4 questions you must ask when an incident happens
When an incident is happening “it’s all about the scope” to assess its severity, to provide visibility to management and customers, and to decide the best way it should be handled.
Si vis pacem, para bellum
When an incident is reported by the monitoring system, we need a framework to help us digest the ten of thousands of metrics a SaaS product created to get the scope of the effect on our system and customers. The scope can always be assessed by asking what is broken? when has it started? where it is happening? and who is affected?(Note that the why is the root cause and not necessarily part of the initial scope assessment).
Google introduced in The Site Reliability Workbook the concept of SLI to measure the SLA commit, but no out-of-the-box solution can automatically answer those questions, so I will share my metrics based implementation.
1. What (is broken)?
It should be simple to understand the broken user flows (or going to break soon) so we could communicate clearly internally in the war room and externally to customers. For example:
- More than 5% failures to rent a book over the last 5 minutes, with at least 50 rent requests for that same window → is the SLO/SLA breached?
- The Postgres WAL size has reached 75% percent of the available WAL storage and could lead to Postgres crash when getting to 100% → user flows will fail to perform any database related action
To answer this question, you should:
- Implement RED methodology (or whiteboxing): based on your load balancer metrics, create a rule for failures per user flow as explained in Site Reliability Workbook (.i.e.: login, rent a book, return a book, list rented books, search for books) and a visualisation.
rate(nginx_http_requests_total{service_name="booking", handler="rent", status=~"5.*"}[5m]) * 100 / rate(nginx_http_requests_total{service_name="booking",handler="rent"}[5m]))rate(nginx_http_requests_total{service_name="booking", status=~"5.*"}[5m]) * 100 /rate(nginx_http_requests_total{service_name="booking"}[5m]) > 5
- Implement USE methodology: based on your infrastructure components, create a rule per component breaking point (i.e.: Postgres WAL size increased more than 75% of available WAL storage → could cause Postgres crash, etcd cluster has no leader for the last 2 minutes → could cause etcd cluster to not accepting new connections) and a visualisation.
sum by (pod )(ccp_wal_activity_total_size_bytes) > sum by (pod )(pg_settings_max_wal_size_bytes) * 0.75
Each micro-service should implement the Four-Signals methodology and alert the service owner
2. When (has it started)?
The incident start date should be the UTC time when the alert was triggered, not to confuse with the first response time.
3. Where (it is happening)?
It should be simple to understand where the incident is happening so we could contact the relevant customers and let them know.
To answer this question, you should aggregate your queries with deployment and target labels:
- Ensure all the metrics have the relevant deployment type, region or zone labels → Failure relates to production us-west-1 deployment
- If you are running a blue/green or canary, ensure all metrics have the relevant deployment target label → Failures relates to blue target
(sum by (type, region, target)(rate(nginx_http_requests_total{service_name="booking", handler="rent", status=~"5.*"}[5m])) * 100 )/ sum by (type, region, target)(rate(nginx_http_requests_total{service_name="booking",handler="rent"}[5m])) > 5
4. Who (is affected)?
It should be simple to understand who is affected by the incident.
To answer this question, you will need to retrieve the customer identifier from within the request received by your web server (or proxy) and add it to the customer label and aggregate your queries with that customer label:
- Ensure all the metrics with a customer context have the relevant customer or tenant label→ Failure relates to acme.com and coyote.com
(sum by (customer)(rate(nginx_http_requests_total{service_name="booking", handler="rent", status=~"5.*"}[5m])) * 100 )/ sum by (customer)(rate(nginx_http_requests_total{service_name="booking",handler="rent"}[5m])) > 5
Wrapping it up
By implementing this framework you could quickly assess the incident by stating
More than 5% failures to rent a book over the last 5 minutes, started at 14:00 UTC, for production us-west-1 blue, and affecting customers acme.com and coyote.com.
(sum by (type, region, target, customer)(rate(nginx_http_requests_total{service_name="booking", handler="rent", status=~"5.*"}[5m])) * 100 )/ sum by (type, region, target, customer)(rate(nginx_http_requests_total{service_name="booking",handler="rent"}[5m])) > 5
What I learned?
- Handling an incident is stressful, so you better get prepared.
- Managers want to get the scope of the incident, not to get too technical.
- Creating visualizations associated to this framework allows to get a quick overview of the system.