Why you shouldn’t notify on warnings
or warnings to /dev/null
The ease of managing modern monitoring systems such as Datadog, New Relic or Prometheus leads us to create an abundance of monitoring rules that will fire alerts that are most of the time informative and not actionable (for example a spike in CPU usage that has no effect on the user experience) and that will overwhelm the on-duty engineers with notifications.
Several SREs proposed their solutions to reduce that toil, such as Štěpán Davidovič and Betsy Beyer who wrote Reduce toil through better alerting where they proposed an alerting maturity hierarchy to improve alerts handling, but it doesn’t solve the root cause of excessive alert notifications.
I will explain how I reduced my alerts handling toil by using warning alerts as indicators and by sending notifications only for critical alerts
Warning alerts as indicators
Based on Google SRE workbook, we should only send actionable critical alert notifications to on-duty engineers when a SLO threshold is breached and I am adding also when an infrastructure component alerts on a critical issue (i.e. growing Postgres WAL for replicas that could cause the master Postgres to crash when mount is full), therefore we could discard notifications for warning alerts. Below are my routes in AlertManager configuration where the onduty receiver will push the notification to OpsGenie
routes:
- match_re:
alertname: QuietHours
receiver: blackhole
- match_re:
severity: critical
receiver: onduty
I am considering warning alerts as indicators of the system health but that don’t require immediate actions, the same way we continue to drive our cars when a signal lights up. Following this logic I have a created my Grafana system health dashboard that gives us an overview of the fired alerts per service (i.e acme service), team (i.e. authentication team) or area (i.e kubernetes related)

I am summing up fired alerts per service/team/area
sum(sum_over_time(ALERTS{service="policy",alertstate="firing"}[$interval])) or on () vector(0)
based on our mandatory labels added to each rule
- alert: HighGoRoutinesCountAcmeService
expr: sum (go_goroutines{service="acme"}) > 1000
for: 2m
labels:
service: acme
severity: warning
team: core
To handle the case of a recurrent alerts of warning severity level, for example high latency for the last hour that could indicate an issue that is not correctly monitored and that should be handled, I have created a rule that counts the alerts occurrence per alert name and that will create a critical alert that will send a notification, the same way that you don’t stop and check your car when a warning lights up in your dashboard, but you will do so if the light doesn’t turn off after a day.
- alert: TooManyWarnings
expr: count_over_time(sum by(service, alertname, triggered_alertname) (label_replace(ALERTS{alertstate="firing",severity="warning"}, "triggered_alertname", "$1", "alertname", "(.*)"))[1h:5m]) > 8
labels:
onduty: working-hours-only
severity: critical
The “onduty” label is used in AlertManager to inhibit notifications outside working hours
inhibit_rules:
- source_match:
alertname: QuietHours
target_match:
onduty: working-hours-only
where QuietHours is a rule that is firing outside working days and hours
- alert: QuietHours
#hour() is in UTC
expr: day_of_week() == 6 or day_of_week() == 7 or hour() >= 15 or hour() <= 6
for: 1m
labels:
severity: info
What I learned
- A SaaS product can respect its SLO’s even if warning alerts are fired.
- Control the flow of the warning alerts.
- On-duty engineers trust in the monitoring system will grow if it sends critical notifications only when real issues are occurring.