Monitoring of Monitoring

5 min readJul 25, 2024

Or the story of the guy who wanted to monitor the monitoring stack…

Once upon a time a SRE was tasked to provide visibility for their Kubernetes based SaaS product’s SLA, so the SRE designed and deployed a Prometheus based monitoring stack and everyone was pleased.

Weeks and months passed by, and once in a while alerts got triggered and engineers rushed to handle them to ensure the SLA didn’t get breached.

One day some customers reported for an issue they started to encounter and engineers handled it. However, because that issue was not detected by the monitoring stack engineers started handling the issue late while it was already spread to many customers. A short engineering investigation pointed out that a expected alerting rule that should have detect the issue failed to be deployed. Moreover the monitoring stack was failing to notify engineers.

Monitoring of Monitoring (MoM) was born

Following the issue above, the SRE decide to continuously monitor the alerting lifecycle, from alerting rule evaluation to notifications sent to expected receiver(s) by following exactly the same business logic and integrations.

Understanding Monitoring

Before implementing MoM, we need to define what is a monitoring stack for a SaaS product. Such a generic stack is composed of:

a metrics management that scrapes metrics from target components and stores them for evaluation (i.e. Prometheus). Some of the target components definition could be generic using service discovery (i.e. Prometheus configuration for Kubernetes)
an alerting management that evaluates alerting rules on stored metrics and triggers alerts (i.e. Prometheus Rules Evaluator)
a alerting notification management that receives triggered alerts, applies a business logic and sends alerting notifications (i.e. Prometheus AlertManager)
an notification management that receives alerting notifications, applies a business logic and and pushes them to a rotation (i.e. Atlassian OpsGenie)

Implementing MoM

Based on monitoring stack composition above, we need to apply MoM on several components at different product lifecycle levels to validate the different monitoring stack components

Build time

We need to enforce a linters for the alerting rules during CI to prevent deployment of bad configuration

Prometheus provides promtool, that exposes the check command to validate alerting rules. Below is an usage example that also validates the yaml format of the alerting rules

# Install prerequisites
brew install prometheus
promtool --version

brew install yamllint
yamllint --version

brew install yq 
yq --version

# Create alerting rules files with valid and invalid yaml
cat << "EOF" > good_yaml.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: prometheus-good-yaml
spec:
  groups:
    - name: elasticsearch
      rules:
        - alert: ElasticsearchHeapUsageTooHigh
          expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
          for: 10m
          labels:
            severity: critical
            team: plt
            service: elasticsearch
          annotations:
            summary: |-
              {{`Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})`}}
            description: |-
              {{`The heap usage is over 90% for 10m. VALUE = {{ $value }} .`}}
EOF

cat << "EOF" > bad_yaml.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
  name: prometheus-bad-yaml
spec:
  groups:
    - name: 
      rules:
        - alert: ElasticsearchHeapUsageTooHigh
          expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
          for: 10m
          labels:
            severity: critical
            team: plt
            service: elasticsearch
          annotations:
            summary: |-
              {{`Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})`}}
            description: |-
              {{`The heap usage is over 90% for 10m. VALUE = {{ $value }} .`}}
EOF

# Valid alerting rules files for yaml format
➜ yamllint --no-warnings -d relaxed good_yaml.yaml
➜ echo $?
0
➜  yamllint --no-warnings -d relaxed bad_yaml.yaml
bad_yaml.yaml
  10:12     error    trailing spaces  (trailing-spaces)
➜  tmp echo $?
1

# Create alerting rules files with valid and invalid rule definition
cat << "EOF" > good_rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
  name: prometheus-good-yaml
  groups:
    - name: elasticsearch
      rules:
        - alert: ElasticsearchHeapUsageTooHigh
          expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
          for: 10m
          labels:
            severity: critical
            team: plt
            service: elasticsearch
          annotations:
            summary: |-
              {{`Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})`}}
            description: |-
              {{`The heap usage is over 90% for 10m. VALUE = {{ $value }} .`}}
EOF

cat << "EOF" > bad_rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
  name: prometheus-bad-rule
spec:
  groups:
    - name: elasticsearch
      rules:
        - alert: ElasticsearchHeapUsageTooHigh
          expr: (elasticsearch_jvm_memory_used_bytes{area="heap} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
          for: 10m
          labels:
            severity: critical
            team: plt
            service: elasticsearch
          annotations:
            summary: |-
              {{`Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})`}}
            description: |-
              {{`The heap usage is over 90% for 10m. VALUE = {{ $value }} .`}}
EOF

# Valid alerting rules files for definition format
➜ yq e '.spec' good_rule.yaml | promtool check rules
Checking standard input
  SUCCESS: 1 rules found
➜ echo $?
0
➜ yq e '.spec' bad_rule.yaml | promtool check rules
Checking standard input
  FAILED:
5:15: group "elasticsearch", rule 1, "ElasticsearchHeapUsageTooHigh": could not parse expression: 1:93: parse error: unexpected identifier "heap" in label matching, expected "," or "}"
➜ echo $?
1

Deploy time

Despite the build-time validation, alerting rules could be deployed manually; therefore, alerting rules configuration reload should be validated as well.

Using automated logs-based alerting, we need to look for configuration reload failure. Such a failure could look like this:

level=error ts=2023-07-25T12:34:56.789Z 
caller=manager.go:123 
msg="loading rule files failed" files="/path/to/rules.yml" 
err="yaml: unmarshal errors: line 5: cannot unmarshal !!str `invalid` into map[string]interface {}"

Run time

To perform end-to-end alerting lifecycle validation, we need to implement a gray-box service that will be deployed alongside the other SaaS product services to validate each lifecycle. Such a gray box service could be based on https://github.com/orensho/thin-blackbox-tester

Black box testing evaluates a system’s functionality without any knowledge of its internal workings, while gray box testing combines this external testing approach with some understanding of the system’s internals to enhance testing effectiveness.

metric creation: requires gray box to increase the synthetic Prometheus counter every minute. See below the configuration pseudo code.

flows:
  mom-creation:
    config:
      frequency: '@every 1m'
    steps:
      - increase-metric

metrics scrapping: performed by the metrics management using the generic pod scrapping job
alerting rule evaluation: performed by the alerting management. Requires a deployed alerting rule to validate that metric is being scrapped as expected. See below the Prometheus rule.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
  name: mom-synthetic-monitoring
spec:
  groups:
    - name: mom-synthetic-monitoring
      rules:
        - alert: MomSyntheticMonitoringRunning
          expr: increase(mom_synthetic_monitoring{}[5m]) > 4
          labels:
            service: mom-synthetic-monitoring
            severity: info
          annotations:
            summary: "mom-synthetic-monitoring is running as expected"

triggered alert notification: performed by the alerting notification management and the notification management. Requires a dedicated alerting notification management configuration and requires gray box to validate that a fresh alerting notification was received by all configured notification channel. See below the configuration pseudo code for alerting notification management configuration and gray box

....
route:
  routes:
  - receiver: mom
    match_re:
      service: mom-synthetic-monitoring
    continue: false
    group_interval: 30s
    repeat_interval: 1m
...
receivers:
- name: internal-monitor-mom-p5
  opsgenie_configs:
  - 
    ...
  slack_configs:
  -

flows:
  mom-validation:
    config:
      frequency: '@every 2m'
    steps:
      - validate-slack
      - validate-opsgenie
      - close-opsgenie

Monitoring the MoM

We have implemented MoM for our SaaS product but how can we ensure we get notification if it fails, especially for runtime? Meaning we need to watch the watcher.

An implementation could be to add an additional step in our gray box service to ping an external hearth beat (i.e. Atlassian OpsGenie Heartbeat). The heartbeat provider should be configured to send a notification if no pings are received for x minutes.

flows:
  mom-watchdog:
    config:
      frequency: '@every 5m'
    steps:
      - ping-heartbeat