Moving our monitoring system from managed Datadog to self-managed Prometheus stack

Oren Shoval
5 min readJun 13, 2021

--

Growing a startup

Building a new product in a new startup is all about being agile with fast deliveries to your customers, onboard more and to keep growing. To support this goal in our startup Luminate (now called Secure Access Cloud and part of Broadcom) we gave focus on the core product development and we decided to leverage managed services instead of investing a lot of effort in building resilient infrastructure.

One delegation was for the metrics monitoring platform where we looked for a solution for our requirements:

  1. Reliable service
  2. Low learning curve
  3. Support for Docker and Kubernetes monitoring
  4. Support for Monitor-as-Code for automatic monitors and dashboards deployment

So in 2017, we decided to go with the Datadog Startup Program and we were very satisfied by their service.

Stickiness

During the following two years we kept focusing on growth and less on expenses as every startup does. When we were acquired in 2019 by Symantec Enterprise, we grew so much in our Kubernetes clusters and related services that our Datadog monthly bill was around 1/5 of our monthy bills. Symantec gave us priority to integrate with their other products and to continue to grow, so those high Datadog bills weren’t THE wakeup call that you should have expected.

Instead our product deployment topology grew so our usage of Datadog features and we got stuck with their service and we continued to pay very expensive monthly bills.

Breaking point

When Symantec Enterprise was acquired by Broadcom on November 2019, priorities changed and one required change was reduction of costs over all the Symantec product line. So we looked at the high expenses and we understood that too much is too much and we took the decision to move away from Datadog. Until this day I still kind of regret this decision since we left one of the best monitoring platforms available, but as I will explain below how as SRE I learnt so much by building our self-managed monitoring system.

Back to basics: building a monitoring system “from scratch”

I went back to our original metrics monitoring system requirements and researched how can I still continue to provide high quality and reliable monitoring service internally and I decided to build a Prometheus centric monitoring system. Below is the high-level architecture

Exposing metrics

Our proprietary services already exposed a metrics endpoints in the Prometheus format, so we only had to add the Prometheus annotations on the service exposing the pods (or it can be add to the pod itslef)

apiVersion: v1
kind: Service
metadata:
name: acme-service-headless
annotations:
prometheus.io/port: "8080"
prometheus.io/probe: "true"
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
spec:
ports:
- port: 8080
protocol: TCP
name: metrics
selector:
service: acme-service
type: ClusterIP
clusterIP: None

So Prometheus Service Discovery will discover dynamically those endpoints for the kubernetes_sd_config job to scrap

- job_name: kubernetes-service-endpoints
scheme: http
kubernetes_sd_configs:
- role: endpoints

For the dependencies that don’t expose Prometheus metrics, I am using Influxdata Telegraf with the relevant input and the Prometheus client output

service:
type: ClusterIP
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "9273"
prometheus.io/probe: "true"
outputs:
- prometheus_client:
listen: ":9273"
inputs:
- rabbitmq:
url: "http://rabbitmq-cluster:15672"
username: "username"
password: "****"

For exposing Kubernetes API as metrics, I deployed Kubernetes kube-state-metrics and I have created a Prometheus static scrapping job

- job_name: kube-state-metrics
scrape_interval: 30s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- 'kube-state-metrics.kube-system.svc:8080'

Collecting the metrics

I am using the Prometheus Operator to manage my Prometheus clusters, each cluster has two independent Prometheus instances with the same configuration to act as “logical” HA by running the same scrapping jobs and alerting the same AlertManager cluster (you can get the full blown solution with thanos.io).

Accessing the metrics

I am using the simple Prometheus UI and the Grafana more advanced visualization capabilities to access and display the metrics

Monitoring as Code

Since all the Prometheus Operator managed components configuration files are Kubernetes CRD, I am managing them in a source code repository and I have a triggered deployment for Kubernetes CRD’s to the target environment on merge.

Since Grafana has a HTTP API to manage dashboards, I am managing them as JSON files in a source code repository and I have also a triggered deployment for dashboards to the target environment Grafana on merge.

Monitoring the monitoring system

Now that we manage the monitoring system, I had to monitor the monitoring system as well. To monitor the internal components, I have defined three additional scrapping jobs for Prometheus with the relevant Prometheus rules

- job_name: prometheus
scheme: http
static_configs:
- targets:
- 'localhost:9090'
- job_name: grafana
scheme: http
static_configs:
- targets:
- 'grafana-svc:80'
- job_name: alertmanager
scheme: http
static_configs:
- targets:
- 'alertmanager-operated:9093'

To monitor as a whole, I have configured a Prometheus rule that triggers an new alert at every evaluation interval

rules:
- alert: DeadMansSwitch
expr: vector(1)
annotations:
summary: "Alerting DeadMansSwitch on prometheus instance"
description: "This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional."

DeadmanSwitch alerts are routed in Alertmanager to a webhook to Atlassian Opsgenie Heartbeat that expects a ping once at least every 5 minutes otherwise it will create an alert to engineering

route:
routes:
- match_re:
alertname: DeadMansSwitch
receiver: deadmansswitch
repeat_interval: 2m
receivers:
- name: deadmansswitch
webhook_configs:
- url: https://api.opsgenie.com/v2/heartbeats/prometheus-deadmanswitch/ping
send_resolved: false
http_config:
basic_auth:
password: ****

Notifying the team

Now the painful part of moving away from a managed services: we had to move our existing Datadog monitors and dashboards to the new monitoring system. I had to train our developers and to supply them templates so they could rewrite their services monitors as a collection of Prometheus rules and Grafana dashboards for our proprietary services and dependencies.

When an alert is triggered it is published to the Alertmanager that has several routes and receivers. Based on the alert severity Alertmanager will notify Slack or notify Atlassian Opsgenie to call the OnDuty egineers

What I learnt

Don’t be afraid of change, either by decision or by requirement.

The monitoring community is very active, Open Source components are well documented.

I can live without cool monitoring features I used to have.

By building the monitoring system from scratch, I learnt the ins and outs of Prometheus, I updated our internal best practices for monitoring and improved the overall supportability of our product.

--

--

Oren Shoval