How to create “story teller” metrics for a better monitoring

Oren Shoval
5 min readJan 16, 2022

--

Not all metrics were born equal. Without “story teller” metrics you will not have a good monitoring of your product.

The Formula 1 example

Think you manage a Formula 1 racing team. Your car exposes system metrics, such as oil temperature, motor RPM, gas gauge, car speed, tire temperature ... Those metrics are very useful for the mechanics to assess the correct functioning (or malfunctioning) of each component, but as the team manager you are missing the observability of the race itself since your product is the racing team that is comprised of cars, drivers, mechanics, crew members and your product’s goal is to win races.
You understand that you need the race metrics to monitor if you are going to win or loss, such metrics could be: driver lap time, driver entering speed to corner, driver exiting speed to corner, driver acceleration, driver reaction time, tire change time, refueling time …

Using this analogy, the car metrics can be called system metrics as they indicate states of the car, and the race metrics can be called applications metrics as they “tell the story” of the race that helps you make decisions.

In marketing terms system metrics are vanity metrics and application metrics are actionable metrics

My examples will be based on the ACME product that exposes a Go authentication service (aka auth-service) that provides a core business flow of authentication for users. Failures in this service could cause users to fail to login, users to lost their sessions or to have a long login time. The high-level architecture would like below diagram:

ACME auth-service architecture

Communication vs State metrics

Your application metrics should be divided into communication and state metrics.

Communication metrics describe requests from user/service to another service/component. Those metrics are standardized and are usually exported by client libraries. If needed you could also export your own communication metrics using the same generic metrics without adding your product prefix or by creating proprietary metrics with your product prefix

// from client libraries
http_requests_total{code="200", handler="login", method="post", namespace="monitoring", service="auth-service"}
grpc_server_handled_total{grpc_code="OK", grpc_method="Login", grpc_type="unary", service="auth-service"}// from Acme services
// request received by proxy and routed to upstream
acme_services_requests_total{direction="proxy", code="200", tenant="3cXGCA_nike",upstream="auth-service", service="management-proxy"}
// request sent from service to upstream
acme_services_requests_total{direction="external", code="200", method="Enrich", tenant="3cXGCA_nike",upstream="enrich-service", service="auth-service"}
// request received by service
acme_services_requests_total{direction="internal", code="200", method="Login", tenant="3cXGCA_nike", service="auth-service"}
// request sent from service to database/redis
acme_services_requests_total{direction="dal", method="CreateUserSession", tenant="3cXGCA_nike", service="auth-service"}

State metrics describe the service/component itself and should follow Prometheus metric naming

Those communication metrics will allow to build dashboards to visualize requests received and emitted in our system

Requests emitted (left) and received (right) by auth-service

Moreover, you could create white boxing monitoring rules based on those metrics. The example below is using recording rules and will alert if there are user portal login failures above 5% in the last 5 minutes while there were more than 10 logins attempts in the last 5 minutes .

(acme_services_requests_total:userportal:failure:rate5m * 100 / acme_services_requests_total:userportal:total:rate5m) > 5 and acme_services_requests_total:userportal:total:increase5m > 10

State metrics describe states of your services/components. Those metrics should allow to measure the core business flows of your system so you could visualize and monitor your internal SLOs

acme_auth_service_sessions_total{tenant="3cXGCA_nike", service="auth-service"}acme_auth_service_login_requests_total{type="user|api", result="success|reject|error", tenant="3cXGCA_nike", service="auth-service"}acme_auth_service_idp_requests_total{type="local|okta|azuread", result="success|reject|error", tenant="3cXGCA_nike", service="auth-service"}acme_auth_service_errors_total{tenant="3cXGCA_nike", service="auth-service"}

For example you could visualize auth-services Idp requests errors.

And you could create monitoring rules to monitor login internal errors per Idp type (useful when there is an Idp outage) or per tenant (useful when there is an bad integration configuration)

sum by(type) (rate(acme_auth_service_idp_requests_total{result="error",service="auth-service"}[5m])) * 100 / sum by(type) (rate(acme_auth_service_idp_requests_total{service="auth-service"}[5m])) > 5sum by(tenant) (rate(acme_auth_service_idp_requests_total{result="error",service="auth-service"}[5m])) * 100 / sum by(tenant) (rate(acme_auth_service_idp_requests_total{service="auth-service"}[5m])) > 5sum by(tenant) (rate(acme_auth_service_login_requests_total{result="error",service="auth-service"}[5m])) * 100 / sum by(tenant) (rate(acme_auth_service_login_requests_total{service="auth-service"}[5m])) > 5

Internal library for application metrics

To assist your developers you could create an internal library to provide application metrics management. This will allow to ensure naming convention and usage are respected.

Attach communication metrics to service router

Since all calls to our Go service are handled by the service router, we will add an additional handler function for each declared route. That handler function will implement a prometheus counter and histogram.

Auth-service usage in the controller:

package controllerimport (
"github.com/amce/common/metrics"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
func NewRouter(handlers *service.Handlers) *mux.Router {
r := mux.NewRouter().StrictSlash(true)
loginHandler := handlers.LoginService
r.HandleFunc("/login",
metrics.NewInstrumentedHandlerFunc("Login", "internal", loginHandler.Login))
return r
}

And the internal library implementation:

package metricsimport (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
const (
AcmeNamespace = "acme"
ServicesSubsystem = "services"
)
func NewInstrumentedHandlerFunc(handlerName string, directionName string, next http.HandlerFunc) http.HandlerFunc {if next == nil {
return nil
}
// GetConstLabels returns Acme common prometheus.Labels
// and adds
// call="handlerName", direction="directionName"
constLabels := GetConstLabels(handlerName, directionName)
counter := prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: AcmeNamespace,
Subsystem: ServicesSubsystem,
Name: "requests_totals",
Help: "A counter for requests to the wrapped handler.",
ConstLabels: constLabels,
},
[]string{"code", "method"},
)
durationHistogram := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: AcmeNamespace,
Subsystem: ServicesSubsystem,
Name: "request_duration_seconds",
Help: "A histogram of latencies for requests.",
ConstLabels: constLabels,
Buckets: []float64{ .05, 0.1, 0.25, 0.5, 1, 2.5, 5},
},
[]string{"method"},
)
return collectionChain.(http.HandlerFunc)
}
package controllerimport (
"github.com/amce/common/metrics"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
func NewRouter(handlers *service.Handlers) *mux.Router {
r := mux.NewRouter().StrictSlash(true)
loginHandler := handlers.LoginService
r.HandleFunc("/login",
metrics.NewInstrumentedHandlerFunc("Login", "internal", loginHandler.Login))
return r
}

Attach state metrics to service methods

You could expand your internal library by exposing more metrics methods for external calls (auth-service calling getUserSession from your Redis), method duration (auth-service performing enrichUser) or state metric (auth-service requests to Idp)

What’s next?

Auth-service has an integration with Idp using tokens generated by customer admin, and those tokens have usually an expiration date for security reason. When an expired token is used it will return 401 and Acme will not be able to authenticate or enrich the customer users.
To prevent users lock out, when auth-service encounters an internal error with an Idp integration you could add a reason label with the return error code (i.e. 401 Unauthorized). This will allow you to create a monitoring rule to detect those “broken” Idp tokens and to create an automation to send a notification to the customer admin so he/she will be able to generate a new token.

acme_auth_service_idp_requests_total{result="error", result="401", service="auth-service"}

--

--