Observability
Observability refers to gathering as much information as possible to enable system operators, DevOps practitioners, and Site Reliability Engineers to ask questions about that information
The USE Method
The USE method applies to hardware
- Utilization - Percentage of time the resource is busy (such as CPU usage of a node)
- Saturation - Amount of work a resource has to do, often queue length of node load
-
Errors - Count of error events
- USE Method - Linux Performance Checklist
- Grafana Dashboard - Node Exporter / USE Method
The RED Method
The RED method applies to services, it can be represented nicely using a Prometheus histogram
-
Rate - Requests per second
-
Errors - Number of requests that are failing
-
Duration - Amount of time these requests take, distribution of latency measurements
Modelling this for every service will give a consistent overview of how the system is behaving.
RED is a good proxy for user happiness.
The Four Golden Signals
Similar to RED, but includes saturation
If you can only measure four metrics of your user-facing system, focus on:
- Latency - the time it takes to service a request
- Distinguish between the latency of successful requests vs failed requests
- Traffic - a measure of how much demand is being placed on the system
- For web services, this is usually requests per second
- Could also split by the nature of the request, like lists vs gets
- For storage systems, this might be read and writes per second
- For web services, this is usually requests per second
- Errors - the rate of requests that fail
-
Saturation - how “full” the service is
Resources
- Grafana
- What is observability?
- Common observability strategies
- USE method
- RED method
- Four Golden Signals
- The Three Pillars of Observability
- The RED Method - Patterns for instrumentation and monitoring slides
- Has Prometheus sample queries for USE and RED
Created: June 3, 2023