Skip to content

Observability

Observability refers to gathering as much information as possible to enable system operators, DevOps practitioners, and Site Reliability Engineers to ask questions about that information

The USE Method

The USE method applies to hardware

The RED Method

The RED method applies to services, it can be represented nicely using a Prometheus histogram

  • Rate - Requests per second

    sum(rate(request_duration_seconds_count{job="..."}[1m]))
    
  • Errors - Number of requests that are failing

    sum(rate(request_duration_seconds_count{job="...", status_code!~"2.."}[1m]))
    
  • Duration - Amount of time these requests take, distribution of latency measurements

    histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job="..."}[1m])) by (le))
    

Modelling this for every service will give a consistent overview of how the system is behaving.

RED is a good proxy for user happiness.

The Four Golden Signals

Similar to RED, but includes saturation

If you can only measure four metrics of your user-facing system, focus on:

  • Latency - the time it takes to service a request
    • Distinguish between the latency of successful requests vs failed requests
  • Traffic - a measure of how much demand is being placed on the system
    • For web services, this is usually requests per second
      • Could also split by the nature of the request, like lists vs gets
    • For storage systems, this might be read and writes per second
  • Errors - the rate of requests that fail
  • Saturation - how “full” the service is

  • Google SRE Book - The Four Golden Signals

Resources


Last update: August 5, 2023
Created: May 27, 2023