Operations 20 min read

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering the Four Golden Signals: A Practical Guide to System Monitoring

Introduction

Understanding system state is crucial for ensuring application and service reliability and stability. Information about deployment health and performance helps teams react to problems and make changes confidently. A powerful monitoring system that collects metrics, visualizes data, and alerts operators is one of the best ways to gain this insight.

This guide discusses core concepts of metrics, monitoring, and alerts. Metrics are the primary material processed by monitoring systems to build a cohesive view of the tracked system. Knowing which components to monitor and which characteristics to observe is the first step in designing a system that provides reliable, actionable insights into software and hardware status.

1. The Golden Signals of Monitoring

Google SRE introduces a useful framework called the four golden signals, representing the most important factors to measure in user‑facing systems. The four signals are discussed below.

Latency

Definition: The time required for a service to process a request.

Importance: Increased latency may indicate performance degradation or bottlenecks; in micro‑service architectures, fast failure and feedback make latency monitoring essential for rapid issue resolution.

Monitoring method: Track response times such as tp99 (99th‑percentile response time) to assess service performance.

Latency measures the time needed to complete an operation. Measuring latency lets you quantify how long specific tasks take, identify bottlenecks, and notice when operations take longer than expected. The SRE book emphasizes distinguishing successful from unsuccessful requests because they can have very different profiles that would otherwise skew averages.

Traffic

Definition: Statistics of data flowing into and out of the system, used to gauge service capacity.

Importance: Traffic volume directly reflects system load and is valuable for capacity planning and resource allocation.

Monitoring method: Evaluate requests per second (TPS) or queries per second (QPS) and similar metrics.

Traffic measures how busy a component or system is, capturing load or demand so you understand how much work the system is currently performing.

Consistently high or low traffic numbers may indicate that a service needs more resources or that problems are preventing proper routing. Traffic helps correlate latency spikes with load peaks and understand maximum throughput and degradation behavior.

Errors

Definition: The number of erroneous requests, usually expressed as an error rate.

Importance: Error rate is a key indicator of system stability and reliability; high error rates suggest serious problems or design flaws.

Monitoring method: Track not only error counts but also error types, sources, and causes to locate and resolve issues quickly.

Tracking errors reveals component health and how often components fail to respond correctly. Distinguishing error types enables precise alerts—some errors may require immediate notification, while others can be tolerated at low rates.

Saturation

Definition: Measures a service's capacity using resource utilization and idle rates.

Importance: Saturation reflects how heavily resources are used; when utilization approaches saturation, performance may suffer.

Monitoring method: Monitor CPU, memory, disk, network utilization; act when usage reaches defined thresholds.

Saturation provides insight into resource capacity, indicating how much of a given resource is being consumed. It helps expose underlying capacity issues and can correlate with latency or error spikes.

2. Measuring Important Data Across the Stack

Using the four golden signals as a guide, you can examine how these metrics appear at each level of the system hierarchy. Services are built by adding abstraction layers on top of more basic components, so metrics should add insight at every deployment level.

Individual server components

Applications and services

Server farms

Environment dependencies

End‑to‑end experience

The order expands the abstraction scope with each subsequent layer.

3. Metrics for Individual Server Components

Collect basic-level metrics related to the underlying hardware and operating system. Although modern development abstracts away many low‑level details, every service depends on hardware and OS to function, making close monitoring of these resources the first step toward understanding system health.

CPU : latency – scheduler delay; traffic – CPU utilization; errors – CPU‑specific fault events; saturation – run‑queue length.

Memory : traffic – amount of memory used; errors – out‑of‑memory errors; saturation – OOM events and swap usage.

Storage : latency – average wait time (await); traffic – read/write I/O level; errors – filesystem or device errors; saturation – I/O queue depth.

Network : latency – driver queue delay; traffic – bytes or packets per second; errors – packet loss or device errors; saturation – overflow, retransmissions.

Beyond physical resources, monitor OS‑level limits such as file handles and thread counts, which can be adjusted with commands like

ulimit

and help detect harmful usage patterns.

4. Metrics for Applications and Services

At this layer, focus on how well the application performs the work it is tasked with and which resources it consumes.

Latency: time to complete a request.

Traffic: requests per second served.

Errors: application errors while handling client requests or accessing resources.

Saturation: percentage or amount of resources currently in use.

Dependency‑related metrics (e.g., application memory usage, open connections, active workers) are also valuable for understanding how the application interacts with underlying servers.

5. Metrics for Server Farms and Their Communication

Distributed services span multiple server instances, adding coordination complexity. Monitor similar signals at the group level.

Latency: time for the pool to respond, including coordination or synchronization delays.

Traffic: requests per second processed by the pool.

Errors: errors occurring while handling client requests, accessing resources, or communicating with peers.

Saturation: amount of resources used, number of servers at full load, and number of available servers.

These signals become more complex when distributed, as latency may involve inter‑host communication, traffic reflects routing efficiency, and errors include network‑related failure modes.

6. Metrics for External Dependencies and Deployment Environment

Track metrics for resources outside your direct control, such as third‑party services or cloud provider APIs.

Latency: time to receive a response from the provider.

Traffic: workload pushed to external services, number of API calls.

Errors: error rate of service requests.

Saturation: consumed account limits (instances, API quotas, cost).

These metrics help identify dependency issues, warn of resource exhaustion, and aid in cost control. When alternatives exist, the data can guide decisions to switch providers or trigger manual mitigation.

7. End‑to‑End Experience Metrics

The highest‑level metrics are observed at the entry point (load balancer, API gateway, etc.) where user requests first interact with the system.

Latency: time to complete a user request.

Traffic: user requests per second.

Errors: errors handling client requests or accessing resources.

Saturation: percentage or amount of resources currently in use.

Values outside acceptable ranges at this level directly impact users and can indicate SLA violations, traffic spikes, error surges, or resource constraints.

Conclusion

This guide first discussed the four golden signals that help discover and understand impactful changes in a system. It then used those signals as a lens to evaluate the most important factors to track at each deployment layer. While the golden signals provide an excellent starting framework for building health‑indicating metrics, you must also consider additional metrics specific to your own context.

monitoringoperationsmetricsSREsystem reliability
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.