Master System Monitoring with the USE Method and Prometheus
This article explains how to design a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines essential system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, and related open‑source tools.
1. Introduction
A good monitoring system not only exposes problems in real time but also automatically analyzes and locates bottlenecks, reporting them precisely to the responsible teams. Effective monitoring relies on comprehensive, quantifiable metrics covering both system resources and application behavior.
System‑level monitoring should include overall resource usage such as CPU, memory, disk, file system, and network. Application‑level monitoring must track internal states like process CPU and I/O, interface latency, error counts, and memory usage of internal objects.
2. System Monitoring
1. The USE Method
Before building a monitoring system, you likely want a concise way to describe resource usage. The USE (Utilization, Saturation, Errors) method simplifies performance metrics into three categories.
Utilization – the percentage of time or capacity a resource is used for service; 100% means the resource is fully consumed.
Saturation – the degree of resource busyness, often related to queue length; 100% indicates the resource cannot accept more requests.
Errors – the count of error events; a higher number signals more severe problems.
These three categories capture common performance bottlenecks and can be applied to hardware resources (CPU, memory, disk, network) as well as software resources (file descriptors, connections, connection tracking).
2. Performance Metrics
The following table (illustrated below) lists typical metrics for each resource, helping you quickly reference the needed indicators.
While USE focuses on core bottleneck indicators, other metrics such as system logs, process resource usage, and cache statistics remain important for auxiliary analysis.
3. Monitoring System Architecture
After defining metrics, you need a monitoring system to collect, store, query, process, alert, and visualize them. Open‑source tools like Zabbix, Nagios, and especially Prometheus can be used.
Prometheus consists of several components:
Data collection – Targets are scraped (pull) or pushed via a Push Gateway (push).
Data storage – A time‑series database (TSDB) persists metrics on disk.
Query and processing – PromQL provides concise querying and basic processing.
Alerting – AlertManager handles rule‑based alerts, grouping, inhibition, and silencing.
Visualization – The built‑in web UI offers simple graphs; combined with Grafana, it delivers powerful dashboards.
Using Prometheus, you can collect CPU, memory, disk, and network utilization, saturation, and error metrics from Linux servers, then display them via Grafana.
4. Summary of System Monitoring
The core of system monitoring is tracking resource usage (CPU, memory, disk, file system, network, file descriptors, connections, etc.). The USE method reduces performance indicators to utilization, saturation, and errors, allowing rapid identification of bottlenecks.
3. Application Monitoring
1. Application Monitoring Metrics
Beyond system resources, application monitoring focuses on request count, error rate, and response time—key indicators of user experience and service reliability. Additional metrics include process resource usage, inter‑service call latency and errors, and internal logic performance.
These metrics enable you to correlate system bottlenecks with application issues, pinpoint problematic service calls, and drill down to specific functions causing slowdown.
2. End‑to‑End Tracing
Distributed systems benefit from tracing tools such as Zipkin, Jaeger, and Pinpoint. They visualize call chains and quickly reveal which component caused a failure, e.g., a Redis timeout.
3. Log Monitoring
Metrics alone may lack context; logs provide detailed string messages for deeper analysis. The classic ELK stack (Elasticsearch, Logstash, Kibana) collects, indexes, and visualizes logs.
Logstash ingests and preprocesses logs, Elasticsearch indexes them for full‑text search, and Kibana offers dashboards. In resource‑constrained environments, Fluentd (EFK) can replace Logstash.
4. Summary of Application Monitoring
Application monitoring combines metric monitoring (time‑series measurement, storage, alerting) and log monitoring (contextual information via ELK). End‑to‑end tracing adds visibility across services, helping locate performance issues in complex microservice architectures.
Source: www.cnblogs.com/-wenli/p/14017850.html
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.