Master System Monitoring with the USE Method and Prometheus
This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.
1. Introduction
A good monitoring system not only reveals issues in real time but also automatically analyzes and locates bottlenecks, reporting them to relevant teams. The core is comprehensive, quantifiable metrics covering both system and application aspects.
System‑side metrics include CPU, memory, disk, file system, and network usage.
Application‑side metrics include process CPU, disk I/O, interface latency, error counts, and internal object memory usage.
2. System Monitoring
2.1 USE Method
Before monitoring, you need a concise way to describe resource usage. The USE (Utilization, Saturation, Errors) method simplifies performance metrics into three categories.
Utilization : percentage of resource capacity used.
Saturation : degree of resource busy‑ness, often linked to queue length.
Errors : count of error events.
These three categories cover common performance bottlenecks for hardware resources (CPU, memory, disk, network) and software resources (file descriptors, connections, conntrack).
2.2 Performance Metrics
Common metrics for each resource are listed in the accompanying table.
USE focuses on core bottleneck indicators, but other metrics such as logs, process usage, and cache usage are also useful as auxiliary data.
2.3 Monitoring System Architecture
After defining metrics, build a monitoring system that collects, stores, queries, processes, alerts, and visualizes data. Open‑source tools such as Zabbix, Nagios, and Prometheus are available.
Using Prometheus as an example, its architecture includes data collection, storage, query/processing, alerting, and visualization.
Data collection: targets and Retrieval module; supports Pull (server‑initiated) and Push (via Push Gateway) modes.
Data storage: TSDB (time‑series database) persists data on SSD.
Query & processing: PromQL provides concise queries and basic processing, feeding alerts and visualizations.
Alerting: AlertManager handles rules, grouping, inhibition, and silencing.
Visualization: Prometheus web UI offers basic charts, while Grafana provides rich dashboards. Using USE metrics, Prometheus can collect CPU, memory, disk, network usage, saturation, and error counts, displayed via Grafana.
2.4 Summary
The core of system monitoring is resource usage (CPU, memory, disk, filesystem, network, file descriptors, connections). The USE method reduces metrics to utilization, saturation, and errors; high values indicate potential bottlenecks. A complete monitoring stack turns these metrics into actionable alerts and visual insights.
3. Application Monitoring
3.1 Application Metrics
Key application metrics are request count, error rate, and response time, reflecting user experience and service reliability. Additional metrics include process resource usage, inter‑service call latency and errors, and internal logic timings.
Combining these with system metrics enables pinpointing whether performance issues stem from resource limits, call‑chain problems, or internal logic.
3.2 Full‑Link Tracing
Distributed tracing tools like Zipkin, Jaeger, and Pinpoint build end‑to‑end call graphs. An example Jaeger trace shows a Redis timeout causing an issue.
Tracing also generates topology maps useful for microservice analysis.
3.3 Log Monitoring
Metrics alone may miss context; logs provide detailed event information. The ELK stack (Elasticsearch, Logstash, Kibana) is classic for log collection, indexing, and visualization. Fluentd can replace Logstash for lower resource consumption (EFK stack).
3.4 Summary
Application monitoring consists of metric monitoring and log monitoring. Metrics are stored as time‑series for real‑time alerts; logs give contextual detail via searchable indexes. Full‑link tracing adds cross‑service visibility, accelerating root‑cause analysis in complex environments.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.