Operations 13 min read

Master System Monitoring with the USE Method and Prometheus

This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.

Efficient Ops

Aug 17, 2022

Master System Monitoring with the USE Method and Prometheus

1. Introduction

A good monitoring system not only reveals issues in real time but also automatically analyzes and locates bottlenecks, reporting them to relevant teams. The core is comprehensive, quantifiable metrics covering both system and application aspects.

System‑side metrics include CPU, memory, disk, file system, and network usage.

Application‑side metrics include process CPU, disk I/O, interface latency, error counts, and internal object memory usage.

2. System Monitoring

2.1 USE Method

Before monitoring, you need a concise way to describe resource usage. The USE (Utilization, Saturation, Errors) method simplifies performance metrics into three categories.

Utilization : percentage of resource capacity used.

Saturation : degree of resource busy‑ness, often linked to queue length.

Errors : count of error events.

These three categories cover common performance bottlenecks for hardware resources (CPU, memory, disk, network) and software resources (file descriptors, connections, conntrack).

2.2 Performance Metrics

Common metrics for each resource are listed in the accompanying table.

USE focuses on core bottleneck indicators, but other metrics such as logs, process usage, and cache usage are also useful as auxiliary data.

2.3 Monitoring System Architecture

After defining metrics, build a monitoring system that collects, stores, queries, processes, alerts, and visualizes data. Open‑source tools such as Zabbix, Nagios, and Prometheus are available.

Using Prometheus as an example, its architecture includes data collection, storage, query/processing, alerting, and visualization.

Data collection: targets and Retrieval module; supports Pull (server‑initiated) and Push (via Push Gateway) modes.

Data storage: TSDB (time‑series database) persists data on SSD.

Query & processing: PromQL provides concise queries and basic processing, feeding alerts and visualizations.

Alerting: AlertManager handles rules, grouping, inhibition, and silencing.

Visualization: Prometheus web UI offers basic charts, while Grafana provides rich dashboards. Using USE metrics, Prometheus can collect CPU, memory, disk, network usage, saturation, and error counts, displayed via Grafana.

2.4 Summary

The core of system monitoring is resource usage (CPU, memory, disk, filesystem, network, file descriptors, connections). The USE method reduces metrics to utilization, saturation, and errors; high values indicate potential bottlenecks. A complete monitoring stack turns these metrics into actionable alerts and visual insights.

3. Application Monitoring

3.1 Application Metrics

Key application metrics are request count, error rate, and response time, reflecting user experience and service reliability. Additional metrics include process resource usage, inter‑service call latency and errors, and internal logic timings.

Combining these with system metrics enables pinpointing whether performance issues stem from resource limits, call‑chain problems, or internal logic.

3.2 Full‑Link Tracing

Distributed tracing tools like Zipkin, Jaeger, and Pinpoint build end‑to‑end call graphs. An example Jaeger trace shows a Redis timeout causing an issue.

Tracing also generates topology maps useful for microservice analysis.

3.3 Log Monitoring

Metrics alone may miss context; logs provide detailed event information. The ELK stack (Elasticsearch, Logstash, Kibana) is classic for log collection, indexing, and visualization. Fluentd can replace Logstash for lower resource consumption (EFK stack).

3.4 Summary

Application monitoring consists of metric monitoring and log monitoring. Metrics are stored as time‑series for real‑time alerts; logs give contextual detail via searchable indexes. Full‑link tracing adds cross‑service visibility, accelerating root‑cause analysis in complex environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Prometheus log analysis system-performance Full‑Link Tracing USE method

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.