How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams
This article details a practical, semi‑automated monitoring solution for environments with fewer than 500 nodes, covering active monitoring concepts, Prometheus data modeling, service‑framework instrumentation, data scraping and visualization with Grafana, and alert handling via AlertManager.
This article summarizes the practice from the SRE book chapter on effective alerting with time‑series data and presents a complete, semi‑automated monitoring system suitable for small‑scale deployments (under 500 nodes) using Prometheus.
Active Monitoring
Monitoring is the foundation of operations; it can be classified into three types: active monitoring (instrumentation before deployment), passive monitoring (black‑box checks such as ping), and side‑channel monitoring (external data like user feedback). The focus here is on active monitoring at the business level.
Why Prometheus?
Prometheus is chosen over other TSDBs because its query language PromQL acts like a programmable calculator, enabling virtually unlimited query combinations.
For example, an HTTP service may expose a metric
http_requests_totalto count requests. Sample scraped data:
<code>http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/a"} 100
http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/b"} 110
http_requests_total{instance="1.1.1.2:80",job="cluster2",location="/b"} 100
http_requests_total{instance="1.1.1.3:80",job="cluster3",location="/c"} 110</code>With three labels (instance, job, location) you can compute:
Single‑node QPS:
sum(rate(http_requests_total[1m])) by (instance)Per‑cluster, per‑path QPS:
sum(rate(http_requests_total[1m])) by (job, location)Prometheus supports three metric types:
Counter : monotonically increasing values such as request counts.
Gauge : instantaneous values that may go up or down, e.g., CPU usage.
Histogram : distribution data, e.g., 95th‑percentile latency.
Most business metrics are implemented as Counters; Histograms are used sparingly due to CPU cost.
Prometheus uses a pull model, which reduces intrusion on services and provides real‑time data compared to log‑based offline aggregation.
Service Framework Instrumentation
The team uses a unified service framework with a multi‑process (master/worker) and multi‑thread event‑loop model, supporting protocols like HTTP, Thrift, and PB. Modules are loaded similarly to Nginx modules, and an asynchronous downstream API is provided.
To expose internal metrics, the framework needed:
Basic metric types (Counter, Histogram) with lock‑free updates; Gauge is omitted for multi‑process aggregation challenges.
A registry to automatically export metrics via a
/metricsendpoint.
Flexible instrumentation using labels so that new endpoints can be added without code changes (e.g., using a single
http_requests_totalmetric with a
locationlabel).
Data Scraping and Visualization
After exposing metrics, Prometheus scrapes them, but attention is needed for:
Ensuring metric names conform to Prometheus conventions; invalid data stops scraping.
Filtering or omitting low‑value metrics to control data volume.
Prometheus is both CPU‑ and I/O‑intensive; ample CPU cores, large memory, and SSD storage are recommended to avoid scrape stalls.
While Prometheus provides a basic UI, Grafana is typically used for richer dashboards. A unified Grafana dashboard is built with three rows:
Row 1: real‑time QPS, average latency, queue time, core dump count, downstream failure rate and latency.
Row 2: business latency (50 % and 95 % percentiles), traffic, throughput by error code.
Row 3: downstream engine latency, traffic, throughput.
Example PromQL query to list top 5 downstream failure rates across data centers:
<code>topk(5, 100*sum(rate(downstream_responses{error_code!="0"}[5m])) by (job, server)/sum(rate(downstream_responses[5m])) by (job, server))</code>The range vector selector
[5m]uses a 5‑minute window for charts; alert rules typically use a 1‑minute window.
AlertManager
Prometheus evaluates alert rules after each scrape and forwards triggered alerts to AlertManager, which handles aggregation, silencing, and routing.
AlertManager can forward alerts via:
Webhook to an internal conversion service.
Built‑in integrations such as PagerDuty.
Email + Slack for small teams.
More advanced alert hierarchy management remains an open topic.
Deploying Prometheus + Grafana + Mesos
The stack is packaged and deployed on Mesos, simplifying installation for small teams. Key points:
Prometheus and Grafana are not containerized; the package is stored in MinIO.
Prometheus runs on a dedicated host with a fixed data directory to minimize restart impact.
Grafana is accessed via an HAProxy‑Consul proxy for a stable entry point.
Conclusion
Prometheus is a powerful, rapidly evolving monitoring system. Although it still has room for improvement in stability, performance, and documentation, it is an excellent choice for small to medium teams when combined with a custom service framework, well‑designed instrumentation, unified Prometheus/Grafana templates, and deployment on a platform like Mesos.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.