Operations 13 min read

How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams

This article details a practical, semi‑automated monitoring solution for environments with fewer than 500 nodes, covering active monitoring concepts, Prometheus data modeling, service‑framework instrumentation, data scraping and visualization with Grafana, and alert handling via AlertManager.

Efficient Ops

May 29, 2022

How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams

This article summarizes the practice from the SRE book chapter on effective alerting with time‑series data and presents a complete, semi‑automated monitoring system suitable for small‑scale deployments (under 500 nodes) using Prometheus.

Active Monitoring

Monitoring is the foundation of operations; it can be classified into three types: active monitoring (instrumentation before deployment), passive monitoring (black‑box checks such as ping), and side‑channel monitoring (external data like user feedback). The focus here is on active monitoring at the business level.

Why Prometheus?

Prometheus is chosen over other TSDBs because its query language PromQL acts like a programmable calculator, enabling virtually unlimited query combinations.

For example, an HTTP service may expose a metric http_requests_total to count requests. Sample scraped data:

http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/a"} 100
http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/b"} 110
http_requests_total{instance="1.1.1.2:80",job="cluster2",location="/b"} 100
http_requests_total{instance="1.1.1.3:80",job="cluster3",location="/c"} 110

With three labels (instance, job, location) you can compute:

Single‑node QPS: sum(rate(http_requests_total[1m])) by (instance) Per‑cluster, per‑path QPS: sum(rate(http_requests_total[1m])) by (job, location) Prometheus supports three metric types:

Counter : monotonically increasing values such as request counts.

Gauge : instantaneous values that may go up or down, e.g., CPU usage.

Histogram : distribution data, e.g., 95th‑percentile latency.

Most business metrics are implemented as Counters; Histograms are used sparingly due to CPU cost.

Prometheus uses a pull model, which reduces intrusion on services and provides real‑time data compared to log‑based offline aggregation.

Service Framework Instrumentation

The team uses a unified service framework with a multi‑process (master/worker) and multi‑thread event‑loop model, supporting protocols like HTTP, Thrift, and PB. Modules are loaded similarly to Nginx modules, and an asynchronous downstream API is provided.

To expose internal metrics, the framework needed:

Basic metric types (Counter, Histogram) with lock‑free updates; Gauge is omitted for multi‑process aggregation challenges.

A registry to automatically export metrics via a /metrics endpoint.

Flexible instrumentation using labels so that new endpoints can be added without code changes (e.g., using a single http_requests_total metric with a location label).

Data Scraping and Visualization

After exposing metrics, Prometheus scrapes them, but attention is needed for:

Ensuring metric names conform to Prometheus conventions; invalid data stops scraping.

Filtering or omitting low‑value metrics to control data volume.

Prometheus is both CPU‑ and I/O‑intensive; ample CPU cores, large memory, and SSD storage are recommended to avoid scrape stalls.

While Prometheus provides a basic UI, Grafana is typically used for richer dashboards. A unified Grafana dashboard is built with three rows:

Row 1: real‑time QPS, average latency, queue time, core dump count, downstream failure rate and latency.

Row 2: business latency (50 % and 95 % percentiles), traffic, throughput by error code.

Row 3: downstream engine latency, traffic, throughput.

Example PromQL query to list top 5 downstream failure rates across data centers:

topk(5, 100*sum(rate(downstream_responses{error_code!="0"}[5m])) by (job, server)/sum(rate(downstream_responses[5m])) by (job, server))

The range vector selector [5m] uses a 5‑minute window for charts; alert rules typically use a 1‑minute window.

AlertManager

Prometheus evaluates alert rules after each scrape and forwards triggered alerts to AlertManager, which handles aggregation, silencing, and routing.

AlertManager can forward alerts via:

Webhook to an internal conversion service.

Built‑in integrations such as PagerDuty.

Email + Slack for small teams.

More advanced alert hierarchy management remains an open topic.

Deploying Prometheus + Grafana + Mesos

The stack is packaged and deployed on Mesos, simplifying installation for small teams. Key points:

Prometheus and Grafana are not containerized; the package is stored in MinIO.

Prometheus runs on a dedicated host with a fixed data directory to minimize restart impact.

Grafana is accessed via an HAProxy‑Consul proxy for a stable entry point.

Conclusion

Prometheus is a powerful, rapidly evolving monitoring system. Although it still has room for improvement in stability, performance, and documentation, it is an excellent choice for small to medium teams when combined with a custom service framework, well‑designed instrumentation, unified Prometheus/Grafana templates, and deployment on a platform like Mesos.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Prometheus Grafana TimeSeries

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Active Monitoring

Why Prometheus?

Service Framework Instrumentation

Data Scraping and Visualization

AlertManager

Deploying Prometheus + Grafana + Mesos

Conclusion

Efficient Ops

How this landed with the community

Was this worth your time?

0 Comments

Deploying Prometheus + Grafana + Mesos