Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB
This article details the background, research, architecture, performance testing, and deployment of a comprehensive monitoring system that leverages Prometheus, Grafana, and M3DB to provide flexible metric collection, automatic dashboard generation, and a custom alerting service for large‑scale business services.
Background: the early zzmonitor system only supported four aggregation functions (SUM, MAX, MIN, AVG) and stored data in MySQL with limited retention, causing functional gaps, inflexible API design, poor time‑series storage performance, and high maintenance costs.
Research & selection: after evaluating Cat, Nightingale, and Prometheus, the team chose Prometheus for its flexible PromQL, rich exporter ecosystem, and active community.
Prometheus capabilities: a built‑in single‑node TSDB with pull‑based metric collection, support for Counter, Gauge, Histogram, and multi‑dimensional labels, and a design that tolerates minor data errors through linear extrapolation.
Architecture design: remote storage was implemented using M3DB (M3 Coordinator, M3DB, M3 Query, M3 Aggregator). The client follows Prometheus remote‑write (ProtoBuf + HTTP) and pushes metrics asynchronously in batches directly to M3DB, eliminating the need for a separate Prometheus server.
Performance testing: QPS reaches tens of millions (e.g., 43 M QPS for Counter in single‑thread), latency stays in the 20‑40 ns range, and memory usage scales with the number of labels (e.g., 381 KB for 500 labels in Histogram).
Implementation details: a unified Grafana instance serves all environments; dashboards are generated from JSON templates; authentication is handled via Enterprise WeChat proxy; automatic panel initialization creates appropriate visualizations for Counter, Gauge, and Histogram metrics. Example code snippets are shown below.
public void test() {
long start = System.currentTimeMillis();
// do something
long cost = System.currentTimeMillis() - start;
ZMonitor.sum("执行次数", 1);
ZMonitor.max("最大耗时", cost);
ZMonitor.min("最小耗时", cost);
ZMonitor.avg("平均耗时", cost);
} Counter counter = Counter.build().name("upload_picture_total").help("上传图片数").register();
counter.inc(); Gauge gauge = Gauge.build().name("active_thread_num").help("活跃线程数").register();
gauge.set(20); Histogram histogram = Histogram.build().name("http_request_cost").help("Http请求耗时").buckets(10,20,30,40).register();
histogram.observe(20);Alerting system: a custom alert service generates PromQL statements stored in MySQL, schedules checks via XXL‑Job, and evaluates conditions against M3DB, allowing users to configure alerts with simple threshold inputs.
Final outcome: the platform provides business‑service, architecture‑component, and operations‑component dashboards, a unified monitoring view, low maintenance overhead, and extensibility through open‑source contributions, receiving positive feedback across business lines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
