Design and Implementation of a Comprehensive Monitoring System for a Big Data Platform
This article describes the end‑to‑end design, metric hierarchy, data collection methods, visualization dashboards, and alerting mechanisms used to build a robust monitoring system for a large‑scale big‑data platform, covering physical hosts, Hadoop components, business services, and data layers with tools such as Telegraf, Prometheus, and Grafana.
Background
The YunZhu big‑data platform initially used scattered monitoring approaches, leading to inconsistent data collection, incomplete metric coverage, and no unified dashboard, which hindered stability as services grew.
Overall Design
A data‑warehouse‑style architecture is adopted, separating real‑time dashboards from offline alert analysis. The design includes four layers: physical host, big‑data component, business service, and business data.
2.1 Layered Metric Hierarchy
Physical host layer – CPU, memory, I/O, disk.
Big‑data component layer – HDFS, YARN, Zookeeper, Kafka, ClickHouse, Hive, Trino, etc.
Business service layer – custom services (edata, master data, AI inference).
Business data layer – Hive tables, ClickHouse tables, Elasticsearch indices.
2.2 Metric Examples
Each component defines specific monitoring items (e.g., HDFS total capacity, YARN running tasks, Kafka broker memory) with severity levels p0, p1, p2.
3 Data Collection
3.1 Physical Host Collection
Telegraf agents collect time‑series metrics (CPU, memory, disk, network) from all hosts.
3.2 Big‑Data Suite Collection
Prometheus scrapes JMX exporters for Hadoop ecosystem components. Core Prometheus components (server, exporters, push gateway, client SDK) are used.
3.3 Business Service Collection
Spring Boot applications integrate the Prometheus client library to expose service‑level metrics.
3.4 Business Data Collection
Python scripts extract metadata from Hive Metastore (MySQL), ClickHouse system tables, and Elasticsearch APIs, synchronizing them into Hive for reporting.
4 Monitoring Visualization
Grafana dashboards aggregate metrics from Prometheus and Elasticsearch, providing real‑time views for operations and weekly business reports via an internal reporting platform.
5 Alerting
5.1 Alert Levels
Alerts are classified as p0 (critical, phone + DingTalk), p1 (warning, DingTalk), etc., with examples such as low memory or node down.
5.2 Alert Convergence
Efforts are underway to reduce noise by auto‑remediating issues and consolidating alerts.
5.3 Alert Implementation
Host‑level alerts via Zabbix triggers.
Component alerts via Prometheus Alertmanager.
Business data alerts via custom Python scripts.
Future plans include a unified Kafka‑Flink pipeline for rule‑based alert processing.
YunZhu Net Technology Team
Technical practice sharing from the YunZhu Net Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.