How to Build a Robust Big Data Monitoring and Alerting System
This article explains why high‑availability design and comprehensive monitoring are essential for modern big‑data platforms, outlines a layered architecture, and provides practical guidance on health checks, alerting, and data‑quality monitoring across storage, compute, scheduling, and service layers.
Since Google’s seminal 2003 papers— The Google File System , MapReduce , and Bigtable —the big‑data era has exploded, making high‑availability (HA) and robust monitoring indispensable for reliable service delivery.
Big Data Foundation
The foundation consists of core cluster services such as HDFS, Hive, YARN, Spark, etc., which together form the platform on which storage, computation, and analysis are built.
Key Monitoring Areas
Host instance health (CPU, memory, disk I/O, network, …)
Cluster service health (Hive, HDFS, YARN, process failures)
Service resource usage and error rates (e.g., Hive SQL success rate, YARN queue saturation)
Critical events such as process restarts or master‑slave failover
Data Integration
Understanding the data pipeline helps identify monitoring points. The typical flow includes:
Data source layer (ODS): usually no monitoring needed.
Data collection layer: monitor FlinkCDC tasks for real‑time integration.
Data storage layer: watch storage service health and usage.
Data compute layer: monitor Flink jobs (status, latency, back‑pressure, checkpoints).
Scheduling engine layer: monitor scheduler health (e.g., DolphinScheduler master/worker).
Data service layer: ensure APIs, Grafana dashboards, or BI tools are reachable.
Storage Monitoring
HDFS
DataNode disk failures
Excessive single‑replica blocks
Under‑replicated blocks
Improper data directory configuration
File count thresholds
Disk space usage on DataNode and HDFS overall
Object Storage
Bucket usage
Lifecycle management policies
Security audits (AK/SK changes)
Compute Monitoring
Real‑time (Flink)
Job abnormal termination
Restart count
Kafka consumption lag
Checkpoint health and latency
Back‑pressure and skew
Job resource usage
Sink execution timeout
Custom metrics collection
Offline (Batch)
Job abnormal termination
Job timeout
Average execution time
Long‑tail tasks
Resource‑heavy tasks
Scheduling Monitoring
Master node status
Worker node status
Node load metrics
Data Service Monitoring
Service availability (e.g., Grafana, API endpoints)
Data correctness and completeness
Response times (page load, API latency)
Data Quality Monitoring
Based on DAMA standards, monitor the following dimensions:
Completeness – missing records, null attributes, incomplete constraints.
Accuracy – unreliable data leading to faulty decisions.
Timeliness – data availability when needed.
Uniqueness – duplicate primary keys or redundant records.
Consistency – schema or value mismatches across sources.
Summary and Best Practices
Use mailing lists or subscription groups for alert notifications to handle personnel changes.
Integrate critical alerts with SMS/phone services for rapid response outside business hours.
Prioritize monitoring based on SLA and impact rather than trying to cover everything.
Ensure all stakeholders subscribe to relevant alerts to avoid missed critical events.
Maintain a closed‑loop process and knowledge base for monitoring, alerting, and incident handling.
References:
InfoQ: “万亿级大数据监控平台建设实践”
Juejin: “大数据集群监控体系架构”
Huawei Cloud: “ALM‑13000 ZooKeeper服务不可用”
Zhihu: “DAMA数据质量管理”
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.