Big Data 14 min read

How to Build a Robust Big Data Monitoring and Alerting System

This article explains why high‑availability design and comprehensive monitoring are essential for modern big‑data platforms, outlines a layered architecture, and provides practical guidance on health checks, alerting, and data‑quality monitoring across storage, compute, scheduling, and service layers.

WeiLi Technology Team
WeiLi Technology Team
WeiLi Technology Team
How to Build a Robust Big Data Monitoring and Alerting System

Since Google’s seminal 2003 papers— The Google File System , MapReduce , and Bigtable —the big‑data era has exploded, making high‑availability (HA) and robust monitoring indispensable for reliable service delivery.

Big Data Foundation

The foundation consists of core cluster services such as HDFS, Hive, YARN, Spark, etc., which together form the platform on which storage, computation, and analysis are built.

Key Monitoring Areas

Host instance health (CPU, memory, disk I/O, network, …)

Cluster service health (Hive, HDFS, YARN, process failures)

Service resource usage and error rates (e.g., Hive SQL success rate, YARN queue saturation)

Critical events such as process restarts or master‑slave failover

Data Integration

Understanding the data pipeline helps identify monitoring points. The typical flow includes:

Data source layer (ODS): usually no monitoring needed.

Data collection layer: monitor FlinkCDC tasks for real‑time integration.

Data storage layer: watch storage service health and usage.

Data compute layer: monitor Flink jobs (status, latency, back‑pressure, checkpoints).

Scheduling engine layer: monitor scheduler health (e.g., DolphinScheduler master/worker).

Data service layer: ensure APIs, Grafana dashboards, or BI tools are reachable.

Storage Monitoring

HDFS

DataNode disk failures

Excessive single‑replica blocks

Under‑replicated blocks

Improper data directory configuration

File count thresholds

Disk space usage on DataNode and HDFS overall

Object Storage

Bucket usage

Lifecycle management policies

Security audits (AK/SK changes)

Compute Monitoring

Real‑time (Flink)

Job abnormal termination

Restart count

Kafka consumption lag

Checkpoint health and latency

Back‑pressure and skew

Job resource usage

Sink execution timeout

Custom metrics collection

Offline (Batch)

Job abnormal termination

Job timeout

Average execution time

Long‑tail tasks

Resource‑heavy tasks

Scheduling Monitoring

Master node status

Worker node status

Node load metrics

Data Service Monitoring

Service availability (e.g., Grafana, API endpoints)

Data correctness and completeness

Response times (page load, API latency)

Data Quality Monitoring

Based on DAMA standards, monitor the following dimensions:

Completeness – missing records, null attributes, incomplete constraints.

Accuracy – unreliable data leading to faulty decisions.

Timeliness – data availability when needed.

Uniqueness – duplicate primary keys or redundant records.

Consistency – schema or value mismatches across sources.

Summary and Best Practices

Use mailing lists or subscription groups for alert notifications to handle personnel changes.

Integrate critical alerts with SMS/phone services for rapid response outside business hours.

Prioritize monitoring based on SLA and impact rather than trying to cover everything.

Ensure all stakeholders subscribe to relevant alerts to avoid missed critical events.

Maintain a closed‑loop process and knowledge base for monitoring, alerting, and incident handling.

References:

InfoQ: “万亿级大数据监控平台建设实践”

Juejin: “大数据集群监控体系架构”

Huawei Cloud: “ALM‑13000 ZooKeeper服务不可用”

Zhihu: “DAMA数据质量管理”

monitoringarchitecturebig dataFlinkdata qualityalertingHDFS
WeiLi Technology Team
Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.