Fault Self‑Healing System for Large‑Scale Big Data Clusters
This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.
1. Background BMR manages over 10,000 machines, 50+ service components, more than 1 EB of storage and over one million CPU cores, forming a massive and heterogeneous big‑data cluster. The scale, complex service mixing, and diverse hardware make fault detection and remediation challenging.
2. Architecture Design The self‑healing system consists of four stages: data collection, fault analysis, fault handling, and decision definition. Near‑real‑time data is gathered from host metrics, logs, alarm subscriptions, system‑provided fault data, and business metadata, then stored in a unified meta‑warehouse.
3. Technical Implementation
3.1 Data Collection – Captures host and service abnormal metrics, logs, alarm data, system‑provided fault records, and business metadata (service, cluster, component, tag). Sources include /var/log/message, /var/log/syslog, etc.
3.1.2 Meta‑Warehouse – Aggregates all collected data, tags it with business and host identifiers, and writes the unified stream to Kafka, finally syncing to StarRocks for analysis.
3.2 Decision Definition – Builds a knowledge base containing fault categories, levels, descriptions, impact scopes, and judgment conditions, as well as business impact metadata, enabling precise impact assessment and appropriate remediation actions.
3.3 Fault Analysis – Reads recent data from StarRocks, aggregates and filters it, enriches fault details (e.g., disk type, affected services) using multi‑dimensional data, and performs noise reduction by de‑duplicating frequent identical faults and applying silent periods.
3.4 Fault Handling – Generates a workflow composed of sub‑tasks (defense, script, deployment, hardware repair, node isolation, notification). Each task type has specific safety checks, rate‑limiting, and timeout/retry policies. The workflow is stored in a database and scheduled for execution, with manual override options when needed.
4. Summary and Outlook The self‑healing system has been applied to critical components such as NodeManager, DataNode, Flink, ClickHouse, Kafka, and Spark, handling over 20 fault cases daily and saving at least one full‑time engineer’s workload. Future work includes expanding coverage to more services, adding predictive fault analysis via machine learning, and further automating the recovery process.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.