Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)
Bilibili's fault‑self‑healing platform for its massive BMR big‑data cluster—over 10,000 machines and 1 EB storage—adds near‑real‑time fault discovery, intelligent diagnosis, and automated workflow handling, dramatically cutting resolution time, improving stability across services, and scaling to dozens of daily automated repairs.
The article introduces the fault‑self‑healing platform built on Bilibili's one‑stop big‑data cluster management system (BMR). BMR oversees more than 10,000 machines, 50+ service components, over 1 EB of storage and more than one million CPU cores, creating a massive and heterogeneous environment.
Challenges : huge cluster scale, complex mixed‑service deployments, and heterogeneous hardware make fault detection and remediation difficult, especially when tasks span hundreds or thousands of nodes.
Architecture Design : The self‑healing system adds near‑real‑time fault discovery, intelligent diagnosis, and analytical capabilities, turning passive response into proactive prevention. Its core modules are data collection, fault analysis, decision definition, and fault handling.
Data Collection : Captures host and service anomaly metrics, logs, alarm subscriptions, host‑fault data, and business metadata. All data are aggregated in a meta‑warehouse (元仓) that tags each record with host and business identifiers and streams the unified dataset to Kafka and StarRocks.
Decision Definition : Constructs a knowledge base containing fault categories, severity, descriptions, impact scope, and handling rules. Decisions are linked to both fault metadata and business metadata to evaluate impact and select appropriate actions.
Fault Analysis : Periodically reads recent data from the meta‑warehouse, aggregates and filters it, enriches fault records (e.g., disk location, type, service impact), and performs noise reduction by deduplicating identical faults and applying a silent‑period for repeated incidents.
Fault Handling : Implements anti‑error strategies such as abnormal traffic detection, rate‑limiting, and proactive defense (e.g., checking under‑replicated HDFS blocks before disk removal). The system generates a workflow composed of sub‑tasks: defense tasks, script tasks, deployment tasks, hardware repair tasks, node isolation tasks, and notification tasks.
Workflow Execution : Tasks are executed serially with configurable timeouts and retry policies. Failures trigger alerts and allow manual intervention, including task skipping or workflow termination. Notifications are sent at workflow start, completion, and on exceptions.
Case Study – Disk Fault Self‑Healing : Three disk‑fault scenarios are handled: obvious hardware failures, performance outliers affecting services (detected via machine‑learning analysis), and disks nearing end‑of‑life during heavy shuffle workloads. The system automates isolation, repair, and reintegration steps, reducing manual effort and extending disk lifespan. Additional fault types (IO hangs, process crashes, port anomalies, etc.) are progressively covered.
Summary and Outlook : The self‑healing system dramatically reduces fault‑resolution time, improves stability for critical big‑data components (NodeManager, DataNode, Flink, ClickHouse, Kafka, Spark), and processes over 20 automatic fault cases daily. Future work includes expanding coverage to more services, adding predictive fault analysis via machine learning, and further enhancing automation.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.