Tag

fault self-healing

1 views collected around this technical thread.

Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

AutomationCluster ManagementMonitoring
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
Bilibili Tech
Bilibili Tech
Dec 10, 2024 · Big Data

Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)

Bilibili's fault‑self‑healing platform for its massive BMR big‑data cluster—over 10,000 machines and 1 EB storage—adds near‑real‑time fault discovery, intelligent diagnosis, and automated workflow handling, dramatically cutting resolution time, improving stability across services, and scaling to dozens of daily automated repairs.

AutomationBMRCluster Management
0 likes · 16 min read
Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)
Bilibili Tech
Bilibili Tech
Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Cluster ManagementContainerizationObservability
0 likes · 12 min read
Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation
Efficient Ops
Efficient Ops
Mar 18, 2024 · Operations

How to Implement Fault Self‑Healing for Scalable Operations

This article explains why low‑disk alerts demand automation, outlines the concept of fault self‑healing versus manual response, and provides practical guidelines—including standards, monitoring dimensions, CMDB integration, script execution tools, and notification channels—to build a reliable self‑healing system for large‑scale environments.

AutomationCMDBDevOps
0 likes · 10 min read
How to Implement Fault Self‑Healing for Scalable Operations
Efficient Ops
Efficient Ops
May 30, 2023 · Operations

Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations

Discover how to transform nightly disk‑space alerts into automated self‑healing workflows, covering prerequisite standards, multi‑dimensional monitoring, CMDB integration, script‑based remediation, and multi‑channel notifications to scale operations across thousands of servers without manual intervention.

CMDBDevOpsMonitoring
0 likes · 10 min read
Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations
Efficient Ops
Efficient Ops
Jan 16, 2023 · Artificial Intelligence

How China Mobile’s AIOps Platform Achieved Top‑Tier Evaluation and What It Means for Intelligent Operations

This article explains the concept of AIOps, details China Mobile Information Technology's successful comprehensive‑level assessment of its centralized operations management platform's fault‑self‑healing module, shares insights from an interview with the project director, and introduces the national AIOps capability maturity model.

AI in ITAIOpsCapability Maturity Model
0 likes · 9 min read
How China Mobile’s AIOps Platform Achieved Top‑Tier Evaluation and What It Means for Intelligent Operations
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How to Build Scalable Fault Self‑Healing for Modern Operations

This article explains why traditional manual responses to alerts are insufficient, outlines the concept of fault self‑healing, and provides a step‑by‑step guide on establishing standards, monitoring dimensions, a unified CMDB, automation tools, and notification channels to achieve automated recovery at scale.

AutomationCMDBMonitoring
0 likes · 9 min read
How to Build Scalable Fault Self‑Healing for Modern Operations