How Alibaba Automates Server Fault Detection and Self‑Healing at Scale
Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.
Preface
As Alibaba's big‑data product business grows, the number of servers continuously increases, and IT operations pressure rises proportionally. Hardware and software failures that cause service interruptions have become a major factor affecting stability.
This article explains how Alibaba implements hardware fault prediction, automatic server decommissioning, service self‑healing, and cluster self‑balancing reconstruction, achieving a closed‑loop strategy that automatically resolves common hardware faults without manual intervention before they impact business.
1. Challenges Faced
For MaxCompute, the offline computing platform that supports 95% of Alibaba Group's data storage and computation, server scale has reached hundreds of thousands. The nature of offline jobs makes hardware faults hard to detect at the software level, and the group's unified hardware fault thresholds often miss faults that affect applications, posing a huge stability challenge for each missed report.
We address two problems: timely detection of hardware faults and migration of affected machines' workloads.
The following sections analyze these issues and introduce our automated hardware self‑healing platform, the Dammo Hardware Platform (DAM).
2. Tianji Application Management
MaxCompute runs on Alibaba's data‑center operating system, Apsara, and all applications on Apsara are managed by the Alibaba foundational platform Tianji.
Tianji is an automated data‑center management system that handles hardware lifecycle and static resources such as programs, configurations, OS images, and data.
Our hardware self‑healing system tightly integrates with Tianji, leveraging Tianji's Healing mechanism to build a closed‑loop for hardware fault detection and automated repair for complex business workloads.
Through Tianji, we can issue commands (restart, reinstall, repair) to physical machines; Tianji translates these commands to each application on the machine, which then decides how to respond based on its business characteristics and self‑healing scenario.
3. Hardware Fault Detection
3.1 How to Detect
We focus on hardware issues such as disks, memory, CPU, network cards, and power supplies. Below are common detection methods and primary tools.
Disk failures account for more than 50% of all hardware faults. The most common type is media (disk) failure.
Typical symptoms include file read/write failures, hangs, or slowness, but these do not always indicate media failure, so we need to understand media fault manifestations at each layer.
System log errors can be found in
/var/log/messageswith entries like:
<code>Sep 3 13:43:22 host1.a1 kernel: : [14809594.557970] sd 6:0:11:0: [sdl] Sense Key : Medium Error [current]
Sep 3 20:39:56 host1.a1 kernel: : [61959097.553029] Buffer I/O error on device sdi1, logical block 796203507</code>TSAR I/O metric changes (rs/ws/await/svctm/util) often reflect read/write pauses; a rule such as
qps=ws+rs<100 & util>90helps distinguish disk issues when no large‑scale kernel problems are present.
System metric variations, such as increased load, can also indicate I/O issues.
SMART value jumps, specifically changes in 197 (Current_Pending_Sector) and 5 (Reallocated_Sector_Ct), correlate with read/write anomalies.
In summary, a single stage observation is insufficient; multiple stages must be analyzed together to confirm hardware problems and quickly differentiate software from hardware issues.
3.2 How to Converge
When a potential fault is detected, we follow these principles:
Metrics should be as independent of applications/business as possible: High I/O util (>90%) alone does not imply a fault; it may simply indicate a hotspot. We consider a disk potentially faulty only if util>90% and IOPS<30 for over 10 minutes.
Collect comprehensively, converge cautiously: All possible fault indicators are collected, but most are used only as references, not as direct repair triggers. For example, we do not automatically open a repair ticket for a disk with high util and low IOPS unless SMART or clear fault sectors confirm the issue.
3.3 Application Effect – Coverage
In a production cluster of roughly xx machines, hardware fault tickets from an IDC work order in 20xx showed the following statistics:
Excluding out‑of‑band failures, our hardware fault detection coverage reaches 97.6%.
4. Hardware Fault Self‑Healing
4.1 Self‑Healing Process
For each machine with a hardware issue, we open an automatic rotating ticket. Two self‑healing workflows exist: the Application‑Aware Repair Process for hot‑swappable disk failures, and the Application‑Blind Repair Process for all other full‑machine hardware repairs.
Key design elements include:
No‑Disk Diagnosis: For crashed machines that cannot boot into no‑disk (ramos) mode, we open a No‑Fault Crash ticket, greatly reducing false alarms and service‑desk workload.
Impact Assessment / Upgrade: If a process is stuck for over 10 minutes, we treat the disk fault as affecting the whole machine and trigger a reboot. If the reboot fails, the workflow automatically upgrades from the application‑aware to the application‑blind process.
Automatic Fallback for Unknown Issues: When a machine can enter no‑disk mode but diagnostics find no hardware problem, we reinstall the OS; a small fraction of machines reveal hardware faults during reinstall and are fixed.
Crash Analysis: The workflow also provides crash analysis capabilities, though the primary goal remains fault resolution.
4.2 Process Statistics
Repeated hardware issues trigger statistical analysis of tickets. For example, recurring crashes on Lenovo RD640 virtual serial ports were identified before root cause isolation, allowing us to isolate affected machines and maintain cluster stability.
Similarly, a historic Hitachi disk drop issue was traced to Huawei N41 servers after reviewing repair tickets.
4.3 Business‑Related Misconceptions
With the complete self‑healing system, some business, kernel, or software problems can also enter the workflow via the unknown‑issue branch. However, relying on hardware self‑healing for all problems can lead to “band‑aid” solutions that mask deeper issues.
We are gradually removing non‑hardware handling from the system, focusing on pure hardware self‑healing scenarios, which improves classification of software vs. hardware problems and aids discovery of unknown issues.
5. Architecture Evolution
5.1 Cloud‑Native Transition
The initial self‑healing architecture ran on each cluster’s control node, limiting data openness. We moved to a centralized architecture, then to a distributed service‑oriented redesign to handle massive data volumes, leveraging Alibaba Cloud Log Service (SLS), Cloud Stream Compute (Blink), and Cloud Analytic Database (ADS). The control plane now retains only core fault analysis and decision logic.
5.2 Data‑Driven Insights
Continuous data generation from the self‑healing system enables higher‑dimensional analysis, revealing valuable information. We reduce dimensionality by assigning a health score to each machine, allowing operators to quickly assess hardware status at the machine, rack, or cluster level.
5.3 Service‑Oriented Offering
With full‑link data control, we package the fault self‑healing system as a standardized hardware lifecycle service for various product lines, providing customizable perception thresholds and supporting personalized full‑lifecycle services.
6. Conclusion
In AIOps’s perception‑decision‑execution loop, software/hardware fault self‑healing is the most common use case, and many industry players choose fault self‑healing as the first AIOps deployment. Providing a generic fault‑self‑healing closed‑loop is foundational for AIOps and NoOps, especially for massive systems.
Complex distributed systems inevitably encounter conflicts due to information asymmetry; abstracting these conflicts into explicit self‑healing behavior at the architectural level enables software components to cooperate, turning conflicts into coordinated actions.
By focusing on the biggest operational conflict—hardware vs. software—we design architecture and products that enhance the overall robustness of distributed systems through self‑healing.
Source: AliDataOps public account.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.