Tag

hardware fault detection

1 views collected around this technical thread.

Efficient Ops
Efficient Ops
Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

AIOpscloud computinghardware fault detection
0 likes · 17 min read
How Alibaba Automates Server Fault Detection and Self‑Healing at Scale