Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven
The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.
Each year the Double Eleven shopping festival puts Alibaba's network under extreme pressure, requiring it to handle billions of users and massive traffic spikes; any failure is amplified, making fast and reliable fault recovery essential.
The core workflow consists of monitoring collection → fault detection → root‑cause identification → automated recovery.
Figure 1 Automated Recovery Overall Process
Data collection is extremely rich, approaching a trillion records per day, including logs, SNMP metrics from routers and switches, AliPing (intranet quality), AliInternet (internet quality), Netflow (flow data), and more.
SNMP Collection
Network devices are polled by dividing them into collection domains, pulling metrics, and then aggregating centrally, with backups between domains.
Figure 2 SNMP & Syslog Collection
AliPing Collection
To quickly and accurately detect packet loss and latency, Alibaba simulates business‑level network characteristics and performs ICMP/TCP ping probing on all physical servers.
Figure 3 AliPing (Intranet Quality) Architecture
AliInternet Collection
The internet is treated as an extension of Alibaba's network; quality is monitored by dynamically selecting live IPs from a global IP pool and probing millions of IPs per minute.
Figure 4 AliInternet (Internet Quality) Architecture
Additional data sources include Netflow from all routers, LVS VIP traffic, and Anat session logs.
Flexible Alerting (Fault Detection)
Real‑time stream processing converts collected data into basic abnormal events (e.g., port down, protocol interruption, high latency) using Spark Streaming, chosen for its hybrid online/offline computation, easy integration of external data, high performance, ML/graph capabilities, and reuse of existing Yarn clusters.
Two years ago Alibaba built the RCS platform to address the lack of a Spark job management system, providing code/JAR management, runtime parameter handling, Yarn scheduling, Spark task submission, and monitoring/alerting.
Figure 5 Zeppelin‑Based Single‑Cluster Management
To ensure high availability during severe failures, a multi‑cluster disaster‑recovery mechanism was added, allowing tasks to migrate between clusters.
Figure 6 Zeppelin‑Based Multi‑Cluster Management
CEP Complex Event Engine
After basic events are generated, Alibaba uses the Siddhi CEP engine to define flexible alert rules, such as traffic thresholds, frequency conditions, aggregated thresholds, and combined conditions.
Figure 7 Overall Alert Process
Alert Convergence
Generated alerts are converged using network topology and PageRank scoring within connected sub‑graphs; the highest‑scoring devices/events are identified as the primary fault alerts.
Figure 8 Alert Convergence
Fault Localization & Automated Recovery
Once the primary alert is identified, customized analysis and recovery strategies are applied. Operators can submit scripts via a platform to cover diverse fault scenarios.
Figure 9 Fault Recovery Process
Example: when a major carrier experiences an external outage, the system automatically triggers the appropriate recovery workflow.
Figure 10 Carrier Fault Automated Recovery
Conclusion
Over the past two years, Alibaba has built a three‑dimensional monitoring system, refined alert customization and convergence, and achieved 47% automation of network alerts; the goal is to raise this to over 90%, further reducing incident frequency and recovery time.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.