Operations 9 min read

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.

Alibaba Cloud Infrastructure

Dec 15, 2017

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

Each year the Double Eleven shopping festival puts Alibaba's network under extreme pressure, requiring it to handle billions of users and massive traffic spikes; any failure is amplified, making fast and reliable fault recovery essential.

The core workflow consists of monitoring collection → fault detection → root‑cause identification → automated recovery.

Figure 1 Automated Recovery Overall Process

Data collection is extremely rich, approaching a trillion records per day, including logs, SNMP metrics from routers and switches, AliPing (intranet quality), AliInternet (internet quality), Netflow (flow data), and more.

SNMP Collection

Network devices are polled by dividing them into collection domains, pulling metrics, and then aggregating centrally, with backups between domains.

Figure 2 SNMP & Syslog Collection

AliPing Collection

To quickly and accurately detect packet loss and latency, Alibaba simulates business‑level network characteristics and performs ICMP/TCP ping probing on all physical servers.

Figure 3 AliPing (Intranet Quality) Architecture

AliInternet Collection

The internet is treated as an extension of Alibaba's network; quality is monitored by dynamically selecting live IPs from a global IP pool and probing millions of IPs per minute.

Figure 4 AliInternet (Internet Quality) Architecture

Additional data sources include Netflow from all routers, LVS VIP traffic, and Anat session logs.

Flexible Alerting (Fault Detection)

Real‑time stream processing converts collected data into basic abnormal events (e.g., port down, protocol interruption, high latency) using Spark Streaming, chosen for its hybrid online/offline computation, easy integration of external data, high performance, ML/graph capabilities, and reuse of existing Yarn clusters.

Two years ago Alibaba built the RCS platform to address the lack of a Spark job management system, providing code/JAR management, runtime parameter handling, Yarn scheduling, Spark task submission, and monitoring/alerting.

Figure 5 Zeppelin‑Based Single‑Cluster Management

To ensure high availability during severe failures, a multi‑cluster disaster‑recovery mechanism was added, allowing tasks to migrate between clusters.

Figure 6 Zeppelin‑Based Multi‑Cluster Management

CEP Complex Event Engine

After basic events are generated, Alibaba uses the Siddhi CEP engine to define flexible alert rules, such as traffic thresholds, frequency conditions, aggregated thresholds, and combined conditions.

Figure 7 Overall Alert Process

Alert Convergence

Generated alerts are converged using network topology and PageRank scoring within connected sub‑graphs; the highest‑scoring devices/events are identified as the primary fault alerts.

Figure 8 Alert Convergence

Fault Localization & Automated Recovery

Once the primary alert is identified, customized analysis and recovery strategies are applied. Operators can submit scripts via a platform to cover diverse fault scenarios.

Figure 9 Fault Recovery Process

Example: when a major carrier experiences an external outage, the system automatically triggers the appropriate recovery workflow.

Figure 10 Carrier Fault Automated Recovery

Conclusion

Over the past two years, Alibaba has built a three‑dimensional monitoring system, refined alert customization and convergence, and achieved 47% automation of network alerts; the goal is to raise this to over 90%, further reducing incident frequency and recovery time.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data automation Operations Network Monitoring fault-recovery

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.