Operations 16 min read

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba senior technical expert Houyi explains how intelligent network automation, rapid fault detection, automatic isolation, and traffic‑optimizing technologies were applied during Double 11 to dramatically improve stability, reduce costs, and enhance overall network performance across millions of devices.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Houyi, a senior technical expert at Alibaba, shares the intelligent network practices used during Double 11 to boost stability, cost efficiency, and operational performance.

Key advancements include rapid fault detection, automatic repair, and change automation; high‑performance gateways (ANAT throughput 16×, LVS 8×); the 4.2 architecture for de‑stacking; 25 G backbone and traffic scheduling platform; precise traffic assessment and QoS optimization; and AGN2.0 backbone upgrades with self‑developed optical modules that cut costs.

Managing a global network of millions of physical and virtual devices requires data‑driven analysis for fault discovery, classification of change‑related and non‑change‑related incidents, and building a fault feature library for proactive prediction. Automatic isolation now handles over 90% of cases with a 95% success rate.

Intelligent scheduling and automatic isolation achieve 100% success in BGP export switching and high automation rates for port/link and board anomalies, dramatically reducing manual intervention.

The Beidou fault‑identification engine performs real‑time log analysis, abnormal traffic detection, and alarm convergence using machine‑learning and graph‑based ranking to pinpoint fault sources.

Automated change processes atomize operations, use state‑machine orchestration, and monitor alarms in real time, enabling rapid rollback decisions and reducing human error to near zero.

The end‑to‑end diagnostic system “Paoding” automates topology discovery, alarm aggregation, log retrieval, and command execution, cutting diagnosis time from 1‑2 hours to about 3 minutes.

NetO’s traffic‑optimizing platform, powered by the SR‑TE‑based SDN solution “Kuohai,” collects global traffic and routing data to perform multi‑objective optimization (cost, latency, bandwidth utilization). It automatically reallocates flows during failures or high‑cost scenarios, achieving load‑balancing and reducing transmission time.

Overall, Alibaba’s Double 11 network automation demonstrates significant progress in stability, cost reduction, and intelligent scheduling, with continued investment in autonomous operation, cost optimization, and user experience improvements.

AlibabaDouble 11operationsfault detectiontraffic optimizationNetwork Automation
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.