How Alibaba’s Global Operations Center Guarantees Seamless Service During Double‑11
This article outlines Alibaba's Global Operations Center (GOC) practices for ensuring stable, high‑performance online services during massive traffic spikes like Double‑11, covering current challenges, the operational assurance framework, best‑practice implementations, and future directions such as automation and AI‑driven monitoring.
Speaker Introduction
Wu Changlong Alibaba Global Operations Center (GOC) is the core team ensuring the stability of Alibaba’s online ecosystem. Master’s graduate in 2014, focused on cloud computing. Previously worked at micro‑film, Melotic (Bitcoin), Rakuten (Japan’s largest e‑commerce). Joined Alibaba GOC in 2016 and has been dedicated to operations assurance ever since.
Preface
Alibaba’s Global Operations Center (GOC) is the core team that guarantees the stability of the entire Alibaba ecosystem, similar to Google’s SRE. The talk is divided into four parts: current stability status and challenges, the operations assurance system, best‑practice implementation, and future development directions.
1. Stability Status and Challenges
During the recent Double‑11, peak order creation reached 325,000 per second and peak payments 256,000 per second, an 80% increase over 2016. Overall feedback indicated a smoother experience compared to previous years, with total transaction volume reaching 168.2 billion RMB.
Rapid business expansion across IDC, network, security, Alibaba Cloud, Alibaba Communication, DingTalk, and numerous business lines (Tmall, Taobao, Ant Financial, etc.) creates significant stability challenges.
New retail initiatives such as Hema Fresh opened five cities and ten stores simultaneously, and Alibaba Cloud’s Malaysia data center was built in under a year, illustrating the speed of infrastructure deployment.
AI‑driven services like Tmall Genie saw over one million units sold during Double‑11, prompting considerations on how to measure AI algorithm stability.
Historical incidents (e.g., 2001 DB outage, 2017 B‑factory WAP outage, AWS regional failure) show that root causes of failures often remain unchanged despite increasing system complexity.
Two key innovations are highlighted: the Changefree system, which uses full‑text search and machine‑learning rules to trace online changes and reduce mean‑time‑to‑recovery by 65%; and a time‑series anomaly detection algorithm that raises monitoring accuracy from ~40% to 80%.
2. Operations Assurance System
The system draws from ITIL and Business Continuity Management (BCM, ISO 22301). While ITIL provides process and service catalogs, GOC found it too heavyweight for rapid internet iteration and thus built a streamlined framework focusing on fault prevention, detection, response, recovery, localization, post‑mortem, and validation.
Key components include:
Fault Prevention : data operations, platform control (e.g., ChangeFree), and regular drills.
Fault Detection : business‑centric monitoring with four severity levels, full‑stack monitoring across IDC, network, application, system, and business layers, and intelligent monitoring that learns baselines to reduce false alarms.
Emergency Response : 24 × 7 on‑call teams across multiple regions (Silicon Valley, Beijing, Hangzhou) to ensure continuous coverage.
Rapid Recovery : isolation, one‑click failover, and quick rollback of recent changes (typically within 15 minutes).
Fault Localization : distinguishing initial causes (capacity or change) from root causes (upstream/downstream dependencies).
Post‑mortem : timely action completion and automated data collection for self‑service reviews.
Drill Validation : continuous rehearsal of failure scenarios, including traffic‑level gray releases and simulated outages.
3. Seamless Operations Best Practices
The core of seamless operations is a tightly integrated platform suite: OPM (fault management), ODE (disaster‑recovery drills), OCM (change management), ODA (operation analysis), ODQ (data quality), etc. The practice emphasizes:
Standardized SOPs for rapid fault detection and change rollback.
Automated correlation of monitoring alerts with recent changes to enable immediate rollback.
Intelligent root‑cause analysis using baseline algorithms for both application and business‑metric paths.
Self‑service post‑mortems powered by automatic data collection.
4. Future Directions
GOC’s vision is fully autonomous, self‑healing online services. Four focus areas are:
Automation of repetitive operational tasks.
Further AI‑driven intelligence for monitoring, change control, and even automated incident isolation.
Internationalization to support global events across time zones.
Unattended operation through standardized SOPs and robotic process automation.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.