How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook
This article details how Tencent's operations team orchestrated a seamless, zero‑impact migration of over 70 million users across three data centers during the 2015 Tianjin explosion, highlighting the four key capabilities—distribution, scheduling, data synchronization, and automated operations—that enabled multi‑active disaster recovery at massive scale.
Overview
Zhou Xiaojun, a senior operations expert from Tencent's Social Network Operations Center (DSNO), introduces his extensive experience in internet site architecture, cloud platforms, and large‑scale infrastructure management, overseeing nearly 20,000 servers with a mix of MySQL and proprietary distributed storage.
During the 2015 Tianjin Binhai New Area chemical warehouse explosion, Tencent's Tianjin data center—one of the largest in Asia with over 200,000 servers—was only two kilometers from the blast site.
Outline
813 large‑scale scheduling process
Architecture behind global scheduling
Multi‑site data synchronization
Operations system
1. 813 Large‑Scale Scheduling Process
Immediately after the explosion, the team evaluated two scenarios: the situation was under control, or there were risks such as power loss, water supply interruption, air‑conditioning failure, or toxic gas.
By the morning of the 13th, they decided to proactively migrate users from Tianjin to Shanghai and Shenzhen, moving roughly ten million users per batch without any noticeable impact on QQ services.
The migration lasted from the afternoon of the 13th to 1 am on the 14th, shifting over 70 million users with zero user‑perceived disruption.
Key experiences for successful migration:
User distribution : Understanding user geography enables effective disaster avoidance.
Schedulability : Ability to move users between sites on demand.
Data synchronization : Ensuring consistent data across multiple data centers.
Operational capability : Automation, monitoring, flexible services, and graceful degradation.
After the situation improved, users were gradually returned to Tianjin, completing the full round‑trip by the 20th.
2. Architecture Behind Global Scheduling
Distribution Capability
Traditional three‑tier architecture (access, business logic, data) is containerized into standardized SET (Service‑Engine‑Template) units, similar to Docker containers but at a larger scale.
Each SET can be deployed in minutes, packing dozens of service modules. Three SET types exist: Access SET, Logic SET, and Data SET.
Logic SETs contain modules such as bitmap services, temporary caches, async logs, user caches, facial recognition, and homepage rendering.
Data is replicated across three sites (Shenzhen as primary write, Shanghai and Tianjin as read replicas) using a one‑write‑multiple‑read model.
Cross‑Region Access
Qzone employs the same three‑site active‑active setup, with Shenzhen handling writes and Shanghai/Tianjin handling reads. Data sync uses dedicated high‑speed lines (≈30 ms latency) or encrypted public‑network paths as fallback.
Scheduling Capability
External traffic uses GSLB (global server load balancing). Internally, a custom name‑service system (L5 and CMLB) combines DNS, LVS, and monitoring to route traffic based on health metrics, enabling automatic failover without manual IP changes.
Real‑time optimal IP calculation leverages a big‑data platform that probes millions of VIP endpoints, selecting the best‑performing node for each user.
3. Multi‑Site Data Synchronization
Synchronization Mechanisms
Data sync challenges are addressed via application‑level sync and log‑based sync. Tencent primarily uses asynchronous replication for MySQL, with optional semi‑synchronous modes for stronger consistency.
QQ DB Multi‑Region Model
QQ’s proprietary NoSQL storage supports one‑write‑multiple‑read across regions, with automatic master‑standby promotion and seamless client redirection via CMLB.
Application Sync Center
A message‑queue‑driven pipeline writes data to local sync agents, which then propagate updates to remote data centers, ensuring fault‑tolerant delivery and retry mechanisms.
Bidirectional State Sync
User status, profiles, and social graphs are synchronized in real time across all three data centers, achieving near‑instant consistency with a throughput of about 1 Gb/s.
4. Operations System
The "ZhiYun" platform provides configuration‑driven automated operations, supporting flexible scaling, overload protection, and service tiering to prevent cascading failures.
Regular drills and capacity planning enable rapid response; for example, during the 813 event, the team leveraged pre‑planned capacity thresholds to decide when to expand resources or degrade services.
Review
The four core capabilities—distribution, scheduling, data synchronization, and automated operations—underpinned the successful large‑scale migration during the Tianjin incident.
Q&A
Key questions covered IP‑based user partitioning, real‑world drill authenticity, handling of data loss in primary‑standby setups, and network capacity management during massive traffic shifts.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.