Operations 22 min read

How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook

This article details how Tencent's operations team orchestrated a seamless, zero‑impact migration of over 70 million users across three data centers during the 2015 Tianjin explosion, highlighting the four key capabilities—distribution, scheduling, data synchronization, and automated operations—that enabled multi‑active disaster recovery at massive scale.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook

Overview

Zhou Xiaojun, a senior operations expert from Tencent's Social Network Operations Center (DSNO), introduces his extensive experience in internet site architecture, cloud platforms, and large‑scale infrastructure management, overseeing nearly 20,000 servers with a mix of MySQL and proprietary distributed storage.

During the 2015 Tianjin Binhai New Area chemical warehouse explosion, Tencent's Tianjin data center—one of the largest in Asia with over 200,000 servers—was only two kilometers from the blast site.

Outline

813 large‑scale scheduling process

Architecture behind global scheduling

Multi‑site data synchronization

Operations system

1. 813 Large‑Scale Scheduling Process

Immediately after the explosion, the team evaluated two scenarios: the situation was under control, or there were risks such as power loss, water supply interruption, air‑conditioning failure, or toxic gas.

By the morning of the 13th, they decided to proactively migrate users from Tianjin to Shanghai and Shenzhen, moving roughly ten million users per batch without any noticeable impact on QQ services.

The migration lasted from the afternoon of the 13th to 1 am on the 14th, shifting over 70 million users with zero user‑perceived disruption.

Key experiences for successful migration:

User distribution : Understanding user geography enables effective disaster avoidance.

Schedulability : Ability to move users between sites on demand.

Data synchronization : Ensuring consistent data across multiple data centers.

Operational capability : Automation, monitoring, flexible services, and graceful degradation.

After the situation improved, users were gradually returned to Tianjin, completing the full round‑trip by the 20th.

2. Architecture Behind Global Scheduling

Distribution Capability

Traditional three‑tier architecture (access, business logic, data) is containerized into standardized SET (Service‑Engine‑Template) units, similar to Docker containers but at a larger scale.

Each SET can be deployed in minutes, packing dozens of service modules. Three SET types exist: Access SET, Logic SET, and Data SET.

Logic SETs contain modules such as bitmap services, temporary caches, async logs, user caches, facial recognition, and homepage rendering.

Data is replicated across three sites (Shenzhen as primary write, Shanghai and Tianjin as read replicas) using a one‑write‑multiple‑read model.

Cross‑Region Access

Qzone employs the same three‑site active‑active setup, with Shenzhen handling writes and Shanghai/Tianjin handling reads. Data sync uses dedicated high‑speed lines (≈30 ms latency) or encrypted public‑network paths as fallback.

Scheduling Capability

External traffic uses GSLB (global server load balancing). Internally, a custom name‑service system (L5 and CMLB) combines DNS, LVS, and monitoring to route traffic based on health metrics, enabling automatic failover without manual IP changes.

Real‑time optimal IP calculation leverages a big‑data platform that probes millions of VIP endpoints, selecting the best‑performing node for each user.

3. Multi‑Site Data Synchronization

Synchronization Mechanisms

Data sync challenges are addressed via application‑level sync and log‑based sync. Tencent primarily uses asynchronous replication for MySQL, with optional semi‑synchronous modes for stronger consistency.

QQ DB Multi‑Region Model

QQ’s proprietary NoSQL storage supports one‑write‑multiple‑read across regions, with automatic master‑standby promotion and seamless client redirection via CMLB.

Application Sync Center

A message‑queue‑driven pipeline writes data to local sync agents, which then propagate updates to remote data centers, ensuring fault‑tolerant delivery and retry mechanisms.

Bidirectional State Sync

User status, profiles, and social graphs are synchronized in real time across all three data centers, achieving near‑instant consistency with a throughput of about 1 Gb/s.

4. Operations System

The "ZhiYun" platform provides configuration‑driven automated operations, supporting flexible scaling, overload protection, and service tiering to prevent cascading failures.

Regular drills and capacity planning enable rapid response; for example, during the 813 event, the team leveraged pre‑planned capacity thresholds to decide when to expand resources or degrade services.

Review

The four core capabilities—distribution, scheduling, data synchronization, and automated operations—underpinned the successful large‑scale migration during the Tianjin incident.

Q&A

Key questions covered IP‑based user partitioning, real‑world drill authenticity, handling of data loss in primary‑standby setups, and network capacity management during massive traffic shifts.

distributed systemsoperationsLoad Balancingdisaster recoveryData Synchronizationmulti-active architecture
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.