Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions
In October 2022 Tencent finished migrating its flagship services—including QQ, WeChat, and Honor of Kings—to a cloud‑native architecture spanning over 50 million CPU cores, overcoming millisecond‑level upgrade, stateful in‑place refresh, massive cross‑region scaling, and heterogeneous hardware by deploying the TKEx platform’s sidecar upgrades, three‑container patterns, Global Scaler Operator, machine‑type abstraction, and Clusternet‑based application‑centric orchestration, boosting CPU utilization to 65 % and establishing China’s largest cloud‑native practice.
In October 2022 Tencent completed the full cloud‑native migration of its self‑developed business products (QQ, WeChat, Honor of Kings, Tencent Meeting, etc.), reaching a scale of more than 50 million CPU cores. The migration leveraged cloud‑native advantages to improve operational efficiency and validated Tencent Cloud products.
Project background
The migration was divided into Cloud‑Native 1.0 (pre‑2022) and 2.0 phases. Early stages used mixed‑deployment models (rich containers and micro‑containers) to containerize massive workloads.
Key technical challenges
Rapid container upgrade with millisecond‑level service disruption.
In‑place (hot) upgrade for stateful services that must preserve IPC shared memory.
Global fast scaling for workloads spanning tens of thousands of instances across dozens of regions and hundreds of clusters.
Heterogeneous underlying hardware (old and new machine types) causing uneven resource utilization.
Transition from cluster‑centric to application‑centric scheduling and orchestration.
Solutions implemented by the TKEx Application Management Platform
Fast container upgrade: a sidecar container monitors version files and provides ms‑level upgrade experience, similar to a process restart.
In‑place upgrade: three‑container pattern (biz‑sidecar, biz‑container, biz‑pause) with shared volume and file‑lock mechanism ensures stateful data is not lost while the container image is refreshed.
Global fast scaling: Global Scaler Operator manages ScalerJob and ScalerTemplate CRDs to orchestrate cross‑cluster scaling, supporting step‑based and proportion‑based strategies for both HPA‑enabled and non‑HPA workloads.
Resource pooling and machine‑type abstraction: introduce machine‑type families and a standardized compute model to hide hardware differences, enabling balanced load and better HPA behavior.
Application‑centric orchestration: leverage Clusternet multi‑cluster management to schedule workloads by application rather than by cluster, providing unified deployment, scaling, gray‑release, and observability.
These capabilities increased CPU utilization to about 65 %, improved stability of containerized services, and reduced operational overhead for large‑scale workloads.
Conclusion
Tencent now operates the largest cloud‑native practice in China, covering audio‑video, gaming, e‑commerce, social, and office collaboration. The accumulated best‑practice knowledge (standardized health checks, QoS definitions, multi‑cluster gray‑release, self‑healing, etc.) is being packaged into the public‑cloud TKEx platform to serve both internal and external customers.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.