Cloud Native 19 min read

Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions

In October 2022 Tencent finished migrating its flagship services—including QQ, WeChat, and Honor of Kings—to a cloud‑native architecture spanning over 50 million CPU cores, overcoming millisecond‑level upgrade, stateful in‑place refresh, massive cross‑region scaling, and heterogeneous hardware by deploying the TKEx platform’s sidecar upgrades, three‑container patterns, Global Scaler Operator, machine‑type abstraction, and Clusternet‑based application‑centric orchestration, boosting CPU utilization to 65 % and establishing China’s largest cloud‑native practice.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions

In October 2022 Tencent completed the full cloud‑native migration of its self‑developed business products (QQ, WeChat, Honor of Kings, Tencent Meeting, etc.), reaching a scale of more than 50 million CPU cores. The migration leveraged cloud‑native advantages to improve operational efficiency and validated Tencent Cloud products.

Project background

The migration was divided into Cloud‑Native 1.0 (pre‑2022) and 2.0 phases. Early stages used mixed‑deployment models (rich containers and micro‑containers) to containerize massive workloads.

Key technical challenges

Rapid container upgrade with millisecond‑level service disruption.

In‑place (hot) upgrade for stateful services that must preserve IPC shared memory.

Global fast scaling for workloads spanning tens of thousands of instances across dozens of regions and hundreds of clusters.

Heterogeneous underlying hardware (old and new machine types) causing uneven resource utilization.

Transition from cluster‑centric to application‑centric scheduling and orchestration.

Solutions implemented by the TKEx Application Management Platform

Fast container upgrade: a sidecar container monitors version files and provides ms‑level upgrade experience, similar to a process restart.

In‑place upgrade: three‑container pattern (biz‑sidecar, biz‑container, biz‑pause) with shared volume and file‑lock mechanism ensures stateful data is not lost while the container image is refreshed.

Global fast scaling: Global Scaler Operator manages ScalerJob and ScalerTemplate CRDs to orchestrate cross‑cluster scaling, supporting step‑based and proportion‑based strategies for both HPA‑enabled and non‑HPA workloads.

Resource pooling and machine‑type abstraction: introduce machine‑type families and a standardized compute model to hide hardware differences, enabling balanced load and better HPA behavior.

Application‑centric orchestration: leverage Clusternet multi‑cluster management to schedule workloads by application rather than by cluster, providing unified deployment, scaling, gray‑release, and observability.

These capabilities increased CPU utilization to about 65 %, improved stability of containerized services, and reduced operational overhead for large‑scale workloads.

Conclusion

Tencent now operates the largest cloud‑native practice in China, covering audio‑video, gaming, e‑commerce, social, and office collaboration. The accumulated best‑practice knowledge (standardized health checks, QoS definitions, multi‑cluster gray‑release, self‑healing, etc.) is being packaged into the public‑cloud TKEx platform to serve both internal and external customers.

cloud-nativekubernetesLarge ScaleTencentContainer Upgradeglobal-scaling
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.