How Tencent Scales 20,000+ Servers: Lessons from SNG Operations
This talk outlines the five major challenges faced by Tencent's SNG component operations—geographic distribution, HTTPS certificate management, massive device failures, long‑term maintenance, and large‑scale scaling—and describes the underlying architecture, operational principles, and practical techniques used to automate and reliably support millions of users during peak events.
This article is based on the 2018 GOPS Shenzhen talk.
Author: Zhang Liming
Zhang Liming is the head of the SNG component operations team with eight years of operations experience. He has participated in the growth of domestic social platforms QQ and Qzone and in projects such as SNG system standardization, large‑scale component deployment, and automated operations.
The SNG component operations team is responsible for the entire SNG access and logic‑layer business, managing 18,000 domains, 3,000 business modules, 40,000 devices, and more than 20,000 devices per operator. The team faces five major challenges.
Challenge 1: China spans eight time zones and over 30 provincial units, making near‑site access for tens of thousands of domains difficult. The data centers are in Shanghai, Tianjin, and Shenzhen, and cross‑operator traffic must be minimized.
Challenge 2: Since Apple introduced ATS, HTTPS has become mandatory. Managing certificate renewal for tens of thousands of domains requires an efficient, unified solution.
Challenge 3: When a single operator manages more than ten thousand servers, device failures become commonplace; automatic, non‑intrusive recovery is essential.
Challenge 4: The operational lifecycle far exceeds the development cycle, creating long‑term maintenance challenges.
Challenge 5: Large‑scale scaling during holidays and events (e.g., QQ Space posts, QQ red‑packet activities) leads to massive device onboarding and module expansion.
The article addresses these challenges from three aspects.
1. Massive Service Infrastructure
The request flow starts with DNS lookup, obtaining TGW and STGW gateway IPs. TGW functions like an LVS/F5 load balancer, directing traffic to web, logic, and storage layers. Three key goals are achieved: reliable name service, fault‑tolerant devices, and a unified framework that improves development‑operations efficiency.
GSLB (Global Server Load Balancing) provides near‑site access by identifying client IP, province, and ISP, then returning the optimal IP. It maintains a comprehensive IP database with near‑100% accuracy and real‑time IDC coverage data.
TGW (Tencent Gateway) consolidates external IPs into a single VIP, enabling high availability via OSPF clustering and supporting multi‑operator access without requiring external IPs on backend servers.
STGW (Secure TGW) extends TGW with HTTPS support, centralizing certificate management. All certificates are stored on the STGW platform, allowing automated scanning, renewal, and deployment.
Weave Cloud Routing (internal name service) handles intra‑datacenter routing, offering fine‑grained traffic control, overload protection, and customizable fault‑tolerance beyond traditional F5/LVS solutions.
SPP (Service Proxy Platform) unifies the backend framework with Proxy, Worker, and Controller components, enabling rapid upgrades, clear separation of network framework and business logic, and automatic process recovery.
2. Operational Principles Summarized from Practice
Name Service Principle – Ensure every address lookup uses the name service, enforce coverage via QA, provide RPC‑wrapped routing, and expand usage scenarios.
Consistency Principle – Maintain uniform module packages, configuration files, and permissions; enforce strong consistency between CMDB records and live deployments; adopt key‑value configuration stores to reduce restart impact.
No‑Data Principle – Design devices to operate without persistent state, simplifying failure detection, reboot handling, and log retention.
Unified Principle – Use a single backend framework, name service, configuration center, data reporting channel, and package release system to reduce maintenance scope and improve automation reliability.
3. Practical Techniques for Supporting Large‑Scale Events
During peak events (e.g., 2018 Spring Festival red‑packet activity), the team performed 641 expansions across 535 modules and 15,701 devices, completing all within one week.
The expansion workflow includes resource request, procurement, virtualization, deployment via Weave Cloud, gray‑release, and full rollout, with automated monitoring and traffic routing.
A batch deployment system aggregates atomic interfaces of various subsystems, providing a unified UI, automatic resource‑to‑module mapping, large‑scale parallel expansion, and real‑time status tracking.
4. Q&A
Q: Can you identify small ISPs with multiple exits for routing? A: Small ISPs are routed to a unified CAP exit. Q: Do small ISPs use dedicated exits? A: Yes, major carriers are kept separate; small ISPs use the CAP acceleration platform. Q: Is routing based on machine density or business? A: Both; automatic failover shifts traffic to healthy carriers when a region experiences issues.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.