How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges
The SNG Operations team shares the five critical challenges of managing tens of thousands of domains, certificates, server failures, automation, and rapid scaling during peak events, and outlines the practical strategies they used to ensure reliable, near‑real‑time service delivery.
Preface
The SNG Operations Component Maintenance team is responsible for the operation and maintenance of Tencent's self‑developed business access and logic layers, covering services such as QQ, Qzone, Kankan, social value‑added services, Penguin Radio, Weiyun, and Tencent Classroom. The team manages roughly 18,000 domain names and 3,000 business modules, handling over 40,000 devices during the Spring Festival, with a single operator sometimes overseeing more than 20,000 devices.
Five Major Challenges
Challenge 1: Ensuring Proximity Access for Tens of Thousands of Domains and Handling ISP Outages
China spans eight time zones and 34 provinces, while Tencent’s IDC locations are primarily in Shenzhen, Shanghai, and Tianjin. Determining the nearest IDC for each domain (e.g., whether Jiangxi is closer to Shanghai or Shenzhen) and requiring operators to have deep geographic knowledge is a non‑trivial problem. Moreover, with three major ISPs and many smaller ones, the team must avoid cross‑ISP routing and schedule traffic based on "country + province + ISP" dimensions.
Challenge 2: Efficiently Managing HTTPS Certificates for Over 18,000 Domains
Since Apple’s ATS security standards, HTTPS has become mandatory for Tencent domains. The team must automate certificate issuance, deployment, and renewal for all domains, monitor expiration dates, and establish a reliable monitoring and renewal mechanism to avoid any service interruption.
Challenge 3: Guaranteeing Business Continuity and Self‑Healing When Servers Crash
When a single operator manages more than ten thousand servers, occasional hardware failures become the norm. The team needs to ensure that single‑machine failures do not require manual intervention, that services remain unaffected, and that automatic recovery restores traffic once the hardware is repaired.
Challenge 4: Maintaining Uniformity Across Live Services Through Automation
Automation is a collaborative effort among development, operations, and testing. The team must define principles, adopt technical tools, and coordinate with developers and QA to keep live services consistent, maintainable, and resilient over years of operation, even for legacy services that are no longer strategically prioritized.
Challenge 5: Rapidly Scaling Hundreds of Modules and Tens of Thousands of Devices for Large‑Scale Events
Holiday spikes, especially during the Spring Festival and red‑packet activities, dramatically increase traffic. In 2018, the team delivered over 32,000 devices within two weeks, performed 641 expansions covering 535 modules and 15,701 devices, and needed a robust process to handle such rapid scaling.
Addressing the Challenges
At the April GOPS Global Operations Conference, the team will share practical experience in three areas to demonstrate how a single operator can manage ten‑thousand‑plus servers:
Infrastructure foundations for massive services
Key operational principles distilled from daily practice
Hands‑on techniques that support large‑scale event handling
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.