Operations 6 min read

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

The SNG Operations team shares the five critical challenges of managing tens of thousands of domains, certificates, server failures, automation, and rapid scaling during peak events, and outlines the practical strategies they used to ensure reliable, near‑real‑time service delivery.

Efficient Ops

Mar 6, 2018

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

Preface

The SNG Operations Component Maintenance team is responsible for the operation and maintenance of Tencent's self‑developed business access and logic layers, covering services such as QQ, Qzone, Kankan, social value‑added services, Penguin Radio, Weiyun, and Tencent Classroom. The team manages roughly 18,000 domain names and 3,000 business modules, handling over 40,000 devices during the Spring Festival, with a single operator sometimes overseeing more than 20,000 devices.

Five Major Challenges

Challenge 1: Ensuring Proximity Access for Tens of Thousands of Domains and Handling ISP Outages

China spans eight time zones and 34 provinces, while Tencent’s IDC locations are primarily in Shenzhen, Shanghai, and Tianjin. Determining the nearest IDC for each domain (e.g., whether Jiangxi is closer to Shanghai or Shenzhen) and requiring operators to have deep geographic knowledge is a non‑trivial problem. Moreover, with three major ISPs and many smaller ones, the team must avoid cross‑ISP routing and schedule traffic based on "country + province + ISP" dimensions.

Challenge 2: Efficiently Managing HTTPS Certificates for Over 18,000 Domains

Since Apple’s ATS security standards, HTTPS has become mandatory for Tencent domains. The team must automate certificate issuance, deployment, and renewal for all domains, monitor expiration dates, and establish a reliable monitoring and renewal mechanism to avoid any service interruption.

Challenge 3: Guaranteeing Business Continuity and Self‑Healing When Servers Crash

When a single operator manages more than ten thousand servers, occasional hardware failures become the norm. The team needs to ensure that single‑machine failures do not require manual intervention, that services remain unaffected, and that automatic recovery restores traffic once the hardware is repaired.

Challenge 4: Maintaining Uniformity Across Live Services Through Automation

Automation is a collaborative effort among development, operations, and testing. The team must define principles, adopt technical tools, and coordinate with developers and QA to keep live services consistent, maintainable, and resilient over years of operation, even for legacy services that are no longer strategically prioritized.

Challenge 5: Rapidly Scaling Hundreds of Modules and Tens of Thousands of Devices for Large‑Scale Events

Holiday spikes, especially during the Spring Festival and red‑packet activities, dramatically increase traffic. In 2018, the team delivered over 32,000 devices within two weeks, performed 641 expansions covering 535 modules and 15,701 devices, and needed a robust process to handle such rapid scaling.

Addressing the Challenges

At the April GOPS Global Operations Conference, the team will share practical experience in three areas to demonstrate how a single operator can manage ten‑thousand‑plus servers:

Infrastructure foundations for massive services

Key operational principles distilled from daily practice

Hands‑on techniques that support large‑scale event handling

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Operations service reliability certificate-management large-scale infrastructure

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.