What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure
This article reflects on ten years of Tencent's operations experience, sharing the author's career journey, the evolution of large‑scale service management, the design of the L5 fault‑tolerant system, unified frameworks, resource packaging, CMDB virtual mirrors, and automated deployment practices that together enable reliable, efficient, and scalable infrastructure.
Ten‑Year Ops Review
Looking back on a decade of operations, the author asks what is most important and what the team should prioritize to support future growth.
Author Background
Zhao Jianchun is Assistant General Manager of Tencent Social Network Operations, head of the Technical Operations channel, and an expert engineer. He joined Tencent in 2004, working in R&D, operations, and data, accumulating extensive large‑scale operations experience.
Career Timeline
2004: Joined Tencent, developed e‑cards.
2005: Joined QQ Space development team, responsible for the message board module.
2006‑present: Shifted to operations after organizational changes.
Team Achievements
The operations team, now 89 people, maintains 100,000 servers and supports QQ extensions such as QQ Space, QQ Music, QQ Membership, and QQ Show. Major events include:
Redmi launch on QQ Space sold 100,000 units in 90 seconds, gaining 100 million likes.
Tianjin explosion incident migrated over 200 million active users to Shenzhen and Shanghai.
Chinese New Year red‑packet traffic grew tenfold in 2016, requiring 5,000 additional servers and reaching 4.77 million requests per second.
Key Advice: Avoid Being a Firefighter
Operations should first ensure system reliability to prevent frequent errors, then focus on efficiency improvements to achieve higher goals.
Five "Killer Tricks" for the Ops Team
Resource Management: Clearly categorize and classify programs and code, applying appropriate deployment methods for each resource.
Fault‑Tolerance Plans: Ensure failures in massive services do not impact projects, with timely server handling.
Unified CMDBA: Register all dependent resources of a business module in a centralized CMDB, enabling rapid decision‑making and monitoring.
DLP Monitoring: An internal critical monitor that pinpoints fault locations.
Entry Monitoring: Identifies root causes of failures; L5 handles fault tolerance, gray release, routing, etc.
L5 System Overview
The L5 system consists of L5, DNS, and L5 agents. CGI requests module IDs; based on success rate and latency, CGI receives feedback via IP+PORT, reports to L5 agents, and aggregates statistics. Low‑success modules are throttled or removed, enabling fault tolerance and load balancing. New servers start with weight 1 and can be gradually introduced or removed based on performance.
Benefits of L5 for the Ops Team
Reduces daily faults by 80‑90%.
Eliminates frequent IP+PORT changes.
Facilitates easy service up/down via names.
Supports gray‑release deployments.
Helps locate root‑cause failures through module access relationships.
Monitors interface latency and failure rates.
Combines fault tolerance, load balancing, routing, and monitoring in one system.
Unified Framework Advantages
Separating network framework from business logic reduces learning costs, improves framework stability, enables cross‑business unified maintenance, and can increase operational efficiency up to tenfold.
Resource Packaging Management
Standardizes packaging of developed programs, handling parameters, dependencies, pre‑ and post‑deployment steps, and provides unified interfaces for install, uninstall, start, and stop operations.
CMDB Virtual Mirror
Registers resources into a secondary CMDB to create a complete virtual image of a module, documenting all dependencies without needing separate documentation.
Decision Scheduling – ZhiYun Automated Deployment Platform
The internal platform automates resource acquisition, deployment, testing, and release. It includes steps such as silencing alerts during device acquisition, synchronizing files during release, verifying process startup, and conducting business tests.
Ops Norm Evolution
The progression of operational standards is depicted, highlighting the challenges of systematic adoption.
Key Visuals
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.