Operations 11 min read

What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure

This article reflects on ten years of Tencent's operations experience, sharing the author's career journey, the evolution of large‑scale service management, the design of the L5 fault‑tolerant system, unified frameworks, resource packaging, CMDB virtual mirrors, and automated deployment practices that together enable reliable, efficient, and scalable infrastructure.

Efficient Ops

Feb 23, 2018

What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure

Ten‑Year Ops Review

Looking back on a decade of operations, the author asks what is most important and what the team should prioritize to support future growth.

Author Background

Zhao Jianchun is Assistant General Manager of Tencent Social Network Operations, head of the Technical Operations channel, and an expert engineer. He joined Tencent in 2004, working in R&D, operations, and data, accumulating extensive large‑scale operations experience.

Career Timeline

2004: Joined Tencent, developed e‑cards.

2005: Joined QQ Space development team, responsible for the message board module.

2006‑present: Shifted to operations after organizational changes.

Team Achievements

The operations team, now 89 people, maintains 100,000 servers and supports QQ extensions such as QQ Space, QQ Music, QQ Membership, and QQ Show. Major events include:

Redmi launch on QQ Space sold 100,000 units in 90 seconds, gaining 100 million likes.

Tianjin explosion incident migrated over 200 million active users to Shenzhen and Shanghai.

Chinese New Year red‑packet traffic grew tenfold in 2016, requiring 5,000 additional servers and reaching 4.77 million requests per second.

Key Advice: Avoid Being a Firefighter

Operations should first ensure system reliability to prevent frequent errors, then focus on efficiency improvements to achieve higher goals.

Five "Killer Tricks" for the Ops Team

Resource Management: Clearly categorize and classify programs and code, applying appropriate deployment methods for each resource.

Fault‑Tolerance Plans: Ensure failures in massive services do not impact projects, with timely server handling.

Unified CMDBA: Register all dependent resources of a business module in a centralized CMDB, enabling rapid decision‑making and monitoring.

DLP Monitoring: An internal critical monitor that pinpoints fault locations.

Entry Monitoring: Identifies root causes of failures; L5 handles fault tolerance, gray release, routing, etc.

L5 System Overview

The L5 system consists of L5, DNS, and L5 agents. CGI requests module IDs; based on success rate and latency, CGI receives feedback via IP+PORT, reports to L5 agents, and aggregates statistics. Low‑success modules are throttled or removed, enabling fault tolerance and load balancing. New servers start with weight 1 and can be gradually introduced or removed based on performance.

Benefits of L5 for the Ops Team

Reduces daily faults by 80‑90%.

Eliminates frequent IP+PORT changes.

Facilitates easy service up/down via names.

Supports gray‑release deployments.

Helps locate root‑cause failures through module access relationships.

Monitors interface latency and failure rates.

Combines fault tolerance, load balancing, routing, and monitoring in one system.

Unified Framework Advantages

Separating network framework from business logic reduces learning costs, improves framework stability, enables cross‑business unified maintenance, and can increase operational efficiency up to tenfold.

Resource Packaging Management

Standardizes packaging of developed programs, handling parameters, dependencies, pre‑ and post‑deployment steps, and provides unified interfaces for install, uninstall, start, and stop operations.

CMDB Virtual Mirror

Registers resources into a secondary CMDB to create a complete virtual image of a module, documenting all dependencies without needing separate documentation.

Decision Scheduling – ZhiYun Automated Deployment Platform

The internal platform automates resource acquisition, deployment, testing, and release. It includes steps such as silencing alerts during device acquisition, synchronizing files during release, verifying process startup, and conducting business tests.

Ops Norm Evolution

The progression of operational standards is depicted, highlighting the challenges of systematic adoption.

Key Visuals

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation Operations fault tolerance CMDB scalable infrastructure

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.