Operations 14 min read

How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

This article explores how Tencent builds and practices its SRE quality operation system, focusing on On‑Call incident management, standardized channels, alert handling, data quality models, and the resulting improvements in reliability, MTTR reduction, and data‑driven decision making.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

This article examines the overall construction and practice of Tencent's SRE quality operation system, sharing experiences, reflections, and future outlook.

On‑Call Incident Management

Problems Solved by Event Integration

After establishing SLO management, the focus shifts to On‑Call incident management. Alert overload is addressed not by reducing the absolute number of alerts—since alerts are both cause and effect—but by routing alerts that require human intervention into a standardized On‑Call workflow.

Standardized Channel Definition

With SLOs defining the primary fault‑discovery channel, all event channels must be standardized, distinguishing automatically discovered events from user‑feedback events, thereby covering the full set of online incidents for data‑driven improvement of automatic fault discovery.

Alert Integration Capabilities

On‑Call can enrich alert handling with matching, filtering, convergence, escalation, and recovery. For example, a microservice may generate alerts across dimensions such as success rate, latency, CPU, memory, and I/O. By converging these alerts under a service‑ID deduplication key, they are aggregated into a single On‑Call ticket, solving the alert‑flood problem.

Alert Access and Response

Configuring event access incurs high understanding cost, but continuous operation and feature refinement enable On‑Call to achieve both no‑miss and no‑false‑alarm goals. Over time, accumulated fault data and dependency graphs can be leveraged with service‑level call chains to algorithmically locate root causes.

On‑Call Positioning in Development Process

Integrating user feedback into On‑Call raises product‑positioning conflicts with user‑operation platforms, requiring clear definition of the relationship between development and operation processes. When an event may need code fixes or become a user request, it is forwarded to the development workflow tool (e.g., Tencent TAPD), achieving end‑to‑end closure of user‑feedback paths.

Operational Cases

Quantified quality data enables continuous observation and improvement of automatic fault discovery rates, providing reliable metrics for reporting to management and supporting both operational efficiency and user‑need fulfillment.

On‑Call Response Management

Standardized Execution

On‑Call covers the full MTTR lifecycle through four modules—business management, duty scheduling, escalation, and ticketing—ensuring that incident discovery, response, handling, and post‑mortem are all completed within the platform, producing valuable MTTR statistics.

Business Management

Each Service represents a user‑facing business scenario; development teams must define responsibilities and bind all On‑Call functions to the Service.

Duty Management

Duty scheduling reduces frequent interruptions for developers while guaranteeing timely fault handling. Although widely adopted in North America, many domestic companies claim to use On‑Call without standardized tooling.

Escalation Strategy

Configurable escalation policies ensure notifications and handling even during off‑hours, introducing multi‑level roles (first‑line, second‑line, leader) for flexible flow.

Ticket Management

Tickets unify all On‑Call capabilities—automation, development workflow, collaboration, fault information, and related events—into a single view.

Data Quality Model

Data Model Stages

First layer: SLO data used for observing and managing product stability and automatic fault discovery.

Second layer: Operational data used to monitor On‑Call efficiency, including manpower investment for stability assessment.

Third layer: Channel data used to evaluate event‑channel coverage and continuously improve automatic fault discovery precision.

Fourth layer: Quality data used for comprehensive stability analysis, such as MTTR, incident count, severity, and root‑cause statistics.

Data‑Driven Decision and Stability Management

These data layers support OKR formulation; for example, setting a target to reduce MTTR by 30% and breaking it down into key results for MTTR, MTTA, localization time, and MTBF improvement based on observed fault patterns.

Summary and Outlook

With standardized On‑Call mechanisms, organizations can quickly achieve low‑cost, high‑impact stability improvements, extend insights to CI/CD stages, and guide investments in chaos engineering, capacity management, architecture inspection, and observability, thereby turning stability work from reactive to proactive.

operationsObservabilitySREIncident ManagementReliabilityon-call
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.