Operations 26 min read

Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent

This article shares insights from Tencent’s SNG operations leader on building effective continuous feedback loops for massive‑scale services, covering monitoring, alerting, operational metrics, multi‑dimensional analysis, and practical DevOps techniques to improve reliability, availability, and automated self‑healing.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent

Preface

Please forgive the click‑bait title; this article approaches DevOps from the perspective of continuous feedback, focusing on monitoring, alerting, and operations as the three indispensable tasks for building an enterprise quality system.

1. DevOps Continuous Feedback Relies on Effective Operations

The diagram summarizes the entire DevOps system; the final stage is operations and closure. I view this stage from two dimensions: (1) quality operation and closure of the DevOps activity, and (2) technical operation and lifecycle termination of the product.

Today we discuss the quality system built during the technical operation phase before a product’s lifecycle ends, aiming to achieve continuous feedback and optimization.

2. Key Role of Operations in the Product Lifecycle

To realize continuous feedback at Tencent, we must focus on three points:

1. Monitoring – coverage, status feedback, metric measurement

Monitoring must be 360° without blind spots; any business issue should be detectable, with real‑time status and metric change feedback.

2. Alerting – timeliness, accuracy, reachability

As services become more complex, alerts increase. Unprocessed or false alerts must be avoided, and the responsible party for handling each alert must be clear.

3. Operations – RCA, incident management, reporting/evaluation

Repeated problems require root‑cause analysis (RCA) and systematic incident management; reports and assessments empower operations to drive optimization of architecture and code.

3. Tencent’s Multi‑Dimensional Monitoring

Tencent manages services in layers—from servers, databases, logic, and middle‑tier computing up to access, load balancing, data centers, DNS, client, and user side—to achieve “no blind spot” through extensive monitoring points, termed multi‑dimensional monitoring.

Since 2014, we have achieved 100% coverage of user sentiment monitoring points, but the explosion of metric data can become a new hidden risk.

4. Reflections After Building the Monitoring System

We face three challenges in the operation stage:

Complex → Simple

How to simplify the myriad alerts and failures that arise during production?

Broad → Precise

When a core switch fails, thousands of downstream alerts are generated; how to identify that the root cause is the switch?

Chaos → Order

Different collection methods and data volumes cause out‑of‑order alerts; how to sort and prioritize them effectively?

Thus, while we aggressively build monitoring, we must also learn to filter when alerts flood.

5. Clarifying the Relationship Between Monitoring Objects and Metrics

Monitoring objects are layered from infrastructure to application. Using the QQ number registration scenario, we illustrate common metrics such as memory usage, long‑connection count, throughput, CPU, response codes, success rates, and request distribution—metrics independent of specific business logic.

Metrics are divided into two categories:

Low‑level metrics : infrastructure‑level indicators like network, hardware, virtualization.

High‑level metrics : business‑oriented indicators such as success rate, latency, request rate.

Low‑level metrics generate more noise; we should automate or converge them, focusing on high‑level metrics that directly reflect service availability.

6. Understanding the Essence of Monitoring

Monitoring essentially collects values and rates, applies analysis strategies or algorithms, and presents conclusions to detect anomalies.

7. Emphasizing Effectiveness in Massive Operational Monitoring

Multi‑dimensional monitoring leads to metric explosion and potential alert overload. To avoid “the boy who cried wolf,” we must address:

1. Correlation analysis

Extract truly important events, activities, and metrics instead of alerting on everything.

2. Zero‑false alerts

Apply convergence and suppression strategies to strengthen alert quality.

3. Continuous operation

Ensure follow‑up, measurement, and accountability so problems do not recur.

A quality system closes the loop: monitoring discovers issues, the system drives optimization across development, operations, and product.

8. Multi‑Dimensional Monitoring Case Study

Case: Qzone heartbeat success rate

Analyzing average success rates across SET, APN, carrier, and region revealed that Android versions performed poorly while iOS maintained 100% success, exposing a version‑specific issue.

Three practical tips emerged for handling massive monitoring data:

Traceability – track data back to its source.

Root‑cause analysis – identify underlying reasons.

Prioritization – select core indicators for focused attention.

9. Techniques for Analyzing Massive Data

Over‑formatting monitoring data discards valuable traces; retain as many fields as possible when reporting protocols.

10. Overview of Traceability Analysis

High‑dimensional vs. dimensionality reduction : Converge alerts to manageable levels; use reports and assessments to drive continuous improvement.

Cascade analysis : Distinguish between infrastructure‑level alerts and service‑level alerts to route them to the appropriate team.

Reverse thinking : Examine raw data before derived results; store raw logs for offline big‑data analysis.

11. Root‑Cause Analysis Practices

Use high‑level alerts to converge low‑level ones.

Converge caller alerts with callee alerts to avoid duplicate noise.

Apply reason‑based convergence to filter symptom alerts during integration testing.

Suppress alerts triggered by planned changes via change‑log integration.

12. Prioritized Indicator (DLP) Practice

Core indicators (DLP) are manually selected to represent a module’s health among hundreds of metrics, enabling focused alerting and resource allocation.

13. Weaving Cloud User Sentiment Monitoring

User sentiment monitoring aggregates feedback from app stores, in‑app channels, and forums, applying machine learning for automatic classification and alerting on issues such as “Qzone unavailable.”

14. Alert Strategy and Self‑Healing

Automated alert handling requires a standardized operation system: pre‑processing, unified policy engine, and decision logic before alerts are emitted.

15. Common Convergence Algorithms

Spike convergence: trigger after three similar alerts within ten minutes.

Similarity convergence: collapse multiple alerts from the same module into one.

Time‑window convergence: ignore alerts during scheduled batch jobs.

Day‑night convergence: suppress non‑critical alerts at night.

Change convergence: suppress alerts that coincide with known operational changes.

16. Monitoring Metric System

The Weaving Cloud monitoring framework defines quality metrics across user‑side, client‑side, server‑side, and infrastructure, leveraging core DLP indicators, multi‑level alerting, and diverse notification channels (SMS, QQ, WeChat, phone).

17. Summary of the Quality System

The system forms a closed loop: continuous feedback → measurement → optimization, enabling effective collaboration among development, product, QA, and support, and delivering tangible business value through DevOps practices.

monitoringoperationsDevOpslarge-scaleContinuous Feedback
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.