Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent
This article shares insights from Tencent’s SNG operations leader on building effective continuous feedback loops for massive‑scale services, covering monitoring, alerting, operational metrics, multi‑dimensional analysis, and practical DevOps techniques to improve reliability, availability, and automated self‑healing.
Preface
Please forgive the click‑bait title; this article approaches DevOps from the perspective of continuous feedback, focusing on monitoring, alerting, and operations as the three indispensable tasks for building an enterprise quality system.
1. DevOps Continuous Feedback Relies on Effective Operations
The diagram summarizes the entire DevOps system; the final stage is operations and closure. I view this stage from two dimensions: (1) quality operation and closure of the DevOps activity, and (2) technical operation and lifecycle termination of the product.
Today we discuss the quality system built during the technical operation phase before a product’s lifecycle ends, aiming to achieve continuous feedback and optimization.
2. Key Role of Operations in the Product Lifecycle
To realize continuous feedback at Tencent, we must focus on three points:
1. Monitoring – coverage, status feedback, metric measurement
Monitoring must be 360° without blind spots; any business issue should be detectable, with real‑time status and metric change feedback.
2. Alerting – timeliness, accuracy, reachability
As services become more complex, alerts increase. Unprocessed or false alerts must be avoided, and the responsible party for handling each alert must be clear.
3. Operations – RCA, incident management, reporting/evaluation
Repeated problems require root‑cause analysis (RCA) and systematic incident management; reports and assessments empower operations to drive optimization of architecture and code.
3. Tencent’s Multi‑Dimensional Monitoring
Tencent manages services in layers—from servers, databases, logic, and middle‑tier computing up to access, load balancing, data centers, DNS, client, and user side—to achieve “no blind spot” through extensive monitoring points, termed multi‑dimensional monitoring.
Since 2014, we have achieved 100% coverage of user sentiment monitoring points, but the explosion of metric data can become a new hidden risk.
4. Reflections After Building the Monitoring System
We face three challenges in the operation stage:
Complex → Simple
How to simplify the myriad alerts and failures that arise during production?
Broad → Precise
When a core switch fails, thousands of downstream alerts are generated; how to identify that the root cause is the switch?
Chaos → Order
Different collection methods and data volumes cause out‑of‑order alerts; how to sort and prioritize them effectively?
Thus, while we aggressively build monitoring, we must also learn to filter when alerts flood.
5. Clarifying the Relationship Between Monitoring Objects and Metrics
Monitoring objects are layered from infrastructure to application. Using the QQ number registration scenario, we illustrate common metrics such as memory usage, long‑connection count, throughput, CPU, response codes, success rates, and request distribution—metrics independent of specific business logic.
Metrics are divided into two categories:
Low‑level metrics : infrastructure‑level indicators like network, hardware, virtualization.
High‑level metrics : business‑oriented indicators such as success rate, latency, request rate.
Low‑level metrics generate more noise; we should automate or converge them, focusing on high‑level metrics that directly reflect service availability.
6. Understanding the Essence of Monitoring
Monitoring essentially collects values and rates, applies analysis strategies or algorithms, and presents conclusions to detect anomalies.
7. Emphasizing Effectiveness in Massive Operational Monitoring
Multi‑dimensional monitoring leads to metric explosion and potential alert overload. To avoid “the boy who cried wolf,” we must address:
1. Correlation analysis
Extract truly important events, activities, and metrics instead of alerting on everything.
2. Zero‑false alerts
Apply convergence and suppression strategies to strengthen alert quality.
3. Continuous operation
Ensure follow‑up, measurement, and accountability so problems do not recur.
A quality system closes the loop: monitoring discovers issues, the system drives optimization across development, operations, and product.
8. Multi‑Dimensional Monitoring Case Study
Case: Qzone heartbeat success rate
Analyzing average success rates across SET, APN, carrier, and region revealed that Android versions performed poorly while iOS maintained 100% success, exposing a version‑specific issue.
Three practical tips emerged for handling massive monitoring data:
Traceability – track data back to its source.
Root‑cause analysis – identify underlying reasons.
Prioritization – select core indicators for focused attention.
9. Techniques for Analyzing Massive Data
Over‑formatting monitoring data discards valuable traces; retain as many fields as possible when reporting protocols.
10. Overview of Traceability Analysis
High‑dimensional vs. dimensionality reduction : Converge alerts to manageable levels; use reports and assessments to drive continuous improvement.
Cascade analysis : Distinguish between infrastructure‑level alerts and service‑level alerts to route them to the appropriate team.
Reverse thinking : Examine raw data before derived results; store raw logs for offline big‑data analysis.
11. Root‑Cause Analysis Practices
Use high‑level alerts to converge low‑level ones.
Converge caller alerts with callee alerts to avoid duplicate noise.
Apply reason‑based convergence to filter symptom alerts during integration testing.
Suppress alerts triggered by planned changes via change‑log integration.
12. Prioritized Indicator (DLP) Practice
Core indicators (DLP) are manually selected to represent a module’s health among hundreds of metrics, enabling focused alerting and resource allocation.
13. Weaving Cloud User Sentiment Monitoring
User sentiment monitoring aggregates feedback from app stores, in‑app channels, and forums, applying machine learning for automatic classification and alerting on issues such as “Qzone unavailable.”
14. Alert Strategy and Self‑Healing
Automated alert handling requires a standardized operation system: pre‑processing, unified policy engine, and decision logic before alerts are emitted.
15. Common Convergence Algorithms
Spike convergence: trigger after three similar alerts within ten minutes.
Similarity convergence: collapse multiple alerts from the same module into one.
Time‑window convergence: ignore alerts during scheduled batch jobs.
Day‑night convergence: suppress non‑critical alerts at night.
Change convergence: suppress alerts that coincide with known operational changes.
16. Monitoring Metric System
The Weaving Cloud monitoring framework defines quality metrics across user‑side, client‑side, server‑side, and infrastructure, leveraging core DLP indicators, multi‑level alerting, and diverse notification channels (SMS, QQ, WeChat, phone).
17. Summary of the Quality System
The system forms a closed loop: continuous feedback → measurement → optimization, enabling effective collaboration among development, product, QA, and support, and delivering tangible business value through DevOps practices.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.