How Tencent CDN Achieves Business Continuity with Intelligent Operations
This article details Tencent CDN's extensive business continuity challenges—including bandwidth, device resources, and massive request volumes—and explains how a fault‑management lifecycle, AIOps components, intelligent alerting, and automated capacity planning together enable resilient, automated operations.
Tencent CDN Business Continuity and Intelligent Operations
In this talk, Huang Xiaohua from Tencent Cloud Architecture Platform shares the challenges and solutions for maintaining continuous service of the Tencent CDN platform.
Key Challenges
Bandwidth reserve : 150 Tbps across 2,000+ global nodes, serving diverse ISPs.
Device resources : 5 million CPU cores, 85 hardware models, 10 disk classes with multi‑level speed tiers.
Massive request volume : Peak QPS exceeds 100 M/s, covering video, static, download, live streaming, dynamic acceleration, and multiple protocols (IPv4/IPv6, HTTP/HTTPS/H2/QUIC).
Complex business scenarios (finance, gaming, live streaming) and strict fault‑grading rules (SLO > 1% for >10 min triggers OKR penalties).
To address these, the team adopts a fault‑management‑centric continuity model.
Fault Management Lifecycle
Fault Prevention : Defensive stability design, proactive monitoring, risk‑based architecture, and chaos engineering.
Fault Handling : Three sub‑stages—discovery (target
10 sIaaS,
1 minPaaS),定位, and recovery—supported by rapid alerting, automated on‑call routing, and real‑time analysis.
Fault Root‑Cause : Post‑mortem reviews, cultural emphasis, graded response, and precise metrics (MTBF, MTTR).
AIOps Framework
The continuity effort integrates three pillars:
Observability : Comprehensive logs, metrics, and tracing.
Analysis : Intelligent analysis of network, business, and performance data.
Automation : Expert‑knowledge‑driven decision making for known scenarios and algorithmic prediction for hidden risks.
Examples include an intelligent alert system that filters out regular spikes, AI‑driven ticket clustering, and automated capacity planning.
Intelligent Alerting
Smart alerts detect subtle anomalies such as periodic spikes, jitter intensity changes, and gradual mean shifts that traditional threshold alerts miss.
Capacity Planning
Version 1.0 uses cost‑centric global optimization (≈2 h runtime) but cannot handle real‑time spikes. The newer self‑training system pre‑computes scenarios, enabling near‑real‑time adjustments with a “slightly over‑provision” strategy.
Root‑Cause Analysis
A data foundation aggregates network, performance, business, log, event, scheduling, health, and client data, linked across the entire service chain. Combined with expert knowledge, it powers fast initial triage, detailed module identification, and final root‑cause determination.
Intelligent Automation Cases
SSD lifespan management : Predict wear‑out, migrate hot disks, and replace before failure, extending SSD life by ~8 months.
Link quality segmentation : Region‑specific quality thresholds replace a single global line, reducing false alarms and improving user experience.
LDNS chunk scheduling : Consolidate top domains, pool resources, and dynamically balance traffic across virtual platforms, cutting operational complexity.
Summary
Intelligent operations, built on fault‑management, observability, analysis, and automation, significantly improve Tencent CDN’s business continuity, reducing MTTR and enhancing reliability.
Tencent Architect
We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.