Operations 17 min read

How Tencent CDN Achieves Business Continuity with Intelligent Operations

This article details Tencent CDN's extensive business continuity challenges—including bandwidth, device resources, and massive request volumes—and explains how a fault‑management lifecycle, AIOps components, intelligent alerting, and automated capacity planning together enable resilient, automated operations.

Tencent Architect
Tencent Architect
Tencent Architect
How Tencent CDN Achieves Business Continuity with Intelligent Operations

Tencent CDN Business Continuity and Intelligent Operations

In this talk, Huang Xiaohua from Tencent Cloud Architecture Platform shares the challenges and solutions for maintaining continuous service of the Tencent CDN platform.

Key Challenges

Bandwidth reserve : 150 Tbps across 2,000+ global nodes, serving diverse ISPs.

Device resources : 5 million CPU cores, 85 hardware models, 10 disk classes with multi‑level speed tiers.

Massive request volume : Peak QPS exceeds 100 M/s, covering video, static, download, live streaming, dynamic acceleration, and multiple protocols (IPv4/IPv6, HTTP/HTTPS/H2/QUIC).

Complex business scenarios (finance, gaming, live streaming) and strict fault‑grading rules (SLO > 1% for >10 min triggers OKR penalties).

To address these, the team adopts a fault‑management‑centric continuity model.

Fault Management Lifecycle

Fault Prevention : Defensive stability design, proactive monitoring, risk‑based architecture, and chaos engineering.

Fault Handling : Three sub‑stages—discovery (target

10 s

IaaS,

1 min

PaaS),定位, and recovery—supported by rapid alerting, automated on‑call routing, and real‑time analysis.

Fault Root‑Cause : Post‑mortem reviews, cultural emphasis, graded response, and precise metrics (MTBF, MTTR).

AIOps Framework

The continuity effort integrates three pillars:

Observability : Comprehensive logs, metrics, and tracing.

Analysis : Intelligent analysis of network, business, and performance data.

Automation : Expert‑knowledge‑driven decision making for known scenarios and algorithmic prediction for hidden risks.

Examples include an intelligent alert system that filters out regular spikes, AI‑driven ticket clustering, and automated capacity planning.

Intelligent Alerting

Smart alerts detect subtle anomalies such as periodic spikes, jitter intensity changes, and gradual mean shifts that traditional threshold alerts miss.

Capacity Planning

Version 1.0 uses cost‑centric global optimization (≈2 h runtime) but cannot handle real‑time spikes. The newer self‑training system pre‑computes scenarios, enabling near‑real‑time adjustments with a “slightly over‑provision” strategy.

Root‑Cause Analysis

A data foundation aggregates network, performance, business, log, event, scheduling, health, and client data, linked across the entire service chain. Combined with expert knowledge, it powers fast initial triage, detailed module identification, and final root‑cause determination.

Intelligent Automation Cases

SSD lifespan management : Predict wear‑out, migrate hot disks, and replace before failure, extending SSD life by ~8 months.

Link quality segmentation : Region‑specific quality thresholds replace a single global line, reducing false alarms and improving user experience.

LDNS chunk scheduling : Consolidate top domains, pool resources, and dynamically balance traffic across virtual platforms, cutting operational complexity.

Summary

Intelligent operations, built on fault‑management, observability, analysis, and automation, significantly improve Tencent CDN’s business continuity, reducing MTTR and enhancing reliability.

Continuity summary diagram
Continuity summary diagram
SRECDNAIOpsIntelligent OperationsBusiness Continuity
Tencent Architect
Written by

Tencent Architect

We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.