Operations 14 min read

Tencent Cloud Network Operations Platform: Architecture, Chaos Engineering, Change Health Check, and Monitoring

Tencent Cloud’s network operations platform combines a layered underlay‑overlay architecture, rapid fault detection within seconds and recovery in minutes, chaos‑engineering experiments, rigorous change health checks, high‑frequency multi‑path monitoring, and plans for predictive self‑healing to ensure reliable service across millions of servers.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Network Operations Platform: Architecture, Chaos Engineering, Change Health Check, and Monitoring

This article summarizes a technical talk by Tencent Cloud expert engineer Chen Zhengchan on the design and operation of the Tencent Cloud network infrastructure and its operations platform.

Network Overview : Tencent Cloud’s network consists of an underlay layer (region → zone → network planning module) and an overlay layer built on top of the underlay. The underlay connects internal networks across regions and zones, while the overlay provides point‑to‑point tunnels via a self‑developed SDN controller for both compute and network nodes.

Scale and Reliability Requirements : Tencent Cloud now operates over 40 availability zones, 100+ zones, and more than one million servers. The platform aims for rapid fault detection (within 15‑30 seconds) and recovery (within 3 minutes), treating network faults as having a lifecycle: pre‑fault (hidden risks), fault‑in‑progress (changes), and post‑fault.

Chaos Engineering : To expose hidden risks before they become incidents, the team runs chaos experiments that inject failures such as packet loss, traffic spikes, or hash‑load imbalance. Experiments are carefully scoped, with rollback mechanisms to prevent real outages, and results are used to build automated remediation workflows.

Change Health Check : Every network change follows a strict process (baseline, time window, approval, announcement). During the change window, health‑check tasks monitor business‑level metrics to detect anomalies. When metrics are unavailable, correlated alarms and log analysis are used.

Network Monitoring : Monitoring covers external ISP links, internal LAN/DCI, gateway clusters, forwarding quality, and dedicated lines. Probes (Ping, Traceroute, Curl, Socket) are deployed with high granularity (5‑10 s intervals) and must be stable, short‑path, and consistently reachable. The system aggregates probe data, performs rapid anomaly detection, and correlates paths to pinpoint fault locations.

Future Directions : The team plans to explore additional methods beyond chaos engineering for risk detection, improve fault isolation without extensive packet capture, and develop automated network self‑healing based on predictive analysis of syslog, SNMP, and other telemetry.

Q&A Highlights : Questions covered the practicality of point‑based monitoring versus full‑mesh, automated analysis of probe paths, use of machine‑learning for log anomaly detection, and the challenges of providing business‑specific monitoring versus platform‑wide observability.

operationsChaos Engineeringchange managementcloud networkingnetwork monitoringTencent Cloud
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.