Operations 14 min read

Tencent Cloud Network Operations Platform: Architecture, Chaos Engineering, Change Health Check, and Monitoring

Tencent Cloud’s network operations platform combines a layered underlay‑overlay architecture, rapid fault detection within seconds and recovery in minutes, chaos‑engineering experiments, rigorous change health checks, high‑frequency multi‑path monitoring, and plans for predictive self‑healing to ensure reliable service across millions of servers.

Tencent Cloud Developer

Dec 25, 2020

Tencent Cloud Network Operations Platform: Architecture, Chaos Engineering, Change Health Check, and Monitoring

This article summarizes a technical talk by Tencent Cloud expert engineer Chen Zhengchan on the design and operation of the Tencent Cloud network infrastructure and its operations platform.

Network Overview : Tencent Cloud’s network consists of an underlay layer (region → zone → network planning module) and an overlay layer built on top of the underlay. The underlay connects internal networks across regions and zones, while the overlay provides point‑to‑point tunnels via a self‑developed SDN controller for both compute and network nodes.

Scale and Reliability Requirements : Tencent Cloud now operates over 40 availability zones, 100+ zones, and more than one million servers. The platform aims for rapid fault detection (within 15‑30 seconds) and recovery (within 3 minutes), treating network faults as having a lifecycle: pre‑fault (hidden risks), fault‑in‑progress (changes), and post‑fault.

Chaos Engineering : To expose hidden risks before they become incidents, the team runs chaos experiments that inject failures such as packet loss, traffic spikes, or hash‑load imbalance. Experiments are carefully scoped, with rollback mechanisms to prevent real outages, and results are used to build automated remediation workflows.

Change Health Check : Every network change follows a strict process (baseline, time window, approval, announcement). During the change window, health‑check tasks monitor business‑level metrics to detect anomalies. When metrics are unavailable, correlated alarms and log analysis are used.

Network Monitoring : Monitoring covers external ISP links, internal LAN/DCI, gateway clusters, forwarding quality, and dedicated lines. Probes (Ping, Traceroute, Curl, Socket) are deployed with high granularity (5‑10 s intervals) and must be stable, short‑path, and consistently reachable. The system aggregates probe data, performs rapid anomaly detection, and correlates paths to pinpoint fault locations.

Future Directions : The team plans to explore additional methods beyond chaos engineering for risk detection, improve fault isolation without extensive packet capture, and develop automated network self‑healing based on predictive analysis of syslog, SNMP, and other telemetry.

Q&A Highlights : Questions covered the practicality of point‑based monitoring versus full‑mesh, automated analysis of probe paths, use of machine‑learning for log anomaly detection, and the challenges of providing business‑specific monitoring versus platform‑wide observability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

chaos engineering change management Cloud Networking Network Monitoring Tencent Cloud

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.