Operations 18 min read

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

In this talk, Tencent’s infrastructure lead explains how their team created an AI‑driven, three‑minute fault detection and recovery pipeline—combining high‑precision Meshping monitoring, multi‑KPI analytics, and automated Moveout isolation—to dramatically shorten network outage resolution from hours to minutes.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

Overview

This article records He Weibing’s presentation at GOPS 2018 Shanghai, where he shares Tencent’s experience building an intelligent network‑operation platform that can detect, locate, and recover from faults within minutes.

Problem Background

During a WeChat Red‑Packet event, a network outage prevented the boss from sending a live red packet, highlighting the critical impact of network failures on high‑visibility services.

3M Methodology

The team defined a three‑step approach—Meshping monitoring, Multi‑KPI analytics, and Moveout isolation—to improve the three stages of incident handling: discovery, localization, and recovery.

3M Method Overview
3M Method Overview

Meshping Monitoring

High‑precision active probing is performed by selecting a subset of servers and having them cross‑probe each other. The system collects massive ping tasks, calculates results quickly, and enriches probes with packet‑size combinations, QoS tags, and UDP checks to achieve high‑accuracy alerts.

Multi‑KPI Analytics

Instead of examining every low‑level detail, the team identifies a small set of KPI indicators strongly correlated with faults. An example is the device forwarding efficiency ratio, derived from the conservation of packet volume. Sudden KPI shifts pinpoint likely fault locations.

Moveout Isolation

Redundant network components are encapsulated as reusable Moveout modules. When a fault is detected, the system automatically shields the affected path, and silent fallback channels can be activated on demand.

Moveout Isolation
Moveout Isolation

Performance Results

With the integrated pipeline, the team can detect anomalies within three minutes, push alerts via WeChat, and complete fault localization in four minutes. In most cases, full recovery is achieved within ten minutes, and 90% of alerts are accurate.

Exploring AIOps

Recognizing the limits of rule‑based KPI thresholds, the team investigated AIOps. Network fault data are scarce, novel, and highly correlated across the mesh topology, making pure machine‑learning difficult. They experimented with two pilots:

Time‑series anomaly detection using a pre‑trained model from an internal SNG platform, achieving high precision on individual metric curves.

Text clustering of configuration files using TF‑IDF‑based dimensionality reduction to audit and standardize device configs.

AIOps Pilot
AIOps Pilot

Future Directions

The team plans to build a comprehensive network knowledge graph, model hardware connections, configuration semantics, and operational states, and to launch “Black Mirror 2.0”, an event‑driven intelligent diagnosis platform that will perform massive logical computations in real time.

They also advocate a shift toward NRE (Network‑DevOps) roles, where software engineers manage network infrastructure using DevOps practices, automated testing, and continuous improvement of monitoring precision.

NRE Operations
NRE Operations

Overall, the presentation outlines practical steps for building AI‑assisted network operations, emphasizing system modeling, data quality, and the balance between algorithmic sophistication and operational practicality.

monitoringautomationAIOpsfault detectionnetwork operations
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.