Operations 19 min read

How AI-Powered Intelligent Operations Transform Network Fault Detection

This talk explains how Guangdong Mobile uses AI‑driven intelligent operations, including a centroid‑based fault‑location algorithm, standardized event‑distance models, and clustering techniques such as DBSCAN and nearest‑neighbor, to automate network alarm correlation, improve fault resolution, and enable predictive maintenance across a massive 4G/VoLTE network.

Efficient Ops
Efficient Ops
Efficient Ops
How AI-Powered Intelligent Operations Transform Network Fault Detection

1. Challenges – Many

Guangdong Mobile operates one of China’s largest mobile networks, supporting over 120 million subscribers, 80 million 4G users, and 7 million VoLTE users. The rapid growth of services and the transition to 5G have produced a highly complex core network with more than 1.3 million network elements generating over 530,000 daily alarms and thousands of work orders.

The complexity creates two main challenges: (1) the core network has evolved into a multi‑network, multi‑link architecture that is difficult to manage, and (2) the shift to NFV means traditional proprietary equipment is being replaced by commodity servers, requiring staff to master both telecom and IT skills.

2. Solutions – More Than Challenges

To address massive alarm volumes, a standardized event‑distance model was created. Instead of relying solely on expert‑defined correlation rules, each alarm pair is quantified by a distance that considers time difference and topological distance within the virtual network.

After defining a numeric distance, machine‑learning algorithms (DBSCAN clustering and nearest‑neighbor) were applied to automatically group related alarms and incidents. This quantitative approach enables the system to learn from data and classify events without manual rule updates.

The implementation began with a single Python script on a server that calculated distances and stored results in a database. A simple front‑end was later added to visualize the clustered alarms. Within a year, the solution was piloted in five provinces and proved effective.

3. Results – Positive Impact

During night‑shift monitoring, the system automatically correlated alarms from up to seven different subsystems, reducing the number of raw alarms (over 5,000) to a few meaningful clusters and pinpointing the affected network element (e.g., EC77). This improved fault localization accuracy by about 25%.

Fine‑grained change management was also introduced: each change operation is treated as an event with time and location attributes, allowing automatic correlation with subsequent alarms. This enabled more precise impact analysis and faster response to incidents.

4. Future – Promising Outlook

The speaker envisions extending the quantitative‑to‑predictive workflow beyond telecom, applying it to domains such as healthcare where wearable devices could trigger alerts that are automatically linked to health symptoms. Continued development of the intelligent analysis platform aims to further automate fault detection, prediction, and remediation across the entire network.

AIautomationDevOpsfault detectionIntelligent Operationsnetwork monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.