Operations 9 min read

Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis

This article summarizes the keynote on intelligent operations presented at the 13th GOPS Global Operations Conference, covering multi‑dimensional anomaly detection, smart alarm aggregation, the construction of an operations knowledge graph, and AI‑driven root‑cause analysis techniques for large‑scale server environments.

58 Tech

Nov 4, 2019

Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis

The 13th GOPS Global Operations Conference featured a keynote by Gong Cheng, Deputy Director of Technology at 58.com, titled “Intelligent Operations Practices for Tens of Thousands of Servers.” The talk introduced four core topics: multi‑dimensional anomaly detection, intelligent alarm merging, operations knowledge‑graph construction, and AI‑driven root‑cause analysis.

Background : 58’s intelligent monitoring system provides a unified, 24/7 monitoring stack covering network, server, system, application, and business layers. Beyond traditional data collection, storage, alerting, and visualization, the platform adds predictive analytics, anomaly detection, alarm merging, fault self‑healing, and pre‑warning capabilities.

1. Multi‑Dimensional Anomaly Detection : Monitoring metrics are categorized into three groups—static‑threshold‑suitable, dynamic‑threshold‑suitable, and those requiring intelligent detection. Static thresholds work for metrics with known ranges (e.g., CPU, memory). Dynamic thresholds adapt to historical distributions. For metrics with periodic patterns, machine‑learning classifiers are employed after unsupervised labeling and feature engineering.

2. Intelligent Alarm Merging : To reduce alarm fatigue, a decision‑tree‑inspired merging strategy minimizes the Gini impurity of alarm groups, automatically selecting merge dimensions and generating an alarm‑merge tree. This approach cuts down the volume of alerts while preserving essential information for rapid fault isolation.

3. Operations Knowledge‑Graph Construction : Diverse operational data sources (CMDB, monitoring databases) are integrated into a knowledge graph that captures entity relationships, causality, and operational patterns. Techniques such as isolated‑forest filtering for erroneous service‑cluster mappings, decision‑tree training for accurate mappings, and Apriori mining for frequent itemsets are applied to enrich the graph.

4. Intelligent Root‑Cause Analysis : Leveraging the knowledge graph, a root‑cause analysis manager selects appropriate analysis modules based on real‑time anomalies. Both state‑machine and behavior‑tree models are explored; the latter offers modularity and low coupling. The system produces concise root‑cause descriptions and visualizations that trace fault propagation across hosts, VMs, and clusters.

Conclusion : As server counts and business complexity grow, traditional operations become insufficient. Continuous integration of AI techniques—anomaly detection, alarm merging, knowledge‑graph reasoning, and automated root‑cause analysis—enhances the scalability and reliability of large‑scale infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Anomaly Detection Knowledge Graph Root Cause Analysis alarm merging intelligent monitoring

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.