Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis
This article summarizes the keynote on intelligent operations presented at the 13th GOPS Global Operations Conference, covering multi‑dimensional anomaly detection, smart alarm aggregation, the construction of an operations knowledge graph, and AI‑driven root‑cause analysis techniques for large‑scale server environments.
The 13th GOPS Global Operations Conference featured a keynote by Gong Cheng, Deputy Director of Technology at 58.com, titled “Intelligent Operations Practices for Tens of Thousands of Servers.” The talk introduced four core topics: multi‑dimensional anomaly detection, intelligent alarm merging, operations knowledge‑graph construction, and AI‑driven root‑cause analysis.
Background : 58’s intelligent monitoring system provides a unified, 24/7 monitoring stack covering network, server, system, application, and business layers. Beyond traditional data collection, storage, alerting, and visualization, the platform adds predictive analytics, anomaly detection, alarm merging, fault self‑healing, and pre‑warning capabilities.
1. Multi‑Dimensional Anomaly Detection : Monitoring metrics are categorized into three groups—static‑threshold‑suitable, dynamic‑threshold‑suitable, and those requiring intelligent detection. Static thresholds work for metrics with known ranges (e.g., CPU, memory). Dynamic thresholds adapt to historical distributions. For metrics with periodic patterns, machine‑learning classifiers are employed after unsupervised labeling and feature engineering.
2. Intelligent Alarm Merging : To reduce alarm fatigue, a decision‑tree‑inspired merging strategy minimizes the Gini impurity of alarm groups, automatically selecting merge dimensions and generating an alarm‑merge tree. This approach cuts down the volume of alerts while preserving essential information for rapid fault isolation.
3. Operations Knowledge‑Graph Construction : Diverse operational data sources (CMDB, monitoring databases) are integrated into a knowledge graph that captures entity relationships, causality, and operational patterns. Techniques such as isolated‑forest filtering for erroneous service‑cluster mappings, decision‑tree training for accurate mappings, and Apriori mining for frequent itemsets are applied to enrich the graph.
4. Intelligent Root‑Cause Analysis : Leveraging the knowledge graph, a root‑cause analysis manager selects appropriate analysis modules based on real‑time anomalies. Both state‑machine and behavior‑tree models are explored; the latter offers modularity and low coupling. The system produces concise root‑cause descriptions and visualizations that trace fault propagation across hosts, VMs, and clusters.
Conclusion : As server counts and business complexity grow, traditional operations become insufficient. Continuous integration of AI techniques—anomaly detection, alarm merging, knowledge‑graph reasoning, and automated root‑cause analysis—enhances the scalability and reliability of large‑scale infrastructure.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.