Operations 24 min read

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

High Availability Architecture

Jan 9, 2024

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

0 Preface

The article discusses not only failures but also alerts and anomalies encountered during operations.

"An incident is an unplanned interruption to an IT Service or a reduction in the Quality of an IT Service." – ITIL

1 Background

Building on the earlier Meituan AIOps article on anomaly detection, the Horae platform has accumulated expertise in single‑time‑series anomaly detection and intelligent alerting. Leveraging this foundation, the operations team has applied AIOps to incident management over the past two years.

Incident management complexity stems from two aspects:

Data volume and diversity – alerts, topology, metrics, logs, changes (including releases), real‑time requirements, and strong domain knowledge.

Process complexity – a detailed incident timeline that requires efficiency improvements at every stage.

Meituan has built a rich toolset based on expert rules, configuration, and workflow control. The AIOps practice described here is divided into four modules:

Risk Prevention – Intelligent Change‑Risk Detection : rule‑based and ML analysis of user and entity behavior.

Fault Discovery – Intelligent Metric Anomaly Identification : statistical and ML models to spot abnormal patterns.

Incident Handling – Diagnosis and Remedy Recommendation : multimodal data and rule engines to locate faults and suggest mitigation.

Incident Operations – Similar‑Fault Recommendation : NLP techniques to retrieve analogous past incidents.

2 Overview of AI Capabilities in Incident Management

The capability framework includes the four modules above, illustrated in the accompanying diagram.

3 AIOps Scenarios in Incident Management

3.1 Pre‑incident Prevention

3.1.1 Risk Identification

Change detection is split into pre‑, mid‑, and post‑change stages. Pre‑change risk alerts have high ROI but limited reference data, making detection harder. Mid‑ and post‑change detection can leverage gray‑release metrics for higher accuracy. Collaboration with the MCM change‑control platform enables detection of anomalies across these stages.

Pre‑change : configuration‑change risk checks based on historical legal change constraints (structure, delimiter, consistency, etc.).

Mid/Post‑change : identify metric anomalies caused by erroneous changes during gray releases, using reference groups that closely match the target metrics to avoid noise.

The algorithm workflow:

Remove outlier sequences from reference data using an optimized adaptive DBSCAN clustering.

Detect anomalies in the target series (point, contextual, subsequence patterns).

Deployed in MCM for core platform cluster change re‑inspection, achieving high detection accuracy (see Figures 3‑4).

3.2 In‑incident Fast Recovery

Improving MTTD, MTTT, and MTTR metrics is essential. The system provides:

Real‑time anomaly detection using similarity of neighboring points in time‑series.

Pre‑detection to filter normal points, followed by feature extraction and model classification.

Weekly sampling and review to maintain >98% precision and recall.

3.2.1 Anomaly Discovery

A similarity‑based algorithm flags points that deviate from historical distributions. The detection pipeline includes pre‑filtering, feature extraction, classification, and feedback loops for model improvement.

3.2.2 Root‑Cause Diagnosis

Automatic root‑cause localization reduces MTTT. Techniques include:

Link expansion: building service call graphs, applying optimized DBSCAN to prune and expand abnormal links.

High‑throughput link anomaly detection: processing millions of link records per minute with 1.5‑3 ms latency, achieving ~81% F1.

Multi‑dimensional KPI root‑cause analysis: handling exponential growth of KPI dimensions via automated time‑range framing, multi‑timestamp drilling, and importance‑based pruning.

3.2.3 Similar‑Event Recommendation

Using NLP vectorization (TF‑IDF, tokenization) on both structured and textual fields, the system retrieves top‑k historical events with high similarity, then re‑ranks them based on text richness, recency, root‑cause match, and alarm match. The final recommendation score combines similarity and rule‑based features, achieving ~76% accuracy and a 28% reduction in MTTR for cases with recommended similar events.

3.3 Post‑incident Operations

COE (Correction Of Error) records post‑mortem analyses. NLP‑driven topic modeling and similarity recommendation help users discover related incidents and common issues.

4 Summary and Future Outlook

The article summarizes Meituan's AIOps journey across pre‑incident prevention, in‑incident handling, and post‑incident operations, and outlines future directions such as intelligent log detection (template‑time‑series and semantic anomalies) and smart change recognition using feature similarity.

5 Authors

Zhèng Dōng, Yíng Gǎng, Zhāng Lín, Jùn Fēng – Meituan Basic R&D Platform.

6 References

Ester et al., "A Density‑Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", AAAI Press, 1996.

Li et al., "Generic and Robust Localization of Multi‑dimensional Root Causes", ISSRE 2019.

He et al., "Drain: An Online Log Parsing Approach with Fixed Depth Tree", ICWS 2017.

Du & Li, "Spell: Streaming Parsing of System Event Logs", ICDM 2016.

IBM, Drain3, https://github.com/IBM/Drain3

Akiko A., "An information‑theoretic perspective of tf–idf measures", Information Processing and Management, 2003.

David B., Ng, & Michael J., "Latent Dirichlet Allocation", JMLR, 2003.

曹臻, 威远, "基于AI算法的数据库异常监测系统的设计与实现".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Operations Anomaly Detection Incident Management NLP aiops Root Cause Analysis

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.