How Alibaba’s EMonitor Achieves 96% Accurate Automated Root‑Cause Analysis
This article explains how Alibaba’s EMonitor platform combines comprehensive metric collection, a custom time‑series database, and a metric‑drill‑down algorithm to automatically pinpoint root causes of service failures with 96% accuracy and sub‑second response times.
Background
Alibaba Group set a "1/5/10" goal for incident handling – 1 minute to detect, 5 minutes to locate, and 10 minutes to recover – demanding higher fault‑location capabilities.
EMonitor is an integrated tracing and metrics system serving all technical departments of Ele.me, covering front‑end monitoring, access‑layer monitoring, business trace and metric monitoring, middleware monitoring, and container, host, and network monitoring.
It processes petabytes of data daily, writes hundreds of terabytes of metric data, and handles tens of millions of metric queries per day, yet during incidents users still spend considerable time manually inspecting data.
Root‑Cause Analysis Modeling
Industry solutions often achieve only 40%–70% accuracy, highlighting the difficulty of root‑cause analysis.
The article focuses on understanding the problem before proposing algorithms.
What Problems to Solve
For an application with many containers providing SOA or HTTP services and depending on DB, Redis, MQ, etc., we need to control:
Latency and status of all entry services.
Latency and status of each operation after an entry service.
Entry services include SOA, HTTP, MQ consumer, scheduled jobs, and others. Each entry may involve five operation types (DB remote, Redis remote, MQ remote, RPC remote, local) plus exception‑throwing behavior.
We must collect latency and status for each operation and exception statistics.
Remote operations are broken down into three components: client connection/request/response time, network time, and server‑side execution time.
Fault Conclusions
With comprehensive data, we can answer:
Which entry services are affected?
Are local operations of the affected entries impacted?
Which remote operations of the affected entries are impacted?
What exceptions were thrown?
These conclusions are data‑driven and highly accurate. However, second‑class root causes such as GC issues, container problems, and change‑related issues cannot be proven with data and remain speculative.
Root‑Cause Analysis Implementation
After defining the required fault conclusions, the basic feature "Metric Drill‑Down Analysis" is introduced.
Metric Drill‑Down Analysis
An metric with multiple tags can be drilled down to identify which tag combination causes a fluctuation.
Examples:
Identify the data center, DAL group, table, operation type, or specific SQL causing DB latency spikes.
Identify the data center, remote appId, or remote method causing RPC latency spikes.
The approach was used in last year’s AIOps competition.
Root‑Cause Analysis Process
Check entry service response time and success rate; if normal, no root cause analysis.
Perform metric drill‑down on the five operation types to find abnormal fluctuations.
Drill down further on the affected operations to determine:
Which entries are impacted?
Which operation attributes are abnormal?
If remote, analyze server‑side components.
If three‑element remote operation data (client time, network time, server time) is available, we can pinpoint the exact cause; otherwise we can only speculate.
Deployment Results
48 out of 50 detailed cases located correctly – 96% accuracy.
Peak of over 500定位 actions in a single day.
Average定位 time of 1 second.
Detailed定位 results displayed.
Why Accuracy Is So High
1. Data Completeness
Full‑link data converted to metrics avoids sampling errors.
2. Modeling Accuracy
The model establishes strict data associations between entry services and each operation’s latency, enabling rigorous proof.
3. Adaptive Anomaly Detection
Instead of simple thresholds, the system calculates contribution of each metric to the overall fluctuation, improving precision.
Why Speed Is So Fast
1. No need to predict every time series.
2. Very few candidate solutions are evaluated.
3. Scoring supports arbitrary arithmetic on metrics.
Industry‑Leading Time‑Series Database LinDB
Root‑cause analysis requires group‑by queries over massive metric dimensions, demanding a powerful distributed time‑series database.
LinDB, developed internally for E‑Monitor, offers:
58:1 data compression ratio.
Efficient RoaringBitmap‑based index filtering.
Highly parallel query execution using Akka.
Its query performance is several to hundreds of times faster than InfluxDB.
Metric Drill‑Down Algorithm Efficiency
Only a small subset of time series are predicted.
Very few solution candidates are computed.
Scoring works with any arithmetic operation, e.g., average response time = total time / total count.
Actual Cases
Case 1
When an application’s SOA service showed a latency spike, clicking root‑cause analysis revealed the affected entry services, their dependent services, and specific operation attributes (e.g., Redis connection latency, exception details).
Case 2
Another incident showed similar analysis, pinpointing a DB instance causing a DAL group’s operations to jitter.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.