Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis
The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.
This article, based on a fault‑localization project, introduces the principles of a root‑cause localization algorithm. It uses a combination of textual explanations and diagrams so that even non‑technical readers can understand the concepts.
Background
IT professionals often face situations where they are woken up at night to handle incidents. With the rapid development of micro‑service architectures, complex call chains and massive amounts of data make fault investigation a major challenge.
vivo has built a comprehensive end‑to‑end monitoring system covering basic monitoring, generic monitoring, tracing, log monitoring, and synthetic testing. The huge volume of data generated daily raises the question of how to extract value from it. The industry is moving toward AIOps, and many root‑cause analysis algorithms have been proposed in academic papers and industry solutions. By combining vivo’s monitoring data with existing algorithms, a fault‑localization platform can be built.
Implementation Effects
The platform currently focuses on average latency issues, covering two scenarios: proactive queries and trace‑based alerts.
1.1 Proactive Query Scenario
When a user reports that an application is slow, traditionally engineers would manually check response times and then locate the cause, which is time‑consuming. With the fault‑localization platform, users can simply select the faulty service and time range on the homepage, and the system performs the analysis automatically.
1.2 Alert Scenario
When an alert about average response time is triggered, a “view reason” link under the alert leads directly to the root‑cause analysis results provided by the platform.
The trace view shows the service’s average latency spike, and a “Root Cause Analysis” button reveals the detailed analysis.
Analysis Process
The system’s workflow consists of three steps:
Frontend sends the abnormal service name and time range to the backend via an API.
The backend executes an analysis function that invokes a detection algorithm, returning downstream data (services/components, variance, point type).
The analysis function processes the data recursively: if pointType=END_POINT the analysis stops; if pointType=RPC_POINT it continues with downstream services, forming a recursion.
The detection algorithm is the core of the analysis. It operates as follows:
4.1 Algorithm Logic
The algorithm first identifies the start time, variance‑start time, and variance‑end time of the abnormal service. It then splits the downstream service timelines into a normal region (start → variance‑start) and an abnormal region (variance‑start → variance‑end) and computes the variance for each downstream service.
The variance values are clustered using K‑Means. Large variances are grouped together, and small‑variance clusters are filtered out, leaving the most likely root causes.
4.2 Algorithm Implementation Details
(1) Timeline Splitting : The abnormal timeline is divided at its midpoint.
(2) Variance Standard Deviation : A double exponential smoothing algorithm predicts the first half of the data; the deviation between observed and predicted values yields the variance.
(3) Abnormal Range Detection : Points exceeding three times the standard deviation are marked as abnormal.
(4) Time‑point Marking : The first and last intersections of the 3σ line with the timeline define the variance‑start and variance‑end times.
(5) Service Drill‑Down : Downstream services are examined within the identified time window.
(6) Normal Region Average : The average of the normal region is calculated.
(7) Abnormal Region Variance : The variance between abnormal points and the normal average is computed.
(8) Timeline Filtering : Timelines with opposite variance direction or low variance ratio are filtered out.
(9) K‑Means Clustering on Remaining Timelines
Summary
The algorithm quantifies service variance, filters out low‑variance downstream services, and highlights the most probable root causes. It leverages existing monitoring data with low implementation cost, achieving over 85% accuracy for downstream‑dependency failures, though it does not cover self‑induced faults such as GC pauses or hardware issues. Human intervention remains necessary for final verification.
Future Outlook
Fault prediction: moving from post‑mortem analysis to proactive fault anticipation.
Data quality governance: improving consistency of logs and metrics for better ML/AIOps outcomes.
Knowledge codification: converting expert operational knowledge into reusable models.
Evolution from statistical to AI algorithms: integrating AI techniques to enhance the current statistical approach.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.