Artificial Intelligence 13 min read

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

vivo Internet Technology

Jan 4, 2023

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

This article, based on a fault‑localization project, introduces the principles of a root‑cause localization algorithm. It uses a combination of textual explanations and diagrams so that even non‑technical readers can understand the concepts.

Background

IT professionals often face situations where they are woken up at night to handle incidents. With the rapid development of micro‑service architectures, complex call chains and massive amounts of data make fault investigation a major challenge.

vivo has built a comprehensive end‑to‑end monitoring system covering basic monitoring, generic monitoring, tracing, log monitoring, and synthetic testing. The huge volume of data generated daily raises the question of how to extract value from it. The industry is moving toward AIOps, and many root‑cause analysis algorithms have been proposed in academic papers and industry solutions. By combining vivo’s monitoring data with existing algorithms, a fault‑localization platform can be built.

Implementation Effects

The platform currently focuses on average latency issues, covering two scenarios: proactive queries and trace‑based alerts.

1.1 Proactive Query Scenario

When a user reports that an application is slow, traditionally engineers would manually check response times and then locate the cause, which is time‑consuming. With the fault‑localization platform, users can simply select the faulty service and time range on the homepage, and the system performs the analysis automatically.

1.2 Alert Scenario

When an alert about average response time is triggered, a “view reason” link under the alert leads directly to the root‑cause analysis results provided by the platform.

The trace view shows the service’s average latency spike, and a “Root Cause Analysis” button reveals the detailed analysis.

Analysis Process

The system’s workflow consists of three steps:

Frontend sends the abnormal service name and time range to the backend via an API.

The backend executes an analysis function that invokes a detection algorithm, returning downstream data (services/components, variance, point type).

The analysis function processes the data recursively: if pointType=END_POINT the analysis stops; if pointType=RPC_POINT it continues with downstream services, forming a recursion.

The detection algorithm is the core of the analysis. It operates as follows:

4.1 Algorithm Logic

The algorithm first identifies the start time, variance‑start time, and variance‑end time of the abnormal service. It then splits the downstream service timelines into a normal region (start → variance‑start) and an abnormal region (variance‑start → variance‑end) and computes the variance for each downstream service.

The variance values are clustered using K‑Means. Large variances are grouped together, and small‑variance clusters are filtered out, leaving the most likely root causes.

4.2 Algorithm Implementation Details

(1) Timeline Splitting : The abnormal timeline is divided at its midpoint.

(2) Variance Standard Deviation : A double exponential smoothing algorithm predicts the first half of the data; the deviation between observed and predicted values yields the variance.

(3) Abnormal Range Detection : Points exceeding three times the standard deviation are marked as abnormal.

(4) Time‑point Marking : The first and last intersections of the 3σ line with the timeline define the variance‑start and variance‑end times.

(5) Service Drill‑Down : Downstream services are examined within the identified time window.

(6) Normal Region Average : The average of the normal region is calculated.

(7) Abnormal Region Variance : The variance between abnormal points and the normal average is computed.

(8) Timeline Filtering : Timelines with opposite variance direction or low variance ratio are filtered out.

(9) K‑Means Clustering on Remaining Timelines

Summary

The algorithm quantifies service variance, filters out low‑variance downstream services, and highlights the most probable root causes. It leverages existing monitoring data with low implementation cost, achieving over 85% accuracy for downstream‑dependency failures, though it does not cover self‑induced faults such as GC pauses or hardware issues. Human intervention remains necessary for final verification.

Future Outlook

Fault prediction: moving from post‑mortem analysis to proactive fault anticipation.

Data quality governance: improving consistency of logs and metrics for better ML/AIOps outcomes.

Knowledge codification: converting expert operational knowledge into reusable models.

Evolution from statistical to AI algorithms: integrating AI techniques to enhance the current statistical approach.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Fault Localization service reliability aiops Root Cause Analysis K-Means

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.