Operations 20 min read

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

Qunar Tech Salon

Jul 12, 2023

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

With the rapid growth of Qunar's business and the proliferation of micro‑service architectures, service call graphs have become extremely complex, making fault localization a major challenge; an analysis of 99 incidents in the first half of 2022 showed that 48.5% were timeout failures, which were grouped into seven root‑cause categories.

To address this, Qunar built an enterprise‑grade APM system comprising the qtracer tracing platform, the watcher monitoring suite, the heimdall anomaly‑statistics engine, and an event platform; metric‑trace correlation is achieved by attaching trace context to metrics, and trace sampling is automatically increased during alerts.

The overall root‑cause analysis architecture consists of six modules: an API/listener entry point, an orchestration layer for workflow control, an analyzer layer (including trace, runtime, middleware, event, log and extensible analyzers), a data‑processor for aggregation, weighting and pruning, a feedback/learning component, and a base‑data layer that aggregates logs, events, alerts and metrics.

Key functional modules include the trace analysis module, which selects relevant traces based on alert time windows, abnormal flags, T‑value and similarity; the application‑dimension analysis module that examines runtime (instance, JVM, business metrics), middleware (MySQL/Redis), events and logs; and a weight system that combines application weight, static weight (derived from historical fault distribution), dynamic weight (severity‑based escalation) and strong/weak dependency weight to rank probable root causes.

Since its launch in December 2022, the system has achieved an average accuracy above 70% for analyzable incidents and has helped reduce the timeout‑incident rate from 60.9% to 38.8%; real‑world cases such as a ticket‑display‑rate drop and a MySQL thread anomaly illustrate its practical value in shortening MTTR.

The platform also supports on‑demand API triggers, integration with the watcher monitoring dashboard, and an online intelligent verification system that validates code changes by analyzing affected Java call chains.

Future work will expand analysis dimensions to include deadlock detection, detailed GC analysis, thread‑pool diagnostics, and further increase coverage and accuracy across all business lines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring microservices Operations Observability DevOps Trace Root Cause Analysis

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.