Operations 20 min read

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

With the rapid growth of Qunar's business and the proliferation of micro‑service architectures, service call graphs have become extremely complex, making fault localization a major challenge; an analysis of 99 incidents in the first half of 2022 showed that 48.5% were timeout failures, which were grouped into seven root‑cause categories.

To address this, Qunar built an enterprise‑grade APM system comprising the qtracer tracing platform, the watcher monitoring suite, the heimdall anomaly‑statistics engine, and an event platform; metric‑trace correlation is achieved by attaching trace context to metrics, and trace sampling is automatically increased during alerts.

The overall root‑cause analysis architecture consists of six modules: an API/listener entry point, an orchestration layer for workflow control, an analyzer layer (including trace, runtime, middleware, event, log and extensible analyzers), a data‑processor for aggregation, weighting and pruning, a feedback/learning component, and a base‑data layer that aggregates logs, events, alerts and metrics.

Key functional modules include the trace analysis module, which selects relevant traces based on alert time windows, abnormal flags, T‑value and similarity; the application‑dimension analysis module that examines runtime (instance, JVM, business metrics), middleware (MySQL/Redis), events and logs; and a weight system that combines application weight, static weight (derived from historical fault distribution), dynamic weight (severity‑based escalation) and strong/weak dependency weight to rank probable root causes.

Since its launch in December 2022, the system has achieved an average accuracy above 70% for analyzable incidents and has helped reduce the timeout‑incident rate from 60.9% to 38.8%; real‑world cases such as a ticket‑display‑rate drop and a MySQL thread anomaly illustrate its practical value in shortening MTTR.

The platform also supports on‑demand API triggers, integration with the watcher monitoring dashboard, and an online intelligent verification system that validates code changes by analyzing affected Java call chains.

Future work will expand analysis dimensions to include deadlock detection, detailed GC analysis, thread‑pool diagnostics, and further increase coverage and accuracy across all business lines.

monitoringmicroservicesoperationsObservabilityDevOpsTraceroot cause analysis
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.