Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform
The article details the background, architecture, core features, alert governance, trace integration, and cloud‑native evolution of Watcher, Qunar's internally built, highly scalable monitoring platform that unifies application‑level metrics, alerting, and observability across thousands of services and containers.
Watcher is Qunar's internally developed, one‑stop monitoring platform designed to replace traditional open‑source solutions that could no longer meet the company's scaling and feature requirements. It aggregates metrics at both system and application layers, supports billions of data points per minute, and handles millions of alerts while ensuring horizontal scalability.
The platform is organized into six functional modules: Application Space (manages monitoring resources per app, including hosts, pods, DBs, and load balancers), Public Space (custom panels with tree‑structured management and permission inheritance), User Space (private panels), Alert Space (centralized alert management with fast search and configuration), Global Configuration (user‑level settings such as themes and tool integrations), and System Configuration (admin‑only plugin and display settings).
Key design highlights include:
Enhanced Grafana integration with custom panel directories, templates, and alert linkage.
Application‑level metric aggregation that allows both total and per‑instance views for root‑cause analysis.
Automatic discovery of hosts, pods, DBs, and domains, providing real‑time resource monitoring without manual configuration.
Rich alert management featuring unified alert view, escalation, noise reduction, flash‑alert suppression, and convergence based on dependency topology derived from metrics and Qtrace.
Watcher also integrates tightly with Qtracer, Qunar's proprietary tracing system, enabling metrics‑driven alerts to be linked to trace IDs for rapid debugging. Metrics marked with QTracer.mark() are stored alongside trace data, allowing users to retrieve relevant traces when an alert fires.
In the cloud‑native era, Qunar migrated most services to Kubernetes, prompting enhancements such as dynamic IP discovery, event listeners for cluster changes, and a client‑discovery module in Qmonitor. Infrastructure monitoring shifted from collectd on KVM to Prometheus + cAdvisor for containers, with seamless Grafana‑Prometheus integration.
Future plans include deeper fusion of metrics, traces, and logs, automated root‑cause analysis, and dynamic anomaly detection to reduce reliance on static thresholds, further advancing Watcher's intelligent monitoring capabilities.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.