Qunar's Watcher Monitoring System: Design, Implementation, and Operational Practices
Zhang Yue, a Qunar operations engineer, discusses the design, selection, architecture, scalability challenges, visualization, alert strategies, and future plans of the company's in‑house monitoring platform Watcher, highlighting lessons learned from migrating from Cacti to a Graphite‑based, Grafana‑enhanced solution.
At the 2016 APMCon conference, Qunar operations engineer Zhang Yue presented "Watcher"—the company's home‑grown monitoring system—covering its development journey, design choices, and operational experience.
Watcher evolved from a two‑person effort to a three‑person team responsible for monitoring most of Qunar's core services. The original solution, Cacti, proved inadequate as metric volume grew, suffering from single‑point failures, poor horizontal scalability, limited visualization, and lack of an open API.
To address these issues, the team set four primary goals for the new system: high availability, horizontal scalability, enhanced visualization, and an open API.
For data accuracy and scalability, Watcher adopted Graphite, which supports scale‑out architecture, allowing larger uncompressed time ranges and sampled monitoring to maintain precision. Visualization is powered by a customized Grafana instance that integrates Qunar's product hierarchy and user system, offering multi‑dimensional dashboards and flexible data displays.
Alerting in Watcher is rule‑based, supporting static thresholds, week‑over‑week comparisons, frequency checks, multi‑trigger alerts, time‑window specific rules, temporary rules, on‑call rotations, callbacks, and multiple notification channels.
Future focus areas include cost optimization—handling over 8 million metrics with 1.5 million per‑minute ingestion—and improving personnel efficiency through refined alarm response processes.
The automation and ops stack includes Graphite, Grafana, Collectd, and various infrastructure tools such as LVS, HAProxy, Docker, Mesos, SaltStack, Ansible, and Ceph, selected based on scenario fit and maturity.
Specific monitoring measures for order cancellation involve counting cancellation events and setting threshold alerts, including comparative metrics.
In her upcoming APMCon talk, Zhang will share Watcher's design, selection rationale, architecture, encountered challenges, and practical lessons for building monitoring systems with open‑source components.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.