Databases 13 min read

Improving Qunar.com Database Monitoring and Alert System with a Kafka‑Based Alarm Program

The article describes how Qunar.com upgraded its Nagios/NRPE‑based database monitoring by inserting a Kafka‑driven alarm component, centralizing alert configuration in MySQL, adding flexible shielding and multi‑channel notifications, and exploring intelligent features such as slow‑query and disk‑space management.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Improving Qunar.com Database Monitoring and Alert System with a Kafka‑Based Alarm Program

Background

Qunar.com originally used Nagios together with the NRPE plugin to monitor MySQL instances, collecting metrics such as CPU load and disk usage. The architecture relied on check_nrpe calls from Nagios to remote hosts, with alerts sent via notification plugins to email or phone.

The existing system suffered from inflexible alert thresholds, rigid severity levels, limited shielding options, delayed mute periods, and a single notification channel that made it hard to prioritize critical alerts.

Improvement Roadmap

Rather than replacing the whole stack, the team added a Kafka‑based pipeline. The notification plugin now writes monitoring data to a Kafka topic; a new alarm program consumes this data, evaluates alert rules stored in MySQL, and dispatches alerts through appropriate channels.

Alarm Program Details

The alarm service performs the following steps:

Consume monitoring messages from Kafka (including metric values, timestamps, host name, and template name).

Query MySQL for the corresponding alert template configuration.

Apply comparison methods, regexes, and thresholds to determine the alert level.

Check mute periods and shielding rules (host‑level, instance‑level, or metric‑level).

Group alerts by severity and send them via QTalk, phone calls, or other channels.

Alert templates are stored centrally in MySQL, allowing DBA staff to modify thresholds, enable/disable templates, and define shielding windows without touching scripts. High‑availability is achieved by deploying multiple alarm instances.

Intelligent Exploration

The upgraded system now supports:

Alert Management : flexible shielding, granular control, and statistical analysis without losing raw monitoring data.

Slow‑Query Management : detection of long‑running queries, automatic retrieval of query details, and optional kill actions presented to DBAs.

Disk‑Space Management : template‑driven classification of directories (e.g., binlog vs. log files) and automated cleanup based on usage thresholds.

These features lay the foundation for future work such as historical alert analytics, multi‑metric correlation, and anomaly detection on metric spikes.

Future Directions

Statistical analysis of alert frequencies and distribution across instances.

Joint analysis of multiple metrics to reduce false positives.

Detection of sudden metric spikes (e.g., QPS/TPS, connection count) for early anomaly warning.

The article concludes with a call for further development and a recruitment notice.

automationKafkaMySQLAlert SystemDBADatabase Monitoring
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.