Optimization Journey of Qunar's Database Inspection and Alarm Systems
This article details Qunar's DBA team's systematic analysis of shortcomings in their original database inspection and alarm systems, the design and implementation of comprehensive metric enhancements, risk‑level classification, automated reporting, and alarm noise reduction, and reports the significant improvements in stability, efficiency, and fault‑free operation achieved through these optimizations.
Introduction The alarm system is crucial for real‑time database monitoring, but relying solely on alarms leaves hidden risks; therefore, a complementary inspection system is essential for proactive risk detection and mitigation.
Inspection System Optimization The original inspection tool only covered host‑level disk usage and basic table/index metrics, lacking performance, load, and risk‑level assessment. To address this, the team expanded inspection items to include cluster load, performance indicators, and introduced a risk‑level classification mechanism.
Design of the Optimization Four key improvements were made: (1) establishing a comprehensive set of metrics covering CPU, network, memory, long transactions, locks, active threads, and Redis usage; (2) categorizing metrics by their relevance (e.g., QPS, active threads); (3) defining high/medium/low risk thresholds and assigning weighted importance; (4) aggregating individual metric risks into instance‑level risk scores.
Implementation Agents collect metric data every 2 seconds, sending it to a server for analysis and report generation. Representative reports include instance‑level inspection, active‑thread inspection, slow‑query inspection, and scan‑row inspection, each providing detailed risk levels, rankings, and actionable insights.
Alarm System Optimization Identified problems were excessive false alarms, ineffective dynamic silencing, low automation coverage, and lack of reporting dashboards. The redesign introduced three major improvements: (1) alarm noise reduction through tiered handling, dynamic silencing, and customizable thresholds; (2) automated alarm processing with one‑click group creation, auto‑remediation, and root‑cause analysis; (3) operational features such as historical alarm search and visual dashboards.
Results Post‑optimization, invalid alarms were eliminated, alarm volume dropped by over 95%, analysis efficiency increased by more than 90%, and overall database stability reached near‑zero incidents, greatly enhancing DBA productivity and business confidence.
Future Outlook Plans include further automating root‑cause analysis across host layers, refining alarm de‑duplication, and leveraging a DBA self‑service robot for smarter, faster alarm configuration.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.