Databases 12 min read

How We Built a Scalable Database Monitoring System for Real‑Time Alerts

This article details the design and implementation of a comprehensive database monitoring platform that automatically adapts to cluster changes, aggregates host and DB metrics, offers flexible alert templates and strategies, stores data in InfluxDB, and provides customizable dashboards for real‑time insight and incident response.

Efficient Ops
Efficient Ops
Efficient Ops
How We Built a Scalable Database Monitoring System for Real‑Time Alerts

1. Background

Database monitoring is essential for early detection of machine and database performance issues and for loss mitigation. In the early stage, we used the open‑source Prometheus system, but encountered several drawbacks:

Cluster member changes required manual updates to Prometheus configuration.

Machine and database metrics were collected by separate exporters, making joint visualization difficult.

Alert configurations and time‑window suppression were inflexible.

Daily inspections and dashboard customization were cumbersome.

Based on these needs and research on Prometheus and Alibaba Cloud database monitoring, we designed the BanYu database monitoring system with the following core capabilities:

Cluster‑level metric collection without configuration changes when members change.

Simultaneous display of machine and database performance metrics at both cluster and host dimensions.

Alert templates that support differentiated configurations.

Time‑window suppression and flexible alert strategies.

Customizable dashboards for easy inspection.

Below we describe the overall architecture and the design rationale of each component.

2. Overall Architecture

The BanYu database monitoring architecture is illustrated below.

Component functions are as follows:

Agent module, similar to a Prometheus exporter, exposing metric endpoints for data collection.

Schedule module, fetching monitoring tasks, retrieving cluster information from the DB config service, and pulling metrics from agents at configured intervals.

Monitor module, responsible for metric storage, query, analysis, and rule‑based alerting.

Alarm module, an internal alert service supporting DingTalk and phone notifications.

HTTP server module, handling task, template, and rule configuration as well as data query and visualization.

2.1 Data Collection

We collect two categories of metrics: host metrics and database performance metrics. Metric selection follows industry best practices such as Alibaba Cloud database monitoring.

Host metrics: CPU utilization, disk I/O usage, disk space usage, load, and memory usage.

MongoDB metrics: connection count, read/write queue length, traffic, cursor count, request count.

Redis metrics: memory usage, request count, traffic, expired keys per second, hit rate.

TiDB metrics: raft‑store CPU, coprocessor CPU, duration, etc., obtained via TiDB’s built‑in Prometheus exporter.

Four types of agents were designed. The node agent runs on each host to collect system metrics via local commands. Database service agents run as multi‑replica pods inside our Kubernetes cluster, allowing dynamic addition of new metrics. For example, adding a TiDB metric only requires adding the corresponding query.

2.2 Task Scheduling

The schedule module loads monitoring tasks at startup, retrieves cluster details (IP, port, role) from the DB config service, and triggers data collection at the configured intervals. Successful collection notifies the monitor module for alert analysis.

2.3 Data Storage

We store monitoring data in the time‑series database InfluxDB for its simplicity, SQL‑like query language, flexible retention policies, and high‑performance in‑memory writes with tag indexing.

Easy deployment without external dependencies.

SQL‑like query language for friendly access.

Retention policies to control data lifespan.

High‑throughput writes with indexed tags for fast queries.

2.4 Alert Rules

Fine‑grained alert strategies are configured via flexible templates. Design considerations include simple cluster‑level configuration, differentiated settings per cluster, and granular rule definition.

2.4.1 Alert Template

An alert template consists of rule name, role, metric, threshold, and strategy, as shown below.

Applying the template to service clusters enables easy mapping of metrics to rules and supports differentiated configurations.

2.4.2 Alert Metrics

Alert metrics are the specific items we monitor, such as CPU usage, load, and disk space.

2.4.3 Alert Strategies

Two typical strategies are illustrated: one triggers DingTalk alerts when a condition is met in 4 out of 8 samples within 2 minutes; the other triggers both DingTalk and phone alerts under the same condition.

Within 2 minutes, if 4 out of 8 samples meet the condition, send DingTalk alert.

Within 2 minutes, if 4 out of 8 samples meet the condition, send DingTalk and phone alerts.

2.5 Monitoring Data Presentation

The platform provides near‑real‑time dashboards that support fault tracing, risk prediction, and overall monitoring. A single page can display both host metrics and database performance metrics for each cluster role, eliminating the need to switch between pages.

Customizable dashboards enable daily inspections and early detection of performance risks.

For example, a TiDB dashboard revealed increased latency, which was promptly addressed, reducing risk.

3. Summary

The BanYu database monitoring system has been in production for nearly six months, during which daily inspections and alerts have helped identify and resolve multiple performance issues. Remaining work includes alert convergence and high‑availability for the time‑series database. Future efforts will continue to deepen the system’s capabilities to safeguard BanYu’s databases.

alertingPrometheusInfluxDBmetrics collectionDatabase Monitoringmonitoring architecture
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.