Databases 20 min read

Building an Automated Database Operations Platform: From Monitoring to Multi‑Active Deployment

The article describes how a DBA team at Ele.me transformed manual database management into a fully automated, platform‑driven operation covering monitoring, alarm handling, MHA automation, resource pooling, large‑scale migration, SQL review, and multi‑active DDL release using Go‑based tools and custom workflows.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Building an Automated Database Operations Platform: From Monitoring to Multi‑Active Deployment

Cai Peng, who joined Ele.me in 2015, witnessed the company’s growth from zero to a large‑scale business and participated in the rapid expansion of the database and DBA teams, eventually transitioning from a traditional DBA role to a DEV‑DBA focused on empowering both DBA and development teams.

Over a two‑and‑a‑half‑year period the team evolved its workflow from manual operations to tool‑based, then platform‑based, and finally self‑service automation, completing the platform and multi‑active database transformation within eight months.

The shift to a multi‑active, geographically distributed architecture made traditional manual DBA processes infeasible, prompting the creation of a comprehensive platform to handle the complexity and scale.

The platform’s core components include:

DB‑Agent : data collection, process management, remote script execution, and platform integration.

MM‑OST : a non‑destructive DDL system built on gh‑ost for multi‑active releases.

Tinker : a Go rewrite of Linux crontab supporting second‑level scheduling and integrated management APIs.

Checksum : cross‑region data consistency checking.

SqlReview : a Go‑based SQL audit tool extending Inception with custom rules.

Luna : an optimized alarm system that reduces noise while preserving critical alerts.

VDBA : an automated alarm‑handling system that performs smoke tests and resolves alerts without DBA intervention.

Real‑time monitoring provides a unified dashboard where DBAs can see the health of all instances, and with a single click the platform executes a suite of commands for process list snapshots, SQL execution plans, lock analysis, and historical trend visualisation, dramatically speeding up fault localisation.

Automated alarm handling covers space exhaustion, uncommitted transactions, long‑running queries, CPU/connection overload, and replication repair, replacing manual “skip” tactics with precise binlog‑based fixes.

Traditional MHA solutions suffered from heavy SSH dependence, complex deployment, and fragile manager nodes. The team replaced it with an agent‑centric approach, exposing Go interfaces such as GetDBTopology() , BuildMHAConfig() , WriteRsaPublicKey() , StartMHA() , MHAProcessMonitor() , InspectMHAConfigIsOK() , StopMHA() , and SwitchMHA() . The platform now orchestrates these calls, reducing failover time from minutes to seconds.

A resource‑pool and one‑click installation layer abstracts machine provisioning, allowing DBAs to focus on capacity requests while agents handle environment setup automatically.

From 2015 to 2016 the team migrated over 3,000 clusters (average 2‑3 migrations per cluster) to CDB, RDS, and a custom disaster‑recovery system, completing a 300‑cluster disaster‑recovery migration in two days using a single Go script that orchestrated the entire process.

For accidental data loss, a custom rollback service built on github.com/siddontang/go-mysql/replication parses binlogs across sharded tables, runs as a distributed service, and enables rapid, UI‑driven recovery without manual command‑line steps.

The task‑scheduling service rewrites crontab in Go, adds second‑level granularity, centralised logging, and error‑code mapping, allowing DBAs to diagnose failures from a dashboard rather than digging through raw logs.

SqlReview was developed using TiDB’s parser to replace Inception, adding rules for redundant indexes, index naming, varchar length limits, DDL risk checks, and multi‑active field requirements; audit results are stored for statistical analysis of developer behaviour.

The multi‑active release system leverages a customised gh‑ost implementation that creates temporary tables, applies DDL, streams binlog events, copies data, and performs a coordinated cut‑over across regions. A coordinator ensures all gh‑ost instances finish data copy before the final rename, keeping inter‑region latency to seconds.

Overall, the platform automates the majority of DBA tasks, achieves a 95% self‑service rate for releases, and positions the team to adopt DevOps, AIOps, and further automation as traditional DBA workloads shrink.

monitoringGomulti-activeMHADBA toolsDatabase AutomationResource PoolSQL review
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.