Databases 14 min read

ZanDB: An Automated Database Management Platform for Large-Scale Operations

ZanDB is Youzan’s automated database‑management platform that standardizes OS and MySQL configurations, employs a Python/Django stack with a Go‑based agent and Celery scheduler, and provides unified modules for backup, host, instance, metadata, log and HA management, currently automating about 70 % of manual operations while targeting full‑scale monitoring, diagnostics and sharding automation.

Youzan Coder
Youzan Coder
Youzan Coder
ZanDB: An Automated Database Management Platform for Large-Scale Operations

The article introduces ZanDB, a database automation management platform developed by Youzan to address the rapid growth of its business and the resulting challenges in database operations, such as the explosion of server count, DB instance data volume, and the inefficiencies of manual CMDB management.

Standardization is presented as the foundation for scaling and automation. Youzan established OS‑level standards (RAID5 disks, WB write‑back policy, deadline I/O scheduler) and database‑level standards (uniform directory structures, per‑instance configuration files, consistent MySQL software versions). These standards were applied over two months, with SaltStack used for baseline software installation and configuration.

Technical Stack of ZanDB includes Python Django, Percona‑Toolkit, a custom agent (servant), Celery, and a front‑end built with jQuery and Ajax. Redis and MySQL serve as caches and storage respectively.

Phase 1 – Backup Monitoring focuses on a backup monitoring system that provides real‑time visibility of backup execution, duration, and statistics for the past five days. It enables DBAs to quickly detect failures and take corrective actions.

Phase 2 – System Architecture adopts a B/S model with a Go‑based agent (servant) deployed on DB servers. The system communicates via HTTP, avoiding direct password‑based connections to the metadata database, thereby improving robustness and security.

The platform is divided into seven functional modules: metadata management, backup management, instance management, host management, task management, log management, and daily maintenance.

Task Management implements a robust scheduler that handles time‑based (minute, hour, day, week, month) and interval‑based tasks, eliminating the need for crontab scripts on DB hosts and allowing dynamic task adjustments.

Backup Subsystem refactors the original xtrabackup‑based physical backup scripts into Python, adds API callbacks for status reporting, integrates with the task system, and provides alerting for failed backups.

Host Management maintains host metadata (IP, location, memory, storage) and periodically updates it via Zabbix/Open‑Falcon APIs, supporting capacity planning and proactive alerts.

Instance Management supports multi‑instance hosts, enabling operations such as adding replicas, creating master‑slave pairs, consistency checks, instance splitting, and snapshotting of instance metrics.

Log Management aggregates slow‑query logs and killed SQL statements, offering Top‑N displays, alert thresholds, and automated analysis using pt‑query‑digest.

Metadata Management tracks binlog ranges, primary‑key overflow checks, and provides a shard‑lookup service to locate instances by database, table, and shard key.

Daily Maintenance automates low‑frequency, time‑consuming tasks such as bulk parameter queries, configuration changes, emergency binlog recovery, and batch SQL execution (restricted to safe DML).

Data Operations leverages accumulated instance metrics to generate trend charts for space and memory utilization and to perform cost accounting for business units.

High Availability (HA) Management evolved from a keepalived + VIP solution to a custom Go‑based HA tool named hamster , offering cluster health checks, failover, active switching, and a RESTful API, while eliminating VIP‑related issues.

The outlook acknowledges that ZanDB currently automates about 70% of manual operations, with future goals including sub‑second monitoring, log auditing, instance inspection, horizontal sharding, performance diagnostics, and automated slow‑query analysis to further reduce manual effort and increase developer adoption.

monitoringoperationsHigh Availabilitytask schedulingBackup ManagementDatabase AutomationZanDB
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.