How ZanDB Automates MySQL Operations at Scale: A Deep Dive
ZanDB is Youzan's comprehensive MySQL automation platform that standardizes OS and database configurations, introduces a web‑based UI, task scheduling, backup monitoring, host and instance management, log analysis, metadata services, and high‑availability features to dramatically reduce manual DBA work and improve reliability.
1. Introduction
Youzan, a leading SaaS provider for new‑retail, has grown from dozens of merchants to three million, spanning retail, beauty, catering, and media, causing explosive traffic growth and a massive increase in server and DB instance counts.
This surge created challenges such as rapid instance provisioning, slow‑query optimization, backup and recovery management, and the inefficiency of using Excel as a CMDB.
The article presents ZanDB, Youzan's in‑house database automation platform, designed to address these challenges.
2. Automation Preparation
2.1 Standardization
Standardization is the foundation for scaling operations. Youzan defined OS‑level standards (RAID5 disks, WB write‑back cache, deadline I/O scheduler, SSD optimizations) and database‑level standards (uniform directory layout, per‑instance configuration files, consistent MySQL versions, and unified parameters).
These standards were applied over two months using SaltStack to enforce software installation and file configuration.
2.2 ZanDB Technology Stack
ZanDB is built with Python Django, Percona‑Toolkit, a custom agent (servant), Celery, and a front‑end based on jQuery and Ajax. Redis is used for caching and MySQL for persistent storage.
3. Phase 1 – Backup Monitoring
Data backup is critical. The initial version replaced ad‑hoc shell scripts with a centralized backup monitoring system that provides real‑time status, execution duration, and five‑day statistics, enabling DBAs to quickly detect failures and trigger alerts.
4. Phase 2 – Full‑Feature Automation
ZanDB adopts a B/S architecture with a Go‑based agent (servant) on database servers. The system is divided into seven modules: metadata management, backup management, instance management, host management, task management, log management, and daily maintenance.
4.1 Task System
The task scheduler coordinates backup, metadata collection, instance provisioning, and other operations. It supports time‑based (minute, hour, day, week, month) and interval‑based recurring tasks, eliminating crontab scripts and allowing dynamic adjustments.
4.2 Backup Subsystem
Backups use Percona XtraBackup, compression, and rsync to remote storage. Python rewrites the backup scripts, adds API callbacks for status, and sends alerts on failures, integrating with the task system to remove crontab dependencies.
4.3 Host Management
Host metadata (IP, location, memory, disk) is refreshed via Zabbix/Open‑Falcon APIs, enabling capacity planning and proactive alerts for low‑space situations.
4.4 Instance Management
Supports multi‑instance hosts, instance listing, creation of master‑slave pairs, schema splitting, daily consistency checks, and snapshotting of instance metrics for historical analysis.
4.5 Log Management
Collects slow‑query logs and killed‑SQL logs, provides Top‑N displays, and triggers alerts when thresholds are exceeded. Logs are parsed with pt‑query‑digest and presented with execution plans and table statistics.
4.6 Metadata Management
Manages binlog metadata, primary‑key overflow checks, and shard‑lookup services, allowing rapid identification of the instance responsible for a given database/table.
4.7 Daily Maintenance
Automates low‑frequency, high‑cost manual tasks such as batch parameter queries, batch configuration changes, emergency binlog recovery, and SQL execution (DML prohibited).
4.8 Data Operations
Aggregated instance metrics feed trend charts for space and memory utilization and cost‑allocation dashboards to aid resource planning.
4.9 High‑Availability Management
Initial HA used keepalived + VIP, which suffered from disk I/O jitter and ARP limits. The second generation employs a Go‑based HA manager (hamster) with cluster health checks, active/passive failover via relay‑log or GTID, and a proxy layer, eliminating VIP‑related issues and supporting dual‑datacenter disaster recovery.
5. Outlook
ZanDB currently automates about 70 % of manual DBA work; future goals include sub‑second monitoring, log auditing, instance inspection, horizontal scaling, performance diagnostics, and automated slow‑query analysis to further increase developer productivity.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.