Databases 15 min read

How ZanDB Automates MySQL Operations at Scale: A Deep Dive

ZanDB is Youzan's comprehensive MySQL automation platform that standardizes OS and database configurations, introduces a web‑based UI, task scheduling, backup monitoring, host and instance management, log analysis, metadata services, and high‑availability features to dramatically reduce manual DBA work and improve reliability.

Efficient Ops

Dec 19, 2017

How ZanDB Automates MySQL Operations at Scale: A Deep Dive

1. Introduction

Youzan, a leading SaaS provider for new‑retail, has grown from dozens of merchants to three million, spanning retail, beauty, catering, and media, causing explosive traffic growth and a massive increase in server and DB instance counts.

This surge created challenges such as rapid instance provisioning, slow‑query optimization, backup and recovery management, and the inefficiency of using Excel as a CMDB.

The article presents ZanDB, Youzan's in‑house database automation platform, designed to address these challenges.

2. Automation Preparation

2.1 Standardization

Standardization is the foundation for scaling operations. Youzan defined OS‑level standards (RAID5 disks, WB write‑back cache, deadline I/O scheduler, SSD optimizations) and database‑level standards (uniform directory layout, per‑instance configuration files, consistent MySQL versions, and unified parameters).

These standards were applied over two months using SaltStack to enforce software installation and file configuration.

2.2 ZanDB Technology Stack

ZanDB is built with Python Django, Percona‑Toolkit, a custom agent (servant), Celery, and a front‑end based on jQuery and Ajax. Redis is used for caching and MySQL for persistent storage.

3. Phase 1 – Backup Monitoring

Data backup is critical. The initial version replaced ad‑hoc shell scripts with a centralized backup monitoring system that provides real‑time status, execution duration, and five‑day statistics, enabling DBAs to quickly detect failures and trigger alerts.

4. Phase 2 – Full‑Feature Automation

ZanDB adopts a B/S architecture with a Go‑based agent (servant) on database servers. The system is divided into seven modules: metadata management, backup management, instance management, host management, task management, log management, and daily maintenance.

4.1 Task System

The task scheduler coordinates backup, metadata collection, instance provisioning, and other operations. It supports time‑based (minute, hour, day, week, month) and interval‑based recurring tasks, eliminating crontab scripts and allowing dynamic adjustments.

4.2 Backup Subsystem

Backups use Percona XtraBackup, compression, and rsync to remote storage. Python rewrites the backup scripts, adds API callbacks for status, and sends alerts on failures, integrating with the task system to remove crontab dependencies.

4.3 Host Management

Host metadata (IP, location, memory, disk) is refreshed via Zabbix/Open‑Falcon APIs, enabling capacity planning and proactive alerts for low‑space situations.

4.4 Instance Management

Supports multi‑instance hosts, instance listing, creation of master‑slave pairs, schema splitting, daily consistency checks, and snapshotting of instance metrics for historical analysis.

4.5 Log Management

Collects slow‑query logs and killed‑SQL logs, provides Top‑N displays, and triggers alerts when thresholds are exceeded. Logs are parsed with pt‑query‑digest and presented with execution plans and table statistics.

4.6 Metadata Management

Manages binlog metadata, primary‑key overflow checks, and shard‑lookup services, allowing rapid identification of the instance responsible for a given database/table.

4.7 Daily Maintenance

Automates low‑frequency, high‑cost manual tasks such as batch parameter queries, batch configuration changes, emergency binlog recovery, and SQL execution (DML prohibited).

4.8 Data Operations

Aggregated instance metrics feed trend charts for space and memory utilization and cost‑allocation dashboards to aid resource planning.

4.9 High‑Availability Management

Initial HA used keepalived + VIP, which suffered from disk I/O jitter and ARP limits. The second generation employs a Go‑based HA manager (hamster) with cluster health checks, active/passive failover via relay‑log or GTID, and a proxy layer, eliminating VIP‑related issues and supporting dual‑datacenter disaster recovery.

5. Outlook

ZanDB currently automates about 70 % of manual DBA work; future goals include sub‑second monitoring, log auditing, instance inspection, horizontal scaling, performance diagnostics, and automated slow‑query analysis to further increase developer productivity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Task Scheduling ZanDB Backup Monitoring MySQL automation

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Introduction

2. Automation Preparation

2.1 Standardization

2.2 ZanDB Technology Stack

3. Phase 1 – Backup Monitoring

4. Phase 2 – Full‑Feature Automation

4.1 Task System

4.2 Backup Subsystem

4.3 Host Management

4.4 Instance Management

4.5 Log Management

4.6 Metadata Management

4.7 Daily Maintenance

4.8 Data Operations

4.9 High‑Availability Management

5. Outlook

Efficient Ops

How this landed with the community

Was this worth your time?

0 Comments

3. Phase 1 – Backup Monitoring

4. Phase 2 – Full‑Feature Automation