Evolution of Vivo's Database Backup and Recovery System
Vivo replaced its fragile Python‑Celery backup system on GlusterFS—characterized by two‑day MySQL backups, single‑point failures, and weak security—with a Java‑based architecture using a Redis cluster and object storage, adding automated copying, verification, point‑in‑time recovery and migration, cutting backup windows to about ten hours and achieving near‑100 % success.
This document describes the evolution of the backup and recovery system used by Vivo's Internet database services, covering the original architecture, its shortcomings, and the redesign that introduced new technologies and processes.
1. Overview
Vivo operates MySQL, MongoDB and TiDB clusters. The legacy backup‑recovery system (the "old system") was built with Python, Celery and GlusterFS. It suffered from low availability, long backup windows (up to two days for a full MySQL backup), and fragile file‑system permissions.
2. Old Backup‑Recovery System
The old system used logical backups for MySQL/TiDB (Mydumper) and MongoDB (mongodbdump), and physical backups for MySQL with Percona Xtrabackup. Backup tasks were dispatched via Celery, and the backup files were stored on a GlusterFS volume shared between two data‑center sites (A and B). The system lacked high‑availability for Redis and did not enforce strict access control on the backup files.
Key components:
Backup module – schedules daily logical/physical backups.
Restore module – performs logical or physical restores.
Copy module – replicates backup files between A and B sites.
2.1 MySQL Backup Tools
Logical backup: Mydumper (C‑based, multithreaded, supports consistent snapshots, compression, and splitting).
Physical backup: Xtrabackup (fast, supports compression, streaming, minimal impact on the running database).
Typical MySQL backup command construction (shown in the original source):
baseDir = fmt.Sprintf("/data/mysql%d", port)
args = append(args, fmt.Sprintf("--defaults-file=%s/conf/my.cnf", baseDir))
args = append(args, fmt.Sprintf("--datadir=%s/data", baseDir))
args = append(args, fmt.Sprintf("--socket=%s/run/mysql.sock", baseDir))
args = append(args, fmt.Sprintf("--user=%s", user))
args = append(args, fmt.Sprintf("--password=%s", pwd))
args = append(args, "--slave-info")
args = append(args, fmt.Sprintf("--ftwrl-wait-timeout=%d", 250))
args = append(args, fmt.Sprintf("--open-files-limit=%d", 204800))
args = append(args, "--stream=xbstream")
args = append(args, "–backup")
args = append(args, fmt.Sprintf("--parallel=%d", parallel))
args = append(args, fmt.Sprintf("--throttle=%d", throttle))
args = append(args, "–compress")
// Incremental backup
args = append(args, fmt.Sprintf("--incremental-lsn=%s", incrLsn))Physical backup of MySQL uses Xtrabackup with the --ftwrl-wait-timeout and --ftwrl-wait-threshold parameters to avoid long‑running lock conflicts:
--ftwrl-wait-timeout 0 # no wait, abort if lock cannot be obtained
--ftwrl-wait-threshold 30 # wait up to 30 s before issuing FLUSH TABLES WITH READ LOCK2.2 TiDB Backup
TiDB uses Mydumper for logical backups (when data < 20 GB) and the official br tool for physical backups (≥ 20 GB). The br tool requires a gc_time of 48 h for incremental backups.
2.3 MongoDB Backup
Logical backup is performed with mongodbdump , but large instances (> 1 TB) are backed up physically using db.fsyncLock() to freeze the storage engine, followed by a tar of the data files. Example commands:
db.fsyncLock()
// capture latest oplog timestamp
next_ts = db.oplog.rs.find().sort({$natural:-1}).limit(1)
// create tar archive
tar -cf backup.tar /data/db
db.fsyncUnlock()Incremental MongoDB backup captures the oplog:
mongodump --host=127.0.0.1 --port=27010 \
--username=mg_backup --password=123ASD123 \
--gzip --authenticationDatabase=admin -d local -c oplog.rs \
-q '{ts:{$gt:Timestamp("next_ts")}}' --archive=oplog.inc_2Physical restore of a MongoDB full backup:
tar -xf full_backup.tar -C /var/lib/mongoPoint‑in‑time restore uses mongorestore --archive=65.gzip --port=11303 --gzip --oplogReplay .
2.4 File System
The old system stored backups on GlusterFS, which offered high scalability but weak permission controls. A‑site files were mounted on the backup controller and on the physical backup machines, exposing them to any DBA with SSH access.
2.5 Problems of the Old System
Low backup efficiency (MySQL full backup took two days).
Logical backup could not keep up with data growth, especially for MongoDB.
Single‑point failures in the A‑site file system.
Insufficient high‑availability for Redis and other components.
Poor security – unrestricted access to backup files.
3. New Backup‑Recovery System
The redesign introduced Java services, a Redis Cluster for high availability, and object storage (Vivo’s own cloud storage) to replace GlusterFS. The new architecture stores backup files in multiple data‑centers, eliminates direct mount points, and enforces strict access control.
Key improvements:
Java + Redis Cluster replaces the Python + single‑node Redis stack.
Object storage provides 12‑9 durability, 99.995 % availability, and eliminates manual file‑system management.
Backup strategy: logical backup for data < 20 GB, physical backup for larger datasets, with automatic selection based on size.
Enhanced copy module that transfers files through an intermediate “transfer machine” to avoid I/O spikes on production nodes.
Backup verification module that records pre‑ and post‑backup metrics (row counts for logical backups, directory size for physical backups) and validates <10 % deviation.
Point‑in‑time recovery workflow that extracts the required binlog/GTID position, fetches the appropriate backup from object storage, and restores to the target instance.
Automated migration module that triggers instance migration when disk usage exceeds 88 %, creates tickets, and closes the loop with expansion, master‑slave switch, DNS update, and resource reclamation.
3.1 Performance Gains
After migration:
MySQL backup completes before 10 AM (≈ 10 hours) with a 100 % success rate.
TiDB physical backup (using br ) reaches 100 % success; older versions still use Mydumper but with dramatically higher reliability.
MongoDB physical backup also achieves 100 % success, and the number of successful restores exceeds six.
3.2 Additional Modules
New components include:
Copy module – maintains two replicas across sites.
Verification module – validates backup integrity before and after execution.
Point‑in‑time recovery – supports binlog/GTID based restores.
Automated migration – reduces DBA manual effort and improves overall operational efficiency.
3.3 Remaining Challenges
MongoDB sharded cluster backup still relies on physical copies and requires per‑shard oplog replay, which is complex.
Data‑recovery speed remains limited by full‑restore + binlog replay; logical‑only restores for single tables are not yet automated.
4. Summary
The new system greatly improves security (no direct mount access, stricter permissions), efficiency (backup window reduced from two days to ~10 hours), and functionality (point‑in‑time recovery, automated migration, verification). The involvement of Java developers and the shift to object storage have enabled rapid feature delivery and stable operation.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.