Databases 28 min read

Evolution of Vivo's Database Backup and Recovery System

Vivo replaced its fragile Python‑Celery backup system on GlusterFS—characterized by two‑day MySQL backups, single‑point failures, and weak security—with a Java‑based architecture using a Redis cluster and object storage, adding automated copying, verification, point‑in‑time recovery and migration, cutting backup windows to about ten hours and achieving near‑100 % success.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Evolution of Vivo's Database Backup and Recovery System

This document describes the evolution of the backup and recovery system used by Vivo's Internet database services, covering the original architecture, its shortcomings, and the redesign that introduced new technologies and processes.

1. Overview

Vivo operates MySQL, MongoDB and TiDB clusters. The legacy backup‑recovery system (the "old system") was built with Python, Celery and GlusterFS. It suffered from low availability, long backup windows (up to two days for a full MySQL backup), and fragile file‑system permissions.

2. Old Backup‑Recovery System

The old system used logical backups for MySQL/TiDB (Mydumper) and MongoDB (mongodbdump), and physical backups for MySQL with Percona Xtrabackup. Backup tasks were dispatched via Celery, and the backup files were stored on a GlusterFS volume shared between two data‑center sites (A and B). The system lacked high‑availability for Redis and did not enforce strict access control on the backup files.

Key components:

Backup module – schedules daily logical/physical backups.

Restore module – performs logical or physical restores.

Copy module – replicates backup files between A and B sites.

2.1 MySQL Backup Tools

Logical backup: Mydumper (C‑based, multithreaded, supports consistent snapshots, compression, and splitting).

Physical backup: Xtrabackup (fast, supports compression, streaming, minimal impact on the running database).

Typical MySQL backup command construction (shown in the original source):

baseDir = fmt.Sprintf("/data/mysql%d", port)
args = append(args, fmt.Sprintf("--defaults-file=%s/conf/my.cnf", baseDir))
args = append(args, fmt.Sprintf("--datadir=%s/data", baseDir))
args = append(args, fmt.Sprintf("--socket=%s/run/mysql.sock", baseDir))
args = append(args, fmt.Sprintf("--user=%s", user))
args = append(args, fmt.Sprintf("--password=%s", pwd))
args = append(args, "--slave-info")
args = append(args, fmt.Sprintf("--ftwrl-wait-timeout=%d", 250))
args = append(args, fmt.Sprintf("--open-files-limit=%d", 204800))
args = append(args, "--stream=xbstream")
args = append(args, "–backup")
args = append(args, fmt.Sprintf("--parallel=%d", parallel))
args = append(args, fmt.Sprintf("--throttle=%d", throttle))
args = append(args, "–compress")
// Incremental backup
args = append(args, fmt.Sprintf("--incremental-lsn=%s", incrLsn))

Physical backup of MySQL uses Xtrabackup with the --ftwrl-wait-timeout and --ftwrl-wait-threshold parameters to avoid long‑running lock conflicts:

--ftwrl-wait-timeout 0   # no wait, abort if lock cannot be obtained
--ftwrl-wait-threshold 30 # wait up to 30 s before issuing FLUSH TABLES WITH READ LOCK

2.2 TiDB Backup

TiDB uses Mydumper for logical backups (when data < 20 GB) and the official br tool for physical backups (≥ 20 GB). The br tool requires a gc_time of 48 h for incremental backups.

2.3 MongoDB Backup

Logical backup is performed with mongodbdump , but large instances (> 1 TB) are backed up physically using db.fsyncLock() to freeze the storage engine, followed by a tar of the data files. Example commands:

db.fsyncLock()
// capture latest oplog timestamp
next_ts = db.oplog.rs.find().sort({$natural:-1}).limit(1)
// create tar archive
tar -cf backup.tar /data/db
db.fsyncUnlock()

Incremental MongoDB backup captures the oplog:

mongodump --host=127.0.0.1 --port=27010 \
    --username=mg_backup --password=123ASD123 \
    --gzip --authenticationDatabase=admin -d local -c oplog.rs \
    -q '{ts:{$gt:Timestamp("next_ts")}}' --archive=oplog.inc_2

Physical restore of a MongoDB full backup:

tar -xf full_backup.tar -C /var/lib/mongo

Point‑in‑time restore uses mongorestore --archive=65.gzip --port=11303 --gzip --oplogReplay .

2.4 File System

The old system stored backups on GlusterFS, which offered high scalability but weak permission controls. A‑site files were mounted on the backup controller and on the physical backup machines, exposing them to any DBA with SSH access.

2.5 Problems of the Old System

Low backup efficiency (MySQL full backup took two days).

Logical backup could not keep up with data growth, especially for MongoDB.

Single‑point failures in the A‑site file system.

Insufficient high‑availability for Redis and other components.

Poor security – unrestricted access to backup files.

3. New Backup‑Recovery System

The redesign introduced Java services, a Redis Cluster for high availability, and object storage (Vivo’s own cloud storage) to replace GlusterFS. The new architecture stores backup files in multiple data‑centers, eliminates direct mount points, and enforces strict access control.

Key improvements:

Java + Redis Cluster replaces the Python + single‑node Redis stack.

Object storage provides 12‑9 durability, 99.995 % availability, and eliminates manual file‑system management.

Backup strategy: logical backup for data < 20 GB, physical backup for larger datasets, with automatic selection based on size.

Enhanced copy module that transfers files through an intermediate “transfer machine” to avoid I/O spikes on production nodes.

Backup verification module that records pre‑ and post‑backup metrics (row counts for logical backups, directory size for physical backups) and validates <10 % deviation.

Point‑in‑time recovery workflow that extracts the required binlog/GTID position, fetches the appropriate backup from object storage, and restores to the target instance.

Automated migration module that triggers instance migration when disk usage exceeds 88 %, creates tickets, and closes the loop with expansion, master‑slave switch, DNS update, and resource reclamation.

3.1 Performance Gains

After migration:

MySQL backup completes before 10 AM (≈ 10 hours) with a 100 % success rate.

TiDB physical backup (using br ) reaches 100 % success; older versions still use Mydumper but with dramatically higher reliability.

MongoDB physical backup also achieves 100 % success, and the number of successful restores exceeds six.

3.2 Additional Modules

New components include:

Copy module – maintains two replicas across sites.

Verification module – validates backup integrity before and after execution.

Point‑in‑time recovery – supports binlog/GTID based restores.

Automated migration – reduces DBA manual effort and improves overall operational efficiency.

3.3 Remaining Challenges

MongoDB sharded cluster backup still relies on physical copies and requires per‑shard oplog replay, which is complex.

Data‑recovery speed remains limited by full‑restore + binlog replay; logical‑only restores for single tables are not yet automated.

4. Summary

The new system greatly improves security (no direct mount access, stricter permissions), efficiency (backup window reduced from two days to ~10 hours), and functionality (point‑in‑time recovery, automated migration, verification). The involvement of Java developers and the shift to object storage have enabled rapid feature delivery and stable operation.

distributed systemsperformance optimizationMySQLTiDBData Recoverydatabase backupMongoDB
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.