Using DataX for Efficient MySQL Data Synchronization
This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.
Introduction
Our company has a project with 50 million records, and the reporting data is inaccurate; the business and reporting databases are cross‑database, making SQL synchronization impossible. Initial attempts with mysqldump and file‑based storage proved inefficient.
mysqldump: backup and sync both take time, and data may be generated during backup, causing unsynced data.
File storage: too slow; three hours only synced two thousand rows.
After researching, we discovered DataX, an open‑source data integration tool that offers high speed and reliable synchronization.
1. DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks data integration, designed for offline data synchronization across heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc.
Transforms complex mesh sync topologies into a star‑shaped data flow, acting as the middle transport layer.
Adding a new data source only requires plugging it into DataX, enabling seamless integration.
1.1 DataX 3.0 Framework Design
DataX follows a Framework + Plugin architecture, abstracting data source reading and writing as Reader/Writer plugins.
Component
Role
Reader (collector)
Collects data from the source and sends it to the Framework.
Writer (writer)
Pulls data from the Framework and writes it to the destination.
Framework (middleman)
Connects Reader and Writer, handling buffering, flow control, concurrency, and data conversion.
1.2 DataX 3.0 Core Architecture
A DataX job represents a single synchronization task. Upon receiving a job, DataX launches a process that manages data cleaning, task splitting, and TaskGroup management.
Job splits into multiple Tasks based on source split strategy for concurrent execution.
Scheduler assembles Tasks into TaskGroups according to configured concurrency.
Each Task runs Reader → Channel → Writer threads.
Job monitors TaskGroups and exits when all complete.
DataX scheduling process:
Job splits into Tasks, calculates required TaskGroups based on concurrency.
Task/Channel = TaskGroup; each TaskGroup runs its Tasks.
2. Using DataX for Data Synchronization
Prerequisites: JDK 1.8+, Python 2/3, Apache Maven 3.x (for compiling DataX).
Installation steps for JDK, DataX, MySQL, and creating test tables are provided, followed by granting privileges and creating a stored procedure to generate test data.
[root@MySQL-1 ~]
# tar zxf jdk-8u181-linux-x64.tar.gz
[root@DataX ~]
# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]
# cat <
> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]
# source /etc/profile
[root@MySQL-1 ~]
# java -version [root@MySQL-1 ~]
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]
# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]
# rm -rf /usr/local/datax/plugin/*/._* # delete hidden files MariaDB [(none)]> create database `course-study`;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> create table `course-study`.t_member(ID int,Name varchar(20),Email varchar(30));
Query OK, 0 rows affected (0.00 sec) grant all privileges on *.* to root@'%' identified by '123123';
flush privileges; DELIMITER $$
CREATE PROCEDURE test()
BEGIN
declare A int default 1;
while (A < 3000000) do
insert into `course-study`.t_member values(A,concat('LiSa',A),concat('LiSa',A,'@163.com'));
set A = A + 1;
END while;
END $$
DELIMITER ;2.1 Full‑size Synchronization
Generate a MySQL‑to‑MySQL template, edit the JSON job file, and run:
# python /usr/local/datax/bin/datax.py install.jsonSample output shows 2.57 MB/s throughput and successful completion.
2.2 Incremental Synchronization
Incremental sync is achieved by adding a where clause to filter records. Example JSON includes the where condition and removes preSql truncation.
"where": "ID <= 1888"Running the job yields 62 records/s for the filtered dataset.
The article concludes with a reminder that full‑sync may be interrupted for very large datasets, making incremental sync essential.
Promotional Content
Readers are invited to join a DeepSeek practical collection, offering AI scenario demos, tool recommendations, and 100 AI‑empowered use‑case examples for a pre‑sale price of 29.9 CNY.
Additional offers include a ChatGPT 4.0 domestic service, a knowledge‑sharing community for AI and side‑business projects, and various exclusive resources such as interview question packs and AI‑related training materials.
Contact information, QR codes, and links to related articles are provided throughout the promotional sections.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.