Using DataX for Efficient MySQL Data Synchronization (Full and Incremental)
This article introduces DataX, an open‑source data integration tool, explains its architecture, and provides step‑by‑step instructions—including environment setup, installation, job JSON creation, and command execution—to achieve fast full‑ and incremental synchronization between MySQL databases.
Introduction
Our project needs to sync 50 million records between a business database and a reporting database, but traditional SQL sync is impractical. Tools like mysqldump and storage‑based methods are too slow, so we turned to DataX for high‑speed, reliable data transfer.
1. DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline synchronization across heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP. It simplifies complex networked sync paths into a star‑topology where DataX acts as the central transport hub.
1.1 Framework Design
DataX follows a Framework + Plugin architecture, abstracting source and destination operations into Reader and Writer plugins.
1.2 Core Architecture
A DataX job is managed by the Job module, which splits the job into multiple Tasks for concurrent execution. The Scheduler assembles Tasks into TaskGroups, and each Task runs a Reader → Channel → Writer pipeline. The Job monitors TaskGroups and exits successfully when all complete.
2. Installing DataX and Prerequisites
Required software:
JDK 1.8+
Python 2 or 3
Apache Maven 3.x (only for compiling DataX from source)
Example of installing JDK on CentOS:
# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <
> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
# source /etc/profile
# java -versionDownload and extract DataX:
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._* # delete hidden filesVerify installation:
# cd /usr/local/datax/bin
# python datax.py ../job/job.json # test run3. Basic DataX Usage
Generate a template for streamreader and streamwriter :
# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriterCreate a job JSON (example test.json ) defining reader and writer parameters, columns, connections, and speed settings.
4. MySQL‑to‑MySQL Full Synchronization
Prepare source and target MySQL instances, create the database and table, and grant privileges:
grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;Optionally create a stored procedure to generate test data.
Write a job JSON ( install.json ) that reads from the source MySQL and writes to the target MySQL, using preSql to truncate the target table before loading.
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job:
# python /usr/local/datax/bin/datax.py install.jsonThe output shows successful transfer of ~3 million records at ~2.57 MB/s.
5. Incremental Synchronization
Incremental sync is achieved by adding a where clause to the reader configuration. Example where.json syncs records with ID <= 1888 :
"where": "ID <= 1888"Run the incremental job similarly; the log reports the filtered record count and speed.
For subsequent increments, adjust the where condition (e.g., "ID > 1888 AND ID <= 2888" ) and remove the preSql that truncates the target table.
Conclusion
DataX provides a flexible, high‑performance solution for both full and incremental data synchronization between MySQL databases, with configurable concurrency, plugin‑based extensibility, and straightforward JSON job definitions.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.