Big Data 13 min read

Using DataX for Efficient MySQL Data Synchronization (Full and Incremental)

This article introduces DataX, an open‑source data integration tool, explains its architecture, and provides step‑by‑step instructions—including environment setup, installation, job JSON creation, and command execution—to achieve fast full‑ and incremental synchronization between MySQL databases.

Selected Java Interview Questions

Dec 26, 2022

Using DataX for Efficient MySQL Data Synchronization (Full and Incremental)

Introduction

Our project needs to sync 50 million records between a business database and a reporting database, but traditional SQL sync is impractical. Tools like mysqldump and storage‑based methods are too slow, so we turned to DataX for high‑speed, reliable data transfer.

1. DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline synchronization across heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP. It simplifies complex networked sync paths into a star‑topology where DataX acts as the central transport hub.

1.1 Framework Design

DataX follows a Framework + Plugin architecture, abstracting source and destination operations into Reader and Writer plugins.

1.2 Core Architecture

A DataX job is managed by the Job module, which splits the job into multiple Tasks for concurrent execution. The Scheduler assembles Tasks into TaskGroups, and each Task runs a Reader → Channel → Writer pipeline. The Job monitors TaskGroups and exits successfully when all complete.

2. Installing DataX and Prerequisites

Required software:

JDK 1.8+

Python 2 or 3

Apache Maven 3.x (only for compiling DataX from source)

Example of installing JDK on CentOS:

# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
# source /etc/profile
# java -version

Download and extract DataX:

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files

Verify installation:

# cd /usr/local/datax/bin
# python datax.py ../job/job.json   # test run

3. Basic DataX Usage

Generate a template for streamreader and streamwriter:

# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

Create a job JSON (example test.json) defining reader and writer parameters, columns, connections, and speed settings.

4. MySQL‑to‑MySQL Full Synchronization

Prepare source and target MySQL instances, create the database and table, and grant privileges:

grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;

Optionally create a stored procedure to generate test data.

Write a job JSON ( install.json) that reads from the source MySQL and writes to the target MySQL, using preSql to truncate the target table before loading.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

# python /usr/local/datax/bin/datax.py install.json

The output shows successful transfer of ~3 million records at ~2.57 MB/s.

5. Incremental Synchronization

Incremental sync is achieved by adding a where clause to the reader configuration. Example where.json syncs records with ID <= 1888: "where": "ID <= 1888" Run the incremental job similarly; the log reports the filtered record count and speed.

For subsequent increments, adjust the where condition (e.g., "ID > 1888 AND ID <= 2888") and remove the preSql that truncates the target table.

Conclusion

DataX provides a flexible, high‑performance solution for both full and incremental data synchronization between MySQL databases, with configurable concurrency, plugin‑based extensibility, and straightforward JSON job definitions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JSON Linux MySQL data synchronization DataX ETL incremental sync

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.