Databases 14 min read

Comprehensive Guide to Using DataX for Data Synchronization

This article provides a step‑by‑step tutorial on installing, configuring, and using Alibaba's open‑source DataX tool to perform both full and incremental data synchronization between MySQL databases on Linux, covering framework design, job architecture, JSON job files, and practical command‑line examples.

Architecture Digest
Architecture Digest
Architecture Digest
Comprehensive Guide to Using DataX for Data Synchronization

Introduction

Our project required synchronizing 50 million rows between a business database and a reporting database, where direct SQL sync was infeasible. Traditional methods like mysqldump or file‑based storage proved too slow, leading us to evaluate DataX.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline synchronization of heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc. It abstracts source and target connections via a star‑topology, simplifying complex sync chains.

1. DataX 3.0 Framework Design

DataX follows a Framework + Plugin architecture. Readers and Writers are plug‑ins that handle data extraction and loading, while the central Framework manages buffering, flow control, concurrency, and data conversion.

Role

Function

Reader (采集模块)

Collects data from the source and sends it to the Framework.

Writer (写入模块)

Continuously pulls data from the Framework and writes it to the destination.

Framework (中间商)

Connects Reader and Writer, acting as the transmission channel and handling core technical issues such as buffering, flow control, concurrency, and data conversion.

2. DataX 3.0 Core Architecture

A DataX job is split into multiple Tasks, grouped into TaskGroups according to the configured concurrency. Each Task runs a Reader → Channel → Writer pipeline, and the Job monitors all TaskGroups until completion.

Using DataX for Data Synchronization

Prerequisites: JDK 1.8+, Python 2/3, Apache Maven 3.x (for compiling DataX). Install JDK, download the DataX tarball, extract to /usr/local , and verify installation with python datax.py ../job/job.json .

1. Install DataX on Linux

[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files

2. Basic Usage

Run a streamreader‑to‑streamwriter template:

python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

3. Install MySQL

Install MariaDB on both hosts, create the course‑study database and t_member table, grant privileges, and optionally create a stored procedure to generate test data.

yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
systemctl start mariadb
mysql_secure_installation

4. MySQL‑to‑MySQL Synchronization

Create a JSON job file specifying mysqlreader and mysqlwriter parameters, then execute:

python /usr/local/datax/bin/datax.py install.json

The job logs show total records, speed, and duration, confirming successful synchronization.

5. Incremental Synchronization

Use the where clause in the reader configuration to filter rows, and adjust preSql as needed for incremental loads. Example where condition: "ID <= 1888" . After running the incremental job, logs display the filtered record count.

Full synchronization may be interrupted for very large data sets; incremental sync using where is essential in such cases.

JSONLinuxMySQLdata synchronizationDataXETLShell
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.