Big Data 12 min read

Step-by-Step Guide to Deploying and Using DataX‑web for Data Synchronization

This article provides a comprehensive tutorial on preparing the environment, installing DataX and DataX‑web, configuring MySQL, JDK, Maven, and Python, deploying the services on Linux, and using the web UI to create data sources, build JSON jobs, monitor execution, and manage users.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Step-by-Step Guide to Deploying and Using DataX‑web for Data Synchronization

Background : To synchronize external and internal production data, the built‑in DataWorks sync module cannot operate in mixed network environments, so the author evaluates two alternatives—DataX‑web and DolphinScheduler—and focuses on the DataX‑web deployment process.

1. Environment preparation

Install required software: MySQL (5.5+), JDK 1.8, Maven 3.6.1+, DataX, and Python 2.x (or replace the three Python scripts under datax/bin for Python 3 support).

2. Install DataX

Download the DataX tarball from http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz , extract it, and run a sync job with:

$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}

Verify installation via the provided self‑check script.

3. Install DataX‑web

Obtain the official tar package (e.g., from Baidu Cloud) or clone the source from Git and run mvn clean install to generate build/datax-web-{VERSION}.tar.gz . Extract the package:

tar -zxvf datax-web-{VERSION}.tar.gz
mv datax-web-2.1.2 datax-web

Run the one‑click install script sh install.sh --force or execute the interactive install.sh to configure database connection, mail service, and other properties.

4. Database initialization

If MySQL is available, the installer will prompt for host, port, username, password, and database name; otherwise, manually execute /bin/db/datax-web.sql and edit modules/datax-admin/bin/env.properties accordingly.

5. Configuration

Edit /modules/datax-admin/bin/env.properties for mail settings and /modules/datax-execute/bin/env.properties for PYTHON_PATH and DATAX_ADMIN_PORT . Adjust other default parameters such as server.port , executor.port , etc.

6. Service startup

Start all services with the provided script, verify processes using jps (look for DataXAdminApplication and DataXExecutorApplication ), and check logs in modules/*/console.out . Use ./bin/start.sh -m {module_name} to start a single module, or ./bin/stop.sh -m {module_name} to stop it.

7. Cluster deployment

Ensure consistent DB configuration and clock across nodes. For executor clusters, keep admin.addresses and executor.appname identical.

8. Using DataX‑web

Through the web UI you can configure executors, create data sources (Hive, MySQL, Oracle, PostgreSQL, SQLServer, HBase, MongoDB, ClickHouse), build JSON job scripts, batch‑create tasks, monitor execution, view logs, and manage users.

9. Task execution policies

Choose blocking strategies such as single‑machine serial, discard subsequent schedules, or overwrite previous schedules; set appropriate retry counts to avoid data duplication.

Conclusion

For modest data volumes and limited budgets, DataX‑web is a viable solution; future articles will cover DolphinScheduler integration and related troubleshooting.

PythondeploymentDevOpsLinuxMySQLdata synchronizationDataX
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.