Big Data 24 min read

Billion‑Row MySQL to HBase Synchronization: Load Data, Kafka‑Thrift, and Flink Solutions

This article presents a comprehensive guide for transferring massive MySQL datasets to HBase, covering environment setup on Ubuntu, three synchronization methods—MySQL LOAD DATA, a Kafka‑Thrift pipeline using Maxwell, and real‑time Flink processing—along with performance comparisons and practical tips for Hadoop, HBase, Kafka, Zookeeper, Phoenix, and related tools.

Top Architect

Aug 14, 2020

Billion‑Row MySQL to HBase Synchronization: Load Data, Kafka‑Thrift, and Flink Solutions

The guide begins with a step‑by‑step setup of a pseudo‑distributed Hadoop 3.0.2 cluster on Ubuntu 16.04, including installation and configuration of HDFS, YARN, Zookeeper, and HBase (1.4.9) with Phoenix, ensuring proper ports and directory permissions.

It then describes three ways to load billions of GPS‑style records into MySQL: the high‑speed LOAD DATA INFILE command, a Python batch insertion script using pymysql with executemany, and a multi‑process Python approach that splits the source file and inserts in parallel.

For data synchronization to HBase, three pipelines are detailed:

Sqoop : a shell loop splits the source range into 40 000‑row chunks and runs sqoop import with --hbase-create-table, --column-family info, and -m 4 parallel mappers.

Kafka‑Thrift (Maxwell) : MySQL binlog is enabled (ROW format), Maxwell extracts changes as JSON, publishes to Kafka, and a Python consumer reads the topic, transforms fields, and writes to HBase via the Thrift API.

Kafka‑Flink : Maxwell and Kafka feed a Flink job that consumes JSON strings, maps them to a Student POJO, windows the stream (3‑second tumbling), and sinks batches of rows into HBase using the native Java client.

Performance results show that LOAD DATA INFILE completes in ~1 hour, while the Python batch method takes ~7.5 hours. Sqoop (properly sharded) needs ~50 hours, Kafka‑Thrift single‑row insertion matches Sqoop, but batch insertion reduces it to ~7 hours, and the Flink pipeline achieves the best throughput at 3‑7 hours.

Additional benchmarks compare HBase native scans, Phoenix queries, and coprocessor scans, revealing that coprocessor > Phoenix > interval count > full scan in speed.

Finally, the article offers practical advice: always split large data, use batch inserts, prefer Flink for real‑time pipelines, configure Zookeeper correctly, and monitor virtual memory settings for Sqoop.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Kafka mysql HBase bigdata DataSync Sqoop

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.