Billion‑Row MySQL to HBase Synchronization: Load Data, Kafka‑Thrift, and Flink Solutions
This article presents a comprehensive guide for transferring massive MySQL datasets to HBase, covering environment setup on Ubuntu, three synchronization methods—MySQL LOAD DATA, a Kafka‑Thrift pipeline using Maxwell, and real‑time Flink processing—along with performance comparisons and practical tips for Hadoop, HBase, Kafka, Zookeeper, Phoenix, and related tools.
The guide begins with a step‑by‑step setup of a pseudo‑distributed Hadoop 3.0.2 cluster on Ubuntu 16.04, including installation and configuration of HDFS, YARN, Zookeeper, and HBase (1.4.9) with Phoenix, ensuring proper ports and directory permissions.
It then describes three ways to load billions of GPS‑style records into MySQL: the high‑speed LOAD DATA INFILE command, a Python batch insertion script using pymysql with executemany , and a multi‑process Python approach that splits the source file and inserts in parallel.
For data synchronization to HBase, three pipelines are detailed:
Sqoop : a shell loop splits the source range into 40 000‑row chunks and runs sqoop import with --hbase-create-table , --column-family info , and -m 4 parallel mappers.
Kafka‑Thrift (Maxwell) : MySQL binlog is enabled (ROW format), Maxwell extracts changes as JSON, publishes to Kafka, and a Python consumer reads the topic, transforms fields, and writes to HBase via the Thrift API.
Kafka‑Flink : Maxwell and Kafka feed a Flink job that consumes JSON strings, maps them to a Student POJO, windows the stream (3‑second tumbling), and sinks batches of rows into HBase using the native Java client.
Performance results show that LOAD DATA INFILE completes in ~1 hour, while the Python batch method takes ~7.5 hours. Sqoop (properly sharded) needs ~50 hours, Kafka‑Thrift single‑row insertion matches Sqoop, but batch insertion reduces it to ~7 hours, and the Flink pipeline achieves the best throughput at 3‑7 hours.
Additional benchmarks compare HBase native scans, Phoenix queries, and coprocessor scans, revealing that coprocessor > Phoenix > interval count > full scan in speed.
Finally, the article offers practical advice: always split large data, use batch inserts, prefer Flink for real‑time pipelines, configure Zookeeper correctly, and monitor virtual memory settings for Sqoop.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.