High‑Speed Bulk Loading of 20 Billion Rows into MySQL Using TokuDB
This article details a real‑world test of loading over 20 billion records into MySQL with XeLabs TokuDB, covering the demand, configuration tweaks, table schema, bulk‑loader commands, performance metrics, comparison with InnoDB, and practical conclusions for large‑scale data ingestion.
A friend needed to import more than 20 billion rows from a big‑data platform into MySQL for next‑day reporting. The author evaluated MySQL’s InnoDB limits (10‑15 k rows/s when data fits in memory) and turned to XeLabs TokuDB for better bulk‑load performance.
XeLabs TokuDB Overview Project: https://github.com/XeLabs/tokudb Key enhancements over official TokuDB include built‑in jemalloc, additional performance metrics, Xtrabackup support, ZSTD compression, and group‑commit support for binlog.
Test Table Schema
CREATE TABLE `user_summary` (
`user_id` BIGINT(20) UNSIGNED NOT NULL COMMENT '用户id/手机号',
`weight` VARCHAR(5) DEFAULT NULL COMMENT '和码体重(KG)',
`level` VARCHAR(20) DEFAULT NULL COMMENT '重量级',
`beat_rate`VARCHAR(12) DEFAULT NULL COMMENT '击败率',
`level_num`INT(10) DEFAULT NULL COMMENT '同吨位人数',
UNIQUE KEY `u_user_id` (`user_id`)
) ENGINE=TokuDB DEFAULT CHARSET=utf8;TokuDB Core Configuration
loose_tokudb_cache_size=4G
loose_tokudb_directio=ON
loose_tokudb_fsync_log_period=1000
tokudb_commit_sync=0Bulk Load Command
LOAD DATA INFILE '/u01/work/134-136.txt' \
INTO TABLE user_summary(user_id, weight, level, beat_rate, level_num);
-- Result: 200,000,000 rows affected in 5 min 48.30 sThe calculated write speed is approximately 574 217.63 rows/s . The source file size is 8.5 GB, while the TokuDB data file occupies 3.5 GB, showing a compression ratio of about 2:1. The entire load completes in roughly 58 minutes, far faster than InnoDB, which would take 3‑4 times longer for the same data volume.
Additional Scenarios
When using an auto‑increment primary key, bulk‑loader benefits disappear, and the load falls back to row‑by‑row inserts, dramatically reducing throughput. The author suggests removing the auto‑increment attribute and using a unique index to retain bulk‑load speed.
Test Environment
CentOS 7 with a custom‑compiled XeLabs TokuDB, 8 CPU cores, 8 GB RAM, and 500 GB high‑performance cloud storage.
Conclusion
In a cloud environment with the above specs, TokuDB can sustain over 570 k rows per second, easily meeting the requirement of loading 20 billion rows within an hour. InnoDB would require significantly more time, making TokuDB the preferred engine for massive bulk imports.
Architect's Tech Stack
Java backend, microservices, distributed systems, containerized programming, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.