Databases 18 min read

Fastest Method to Import 1 Billion Rows into MySQL: Architecture and Performance Tips

This article analyzes how to import one billion 1 KB log records stored in HDFS or S3 into MySQL, detailing why a single table cannot hold that volume, the B‑tree index limits, batch‑write strategies, storage‑engine choices, sharding, file‑reading methods, task coordination, and reliability mechanisms.

IoT Full-Stack Technology
IoT Full-Stack Technology
IoT Full-Stack Technology
Fastest Method to Import 1 Billion Rows into MySQL: Architecture and Performance Tips

Problem and Constraints

The goal is to import 1 billion log records (each ~1 KB) into MySQL. The data resides in HDFS or S3, is split into roughly 100 files, must be imported in order, and duplicates should be avoided.

Why a Single MySQL Table Is Insufficient

MySQL uses a B+‑tree clustered index. A leaf page is 16 KB, holding 16 rows (1 KB each). A non‑leaf page (also 16 KB) stores a 8‑byte BIGINT primary key and a 6‑byte pointer, allowing 16 384 / 14 ≈ 1170 child pointers. The resulting index capacities are:

2‑level index: 1170 × 16 = 18,720 rows

3‑level index: 1170 × 1170 × 16 ≈ 21,902,400 rows (≈20 M, the practical limit)

4‑level index: 1170 × 1170 × 1170 × 16 ≈ 25,625,808,000 rows (≈2.56 B)

Since a single table comfortably holds only up to ~20 M rows, the data must be split, e.g., into 100 tables of ~10 M rows each.

Efficient Write Strategies

Batch inserts dramatically improve throughput; a batch size of 100 rows (each 1 KB) is a good starting point. Use InnoDB transactions so a batch either succeeds or fails entirely, and retry on failure. Avoid creating non‑primary indexes before loading; create them after the bulk load if needed. MyISAM can be faster for raw inserts but lacks transactional safety, so it is considered only when InnoDB tuning is insufficient.

Performance test (image) shows batch inserts outperform single‑row inserts, and disabling InnoDB’s immediate flush ( innodb_flush_log_at_trx_commit) narrows the gap between InnoDB and MyISAM.

File Reading Techniques

The total file size is about 931 GB (1 B × 1 KB). Reading the whole file into memory is impossible. Benchmarks on a 3.4 GB file on macOS compare several methods: Files.readAllBytes: OOM

FileReader + BufferedReader: 11 s

File + BufferedReader: 10 s

Scanner: 57 s

Java NIO FileChannel (buffered): 3 s

Because the overall bottleneck is database writing, BufferedReader’s line‑by‑line reading (≈10 s) is sufficient and simpler than NIO’s buffer handling.

Ensuring Ordered and Reliable Import

Assign a unique primary key composed of the file suffix and line number, e.g., index_90.txt → database_9.table_0 or index_67.txt → database_6.table_7. This guarantees order within each table and global order via the database and table suffixes.

Use Redis to record task progress. After a successful batch, increment a Redis key (e.g., INCRBY task_offset_{taskId} 100). On failure, retry the batch; after repeated failures, fall back to single‑row inserts and log the problematic rows.

INCRBY KEY_NAME INCR_AMOUNT

Task Coordination and Concurrency Control

Separating read and write tasks with Kafka was abandoned because ordering and concurrency limits made it impractical. The final design merges reading and writing in the same task, limiting the number of concurrent write tasks per table.

Redisson semaphores enforce a concurrency limit of one write per table:

RedissonClient redissonClient = Redisson.create(config);
RSemaphore rSemaphore = redissonClient.getSemaphore("semaphore");
rSemaphore.trySetPermits(1);
rSemaphore.tryAcquire(); // non‑blocking lock

Master‑node election can be performed with Zookeeper + Curator to assign tasks and avoid duplicate processing.

Final Recommendations

Clarify constraints (file size, format, ordering) before designing the solution.

Split data into multiple databases/tables based on B+‑tree capacity limits.

Use sharding and dynamic concurrency limits to adapt to SSD or HDD storage characteristics.

Benchmark InnoDB vs MyISAM and choose the engine that meets performance and reliability needs.

Determine the optimal batch‑insert size through iterative testing.

Combine read and write tasks, track progress with Redis, and ensure idempotent inserts via primary‑key composition.

Employ Zookeeper + Curator for master election and Redisson locks for exclusive write access.

Performance comparison image
Performance comparison image
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceShardingRedisMySQLBatch InsertRedissonFile Reading
IoT Full-Stack Technology
Written by

IoT Full-Stack Technology

Dedicated to sharing IoT cloud services, embedded systems, and mobile client technology, with no spam ads.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.