Databases 19 min read

Designing a High‑Performance System to Import 1 Billion Records into MySQL

To import one billion 1 KB log records stored in HDFS or S3 into MySQL as quickly as possible, the article analyzes constraints, evaluates single‑table limits, proposes sharding, batch inserts, storage‑engine choices, file‑reading strategies, task coordination, reliability mechanisms, and concurrency control.

Selected Java Interview Questions

Nov 30, 2023

Designing a High‑Performance System to Import 1 Billion Records into MySQL

This article revisits a Java interview question that asks how to insert 1 billion rows (each about 1 KB) as fast as possible into a MySQL database. The data are unstructured access logs stored in HDFS or S3, split into roughly 100 files, and must be imported in order with minimal duplication.

1. Can a single MySQL table hold 1 billion rows?

The answer is no – a single table is recommended to stay below 20 million rows. MySQL uses a B+‑tree clustered index; with a leaf page size of 16 KB and a row size of 1 KB, each leaf page holds 16 rows. A non‑leaf page (also 16 KB) stores a BigInt primary key (8 bytes) and a pointer (6 bytes), allowing roughly 1 170 child pointers ( 16*1024/14). This limits the tree depth and explains why tables larger than ~20 M rows require a fourth index level, degrading performance.

2. Efficient database writes

Single‑row inserts are slow; batch inserts (e.g., 100 rows per batch) dramatically improve throughput. InnoDB guarantees atomicity of batch transactions. To keep inserts fast, write rows in primary‑key order, avoid non‑primary indexes during the bulk load, or create them after the data is loaded.

Should we write concurrently to the same table?

No. Concurrent writes to a single table break ordering guarantees. Instead, increase the batch size to raise effective concurrency without parallel writes.

3. Choosing a MySQL storage engine

MyISAM offers higher raw insert speed but lacks transaction support, risking duplicate rows on failure. InnoDB, with innodb_flush_log_at_trx_commit set to 0 or 2, can approach MyISAM performance while providing ACID guarantees. The article recommends using InnoDB unless the cluster disallows changing the flush policy.

4. Need for sharding

Single‑instance MySQL caps at about 5 K TPS. With SSDs the limit is higher than HDDs, but HDDs cannot handle many concurrent writes due to a single read/write head. Therefore, the design should allow configurable numbers of databases and tables, adapting to the underlying storage.

5. Maintaining order during import

Assign each of the 100 files a unique database and table (e.g., index_90.txt → database_9.table_0). The combination of file suffix and line number forms a unique, ordered primary key.

6. Faster file reading

Reading a 10 GB file cannot be done in one go. Various Java approaches were benchmarked (Files.readAllBytes, BufferedReader, Scanner, NIO FileChannel). Java NIO with ByteBuffer is fastest but does not provide line boundaries. BufferedReader offers acceptable performance (~30 s for 10 GB) and natural line handling, making it the preferred method.

File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
    fileInputStream = new FileInputStream(file);
    FileChannel fileChannel = fileInputStream.getChannel();
    int capacity = 1 * 1024 * 1024; // 1M
    ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
    StringBuffer buffer = new StringBuffer();
    int size = 0;
    while (fileChannel.read(byteBuffer) != -1) {
        byteBuffer.clear();
        byte[] bytes = byteBuffer.array();
        size += bytes.length;
    }
    System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    // TODO close resources.
}
System.out.println("Time:" + (System.currentTimeMillis() - now));

7. Coordinating read and write tasks

Attempting to separate readers and writers via Kafka broke ordering guarantees and introduced partition‑mixing problems. The final solution merges reading and writing in a single task: each task reads a batch, then writes it, preserving order without extra messaging.

8. Ensuring task reliability

Use a deterministic primary key composed of {taskId}{fileIndex}{fileRowNumber} to achieve idempotent inserts. Record progress in Redis with INCRBY task_offset_{taskId} 100. On failure, retry the batch; after several failures, fall back to single‑row inserts and continue updating Redis. Optionally, consume MySQL binlog to keep Redis and the database in sync.

INCRBY KEY_NAME INCR_AMOUNT

9. Controlling read‑task concurrency

Limit the number of active read‑write tasks per database using a distributed semaphore (Redisson). Each node acquires a permit before processing a task.

RedissonClient redissonClient = Redisson.create(config);
RSemaphore rSemaphore = redissonClient.getSemaphore("semaphore");
// set concurrency to 1
rSemaphore.trySetPermits(1);
rSemaphore.tryAcquire(); // non‑blocking acquire

Because Redisson semaphores lack renewal, the design switches to a leader‑election model using Zookeeper+Curator. The leader assigns tasks, publishes Kafka messages, and workers acquire a distributed lock (with renewal) before processing.

10. Summary

Clarify constraints before designing a solution.

Large data volume requires sharding and careful table sizing.

Single‑instance write bottlenecks dictate the need for multiple databases.

Limit concurrent writes per database and allow dynamic tuning.

Compare MySQL storage engines (InnoDB vs MyISAM) in production.

Determine optimal batch size through testing.

Combine read and write tasks to avoid ordering issues.

Use Redis to track progress and guarantee idempotency.

Employ Zookeeper‑based leader election and distributed locks to coordinate tasks safely.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Sharding data import

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Can a single MySQL table hold 1 billion rows?

2. Efficient database writes

Should we write concurrently to the same table?

3. Choosing a MySQL storage engine

4. Need for sharding

5. Maintaining order during import

6. Faster file reading

7. Coordinating read and write tasks

8. Ensuring task reliability

9. Controlling read‑task concurrency

10. Summary

Selected Java Interview Questions

How this landed with the community

Was this worth your time?

0 Comments

1. Can a single MySQL table hold 1 billion rows?