Generating One Billion SQLite Rows in Under a Minute: Python, PyPy, and Rust Performance Comparison
A programmer needed to create a billion‑row SQLite test database within a minute, found a naïve Python script unbearably slow, applied batch inserts and SQLite PRAGMA tweaks, then compared CPython, PyPy, and Rust implementations, ultimately achieving sub‑minute runtimes with Rust and highlighting best‑practice optimizations.
A developer attempted to generate a 1 billion‑row SQLite database on a 2019 MacBook Pro (2.4 GHz quad‑core i5) within one minute; the initial Python script using a simple for‑loop took about 15 minutes for 10 million rows and was deemed far too slow.
Switching to batch inserts of 100 k rows reduced the runtime to roughly 10 minutes, revealing that SQLite’s per‑transaction overhead was the main bottleneck.
Further SQLite optimizations were applied: disabling journal_mode , turning off synchronous , increasing cache_size , setting locking_mode=EXCLUSIVE , and using temp_store=MEMORY . The author warned that these settings are unsuitable for production environments.
Running the batch version under PyPy (which claims a 4× speedup over CPython) required only 2.5 minutes, about 3.5× faster than the CPython run.
A Rust rewrite was then explored. The naïve Rust loop still took about 3 minutes, but using prepared statements and inserting in batches of 50 rows cut the time to 34.3 seconds; a multithreaded version with one consumer thread and four producer threads achieved the best result of 32.37 seconds.
Testing with an in‑memory SQLite database ( :memory: ) showed the Rust version completing in 29 seconds, indicating that disk I/O contributed only ~2 seconds to the overall runtime and that most time is spent generating and inserting data.
Benchmark results were summarized in a table: Rust – 33 seconds, PyPy – 126 seconds, CPython – 210 seconds.
The final recommendations emphasized using SQLite PRAGMA statements, prepared statements, and batch inserts; acknowledging that PyPy can be significantly faster than CPython, while async approaches do not always yield speed gains.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.