Parallel Processing of Large CSV Files with multiprocessing, joblib, and tqdm in Python
This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by dividing the work into parallel tasks using Python's multiprocessing, joblib, and tqdm libraries, comparing serial, multi‑process, batch, and process‑map approaches with detailed timing results.
To achieve parallel processing, the task is split into sub‑units, increasing the number of jobs and reducing overall execution time.
The example uses the US Accidents (2016‑2021) dataset from Kaggle (2.8 million rows, 47 columns) and imports multiprocessing as mp , from joblib import Parallel, delayed , from tqdm.notebook import tqdm , import pandas as pd , import re , from nltk.corpus import stopwords , and import string .
Workers are set by doubling the CPU count: n_workers = 2 * mp.cpu_count() (e.g., 8 workers).
Serial processing uses tqdm.pandas() and df['Description'] = df['Description'].progress_apply(clean_text) , taking about 9 minutes 5 seconds for 2.8 M rows.
Multiprocessing with Pool creates a pool of workers and maps the cleaning function: p = mp.Pool(n_workers) followed by df['Description'] = p.map(clean_text, tqdm(df['Description'])) , reducing time to roughly 3 minutes 51 seconds.
Joblib Parallel defines def text_parallel_clean(array): result = Parallel(n_jobs=n_workers, backend="multiprocessing")(delayed(clean_text)(text) for text in tqdm(array)) return result and applies it to the column, achieving a runtime of about 4 minutes 4 seconds.
Batch processing splits the data into batches with def batch_file(array, n_workers): file_len = len(array) batch_size = round(file_len / n_workers) batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))] return batches , then processes each batch in parallel using Parallel and delayed(proc_batch) , resulting in a wall‑time of approximately 3 minutes 56 seconds.
tqdm.contrib.concurrent.process_map offers a concise one‑liner: df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch) , delivering the best performance (about 3 minutes 51 seconds).
Conclusion emphasizes selecting the appropriate method—serial, parallel, or batch—based on dataset size and complexity, and suggests alternatives like Dask, datatable, or RAPIDS for further performance gains.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.