Big Data 8 min read

Parallel Processing of Large CSV Files with multiprocessing, joblib, and tqdm in Python

This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by dividing the work into parallel tasks using Python's multiprocessing, joblib, and tqdm libraries, comparing serial, multi‑process, batch, and process‑map approaches with detailed timing results.

Python Programming Learning Circle

Jun 5, 2024

Parallel Processing of Large CSV Files with multiprocessing, joblib, and tqdm in Python

To achieve parallel processing, the task is split into sub‑units, increasing the number of jobs and reducing overall execution time.

The example uses the US Accidents (2016‑2021) dataset from Kaggle (2.8 million rows, 47 columns) and imports multiprocessing as mp, from joblib import Parallel, delayed, from tqdm.notebook import tqdm, import pandas as pd, import re, from nltk.corpus import stopwords, and import string.

Workers are set by doubling the CPU count: n_workers = 2 * mp.cpu_count() (e.g., 8 workers).

Serial processing uses tqdm.pandas() and

df['Description'] = df['Description'].progress_apply(clean_text)

, taking about 9 minutes 5 seconds for 2.8 M rows.

Multiprocessing with Pool creates a pool of workers and maps the cleaning function: p = mp.Pool(n_workers) followed by

df['Description'] = p.map(clean_text, tqdm(df['Description']))

, reducing time to roughly 3 minutes 51 seconds.

Joblib Parallel defines

def text_parallel_clean(array):
    result = Parallel(n_jobs=n_workers, backend="multiprocessing")(delayed(clean_text)(text) for text in tqdm(array))
    return result

and applies it to the column, achieving a runtime of about 4 minutes 4 seconds.

Batch processing splits the data into batches with

def batch_file(array, n_workers):
    file_len = len(array)
    batch_size = round(file_len / n_workers)
    batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))]
    return batches

, then processes each batch in parallel using Parallel and delayed(proc_batch), resulting in a wall‑time of approximately 3 minutes 56 seconds.

tqdm.contrib.concurrent.process_map offers a concise one‑liner:

df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch)

, delivering the best performance (about 3 minutes 51 seconds).

Conclusion emphasizes selecting the appropriate method—serial, parallel, or batch—based on dataset size and complexity, and suggests alternatives like Dask, datatable, or RAPIDS for further performance gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Data cleaning parallel processing tqdm multiprocessing joblib

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.