Big Data 17 min read

Optimizing pandas DataFrames to Reduce Memory Usage by Up to 90%

This tutorial demonstrates how to analyze pandas memory consumption, downcast numeric columns, convert object columns to categoricals, and specify optimal dtypes when reading CSV files, achieving a reduction of nearly 90% in DataFrame memory usage while preserving full analytical capabilities.

Python Programming Learning Circle

Aug 12, 2021

Optimizing pandas DataFrames to Reduce Memory Usage by Up to 90%

pandas is a Python library for data manipulation and analysis, but large datasets (hundreds of megabytes to several gigabytes) can quickly exhaust memory and cause performance issues.

After loading the merged MLB game logs CSV with

import pandas as pd
gl = pd.read_csv('game_logs.csv')
gl.head()

, gl.info(memory_usage='deep') reveals 171,907 rows, 161 columns, and 861.6 MB of memory usage, with many columns stored as object (strings).

Internally pandas groups columns of the same dtype into blocks; numeric blocks are stored as contiguous NumPy ndarray s, while object blocks store Python string objects, leading to fragmented and memory‑heavy storage.

Numeric subtypes (e.g., int8, float32) use fewer bytes. Downcasting is performed with

gl_int = gl.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric, downcast='unsigned')
print(mem_usage(gl_int))
print(mem_usage(converted_int))

, reducing int memory from 7.87 MB to 1.48 MB (≈80% reduction). Float columns are similarly downcasted, cutting memory from 100.99 MB to 50.49 MB.

Object columns dominate memory usage. Converting low‑cardinality string columns to category dtype replaces strings with integer codes. For example, converting the day_of_week column:

dow = gl_obj.day_of_week
dow_cat = dow.astype('category')
print(mem_usage(dow))
print(mem_usage(dow_cat))

reduces memory from 9.84 MB to 0.16 MB (≈98% reduction). A loop checks each object column and converts it to category when unique values are less than 50% of total rows, shrinking object memory from 752 MB to 52 MB.

All optimizations can be applied at load time by building a dtype dictionary from the optimized DataFrame and passing it to pd.read_csv:

read_and_optimized = pd.read_csv('game_logs.csv', dtype=column_types, parse_dates=['date'], infer_datetime_format=True)
print(mem_usage(read_and_optimized))

This loads the data in only 104 MB, an 88% reduction from the original 861 MB.

With the optimized DataFrame, simple analyses such as plotting game counts by weekday over years and examining game length trends become fast and memory‑efficient.

Key takeaways: downcast numeric columns, convert suitable object columns to category, and specify optimal dtypes when reading data to dramatically lower pandas memory footprints while retaining full analytical power.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Memory Optimization dataframe pandas big-data Downcasting categorical

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.