Optimizing pandas DataFrames to Reduce Memory Usage by Up to 90%
This tutorial demonstrates how to analyze pandas memory consumption, downcast numeric columns, convert object columns to categoricals, and specify optimal dtypes when reading CSV files, achieving a reduction of nearly 90% in DataFrame memory usage while preserving full analytical capabilities.
pandas is a Python library for data manipulation and analysis, but large datasets (hundreds of megabytes to several gigabytes) can quickly exhaust memory and cause performance issues.
After loading the merged MLB game logs CSV with import pandas as pd gl = pd.read_csv('game_logs.csv') gl.head() , gl.info(memory_usage='deep') reveals 171,907 rows, 161 columns, and 861.6 MB of memory usage, with many columns stored as object (strings).
Internally pandas groups columns of the same dtype into blocks; numeric blocks are stored as contiguous NumPy ndarray s, while object blocks store Python string objects, leading to fragmented and memory‑heavy storage.
Numeric subtypes (e.g., int8 , float32 ) use fewer bytes. Downcasting is performed with gl_int = gl.select_dtypes(include=['int']) converted_int = gl_int.apply(pd.to_numeric, downcast='unsigned') print(mem_usage(gl_int)) print(mem_usage(converted_int)) , reducing int memory from 7.87 MB to 1.48 MB (≈80% reduction). Float columns are similarly downcasted, cutting memory from 100.99 MB to 50.49 MB.
Object columns dominate memory usage. Converting low‑cardinality string columns to category dtype replaces strings with integer codes. For example, converting the day_of_week column:
dow = gl_obj.day_of_week
dow_cat = dow.astype('category')
print(mem_usage(dow))
print(mem_usage(dow_cat))reduces memory from 9.84 MB to 0.16 MB (≈98% reduction). A loop checks each object column and converts it to category when unique values are less than 50% of total rows, shrinking object memory from 752 MB to 52 MB.
All optimizations can be applied at load time by building a dtype dictionary from the optimized DataFrame and passing it to pd.read_csv :
read_and_optimized = pd.read_csv('game_logs.csv', dtype=column_types, parse_dates=['date'], infer_datetime_format=True)
print(mem_usage(read_and_optimized))This loads the data in only 104 MB, an 88% reduction from the original 861 MB.
With the optimized DataFrame, simple analyses such as plotting game counts by weekday over years and examining game length trends become fast and memory‑efficient.
Key takeaways: downcast numeric columns, convert suitable object columns to category , and specify optimal dtypes when reading data to dramatically lower pandas memory footprints while retaining full analytical power.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.