Using pandas‑profiling for Fast Exploratory Data Analysis in Python
This article introduces pandas‑profiling as a powerful Python library for automating exploratory data analysis, compares it with R's skimr and pandas.describe(), shows quick installation and usage examples, and explains how to customize reports via code or YAML configuration for small to medium datasets.
For anyone involved in data science, the initial steps of data cleaning and exploratory data analysis (EDA) consume a large portion of the project timeline, often up to 80% of the effort, and directly affect model performance.
When a new dataset arrives, analysts first familiarize themselves with the data through manual inspection and field definitions before proceeding to the actual EDA phase, which typically involves basic statistical summaries such as mean, variance, min/max, frequencies, quantiles, and distributions.
In R, the skimr package provides richer exploratory statistics than pandas' describe() . In the Python ecosystem, the pandas-profiling library offers comparable, and in many cases superior, functionality.
Quick Start
After installing the library with pip install pandas-profiling , you can generate a full report with a single line of code:
<code>import pandas as pd
import seaborn as sns
from pandas_profiling import ProfileReport
titanic = sns.load_dataset("Titanic")
ProfileReport(titanic, title="The EDA of Titanic Dataset")
</code>When run inside a Jupyter Notebook, the report renders directly in the notebook cell.
The library also extends the DataFrame object, allowing you to call DataFrame.profile_report() for the same effect.
The generated ProfileReport object can be further customized; for interactive notebooks you can use to_widgets() or to_notebook_iframe() , while for other IDEs you can export the report to an HTML file with to_file() .
Customizing the Report
Although the default report covers many basic needs, you can fine‑tune its content by editing a YAML configuration file. The configuration controls sections such as vars (variable statistics), missing_diagrams (visualization of missing values), correlations , interactions , and samples (head/tail previews).
Example Python configuration:
<code>profile_config = {
"progress_bar": False,
"sort": "ascending",
"vars": {
"num": {"chi_squared_threshold": 0.95},
"cat": {"n_obs": 10}
},
"missing_diagrams": {
'heatmap': False,
'dendrogram': False,
}
}
profile = titanic.profile_report(**profile_config)
profile.to_file("titanic-EDA-report.html")
</code>Alternatively, you can place the same options in a yaml file (e.g., profile_config.yml ) and pass it via the config_file argument:
<code>df.profile_report(config_file="your_path.yml")
</code>Conclusion
The pandas-profiling library provides a convenient, fast way to generate comprehensive EDA reports with richer statistics and visualizations than basic pandas methods, saving considerable time during the data‑preparation phase. It works best on small to medium datasets; for larger data you may need to sample or wait for future versions that integrate high‑performance back‑ends such as Modin, Spark, or Dask.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.