Fundamentals 8 min read

Using pandas‑profiling for Fast Exploratory Data Analysis in Python

This article introduces pandas‑profiling as a powerful Python library for automating exploratory data analysis, compares it with R's skimr and pandas.describe(), shows quick installation and usage examples, and explains how to customize reports via code or YAML configuration for small to medium datasets.

Python Programming Learning Circle

Dec 25, 2024

Using pandas‑profiling for Fast Exploratory Data Analysis in Python

For anyone involved in data science, the initial steps of data cleaning and exploratory data analysis (EDA) consume a large portion of the project timeline, often up to 80% of the effort, and directly affect model performance.

When a new dataset arrives, analysts first familiarize themselves with the data through manual inspection and field definitions before proceeding to the actual EDA phase, which typically involves basic statistical summaries such as mean, variance, min/max, frequencies, quantiles, and distributions.

In R, the skimr package provides richer exploratory statistics than pandas' describe(). In the Python ecosystem, the pandas-profiling library offers comparable, and in many cases superior, functionality.

Quick Start

After installing the library with pip install pandas-profiling, you can generate a full report with a single line of code:

import pandas as pd
import seaborn as sns
from pandas_profiling import ProfileReport

titanic = sns.load_dataset("Titanic")
ProfileReport(titanic, title="The EDA of Titanic Dataset")

When run inside a Jupyter Notebook, the report renders directly in the notebook cell.

The library also extends the DataFrame object, allowing you to call DataFrame.profile_report() for the same effect.

The generated ProfileReport object can be further customized; for interactive notebooks you can use to_widgets() or to_notebook_iframe(), while for other IDEs you can export the report to an HTML file with to_file().

Customizing the Report

Although the default report covers many basic needs, you can fine‑tune its content by editing a YAML configuration file. The configuration controls sections such as vars (variable statistics), missing_diagrams (visualization of missing values), correlations, interactions, and samples (head/tail previews).

Example Python configuration:

profile_config = {
    "progress_bar": False,
    "sort": "ascending",
    "vars": {
        "num": {"chi_squared_threshold": 0.95},
        "cat": {"n_obs": 10}
    },
    "missing_diagrams": {
        'heatmap': False,
        'dendrogram': False,
    }
}

profile = titanic.profile_report(**profile_config)
profile.to_file("titanic-EDA-report.html")

Alternatively, you can place the same options in a yaml file (e.g., profile_config.yml) and pass it via the config_file argument:

df.profile_report(config_file="your_path.yml")

Conclusion

The pandas-profiling library provides a convenient, fast way to generate comprehensive EDA reports with richer statistics and visualizations than basic pandas methods, saving considerable time during the data‑preparation phase. It works best on small to medium datasets; for larger data you may need to sample or wait for future versions that integrate high‑performance back‑ends such as Modin, Spark, or Dask.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Analysis EDA Data cleaning pandas-profiling YAML configuration

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.