Artificial Intelligence 5 min read

Data Preprocessing and Modeling with Pandas and Scikit‑learn

This guide walks through using Pandas for data cleaning, feature engineering, and preparation, then demonstrates building, evaluating, and persisting a machine‑learning model with Scikit‑learn's pipeline and RandomForestClassifier in Python.

Test Development Learning Exchange

Oct 29, 2024

Data Preprocessing and Modeling with Pandas and Scikit‑learn

In Pandas, data modeling typically refers to using Pandas for data preprocessing and preparation so that the data can be fed into machine‑learning models; Pandas itself does not perform modeling but provides powerful tools for cleaning, transforming, and preparing data.

1. Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

2. Read data

Assume a CSV file data.csv containing features and a target variable.

# Read data
df = pd.read_csv('data.csv')
print(df.head())

3. Data exploration

Inspect basic information, descriptive statistics, and missing values.

# View basic info
print(df.info())
# Descriptive statistics
print(df.describe())
# Missing values count
print(df.isnull().sum())

4. Data cleaning

Handle missing values and outliers.

# Drop rows with missing values
df = df.dropna()  # remove rows containing NaNs
# Or fill missing numeric values with column mean
df = df.fillna(df.mean())
# Remove outliers, e.g., age > 100
df = df[df['Age'] <= 100]

5. Feature engineering

Create new features or transform existing ones, such as age groups and date components.

# Create age group feature
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 35, 50, 65, 100], labels=['Child', 'Young', 'Adult', 'Middle Age', 'Senior'])
# Convert date column to datetime and extract components
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

6. Split dataset

# Define features and target
df_X = df.drop(columns=['Target'])
y = df['Target']
# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(df_X, y, test_size=0.2, random_state=42)

7. Feature scaling and encoding

Standardize numeric features and one‑hot encode categorical features, then build a pipeline with a RandomForest classifier.

# Define feature groups
numeric_features = ['Age', 'Income']
categorical_features = ['Gender', 'Education']
# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])
# Full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])
# Train model
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Evaluate
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:
', classification_report(y_test, y_pred))

8. Save and load model

Use joblib (or pickle) to persist the trained pipeline.

import joblib
# Save model
joblib.dump(pipeline, 'model.pkl')
# Load model
loaded_pipeline = joblib.load('model.pkl')
# Predict with loaded model
y_pred_loaded = loaded_pipeline.predict(X_test)
print('Loaded Model Accuracy:', accuracy_score(y_test, y_pred_loaded))

Conclusion

By following these steps, you can use Pandas for data cleaning and feature engineering, and combine it with Scikit‑learn to build a complete data‑modeling workflow; Pandas handles preprocessing while Scikit‑learn manages model training and evaluation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning model training data preprocessing scikit-learn

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.