Data Preprocessing and Modeling with Pandas and Scikit‑learn
This guide walks through using Pandas for data cleaning, feature engineering, and preparation, then demonstrates building, evaluating, and persisting a machine‑learning model with Scikit‑learn's pipeline and RandomForestClassifier in Python.
In Pandas, data modeling typically refers to using Pandas for data preprocessing and preparation so that the data can be fed into machine‑learning models; Pandas itself does not perform modeling but provides powerful tools for cleaning, transforming, and preparing data.
1. Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report2. Read data
Assume a CSV file data.csv containing features and a target variable.
# Read data
df = pd.read_csv('data.csv')
print(df.head())3. Data exploration
Inspect basic information, descriptive statistics, and missing values.
# View basic info
print(df.info())
# Descriptive statistics
print(df.describe())
# Missing values count
print(df.isnull().sum())4. Data cleaning
Handle missing values and outliers.
# Drop rows with missing values
df = df.dropna() # remove rows containing NaNs
# Or fill missing numeric values with column mean
df = df.fillna(df.mean())
# Remove outliers, e.g., age > 100
df = df[df['Age'] <= 100]5. Feature engineering
Create new features or transform existing ones, such as age groups and date components.
# Create age group feature
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 35, 50, 65, 100], labels=['Child', 'Young', 'Adult', 'Middle Age', 'Senior'])
# Convert date column to datetime and extract components
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day6. Split dataset
# Define features and target
df_X = df.drop(columns=['Target'])
y = df['Target']
# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(df_X, y, test_size=0.2, random_state=42)7. Feature scaling and encoding
Standardize numeric features and one‑hot encode categorical features, then build a pipeline with a RandomForest classifier.
# Define feature groups
numeric_features = ['Age', 'Income']
categorical_features = ['Gender', 'Education']
# Preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
# Full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# Train model
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Evaluate
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))8. Save and load model
Use joblib (or pickle ) to persist the trained pipeline.
import joblib
# Save model
joblib.dump(pipeline, 'model.pkl')
# Load model
loaded_pipeline = joblib.load('model.pkl')
# Predict with loaded model
y_pred_loaded = loaded_pipeline.predict(X_test)
print('Loaded Model Accuracy:', accuracy_score(y_test, y_pred_loaded))Conclusion
By following these steps, you can use Pandas for data cleaning and feature engineering, and combine it with Scikit‑learn to build a complete data‑modeling workflow; Pandas handles preprocessing while Scikit‑learn manages model training and evaluation.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.