Artificial Intelligence 9 min read

Comprehensive Python Tutorial for Data Preprocessing, Feature Engineering, Model Training, Evaluation, and Deployment

This tutorial walks through consolidating the first ten days of learning by covering data preprocessing, feature engineering, model training with linear regression, decision tree, and random forest, model evaluation using cross‑validation, and finally saving and loading the best model, all illustrated with complete Python code examples.

Test Development Learning Exchange

Nov 26, 2024

Comprehensive Python Tutorial for Data Preprocessing, Feature Engineering, Model Training, Evaluation, and Deployment

Goal

Consolidate the learning from the first 10 days, covering data preprocessing, feature engineering, model training, and evaluation.

Learning Content

Data preprocessing, feature engineering, model training, model evaluation.

Code Example

1. Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression

2. Data Collection

# Load example dataset (Boston housing)
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target
print(f"Example dataset:
{df.head()}")

3. Data Preprocessing

Check missing values

# Count missing values per column
missing_values = df.isnull().sum()
print(f"Missing values per column:
{missing_values}")

Handle missing values

# Fill missing values with column mean
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(f"Dataset after handling missing values:
{df_imputed.head()}")

Check outliers

# Boxplot to visualize outliers
sns.boxplot(data=df_imputed)
plt.xticks(rotation=90)
plt.show()

Handle outliers

# Remove outliers using IQR method
Q1 = df_imputed.quantile(0.25)
Q3 = df_imputed.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = df_imputed[~((df_imputed < (Q1 - 1.5 * IQR)) | (df_imputed > (Q3 + 1.5 * IQR))).any(axis=1)]
print(f"Dataset after removing outliers:
{df_cleaned.head()}")

4. Feature Engineering

Standardize features

# Standardize features
scaler = StandardScaler()
X = df_cleaned.drop('PRICE', axis=1)
X_scaled = scaler.fit_transform(X)
df_scaled = pd.DataFrame(X_scaled, columns=X.columns)
df_scaled['PRICE'] = df_cleaned['PRICE']
print(f"Dataset after standardization:
{df_scaled.head()}")

Create new features

# Create new feature as product of RM and LSTAT
df_scaled['RM_LSTAT'] = df_scaled['RM'] * df_scaled['LSTAT']
print(f"Dataset after creating new feature:
{df_scaled.head()}")

Feature selection

# Select top 5 features using SelectKBest
selector = SelectKBest(score_func=f_regression, k=5)
X_new = selector.fit_transform(X_scaled, df_scaled['PRICE'])
selected_features = [X.columns[i] for i in selector.get_support(indices=True)]
print(f"Selected features: {selected_features}")

5. Model Training

Split dataset

# Split data into training and test sets
X = df_scaled[selected_features]
y = df_scaled['PRICE']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training features:
{X_train.head()}")
print(f"Test features:
{X_test.head()}")
print(f"Training labels:
{y_train.head()}")
print(f"Test labels:
{y_test.head()}")

Train Linear Regression model

# Train Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
# Predict
y_pred_linear = linear_reg.predict(X_test)
# Evaluate
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)
print(f"Linear Regression MSE: {mse_linear:.2f}")
print(f"Linear Regression R^2: {r2_linear:.2f}")

Train Decision Tree model

# Train Decision Tree
decision_tree = DecisionTreeRegressor(random_state=42)
decision_tree.fit(X_train, y_train)
# Predict
y_pred_tree = decision_tree.predict(X_test)
# Evaluate
mse_tree = mean_squared_error(y_test, y_pred_tree)
r2_tree = r2_score(y_test, y_pred_tree)
print(f"Decision Tree MSE: {mse_tree:.2f}")
print(f"Decision Tree R^2: {r2_tree:.2f}")

Train Random Forest model

# Train Random Forest
random_forest = RandomForestRegressor(random_state=42)
random_forest.fit(X_train, y_train)
# Predict
y_pred_forest = random_forest.predict(X_test)
# Evaluate
mse_forest = mean_squared_error(y_test, y_pred_forest)
r2_forest = r2_score(y_test, y_pred_forest)
print(f"Random Forest MSE: {mse_forest:.2f}")
print(f"Random Forest R^2: {r2_forest:.2f}")

6. Model Evaluation

Use K‑fold cross‑validation

# Cross‑validation for each model
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42)
}
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for name, model in models.items():
    mse_scores = -cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(model, X, y, cv=kf, scoring='r2')
    print(f"{name} CV MSE scores: {mse_scores}")
    print(f"{name} CV average MSE: {mse_scores.mean():.2f}")
    print(f"{name} CV R2 scores: {r2_scores}")
    print(f"{name} CV average R2: {r2_scores.mean():.2f}
")

7. Model Deployment

Save and load the best model

import joblib
# Assume Random Forest is the best model
best_model = random_forest
joblib.dump(best_model, 'best_model.pkl')
# Load model
loaded_model = joblib.load('best_model.pkl')
# Predict with loaded model
y_pred_loaded = loaded_model.predict(X_test)
print(f"Loaded model predictions:
{y_pred_loaded[:10]}")

Summary

By completing this practice you should have reinforced the first ten days of study, including data preprocessing, feature engineering, modeling, and evaluation steps, each with detailed comments and Chinese explanations to help you understand the operations and apply them to real projects.

If you have any questions or need further assistance, feel free to let me know!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Python feature engineering model training data preprocessing

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.