Data Preprocessing: Standardization, Normalization, and Missing Value Imputation with Python
This tutorial demonstrates how to perform essential data preprocessing techniques—including standardization, min‑max normalization, and various missing‑value imputation methods—using pandas and scikit‑learn in Python, providing code examples and explanations to help you prepare datasets for machine‑learning models.
Goal : Learn data preprocessing techniques.
Learning Content : Standardization, min‑max normalization, and missing‑value filling.
Code Example :
1. Import required libraries
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler2. Create an example dataset
data = { '姓名': ['张三', '李四', '王五', '赵六', '孙七'], '年龄': [25, 30, 22, 35, 28], '收入': [5000, 7000, 6000, 8000, 6500], '身高': [170, 175, 165, 180, 172] } df = pd.DataFrame(data) print(f"示例数据集: \n{df}")3. Standardization using StandardScaler
scaler = StandardScaler() df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']]) print(f"标准化后的数据集: \n{df}")4. Normalization using MinMaxScaler
scaler = MinMaxScaler() df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']]) print(f"归一化后的数据集: \n{df}")5. Check missing values
missing_values = df.isnull().sum() print(f"每列的缺失值数量: \n{missing_values}")6. Generate dataset with missing values
np.random.seed(0) df['收入'][np.random.randint(0, len(df), 2)] = np.nan print(f"带有缺失值的数据集: \n{df}")7. Fill missing values
# Mean imputation df['收入'].fillna(df['收入'].mean(), inplace=True) print(f"使用均值填充缺失值后的数据集: \n{df}") # Median imputation df['收入'].fillna(df['收入'].median(), inplace=True) print(f"使用中位数填充缺失值后的数据集: \n{df}") # Forward fill df['收入'].fillna(method='ffill', inplace=True) print(f"使用前向填充后的数据集: \n{df}") # Backward fill df['收入'].fillna(method='bfill', inplace=True) print(f"使用后向填充后的数据集: \n{df}")Practice : Apply the above steps to a dataset to perform standardization, normalization, and missing‑value imputation.
Summary : After completing the exercises, you should be able to preprocess data by scaling features to a common range and handling missing entries with various strategies, which are essential for improving model performance in machine‑learning projects.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.