Artificial Intelligence 6 min read

Data Preprocessing: Standardization, Normalization, and Missing Value Imputation with Python

This tutorial demonstrates how to perform essential data preprocessing techniques—including standardization, min‑max normalization, and various missing‑value imputation methods—using pandas and scikit‑learn in Python, providing code examples and explanations to help you prepare datasets for machine‑learning models.

Test Development Learning Exchange
Test Development Learning Exchange
Test Development Learning Exchange
Data Preprocessing: Standardization, Normalization, and Missing Value Imputation with Python

Goal : Learn data preprocessing techniques.

Learning Content : Standardization, min‑max normalization, and missing‑value filling.

Code Example :

1. Import required libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

2. Create an example dataset

data = {
'姓名': ['张三', '李四', '王五', '赵六', '孙七'],
'年龄': [25, 30, 22, 35, 28],
'收入': [5000, 7000, 6000, 8000, 6500],
'身高': [170, 175, 165, 180, 172]
}
df = pd.DataFrame(data)
print(f"示例数据集: \n{df}")

3. Standardization using StandardScaler

scaler = StandardScaler()
df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']])
print(f"标准化后的数据集: \n{df}")

4. Normalization using MinMaxScaler

scaler = MinMaxScaler()
df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']])
print(f"归一化后的数据集: \n{df}")

5. Check missing values

missing_values = df.isnull().sum()
print(f"每列的缺失值数量: \n{missing_values}")

6. Generate dataset with missing values

np.random.seed(0)
df['收入'][np.random.randint(0, len(df), 2)] = np.nan
print(f"带有缺失值的数据集: \n{df}")

7. Fill missing values

# Mean imputation
df['收入'].fillna(df['收入'].mean(), inplace=True)
print(f"使用均值填充缺失值后的数据集: \n{df}")
# Median imputation
df['收入'].fillna(df['收入'].median(), inplace=True)
print(f"使用中位数填充缺失值后的数据集: \n{df}")
# Forward fill
df['收入'].fillna(method='ffill', inplace=True)
print(f"使用前向填充后的数据集: \n{df}")
# Backward fill
df['收入'].fillna(method='bfill', inplace=True)
print(f"使用后向填充后的数据集: \n{df}")

Practice : Apply the above steps to a dataset to perform standardization, normalization, and missing‑value imputation.

Summary : After completing the exercises, you should be able to preprocess data by scaling features to a common range and handling missing entries with various strategies, which are essential for improving model performance in machine‑learning projects.

PythonStandardizationdata preprocessingpandasnormalizationscikit-learnmissing value imputation
Test Development Learning Exchange
Written by

Test Development Learning Exchange

Test Development Learning Exchange

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.