Artificial Intelligence 6 min read

Data Preprocessing: Standardization, Normalization, and Missing Value Imputation with Python

This tutorial demonstrates how to perform essential data preprocessing techniques—including standardization, min‑max normalization, and various missing‑value imputation methods—using pandas and scikit‑learn in Python, providing code examples and explanations to help you prepare datasets for machine‑learning models.

Test Development Learning Exchange

Nov 21, 2024

Data Preprocessing: Standardization, Normalization, and Missing Value Imputation with Python

Goal : Learn data preprocessing techniques.

Learning Content : Standardization, min‑max normalization, and missing‑value filling.

Code Example :

1. Import required libraries

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler, MinMaxScaler

2. Create an example dataset

data = {

'姓名': ['张三', '李四', '王五', '赵六', '孙七'],

'年龄': [25, 30, 22, 35, 28],

'收入': [5000, 7000, 6000, 8000, 6500],

'身高': [170, 175, 165, 180, 172]

df = pd.DataFrame(data)

print(f"示例数据集: 
{df}")

3. Standardization using

StandardScaler

scaler = StandardScaler()

df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']])

print(f"标准化后的数据集: 
{df}")

4. Normalization using

MinMaxScaler

scaler = MinMaxScaler()

df[['年龄', '收入', '身高']] = scaler.fit_transform(df[['年龄', '收入', '身高']])

print(f"归一化后的数据集: 
{df}")

5. Check missing values

missing_values = df.isnull().sum()

print(f"每列的缺失值数量: 
{missing_values}")

6. Generate dataset with missing values

np.random.seed(0)

df['收入'][np.random.randint(0, len(df), 2)] = np.nan

print(f"带有缺失值的数据集: 
{df}")

7. Fill missing values

# Mean imputation

df['收入'].fillna(df['收入'].mean(), inplace=True)

print(f"使用均值填充缺失值后的数据集: 
{df}")

# Median imputation

df['收入'].fillna(df['收入'].median(), inplace=True)

print(f"使用中位数填充缺失值后的数据集: 
{df}")

# Forward fill

df['收入'].fillna(method='ffill', inplace=True)

print(f"使用前向填充后的数据集: 
{df}")

# Backward fill

df['收入'].fillna(method='bfill', inplace=True)

print(f"使用后向填充后的数据集: 
{df}")

Practice : Apply the above steps to a dataset to perform standardization, normalization, and missing‑value imputation.

Summary : After completing the exercises, you should be able to preprocess data by scaling features to a common range and handling missing entries with various strategies, which are essential for improving model performance in machine‑learning projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python standardization pandas normalization scikit-learn missing value imputation

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.