Fundamentals 6 min read

Basic Data Cleaning Techniques with Pandas

This tutorial teaches fundamental data cleaning with Pandas, covering how to handle missing values, rename columns, and remove duplicate rows through clear explanations and complete code examples.

Test Development Learning Exchange
Test Development Learning Exchange
Test Development Learning Exchange
Basic Data Cleaning Techniques with Pandas

Goal : Learn basic data cleaning techniques.

Learning Content :

Handle missing values

Rename columns

Delete duplicate rows

Code Example :

1. Import Pandas library

import pandas as pd

2. Create example dataset

# Create a DataFrame with missing values and duplicate rows
data = {
    '姓名': ['张三', '李四', '王五', '张三', '赵六', None],
    '年龄': [25, 30, 35, 25, 40, 28],
    '城市': ['北京', '上海', '广州', '北京', '深圳', '杭州']
}
df = pd.DataFrame(data)
print(f"原始数据集: \n{df}")

3. Handle missing values

Check missing values

# Check the number of missing values per column
missing_values = df.isnull().sum()
print(f"每列的缺失值数量: \n{missing_values}")

Delete rows containing missing values

# Delete rows with missing values
df_cleaned = df.dropna()
print(f"删除缺失值后的数据集: \n{df_cleaned}")

Fill missing values

# Fill missing values
df_filled = df.fillna('未知')
print(f"填充缺失值后的数据集: \n{df_filled}")

Fill missing values with specific values

# Fill missing values with specific values
df_filled_age = df.fillna({'姓名': '未知', '年龄': 0})
print(f"使用特定值填充缺失值后的数据集: \n{df_filled_age}")

4. Rename columns

Rename a single column

# Rename a single column
df_renamed_single = df.rename(columns={'姓名': '名字'})
print(f"重命名单列后的数据集: \n{df_renamed_single}")

Rename multiple columns

# Rename multiple columns
df_renamed_multiple = df.rename(columns={'姓名': '名字', '城市': '所在地'})
print(f"重命名多列后的数据集: \n{df_renamed_multiple}")

5. Delete duplicate rows

Check duplicate rows

# Check duplicate rows
duplicates = df.duplicated()
print(f"重复行: \n{duplicates}")

Delete duplicate rows

# Delete duplicate rows
df_no_duplicates = df.drop_duplicates()
print(f"删除重复行后的数据集: \n{df_no_duplicates}")

Delete duplicate rows based on specific columns

# Delete duplicate rows based on specific columns
df_no_duplicates_specified = df.drop_duplicates(subset=['姓名'])
print(f"指定列删除重复行后的数据集: \n{df_no_duplicates_specified}")

Practice : Clean a dataset containing missing values and duplicate rows.

# Import Pandas library
import pandas as pd
# Create a DataFrame with missing values and duplicate rows
data = {
    '姓名': ['张三', '李四', '王五', '张三', '赵六', None],
    '年龄': [25, 30, 35, 25, 40, 28],
    '城市': ['北京', '上海', '广州', '北京', '深圳', '杭州']
}
df = pd.DataFrame(data)
print(f"原始数据集: \n{df}")
# Check the number of missing values per column
missing_values = df.isnull().sum()
print(f"每列的缺失值数量: \n{missing_values}")
# Delete rows with missing values
df_cleaned = df.dropna()
print(f"删除缺失值后的数据集: \n{df_cleaned}")
# Rename columns
df_renamed = df_cleaned.rename(columns={'姓名': '名字', '城市': '所在地'})
print(f"重命名列后的数据集: \n{df_renamed}")
# Check duplicate rows
duplicates = df_renamed.duplicated()
print(f"重复行: \n{duplicates}")
# Delete duplicate rows
df_final = df_renamed.drop_duplicates()
print(f"最终清洗后的数据集: \n{df_final}")

Summary : Through today's practice, you should now be able to use Pandas for basic data cleaning, including handling missing values, renaming columns, and removing duplicate rows. Upcoming sessions will dive deeper into Python data processing techniques. Happy learning!

data cleaningpandasduplicate rowsmissing valuesrename columns
Test Development Learning Exchange
Written by

Test Development Learning Exchange

Test Development Learning Exchange

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.