Sales Data Analysis Project: Reading, Cleaning, Aggregating, and Visualizing with Python
This tutorial guides you through a comprehensive sales data project that covers reading a CSV file, cleaning missing and duplicate entries, grouping by department to compute average sales, and visualizing the results with line and bar charts using pandas and matplotlib.
Goal: Consolidate the first nine days of learning by reviewing all knowledge points and completing a comprehensive project.
Learning Content: Review all topics and practice by completing a comprehensive project.
Comprehensive Project: Given a CSV file containing sales data, you will read the file, clean the data (handle missing values and duplicate rows), group by department to calculate average sales, and visualize the sales figures with line and bar charts.
Code Example:
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV file
file_path = 'sales_data.csv'
df = pd.read_csv(file_path)
print(f"Original dataset:\n{df.head()}")
# Check missing values
missing_values = df.isnull().sum()
print(f"Missing values per column:\n{missing_values}")
# Drop rows with missing values
df_cleaned = df.dropna()
print(f"Dataset after dropping missing values:\n{df_cleaned.head()}")
# Check duplicate rows
duplicates = df_cleaned.duplicated()
print(f"Duplicate rows:\n{duplicates}")
# Drop duplicate rows
df_no_duplicates = df_cleaned.drop_duplicates()
print(f"Dataset after dropping duplicates:\n{df_no_duplicates.head()}")
# Group by '部门' (department) and compute average sales
grouped_by_department = df_no_duplicates.groupby('部门')
mean_sales_by_department = grouped_by_department['销售额'].mean()
print(f"Average sales by department:\n{mean_sales_by_department}")
# Plot line chart
plt.figure(figsize=(10, 6))
plt.plot(mean_sales_by_department.index, mean_sales_by_department.values,
marker='o', linestyle='-', color='b', label='Average Sales')
plt.xlabel('部门')
plt.ylabel('销售额均值 (万元)')
plt.title('各部门销售额均值折线图')
plt.legend()
plt.grid(True)
plt.show()
# Plot bar chart
plt.figure(figsize=(10, 6))
plt.bar(mean_sales_by_department.index, mean_sales_by_department.values, color='b')
plt.xlabel('部门')
plt.ylabel('销售额均值 (万元)')
plt.title('各部门销售额均值柱状图')
plt.grid(True)
plt.show()Full Code: The full script combines all steps shown above into a single executable program.
Summary: By completing this comprehensive project you have reinforced the previous nine days' learning, mastering data reading, cleaning, aggregation, and visualization, and you are now ready to apply these skills to real‑world projects.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.