Superstore Sales Data Analysis: From Data Preprocessing to RFM Modeling
This article presents a comprehensive analysis of a global supermarket's four‑year sales dataset, covering data collection, preprocessing, exploratory visualizations, sales, quantity, profit, market segmentation, product performance, customer segmentation, RFM modeling, and actionable recommendations to improve revenue and customer retention.
Define Requirements and Objectives
Analyze the sales data of a global supermarket from 2011‑2014 using the "people, product, place" framework and provide targeted suggestions to increase sales.
Place: Overall operational analysis including revenue, sales volume, profit, average transaction value, and market layout.
Product: Product mix, best‑selling items, under‑performing items.
People: Customer count, new vs. returning customers, RFM model, repurchase rate.
Data Collection
The dataset is sourced from Kaggle: superstore_dataset2011-2015.csv with 51,290 rows and 24 features.
Data Pre‑processing
1. Data Integration
1.1 Load Libraries and Dataset
<code># Load required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
plt.rcParams['font.sans-serif'] = ['SimHei']
warnings.filterwarnings('ignore')
# Load the retail dataset with ISO‑8859‑1 encoding
df = pd.read_csv('superstore_dataset2011-2015.csv', encoding='ISO-8859-1')
df.head()</code>1.2 Data Overview
<code>df.shape
# (51290, 24)
# Show basic info
# df.info()
# df.describe()</code>1.3 Column Renaming
<code>df.rename(columns=lambda x: x.replace(' ', '_').replace('-', '_'), inplace=True)
df.columns</code>1.4 Data Type Handling
<code># Convert Order_Date to datetime
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
# Add year and month columns
df['year'] = df['Order_Date'].dt.year
df['month'] = df['Order_Date'].values.astype('datetime64[M]')
</code>1.5 Missing Value Treatment
<code># Identify missing values
df.isnull().sum()
# Drop Postal_Code (many missing values)
df.drop(['Postal_Code'], axis=1, inplace=True)
</code>1.6 Outlier Detection
No obvious outliers were found in the descriptive statistics.
1.7 Duplicate Removal
<code># Check for duplicates
df.duplicated().sum()
</code>No duplicate rows were detected.
Data Analysis
1. Overall Sales Situation
Create a subset containing order date, sales, quantity, profit, year, and month.
<code>sales_data = df[['Order_Date', 'Sales', 'Quantity', 'Profit', 'year', 'month']]
</code>Group by year and month and sum the metrics.
<code>sales_year = sales_data.groupby(['year', 'month']).sum()
</code>Split the grouped data into separate DataFrames for each year.
<code>year_2011 = sales_year.loc[(2011, slice(None)), :].reset_index()
year_2012 = sales_year.loc[(2012, slice(None)), :].reset_index()
year_2013 = sales_year.loc[(2013, slice(None)), :].reset_index()
year_2014 = sales_year.loc[(2014, slice(None)), :].reset_index()
</code>1.1 Sales Amount Analysis
<code># Concatenate yearly sales columns
sales = pd.concat([year_2011['Sales'], year_2012['Sales'], year_2013['Sales'], year_2014['Sales']], axis=1)
sales.columns = ['Sales-2011', 'Sales-2012', 'Sales-2013', 'Sales-2014']
sales.index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Visualize with a gradient background
sales.style.background_gradient()
</code>Observation: Sales in the second half of each year are higher, and sales increase year over year.
<code># Calculate yearly total sales and growth rates
sales_sum = sales.sum()
sales_sum.plot(kind='bar', alpha=0.5)
plt.grid()
rise_12 = sales_sum[1] / sales_sum[0] - 1
rise_13 = sales_sum[2] / sales_sum[1] - 1
rise_14 = sales_sum[3] / sales_sum[2] - 1
rise_rate = [0, rise_12, rise_13, rise_14]
sales_sum = pd.DataFrame({'sales_sum': sales_sum})
sales_sum['rise_rate'] = rise_rate
sales_sum
</code>Result: 2014 sales are almost double 2011, with growth rates around 26%.
1.2 Quantity Analysis
<code># Concatenate yearly quantity columns
quantity = pd.concat([year_2011['Quantity'], year_2012['Quantity'], year_2013['Quantity'], year_2014['Quantity']], axis=1)
quantity.columns = ['Quantity-2011', 'Quantity-2012', 'Quantity-2013', 'Quantity-2014']
quantity.index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
quantity.style.background_gradient()
</code> <code># Yearly total quantity and growth rates
quantity_sum = quantity.sum()
quantity_sum.plot(kind='bar', alpha=0.5)
plt.grid()
rise_12 = quantity_sum[1] / quantity_sum[0] - 1
rise_13 = quantity_sum[2] / quantity_sum[1] - 1
rise_14 = quantity_sum[3] / quantity_sum[2] - 1
rise_rate = [0, rise_12, rise_13, rise_14]
quantity_sum = pd.DataFrame({'quantity_sum': quantity_sum})
quantity_sum['rise_rate'] = rise_rate
quantity_sum
</code>Quantity trends mirror sales trends, with higher volumes in the second half of each year.
1.3 Profit Analysis
<code># Concatenate yearly profit columns
profit = pd.concat([year_2011['Profit'], year_2012['Profit'], year_2013['Profit'], year_2014['Profit']], axis=1)
profit.columns = ['Profit-2011', 'Profit-2012', 'Profit-2013', 'Profit-2014']
profit.index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
profit.style.background_gradient()
</code> <code># Yearly total profit and profit margin
profit_sum = profit.sum()
profit_sum.plot(kind='bar', alpha=0.5)
plt.grid()
profit_sum = pd.DataFrame({'profit_sum': profit_sum})
profit_sum['year'] = [2011, 2012, 2013, 2014]
sales_sum = pd.DataFrame({'sales_sum': sales.sum()})
sales_sum['year'] = [2011, 2012, 2013, 2014]
profit_sum = pd.merge(profit_sum, sales_sum, on='year')
profit_sum['profit_rate'] = profit_sum['profit_sum'] / profit_sum['sales_sum']
profit_sum
</code>Profit margins stay stable around 11‑12% while absolute profit grows each year.
1.4 Average Transaction Value (ATV) Analysis
<code># Calculate ATV for each year
for i in range(2011, 2015):
data = df[df['year'] == i]
price = data[['Order_Date', 'Customer_ID', 'Sales']]
price_dr = price.drop_duplicates(subset=['Order_Date', 'Customer_ID'])
total_num = price_dr.shape[0]
unit_price = price['Sales'].sum() / total_num
print(f'{i} year ATV = {unit_price}')
</code>ATV remains roughly stable around 500 across the four years.
1.5 Market Layout Analysis
<code># Sales by market and year
Market_Year_Sales = df.groupby(['Market', 'year']).agg({'Sales': 'sum'}).reset_index().rename(columns={'Sales': 'Sales_amounts'})
</code> <code># Visualize market sales over years
sns.barplot(x='Market', y='Sales_amounts', hue='year', data=Market_Year_Sales)
plt.title('2011‑2014 market sales')
</code> <code># Market share of total sales
Market_Sales = df.groupby('Market').agg({'Sales': 'sum'})
Market_Sales['percent'] = Market_Sales['Sales'] / df['Sales'].sum()
Market_Sales.style.background_gradient()
</code>APAC, EU, US, and LATAM dominate (>85% of total sales); Canada contributes minimally.
4.2 Product Situation Analysis
<code># Top 10 products by sales count
productId_count = df.groupby('Product_ID').count()['Customer_ID'].sort_values(ascending=False)
print(productId_count.head(10))
</code> <code># Top 10 products by sales amount
productId_amount = df.groupby('Product_ID').sum()['Sales'].sort_values(ascending=False)
print(productId_amount.head(10))
</code> <code># Top 10 products by profit
productId_Profit = df.groupby('Product_ID').sum()['Profit'].sort_values(ascending=False)
print(productId_Profit.head(10))
</code>High‑selling items are mainly office supplies, while top‑revenue and top‑profit items are electronics and furniture.
<code># Combine Category and Sub‑Category
df['Category_Sub_Category'] = df[['Category', 'Sub_Category']].apply(lambda x: f"{x[0]}_{x[1]}", axis=1)
# Aggregate sales and profit by the new combined category
df_Category_Sub_Category = df.groupby('Category_Sub_Category').agg({'Profit': 'sum', 'Sales': 'sum'}).reset_index()
df_Category_Sub_Category.sort_values(by='Sales', ascending=False, inplace=True)
# Cumulative sales percentage
df_Category_Sub_Category['cum_percent'] = df_Category_Sub_Category['Sales'].cumsum() / df_Category_Sub_Category['Sales'].sum()
</code>Approximately 84% of sales come from half of the product categories, indicating core product strengths.
4.3 Customer Situation Analysis
3.1 Customer Segment Distribution
<code># Segment proportion pie chart
df['Segment'].value_counts().plot(kind='pie', autopct='%.2f%%', shadow=True, figsize=(14, 6))
</code>Regular consumers account for about 51.7% of customers.
<code># Segment count per year
Segment_Year = df.groupby(['Segment', 'year']).agg({'Customer_ID': 'count'}).reset_index()
sns.barplot(x='Segment', y='Customer_ID', hue='year', data=Segment_Year)
plt.title('2011‑2014 Segment Customer')
</code> <code># Segment sales per year
Segment_sales = df.groupby(['Segment', 'year']).agg({'Sales': 'sum'}).reset_index()
sns.barplot(x='Segment', y='Sales', hue='year', data=Segment_sales)
plt.title('2011‑2014 Segment Sales')
</code>3.2 Customer Ordering Behavior
<code># Subset for ordering behavior and sort by date
grouped_Customer = df[['Customer_ID', 'Order_Date', 'Quantity', 'Sales', 'month']].sort_values('Order_Date').groupby('Customer_ID')
</code> <code># First purchase date distribution
grouped_Customer.min().Order_Date.value_counts().plot()
</code> <code># Last purchase date distribution
grouped_Customer.max().Order_Date.value_counts().plot()
</code>New customer acquisition slows after early 2013, while retention remains strong.
<code># Customers with only one purchase
Customer_life = grouped_Customer.Order_Date.agg(['min', 'max'])
single_purchase = (Customer_life['min'] == Customer_life['max']).value_counts()
print(single_purchase)
</code>Only 10 customers made a single purchase, indicating high repeat purchase rate.
3.3 RFM Model Analysis
<code># Build RFM table
rfm = df.pivot_table(index='Customer_ID', values=['Quantity', 'Sales', 'Order_Date'], aggfunc={'Quantity': 'sum', 'Sales': 'sum', 'Order_Date': 'max'})
# Recency (days since last purchase)
rfm['R'] = (rfm.Order_Date.max() - rfm.Order_Date) / np.timedelta64(1, 'D')
# Frequency and Monetary
rfm.rename(columns={'Quantity': 'F', 'Sales': 'M'}, inplace=True)
</code> <code># Label customers into 8 groups based on mean thresholds
def rfm_func(x):
level = x.apply(lambda v: '1' if v > 0 else '0')
code = level.R + level.F + level.M
mapping = {
'111': 'High‑Value Customer',
'011': 'High‑Retention Customer',
'101': 'High‑Recovery Customer',
'001': 'High‑Growth Customer',
'110': 'Medium‑Value Customer',
'010': 'Medium‑Retention Customer',
'100': 'Medium‑Recovery Customer',
'000': 'Medium‑Growth Customer'
}
return mapping.get(code, 'Other')
rfm['label'] = rfm[['R', 'F', 'M']].apply(lambda x: x - x.mean()).apply(rfm_func, axis=1)
</code> <code># Visualize high‑value vs others
rfm.loc[rfm.label == 'High‑Value Customer', 'color'] = 'g'
rfm.loc[rfm.label != 'High‑Value Customer', 'color'] = 'r'
rfm.plot.scatter('F', 'R', c=rfm.color)
</code>The RFM segmentation helps identify valuable customers for targeted marketing.
3.4 User Activity Classification
<code># Build a pivot table of purchase counts per month per customer
pivoted_counts = df.pivot_table(index='Customer_ID', columns='month', values='Order_Date', aggfunc='count').fillna(0)
# Binary purchase indicator (1 if purchased, 0 otherwise)
df_purchase = pivoted_counts.applymap(lambda x: 1 if x > 0 else 0)
</code> <code># Define activity status function
def active_status(row):
status = []
for i in range(48):
if row[i] == 0:
if status and status[-1] == 'unreg':
status.append('unreg')
else:
status.append('unactive' if status else 'unreg')
else:
if not status:
status.append('new')
else:
if status[-1] == 'unactive':
status.append('return')
elif status[-1] == 'unreg':
status.append('new')
else:
status.append('active')
return pd.Series(status)
purchase_stats = df_purchase.apply(active_status, axis=1)
</code> <code># Monthly count of each status (replace unregistered with NaN for plotting)
purchase_stats_ct = purchase_stats.replace('unreg', np.NaN).apply(lambda s: pd.value_counts(s)).fillna(0).T
purchase_stats_ct.plot.area()
</code>Active, new, and returning customers show seasonal peaks, while new‑customer acquisition declines each year.
3.5 Repurchase and Return Rate Analysis
<code># Repurchase rate (customers buying more than once in a month)
purchase_r = pivoted_counts.applymap(lambda x: 1 if x > 1 else (np.nan if x == 0 else 0))
(purchase_r.sum() / purchase_r.count()).plot(figsize=(10, 4))
</code> <code># Return rate (purchase in month i and also in month i+1)
def purchase_back(row):
status = []
for i in range(47):
if row[i] == 1:
status.append(1 if row[i+1] == 1 else 0)
else:
status.append(np.nan)
status.append(np.nan)
return status
purchase_b = df_purchase.apply(purchase_back, axis=1, result_type='expand')
(purchase_b.sum() / purchase_b.count()).plot(figsize=(10, 4))
</code>Repurchase rates stay above 52% with an upward trend; return rates peak around mid‑year and year‑end, likely linked to promotions.
Conclusion
The analysis examined the supermarket from the perspectives of "place", "product", and "people", providing insights on overall sales growth, seasonal patterns, market distribution, product performance, and detailed customer segmentation. Recommendations include focusing on high‑margin products, improving new‑customer acquisition, and leveraging RFM‑based targeted marketing to further boost revenue.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.