Fundamentals 27 min read

Superstore Sales Data Analysis: From Data Preprocessing to RFM Modeling

This article presents a comprehensive analysis of a global supermarket's four‑year sales dataset, covering data collection, preprocessing, exploratory visualizations, sales, quantity, profit, market segmentation, product performance, customer segmentation, RFM modeling, and actionable recommendations to improve revenue and customer retention.

Python Programming Learning Circle

Jun 4, 2020

Superstore Sales Data Analysis: From Data Preprocessing to RFM Modeling

Define Requirements and Objectives

Analyze the sales data of a global supermarket from 2011‑2014 using the "people, product, place" framework and provide targeted suggestions to increase sales.

Place: Overall operational analysis including revenue, sales volume, profit, average transaction value, and market layout.

Product: Product mix, best‑selling items, under‑performing items.

People: Customer count, new vs. returning customers, RFM model, repurchase rate.

Data Collection

The dataset is sourced from Kaggle: superstore_dataset2011-2015.csv with 51,290 rows and 24 features.

Data Pre‑processing

1. Data Integration

1.1 Load Libraries and Dataset

# Load required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

plt.rcParams['font.sans-serif'] = ['SimHei']
warnings.filterwarnings('ignore')

# Load the retail dataset with ISO‑8859‑1 encoding
df = pd.read_csv('superstore_dataset2011-2015.csv', encoding='ISO-8859-1')
df.head()

1.2 Data Overview

df.shape
# (51290, 24)

# Show basic info
# df.info()
# df.describe()

1.3 Column Renaming

df.rename(columns=lambda x: x.replace(' ', '_').replace('-', '_'), inplace=True)
df.columns

1.4 Data Type Handling

# Convert Order_Date to datetime
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
# Add year and month columns
df['year'] = df['Order_Date'].dt.year
df['month'] = df['Order_Date'].values.astype('datetime64[M]')

1.5 Missing Value Treatment

# Identify missing values
df.isnull().sum()
# Drop Postal_Code (many missing values)
df.drop(['Postal_Code'], axis=1, inplace=True)

1.6 Outlier Detection

No obvious outliers were found in the descriptive statistics.

1.7 Duplicate Removal

# Check for duplicates
df.duplicated().sum()

No duplicate rows were detected.

Data Analysis

1. Overall Sales Situation

Create a subset containing order date, sales, quantity, profit, year, and month.

sales_data = df[['Order_Date', 'Sales', 'Quantity', 'Profit', 'year', 'month']]

Group by year and month and sum the metrics.

sales_year = sales_data.groupby(['year', 'month']).sum()

Split the grouped data into separate DataFrames for each year.

year_2011 = sales_year.loc[(2011, slice(None)), :].reset_index()
year_2012 = sales_year.loc[(2012, slice(None)), :].reset_index()
year_2013 = sales_year.loc[(2013, slice(None)), :].reset_index()
year_2014 = sales_year.loc[(2014, slice(None)), :].reset_index()

1.1 Sales Amount Analysis

# Concatenate yearly sales columns
sales = pd.concat([year_2011['Sales'], year_2012['Sales'], year_2013['Sales'], year_2014['Sales']], axis=1)
sales.columns = ['Sales-2011', 'Sales-2012', 'Sales-2013', 'Sales-2014']
sales.index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Visualize with a gradient background
sales.style.background_gradient()

Observation: Sales in the second half of each year are higher, and sales increase year over year.

# Calculate yearly total sales and growth rates
sales_sum = sales.sum()
sales_sum.plot(kind='bar', alpha=0.5)
plt.grid()

rise_12 = sales_sum[1] / sales_sum[0] - 1
rise_13 = sales_sum[2] / sales_sum[1] - 1
rise_14 = sales_sum[3] / sales_sum[2] - 1
rise_rate = [0, rise_12, rise_13, rise_14]

sales_sum = pd.DataFrame({'sales_sum': sales_sum})
sales_sum['rise_rate'] = rise_rate
sales_sum

Result: 2014 sales are almost double 2011, with growth rates around 26%.

1.2 Quantity Analysis

# Concatenate yearly quantity columns
quantity = pd.concat([year_2011['Quantity'], year_2012['Quantity'], year_2013['Quantity'], year_2014['Quantity']], axis=1)
quantity.columns = ['Quantity-2011', 'Quantity-2012', 'Quantity-2013', 'Quantity-2014']
quantity.index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
quantity.style.background_gradient()

# Yearly total quantity and growth rates
quantity_sum = quantity.sum()
quantity_sum.plot(kind='bar', alpha=0.5)
plt.grid()

rise_12 = quantity_sum[1] / quantity_sum[0] - 1
rise_13 = quantity_sum[2] / quantity_sum[1] - 1
rise_14 = quantity_sum[3] / quantity_sum[2] - 1
rise_rate = [0, rise_12, rise_13, rise_14]

quantity_sum = pd.DataFrame({'quantity_sum': quantity_sum})
quantity_sum['rise_rate'] = rise_rate
quantity_sum

Quantity trends mirror sales trends, with higher volumes in the second half of each year.

1.3 Profit Analysis

# Concatenate yearly profit columns
profit = pd.concat([year_2011['Profit'], year_2012['Profit'], year_2013['Profit'], year_2014['Profit']], axis=1)
profit.columns = ['Profit-2011', 'Profit-2012', 'Profit-2013', 'Profit-2014']
profit.index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
profit.style.background_gradient()

# Yearly total profit and profit margin
profit_sum = profit.sum()
profit_sum.plot(kind='bar', alpha=0.5)
plt.grid()

profit_sum = pd.DataFrame({'profit_sum': profit_sum})
profit_sum['year'] = [2011, 2012, 2013, 2014]

sales_sum = pd.DataFrame({'sales_sum': sales.sum()})
sales_sum['year'] = [2011, 2012, 2013, 2014]

profit_sum = pd.merge(profit_sum, sales_sum, on='year')
profit_sum['profit_rate'] = profit_sum['profit_sum'] / profit_sum['sales_sum']
profit_sum

Profit margins stay stable around 11‑12% while absolute profit grows each year.

1.4 Average Transaction Value (ATV) Analysis

# Calculate ATV for each year
for i in range(2011, 2015):
    data = df[df['year'] == i]
    price = data[['Order_Date', 'Customer_ID', 'Sales']]
    price_dr = price.drop_duplicates(subset=['Order_Date', 'Customer_ID'])
    total_num = price_dr.shape[0]
    unit_price = price['Sales'].sum() / total_num
    print(f'{i} year ATV = {unit_price}')

ATV remains roughly stable around 500 across the four years.

1.5 Market Layout Analysis

# Sales by market and year
Market_Year_Sales = df.groupby(['Market', 'year']).agg({'Sales': 'sum'}).reset_index().rename(columns={'Sales': 'Sales_amounts'})

# Visualize market sales over years
sns.barplot(x='Market', y='Sales_amounts', hue='year', data=Market_Year_Sales)
plt.title('2011‑2014 market sales')

# Market share of total sales
Market_Sales = df.groupby('Market').agg({'Sales': 'sum'})
Market_Sales['percent'] = Market_Sales['Sales'] / df['Sales'].sum()
Market_Sales.style.background_gradient()

APAC, EU, US, and LATAM dominate (>85% of total sales); Canada contributes minimally.

4.2 Product Situation Analysis

# Top 10 products by sales count
productId_count = df.groupby('Product_ID').count()['Customer_ID'].sort_values(ascending=False)
print(productId_count.head(10))

# Top 10 products by sales amount
productId_amount = df.groupby('Product_ID').sum()['Sales'].sort_values(ascending=False)
print(productId_amount.head(10))

# Top 10 products by profit
productId_Profit = df.groupby('Product_ID').sum()['Profit'].sort_values(ascending=False)
print(productId_Profit.head(10))

High‑selling items are mainly office supplies, while top‑revenue and top‑profit items are electronics and furniture.

# Combine Category and Sub‑Category
df['Category_Sub_Category'] = df[['Category', 'Sub_Category']].apply(lambda x: f"{x[0]}_{x[1]}", axis=1)
# Aggregate sales and profit by the new combined category
df_Category_Sub_Category = df.groupby('Category_Sub_Category').agg({'Profit': 'sum', 'Sales': 'sum'}).reset_index()
df_Category_Sub_Category.sort_values(by='Sales', ascending=False, inplace=True)
# Cumulative sales percentage
df_Category_Sub_Category['cum_percent'] = df_Category_Sub_Category['Sales'].cumsum() / df_Category_Sub_Category['Sales'].sum()

Approximately 84% of sales come from half of the product categories, indicating core product strengths.

4.3 Customer Situation Analysis

3.1 Customer Segment Distribution

# Segment proportion pie chart
df['Segment'].value_counts().plot(kind='pie', autopct='%.2f%%', shadow=True, figsize=(14, 6))

Regular consumers account for about 51.7% of customers.

# Segment count per year
Segment_Year = df.groupby(['Segment', 'year']).agg({'Customer_ID': 'count'}).reset_index()
sns.barplot(x='Segment', y='Customer_ID', hue='year', data=Segment_Year)
plt.title('2011‑2014 Segment Customer')

# Segment sales per year
Segment_sales = df.groupby(['Segment', 'year']).agg({'Sales': 'sum'}).reset_index()
sns.barplot(x='Segment', y='Sales', hue='year', data=Segment_sales)
plt.title('2011‑2014 Segment Sales')

3.2 Customer Ordering Behavior

# Subset for ordering behavior and sort by date
grouped_Customer = df[['Customer_ID', 'Order_Date', 'Quantity', 'Sales', 'month']].sort_values('Order_Date').groupby('Customer_ID')

# First purchase date distribution
grouped_Customer.min().Order_Date.value_counts().plot()

# Last purchase date distribution
grouped_Customer.max().Order_Date.value_counts().plot()

New customer acquisition slows after early 2013, while retention remains strong.

# Customers with only one purchase
Customer_life = grouped_Customer.Order_Date.agg(['min', 'max'])
single_purchase = (Customer_life['min'] == Customer_life['max']).value_counts()
print(single_purchase)

Only 10 customers made a single purchase, indicating high repeat purchase rate.

3.3 RFM Model Analysis

# Build RFM table
rfm = df.pivot_table(index='Customer_ID', values=['Quantity', 'Sales', 'Order_Date'], aggfunc={'Quantity': 'sum', 'Sales': 'sum', 'Order_Date': 'max'})
# Recency (days since last purchase)
rfm['R'] = (rfm.Order_Date.max() - rfm.Order_Date) / np.timedelta64(1, 'D')
# Frequency and Monetary
rfm.rename(columns={'Quantity': 'F', 'Sales': 'M'}, inplace=True)

# Label customers into 8 groups based on mean thresholds
def rfm_func(x):
    level = x.apply(lambda v: '1' if v > 0 else '0')
    code = level.R + level.F + level.M
    mapping = {
        '111': 'High‑Value Customer',
        '011': 'High‑Retention Customer',
        '101': 'High‑Recovery Customer',
        '001': 'High‑Growth Customer',
        '110': 'Medium‑Value Customer',
        '010': 'Medium‑Retention Customer',
        '100': 'Medium‑Recovery Customer',
        '000': 'Medium‑Growth Customer'
    }
    return mapping.get(code, 'Other')

rfm['label'] = rfm[['R', 'F', 'M']].apply(lambda x: x - x.mean()).apply(rfm_func, axis=1)

# Visualize high‑value vs others
rfm.loc[rfm.label == 'High‑Value Customer', 'color'] = 'g'
rfm.loc[rfm.label != 'High‑Value Customer', 'color'] = 'r'
rfm.plot.scatter('F', 'R', c=rfm.color)

The RFM segmentation helps identify valuable customers for targeted marketing.

3.4 User Activity Classification

# Build a pivot table of purchase counts per month per customer
pivoted_counts = df.pivot_table(index='Customer_ID', columns='month', values='Order_Date', aggfunc='count').fillna(0)
# Binary purchase indicator (1 if purchased, 0 otherwise)
df_purchase = pivoted_counts.applymap(lambda x: 1 if x > 0 else 0)

# Define activity status function
def active_status(row):
    status = []
    for i in range(48):
        if row[i] == 0:
            if status and status[-1] == 'unreg':
                status.append('unreg')
            else:
                status.append('unactive' if status else 'unreg')
        else:
            if not status:
                status.append('new')
            else:
                if status[-1] == 'unactive':
                    status.append('return')
                elif status[-1] == 'unreg':
                    status.append('new')
                else:
                    status.append('active')
    return pd.Series(status)

purchase_stats = df_purchase.apply(active_status, axis=1)

# Monthly count of each status (replace unregistered with NaN for plotting)
purchase_stats_ct = purchase_stats.replace('unreg', np.NaN).apply(lambda s: pd.value_counts(s)).fillna(0).T
purchase_stats_ct.plot.area()

Active, new, and returning customers show seasonal peaks, while new‑customer acquisition declines each year.

3.5 Repurchase and Return Rate Analysis

# Repurchase rate (customers buying more than once in a month)
purchase_r = pivoted_counts.applymap(lambda x: 1 if x > 1 else (np.nan if x == 0 else 0))
(purchase_r.sum() / purchase_r.count()).plot(figsize=(10, 4))

# Return rate (purchase in month i and also in month i+1)
def purchase_back(row):
    status = []
    for i in range(47):
        if row[i] == 1:
            status.append(1 if row[i+1] == 1 else 0)
        else:
            status.append(np.nan)
    status.append(np.nan)
    return status

purchase_b = df_purchase.apply(purchase_back, axis=1, result_type='expand')
(purchase_b.sum() / purchase_b.count()).plot(figsize=(10, 4))

Repurchase rates stay above 52% with an upward trend; return rates peak around mid‑year and year‑end, likely linked to promotions.

Conclusion

The analysis examined the supermarket from the perspectives of "place", "product", and "people", providing insights on overall sales growth, seasonal patterns, market distribution, product performance, and detailed customer segmentation. Recommendations include focusing on high‑margin products, improving new‑customer acquisition, and leveraging RFM‑based targeted marketing to further boost revenue.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Analysis pandas Retail analytics Sales Forecasting RFM model

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.