Walmart Sales & Marketing Analytics
🔎 Introduction
This project explores Walmart’s sales trends, promotional effectiveness, and customer behavior using real transaction data. The goal is to uncover data-driven insights that help optimize revenue, improve marketing strategies, and enhance inventory management. By analyzing key factors such as best-selling products, promotion impact, customer spending patterns, and demand forecasting accuracy, aim to provide actionable recommendations for Walmart’s business growth.
📋 Project Overview
This analysis focuses on three key areas:
📊 Sales Performance
- Identify top-selling and underperforming products.
- Analyze regional store performance and revenue trends.
🎯 Marketing & Promotions Impact
- Measure the effectiveness of different promotions on sales.
- Understand customer response to discounts and loyalty programs.
📦 Demand Forecasting & Inventory Management
- Compare forecasted vs. actual demand to assess prediction accuracy.
- Identify stockout trends and recommend inventory optimizations.
✅ Key Deliverables:
- Best & worst-selling product insights
- Promotion effectiveness analysis
- Customer segmentation insights
- Demand forecasting accuracy assessment
📊 Summary of the Dataset
📂 Key Dataset Attributes
- Total Records: 5,000 transactions
- Total Columns: 28 features
- Time Period Covered: Includes sales data over multiple dates
🛒 Sales & Transaction Data
- Transaction ID: Unique identifier for each sale
- Product Details: Product ID, Name, Category, Unit Price, Quantity Sold
- Store Information: Store ID, Store Location
👥 Customer Demographics
- Customer ID: Unique identifier per customer
- Age & Gender: Customer’s age and gender distribution
- Income Level: Customer’s financial profile
- Loyalty Level: Bronze, Silver, Gold, Platinum
📣 Marketing & Promotions
- Promotions Applied: Whether a discount was used
- Promotion Type: Percentage discount, Buy 1 Get 1, etc.
- Payment Method: Cash, Credit Card, Digital Wallet
📦 Inventory & Demand Forecasting
- Inventory Level: Stock availability before a transaction
- Stockout Indicator: Whether an item was out of stock
- Forecasted vs. Actual Demand: Accuracy of sales predictions
🌦 External Influences
- Weather Conditions: Rainy, Sunny, Stormy, etc.
- Holiday Indicator: Whether the transaction occurred during a holiday
- Weekday: Day of the transaction
🔧 Phase 1: Data Cleaning & Preparation
Importing Libraries
# Data Manipulation
import pandas as pd
import numpy as np
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Statistical Analysis
from scipy import stats
# Display settings for better readability
pd.set_option('display.float_format', '{:.2f}'.format)
Loading the Dataset & Basic Data Check
df = pd.read_csv(filepath_or_buffer=r'D:\roy\roy files\data\data project\Walmart.csv')
df.head()
transaction_id | customer_id | product_id | product_name | category | quantity_sold | unit_price | transaction_date | store_id | store_location | inventory_level | reorder_point | reorder_quantity | supplier_id | supplier_lead_time | customer_age | customer_gender | customer_income | customer_loyalty_level | payment_method | promotion_applied | promotion_type | weather_conditions | holiday_indicator | weekday | stockout_indicator | forecasted_demand | actual_demand |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2824 | 843 | Fridge | Electronics | 3 | 188.46 | 3/31/2024 21:46 | 3 | Miami, FL | 246 | 116 | 170 | 474 | 8 | 29 | Other | 98760.83 | Silver | Credit Card | True | Stormy | False | Friday | True | 172 | 179 | |
2 | 1409 | 135 | TV | Electronics | 4 | 1912.04 | 7/28/2024 12:45 | 5 | Dallas, TX | 43 | 70 | 212 | 135 | 4 | 34 | Other | 69781.93 | Gold | Cash | True | Percentage Discount | Rainy | False | Monday | True | 109 | 484 |
3 | 5506 | 391 | Fridge | Electronics | 4 | 1377.75 | 6/10/2024 4:55 | 1 | Los Angeles, CA | 411 | 94 | 252 | 475 | 2 | 69 | Other | 77373.1 | Platinum | Cash | False | Sunny | False | Tuesday | True | 289 | 416 | |
4 | 5012 | 710 | Smartphone | Electronics | 5 | 182.31 | 8/15/2024 1:03 | 5 | Miami, FL | 452 | 87 | 112 | 118 | 7 | 47 | Other | 33383.04 | Silver | Cash | True | Percentage Discount | Sunny | True | Sunday | False | 174 | 446 |
5 | 4657 | 116 | Laptop | Electronics | 3 | 499.28 | 9/13/2024 0:45 | 6 | Chicago, IL | 412 | 99 | 177 | 155 | 4 | 70 | Female | 108999.41 | Bronze | Digital Wallet | False | Sunny | False | Thursday | True | 287 | 469 |
pd.set_option('display.max_columns', None)
df.info()
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 transaction_id 5000 non-null int64
1 customer_id 5000 non-null int64
2 product_id 5000 non-null int64
3 product_name 5000 non-null object
4 category 5000 non-null object
5 quantity_sold 5000 non-null int64
6 unit_price 5000 non-null float64
7 transaction_date 5000 non-null object
8 store_id 5000 non-null int64
9 store_location 5000 non-null object
10 inventory_level 5000 non-null int64
11 reorder_point 5000 non-null int64
12 reorder_quantity 5000 non-null int64
13 supplier_id 5000 non-null int64
14 supplier_lead_time 5000 non-null int64
15 customer_age 5000 non-null int64
16 customer_gender 5000 non-null object
17 customer_income 5000 non-null float64
18 customer_loyalty_level 5000 non-null object
19 payment_method 5000 non-null object
20 promotion_applied 5000 non-null bool
21 promotion_type 1593 non-null object
22 weather_conditions 5000 non-null object
23 holiday_indicator 5000 non-null bool
24 weekday 5000 non-null object
25 stockout_indicator 5000 non-null bool
26 forecasted_demand 5000 non-null int64
27 actual_demand 5000 non-null int64
dtypes: bool(3), float64(2), int64(13), object(10)
memory usage: 991.3+ KB
Data Cleaning
- Check for Missing Values
- Check & Remove Duplicates
- Fix Data Types (Dates, Categories, Numbers, Booleans)
# Checking for missing values in each column
df.isna().sum()
transaction_id 0
customer_id 0
product_id 0
product_name 0
category 0
quantity_sold 0
unit_price 0
transaction_date 0
store_id 0
store_location 0
inventory_level 0
reorder_point 0
reorder_quantity 0
supplier_id 0
supplier_lead_time 0
customer_age 0
customer_gender 0
customer_income 0
customer_loyalty_level 0
payment_method 0
promotion_applied 0
promotion_type 3407
weather_conditions 0
holiday_indicator 0
weekday 0
stockout_indicator 0
forecasted_demand 0
actual_demand 0
dtype: int64
# replace NaN with "No Promotion"
df['promotion_type'].fillna('No Promotion',inplace=True)
# Count duplicate rows
df.duplicated().sum()
np.int64(0)
# Convert transaction_date to datetime format
df['transaction_date'] = pd.to_datetime(df['transaction_date'],errors='coerce')
# Convert categorical columns to category type
categorical_cols = ['category', 'store_location', 'customer_loyalty_level', 'payment_method',
'promotion_type', 'weather_conditions', 'customer_gender', 'weekday']
df[categorical_cols] = df[categorical_cols].astype('category')
Validating and Ensuring Data Consistency
- Check for Unexpected Values in Categorical Columns
- Check for Negative or Illogical Numeric Values
- Verify Date Ranges
- Check for Unusual or Duplicated Entries in Key Columns
# Display unique values in categorical columns
for col in categorical_cols:
print(f"Unique values in '{col}':\n", df[col].unique(), "\n")
Unique values in 'category':
['Electronics', 'Appliances']
Categories (2, object): ['Appliances', 'Electronics']
Unique values in 'store_location':
['Miami, FL', 'Dallas, TX', 'Los Angeles, CA', 'Chicago, IL', 'New York, NY']
Categories (5, object): ['Chicago, IL', 'Dallas, TX', 'Los Angeles, CA', 'Miami, FL', 'New York, NY']
Unique values in 'customer_loyalty_level':
['Silver', 'Gold', 'Platinum', 'Bronze']
Categories (4, object): ['Bronze', 'Gold', 'Platinum', 'Silver']
Unique values in 'payment_method':
['Credit Card', 'Cash', 'Digital Wallet', 'Debit Card']
Categories (4, object): ['Cash', 'Credit Card', 'Debit Card', 'Digital Wallet']
Unique values in 'promotion_type':
['No Promotion', 'Percentage Discount', 'BOGO']
Categories (3, object): ['BOGO', 'No Promotion', 'Percentage Discount']
Unique values in 'weather_conditions':
['Stormy', 'Rainy', 'Sunny', 'Cloudy']
Categories (4, object): ['Cloudy', 'Rainy', 'Stormy', 'Sunny']
Unique values in 'customer_gender':
['Other', 'Female', 'Male']
Categories (3, object): ['Female', 'Male', 'Other']
Unique values in 'weekday':
['Friday', 'Monday', 'Tuesday', 'Sunday', 'Thursday', 'Saturday', 'Wednesday']
Categories (7, object): ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']
# List of numeric columns to check
numeric_cols_check = ['unit_price', 'quantity_sold', 'customer_income', 'forecasted_demand', 'actual_demand']
# Check for negative values
for col in numeric_cols_check:
invalid_values = df[df[col] < 0]
if not invalid_values.empty:
print(f"⚠️ Warning: Negative values found in '{col}':")
print(invalid_values[[col]].head(), "\n")
else:
print(f"✅ '{col}' has no negative values.")
✅ 'unit_price' has no negative values.
✅ 'quantity_sold' has no negative values.
✅ 'customer_income' has no negative values.
✅ 'forecasted_demand' has no negative values.
✅ 'actual_demand' has no negative values.
# Display the earliest and latest transaction dates
min_date = df['transaction_date'].min()
max_date = df['transaction_date'].max()
print(f"Earliest Date: {min_date}")
print(f"Latest Date: {max_date}")
# Check for any future dates beyond today
from datetime import datetime
today = datetime.today()
future_dates = df[df['transaction_date'] > today]
if not future_dates.empty:
print("Warning: There are transactions with futre dates!")
print(future_dates[['transaction_date']].head())
else:
print("No future dates found in dataset.")
📅 Earliest Date: 2024-01-01 00:31:00
📅 Latest Date: 2024-09-16 20:22:00
No future dates found in dataset.
# Check for duplicate product entries within the same transaction (if applicable)
duplicate_products = df[df.duplicated(subset=['transaction_id', 'product_id'],keep=False)]
print(f"Duplicate Product Entries Within Transactions: {duplicate_products.shape[0]}")
Duplicate Product Entries Within Transactions: 0
Summary of Data Cleaning & Preparation
Step | Task | Status |
---|---|---|
1. Importing Libraries | Loaded Pandas, NumPy, Matplotlib, etc. | |
2. Loading the Dataset | Read the dataset and checked structure | |
3. Data Cleaning | Fixed missing values, handled duplicates | |
4. Fixing Data Types | Converted dates, numerical, categorical fields | |
5. Validating Data Consistency | Checked for anomalies, duplicate records, incorrect values |
📈 Phase 2: Exploratory Data Analysis (EDA)
Data Distribution
# Summary statistics for numerical columns
df.describe().round(2)
transaction_id | customer_id | product_id | quantity_sold | unit_price | transaction_date | store_id | inventory_level | reorder_point | reorder_quantity | supplier_id | supplier_lead_time | customer_age | customer_income | forecasted_demand | actual_demand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 |
mean | 2500.5 | 5542.5 | 551.23 | 2.98 | 1023.47 | 2024-05-11 05:10:22.740000 | 10.52 | 253.12 | 99.79 | 200.52 | 300.13 | 5.52 | 44.12 | 70041.63 | 297.13 | 299.09 |
min | 1.0 | 1001.0 | 100.0 | 1.0 | 50.1 | 2024-01-01 00:31:00 | 1.0 | 0.0 | 50.0 | 100.0 | 100.0 | 1.0 | 18.0 | 20005.34 | 100.0 | 90.0 |
25% | 1250.75 | 3279.0 | 322.0 | 2.0 | 537.78 | 2024-03-07 19:21:15 | 5.0 | 130.0 | 75.0 | 150.75 | 199.0 | 3.0 | 31.0 | 44865.42 | 195.0 | 194.0 |
50% | 2500.5 | 5558.0 | 559.0 | 3.0 | 1029.18 | 2024-05-11 20:02:30 | 11.0 | 253.0 | 100.0 | 200.5 | 299.0 | 6.0 | 44.0 | 70188.29 | 297.5 | 299.0 |
75% | 3750.25 | 7767.25 | 776.0 | 4.0 | 1506.31 | 2024-07-16 08:58:15 | 16.0 | 377.25 | 125.0 | 251.0 | 405.0 | 8.0 | 58.0 | 95395.87 | 395.0 | 404.0 |
max | 5000.0 | 9998.0 | 999.0 | 5.0 | 1999.85 | 2024-09-16 20:22:00 | 20.0 | 500.0 | 150.0 | 300.0 | 500.0 | 10.0 | 70.0 | 119999.78 | 500.0 | 510.0 |
std | 1443.52 | 2582.13 | 258.83 | 1.42 | 559.61 | 5.79 | 142.89 | 29.13 | 58.26 | 116.39 | 2.86 | 15.33 | 29053.37 | 115.57 | 121.68 |
# Summary of categorical columns
df.describe(include=['category'])
category | store_location | customer_gender | customer_loyalty_level | payment_method | promotion_type | weather_conditions | weekday | |
---|---|---|---|---|---|---|---|---|
count | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 |
unique | 2 | 5 | 3 | 4 | 4 | 3 | 4 | 7 |
top | Electronics | Los Angeles, CA | Male | Platinum | Credit Card | No Promotion | Cloudy | Monday |
freq | 2585 | 1038 | 1682 | 1299 | 1281 | 3407 | 1281 | 775 |
Outlier Detection & Treatment
- Detect Outliers – Boxplot Visualization
- Z-Score & IQR Methods
import matplotlib.pyplot as plt
import seaborn as sns
# List of numerical columns to check for outliers
numeric_cols = ['unit_price', 'quantity_sold', 'customer_income', 'forecasted_demand', 'actual_demand']
# Apply log transformation to adjust scaling
df_log = df.copy()
df_log[numeric_cols] = np.log1p(df[numeric_cols])
# Create a new boxplot with adjusted scaling
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_log[numeric_cols])
plt.xticks(rotation=45)
plt.title("Boxplot for Outlier Detection (Log-Scaled)")
plt.show()

🔍 Key Observations:
✅ Customer Income is now well-balanced (log-transformed).
✅ Unit Price shows some lower-end outliers (small dots below whiskers).
✅ Other variables (quantity_sold, forecasted_demand, actual_demand) seem normally distributed (no major outliers).
from scipy import stats
# Define threshold for Z-score method (values beyond ±3 standard deviations are considered outliers)
z_threshold = 3
# Identify outliers in 'unit_price' using Z-score
z_scores_unit_price = np.abs(stats.zscore(df['unit_price'])) # Compute Z-score
outliers_z_unit_price = df[z_scores_unit_price > z_threshold] # Filter outliers
if not outliers_z_unit_price.empty:
print("There are some streme outliers in unit_price based on the Z-score method.")
else:
print("There are no extreme outliers in unit_price based on the Z-score method.")
There are no extreme outliers in unit_price based on the Z-score method.
# Interquartile Range (IQR) method for outlier detection
Q1 = df['unit_price'].quantile(0.25)
Q3 = df['unit_price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers in 'unit_price' using IQR
outliers_iqr_unit_price = df[(df['unit_price'] < lower_bound) | (df['unit_price'] > upper_bound)]
if not outliers_iqr_unit_price.empty:
print("There are some streme outliers in unit_price based on the IQR method method.")
else:
print("There are no extreme outliers in unit_price based on the IQR method method.")
There are no extreme outliers in unit_price based on the IQR method method.
📢 Conclusion:
✅ No extreme outliers detected in any numerical variables, confirming that the dataset is clean for further analysis.
Customer Analysis
- Customer Demographics
- Customer Loyalty Trends
- Payment Method Preferences
- Customer Spending Behavior
'''Customer Demographics Analysis'''
# Plot Age Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['customer_age'], bins=20, kde=True, color='royalblue')
plt.title("Customer Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# Plot Gender Distribution
plt.figure(figsize=(6, 5))
bars = sns.countplot(data=df, x='customer_gender', palette='Set2')
plt.title("Customer Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
# Add annotations
for bar in bars.patches:
yval = bar.get_height()
plt.text(
bar.get_x() + bar.get_width()/2, yval,
f'{yval:.0f} ',
ha='center', va='bottom', fontsize=10
)
plt.show()

# Plot Income Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['customer_income'], bins=20, kde=True, palette='set1')
plt.title("Customer Income Distribution")
plt.xlabel("Income ($)")
plt.ylabel("Count")
plt.show()

📢 Key Insights:
✅Age Distribution:
- Customers are evenly distributed across different age groups, ranging from early 20s to late 60s.
- No dominant age group, indicating a broad customer base.
✅Gender Distribution:
- The dataset has a balanced representation of Male, Female, and Other customers.
- No significant gender bias, meaning marketing strategies should target all genders equally.
✅Income Distribution:
- Customers have a wide range of incomes, from $20,000 to $120,000.
- The distribution is relatively even, meaning Walmart serves both low-income and high-income segments.
'''Customer Loyalty Level Analysis'''
# Get loyalty level distribution counts
loyalty_counts = df['customer_loyalty_level'].value_counts()
# Create Pie Chart
plt.figure(figsize=(7, 7))
plt.pie(loyalty_counts, labels=loyalty_counts.index, autopct='%1.1f%%',
colors=['#66c2a5', '#2ACCE7', '#8da0cb', '#e78ac3'],
startangle=140, wedgeprops={'edgecolor': 'grey'})
# Set title
plt.title("Customer Loyalty Level Distribution")
plt.show()

📢 Key Insights::
✅ Platinum (26%) & Silver (25.8%) customers dominate, indicating strong engagement.
✅ Bronze (25.1%) customers have potential for upgrades with better incentives.
✅ Gold (23.2%) is the least preferred tier, suggesting a need for improved benefits.
'''Payment Method Preferences Analysis'''
# Plot Payment Method Distribution
plt.figure(figsize=(7, 5))
bars = sns.countplot(data=df, x='payment_method', order=df['payment_method'].value_counts().index, palette='pastel')
plt.title("Customer Payment Method Preferences")
plt.xlabel("Payment Method")
plt.ylabel("Count")
# Add annotations to bars
for bar in bars.patches:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval, f'{yval:.0f}', ha='center', va='bottom', fontsize=10)
plt.show()

📢 Key Insights:
✅ Credit Card (1,281) is the most preferred payment method, showing that many customers rely on credit-based transactions.
✅ Digital Wallets (1,263) are nearly as popular, indicating a shift towards cashless and mobile payments.
✅ Cash (1,246) remains a strong option, meaning traditional payment methods are still widely used.
✅ Debit Card (1,210) is the least used method, but the difference is minimal, showing all payment options are well-balanced.
'''Customer Spending Behavior Analysis'''
# Create a new column for total spending
df['total_price'] = df['unit_price'] * df['quantity_sold']
# Create a boxplot for total spending based on loyalty level
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='customer_loyalty_level', y='total_price', palette='coolwarm')
plt.title("Customer Spending Behavior by Loyalty Level")
plt.xlabel("Loyalty Level")
plt.ylabel("Total Spending ($)")
plt.show()

📢 Key Insights:
✅ All loyalty levels have similar median spending (~$2,000).
- This suggests that spending habits are relatively consistent across customer tiers.
✅ Wide range of spending within each group.
- Some customers spend as low as $0, while others exceed $10,000, indicating a mix of low- and high-spending customers in all tiers.
✅ Presence of high-spending outliers.
- A few customers in each loyalty tier make very large purchases ($10,000+), likely bulk or high-value item buyers.
Business Recommendations:
✅ Segment marketing campaigns based on income, age, and loyalty level.
✅ Enhance Gold-tier benefits to increase participation.
✅ Encourage debit card usage through cashback or loyalty perks.
✅ Target low-spending customers with promotional incentives to boost spending.
✅ Create personalized offers for high-spending customers to retain premium buyers.
📈 Phase 3: Business Insights & Reporting
Sales Trends Analysis
'''Sales Trends Analysis'''
# Aggregate total sales per week
sales_trend = df.groupby(pd.Grouper(key='transaction_date', freq='W'))['total_price'].sum()
plt.figure(figsize=(12, 5))
# Plot the weekly sales trend
sns.lineplot(x=sales_trend.index, y=sales_trend.values, marker='o', color='royalblue', label="Weekly Sales")
# Add a trend line using numpy polyfit
x_values = np.arange(len(sales_trend)) # Convert index to numerical values
y_values = sales_trend.values
z = np.polyfit(x_values, y_values, 1)
p = np.poly1d(z)
# Plot the trend line
plt.plot(sales_trend.index, p(x_values), color='red', linestyle='dashed', label="Trend Line")
plt.title("Weekly Sales Trends Over Time with Trend Line")
plt.xlabel("Date")
plt.ylabel("Total Sales ($)")
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.show()

📢 Key Findings:
✅ Weekly sales fluctuate significantly, with peaks exceeding $500,000 and dips below $300,000.
✅ A sharp drop at the end of the period, indicating a potential sales issue (e.g., seasonal drop, inventory shortage, or data error).
✅ The red trend line shows a slight downward trend, meaning overall sales might be declining over time.
Best-Selling Products Analysis
'''Best-Selling Products Analysis'''
# Aggregate total sales per product
top_products = df.groupby('product_name')['total_price'].sum().sort_values(ascending=False).head(10)
# Plot Best-Selling Products
plt.figure(figsize=(10, 5))
sns.barplot(x=top_products.values, y=top_products.index, palette='viridis')
plt.title("Top 10 Best-Selling Products")
plt.xlabel("Total Sales ($)")
plt.ylabel("Product Name")
# Add values on bars
for index, value in enumerate(top_products.values):
plt.text(value, index, f"${value:,.0f}", va='center', fontsize=10)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

📢 Key Findings:
✅ TVs are the top-selling product, generating over $2 million in total sales.
✅ Tablets, Fridges, and Smartphones also drive high revenue, indicating strong demand.
✅ Washing Machines and Cameras are in the mid-range but still contribute significantly.
✅ Laptops rank the lowest among the top 8, but still exceed $1.7 million in sales.
Regional Store Performance Analysis
'''Regional Store Performance Analysis'''
# Aggregate total sales per store location
store_performance = df.groupby('store_location')['total_price'].sum().sort_values(ascending=False).head(10)
# Plot Top 10 Best-Performing Stores
plt.figure(figsize=(10, 5))
sns.barplot(x=store_performance.values, y=store_performance.index, order=store_performance.index, palette='magma')
plt.title("Top 10 Best-Performing Store Locations")
plt.xlabel("Total Sales ($)")
plt.ylabel("Store Location")
# Add values on bars
for index, value in enumerate(store_performance.values):
plt.text(value, index, f"${value:,.0f}", va='center', fontsize=10)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

📢 Key Findings:
✅ Los Angeles, CA is the top-performing store with $3.27 million in total sales.
✅ Chicago, IL follows closely at $3.15 million, showing strong sales demand.
✅ New York, Miami, and Dallas are all in the $2.9 million range, contributing significantly.
✅ The top 5 stores generate a major share of total revenue, indicating high footfall and demand in these regions.
Promotions Impact Analysis
'''Promotions Impact Analysis'''
# Compare total sales between promoted and non-promoted transactions
promotion_impact = df.groupby('promotion_applied')['total_price'].sum()
# Plot Promotion Impact
plt.figure(figsize=(7, 5))
sns.barplot(x=promotion_impact.index, y=promotion_impact.values, palette='cool')
# Formatting
plt.title("Impact of Promotions on Total Sales")
plt.xlabel("Promotion Applied")
plt.ylabel("Total Sales ($)")
plt.xticks(ticks=[0, 1], labels=['No Promotion', 'Promotion Applied'])
# Add values on bars
for index, value in enumerate(promotion_impact.values):
plt.text(index, value, f"${value:,.0f}", ha='center', va='bottom', fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

📢 Key Findings:
✅ Promotions drive higher total sales, reaching $8.06 million compared to $7.2 million without promotions.
✅ Sales increased by approximately 12% when promotions were applied.
Demand Forecasting Accuracy Analysis
'''Demand Forecasting Accuracy Analysis'''
# Calculate absolute error between forecasted and actual demand
df['demand_error'] = abs(df['forecasted_demand'] - df['actual_demand'])
# Calculate Mean Absolute Error (MAE)
mae = df['demand_error'].mean()
# Plot Forecasted vs Actual Demand
plt.figure(figsize=(10, 5))
sns.scatterplot(x=df['forecasted_demand'], y=df['actual_demand'], alpha=0.6)
plt.plot([df['forecasted_demand'].min(), df['forecasted_demand'].max()],
[df['forecasted_demand'].min(), df['forecasted_demand'].max()], 'r--', label="Perfect Prediction")
plt.title(f"Forecasted vs Actual Demand (MAE: {mae:.2f})")
plt.xlabel("Forecasted Demand")
plt.ylabel("Actual Demand")
plt.legend()
plt.grid(alpha=0.5)
plt.show()

📢 Key Findings:
✅ Mean Absolute Error (MAE) is 137.52, meaning that on average, the forecasted demand deviates by 137 units from actual demand.
✅ The scatter plot shows high variability, with actual demand widely dispersed around the red dashed “Perfect Prediction” line.
✅ Some predictions are fairly close to the actual values, but many data points indicate over- or under-estimations.
Summary of Business Insights & Reporting
Goal: Transform exploratory findings into actionable insights to drive business decisions.
Sales Trends Analysis : Identified fluctuations in weekly sales, highlighting seasonality and demand shifts.
Best-Selling Products Analysis : TVs, Tablets, and Fridges lead in revenue generation.
Regional Store Performance : Los Angeles, Chicago, and New York stores outperform other locations.
Promotions Impact Analysis : Sales increased by ~12% with promotions, proving their effectiveness.
Demand Forecasting Accuracy : Forecasting error (MAE: 137.52) indicates inconsistencies, suggesting room for improvement.
Key Takeaways:
Targeted marketing & promotions can further boost sales.
Inventory adjustments needed to match demand patterns.
Enhancing demand forecasting models can reduce stockouts and overstocking.
💡 Recommendations
📊 Sales & Revenue Growth Strategies
✅ Optimize Inventory for Best-Selling Products
-
-
- TVs, Tablets, and Fridges drive the most revenue—maintain priority stock levels.
- Use predictive analytics to prevent stockouts & excess inventory.
-
✅ Regional Store Strategy
-
-
- Los Angeles, Chicago, & New York lead in sales—expand marketing efforts & premium product placements in these stores.
- Tailor promotions regionally based on demand patterns.
-
✅ Price & Demand Optimization
-
-
- Implement dynamic pricing strategies to adjust unit prices based on real-time demand trends.
- Use customer segmentation to personalize discounts for high-value customers.
-
📣 Marketing & Customer Engagement
✅ Expand Promotions Based on ROI
-
-
- Sales increased +12% when promotions were applied—strategically increase promo campaigns during peak demand periods.
- Focus on seasonal discounts, flash sales, and loyalty-based discounts.
-
✅ Enhance Customer Loyalty Programs
-
-
- Platinum & Silver customers have higher spending patterns—introduce exclusive membership perks, personalized discounts, and early access deals.
- Use AI-driven segmentation to target high-value customers with personalized offers.
-
✅ Leverage Payment Preferences for More Conversions
-
-
- Customers prefer Credit Cards & Digital Wallets—collaborate with banks & fintech companies for cashback rewards & flexible financing options.
- Promote “Buy Now, Pay Later” (BNPL) services to encourage larger purchases.
-
📦 Inventory & Demand Forecasting Optimization
✅ Strengthen Demand Forecasting Models
-
-
- Current forecasting shows an MAE of 137.52, indicating a gap in prediction accuracy.
- Implement machine learning models to refine demand forecasts & reduce errors.
-
✅ Reduce Stockouts & Overstock Situations
-
-
- Align supply chain planning with actual demand trends.
- Use real-time monitoring to adjust reorder levels dynamically.
-
✅ Weather-Driven Stock Adjustments
-
-
- Consider weather trends in inventory planning (e.g., more air conditioners in summer, heaters in winter).
- Automate supply chain alerts based on environmental & seasonal data.
-