Amazon Sales Performance Dashboard

Home //  AI-Powered Data Analysis

📂 Table of Content

📋 Project Overview​

📌Objective:

The goal of this project is to analyze Amazon sales data to uncover key insights about sales performance, customer behavior, and operational efficiency. I used statistical methods to identify trends, relationships, and factors influencing sales outcomes.


📌Key Analyses & Methods:

🔹Descriptive Statistics (Data Summary)

  • Analyze sales distribution (total revenue, order count, average order value).
  • Identify top-selling categories and products.
  • Examine order status breakdown (shipped, canceled, returned).

🔹 Inferential Statistics (Comparisons & Hypothesis Testing)

  • T-test/ANOVA: Compare sales performance between B2B vs. B2C and fulfillment methods.
  • Chi-Square Test: Analyze relationships between order status and fulfillment methods.
  • Correlation Analysis: Identify relationships between Qty and Amount.

🔹Time Series & Trend Analysis

  • Identify monthly and seasonal sales trends.

🔹 Outlier Detection & Data Quality Checks

  • Detect unusual patterns in order amounts using boxplots & statistical methods.
  • Handle missing values and anomalies for better data reliability.

📌Expected Outcomes:

  • Actionable insights into revenue trends, high-performing categories, and sales channels.
  • Statistical evidence of key sales-driving factors (B2B vs. B2C, fulfillment method impact).

📚 Workflow

✅Data Preprocessing

  • Load & Inspect: Check data structure, types, and missing values.
  • Clean Data: Handle missing values, convert dates, and standardize categories.
  • Detect Outliers: Use boxplots, Z-score, and IQR methods.

✅Descriptive Statistics & Sales Analysis

  • Overall Sales: Total revenue, order count, average order value.
  • Category & Product Analysis: Identify top-selling items.
  • Order Status: Shipped vs. Canceled impact on revenue.
  • Fulfillment & B2B vs. B2C: Compare sales performance.

✅Statistical Analysis & Hypothesis Testing

  • T-Test/ANOVA: Compare B2B vs. B2C and category sales.
  • Chi-Square: Test relationships (e.g., Order Status vs. Fulfillment).
  • Correlation & Regression: Analyze Qty vs. Amount, build predictive models.

✅Time Series & Trend Analysis

  • Sales Trends: Monthly & seasonal revenue insights.
  • Forecasting (Optional): Predict future sales using time series models.

✅Data Visualization & Insights

  • Charts & Graphs: Revenue trends, sales distribution, fulfillment comparisons.
  • Insights & Recommendations: Summarize key findings for decision-making.

✅Dashboard

  • Report: Document key metrics & trends.
  • Power BI Dashboard: Create interactive visuals for easy insights.

📊 Summary of the Dataset

📊 Summary of the Amazon Sales Dataset

  • Dataset Name: Amazon Sales Data
  • Total Rows: 128,976
  • Total Columns: 19

🔑 Key Columns & Descriptions:

  1. index – Row index (not necessary for analysis).
  2. order_id – Unique identifier for each order.
  3. date – Date of the order.
  4. status – Current status of the order (Shipped, Cancelled, etc.).
  5. fulfilment – Whether fulfilled by Amazon or Merchant.
  6. sales_channel – Platform where the sale was made.
  7. ship_service_level – Type of shipping service selected.
  8. category – Product category.
  9. size – Product size.
  10. courier_status – Delivery status from the courier.
  11. qty – Quantity of items in the order.
  12. currency – Currency used for the order.
  13. amount – Total sales amount for the order.
  14. ship_city – Destination city of the order.
  15. ship_state – Destination state of the order.
  16. ship_postal_code – Postal code of the shipping address.
  17. ship_country – Destination country of the order.
  18. b2b – Whether the order is Business-to-Business (B2B) or not.
  19. fulfilled_by – Fulfillment method used (Amazon, Merchant, etc.).

🗂️ Data Processing

Drop Columns
				
					# Drop unnecessary columns: 'index' and 'ship_postal_code'
df.drop(columns=['index', 'ship_postal_code'], inplace=True)
				
			
				
					# Check for duplicate rows
duplicate_count = df.duplicated().sum()

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Verify if duplicates remain
remaining_duplicates = df.duplicated().sum()

# Display results
duplicate_summary = {
    "Duplicates Found": duplicate_count,
    "Duplicates Removed": duplicate_count,
    "Remaining Duplicates": remaining_duplicates
}
				
			
				
					{'Duplicates Found': np.int64(168),
 'Duplicates Removed': np.int64(168),
 'Remaining Duplicates': np.int64(0)}
				
			
Missing ValuesPercentage (%)
fulfilled_by8971370.0
currency78006.0
amount78006.0
ship_postal_code350.0
ship_country350.0
ship_state350.0
ship_city350.0
				
					# Check for missing values in each column
missing_values = df.isnull().sum()
missing_percentage = round((missing_values / df.shape[0]) * 100)

# Create a summary DataFrame for missing values
missing_summary = pd.DataFrame({
    "Missing Values": missing_values,
    "Percentage (%)": missing_percentage
}).sort_values(by="Missing Values", ascending=False)

# Display columns with missing values
missing_summary[missing_summary["Missing Values"] > 0]
				
			
				
					# Fill NaN values in 'fulfilled-by' with 'Merchant Fulfilled'
df['fulfilled_by'].fillna("Merchant Fulfilled", inplace=True)

# Verify the changes
df['fulfilled_by'].unique()
				
			
				
					array(['Easy Ship', 'Merchant Fulfilled'], dtype=object)
				
			
				
					# Fill missing 'Amount' values with 0 for cancelled orders
df.loc[(df['amount'].isnull()) & (df['courier_status'] == 'Cancelled'), 'amount'] = 0
				
			
				
					# Calculate the median of 'amount' (excluding 0 values)
median_amount = df[df['amount'] > 0]['amount'].median()

# Fill missing 'amount' for shipped & unshipped orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipped') & (df['courier_status'] == 'Unshipped'), 'amount'] = median_amount

# Fill missing 'amount' for Shipped - Delivered to Buyer & on the way orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipped - Delivered to Buyer') & (df['courier_status'] == 'On the Way'), 'amount'] = median_amount

# Fill missing 'amount' for Shipped - Returned to Seller & On the Way orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipped - Returned to Seller') & (df['courier_status'] == 'On the Way'), 'amount'] = median_amount

# Fill missing 'amount' for Shipping & Unshipped orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipping') & (df['courier_status'] == 'Unshipped'), 'amount'] = median_amount

# Fill missing 'amount' for Cancelled & Unshipped with the 0
df.loc[(df['amount'].isnull()) & (df['status'] == 'Cancelled') & (df['courier_status'] == 'Unshipped'), 'amount'] = 0

# Fill missing 'amount' for cancelled & on the way orders with 0
df.loc[(df['amount'].isnull()) & (df['status'] == 'Cancelled') & (df['courier_status'] == 'On the Way'), 'amount'] = 0
				
			
				
					# Fill missing 'currency' values with 'INR' (since only one country is present)
df['currency'].fillna('INR', inplace=True)


				
			
				
					# Fill missing values in ship_city, ship_state, and ship_postal_code with 'Unknown'
df['ship_city'].fillna('Unknown', inplace=True)
df['ship_state'].fillna('Unknown', inplace=True)
df['ship_postal_code'].fillna('Unknown', inplace=True)

# Fill missing values in ship_country with 'IN' (since only one country is present)
df['ship_country'].fillna('IN', inplace=True)

				
			
				
					df.isna().sum()
				
			
				
					index                 0
order_id              0
date                  0
status                0
fulfilment            0
sales_channel         0
ship_service_level    0
category              0
size                  0
courier_status        0
qty                   0
currency              0
amount                0
ship_city             0
ship_state            0
ship_postal_code      0
ship_country          0
b2b                   0
fulfilled_by          0
dtype: int64
				
			
				
					#  Standardizing column names with lowercase and underscores

df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('-', '_')
df.columns = df.columns.str.lower()
				
			
				
					Index(['index', 'order_id', 'date', 'status', 'fulfilment', 'sales_channel',
       'ship_service_level', 'category', 'size', 'courier_status', 'qty',
       'currency', 'amount', 'ship_city', 'ship_state', 'ship_postal_code',
       'ship_country', 'b2b', 'fulfilled_by'],
      dtype='object')
				
			
				
					# Check the current data types of all columns
df.dtypes
				
			
				
					index                   int64
order_id               object
date                   object
status                 object
fulfilment             object
sales_channel          object
ship_service_level     object
category               object
size                   object
courier_status         object
qty                     int64
currency               object
amount                float64
ship_city              object
ship_state             object
ship_postal_code       object
ship_country           object
b2b                      bool
fulfilled_by           object
dtype: object
				
			
				
					# Convert 'date' to datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')
				
			
				
					# Get unique values count for each categorical column
categorical_columns = [
    'status', 'fulfilment', 'sales_channel', 'ship_service_level', 
    'category', 'size', 'courier_status', 'ship_country', 'fulfilled_by'
]

unique_values_summary = {col: df[col].nunique() for col in categorical_columns}
unique_values_summary
				
			
				
					{'status': 13,
 'fulfilment': 2,
 'sales_channel': 2,
 'ship_service_level': 2,
 'category': 9,
 'size': 11,
 'courier_status': 4,
 'ship_country': 1,
 'fulfilled_by': 2}
				
			
				
					df['status'].unique()
				
			
				
					array(['Cancelled', 'Shipped - Delivered to Buyer', 'Shipped',
       'Shipped - Returned to Seller', 'Shipped - Rejected by Buyer',
       'Shipped - Lost in Transit', 'Shipped - Out for Delivery',
       'Shipped - Returning to Seller', 'Shipped - Picked Up', 'Pending',
       'Pending - Waiting for Pick Up', 'Shipped - Damaged', 'Shipping'],
      dtype=object)
				
			
				
					# Define a mapping dictionary to simplify 'status' values
status_mapping = {
    "Cancelled": "Cancelled",
    "Pending": "Pending",
    "Pending - Waiting for Pick Up": "Pending",
    "Shipped": "Shipped",
    "Shipped - Delivered to Buyer": "Delivered",
    "Shipped - Out for Delivery": "Out for Delivery",
    "Shipped - Picked Up": "Picked Up",
    "Shipped - Returned to Seller": "Returned",
    "Shipped - Returning to Seller": "Returned",
    "Shipped - Rejected by Buyer": "Returned",
    "Shipped - Lost in Transit": "Lost/Damaged",
    "Shipped - Damaged": "Lost/Damaged",
    "Shipping": "Shipped"
}

# Apply the mapping
df['status'] = df['status'].map(status_mapping)

# Verify the updated unique values
df['status'].unique()
				
			
				
					array(['Cancelled', 'Delivered', 'Shipped', 'Returned', 'Lost/Damaged',
       'Out for Delivery', 'Picked Up', 'Pending'], dtype=object)
				
			
				
					# Convert categorical columns to 'category' data type
categorical_columns = [
    'status', 'fulfilment', 'sales_channel', 'ship_service_level', 
    'category', 'size', 'courier_status', 'ship_country', 'fulfilled_by'
]

for col in categorical_columns:
    df[col] = df[col].astype('category')

# Verify the changes
df.dtypes
				
			
				
					index                   int64
order_id               object
date                   object
status                 object
fulfilment             object
sales_channel          object
ship_service_level     object
category               object
size                   object
courier_status         object
qty                     int64
currency               object
amount                float64
ship_city              object
ship_state             object
ship_postal_code       object
ship_country           object
b2b                      bool
fulfilled_by           object
dtype: object
				
			
				
					index                          int64
order_id                      object
date                  datetime64[ns]
status                      category
fulfilment                  category
sales_channel               category
ship_service_level          category
category                    category
size                        category
courier_status              category
qty                            int64
currency                      object
amount                       float64
ship_city                     object
ship_state                    object
ship_postal_code              object
ship_country                category
b2b                             bool
fulfilled_by                category
dtype: object
				
			
				
					import numpy as np

# Function to detect outliers using IQR
def detect_outliers_iqr(data, column):
    Q1 = np.percentile(data[column], 25)  # First quartile
    Q3 = np.percentile(data[column], 75)  # Third quartile
    IQR = Q3 - Q1  # Interquartile range
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers.shape[0], lower_bound, upper_bound

# Detect outliers in 'qty' and 'amount'
qty_outliers, qty_lower, qty_upper = detect_outliers_iqr(df, 'qty')
amount_outliers, amount_lower, amount_upper = detect_outliers_iqr(df, 'amount')

# Create a summary DataFrame
outlier_summary = pd.DataFrame({
    "Column": ["qty", "amount"],
    "Outlier Count": [qty_outliers, amount_outliers],
    "Lower Bound": [qty_lower, amount_lower],
    "Upper Bound": [qty_upper, amount_upper]
})

outlier_summary

				
			
ColumnOutlier CountLower BoundUpper Bound
qty131791.01.0
amount3174-116.51303.5
				
					# Flag outliers in 'amount' based on IQR bounds
amount_lower_bound = -116.5  # From IQR summary
amount_upper_bound = 1303.5

df['amount_outlier'] = ((df['amount'] < amount_lower_bound) | (df['amount'] > amount_upper_bound))

# Verify how many outliers were flagged
df['amount_outlier'].sum()
				
			
				
					np.int64(3174)
				
			
				
					from scipy import stats

# Calculate Z-scores for 'amount'
df['amount_zscore'] = stats.zscore(df['amount'], nan_policy='omit')

# Flag outliers using Z-score (threshold ±3)
df['amount_outlier_z'] = df['amount_zscore'].abs() > 3

# Count the number of Z-score flagged outliers
zscore_outliers_count = df['amount_outlier_z'].sum()

zscore_outliers_count

				
			
				
					np.int64(439)
				
			
				
					# Keep the outliers but flag them instead of removing
df['amount_outlier'] = df['amount_zscore'].abs() > 3

# Drop only temporary Z-score columns
df.drop(columns=['amount_zscore','amount_outlier_z'], inplace=True)


				
			
				
					# Final Data Validation

total_rows, total_columns = df.shape
remaining_missing_values = df.isnull().sum().sum()
outlier_count = df['amount_outlier'].sum()
duplicate_count = df.duplicated().sum()

# Create summary dataframe
final_summary = pd.DataFrame({
    "Total Rows": [total_rows],
    "Total Columns": [total_columns],
    "Remaining Missing Values": [remaining_missing_values],
    "Flagged Amount Outliers": [outlier_count],
    "duplicate_count": [duplicate_count]
})
final_summary
				
			
Total RowsTotal ColumnsRemaining Missing ValuesFlagged Amount Outliersduplicate_count
1288082004390

💾 Descriptive Statistics & Sales Analysis

Overall Sales Summary
				
					# Compute overall sales summary by status
overall_sales_summary = df.groupby('status').agg({
    'amount': ['sum', 'mean', 'median','min','max'],
    'qty': ['sum', 'mean', 'median','min','max']
})

# Compute correct "Grand Total" values separately
grand_total_values = {
    ('amount', 'sum'): df['amount'].sum(),  
    ('amount', 'mean'): df['amount'].mean(),  
    ('amount', 'median'): df['amount'].median(),  
    ('qty', 'sum'): df['qty'].sum(),  
    ('qty', 'mean'): df['qty'].mean(),  
    ('qty', 'median'): df['qty'].median()  
}

# Convert to DataFrame and append as "Grand Total" row
grand_total = pd.DataFrame([grand_total_values], index=["Grand Total"])
overall_sales_summary = pd.concat([overall_sales_summary, grand_total])

overall_sales_summary
				
			
amountamountamountamountamountqtyqtyqtyqtyqty
summeanmedianminmaxsummeanmedianminmax
Cancelled6910831.39377.41419856916605380.00.04235.7256510.30861230954071320.00.02.0
Delivered18624586.0648.6012885251611631.00.05495.0288311.00403970050496261.00.05.0
Lost/Damaged3133.0522.1666666666666499.00.01136.061.01.01.01.0
Out for Delivery26971.0770.6729.0301.01399.0351.01.01.01.0
Pending622409.0662.842385516507664.00.02326.09401.00106496272630461.00.02.0
Picked Up661252.0679.601233299075699.00.01998.09771.00411099691675231.01.02.0
Returned1386026.0657.5075901328273635.00.02796.021301.01043643263757121.00.04.0
Shipped50344926.0647.7647739993052599.00.05584.0779261.0026376397627411.00.015.0
Grand Total78580134.39000002610.056319405627588.01164960.90441587479038571.0
				
					# Compute category sales summary
category_sales_summary = df.groupby('category').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

category_sales_summary

				
			
amountamountamountamountamountqtyqtyqtyqtyqty
summeanmedianminmaxsummeanmedianminmax
category
T-shirt39193318.17780.4480011549414759.00.05584.0452280.90061530496425661.008
Shirt21304600.7427.7688679624126432.00.02796.0449780.90310015259818491.0013
Blazzer11211616.12723.8437678352378744.00.02860.0139340.89960617212215121.004
Trousers5344415.3503.85738663146975518.00.01797.098890.932308852644481.003
Perfume789419.66682.2987554019015797.140.01449.010510.90838375108038031.002
Wallet460896.18497.72805615550755518.5250.01266.668630.93196544276457881.0015
Socks151019.5344.0079726651481342.860.01028.583980.90660592255125281.002
Shoes123933.76755.6936585365853790.00.02058.01520.9268292682926831.003
Watch915.0305.0305.0305.0305.031.01.011
				
					# Compute sales summary by product
product_sales_summary = df.groupby('size').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

product_sales_summary


				
			
amountamountamountamountamountqtyqtyqtyqtyqty
summeanmedianminmaxsummeanmedianminmax
size
M13691357.13612.5886859060403597.00.04235.72201160.90004474272930651.004
L13038244.84598.7162988474078573.00.02598.0196800.90370574459291911.009
XL12255023.38597.2233615984405563.00.02698.0186090.90687134502923971.005
XXL10461329.64588.2769858853962545.00.02676.0162260.91244446943710291.0013
S10426761.18622.3445851736899599.00.02864.0150220.89662170228005261.004
3XL9034190.92617.6802215233147581.00.05495.0133460.91248461643648291.005
XS6960339.1630.8655034895314622.860.05584.098290.89087283603734251.008
Free1375164.6611.1842666666668625.00.02058.020690.91955555555555561.0015
6XL577493.33782.5112872628725798.00.01672.06880.93224932249322491.002
5XL425778.63774.1429636363637798.00.01629.05130.93272727272727271.002
4XL334451.64783.2591100702576798.00.01672.03980.93208430913348951.002
				
					# Compute sales summary by fulfillment method
fulfillment_summary = df.groupby('fulfilled_by').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

# Compute sales summary by B2B vs. B2C
b2b_summary = df.groupby('b2b').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

b2b_summary
fulfillment_summary
				
			
amountamountamountamountamountqtyqtyqtyqtyqty
summeanmedianminmaxsummeanmedianminmax
b2b
False77988913.6609.5884193001242587.00.05584.01156560.90400744116244711.0015
True591220.79678.7839150401837653.00.05495.08400.96440872560275551.005
amountamountamountamountamountqtyqtyqtyqtyqty
summeanmedianminmaxsummeanmedianminmax
fulfilled_by
Merchant Fulfilled54338671.0606.4922261286902574.00.05584.0839900.93744070539650651.0015
Easy Ship24241463.39618.1996631219239602.860.05495.0325060.82895978374518651.005

♟Statistical Analysis & Hypothesis Testing

T-Test/ANOVA for B2B vs. B2C and Category Sales Comparisons
				
					from scipy.stats import ttest_ind, f_oneway

# T-Test for B2B vs. B2C Sales Comparison
b2b_sales = df[df['b2b'] == True]['amount']
b2c_sales = df[df['b2b'] == False]['amount']

# Perform independent T-test
t_stat, p_value_b2b = ttest_ind(b2b_sales, b2c_sales, equal_var=False, nan_policy='omit')

# ANOVA for Category Sales Comparison
category_groups = [group['amount'].dropna() for name, group in df.groupby('category')]

# Perform ANOVA test
anova_stat, p_value_category = f_oneway(*category_groups)

# Create a summary DataFrame for hypothesis testing results
hypothesis_results = pd.DataFrame({
    "Test": ["T-Test (B2B vs. B2C)", "ANOVA (Category Sales)"],
    "Test Statistic": [t_stat, anova_stat],
    "P-Value": [p_value_b2b, p_value_category]
})

hypothesis_results
				
			
TestTest StatisticP-Value
0T-Test (B2B vs. B2C)5.7825855155754931.0216579602426602e-08
1ANOVA (Category Sales)6157.1935631453010.0

1️⃣ T-Test (B2B vs. B2C)

  • Test Statistic (5.78):

    • A high t-value means a strong difference between B2B and B2C sales amounts.
    • This shows that the two groups have significantly different mean sales.
  • P-Value (1.02e-08 or 0.0000000102):

    • The probability that this difference happened by random chance is almost zero.
    • Since p < 0.05, we reject the null hypothesis, meaning B2B and B2C sales are significantly different.

Conclusion:

  • B2B sales behave differently than B2C sales in terms of order amount.

2️⃣ ANOVA (Category Sales)

  • Test Statistic (6157.19):

    • A very high F-statistic means that there is a strong difference in sales across different categories.
  • P-Value (0.000):

    • The probability that all categories have the same sales is almost zero.
    • Since p < 0.05, we reject the null hypothesis, meaning some product categories sell significantly better than others.

Conclusion:

  • Product categories show significant differences in sales performance. Some categories are selling far more or far less than others.
				
					import scipy.stats as stats

# Create a contingency table
contingency_table = pd.crosstab(df['status'], df['fulfilled_by'])

# Perform Chi-Square test
chi2_stat, p_value_chi2, dof, expected = stats.chi2_contingency(contingency_table)

# Create a summary DataFrame for the Chi-Square test results
chi_square_results = pd.DataFrame({
    "Test": ["Chi-Square (Order Status vs. Fulfillment)"],
    "Chi-Square Statistic": [chi2_stat],
    "P-Value": [p_value_chi2],
    "Degrees of Freedom": [dof]
})
chi_square_results
				
			
TestChi-Square StatisticP-ValueDegrees of Freedom
Chi-Square (Order Status vs. Fulfillment)107464.386988395290.07
  1. Chi-Square Statistic (107464.39)

    • A very high value indicates a strong relationship between order status and fulfillment method (Amazon vs. Merchant).
    • This means that order fulfillment type affects whether an order is shipped, canceled, or returned.
  2. P-Value (0.000)

    • The probability of this result happening by random chance is almost zero.
    • Since p < 0.05, we reject the null hypothesis, meaning order status is significantly influenced by the fulfillment method.
  3. Degrees of Freedom (7)

    • This is based on the number of categories in status and fulfilled_by.

Conclusion:

The fulfillment method significantly affects order status. Orders fulfilled by Amazon vs. Merchant have different cancellation, return, and shipping rates.

				
					# Correlation & Regression Analysis (Qty vs. Amount)
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Calculate Pearson correlation coefficient
correlation_coefficient = df['qty'].corr(df['amount'])

# Scatter plot with regression line
plt.figure(figsize=(8, 5))
sns.regplot(x=df['qty'], y=df['amount'], scatter_kws={"alpha":0.3}, line_kws={"color":"red"})
plt.xlabel("Quantity (Qty)")
plt.ylabel("Amount ($)")
plt.title(f"Correlation between Quantity and Amount (r = {correlation_coefficient:.2f})")
plt.show()

# Perform linear regression (OLS - Ordinary Least Squares)
X = sm.add_constant(df['qty'])  # Add constant for intercept
y = df['amount']
model = sm.OLS(y, X).fit()

# Display regression summary
regression_summary = model.summary()
regression_summary

				
			

✅Correlation Coefficient (r = 0.39)

  • Shows a moderate positive correlation between quantity (qty) and amount.
  • Higher quantities tend to be associated with higher sales, but the relationship is not perfectly linear.

Regression Line (Red Line in Plot)

  • The upward slope confirms a positive relationship between quantity and amount.
  • Some outliers exist where low quantities still result in high sales amounts.
				
					OLS Regression Results                            
==============================================================================
Dep. Variable:                 amount   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.155
Method:                 Least Squares   F-statistic:                 2.369e+04
Date:                Fri, 14 Mar 2025   Prob (F-statistic):               0.00
Time:                        17:08:52   Log-Likelihood:            -9.1196e+05
No. Observations:              128808   AIC:                         1.824e+06
Df Residuals:                  128806   BIC:                         1.824e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        254.2724      2.446    103.941      0.000     249.478     259.067
qty          393.3853      2.556    153.920      0.000     388.376     398.395
==============================================================================
Omnibus:                    17866.214   Durbin-Watson:                   1.863
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            58717.680
Skew:                           0.709   Prob(JB):                         0.00
Kurtosis:                       5.988   Cond. No.                         5.95
==============================================================================

				
			

1️⃣ Model Summary

  • Dep. Variable: amount → This is the target variable (sales amount) we are predicting.
  • Model Type: OLS (Ordinary Least Squares) → The type of regression used.
  • Method: Least Squares → The method used to estimate coefficients.
  • No. Observations: 128,088 → The total number of data points (rows) in the dataset.
  • Df Residuals: 128,086 → Degrees of freedom left after estimating model parameters.
  • Df Model: 1 → The number of predictor variables in the model (qty).

2️⃣ Model Fit Statistics (How Well the Model Explains Variance)

  • R-squared: 0.155 → This means that 15.5% of the variation in amount is explained by qty. A low R² suggests other factors influence sales.
  • Adjusted R-squared: 0.155 → Similar to R² but adjusted for additional predictors (remains the same since we have only one predictor).
  • F-statistic: 23,690 → A high F-value indicates that the model is statistically significant.
  • Prob (F-statistic): 0.000 → Since p < 0.05, the model is statistically significant.

Interpretation:

  • The model has low explanatory power (R² = 15.5%), meaning qty does not fully predict amount—other factors are influencing sales.

3️⃣ Regression Coefficients (How qty Affects amount)

  • Intercept (const): 254.27 → If qty = 0, the predicted sales amount is $254.27 (base value).
  • Quantity (qty): 393.39Each additional unit sold increases amount by $393.39 on average.

Statistical Significance of Predictors:

  • Standard Error (std err):
    • Measures how much the estimated coefficient might vary. Smaller values are better.
  • t-statistic (t):
    • Measures how strongly a variable influences the dependent variable. Higher absolute values mean stronger relationships.
  • p-value (P>|t|):
    • Measures statistical significance. If p < 0.05, the variable significantly affects amount.
    • Both const and qty are statistically significant (p = 0.000).
  • 95% Confidence Interval:
    • The range where the true coefficient likely falls.
    • Example: qty is likely between $388.38 to $398.39 per unit increase.

Interpretation:

  • qty has a significant positive impact on amount—each unit sold increases sales by ~$393.39.

4️⃣ Residual Diagnostics (Checking Model Assumptions)

  • Omnibus: 17,866.214 → Tests if residuals are normally distributed. High values suggest they are not normal.
  • Prob (Omnibus): 0.000 → Residuals are not normally distributed (p < 0.05).
  • Durbin-Watson: 1.863 → Measures autocorrelation (values between 1.5 – 2.5 are acceptable).
  • Jarque-Bera (JB): 58,717.68 → Another normality test—high values confirm non-normal residuals.
  • Prob (JB): 0.000 → Residuals fail the normality test (p < 0.05).
  • Skew: 0.709 → Residuals are slightly right-skewed.
  • Kurtosis: 5.988 → Residuals have heavy tails (outliers present).
  • Condition Number (Cond. No.): 5.95 → Measures multicollinearity (values > 30 are problematic; this model is fine).

Interpretation:

  • Residuals are not normally distributed → This suggests the model may need transformation.
  • Some skewness and outliers exist → The model could be improved.
  • No severe multicollinearity detected, so predictor variables are independent.

🚀 Key Takeaways:

qty significantly predicts amount (p < 0.05)—each additional unit sold increases revenue by ~$393.39.
The model explains 15.5% of sales variation, meaning other factors influence sales.
Residuals are not normally distributed, which suggests potential model improvement (e.g., transformation).

				
					# Aggregate sales by month to analyze trends
monthly_sales = df.resample('M', on='date').agg({'amount': 'sum'}).reset_index()

# Plot the monthly sales trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['date'], monthly_sales['amount'], marker='o', linestyle='-', color='b')
plt.xlabel("Month")
plt.ylabel("Total Sales Amount ($)")
plt.title("Monthly Sales Trend Over Time")
plt.grid(True)
plt.show()
				
			

✅ Sales Growth and Decline:

  • The chart shows a sharp increase in sales in April, peaking in May.
  • After the peak, sales gradually decline from June to July.

✅ Possible Reasons for Trends:

  • Sales Peak in May: Likely due to seasonal demand, promotions, or new product launches.
  • Decline After May: Could be due to market saturation, reduced marketing, or seasonal effects.

🕒Time Series & Trend Analysis

Time Series & Trend Analysis (Sales Trends & Seasonality)
				
					# Aggregate sales by month to analyze trends
monthly_sales = df.resample('M', on='date').agg({'amount': 'sum'}).reset_index()

# Plot the monthly sales trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['date'], monthly_sales['amount'], marker='o', linestyle='-', color='b')
plt.xlabel("Month")
plt.ylabel("Total Sales Amount ($)")
plt.title("Monthly Sales Trend Over Time")
plt.grid(True)
plt.show()
				
			

✅ Sales Growth and Decline:

  • The chart shows a sharp increase in sales in April, peaking in May.
  • After the peak, sales gradually decline from June to July.

✅ Possible Reasons for Trends:

  • Sales Peak in May: Likely due to seasonal demand, promotions, or new product launches.
  • Decline After May: Could be due to market saturation, reduced marketing, or seasonal effects.

🎨Data Visualization & Insights

				
					import seaborn as sns

# Create a bar chart for sales by category
plt.figure(figsize=(12, 6))
sns.barplot(x=df.groupby('category')['amount'].sum().index, 
            y=df.groupby('category')['amount'].sum().values, 
            palette="viridis")

plt.xlabel("Product Category")
plt.ylabel("Total Sales Amount ($)")
plt.title("Total Sales by Product Category")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
				
			

🔎 Key Insights:

1️⃣ T-Shirts Dominate Sales

  • Highest-selling category with total sales nearing 40 million.
  • Suggests high demand or strong marketing efforts for T-shirts.

2️⃣ Shirts and Blazers Perform Well

  • Shirts (~21 million) and Blazers (~11 million) contribute significantly to revenue.
  • Indicates consistent demand for formal/casual wear.

3️⃣ Low-Selling Categories (Shoes, Socks, Wallets, Watches, Perfume)

  • These categories have very low sales volume.
  • Possible reasons: Lower demand, pricing issues, or lack of promotion.

4️⃣ Trousers Show Moderate Sales

  • Generates noticeable revenue (~6 million), but not as strong as Shirts or T-Shirts.

🚀 Actionable Recommendations:

Focus on T-Shirts & Shirts – Continue marketing efforts and optimize inventory.
Analyze Low-Selling Items – Investigate customer preferences and revamp product strategy.
Seasonal Campaigns – Assess if certain categories perform better in specific seasons.

				
					# Create a bar chart for order status impact on sales
plt.figure(figsize=(12, 6))
sns.barplot(x=df.groupby('status')['amount'].sum().index, 
            y=df.groupby('status')['amount'].sum().values, 
            palette="coolwarm")

plt.xlabel("Order Status")
plt.ylabel("Total Sales Amount ($)")
plt.title("Impact of Order Status on Sales")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
				
			

🔎 Key Insights:

1️⃣ Shipped Orders Dominate Sales

  • The majority of sales (~50 million) come from Shipped orders, indicating high successful fulfillment rates.

2️⃣ Delivered Sales Are Significant

  • Delivered orders contribute ~19 million, confirming a strong conversion rate from shipping to successful delivery.

3️⃣ Cancelled Orders Have Noticeable Revenue Loss

  • Cancelled orders account for a significant portion (~7 million) in lost sales.
  • Indicates customer dropouts or fulfillment issues affecting revenue.

4️⃣ Returned Orders Impact Sales

  • Returned orders show a small but noticeable loss, suggesting product quality, incorrect shipments, or customer dissatisfaction.

5️⃣ Pending, Lost/Damaged, and Picked-Up Orders Have Minimal Impact

  • These statuses represent a very small portion of total sales, meaning they are not major revenue blockers.

🚀 Actionable Recommendations:

Investigate Cancelled Orders → Identify why cancellations occur and reduce lost sales.
Improve Delivery Success Rate → Ensure shipped orders convert to delivered orders more efficiently.
Analyze Returns → Assess why customers return products and optimize policies.
Enhance Order Tracking → Ensure smooth fulfillment to reduce delays, losses, and damages.

📈 Dashboard