Amazon Sales Performance – Roy AI Insights

Amazon Sales Performance Dashboard

Home // AI-Powered Data Analysis

📋 Project Overview

Objective:

The goal of this project is to analyze Amazon sales data to uncover key insights about sales performance, customer behavior, and operational efficiency. I used statistical methods to identify trends, relationships, and factors influencing sales outcomes.

Key Analyses & Methods:

Descriptive Statistics (Data Summary)

Analyze sales distribution (total revenue, order count, average order value).
Identify top-selling categories and products.
Examine order status breakdown (shipped, canceled, returned).

Inferential Statistics (Comparisons & Hypothesis Testing)

T-test/ANOVA: Compare sales performance between B2B vs. B2C and fulfillment methods.
Chi-Square Test: Analyze relationships between order status and fulfillment methods.
Correlation Analysis: Identify relationships between Qty and Amount.

Time Series & Trend Analysis

Identify monthly and seasonal sales trends.

Outlier Detection & Data Quality Checks

Detect unusual patterns in order amounts using boxplots & statistical methods.
Handle missing values and anomalies for better data reliability.

Expected Outcomes:

Actionable insights into revenue trends, high-performing categories, and sales channels.
Statistical evidence of key sales-driving factors (B2B vs. B2C, fulfillment method impact).

📚 Workflow

✅Data Preprocessing

Load & Inspect: Check data structure, types, and missing values.
Clean Data: Handle missing values, convert dates, and standardize categories.
Detect Outliers: Use boxplots, Z-score, and IQR methods.

✅Descriptive Statistics & Sales Analysis

Overall Sales: Total revenue, order count, average order value.
Category & Product Analysis: Identify top-selling items.
Order Status: Shipped vs. Canceled impact on revenue.
Fulfillment & B2B vs. B2C: Compare sales performance.

✅Statistical Analysis & Hypothesis Testing

T-Test/ANOVA: Compare B2B vs. B2C and category sales.
Chi-Square: Test relationships (e.g., Order Status vs. Fulfillment).
Correlation & Regression: Analyze Qty vs. Amount, build predictive models.

✅Time Series & Trend Analysis

Sales Trends: Monthly & seasonal revenue insights.
Forecasting (Optional): Predict future sales using time series models.

✅Data Visualization & Insights

Charts & Graphs: Revenue trends, sales distribution, fulfillment comparisons.
Insights & Recommendations: Summarize key findings for decision-making.

✅Dashboard

Report: Document key metrics & trends.
Power BI Dashboard: Create interactive visuals for easy insights.

📊 Summary of the Dataset

📊 Summary of the Amazon Sales Dataset

Dataset Name: Amazon Sales Data
Total Rows: 128,976
Total Columns: 19

🔑 Key Columns & Descriptions:

index – Row index (not necessary for analysis).
order_id – Unique identifier for each order.
date – Date of the order.
status – Current status of the order (Shipped, Cancelled, etc.).
fulfilment – Whether fulfilled by Amazon or Merchant.
sales_channel – Platform where the sale was made.
ship_service_level – Type of shipping service selected.
category – Product category.
size – Product size.
courier_status – Delivery status from the courier.
qty – Quantity of items in the order.
currency – Currency used for the order.
amount – Total sales amount for the order.
ship_city – Destination city of the order.
ship_state – Destination state of the order.
ship_postal_code – Postal code of the shipping address.
ship_country – Destination country of the order.
b2b – Whether the order is Business-to-Business (B2B) or not.
fulfilled_by – Fulfillment method used (Amazon, Merchant, etc.).

🗂️ Data Processing

Drop Columns

				
					# Drop unnecessary columns: 'index' and 'ship_postal_code'
df.drop(columns=['index', 'ship_postal_code'], inplace=True)

Remove Duplicates

				
					# Check for duplicate rows
duplicate_count = df.duplicated().sum()

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Verify if duplicates remain
remaining_duplicates = df.duplicated().sum()

# Display results
duplicate_summary = {
    "Duplicates Found": duplicate_count,
    "Duplicates Removed": duplicate_count,
    "Remaining Duplicates": remaining_duplicates
}

				
					{'Duplicates Found': np.int64(168),
 'Duplicates Removed': np.int64(168),
 'Remaining Duplicates': np.int64(0)}

Handling Missing Values

	Missing Values	Percentage (%)
fulfilled_by	89713	70.0
currency	7800	6.0
amount	7800	6.0
ship_postal_code	35	0.0
ship_country	35	0.0
ship_state	35	0.0
ship_city	35	0.0

				
					# Check for missing values in each column
missing_values = df.isnull().sum()
missing_percentage = round((missing_values / df.shape[0]) * 100)

# Create a summary DataFrame for missing values
missing_summary = pd.DataFrame({
    "Missing Values": missing_values,
    "Percentage (%)": missing_percentage
}).sort_values(by="Missing Values", ascending=False)

# Display columns with missing values
missing_summary[missing_summary["Missing Values"] > 0]

				
					# Fill NaN values in 'fulfilled-by' with 'Merchant Fulfilled'
df['fulfilled_by'].fillna("Merchant Fulfilled", inplace=True)

# Verify the changes
df['fulfilled_by'].unique()

				
					array(['Easy Ship', 'Merchant Fulfilled'], dtype=object)

				
					# Fill missing 'Amount' values with 0 for cancelled orders
df.loc[(df['amount'].isnull()) & (df['courier_status'] == 'Cancelled'), 'amount'] = 0

				
					# Calculate the median of 'amount' (excluding 0 values)
median_amount = df[df['amount'] > 0]['amount'].median()

# Fill missing 'amount' for shipped & unshipped orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipped') & (df['courier_status'] == 'Unshipped'), 'amount'] = median_amount

# Fill missing 'amount' for Shipped - Delivered to Buyer & on the way orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipped - Delivered to Buyer') & (df['courier_status'] == 'On the Way'), 'amount'] = median_amount

# Fill missing 'amount' for Shipped - Returned to Seller & On the Way orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipped - Returned to Seller') & (df['courier_status'] == 'On the Way'), 'amount'] = median_amount

# Fill missing 'amount' for Shipping & Unshipped orders with the median
df.loc[(df['amount'].isnull()) & (df['status'] == 'Shipping') & (df['courier_status'] == 'Unshipped'), 'amount'] = median_amount

# Fill missing 'amount' for Cancelled & Unshipped with the 0
df.loc[(df['amount'].isnull()) & (df['status'] == 'Cancelled') & (df['courier_status'] == 'Unshipped'), 'amount'] = 0

# Fill missing 'amount' for cancelled & on the way orders with 0
df.loc[(df['amount'].isnull()) & (df['status'] == 'Cancelled') & (df['courier_status'] == 'On the Way'), 'amount'] = 0

				
					# Fill missing 'currency' values with 'INR' (since only one country is present)
df['currency'].fillna('INR', inplace=True)

				
					# Fill missing values in ship_city, ship_state, and ship_postal_code with 'Unknown'
df['ship_city'].fillna('Unknown', inplace=True)
df['ship_state'].fillna('Unknown', inplace=True)
df['ship_postal_code'].fillna('Unknown', inplace=True)

# Fill missing values in ship_country with 'IN' (since only one country is present)
df['ship_country'].fillna('IN', inplace=True)

				
					df.isna().sum()

				
					index                 0
order_id              0
date                  0
status                0
fulfilment            0
sales_channel         0
ship_service_level    0
category              0
size                  0
courier_status        0
qty                   0
currency              0
amount                0
ship_city             0
ship_state            0
ship_postal_code      0
ship_country          0
b2b                   0
fulfilled_by          0
dtype: int64

Standardizing Column Names

				
					#  Standardizing column names with lowercase and underscores

df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('-', '_')
df.columns = df.columns.str.lower()

				
					Index(['index', 'order_id', 'date', 'status', 'fulfilment', 'sales_channel',
       'ship_service_level', 'category', 'size', 'courier_status', 'qty',
       'currency', 'amount', 'ship_city', 'ship_state', 'ship_postal_code',
       'ship_country', 'b2b', 'fulfilled_by'],
      dtype='object')

Data Type Conversion

				
					# Check the current data types of all columns
df.dtypes

				
					index                   int64
order_id               object
date                   object
status                 object
fulfilment             object
sales_channel          object
ship_service_level     object
category               object
size                   object
courier_status         object
qty                     int64
currency               object
amount                float64
ship_city              object
ship_state             object
ship_postal_code       object
ship_country           object
b2b                      bool
fulfilled_by           object
dtype: object

				
					# Convert 'date' to datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')

				
					# Get unique values count for each categorical column
categorical_columns = [
    'status', 'fulfilment', 'sales_channel', 'ship_service_level', 
    'category', 'size', 'courier_status', 'ship_country', 'fulfilled_by'
]

unique_values_summary = {col: df[col].nunique() for col in categorical_columns}
unique_values_summary

				
					{'status': 13,
 'fulfilment': 2,
 'sales_channel': 2,
 'ship_service_level': 2,
 'category': 9,
 'size': 11,
 'courier_status': 4,
 'ship_country': 1,
 'fulfilled_by': 2}

				
					df['status'].unique()

				
					array(['Cancelled', 'Shipped - Delivered to Buyer', 'Shipped',
       'Shipped - Returned to Seller', 'Shipped - Rejected by Buyer',
       'Shipped - Lost in Transit', 'Shipped - Out for Delivery',
       'Shipped - Returning to Seller', 'Shipped - Picked Up', 'Pending',
       'Pending - Waiting for Pick Up', 'Shipped - Damaged', 'Shipping'],
      dtype=object)

				
					# Define a mapping dictionary to simplify 'status' values
status_mapping = {
    "Cancelled": "Cancelled",
    "Pending": "Pending",
    "Pending - Waiting for Pick Up": "Pending",
    "Shipped": "Shipped",
    "Shipped - Delivered to Buyer": "Delivered",
    "Shipped - Out for Delivery": "Out for Delivery",
    "Shipped - Picked Up": "Picked Up",
    "Shipped - Returned to Seller": "Returned",
    "Shipped - Returning to Seller": "Returned",
    "Shipped - Rejected by Buyer": "Returned",
    "Shipped - Lost in Transit": "Lost/Damaged",
    "Shipped - Damaged": "Lost/Damaged",
    "Shipping": "Shipped"
}

# Apply the mapping
df['status'] = df['status'].map(status_mapping)

# Verify the updated unique values
df['status'].unique()

				
					array(['Cancelled', 'Delivered', 'Shipped', 'Returned', 'Lost/Damaged',
       'Out for Delivery', 'Picked Up', 'Pending'], dtype=object)

				
					# Convert categorical columns to 'category' data type
categorical_columns = [
    'status', 'fulfilment', 'sales_channel', 'ship_service_level', 
    'category', 'size', 'courier_status', 'ship_country', 'fulfilled_by'
]

for col in categorical_columns:
    df[col] = df[col].astype('category')

# Verify the changes
df.dtypes

				
					index                   int64
order_id               object
date                   object
status                 object
fulfilment             object
sales_channel          object
ship_service_level     object
category               object
size                   object
courier_status         object
qty                     int64
currency               object
amount                float64
ship_city              object
ship_state             object
ship_postal_code       object
ship_country           object
b2b                      bool
fulfilled_by           object
dtype: object

				
					index                          int64
order_id                      object
date                  datetime64[ns]
status                      category
fulfilment                  category
sales_channel               category
ship_service_level          category
category                    category
size                        category
courier_status              category
qty                            int64
currency                      object
amount                       float64
ship_city                     object
ship_state                    object
ship_postal_code              object
ship_country                category
b2b                             bool
fulfilled_by                category
dtype: object

Outlier Detection & Handling

				
					import numpy as np

# Function to detect outliers using IQR
def detect_outliers_iqr(data, column):
    Q1 = np.percentile(data[column], 25)  # First quartile
    Q3 = np.percentile(data[column], 75)  # Third quartile
    IQR = Q3 - Q1  # Interquartile range
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers.shape[0], lower_bound, upper_bound

# Detect outliers in 'qty' and 'amount'
qty_outliers, qty_lower, qty_upper = detect_outliers_iqr(df, 'qty')
amount_outliers, amount_lower, amount_upper = detect_outliers_iqr(df, 'amount')

# Create a summary DataFrame
outlier_summary = pd.DataFrame({
    "Column": ["qty", "amount"],
    "Outlier Count": [qty_outliers, amount_outliers],
    "Lower Bound": [qty_lower, amount_lower],
    "Upper Bound": [qty_upper, amount_upper]
})

outlier_summary

Column	Outlier Count	Lower Bound	Upper Bound
qty	13179	1.0	1.0
amount	3174	-116.5	1303.5

				
					# Flag outliers in 'amount' based on IQR bounds
amount_lower_bound = -116.5  # From IQR summary
amount_upper_bound = 1303.5

df['amount_outlier'] = ((df['amount'] < amount_lower_bound) | (df['amount'] > amount_upper_bound))

# Verify how many outliers were flagged
df['amount_outlier'].sum()

				
					np.int64(3174)

				
					from scipy import stats

# Calculate Z-scores for 'amount'
df['amount_zscore'] = stats.zscore(df['amount'], nan_policy='omit')

# Flag outliers using Z-score (threshold ±3)
df['amount_outlier_z'] = df['amount_zscore'].abs() > 3

# Count the number of Z-score flagged outliers
zscore_outliers_count = df['amount_outlier_z'].sum()

zscore_outliers_count

				
					np.int64(439)

				
					# Keep the outliers but flag them instead of removing
df['amount_outlier'] = df['amount_zscore'].abs() > 3

# Drop only temporary Z-score columns
df.drop(columns=['amount_zscore','amount_outlier_z'], inplace=True)

Validation

				
					# Final Data Validation

total_rows, total_columns = df.shape
remaining_missing_values = df.isnull().sum().sum()
outlier_count = df['amount_outlier'].sum()
duplicate_count = df.duplicated().sum()

# Create summary dataframe
final_summary = pd.DataFrame({
    "Total Rows": [total_rows],
    "Total Columns": [total_columns],
    "Remaining Missing Values": [remaining_missing_values],
    "Flagged Amount Outliers": [outlier_count],
    "duplicate_count": [duplicate_count]
})
final_summary

Total Rows	Total Columns	Remaining Missing Values	Flagged Amount Outliers	duplicate_count
128808	20	0	439	0

💾 Descriptive Statistics & Sales Analysis

Overall Sales Summary

				
					# Compute overall sales summary by status
overall_sales_summary = df.groupby('status').agg({
    'amount': ['sum', 'mean', 'median','min','max'],
    'qty': ['sum', 'mean', 'median','min','max']
})

# Compute correct "Grand Total" values separately
grand_total_values = {
    ('amount', 'sum'): df['amount'].sum(),  
    ('amount', 'mean'): df['amount'].mean(),  
    ('amount', 'median'): df['amount'].median(),  
    ('qty', 'sum'): df['qty'].sum(),  
    ('qty', 'mean'): df['qty'].mean(),  
    ('qty', 'median'): df['qty'].median()  
}

# Convert to DataFrame and append as "Grand Total" row
grand_total = pd.DataFrame([grand_total_values], index=["Grand Total"])
overall_sales_summary = pd.concat([overall_sales_summary, grand_total])

overall_sales_summary

	amount	amount	amount	amount	amount	qty	qty	qty	qty	qty
	sum	mean	median	min	max	sum	mean	median	min	max
Cancelled	6910831.39	377.41419856916605	380.0	0.0	4235.72	5651	0.3086123095407132	0.0	0.0	2.0
Delivered	18624586.0	648.6012885251611	631.0	0.0	5495.0	28831	1.0040397005049626	1.0	0.0	5.0
Lost/Damaged	3133.0	522.1666666666666	499.0	0.0	1136.0	6	1.0	1.0	1.0	1.0
Out for Delivery	26971.0	770.6	729.0	301.0	1399.0	35	1.0	1.0	1.0	1.0
Pending	622409.0	662.842385516507	664.0	0.0	2326.0	940	1.0010649627263046	1.0	0.0	2.0
Picked Up	661252.0	679.601233299075	699.0	0.0	1998.0	977	1.0041109969167523	1.0	1.0	2.0
Returned	1386026.0	657.5075901328273	635.0	0.0	2796.0	2130	1.0104364326375712	1.0	0.0	4.0
Shipped	50344926.0	647.7647739993052	599.0	0.0	5584.0	77926	1.002637639762741	1.0	0.0	15.0
Grand Total	78580134.39000002	610.056319405627	588.0			116496	0.9044158747903857	1.0

Category & Product Analysis

				
					# Compute category sales summary
category_sales_summary = df.groupby('category').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

category_sales_summary

	amount	amount	amount	amount	amount	qty	qty	qty	qty	qty
	sum	mean	median	min	max	sum	mean	median	min	max
category
T-shirt	39193318.17	780.4480011549414	759.0	0.0	5584.0	45228	0.9006153049642566	1.0	0	8
Shirt	21304600.7	427.7688679624126	432.0	0.0	2796.0	44978	0.9031001525981849	1.0	0	13
Blazzer	11211616.12	723.8437678352378	744.0	0.0	2860.0	13934	0.8996061721221512	1.0	0	4
Trousers	5344415.3	503.85738663146975	518.0	0.0	1797.0	9889	0.93230885264448	1.0	0	3
Perfume	789419.66	682.2987554019015	797.14	0.0	1449.0	1051	0.9083837510803803	1.0	0	2
Wallet	460896.18	497.72805615550755	518.525	0.0	1266.66	863	0.9319654427645788	1.0	0	15
Socks	151019.5	344.0079726651481	342.86	0.0	1028.58	398	0.9066059225512528	1.0	0	2
Shoes	123933.76	755.6936585365853	790.0	0.0	2058.0	152	0.926829268292683	1.0	0	3
Watch	915.0	305.0	305.0	305.0	305.0	3	1.0	1.0	1	1

				
					# Compute sales summary by product
product_sales_summary = df.groupby('size').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

product_sales_summary

	amount	amount	amount	amount	amount	qty	qty	qty	qty	qty
	sum	mean	median	min	max	sum	mean	median	min	max
size
M	13691357.13	612.5886859060403	597.0	0.0	4235.72	20116	0.9000447427293065	1.0	0	4
L	13038244.84	598.7162988474078	573.0	0.0	2598.0	19680	0.9037057445929191	1.0	0	9
XL	12255023.38	597.2233615984405	563.0	0.0	2698.0	18609	0.9068713450292397	1.0	0	5
XXL	10461329.64	588.2769858853962	545.0	0.0	2676.0	16226	0.9124444694371029	1.0	0	13
S	10426761.18	622.3445851736899	599.0	0.0	2864.0	15022	0.8966217022800526	1.0	0	4
3XL	9034190.92	617.6802215233147	581.0	0.0	5495.0	13346	0.9124846164364829	1.0	0	5
XS	6960339.1	630.8655034895314	622.86	0.0	5584.0	9829	0.8908728360373425	1.0	0	8
Free	1375164.6	611.1842666666668	625.0	0.0	2058.0	2069	0.9195555555555556	1.0	0	15
6XL	577493.33	782.5112872628725	798.0	0.0	1672.0	688	0.9322493224932249	1.0	0	2
5XL	425778.63	774.1429636363637	798.0	0.0	1629.0	513	0.9327272727272727	1.0	0	2
4XL	334451.64	783.2591100702576	798.0	0.0	1672.0	398	0.9320843091334895	1.0	0	2

Fulfillment & B2B vs. B2C Analysis

				
					# Compute sales summary by fulfillment method
fulfillment_summary = df.groupby('fulfilled_by').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

# Compute sales summary by B2B vs. B2C
b2b_summary = df.groupby('b2b').agg({
    'amount': ['sum', 'mean', 'median', 'min', 'max'],
    'qty': ['sum', 'mean', 'median', 'min', 'max']
}).sort_values(('amount', 'sum'), ascending=False)

b2b_summary
fulfillment_summary

	amount	amount	amount	amount	amount	qty	qty	qty	qty	qty
	sum	mean	median	min	max	sum	mean	median	min	max
b2b
False	77988913.6	609.5884193001242	587.0	0.0	5584.0	115656	0.9040074411624471	1.0	0	15
True	591220.79	678.7839150401837	653.0	0.0	5495.0	840	0.9644087256027555	1.0	0	5

	amount	amount	amount	amount	amount	qty	qty	qty	qty	qty
	sum	mean	median	min	max	sum	mean	median	min	max
fulfilled_by
Merchant Fulfilled	54338671.0	606.4922261286902	574.0	0.0	5584.0	83990	0.9374407053965065	1.0	0	15
Easy Ship	24241463.39	618.1996631219239	602.86	0.0	5495.0	32506	0.8289597837451865	1.0	0	5

♟Statistical Analysis & Hypothesis Testing

T-Test/ANOVA for B2B vs. B2C and Category Sales Comparisons

				
					from scipy.stats import ttest_ind, f_oneway

# T-Test for B2B vs. B2C Sales Comparison
b2b_sales = df[df['b2b'] == True]['amount']
b2c_sales = df[df['b2b'] == False]['amount']

# Perform independent T-test
t_stat, p_value_b2b = ttest_ind(b2b_sales, b2c_sales, equal_var=False, nan_policy='omit')

# ANOVA for Category Sales Comparison
category_groups = [group['amount'].dropna() for name, group in df.groupby('category')]

# Perform ANOVA test
anova_stat, p_value_category = f_oneway(*category_groups)

# Create a summary DataFrame for hypothesis testing results
hypothesis_results = pd.DataFrame({
    "Test": ["T-Test (B2B vs. B2C)", "ANOVA (Category Sales)"],
    "Test Statistic": [t_stat, anova_stat],
    "P-Value": [p_value_b2b, p_value_category]
})

hypothesis_results

	Test	Test Statistic	P-Value
0	T-Test (B2B vs. B2C)	5.782585515575493	1.0216579602426602e-08
1	ANOVA (Category Sales)	6157.193563145301	0.0

1️⃣ T-Test (B2B vs. B2C)

Test Statistic (5.78):
- A high t-value means a strong difference between B2B and B2C sales amounts.
- This shows that the two groups have significantly different mean sales.
P-Value (1.02e-08 or 0.0000000102):
- The probability that this difference happened by random chance is almost zero.
- Since p < 0.05, we reject the null hypothesis, meaning B2B and B2C sales are significantly different.

✅ Conclusion:

B2B sales behave differently than B2C sales in terms of order amount.

2️⃣ ANOVA (Category Sales)

Test Statistic (6157.19):
- A very high F-statistic means that there is a strong difference in sales across different categories.
P-Value (0.000):
- The probability that all categories have the same sales is almost zero.
- Since p < 0.05, we reject the null hypothesis, meaning some product categories sell significantly better than others.

✅ Conclusion:

Product categories show significant differences in sales performance. Some categories are selling far more or far less than others.

Chi-Square Analysis (Order Status vs. Fulfillment Method)

				
					import scipy.stats as stats

# Create a contingency table
contingency_table = pd.crosstab(df['status'], df['fulfilled_by'])

# Perform Chi-Square test
chi2_stat, p_value_chi2, dof, expected = stats.chi2_contingency(contingency_table)

# Create a summary DataFrame for the Chi-Square test results
chi_square_results = pd.DataFrame({
    "Test": ["Chi-Square (Order Status vs. Fulfillment)"],
    "Chi-Square Statistic": [chi2_stat],
    "P-Value": [p_value_chi2],
    "Degrees of Freedom": [dof]
})
chi_square_results

Test	Chi-Square Statistic	P-Value	Degrees of Freedom
Chi-Square (Order Status vs. Fulfillment)	107464.38698839529	0.0	7

Chi-Square Statistic (107464.39)
- A very high value indicates a strong relationship between order status and fulfillment method (Amazon vs. Merchant).
- This means that order fulfillment type affects whether an order is shipped, canceled, or returned.
P-Value (0.000)
- The probability of this result happening by random chance is almost zero.
- Since p < 0.05, we reject the null hypothesis, meaning order status is significantly influenced by the fulfillment method.
Degrees of Freedom (7)
- This is based on the number of categories in status and fulfilled_by.

✅ Conclusion:

The fulfillment method significantly affects order status. Orders fulfilled by Amazon vs. Merchant have different cancellation, return, and shipping rates.

Correlation & Regression Analysis (Qty vs. Amount)

				
					# Correlation & Regression Analysis (Qty vs. Amount)
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Calculate Pearson correlation coefficient
correlation_coefficient = df['qty'].corr(df['amount'])

# Scatter plot with regression line
plt.figure(figsize=(8, 5))
sns.regplot(x=df['qty'], y=df['amount'], scatter_kws={"alpha":0.3}, line_kws={"color":"red"})
plt.xlabel("Quantity (Qty)")
plt.ylabel("Amount ($)")
plt.title(f"Correlation between Quantity and Amount (r = {correlation_coefficient:.2f})")
plt.show()

# Perform linear regression (OLS - Ordinary Least Squares)
X = sm.add_constant(df['qty'])  # Add constant for intercept
y = df['amount']
model = sm.OLS(y, X).fit()

# Display regression summary
regression_summary = model.summary()
regression_summary

✅Correlation Coefficient (r = 0.39)

Shows a moderate positive correlation between quantity (qty) and amount.
Higher quantities tend to be associated with higher sales, but the relationship is not perfectly linear.

✅Regression Line (Red Line in Plot)

The upward slope confirms a positive relationship between quantity and amount.
Some outliers exist where low quantities still result in high sales amounts.

				
					OLS Regression Results                            
==============================================================================
Dep. Variable:                 amount   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.155
Method:                 Least Squares   F-statistic:                 2.369e+04
Date:                Fri, 14 Mar 2025   Prob (F-statistic):               0.00
Time:                        17:08:52   Log-Likelihood:            -9.1196e+05
No. Observations:              128808   AIC:                         1.824e+06
Df Residuals:                  128806   BIC:                         1.824e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        254.2724      2.446    103.941      0.000     249.478     259.067
qty          393.3853      2.556    153.920      0.000     388.376     398.395
==============================================================================
Omnibus:                    17866.214   Durbin-Watson:                   1.863
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            58717.680
Skew:                           0.709   Prob(JB):                         0.00
Kurtosis:                       5.988   Cond. No.                         5.95
==============================================================================

1️⃣ Model Summary

Dep. Variable: amount → This is the target variable (sales amount) we are predicting.
Model Type: OLS (Ordinary Least Squares) → The type of regression used.
Method: Least Squares → The method used to estimate coefficients.
No. Observations: 128,088 → The total number of data points (rows) in the dataset.
Df Residuals: 128,086 → Degrees of freedom left after estimating model parameters.
Df Model: 1 → The number of predictor variables in the model (qty).

2️⃣ Model Fit Statistics (How Well the Model Explains Variance)

R-squared: 0.155 → This means that 15.5% of the variation in amount is explained by qty. A low R² suggests other factors influence sales.
Adjusted R-squared: 0.155 → Similar to R² but adjusted for additional predictors (remains the same since we have only one predictor).
F-statistic: 23,690 → A high F-value indicates that the model is statistically significant.
Prob (F-statistic): 0.000 → Since p < 0.05, the model is statistically significant.

Interpretation:

The model has low explanatory power (R² = 15.5%), meaning qty does not fully predict amount—other factors are influencing sales.

3️⃣ Regression Coefficients (How qty Affects amount)

Intercept (const): 254.27 → If qty = 0, the predicted sales amount is $254.27 (base value).
Quantity (qty): 393.39 → Each additional unit sold increases amount by $393.39 on average.

Statistical Significance of Predictors:

Standard Error (std err):
- Measures how much the estimated coefficient might vary. Smaller values are better.
t-statistic (t):
- Measures how strongly a variable influences the dependent variable. Higher absolute values mean stronger relationships.
p-value (P>|t|):
- Measures statistical significance. If p < 0.05, the variable significantly affects amount.
- Both const and qty are statistically significant (p = 0.000).
95% Confidence Interval:
- The range where the true coefficient likely falls.
- Example: qty is likely between $388.38 to $398.39 per unit increase.

Interpretation:

qty has a significant positive impact on amount—each unit sold increases sales by ~$393.39.

4️⃣ Residual Diagnostics (Checking Model Assumptions)

Omnibus: 17,866.214 → Tests if residuals are normally distributed. High values suggest they are not normal.
Prob (Omnibus): 0.000 → Residuals are not normally distributed (p < 0.05).
Durbin-Watson: 1.863 → Measures autocorrelation (values between 1.5 – 2.5 are acceptable).
Jarque-Bera (JB): 58,717.68 → Another normality test—high values confirm non-normal residuals.
Prob (JB): 0.000 → Residuals fail the normality test (p < 0.05).
Skew: 0.709 → Residuals are slightly right-skewed.
Kurtosis: 5.988 → Residuals have heavy tails (outliers present).
Condition Number (Cond. No.): 5.95 → Measures multicollinearity (values > 30 are problematic; this model is fine).

Interpretation:

Residuals are not normally distributed → This suggests the model may need transformation.
Some skewness and outliers exist → The model could be improved.
No severe multicollinearity detected, so predictor variables are independent.

🚀 Key Takeaways:

✅ qty significantly predicts amount (p < 0.05)—each additional unit sold increases revenue by ~$393.39.
✅ The model explains 15.5% of sales variation, meaning other factors influence sales.
✅ Residuals are not normally distributed, which suggests potential model improvement (e.g., transformation).

Time Series & Trend Analysis (Sales Trends & Seasonality)

				
					# Aggregate sales by month to analyze trends
monthly_sales = df.resample('M', on='date').agg({'amount': 'sum'}).reset_index()

# Plot the monthly sales trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['date'], monthly_sales['amount'], marker='o', linestyle='-', color='b')
plt.xlabel("Month")
plt.ylabel("Total Sales Amount ($)")
plt.title("Monthly Sales Trend Over Time")
plt.grid(True)
plt.show()

✅ Sales Growth and Decline:

The chart shows a sharp increase in sales in April, peaking in May.
After the peak, sales gradually decline from June to July.

✅ Possible Reasons for Trends:

Sales Peak in May: Likely due to seasonal demand, promotions, or new product launches.
Decline After May: Could be due to market saturation, reduced marketing, or seasonal effects.

🕒Time Series & Trend Analysis

Time Series & Trend Analysis (Sales Trends & Seasonality)

				
					# Aggregate sales by month to analyze trends
monthly_sales = df.resample('M', on='date').agg({'amount': 'sum'}).reset_index()

# Plot the monthly sales trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['date'], monthly_sales['amount'], marker='o', linestyle='-', color='b')
plt.xlabel("Month")
plt.ylabel("Total Sales Amount ($)")
plt.title("Monthly Sales Trend Over Time")
plt.grid(True)
plt.show()

✅ Sales Growth and Decline:

The chart shows a sharp increase in sales in April, peaking in May.
After the peak, sales gradually decline from June to July.

✅ Possible Reasons for Trends:

Sales Peak in May: Likely due to seasonal demand, promotions, or new product launches.
Decline After May: Could be due to market saturation, reduced marketing, or seasonal effects.

🎨Data Visualization & Insights

				
					import seaborn as sns

# Create a bar chart for sales by category
plt.figure(figsize=(12, 6))
sns.barplot(x=df.groupby('category')['amount'].sum().index, 
            y=df.groupby('category')['amount'].sum().values, 
            palette="viridis")

plt.xlabel("Product Category")
plt.ylabel("Total Sales Amount ($)")
plt.title("Total Sales by Product Category")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

🔎 Key Insights:

1️⃣ T-Shirts Dominate Sales

Highest-selling category with total sales nearing 40 million.
Suggests high demand or strong marketing efforts for T-shirts.

2️⃣ Shirts and Blazers Perform Well

Shirts (~21 million) and Blazers (~11 million) contribute significantly to revenue.
Indicates consistent demand for formal/casual wear.

3️⃣ Low-Selling Categories (Shoes, Socks, Wallets, Watches, Perfume)

These categories have very low sales volume.
Possible reasons: Lower demand, pricing issues, or lack of promotion.

4️⃣ Trousers Show Moderate Sales

Generates noticeable revenue (~6 million), but not as strong as Shirts or T-Shirts.

🚀 Actionable Recommendations:

✅ Focus on T-Shirts & Shirts – Continue marketing efforts and optimize inventory.
✅ Analyze Low-Selling Items – Investigate customer preferences and revamp product strategy.
✅ Seasonal Campaigns – Assess if certain categories perform better in specific seasons.

				
					# Create a bar chart for order status impact on sales
plt.figure(figsize=(12, 6))
sns.barplot(x=df.groupby('status')['amount'].sum().index, 
            y=df.groupby('status')['amount'].sum().values, 
            palette="coolwarm")

plt.xlabel("Order Status")
plt.ylabel("Total Sales Amount ($)")
plt.title("Impact of Order Status on Sales")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

🔎 Key Insights:

1️⃣ Shipped Orders Dominate Sales

The majority of sales (~50 million) come from Shipped orders, indicating high successful fulfillment rates.

2️⃣ Delivered Sales Are Significant

Delivered orders contribute ~19 million, confirming a strong conversion rate from shipping to successful delivery.

3️⃣ Cancelled Orders Have Noticeable Revenue Loss

Cancelled orders account for a significant portion (~7 million) in lost sales.
Indicates customer dropouts or fulfillment issues affecting revenue.

4️⃣ Returned Orders Impact Sales

Returned orders show a small but noticeable loss, suggesting product quality, incorrect shipments, or customer dissatisfaction.

5️⃣ Pending, Lost/Damaged, and Picked-Up Orders Have Minimal Impact

These statuses represent a very small portion of total sales, meaning they are not major revenue blockers.

🚀 Actionable Recommendations:

✅ Investigate Cancelled Orders → Identify why cancellations occur and reduce lost sales.
✅ Improve Delivery Success Rate → Ensure shipped orders convert to delivered orders more efficiently.
✅ Analyze Returns → Assess why customers return products and optimize policies.
✅ Enhance Order Tracking → Ensure smooth fulfillment to reduce delays, losses, and damages.

Amazon Sales Performance Dashboard

📂 Table of Content

📋 Project Overview

📚 Workflow

📊 Summary of the Dataset

🗂️ Data Processing

Drop Columns

Remove Duplicates

Handling Missing Values

Standardizing Column Names

Data Type Conversion

Outlier Detection & Handling

Validation

💾 Descriptive Statistics & Sales Analysis

Overall Sales Summary

Category & Product Analysis

Fulfillment & B2B vs. B2C Analysis

♟Statistical Analysis & Hypothesis Testing

T-Test/ANOVA for B2B vs. B2C and Category Sales Comparisons

Chi-Square Analysis (Order Status vs. Fulfillment Method)

Correlation & Regression Analysis (Qty vs. Amount)

Time Series & Trend Analysis (Sales Trends & Seasonality)

🕒Time Series & Trend Analysis

Time Series & Trend Analysis (Sales Trends & Seasonality)

🎨Data Visualization & Insights

📈 Dashboard

Quick Links

Get In Touch

Amazon Sales Performance Dashboard

📂 Table of Content

📋 Project Overview​

📚 Workflow

📊 Summary of the Dataset

🗂️ Data Processing

Drop Columns

Remove Duplicates

Handling Missing Values

Standardizing Column Names

Data Type Conversion

Outlier Detection & Handling

Validation

💾 Descriptive Statistics & Sales Analysis

Overall Sales Summary

Category & Product Analysis

Fulfillment & B2B vs. B2C Analysis

♟Statistical Analysis & Hypothesis Testing

T-Test/ANOVA for B2B vs. B2C and Category Sales Comparisons

Chi-Square Analysis (Order Status vs. Fulfillment Method)

Correlation & Regression Analysis (Qty vs. Amount)

Time Series & Trend Analysis (Sales Trends & Seasonality)

🕒Time Series & Trend Analysis

Time Series & Trend Analysis (Sales Trends & Seasonality)

🎨Data Visualization & Insights

📈 Dashboard

Quick Links

Get In Touch

📋 Project Overview