Netflix Streaming Insights – Roy AI Insights

NETFLIX Netflix Streaming Insights:
A Comprehensive Data Analysis

Home // AI-Powered Data Analysis

🔎 Project Overview

Introduction

Netflix is one of the world’s largest streaming platforms, offering a vast collection of movies and TV shows from different genres, countries, and time periods. This project aims to analyze Netflix’s content catalog to uncover key insights into content distribution, genre trends, country contributions, and viewing patterns.

Dataset Description

This analysis is based on the publicly available Netflix Movies and TV Shows dataset from Kaggle. It contains 8,807 records with 12 attributes that provide information on titles, directors, cast, countries, release years, ratings, and more.

Objectives of the Analysis

The primary goals of this project are:

Content Analysis: Understand the distribution of Movies vs. TV Shows.
Country-Based Insights: Identify which countries produce the most Netflix content.
Genre Popularity: Determine the most frequent genres on the platform.
Trends Over Time: Analyze how Netflix’s content library has grown over the years.
Ratings & Maturity Levels: Explore the distribution of content ratings (TV-MA, PG, R, etc.).
Duration Analysis: Examine movie durations and TV show season counts.

Methodology

Data Cleaning: Handling missing values, transforming data types, and restructuring key columns.
Exploratory Data Analysis (EDA): Uncovering trends through data visualization and statistical summaries.
Power BI Visualization: Creating interactive dashboards to present findings effectively.

Expected Outcomes

By the end of this analysis, I will have a clear understanding of Netflix’s content strategy, helping answer critical business and user behavior questions. The final Power BI dashboard will allow for dynamic exploration of Netflix’s catalog.

📊 Summary of the Dataset

1. Dataset Overview

Total Records: 8,807 entries
Total Columns: 12 columns
Content: Information about movies and TV shows available on Netflix, including metadata like title, director, cast, country, release year, duration, rating, and genre.
Purpose of Analysis: To explore content distribution, country-based trends, director and cast insights, rating patterns, and duration analysis.

2. Column Descriptions

The dataset contains 12 columns, each representing different attributes of Netflix content:

show_id – Unique identifier for each Netflix title.
type – Indicates whether the content is a “Movie” or a “TV Show.”
title – Name of the Movie or TV Show.
director – Name(s) of the director(s). This column has significant missing values.
cast – List of main actors/actresses in the Movie or TV Show. This column also has many missing values.
country – Country of origin where the content was produced. Some entries are missing.
date_added – Date when the content was added to Netflix. Useful for trend analysis.
release_year – The year in which the Movie or TV Show was originally released.
rating – Maturity rating of the content, such as PG, R, TV-MA, etc.
duration – Represents the length of Movies (in minutes) or the number of seasons for TV Shows. Needs transformation for better analysis.
listed_in – Categories or genres associated with the Movie or TV Show. Multiple genres are often listed in a single entry, requiring data splitting.
description – A brief summary of the Movie or TV Show. Could be used for text analysis or classification.

❓ Key Questions

1. Content Distribution & Trends

What is the distribution of Movies vs. TV Shows on Netflix?
How has the number of new content additions changed over the years?
What are the most common genres on Netflix?

2. Country-Based Insights

Which countries produce the most Netflix content?
How does content distribution vary by country (e.g., Which countries dominate certain genres)?

3. Directors & Cast Analysis

Who are the most frequently featured directors?
Which actors appear most often in Netflix productions?

4. Ratings & Maturity Levels

What are the most common content ratings on Netflix?
How does the distribution of ratings differ between Movies and TV Shows?

5. Duration & Season Analysis

What is the average duration of movies on Netflix?
How does the number of seasons vary across TV Shows?

🧹 Data Cleaning

Load & Inspect Data

The first few rows
Column names and data types
Basic summary to identify any immediate issues

				
					import pandas as pd

# Load the dataset
df = pd.read_csv(r"D:\roy\roy files\Data\data project\netflix_titles.csv")

# Display basic information about the dataset
df.info()
df.head()

				
					<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
s1	Movie	Dick Johnson Is Dead	Kirsten Johnson		United States	September 25, 2021	2020	PG-13	90 min	Documentaries	As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
s2	TV Show	Blood & Water		Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng	South Africa	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, TV Dramas, TV Mysteries	After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
s3	TV Show	Ganglands	Julien Leclercq	Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera		September 24, 2021	2021	TV-MA	1 Season	Crime TV Shows, International TV Shows, TV Action & Adventure	To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
s4	TV Show	Jailbirds New Orleans				September 24, 2021	2021	TV-MA	1 Season	Docuseries, Reality TV	Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
s5	TV Show	Kota Factory		Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar	India	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, Romantic TV Shows, TV Comedies	In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.

				
					import matplotlib.pyplot as plt

# Calculate missing values count per column
missing_values = df.isna().sum()

# Plot histogram of missing values
plt.figure(figsize=(10, 5))
bars= plt.bar(missing_values.index, missing_values.values, color='skyblue')
plt.xticks(rotation=90)
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.title("Missing Values Per Column in Netflix Dataset")
plt.xticks(rotation=45)

for bar in bars.patches:  
    yval = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2, yval, 
        f'{yval:.0f}', 
        ha='center', va='bottom', fontsize=10
    )
plt.show()

Handling Missing Values

Director

Merged df_credits and df_titles on the 'id' column.
Extracted only the records where 'role' == "DIRECTOR" and created a new dataset df_directors.
Renamed 'name' column in df_directors to 'director'.
Merged df_directors with the Netflix dataset based on the 'title' column.
Filled missing director values using the enriched dataset.
Any remaining missing values were filled with "Unknown Director".

Cast

Filled missing values with “Unknown”

Country

Used cast information to infer missing country values
Remaining missing values filled with “Unknown”

Date Added

Filled missing values with the most frequent date in the dataset

Rating

Filled missing values with the most frequent rating

Duration

Extracted minutes for movies into duration_in_minutes (integer)
Extracted seasons for TV shows into num_seasons (integer)
Non-numeric values set to 0

				
					# Load the datasets
df_credits = pd.read_csv(r'D:\roy\roy files\Data\data project\Netflix\credits.csv')
df_titles = pd.read_csv(r'D:\roy\roy files\Data\data project\Netflix\titles.csv')

# Perform the merge operation on the 'id' column
df_merged = df_credits.merge(df_titles, on='id', how='left')
df_merged

# Checking if 'role' column has "DIRECTOR" as a value and merging accordingly
df_directors = df_merged[df_merged['role'] == 'DIRECTOR'][['title', 'name']]
df_directors.rename(columns={'name': 'director'}, inplace=True)

# Merge with the main Netflix dataset to fill missing directors
df_filled = df.merge(df_directors, on='title', how='left')

# Fill missing director values in 'df' with those found in 'df_merged'
df_filled['director'] = df_filled['director_x'].combine_first(df_filled['director_y'])

# Drop redundant columns after merging
df_filled.drop(columns=['director_x', 'director_y'], inplace=True)

df = df_filled.copy()
df.drop_duplicates()
# Fill remaining missing values in the 'director' and 'cast' columns with "Unknown"
df['director'].fillna("Unknown", inplace=True)
df['cast'].fillna("Unknown", inplace=True)

				
					# Create a mapping of actors to their countries based on existing data
actor_country_mapping = {}

# Extract known country information for each actor
for index, row in df.dropna(subset=['cast', 'country']).iterrows():
    actors = row['cast'].split(', ')
    country = row['country']
    for actor in actors:
        if actor not in actor_country_mapping:
            actor_country_mapping[actor] = country

# Function to fill missing country values based on cast information
def fill_country_based_on_cast(row):
    if pd.isna(row['country']):
        actors = row['cast'].split(', ') if pd.notna(row['cast']) else []
        for actor in actors:
            if actor in actor_country_mapping:
                return actor_country_mapping[actor]  # Assign the first found country
    return row['country']

# Apply the function to fill missing country values
df['country'] = df.apply(fill_country_based_on_cast, axis=1)

# Fill any remaining missing values in 'country' with "Unknown"
df['country'].fillna("Unknown", inplace=True)

				
					# Find the most frequent date in the 'date_added' column
most_frequent_date = df['date_added'].mode()[0]

# Fill missing values in 'date_added' with the most frequent date
df['date_added'].fillna(most_frequent_date, inplace=True)

				
					# Fill missing "rating" with the most frequent rating
most_frequent_rating = df['rating'].mode()[0]
df['rating'].fillna(most_frequent_rating, inplace=True)

# Creating two separate columns for movies and TV shows
df['duration_in_minutes'] = df['duration'].str.extract('(\d+)\s*min').astype(float)
df['num_seasons'] = df['duration'].str.extract('(\d+)\s*Season').astype(float)

# Filling NaN values with 0 for consistency
df['duration_in_minutes'].fillna(0, inplace=True)
df['num_seasons'].fillna(0, inplace=True)

# Converting to integer type
df['duration_in_minutes'] = df['duration_in_minutes'].astype(int)
df['num_seasons'] = df['num_seasons'].astype(int)

# Dropping the original "duration" column as it's now split
df.drop(columns=['duration'], inplace=True)

# Ensuring all values in 'duration_in_minutes' and 'num_seasons' are valid integers, replacing non-digits with 0
df['duration_in_minutes'] = pd.to_numeric(df['duration_in_minutes'], errors='coerce').fillna(0).astype(int)
df['num_seasons'] = pd.to_numeric(df['num_seasons'], errors='coerce').fillna(0).astype(int)

				
					show_id                0
type                   0
title                  0
director               0
cast                   0
country                0
date_added             0
release_year           0
rating                 0
listed_in              0
description            0
duration_in_minutes    0
num_seasons            0
dtype: int64

Data Type Validation and Conversion

✅ Converted date_added to datetime format (datetime64[ns])

Ensures proper handling of time-based analysis.

✅ Converted release_year to integer (int64)

Ensures consistency for numerical operations.

✅ Converted rating to categorical type (category)

Optimizes storage and improves analysis.

✅ Converted duration values into two separate numerical columns:

duration_in_minutes (int64) → For movies with minutes.
num_seasons (int64) → For TV shows with season count.

✅ Ensured all remaining columns maintain appropriate data types:

- show_id, type, title, director, cast, country, listed_in, description remain as object (string-based).

				
					# Checking current data types of each column
df.dtypes

				
					show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration         int64
listed_in       object
description     object
dtype: object

				
					# Convert 'date_added' to datetime format
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

				
					# Extract misplaced duration values from the 'rating' column
misplaced_duration_mask = df["rating"].str.contains(r"^\d+\s*min$", na=False)

# Convert extracted values to integer and place them in 'duration_in_minutes'
df.loc[misplaced_duration_mask, "duration_in_minutes"] = (
    df.loc[misplaced_duration_mask, "rating"].str.extract(r"(\d+)").astype(int)
)

# Replace those misplaced values in 'rating' with NaN (to be filled later)
df.loc[misplaced_duration_mask, "rating"] = None

# Convert 'rating' to categorical type
df["rating"] = df["rating"].astype("category")

# Convert duration column to integer type
df['duration_in_minutes'] = df['duration_in_minutes'].fillna(0).astype(int)

📈 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)

Distribution of Movies vs. TV Shows

				
					import matplotlib.pyplot as plt
import seaborn as sns

# Count distribution of Movies vs TV Shows
content_distribution = df['type'].value_counts()

# Plot
plt.figure(figsize=(6,4))
sns.barplot(x=content_distribution.index, y=content_distribution.values)
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.title("Distribution of Movies vs TV Shows on Netflix")

# Show plot
plt.show()

Analyzing new content additions over the years

				
					# Extract year from 'date_added' column
df['year_added'] = df['date_added'].dt.year

# Count the number of additions per year
yearly_additions = df['year_added'].value_counts().sort_index()

# Plot the trend of new content additions over the years
plt.figure(figsize=(10,5))
plt.plot(yearly_additions.index, yearly_additions.values, marker='o', linestyle='-', linewidth=2)
plt.xlabel('Year')
plt.ylabel('Number of Additions')
plt.title('Trend of New Content Additions Over the Years on Netflix')
plt.grid(True)

# Show the plot
plt.show()

Finding the most common genres on Netflix

				
					import matplotlib.pyplot as plt
from collections import Counter

# Ensure all values are strings and remove brackets if any
df['listed_in'] = df['listed_in'].astype(str).str.replace(r"[\[\]']", "", regex=True)

# Flatten the list of genres and count occurrences
all_genres = [genre.strip() for sublist in df['listed_in'].dropna().str.split(',') for genre in sublist]
genre_counts = Counter(all_genres)

# Convert to DataFrame for visualization
genre_df = pd.DataFrame(genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)

# Plot the top 10 most common genres
plt.figure(figsize=(12, 6))
plt.barh(genre_df['Genre'].head(10), genre_df['Count'].head(10), color='steelblue')
plt.xlabel("Count")
plt.ylabel("Genre")
plt.title("Most Common Genres on Netflix")
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

Count the number of titles produced by each country

				
					# Ensure all values are strings and remove brackets if any
df['country'] = df['country'].astype(str).str.replace(r"[\[\]']", "", regex=True)

# Flatten the list of countries and count occurrences
all_countries = [country.strip() for sublist in df['country'].dropna().str.split(',') for country in sublist]
country_counts = Counter(all_countries)

# Convert to DataFrame for visualization
country_df = pd.DataFrame(country_counts.items(), columns=['Country', 'Count']).sort_values(by='Count', ascending=False)

# Plot the top 10 countries producing the most Netflix content
plt.figure(figsize=(12, 6))
plt.barh(country_df['Country'].head(10), country_df['Count'].head(10), color='steelblue')
plt.xlabel("Count")
plt.ylabel("Country")
plt.title("Top 10 Countries Producing Netflix Content")
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

Count the most frequently featured directors

				
					# Count occurrences of each director (exclude "Unknown")
director_counts = df['director'][df['director']!="Unknown"].dropna().str.split(',').explode().str.strip().value_counts()

# Plot the top 10 most frequently featured directors
plt.figure(figsize=(12, 6))
director_counts.head(10).plot(kind='barh', color='steelblue')
plt.xlabel("Number of Titles")
plt.ylabel("Director")
plt.title("Top 10 Most Frequently Featured Directors on Netflix")
plt.gca().invert_yaxis()

for index, value in enumerate(director_counts.head(10)):
    plt.text(value, index, str(value))
plt.show()

Count the most frequently appearing actors

				
					# Count occurrences of each actor
actor_counts = df['cast'][df['cast']!="Unknown"].str.split(',').explode().str.strip().value_counts()

# Plot the top 10 most frequently featured actors
plt.figure(figsize=(12, 6))
actor_counts.head(10).plot(kind='barh', color='steelblue')
plt.xlabel("Number of Titles")
plt.ylabel("Actor")
plt.title("Top 10 Most Frequently Featured Actors on Netflix")
plt.gca().invert_yaxis()
for index, value in enumerate(actor_counts.head(10)):
    plt.text(value, index, str(value))
plt.show()

Most common content ratings on Netflix

				
					# Count occurrences of each rating
rating_counts = df['rating'].value_counts()

# Plot the distribution of content ratings
plt.figure(figsize=(12, 6))
rating_counts.plot(kind='bar', color='steelblue')
plt.xlabel("Content Rating")
plt.ylabel("Count")
plt.title("Most Common Content Ratings on Netflix")
plt.xticks(rotation=45)
plt.show()

Compare Rating Distribution Between Movies & TV Shows

				
					# Count ratings separately for Movies and TV Shows
rating_movie_counts = df[df['type'] == 'Movie']['rating'].value_counts()
rating_tv_counts = df[df['type'] == 'TV Show']['rating'].value_counts()

# Create a bar chart
fig, ax = plt.subplots(figsize=(12, 6))
rating_movie_counts.plot(kind='bar', color='blue', alpha=0.6, label='Movies', ax=ax)
rating_tv_counts.plot(kind='bar', color='red', alpha=0.6, label='TV Shows', ax=ax)

plt.xlabel("Content Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings for Movies vs TV Shows")
plt.xticks(rotation=45)
plt.legend()
plt.show()

Average duration of movies on Netflix.

				
					# Filter movies and calculate average duration
avg_movie_duration = df[df['type'] == 'Movie']['duration_in_minutes'].mean()

# Plot histogram of movie durations
plt.figure(figsize=(12, 6))
sns.histplot(df[df['type'] == 'Movie']['duration_in_minutes'], bins=30, kde=True, color='blue')
plt.axvline(avg_movie_duration, color='red', linestyle='dashed', linewidth=2, label=f'Avg: {avg_movie_duration:.2f} min')
plt.xlabel("Movie Duration (minutes)")
plt.ylabel("Count")
plt.title("Distribution of Movie Durations on Netflix")
plt.legend()
plt.show()

Variation in the number of seasons across TV shows.

				
					# Filter TV shows and count occurrences of seasons
season_counts = df[df['type'] == 'TV Show']['num_seasons'].value_counts().sort_index()

# Plot bar chart
plt.figure(figsize=(12, 6))
season_counts.plot(kind='bar', color='steelblue')
plt.xlabel("Number of Seasons")
plt.ylabel("Count of TV Shows")
plt.title("Distribution of Number of Seasons in TV Shows")
plt.xticks(rotation=0)
plt.show()

Most Frequently Appearing Words in Descriptions

				
					from wordcloud import WordCloud

# Combine all descriptions into a single text
text = " ".join(str(desc) for desc in df["description"].dropna())

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=100).generate(text)

# Display the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Most Frequently Appearing Words in Descriptions", fontsize=14)
plt.show()

key insights:

✅Content Distribution & Trends

Movies dominate Netflix’s catalog compared to TV shows.
A significant rise in new content additions began around 2015, peaking in 2019.

✅Country-Based Insights

The United States is the top producer of Netflix content, followed by India and the United Kingdom.

✅Directors & Cast Analysis

The most frequently featured directors and actors were identified.

✅Ratings & Maturity Levels

TV-MA is the most common rating.
Rating distribution varies between Movies and TV Shows.

✅Duration & Season Analysis

The average duration of movies is around 99 minutes.
Most TV shows have only one season.

NETFLIX Netflix Streaming Insights:
A Comprehensive Data Analysis

📂 Table of Content

🔎 Project Overview

📊 Summary of the Dataset

❓ Key Questions

🧹 Data Cleaning

Load & Inspect Data

Handling Missing Values

Data Type Validation and Conversion

📈 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)

key insights:

📽 Visualiztion

Quick Links

Get In Touch

NETFLIX Netflix Streaming Insights: A Comprehensive Data Analysis

📂 Table of Content

🔎 Project Overview​

📊 Summary of the Dataset​

❓ Key Questions

🧹 Data Cleaning

Load & Inspect Data

Handling Missing Values

Data Type Validation and Conversion

📈 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)

key insights:

📽 Visualiztion

Quick Links

Get In Touch

NETFLIX Netflix Streaming Insights:
A Comprehensive Data Analysis

🔎 Project Overview

📊 Summary of the Dataset