NETFLIX Netflix Streaming Insights:
A Comprehensive Data Analysis
Home // AI-Powered Data Analysis
🔎 Project Overview
Introduction
Netflix is one of the world’s largest streaming platforms, offering a vast collection of movies and TV shows from different genres, countries, and time periods. This project aims to analyze Netflix’s content catalog to uncover key insights into content distribution, genre trends, country contributions, and viewing patterns.
Dataset Description
This analysis is based on the publicly available Netflix Movies and TV Shows dataset from Kaggle. It contains 8,807 records with 12 attributes that provide information on titles, directors, cast, countries, release years, ratings, and more.
Objectives of the Analysis
The primary goals of this project are:
- Content Analysis: Understand the distribution of Movies vs. TV Shows.
- Country-Based Insights: Identify which countries produce the most Netflix content.
- Genre Popularity: Determine the most frequent genres on the platform.
- Trends Over Time: Analyze how Netflix’s content library has grown over the years.
- Ratings & Maturity Levels: Explore the distribution of content ratings (TV-MA, PG, R, etc.).
- Duration Analysis: Examine movie durations and TV show season counts.
Methodology
- Data Cleaning: Handling missing values, transforming data types, and restructuring key columns.
- Exploratory Data Analysis (EDA): Uncovering trends through data visualization and statistical summaries.
- Power BI Visualization: Creating interactive dashboards to present findings effectively.
Expected Outcomes
By the end of this analysis, I will have a clear understanding of Netflix’s content strategy, helping answer critical business and user behavior questions. The final Power BI dashboard will allow for dynamic exploration of Netflix’s catalog.
📊 Summary of the Dataset
1. Dataset Overview
- Total Records: 8,807 entries
- Total Columns: 12 columns
- Content: Information about movies and TV shows available on Netflix, including metadata like title, director, cast, country, release year, duration, rating, and genre.
- Purpose of Analysis: To explore content distribution, country-based trends, director and cast insights, rating patterns, and duration analysis.
2. Column Descriptions
The dataset contains 12 columns, each representing different attributes of Netflix content:
- show_id – Unique identifier for each Netflix title.
- type – Indicates whether the content is a “Movie” or a “TV Show.”
- title – Name of the Movie or TV Show.
- director – Name(s) of the director(s). This column has significant missing values.
- cast – List of main actors/actresses in the Movie or TV Show. This column also has many missing values.
- country – Country of origin where the content was produced. Some entries are missing.
- date_added – Date when the content was added to Netflix. Useful for trend analysis.
- release_year – The year in which the Movie or TV Show was originally released.
- rating – Maturity rating of the content, such as PG, R, TV-MA, etc.
- duration – Represents the length of Movies (in minutes) or the number of seasons for TV Shows. Needs transformation for better analysis.
- listed_in – Categories or genres associated with the Movie or TV Show. Multiple genres are often listed in a single entry, requiring data splitting.
- description – A brief summary of the Movie or TV Show. Could be used for text analysis or classification.
❓ Key Questions
1. Content Distribution & Trends
- What is the distribution of Movies vs. TV Shows on Netflix?
- How has the number of new content additions changed over the years?
- What are the most common genres on Netflix?
2. Country-Based Insights
- Which countries produce the most Netflix content?
- How does content distribution vary by country (e.g., Which countries dominate certain genres)?
3. Directors & Cast Analysis
- Who are the most frequently featured directors?
- Which actors appear most often in Netflix productions?
4. Ratings & Maturity Levels
- What are the most common content ratings on Netflix?
- How does the distribution of ratings differ between Movies and TV Shows?
5. Duration & Season Analysis
- What is the average duration of movies on Netflix?
- How does the number of seasons vary across TV Shows?
🧹 Data Cleaning
Load & Inspect Data
- The first few rows
- Column names and data types
- Basic summary to identify any immediate issues
import pandas as pd
# Load the dataset
df = pd.read_csv(r"D:\roy\roy files\Data\data project\netflix_titles.csv")
# Display basic information about the dataset
df.info()
df.head()
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|
s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable. | |
s2 | TV Show | Blood & Water | Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth. | |
s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Action & Adventure | To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war. | |
s4 | TV Show | Jailbirds New Orleans | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series. | |||
s5 | TV Show | Kota Factory | Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV Comedies | In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life. |
import matplotlib.pyplot as plt
# Calculate missing values count per column
missing_values = df.isna().sum()
# Plot histogram of missing values
plt.figure(figsize=(10, 5))
bars= plt.bar(missing_values.index, missing_values.values, color='skyblue')
plt.xticks(rotation=90)
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.title("Missing Values Per Column in Netflix Dataset")
plt.xticks(rotation=45)
for bar in bars.patches:
yval = bar.get_height()
plt.text(
bar.get_x() + bar.get_width()/2, yval,
f'{yval:.0f}',
ha='center', va='bottom', fontsize=10
)
plt.show()

Handling Missing Values
Director
- Merged df_credits and df_titles on the
'id'
column. - Extracted only the records where
'role' == "DIRECTOR"
and created a new dataset df_directors. - Renamed
'name'
column in df_directors to'director'
. - Merged df_directors with the Netflix dataset based on the
'title'
column. - Filled missing director values using the enriched dataset.
- Any remaining missing values were filled with
"Unknown Director"
.
Cast
- Filled missing values with “Unknown”
Country
- Used cast information to infer missing country values
- Remaining missing values filled with “Unknown”
Date Added
- Filled missing values with the most frequent date in the dataset
Rating
- Filled missing values with the most frequent rating
Duration
- Extracted minutes for movies into
duration_in_minutes
(integer) - Extracted seasons for TV shows into
num_seasons
(integer) - Non-numeric values set to
0
# Load the datasets
df_credits = pd.read_csv(r'D:\roy\roy files\Data\data project\Netflix\credits.csv')
df_titles = pd.read_csv(r'D:\roy\roy files\Data\data project\Netflix\titles.csv')
# Perform the merge operation on the 'id' column
df_merged = df_credits.merge(df_titles, on='id', how='left')
df_merged
# Checking if 'role' column has "DIRECTOR" as a value and merging accordingly
df_directors = df_merged[df_merged['role'] == 'DIRECTOR'][['title', 'name']]
df_directors.rename(columns={'name': 'director'}, inplace=True)
# Merge with the main Netflix dataset to fill missing directors
df_filled = df.merge(df_directors, on='title', how='left')
# Fill missing director values in 'df' with those found in 'df_merged'
df_filled['director'] = df_filled['director_x'].combine_first(df_filled['director_y'])
# Drop redundant columns after merging
df_filled.drop(columns=['director_x', 'director_y'], inplace=True)
df = df_filled.copy()
df.drop_duplicates()
# Fill remaining missing values in the 'director' and 'cast' columns with "Unknown"
df['director'].fillna("Unknown", inplace=True)
df['cast'].fillna("Unknown", inplace=True)
# Create a mapping of actors to their countries based on existing data
actor_country_mapping = {}
# Extract known country information for each actor
for index, row in df.dropna(subset=['cast', 'country']).iterrows():
actors = row['cast'].split(', ')
country = row['country']
for actor in actors:
if actor not in actor_country_mapping:
actor_country_mapping[actor] = country
# Function to fill missing country values based on cast information
def fill_country_based_on_cast(row):
if pd.isna(row['country']):
actors = row['cast'].split(', ') if pd.notna(row['cast']) else []
for actor in actors:
if actor in actor_country_mapping:
return actor_country_mapping[actor] # Assign the first found country
return row['country']
# Apply the function to fill missing country values
df['country'] = df.apply(fill_country_based_on_cast, axis=1)
# Fill any remaining missing values in 'country' with "Unknown"
df['country'].fillna("Unknown", inplace=True)
# Find the most frequent date in the 'date_added' column
most_frequent_date = df['date_added'].mode()[0]
# Fill missing values in 'date_added' with the most frequent date
df['date_added'].fillna(most_frequent_date, inplace=True)
# Fill missing "rating" with the most frequent rating
most_frequent_rating = df['rating'].mode()[0]
df['rating'].fillna(most_frequent_rating, inplace=True)
# Creating two separate columns for movies and TV shows
df['duration_in_minutes'] = df['duration'].str.extract('(\d+)\s*min').astype(float)
df['num_seasons'] = df['duration'].str.extract('(\d+)\s*Season').astype(float)
# Filling NaN values with 0 for consistency
df['duration_in_minutes'].fillna(0, inplace=True)
df['num_seasons'].fillna(0, inplace=True)
# Converting to integer type
df['duration_in_minutes'] = df['duration_in_minutes'].astype(int)
df['num_seasons'] = df['num_seasons'].astype(int)
# Dropping the original "duration" column as it's now split
df.drop(columns=['duration'], inplace=True)
# Ensuring all values in 'duration_in_minutes' and 'num_seasons' are valid integers, replacing non-digits with 0
df['duration_in_minutes'] = pd.to_numeric(df['duration_in_minutes'], errors='coerce').fillna(0).astype(int)
df['num_seasons'] = pd.to_numeric(df['num_seasons'], errors='coerce').fillna(0).astype(int)
show_id 0
type 0
title 0
director 0
cast 0
country 0
date_added 0
release_year 0
rating 0
listed_in 0
description 0
duration_in_minutes 0
num_seasons 0
dtype: int64
Data Type Validation and Conversion
✅ Converted date_added
to datetime format (datetime64[ns]
)
- Ensures proper handling of time-based analysis.
✅ Converted release_year
to integer (int64
)
- Ensures consistency for numerical operations.
✅ Converted rating
to categorical type (category
)
- Optimizes storage and improves analysis.
✅ Converted duration
values into two separate numerical columns:
duration_in_minutes
(int64) → For movies with minutes.num_seasons
(int64) → For TV shows with season count.
✅ Ensured all remaining columns maintain appropriate data types:
show_id
,type
,title
,director
,cast
,country
,listed_in
,description
remain asobject
(string-based).
# Checking current data types of each column
df.dtypes
show_id object
type object
title object
director object
cast object
country object
date_added object
release_year int64
rating object
duration int64
listed_in object
description object
dtype: object
# Convert 'date_added' to datetime format
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
# Extract misplaced duration values from the 'rating' column
misplaced_duration_mask = df["rating"].str.contains(r"^\d+\s*min$", na=False)
# Convert extracted values to integer and place them in 'duration_in_minutes'
df.loc[misplaced_duration_mask, "duration_in_minutes"] = (
df.loc[misplaced_duration_mask, "rating"].str.extract(r"(\d+)").astype(int)
)
# Replace those misplaced values in 'rating' with NaN (to be filled later)
df.loc[misplaced_duration_mask, "rating"] = None
# Convert 'rating' to categorical type
df["rating"] = df["rating"].astype("category")
# Convert duration column to integer type
df['duration_in_minutes'] = df['duration_in_minutes'].fillna(0).astype(int)
📈 Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA)
Distribution of Movies vs. TV Shows
import matplotlib.pyplot as plt
import seaborn as sns
# Count distribution of Movies vs TV Shows
content_distribution = df['type'].value_counts()
# Plot
plt.figure(figsize=(6,4))
sns.barplot(x=content_distribution.index, y=content_distribution.values)
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.title("Distribution of Movies vs TV Shows on Netflix")
# Show plot
plt.show()

The chart indicates that there are significantly more movies than TV shows in the dataset.
Analyzing new content additions over the years
# Extract year from 'date_added' column
df['year_added'] = df['date_added'].dt.year
# Count the number of additions per year
yearly_additions = df['year_added'].value_counts().sort_index()
# Plot the trend of new content additions over the years
plt.figure(figsize=(10,5))
plt.plot(yearly_additions.index, yearly_additions.values, marker='o', linestyle='-', linewidth=2)
plt.xlabel('Year')
plt.ylabel('Number of Additions')
plt.title('Trend of New Content Additions Over the Years on Netflix')
plt.grid(True)
# Show the plot
plt.show()

The chart indicates that there was a significant rise in additions from around 2015, peaking in 2019 before slightly declining.
Finding the most common genres on Netflix
import matplotlib.pyplot as plt
from collections import Counter
# Ensure all values are strings and remove brackets if any
df['listed_in'] = df['listed_in'].astype(str).str.replace(r"[\[\]']", "", regex=True)
# Flatten the list of genres and count occurrences
all_genres = [genre.strip() for sublist in df['listed_in'].dropna().str.split(',') for genre in sublist]
genre_counts = Counter(all_genres)
# Convert to DataFrame for visualization
genre_df = pd.DataFrame(genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)
# Plot the top 10 most common genres
plt.figure(figsize=(12, 6))
plt.barh(genre_df['Genre'].head(10), genre_df['Count'].head(10), color='steelblue')
plt.xlabel("Count")
plt.ylabel("Genre")
plt.title("Most Common Genres on Netflix")
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.show()

Count the number of titles produced by each country
# Ensure all values are strings and remove brackets if any
df['country'] = df['country'].astype(str).str.replace(r"[\[\]']", "", regex=True)
# Flatten the list of countries and count occurrences
all_countries = [country.strip() for sublist in df['country'].dropna().str.split(',') for country in sublist]
country_counts = Counter(all_countries)
# Convert to DataFrame for visualization
country_df = pd.DataFrame(country_counts.items(), columns=['Country', 'Count']).sort_values(by='Count', ascending=False)
# Plot the top 10 countries producing the most Netflix content
plt.figure(figsize=(12, 6))
plt.barh(country_df['Country'].head(10), country_df['Count'].head(10), color='steelblue')
plt.xlabel("Count")
plt.ylabel("Country")
plt.title("Top 10 Countries Producing Netflix Content")
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.show()

Count the most frequently featured directors
# Count occurrences of each director (exclude "Unknown")
director_counts = df['director'][df['director']!="Unknown"].dropna().str.split(',').explode().str.strip().value_counts()
# Plot the top 10 most frequently featured directors
plt.figure(figsize=(12, 6))
director_counts.head(10).plot(kind='barh', color='steelblue')
plt.xlabel("Number of Titles")
plt.ylabel("Director")
plt.title("Top 10 Most Frequently Featured Directors on Netflix")
plt.gca().invert_yaxis()
for index, value in enumerate(director_counts.head(10)):
plt.text(value, index, str(value))
plt.show()

Count the most frequently appearing actors
# Count occurrences of each actor
actor_counts = df['cast'][df['cast']!="Unknown"].str.split(',').explode().str.strip().value_counts()
# Plot the top 10 most frequently featured actors
plt.figure(figsize=(12, 6))
actor_counts.head(10).plot(kind='barh', color='steelblue')
plt.xlabel("Number of Titles")
plt.ylabel("Actor")
plt.title("Top 10 Most Frequently Featured Actors on Netflix")
plt.gca().invert_yaxis()
for index, value in enumerate(actor_counts.head(10)):
plt.text(value, index, str(value))
plt.show()

Most common content ratings on Netflix
# Count occurrences of each rating
rating_counts = df['rating'].value_counts()
# Plot the distribution of content ratings
plt.figure(figsize=(12, 6))
rating_counts.plot(kind='bar', color='steelblue')
plt.xlabel("Content Rating")
plt.ylabel("Count")
plt.title("Most Common Content Ratings on Netflix")
plt.xticks(rotation=45)
plt.show()

Compare Rating Distribution Between Movies & TV Shows
# Count ratings separately for Movies and TV Shows
rating_movie_counts = df[df['type'] == 'Movie']['rating'].value_counts()
rating_tv_counts = df[df['type'] == 'TV Show']['rating'].value_counts()
# Create a bar chart
fig, ax = plt.subplots(figsize=(12, 6))
rating_movie_counts.plot(kind='bar', color='blue', alpha=0.6, label='Movies', ax=ax)
rating_tv_counts.plot(kind='bar', color='red', alpha=0.6, label='TV Shows', ax=ax)
plt.xlabel("Content Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings for Movies vs TV Shows")
plt.xticks(rotation=45)
plt.legend()
plt.show()

Average duration of movies on Netflix.
# Filter movies and calculate average duration
avg_movie_duration = df[df['type'] == 'Movie']['duration_in_minutes'].mean()
# Plot histogram of movie durations
plt.figure(figsize=(12, 6))
sns.histplot(df[df['type'] == 'Movie']['duration_in_minutes'], bins=30, kde=True, color='blue')
plt.axvline(avg_movie_duration, color='red', linestyle='dashed', linewidth=2, label=f'Avg: {avg_movie_duration:.2f} min')
plt.xlabel("Movie Duration (minutes)")
plt.ylabel("Count")
plt.title("Distribution of Movie Durations on Netflix")
plt.legend()
plt.show()

Variation in the number of seasons across TV shows.
# Filter TV shows and count occurrences of seasons
season_counts = df[df['type'] == 'TV Show']['num_seasons'].value_counts().sort_index()
# Plot bar chart
plt.figure(figsize=(12, 6))
season_counts.plot(kind='bar', color='steelblue')
plt.xlabel("Number of Seasons")
plt.ylabel("Count of TV Shows")
plt.title("Distribution of Number of Seasons in TV Shows")
plt.xticks(rotation=0)
plt.show()

Most Frequently Appearing Words in Descriptions
from wordcloud import WordCloud
# Combine all descriptions into a single text
text = " ".join(str(desc) for desc in df["description"].dropna())
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=100).generate(text)
# Display the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Most Frequently Appearing Words in Descriptions", fontsize=14)
plt.show()

key insights:
✅Content Distribution & Trends
- Movies dominate Netflix’s catalog compared to TV shows.
- A significant rise in new content additions began around 2015, peaking in 2019.
✅Country-Based Insights
- The United States is the top producer of Netflix content, followed by India and the United Kingdom.
✅Directors & Cast Analysis
- The most frequently featured directors and actors were identified.
✅Ratings & Maturity Levels
- TV-MA is the most common rating.
- Rating distribution varies between Movies and TV Shows.
✅Duration & Season Analysis
- The average duration of movies is around 99 minutes.
- Most TV shows have only one season.