NETFLIX Netflix Streaming Insights:
A Comprehensive Data Analysis

Home //  AI-Powered Data Analysis

📂 Table of Content

🔎 Project Overview​

📌 Introduction

Netflix is one of the world’s largest streaming platforms, offering a vast collection of movies and TV shows from different genres, countries, and time periods. This project aims to analyze Netflix’s content catalog to uncover key insights into content distribution, genre trends, country contributions, and viewing patterns.


📊 Dataset Description

This analysis is based on the publicly available Netflix Movies and TV Shows dataset from Kaggle. It contains 8,807 records with 12 attributes that provide information on titles, directors, cast, countries, release years, ratings, and more.


🎯 Objectives of the Analysis

The primary goals of this project are:

  • Content Analysis: Understand the distribution of Movies vs. TV Shows.
  • Country-Based Insights: Identify which countries produce the most Netflix content.
  • Genre Popularity: Determine the most frequent genres on the platform.
  • Trends Over Time: Analyze how Netflix’s content library has grown over the years.
  • Ratings & Maturity Levels: Explore the distribution of content ratings (TV-MA, PG, R, etc.).
  • Duration Analysis: Examine movie durations and TV show season counts.

🛠️ Methodology

  • Data Cleaning: Handling missing values, transforming data types, and restructuring key columns.
  • Exploratory Data Analysis (EDA): Uncovering trends through data visualization and statistical summaries.
  • Power BI Visualization: Creating interactive dashboards to present findings effectively.

📈 Expected Outcomes

By the end of this analysis, I will have a clear understanding of Netflix’s content strategy, helping answer critical business and user behavior questions. The final Power BI dashboard will allow for dynamic exploration of Netflix’s catalog.

📊 Summary of the Dataset​

1. Dataset Overview

  • Total Records: 8,807 entries
  • Total Columns: 12 columns
  • Content: Information about movies and TV shows available on Netflix, including metadata like title, director, cast, country, release year, duration, rating, and genre.
  • Purpose of Analysis: To explore content distribution, country-based trends, director and cast insights, rating patterns, and duration analysis.

2. Column Descriptions

The dataset contains 12 columns, each representing different attributes of Netflix content:

  • show_id – Unique identifier for each Netflix title.
  • type – Indicates whether the content is a “Movie” or a “TV Show.”
  • title – Name of the Movie or TV Show.
  • director – Name(s) of the director(s). This column has significant missing values.
  • cast – List of main actors/actresses in the Movie or TV Show. This column also has many missing values.
  • country – Country of origin where the content was produced. Some entries are missing.
  • date_added – Date when the content was added to Netflix. Useful for trend analysis.
  • release_year – The year in which the Movie or TV Show was originally released.
  • rating – Maturity rating of the content, such as PG, R, TV-MA, etc.
  • duration – Represents the length of Movies (in minutes) or the number of seasons for TV Shows. Needs transformation for better analysis.
  • listed_in – Categories or genres associated with the Movie or TV Show. Multiple genres are often listed in a single entry, requiring data splitting.
  • description – A brief summary of the Movie or TV Show. Could be used for text analysis or classification.

❓ Key Questions

1. Content Distribution & Trends

  • What is the distribution of Movies vs. TV Shows on Netflix?
  • How has the number of new content additions changed over the years?
  • What are the most common genres on Netflix?

2. Country-Based Insights

  • Which countries produce the most Netflix content?
  • How does content distribution vary by country (e.g., Which countries dominate certain genres)?

3. Directors & Cast Analysis

  • Who are the most frequently featured directors?
  • Which actors appear most often in Netflix productions?

4. Ratings & Maturity Levels

  • What are the most common content ratings on Netflix?
  • How does the distribution of ratings differ between Movies and TV Shows?

5. Duration & Season Analysis

  • What is the average duration of movies on Netflix?
  • How does the number of seasons vary across TV Shows?

🧹 Data Cleaning

Load & Inspect Data
  • The first few rows
  • Column names and data types
  • Basic summary to identify any immediate issues
				
					import pandas as pd

# Load the dataset
df = pd.read_csv(r"D:\roy\roy files\Data\data project\netflix_titles.csv")

# Display basic information about the dataset
df.info()
df.head()
				
			
				
					<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
				
			
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
s1MovieDick Johnson Is DeadKirsten JohnsonUnited StatesSeptember 25, 20212020PG-1390 minDocumentariesAs her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
s2TV ShowBlood & WaterAma Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick MofokengSouth AfricaSeptember 24, 20212021TV-MA2 SeasonsInternational TV Shows, TV Dramas, TV MysteriesAfter crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
s3TV ShowGanglandsJulien LeclercqSami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary DiomberaSeptember 24, 20212021TV-MA1 SeasonCrime TV Shows, International TV Shows, TV Action & AdventureTo protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
s4TV ShowJailbirds New OrleansSeptember 24, 20212021TV-MA1 SeasonDocuseries, Reality TVFeuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
s5TV ShowKota FactoryMayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun KumarIndiaSeptember 24, 20212021TV-MA2 SeasonsInternational TV Shows, Romantic TV Shows, TV ComediesIn a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
				
					import matplotlib.pyplot as plt

# Calculate missing values count per column
missing_values = df.isna().sum()

# Plot histogram of missing values
plt.figure(figsize=(10, 5))
bars= plt.bar(missing_values.index, missing_values.values, color='skyblue')
plt.xticks(rotation=90)
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.title("Missing Values Per Column in Netflix Dataset")
plt.xticks(rotation=45)

for bar in bars.patches:  
    yval = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2, yval, 
        f'{yval:.0f}', 
        ha='center', va='bottom', fontsize=10
    )
plt.show()

				
			

✅ Director

  • Merged df_credits and df_titles on the 'id' column.
  • Extracted only the records where 'role' == "DIRECTOR" and created a new dataset df_directors.
  • Renamed 'name' column in df_directors to 'director'.
  • Merged df_directors with the Netflix dataset based on the 'title' column.
  • Filled missing director values using the enriched dataset.
  • Any remaining missing values were filled with "Unknown Director".

✅ Cast

  • Filled missing values with “Unknown”

✅ Country

  • Used cast information to infer missing country values
  • Remaining missing values filled with “Unknown”

✅ Date Added

  • Filled missing values with the most frequent date in the dataset

✅ Rating

  • Filled missing values with the most frequent rating

✅ Duration

  • Extracted minutes for movies into duration_in_minutes (integer)
  • Extracted seasons for TV shows into num_seasons (integer)
  • Non-numeric values set to 0
				
					# Load the datasets
df_credits = pd.read_csv(r'D:\roy\roy files\Data\data project\Netflix\credits.csv')
df_titles = pd.read_csv(r'D:\roy\roy files\Data\data project\Netflix\titles.csv')

# Perform the merge operation on the 'id' column
df_merged = df_credits.merge(df_titles, on='id', how='left')
df_merged

# Checking if 'role' column has "DIRECTOR" as a value and merging accordingly
df_directors = df_merged[df_merged['role'] == 'DIRECTOR'][['title', 'name']]
df_directors.rename(columns={'name': 'director'}, inplace=True)

# Merge with the main Netflix dataset to fill missing directors
df_filled = df.merge(df_directors, on='title', how='left')

# Fill missing director values in 'df' with those found in 'df_merged'
df_filled['director'] = df_filled['director_x'].combine_first(df_filled['director_y'])

# Drop redundant columns after merging
df_filled.drop(columns=['director_x', 'director_y'], inplace=True)

df = df_filled.copy()
df.drop_duplicates()
# Fill remaining missing values in the 'director' and 'cast' columns with "Unknown"
df['director'].fillna("Unknown", inplace=True)
df['cast'].fillna("Unknown", inplace=True)
				
			
				
					# Create a mapping of actors to their countries based on existing data
actor_country_mapping = {}

# Extract known country information for each actor
for index, row in df.dropna(subset=['cast', 'country']).iterrows():
    actors = row['cast'].split(', ')
    country = row['country']
    for actor in actors:
        if actor not in actor_country_mapping:
            actor_country_mapping[actor] = country

# Function to fill missing country values based on cast information
def fill_country_based_on_cast(row):
    if pd.isna(row['country']):
        actors = row['cast'].split(', ') if pd.notna(row['cast']) else []
        for actor in actors:
            if actor in actor_country_mapping:
                return actor_country_mapping[actor]  # Assign the first found country
    return row['country']

# Apply the function to fill missing country values
df['country'] = df.apply(fill_country_based_on_cast, axis=1)

# Fill any remaining missing values in 'country' with "Unknown"
df['country'].fillna("Unknown", inplace=True)

				
			
				
					# Find the most frequent date in the 'date_added' column
most_frequent_date = df['date_added'].mode()[0]

# Fill missing values in 'date_added' with the most frequent date
df['date_added'].fillna(most_frequent_date, inplace=True)
				
			
				
					# Fill missing "rating" with the most frequent rating
most_frequent_rating = df['rating'].mode()[0]
df['rating'].fillna(most_frequent_rating, inplace=True)

# Creating two separate columns for movies and TV shows
df['duration_in_minutes'] = df['duration'].str.extract('(\d+)\s*min').astype(float)
df['num_seasons'] = df['duration'].str.extract('(\d+)\s*Season').astype(float)

# Filling NaN values with 0 for consistency
df['duration_in_minutes'].fillna(0, inplace=True)
df['num_seasons'].fillna(0, inplace=True)

# Converting to integer type
df['duration_in_minutes'] = df['duration_in_minutes'].astype(int)
df['num_seasons'] = df['num_seasons'].astype(int)

# Dropping the original "duration" column as it's now split
df.drop(columns=['duration'], inplace=True)

# Ensuring all values in 'duration_in_minutes' and 'num_seasons' are valid integers, replacing non-digits with 0
df['duration_in_minutes'] = pd.to_numeric(df['duration_in_minutes'], errors='coerce').fillna(0).astype(int)
df['num_seasons'] = pd.to_numeric(df['num_seasons'], errors='coerce').fillna(0).astype(int)

				
			
				
					show_id                0
type                   0
title                  0
director               0
cast                   0
country                0
date_added             0
release_year           0
rating                 0
listed_in              0
description            0
duration_in_minutes    0
num_seasons            0
dtype: int64
				
			

✅ Converted date_added to datetime format (datetime64[ns])

  • Ensures proper handling of time-based analysis.

✅ Converted release_year to integer (int64)

  • Ensures consistency for numerical operations.

✅ Converted rating to categorical type (category)

  • Optimizes storage and improves analysis.

✅ Converted duration values into two separate numerical columns:

  • duration_in_minutes (int64) → For movies with minutes.
  • num_seasons (int64) → For TV shows with season count.

✅ Ensured all remaining columns maintain appropriate data types:

    • show_id, type, title, director, cast, country, listed_in, description remain as object (string-based).
				
					# Checking current data types of each column
df.dtypes
				
			
				
					show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration         int64
listed_in       object
description     object
dtype: object
				
			
				
					# Convert 'date_added' to datetime format
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
				
			
				
					# Extract misplaced duration values from the 'rating' column
misplaced_duration_mask = df["rating"].str.contains(r"^\d+\s*min$", na=False)

# Convert extracted values to integer and place them in 'duration_in_minutes'
df.loc[misplaced_duration_mask, "duration_in_minutes"] = (
    df.loc[misplaced_duration_mask, "rating"].str.extract(r"(\d+)").astype(int)
)

# Replace those misplaced values in 'rating' with NaN (to be filled later)
df.loc[misplaced_duration_mask, "rating"] = None

# Convert 'rating' to categorical type
df["rating"] = df["rating"].astype("category")

# Convert duration column to integer type
df['duration_in_minutes'] = df['duration_in_minutes'].fillna(0).astype(int)
				
			

📈 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)
Distribution of Movies vs. TV Shows
				
					import matplotlib.pyplot as plt
import seaborn as sns

# Count distribution of Movies vs TV Shows
content_distribution = df['type'].value_counts()

# Plot
plt.figure(figsize=(6,4))
sns.barplot(x=content_distribution.index, y=content_distribution.values)
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.title("Distribution of Movies vs TV Shows on Netflix")

# Show plot
plt.show()
				
			

The chart indicates that there are significantly more movies than TV shows in the dataset.

				
					# Extract year from 'date_added' column
df['year_added'] = df['date_added'].dt.year

# Count the number of additions per year
yearly_additions = df['year_added'].value_counts().sort_index()

# Plot the trend of new content additions over the years
plt.figure(figsize=(10,5))
plt.plot(yearly_additions.index, yearly_additions.values, marker='o', linestyle='-', linewidth=2)
plt.xlabel('Year')
plt.ylabel('Number of Additions')
plt.title('Trend of New Content Additions Over the Years on Netflix')
plt.grid(True)

# Show the plot
plt.show()
				
			

The chart indicates that there was a significant rise in additions from around 2015, peaking in 2019 before slightly declining.

				
					import matplotlib.pyplot as plt
from collections import Counter

# Ensure all values are strings and remove brackets if any
df['listed_in'] = df['listed_in'].astype(str).str.replace(r"[\[\]']", "", regex=True)

# Flatten the list of genres and count occurrences
all_genres = [genre.strip() for sublist in df['listed_in'].dropna().str.split(',') for genre in sublist]
genre_counts = Counter(all_genres)

# Convert to DataFrame for visualization
genre_df = pd.DataFrame(genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)

# Plot the top 10 most common genres
plt.figure(figsize=(12, 6))
plt.barh(genre_df['Genre'].head(10), genre_df['Count'].head(10), color='steelblue')
plt.xlabel("Count")
plt.ylabel("Genre")
plt.title("Most Common Genres on Netflix")
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

				
			
				
					# Ensure all values are strings and remove brackets if any
df['country'] = df['country'].astype(str).str.replace(r"[\[\]']", "", regex=True)

# Flatten the list of countries and count occurrences
all_countries = [country.strip() for sublist in df['country'].dropna().str.split(',') for country in sublist]
country_counts = Counter(all_countries)

# Convert to DataFrame for visualization
country_df = pd.DataFrame(country_counts.items(), columns=['Country', 'Count']).sort_values(by='Count', ascending=False)

# Plot the top 10 countries producing the most Netflix content
plt.figure(figsize=(12, 6))
plt.barh(country_df['Country'].head(10), country_df['Count'].head(10), color='steelblue')
plt.xlabel("Count")
plt.ylabel("Country")
plt.title("Top 10 Countries Producing Netflix Content")
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

				
			
				
					# Count occurrences of each director (exclude "Unknown")
director_counts = df['director'][df['director']!="Unknown"].dropna().str.split(',').explode().str.strip().value_counts()

# Plot the top 10 most frequently featured directors
plt.figure(figsize=(12, 6))
director_counts.head(10).plot(kind='barh', color='steelblue')
plt.xlabel("Number of Titles")
plt.ylabel("Director")
plt.title("Top 10 Most Frequently Featured Directors on Netflix")
plt.gca().invert_yaxis()

for index, value in enumerate(director_counts.head(10)):
    plt.text(value, index, str(value))
plt.show()
				
			
				
					# Count occurrences of each actor
actor_counts = df['cast'][df['cast']!="Unknown"].str.split(',').explode().str.strip().value_counts()

# Plot the top 10 most frequently featured actors
plt.figure(figsize=(12, 6))
actor_counts.head(10).plot(kind='barh', color='steelblue')
plt.xlabel("Number of Titles")
plt.ylabel("Actor")
plt.title("Top 10 Most Frequently Featured Actors on Netflix")
plt.gca().invert_yaxis()
for index, value in enumerate(actor_counts.head(10)):
    plt.text(value, index, str(value))
plt.show()

				
			
				
					# Count occurrences of each rating
rating_counts = df['rating'].value_counts()

# Plot the distribution of content ratings
plt.figure(figsize=(12, 6))
rating_counts.plot(kind='bar', color='steelblue')
plt.xlabel("Content Rating")
plt.ylabel("Count")
plt.title("Most Common Content Ratings on Netflix")
plt.xticks(rotation=45)
plt.show()

				
			
				
					# Count ratings separately for Movies and TV Shows
rating_movie_counts = df[df['type'] == 'Movie']['rating'].value_counts()
rating_tv_counts = df[df['type'] == 'TV Show']['rating'].value_counts()

# Create a bar chart
fig, ax = plt.subplots(figsize=(12, 6))
rating_movie_counts.plot(kind='bar', color='blue', alpha=0.6, label='Movies', ax=ax)
rating_tv_counts.plot(kind='bar', color='red', alpha=0.6, label='TV Shows', ax=ax)

plt.xlabel("Content Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings for Movies vs TV Shows")
plt.xticks(rotation=45)
plt.legend()
plt.show()

				
			
				
					# Filter movies and calculate average duration
avg_movie_duration = df[df['type'] == 'Movie']['duration_in_minutes'].mean()

# Plot histogram of movie durations
plt.figure(figsize=(12, 6))
sns.histplot(df[df['type'] == 'Movie']['duration_in_minutes'], bins=30, kde=True, color='blue')
plt.axvline(avg_movie_duration, color='red', linestyle='dashed', linewidth=2, label=f'Avg: {avg_movie_duration:.2f} min')
plt.xlabel("Movie Duration (minutes)")
plt.ylabel("Count")
plt.title("Distribution of Movie Durations on Netflix")
plt.legend()
plt.show()

				
			
				
					# Filter TV shows and count occurrences of seasons
season_counts = df[df['type'] == 'TV Show']['num_seasons'].value_counts().sort_index()

# Plot bar chart
plt.figure(figsize=(12, 6))
season_counts.plot(kind='bar', color='steelblue')
plt.xlabel("Number of Seasons")
plt.ylabel("Count of TV Shows")
plt.title("Distribution of Number of Seasons in TV Shows")
plt.xticks(rotation=0)
plt.show()
				
			
				
					from wordcloud import WordCloud

# Combine all descriptions into a single text
text = " ".join(str(desc) for desc in df["description"].dropna())

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=100).generate(text)

# Display the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Most Frequently Appearing Words in Descriptions", fontsize=14)
plt.show()

				
			

Content Distribution & Trends

  • Movies dominate Netflix’s catalog compared to TV shows.
  • A significant rise in new content additions began around 2015, peaking in 2019.

✅Country-Based Insights

  • The United States is the top producer of Netflix content, followed by India and the United Kingdom.

✅Directors & Cast Analysis

  • The most frequently featured directors and actors were identified.

✅Ratings & Maturity Levels

  • TV-MA is the most common rating.
  • Rating distribution varies between Movies and TV Shows.

✅Duration & Season Analysis

  • The average duration of movies is around 99 minutes.
  • Most TV shows have only one season.

📽 Visualiztion