top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureSafia Bashir

Guest Stars in The Office




Introduction


The Office is a television series that depicts the everyday lives of office employees in the fictional Scranton office of Dunder Mifflin Paper Company. There have been nine seasons of the show, which aired from March 24, 2005, to May 16, 2013. This notebook investigates the popularity and quality of the Office episodes over time by examining the datasets/office_episodes.csv, which was obtained from Kaggle here.

In this dataset, you can find information about multiple characteristics of each episode. This includes the following:

  • episode_number: Canonical episode number.

  • season: Season in which the episode appeared.

  • episode_title: Title of the episode.

  • description: Description of the episode.

  • ratings: Average IMDB rating.

  • votes: Number of votes.

  • viewership_mil: Number of US viewers in millions.

  • duration: Duration in number of minutes.

  • release_date: Airdate.

  • guest_stars: Guest stars in the episode (if any).

  • director: Director of the episode.

  • writers: Writers of the episode.

  • has_guests: True/False column for whether the episode contained guest stars.

  • scaled_ratings: The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).

This project aims to:

1 - Create a scatter plot for the purpose of exploring data and uncovering insights

2 - Find out which stars appeared in the most viewed episode


libraries and data importing



import pandas as pd 
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [11, 7]
office_df=pd.read_csv('datasets/office_episodes.csv')
office_df.info()
%matplotlib inline 

# Load your data and print out a few lines.
# Read the CSV File Using Pandas read_csv function
office_df=pd.read_csv('datasets/office_episodes.csv')

# print the concise summery of the dataset
office_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   episode_number  188 non-null    int64  
 1   season          188 non-null    int64  
 2   episode_title   188 non-null    object 
 3   description     188 non-null    object 
 4   ratings         188 non-null    float64
 5   votes           188 non-null    int64  
 6   viewership_mil  188 non-null    float64
 7   duration        188 non-null    int64  
 8   release_date    188 non-null    object 
 9   guest_stars     29 non-null     object 
 10  director        188 non-null    object 
 11  writers         188 non-null    object 
 12  has_guests      188 non-null    bool   
 13  scaled_ratings  188 non-null    float64
dtypes: bool(1), float64(3), int64(4), object(6)
memory usage: 19.4+ KB

from this dataset, we can observe that 29 guests appeared in 188 episodes.


Data visualization :


1-Simple scatter plot :

Here is a simple scatter plot, where episode number is plotted along the x-axis and viewership, is plotted along the y-axis as follows


fig= plt.figure()
plt.scatter(x=office_df['episode_number'],y=office_df['viewership_mil'])

This scatter plot indicates that the first few episodes of the show did not have the highest viewership, but popularity picked up as the season progressed and declined later in the season.

This scatter plot is difficult to read. Therefore, we should develop a more sophisticated scatter plot.


2-Advanced scatter plot :


We are going to add a color scheme to the previous scatter plot such that each point has a color associated with its rating based on the following:


Ratings < 0.25 are colored "red"

Ratings >= 0.25 and < 0.50 are colored "orange"

Ratings >= 0.50 and < 0.75 are colored "lightgreen"

Ratings >= 0.75 are colored "darkgreen"


This can be accomplished by creating an array of colors to customize the plot

# Initiatlize an  empty list

cols=[]
# Iterate through the DataFrame, and assign colors based on the rating
for ind,row in office_df.iterrows():
    if row['scaled_ratings']<0.25: 
        cols.append('red')
    elif row['scaled_ratings']<0.50:
        cols.append('orange')
    elif row['scaled_ratings']< 0.75:
        cols.append("lightgreen")
    else:
        cols.append("darkgreen")
        
fig= plt.figure()
plt.scatter(x=office_df['episode_number'], y= office_df['viewership_mil'],c=cols)

Now each point in the scatter plot is colored according to the rating of each episode

As a next step, we can add a sizing system such that episodes with guests have a marker size of 250 and episodes without guests have a marker size of 25. we can do this by using the same technique (by creating an array of sizes to customize the plot )

# Initiatlize an empty list
sizes=[]

for ind,row in office_df.iterrows():
    if row['has_guests']==True:
        sizes.append(250)
    else:
        sizes.append(25)
print(sizes.count(250))  
fig= plt.figure()

plt.scatter(x=office_df['episode_number'], y= office_df['viewership_mil'],c=cols,s=sizes)
print(sizes.count(250)) 

If we count the number of points that has a marker size of 250 or used (print(sizes.count(250)) ) we can find that 29 guests appeared in 188 episodes. This is in line with our earlier observations about guest appearances.

This scatter plot will be easier to read by providing labels for both axes and adding a title as follows:


fig= plt.figure()
plt.scatter(x=office_df['episode_number'], y= office_df['viewership_mil'],c=cols,s=sizes)
print(sizes.count(250)) 
# Create a title
plt.title("Popularity, Quality, and Guest Appearances on the Office")
# Create an x-axis label
plt.xlabel("Episode Number")
# Create a y-axis label
plt.ylabel("Viewership (Millions)")

Aside from varying the marker size, we can also change the marker type to star to identify guest appearances. This can be achieved by calling two scatter plots, one with guest appearance data and the other with non-guest appearance data, and changing the marker

# Adding our lists as columns to the DataFrame will make plotting much easier

office_df['colors']=cols
office_df['size']=sizes
# Split data into guest and non_guest DataFrames
non_guest_df=office_df[office_df['has_guests']==False]
guest_df=office_df[office_df['has_guests']==True]

fig=plt.figure()
# Create two scatter plots with the episode number on the x axis, and the viewership on the y axis

# Create a normal scatter plot for regular episodes
plt.scatter(x=non_guest_df['episode_number'],
            y=non_guest_df['viewership_mil'],
            c=non_guest_df['colors'],
            s=non_guest_df['size'])
# Create a starred scatterplot for guest star episodes
plt.scatter(x=guest_df['episode_number'],
            y=guest_df['viewership_mil'],
            c=guest_df['colors'],
            s=guest_df['size'],
           marker='*')

# Create a title
plt.title("Popularity, Quality, and Guest Appearances on the Office")
# Create an x-axis label
plt.xlabel("Episode Number")
# Create an y-axis label
plt.ylabel("Viewership (Millions)")

Finally, let's find out which stars appeared in the most viewed episode


# Get the most popular guest star
print(office_df[office_df['viewership_mil'] > 20]['guest_stars'])

77 Cloris Leachman, Jack Black, Jessica Alba Name: guest_stars, dtype: object

A fun fact is that these stars starred in a fake movie the cast watched.


you can find the source code on GitHub

References:

1- Datasets for 'The Office' series were downloaded from Kaggle here.

2- DataCamp's Unguided Project: "Investigating Netflix Movies and Guest Stars in The Office" here.

0 comments

Recent Posts

See All

Comments


bottom of page