top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureMarehan Refaat

Project: Investigating Guest Stars in The Office

In this project, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time.

To do so, we will use the following dataset: datasets/office_episodes.csv, which was downloaded from Kaggle here.

This dataset contains information on a variety of characteristics of each episode. In detail, these are:

datasets/office_episodes.csv

  • episode_number: Canonical episode number.

  • season: Season in which the episode appeared.

  • episode_title: Title of the episode.

  • description: Description of the episode.

  • ratings: Average IMDB rating.

  • votes: Number of votes.

  • viewership_mil: Number of US viewers in millions.

  • duration: Duration in number of minutes.

  • release_date: Airdate.

  • guest_stars: Guest stars in the episode (if any).

  • director: Director of the episode.

  • writers: Writers of the episode.

  • has_guests: True/False column for whether the episode contained guest stars.

  • scaled_ratings: The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).

We will start by reading our data and explore it

# Read in the CSV as a DataFrame import pandas as pd office_episodes = pd.read_csv("datasets/office_episodes.csv") # Print the first ten rows of the DataFrame office_episodes[:10]


The output will be like the following


Task 1 :

Create a matplotlib scatter plot of the data that contains the following attributes:

  • Each episode's episode number plotted along the x-axis

  • Each episode's viewership (in millions) plotted along the y-axis

  • A color scheme reflecting the scaled ratings (not the regular ratings) of each episode, such that:

    • Ratings < 0.25 are colored "red"

    • Ratings >= 0.25 and < 0.50 are colored "orange"

    • Ratings >= 0.50 and < 0.75 are colored "lightgreen"

    • Ratings >= 0.75 are colored "darkgreen"


  • A sizing system, such that episodes with guest appearances have a marker size of 250 and episodes without are sized 25

  • A title, reading "Popularity, Quality, and Guest Appearances on the Office"

  • An x-axis label reading "Episode Number"

  • A y-axis label reading "Viewership (Millions)"


# Create a color_scheme list colors = [] # Iterate over rows of office_episodes to input color name to the colors list for lab, row in office_episodes.iterrows(): if row['scaled_ratings'] < 0.25: colors.append("red") elif 0.25 <= row['scaled_ratings'] < 0.50: colors.append("orange") elif 0.50 <= row['scaled_ratings'] < 0.75: colors.append("lightgreen") else: colors.append("darkgreen") The output will be like that:

# Inspect the first 10 values in the list colors[:10]


# Create a sizing system: # episodes with guest appearances have a marker size of 250 # episodes without are sized 25 sizes = [] for lab, row in office_episodes.iterrows(): if row['has_guests'] == True: sizes.append(250) else: sizes.append(25) # Inspect the first 10 values in the list sizes[:10]

The output will be like that:


# Import matplotlib.pyplot under its usual alias and create a figure import matplotlib.pyplot as plt fig = plt.figure(figsize=(11,7)) # Create a scatter plot plt.scatter(office_episodes["episode_number"], office_episodes["viewership_mil"], c = colors, s = sizes) # Create a title plt.title('Popularity, Quality, and Guest Appearances on the Office') # Create an x-axis and an y-axis plt.xlabel('Episode Number') plt.ylabel('Viewership (Millions)') # Show the plot plt.show()

The output will be like that:

Task 2 :

Provide the name of one of the guest stars (hint, there were multiple!) who was in the most watched Office episode. Save it as a string in the variable top_star (e.g. top_star = "Will Ferrell").


# The highest view highest_view = max(office_episodes["viewership_mil"]) # Filter the Dataframe row that has the most watched episode most_watched_dataframe = office_episodes.loc[office_episodes["viewership_mil"] == highest_view] # Top guest stars that were in that episode top_stars = most_watched_dataframe[["guest_stars"]] top_stars

The output will be like that:

Save it as a string in the variable top_star


top_star = 'Jack Black'


Please refer to the complete code from here

0 comments

Comments


bottom of page