top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureaya abdalsalam

Investigating Guest Stars in The Office

Some times View an image is enough for solve problems.

so Data Visualization will help you to explore your Data but before draw let's see our data and know what data consist of:

#import needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

First we read our data as csv file and call it office_df

and display columns


office_df = pd.read_csv('office_episodes.csv')
office_df.columns

Out[67]:
Index(['episode_number', 'season', 'episode_title', 'description', 'ratings','votes', 'viewership_mil', 'duration', 'release_date', 'guest_stars', 'director', 'writers', 'has_guests', 'scaled_ratings'],       dtype='object')

let's see our five rows of our data using head() function

office_df.head()

let's see information about every column in our Data

We have 14 columns there is no empty data except column guest_stars

which has only 29 number of data.

office_df.info(

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   episode_number  188 non-null    int64  
 1   season          188 non-null    int64  
 2   episode_title   188 non-null    object 
 3   description     188 non-null    object 
 4   ratings         188 non-null    float64
 5   votes           188 non-null    int64  
 6   viewership_mil  188 non-null    float64
 7   duration        188 non-null    int64  
 8   release_date    188 non-null    object 
 9   guest_stars     29 non-null     object 
 10  director        188 non-null    object 
 11  writers         188 non-null    object 
 12  has_guests      188 non-null    bool   
 13  scaled_ratings  188 non-null    float64
dtypes: bool(1), float64(3), int64(4), object(6)
memory usage: 19.4+ KB
office_df.shape
(188, 14)

Data consist of 188 row and 14 column

office_df.describe()

office_df.sort_values(["episode_number","ratings"],ascending= [True,True])

office_df[office_df['scaled_ratings'] >= 1]

we have Two episode have scaled rating = 1 to the writer Greg Daniels

and director Paul Feig and Ken Kwapis


Let's see who get the max number of view

maxView = office_df['viewership_mil'].max()
office_df[office_df['viewership_mil'] == maxView]


here we found that episode number 77 which called Stress Relief get the maximum number of view equal 22.91


Start Date and End Date Of release

print(office_df['release_date'].min())
print(office_df['release_date'].max())
2005-03-24
2013-05-16
 mean = office_df['votes'].mean()
median = office_df['votes'].median()
print(mean)
print(median)
2838.228723404255
2614.0
office_df.isna().sum().plot(kind="bar")
plt.show()

As we say before column guest stars has 29 row only with data.


Let's see with colors the most view.

office_df = pd.read_csv('office_episodes.csv')
colorsList = []
for ind, row in office_df.iterrows():
    if row['scaled_ratings'] < 0.25:
        colorsList.append('red')
    elif row['scaled_ratings'] < 0.50:
        colorsList.append('orange')
    elif row['scaled_ratings'] < 0.75:
        colorsList.append('lightgreen')
    else:
        colorsList.append('darkgreen')

sizes = []
for ind, row in office_df.iterrows():
    if row['has_guests'] == False:
        sizes.append(25)
    else:
        sizes.append(250)


office_df['colors'] = colorsList
office_df['sizes'] = sizes

non_guest_df = office_df[office_df['has_guests'] == False]
guest_df = office_df[office_df['has_guests'] == True]

plt.rcParams['figure.figsize'] = [11, 7]
plt.style.use('fivethirtyeight')


plt.scatter(x = non_guest_df.episode_number, y = non_guest_df.viewership_mil, \
c = non_guest_df['colors'],marker = "v", s = 25)

# Create a starred scatterplot for guest star episodes
plt.scatter(x = guest_df.episode_number, y = guest_df.viewership_mil, \
 c = guest_df['colors'], marker = '*', s = 250)

plt.title("Popularity, Quality, and Guest Appearances on the Office", fontsize = 28)
plt.xlabel("Episode Number", fontsize = 30)
plt.ylabel("Viewership (Millions)", fontsize = 30)

plt.show()

the most popular guest star
print(office_df[office_df['viewership_mil'] > 20]['guest_stars'])

There have been 9 seasons

office_df['season'].max()
9

office_df.plot(x = "duration", y = "ratings", kind = "scatter",marker ="*",color = "green")
plt.show()

long duration has high rate and small duration its rate is between (7.5,9)


office_df.plot(x = "duration", y = "viewership_mil", kind = "scatter", marker = "s",color ="green")
plt.show()

Less duration of Episode more view.


office_df.plot(x = "release_date", y = "duration")
plt.xticks(rotation=90)
plt.show()





Duration change over years and max duration is 60 and Two episode has it Stress Relief and Classy Christmas

see all code here


https://github.com/AyaMohammedAli/Investigating-Guest-Stars-in-TheOffice-


0 comments

Recent Posts

See All

Comments


bottom of page