top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Data manipulation with Pandas: From loading file to data visualization



Pandas is a library written for data manipulation and analysis in Python. It offers data structures and operations for manipulating numerical tables and time series.


We will use five techniques between a bunch of functionality that we could have. But let's try to place ourselves in the context of carrying out an exploratory study and will use five techniques most used during exploratory data analysis. To carry out learning from the beginning to the visualization without really using other dependencies.


The techniques that we will go through together are:

  • Loading files

  • Handle missing data

  • Manipulation of string data

  • Joining data

  • Visualization data


Before loading files, we will import the Pandas library first.


import pandas as pd

Also, we will refer to some parts of the previous post.


Loading files


The Pandas library has many functions for reading a file, according to the file type.

Usually, the function name is read_XXXX, XXXX represents the file type in lower case.

Allows file types are the following: CSV, Excel, JSON, XML, HTML, Pickle, ...

If you want to learn more about file types, you should read the documentation.

In our example, we will try to read a JSON and HTML file.


  • JSON


We will try to retrieve data from the current URI https://api.github.com/users/hadley/orgs. This URI returns JSON data.


github = pd.read_json("https://api.github.com/users/hadley/orgs")
github.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   login               10 non-null     object
 1   id                  10 non-null     int64 
 2   node_id             10 non-null     object
 3   url                 10 non-null     object
 4   repos_url           10 non-null     object
 5   events_url          10 non-null     object
 6   hooks_url           10 non-null     object
 7   issues_url          10 non-null     object
 8   members_url         10 non-null     object
 9   public_members_url  10 non-null     object
 10  avatar_url          10 non-null     object
 11  description         8 non-null      object
dtypes: int64(1), object(11)
memory usage: 1.1+ KB

  • HTML


This time we will try to retrieve data from this URI https://fr.wikipedia.org/wiki/Population_mondiale. On this page, we have different types of data like text and tables. Thanks to the read_html, we can directly get the list of table data.

world_population = pd.read_html('https://fr.wikipedia.org/wiki/Population_mondiale') # list of table
world_population[0] # display the first item in list

Handle missing data


for this part, we will use the following dataset datasets/the_office_series.csv from Kaggle here

We already used this dataset in the previous post, so we know the GuestStars column has nullable values.

Using the function isna of pandas library, we will try to show where there is no value for the five first records.


office_df = pd.read_csv("datasets/the_office_series.csv")
office_df['GuestStars'].isna().head()
0    True
1    True
2    True
3    True
4    True
Name: GuestStars, dtype: bool

To count missing values, we can write the following code.


office_df['GuestStars'].isna().sum() # output 159

We can also fill missing values with other possible values according to the study. Has an example, we can have it.


office_df['GuestStars'].fillna('No value').head(10)
0         No value
1         No value
2         No value
3         No value
4         No value
5       Amy Adams
6         No value
7         No value
8    Nancy Carell
9       Amy Adams
Name: GuestStars, dtype: object

We will work with the GuestStars column, where the value is not empty.


available_guest =  office_df.loc[:,'GuestStars'].dropna()
available_guest.head()
5        Amy Adams
8     Nancy Carell
9        Amy Adams
12     Tim Meadows
14       Ken Jeong
Name: GuestStars, dtype: object

Manipulation of string data


Using the Pandas library, we can find a pattern in a string like this.


available_guest.str.match('Amy Adams').head()
5      True
8     False
9      True
12    False
14    False
Name: GuestStars, dtype: bool

There is another function for finding patterns in a string.

We also can change the string case. Currently, we have capitalized case for the GuestStars column.


available_guest.str.upper().head()
5        AMY ADAMS
8     NANCY CARELL
9        AMY ADAMS
12     TIM MEADOWS
14       KEN JEONG
Name: GuestStars, dtype: object

To learn more about the manipulation of string using the Pandas library, You should read the documentation.


Joining data


Like in the previous post, to use the merge function. We have to have at least two data frames.

The first will get the director, who made ten episodes at least.


director_df = office_df.groupby('Director')['EpisodeTitle'].agg(['count'])
director_df = director_df[director_df['count'] > 10 ].sort_values('count', ascending=False)
director_df

The second data frame will contain the rating means for all directors.


director_mean_df = office_df.groupby('Director')['Ratings'].agg(['mean'])
director_mean_df.head()

new_df = pd.merge(director_df, director_mean_df, left_index=True, right_index=True)
new_df.sort_values('mean', ascending=False)

Visualization data


We will use the last data frame from the previous post, created by the following code.

In this code, we create a function rangeGetData to get the mean from the provided column. Later we get data from the selected duration using the bool mask tip. Then, we create a data frame from filtered data.


# Function creation
def rangeGetData(column, df):
    return df[column].mean()

# Boolean mask trick to filter data
mean_viewers_19___27__2 = rangeGetData('Viewership', office_df[(office_df['Duration'] >= 19) & (office_df['Duration'] < 27.2)])
mean_ratings_19___27__2 = rangeGetData('Ratings', office_df[(office_df['Duration'] >= 19) & (office_df['Duration'] < 27.2)])

mean_viewers_27__2___35__4 = rangeGetData('Viewership', office_df[(office_df['Duration'] >= 27.2) & (office_df['Duration'] < 35.4)])
mean_ratings_27__2___35__4 = rangeGetData('Ratings', office_df[(office_df['Duration'] >= 27.2) & (office_df['Duration'] < 35.4)])

mean_viewers_35__4___43__6 = rangeGetData('Viewership', office_df[(office_df['Duration'] >= 35.4) & (office_df['Duration'] < 43.6)])
mean_ratings_35__4___43__6 = rangeGetData('Ratings', office_df[(office_df['Duration'] >= 35.4) & (office_df['Duration'] < 43.6)])

mean_viewers_43__6___51__8 = rangeGetData('Viewership', office_df[(office_df['Duration'] >= 43.6) & (office_df['Duration'] < 51.8)])
mean_ratings_43__6___51__8 = rangeGetData('Ratings', office_df[(office_df['Duration'] >= 43.6) & (office_df['Duration'] < 51.8)])

mean_viewers_51__8 = rangeGetData('Viewership', office_df[office_df['Duration'] >= 51.8])
mean_ratings_51__8 = rangeGetData('Ratings', office_df[office_df['Duration'] >= 51.8])

# Data frame creation
duration_df = pd.DataFrame(
    data={
        'viewers': [mean_viewers_19___27__2, mean_viewers_27__2___35__4, mean_viewers_35__4___43__6, mean_viewers_43__6___51__8, mean_viewers_51__8],
        'ratings': [mean_ratings_19___27__2, mean_ratings_27__2___35__4, mean_ratings_35__4___43__6, mean_ratings_43__6___51__8, mean_ratings_51__8]
        },
    index = ['19 to 27.2', '27.2 to 35.4', '35.4 to 43.6', '43.6 to 51.8', '51.8 to 60']
)

# Show data frame
duration_df

duration_df.plot.line(subplots=True, figsize=(11, 14))

duration_df.plot(kind='pie',subplots=True, figsize=(11, 14))


Conclusion


The Pandas library has enough functionality and allows us to do an exploratory study without worries and using external dependencies.


Reference

  • DataCamp's Unguided Project: "Investigating Netflix Movies and Guest Stars in The Office" here.

  • This article was originally written by Kossonou Kouamé Maïzan Alain Serge, as part of the Data Insight data scientist program.

  • The full code on Github

0 comments

Recent Posts

See All

Comentarios


bottom of page