Pandas Techniques for Data Manipulation in Python
Pandas is a great library for data Manipulation, it offers many tools to analyse data. Pandas is made up by numpy and matplotlib, so it use the power of numpy to perfom some task on a large scale of data.
Pandas use two dimensionnal array call DataFrame. There are many technics that we can use to analyse data using pandas.
Sorting DataFrame:
Sorting is a great operation, it can be useful on many situation. we can change the order of data.It is useful when we want to extract the most interesting data and to put them at the top of the dataFrame.
l example
using our sale_price dataset, we can sort weekly_sale from the highest to the smallest
#load Dataset
in: df = pd.read_csv('sales_subset.csv')
df.head()
#sorting weekly_sales in descending order to see the department which make the biggest sale
df.sort_values('weekly_sales', ascending=False)
The complete notebook is available here
2. Subsetting
When we need to get a specific part of a DataFrame, we use subsetting. We can extract a
specific rows and all columns
specific column and all rows
specific row and specific column
Example: here we are using a dataframe of temperatures to subset the country with the highest and the smallest temperature after each four years, from 2000-01-01 to 2013-09-01
# import module
import pandas as pd
import numpy as np
# Load the DataSet
temp = pd.read_csv('temperatures.csv')
in: temp1 = temp['date']<= "2004-01-01"
avg_temp1 = temp[temp1]
# Subset the country with the highest and smallest temperature
in: avg_temp1["avg_temp_c"].max()
out: 38.283
in: avg_temp1[avg_temp1['avg_temp_c']==38.283]
The complete notebook is available here
3. Grouped summary statistic
We can apply some statistic operations like mean, median, sum, mode, minimum, maximum, quantile, standard deviation.
Here we have you use grouped summary statistic to see the factors that influences the performances of student.
#load Data
df = pd.read_csv('StudentsPerformance.csv')
df.head()
in: df.groupby('parental level of education')['math score'].mean()
in: df.groupby('parental level of education')['reading score'].mean()
in: df.groupby('parental level of education')['writing score'].mean()
It can therefore be seen that the level of education of the parents influences the performance of the learners.
The marks of children whose parents have a high level of education are higher than those of others.
The complete notebook is available here
4. Iterating over rows of DataFrame
Sometime we need to iterate over a DataFrame. There are several method to do it:
iterrows(): here we use two variable to iterate over rows, the first is get the index and the second produse a pandas Series
e.g:
#Iterating over each rows using iterrows
for index, row in df.iterrows():
print(index, row['OSName'], row['Type '])
itertuples(): we only use one variable.
example:
in: #Iterating over rows using itertuples
for row in df.itertuples():
print(row)
out: Pandas(Index=0, OSName='Windows 10 64 bit', PercApr22=73.55, ChangeApr22=-1.14, _4='Windows')
we can remove index, and set a custom name for the yielded namedtuples
in: for row in df.itertuples(name='OS'):
print(row)
out: OS(OSName='Windows 10 64 bit', PercApr22=73.55, ChangeApr22=-1.14, _4='Windows')
2. Create pandas DataFrame
There are many ways to create a pandas dataFrame
Creating pandas DataFrame from list of list
example:
list_data = [['Adama',17],['Clinton',12],['Kemogne',15],['John',13],['Ntep',18],['Bodo',14],['Abdou',9]]
data = pd.DataFrame(list_data, columns=['Name','Math score'])
Creating pandas DataFrame using zip() function
math_score = [17,12,15,13,18,14,9]
data = pd.DataFrame(zip(name,math_score), columns=['Name','Math score'])
data
Creating pandas DataFrame from dictionnary of list
dict_data = {'Name':['Adam',"Clinton","Kemogne","John","Ntep","Bodo","Abdou"], 'Math score':[17,12,15,13,18,14,9]}
data = pd.DataFrame(dict_data)
data
Creating DataFrame from Dicts of series.
data={'Name':pd.Series(['Adam',"Clinton","Kemogne","John","Ntep","Bodo","Abdou"]),
'Math score':pd.Series([17,12,15,13,18,14,9])}
data = pd.DataFrame(data)
data
Create pandas DataFrame from lists of dictionaries
data = [{'Name':'Adam','Math score':17},{'Name':'Clinton','Math score':12 },{'Name':'Kemogne','Math score': 15},
{'Name':'John','Math score':13 },{'Name':'Ntep','Math score':18 },
{'Name':'Bodo','Math score':14 }, {'Name':'Abdou','Math score': 9}]
data = pd.DataFrame(data)
data
Comments