Pandas Techniques for Data Manipulation in Python

Pandas is a great library for data Manipulation, it offers many tools to analyse data. Pandas is made up by numpy and matplotlib, so it use the power of numpy to perfom some task on a large scale of data.

Pandas use two dimensionnal array call DataFrame. There are many technics that we can use to analyse data using pandas.

Sorting DataFrame:

Sorting is a great operation, it can be useful on many situation. we can change the order of data.It is useful when we want to extract the most interesting data and to put them at the top of the dataFrame.

l example

using our sale_price dataset, we can sort weekly_sale from the highest to the smallest

#load Dataset
in: df = pd.read_csv('sales_subset.csv')
    df.head()


#sorting weekly_sales in descending order to see the department which make the biggest sale
df.sort_values('weekly_sales', ascending=False)

The complete notebook is available here

2. Subsetting

When we need to get a specific part of a DataFrame, we use subsetting. We can extract a

specific rows and all columns
specific column and all rows
specific row and specific column

Example: here we are using a dataframe of temperatures to subset the country with the highest and the smallest temperature after each four years, from 2000-01-01 to 2013-09-01

# import module
import pandas as pd 
import numpy as np

# Load the DataSet
temp = pd.read_csv('temperatures.csv')

in: temp1 = temp['date']<= "2004-01-01"
    avg_temp1 = temp[temp1]

# Subset the country with the highest and smallest temperature
in: avg_temp1["avg_temp_c"].max()

out: 38.283

in: avg_temp1[avg_temp1['avg_temp_c']==38.283]

The complete notebook is available here

3. Grouped summary statistic

We can apply some statistic operations like mean, median, sum, mode, minimum, maximum, quantile, standard deviation.

Here we have you use grouped summary statistic to see the factors that influences the performances of student.

#load Data
df = pd.read_csv('StudentsPerformance.csv')
df.head()

in: df.groupby('parental level of education')['math score'].mean()

in: df.groupby('parental level of education')['reading score'].mean()

in: df.groupby('parental level of education')['writing score'].mean()

It can therefore be seen that the level of education of the parents influences the performance of the learners.

The marks of children whose parents have a high level of education are higher than those of others.

The complete notebook is available here

4. Iterating over rows of DataFrame

Sometime we need to iterate over a DataFrame. There are several method to do it:

iterrows(): here we use two variable to iterate over rows, the first is get the index and the second produse a pandas Series

e.g:

#Iterating over each rows using iterrows

for index, row in df.iterrows():

print(index, row['OSName'], row['Type '])

itertuples(): we only use one variable.

example:

in:		#Iterating over rows using itertuples
		for row in df.itertuples():
    			print(row)

out: Pandas(Index=0, OSName='Windows 10 64 bit', PercApr22=73.55, ChangeApr22=-1.14, _4='Windows')

we can remove index, and set a custom name for the yielded namedtuples

in: for row in df.itertuples(name='OS'):
    print(row)

out: OS(OSName='Windows 10 64 bit', PercApr22=73.55, ChangeApr22=-1.14, _4='Windows')

2. Create pandas DataFrame

There are many ways to create a pandas dataFrame

Creating pandas DataFrame from list of list

example:

list_data = [['Adama',17],['Clinton',12],['Kemogne',15],['John',13],['Ntep',18],['Bodo',14],['Abdou',9]]
data = pd.DataFrame(list_data, columns=['Name','Math score'])

Creating pandas DataFrame using zip() function

math_score = [17,12,15,13,18,14,9]
data = pd.DataFrame(zip(name,math_score), columns=['Name','Math score'])
data

Creating pandas DataFrame from dictionnary of list

dict_data = {'Name':['Adam',"Clinton","Kemogne","John","Ntep","Bodo","Abdou"], 'Math score':[17,12,15,13,18,14,9]}
data = pd.DataFrame(dict_data)
data

Creating DataFrame from Dicts of series.

data={'Name':pd.Series(['Adam',"Clinton","Kemogne","John","Ntep","Bodo","Abdou"]), 
        'Math score':pd.Series([17,12,15,13,18,14,9])}
data = pd.DataFrame(data)
data

Create pandas DataFrame from lists of dictionaries

data = [{'Name':'Adam','Math score':17},{'Name':'Clinton','Math score':12 },{'Name':'Kemogne','Math score': 15},
        {'Name':'John','Math score':13 },{'Name':'Ntep','Math score':18 },
        {'Name':'Bodo','Math score':14 }, {'Name':'Abdou','Math score': 9}]
data = pd.DataFrame(data)
data

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Pandas Techniques for Data Manipulation in Python

It can therefore be seen that the level of education of the parents influences the performance of the learners.

The marks of children whose parents have a high level of education are higher than those of others.

Recent Posts

תגובות

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts