Five Pandas Techniques for Data Manipulations
PANDAS
Pandas is a Python library used for working with data sets. It is used for for analysing, cleaning, exploring and manipulating data. Pandas allows us to analyse big data.It can clean messy data sets, and make them readable and relevant.
Exploring Dataset
First we import the python libraries, pandas and numpy as follows.
import pandas as pd
import numpy as np
Now we load our csv (comma-separated values ) file and assign a file name to it.
df = pd.read_csv('Rectangular Data.csv')
To read the first five rows of our data we make use of head() method.
df.head()
Output
We make use of tail() method along with it we assign a numeric value for the number of last rows we want to view. Here we have assign 2, in order to view the last two rows of our data.
df.tail(2)
The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
df.info()
The shape attribute is used to view the number of rows and columns of our DataFrame.
df.shape
output
(7, 6)
Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values.
df.describe()
output
array([['Sintu', 'Labrador', 'Black', 57, 32, '7/1/2013'],
['Sonam', 'Schnauzer', 'Brown', 70, 72, '9/16/2016'],
['Rattu', 'Poodle', 'White', 45, 25, '8/25/2014'],
['Dolphie', 'Chihuahua', 'Brown', 18, 7, '12/11/2011'],
['Kattu', 'Labrador', 'Black', 44, 23, '1/20/2017'],
['Pelki', 'Chow Chow', 'Tan', 39, 25, '4/20/2015'],
['Zimba', 'St. Bernard', 'Grey', 55, 30, '2/27/2018']],
dtype=object)
The columns attribute and index returns the column labels and the index name of the given Dataframe.
df.columns
output
Index(['Name', 'Breed', 'Color', 'Height (cm)', 'Weight (kg)',
'Date of Birth'],
dtype='object')
df.index
output
RangeIndex(start=0, stop=7, step=1)
2. Pivot Table
First we calculate the mean of height of each color.
df.pivot_table(values='Height (cm)', index='Color')
Median of each color Height is calculated.
df.pivot_table(values='Height (cm)', index='Color', aggfunc=np.median)
To view multiple statistics we make use of aggfunc
df.pivot_table(values='Height (cm)', index='Color', aggfunc=[np.mean, np.median])
Pivot in Two variables.
df.pivot_table(values='Height (cm)', index='Color', columns='Breed')
Filling missing values, value to replace missing values with (in the resulting pivot table, after aggregation).
df.pivot_table(values='Height (cm)', index='Color', columns='Breed', fill_value=0)
We sum the rows and columns together.
df.pivot_table(values='Height (cm)', index='Color', columns='Breed', fill_value=0, margins=True)
3. Sorting
The sort_values() function sorts a data frame in Ascending or Descending order of passed Column. By default pandas sorts the datas in ascending order.
df.sort_values('Height (cm)')
In order to sort the datas in descending order we have to specify ascending=False.
df.sort_values('Weight (kg)', ascending=False)
We can also sort multiple values by calling pandas DataFrame .sort_values in ascending with a list of column names to sort the rows in the DataFrame object based on the columns specified.
df.sort_values(['Weight (kg)', 'Height (cm)'])
4. Subsetting
To select a single column, use square brackets [] with the column name of the column of interest.
df['Breed']
Output
0 Labrador
1 Schnauzer
2 Poodle
3 Chihuahua
4 Labrador
5 Chow Chow
6 St. Bernard
Name: Breed, dtype: object
To select multiple columns we can pass the column name of the desired columns.
df[["Name","Height (cm)"]]
Subsetting rows with boolean values
df["Height (cm)"] > 50
Output
0 True
1 True
2 False
3 False
4 False
5 False
6 True
Name: Height (cm), dtype: bool
Subsetting rows with numeric values
df[df["Height (cm)"] > 50]
Subsetting datas based on words.
df[df["Breed"] == 'Labrador']
Subsetting with on multiple datas
is_lab = df['Breed'] == 'Labrador'
is_black = df['Color'] == 'Black'
df[is_lab & is_black]
5. Summary Statistics Summary statistics is a part of descriptive statistics that summarises and provides the gist of information about the sample data. Statisticians commonly try to describe and characterise the observations by finding: a measure of location, or central tendency, such as the arithmetic mean. We find the mean of height as follows.
df['Height (cm)'].mean()
Output
46.857142857142854
We also calculate the oldest and the latest date.
df['Date of Birth'].min()
Output
'1/20/2017'
df['Date of Birth'].max()
Output
'9/16/2016'
We calculate the cumulative sum of the weights of the dogs as follows.
df['Weight (kg)'].cumsum()
Output
0 32
1 104
2 129
3 136
4 159
5 184
6 214
Name: Weight (kg), dtype: int64
Conclusion
This article is written in part of the data insight online program with reference of datas from DataCamp.
Comments