top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureEbrima Sise

Where Function in Pandas

The where() in pandas and NumPy is used to filter Data Frames based on a condition. We will discuss this important topic in this tutorial. First, we will need to import our libraries that will help us work with a Data Frame.


import numpy as np
import pandas as pd
import statsmodels.api as sm

Now that the libraries are imported, we will load the mtcars dataset from statsmodels into a Data Frame and name it as df.


df = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data

Let's take a quick glance at the newly created Data Frame.


df.head()

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.9 2.62 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2


Our tutorial will start by first getting the indices of all the cars in the Data Frame which has four cylinders. We will store these indices in a variable called four_cyl_ind.


four_cyl_ind = np.where(df.cyl == 4)

Since the dataset is not very big, we can assume that the four-cylinder cars will not be too many. That being the case, we will print the indices below.

four_cyl_ind
output: (array([ 2,  7,  8, 17, 18, 19, 20, 25, 26, 27, 31]),)

Now, we can subset the Data Frame and get data on only the four-cylinder cars.

four_cylinder_cars = df.iloc[four_cyl_ind]

To confirm that only four-cylinder cars are in the new Data Frame, we will check the unique values in the cyl column. If it is more than one, then our assertion is incorrect.

four_cylinder_cars['cyl'].unique()
output: array([4])

And yes! We were right. But let's look at the first few rows of the new Data Frame.


four_cylinder_cars.head()

mpg cyl disp hp drat wt qsec vs am gear carb

Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1

Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2

Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2

Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2


We can use the where() to also create a new column in a Data Frame. It checks a condition, if true, the second argument is returned. Otherwise, the third one is returned. We will create a new column called consumption based on this.


df['consumption'] = np.where(df['mpg'] > 25,
        'High MPG', 
        'Low MPG')

Finally, we will see the new column added to Data Frame.

mpg cyl disp hp drat wt qsec vs am gear carb consumption

Mazda RX4 21.0 6 160.0 110 3.9 2.62 16.46 0 1 4 4 Low MPG

Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4 Low MPG

Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1 Low MPG

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Low MPG

Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2 Low MPG



0 comments

Recent Posts

See All

Comments


bottom of page