Become a pandas ninja with nlargest(), nsmallest(), query and where methods
Pandas is a famous library with a ton of methods. The methods we cover in this article are not often used by data scientists but can be very useful in certain situations. You must add these methods into your pandas skills to definitely become a pandas ninja.
In this article, we'll work with Africa Cup of Nations dataset upload on Kaggle by Mohammad Essam, especially with Participated Teams General Statistics dataset.
import pandas as pd
df_teams_statistics = pd.read_csv('Participated Teams General Statistics.csv')
Throughout this article with this dataset, we'll explore nlargest(), nsmallest(), query() and where() methods.
1. DataFrame.nlargest() and DataFrame.nsmallest()
DataFrame.nlargest() is a pandas method that ordered DataFrame in descending order and takes the first n rows. The signature of this method is:
DataFrame.nlargest(n, columns, keep='first')
where
columns is a list of columns to use to sort DataFrame in descending order
keep indicate the occurrences of duplicate rows to conserve. This parameter takes value in the set first, last, all
n indicate the number of rows to take after sorting the DataFrame.
df_teams_statistics.nlargest(5, columns='Rank')
The image shows us clearly that nlargest() method sort DataFrame in descending order and take the first five elements. You can add keep parameter to tell nlargest() method how to do in front of duplicate rows. The default value for keep is first.
To sort our DataFrame in ascending order and get the first n rows, pandas provide another method to do that: DataFrame.nsmallest(). The signature of this method is:
DataFrame.nsmallest(n, columns, keep='first')
This method takes the same parameter as nlargest() but works in an inverse way.
df_teams_statistics.nsmallest(5, columns='Rank')
The result DataFrame in this case is ordered in ascending order based on columns we specify.
In this section, we talk about nlargest and nsmallest which are pretty useful and easy when working with pandas. In the next section, we'll talk about DataFrame.query().
2. DataFrame.query
This method is a pandas shortcut for filtering DataFrame. It allows us to query DataFrame columns with a boolean expression. The signature of this method is:
DataFrame.query(expr, inplace=False, **kwargs)
where
expr is a string to evaluate
inplace indicate if we'll work on a copy of DataFrame or not
df_teams_statistics.query("Win > 20")
The previous line of code return all entries where the number of matches played is greater than 20. We can combine multiple conditions in the same expression.
display(df_teams_statistics.query("Win > 20 & Lose < 20"))
display(df_teams_statistics.query("Win > 20 and Lose < 20"))
We have various boolean operator to use to combine different condition in our expression.
and with equivalent &
or with his equivalent |
not with his equivalent ~
We can find all African countries which have never win a African Cup of Nation like this:
display(df_teams_statistics.query('~(Titles >= 1)'))
display(df_teams_statistics.query('not(Titles >= 1)'))
where
Titles > 1 return all African countries which have at least win one African Cup of Nations
~(Titles > 1) or not(Titles > 1) return the inverse of the previous result.
To query all African countries with a number of Win greater than 20 and the number of lost less than 20, we write this condition:
df_teams_statistics.query("Win > 20 and Lose < 20")
One can combine conditions as he wants to archive a goal. In this section, we give an overview of the query method. Check pandas documentation for further explanation.
3. DataFrame.where
DataFrame.where() is another method offered by pandas that are not regularly used but can be useful in certain cases. This method replaces values for which our condition is evaluated to False. In other words, it replaces all values that do not satisfy one or more criteria by the one we give or with NaN by default. The signature of this method is :
DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)
where
cond is the condition to be satisfy by each entries. This condition can a boolean Series/DataFrame, array-like, or callable
other specify the value to give to each cell where the evaluation of condition is False
To know more about other parameters , check documentation.
For the following example, we will drop column Team to only have a DataFrame with numerical values.
df_teams_statistics.drop(columns='Team', inplace=True)
Now, we can make some example with DataFrame.where() method.
df_teams_statistics.where(df_teams_statistics < 100,'HIGHT').head(10)
In this example, we replace all entries in DataFrame with a value greater than 100 with the string HIGHT. As we say, DataFrame.where() fill all entries evaluate to False with the value HIGHT we provide.
If we don't set other parameter, the False entries will be replaced by NaN.
df_teams_statistics.where(df_teams_statistics < 100).head()
We can achieve the same as previous using a callback function in the first parameter of where method.
df_teams_statistics.where(lambda x: x< 100, "OK").head()
Conclusion
With this blog post, we give an overview of how nlargest(), nsmallest(), query and where methods work. For more examples, you can check pandas documentation for each method. The source code used for this article is downloading here. This article is written in part of the data insight online program.
Other links: This article is one of the series of articles on pandas techniques for data manipulation. The following link opens directly the the other ones.
Comments