Introduction to Data visualization with Seaborn
Introduction
Visualization is one critical step in drawing insights from data in that it allows to view patterns, distribution and relationships in data. In fact, good figures and tables are very helpful in communicating about the data. There exist many libraries which offer the possibility to visualize data, among which according to me, seaborn is the best fit for data science. It is a high-level data visualization library built on top of Matplotlib. In this tutorial, one step after another we will go through the data scientists' favorite plots offered by seaborn. We will use the food recipes dataset from Datacamp. This tutorial is organized in three main parts relational plots, categorical plots and distribution plots.
First of all we load the necessary packages, pandas for importing the data, seaborn under its common alias sns, pyplot from matplotlib since seaborn uses it.
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
data = pd.read_csv('recipes.csv')
data.head(3)
1. Relational plots
These are plots that show relationships between parts of data and how they relate to other variables. In the following lines, we will discuss and illustrate the two types of relational plots:
Scatter plots;
Line plots;
1.1. Scatter plot
This kind of plot shows the joint distribution of two variables, represented by a cloud of points. The number of points is the same as the number of data points, this means both variables must be of the same length. Once the cloud is displayed, it allows human eyes to rapidly detect potential relationships between both coordinates (variables), not only but also the variables used as hue and/or size. seaborn offers two different level functions to draw scatter plots:
A figure-level function relplot() in which we set the kind parameter to scatter or nothing as it is also the default value of that parameter.
An axis-level function scatterplot().
For illustration, let's visualize the relation ship between the Energy (Calories) and the Sugar contained in each recipe (SugarContent). The size of each cloud point will be given by its score (HighScore column).
fig, ax = plt.subplots()
sns.scatterplot(x='SugarContent', y='Calories', size='HighScore', data=data, ax=ax)
ax.set_title('Relationship between Energy and Sugar content per recipe')
ax.set_xlabel('Sugar')
plt.show()
In the above code, the ax parameter is to in indicate to seaborn which matplotlib.axes._subplots.AxesSubplot object to use. When not specified, it uses the default axes object (current figure) created by pyplot. This is equivalent to the following code, notice the parameter ax has been removed, this is because relplot(), a figure-level function is a FacetGrid object, equivalent to plt.subplots() that can include more than one Axes. This time the color of each cloud point depends on its score (HighScore column).
g = sns.relplot(x='SugarContent', y='Calories', data=data, hue='HighScore')
g.fig.suptitle('Relationship between Energy and Sugar content per recipe', fontsize=10)
g.fig.subplots_adjust(top=.9) # To avoid title displaying on the figure
g.set_xlabels('Sugar')
plt.show()
1.2. Line plot
A scatter plot is highly efficient in showing potential relationships in data but there are cases where they are less suitable than a line plot. For example when we want to visualize how a certain quantity varies evolves over time. This can be done using relplot() by specifying kind = 'line' or directly by lineplot().
fig, ax = plt.subplots()
sns.lineplot(x='SugarContent', y='Calories', data=data, ax=ax)
ax.set_title('Relationship between Energy and Sugar content per recipe')
ax.set_xlabel('Sugar')
plt.show()
Note : The lineplot() sortes the data in crescending order before plotting it. This default behaviour can be ovewritten by specifying the parameter sort = False.
2. Categorical plots
Relational plots were about showing relationships between numeric data. What if some data are divided into groups or categories? Seaborn offers one figure-level function, catplot() which, with the kind parameter, covers 8 axes-level functions divived into 3 categories:
Categorical scatterplots:
stripplot() (with kind="strip"; the default)
swarmplot() (with kind="swarm")
Categorical distribution plots:
boxplot() (with kind="box")
violinplot() (with kind="violin")
boxenplot() (with kind="boxen")
Categorical estimate plots:
pointplot() (with kind="point")
barplot() (with kind="bar")
countplot() (with kind="count")
In this tutorial, we will cover one function per category, namely stripplot, boxplot and countplot.
2.1. Strip plot The column HighScore of the data has only two possible values which describe the popularity of recipes: 1.0 (Popular) or 0.0 (Unpopular). We will plot the bee swarm representation of the Calories column grouped by their popularity that is HighScore column. Unlike the relational plots where we used both figure-level and axes-level functions, here we will only illustrate axes-level functions.
sns.stripplot(x = 'HighScore', y='Calories', data=data)
plt.show()
This plot allows to have an overview of the distribution of data across its different categories.
2.2. Box plot
The categorical scatter plots become less informative when the size of the data gets high. The geometrical visualization comes in handy in this case as it provides a rapid summary statistics of the data across its different categories. The illustration below gives some summary statistics of Calories in each recipes category.
sns.boxplot(x = 'HighScore', y='Calories', data=data, showfliers=False)
plt.xlabel('')
plt.title('Distribution of Calories per recipes category')
plt.xticks([0, 1], ['Unpopular', 'Popular'])
plt.show()
The line inside the box represents the median while the upper and lower bounds represent the 75𝑡ℎ and the 25𝑡ℎ quartiles respectively, giving the Inter Quartile Range (IQR) . The whiskers, that is the upper and lower horizontal lines give the boundaries out of which each data point is considered and outliers, by default they are located at 1.5∗𝐼𝑄𝑅. The parameter showfliers controls whether to display outliers or not.
2.3. Count plot
As indicated by its name, this type of plot permits to estimate ("count") the number of data points per category.
sns.countplot(x = 'HighScore', data=data)
plt.xlabel('')
plt.ylabel('')
plt.title('Total number of recipes per category')
plt.xticks([0, 1], ['Unpopular', 'Popular'])
plt.show()
3. Distribution plots
An efficient data analysis should lean on understanding and interpreting its distribution, thus answering questions like: What is the main tendency in data? Is data skew or symmetrically distributed?... Seaborn offers 4 axes-level functions: kdeplot(), ecdfplot(), histplot and rugplot(); enclosed by 3 figure-level functions namely : displot(), pairplot() and jointplot(). In this part of the tutorial, we will only illustrate displot() with its 3 axes-level functions as follows:
histplot() for a histogram
kdeplot() for a kernel density estimate
ecdfplot() for an empirical cumulative density function.
3.1. Histplot
By default, it divides the data into 50 bins of equal amplitude and plot their frequency (number of data points per bin). The number of bins is controlled by the parameter bins which accepts an integers as values. Let's visualize as a histogram the quantity of proteins in recipes. Because the data has some points farther from the mean and median, we will create a new dataframe df by sorting data on Calories column in ascending order then we will take 42000 out of ≈43000 data points, thus eliminating outliers.
df = data.sort_values('Calories')[['Calories', 'HighScore']].iloc[0:42000]
sns.histplot(x='Calories', data=df, bins=20)
# sns.displot(df, x='Calories', bins=20)
plt.show()
3.2. Kde plot
Its aims is to provide a plot of the kde which is a non-parametric way to estimate the probability density function of the data.
sns.kdeplot(x='Calories', data=df)
plt.title('Kernel Density Estimate of recipes Calories')
plt.show()
We can notice that this plot has the same shape as the above histogram.
3.3. Ecdf plot
sns.ecdfplot(x='Calories', data=df)
plt.title('ECDF of recipes calories')
plt.show()
Conclusion
In this tutorial we covered few data visualization functions from seaborn and it is clear that this python library is the most accomplished one for data scientist due to its ease of use, simple syntax and especially hierarchization, most importantly it leverages the power of matplotlib to make it less cumbersome and clearer.
Find the notebook attached to this article here.
Commenti