top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's picturePyae Phyo Kyaw

Pandas Techniques for Data Manipulation in Python : groupby() & pivot_table()

Pandas is a Python Library for data analysis, started by Wes McKinney in 2008, which was created to fill need for powerful and flexible quantitative analysis tool.

Pandas is built on top of tow core python libraries - matplotlib for data visualization and NumPy for mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access many of matplotlib's and NumPy's methods with less code.

Loading Data in Pandas

The most important thing to do when dealing with data is to first load the data files for analysis. Received data can be in a variety of configurations and file types. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Stata, Rdata (R) etc.

While importing external files, we need to check the following points -

  • Check whether header row exists or not

  • Treatment of special values as missing values

  • Consistent data type in a variable (column)

  • Date Type variable in consistent date format.

  • No truncation of rows while reading external data

In this Blog, I will mainly explain with supermarket_sales datasets from kaggle in here. Let’s start by importing the supermarket_sales csv file. In order to do this we use Pandas 'read_csv()' function and pass the path to the file (which in our case is stored in the 'dataset/supermarket_sales.csv') as a parameter. We also parse the 'Date' of the datasets to change 'datetime' datatype.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Invoice ID               1000 non-null   object        
 1   Branch                   1000 non-null   object        
 2   City                     1000 non-null   object        
 3   Customer type            1000 non-null   object        
 4   Gender                   1000 non-null   object        
 5   Product line             1000 non-null   object        
 6   Unit price               1000 non-null   float64       
 7   Quantity                 1000 non-null   int64         
 8   Tax 5%                   1000 non-null   float64       
 9   Total                    1000 non-null   float64       
 10  Date                     1000 non-null   datetime64[ns]
 11  Time                     1000 non-null   object        
 12  Payment                  1000 non-null   object        
 13  cogs                     1000 non-null   float64       
 14  gross margin percentage  1000 non-null   float64       
 15  gross income             1000 non-null   float64       
 16  Rating                   1000 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(8)
memory usage: 132.9+ KB
None

Grouping by summary statistic

Grouping data is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. It’s a simple concept but it’s an extremely valuable technique that’s widely used in data science. In real data science projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use Groupby concept. Groupby concept is really important because it’s ability to aggregate data efficiently, both in performance and the amount code is magnificent. Groupby mainly refers to a process involving one or more of the following steps they are:

Splitting : It is a process in which we split data into group by applying some conditions on datasets. Applying : It is a process in which we apply a function to each group independently Combining : It is a process in which we combine different datasets after applying groupby and results into a data structure.

groupby() in Pandas

Pandas datasets can be split into any of their objects. There are multiple ways to split data like:

  • df.groupby(key)

  • df.groupby(key, axis=1)

  • df.groupby([key1, key2])

grouping data with one key:

In order to group data with one key, we pass only one key as an argument in groupby function.

Grouping data with multiple keys :

In order to group data with multiple keys, we pass multiple keys in groupby function.

Grouping data by sorting keys :

Group keys are sorted by default uring the groupby operation. User can pass sort=False for potential speedups.

Selecting a groups

In order to select a group, we can select group using GroupBy.get_group(). We can select a group by applying a function GroupBy.get_group this function select a single group.

Now we select an object grouped on multiple columns

Applying function to group

After splitting a data into a group, we apply a function to each group in order to do that we perform some operation they are:

  • Aggregation : It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute group sums or means.

  • Transformation : It is a process in which we perform some group-specific computations and return a like-indexed. For Example, Filling NAs within groups with a value derived from each group

  • Filtration : It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For Example, Filtering out data based on the group sum or mean

Aggregation :

Aggregation is a process in which we compute a summary statistic about each group. Aggregated function returns a single aggregated value for each group. After splitting a data into groups using groupby function, several aggregation operations can be performed on the grouped data.

Applying multiple functions at once :

We can apply a multiple functions at once by passing a list or dictionary of functions to do aggregation with, outputting a DataFrame.

Transformation :

Transformation is a process in which we perform some group-specific computations and return a like-indexed. Transform method returns an object that is indexed the same (same size) as the one being grouped. The transform function must:

  • Return a result that is either the same size as the group chunk

  • Operate column-by-column on the group chunk

  • Not perform in-place operations on the group chunk. Now we perform some group-specific computations and return a like-indexed.

Filtration :

Filtration is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. In order to filter a group, we use filter method and apply some condition by which we filter group.

Pivot Table

Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables.In python, Pivot tables of pandas dataframes can be created using the command: pandas.pivot_table(). The function is quite similar to the group by function also available in Pandas, but offers significantly more customization.

Creating a Pivot Table in Pandas

To get started with creating a pivot table in Pandas, let’s build a very simple pivot table to start things off. We’ll begin by aggregating the Total Price by the Product line the sale took place in:

As the default parameter for aggfunc is mean,This gave us a mean of the Sales field by Region.

If we wanted to change the type of function used, we could use the aggfunc parameter. if we wanted to return the sum of all Sales across a region, we could write:

Creating a Multi-Index Pivot Table

Single index pivot tables are great for generating high-level overviews. However, we can also add additional indices to a pivot table to create further groupings. Say, we wanted to calculate the sum per "Customer type" and per Product line, we could write the following:

We could also apply multiple functions to our pivot table. Say we wanted to have the same pivot table that showed us the total sum but also the mean of "Total", we could write:

Adding Columns to a Pandas Pivot Table

Adding columns to a pivot table in Pandas can add another dimension to the tables. Based on the description we provided in our earlier section, the Columns parameter allows us to add a key to aggregate by. For example, if we wanted to see Total of Sale by "City" and by "Product Line", we could write:

Handling Missing Data

No data set is perfect! Let’s see how we can handle missing data in Python pivot tables. When there are missing data in pivot table, the table will display with "NaN" in missing place. But the reference dataset has no missing values, it is difficult to show how to handling missing data. if we wanted to fill N/A for any missing values, we could write the following:

Adding Totals for Rows and Columns to Pandas Pivot Tables

Let’s explore how to add totals to both rows and columns in our Python pivot table. We do this with the margins and margins_name parameters. If we wanted to add this to the pivot table we created above, we would write the following:

In this post, we explored how to summarize grouped of data and easily generated a pivot table off of a given data-frame using Python and Pandas. The multitude of parameters available in the groupby and pivot_table function allows for a lot of flexibility in how data is analyzed. In this post, we explored how to summerize groups and how to generate a pivot table, how to filter groupby tables in Python, how to add multiple indices and columns to pivot tables, how to deal with missing values.

The mainly used reference datasets is downloaded from kaggle. here.


0 comments

Recent Posts

See All

Comments


bottom of page