Pandas techniques - part 3 - Merging Data Frames

Due to the development of the relational database management system (RDMS), combining and merging data has become an essential programming work in data analysis.

What is the relational database management system?

It is a common type of database that stores data in separate tables, and it can be used in relation to other stored datasets. For example, suppose a database for a company may contain information about the employees, salaries, departments, dates of each employee, the rank of each employee, and so on. This table will have too many columns and will have difficult to access, slow speed, low accuracy. Here RDMS comes into play. It separates into tables with columns of the same types. For example, employee table, department table, salary table. Each table has a primary key and may or may not have foreign keys. The primary key is a specific column for that table and it contains a unique value for each row and represents that table. A foreign key is a primary key from another table. For example, employee ID is the primary key for the employee table and department ID is the primary key for the department table but department ID is a foreign key for the employee table. We can use those keys to connect between tables. I am just talking about the concepts of RDMS and I will not focus on those keys. I will show the idea of how we can merge tables in python with Pandas.

Merging DataFrames

The main idea of merging tables is how to merge.

Suppose we want to combine two tables, the key columns from both tables serve as the pivot column. Depends on the values from these columns, the two tables will merge. There are four types of joins in general.

Join/Inner join - only the same values from each key columns

Full/Outer join - all the values from each key column

Left join - all the values from the left table

Right join - all the values from the right table

Let's see in action in Pandas.

Inner Join

df1 = pd.DataFrame({'key':['a','a','b','b','f','f'],
                   'data':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'key':['a','b','c','d','e'],
                   'data':[10,20,30,40,50]})
df1

Out[2]: 
  key  data
0   a     1
1   a     2
2   b     3
3   b     4
4   f     5
5   f     6

In [3]: df2
Out[3]: 
  key  data
0   a    10
1   b    20
2   c    30
3   d    40
4   e    50

We can use the merge method to combine, the default is the inner join.

In [4]: pd.merge(df1, df2, on='key')
Out[4]: 
  key  data_x  data_y
0   a       1      10
1   a       2      10
2   b       3      20
3   b       4      20

Look at the key column and compare it with the tables before combining. It is combined with the key values that are contained in both tables.

Left join

In [6]: df1 = pd.DataFrame({'key':['a','a','b','b','f','f'],
   ...:                    'data':[1,2,3,4,5,6]})
   ...: df2 = pd.DataFrame({'key':['a','b','c','d','e'],
   ...:                    'data':[10,20,30,40,50]})
   ...: df1
Out[6]: 
  key  data
0   a     1
1   a     2
2   b     3
3   b     4
4   f     5
5   f     6
In [7]: df2
Out[7]: 
  key  data
0   a    10
1   b    20
2   c    30
3   d    40
4   e    50

For left join, we can pass the parameter, how with left.

In [8]: pd.merge(df1, df2, on='key', how='left')
Out[8]: 
  key  data_x  data_y
0   a       1    10.0
1   a       2    10.0
2   b       3    20.0
3   b       4    20.0
4   f       5     NaN
5   f       6     NaN

It focuses on the key value of table 1 and so, the values from table 2 become missing values as table 2 does not contain key values of 'f'.

The Right join is the same as the left join. It will get the same result if you pass the parameters df2 first and df1 second.

Outer Join

The outer join will combine all the key values from both tables.

In [9]: pd.merge(df1, df2, on='key', how='outer')
Out[9]: 
  key  data_x  data_y
0   a     1.0    10.0
1   a     2.0    10.0
2   b     3.0    20.0
3   b     4.0    20.0
4   f     5.0     NaN
5   f     6.0     NaN
6   c     NaN    30.0
7   d     NaN    40.0
8   e     NaN    50.0

Different key column names

If the two tables have different key column names, you can use the parameters left_on and right_on to merge.

In [10]: df1 = pd.DataFrame(
              {'key1':['a','a','b','b','f','f'],
               'data':[1,2,3,4,5,6]})
    ...: df2 = pd.DataFrame(
              {'key2':['a','b','c','d','e'],
    ...:       'data':[10,20,30,40,50]})
    ...: pd.merge(df1, df2, left_on='key1', 
                  right_on='key2')

Out[10]: 
  key1  data_x key2  data_y
0    a       1    a      10
1    a       2    a      10
2    b       3    b      20
3    b       4    b      20

More than one key column

If the two tables have more than one key column, you can pass the on parameter with a list of key column names.

In [11]: df1 = pd.DataFrame(
                      {'key1':['aaa','aaa','bbb'],
    ...:               'key2':['one','three','one'],
    ...:               'lval':[1,2,3]})
    ...: df2 = pd.DataFrame(
                   {'key1':['aaa','bbb','aaa','bbb'],
    ...:            'key2':['one','one','two','three'],
    ...:            'rval':[10,20,30,40]})
    ...: df1
Out[11]: 
  key1   key2  lval
0  aaa    one     1
1  aaa  three     2
2  bbb    one     3

In [12]: df2
Out[12]: 
  key1   key2  rval
0  aaa    one    10
1  bbb    one    20
2  aaa    two    30
3  bbb  three    40

In [13]: pd.merge(df1, df2, on=['key1','key2'], 
                                   how='outer')
Out[13]: 
  key1   key2  lval  rval
0  aaa    one   1.0  10.0
1  aaa  three   2.0   NaN
2  bbb    one   3.0  20.0
3  aaa    two   NaN  30.0
4  bbb  three   NaN  40.0

Using Suffixes

You can use suffixes for defining custom column names after merging the two tables. For example, the default name is x and y.

In [14]: pd.merge(df1, df2, on='key1')
Out[14]: 
  key1 key2_x  lval key2_y  rval
0  aaa    one     1    one    10
1  aaa    one     1    two    30
2  aaa  three     2    one    10
3  aaa  three     2    two    30
4  bbb    one     3    one    20
5  bbb    one     3  three    40

You can pass the suffixes parameter to define the column names.

In [15]: pd.merge(df1, df2, on='key1', suffixes= 
                              ('_left','_right'))
Out[15]: 
  key1 key2_left  lval key2_right  rval
0  aaa       one     1        one    10
1  aaa       one     1        two    30
2  aaa     three     2        one    10
3  aaa     three     2        two    30
4  bbb       one     3        one    20
5  bbb       one     3      three    40

Merging on Index

You can also merge on the index by passing right_index or left_index with boolean values.

In [16]: df1 = pd.DataFrame(
                   {'key':['a','b','c','a','b','f'],
    ...:            'value':range(6)})
    ...: df2 = pd.DataFrame({'value':[10,20]}, 
                              index=['a','b'])
    ...: df1
Out[16]: 
  key  value
0   a      0
1   b      1
2   c      2
3   a      3
4   b      4
5   f      5

In [17]: df2
Out[17]: 
   value
a     10
b     20

In [18]: pd.merge(df1, df2, left_on='key',right_index=True)
Out[18]: 
  key  value_x  value_y
0   a        0       10
3   a        3       10
1   b        1       20
4   b        4       20

Hierarchically-index data

We can also use Hierarchically-index data to join with the same methods.

In [19]: df1 = pd.DataFrame(
          {'key1':['A','A','B','B','D','D'],
    ...:  'key2':['aaa','bbb','aaa','bbb','ddd','eee'],
    ...:  'data':np.arange(6)})
    ...: df2 = pd.DataFrame(np.arange(12) 
                              .reshape((6,2)),
    ...:            index=[['A','A','B','B','B','B'],
    ...:        ['aaa','ccc','aaa','bbb','ccc','ddd']],
    ...:            columns=['val1','val2'])
    ...: df
Out[19]: 
  key1 key2  data
0    A  aaa     0
1    A  bbb     1
2    B  aaa     2
3    B  bbb     3
4    D  ddd     4
5    D  eee     5

In [20]: df2
Out[20]: 
       val1  val2
A aaa     0     1
  ccc     2     3
B aaa     4     5
  bbb     6     7
  ccc     8     9
  ddd    10    11
  
In [21]: pd.merge(df1, df2, left_on=['key1','key2'], 
                                 right_index=True)
Out[21]: 
  key1 key2  data  val1  val2
0    A  aaa     0     0     1
2    B  aaa     2     4     5
3    B  bbb     3     6     7

Concatenation tables

Not same as above merging methods, table concatenation is just to combine rows and columns without using key columns. We can use pd.concat method and passing the list of tables that want to be combined and specify the axis ( 0 for bind rows, and 1 for bind columns ).

In [22]: df1 = pd.DataFrame({'Name':['A','B','C','D'],
    ...:                    'Value':np.arange(4)})
    ...: df2 = pd.DataFrame({'Name':['E','F','G'],
    ...:                    'Value':np.arange(3)})
    ...: df1
Out[22]: 
  Name  Value
0    A      0
1    B      1
2    C      2
3    D      3

In [23]: df2
Out[23]: 
  Name  Value
0    E      0
1    F      1
2    G      2

In [24]: pd.concat([df1,df2],axis=0)
Out[24]: 
  Name  Value
0    A      0
1    B      1
2    C      2
3    D      3
0    E      0
1    F      1
2    G      2

In [25]: pd.concat([df1,df2],axis=1)
Out[25]: 
  Name  Value Name  Value
0    A      0    E    0.0
1    B      1    F    1.0
2    C      2    G    2.0
3    D      3  NaN    NaN

Combining data with Overlapping

Start with an example.

In [26]: df1 = pd.DataFrame({'a':[1,np.nan,3,4,np.nan],
    ...:                    'b':[6,7,np.nan,8,9],
    ...:                    'c':[10,11,12,np.nan,14]})
    ...: df2 = pd.DataFrame({'a':[10,20,30,np.nan,50],
    ...:            'b':[60,70,80,np.nan, np.nan],
    ...:            'c':[100,np.nan, 130,np.nan, 150]})
    ...: df1
Out[26]: 
     a    b     c
0  1.0  6.0  10.0
1  NaN  7.0  11.0
2  3.0  NaN  12.0
3  4.0  8.0   NaN
4  NaN  9.0  14.0

In [27]: df2
Out[27]: 
      a     b      c
0  10.0  60.0  100.0
1  20.0  70.0    NaN
2  30.0  80.0  130.0
3   NaN   NaN    NaN
4  50.0   NaN  150.0

Look at the NaN values. The idea is to get all the values from table 1 and if there are missing values in table 1, take that value from table 2. We can use the combine_first method.

In [28]: df1.combine_first(df2)
Out[28]: 
      a     b     c
0   1.0   6.0  10.0
1  20.0   7.0  11.0
2   3.0  80.0  12.0
3   4.0   8.0   NaN
4  50.0   9.0  14.0

To be clear you can use the where method from NumPy.

In [29]: np.where(pd.isna(df1),df2,df1)
Out[29]: 
array([[ 1.,  6., 10.],
       [20.,  7., 11.],
       [ 3., 80., 12.],
       [ 4.,  8., nan],
       [50.,  9., 14.]])

If there is a missing value in df1, then take from df2, and if not missing, take the value from df1.

This is an introduction to how to combine and merge data frames in Pandas. Only one important idea is how they merge and join; inner, outer, left, and right.

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Pandas techniques - part 3 - Merging Data Frames

Merging DataFrames

Inner Join

Left join

Outer Join

Different key column names

More than one key column

Using Suffixes

Merging on Index

Hierarchically-index data

Concatenation tables

Combining data with Overlapping

Recent Posts

Comments

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts