Make calculations in Two Way ANOVA test with Pandas

In the area of hypothesis testing, we have one parametric test called ANOVA (Analysis Of Variance) which have three variants depends on data:

One Way ANOVA
Two Way ANOVA
Two Way ANOVA with replication

Each of these tests is used in dedicated condition. His assumption are:

Normally distributed data
Equality of variance between data

To perform it well, we generally have five step to follow:

Step 1: Hypothesis formulation
Step 2: Choice of probability law
Step 3: Compute observation values or reference values
Step 4: Determine the critical values
Step 5: Make conclusion

In this post, we'll use Pandas to compute ANOVA Two Way parameter in step 3. We will make a demonstration on the following data representing the yields of three varieties of maize using four different kinds of fertilizers. We want to test whether the variation in yields is caused by the different varieties of maize, different kinds of fertilizers or differences in both.

	Variety_1	Variety_2	Variety_3
Type_1	64	72	74
Type_2	55	57	47
Type_3	59	66	58
Type_4	58	57	53

Before going further, let's remember the formula:

Now, we can start write our python code to solve our problem.

1. Correlation factor

Before calculating the correlation factor, let's compute first the sum of the column and the sum of the row, and finally the total of all our data.

NB: The following manipulation supposes we already load our data and put it in a variable called "data".

	Variety_1	Variety_2	Variety_3	Ti.
Type_1	64	72	74	210
Type_2	55	57	47	159
Type_3	59	66	58	183
Type_4	58	57	53	168
T.j	236	252	232	720

1.1.Sum of column


  variety_sum = data.sum()

Output:

variety_1    236 
variety_2    252 
variety_3    232 
dtype: int64

The method sum is used to return the sum of pandas Series/DataFrame over the y-axis.

1.2.Sum of row

type_sum = data.sum(axis=True)

Output:

Type_1    210
Type_2    159
Type_3    183
Type_4    168
dtype: int64

The method sum(axis=True) in this case return the sum of pandas series/dataframe over the x-axis

With the type_sum and variety_sum, we can now compute the correlation factor:

1.3. Sum of all data

As we have the sum of rows and sum of the column, it's now easy for us to calculate the total of data.

type_sum.sum() or variety_sum.sum()

Output:

As type_sum and variety_sum are vectors, call pandas sum function on their return a single value represents the summation of the element.

1.4. Number of column and rows of data

we need to store the number of rows and columns of our data to use them on our following computation. These values will be extract from pandas shape function.

 #Number of rows of data
 nbre_row = data.shape[0]
 #Number of column of data
 nbre_column = data.shape[1]

1.5.Correlation factor

To calculate it, we just need to apply the formula.

correlation_factor = type_sum.sum()**2/(nbre_column*nbre_row)

2.Total sum of square

  sst = (data**2).sum().sum() - correlation_factor

The expression data**2 is used to put each value in data at square, (data**2).sum() calculate the sum of all values over y-axis (the column) and (data**2).sum().sum() return the total of summation of all data.

3. Complete code

def compute_anova_parameter(data):
    # Compute the sum of all data in column
    variety_sum = data.sum()
    #compute the sum of all data in row
    type_sum = data.sum(axis=True)
    
    #NUmber of ligne of data
    nbre_row = data.shape[0]
    #Number of column of data
    nbre_column = data.shape[1]
    
    #Correlation Factor
    correlation_factor = type_sum.sum()**2/(nbre_column*nbre_row)
    
    # Total sum of square
    sst = (data**2).sum().sum() - correlation_factor
    
    # Total sum of square of row effect
    ssr = (type_sum**2).sum()/nbre_column - correlation_factor
    
    # Total sum of squares of column effect
    ssc = (variety_sum**2).sum()/nbre_row - correlation_factor
    
    # Sum square Error
    sse = sst-ssc-ssr
    
    # Mean Square Column
    msc = ssc/(nbre_column-1)
    
    # Mean square Row
    msr = ssr/(nbre_row-1)
    
    #Mean Square Error
    mse = sse/((nbre_column-1)*(nbre_row-1))
    
    # Calculation of Fisher parameter
    Fc = round(msc/mse,3)
    Fr = round(msr/mse,3)
    return {"Fc":Fc, "Fr":Fr}

4. Testing

We have our in excel format as follow:

we can load and use it.

import pandas as pd
data = pd.read_excel('fertlizer.xlsx', index_col=0)
print(compute_anova_parameter(data))

output : {'Fc': 1.556, 'Fr': 9.222}

You can add more data in the excel file as you want and the program will compute it. The final values will be used in step 5 to make a conclusion of hypothesis testing

Source code

Download dataset

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Make calculations in Two Way ANOVA test with Pandas

1. Correlation factor

1.1.Sum of column

1.2.Sum of row

1.3. Sum of all data

1.4. Number of column and rows of data

1.5.Correlation factor

2.Total sum of square

3. Complete code

4. Testing

Recent Posts

Comments

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts