top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureTEMFACK DERICK

Make calculations in Two Way ANOVA test with Pandas



In the area of hypothesis testing, we have one parametric test called ANOVA (Analysis Of Variance) which have three variants depends on data:

  • One Way ANOVA

  • Two Way ANOVA

  • Two Way ANOVA with replication

Each of these tests is used in dedicated condition. His assumption are:

  • Normally distributed data

  • Equality of variance between data

To perform it well, we generally have five step to follow:

  • Step 1: Hypothesis formulation

  • Step 2: Choice of probability law

  • Step 3: Compute observation values or reference values

  • Step 4: Determine the critical values

  • Step 5: Make conclusion

In this post, we'll use Pandas to compute ANOVA Two Way parameter in step 3. We will make a demonstration on the following data representing the yields of three varieties of maize using four different kinds of fertilizers. We want to test whether the variation in yields is caused by the different varieties of maize, different kinds of fertilizers or differences in both.

Variety_1

Variety_2

Variety_3

Type_1

64

72

74

Type_2

55

57

47

Type_3

59

66

58

Type_4

58

57

53


Before going further, let's remember the formula:

Now, we can start write our python code to solve our problem.


1. Correlation factor

Before calculating the correlation factor, let's compute first the sum of the column and the sum of the row, and finally the total of all our data.


NB: The following manipulation supposes we already load our data and put it in a variable called "data".

​Variety_1

​Variety_2

​Variety_3

Ti.

Type_1

64

72

74

210

Type_2

55

57

47

159

Type_3

59

66

58

183

Type_4

58

57

53

168

T.j

236

252

232

720

1.1.Sum of column


  variety_sum = data.sum()

Output:

variety_1    236 
variety_2    252 
variety_3    232 
dtype: int64

The method sum is used to return the sum of pandas Series/DataFrame over the y-axis.


1.2.Sum of row

type_sum = data.sum(axis=True)

Output:

Type_1    210
Type_2    159
Type_3    183
Type_4    168
dtype: int64

The method sum(axis=True) in this case return the sum of pandas series/dataframe over the x-axis


With the type_sum and variety_sum, we can now compute the correlation factor:


1.3. Sum of all data

As we have the sum of rows and sum of the column, it's now easy for us to calculate the total of data.

type_sum.sum() or variety_sum.sum() 

Output:

720

As type_sum and variety_sum are vectors, call pandas sum function on their return a single value represents the summation of the element.


1.4. Number of column and rows of data

we need to store the number of rows and columns of our data to use them on our following computation. These values will be extract from pandas shape function.


 #Number of rows of data
 nbre_row = data.shape[0]
 #Number of column of data
 nbre_column = data.shape[1]

1.5.Correlation factor

To calculate it, we just need to apply the formula.

correlation_factor = type_sum.sum()**2/(nbre_column*nbre_row)


2.Total sum of square

  sst = (data**2).sum().sum() - correlation_factor

The expression data**2 is used to put each value in data at square, (data**2).sum() calculate the sum of all values over y-axis (the column) and (data**2).sum().sum() return the total of summation of all data.


3. Complete code

def compute_anova_parameter(data):
    # Compute the sum of all data in column
    variety_sum = data.sum()
    #compute the sum of all data in row
    type_sum = data.sum(axis=True)
    
    #NUmber of ligne of data
    nbre_row = data.shape[0]
    #Number of column of data
    nbre_column = data.shape[1]
    
    #Correlation Factor
    correlation_factor = type_sum.sum()**2/(nbre_column*nbre_row)
    
    # Total sum of square
    sst = (data**2).sum().sum() - correlation_factor
    
    # Total sum of square of row effect
    ssr = (type_sum**2).sum()/nbre_column - correlation_factor
    
    # Total sum of squares of column effect
    ssc = (variety_sum**2).sum()/nbre_row - correlation_factor
    
    # Sum square Error
    sse = sst-ssc-ssr
    
    # Mean Square Column
    msc = ssc/(nbre_column-1)
    
    # Mean square Row
    msr = ssr/(nbre_row-1)
    
    #Mean Square Error
    mse = sse/((nbre_column-1)*(nbre_row-1))
    
    # Calculation of Fisher parameter
    Fc = round(msc/mse,3)
    Fr = round(msr/mse,3)
    return {"Fc":Fc, "Fr":Fr}

4. Testing

We have our in excel format as follow:

we can load and use it.

import pandas as pd
data = pd.read_excel('fertlizer.xlsx', index_col=0)
print(compute_anova_parameter(data))

output : {'Fc': 1.556, 'Fr': 9.222}

You can add more data in the excel file as you want and the program will compute it. The final values will be used in step 5 to make a conclusion of hypothesis testing



0 comments

Kommentare


bottom of page