top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureRubash Mali

Basic Statistics Techniques Essential for Data Science

In this blog post we will discuss some common concepts and techniques frequently used by a data scientist which are as follows:


  • Population and Sample

  • Measure of Central Tendency

  • Measure of Central Dispersion

  • Covariance and Correlation

  • Normal Distribution

  • Central Limit Theorem

  • Quantile Quantile plot

Initially we import all the required libraries as follow:


import numpy as np
from statistics import mode
from scipy.stats import norm
import scipy
import random

Population and Sample

A population is the entire group that you want to draw conclusions about. While a sample is a unbiased subset of the population that best represents the whole data.

Sampling is done as it is rarely possible to collect data from the population.



We generate a population as folows and sample with and without replacement.


population=np.arange(0,100,2)
print(population)
#Output
[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98]

#sampling randomly with replacement
sample=np.random.choice(population,10)
print(sample)
#Output
[34 52 24 34 56 70 32 16 32 12]
#sampling randomly without replacement
sample=np.random.choice(population,10,replace=False)
#Output
[44 28 70 48 74 68 82 66 54 38]


Here we can observe that when sampling is done with replacement samples are repeated.


Measure of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. Some popular measures of central tendency are mean median and mode



population=[1,2,2,2,5,6]
mean=np.mean(population)
median=np.median(population)
mode=mode(population)
print(mean,median,mode) 
#Output
3.0 2.0 2 


Measure of Dispersion

Measure Of Central Tendency is alone inadequate to describe the data.

The measures of dispersion help to interpret the variability of data. Most commonly used measure of Dispersions are Standard Deviation, Range and IQR.







We use numpy based functions (std and percentile) to calculate standard deviation and inter quantile range. As for range function was defined to calculate the maximum and minimum value and finally the range itself


standard_deviation=np.std(population)
print(standard_deviation)
#Output
1.8257418583505538

q1 ,q3= np.percentile(population, [25,75])
iqr = q3 - q1
print(iqr)
#Output
2.25

def find_range(value_list):
maximum=max(value_list)
minimum=min(value_list)
rangee=maximum-minimum
return rangee
print(find_range(population))
#Output
5

Covariance and Correlation

Covariance is a measure of the relationship between two random variables. It measures to what extend the variables change together.




Positive covariance indicates that two variables tend to move in the same direction while negative covariance indicates they move in different direction.

A major drawback of covariance is that it is scale variant so to overcome this Karl Pearson's coefficient is introduced. We make it scale invariant by dividing the covariance by the standard deviation





The code snippet to calculate the correlation and covariance between to variables are as follow:



def find_cov_corr(x,y):
   mean_x = sum(x) / len(x)
   cov_sum=0
   mean_x = sum(x) / len(x)
   mean_y = sum(y) / len(y)
   for i in range(0,len(x)):
     cov_sum +=(x[i]-mean_x) * (y[i]-mean_y)
   covariance=cov_sum/len(x)
   correl=covariance/(np.std(x)*np.std(y))
   print('correlation is',correl)
   print('covariance is',covariance)
x_coordinates = [1, 2, 3, 4, 5] 
y_coordinates = [2, 2, 3, 4, 5] 
find_cov_corr(x_coordinates,y_coordinates)
#Output
correlation is 0.9701
covariance is 1.6 

Normal Distribution

It is also known as the Gaussian distribution and is a continuous probability distribution that is symmetric about the mean. Graphically its of bell shape. It is the most common type of distribution. It is characterized using two parameters the mean and the standard deviation.




We create a sample of length 1000 which follows normal distribution (mean= 40 standard deviation= 15 )using numpy as follow:



 mu, sigma = 40,15
 s = np.random.normal(mu, sigma, 1000)
 plt.hist(s)
 plt.show()



Let the above be the weight distribution of people of a town with mean=40 and standard deviation= 15 . We can answer analytical questions such as

probability of a person in the town whose weight is less than 30 using the cdf function.


# Probability of person being < 30 kgs
prob_less_30 = norm.cdf(30,mu,sigma)
print(prob_less_30)
#Output
0.0477903522728147

Probability of a person being greater than 50 kgs

#Probability of person being > 50 kg
prob_over_50 =1-norm.cdf(50,mu, sigma)
print(prob_over_1000)
#Output
0.8413447460685429

Probability of the person being between 20 to 40 kgs


 #Probability of person being between 20 to 40
 prob_20_to_40 = norm.cdf(40,mu, sigma)-norm.cdf(20,mu, 
 sigma)
 print(prob_20_to_40)
 #Output
 0.14883992530281173

Central Limit Theorem

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population's distribution.

Here let us generate a random uniform distribution as follow:


import matplotlib.pyplot as plt
numpy.random.seed(1)
data=np.random.rand(1000)
plt.hist(data)
plt.show()



Now as per CLT we calculate sample mean with various sample size and as the size increases we can observe the mean's distribution follows normal like distribution


 import random
 sample_size = [10, 50, 100,200,400,600]
 means=[]
 for i in range(0,len(sample_size)):
    dyy=[]
        for j in range(0,sample_size[i])
           temp=random.choices(data, k=150)
           dyy.append(np.mean(temp))
    means.append(dyy)



#plotting the histogram graph's
ncols = 3
nrows =2
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 10))
k=0
for i in range(nrows):
    for j in range(ncols):
      ax = axes[i][j]
      ax.hist(means[k])
      ax.set_title(label = 'Number of samples=' 
      +str(sample_size[k]))
      k=k+1
      plt.show()



Q-Q plot

The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set

If the scatter plot drawn from the quantiles of datasets follows a 45 degree straight line we conclude that they have same distribution else they dont.

Firstly we create a uniformly distributed data set and draw a q-q plot with a normally distributed list as follow





Then we create a normally distributed data set and draw a q-q plot with a normally distributed list as follow




We can verify q-q plot does work as mentioned above.


Conclusion

Hence we discussed about some popular statistical concepts essential for data science. The code snippet is as follows





0 comments

Recent Posts

See All

Comments


bottom of page