top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureAbu Bin Fahd

Machine Learning Preprocessing: Scaling


Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. It's a necessary step because many models assume that the data you are training on is normally distributed and if it isn't, the risk of bias model. Example:

  • Log Normalization

  • Scaling


When to use Standardize?

  • Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

  • Dataset feature are continous and different scale

  • Suppose a dataset contains height and weight features.In order to compare these features, they must be in the same

# import necessary library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Load dataset
df = pd.read_csv("/content/wine_types.csv")df.head()


df.info()
df.describe()

Proline has an extremely high variance compared to the other columns. This is an example of where a technique like a log normalization would come in handy.


# create feature and target
X = df.drop("Type", axis = 1)
y = df['Type']


Unscaled KNeighborsClassifier

from sklearn.model_selection import cross_val_score, 
                                train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Model
knn = KNeighborsClassifier()

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.71


Log Normalization

Log normalization is a method for standardizing your data that can be useful when you have a particular column with high variance. As you saw in the previous section's exercise, training a k-nearest neighbors classifier on that subset of the wine dataset didn't get a very high accuracy score.



# Print out the variance of the Proline column
print(df['Proline'].var())

# Apply the log normalization function to the Proline column
df['Proline_log'] = np.log(df['Proline'])

# Check the variance of the normalized Proline column
print(df['Proline_log'].var())


99166.71735542436         0.17231366191842012 

See the change of variance!!!


Scaling

What is Feature Scaling? Scaling is a method of Standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales. Feature Scaling transforms a dataset with mean 0 variances 1. Transform approximately normal distribution. Now we are only showing StandardScaling


Scaled KNeighborsClassifier


# Import StandardScaler from scikit-learnfrom sklearn.preprocessing import StandardScaler

# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
# In scikit-learn, running fit_transform during preprocessing will both fit the method to the data as well as transform the data in a single step.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))
0.9555555555555556

Unscaled accuracy is 0.71

Scaled accuracy is 0.95

A huge difference!!!


Dataset: Wine Types


0 comments

Recent Posts

See All

Comments


bottom of page