top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Unsupervised Learning: Basics of Clustering

Writer's picture: Abu Bin FahdAbu Bin Fahd

What is unsupervised learning?

  • A group of machine learning algorithms that find patterns in data

  • Data for algorithms has not label, classified or characterized

  • The objective of the algorithm is to interpret any structure in the data

  • Common unsupervised learning algorithms: clustering, neural networks, anomaly detection



What is clustering?

  • The process of grouping items with similar characteristics

  • Items in groups similar to each other than in other groups

  • Example: distance between points on a 2D plane


Plotting data for clustering - Pokemon sightings




from matplotlib import pyplot as plt
x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 
                                56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25,     
                                2, 10, 24, 10]
plt.scatter(x_coordinates, y_coordinates)
plt.show()


Data are separated into three parts!!!


Basic of cluster analysis

What is a cluster?

  • Group of items with similar characteristics

  • Google News: articles where similar words and word associations appear together

  • Customer Segments


Clustering algorithms

  • Hierarchical clustering

  • K means clustering

  • Other clustering algorithms: DBSCAN, Gaussian Methods

Hierarchical clustering in SciPy


from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 
    3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 
     47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]
     
df = pd.DataFrame({'x_coordinate': x_coordinates, 
        'y_coordinate': y_coordinates})
        
Z = linkage(df, 'ward') # ward: based on the sum of squares
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')

sns.scatterplot(x='x_coordinate', y='y_coordinate', hue='cluster_labels', data = df)
plt.show()


K-means clustering in SciPy

from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd
import random
random.seed((1000,2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 
    3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 
    47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]
    
df = pd.DataFrame({'x_coordinate': x_coordinates, 
      'y_coordinate': y_coordinates})
      
centroids,_ = kmeans(df, 3)
df['cluster_labels'], _ = vq(df, centroids) # centroid: based on the geometric mean of all objects

sns.scatterplot(x='x_coordinate', y='y_coordinate', 
      hue='cluster_labels', data = df)
plt.show()


Data preparation for cluster analysis

Why do we need to prepare data for clustering?

  • Variables have incomparable units (product dimensions in cm, price in $)

  • Variables with the same units have vastly different scales and variances (expenditures oncereals, travel)

  • Data in raw form may lead to bias in clustering

  • Clusters may be heavily dependent on one variable

  • Solution: normalization of individual variables

Normalization of data

Normalization: the process of rescaling data to a standard deviation of 1

from scipy.cluster.vq import whiten
data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]
scaled_data = whiten(data)
print(scaled_data)
[2.72733941 0.54546788 1.63640365 1.63640365 1.09093577 1.63640365
 1.63640365 4.36374306 0.54546788 1.09093577 1.09093577 1.63640365
 2.72733941]
# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data
plt.plot(data,label="original")
plt.plot(scaled_data,label="scaled")# Show legend and display 
                                                         plot

plt.legend()
plt.show()


0 comments

Recent Posts

See All

Comments


COURSES, PROGRAMS & CERTIFICATIONS

 

Advanced Business Analytics Specialization

Applied Data Science with Python (University of Michigan)

Data Analyst Professional Certificate (IBM)

Data Science Professional Certificate (IBM)

Data Science Specialization (John Hopkins University)

Data Science with Python Certification Training 

Data Scientist Career Path

Data Scientist Nano Degree Program

Data Scientist Program

Deep Learning Specialization

Machine Learning Course (Andrew Ng @ Stanford)

Machine Learning, Data Science and Deep Learning

Machine Learning Specialization (University of Washington)

Master Python for Data Science

Mathematics for Machine Learning (Imperial College London)

Programming with Python

Python for Everybody Specialization (University of Michigan)

Python Machine Learning Certification Training

Reinforcement Learning Specialization (University of Alberta)

Join our mailing list

Data Insight participates in affiliate programs and may sometimes get a commission through purchases made through our links without any additional cost to our visitors.

bottom of page