K-means Clustering

DATA SCIENCE TUTORIALS

Let us get familiar with the term called CLUSTERING. Clustering is a type of unsupervised learning method. Here we draw inferences from the given input, the output of which is unknown. Clustering is a technique used to find clusters of similar data from data sets. Clustering is categorized into unsupervised learning, i.e.; when the output is unknown or when one is dealing with unlabelled data.

In Clustering the data points are divided into homogeneous groups such that data points in each cluster are similar to each other.

We use the Clustering Algorithm to cluster data into specific groups. There are several Clustering Algorithms like centroid-based clustering, density-based clustering, distribution-based clustering, hierarchical clustering. In this blog, I will walk you through centroid-based clustering namely k-means.

K means

Most commonly used Clustering algorithm. K-means partitioning ‘n’ observations into ‘k’ clusters.

There are 3 steps:

Initialization – K initial “means” (centroids) are generated at random
Assignment – K clusters are created by associating each observation with the nearest centroid (we need to know the clusters in advance)
Update – The centroid of the clusters becomes the new mean.

Assignment and Update are repeated iteration is done until convergence. The end result is that the sum of squared errors is minimized between points and their respective centroids.

How to determine K

The number of clusters is decided by the ELBOW method.

Advantages of K means

Easy Implementation and high speed performance
Measurable and efficient in large data collection

Disadvantages of K means

Selection of initial centroids is random it might give different clustering results on different run of the algorithm. Thus , results lack consistency.
Selection of optimal number of clusters is difficult.
Unable to handle noisy data and outliers
All items are forced into clusters

Application of K means Clustering

Hazard Mapping- of the earthquake prone areas , as per vulnerability.
Town/City Planning- for classifying land as per use like residential, commercial, university area etc.
Marketing- in areas like customer segmentation, product penetration
Taxonomy- Classification of species, flora and fauna.
Document Clustering
Image Segmentation and Image Compression

https://youtu.be/7DJ7mfLlel0

Visit You tube channel for more details.

Making Data Science Easy for Beginners