You are currently viewing K-means Clustering

K-means Clustering

Let us get familiar with the term called CLUSTERING. Clustering is a type of unsupervised learning method. Here we draw inferences from the given input, the output of which is unknown. Clustering is a technique used to find clusters of similar data from data sets. Clustering is categorized into unsupervised learning, i.e.; when the output is unknown or when one is dealing with unlabelled data.

In Clustering the data points are divided into homogeneous groups such that data points in each cluster are similar to each other.

We use the Clustering Algorithm to cluster data into specific groups. There are several Clustering Algorithms like centroid-based clustering, density-based clustering, distribution-based clustering, hierarchical clustering. In this blog, I will walk you through centroid-based clustering namely k-means.

K means

Most commonly used Clustering algorithm. K-means partitioning ‘n’ observations into ‘k’ clusters.

There are 3 steps:

  • Initialization – K initial “means” (centroids) are generated at random
  • Assignment – K clusters are created by associating each observation with the nearest centroid (we need to know the clusters in advance)
  • Update – The centroid of the clusters becomes the new mean.

Assignment and Update are repeated iteration is done until convergence. The end result is that the sum of squared errors is minimized between points and their respective centroids.

How to determine K

The number of clusters is decided by the ELBOW method.

Advantages of K means

  • Easy Implementation and high speed performance
  • Measurable and efficient in large data collection

Disadvantages of K means

  • Selection of initial centroids is random it might give different clustering results on different run of the algorithm. Thus , results lack consistency.
  • Selection of optimal number of clusters is difficult.
  • Unable to handle noisy data and outliers
  • All items are forced into clusters

Application of K means Clustering

  1. Hazard Mapping- of the earthquake prone areas , as per vulnerability.
  2. Town/City Planning- for classifying land as per use like residential, commercial, university area etc.
  3. Marketing- in areas like customer segmentation, product penetration
  4. Taxonomy- Classification of species, flora and fauna.
  5. Document Clustering
  6. Image Segmentation and Image Compression

https://youtu.be/7DJ7mfLlel0

Visit You tube channel for more details.

Leave a Reply