K-mean clustering Reading Notes

OVERVIEW

KMeans is a heuristic and unsupervised method that guarantee to converge to an (local) optimal solution, since each iteration must reduce the WCSS (within-cluster sum of squares), and there’s only finite number of assignments, the algorithm is gauaranteed to converge on an (local) optimal solution.

conversion

Application in Bag of Words

K means can be used for image segmentation, specifically, cluster the response of different filters into K clusters, that can be used to assign each pixel in the image to the nearest cluster. The result could be further used to generate feature vector and classify images.

Remarks

Despite being fast, K mean clustering has quite a few drawbacks:

  1. K means needs to determine K before hand, the true number may never be known
  2. Sensitive to outliers: the centroid can be pulled significantly due to one single noisy data point
  3. Since euclidean distance metric is used in the formula, it assume a sphere like distribution
  4. It assume each cluser should be equally large, in fact, the algorithm is equivalent to assigning points by Vornoi diagram of cluster centroids

conversion

EM algorithm could be used to yield better results.