K-Means

K-Means is an unsupervised Clustering algorithm and a method of Vector Quantisation. The goal is to partition nn data points into kk clusters, where each data point belongs to the cluster with the closest centroid.

The algorithm works like this:

  1. Standardise the data by centring at 0 to ensure all features are utilised equally for clustering.
  2. Randomly create kk centroids—one for each cluster. Common methods include selecting kk random data points as centroids or randomly generating centroid coordinates.
  3. Calculate the distance between each data point and the centroid. Euclidean Distance is commonly used: i=0n(qipi)2\sqrt{\sum\limits_{i=0}^{n} (q_i - p_i)^2}, where qiq_i and pip_i refer to the ithi_{th} feature of datapoint qq and centroid pp, respectively. However, other distance functions may be more suitable.

  4. Assign each datapoint to its closest centroid based on the calculated distances.

  5. Update the position of the k centroids by taking the mean of all data points assigned to each cluster.
  6. Repeat steps 4 and 5 until the centroids no longer change or a maximum number of iterations is reached.

The quality of the K-Means clustering is typically evaluated by calculating the average distance of all data points to their assigned centroids. However, K-Means does not guarantee convergence to the global minimum, and the final clustering may depend on the initial centroid positions.

K-Means clustering example