K-Means

Nov 10, 2017 permanent MachineLearning

K-Means is an unsupervised Clustering algorithm and a method of Vector Quantisation. The goal is to partition $n$ data points into $k$ clusters, where each data point belongs to the cluster with the closest centroid.

The algorithm works like this:

Standardise the data by centring at 0 to ensure all features are utilised equally for clustering.
Randomly create $k$ centroids—one for each cluster. Common methods include selecting $k$ random data points as centroids or randomly generating centroid coordinates.
Calculate the distance between each data point and the centroid. Euclidean Distance is commonly used: $\sqrt{\sum\limits_{i=0}^{n} (q_i - p_i)^2}$ , where $q_i$ and $p_i$ refer to the $i_{th}$ feature of datapoint $q$ and centroid $p$ , respectively. However, other distance functions may be more suitable.
Assign each datapoint to its closest centroid based on the calculated distances.
Update the position of the k centroids by taking the mean of all data points assigned to each cluster.
Repeat steps 4 and 5 until the centroids no longer change or a maximum number of iterations is reached.

The quality of the K-Means clustering is typically evaluated by calculating the average distance of all data points to their assigned centroids. However, K-Means does not guarantee convergence to the global minimum, and the final clustering may depend on the initial centroid positions.

K-Means clustering example

Tags

Notes by Lex Toumbourou

K-Means