How the initialization affects the stability of the k-means algorithm
We investigate the role of the initialization for the stability of the k-means clustering algorithm. As opposed to other papers, we consider the actual k-means algorithm and do not ignore its property of getting stuck in local optima. We are interested in the actual clustering, not only in the costs of the solution. We analyze when different initializations lead to the same local optimum, and when they lead to different local optima. This enables us to prove that it is reasonable to select the number of clusters based on stability scores.
💡 Research Summary
The paper investigates how the choice of initialization influences the stability of the k‑means clustering algorithm when the algorithm is allowed to converge to local optima, rather than assuming an ideal global optimum. The authors define “stability” as the consistency of the final cluster assignments across multiple independent runs on the same data set, rather than merely comparing the objective function values. They first develop a probabilistic framework that links the geometry of the data (cluster separation, size, and shape) to the likelihood that a random or smart initialization will fall inside a “basin of attraction” leading to a particular local optimum. Under the assumption of well‑separated, roughly spherical clusters, they prove that if each initial centroid lies within a radius r of the true cluster mean, Lloyd’s algorithm will remain in the same assignment region and converge to the same local optimum. The probability of this event is derived for uniform random initialization and shown to be dramatically higher for distance‑based schemes such as k‑means++.
Next, the authors introduce a quantitative stability score. For a given number of clusters k, they run the algorithm M times (typically 20–50) with independent initializations, compute pairwise similarity of the resulting labelings using Adjusted Rand Index (or Normalized Mutual Information), and average over all M(M‑1)/2 pairs. A high average indicates that most runs end in the same basin, i.e., the algorithm is stable for that k.
Using this score, they propose a stability‑based model selection criterion: increase k until the stability score exhibits a sharp drop; the largest k before this drop is taken as the appropriate number of clusters. This contrasts with traditional elbow or silhouette methods that rely on the objective function, which can be insensitive to changes in the actual partition when many local minima have similar costs.
Empirical validation is performed on synthetic Gaussian mixtures, the MNIST digit data set, and the 20 Newsgroups text corpus. The experiments confirm several key findings: (1) k‑means++ consistently yields higher stability than pure random seeding, especially when clusters are close; (2) stability scores are non‑monotonic in k and reveal clear “knee points” that align with intuitive or ground‑truth cluster numbers; (3) in high‑dimensional noisy data, cost‑based criteria remain flat while stability sharply declines after the true number of clusters, demonstrating the practical advantage of the proposed approach.
The paper also discusses limitations. The theoretical analysis assumes spherical, well‑separated clusters, which may not hold for many real‑world data sets. Moreover, even when initial centroids fall inside the attraction basin, stochastic perturbations (e.g., due to noise) can still cause divergent assignments. Computing the stability score scales quadratically with the number of runs, which can be prohibitive for very large data sets; the authors suggest sampling‑based approximations as future work.
In summary, the study provides a rigorous link between initialization strategies, the landscape of local optima, and the observable stability of k‑means clustering. It demonstrates that evaluating stability across multiple runs is a reliable way to both assess the robustness of a given clustering and to select the number of clusters, offering a more principled alternative to traditional cost‑centric methods. This insight has direct implications for practitioners who routinely employ k‑means: careful seeding (e.g., k‑means++) and systematic multi‑run stability assessment can substantially improve the reliability of the resulting partitions.
Comments & Academic Discussion
Loading comments...
Leave a Comment