Fast $k$-means Seeding Under The Manifold Hypothesis
We study beyond worst case analysis for the $k$-means problem where the goal is to model typical instances of $k$-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to $k$-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension $D$ concentrates around a low dimensional manifold of intrinsic dimension $d$, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent $\varepsilon = 2/d$ using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called $\operatorname{Qkmeans}$ which provides $O(ρ^{-2} \log k)$ approximate solutions to the $k$-means problem in time $O(nD) + \widetilde{O}(\varepsilon^{1+ρ}ρ^{-1}k^{1+γ})$; where the exponent $γ= \varepsilon + ρ$ for an input parameter $ρ< 1$. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.
💡 Research Summary
The paper tackles the gap between the theoretical worst‑case analysis of the $k$‑means problem and its practical performance on real‑world data. Existing theoretical guarantees for $k$‑means algorithms (e.g., Lloyd’s iterations, $k$‑means++) rely on strong assumptions about the optimal solution—such as separability or stability—that are difficult to verify empirically. To bridge this gap, the authors adopt the manifold hypothesis, which posits that high‑dimensional data points are concentrated near a low‑dimensional smooth manifold $M\subset\mathbb{R}^D$ with intrinsic dimension $d\ll D$. This assumption shifts the focus from properties of the optimal clustering to geometric regularities of the data distribution itself, making it both theoretically tractable and empirically testable.
Using results from optimal quantization theory—particularly Zador’s asymptotic law and Gruber’s extension to manifolds—the authors derive scaling laws for the optimal $k$‑quantizer cost $\Delta_k(f)$. They show that $\Delta_k(f)$ decays as $k^{-\varepsilon}$ where $\varepsilon = 2/d$ (the quantization exponent). Translating these continuous results to finite samples, they define two data‑dependent parameters:
- $\beta_k(X) = \frac{\text{opt}_1(X)}{\text{opt}_k(X)}$, measuring how much the cost drops when increasing the number of clusters.
- $\eta(X)$, the aspect ratio (ratio of maximum to minimum pairwise distances) of the dataset.
Theorem 4 proves that, with high probability, $\beta_k(X)=1+O!\big(\frac{D\log n}{n^{2/d}}k^{\varepsilon}\big)$ and $\eta(X)=O!\big(n^{3/(2d)}\big)$. Hence, for low intrinsic dimension $d$, $\beta_k$ stays close to 1 and $\eta$ remains modest, confirming that the data exhibit predictable geometric regularities.
Leveraging these regularities, the authors design a new seeding algorithm called Qkmeans. Traditional $k$‑means++ samples centers sequentially from the $D^2$ distribution, which is inherently $O(nkD)$ and difficult to parallelize. Qkmeans replaces this with a rejection‑sampling scheme that draws candidate centers from a simple $\ell_2$‑norm distribution and accepts them with probability proportional to the $D^2$ weight. A tunable parameter $\rho<1$ controls the number of rejection attempts, yielding an expected approximation factor $O(\rho^{-2}\log k)$. The overall runtime consists of an $O(nD)$ pass to read the data plus an additional term $\widetilde O(\varepsilon^{1+\rho}\rho^{-1}k^{1+\gamma})$, where $\gamma = \varepsilon + \rho$. Because $\varepsilon = 2/d$ is typically small (e.g., $0.1$–$0.2$ for many large datasets), the exponent $1+\gamma$ is close to 1, making the algorithm almost linear in $k$.
Theorem 5 formalizes the guarantee for any dataset: the expected cost of the returned centers $C$ satisfies
\
Comments & Academic Discussion
Loading comments...
Leave a Comment