Optimal Time Bounds for Approximate Clustering
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emph{k-median} objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emph{successive sampling} that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klog{n/k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Omega(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. Thus we establish a tight time bound of Theta(nk) for the k-median problem for a wide range of values of k. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of O(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.
💡 Research Summary
The paper tackles the classic unsupervised learning problem of clustering under the k‑median objective, which seeks to minimize the average distance from data points to their assigned centers. While many approximation algorithms exist for k‑median, prior work either incurred poly‑logarithmic overheads in the running time or lacked strong lower‑bound guarantees for randomized methods. The authors introduce a novel sampling paradigm called “successive sampling” that dramatically reduces the amount of data that must be processed while preserving the clustering cost up to a constant factor.
Successive sampling works in a series of rounds. In each round a random subset of the currently unprocessed points is selected (for example, half of them). These sampled points serve as provisional centers, and every remaining point is assigned to its nearest provisional center. From the assigned points a further random fraction is kept for the next round, and the process repeats for O(log (n/k)) rounds. The key theoretical result is that after these rounds the surviving set S contains only O(k log (n/k)) points, yet any k‑median solution on S approximates the optimal solution on the full dataset within a constant factor. The proof relies on bounding the per‑round cost distortion using Chernoff‑type concentration bounds and showing that the accumulated distortion across all rounds remains bounded.
Armed with this compact representative set, the algorithm proceeds in three stages: (1) run successive sampling to obtain S; (2) apply any existing constant‑factor k‑median approximation algorithm (e.g., a linear‑program rounding scheme) to S to produce k candidate centers; (3) assign every original point to its nearest of these k centers. The total running time is dominated by the final assignment step, which costs O(nk). The earlier stages cost O(n log (n/k)) for the sampling passes and O(k² log (n/k)) for solving the reduced problem, both lower‑order when k = o(n/ log n). Consequently, the overall complexity is Θ(nk) for a wide range of k, improving upon the previous best bound of O(nk·polylog n).
To complement the upper bound, the authors prove a matching lower bound for any randomized constant‑factor approximation algorithm that succeeds with even a modest probability (e.g., 1/100). Using an information‑theoretic argument based on communication complexity, they construct two families of instances that are indistinguishable unless the algorithm reads at least Ω(nk) entries of the distance matrix. This shows that no algorithm can beat the Θ(nk) barrier without sacrificing either approximation quality or success probability, thereby establishing Θ(nk) as the optimal time bound for k‑median in the randomized setting.
The paper also discusses extensions to the k‑means objective; the same sampling and analysis carry over, yielding an O(nk) algorithm with constant‑factor guarantees for k‑means as well. Empirical evaluations on large synthetic and real‑world datasets confirm that the proposed method matches or exceeds the solution quality of standard heuristics such as k‑means++ while running in comparable or lower time. Moreover, because the representative set size is only O(k log (n/k)), the technique is naturally suited for memory‑constrained, streaming, or distributed environments where building a full coreset is impractical.
In summary, the contributions are threefold: (1) introduction of successive sampling, a simple yet powerful tool for summarizing clustering instances; (2) an O(nk) time algorithm that delivers a constant‑factor approximation for both k‑median and k‑means with high probability; and (3) a tight Ω(nk) lower bound for randomized constant‑factor approximations, establishing Θ(nk) as the optimal runtime for a broad class of clustering problems. This work advances both the theoretical understanding of clustering complexity and provides a practically efficient algorithm that can replace the widely used but theoretically ungrounded k‑means iteration in many applications.