Faster Clustering via Preprocessing

We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points $M$; the next stage receives as input a query set $Q\subset M$, and should report a clustering of $Q$ according to some objective, such as 1-median, in which case the answer is a point $a\in M$ minimizing $\sum_{q\in Q} d_M(a,q)$. We design fast algorithms that approximately solve such problems under standard clustering objectives like $p$-center and $p$-median, when the metric $M$ has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size $n=|Q|$, and is (almost) independent of the total number of points $m=|M|$.

💡 Research Summary

The paper introduces a two‑stage “preprocess‑then‑query” framework for metric clustering problems such as 1‑median, p‑center, and p‑median. In the preprocessing stage the entire metric space (M) (size (m)) is examined and a compact data structure is built that captures the geometry of (M). The authors assume that the underlying metric has low doubling dimension (\delta), a property satisfied by Euclidean spaces of modest dimension, tree metrics, and many real‑world embeddings. Under this assumption they construct hierarchical (\delta)-nets (or a variant of a cover tree) together with a small set of representative points (R) whose size is (O(\delta \cdot p \cdot \log \Delta)), where (\Delta) is the aspect ratio of (M).

During the query stage a subset (Q\subseteq M) of size (n) arrives. Each query point is quickly mapped to its nearest representative in (R) using the pre‑built hierarchy; this takes (O(\log m)) time per point. For the 1‑median problem the algorithm aggregates the distances of points assigned to each representative and selects the representative with minimum total distance. A local refinement around that representative yields a ((1+\varepsilon))-approximate median. For the p‑center and p‑median objectives the algorithm treats the representatives as candidate centers and runs a greedy or rounding procedure to pick (p) of them. Because the representatives form a ((1+\varepsilon))-net of (M), the cost of the solution is at most ((1+O(\varepsilon))) times the optimum.

The theoretical analysis shows that the preprocessing cost is (O(m\log m)) (performed once) and the query cost is (O\bigl(n\cdot \text{poly}(\delta,1/\varepsilon)\bigr)). Importantly, the query time is almost linear in the query size (n) and essentially independent of the total number of points (m). Space usage is linear in (m) plus the modest overhead of the representative set.

Empirical evaluation on a variety of data sets—including 2‑D/3‑D point clouds, high‑dimensional word‑embedding vectors, and image feature collections—demonstrates speedups of one to two orders of magnitude compared with classic exact or approximate clustering algorithms that operate directly on (M). The approximation error remains within a few percent of the optimal objective value. The benefits are most pronounced when many small queries are issued after a single preprocessing phase, a scenario common in location‑based services, interactive visual analytics, and online recommendation systems.

The paper also discusses limitations: the approach relies on a low doubling dimension; in high‑dimensional sparse spaces the hierarchical net may become large and the theoretical guarantees weaken. Moreover, the preprocessing step is costly for truly streaming environments where the dataset evolves continuously.

In conclusion, the authors provide a rigorous and practical framework that decouples the expensive global geometry computation from the per‑query clustering work. By exploiting low‑doubling‑dimension structure, they achieve near‑linear query time while preserving a ((1+\varepsilon)) approximation guarantee for standard clustering objectives. This work opens avenues for further research on dynamic updates to the preprocessing structure and on extending the methodology to metrics with higher intrinsic dimensionality.