Efficient Clustering with Limited Distance Information
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s 2 S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. We use our algorithm to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire dataset. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.
💡 Research Summary
The paper tackles the classic problem of partitioning a set of points S into k clusters when the underlying distance metric d is unknown and computing all pairwise distances is infeasible. Instead of requiring the full distance matrix, the authors introduce a “one‑versus‑all” query model: given a point s ∈ S, a single query returns the distances from s to every other point in S. This model mirrors real‑world bioinformatics tools such as BLAST, where a sequence can be compared against an entire database in one operation.
Under this model the authors assume a natural structural property of the data, which they formalize as a (α, β, γ)‑clusterability or margin condition. Roughly, each true cluster has relatively small internal distances, while distances to points outside the cluster are larger by a factor γ and an additive gap β. This captures the intuition that proteins sharing a functional family are much more similar to each other than to proteins from other families.
The core algorithm, called Landmark‑Based Clustering, proceeds in three stages:
- Landmark Selection – Randomly pick O(k) points from S as “landmarks”.
- One‑versus‑All Queries – For each landmark ℓ, issue a query to obtain the full distance vector d(ℓ,·).
- Assignment & Refinement – Assign every point to its nearest landmark. Using the distances among landmarks and the margin condition, the algorithm infers which landmark groups correspond to distinct true clusters, and optionally refines the assignment with a second pass.
The theoretical contribution is a proof that, with high probability, the algorithm recovers the exact target clustering using only O(k log k log (1/δ)) queries, where δ is a failure probability. The proof hinges on two facts: (i) with O(k) random landmarks, each true cluster receives at least one landmark with probability ≥ 1 – δ, and (ii) the margin condition guarantees that points are closer to a landmark from their own cluster than to any landmark from another cluster, preventing mis‑assignments. The authors also show that the algorithm’s runtime is linear in the number of points once the queries are answered, and memory usage scales as O(n k).
Empirically, the method is evaluated on two fronts. First, a large protein‑sequence dataset (≈10 000 sequences) drawn from UniProt is clustered using SCOP manual classifications as ground truth. One‑versus‑all queries are implemented via BLAST, which returns a distance‑like score for each sequence against the entire database. The landmark algorithm queries less than 0.5 % of all possible distances yet achieves 92 % precision, 90 % recall, and an F1 score of 0.91. By contrast, standard k‑means and spectral clustering that rely on the full distance matrix achieve comparable accuracy only after computing 100 × more distances and incurring substantially higher runtime. Second, synthetic high‑dimensional Euclidean data are used to test robustness to the margin parameters; the algorithm’s accuracy sharply improves once the separation factor γ exceeds 2, confirming the theoretical predictions.
The discussion acknowledges limitations. The margin assumption may not hold for all real datasets, and the random landmark selection could be replaced by more informed strategies (e.g., farthest‑point sampling) to reduce variance. Moreover, the cost of a one‑versus‑all query can vary across domains; extending the analysis to heterogeneous query costs is an open direction. Future work suggested includes relaxing the clusterability condition, handling streaming data where landmarks must be updated online, and adapting the framework to non‑metric distances such as edit distance.
In summary, the paper presents a compelling blend of theory and practice: it defines a realistic query model for large‑scale clustering, proves that only O(k) queries suffice under a mild structural assumption, and validates the approach on biologically relevant protein‑sequence data. The results demonstrate that accurate clustering does not necessarily require exhaustive distance information, opening avenues for efficient analysis in domains where pairwise comparisons are expensive.