Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

Clustering Based on Pairwise Distances When the Data is of Mixed   Dimensions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method.


💡 Research Summary

In this paper the authors address the problem of clustering data that consist of several groups with different intrinsic dimensions, densities, sizes and shapes, all embedded in a common Euclidean ambient space. They formalize a generative model in which each cluster Cₖ lies on a dₖ‑dimensional manifold Mₖ (dₖ ≤ D) and points are sampled uniformly with density ρₖ. The key quantity is the minimal inter‑cluster distance Δₘᵢₙ, which determines how well the clusters can be separated.

The study focuses on three prototypical distance‑based algorithms: (1) extraction of connected components from an ε‑neighbourhood graph, (2) spectral clustering as introduced by Ng, Jordan and Weiss, and (3) hierarchical single‑linkage clustering (minimum spanning tree cut). For each method the authors derive asymptotic conditions—valid when the total number of points n → ∞—under which the algorithm recovers the true cluster partition with high probability.

For the ε‑graph, they prove that a scale ε satisfying
c₁ (log n / n)^{1/dₖ} ≤ ε ≤ c₂ Δₘᵢₙ
exists and guarantees that every intra‑cluster pair is linked while no inter‑cluster edge appears. Consequently the connected components of the graph coincide exactly with the underlying clusters.

In the spectral case, they introduce a local scaling rule σᵢ = distance to the k‑th nearest neighbour of point i (the Zelnik‑Manor–Perona scheme). This adaptive scale automatically places ε inside the admissible interval above, eliminating the need for manual tuning. They then bound the eigengap λ_{K+1} – λ_K from below by
c₃ (Δₘᵢₙ / σ_max)²,
showing that a sufficiently large separation yields a pronounced spectral gap, which in turn enables a consistent estimate of the number of clusters K via the classic eigengap heuristic.

For single‑linkage, the authors analyse the minimum spanning tree (MST) of the data. They demonstrate that the longest edge in the MST converges to Δₘᵢₙ, so cutting the K‑1 largest edges yields the exact cluster decomposition whenever Δₘᵢₙ exceeds the intra‑cluster edge scale.

Robustness to outliers is also examined. Outliers are modeled as a small fraction of points drawn uniformly from the whole ambient space. The analysis shows that, under the same ε or σᵢ regimes, the probability that an outlier creates a spurious inter‑cluster edge decays as O(εᴰ), rendering the three algorithms essentially immune to a modest amount of contamination.

The paper further provides a lower bound on the spectral gap that can be computed from the data, offering a practical criterion for selecting K without cross‑validation. Experimental validation on synthetic mixtures of manifolds and on real high‑dimensional datasets (image patches, word embeddings) confirms the theoretical predictions: local scaling yields near‑optimal ε, the eigengap aligns with the derived bound, and single‑linkage remains computationally attractive even for large n.

Overall, the work delivers a unified theoretical framework that explains why simple distance‑based clustering methods can be near‑optimal for heterogeneous, mixed‑dimensional data. It clarifies the role of scale selection, quantifies the separation required for reliable recovery, and supplies actionable tools (local scaling, eigengap bound) for practitioners dealing with complex, high‑dimensional clustering tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment