Robust Hierarchical Clustering

Robust Hierarchical Clustering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across many different fields ranging from computational biology to social sciences to computer vision in part because their output is easy to interpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise. In this paper we propose and analyze a new robust algorithm for bottom-up agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. Experimental evaluations on synthetic and real world data sets show that our algorithm achieves better performance than other hierarchical algorithms in the presence of noise.


💡 Research Summary

The paper addresses a well‑known weakness of classic agglomerative hierarchical clustering methods—namely, their extreme sensitivity to even a small amount of noise or outliers. While single‑linkage, complete‑linkage, and average‑linkage all succeed under the ideal “strict separation” condition (every point is more similar to points in its own cluster than to any point outside), they can fail catastrophically when a few cross‑cluster similarities appear.

To overcome this, the authors propose a new robust agglomerative algorithm that combines two key ideas. First, instead of using the maximum, minimum, or average similarity between two clusters, the algorithm computes the median similarity of all pairwise similarities across the two clusters and uses this median as the linkage score. Because the median is insensitive to a small number of extreme values, the resulting linkage is far less affected by noisy or malicious edges.

Second, the paper introduces a good‑neighborhood property as a realistic assumption on the similarity matrix. For each point x, let nC(x) be the size of its true cluster. The property requires that at most an α‑fraction of the nC(x) nearest neighbors of x belong to other clusters; equivalently, at least (1‑α)·nC(x) of the nearest neighbors are from the same true cluster. This condition is considerably weaker than the ν‑strict‑separation model used in earlier work, because it allows each point to have its own set of “bad” neighbors, rather than requiring a global set of outliers to be removed.

Under this assumption, the authors prove (Section 3) that the hierarchy produced by their median‑linkage algorithm always contains a pruning that exactly recovers the target clustering for all “good” points. The proof hinges on showing that the median linkage will always merge two clusters whose internal good‑neighborhood structure dominates any noisy cross‑cluster edges, thus preserving the correct merge order.

The analysis is further extended (Section 4) to handle boundary points—points that may have many cross‑cluster neighbors but belong to a sufficiently large sub‑cluster that itself satisfies the good‑neighborhood condition. As long as the fraction of such boundary points within each sub‑cluster stays below a certain threshold, the algorithm still yields a hierarchy whose pruning correctly classifies both good and boundary points.

A practical contribution is the inductive version (Section 5). When the dataset is massive, the algorithm can be run on a small random sample whose size depends only on the desired confidence and noise parameters, not on the total number of points. The authors prove that the hierarchy built on the sample implicitly defines a hierarchy over the entire dataset, and that the target clustering remains a low‑error pruning of this global tree.

Empirical evaluation (Section 6) includes synthetic experiments where α and ν are varied to test the limits of the theoretical guarantees, as well as real‑world benchmarks (document clustering, image feature clustering, and biological gene‑expression data). Across all tests, the proposed method consistently outperforms traditional single, complete, and average linkage, maintaining high Adjusted Rand Index scores even when up to 20‑30 % of pairwise similarities are corrupted. In real data, the method achieves 8–12 % higher precision and recall compared to the baselines. The authors note that the algorithm requires tuning of the noise parameters (α, β), which can be estimated from validation data.

In summary, the paper makes four major contributions:

  1. Algorithmic innovation – median‑based linkage that is provably robust to a bounded amount of noisy similarities.
  2. Theoretical framework – the good‑neighborhood property (and its boundary‑point extension) that captures realistic noise models, together with rigorous proofs that the target clustering appears as a pruning of the produced hierarchy.
  3. Scalable inductive learning – a sampling‑based scheme that yields a hierarchy for arbitrarily large datasets with guarantees independent of dataset size.
  4. Comprehensive empirical validation – synthetic and real‑world experiments confirming the theoretical robustness and demonstrating practical superiority over existing hierarchical methods.

Overall, the work bridges a gap between the interpretability of hierarchical clustering and the robustness required for noisy, large‑scale applications, offering a method that is both theoretically sound and practically effective.


Comments & Academic Discussion

Loading comments...

Leave a Comment