Modern hierarchical, agglomerative clustering algorithms

Modern hierarchical, agglomerative clustering algorithms
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents algorithms for hierarchical, agglomerative clustering which perform most efficiently in the general-purpose setup that is given in modern standard software. Requirements are: (1) the input data is given by pairwise dissimilarities between data points, but extensions to vector data are also discussed (2) the output is a “stepwise dendrogram”, a data structure which is shared by all implementations in current standard software. We present algorithms (old and new) which perform clustering in this setting efficiently, both in an asymptotic worst-case analysis and from a practical point of view. The main contributions of this paper are: (1) We present a new algorithm which is suitable for any distance update scheme and performs significantly better than the existing algorithms. (2) We prove the correctness of two algorithms by Rohlf and Murtagh, which is necessary in each case for different reasons. (3) We give well-founded recommendations for the best current algorithms for the various agglomerative clustering schemes.


💡 Research Summary

The paper addresses a practical gap between the theoretical development of hierarchical agglomerative clustering (HAC) algorithms and the requirements of modern statistical and machine learning software packages such as R, SciPy, and MATLAB. The authors formalize the problem setting as follows: the input is a full pairwise dissimilarity matrix for N objects (Θ(N²) size) and the output must be a “stepwise dendrogram”, i.e., an (N‑1) × 3 list of triples (aᵢ, bᵢ, δᵢ) that records which two clusters are merged at each step and at what dissimilarity value. This output format is the de‑facto standard across the major libraries and carries more information than a plain ultrametric dendrogram, especially when multiple merges occur at the same distance.

The authors first restate the primitive HAC algorithm (Figure 1) that repeatedly finds the closest pair of clusters, merges them, updates all distances according to a chosen linkage formula, and records the merge in the stepwise dendrogram. They list the seven classic linkage schemes (single, complete, average, weighted, Ward, centroid, median) and show which of them admit a closed‑form, order‑independent distance update. The primitive algorithm runs in Θ(N³) time because each iteration scans all remaining pairwise distances.

The core contribution is a new “generic algorithm” that works for any linkage update rule, including the order‑dependent centroid and median schemes. The algorithm combines a min‑heap (or priority queue) to retrieve the current nearest pair in O(log N) time with a Union‑Find data structure to maintain cluster representatives. Crucially, distance updates are performed lazily: after a merge, only distances involving the newly created cluster are recomputed, and outdated heap entries are ignored when they surface. This yields an overall time complexity of Θ(N² log N) and a memory footprint of Θ(N²), matching the lower bound imposed by the input size. The authors prove that the algorithm always produces a valid stepwise dendrogram and that it never creates inversions for the five linkage methods that satisfy the monotonicity condition; for centroid and median, inversions are possible but the algorithm still respects the defined update formulas.

In addition to the new algorithm, the paper supplies rigorous correctness proofs for two previously informal methods: (1) Rohlf’s 1973 minimum‑spanning‑tree based single‑linkage algorithm, and (2) Murtagh’s 1985 nearest‑neighbor‑chain (NNC) algorithm. The Rohlf proof shows that each edge added to the MST corresponds exactly to a merge in the stepwise dendrogram, while the NNC proof demonstrates that the chain always terminates at a mutually nearest pair, guaranteeing optimal merges at each step.

The authors conduct a detailed complexity analysis (Section 4.1) and extensive empirical tests (Section 4.2). Experiments on synthetic datasets ranging from 10⁴ to 10⁵ points and on real‑world genetic and image‑feature data compare the new generic algorithm against established implementations: SLINK (single linkage), FastCluster (multiple linkages), and the default SciPy linkage routine. Results show that for single, complete, average, weighted, and Ward linkages the existing FastCluster code already achieves near‑optimal performance, while for centroid and median linkages the generic algorithm delivers speed‑ups of 2–5× and modest memory savings. The paper also quantifies the frequency of dendrogram inversions under centroid and median schemes and discusses how the generic algorithm’s lazy update strategy mitigates the associated numerical instability.

Finally, the paper provides practical recommendations: use FastCluster (or the built‑in SciPy routine) for the five monotonic linkages, and adopt the newly proposed generic algorithm for centroid and median linkages. The authors supply C++ source code with R and Python bindings (Müllner 2011) and outline how the methods could be extended to the “stored‑data” setting where the input is a set of vectors rather than a distance matrix.

Overall, the work bridges theory and practice by delivering provably correct, asymptotically optimal algorithms that fit seamlessly into the data structures and output conventions of contemporary statistical software, thereby enabling faster and more reliable hierarchical clustering in a wide range of scientific applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment