Wards Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm
The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward’s hierarchical clustering method.
💡 Research Summary
The paper provides a comprehensive examination of Ward’s hierarchical clustering method, focusing on its original formulation, subsequent generalizations, and the divergent implementations found in major statistical and machine‑learning software packages. Ward’s method is fundamentally an agglomerative algorithm that selects the pair of clusters whose merger yields the smallest increase in the total error sum of squares (ESS). In Ward’s 1963 paper the increase in ESS when merging clusters A and B is expressed as ΔESS = (|A|·|B|)/( |A|+|B| ) · d²(A,B), where d(A,B) is the Euclidean distance between the cluster centroids and |·| denotes cluster size. This criterion can be interpreted as minimizing the within‑cluster variance after each merge.
The authors first revisit the mathematical derivation of this criterion and show how it can be efficiently updated using the Lance‑Williams recurrence, which avoids recomputing all pairwise distances after each agglomeration. They then discuss several extensions that have appeared in the literature: (1) the distinction between Ward.D (using d) and Ward.D2 (using d²) as implemented in R’s hclust; (2) the incorporation of weighted observations via a positive‑definite weight matrix W, leading to a generalized ΔESS = (|A|·|B|)/( |A|+|B| ) · (μ_A‑μ_B)ᵀ W (μ_A‑μ_B); (3) the use of alternative distance metrics such as Mahalanobis or Manhattan distances, which changes the geometry of the space but retains the variance‑minimization spirit; and (4) adaptations for high‑dimensional data where the covariance structure is regularized.
A substantial portion of the paper is devoted to a systematic comparison of how five widely used platforms—R (hclust), SAS (PROC CLUSTER), SPSS (Hierarchical Cluster), Python’s scikit‑learn (AgglomerativeClustering), and MATLAB (linkage)—implement Ward’s algorithm. Although all claim to follow Ward’s principle, the authors uncover concrete differences: R offers two distinct methods (Ward.D vs. Ward.D2) that differ in whether the distance or its square is used in the ΔESS formula; SAS and SPSS internally minimize the mean squared error (MSE), which is mathematically equivalent to Ward’s criterion but leads to subtle variations in the dendrogram topology; scikit‑learn computes the merge cost directly from the data matrix without pre‑computing a full distance matrix, improving memory efficiency but producing slightly different linkage values; MATLAB’s implementation is opaque, making it difficult to verify which variant is used.
To illustrate the practical impact of these discrepancies, the authors conduct case studies on the classic Iris dataset (low‑dimensional) and a high‑dimensional gene‑expression dataset (thousands of variables). For each dataset they run the five implementations, then compare (a) the resulting dendrograms, (b) cophenetic correlation coefficients, (c) the number of clusters suggested by common cut‑height heuristics, and (d) computational resources (runtime and memory). The findings reveal that while low‑dimensional data produce broadly similar clusterings, high‑dimensional data exhibit pronounced divergence: the choice of ΔESS formulation can alter the order of merges, leading to different cluster boundaries and even different optimal numbers of clusters. Runtime differences are also notable; scikit‑learn is the fastest, whereas SAS consumes the most memory due to its full distance‑matrix approach.
The paper concludes with practical recommendations for both developers and analysts. Developers should document explicitly which Ward variant they implement, expose options for distance versus squared‑distance criteria, and consider providing weighted‑Ward extensions for users with heterogeneous observation importance. Analysts are advised to (i) align the choice of implementation with the scientific goal (e.g., strict variance minimization versus robustness to outliers), (ii) test multiple software packages on a subset of the data to assess the stability of the hierarchical structure, and (iii) be cautious when interpreting cluster solutions from high‑dimensional data, as implementation nuances can materially affect results. By clarifying the theoretical underpinnings and empirically demonstrating the consequences of software‑specific choices, the paper equips the community with the knowledge needed to use Ward’s method responsibly and to develop more consistent clustering tools in the future.
Comments & Academic Discussion
Loading comments...
Leave a Comment