Title: Wards Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm
ArXiv ID: 1111.6285
Date: 2016-09-20
Authors: The original manuscript does not provide the author list in the excerpt supplied.
📝 Abstract
The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.
💡 Deep Analysis
Deep Dive into Wards Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm.
The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward’s hierarchical clustering method.
📄 Full Content
In the literature and in software packages there is confusion in regard to what is termed the Ward hierarchical clustering method. This relates to any and possibly all of the following: (i) input dissimilarities, whether squared or not; (ii) output dendrogram heights and whether or not their square root is used; and (iii) there is a subtle but important difference that we have found in the loop structure of the stepwise dissimilarity-based agglomerative algorithm. Our main objective in this work is to warn users of hierarchical clustering about this, to raise awareness about these distinctions or differences, and to urge users to check what their favorite software package is doing.
In R, the function hclust of stats with the method=“ward” option produces results that correspond to a Ward method (Ward 1 , 1963) described in terms of a Lance-Williams updating formula using a sum of dissimilarities, which produces updated dissimilarities. This is the implementation used by, for example, Wishart (1969), Murtagh (1985) on whose code the hclust implementation is based, Jain and Dubes (1988), Jambu (1989), in XploRe (2007), in Clustan (www.clustan.com
), and elsewhere.
An important issue though is the form of input that is necessary to give Ward’s method. For an input data matrix, x, in R’s hclust function the following command is required: hclust(dist(x)^2,method=“ward”). In later sections (in particular, section 3.2) of this article we explain just why the squaring of the distances is a requirement for the Ward method. In section 4 (Experiment 4) it is discussed why we may wish to take the square roots of the agglomeration, or dendrogram node height, values.
In R, the agnes function of cluster with the method=“ward” option is also presented as the Ward method in Kaufman and Rousseeuw (1990), Legendre and Legendre (2012), among others. A formally similar algorithm is used, based on the Lance and Williams (1967) recurrence. Lance and Williams (1967) did not themselves consider the Ward method, which instead was first investigated by Wishart (1969).
What is at issue for us here starts with how hclust and agnes give different outputs when applied to the same dissimilarity matrix as input. What therefore explains the formal similarity in terms of criterion and algorithms, yet at the same time yields outputs that are different? 2 Ward’s Agglomerative Hierarchical Clustering Method
We recall that a distance is a positive, definite, symmetric mapping of a pair of observation vectors onto the positive reals which in addition satisfies the triangular inequality. For observations i, j, k we have: d(i, j) > 0; d(i, j) = 0 ⇐⇒ i = j; d(i, j) = d(j, i); d(i, j) ≤ d(i, k) + d(k, j). For observation set, I, with i, j, k ∈ I we can write the distance as a mapping from the Cartesian product of the observation set into the positive reals: d :
A dissimilarity is usually taken as a distance but without the triangular inequality (d(i, j) ≤ d(i, k) + d(k, j), ∀i, j, k). Lance and Williams, 1967, use the term “an (i, j)-measure” for a dissimilarity.
An ultrametric, or tree distance, which defines a hierarchical clustering (and also an ultrametric topology, which goes beyond a metric geometry, or a p-adic number system) differs from a distance in that the strong triangular inequality is instead satisfied. This inequality, also commonly called the ultrametric inequality, is:
For observations i in a cluster q, and a distance d (which can potentially be relaxed to a dissimilarity) we have the following definitions. We may want to consider a mass or weight associated with observation i, p(i). Typically we take p(i) = 1/|q| when i ∈ q, i.e. 1 over cluster cardinality of the relevant cluster.
With the context being clear, let q denote the cluster (a set) as well as the cluster’s center. We have this center defined as q = 1/|q| i∈q i. Furthermore, and again the context makes this clear, we have i used for the observation label, or index, among all observations, and the observation vector.
Some further definitions follow.
Error sum of squares: i∈q d 2 (i, q).
Variance (or centered sum of squares): 1/|q| i∈q d 2 (i, q).
Consider now a set of masses, or weights, m i for observations i. Following Benzécri (1976, p. 185), the centered moment of order 2, M 2 (I) of the cloud (or set) of observations i, i ∈ I, is written: M 2 (I) = i∈I m i i -g 2 where the center of gravity of the system is g
where m I is the total mass of the cloud. Due to Huyghen’s theorem the following can be shown (Benzécri, 1976, p. 186) for clusters q whose union make up the partition, Q:
The V (Q) and V (I) definitions here are discussed in Jambu (1978, pp. 154-155). The last of the above can be seen to decompose (additively) total variance of the cloud I into (first term on the right hand side) variance of the cloud of cluster centers (q ∈ Q), and summed variances of the clusters. We can consider this last of the above relations as: T = B + W , where B is the bet