Versatile linkage: a family of space-conserving strategies for agglomerative hierarchical clustering

Agglomerative hierarchical clustering can be implemented with several strategies that differ in the way elements of a collection are grouped together to build a hierarchy of clusters. Here we introduce versatile linkage, a new infinite system of aggl…

Authors: Alberto Fern, ez, Sergio Gomez

Versatile linkage: a family of space-conserving strategies for   agglomerative hierarchical clustering
V ersatile link age: a family of space-conserving strategies for agglomerativ e hierarc hical clustering Alb erto F ern´ andez 1 and Sergio G´ omez 2 1 Departamen t d’Engin yeria Qu ´ ımica, Universitat Ro vira i Virgili, 43007 T arragona, Spain. alb erto.fernandez@urv.cat 2 Departamen t d’Engin yeria Inform` atica i Matem` atiques, Universitat Ro vira i Virgili, 43007 T arragona, Spain. sergio.gomez@urv.cat Abstract Abstract: Agglomerativ e hierarc hical clustering can b e implemented with several strategies that differ in the w ay elements of a collection are group ed together to build a hierarch y of clusters. Here we introduce v ersatile link age, a new infinite system of agglomerativ e hierarc hical clustering strategies based on generalized means, whic h go from single link age to complete link age, passing through arithmetic av erage link age and other clustering metho ds yet unexplored suc h as geometric link age and harmonic link age. W e compare the differen t clustering strategies in terms of cophenetic correlation, mean absolute error, and also tree balance and space distortion, t wo new m easures prop osed to describ e hierarchical trees. Unlik e the β -flexible clustering system, we show that the v ersatile link age family is space-conserving. 1 1 In tro duction Agglomerativ e hierarc hical clustering constitutes one of the most widely used metho ds for cluster analysis. Starting with a matrix of dissimilarities b etw een a set of elements, eac h elemen t is first assigned to its own cluster, and the algorithm sequen tially merges the more similar clusters until a complete hierarc hy of clusters is obtained [22, 10]. This metho d requires the definition of the dissimilarity (or distance) b etw een clusters, using only the original distances b etw een their constituen t elements. The w ay these distances are defined leads to distinct strategies of agglomerative hierarc hical clustering. T o name just tw o such clustering strategies, single link age usually leads to an elongate gro wth of clusters, while complete link age generally leads to tigh t clusters that join others with difficult y . Average link age clustering strategies w ere dev elop ed b y Sok al and Mic hener to av oid the extreme cases produced b y single link age and complete link age [23]. They require the calculation of some kind of a verage distance betw een clusters; a verage link age, for instance, calculates the arithmetic av erage of all the distances b etw een mem b ers of the clusters. More than fifty years ago, Lance and Williams introduced a form ula for integrating sev eral agglomerativ e hierarc hical clustering strategies in to a single system [13]. Based on this form ula they prop osed β -flexible clustering [14], a generalized clustering pro ce- dure that pro vides an infinite n umber of hierarchical clustering strategies just v arying a parameter β . Similarly , in this w ork w e introduce versatile linkage , a new parameterized family of agglomerativ e hierarc hical clustering strategies that go from single link age to complete link age, passing through arithmetic a v erage link age and other clustering strate- gies y et unexplored such as geometric link age and harmonic link age. Both β -flexible clustering and v ersatile link age are presented here using v ariable- group metho ds [23, 7] that, unlike pair-group metho ds, admit an y num b er of new mem- b ers simultaneously in to groups. In the case of pair-group metho ds the resulting hierar- c hical tree is called a dendrogram, which is built up on bifurcations, while in the case of v ariable-group metho ds the resulting hierarchical tree is called a multidendr o gr am [7], whic h consists of multifurcations, not necessarily binary ones. Here we use the v ariable- group algorithm introduced in [7] that solv es the non-uniqueness problem, also called the ties in proximit y problem, found in pair-group algorithms [22, 11, 5]. This problem arises when there are more than t w o clusters separated by the same minimum distance during the agglomerative pro cess. P air-group algorithms break ties b etw een distances c ho osing a pair of clusters, usually at random. Ho w ever, different output dendrograms are p ossible dep ending on the criterion used to break ties. Moreo ver, very frequently 2 results dep end on the order of the elements in the input data file, what is an undesired effect in hierarchical clustering except for the case of con tiguit y-constrained hierarc hi- cal clustering, which is used to obtain a hierarc hical clustering that tak es in to accoun t the ordering on the input elemen ts. The v ariable-group algorithm used here alwa ys giv es a uniquely determined solution grouping more than tw o clusters at the same time when ties o ccur, and when there are no ties it gives the same results as the pair-group algorithm. Section 2 reviews the β -flexible family of hierarchical clustering strategies, while Section 3 introduces the v ersatile link age family . F our case studies are used in Section 4 to perform a descriptive analysis of differen t hierarchical clustering strategies in terms of cophenetic correlation, mean absolute error, and the prop osed new measures of space distortion and tree balance. Finally , some concluding remarks are given in Section 5. 2 β -Flexible Clustering In any pro cedure implemen ting an agglomerative hierarc hical clustering strategy , giv en a set of individuals Ω = { x 1 , x 2 , . . . , x n } , initially eac h individual forms a singleton cluster, { x i } , and the distances D ( { x i } , { x j } ) b et ween singleton clusters are equal to the dissim- ilarities b et ween individuals, d ( x i , x j ). During the subsequen t iterations of the pro ce- dure, the distances D ( X I , X J ) are computed betw een an y t wo clusters X I = S i ∈ I X i and X J = S j ∈ J X j , each one of them made up of several sub clusters X i and X j indexed b y I = { i 1 , i 2 , . . . , i p } and J = { j 1 , j 2 , . . . , j q } , resp ectively . Lance and Williams introduced a form ula for in tegrating sev eral agglomerative hierarchical clustering strategies into a single system [13]. The v ariable-group generalization of Lance and Williams’ form ula, compatible with the fusion of more than t w o clusters sim ultaneously , is: D ( X I , X J ) = X i ∈ I X j ∈ J α ij D ( X i , X j ) + + X i ∈ I X i 0 ∈ I i 0 >i β ii 0 D ( X i , X i 0 ) + X j ∈ J X j 0 ∈ J j 0 >j β j j 0 D ( X j , X j 0 ) , (1) where the v alues of the parameters α ij , β ii 0 and β j j 0 determine the nature of the clustering strategy [7]. This form ula is com binatorial [14], i.e., the distance D ( X I , X J ) can b e calculated from the distances D ( X i , X j ), D ( X i , X i 0 ) and D ( X j , X j 0 ) obtained from the previous iteration and it is not necessary to k eep the initial distance matrix d ( x i , x j ) during the whole clustering pro cess. Based on Equation 1, Lance and Williams [14] prop osed an infinite system of ag- 3 glomerativ e hierarchical clustering strategies defined b y the constraint X i ∈ I X j ∈ J α ij | {z } α + X i ∈ I X i 0 ∈ I i 0 >i β ii 0 + X j ∈ J X j 0 ∈ J j 0 >j β j j 0 | {z } β = 1 , (2) where − 1 6 β 6 +1 generates a whole system of hierarchical clustering strategies for the infinite p ossible v alues of β . Giv en a v alue of β , the v alue for α ij can b e assigned follo wing a weigh ted approac h as in the original β -flexible clustering based on WPGMA (w eighted pair-group metho d using arithmetic mean) and introduced b y Lance and Williams [13], or it can b e assigned following an unw eighted approac h as in the β - flexible clustering based on UPGMA (unw eighted pair-group method using arithmetic mean) and in tro duced by Belbin et al. [2]. The standard WPGMA and UPGMA strategies are obtained from weigh ted and un weigh ted β -flexible clustering, resp ectiv ely , when β is set equal to 0. The difference b etw een w eighted and un weigh ted metho ds lies in the weigh ts assigned to individuals and clusters during the agglomerativ e pro cess: w eighted metho ds assign equal weigh ts to clusters, while unw eighted metho ds assign equal w eights to individuals. In un weigh ted β -flexible clustering the v alue for α ij is determined prop ortionally to | X i || X j | : α ij = | X i || X j | | X I || X J | (1 − β ) , (3) where | X i | and | X j | are the n um b er of individuals in sub clusters X i and X j , resp ectiv ely , and | X I | and | X J | are the n umber of individuals in clusters X I and X J , i.e., | X I | = P i ∈ I | X i | and | X J | = P j ∈ J | X j | . In a similar w ay , the v alue for β ii 0 is calculated prop ortionally to | X i || X i 0 | , and the v alue for β j j 0 prop ortionally to | X j || X j 0 | : β ii 0 = | X i || X i 0 | σ I + σ J β , (4) σ I = X i ∈ I X i 0 ∈ I i 0 >i | X i || X i 0 | = 1 2 | X I | 2 − X i ∈ I | X i | 2 ! . (5) The corresp onding v alues for w eigh ted β -flexible clustering are: α ij = 1 | I || J | (1 − β ) , (6) β ii 0 = 1 σ I + σ J β , (7) σ I = | I | ( | I | − 1) 2 = | I | 2 − | I | 2 , (8) 4 where | I | and | J | are the n um b er of sub clusters con tained in clusters X I and X J , resp ec- tiv ely . These formulas derive from the un w eighted ones when w e take | X i | = 1, ∀ i ∈ I , and | X j | = 1, ∀ j ∈ J . 3 V ersatile Link age Arithmetic a verage link age clustering iteratively forms clusters made up of previously formed subclusters, based on the arithmetic mean distances betw een their member in- dividuals; for simplicity and to av oid confusion, we will denote it arithmetic linkage instead of the standard term aver age linkage . Substituting the arithmetic means by generalized means, also known as p o wer means, this clustering strategy can b e extended to an y finite p o w er p 6 = 0: D p ( X I , X J ) = 1 | X I || X J | X x ∈ X I X y ∈ X J [ d ( x, y )] p ! 1 /p = = 1 | X I || X J | X i ∈ I X j ∈ J | X i || X j | [ D p ( X i , X j )] p ! 1 /p . (9) W e call this new system of agglomerativ e hierarchical clustering strategies as versatile linkage . As in the case of β -flexible clustering, versatile link age pro vides a wa y of ob- taining an infinite num b er of clustering strategies from a single formula. The second equalit y in Equation 9 shows that v ersatile link age can b e calculated using a com bina- torial formula, from the distances D p ( X i , X j ) obtained during the previous iteration, in the same w a y as Lance and Williams’ recurrence form ula giv en in Equation 1. The decision of what pow er p to use could b e tak en in agreement with the t yp e of dis- tance emplo yed to measure the initial dissimilarities b et ween individuals. F or instance, if the initial dissimilarities were calculated using a generalized distance of order p , then the natural agglomerativ e clustering strategy w ould be v ersatile link age with the same p o wer p . Ho wev er, this pro cedure do es not guaran tee that the dendrogram obtained is the b est according to other criteria, e.g., cophenetic correlation, mean absolute error, space distortion or tree balance, see Section 4. A b etter approac h consists in scanning the whole range of parameters p , calculate the preferred descriptors of the corresponding dendrograms, and decide if it is b etter to substitute the natural parameter p b y another one. This is esp ecially imp ortant when only the dissimilarities betw een individuals are a v ailable, without co ordinates for the individuals, as is common in m ultidimensional scaling problems, or when the dissimilarities hav e not b een calculated using generalized means. 5 3.1 P articular Cases The generalized mean con tains several well-kno wn particular cases, dep ending on the v alue of the p ow er p , that deserv e sp ecial attention. Some of them reduce versatile link age to the most commonly used methods, while others emerge naturally as deserving further atten tion: • In the limit when p → −∞ , versatile link age b ecomes single link age (SL): D min ( X I , X J ) = min x ∈ X I min y ∈ X J d ( x, y ) = min i ∈ I min j ∈ J D min ( X i , X j ) . (10) • In the limit when p → + ∞ , versatile link age b ecomes complete link age (CL): D max ( X I , X J ) = max x ∈ X I max y ∈ X J d ( x, y ) = max i ∈ I max j ∈ J D max ( X i , X j ) . (11) There are also three other particular cases that can b e group ed together as Pythagor e an linkages : • When p = +1, the generalized mean is equal to the arithmetic mean and arithmetic linkage (AL), i.e. the standard a v erage link age or UPGMA, is reco v ered. • When p = − 1, the generalized mean is equal to the harmonic mean and, therefore, harmonic linkage (HL) is obtained. • In the limit when p → 0, the generalized mean tends to the geometric mean. Hence, the distance definition for ge ometric linkage (GL) is: D geo ( X I , X J ) = Y x ∈ X I Y y ∈ X J d ( x, y ) ! 1 / ( | X I || X J | ) = = Y i ∈ I Y j ∈ J [ D geo ( X i , X j )] | X i || X j | ! 1 / ( | X I || X J | ) . (12) T o show the effects of v arying the p o wer p in versatile link age clustering, we ha ve built a small dataset with four individuals: Alice, Bob, Carol and Da ve, which lay on a straigh t line, separated b etw een them b y distances equal to 7, 9 and 12 units, resp ectiv ely . T able 1 giv es the pairwise distances b et w een the four individuals, and Figure 1 shows some m ultidendrograms obtained v arying the pow er p in v ersatile link age clustering. Alice and Bob are alwa ys group ed together forming the first binary cluster, at a distance equal to 7 . 00. F or v alues of the exp onen t p ∈ ( −∞ , 0), the Alice-Bob cluster is joined with Carol’s singleton cluster at distances that range b et ween 9 . 00 and 6 T able 1: Sample pairwise distances b etw een four individuals. Alice Bob Carol Da ve Alice 0 7 16 28 Bob 0 9 21 Carol 0 12 Da ve 0 12 . 00. More precisely , this distance tak es v alues 9 . 00 for SL ( p → −∞ ), 11 . 52 for HL ( p = − 1) and 12 . 00 when we approach GL ( p → 0 − ). F or larger v alues of the exp onen t, p > 0, this distance b ecomes larger than 12 . 00, th us Carol joins instead in a cluster with Da v e at their distance 12 . 00. The remaining cluster for p ∈ ( −∞ , 0), whic h joins the Alice-Bob-Carol cluster with Da ve, happ ens at height s 12 . 00 (SL), 18 . 00 (HL) and 19 . 18 ( p → 0 − ), resp ectiv ely . F or the range p ∈ (0 , + ∞ ), the clusters Alice-Bob and Carol-Da ve join at heights 17 . 06 ( p → 0 + ), 18 . 50 (AL) and 28 . 00 (CL), resp ectively . GL ( p = 0) lays betw een these t wo structurally differen t dendrograms, represen ted as “(((Alice,Bob),Carol),Dav e)” and “((Alice,Bob),(Carol,Dav e))”. Using pair-group agglomerativ e clustering metho ds, we w ould assign one of these t w o p ossible dendro- grams to GL, thus breaking the tied pairs (Alice,Bob)-Carol and Carol-Da ve (both at distance 12 . 00) randomly; this is an example of the ties in pro ximity (non-uniqueness) problem men tioned ab o ve. With the v ariable-group approac h [7], we join them at once forming the m ultidendrogram “((Alice,Bob),Carol,Da v e)”, where the three clusters join at distance 12 . 00, with a band going up to distance 24 . 25 to represent the heterogeneity of the new cluster, 24 . 25 b eing the distance b et w een the clusters (Alice,Bob) and Da v e (see middle m ultidendrogram in Figure 1). This simple example sho ws the abilit y of v er- satile link age to cov er structurally different hierarchical clustering structures, including at the same time the traditionally imp ortant metho ds of SL, AL and CL. 3.2 W eigh ted V ersatile Link age W eighted clustering w as in tro duced by Sok al and Mic hener [23] in an attempt to give merging branches in a hierarchical tree equal weigh t regardless of the num b er of indi- viduals carried on eac h branc h. Suc h a pro cedure w eights the individuals unequally , con trasting with unw eigh ted clustering that giv es equal weigh t to each individual in the clusters. In w eighted v ersatile link age strategies, the distance betw een t wo clusters X I and X J 7 p →−∞ (SL) p = − 1 (HL) p =0 (GL) p =+1 (AL) p → + ∞ (CL) Figure 1: Effects of v arying the p o wer p in v ersatile link age clustering for the sample distances in T able 1. Computations p erformed using the MultiDendrograms 3 , 4 soft ware [9], with the precision parameter equal to 2 significan t decimal digits. When p = 0 (GL), the gra y band shows the existence of a tie b et w een distances. is calculated b y taking the generalized mean of the pairwise distances, not b et ween indi- viduals in the initial distance matrix, but b et ween comp onen t sub clusters in the matrix used during the previous iteration of the pro cedure, th us Equation 9 b eing replaced b y: D p ( X I , X J ) = 1 | I || J | X i ∈ I X j ∈ J [ D p ( X i , X j )] p ! 1 /p . (13) 3 MultiDendrograms: http://deim.urv.cat/ ~ sergio.gomez/multidendrograms.php 4 In MultiDendrograms, to a void the infinite range of the exponent p , a sigmoidal transformation is p erformed suc h that the parameter used is within the range [ − 1 . 0 , +1 . 0], with v alues − 1 . 0, − 0 . 1, 0 . 0, +0 . 1 and +1 . 0 representing SL, HL, GL, AL and CL, resp ectiv ely . 8 3.3 Absence of in v ersions V ersatile link age strategies are monotonic, that is, they do not produce in versions. An in version or rev ersal app ears in a hierarc hy when the hierarch y contains tw o clusters X and Y for whic h X ⊂ Y but the height of cluster X is higher than the heigh t of cluster Y [18, 17]. In versions mak e hierarchies difficult to interpret, sp ecially if they o ccur during the last stages of the agglomeration pro cess. The monotonicity of versatile link age strategies is explained b y the Pythagorean means inequalit y , min 6 HM 6 GM 6 AM 6 max , (14) where HM stands for the harmonic mean, GM for the geometric mean, and AM for the arithmetic mean. In the general case given b y Equations 9 and 13, the generalized mean inequalit y holds: D p ( X I , X J ) 6 D q ( X I , X J ) , ∀ p < q , (15) and D p ( X I , X J ) = D q ( X I , X J ) if, and only if, the initial distances d ( x, y ) are equal ∀ x ∈ X I and ∀ y ∈ X J . Supp osing that at a certain step of the clustering pro cedure the minim um distance b etw een an y tw o sub clusters still to b e merged is equal to δ , then the distance D ( X i , X j ) b et ween an y t wo sub clusters to b e included in different clusters, X i ⊆ X I and X j ⊆ X J , will b e necessarily greater than δ , otherwise sub clusters X i and X j w ould b e merged into the same cluster. In particular, D min ( X I , X J ) > δ . Therefore, taking in to accoun t the generalized mean inequality in Equation 15, and giv en that in the limit when p → −∞ we hav e D p ( X I , X J ) = D min ( X I , X J ), we can conclude that D p ( X I , X J ) > δ , ∀ p , whic h pro v es the absence of inv ersions of v ersatile link age strategies. 4 Descriptiv e Analysis of Hierarc hical T rees W e ha ve selected four case studies, dra wn from the UCI Machine Learning Repository [15], for a descriptiv e analysis of several agglomerativ e hierarc hical clustering strategies. T able 2 summarizes the main c haracteristics of these datasets. The v alues of the v ari- ables in these datasets show differen t orders of magnitude; therefore, all the v ariables ha ve been scaled first, and then the corresp onding dissimilarit y matrices hav e b een built using the Euclidean distance b etw een all pairs of individuals. F or the comparison of the hierarc hical clustering strategies, w e hav e c hosen the follo wing metho ds: β -flexible with β = +0 . 9, to a v oid the completely flat hierarc hical trees obtained with β = +1; versatile link age with p → −∞ , i.e., SL; cen troid method; v ersatile link age with p = − 1, i.e., HL; v ersatile link age with p → 0, i.e., GL; v ersatile 9 T able 2: Characteristics of the selected datasets. Dataset Instances F eatures Breast tissue [12] 106 9 Iris [8] 150 4 Wine [1] 178 13 P arkinsons [16] 195 22 link age with p = +1, which is the same as β -flexible with β = 0, i.e., AL; versatile link age with p → + ∞ , i.e., CL; W ard’s minim um v ariance metho d [25]; and β -flexible with β = − 1. This selection includes fiv e v arian ts of versatile link age, three of them equiv alent to traditional methods (SL, AL and CL) and the other t wo in tro duced in this w ork (HL and GL), and three v arian ts of β -flexible clustering, one of them equiv alent to AL. W eighted and un w eighted v ersions of the hierarchical clustering strategies hav e b een used. Although w eigh ting has no effect on SL and CL, w e ha v e included b oth of them for visual con venien ce in all the figures depicted next. The soft w are used to run these exp erimen ts is MultiDendrograms [9], which from v ersion 5 . 0 implements all the hierar- c hical clustering strategies analyzed here and it also computes the necessary descriptive measures. 4.1 Cophenetic Correlation The cophenetic correlation co efficien t (CCC) measures the similarit y b et ween the dis- tances in the initial matrix and the distances in the final ultrametric matrix obtained as result of a hierarchical clustering pro cedure [24]. The ultrametric distance b et w een tw o individuals is represented in a dendrogram by the heigh t at which those t w o individuals are first joined. The CCC is calculated as the P earson correlation co efficient b et ween b oth matrices of distances; thus, the closer to 1, the largest their similarity . In the analysis sho wn in Figure 2, the CCC is higher for Pythagorean link ages (i.e., HL, GL and AL), and also the unw eigh ted clustering strategies generally p erform b et- ter than the w eighted ones, corrob orating the empirical observ ation already stated by Sneath and Sok al [22]. In the case of the almost flat hierarchical trees obtained with β -flexible clustering when β = +1, the CCC is very close to 0. 10 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Breast tissue CCC 0.0 0.2 0.4 0.6 0.8 1.0 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Iris CCC 0.0 0.2 0.4 0.6 0.8 1.0 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Wine CCC 0.0 0.2 0.4 0.6 0.8 1.0 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Parkinsons CCC 0.0 0.2 0.4 0.6 0.8 1.0 Figure 2: Cophenetic correlation co efficien t (CCC). W eigh ted and un weigh ted versions of the clustering strategies are compared. 4.2 Mean Absolute Error The CCC is a b ounded measure that do es not take in to account how differen t the magnitudes of the distances in the initial matrix are from the distances in the final ultrametric matrix. F or this reason, in Figure 3 we sho w the normalized mean absolute error (MAE), which tak es in to accoun t this type of differences. Note that in the case of the Iris dataset, W ard’s metho d and β -flexible clustering with β = − 1 show ed a very go o d CCC in Figure 2, while their MAE observed in Figure 3 are the w orst ones. As a matter of fact, β -flexible clustering with β = − 1 yields results orders of magnitude w orse than all the other metho ds, for the four datasets shown in Figure 3. The b est results are obtained again with Pythagorean link ages, and also un weigh ted clustering strategies are sligh tly b etter than the weigh ted ones. 11 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Breast tissue MAE 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Iris MAE 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Wine MAE 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Parkinsons MAE 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 Figure 3: Normalized mean absolute error (MAE), in logarithmic scale. W eigh ted and un weigh ted versions of the clustering strat egies are compared. 4.3 Space Distortion F or an y agglomerativ e hierarc hical clustering strategy , the initial distances betw een indi- viduals may b e regarded as defining a space with kno wn prop erties [14]. When clusters b egin to form, if the new distances b et ween clusters are kept within the limits of the same space, then the original mo del remains unc hanged and the clustering strategy is referred to as space-conserving. Otherwise, the clustering strategy is referred to as space-distorting. According to the formalization of the concept of space distortion [6], a clustering strategy is said to b e space-conserving if min i ∈ I min j ∈ J D ( X i , X j ) 6 D ( X I , X J ) 6 max i ∈ I max j ∈ J D ( X i , X j ) . (16) On the con trary , a clustering strategy is space-con tracting if the left inequalit y , delimited b y SL, is not satisfied; and a clustering strategy is space-dilating if the right inequalit y , 12 delimited b y CL, is not satisfied. F or space-contracting clustering strategies, as clusters gro w in size, they mo v e closer to other clusters. This effect is called chaining and it refers to the successive addition of elemen ts to an ever expanding single cluster [14]. Space- dilating clustering strategies pro duce the opp osite effect, i.e., clusters moving further a wa y from other clusters as they grow in size. T o n umerically assess space distortion, w e prop ose a sp ac e distortion r atio (SDR) measure, calculated as the quotient b etw een the range of final ultrametric distances, u ( x i , x j ), and the range of initial distances, d ( x i , x j ): SDR( u, d ) = max u ( x i , x j ) − min u ( x i , x j ) max d ( x i , x j ) − min d ( x i , x j ) . (17) The SDR is equal to 1 for CL, thus this v alue separates space-conserving hierarc hical trees from space-dilating ones. Figure 4 sho ws the SDR v alues corresp onding to our four case studies. The outstanding differences b etw een initial distances and ultrametric distances in the case of W ard’s method and β -flexible clustering with β = − 1, already observ ed in Figure 3, allow the classification of b oth hierarchical clustering metho ds as space-dilating. With regard to w eigh ting, it cannot be stated that neither w eighted nor un weigh ted clustering strategies produce more space distortion: it dep ends on the particular dataset. In Figure 4 it can also b e observ ed the increasing space distortion when β decreases in β -flexible clustering, or when the p o w er p increases in versatile link age clustering. Both parameters, β and p , w ork as cluster in tensity co efficien ts in their respective clustering systems. In the case of versatile link age, the increasing space distortion when the p o wer p increases is explained by the generalized mean inequalit y in Equation 15. Therefore, taking also into account that, according to Equation 16, space-conserving clustering strategies are low er bounded b y SL ( p → −∞ ) and upp er bounded b y CL ( p → + ∞ ), w e can state that versatile link age defines an infinite system of space-conserving strategies for agglomerativ e hierarc hical clustering. 4.4 T ree Balance W e use the concept of entrop y from information theory , more concretely Shannon’s en tropy [21], to introduce a new measure to assess the degree of homogeneit y in size of the clusters in a hierarchical tree. Giv en a cluster X I , w e define its entrop y as H I = − X i ∈ I p i log | I | ( p i ) , (18) 13 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Breast tissue SDR 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Iris SDR 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Wine SDR 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted Weighted Parkinsons SDR 10 − 2 10 − 1 10 0 10 + 1 10 + 2 10 + 3 10 + 4 Figure 4: Space distortion ratio (SDR), in logarithmic scale. W eighted and un weigh ted v ersions of the clustering strategies are compared. where p i = | X i | | X I | is the prop ortion of individuals in cluster X I that are also members of sub cluster X i . Next, we define the tr e e b alanc e , H , of a hierarc hical tree as the a verage entrop y of all its internal clusters. The maximum tree balance is equal to 1 and it is obtained, for instance, for a completely flat hierarc hical tree with a single cluster con taining the N individuals in the collection. Another example of hierarchical trees with maxim um tree balance are the regular m -w ay trees obtained when applying the Baire-based divisiv e hierarc hical clustering algorithm on a collection of sequences with uniformly distributed prefixes [3, 4]. On the con trary , the minim um tree balance, H min , corresp onds to a binary tree where individuals are chained one at a time: H min = 1 N − 1 " log 2 ( N ) + N − 1 X n =2 1 n + 1 log 2 ( n ) # . (19) 14 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Breast tissue NTB 0.0 0.2 0.4 0.6 0.8 1.0 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Iris NTB 0.0 0.2 0.4 0.6 0.8 1.0 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Wine NTB 0.0 0.2 0.4 0.6 0.8 1.0 β =+0.9 SL Centroid HL GL AL CL Ward β =−1.0 Unweighted W eighted Parkinsons NTB 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5: Normalized tree balance (NTB). W eigh ted and un weigh ted versions of the clustering strategies are compared. No w, we can define the normalize d tr e e b alanc e (NTB) as NTB = H − H min 1 − H min , (20) whic h b ecomes a measure with v alues b et w een 0 and 1. Figure 5 shows the NTB v alues obtained for our case studies. Similarly to space distortion, tree balance increases when β decreases in β -flexible clustering, or when the p o wer p increases in v ersatile link age clustering. In the case of the almost flat hierarchical trees obtained with β -flexible clustering when β = +1, the NTB is v ery close to 1. Finally , according to the v alues observ ed in Figure 5, it cannot be stated that neither weigh ted nor un w eigh ted clustering strategies pro duce hierarc hical trees that are more balanced. 15 5 Conclusions Agglomerativ e hierarc hical clustering metho ds ha v e been con tin ually ev olving since their origins back in the 1950s, and historically they hav e b een deplo yed in very div erse application domains, such as geosciences, biosciences, ecology , c hemistry , text mining and information retriev al, among others [19]. No w ada ys, with the adven t of the big data revolution, hierarchical clustering metho ds hav e had to address the new c hallenges brough t by more recen t application domains that require the hierarc hical clustering of thousands of observ ations [20]. In this w ork w e ha v e in tro duced versatile link age, an infinite family of agglomerativ e hierarc hical clustering strategies based on the definition of generalized mean. W e ha ve sho wn that the versatile link age family con tains as particular cases not only the tradi- tionally imp ortan t strategies of single link age, comple te link age and arithmetic link age, but also t wo new clustering strategies suc h as geometric link age and harmonic link age. In addition, w e hav e giv en b oth weigh ted and unw eighted v ersions of these hierarchical clustering strategies, and w e ha ve prov ed the monotonicity of versatile link age strate- gies, whic h guarantees the absence of in versions in the hierarch y . Although we hav e built v ersatile link age up on the m ultidendrograms v ariable-group metho ds to ensure the uniqueness of the clustering, it ma y also b e used with the common pair-group approac h just b y breaking ties randomly . W e ha ve shown that any descriptiv e analysis of hierarchical trees in terms of cophe- netic correlation should b e complemented with the use of other measures capable of describing the space distortion that differen t hierarchical clustering strategies cause. Under this p oint of view, w e hav e sho wn that it is helpful to use other measures suc h as the mean absolute error or the space distortion ratio. The latter, in addition, pro vides a w ay to describ e numerically the increase in space distortion observed all along a system of hierarc hical clustering strategies such as versatile link age. Space distortion is in versely proportional to clustering intensit y: space-contracting clustering strategies drive systems to cluster v ery w eakly and pro duce a chaining effect, while space-dilating clustering strategies driv e systems to cluster with high in tensit y and pro duce very compact clusters. These differences are describ ed by the normalized tree balance measure introduced here, which is based on Shannon’s entrop y . T ree balance and space distortion are tw o new descriptive measures mean t to b e helpful to analyze and understand an y hierarc hical tree. The β -flexible clustering also integrates an infinite n umber of agglomerativ e hierar- c hical clustering strategies in to a single system, driven by a parameter β that works as 16 a cluster intensit y co efficien t. How ev er, to the b est of our knowledge, no one has rigor- ously defined y et a range of v alues of β for whic h the corresp onding β -flexible clustering strategies can b e regarded as space-conserving. Unlike the β -flexible clustering system, w e hav e shown that the v ersatile link age family is space-conserving. Ac kno wledgemen ts This w ork has been partially supported b y MINECO through gran t FIS2015-71582-C2-1 (S.G.), Generalitat de Cataluny a pro ject 2017-SGR-896 (S.G.), and Universitat Rovira i Virgili pro jects 2017PFR-UR V-B2-29 (A.F.) and 2017PFR-UR V-B2-41 (S.G.). References [1] S. Aeb erhard, D. Co omans, and O. De V el. Comparison of classifiers in high di- mensional settings. Dept. Math. Statist., James Co ok Univ., North Que ensland, A ustr alia, T e ch. R ep. no. 92-02 , 1992. [2] L. Belbin, D.P . F aith, and G.W. Milligan. A comparison of tw o approac hes to b eta-flexible clustering. Multivariate Behavior al R ese ar ch , 27(3):417–433, 1992. [3] P .E. Bradley . Mumford dendrograms. The Computer Journal , 53(4):393–404, 2010. [4] P . Contreras and F. Murtagh. F ast, linear time hierarchical clustering using the baire metric. Journal of Classific ation , 29(2):118–143, 2012. [5] W.H.E. Da y and H. Edelsbrunner. Efficien t algorithms for agglomerativ e hierar- c hical clustering methods. Journal of Classific ation , 1(1):7–24, 1984. [6] J.L. Dubien and W.D. W arde. A mathematical comparison of the mem b ers of an infinite family of agglomerativ e clustering algorithms. Canadian Journal of Statistics , 7:29–38, 1979. [7] A. F ern´ andez and S. G´ omez. Solving non-uniqueness in agglomerativ e hierarchical clustering using m ultidendrograms. Journal of Classific ation , 25(1):43–65, 2008. [8] R.A. Fisher. The use of multiple measurements in taxonomic problems. A nnals of Eugenics , 7(2):179–188, 1936. 17 [9] S. G´ omez and A. F ern´ andez. MultiDendrograms: A hierarchical clustering to ol (V ersion 5.0), 2018. http://deim.urv.cat/ ~ sergio.gomez/multidendrograms. php . [10] A.D. Gordon. Classific ation . Chapman & Hall/CRC, 2nd edition, 1999. [11] G. Hart. The occurrence of multiple UPGMA phenograms. In J. F elsenstein, editor, Numeric al T axonomy , pages 254–258. Springer Berlin Heidelb erg, 1983. [12] J Jossinet. V ariability of impedivity in normal and pathological breast tissue. Me d- ic al and Biolo gic al Engine ering and Computing , 34(5):346–350, 1996. [13] G.N. Lance and W.T. Williams. A generalized sorting strategy for computer clas- sifications. Natur e , 212:218, 1966. [14] G.N. Lance and W.T. Williams. A general theory of classificatory sorting strategies: 1. Hierarc hical systems. The Computer Journal , 9(4):373–380, 1967. [15] M. Lic hman. UCI machine learning repository , 2013. [16] M.A. Little, P .E. McSharry , E.J. Hun ter, J. Spielman, and L.O. Ramig. Suitabil- it y of dysphonia measurements for telemonitoring of P arkinson’s disease. IEEE T r ansactions on Biome dic al Engine ering , 56(4):1015–1022, 2009. [17] B.J.T. Morgan and A.P .G. Ra y . Non-uniqueness and in versions in cluster analysis. Journal of the R oyal Statistic al So ciety: Series C (Applie d Statistics) , 44(1):117– 134, 1995. [18] F. Murtagh. Multidimensional clustering algorithms. In Compstat L e ctur es . Ph ysica-V erlag, Vienna, 1985. [19] F. Murtagh and P . Contreras. Algorithms for hierarchical clustering: An ov erview, ii. Wiley Inter disciplinary R eviews: Data Mining and Know le dge Disc overy , 7(6):e1219, 2017. [20] F. Murtagh and P . Con treras. Clustering through high dimensional data scaling: Applications and implementations. A r chives of Data Scienc e, Series A , 2(1):1–16, 2017. [21] C.E. Shannon. A mathematical theory of communication. The Bel l System T e ch- nic al Journal , 27:379–423, 1948. 18 [22] P .H.A. Sneath and R.R. Sok al. Numeric al T axonomy: The Principles and Pr actic e of Numeric al Classific ation . W. H. F reeman and Compan y , 1973. [23] R.R. Sok al and C.D. Michener. A statistical metho d for ev aluating systematic relationships. The University of Kansas Scienc e Bul letin , 38:1409–1438, 1958. [24] R.R. Sok al and F.J. Rohlf. The comparison of dendrograms b y ob jective metho ds. T axon , 11(2):33–40, 1962. [25] J.H. W ard Jr. Hierarchical grouping to optimize an ob jectiv e function. Journal of the A meric an Statistic al Asso ciation , 58(301):236–244, 1963. 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment