Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clusterin…
Authors: Fionn Murtagh, Pedro Contreras
Metho ds of Hierarc hical Clust e ring Fionn Murtagh (1, 2) and Pe dro Con treras (2) (1) Science F oundation Ireland, Wilton Place, D ublin 2, Ireland (2) Departmen t of Computer Science, Ro y al Ho llo wa y , Unive rsit y of London Egham TW20 0EX, England Email: fm urtagh@acm.org No v em b er 26 , 2024 Abstract W e su rv ey agglomerativ e h i erarchical clustering algorithms and d i s- cuss efficien t implementations that a re av ailable in R and other soft w are environmen ts. W e look at hierarchical self -organizing maps, and mixture mod els. W e review gri d- b as ed clustering, fo cusing on hierarc hical densit y- based approaches. Finally we describe a recently develo p ed v ery efficient (linear time) hierarchical clustering algorithm, which can also b e view ed as a hierarchica l grid-based algorithm. 1 In tro duction Agglomera t ive hierar c hical clustering has b een the dominant approa c h to c on- structing embedded cla ssification s c hemes. It is our aim to direct the reader’s attent ion t o practical algor ithm s and metho ds – b oth efficient (from the compu- tational and s t or age p oin ts of view) and e ffective (from the application p oin t of view). It is often helpful to distinguish betw een metho d , involving a compactness criterion and the tar get structure of a 2-wa y tree re presen ting the pa rtial order on subs e ts of the p o wer set; as opp osed to an implementation , which r elates to the detail o f the a lgorithm use d. As with ma n y other multiv aria te techniques, the ob jects to b e classified hav e nu merica l measur emen ts on a s e t of v ariables or attributes. Hence, the analysis is carr ied out on the rows of an arr a y o r matrix. If we do not have a matrix of nu merica l v alues to b egin with, then it may b e neces sary to skilfully co nstruct such a matrix . The ob jects, o r r o ws o f the matrix, can b e v iew ed as vectors in a multidimensional space (the dimensio nalit y of this space b eing the num b er of v a riables o r co lumns). A geometr ic fra mew ork of this type is no t the only one which can b e used to form ulate clustering algorithms. Suita ble alternative for ms of sto r age of a rectangula r a rra y of v alues are not inconsistent with viewing the problem in geo metric terms (and in matr ix terms – for example, expr e ssing the adjacency relations in a g raph). 1 Motiv a tion for clustering in genera l, covering hiera rc hical clustering a nd ap- plications, includes the following: analysis of data; interactive user interfaces; storage and r etriev al; and pa tt ern reco gnition. Surveys o f clustering with coverage also of hierarchical clus t ering include Gordon (19 8 1), March (1 983), Jain and Dubes (1988 ), Gor don (1987 ), Mirkin (1996), J ain, Murty and Flynn (1999), a nd Xu and W unsch (20 05). Lerman (1981) and Ja nowitz (20 10) present ov erar c hing rev iews o f clustering including through use of la t tices that generalize trees. The case for the ce ntral role of hier- archical clustering in informa tion retriev al was made by v a n Rijsb ergen (1979 ) and co n tin ued in the work o f Willett (e.g . Griffiths et al., 1 984) a nd others. V a rious mathematical views of hierarch y , all expr essing symmetry in one way or ano t her, are explo red in Murtag h (20 09). This article is orga nized as follows. In section 2 we lo ok at the issue o f nor malization of da ta, prior to inducing a hier arc hy on the data. In section 3 some historical remar ks a nd motiv atio n are pr o vided for hier ar- chical agglomer ativ e clustering. In section 4, w e discuss the Lance-Williams formulation of a wide r a nge of algor ithms, a nd how these alg o rithms can b e express ed in gra ph theoretic terms and in geo metric terms. In section 5, we descr ibe the principles o f the recipro cal neares t neighbor a nd nearest neighbo r chain algor ithm , to supp ort building a hier arc hical clustering in a more efficient manner compared to the Lance-Williams o r g eneral g eometric approa c hes. In section 6 we overview the hierar chical Ko ho nen s e lf -or ganizing feature map, and also hiera rc hical mo del-based clustering . W e conclude this sectio n with s ome r eflections on divisive hierarchical c lus t ering , in g eneral. Section 7 sur v eys developmen ts in grid- a nd density-based clustering . The following section, sectio n 8, presents a recent algo rithm of this type, which is particularly suitable fo r the hie r arc hical clus ter ing of mas s iv e data sets. 2 Data Ent ry: Distance, Similarit y and Their Use Before cluster ing comes the phase of da ta measurement, or measurement of the observ ables. Let us lo ok at some imp ortant co nsiderations to b e taken into account. These consider ations re la te to the metr ic or other s patial embedding, comprising the fir st phas e of the da ta a nalysis s t ric to sensu . T o gr o up data we need a w ay to measur e the elements a nd their distances relative to each other in order to decide which elemen ts b elong to a gr oup. This can be a similarity , although on many o ccasions a dissimilar it y meas uremen t, or a “ stronger” distance, is used. A distance betw een any pair of vectors o r p oin ts i, j, k satisfies the prop erties of: symmetry , d ( i, j ) = d ( j, k ); p ositive definitenes s , d ( i, j ) > 0 and d ( i, j ) = 0 iff i = j ; and the triangular inequa lit y , d ( i, j ) ≤ d ( i, k ) + d ( k , j ). If the tria ngular 2 inequality is not taken in to account, we have a dissimilarity . Finally a similarity is given by s ( i, j ) = max i,j { d ( i, j ) } − d ( i, j ). When working in a vector spa ce a tra dit iona l wa y to measure dista nces is a Minko wski distance, which is a family of metrics defined as follows: L p ( x a , x b ) = ( n X i =1 | x i,a − x i,b | p ) 1 /p ; ∀ p ≥ 1 , p ∈ Z + , (1) where Z + is the set o f p ositiv e integers. The Manhattan, Euclidean and Chebyshev distances (the latter is also called maximum distance) ar e sp ecial cases of the Mink owski dista nce when p = 1 , p = 2 a nd p → ∞ . As an exa mple of similar ity w e have the c osine similar it y , which gives the angle b et w een tw o vectors. This is w idely used in text retriev al to ma tc h vector queries to the dataset. The sma ller the ang le b et w een a query vector a nd a do cumen t vector, the closer a query is to a do cumen t. The normalized cosine similarity is de fined as fo llo ws: s ( x a , x b ) = co s ( θ ) = x a · x b k x a kk x b k (2) where x a · x b is the dot pr oduct a nd k · k the no rm. Other relev ant distanc e s are the Hellinger, v a riational, Ma halanobis a nd Hamming distances. Anderb erg (19 7 3) giv es a go o d review o f measur emen t and metrics, where their int err elationships a re a lso discussed. Also Deza and Deza (2009) have pro duced a compre hensiv e list of distances in their Encyclop e dia of Distanc es . By ma ppin g o ur input da ta int o a Euclidean space, where each ob ject is equiweigh ted, we can use a Euclidea n distance for the clustering that follows. Corresp ondence a na lysis is very versatile in determining a Euclidean, factor space from a wide ra nge o f input da ta t yp es, including frequency counts, mixed qualitative and quantitativ e data v alues, r anks or scor e s , and other s. F urther reading on this is to b e fo und in Benz´ ecri (1979), Le Roux and Ro ua net (2004 ) and Murta gh (2 005). 3 Agglomerativ e Hierarc hical Clustering Algo- rithms: Motiv ation Agglomera t ive hierarchical clustering alg orithms can b e characterize d a s gr e e dy , in the algor ithm ic sense . A s e quence o f irre v ersible a lgorithm steps is used to construct the desire d data s t ructure . Assume that a pair of cluster s, including po ssibly singletons, is mer ged o r agglo merated at each step o f the a lgorithm. Then the following a r e equiv alent views of the same output str ucture constructed on n ob jects: a set of n − 1 par titions, sta rting with the fine partition consisting of n clas ses and ending with the trivial partition cons isting o f just one class, the entire o b ject set; a binary tr ee (one or tw o child no des a t each non-terminal 3 no de) commonly referred to a s a dendrogr a m; a pa r tially o rdered set (p oset) which is a s ubset of the p o wer set o f the n ob jects; a nd an ultra metr ic top ology on the n ob jects. An ultr a metric, or tree metric, defines a stronge r top ology compare d to, for ex ample, a Euclidean metric geometr y . F or thr ee p oin ts, i, j, k , metric and ultrametric r e spect the pr operties of symmetr y ( d , d ( i, j ) = d ( j, i )) a nd p ositiv e definiteness ( d ( i, j ) > 0 a nd if d ( i, j ) = 0 then i = j ). A metric though (as noted in s ection 2) sa tis fie s the triang ular inequality , d ( i, j ) ≤ d ( i, k ) + d ( k , j ) while an ultrametric sa tisfies the strong triangula r or ultrametric (or non-Archimedean), inequality , d ( i, j ) ≤ max { d ( i, k ) , d ( k, j ) } . In section 2, ab o ve, there was further discussion on metrics. The single link age hiera rc hical clustering appr oac h outputs a s e t of cluster s (to use g raph theoretic termino lo gy , a set of ma ximal connected subgra phs) at each level – or fo r each threshold v alue which pro duces a new partition. The single link age metho d with which we beg in is one of the oldest metho ds, its ori- gins b eing traced to Polish re s earc hers in the 1950s (Graham and Hell, 1985 ). The name single linkage arises since the in terco nnecting diss imila rit y b et ween t wo clusters o r comp onen ts is defined a s the lea st interconnecting dissimilar - it y b et ween a member o f one and a member of the o ther . Other hierarchical clustering metho ds a re character iz ed by o th er functions of the interconnecting link a g e dis s imilarities. As ear ly as the 1 9 70s, it was held that ab out 75% of all published work on clustering employ ed hier arc hical a lgorithms (Blashfield a nd Aldenderfer, 19 78). Int erpr etation of the informa tio n contained in a dendr ogram is o f ten of one or more of the following kinds: set inclusion relatio nships, partition of the ob ject- sets, a nd significant cluster s. Much early work o n hierarchical clustering was in the field o f biolo gical taxonomy , from the 1 950s a nd more so fro m the 1 9 60s onw ards. The central reference in this ar ea, the first edition of which dates fr o m the early 1960s, is Sneath and So k al (1973 ). O ne ma jor interpretation of hierarchies has b een the evolution relationships b et w een the or ganisms under study . It is hop ed, in this context, that a dendrog r am provides a sufficiently a ccurate mo del of underlying evolutionary pr o gression. A co mmon interpretation made of hier arc hical clustering is to derive a par t i- tion. A further type of interpretation is instea d to detect maxima l (i.e. disjoint) clusters of interest at v arying levels of the hierar c h y . Such an appro ac h is used by Rapo port and Fillenbaum (1972) in a c lus t ering o f c o lors based on s e m antic attributes. Lerma n (19 81) dev elop ed an appro a c h for finding s ignifican t clusters at v arying le vels o f a hierar c h y , which has bee n w ide ly applied. By de veloping a wav elet tra nsform on a dendrogr am (Murtagh, 2 007), which amounts to a wa velet tra sform in the asso ciated ultra metric top ological space, the most im- po rtan t – in the sense o f b est approximating – cluster s can b e determined. Such an appro a c h is a top ological one (i.e., base d on sets and their pro perties) as c on trasted with more widely used optimizatio n or sta tis tica l a pp ro ac hes. In s um mar y , a dendrog ram collects to gether many o f the proximit y and classificator y r elationships in a bo dy of da ta. It is a convenien t represe n tation 4 which answers such questions as: “ Ho w many us ef ul gr oups are in this data?”, “What a re the s a lien t interrelationships pr esen t?”. But it ca n b e noted tha t differing answers c a n feasibly b e provided by a dendrog ram for mos t of these questions, dep ending on the application. 4 Agglomerativ e Hierarc hical Clustering Algo- rithms A wide range of agglomer ativ e hierarchical clustering alg o rithms hav e b een pro- po sed at o ne time or another. Such hiera rc hical algo rithms ma y b e co n v eniently broken down into tw o gr oups of metho ds. The firs t gr oup is tha t of link age metho ds – the single, complete, weighted and un weigh ted average link ag e meth- o ds. Thes e ar e metho ds for which a gra ph representation ca n b e used. Sneath and So k al (1973 ) may be consulted for many other g raph r epresen tations of the stages in the co nstruction of hier arc hical cluster ings. The se cond group of hierarchical clustering methods are metho ds which allow the cluster centers to b e sp ecified (as an average or a weight ed av erage o f the mem b er vectors of the cluster ). These metho ds include the centroid, median and minimum v ariance metho ds. The la t ter may b e spe c ifie d either in ter ms of diss imila rities, alone, o r a l- ternatively in ter ms o f cluster center co ordinates and dissimilarities. A very conv enient formulation, in dissimilar it y terms, which embraces a ll the hierar- chical metho ds ment ioned so far, is the L anc e-Wil lia ms dissimilarity u p date formula . If p oin ts (ob jects) i and j are agglomer ated into cluster i ∪ j , then we must simply sp ecify the new diss im ilar it y b et ween the cluster and all other po in ts (ob jects or cluster s). The formula is: d ( i ∪ j, k ) = α i d ( i, k ) + α j d ( j, k ) + β d ( i, j ) + γ | d ( i, k ) − d ( j, k ) | where α i , α j , β , and γ define the agglo merativ e criterion. V alues of these are listed in the second c olumn of T able 1. In the case of the single link metho d, using α i = α j = 1 2 , β = 0, and γ = − 1 2 gives us d ( i ∪ j, k ) = 1 2 d ( i, k ) + 1 2 d ( j, k ) − 1 2 | d ( i , k ) − d ( j, k ) | which, it may b e verified, ca n b e r e written as d ( i ∪ j, k ) = min { d ( i, k ) , d ( j, k ) } . Using other up date for m ulas, a s given in column 2 of T able 1, allows the other ag glomerative metho ds to b e implemented in a very similar way to the implemen tation of the single link metho d. In the case o f the metho ds which use cluster centers, we have the center co ordinates (in co lumn 3 of T able 1) and dissimilar it ies as defined between cluster centers (column 4 of T able 1). The Euclidea n distance must b e us ed for 5 Hierarchical Lance a nd Williams Co ordinates Dissimilarity clustering dissimilarity of center of betw een cluster metho ds (and upda t e for m ula cluster, which centers g i and g j aliases) agglomer ates clusters i a nd j Single link α i = 0 . 5 (nearest β = 0 neighbor) γ = − 0 . 5 (More simply: min { d ik , d j k } ) Complete link α i = 0 . 5 (diameter) β = 0 γ = 0 . 5 (More simply: max { d ik , d j k } ) Group average α i = | i | | i | + | j | (av erage link, β = 0 UPGMA) γ = 0 McQuitty’s α i = 0 . 5 metho d β = 0 (WPGMA) γ = 0 Median metho d α i = 0 . 5 g = g i + g j 2 k g i − g j k 2 (Gow er’s, β = − 0 . 25 WPGMC) γ = 0 Cent ro id α i = | i | | i | + | j | g = | i | g i + | j | g j | i | + | j | k g i − g j k 2 (UPGMC) β = − | i || j | ( | i | + | j | ) 2 γ = 0 W a rd’s metho d α i = | i | + | k | | i | + | j | + | k | g = | i | g i + | j | g j | i | + | j | | i || j | | i | + | j | k g i − g j k 2 (minim um v ar- β = − | k | | i | + | j | + | k | iance, e r ror γ = 0 sum of sq uares) Notes: | i | is the n umber o f o b jects in cluster i . g i is a vector in m -space ( m is the set of a t tributes), – either an int ial p oin t o r a cluster center. k . k is the norm in the Euclidean metric. The na mes UPGMA, etc. ar e due to Sneath and Sok al (19 73). Co efficient α j , with index j , is defined identically to co efficien t α i with index i . Finally , the La nce and Williams recur rence formula is (with | . | expr essing a bsolute v a lue): d i ∪ j,k = α i d ik + α j d j k + β d ij + γ | d ik − d j k | . T able 1: Sp ecifications of se ven hier arc hical cluster ing metho ds. 6 equiv a lence b et ween the tw o appro ac hes. I n the case of the m e dian metho d , for instance, we have the following (cf. T able 1). Let a and b be tw o po in ts (i.e. m -dimensional vectors: t hese a re ob jects or cluster c e nters) which hav e b een a gglomerated, a nd let c b e another p oin t. F rom the Lance- Williams dissimilarit y up date formula, using squa red Euclidean distances, we have: d 2 ( a ∪ b , c ) = d 2 ( a,c ) 2 + d 2 ( b,c ) 2 − d 2 ( a,b ) 4 = k a − c k 2 2 + k b − c k 2 2 − k a − b k 2 4 . (3) The new cluster center is ( a + b ) / 2, so that its distance to p oin t c is k c − a + b 2 k 2 . (4) That these tw o expr essions are ident ical is readily verified. The cor respon- dence b et w een these tw o p ersp e ctiv es on the one agglo mer ativ e criterio n is sim- ilarly proved for the centroid and minimum v ariance metho ds. F or c lus t er center metho ds, and with suitable altera tions for g raph meth- o ds, the fo llowing alg orithm is an alter na tiv e to the genera l dissimilarity base d algorithm. The la tt er may b e describ ed as a “stored dissimilarities approach” (Anderbe r g, 1973 ). Stored data approac h Step 1 Examine all interpo int dissimilar it ies, and for m cluster from tw o closest po in ts. Step 2 Replace t wo p oin ts clustered by repr e sen tative point (center o f gr avity) or by clus ter fragment. Step 3 Return to step 1, trea t ing clusters as well as remaining o b jects, until all ob jects ar e in one cluster. In steps 1 and 2 , “po in t” re fer s either to ob jects or clusters, b oth of whic h are defined as vectors in the c a se of cluster center metho ds. This a lgorithm is justi- fied by s to rage considera tio ns, since we hav e O ( n ) stor age requir e d for n initial ob jects and O ( n ) stora ge for the n − 1 (at most) clusters . In the case of link age metho ds, the term “frag ment” in step 2 refers (in the ter m inolo gy of graph the- ory) to a connected compo nen t in the case of the single link metho d and to a clique or complete subg r aph in the case of the co mplete link metho d. Without consideratio n of any sp ecial algor ithmic “sp eed-ups”, the overall complexity of the ab o ve algorithm is O ( n 3 ) due to the rep eated calculation of dissimilarities in step 1, coupled with O ( n ) iter ations through s t eps 1 , 2 and 3. While the stor e d data a lgorithm is instructive, it do es no t lend itself to efficient implementations. In the section to follow, we lo ok at the r eciproca l nearest neighbor and mutual 7 nearest neighbo r alg orithms which ca n b e used in prac t ice for implementing agglomer ativ e hierar c hical clustering algo rithms. Before concluding this overview of a g glomerative hier arc hical cluster ing al- gorithms, we will descr ibe briefly the minimum v a riance metho d. The v ariance o r sprea d of a set o f p oin ts (i.e. the av erage of the sum of squared dista nces from the ce n ter) has b een a p oint of departure for s pecifying clustering algor ithm s. Ma n y of these alg o rithms, – iter ativ e, optimization a lgo- rithms as well as the hierar c hical, agglomer ativ e algor it hms – are descr ib ed and appraised in Wishart (1 969). The use of v ariance in a clustering criterio n links the resulting clustering to other data- analytic techniques which in volv e a decom- po sition of v aria nce, and make the minimum v ariance a gglomerative strategy particularly suitable for syno pt ic clustering. Hierarchies are also mor e bala nced with this agglomer ativ e criterio n, which is often o f practica l a dv ant ag e. The minimum v aria nc e metho d pro duces clusters which sa tisfy compactness and isolatio n criter ia. These criteria a re incorp orated into the dissimilarity . W e seek to a g glomerate tw o clus t ers , c 1 and c 2 , into clus t er c such that the within- class v ariance of the par tit ion thereby obtained is minim um. Alternatively , the betw een-cla ss v aria nc e of the partition obtained is to b e maximized. Let P and Q b e the par tit ions pr ior to, a nd subsequent to, the agg lomeration; let p 1 , p 2 , . . . be class es o f the par tit ions: P = { p 1 , p 2 , . . . , p k , c 1 , c 2 } Q = { p 1 , p 2 , . . . , p k , c } . Letting V deno te varianc e , then in ag g lomerating tw o clas s es of P , the v ari- ance o f the resulting partition (i.e. V ( Q ) ) w ill neces sarily decr ease: therefor e in seeking to minimize this decre ase, we sim ultaneous ly a chieve a partition with maximum betw een-cla ss v ariance. The cr iterion to b e optimized ca n then b e shown to b e: V ( P ) − V ( Q ) = V ( c ) − V ( c 1 ) − V ( c 2 ) = | c 1 | | c 2 | | c 1 | + | c 2 | k c 1 − c 2 k 2 , which is the dissimilarity g iv en in T able 1. This is a dissimilar it y which may be determined for any pair of cla sses of pa rtition P ; and the agglo merands a re those cla sses, c 1 and c 2 , for which it is minimum. It may b e noted that if c 1 and c 2 are singleton cla sses, then V ( { c 1 , c 2 } ) = 1 2 k c 1 − c 2 k 2 , i.e. the v ariance of a pair of o b jects is equal to half their Euclidean distance. 5 Efficien t Hierarc hical Clustering Algorithms Using Nearest Neigh b or Chains Early , efficient algo rithms for hierarchical clustering are due to Sibson (19 73), Rohlf (1973 ) and Defays (1977 ) . Their O ( n 2 ) implement ations of the single link metho d and of a (non-unique) co mplet e link metho d, r espectively , have b een widely cited. 8 s s s s s e d c b a ✲ ✲ ✲ ✛ Figure 1: Five p oin ts, showing NNs a nd RNNs. d1 d2 q r s q r s d2 d1 Figure 2 : Alterna t ive repr esen tations of a hierarch y with an inv ersion. Assuming dissimilarities, a s we go vertically up, a gglomerative cr iterion v alues ( d 1 , d 2 ) increase so that d 2 > d 1 . But here, undesirably , d 2 < d 1 and the “cro ss-o ver” or inv ersio n (right panel) arise s. In the early 1980s a r ange of significant improvemen ts (de Rha m , 1980; J ua n, 1982) w ere made to the Lance-Williams, or related, diss imila rit y update schema, which had been in wide use since the mid-1 960s. Murtagh (198 3, 1985) presents a survey of these alg orithmic improv ements. W e will brie fly desc r ibe them here. The new algor it hms, which have the p oten tial for ex a ctly replica t ing r e sults found in the class ical but more computationa lly exp ensiv e way , ar e based on the constr uction o f ne ar est neighb or chains and r e cipr o c al or mutual NNs (NN- chains and RNNs). A NN-chain consists of a n ar bitrary p oin t ( a in Fig. 1); followed by its NN ( b in Fig. 1); followed by the NN from a mong the remaining p oints ( c , d , and e in Fig. 1 ) of this second po in t; and so on until we neces s arily hav e some pair of po in ts which can be termed re ciprocal o r mutual NNs. (Suc h a pair o f RNNs may b e the firs t tw o p oin ts in the chain; and we hav e assumed that no tw o dissimilarities are equal.) In constructing a NN-chain, irresp ective of the sta r ting p oin t, we may ag - glomerate a pair of RNNs as so on as they a re found. What guara n tees that we can arr iv e at the same hierarchy as if we used tra dit ional “stored dissimilarities” or “stored data” algor ithm s? Essentially this is the same condition as that un- der which no inv ersions or reversals are pro duced b y the clustering metho d. Fig. 2 gives a n example o f this, wher e s is agg lomerated at a low er criter ion v alue (i.e. dissimilarity) than was the case at the previous a gglomeration be tw een q and r . O ur ambien t spa ce has thus contracted b ecause of the ag g lomeration. This is due to the alg orithm used – in particular the agg lomeration criterion – and it is s omething we would nor mally wish to avoid. This is fo rm ulated as : Inv ersion imp ossible if: d ( i, j ) < d ( i, k ) or d ( j, k ) 9 ⇒ d ( i , j ) < d ( i ∪ j, k ) This is one form o f Bruyno o ghe’s r e ducibility pr op ert y (Bruyno oghe, 19 77; see also Murtagh, 1984). Using the Lance- W illiams dissimila rit y up date for m ula, it can b e shown that the minimum v ariance metho d does not give rise to inv ersions ; neither do the link age methods; but the median a nd centroid metho ds cannot be guaranteed not to have inv ersio ns. T o retur n to Fig. 1, if we are dea ling with a clustering cr it erio n which pre- cludes inv ersions, then c and d ca n justifiably b e agg lomerated, sinc e no other po in t (for exa m ple, b or e ) could hav e b een a gglomerated to either of these. The pro cessing re q uired, following an a g glomeration, is to upda te the NNs of p oin ts such as b in Fig. 1 (a nd on a ccoun t of s uc h p oin ts, this algo rithm was dubbed algorithme des c´ elib atair es , or ba chelors’ algo rithm, in de Rham, 19 80). The following is a s um mar y of the algo rithm: NN-c hain algorithm Step 1 Select a p oin t ar bit ra r ily . Step 2 Grow the NN-chain fro m this p oin t un til a pa ir of RNNs is o bt ained. Step 3 Agglomerate these p oin ts (replacing with a cluster p oin t, o r up dating the dissimilarity matrix). Step 4 F r om the p oin t which preceded the RNNs (or from any o ther arbitrar y po in t if the first tw o p oin ts chosen in steps 1 and 2 cons t ituted a pair of RNNs), return to step 2 unt il only one p oint rema ins. In Murtag h (1983 , 1984, 198 5) and Day and Edelsbrunner (19 84), one finds discussions o f O ( n 2 ) time and O ( n ) space implementations o f W ard’s minimum v a riance (or error sum of squa r es) metho d a nd of the centroid and median metho ds. The latter t wo methods are termed the UPGMC a nd WPGMC criteria by Sneath a nd Sok al (1973 ). Now, a pro blem with the cluster cr it eria used by these latter tw o metho ds is that the r educibilit y prop erty is not satisfied by them. This mea ns that the hierar c h y constructed may no t b e unique as a result of in versions o r reversals (non-monoto nic v a riation) in the clustering c r iterion v a lue determined in the sequence of a g glomerations. Murtagh (19 83, 198 5) desc r ibes O ( n 2 ) time and O ( n 2 ) spac e implement a- tions for the single link metho d, the complete link metho d and for the weigh ted and unw eight ed group av erag e metho ds (WPGMA and UPGMA). This ap- proach is quite genera l vis ` a vis the dissimilarity us ed and can a lso b e used for hierarchical cluster ing metho ds other than those mentioned. Day and Edels brunner (1984 ) pr o ve the exact O ( n 2 ) time complexity of the centroid and median metho ds using an ar gumen t r elated to the co m binatorial problem of optimally pac king hyperspher es in to an m -dimensional v olume. They also addr ess the question of metrics: res ult s are v alid in a wide cla ss of distances including those asso ciated with the Minko wski metrics. 10 The construction a nd maintenance of the near est neighbor chain as well as the car rying out of agglo merations whenever recipro cal nearest neig h bor s meet, bo th offer p ossibilities for distributed implemen tation. Implementations on a parallel ma c hine architecture were describ ed by Willett (1 989). Evidently (from T able 1) b oth c o ordinate data and graph (e.g., dissimilar it y) data ca n b e input to these ag g lomerativ e metho ds. Gillet et a l. (19 98) in the context o f clustering chemical structure databases r efer to the common us e of the W ard method, based on the rec ipr ocal nearest neighbors algo rithm, on da ta sets o f a few hundred thousa nd molecule s . Applications of hierarchical clustering to bibliographic informatio n retriev al are a s sessed in Griffiths et a l. (19 84). W ar d’s minimum v a riance criterion is fav ored. F rom deta ils in White and McCain (1997), the Institute of Scien tific Informa- tion (ISI) clus t ers citations (science, and so cial science) by first clustering hig hly cited do cumen ts based o n a single link a ge criter io n, a nd then four more passe s are made thro ugh the data to c reate a subset of a single link age hiera rc hical clustering. In the CLUST AN and R statis tica l data analysis pa c k ag es (in a ddit ion to hclust in R, s ee flashClust due to P . Lang felder and av ailable o n CRAN, “Com- prehensive R Archiv e Netw ork” , c ran.r-pro ject.or g) there ar e implementations of the NN-chain algorithm for the minimum v a riance agglomer a tiv e criterion. A prop ert y of the minim um v ariance ag glomerative hierar c hical clustering metho d is that we can use weigh ts on the ob jects o n which we will induce a hierar c h y . By defa ult, these weight s are identical a nd equal to 1. Such weigh ting o f obser- v a t ions to b e clustered is an imp ortant a nd practical asp ect o f these softw are pack ages. 6 Hierarc h i cal Self-Organizing M aps and Hier- arc hical M i xture Mo deling It is quite impressive how 2 D (2-dimensio nal or , for that ma tter , 3D) imag e signals can handle with eas e the sca labilit y limitations o f clustering and ma n y other data pro cessing op erations. The contiguit y imp osed o n adjacent pixels or grid cells bypasses the need for neares t neighbor finding. It is v ery interesting therefore to consider the feasibility of tak ing pro blems of cluster ing massive data sets into the 2 D image domain. The Kohonen s elf -o r ganizing feature map e x em- plifes this well. In its ba s ic v ariant (Koho ne n, 1984 , 20 01) is ca n b e fo r m ulated in terms of k-means clustering s ub ject to a set of interrelationships b et ween the cluster centers (Murtagh and F ern´ a ndez - P a jar es, 1 9 95). Kohonen ma ps lend themselves w ell for hierarchical repr esen tation. Lampinen and Oja (1992), Dittenbac h et al. (20 02) and Endo et al. (200 2 ) elab orate on the Kohonen map in this way . An example application in character re c o gnition is Miikkula nien (1990 ) . A shor t, informative rev iew of hier arc hical self-or g anizing ma ps is pr o vided 11 by Vicent e and V ellido (2004 ). These author s also r eview what they term as pr obabilistic hierarchical mo dels. This includes putting int o a hierar c hi- cal framework the following: Gaussian mixture mo dels, a nd a probabilis t ic – Bay esian – alterna tive to the Kohonen self-o rganizing map termed Gener ativ e T opo graphic Mapping (GTM). GTM can b e traced to the Kohonen self-organizing map in the following wa y . Firstly , w e consider the hierar c hical map as brought ab out thro ugh a growing pro cess, i.e. the target map is allow ed to gr o w in terms of lay ers, a nd of g rid po in ts within thos e layers. Secondly , we imp ose an explicit proba bility density mo del on the data. Tino and Na bn ey (2 0 02) discuss how the lo cal hierarchical mo dels a re orga nized in a hierarchical way . In W ang et al. (200 0) a n alternating Gauss ian mixtur e mo deling, and prin- cipal comp onen t a nalysis, is de s cribed, in this way furnishing a hierar chy o f mo del-based clusters. AIC, the Ak aike information cr iterion, is used for selec- tion of the b est cluster mo del ov erall. Murtagh et a l. (2005 ) use a top level Gaussian mixture mo deling with the (spatially aware) PLIC, pseudo - lik eliho od infor mation criterion, used for cluster selection and identifiabilit y . Then at the next le vel – and p otent ially also for further divisive, hierar c hical levels – the Gaussia n mixture mo deling is contin- ued but now us ing the marg inal distributions within each cluster, and using the analogo us Bayesian clustering identifiabilit y criterion which is the B a yesian information criterio n, BIC. The resulting output is re f err ed to as a mo del-based cluster tree. The model- based cluster tree algor ithm of Murtagh et al. (2 005) is a divisive hierarchical algor ithm . Ear lier in this ar t icle, we consider ed agglo merativ e algo- rithms. How ever it is often feasible to implement a divisive algo rithm instead, esp ecially when a g raph cut (for example) is imp ortant for the application c on- cerned. Mirkin (1 9 96, chapter 7) descr ibes div is iv e W ard, minim um v ariance hierarchical cluster ing, which is clo sely related to a bisecting k-means also . A clas s of methods under the name of sp ectral clustering uses eigenv alue/eigenv ector reduction on the (gr aph) adjacency matrix. As von Luxburg (20 0 7) p oints out in r eviewing this field of s pectral clustering, such metho ds hav e “b een discov- ered, re- disco vered, and extended many times in different commun ities”. F ar from see ing this gr eat dea l of work on clus t ering in any sense in a p essimistic wa y , we see the p erennial and p erv asive interest in clus ter ing a s testifying to the c on tin ual renewal a nd innov ation in algor ithm developments, faced with application needs. It is indeed interesting to no t e how the c lusters in a hierar c hical clustering may b e define d by the eig e n v ector s of a dissimila r it y matrix, but sub ject to carrying o ut the eigen vector reduction in a particular alg ebraic structure, a semi- ring with additive and multiplicativ e op erations g iv en b y “min” a nd “max ” , resp ectiv ely (Gondra n, 1976). In the next section, section 7, the themes of ma pping , and o f divisive alg o- rithm, a r e fre q uen tly taken in a somewhat different dir ection. As a lw ays, the application at issue is highly r elev ant for the c hoice o f the hierar c hical clustering algorithm. 12 7 Densit y and Grid-Based Clustering T ec hniques Many moder n clustering techniques fo cus o n large data sets. In Xu a nd W unsc h (2008, p. 215) these a re clas sified a s follows: • Random sampling • Data condensa t ion • Density-based appro a c hes • Grid-based appro ac hes • Divide and conquer • Incremental learning F rom the p oin t of view o f this article, we selec t density and grid based approaches, i.e., metho ds that either lo ok for data densities or s plit the data space into cells when lo oking for g roups. In this section we ta ke a lo ok at these t wo families of metho ds. The main idea is to use a grid-like structur e to split the infor mation space, separating the dense g rid regions from the les s dens e ones to fo rm gro up s. In gener al, a typical approa c h within this categor y will consist of the follow- ing steps as presented by Gr abusts and Bo riso v (2002 ) : 1. Creating a gr id str uctur e, i.e. par tit ioning the da t a spa ce into a finite nu mber o f non-overlapping cells . 2. Calculating the cell density for each cell. 3. Sorting of the cells acco r ding to their densities. 4. Identif ying cluster c en ters. 5. T rav ersal of neig h bor cells. Some o f the mo re impo rtan t algor ithm s within this catego ry are the follow- ing: – STING: ST atistica l INformation Grid-ba sed c lustering was pro posed by W a ng et al. (1997 ) who divide the spatial a r ea into rectang ular cells rep- resented by a hiera rc hical s tructure. The ro ot is at hier arc hical level 1, its children at level 2 , and so on. This alg orithm ha s a computational com- plexity of O ( K ), wher e K is the num ber of cells in the b ottom lay er. This implies that scaling this method to hig her dimensional s paces is difficult (Hinneburg a nd Keim, 1999 ). F or exa mp le, if in high dimensional data space e ac h cell has four children, then the num ber of cells in the second level will be 2 m , where m is the dimens ionalit y o f the databas e. 13 – OptiGrid: Optimal Grid-Clus t ering was introduced by Hinneburg and Keim (199 9) as an efficient algor it hm to cluster high-dimensional da t aba ses with noise. It uses data pa rtitioning bas e d on divisive recurs ion by mul- tidimensional g rids, fo cusing on s eparation of cluster s by hyperplanes. A cutting plane is chosen which go es thr ough the po in t of minima l density , therefore splitting tw o dense half-s pa ces. This pro cess is applied recur - sively with each subset of da t a. This algor ithm is hierarchical, with time complexity of O ( n · m ) (Gan et al., 2007 , pp. 210– 212). – GRIDCLUS: prop osed by Schikute (199 6) is a hierar c hical algo rithm for c lus t ering very lar g e datasets . It uses a multidimensional da ta grid to organize the space surrounding the data v a lue s rather than or g anize the data themselves. Ther eafter patterns are or ganized into blo cks, which in turn ar e clus ter ed by a top ological neighbor se a rc h algo r ithm. Five ma in steps are inv olved in the GRIDCLUS metho d: (a) inser tion of p oin ts into the grid structure, (b) c a lculation of densit y indices , (c) sor ting the blo c ks with resp ect to their density indice s, (d) ide ntification o f c lus t er c e n ters, and (e) trav ersa l of neighbor blo cks. – W av eC luster: this clustering tec hnique propos e d by Sheikholesla mi et al. (2000) defines a uniform tw o dimensio nal gr id on the da ta and represe nts the da t a p oin ts in each cell by the num ber of p oin ts. Thus the data po in ts b ecome a set of g rey-scale p oin ts, which is treated as a n ima g e. Then the problem of lo oking fo r cluster s is trans f or med into a n ima g e segmentation problem, where wa velets are used to take a dv antage of their m ulti-sca ling and no ise reduction pro perties. The basic alg o rithm is a s follows: (a) cr eate a data grid and assig n each data ob ject to a cell in the grid, (b) apply the wav elet tra nsform to the da ta, (c) use the average sub-image to find connected cluster s (i.e. c onnected pixels ), and (d) map the r esulting clusters back to the p oint s in the or iginal spa ce. There is a great deal of other work also that is based on using the wa velet and other m ultireso lution tr a nsforms for se g men tation. F urther grid-ba s ed clustering a lgorithms ca n b e found in the following: Chang and Jin (2002), Park a nd Lee (2004), Gan et al. (20 07), and Xu and W unsch (2008). Densit y-ba sed clustering algor it hms are defined a s dens e regions of p oint s, which are separ ated by low-density regions . Therefore, clusters ca n have an arbitrar y shap e and the p oin ts in the cluster s may b e ar bitrarily distr ibut ed. An impor tan t adv antage o f this metho dology is that only one scan of the dataset is needed and it can handle noise effectively . F urther more the num b er of clusters to initialize the algor ithm is not r equired. Some o f the more imp ortant algorithms in this category include the following: – DBSCAN: Density-Based Spatial Clustering of Applications with Noise was prop osed by Ester et al. (1996) to discover a rbitrarily shap ed clusters . Since it finds cluster s based o n dens it y it does not need to know the num b er 14 of cluster s at initialization time. This algor ithm ha s b een widely used and ha s ma n y v a riations (e.g., se e GDBSCAN by Sander et al. (1 9 98), PDBSCAN by Xu et al. (1999), a nd DBCluC by Z a ¨ ıane a nd Lee (2 002). – BRIDGE: pro posed by Dash et a l. (2001 ) uses a hybrid a pp ro ac h inte- grating k -means to partition the datase t int o k clusters, a nd then density- based a lgorithm DBSCAN is a pplied to ea c h partition to find dense clus- ters. – DBCLASD: Distribution-Based Clustering of LArge Spatial Databases (see Xu et al., 1998 ) as sumes that data p oin ts within a cluster a re uni- formly distr ibut ed. The cluster pro duced is defined in terms o f the nearest neighbor distance. – DENCLUE: DENsity based CLUstering aims to cluster large multimedia data. It can find arbitrar ily shap ed clusters a nd at the sa me time deals with no ise in the da t a. This a lgorithm has tw o steps. Firs t a pre-cluster map is gener ated, and the data is divided in hyperc ubes where o nly the po pulated a re considered. The seco nd step takes the highly p o pulated cube s a nd cubes that are c onnected to a highly p opulated cub e to pro duce the clusters . F or a detailed pr esen tation o f these steps see Hinneburg and Keim (1998 ). – CUBN: this has three steps. Firs t an er osion o peration is carried out to find b order points. Second, the neares t neigh b or method is used to cluster the bo rder p oin ts. Fina lly , the near est neighbor metho d is used to cluster the inner p oin ts. This algorithm is capable of finding non-spherica l s ha pes and wide v ariatio ns in size. Its co m putational complexity is O ( n ) with n being the s ize of the dataset. F or a detailed presentation of this algo r ithm see W ang and W ang (200 3). 8 A New, Linear Time Grid Clustering Metho d: m-Adic Clustering In the last section, section 7, we have seen a num ber of clustering metho ds that split the da ta space into cells, cub es, o r dense regions to lo cate high densit y areas that can be further s t udied to find clus ter s. F or lar ge data sets clustering via an m -adic ( m integer, which if a prime is usually denoted as p ) ex pa nsion is po ssible, w ith the adv antage o f do ing s o in linear time for the clustering algor ith m based on this expansion. The usual base 10 s ystem for num ber s is none other than the case of m = 10 a nd the base 2 or binary sys tem can b e referred to as 2-a dic wher e p = 2. Let us consider the following distance rela ting to the case of vectors x and y with 1 attr ibute, hence unidimensional: d B ( x, y ) = 1 if x 1 6 = y 1 inf m − k x k = y k 1 ≤ k ≤ | K | (5) 15 This dis ta nce defines the longest common pr efix of string s . A spa ce of strings , with this distance, is a Baire space. Thus w e call this the Baire distance: here the longer the common pr efix, the closer a pair o f sequence s. What is of interest to us here is this long e st common prefix metric, which is an ultrametric (Murtagh et al., 2 008). F or exa mple, let us co nsider tw o such v alues, x a nd y . W e take x and y to be b ounded by 0 and 1. E a c h a re of so me precision, and we take the integer | K | to b e the ma x im um prec is ion. Thu s we consider o r dered sets x k and y k for k ∈ K . So, k = 1 is the index of the firs t de c ima l place of precisio n; k = 2 is the index of the se c o nd decimal place; . . . ; k = | K | is the index of the | K | th decimal place. The ca r dinalit y of the set K is the pr e cision with which a num b er, x , is measur ed. Consider as examples x k = 0 . 47 8; a nd y k = 0 . 47 2. In these cases , | K | = 3. Start from the firs t decimal p osition. F or k = 1, we hav e x k = y k = 4. F or k = 2 , x k = y k . But for k = 3, x k 6 = y k . Hence their Ba ire distance is 1 0 − 2 for base m = 10. It is seen that this distance splits a unidimensiona l string of decimal v alues int o a 10-way hier arc hy , in which ea ch leaf can be seen as a grid cell. F r om equation (5) we ca n rea d off the distance b et w een p oin ts a ssigned to the s ame grid c e ll. All pairwise distances of p oin ts assigned to the same cell are the same. Clustering using this Ba ire distance has b een successfully applied to ar eas such a s chemoinformatics (Murtagh et a l., 2 008), a stronom y (Contreras a nd Murtagh, 2009 ) and text r etriev al (Contreras, 2 010). 9 Conclusions Hierarchical cluster ing metho ds, with ro ots go ing back to the 1960 s and 19 70s, are co n tin ually re plenished with new challenges. As a family of algo rithms they are cen tral to the addr essing of many imp ortant problems. Their deployment in many a pplication do mains testifies to how hier a rc hical clustering metho ds will remain crucial for a long time to co me. W e hav e lo ok ed at b oth tra ditional ag glomerative hierar c hical clustering, and more r ecen t developmen ts in grid or ce ll bas ed appr oac hes. W e hav e dis- cussed v a rious algo r ithmic asp ects, including well-definedness (e.g . inv ersions) and computational pr operties. W e hav e also touched on a num ber o f application domains, again in a reas that reach back ov er s o me deca des (chemoinformatics) or many decades (information retriev al, which motiv ated muc h early work in clustering, including hiera rc hical cluster ing ), a nd more recent application do- mains (such as hierarchical mo del-based clustering approa c hes). 10 References 1. Anderb erg MR Cluster Analysis for Applic ations . Academic P ress, New Y o rk, 1973 . 16 2. Benz´ ecr i et co ll., JP L’Analyse des Donn´ ees. I. L a T axinomie , Duno d, Paris, 1979 (3rd ed.). 3. Blashfield RK and Aldenderfer MS The literature on cluster analysis Mul- tivariate Behavior al Re se ar ch 19 78, 13 : 2 71–295. 4. Bruyno oghe M M´ etho des nouvelles en classification automatique des donn ´ ees taxinomiques nombreuses St a tistique et Analyse des Donn´ ees 19 7 7, no. 3 , 24–42 . 5. Chang J-W and Jin D-S, A new cell-ba sed clustering metho d for large , high-dimensional data in da t a mining applications, in SAC ’0 2: Pr oceed- ings of the 2002 ACM Symp osium on Applied Co mput ing. New Y or k: A CM, 20 02, pp. 50 3–507. 6. Contreras P Search and Retriev al in Massive Data Collections PhD Thesis. Roy al Hollow ay , Universit y o f Londo n, 20 1 0. 7. Contreras P and Mur tagh F F as t hier arc hical clustering fro m the Bair e distance. In Class ific a tion as a T o ol for Resea r c h, eds. H. Ho carek-Junge and C. W eihs, Spring er, Berlin, 235– 243, 201 0. 8. Dash M, Liu H, and Xu X, 1 + 1 > 2: Merging distance and density based clustering , in DASF AA ’01 : Pro ceedings o f the 7 th International Conference on Database Systems for Adv anced Applications. W ashington, DC: IE EE Computer So ciet y , 20 01, pp. 3 2 –39. 9. Day WHE and E delsbrunner H E fficien t algo r ithms for agglo merativ e hi- erarchical clustering metho ds Journal of Classific ation 1984 , 1: 7–24 . 10. Defays D An efficient alg orithm for a complete link metho d Computer Journal 1977 , 20: 364 –366. 11. de Rha m C La c la ssification hi´ er arc hique ascendante selon la m´ etho de des voisins r´ ecipro ques L es Cahiers de l’Anal yse des Donn´ ees 1 980, V: 135–1 44. 12. Deza MM and Deza E Encyclop e dia of Distanc es . Springe r , Berlin, 200 9. 13. Dittenbac h M, Raub er A and Merkl D Uncovering the hiera r c hical s t ruc- ture in da t a using the growing hierarchical self-or ganizing map N eur o c om- puting , 2002 , 48(1– 4):199–216. 14. E ndo M, Ueno M a nd and T anab e T A cluster ing metho d using hier ar- chical self-orga nizing maps Journal of VLSI Signal Pr o c essing 32 :105–118, 2002. 15. E ster M, K riegel H-P , Sander J , and Xu X, A density-based alg orithm for discov ering clusters in large spatial da t aba ses with noise, in 2nd In- ternational Conference on Knowledge Discov ery and Da ta Mining. AAAI Press, 1 996, pp. 226– 2 31. 17 16. Ga n G, Ma C and W u J D a ta Clustering The ory, Algorithms, and A ppli- c ations So ciet y for Industrial and Applied Mathema t ics. SIAM, 2 007. 17. Gillet VJ, Wild DJ, Willett P and Br adsha w J Similarity and dissimilarity metho ds for pr o cessing chemical structure databases Computer Journal 1998, 41: 547– 558. 18. Go ndran M V aleurs pro pres et vecteurs propres en clas sification hi ´ erar c hique RAIRO In f ormatique Th´ eorique 1976 , 10(3): 39–4 6. 19. Go rdon AD Classific ation , Chapman a nd Hall, Lo ndon, 1 981. 20. Go rdon AD A revie w of hierarchical cla ssification Journal of t h e R oyal Statistic al So ciety A 1987 , 150 : 119 – 137. 21. Gr abusts P and Borisov A Using grid- c lustering metho ds in da ta cla ssi- fication, in P ARELEC ’02: P roceedings of the International Conference on Parallel Computing in Electric a l E ng ineering.W as hington, DC: IEE E Computer So ciet y , 2 0 02. 22. Gr aham RH and Hell P On the history of the minimum spanning tree problem Annals of the H i story of Computing 19 8 5 7: 43– 57. 23. Gr iffit hs A, Ro bins on LA and Willett P Hierarchic agglo mer ativ e cluster- ing metho ds for automatic do cumen t classification Journal of Do cumen- tation 1984 , 40: 175– 205. 24. Hinneburg A and Keim DA, A density-based a lgorithm for discovering clusters in lar ge spatial databases with nois e , in Pro ceeding o f the 4th Int erna t iona l Confer ence on Knowledge Disc overy a nd Data Mining. New Y o rk: AAAI Pr e s s, 199 8, pp. 58 – 68. 25. Hinneburg A a nd Keim D Optimal g rid-clustering: T ow ar ds breaking the curse of dimensionality in high-dimensiona l clustering, in VLDB ’99 : Pro- ceedings of the 25th International Confer e nc e on V ery Lar ge Data Bases . San F rancisco, CA: Morg a n Kaufmann Publishers Inc., 1999 , pp. 50 6–517. 26. J ain AK a nd Dubes RC Algorithms F or Clustering D ata Prentice-Hall, Englwoo d Cliffs, 19 8 8. 27. J ain AK, Murty , MN and Flynn PJ Data clustering: a review ACM Com- puting Surveys 1 999, 31: 264– 3 23. 28. J ano witz, MF Or dinal and R elational Clust eri ng , W orld Scie ntific, Singa- po re, 2010 . 29. J uan J Pro g ramme de class ification hi´ erarchique par l’alg orithme de la recherc he en c ha ˆ ıne des voisins r´ ecipro ques L es Cahiers de l’A nalyse des Donn´ ees 19 82, VI I: 21 9–225. 18 30. K ohonen T Self-Or ganization and Asso ciative Memory Spring er, Berlin, 1984. 31. K ohonen T S elf-Org anizing Maps , 3 rd e dn., Springer, Ber lin, 2 001. 32. La mpinen J and Oja E Clustering prop erties of hierarchical self-o rganizing maps Journal of Mathematic al Imaging and Vision 2: 2 61–272, 1 9 92. 33. Le rman, IC Classific ation et Analyse Or dinale des Donn´ ees , Duno d, Paris, 1981. 34. Le Roux B and Roua ne t H Ge ometric Data Analysis: F r om Corr esp on- denc e Analysis to S t ructur e d Data Analysis , K lu w er, Do rdrec ht, 200 4. 35. von Luxburg U A tutor ial on sp ectral clustering Statistics and Computing 1997, 17(4): 395– 416. 36. Ma rc h ST T ec hniques for s tr ucturing databa se recor ds ACM Computing Surveys 1983 , 15: 45– 79. 37. Miik k ulainien R Script reco gnition with hier a rc hical featur e maps Con- ne ction Scienc e 1990 , 2: 83–1 01. 38. Mir kin B Mathematic al Classific ation and Clustering Kluw er, Dordrech t, 1996. 39. Mur tagh F A survey of re cen t a dv ances in hierarchical clustering alg o- rithms Computer Journal 198 3, 26, 354– 359. 40. Mur tagh F Complexities of hiera rc hic cluster ing a lgorithms: state of the art Computational Statistics Quarterly 198 4, 1: 101 –113. 41. Mur tagh F Multidimensional Clust e ring Algorithms Physica-V erlag, W ¨ urzburg, 1985. 42. Mur tagh F a nd Hern´ andez- P a jar e s M The Ko honen self-or ganizing map metho d: an asse ssmen t, Journal of Classific ation 1 995 1 2, 16 5-190. 43. Mur tagh F, Raftery AE and Star c k JL Bayesian inference for m ultiband image seg men tation via mo del-based clustering trees, Image and Vision Computing 200 5, 23: 587– 5 96. 44. Mur tagh F Corr esp ondenc e Analysis and Data Co ding with Java and R , Chapman and Hall, Bo ca Raton, 200 5. 45. Mur tagh F The Haar wa velet tra nsform of a dendrogr am Journal of Clas- sific ation 2 007, 24 : 3 – 32. 46. Mur tagh F Symmetry in data mining and a nalysis: a unifying view based on hier arc hy Pr o c e e dings of St eklov Institu te of Mathematics 20 09, 265 : 177–1 98. 19 47. Mur tagh F a nd Downs G and Co n treras P Hierarchical cluster ing o f massive, high dimensional data sets by exploiting ultrametric embedding SIAM Journal on S cientific Computing 2 008, 3 0(2): 707– 730. 48. Park NH a nd Lee WS, Statistical grid-base d clustering over data str eams, SIGMOD R e c or d 2 004, 3 3(1): 32–3 7. 49. Ra pop ort A and Fillenbaum S An exp erimen tal study of s eman tic struc- tures, in Eds . A.K. Romney , R.N. Shepard and S.B. Nerlove, Multidimen- sional Sc aling; The ory and Applic ations in the Behavior al Scienc es. V ol. 2, Applic ations , Seminar Press, New Y ork, 1 972, 93–13 1. 50. Ro hlf FJ Algo rithm 7 6 : Hier arc hical cluster ing using the minimum spa n- ning tree Computer Journal 1973 , 16: 93– 95. 51. Sa nder J, Ester M, Kriegel H.- P , a nd Xu X, Densit y-ba sed clustering in spatial data ba ses: The algorithm GDBSCAN a nd its applica tions Data Mining Know le dge Disc overy 199 8, 2(2): 169 –194. 52. Schikuta E Gr id-clustering: An efficient hierarchical clustering metho d for very lar ge data se t s, in I C P R ’96: Pr oceedings of the 13th International Conference o n Pattern Reco gnition. W ashing t on, DC: IEEE Co mput er So ciet y , 1 996, pp. 1 01–105. 53. Sheik holeslami G, Chatterjee S and Zhang A, W av ecluster: a wa velet based clustering approa c h for spa t ial data in very larg e databases, The VLDB Journal , 200 0, 8(3–4): 289 – 304. 54. Sibs o n R SLINK: an optimally efficient algorithm for the sing le link cluster metho d Computer Journal 1973 , 16: 30– 34. 55. Snea th PHA and So k al RR Nu meric al T axonomy , F r eeman, San F rancisco , 1973. 56. Tino P and Nabney I Hierarchical GTM: c o nstructing lo calized non-linear pro jection ma nif olds in a principled way , IEEE T r ansactions on Pattern Analy sis and Machine Intel ligenc e , 2002 , 24(5): 639 –656. 57. v an Rijsb ergen CJ Information Retrie val Butterworths, London, 197 9 (2nd ed.). 58. W ang L and W ang Z-O, CUBN: a clus t ering algor it hm based o n den- sity and dis t ance, in Pro ceeding of the 200 3 International Conference on Machine Le arning and Cyb ernetics. IE EE P r ess, 20 03, pp. 1 08–112. 59. W ang W, Y ang J a nd Muntz R STING: A s ta tistical information gr id approach to spatial data mining, in VLDB ’97: Pro ceedings of the 23rd Int erna t iona l Confer ence on V ery Larg e Data B ases.San F rancisco, CA: Morgan Kaufmann Publishers Inc., 199 7, pp. 18 –195. 20 60. W ang Y, F reedma n M.I. a nd Kung S.-Y. Pr obabilistic principa l comp o- nent subspaces: A hierarchical finite mixture mo del for data v isualization, IEEE T r ansactions on Neu ra l Networks 200 0, 11(3): 625– 636. 61. White HD and McCa in K W Visualiz a tion of literatures, in M.E. Williams, Ed., A nnu a l R eview of Information Scienc e and T e chnolo gy (ARIST) 1997, 32: 99–1 6 8. 62. Vicente D a nd V ellido A r e v iew of hiera rc hical mo dels for data clus t ering and visualization. In R. Gir´ aldez, J.C. Riquelme and J.S. Aguilar -Ruiz, Eds., T endencia s de la Miner ´ ıa de Dato s en Espa ˜ na. Re d Espa ˜ nola de Miner ´ ıa de Datos, 200 4. 63. Willett P E fficie nc y of hierarchic agglomer ativ e clustering using the ICL distributed array pro cessor Journal of Do cumentation 1 9 89, 4 5: 1–45. 64. Wisha rt D Mo de analysis : a generaliz a tion of nearest neighbour which re- duces chaining effects, in Ed. A.J. Cole, Numeric al T axonomy , Academic Press, New Y ork, 2 82–311, 19 69. 65. Xu R and W uns c h D Survey o f clustering algor it hms IEEE T ra nsactions on Neur al Net w orks 200 5, 16: 645– 678. 66. Xu R a nd W unsc h DC Clustering IEEE Computer So ciet y Press , 2008 . 67. Xu X, Ester M, Krieg el H-P and Sander J A distribution- ba sed clus ter ing algorithm fo r mining in larg e spa tial data bases, in ICDE ’98: Pro ceedings of the F o urteen th International Confer ence o n Da t a Engineering . W a sh- ington, DC: IEEE Computer So ciet y , 1998, pp. 32 4–331. 68. Xu X, J¨ ager J, and K riegel, H-P A fast parallel clus t ering algorithm for large spatial databases , D a ta Mining Know le dge D i sc overy 1999 , 3(3): 263–2 90. 69. Z a ¨ ıane OR a nd Lee C- H, Clustering s pa tial data in the presenc e o f ob- stacles: a density-based approa c h, in IDEAS ’0 2 : Pr oceedings of the 2002 International Symp osium on Database Enginee r ing a nd Applica - tions.W ashington, DC: IEEE Computer So ciet y , 20 02, pp. 2 14–223. 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment