Adaptive Affinity Propagation Clustering

Affinity propagation clustering (AP) has two limitations: it is hard to know what value of parameter 'preference' can yield an optimal clustering solution, and oscillations cannot be eliminated automatically if occur. The adaptive AP method is propos…

Authors: Kaijun Wang, Junying Zhang, Dan Li

Adaptive Affinity Propagat ion Clustering. Acta Automatica Sinica, 33(12):1242-1246, 200 7 Adaptive Affinity Propagation Clustering Kaijun W ang 1 , Junying Zhan g 1 , Dan Li 1 , Xinna Zhang 2 and T ao Guo 1 1 School of computer sci ence and engineer ing, Xidian University , Xi’an 71007 1, P . R. China. 2 China Jiliang C ollege, Hangzhou 310018, P . R. China Email: sunice 9@yahoo.com Abstract Affinity propagation clustering (AP) has two lim itations: it is hard to know what value of parameter ‘preference’ can yield an optim al clustering solu tion, and osc illations cannot be elim inated automatically if occur . The adaptive AP m ethod is proposed t o overcom e these limitations , including adaptive scanning of preferences to searc h space of th e number of clusters f or finding the optimal clus tering solution, adap tive adjustment of d amping factors to elim inate oscillat ions, and adaptive es caping from oscillations w hen the damping ad justment techniq ue fails. Experimenta l results on sim ulated and real data sets show that the adaptive AP is effec tive and can outperform AP in quality of clustering results. Keywords Af finity propagation clust ering; Adaptive clustering; L ar ge number of clusters 1 Intr oduction Affinity propagation clus tering (AP) [1] is a fast clustering algorithm espec ially in the case of lar ge number of clusters [2], and has some advanta ges: speed, ge neral applicability and good perform ance. AP works based on similari ties between pairs of data points (or n × n similarity matrix S for n data points), and simultaneousl y considers all the data poi nts as potential cluster centers (call ed exemplars). T o find appropriate exemplars, AP accumu lates evidence “r esponsibility” R ( i,k ) from data point i for how well-suited point k is to serve as the exem plar for point i , a nd accumulates evide nce “availabili ty” A ( i,k ) from candidate exemplar point k for how appropri ate it would be for point i to choose point k as its exem plar. From the view of evidence, larger the R (: ,k )+ A (: ,k ), more probabi lity the point k as a final clus ter center. Based on evide nce accumulation, AP searches for clusters t hrough an iter ative process until a high-qu a lity set of exemplars and corresponding clus ters emer ges. In the it erative pr oc ess, identified exemplars start from the maximum n exemplars to fewer exem plars until m exem plars appear and are unchan ging an y more (or AP algorithm conver ges). The m clusters found based on m exemplars are the clus tering solution of AP . There are two important paramete rs in AP: the preferences ( p ) in diagonal of similarity matrix S and the damping factor ( lam ). The preference parameter p ( i ) (its i nitial value is negative ) indicates the preference that data point i be chosen as a cluster center , and influences ou tput clusters and the number of clusters (NC). The main formula in the AP algorithm are: R ( i , k ) = S ( i , k )–max{ A ( i , j )+ S ( i , j )}, where j ∈ {1,2,…, n } but j ≠ k ; and A ( i , k ) = min{0, R ( k , k )+s um{ max (0 ,R ( j , k ))}}, where j ∈ {1,2,…, n } but j ≠ i and j ≠ k ; and p appears in R ( k , k )= p ( k )–max { A ( k , j )+ S ( k , j )}. Hence, the p influences which and how m any exemplars will win as final cluster centers, i.e., whe n p ( k ) is lar ger , R ( k , k ) and A ( i , k ) are larg er , so that it has more probabi lity that the point k is as a final cluster cent er . This means that the nu m ber of identified clusters is increased or decre ased by adjusting p correspondingly , and usually a good choice is to set all the p ( i ) to be the media n ( pm ) of all the similaritie s between data points [1]. However , the pm can not l ead to an optim al clustering solution in many cases, since t he pm is not given on the basis of the o ptimal cluster struc ture of a data set. Furthermore, there is no exact corresponding relation between the p and output NC. Therefore , how to find a n optimal clustering solution is an unsolved problem for using AP alg orithm. In each iterative ste p i , R and A are updated with the one in last iterati on, i.e., R i = (1- lam ) × R i + lam × R i -1 , A i = (1- lam ) × A i + lam × A i -1 , where damping fact or lam ∈ [0,1] and default la m =0.5. Another functio n of the damping factor is to improve conver gence when AP fails to conver ge on acco unt of oscillations (o r identified exemplars are in period ic variation), where lam needs to be increased to el im inate oscillations [1]. In un-convergent cases, we have to increase lam manually and gradually and reru n AP until the algorithm conver ges. Another c hoice is to use a big dam ping factor close to 1 to elim inate oscillations, bu t AP will run very slow . Both choices m ay consum e plenty of time, especially for a la rge data set. H ence, it is a n important problem for AP: how to automa tically eliminate oscillati ons when oscillations occur? T o solve these problem s, we propose adaptiv e AP , including: adaptive adjustm ent of the damping fact or to eliminate oscilla tions (called adaptive damping), adapt ive escaping oscillations by decre asing p when adaptive damping m ethod fails (called adapt ive escape), and adaptive searc hing the space of p to find out th e optimal cluste ring solution suitable to a data set (c alle d adaptive pref erence scan ning). The adaptive AP is proposed in Secti on 2, and experim ental results are i n Section 3. Fina lly , Section 4 gives the conclusion . The Matlab codes of the adap tive AP are available fro m ref. [10]. 1 Adaptive Affinity Propagat ion Clustering. Acta Automatica Sinica, 33(12):1242-1246, 200 7 2 Adaptive Affinity Pr opagation In this section , the adaptive damping and esc ape methods are discussed f irst to elim inate oscillations, and then the adapt ive scanning of p is desi gned. Finally , a cluster validity method is adopted to fin d the optim al clustering solution. It is not ed that the sam e initial value is assigned t o all the p ( i ) in the diag onal of matrix S . When oscillations occur and AP fails to conver ge, our target is to both eliminate oscilla tions and keep the speed of the algorit hm. Although the lam near 1 has m ore probability for oscillation elim ination, the R and A are updated very slow and m uch more time is needed to run AP . It is a better choice to check the ef fect of oscillation elim ination while increasing lam gradually . Following thi s idea, the adaptive d amping method is designed: (1) det ect whether oscillati on occurs in a n iteration of the AP algorithm; (2) the lam is increase d once by a step such as 0.05 if oscillat ions are de tected, otherwise go t o (1); (3) the iteration continues w times; (4) repeat these steps until the al gorithm meets preset stop c ondition. It is a key to detect any oscillatio n while the algo rithm runs, but feat ures of oscillations are too complex to be described. Then, we turn to describe/de fine non-oscil lation features: th e number of identified exem plars is decreasing or unc hanging during t he iterative pr ocess. This defi nition is reasonable, since the decreasing and unchanging ar e the features tha t the algori thm is going t o conver gence. A moving monitoring window Kb ( i ) (window size w ) is used to record whether non-oscill ation feat ures appear in a sequen ce of i terations, e.g., Kb ( i )=1 when non-oscillation feat ures a ppear in iteratio n i , otherwise Kb ( i )=0. The criterio n of whether oscillations occur is desig ned as follows: oscillations occur if the nu mber of non-oscillation fe atures in the monitoring window is less t han two thirds of window size. This is a tolerant design tha t considers tolerating occasionally random vibrations in a short tim e and vibrations in initia l iterations. Thus, the a bove monitoring -adjusting tec hnique realize s adaptive adj ustment of lam and le ads AP to converge nce. If it fails to depress osci llations by incre asing lam (e.g., lam is increased to 0.85 or higher), an adaptive escape technique will be designed to avoid oscillati ons. That large lam brings little ef fect suggests that oscillations ar e pertinacious und er the given p , so t he alternative is to decreas e p away from the given p to escape from oscillations. This escape method is workab le due to that it works together with adaptive scanning of p discussed below , differe nt fro m AP that works under a fixed p . The adaptive escape technique is designed as follows: when oscillations occur and lam ≥ 0.85 in the iterative process, decreasing p gradually until osc illations disappear . This technique is added in the step (2) of the adaptiv e damping meth od: if oscil lations occur , increasing lam by a step such as 0.05; if lam ≥ 0.85, decreasing p by step ps , otherwise g o to step (1) of the adapt ive damping m ethod. Both adapti ve damping and adaptive escape techniques are used to elim inate os cillations at t he same tim e. The monitori ng window size w =40 is appropriate as per our e xperiences (occas ionally random vibrati ons and tolerant vibrat ions in initial iterations will be caught under too sm all w and AP runs slowly under too big w ). The pseudo c odes of adaptive damping and adaptive esc ape are listed in T able 1 (where max its and ps will be set in T able 2). Ta b l e 1 . Procedure of adaptive damping and adaptive escape Initialization: dam ping factor lam ← 0.5, monitoring window siz e w ← 40, parameter w 2 ← w /8, max iterati on times maxits , decreasing step ps . for i ← 1 to maxits do Kset ( i ) ← K △ K is the number of exemplars Km ( i ) ← mean( Kset ( i - w 2 : i )) if Km ( i )- Km ( i -1) < 0 then Kd ← 1 △ record the decrease of K Kc ← ∑ j | Kset ( i )- Kset ( j )| △ record the unchang ed of K , j ∈ i - w 2 to i -1 △ record the decrease a nd unchangeab leness of K in m onitoring window Kb : if Kd = 1 or Kc = 0 then Kb ( j ) ← 1 △ j is the remainder of i / w else Kb ( j ) ← 0 △ j is the remainder of i / w Ks ← ∑ j Kb ( j ) △ j ∈ 1 t o w i f Ks < 2 w /3 then lam ← lam +0.05 if lam >= 0.85 then p ← p + ps 2 Adaptive Affinity Propagat ion Clustering. Acta Automatica Sinica, 33(12):1242-1246, 200 7 The number o f identified clusters depends on i nput p , but it is unknown which value of p will give best clustering soluti on for a given dat a set. Generally , cluster validat ion techniques (usually bas ed on validation indices) [3] are used to evaluate which clustering soluti on is optimal for a data set . AP algo rithm need give a series of clustering solution s with differe nt NCs, am ong which th e optimal clusteri ng solution is found by a cluster validat ion index. Th ere is no ex act corresponding relation between the p and outpu t NC, so we design the method of scanning s pace of p to obtain different NCs. The adapti ve p -scanning technique is des igned as follows: (1) specify a lar ge p to start the algorithm ; (2) an iteration r uns and gives K exemplars; (3) check w hether K exemplars conver ge (the condition is that every exemplar satisfies preset c ontinuously unchangin g times v ); (4) go to step (5) if K exemplars conver ge, otherwise go to step (2); (5) decrease the p by step ps if K exem plars conver ge too in additional dy iterations (this is for more reliable conv er gence), otherwise go to step (2); (6) go to step (2). Thus, a series of clusteri ng results with di f ferent NCs can be gained through scanning p , and the scanning of p space is designed inside the iterative process to keep the advantage of speed. T o avoid possible repeated computation, in the p -scanning pr ocess we continue to calculat e R ( i,k ) and A ( i,k ) base d on (or using) the current values of R ( i,j ) and A ( i,j ) after each reduction of p (then S ( i , i )= p ( i ) is changed but other elements of S are unchanged). It is the key to select a proper de creasing step for the adapti ve p -scanning technology . According to our experience, the decreasing s tep may be set to be ps = pm /100. This is a compromise design, whic h considers both cases: the al gorithm runs slow whe n | ps | is too s mall, and the alg orithm possibly m iss the NC of the inherent cluster structure when | ps | is too big. Nevertheless, this fixed decreasing ste p cannot m eet the diff erent cases of big NC and small NC. Cons idering that bi g NC is more sensitive to ps than that of small NC, we design the adaptive decreasing step: ps =0.01 pm / q , where decreasing parameter 0.1 +50 K q = . Thus, the q is adjusted dynamically with K , and the ps is smaller when K is bigger , while the ps is larger when K is smaller . In order to check whet her the conver gence condition is satisfied, anot her monitoring win dow B (similar to that in adaptive dam ping method) is adopted to record t he continuous ly unchanging tim es v of K exemplar , and the window size is set to be v =40, which is consis tent with default conver gence times 50 i n AP [1] ( v =40 pluses delay ti mes of 10). It is important to specify the scanni ng scope, and a sm aller scope is pr eferred for less runnin g time. The p space [- ∞ , 0] corresp onds to NC space [1, n ]. For the clustering of n data points, it is r easonable to regard the square of n as the upper lim it of the optimal NC [4]. In the following experim ents we find: K 1 at the first conver gence equals or is over n when initial p = pm /2 is set, and the NCs searched by the algorithm are much bigger than n (since every data point is regarded as an exem plar when AP starts). Hence, we set initial p = pm /2. The mini mal NC ( K =2) determines the lower limit of p , i.e., reducing p until K =2. The la rge maxits =50000 is set so that the m aximal iteration times maxits does not i nfluence whether the algorithm reaches K =2. Finally , an acceleration tec hnique of p -scann ing is needed to sa ve running tim e. As some NCs correspond to a large scope of p , the la rge reduction of p is needed to change NC. In this c ase, we may increase the decreasing step of p to obtain sm aller NCs rapidly . The acceleration technique of p -scanning is designed as follows: (1) the iteration runs once; (2) check whether K exemplars conver ge; if yes, go to (3), otherwise set b =0 and go to (1); (3) the it eration continues dy =10 times; (4) check whether K exe mplars conver ge; if yes, set b = b +1, otherwise go to (1); (5) set p = p + b × ps , and go to (3) . The pseudo codes of the adaptive p -scanning technology are li sted in the T able 2. Now the adaptive AP gives clustering solu tions with dif ferent NCs through the p -scannin g process, and then cluster validation t echnique is used to e valuate qu ality of these solutions. It is the validi ty indices tha t are usually used to evaluate quality of cluster ing re sults and to evaluat e which clustering solution is the optimal for the da ta set. Among many validity indices, S ilhouette inde x, which reflects the com pactness and separation of clusters, is widely-used and has good perform ance on NC estimati on for obvious cluste r structures. It is applicable to both the estimation of the optim al NC and evaluation of clus tering quality . Hence, we adopt Sil houette index, as an ill ustration, to find the optim al clustering s olution. Let a data set with n samples be divi ded to k clusters C i ( i =1~ k ), a ( t ) be average dissim ilarity of sample t of C j to all other samples in C j , d ( t,C i ) be average dissimilarity of sa mple t of C j to all samples in anot her cluster C i , then b ( t )=min{ d ( t,C i )} , i =1~ k , i ≠ j . T he Silhouette formula f or sample t is [3]: () () () max{ ( ), ( )} Sil t bt at at bt = − (1) 3 Adaptive Affinity Propagat ion Clustering. Acta Automatica Sinica, 33(12):1242-1246, 200 7 W ith Sil ( t ) for each sample, overall a verage silhouette Sil for n samples of the data set is obtained directly . The larg est overall average silhouette indicates the best clustering qua lity and the optim al NC [3]. Using formula (1), a series of Sil values correspon ding to clusterin g solutions un der diff erent NCs are calculated, and the optim al clustering solu tion is found at t he larg est Sil . Ta b l e 2 . Procedure of adaptive p -sca nning for searching NC space Initialization: p ← pm /2, ps ← pm /100, b ← 0, v ← 40 , dy ← 10, nits ← 0, maxits ← 50000. for i ← 1 to maxits do Kset ( i ) ← K △ K is the number o f exempla rs if point k is the exem plar then B ( k,j ) ← 1 △ j is the remainder of i / v else B ( k,j ) ← 0 △ j is the remainder of i / v if there are K exemplars tha t make ∑ j B ( k,j ) = v then Hdown ← 1 △ K exemplars converge else Hdown ← 0, b ← 0, nits ← 0 nits ← nits +1 if Hdown = 1 and nits >= dy then b ← b +1 q ← 0. 1 +50 K p ← p + b × ps / q nits ← 0 if K <= 2 then stop 3 Experimental Results This section c ompares the cluster ing perform ance between adapti ve AP method (adAP) and AP algorithm (AP). The items of clusteri ng performance include: w het her adAP can eliminate oscillations (if oscillations occur) automatically s o as to give c orrect clustering re sults, whether a dAP can give correct clustering r esults based on the Silhouette index (or cluster validation technique). T he adAP and AP use same initial lam =0.5 (but lam =0.8 in T ravelr oute experim ent), and AP uses fixed p = pm and maxits =2 000. For Document and T ravelroute experiments, bot h methods use fixe d p from prior knowledge [1]. Let a data se t be an n × d matr ix X ={ x i }, where x i is d -dimensional. For general data, t he similarity betwee n sample x i and x j is 2 - || || ij i j bx x = − based on Eucli dean distances, w hile for gene expressio n data the Pearson coeffi cients are used as sim ilarity measur e, i. e., the linear rela tionship between two samples x i and x j (their means i x and j x ) is: 22 1 11 () ( ) () ( (, ) d il i jl j l dd il i jl j ll Ri j xx xx ) x xx x = == −− −− = ∑ ∑∑ (2) T o avoid possible calcula tion confusio n from negative val ues, the R ( i,j ) ∈ [-1,1] is transform ed to R ( i,j )= 1 − (1+ R ( i,j ))/2. Thus, Pearson coef ficients are tr ansformed to positive Pearson distance R ( i,j ) ∈ [0,1] (bigger the value, farther the two sa mples), and the sim ilarity between sam ple x i and x j is b ij = - R ( i,j ). T welve data sets in T able 3 are used in the experime nts, where the first eight data s ets have know n class labels. Their features include: far and cl ose well-separate d clust ers, slight overlappin g clusters, tight clusters and loose clusters. T he first four data sets are sim ulate d data, while other data sets are real data. The Y east and NCI60 are gene expression data, a nd a subset of d ataset Exons is used, i.e., the first 3499 samples and the last one (= 3500 samples) from 75067 samples are used. 4 Adaptive Affinity Propagat ion Clustering. Acta Automatica Sinica, 33(12):1242-1246, 200 7 Ta b l e 3 . Features of data sets Data sets Features of cluster structures Number of clusters Number of samples Dimensions Source 3k2lap overlap, loose 3 300 2 [5] 5k8close close, loose 5 1000 8 [6] 14k10close close, loose 14 480 10 [5] 22k10far far , tight 22 790 10 [5] Ionosphere overlap , loose 2 351 4 [7] W ine overlap, loose 3 17 8 3 [7] Y east far , loose 4 208 79 [8] NCI60 overlap, loose 8 58 20 [9] FaceImage overlap 100 900 50 × 50 [1] Document / 4 125 / [1] T ravelroute / 7 456 3 [1] Exons / / 3500 12 [1] The clustering results of adAP and AP are listed in T a ble 4, where ‘adAP error ’ denotes error rates of adAP solutions (com pared with true class lab els), ‘yes’ in ‘adAP elimina te oscillations ’ denotes that oscill ations occur and adAP eliminates them automatically , ‘adAP time’ and ‘AP time’ denote the running t ime of Matlab programs of adAP and AP respectively in a same com puter (Intel CPU 3.60GHz, 2GB), and FM denotes Fowlkes-Mallows index [3] . FM with values in [0,1 ] measures a greement between a clus tering solution a nd true class labels, and bigge r the value be tter the agreem ent, e.g., F M value is 1 when t he solution and true labels are the same. FM index is used to evaluate clus t ering quality when the NC of a clustering solution is diff erent from true NC. In addition, the last four datasets have no true class labels so that t here is no error rate and FM value; for Exons, the values in FM de note correct rates of found exons. Ta b l e 4 . Clustering results of adAP and AP Data sets known NC adAP NC adAP error (%) adAP FM adAP time (s) adAP eliminate oscillations AP NC AP FM AP time (s) 3k2lap 3 3 7.67 0.85 144.0 / 16 0.39 2.1 5k8close 5 5 0 1.00 1851.0 / 17 0.56 31.2 14k10close 14 14 0 1.00 275.5 / 15 0.97 6.0 22k10far 22 22 0 1.00 1 1 25.9 yes 168 0.80 307.8 Ionosphere 2 2 17.4 0.75 445.3 yes 28 0.43 56.8 W ine 3 3 10.7 0.80 34.5 / 1 1 0.46 0.5 Y east 4 4 3.37 0.97 54.7 / 1 1 0.66 0.9 NCI60 8 8 / 0.56 12.9 / 9 0.48 0.1 FaceImage 100 102 / / 3701.2 / 103 / 14. 5 Document 4 4 / / 0.3 / 4 / 0.2 T ravelroute 7 7 / / 24.7 / 7 / 20.0 Exons / 102 / 32.8% 83073.7 / 37 22. 4% 996.0 In T able 4 one can see: for all the dataset s except the last four datasets, adAP gives correct NC in all the cases, while AP fails in all the cases; FM values of ad AP are higher than that of AP , indicating that adAP gives better clustering quality than AP; and the osci ll ations lead AP to poor solutions for 22k10far and Ionosphere. The c lustering task is t o find representative sen tences (or cluster c enters) for Document da ta, and both adAP and AP find the same four represent ative sentences; and the task is to find the appropria te airport (or cluster centers) a s airport hub for T ravelr oute data, and both adAP and AP find the same seven airports. The clust ering task is to find the cluster of exons for Exons data, which is realized by finding the cluster of non-exo n (it is known that the last sample is no n-exon), and adAP has higher identifica tion rate of exons than AP . Some cluster s of 900 face images from 100 persons are overlapping on account of s emblable faces for FaceIm age data, so that the separability of the 100 clusters is not good; in this case, adAP yields a better result of 102 clusters than AP . The experim ental results show that adaptive AP gives co rrect clust ering results based on Silh ouette index, and eliminates os ci llations a utomatically . These results dem onstrate that the adaptive d amping, adaptive es cape and ad aptive preference scan ning techniqu es in adaptive A P method are effec tive, resulting in bet ter pe rformance of adAP than original AP . 5 Adaptive Affinity Propagat ion Clustering. Acta Automatica Sinica, 33(12):1242-1246, 200 7 4 Conclusion The proposed adaptive AP uses adaptive preference scan ning t o search space of the num ber of clusters, and finds the optim al clustering solu tion suitable to a data set by t he cluster val idation tech nique. Moreover , in adaptive AP the adaptive damping is des igned to e lim inate oscillations autom atically instead of manually , and the ad aptive es caping is devel oped to el iminate oscillations when the damping technique fails. W ith these adaptive techniques, adapti ve AP can outperfo rm or equal AP algorithm in clustering quality and oscillation eli mination. In additi on, it is worth further research that which validity method co mbined with adaptive AP is m ore appropriate for a data set with com plex cluster structures such as overlappi ng clusters. Refer ences 1. Frey B J, Dueck D. Clustering by Passing Messages be tween Data Points. Scienc e , 2007, 315(58 14), 972-976. http://www.psi.tor onto.edu/affinitypropagation 2. Karen K. Affinity pr ogram slashes comput ing times. [onlin e] available: http://www.news.utoront o.ca/bin6/070215-2952.as p, Oct. 25, 2007. 3. Dudoit S, Fridlyand J. A prediction-based resam pli ng m ethod for estim ating the number of clusters in a dataset. Genome Biology , 2002, 3(7): 0036.1-0 036.21. 4. Yu J, Cheng Q S. The upper bo und of the optim al number of cl usters in fuzzy clu stering. Science in china , Ser. F, 2001, 44(2): 1 19~125. 5. Dembélé D, Kastner P. Fuzzy C-means m ethod for clustering m icroarray data, Bioinformatics , 2003, 19(8): 973-980. 6. Strehl, A. Relationship- based Clustering and Cluster Ensembles for High- dimensional Data Mining. Ph.D thesis, The University of Texas at Austin, May 2002. 7. Blake C L, Merz C J. UCI repository of m achine learning databas es, [online] ava ilable: http://mlearn.ics.uci .edu/MLRepository.html 8. Ben-Hur A, Guyon I A. S tability bas ed method for discov ering structure in c lustered data. In: P roceedings of the 7th Pacific Sym posium on Biocom puting. Lihue, Hawaii, USA, 2 002, 6-17. 9. Ross D T, Scherf U, Eisen M B, et al. Systematic variation in gene expression patterns in human cance r cell lines. Nature Genetics , 2000, 24(3): 227-234. 10. http://www .mathworks.com/m atla bcentral/fileexcha nge/loadAuthor. do?objectT ype=author&objectId=10 95267 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment