Clustering processes

The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most…

Authors: Daniil Ryabko (INRIA Lille - Nord Europe)

Clustering pro cesses Daniil Ry abk o INRIA Lil le-Nor d Eur op e, daniil@ryab ko.net Abstract The problem of clus ter ing is considered, for the case when each data point is a sa mple gene r ated b y a sta- tionary ergo dic pr ocess. W e propos e a very natural asymptotic notion of consistency , and show that sim- ple co n sistent algorithms exist, under most general non-para metr ic a ssumptions. The notion of co nsis- tency is as fo llows: tw o samples should b e put into the same cluster if a nd only if they were generated by the same distribution. With this notion of consis - tency , clustering generalize s such classica l statistical problems as homogeneity testing and pro cess clas- sification. W e show tha t, for the ca s e of a known nu mber of cluster s, consistency can b e achieved un- der the only assumption that the joint distribution of the data is stationary er g o dic (no par a metric or Marko vian assumptions, no assumptions o f indepen- dence, neither b etw een nor within the s amples). If the num ber of clusters is unknown, c o nsistency can be achiev ed under appropr iate ass umptions on the mixing rates of the pr o cesses. In b oth cases we give examples of simple (a t most q uadratic in each argu- men t) algor ithms which are consis tent . 1 In tro duction Given a finite set of ob jects, the problem is to “clus- ter” similar o b jects together. This intuitiv ely sim- ple goal is notoriously har d to formalize. Most of the w ork on clustering is concer ned with particular parametric da ta gener ating mo dels , o r particula r al- gorithms, a g iven similarity mea sure, and (very of- ten) a given num ber of clusters . It is clear that, as in almost lea rning problems, in clustering finding the right s imila rity measur e is an in tegra l part of the problem. Ho wev er, even if one assumes the similar- it y measur e k nown, it is hard to define wha t a g o o d clustering is Klein b erg (200 2); Zadeh & Ben-David (2009). Wha t is mor e, even if o ne a ssumes the sim- ilarity measure to b e simply the Euclidean distance (on the plane), and the num ber of clusters k known, then clustering may still app ear intractable for com- putational reasons . Indeed, in this cas e finding k cen- tres (p oints which minimize the c umulative distance from each po int in the sample to o ne of the ce ntres) seems to b e a natural g oal, but this pr oblem is NP- hard Maha jan et al. (2009). In this work we concentrate on a subset of the clustering pr oblem: clustering pro cesses. That is, each data p oint is itself a sample gener ated by a cer- tain discrete-time sto chastic pro cess. This v ersion of the problem ha s numerous applications, such as clustering biolog ical da ta, financial obs erv ations, or behavioura l pa tterns, and as such it has gained a tremendous attent ion in the literature . The main observ ation that w e make in this w ork is that, in the ca se of clustering pro cesses, one c an bene fit from the notion of er go dicity to define wha t app ears to b e a very natural notion of consistency . This notion of consistency is shown to b e satisfied b y simple algorithms that we present, which ar e p olyno - mial in all arguments. This can be a chiev ed without any modeling a ssumptions on the data (e.g. Hidden Marko v, Gaussian, etc.), without a s suming indep en- dence of any kind w ithin or betw een the samples. The o nly assumption that we make is that the joint distribution of the da ta is stationar y ergo dic. The assumption o f sta tionarity means, intuit ively , that 1 the time index itself bares no informa tion: it do es not matter whether we ha ve started reco rding obser- v atio ns at time 0 or at time 10 0. By v ir tue of the ergo dic theorem, a ny stationary pro ces s can b e rep- resented as a mixture o f stationa ry ergo dic pro c e s ses. In other words, a stationar y pro cess can b e tho ug ht of as first selecting a stationary ergo dic pro ce ss (acc ord- ing to so me prior distribution) and then observing its outcomes. Th us, the assumption that the data is sta- tionary er go dic is b o th very natural a nd ra ther weak. A t the same time, er go dicity means that, in asymp- totic, the prop er ties of the pro cess can b e learned from observ ation. This allows us to define the clustering prob- lem as follows. N samples a re given: x 1 = ( x 1 1 , . . . , x 1 n 1 ) , . . . , x N = ( x N 1 , . . . , x N n N ). Each sample is drawn by one o ut o f k differen t stationar y ergo dic distributions. The samples ar e not assumed to b e drawn indep endently; ra ther, it is assumed tha t the joint dis tr ibution of the samples is stationar y ergo dic. The tar get clustering is as follows: those and o nly those samples are put into the s ame cluster tha t were generated by the same distr ibutio n. The n umber k of target clusters can be either known or unknown (dif- ferent co nsistency res ults can be obtained in these cases). A clustering algor ithm is called asymptot- ically consistent if the pro bability that it outputs the target cluster ing conv erges to 1, as the lengths ( n 1 , . . . , n N ) of the s amples tend to infinity (a v ari- ant of this definition is to requir e the algorithm to stabilize on the co rrect answer with probability 1 ). Note the particular regime of asymptotic: not with resp ect to the n umber of samples N , but with res pe ct to the le ng th of the samples n 1 , . . . , n N . Similar for mulations have app ear ed in the litera- ture b efore. Perhaps the most close approach is mix- ture mo dels Smyth (199 7); Zhong & Ghosh (2003): it is assumed that there are k different distributions that have a particular kno wn form (suc h as Gaus- sian, Hidden Mar ko v mo dels, or gr aphical mo dels) and ea ch o ne out of N samples is generated inde- pendently accor ding to o ne of thes e k distributions (with s o me fixed proba bilit y). Since the mo del of the data is sp e c ified quite well, one can use likelihoo d- based dista nces (and then, for ex ample, the k -means algorithm), or Bay esian inference, to cluster the data . Clearly , the main difference from our se tting is in that we do not assume any k nown model o f the data ; not even b etw een-sample independence is assumed. The pro ble m of clustering in our formulation g ener- alizes tw o class ical problems of mathematical statis- tics. The first one is homogeneity testing, or the t wo- sample problem. Tw o samples x 1 = ( x 1 1 , . . . , x 1 n 1 ) and x 2 = ( x 2 1 , . . . , x 2 n 2 ) are given, a nd it is requir ed to test whether they were generated b y the s a me dis- tribution, o r by different distributions. This co rre- sp onds to clus tering just tw o data p oints ( N = 2) with the num ber k o f cluster s unknown: either k = 1 or k = 2. The second problem is pro cess classifi- cation, or the thr e e -sample pr oblem. Three sa mples x 1 , x 2 , x 3 are given, it is k nown that tw o of them were gener ated by the s ame distribution, w hile the third one was genera ted by a different distr ibution. It is required to find out which tw o were ge ne r ated by the same distribution. This corr esp onds to clus- tering three data points, with the num ber of clus- ters known: k = 2. The cla s sical appro ach is of course to consider Gaussia n i.i.d. data, but gen- eral non-para metric so lutions exist no t o nly for i.i.d. data Lehmann (1986), but also for Markov chains Gutman (1 9 89), a nd under certain mixing rates c on- ditions. What is imp or tant for us here, is that the three-sample pr oblem is eas ie r than the tw o-sample problem; the reaso n is that k is k nown in the lat- ter c ase but no t in the former . Indeed, in Ryabk o (2010b) it is shown that in general, for stationar y ergo dic (binar y-v alued) pro c e s ses, there is no s olu- tion to the t wo-sample pro blem, even in the weak est asymptotic se ns e. Ho wev er, a so lution to the three- sample problem, for (real-v alued) stationary e r go dic pro cesses was given in Ryabko & Ryabko (2010). In this work w e demonstrate that, if the n umber k of clus ters is known, then there is a n asymptoti- cally co nsistent clustering algor ithm, under the only assumption that the joint distribution of data is sta - tionary ergo dic. If k is unknown, then in this gen- eral cas e ther e is no consistent cluster ing algo rithm (as follows fr om the mentioned result for the tw o- sample problem). Howev er, if an upper- b o und α n on the α -mixing ra tes of the joint distribution of the pro cesses is known, and α n → 0, then there is a consistent cluster ing algorithm. Bo th a lgorithms are 2 rather simple, and are based on the empirical es ti- mates o f the so-c a lled distributional distance. F or t wo pro cesse s ρ 1 , ρ 2 a distributio na l distance d is de- fined as P ∞ k =1 w k | ρ 1 ( B k ) − ρ 2 ( B k ) | , wher e w k are p os- itive summable rea l weigh ts, e .g . w k = 2 − k , a nd B k range ov er a countable field that generates the sig ma- algebra of the underlying probability space. F or ex- ample, if w e are ta lking ab out finite-alphab e t pro- cesses with the binary alpha b et A = { 0 , 1 } , B k would range ov er the set A ∗ = ∪ k ∈ N A k ; that is, ov er all tuples 0 , 1 , 00 , 01 , 10 , 1 1 , 000 , 001 , . . . (of course, we could just as well omit, say , 1 and 11); therefore, the distributional distance in this case is the w eighted sum o f differences of pro babilities of all p o ssible tu- ples. In this work we consider rea l-v alued pro cesses, so B k hav e to ra nge through a suitable sequence of in- terv a ls, a ll pairs of such interv a ls, triples , etc. (see the formal definitions below). This distance has proved a useful to ol for solving v arious statistical pr ob- lems concerning ergo dic pr o cesses Ry abko & Ryabk o (2010); Ry abko (2010 a). Although this distance in volv es infinite summa- tion, we show that its empirical a pproximations can be easily calcula ted. F or the case o f a known num b er of clus ters, the prop o sed alg orithm (which is shown to b e consistent) is as follows. (The distance in the algorithms is a suitable empirical estimate of d .) The first sa mple is assig ned to the first cluster. F or each j = 2 ..k , find a p oint that max imize s the minimal distance to those p oints already assigned to clusters, and assign it to the cluster j . Thus w e ha ve one po int in each of the k clusters. Next, assign each of the r emaining p o ints to the cluster that co nt ains the closest p oints from those k already assig ned. F or the case of a n unknown num ber o f clusters k , the algo- rithm simply puts those samples together that are no t farther aw ay from each other than a certain thresh- old level, where the threshold is ca lculated based on the known b o und on the mixing rates . In this ca se, bes ides the a symptotic result, finite-time bo unds on the probability of outputting an incorr e ct clustering can be obtained. Each of the algorithms is shown to be a t mo st quadratic in each argument. Therefore, w e sho w that for the prop osed notion of consistency , there a re simple algo r ithms that ar e con- sistent under mo st general assumptions. While these algorithms can b e easily implemented, we hav e left the pro ble m of try ing them out on particular applica - tions, a s w ell as optimizing the parameter s, for future resear ch. It may als o b e sug gested that the empiri- cal distributional distance c a n b e r e placed b y other distances, for whic h similar theoretical results can b e obtained. An interesting direction, that could pr e- serve the theoretical genera lity , w ould b e to use data compresso r s. These were used in Ryabko & As to la (2006) for the related problems of hypo theses test- ing, leading both to theoretical a nd practica l results. As far as c lus tering is concer ned, compression- based metho ds were used (without asymptotic consistency analysis) in Cilibrasi & Vitanyi (2005), and (in a dif- ferent wa y) in Bagnall et al. (2006). Com bining o ur consistency framework with these compress ion-based metho ds is a promising direc tio n for further research. 2 Preliminaries Let A be a n a lphab et, and denote A ∗ the s et o f tuples ∪ ∞ i =1 A i . In this w ork w e consider the ca s e A = R ; ex- tensions to the multidimensional case, as well as to more general spa ces, ar e str a ightforw ard. Distribu- tions, or (sto chastic) pro cesse s , a re mea sures on the space ( A ∞ , F A ∞ ), where F A ∞ is the Bor e l sigma- algebra of A ∞ . When talking ab out joint distribu- tions of N sa mples, we mean distr ibutions on the space (( A N ) ∞ , F ( A N ) ∞ ). F or ea ch k , l ∈ N , let B k,l be the par tition of the set A k int o k -dimensio nal c ub es with volume h k l = (1 /l ) k (the cubes s ta rt at 0). Mor eov er, define B k = ∪ l ∈ N B k,l and B = ∪ ∞ k =1 B k . The s e t { B × A ∞ : B ∈ B k,l , k , l ∈ N } generates the Bor el σ - algebra on R ∞ = A ∞ . F o r a set B ∈ B let | B | b e the index k of the set B k that B co mes from: | B | = k : B ∈ B k . W e use the abbreviatio n X 1 ..k for X 1 , . . . , X k . F o r a sequence x ∈ A n and a set B ∈ B denote ν ( x , B ) the frequency with which the sequence x fa lls in the set B . ν ( x , B ) := ( 1 n −| B | +1 P n −| B | +1 i =1 I { ( X i ,...,X i + | B |− 1 ) ∈ B } if n ≥ | B | , 0 otherwise. 3 A pro cess ρ is stationary if ρ ( X 1 .. | B | = B ) = ρ ( X t..t + | B |− 1 = B ) for any B ∈ A ∗ and t ∈ N . W e further abbre v iate ρ ( B ) := ρ ( X 1 .. | B | = B ). A sta - tionary pro cess ρ is c a lled (stationary) er go dic if the frequency of occ urrence o f each word B in a seq ue nce X 1 , X 2 , . . . genera ted by ρ tends to its a pr io ri (or limiting) probability a.s.: ρ (lim n →∞ ν ( X 1 ..n , B ) = ρ ( B )) = 1 . Denote E the set of all stationary e r go dic pro cesses. Definition 1 (distributional distance) . The distribu- tional distanc e is define d for a p air of pr o c esses ρ 1 , ρ 2 as fol lows (e.g. Gr ay (1988)) d ( ρ 1 , ρ 2 ) = ∞ X m,l =1 w m w l X B ∈ B m,l | ρ 1 ( B ) − ρ 2 ( B ) | , wher e w j = 2 − j . (The weigh ts in the definition are fixed for the sake of co ncreteness only; we could take any other summable sequence o f po sitive weigh ts instead.) In words, we ar e taking a sum over a series of pa rtitions int o cub es of decr e asing v olume (indexed by l ) of all sets A k , k ∈ N , a nd count the differences in pr o babil- ities of all cubes in all these partitions. The s e differ- ences in probabilities are weigh ted: sma ller weigh ts are g iven to larger k and finer partitions . It is easy to see that d is a metric. W e re fer to Gray (198 8) for more information on this metric and its pr o p erties. The clustering algor ithms pres ented below are based on empiric al estimates of the distanc e d : ˆ d ( X 1 1 ..n 1 , X 2 1 ..n 2 ) = ∞ X m,l =1 w m w l X B ∈ B m,l | ν ( X 1 1 ..n 1 , B ) − ν ( X 2 1 ..n 2 , B ) | , (1) where n 1 , n 2 ∈ N , ρ ∈ S , X i 1 ..n i ∈ A n i . Although the expres sion (1) inv olves taking three infinite sums, it will b e s hown b e low that it can b e easily calculated. Lemma 1 ( ˆ d is consistent) . L et ρ 1 , ρ 2 ∈ E and let two samples x 1 = X 1 1 ..n 1 and x 2 = X 2 1 ..n 2 b e gener- ate d by a distribution ρ such t hat t he mar ginal distri- bution of X i 1 ..n 1 is ρ i , i = 1 , 2 , and the joint distribu- tion ρ is stationary er go dic. Then lim n 1 ,n 2 →∞ ˆ d ( X 1 1 ..n 1 , X 2 1 ..n 2 ) = d ( ρ 1 , ρ 2 ) ρ –a.s. Pr o of. The idea o f the pro o f is s imple: for each set B ∈ B , the frequency with whic h the sample x 1 falls int o B conv erges to the probability ρ 1 ( B ), and analo- gously for the second s ample. When the sample sizes grow, there will be more a nd mo re sets B ∈ B whose frequencies hav e a lready conv erged to the probabili- ties, so that the cum ulative w eight o f those sets whose frequencies hav e no t co nv erged yet, will tend to 0 . F or any ε > 0 we can find an index J such that P ∞ i,j = J w i w j < ε/ 3 . Moreover, fo r each m, l we can find such elements B m,l 1 , . . . , B m,l t m,l , for some t m,l ∈ N , of the partition B m,l that ρ i ( ∪ t m,l i =1 B m,l i ) ≥ 1 − ε/ 6 J w m w l . F or each B m,l j , where m, l ≤ J and j ≤ t m,l , w e hav e ν (( X 1 1 , . . . , X 1 n 1 ) , B m,l j ) → ρ 1 ( B m,l j ) a.s., so that | ν (( X 1 1 , . . . , X 1 n 1 ) , B m,l j ) − ρ 1 ( B m,l j ) | < ρ 1 ( B m,l j ) ε/ (6 J w j ) for all n 1 ≥ u , for s ome u ∈ N ; define U m,l j := u . Let U := max m,l ≤ J,j ≤ t m,l U m,l j ( U dep e nds on the realization X 1 1 , X 1 2 , . . . ). Define analogously V for the sequence ( X 2 1 , X 2 2 , . . . ). Thus fo r n 1 > U and n 2 > V we hav e | ˆ d ( x 1 , x 2 ) − d ( ρ 1 , ρ 2 ) | =       ∞ X m,l =1 w m w l X B ∈ B k,l  | ν ( x 1 , B ) − ν ( x 2 , B ) | − | ρ 1 ( B ) − ρ 2 ( B ) |        ≤ ∞ X m,l =1 w m w l X B ∈ B k,l w i  | ν ( x 1 , B ) − ρ 1 ( B ) | + | ν ( x 2 , B ) − ρ 2 ( B ) |  ≤ J X m,l =1 w m w l t k,l X i =1  | ν ( x 1 , B m,l i ) − ρ 1 ( B m,l i ) | + | ν ( x 2 , B m,l i ) − ρ 2 ( B m,l i ) |  + 2 ε/ 3 ≤ J X m,l =1 w m w l t k,l X i =1 ( ρ 1 ( B m,l i ) ε/ (6 J w m w l ) + ρ 2 ( B m,l i ) ε/ (6 J w m w l )) + 2 ε / 3 ≤ ε, which proves the sta temen t. 4 3 Main results The clustering problem can be defined as follows. W e are given N samples x 1 , . . . , x N , where each sam- ple x i is a string of length n i of symbo ls fro m A : x i = X i 1 ..n i . Each sample is gener ated by one out of k differen t un kn own stationary ergo dic distribu- tions ρ 1 , . . . , ρ k ∈ E . Thus, ther e is a par titioning I = { I 1 , . . . , I k } of the set { 1 ..N } in to k disjoint sub- sets I j , j = 1 ..k { 1 ..N } = ∪ k j =1 I j , such tha t x j , 1 ≤ j ≤ N is genera ted by ρ j if a nd only if j ∈ I j . The par titioning I is called the t ar- get clustering and the sets I i , 1 ≤ i ≤ k , ar e calle d the tar get clusters . Given samples x 1 , . . . , x N and a target cluster ing I , let I ( x ) denote the cluster that contains x . A clustering fun ction F takes a finite num b er o f samples x 1 , . . . , x N and an optional parameter k (the target num b er of cluster s) a nd outputs a partition F ( x 1 , . . . , x N , ( k )) = { T 1 , . . . , T k } of the set { 1 ..N } . Definition 2 (asymptotic consistency) . L et a finite numb er N of samples b e given, and let the tar get clus- tering p artition b e I . Define n = min { n 1 , . . . , n N } . A clustering function F is str ongly asymptotic al ly c on- sistent if F ( x 1 , . . . , x N , ( k )) = I fr om some n on with pr ob ability 1. A clustering function is we akly asymp- totic al ly c onsistent if P ( F ( x 1 , . . . , x N , ( k )) = I ) → 1 . Note that the consistency is a symptotic with re- sp ect to the minimal length of the sample , and not with resp ect to the numb er of samples . 3.1 Kno wn n um b er of clusters Algorithm 1 is a simple clustering algor ithm, which, given the num ber k of clus ter s, will b e shown to be consistent under most ge ne r al assumptions. It works as follows. The p o int x 1 is assigned to the first clus- ter. Next, find the p o int that is farthest aw ay fro m x 1 in the empirical distr ibutional distance ˆ d , a nd a ssign this p o int to the second cluster. F or each j = 3 ..k , find a p oint tha t maximizes the minimal distance to those p oints alrea dy assigned to clusters, and assign it to the cluster j . Thus we have one p oint in e a ch of the k cluster s. Next s imply ass ig n each of the re- maining p oints to the cluster that co nt ains the closes t po ints from those k already assig ned. (One ma y no- tice tha t Algo rithm 1 is one iteration of the k - means algorithm, with a sp ecific initializa tio n, and a sp e- cially designed distance.) Algorithm 1 The cas e of known num ber of clus ter s k INPUT: The n um b er of c lusters k , samples x 1 , . . . , x N . Initialize: j := 1, c 1 := 1 , T 1 := { x c 1 } . for j := 2 to k do c j := a rgmax { i = 1 , . . . , N : min j − 1 t =1 ˆ d ( x i , x c t ) } T j := { x c j } end for for i = 1 to N do Put x i int o the set T argmin k j =1 ˆ d ( x i , x c j ) end for OUTPUT: the sets T j , j = 1 ..k . Prop ositi on 1 (calcula ting ˆ d ( x 1 , x 2 )) . F or two sam- ples x 1 = X 1 1 ..n 1 and x 2 = X 2 1 ..n 2 the c ompu- tational c omplexity (time and sp ac e) of c alculating the empiric al distributional distanc e ˆ d ( x 1 , x 2 ) ( 1) is O ( n 2 log s − 1 min ) , wher e n = max( n 1 , n 2 ) and s min = min i =1 ..n 1 ,j =1 ..n 2 ,X 1 i 6 = X 2 j | X 1 i − X 2 j | . Pr o of. First, observe that for fixe d m and l , the sum T m,l := X B ∈ B m,l | ν ( X 1 1 ..n 1 , B ) − ν ( X 2 1 ..n 2 , B ) | (2) has not mo re than n 1 + n 2 − 2 m + 2 non-zero terms ( ass uming m ≤ n 1 , n 2 ; the other case is obvious). Indeed, for eac h i = 0 , 1, in the sample x i there a re n i − m + 1 tuples of s ize k : X i 1 ..m , X i 2 ..m +1 , . . . , X i n 1 − m +1 ..n 1 . Therefore, the complexity of calculating T m,l is O ( n 1 + n 2 − 2 m + 2) = O ( n ). F urthermo re, obser ve that fo r each m , for all l > log s − 1 min the ter m T m,l is constant. Therefo re, it is enoug h to calculate T m, 1 , . . . , T m, log s − 1 min , since for fixed m ∞ X l =1 w m w l T m,l = w m w log s − 1 min T m, log s − 1 min + log s − 1 min X l =1 w m w l T m,l 5 (that is, w e double the weigh t of the last non- zero term). Th us, the complexity of calcula ting P ∞ l =1 w m w l T m,l is O ( n log s − 1 min ). Fina lly , for all m > n w e hav e T m,l = 0 . Since ˆ d ( x 1 , x 2 ) = P ∞ m,l =1 w m , w l T m,l , the statement is prov en. Theorem 1. L et N ∈ N and supp ose t hat the sam- ples x 1 , . . . , x N ar e gener ate d in s uch a way that the joint distribution is stationary er go dic. If the c or- r e ct numb er of clusters k is known, then A lgorithm 1 is str ongly asymptotic al ly c onsistent. A lgorithm 1 makes O ( k N ) c alculations of ˆ d ( · , · ) , so that its c om- putational c omplexity is O ( k N n 2 max log s − 1 min ) , wher e n max = max k i =1 n i and s min = min u,v =1 . .N ,u 6 = v ,i =1 ..n u ,j =1 ..n v ,X u i 6 = X v j | X u i − X v j | . Observe tha t the samples are not required to b e generated indep endently . The o nly req uir ement on the distr ibution of samples is that the joint distr ibu- tion is stationary erg o dic. This is per haps o ne of the mildest po ssible probabilistic assumptions. Pr o of. By Lemma 1 , ˆ d ( x i , x j ), i, j ∈ { 1 ..N } con- verges to 0 if and only if x i and x j are in the sa me cluster. Since there are only finitely ma ny sa mples x i , there exists some δ > 0 such that, fr o m s ome n on, we will ha ve ˆ d ( x i , x j ) < δ if x i , x j belo ng to the same target cluster ( I ( x i ) = I ( x j )), and ˆ d ( x i , x j ) > δ oth- erwise ( I ( x i ) 6 = I ( x j )). Therefore , from some n o n, for every j ≤ k we will hav e max { i = 1 , . . . , N : min j − 1 t =1 ˆ d ( x i , x c t ) } > δ and the sample x c j , where c j = a rgmax { i = 1 , . . . , N : min j − 1 t =1 ˆ d ( x i , x c t ) } , will be selected from a targ et cluster that do es no t contain any x c i , i < j . The consis tency statemen t follows. Next, let us find how many pair wis e distance esti- mates ˆ d ( x i , x j ) the alg o rithm has to make. On the first itera tion o f the lo op, it has to ca lculate ˆ d ( x i , x c 1 ) for all i = 1 ..N . On the sec ond iteration, it needs again ˆ d ( x i , x c 1 ) fo r all i = 1 ..N , which are already calculated, and also ˆ d ( x i , x c 2 ) for a ll i = 1 ..N , and so on: on j th itera tion of the lo op we need to cal- culate d ( x i , x c j ), i = 1 ..N , which gives at most k N pairwise distance calculations in total. The state- men t ab out computational co mplexity follows from this and Prop osition 1 : indeed, apart from the ca lcu- lation of ˆ d , the rest o f the computations is of or der O ( k N ). Complexity–precision trade–off. The bo und on the computational complexity of Algor ithm 1 , given in Theor em 1, is g iven for the ca se of pr e cisely cal- culated distance estimates ˆ d ( · , · ). How ever, precise estimates a r e not needed if we only wan t to hav e an asymptotically cons istent algorithm. Indeed, follow- ing the pro o f of Lemma 1, it is ea s y to chec k tha t if we replace in (1 ) the infinite sums with sums over any nu mber o f ter ms m n , l n that gr ows to infinity with n = min( n 1 , n 2 ), and if we replace partitions B m,l by their (finite) subsets B m,l,n which increas e to B m,l , then w e still hav e a consis ten t estimate of d ( · , · ). Definition 3 ( ˇ d ) . L et m n , l n b e s ome se quenc es of numb ers, B m,l,n ⊂ B m,l for al l m, l, n ∈ N , and de- note n := min { n 1 , n 2 } . Defin e ˇ d ( X 1 1 ..n 1 , X 2 1 ..n 2 ) := m n X m =1 l n X l =1 w m w l X B ∈ B m,l,n | ν ( X 1 1 ..n 1 , B ) − ν ( X 2 1 ..n 2 , B ) | . (3) Lemma 2 ( ˇ d is consistent) . A ssume the c onditions of L emma 1. L et l n and m n b e any se qu enc es of inte gers that go t o infin ity with n , and let, for e ach m, l ∈ N , the s ets B m,l,n , n ∈ N b e an incr e asing se quenc e of subsets of B m,l , such that ∪ n ∈ N B m,l,n = B m,l . Then lim n 1 ,n 2 →∞ ˇ d ( X 1 1 ..n 1 , X 2 1 ..n 2 ) = d ( ρ 1 , ρ 2 ) ρ –a.s. . Pr o of. It is enough to observe that lim n 1 ,n 2 →∞ m n X m =1 l n X l =1 w m w l X B ∈ B m,l,n | ρ 1 ( B ) − ρ 2 ( B ) | = d ( ρ 1 , ρ 2 ) , and then follow the pro of of Lemma 1. If we us e the estimate ˇ d ( · , · ) in Algorithm 1 (instead of ˆ d ( · , · )), then we still get an a symptotically cons is- ten t clustering function. Thu s the following state- men t holds true. 6 Prop ositi on 2. Assu me the c onditions of The o- r em 1. F or al l se qu enc es m n , l n of numb ers t hat in- cr e ase to infinity with n , ther e is a st r ongly asymp- totic al ly c onsistent clustering algorithm, whose c om- putational c omplexity is O ( k N n max m n max l n max ) . On the one hand, P rop osition 2 can b e thought of as an a r tifact of the as ymptotic definition o f consis- tency; on the other hand, in pr actice pr ecise ca lcula- tion o f ˆ d ( · , · ) is hardly necessar y . What we get from Prop os itio n 2 is the p ossibility to sele ct the appro- priate trade– off betw een the computationa l bur den, and the pr ecision of clustering befor e a symptotic. Note that the b ound in Prop o sition 2 do es not in- volv e the sizes of the sets B m,l,n ; in particular, o ne can take B m,l,n = B m,l for all n . This is b ecause, for every tw o samples X 1 1 ..n and X 2 1 ..n , this sum has no more than 2 n no n-zero terms, wha tever are m, l . How ev er, in the following section, whe r e we are a f- ter c lustering with a n unknown num b er of clusters k , a nd th us a fter controlled rates of co nv ergence, the sizes of the sets B m,l,n will a pp ear in the bounds. 3.2 Unkno wn n um b er of clusters So far we hav e s hown that when the nu mber of clus- ters is known in adv ance, consistent cluster ing is p os- sible under the only assumption that the joint distri- bution of the samples is stationary er go dic. How- ever, under this as sumption, in general, consistent clustering with unknown n umber of clusters is im- po ssible. Indeed, as w as shown in Ryabk o (20 10b), when w e have only tw o binary-val ue d samples, gener- ated indep endently by tw o statio na ry er go dic distr i- butions, it is impossible to decide whether they have bee n genera ted by the same or by different distribu- tions, even in the sense of weak as ymptotic consis- tency (this ho lds ev en if the distributions come from a smaller class: the set of all B - pro cesses ). There- fore, if the num ber of clusters is unknown, we hav e to settle for less, which means that we have to make stronger a ssumptions on the data. What we need is known r ates of conv ergence of frequencies to their ex- pec tations. Such rates are provided b y a ssumptions on the mixing rates of the distributio n generating the data. Here we will show that under r ather mild as- sumptions on the mixing ra tes (and, again, witho ut any mo deling a ssumptions or ass umptions of inde- pendenc e ), co nsistent clustering is p ossible when the nu mber of clusters is unknown. In this section we as sume that a ll the samples are [0 , 1]-v alued (that is, X j i ∈ [0 , 1]); extension to arbitra ry b o unded (m ultidimensional) ra nges is straightforward. Next we int ro duce mixing c o effi- cients , mainly following B osq (1996) in formulations. Informally , mixing co efficients of a sto chastic pro cess measure how fast the pro ce s s forg ets ab out its past. An y one-wa y infinite stationary pro cess X 1 , X 2 , . . . can b e ex tended backwards to make a tw o-wa y infi- nite pro cess . . . , X − 1 , X 0 , X 1 , . . . with the same dis- tribution. In the definition b elow we assume such an extension. Define the α mixing co efficients as α ( n ) = sup A ∈ σ ( . ..,X − 1 ,X 0 ) ,B ∈ σ ( X n ,X n +1 ,... )) | P ( A ∩ B ) − P ( A ) P ( B ) | , (4) where σ ( .. ) stays for the sigma-a lgebra g enerated by random v ariables in brack ets. These co efficients ar e non-increas ing. A pro cess is ca lled stro ngly α -mixing if α ( n ) → 0 . Many imp ortant classes of pro cess es sat- isfy the mixing conditions. F o r example, if a proc ess is a stationar y irreducible ap erio dic Hidden Markov pro cess, then it is α - mixing. If the underly ing Mar ko v chain is finite-state, then the co efficie nt s decrease ex- po nentially fast. Other pro babilistic a ssumptions c an be used to obtain b ounds on the mixing c o efficients, see e.g. Br adley (2 005) and references therein. Algorithm 2 is very simple. Its inputs ar e: sam- ples x 1 , . . . , x N ; the threshold level δ ∈ (0 , 1), the parameters m, l ∈ N , B m,l,n . The algo rithm assigns to the same cluster all samples which are at mos t δ -far from each other , as measured by ˇ d ( · , · ). The estimate ˇ d ( · , · ) can b e calcula ted in the same wa y as ˆ d ( · , · ) (see P rop osition 1 a nd its pro of ). W e do not give a pseudo co de implementation o f this algo rithm, since it’s rather obvious. The idea is that the thr eshold level δ is selec ted according to the minimal le ng th o f a sample and the (known b ounds on) mixing ra tes of the pro cess ρ gen- erating the samples (see Theorem 2). The next theorem shows that, if the jo int distribu- 7 tion of the s amples satisfies α ( n ) ≤ α n → 0 , where α n are known, then one can select (based o n α n only) the par a meters o f Algorithm 2 in suc h a wa y that it is weakly as y mptotically c onsistent. Moreover, a bo und on the pro bability of er r or b e fo re asymptotic is provided. Theorem 2 (Algorithm 2 is co nsistent, unknown k ) . Fix se quenc es α n ∈ (0 , 1) , m n , l n , b n ∈ N , and let B m,l,n ⊂ B m,l b e an incr e asing se quenc e of finite sets, for e ach m, l ∈ N . Set b n := max l ≤ l n ,m ≤ m n | B m,l,n | . L et also δ n ∈ (0 , 1) . L et N ∈ N and su pp ose that the samples x 1 , . . . , x N ar e gener ate d in such a way that the (unknown) joint distribution ρ is stationary er go dic, and satisfies α n ( ρ ) ≤ α n , for al l n ∈ N . Then for every se quenc e q n ∈ [0 ..n / 2] , Algorithm 2, with the ab ove p ar ameters, satisfies ρ ( T 6 = I ) ≤ 2 N ( N + 1)( m n l n b n γ n ( δ n ) + γ n ( ε ρ )) (5) wher e γ ( δ ) = (2 e − q n δ 2 / 32 + 11(1 + 4 / δ ) 1 / 2 q n α ( n − 2 m n ) / 2 q n ) , T is t he p artition out put by the algorithm, I is the tar get clustering, ε ρ is a c onstant that dep ends only on ρ , and n = min i =1 ..N n i . In p articular, if α n = o (1) , then, sele cting the p ar ameters in such a way that δ n = o (1) , q n , m n , l n , b n = o ( n ) , q n , m n , l n → ∞ , ∪ k ∈ N B m,l,k = B m,l , b m,l n → ∞ , for al l m, l ∈ N , and, final ly, m n l n b n ( e − q n δ 2 n + δ − 1 / 2 n q n α ( n − 2 m n ) / 2 q n ) = o (1) , as is always p ossible, Algorithm 2 is weakly asymp- totically co nsistent (with the n umb er of clusters k unknown). The c omputational c omplexity of A lgo- rithm 2 is O ( N 2 m n max l n max b n max ) , and is b ounde d by O ( N 2 n 2 max log s − 1 ) , wher e n max and log s − 1 min ar e de- fine d as in The or em 1. Pr o of. W e use the following bo und from Bosq (1996): for any zero-mea n rando m pr o cess Y 1 , Y 2 , . . . , every n ∈ N and every q ∈ [1 ..n / 2] we hav e P | n X i =1 Y i | > nε ! ≤ 4 exp( − q ε 2 / 8) + 22(1 + 4 / ε ) 1 / 2 q α ( n/ 2 q ) . F or every j = 1 ..N , every m < n , l ∈ N , a nd B ∈ B m,l , define the pro cesses Y j 1 , Y j 2 , . . . , where Y j t := I ( X j t ,...,X j t + m − 1 ) ∈ B − ρ ( X j 1 ..m ∈ B ) . It is ea sy to see that α -mixing co efficients for this pro cess satisfy α ( n ) ≤ α n − 2 m . Thus, ρ ( | ν ( X j 1 ..n j , B ) − ρ ( X j 1 ..m ∈ B ) | > ε/ 2 ) ≤ γ n ( ε ) (6) Then for every i, j ∈ [1 ..N ] such that I ( x i ) = I ( x j ) (that is, x i and x j are in the same cluster) we have ρ ( | ν ( X i 1 ..n i , B ) − ν ( X j 1 ..n j , B ) | > ε ) ≤ 2 γ n ( ε ) . Using the unio n b ound, summing ov er m, l, and B , we obtain ρ ( ˇ d ( x i , x j ) > ε ) ≤ 2 m n l n b n γ n ( ε ) . (7) Next, let i, j b e such that I ( x i ) 6 = I ( x j ). Then, for some m i,j , l i,j ∈ N there is B i,j ∈ B m i,j ,l i,j such that | ρ ( X i 1 .. | B i,j | ∈ B i,j ) − ρ ( X j 1 .. | B i,j | ∈ B i,j ) | > 2 τ i,j for some τ i,j > 0 . Then for every ε < τ i,j / 2 we hav e ρ ( | ν ( X i 1 ..n i , B i,j ) − ν ( X j 1 ..n j , B i,j ) | < ε ) ≤ ρ ( | ν ( X i 1 ..n i , B i,j ) − ρ ( X i 1 .. | B | ∈ B i,j ) | > τ i,j ) + ρ ( | ν ( X j 1 ..n j , B i,j ) − ρ ( X j 1 .. | B i,j | ∈ B i,j ) | > τ i,j ) ≤ 2 γ n ( τ i,j ) . (8) Moreov er, for ε < w m i,j w l i,j τ i,j / 2 ρ ( ˇ d ( x i , x j ) > ε ) ≤ 2 γ n ( w m i,j w l i,j τ i,j ) . (9) Define ε ρ := min i,j =1 ..N : I ( x i ) 6 = I ( x j ) w m i,j w l i,j τ i,j / 2 . Clearly , fro m this and (8), fo r every ε < 2 ε ρ we obtain ρ ( ˇ d ( x i , x j ) > ε ) ≤ 2 γ n ( ε ρ ) . (10) If, fo r every pair i, j of samples, ˇ d ( x i , x j ) < δ n if and only if I ( x i ) = I ( x j ), then Algorithm 2 gives a co rrect answer. Therefore, taking the b ounds (7) and (10) to g ether for each o f the N ( N + 1) / 2 pairs of samples, we obtain (5). The complexity s tatement can be established analog ously to that in Theo rem 1 . 8 While Theorem 2 shows that α -mixing with a known bound on the co efficients is sufficient to achiev e asymptotic consistency , the bound (5) on the probability of error includes a s m ultiplicative terms all the parameters m n , l n and b n of the algor ithm, which can make it large for practically useful c hoices of the parameter s. The m ultiplicative factor s are due to the fact that we ta ke a b ound on the divergence of each individual frequency of each cell of each par- tition from its exp ectation, and then tak e a union bo und ov er a ll of these. T o obtain a more realis- tic p erformance gua rantee, we w ould like to hav e a bo und on the divergence of al l the fre quencies o f all cells of a given partition fro m their ex pe c tations. Such uniform divergence es timates are p os sible un- der s tronger ass umptions; namely , they ca n b e es- tablished under some assumptions on β -mixing co ef- ficient s, which ar e defined a s follows β ( n ) = E sup B ∈ σ ( X n ,... )) | P ( B ) − P ( B | σ ( . . . , X 0 )) | . These co efficients satisfy 2 α ( n ) ≤ β ( n ) (see e.g. Bosq (1996)), so as sumptions on the s p e ed of decrease of β - co efficients are s tronger. Using the uniform b ounds given in Karandik ara & Vidyasagar (2002), one can obtain a statemen t s imilarto that in Theo rem 2, with α -mixing replaced by β -mixing , and without the mul- tiplicative factor b n . 4 Conclusion W e have prop o s ed a framework for defining consis - tency of clustering algor ithms, when the data co mes as a set of samples drawn from statio nary proc esses. The main adv antage o f this fra mework is its general- it y: no assumptions have to b e made on the distribu- tion o f the data, beyond statio narity a nd ergo dicity . The prop o sed notion of co nsistency is so simple and natural, that it may b e suggested to be used as a ba - sic sa nity-chec k for a ll clustering algor ithms that are used on sequence-like da ta. F or example, it is easy to see that the k -mea ns algor ithm will b e consistent with some initializatio ns (e.g. with the one used in Algorithm 1 ) but not with other s (e.g. not with the random one). While the alg orithms that we pr e s ented to demon- strate the existence of co nsistent clustering meth- o ds ar e computationally efficient a nd e asy to imple- men t, the main v alue of the es ta blished results is theoretical. As it was mentioned in the intro duc- tion, it can b e suggested that fo r pr actical appli- cations e mpir ical estimates of the distributional dis - tance can b e replaced with distances ba s ed o n da ta compressio n, in the spirit o f Ryabk o & Asto la (2006); Cilibrasi & Vitan yi (2005); Ryabk o (20 09). Another direction for future r esearch concerns op- timal b o unds on the s p eed o f conv ergence: while we show that such b ounds can b e o bta ined (of co urse, only in the case of k nown mixing ra tes), finding prac- tical a nd tight bounds, for different notions of mixing rates, remains op en. Finally , here we hav e only considered the s etting in whic h the num ber N of samples is fixed, while the asymptotic is with resp ect to the lengths of the samples. F o r o n-line c lus tering pro blems, it would be int eresting to co nsider the formulation where b oth N and the lengths of the samples grow. References A. B a gnall, C. Rata namahatana, E . Keogh, S. Lonardi, G. J anacek. A bit lev el representa- tion for time series data mining with sha p e bas ed similarity . Data Mining and Know le dge D isc overy , 13(1): 1 1 –40, 2006. Bosq, D. Nonp ar ametric Statistics for S to chastic Pr o- c esses . Estima tio n and Prediction. Springer , 1996. Bradley , R.C. Basic pro p erties of s trong mixing con- ditions. A survey and some o p e n questions. Pr ob- ability Surveys , 2:107–1 44, 2005 . Cilibrasi, R. and Vitanyi, P .M.B. Cluster ing by co m- pression. IEEE T r ans. Inf. Th. , 51 :1523 – 1545 , 2005. Gray , R. Pr ob abili ty, R andom Pr o c esses, and Er go dic Pr op erties . Springer V e rlag, 1988. Gutman, M. Asymptotically o ptimal classification for 9 m ultiple tests with empirica lly observed statistics. IEEE T ra ns. Inf. Th. , 3 5(2):402– 408, 198 9. Karandik ara, R.L. and Vidyasagar , M. Rates of uni- form convergence of empirica l means with mixing pro cesses. St at.&Pr ob. L ett. , 58:297– 307, 2002. Kleinberg, J. An imp o s sibility theorem for clustering. In NIPS :446– 453, 200 2. Lehmann, E. T esting Statistic al H yp otheses, 2nd e d. . Wiley , New Y ork, 1986 . Maha jan, M., Nimbhork ar , P ., and V arada ra jan, K. The plana r k-means problem is NP -hard. In W AL- COM : 27 4–28 5, 2009. Ryabk o, B. Compression-ba sed metho ds for nonpar a - metric pr e diction and estimation of some charac- teristics o f time s e ries. IEEE T r ans. In f. Th. , 55 : 4309– 4315 , 20 09. Ryabk o, B. and Asto la , J. Universal co des as a basis for time ser ies testing. St at. Metho dolo gy , 3:375 – 397, 2006. Ryabk o, D. T esting comp osite hypo theses ab out discrete-v alued stationa ry pro ce sses. In ITW : 2 91– 295, 2010a. Ryabk o, D. Discr imination be tw een B- pr o cesses is impo ssible. J. The or. Pr ob. , 23(2):565 –575 , 2010b. Ryabk o, D. and Ryabk o, B . Nonpa rametric statis- tical inference for e rgo dic pro cesses . IEEE T r ans. Inf. Th. , 56 (3 ):1430– 1435 , 2010. Sm yth, P . Clustering sequences with hidden mar ko v mo dels. In NIPS : 648–65 4. 1997. Zadeh, R. and Ben- David, S. A uniqueness theorem for clustering. In UAI , 2009. Zhong, S. and Ghosh, J. A unified framework fo r mo del-based clustering. JMLR , 4:100 1–10 37, 2003. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment