Identifying the number of clusters in discrete mixture models

Iden tifying the n um b er of clusters in dis crete mixture mo dels Cl´ audia Silv estre, a ∗ Margarida G. M. S . Cardoso, b M´ ario A. T. Figueiredo c a Esc ola Sup erior de Comunic a¸ c˜ ao So cial, Institu to Polit ´ ecnic o de Lisb o a, Portugal ; b Dep artment of Quantitative Metho ds, ISCTE - Lisb on University Institute, Portugal ; c Instituto de T ele c omunic a¸ c˜ oes, Inst ituto Sup erior T´ ecnic o, Universidade de Lisb o a, Portugal, July 23, 2018 Abstract Research on cluster analysis for catego rical da ta c o n tinues to de- velop, with new clustering algor ithms being pro p osed. How ever, in this context, the determination of the num b er of clusters is r arely ad- dressed. In this pap er, we prop o se a new appr oac h in which clustering of categ orical data and the estimation of the n umber of clusters is carried out simultaneously . Assuming tha t the data or iginate from a ﬁnite mixtur e of m ultinomial distributions, we develop a metho d to select the num b er o f mixture compo nen ts based on a minimum message length (MML) criterion and implemen t a new ex p e ctation- maximization (E M) algorithm to estimate all the mo del par ameters. The prop osed EM-MML approach, rather tha n selecting one among a set of pre-estimated candida te models (which require s running E M several times), sea mlessly in teg r ates estimation and mode l selection in a single algo rithm. The p erformance of the prop osed approach is com- pared wit h other well-known criteria (such a s the Bayesian information criterion –BIC), r esorting to synthetic data and to tw o real applications from the Europ ean So cial Sur v ey . The EM-MML co mputation time is a clear adv antage of the prop osed metho d. Also, the rea l data solutions are muc h more parsimonious than the solutions provided b y comp et- ing metho ds, which reduces the risk o f mo del or der ov e r estimation and increases interpretability . k eywords : ﬁnite mixture mo del; EM a lgorithm; mo del selection, minim um messag e leng th; categ orical data 1 In tro duction Clustering is a tec hn iqu e commonly used in sev eral researc h and application areas, suc h as social sciences, medicine, biology , engineering, computer s ci- ence, image analysis, bioinformatics, and market ing. The goal of clustering 1 is to disco ver or un co ver groups in data. T o this end, there are essen tially t wo diﬀeren t approac hes: d istance- based, wh ere a distance or a s im ilarity measure b et we en ob jects is deﬁned and similar ob jects are assigned to the same group; mo del-based, where the data are assumed to b e generated b y a ﬁnite mixtur e mo del, and ob jects are assigned to groups based on the corresp onding estimate s of the p osterior probabilities [1]. Most of the clustering tec hniques are fo cused on numerical data and can not b e applied directly to categorica l d ata . In fact, clustering tec h niques for categorica l data are more c h alle nging [2], and there are fewe r tec hniques a v ailable [3]. When usin g distance-based clustering approac hes, one needs to resort to sp eciﬁc similarit y measures to deal with cate gorical features – e.g [4]. In this con text, the determination of th e num b er of clusters is commonly based on clustering qualit y measures and the corresp onding grap h ical dis- pla ys, suc h as d endrograms (when using hierarc h ical clustering tec h niques) or en tr opy r ela ted graphics [5]. In mo del-based ap p roac h es for n u merical data, the u sual choice is a mixture of Gaussians (e .g. [6], [7], [8]); when referring to categ orical d ata, a mixtur e of m ultinomials – discrete mixture mo del – is usu ally considered (e.g. [3], [9], [10], [11]). In order to deter- mine the n umb er of groups in discrete mixture mo dels, inform ati on criteria are commonly used: e.g., the Bayesian information criterion (BIC) [12] or the Akaike information criterion (AIC) [13]. These criteria lo ok for a balance b et ween the mo del’s ﬁt to the d ata (which o crresp onds to maxi- mizing the lik eliho o d function) and pars im ony (using p en alt ies asso ciated with measures of mo del complexit y), th us trying to a v oid ov er-ﬁ tting. The use of information criteria f ollo ws the estimation of candidate ﬁnite mixture mo dels for wh ic h a predetermined num b er of groups is indicated, generally resorting to an EM ( exp e ctation-maximization ) algo rithm, [14]. In this work, w e fo cus on determining th e n umb er of groups while clus- tering categorical d ata, u sing an EM em b edded app r oac h to estimate th e n u m b er of groups. The nov elty of this approac h is that it d oes not r ely on se- lecting among a set of pr e-estimated candidate mo dels, b ut rather inte grates estimation and m odel selection in a single algorithm. W e capitalize on the approac h dev elop ed by Figueiredo and Jain [15] for clustering con tin uous data and extend it for dealing with categorical data. The prop osed metho d is based on a minimum message length (MML) criterion to select the n um- b er of clusters and on an E M algorithm to estimate the mo del p aramete rs; our implement ation follo ws a previous v ersion describ ed in [16]. The pap er is organized as follo ws: Section 2 reviews the ﬁnite mixtu r e mo del-based approac h for clustering and addr esses the case of categorical data; Section 3 provides an in tro duction to the topic of mo del s election for discrete ﬁnite mixtur es; in S ect ion 4 we d escrib e the p rop osed EM-MML based algo r ithm; in Section 5, the exp erimental results, based on synthetic and real data, illustrate the p erformance of th e EM-MML appr oa c h. Con- cluding remarks are su m marized in Sectio n 6. 2 2 Clustering with ﬁnite mixture mo dels Finite mixture mod els oﬀer a mo del-based approac h to clustering, exhibit- ing some comp etitiv e adv antag es when compared to a lternativ e metho ds: b esides p rod ucing the allocation of observ ations to clusters, they yield es- timates of within-clusters j oint probabilit y f unctions for the base v ariables; moreo ve r, they provide means to allo cate n ew observ ations to groups; ﬁnally , when u sed with information criteria, mixture mo dels pro vide a statistical framew ork to determine the num b er of clusters. The lite rature on ﬁnite mixture mod els and their application is v ast, including some b o oks co vering theory , geometry , and app lica tions [17 ], [18], [19], [20 ], [21], [22 ]. F or examp le, in m ark et segmen tation, clus ter analysis via ﬁnite mixtur e mo dels has replaced more traditional cluster analysis, such as K-Means algorithm, as the state of the art [23]. Finite mixture mo dels ha ve pla ye d an imp ortan t role in the h istory of mo dern statistics. One of the ﬁrst applications of mixtur e mo dels is du e to New com b [24], who used a m ixture of Gaussians to analyse a collection of observ ations of transits of Mercur y . Pearson [25] ﬁtted a mixture of t wo Gaussians w ith u nequal v ariances in an analysis of diﬀerent sp ecies of crabs. Among many other examples of app lying mixture mo dels, MacDonald and Pitc her [26] analysed single sp ecies of ﬁsh in a lak e, using a mixture of Gaus- sians where eac h comp onen t consists of the ﬁsh of a single yearly spa w ning of that sp ecies. Another example is giv en b y Do and McLac hlan [27], wh ere mixture of Gaussians wa s used to stud y the p opulations of rats that w ere b eing eaten by a particular sp ecie s of o wl, with th e d istin ct rat sp ecies cor- resp onding to the comp onen ts of mixtures. Al-Hussaini and Ab del-Hamid [28] studied the b eha vior of failure time of a device; they ﬁ tted a mixture of comp onen ts, eac h of whic h represen ts a diﬀeren t cause of failure. In th eir r e- searc h, a sp ecial attent ion w as paid to mixtur es of t wo W eibull comp onen ts, but t w o exp onen tial comp onen ts, tw o Rayl eigh comp onents, and mixture of Ra yleigh and W eibull comp onents, we re also analysed. In these examples, the segmen tation pro cess unco ve rs physically meaningful comp onen ts. When applying ﬁnite mixture m o dels to so cial sciences, th e an alyst ma y b e confront ed with the need to un co ver sub -popu latio n s based on qualitativ e indicators. In this con text, the use of mixture mo dels of categorica l d ata is particularly p ertinen t. F or example, for clustering categorical data fr om m u ltiple c hoice questions, a mixtur e of m u ltinomia l d istributions is used in order to marke t pro ducts [29]. 2.1 Deﬁnitions and concepts Let Y = { y i , i = 1 , . . . , n } b e a set of n indep enden t and identica lly dis- tributed (i.i.d.) sample observ ations of a random vecto r , Y = [ Y 1 , . . . , Y L ] ′ . If Y follo ws a mixture of K comp onents densities, f ( y | θ k ) ( k = 1 , . . . , K ), 3 with probabilities { α 1 , . . . , α K } , the probabilit y (densit y) f unction of Y is f ( y | Θ) = K X k =1 α k f ( y | θ k ) , where Θ = { θ 1 , . . . , θ K , α 1 , . . . , α K } is th e set of all th e parameters of the mo del and θ k are the distributional p aramete rs deﬁ ning the k -th comp onent . The prop ortions, also called mixing p r ob abilities , are s u b ject to the usual constrain ts: P K k =1 α k = 1 and α k ≥ 0, k = 1 , . . . , K . The log-lik eliho od of the observ ed set of sample observ ations is log f ( Y | Θ) = log n Y i =1 f ( y i | Θ) = n X i =1 log K X k =1 α k f ( y i | θ k ) . In clustering, the identit y of the comp onen t that generated eac h sample ob- serv ation is unknown. The observ ed data Y is therefore regarded as incom- plete, where the missing data is a set of indicator v ariables Z = { z 1 , ..., z n } , eac h taking the form z i = [ z i 1 , ..., z iK ] ′ , where z ik is a binary indicator: z ik tak es th e v alue 1 if the observ ation y i w as generated by the k-th comp o- nen t, and 0 otherwise. It is usually assumed that the { z i , i = 1 , . . . , n } are i.i.d., follo wing a multinomial distribution of K categories, with pr obabiliti es { α 1 , . . . , α K } . The log-lik eliho o d of complete data { Y , Z } is giv en b y log f ( Y , Z | Θ) = n X i =1 K X k =1 z ik log h α k f ( y i | θ k ) i . 2.2 Discrete Finite Mixture Mo dels Consider that eac h v ariable in Y , Y l ( l = 1 , . . . , L ) can tak e one of C l catego ries. Conditionally on having b een generated by the k-th comp o- nen t of the mixtu r e, eac h Y l is thus mo deled b y a m u ltinomial distribu tio n with n l trials, C l catego ries, and non-negativ e parameters θ k l = { θ k lc , c = 1 , . . . , C l } , with P C l c =1 θ k lc = 1. F or a sample y il ( i = 1 , . . . , n ) of Y l , we denote as y ilc the num b er of outcomes in category c , whic h is a suﬃcient statistic; naturally , P C l c =1 y ilc = n l . Th us, with θ k = { θ k 1 , . . . , θ k L } and Θ = { θ 1 , . . . , θ K , α 1 , . . . , α k } , the log-lik eliho o d function, for a set of obser- v ations corresp onding to a discrete ﬁn ite m ixture mo del (mixture of m ulti- nomials) is log p ( Y | Θ) = n X i =1 log K X k =1 α k L Y l =1 " n l ! C l Y c =1 ( θ k lc ) y ilc y ilc ! # . (1) In order to estimate the parameters of this mixture, it is necessary to ensure that they are iden tiﬁable. In particular, in the case of a mixture of 4 m u ltinomia ls, sp eciﬁc identi ﬁabilit y cond itio n must b e fulﬁlled: a mixture of m ultinomial distrib utions is ident iﬁable if T ≥ 2 K − 1, where T is the n u m b er of trials of eac h m u ltinomial distrib u tion (see deta ils in [30] and [31]). As in the Gaussian case, the log-lik eliho o d in (1) can b e seen as corre- sp onding to a missin g-dat a problem, where the missin g data has exactly the same meaning and stru ctur e as ab o ve. The log- lik eliho o d of the complete data { Y , Z } is th us give n by log p ( Y , Z | Θ) = n X i =1 K X k =1 z ik log α k L Y l =1 " n l ! C l Y c =1 ( θ k lc ) y ilc y ilc ! #! . (2) 2.3 Estimation of ﬁnite mixture mo dels via the EM algo- rithm T o obtain a maximum-likelho o d (ML) or maximum a p osteriori (MAP) es- timate of the parameters of a m u ltinomial mixture, th e we ll-kno wn EM algorithm is usually the to ol of c h oic e ([14] and [32]). EM is an iterativ e algorithm, w hic h alternates b et ween t wo steps, the exp e ctation step (E-step) and the maximization step (M-step), describ ed next. E-step: Compute the exp ectation of the complete log-lik eliho o d (2), with resp ect to the missing v ariables Z , giv en the observ ed data Y , and the curr en t parameter estimate b Θ ( t ) (where t is the iteration coun ter). Because log p ( Y , Z | Θ) is linear with r esp ect to Z (as is clea r in (2)), E h log p ( Y , Z | Θ)   Y , b Θ ( t ) i = log p ( Y , ¯ Z ( t )   Θ) where eac h elemen t ¯ z ( t ) ik of ¯ Z ( t ) is giv en b y ¯ z ( t ) ik = E h Z ik   Y , b Θ ( t ) i = P h Z ik = 1   y i , b Θ ( t ) i = α k f ( y i | θ k ) P K j =1 α j f ( y i | θ j ) , (3) since Z ik is b inary and cond itionally ind ep en den t fr om all y j , for j 6 = i ; the third equalit y results simply from Ba y es’ la w. M-step: Up date the parameter estimates b y maximizing the current esti- mate of the exp ected complete log-lik eliho o d b Θ ( t +1) = arg max Θ log p ( Y , ¯ Z ( t )   Θ) + log p (Θ) , (4) where p (Θ) is a prior, in the case of MAP estimation; in the case of ML estimation, the term log p (Θ ) is absen t. 5 A more detailed deriv ation of the EM algorithm, its conv er gence prop erties, extensions, and instances for diﬀerent t yp e of missing-data mo dels, can b e found, e.g., in [33], [34]. 3 Mo del selection for c ategorical data Mo del s election is an imp ortan t problem in s tatistical analysis [35 ]. I n mo del-based clusterin g, the term mo del sele ction usually refers to the prob- lem of determining the n umb er of clusters, although it ma y a lso refer to the problem of selecting the structure of the clusters. Mo del-based clus- tering pro v id es a statistical framework to so lv e t his problem [36], usually resorting to information criteria . The rationale of suc h criteria is that ﬁt- ting a mo del with a large n umb er of clusters requires estimation of a very large num b er of parameters and a consequent loss of pr eci sion in these es- timates. Therefore, one should p enalize excessiv e mo del complexit y and sim u lta neously try to in crease the mo del’s ﬁt to the data, based on the like - liho od function. The b est-known inf ormatio n criteria are BIC, AIC, and their mo diﬁcations, n amely the c onsistent AIC (CAIC ) [37], and the mo di- ﬁe d AIC (MAIC) [38 ]. Other criteria th at h a v e b een prop osed in clud e the inte gr ate d c omplete d likeliho o d (ICL) [39], the minimum description length (MDL) [40], and the minimum message length (MM L) cr iterion [41]. In- formation criteria are w ell-kno wn and easily implemen ted, th e ﬁnal mo del b eing selected acc ording to a compromise b et ween its ﬁt to data and its complexit y . In this w ork, w e use the MML criterion to choose the num b er of com- p onen ts of a mixture of m u ltinomia ls. MML is b ased on the information- theoretic view of estimation and mo del selection, according to which an adequate m odel is one that allo ws a short descrip tio n of the observ ations [41]. MML-t yp e cr iteria ev aluate statistical mo dels according to th eir abilit y to compress a message con taining the d ata , lo oking for a balance b et w een c ho osing a simp le mo del and one th at describ es the data w ell. According to S hannon’s information theory , if Y is some random v ariable with probability distribution p ( y | Θ), the optimal co de-length (in an exp ected v alue sense) for an outcome y is l ( y | Θ ) = − log 2 p ( y | Θ) , measured in bits (from the b ase- 2 logarithm) [42]. If Θ is unkno wn , the total co de-length function has t wo parts: l ( y , Θ) = l ( y | Θ) + l (Θ); the ﬁrst part enco d es the outcome y , w h ile the second part enco des the parameters of the mo del. Th e ﬁrst p art corresp onds th e ﬁt of the mo del to the data (b etter ﬁ t corresp onds to higher compression), w hile the second part represen ts the complexit y of the mo del. Diﬀerent w ays to compu te l (Θ) are derived fr om diﬀerent statis- tical frameworks and yield diﬀerent ﬂa v ors of information criteria, namely MDL and MML, where MML ad m its the existence of a prior p (Θ), while MDL do es not. 6 The message length fun ctio n for a mixture of distributions (as dev elop ed in [43]) is: l ( y , Θ) = − log p (Θ) − log p ( y | Θ) + 1 2 log | I (Θ) | + C 2 (1 − log(12)) , (5) where p ( y | Θ) is the lik eliho o d fun ctio n, I (Θ) ≡ − E h ∂ 2 ∂ Θ 2 log p ( Y | Θ) i is the exp ected Fisher information matrix, | I (Θ) | its d eterminan t, and C is the the n u m b er of parameters of the mo del that need to b e estimated. F or example, for the K mixtur e multinomial distributions presented in (1), C = ( K − 1) + K L X l =1 ( C l − 1) ! . The exp ected Fisher inform ation matrix of a mixture leads to a complex analytical f orm of MML which cannot b e easily compu ted. T o o verco me this diﬃcult y , Figueiredo and Jain [15] replace the exp ected Fisher inf ormatio n matrix b y its co m plete- data coun terpart I c (Θ) ≡ − E h ∂ 2 ∂ θ 2 log p ( Y , Z | Θ) i . Also, th ey adopt indep endent Jeﬀreys’ priors for the mixture parameters. The resulting message length function is l ( y , Θ) = M 2 X k : α k > 0 log  n α k 12  + k nz 2 log n 12 + k nz ( M + 1) 2 − log p ( y , Θ) (6) where M is the n u m b er of parameters sp ecifying eac h comp onen t (the di- mension of eac h θ k ) and k nz the n um b er of comp onents with non zero p rob- abilit y (for more details on the deriv ation of (6), see [15], [43]). 4 The prop osed MML based EM algorithm In ord er to estimate a mixture of multinomials, w e prop ose to use a v arian t of the EM algorithm (herein termed EM-MML), which integ rates b oth es- timation and mo del selec tion, by directly minimizing (6). This algorithm is an extension, to the m ultinomial case, of the approac h d ev elop ed b y in [15] for clustering con tin uous data, based on a Gaussian mixtur e m odel. The algorithm results fr om observing that (6) con tains, in add itio n to the log-lik eliho od term, an exp lici t p enalt y on the num b er of comp onen ts (the t wo terms prop ortional to k nz ), and a term (the ﬁ r st one) that can b e seen as a log-prior on the α k parameters of Θ, that will directly aﬀect the M- step. Finally , notice that in the presence of a set of multinomial ob s erv ations Y = { y i , i = 1 , . . . , n } , th e log-lik eliho o d log p ( Y   Θ) is as giv en in (1). E-step: The E-step of th e EM-MML is p recisel y the same as in th e case of ML or MAP estimatio n, since the generativ e mo del for the data is 7 the same. Since w e are dealing with a multinomial mixture, we simply ha ve to p lug t he corresp onding m ultinomial probabilit y function in (3), yielding ¯ z ( t ) ik = α k Q L l =1  n l ! Q C l c =1 ( b θ ( t ) klc ) y ilc y ilc !  P K j =1 α j Q L l =1  n l ! Q C l c =1 ( b θ ( t ) j lc ) y ilc y ilc !  , (7) for i = 1 , . . . , n and k = 1 , . . . , K . M-step: F or the M-step, noticing that the ﬁrst term in (6) can b e seen as the negativ e log-prior − log p ( α k ) = C − K +1 2 K log α k (plus a constan t), and enforcing the conditions that α k ≥ 0, for k = 1 , ..., K and that P K k =1 α k = 1, yields the follo wing u p d ates for the estimate s of the α k parameters: b α ( t +1) k = max ( 0 , n X i =1 ¯ z ( t ) ik − C − K + 1 2 K ) K X j =1 max ( 0 , n X i =1 ¯ z ( t ) ij − C − K + 1 2 K ) , (8) for k = 1 , ..., K . N otice that, some b α ( t +1) k ma y b e zero; in that case, the k -th comp onen t is excluded fr om the mixtur e mo del. The multinomial parameters corresp onding to comp onen ts with b α ( t +1) k = 0 need not b e further calculated, since these comp onents d o not cont ribute to the lik eliho o d. F or the comp onen ts with non-zero probabilit y , b α ( t +1) k > 0, the estimates of multinomial p aramete r s are up dated to their standard w eighte d ML estimat es: b θ ( t +1) k lc = n X i =1 ¯ z ( t ) ik y ilc n l n X i =1 ¯ z ( t ) ik , (9) for k = 1 , . . . , K , l = 1 , . . . , L , and c = 1 , . . . , C l . Notice that, in accordance with the meaning of the θ k lc parameters, P C l c =1 b θ ( t +1) k lc = 1. The EM-MM L algo rithm for clustering categorica l d ata and selec ting the n u m b er of clusters simultaneously is summarized in Figure 1. 8 Figure 1: The EM-MML algorithm Input: data: Y = { y i , i = 1 , . . . , n } the mi nim um n umber of segments: K min the maximum num b er of segments: K max minimum increasing threshold for the l og -likelihoo d function: δ Ouput: num b er of segments: K segmen ts probabilities: { α 1 , . . . , α K } multinomial parameters: { θ 1 , . . . , θ K } Initializa tion: initiali zation of the parameters resorts to the empirical d istribution: p ( y l | θ lk ) , ( l = 1 , . . . , L ; k = 1 , . . . , K max ) set the segmen t’ s probability: α k = 1 /K max ( k = 1 , . . . , K max ) store the i nitial log-likelihoo d store the i nitial message length ( iml ) minml ← iml con tinue ← 1 while con tinue = 1 do while increases on log-likelihoo d are ab o ve δ do k = K max while k > K min do compute b α k according to (8) if b α k = 0 remo v e the comp onen t k of the mixture K max ← K max − 1  b α 1 , . . . , b α K max  ← ( b α 1 P k max k =1 b α k , . . . , b α K max P k max k =1 b α k ) else compute b θ according to ( 9) E-step according to (7) k ← k − 1 end if end while compute log-li k elih ood compute message l ength (ml) end while if ml < minml minml ← ml segmen t’ s probabil it y ← current parameters multinomial probabilities ← cu rren t parameters end if if K max > K min k remove = ar gmin { α 1 , . . . , α K } remo v e th e component k remove of the m ixture model K max ← K max − 1  α 1 , . . . , α K max  ← ( b α 1 P k max k =1 b α k , . . . , b α K max P k max k =1 b α k ) else con tinue ← 0 end if end while The b est solution corresp onds to the minimum message len gth obtained. 5 Data analysis and results 5.1 Ev aluating the EM-MML p erformance o n syn thetic data T o ev aluate the p erformance of the EM-MML algorithm, we b egin by resort- ing to synt hetic data sets: 2-comp onen ts mixtures of m ultinomials (10 data sets) and 3-comp onent s m ixtures (10 d ata sets) are u s ed. In order to p ro vide useful insights regarding th e EM-MML p erformance, the 10 d ata sets within eac h setting exhibit d iverse degrees of cluster separation. F or this pur p ose, w e measure the separation b etw een all p airs of clusters ( C k ; C k ′ ) forming in 9 a partition Π K according to separation(Π K ) = 2 k ( k − 1) X k 6 = k ′ 1 2 h D K L ( C k ; C k ′ ) + D K L ( C k ′ ; C k ) i , (10) where D K L ( C k ; C k ′ ) = L X l =1 C l X c =1 P ( Y k l = c ) log P ( Y k l = c ) P ( Y k ′ l = c ) = L X l =1 C l X c =1 b θ k lc log b θ k lc b θ k ′ lc is th e sum of the Kullbac k -Leibler div ergence b et ween the L m ultinomial distributions corresp onding to comp onents k and k ′ of the mixture. This symmetric measure of separation r anges from around 0 . 01 (p o or clusters’ separation) to 0 . 17 (go o d clusters’ separation), in the exp erimenta l scenarios. The EM-MML results are compared with th ose obtained fr om a stan- dard EM algo rithm and well- kno wn inform ati on criteria, namely BIC, AIC, CAIC, MAIC, and ICL. T h e p erformance of the v arious criteria and metho ds is assessed by the rate (o v er 30 run s) of correct selec tion of th e true n um b er of clusters, and also b y th e computation time. The results are pr esen ted in Figure 2. Figure 2: Comparison of seve r al mo del selectio n criteria on synthetic data from mixtures of multinomia ls with K = 2 (left) and K = 3 (righ t). In th e tw o-comp onen ts data sets with separation lo wer than 0.04, all the app roac h es identify only one cluster. IC L is unable to reco ver the true n u m b er of clusters, ev en with mo derate separation (up to around 0.12). The other criteria, and our EM-MML approac h , correctly iden tify t wo clusters when separati on is a b o ve 0 .04. F or three-comp onen ts m ixtures, the ICL criterion is able to iden tify the correct num b er of clusters only for clus- ter separation higher than 0.08. The other criteria (in clud ing EM-MML) present similar resu lts iden tifying three clusters for separations larger than 0.03. Generally one ca n thus conclude th at the E M-MML has a similar p er- formance to BIC , AIC, CAIC, and MAIC in reco v ering the true num b er of clusters, while ICL clearly u nderp erforms the other metho ds for not clearly separated comp onen ts. 10 In order to fu r ther ev aluate the comparativ e p erformance of EM-MML, w e compare its computation times with that of BIC . Notice that BIC and other in f ormatio n criteria h a v e s im ilar computation times. Based on a 300 runs sample from 2-comp onen ts m ixtu res of m ultinomials, w e obtain the fol- lo wing a verag e compu tat ion times: 146.84 seconds for EM-MML and 230.67 seconds for BIC. A paired-samples t test yields t = -17.06 ; df = 299 and p- v alue < 0 . 01, allo wing to conclude that EM-MML is signiﬁcan tly faster than BIC. F or the 3-comp onen ts mixtures of m u ltinomial s, the a verage of compu- tation times are 194.38 and 239.27 seconds for EM-MML and BIC resp ec- tiv ely . Again, the null hyp othesis of a paired-samp les t test is rejected (t = -6.88; df = 299 and p -v alue < 0 . 01), showing that EM-MML is signiﬁcan tly faster than BIC. Overall, the EM-MML metho d sh ows go od p erf ormance when d eal ing with syn th eti c data sets: when selecting the tru e n um b er of clusters it has similar p erf ormance to BIC, AIC, CAIC, and MAIC, and outp erforms ICL. In terms of computation time, since EM-MML do es not require a sequent ial approac h, it b ecomes clearly faster than the other cri- teria. 5.2 Exp erimen ts on real data Additional in sigh t into the p erf ormance of EM-MML is obtained by applyin g it to t wo real data sets from the Europ ean So cial S urv ey (ESS). ESS is a biennial survey started in 2002, whic h measur es the attitudes, b eliefs, v alues, and b ehavio u r patterns of Eu rop ean p opu lations. The most recen t survey w as in 2012 (round 6), co v erin g 30 countries and 243 regions. F or the purp ose of our exp eriment, w e aggregate the ES S data by region, taking in to acco unt sampling w eigh ts (ES S w eigh ts) whic h are mean t to pro v id e the sample representat iv eness. Clustering is p erformed b ased on: 1) the trust in some in stitutions (n amely , in the coun try’s parliamen t, legal system, p olice, p oliticians, p olitical p arties, the Europ ean P arliamen t, th e United Na tions); 2) t he sa tisfaction with life as whole, the econom y , the go v ernmen t, and the functioning of the d emocracy . W e recode r esp onses in to b inary v ariables: distrust/trust and dissatisﬁed/satisﬁed. The summary of the comparisons of the sev eral mo del selection criteria on the tw o ESS d ata sets is presen ted in T ables 1 and 2. T o measure the relationship b et ween un cov er ed segmen ts and the clus terin g base v ariables, w e resort to the Cramer’s V association measure - whic h ranges from 0 (no asso ciat ion) to 1 (p erfect asso ciat ion) - and to its sum as a p ro xy of v ariables’ discriminant capacit y . The n um b er of segmen ts selected by the EM-MML is muc h lo wer than for the remaining criteria. Th is fact av oids estimation problems asso ciated with v ery small segmen ts and also impro ves the in terp retabilit y of the clus- tering solution. In add itio n, th e total v alue of the Cr amers V asso ciation measure (which ev aluates the relationship b etw een clusters an d clustering 11 T able 1: Comp arison of Cramer’s V asso ciation b et wee n segmen tation base v ariables “trust in” and the segmen ts obtained with eac h criterion. BIC AIC CAIC MAIC ICL EM-MML Num b er of s egmen ts 11 17 11 18 11 4 T rust in country’s parlia men t .58 .53 .58 .55 .58 .72 the lega l system .55 .54 .55 .57 .55 .67 the p olice .52 .51 .52 .50 .52 .57 po liticians .64 .57 .64 .58 .64 .78 po litical parties .64 .60 .64 .58 .64 .78 the Eur o pean Parliament .46 .45 .46 .47 .46 .52 the United Na tions .47 .43 .47 .48 .47 .52 Sum 3.86 3.63 3.86 3.73 3.86 4.56 T able 2: Comparison of sum of Cramer’s V asso ciation b et ween segmenta - tion base v ariables ”satisfaction with” and th e s egments obtained with eac h criterion. BIC AIC CAIC MAIC ICL EM-MML Num b er of s egmen ts 13 18 13 18 13 7 How satisﬁed with life as a w ho le .59 .55 .59 .55 .59 .55 present sta te of economy in coun try .58 .55 .58 .55 .58 .62 the nationa l governmen t .56 .55 .56 .55 .56 .60 the wa y demo cracy works in co un try .56 .53 .56 .53 .56 .63 Sum 2.29 2.18 2.29 2.18 2.29 2.40 base v ariables) is higher for E M-M ML, in dicating v ariables with higher dis- criminan t capacit y in the EM-MML solution. Th e s eg men ts selected by EM-MML criterion are pr esen ted in T ables 3 and 4. T able 3: EM-MML segmen tation of the ”trust in” data. T rust in Segment 1 Segment 2 Segment 3 Segmen t 4 country’s parlia men t 59.2% 18.2% 14.2 % 35.8% the lega l system 72.0% 23.2% 17.7% 55.4% the p olice 81.4% 41.6% 20.8% 72.9% po liticians 45.6% 10.6% 8.7% 20 .8% po litical parties 44.1% 10.8% 9.9% 19.1% the Eur o pean Parliament 41.7% 26.9% 20.2% 28.7% the United Na tions 64.7% 35.0% 24.6% 45.0% The ”trust in institutions” EM-MML segmen tation yields four segmen ts. F or example, Lisb on is in s egment 2, where all the regions ha ve low trust in institutions. The lo west tr ust v alues, in this segment, refer to tru st in p oliticia ns and p olitical parties. Comparing the trust in diﬀerent institu- tions, L isb on citizens ha v e mo derate trust in the country p olice. Dublin, 12 T able 4: EM-MML segmen tation of the ”satisfactio n with” data. How satisﬁed with Seg. 1 Seg. 2 Seg . 3 Seg. 4 Seg. 5 Seg . 6 Seg. 7 life as a w ho le 92.5% 93.0% 29.3 % 79.7% 71.8 % 59.6 % 88.0% present sta te of economy 85.8% 36.4% 3.5% 20.7% 35.5% 15 .8% 57.3% in country the nationa l governmen t 70.5 % 56.7% 8.5% 23.4% 34.5% 2 1 .9% 42.9% the wa y demo cracy 85.8% 81.3% 9.8% 47.1% 46.1 % 22 .8% 68.2% works in co un try another Europ ean capital ci t y , is in seg m en t 4, c haracterized by a h igher trust in institutions, sp ecia lly in the p olice (7 2.9%) and the legal system (55.4%). The sep ar ation of these seg men ts (according to the measure in (10)) is 1.46, ind icating that th ey are w ell sep arate d . The seven EM-MML segments for ”sati sfaction with” are also w ell sep- arated segmen ts (separation(Π 7 ) = 1 . 26). In these segmen tation, Lisb on is in segmen t 6 and Dublin in segmen t 4. In b oth capitals, p eople only f eel satisﬁed with life as a wh ole , but in Dublin th ey feel generally m ore satisﬁed than in Lisb on. 6 Discussion and p ersp ectiv es In this w ork, a mo del selection criterion and metho d for ﬁ nite mixture mo dels of categorical observ ations w as prop osed. The new algorithm si- m u lta neously p er f orms mo del estimation and s elects the n um b er of com- p onen ts/clusters. When compared to information cr iteria, wh ic h are com- monly asso ciat ed w ith the u se of th e EM algo r ithm, the EM-MML metho d exhibits sev er al adv an tages: 1) it easily reco v er s the true num b er of clusters in s yn thetic data sets with v arious degrees of separation; 2) its computa- tions times are signiﬁcantl y lo we r than those required b y the sequen tial use of EM in standard approac hes (such as BIC); 3) when applied to real data sets it pro duces more parsimonious solutions, th us easier to in terpret. An additional adv antage of the prop osed ap p roac h that stems from obtaining more parsimon ious sol utions is th at suc h so lutions hav e a higher num b er of observ ations p er clus ter, thus helping to o v ercome ev entual estimation problems. Since th e p erformance of th e EM-MML is encouraging for selecting the n u m b er of clusters, and the s ame criterion was already used for feature selection [8], future dev elopments will include the inte gration of b oth mo del and feature selection. 13 References [1] Chen T, Zhang NL, Liu T, P o on KM, W ang Y. Mod el-based multi- dimensional clustering of categ orical data. Artiﬁcial In telligence. 2012; 176:22 462269. [2] Xiong T, W ang S, Ma yers A, Monga E. Dhcc: Divisive hierarc h ical clus- tering of categorica l d ata. Data Mining and Knowledge . 2012;24:10 3–35. [3] Giordana M, Diana G. A clustering metho d f or categorical ord inal d ata . Comm u nicatio ns in S tatistics-Theory and Methods. 201 1;40:1 31513 34. [4] Chen K, Liu L. Best k: Critical clustering s tructures in categoric al datasets. Knowledge and In formation Systems. 2009; 20:1– 33. [5] Desai A, Singh H, Pu di V. Disc: Data-in tensive similarit y measure for catego rical d ata . Adv ances in Knowledge Disco ve ry and Data Minin g, Lecture Notes in Computer S cie nce. 2011;66 35:469–481. [6] Constantinopoulos C, Titsias MK, Lik a A. Ba y esian feature and mo del selection for gaussian mixture mo dels. I EEE T ransactions on P attern Analysis and Mac hine In tellige nce. 200 6;28:1 013–1 018. [7] Celeux G, Martin O, La v ergne C. Mixture of linear mixed m o dels f or clustering gene expression proﬁles f rom rep eated microarray exp eri- men ts. Statistica l Mo delling. 2005;5:2 43–2 67. [8] La w M, Figueiredo M, J ain A. Simultaneous f eat ure selection and clus- tering using mixtu re mo dels. IEEE T r ansacti ons on Patte rn Analysis and Mac hine In telligence. 2004;26:11 54–1166. [9] Bouguila N. Clustering of coun t data using generaliz ed diric h let multi- nomial distrib utions. IE E E T ransactions on Kno wledge and Data En- gineering. 2008;2 0:462– 474. [10] Li M, Zhang L. Multinomial mixture mo del with feature select ion for text clustering. Knowledge -Based Systems. 200 8;21:7 04–70 8. [11] Dey a T, Lim CY. Comparisons of computational metho ds for clustered binary data. Journal of Statistical Compu tati on and Simulat ion. 2013; 83:203 02046. [12] Sc h warz G. Estimating the d im en sion of a mo del. The Ann als of S tat is- tics. 1978;6 :461– 464. [13] Ak aik e H. Maxim um likeli ho od identiﬁca tion of gaussian autorregres- siv e mo ving a verage mo dels. Biometrik a. 1973; 60:25 5–265 . 14 [14] Dempster A, Laird N, Rubin D. Maximum likeli ho od estimation from incomplete data via the EM algorithm. Jour n al of Ro yal Statistical So ciet y . 1997 ;39:1– 38; series B. [15] Figueiredo MA T, Jain AK. Unsup ervised learning of ﬁ n ite mixture mo dels. IEEE T rans ac tions on Patte rn Analysis and Mac hine Intelli- gence. 2002 ;24:38 1–396. [16] Silve stre C, Card oso M, Figueiredo M. Clustering with ﬁnite mixture mo dels and categorical v ariables. I n: Inte rnational Conference on Com- putational Statistics - CO MPST A T 2008; 2008. [17] Eve ritt BS, Hand DJ. Finite mixture distribu tions. New Y ork: Chap- man and Hall; 1981. [18] Titterington DM, S mith AFM, Mak o v UE. The statistical analysis of ﬁnite mixture mo dels. New Y ork: Wiley; 1985. [19] McLac hlan GJ, Pee l D. Finite mixture mo dels. New Y ork: Wiley; 2000. [20] Titterington DM, S mith AFM, Mak o v U. Statistical analysis of ﬁ nite mixture distributions. Jon Wiley and Sons; 1985 . [21] Meln yko v V, Maitra R. Finite mixture mo dels and mo del-based clus- tering. Statistics Su rv eys. 2010;4: 80116 . [22] Lindsay BG. Mixture mo dels: Theory , geometry , and applications. Hay- w ard : Institute of Mathematical S tati stics; 1995. [23] W edel M, Kamakura W A. Mark et segmen tation: Conceptual and metho dologic al foundations (in tern atio n al series in quan titativ e mar- k eting). 2nd ed. Massac husetts: Klu wer Academic Publishers; 2000. [24] New comb S. A generalized theory of the com b inatio n of observ ations so as to obtain the b est resu lt. American Journ al of Mathematics. 1886; 8:343– 366. [25] P earson K. Con trib utions to th e mathematical theory of evo lution. Philosophical T r ansacti ons of th e Roy al So ciet y . 1894;185: 71–11 0. [26] Macdonald P , Pitc her T. Age-groups fr om size-frequency data: a v er- satile and eﬃcient m ethod of analysing distribution mixtures. Jour nal of the Fisheries Researc h Board of Canada. 1979;36:9 87–10 01. [27] Do K, McLac hlan GJ. Estimatiom of m ixin g prop ortions: A case stud y . Applied Statistics. 1984;33:1 34–14 0. [28] Al-Hussaini EK, Ab del-Hamid AH. Accelerate d life tests und er ﬁ n ite mixture mo dels. Journal of Statistical Computation and Sim u lati on. 2006;7 6:673–690. 15 [29] Maitra R, Meln yko v V. Finite mixture mo dels and mo del-based clus- tering. Statistical Su rv eys. 2010;4:8 0–116. [30] Elmore R T, W ang S. Iden tiﬁability a nd estimation in ﬁnite mixture mo dels with m u ltinomia l comp onen ts. T echnical Rep ort 03-04, Depart- men t of Statistic s, Pennsylv ania S tate Un iv ersit y . 2003 ;. [31] P ortela J . Clustering d iscrete data through the multinomial mix- ture mo del. C omm unications in Statistics-Theory and Metho ds. 2008; 37:325 0–3263. [32] V ermun t JK, Magidson J. Latent class cluster analysis. Applied laten t class analysis. 2002;11 :89–1 06. [33] McLac hlan G, Thr iy am bak am K. The EM algorithm and extensions. 2nd ed. New Jersey: John Wiley and Sons; 2008. [34] Gupta M, Chen Y. Th eo ry and use of the EM algorithm. No w Publish- ers; 2011. [35] Celeux G, Martin-Ma gniette ML, Maugis-Rabusseau C , Raftery AE. Comparing mo del selection and regularization appr oac hes to v ariable selection in mo del-based clustering. Jour nal de la S oci ´ et ´ e F ran¸ caise d e Statistique. 2014;155 :57–7 1. [36] F raleya C, Raftery AE. Model-based clustering, discriminant analysis, and d ensit y estimation. J ournal of the American Statistical Asso ciation. 2002;9 7:611–631. [37] Bozdogan H. Mo del selection an d ak aik e’s inform ation criterion (AIC): The general theory and its analytical extensions. Psy cometrik a. 1987; 52:345 –370. [38] Bozdogan H. Mixture-mo del cluster analysis using mo del selec tion cri- teria and a new informational measure of complexity . In: Pro ceedings of the Firs t US/Japan Conference on th e F rontiers of Statistical Mo deling: An Approac h ; 1994. p. 69–11 3. [39] Biernac ki C, Celeux G, Go v aert G. Assessing a mixture mo del for clus- tering with the inte grated completed lik eliho o d. IEEE T ransactions on P attern analysis and Mac hine Intellige nce. 200 0;22:7 19–7 25. [40] Rissanen J. Complexity in statistical inquir y . W orld Scien tiﬁc; 1989. [41] W allace C, Boulton D. An information measure for cla ssiﬁcation. The Computer Journal. 1968;11 :195–209. [42] Co v er T, Th omas J. Entrop y , relativ e en tropy and m u tual inf orm atio n . Elemen ts of Information Theory; 1991. Chapter 2. 16 [43] Baxter RA, Olivier JJ. Find ing ov erlapping comp onen ts with m ml. Statistics and Computing. 2000;10: 5–16. 17

Identifying the number of clusters in discrete mixture models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment