Identifying the number of clusters in discrete mixture models
Research on cluster analysis for categorical data continues to develop, with new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. In this paper, we propose a new approach…
Authors: Claudia Silvestre, Margarida G. M. S. Cardoso, Mario A. T. Figueiredo
Iden tifying the n um b er of clusters in dis crete mixture mo dels Cl´ audia Silv estre, a ∗ Margarida G. M. S . Cardoso, b M´ ario A. T. Figueiredo c a Esc ola Sup erior de Comunic a¸ c˜ ao So cial, Institu to Polit ´ ecnic o de Lisb o a, Portugal ; b Dep artment of Quantitative Metho ds, ISCTE - Lisb on University Institute, Portugal ; c Instituto de T ele c omunic a¸ c˜ oes, Inst ituto Sup erior T´ ecnic o, Universidade de Lisb o a, Portugal, July 23, 2018 Abstract Research on cluster analysis for catego rical da ta c o n tinues to de- velop, with new clustering algor ithms being pro p osed. How ever, in this context, the determination of the num b er of clusters is r arely ad- dressed. In this pap er, we prop o se a new appr oac h in which clustering of categ orical data and the estimation of the n umber of clusters is carried out simultaneously . Assuming tha t the data or iginate from a finite mixtur e of m ultinomial distributions, we develop a metho d to select the num b er o f mixture compo nen ts based on a minimum message length (MML) criterion and implemen t a new ex p e ctation- maximization (E M) algorithm to estimate all the mo del par ameters. The prop osed EM-MML approach, rather tha n selecting one among a set of pre-estimated candida te models (which require s running E M several times), sea mlessly in teg r ates estimation and mode l selection in a single algo rithm. The p erformance of the prop osed approach is com- pared wit h other well-known criteria (such a s the Bayesian information criterion –BIC), r esorting to synthetic data and to tw o real applications from the Europ ean So cial Sur v ey . The EM-MML co mputation time is a clear adv antage of the prop osed metho d. Also, the rea l data solutions are muc h more parsimonious than the solutions provided b y comp et- ing metho ds, which reduces the risk o f mo del or der ov e r estimation and increases interpretability . k eywords : finite mixture mo del; EM a lgorithm; mo del selection, minim um messag e leng th; categ orical data 1 In tro duction Clustering is a tec hn iqu e commonly used in sev eral researc h and application areas, suc h as social sciences, medicine, biology , engineering, computer s ci- ence, image analysis, bioinformatics, and market ing. The goal of clustering 1 is to disco ver or un co ver groups in data. T o this end, there are essen tially t wo differen t approac hes: d istance- based, wh ere a distance or a s im ilarity measure b et we en ob jects is defined and similar ob jects are assigned to the same group; mo del-based, where the data are assumed to b e generated b y a finite mixtur e mo del, and ob jects are assigned to groups based on the corresp onding estimate s of the p osterior probabilities [1]. Most of the clustering tec hniques are fo cused on numerical data and can not b e applied directly to categorica l d ata . In fact, clustering tec h niques for categorica l data are more c h alle nging [2], and there are fewe r tec hniques a v ailable [3]. When usin g distance-based clustering approac hes, one needs to resort to sp ecific similarit y measures to deal with cate gorical features – e.g [4]. In this con text, the determination of th e num b er of clusters is commonly based on clustering qualit y measures and the corresp onding grap h ical dis- pla ys, suc h as d endrograms (when using hierarc h ical clustering tec h niques) or en tr opy r ela ted graphics [5]. In mo del-based ap p roac h es for n u merical data, the u sual choice is a mixture of Gaussians (e .g. [6], [7], [8]); when referring to categ orical d ata, a mixtur e of m ultinomials – discrete mixture mo del – is usu ally considered (e.g. [3], [9], [10], [11]). In order to deter- mine the n umb er of groups in discrete mixture mo dels, inform ati on criteria are commonly used: e.g., the Bayesian information criterion (BIC) [12] or the Akaike information criterion (AIC) [13]. These criteria lo ok for a balance b et ween the mo del’s fit to the d ata (which o crresp onds to maxi- mizing the lik eliho o d function) and pars im ony (using p en alt ies asso ciated with measures of mo del complexit y), th us trying to a v oid ov er-fi tting. The use of information criteria f ollo ws the estimation of candidate finite mixture mo dels for wh ic h a predetermined num b er of groups is indicated, generally resorting to an EM ( exp e ctation-maximization ) algo rithm, [14]. In this work, w e fo cus on determining th e n umb er of groups while clus- tering categorical d ata, u sing an EM em b edded app r oac h to estimate th e n u m b er of groups. The nov elty of this approac h is that it d oes not r ely on se- lecting among a set of pr e-estimated candidate mo dels, b ut rather inte grates estimation and m odel selection in a single algorithm. W e capitalize on the approac h dev elop ed by Figueiredo and Jain [15] for clustering con tin uous data and extend it for dealing with categorical data. The prop osed metho d is based on a minimum message length (MML) criterion to select the n um- b er of clusters and on an E M algorithm to estimate the mo del p aramete rs; our implement ation follo ws a previous v ersion describ ed in [16]. The pap er is organized as follo ws: Section 2 reviews the finite mixtu r e mo del-based approac h for clustering and addr esses the case of categorical data; Section 3 provides an in tro duction to the topic of mo del s election for discrete finite mixtur es; in S ect ion 4 we d escrib e the p rop osed EM-MML based algo r ithm; in Section 5, the exp erimental results, based on synthetic and real data, illustrate the p erformance of th e EM-MML appr oa c h. Con- cluding remarks are su m marized in Sectio n 6. 2 2 Clustering with finite mixture mo dels Finite mixture mod els offer a mo del-based approac h to clustering, exhibit- ing some comp etitiv e adv antag es when compared to a lternativ e metho ds: b esides p rod ucing the allocation of observ ations to clusters, they yield es- timates of within-clusters j oint probabilit y f unctions for the base v ariables; moreo ve r, they provide means to allo cate n ew observ ations to groups; finally , when u sed with information criteria, mixture mo dels pro vide a statistical framew ork to determine the num b er of clusters. The lite rature on finite mixture mod els and their application is v ast, including some b o oks co vering theory , geometry , and app lica tions [17 ], [18], [19], [20 ], [21], [22 ]. F or examp le, in m ark et segmen tation, clus ter analysis via finite mixtur e mo dels has replaced more traditional cluster analysis, such as K-Means algorithm, as the state of the art [23]. Finite mixture mo dels ha ve pla ye d an imp ortan t role in the h istory of mo dern statistics. One of the first applications of mixtur e mo dels is du e to New com b [24], who used a m ixture of Gaussians to analyse a collection of observ ations of transits of Mercur y . Pearson [25] fitted a mixture of t wo Gaussians w ith u nequal v ariances in an analysis of different sp ecies of crabs. Among many other examples of app lying mixture mo dels, MacDonald and Pitc her [26] analysed single sp ecies of fish in a lak e, using a mixture of Gaus- sians where eac h comp onen t consists of the fish of a single yearly spa w ning of that sp ecies. Another example is giv en b y Do and McLac hlan [27], wh ere mixture of Gaussians wa s used to stud y the p opulations of rats that w ere b eing eaten by a particular sp ecie s of o wl, with th e d istin ct rat sp ecies cor- resp onding to the comp onen ts of mixtures. Al-Hussaini and Ab del-Hamid [28] studied the b eha vior of failure time of a device; they fi tted a mixture of comp onen ts, eac h of whic h represen ts a differen t cause of failure. In th eir r e- searc h, a sp ecial attent ion w as paid to mixtur es of t wo W eibull comp onen ts, but t w o exp onen tial comp onen ts, tw o Rayl eigh comp onents, and mixture of Ra yleigh and W eibull comp onents, we re also analysed. In these examples, the segmen tation pro cess unco ve rs physically meaningful comp onen ts. When applying finite mixture m o dels to so cial sciences, th e an alyst ma y b e confront ed with the need to un co ver sub -popu latio n s based on qualitativ e indicators. In this con text, the use of mixture mo dels of categorica l d ata is particularly p ertinen t. F or example, for clustering categorical data fr om m u ltiple c hoice questions, a mixtur e of m u ltinomia l d istributions is used in order to marke t pro ducts [29]. 2.1 Definitions and concepts Let Y = { y i , i = 1 , . . . , n } b e a set of n indep enden t and identica lly dis- tributed (i.i.d.) sample observ ations of a random vecto r , Y = [ Y 1 , . . . , Y L ] ′ . If Y follo ws a mixture of K comp onents densities, f ( y | θ k ) ( k = 1 , . . . , K ), 3 with probabilities { α 1 , . . . , α K } , the probabilit y (densit y) f unction of Y is f ( y | Θ) = K X k =1 α k f ( y | θ k ) , where Θ = { θ 1 , . . . , θ K , α 1 , . . . , α K } is th e set of all th e parameters of the mo del and θ k are the distributional p aramete rs defi ning the k -th comp onent . The prop ortions, also called mixing p r ob abilities , are s u b ject to the usual constrain ts: P K k =1 α k = 1 and α k ≥ 0, k = 1 , . . . , K . The log-lik eliho od of the observ ed set of sample observ ations is log f ( Y | Θ) = log n Y i =1 f ( y i | Θ) = n X i =1 log K X k =1 α k f ( y i | θ k ) . In clustering, the identit y of the comp onen t that generated eac h sample ob- serv ation is unknown. The observ ed data Y is therefore regarded as incom- plete, where the missing data is a set of indicator v ariables Z = { z 1 , ..., z n } , eac h taking the form z i = [ z i 1 , ..., z iK ] ′ , where z ik is a binary indicator: z ik tak es th e v alue 1 if the observ ation y i w as generated by the k-th comp o- nen t, and 0 otherwise. It is usually assumed that the { z i , i = 1 , . . . , n } are i.i.d., follo wing a multinomial distribution of K categories, with pr obabiliti es { α 1 , . . . , α K } . The log-lik eliho o d of complete data { Y , Z } is giv en b y log f ( Y , Z | Θ) = n X i =1 K X k =1 z ik log h α k f ( y i | θ k ) i . 2.2 Discrete Finite Mixture Mo dels Consider that eac h v ariable in Y , Y l ( l = 1 , . . . , L ) can tak e one of C l catego ries. Conditionally on having b een generated by the k-th comp o- nen t of the mixtu r e, eac h Y l is thus mo deled b y a m u ltinomial distribu tio n with n l trials, C l catego ries, and non-negativ e parameters θ k l = { θ k lc , c = 1 , . . . , C l } , with P C l c =1 θ k lc = 1. F or a sample y il ( i = 1 , . . . , n ) of Y l , we denote as y ilc the num b er of outcomes in category c , whic h is a sufficient statistic; naturally , P C l c =1 y ilc = n l . Th us, with θ k = { θ k 1 , . . . , θ k L } and Θ = { θ 1 , . . . , θ K , α 1 , . . . , α k } , the log-lik eliho o d function, for a set of obser- v ations corresp onding to a discrete fin ite m ixture mo del (mixture of m ulti- nomials) is log p ( Y | Θ) = n X i =1 log K X k =1 α k L Y l =1 " n l ! C l Y c =1 ( θ k lc ) y ilc y ilc ! # . (1) In order to estimate the parameters of this mixture, it is necessary to ensure that they are iden tifiable. In particular, in the case of a mixture of 4 m u ltinomia ls, sp ecific identi fiabilit y cond itio n must b e fulfilled: a mixture of m ultinomial distrib utions is ident ifiable if T ≥ 2 K − 1, where T is the n u m b er of trials of eac h m u ltinomial distrib u tion (see deta ils in [30] and [31]). As in the Gaussian case, the log-lik eliho o d in (1) can b e seen as corre- sp onding to a missin g-dat a problem, where the missin g data has exactly the same meaning and stru ctur e as ab o ve. The log- lik eliho o d of the complete data { Y , Z } is th us give n by log p ( Y , Z | Θ) = n X i =1 K X k =1 z ik log α k L Y l =1 " n l ! C l Y c =1 ( θ k lc ) y ilc y ilc ! #! . (2) 2.3 Estimation of finite mixture mo dels via the EM algo- rithm T o obtain a maximum-likelho o d (ML) or maximum a p osteriori (MAP) es- timate of the parameters of a m u ltinomial mixture, th e we ll-kno wn EM algorithm is usually the to ol of c h oic e ([14] and [32]). EM is an iterativ e algorithm, w hic h alternates b et ween t wo steps, the exp e ctation step (E-step) and the maximization step (M-step), describ ed next. E-step: Compute the exp ectation of the complete log-lik eliho o d (2), with resp ect to the missing v ariables Z , giv en the observ ed data Y , and the curr en t parameter estimate b Θ ( t ) (where t is the iteration coun ter). Because log p ( Y , Z | Θ) is linear with r esp ect to Z (as is clea r in (2)), E h log p ( Y , Z | Θ) Y , b Θ ( t ) i = log p ( Y , ¯ Z ( t ) Θ) where eac h elemen t ¯ z ( t ) ik of ¯ Z ( t ) is giv en b y ¯ z ( t ) ik = E h Z ik Y , b Θ ( t ) i = P h Z ik = 1 y i , b Θ ( t ) i = α k f ( y i | θ k ) P K j =1 α j f ( y i | θ j ) , (3) since Z ik is b inary and cond itionally ind ep en den t fr om all y j , for j 6 = i ; the third equalit y results simply from Ba y es’ la w. M-step: Up date the parameter estimates b y maximizing the current esti- mate of the exp ected complete log-lik eliho o d b Θ ( t +1) = arg max Θ log p ( Y , ¯ Z ( t ) Θ) + log p (Θ) , (4) where p (Θ) is a prior, in the case of MAP estimation; in the case of ML estimation, the term log p (Θ ) is absen t. 5 A more detailed deriv ation of the EM algorithm, its conv er gence prop erties, extensions, and instances for different t yp e of missing-data mo dels, can b e found, e.g., in [33], [34]. 3 Mo del selection for c ategorical data Mo del s election is an imp ortan t problem in s tatistical analysis [35 ]. I n mo del-based clusterin g, the term mo del sele ction usually refers to the prob- lem of determining the n umb er of clusters, although it ma y a lso refer to the problem of selecting the structure of the clusters. Mo del-based clus- tering pro v id es a statistical framework to so lv e t his problem [36], usually resorting to information criteria . The rationale of suc h criteria is that fit- ting a mo del with a large n umb er of clusters requires estimation of a very large num b er of parameters and a consequent loss of pr eci sion in these es- timates. Therefore, one should p enalize excessiv e mo del complexit y and sim u lta neously try to in crease the mo del’s fit to the data, based on the like - liho od function. The b est-known inf ormatio n criteria are BIC, AIC, and their mo difications, n amely the c onsistent AIC (CAIC ) [37], and the mo di- fie d AIC (MAIC) [38 ]. Other criteria th at h a v e b een prop osed in clud e the inte gr ate d c omplete d likeliho o d (ICL) [39], the minimum description length (MDL) [40], and the minimum message length (MM L) cr iterion [41]. In- formation criteria are w ell-kno wn and easily implemen ted, th e final mo del b eing selected acc ording to a compromise b et ween its fit to data and its complexit y . In this w ork, w e use the MML criterion to choose the num b er of com- p onen ts of a mixture of m u ltinomia ls. MML is b ased on the information- theoretic view of estimation and mo del selection, according to which an adequate m odel is one that allo ws a short descrip tio n of the observ ations [41]. MML-t yp e cr iteria ev aluate statistical mo dels according to th eir abilit y to compress a message con taining the d ata , lo oking for a balance b et w een c ho osing a simp le mo del and one th at describ es the data w ell. According to S hannon’s information theory , if Y is some random v ariable with probability distribution p ( y | Θ), the optimal co de-length (in an exp ected v alue sense) for an outcome y is l ( y | Θ ) = − log 2 p ( y | Θ) , measured in bits (from the b ase- 2 logarithm) [42]. If Θ is unkno wn , the total co de-length function has t wo parts: l ( y , Θ) = l ( y | Θ) + l (Θ); the first part enco d es the outcome y , w h ile the second part enco des the parameters of the mo del. Th e first p art corresp onds th e fit of the mo del to the data (b etter fi t corresp onds to higher compression), w hile the second part represen ts the complexit y of the mo del. Different w ays to compu te l (Θ) are derived fr om different statis- tical frameworks and yield different fla v ors of information criteria, namely MDL and MML, where MML ad m its the existence of a prior p (Θ), while MDL do es not. 6 The message length fun ctio n for a mixture of distributions (as dev elop ed in [43]) is: l ( y , Θ) = − log p (Θ) − log p ( y | Θ) + 1 2 log | I (Θ) | + C 2 (1 − log(12)) , (5) where p ( y | Θ) is the lik eliho o d fun ctio n, I (Θ) ≡ − E h ∂ 2 ∂ Θ 2 log p ( Y | Θ) i is the exp ected Fisher information matrix, | I (Θ) | its d eterminan t, and C is the the n u m b er of parameters of the mo del that need to b e estimated. F or example, for the K mixtur e multinomial distributions presented in (1), C = ( K − 1) + K L X l =1 ( C l − 1) ! . The exp ected Fisher inform ation matrix of a mixture leads to a complex analytical f orm of MML which cannot b e easily compu ted. T o o verco me this difficult y , Figueiredo and Jain [15] replace the exp ected Fisher inf ormatio n matrix b y its co m plete- data coun terpart I c (Θ) ≡ − E h ∂ 2 ∂ θ 2 log p ( Y , Z | Θ) i . Also, th ey adopt indep endent Jeffreys’ priors for the mixture parameters. The resulting message length function is l ( y , Θ) = M 2 X k : α k > 0 log n α k 12 + k nz 2 log n 12 + k nz ( M + 1) 2 − log p ( y , Θ) (6) where M is the n u m b er of parameters sp ecifying eac h comp onen t (the di- mension of eac h θ k ) and k nz the n um b er of comp onents with non zero p rob- abilit y (for more details on the deriv ation of (6), see [15], [43]). 4 The prop osed MML based EM algorithm In ord er to estimate a mixture of multinomials, w e prop ose to use a v arian t of the EM algorithm (herein termed EM-MML), which integ rates b oth es- timation and mo del selec tion, by directly minimizing (6). This algorithm is an extension, to the m ultinomial case, of the approac h d ev elop ed b y in [15] for clustering con tin uous data, based on a Gaussian mixtur e m odel. The algorithm results fr om observing that (6) con tains, in add itio n to the log-lik eliho od term, an exp lici t p enalt y on the num b er of comp onen ts (the t wo terms prop ortional to k nz ), and a term (the fi r st one) that can b e seen as a log-prior on the α k parameters of Θ, that will directly affect the M- step. Finally , notice that in the presence of a set of multinomial ob s erv ations Y = { y i , i = 1 , . . . , n } , th e log-lik eliho o d log p ( Y Θ) is as giv en in (1). E-step: The E-step of th e EM-MML is p recisel y the same as in th e case of ML or MAP estimatio n, since the generativ e mo del for the data is 7 the same. Since w e are dealing with a multinomial mixture, we simply ha ve to p lug t he corresp onding m ultinomial probabilit y function in (3), yielding ¯ z ( t ) ik = α k Q L l =1 n l ! Q C l c =1 ( b θ ( t ) klc ) y ilc y ilc ! P K j =1 α j Q L l =1 n l ! Q C l c =1 ( b θ ( t ) j lc ) y ilc y ilc ! , (7) for i = 1 , . . . , n and k = 1 , . . . , K . M-step: F or the M-step, noticing that the first term in (6) can b e seen as the negativ e log-prior − log p ( α k ) = C − K +1 2 K log α k (plus a constan t), and enforcing the conditions that α k ≥ 0, for k = 1 , ..., K and that P K k =1 α k = 1, yields the follo wing u p d ates for the estimate s of the α k parameters: b α ( t +1) k = max ( 0 , n X i =1 ¯ z ( t ) ik − C − K + 1 2 K ) K X j =1 max ( 0 , n X i =1 ¯ z ( t ) ij − C − K + 1 2 K ) , (8) for k = 1 , ..., K . N otice that, some b α ( t +1) k ma y b e zero; in that case, the k -th comp onen t is excluded fr om the mixtur e mo del. The multinomial parameters corresp onding to comp onen ts with b α ( t +1) k = 0 need not b e further calculated, since these comp onents d o not cont ribute to the lik eliho o d. F or the comp onen ts with non-zero probabilit y , b α ( t +1) k > 0, the estimates of multinomial p aramete r s are up dated to their standard w eighte d ML estimat es: b θ ( t +1) k lc = n X i =1 ¯ z ( t ) ik y ilc n l n X i =1 ¯ z ( t ) ik , (9) for k = 1 , . . . , K , l = 1 , . . . , L , and c = 1 , . . . , C l . Notice that, in accordance with the meaning of the θ k lc parameters, P C l c =1 b θ ( t +1) k lc = 1. The EM-MM L algo rithm for clustering categorica l d ata and selec ting the n u m b er of clusters simultaneously is summarized in Figure 1. 8 Figure 1: The EM-MML algorithm Input: data: Y = { y i , i = 1 , . . . , n } the mi nim um n umber of segments: K min the maximum num b er of segments: K max minimum increasing threshold for the l og -likelihoo d function: δ Ouput: num b er of segments: K segmen ts probabilities: { α 1 , . . . , α K } multinomial parameters: { θ 1 , . . . , θ K } Initializa tion: initiali zation of the parameters resorts to the empirical d istribution: p ( y l | θ lk ) , ( l = 1 , . . . , L ; k = 1 , . . . , K max ) set the segmen t’ s probability: α k = 1 /K max ( k = 1 , . . . , K max ) store the i nitial log-likelihoo d store the i nitial message length ( iml ) minml ← iml con tinue ← 1 while con tinue = 1 do while increases on log-likelihoo d are ab o ve δ do k = K max while k > K min do compute b α k according to (8) if b α k = 0 remo v e the comp onen t k of the mixture K max ← K max − 1 b α 1 , . . . , b α K max ← ( b α 1 P k max k =1 b α k , . . . , b α K max P k max k =1 b α k ) else compute b θ according to ( 9) E-step according to (7) k ← k − 1 end if end while compute log-li k elih ood compute message l ength (ml) end while if ml < minml minml ← ml segmen t’ s probabil it y ← current parameters multinomial probabilities ← cu rren t parameters end if if K max > K min k remove = ar gmin { α 1 , . . . , α K } remo v e th e component k remove of the m ixture model K max ← K max − 1 α 1 , . . . , α K max ← ( b α 1 P k max k =1 b α k , . . . , b α K max P k max k =1 b α k ) else con tinue ← 0 end if end while The b est solution corresp onds to the minimum message len gth obtained. 5 Data analysis and results 5.1 Ev aluating the EM-MML p erformance o n syn thetic data T o ev aluate the p erformance of the EM-MML algorithm, we b egin by resort- ing to synt hetic data sets: 2-comp onen ts mixtures of m ultinomials (10 data sets) and 3-comp onent s m ixtures (10 d ata sets) are u s ed. In order to p ro vide useful insights regarding th e EM-MML p erformance, the 10 d ata sets within eac h setting exhibit d iverse degrees of cluster separation. F or this pur p ose, w e measure the separation b etw een all p airs of clusters ( C k ; C k ′ ) forming in 9 a partition Π K according to separation(Π K ) = 2 k ( k − 1) X k 6 = k ′ 1 2 h D K L ( C k ; C k ′ ) + D K L ( C k ′ ; C k ) i , (10) where D K L ( C k ; C k ′ ) = L X l =1 C l X c =1 P ( Y k l = c ) log P ( Y k l = c ) P ( Y k ′ l = c ) = L X l =1 C l X c =1 b θ k lc log b θ k lc b θ k ′ lc is th e sum of the Kullbac k -Leibler div ergence b et ween the L m ultinomial distributions corresp onding to comp onents k and k ′ of the mixture. This symmetric measure of separation r anges from around 0 . 01 (p o or clusters’ separation) to 0 . 17 (go o d clusters’ separation), in the exp erimenta l scenarios. The EM-MML results are compared with th ose obtained fr om a stan- dard EM algo rithm and well- kno wn inform ati on criteria, namely BIC, AIC, CAIC, MAIC, and ICL. T h e p erformance of the v arious criteria and metho ds is assessed by the rate (o v er 30 run s) of correct selec tion of th e true n um b er of clusters, and also b y th e computation time. The results are pr esen ted in Figure 2. Figure 2: Comparison of seve r al mo del selectio n criteria on synthetic data from mixtures of multinomia ls with K = 2 (left) and K = 3 (righ t). In th e tw o-comp onen ts data sets with separation lo wer than 0.04, all the app roac h es identify only one cluster. IC L is unable to reco ver the true n u m b er of clusters, ev en with mo derate separation (up to around 0.12). The other criteria, and our EM-MML approac h , correctly iden tify t wo clusters when separati on is a b o ve 0 .04. F or three-comp onen ts m ixtures, the ICL criterion is able to iden tify the correct num b er of clusters only for clus- ter separation higher than 0.08. The other criteria (in clud ing EM-MML) present similar resu lts iden tifying three clusters for separations larger than 0.03. Generally one ca n thus conclude th at the E M-MML has a similar p er- formance to BIC , AIC, CAIC, and MAIC in reco v ering the true num b er of clusters, while ICL clearly u nderp erforms the other metho ds for not clearly separated comp onen ts. 10 In order to fu r ther ev aluate the comparativ e p erformance of EM-MML, w e compare its computation times with that of BIC . Notice that BIC and other in f ormatio n criteria h a v e s im ilar computation times. Based on a 300 runs sample from 2-comp onen ts m ixtu res of m ultinomials, w e obtain the fol- lo wing a verag e compu tat ion times: 146.84 seconds for EM-MML and 230.67 seconds for BIC. A paired-samples t test yields t = -17.06 ; df = 299 and p- v alue < 0 . 01, allo wing to conclude that EM-MML is significan tly faster than BIC. F or the 3-comp onen ts mixtures of m u ltinomial s, the a verage of compu- tation times are 194.38 and 239.27 seconds for EM-MML and BIC resp ec- tiv ely . Again, the null hyp othesis of a paired-samp les t test is rejected (t = -6.88; df = 299 and p -v alue < 0 . 01), showing that EM-MML is significan tly faster than BIC. Overall, the EM-MML metho d sh ows go od p erf ormance when d eal ing with syn th eti c data sets: when selecting the tru e n um b er of clusters it has similar p erf ormance to BIC, AIC, CAIC, and MAIC, and outp erforms ICL. In terms of computation time, since EM-MML do es not require a sequent ial approac h, it b ecomes clearly faster than the other cri- teria. 5.2 Exp erimen ts on real data Additional in sigh t into the p erf ormance of EM-MML is obtained by applyin g it to t wo real data sets from the Europ ean So cial S urv ey (ESS). ESS is a biennial survey started in 2002, whic h measur es the attitudes, b eliefs, v alues, and b ehavio u r patterns of Eu rop ean p opu lations. The most recen t survey w as in 2012 (round 6), co v erin g 30 countries and 243 regions. F or the purp ose of our exp eriment, w e aggregate the ES S data by region, taking in to acco unt sampling w eigh ts (ES S w eigh ts) whic h are mean t to pro v id e the sample representat iv eness. Clustering is p erformed b ased on: 1) the trust in some in stitutions (n amely , in the coun try’s parliamen t, legal system, p olice, p oliticians, p olitical p arties, the Europ ean P arliamen t, th e United Na tions); 2) t he sa tisfaction with life as whole, the econom y , the go v ernmen t, and the functioning of the d emocracy . W e recode r esp onses in to b inary v ariables: distrust/trust and dissatisfied/satisfied. The summary of the comparisons of the sev eral mo del selection criteria on the tw o ESS d ata sets is presen ted in T ables 1 and 2. T o measure the relationship b et ween un cov er ed segmen ts and the clus terin g base v ariables, w e resort to the Cramer’s V association measure - whic h ranges from 0 (no asso ciat ion) to 1 (p erfect asso ciat ion) - and to its sum as a p ro xy of v ariables’ discriminant capacit y . The n um b er of segmen ts selected by the EM-MML is muc h lo wer than for the remaining criteria. Th is fact av oids estimation problems asso ciated with v ery small segmen ts and also impro ves the in terp retabilit y of the clus- tering solution. In add itio n, th e total v alue of the Cr amers V asso ciation measure (which ev aluates the relationship b etw een clusters an d clustering 11 T able 1: Comp arison of Cramer’s V asso ciation b et wee n segmen tation base v ariables “trust in” and the segmen ts obtained with eac h criterion. BIC AIC CAIC MAIC ICL EM-MML Num b er of s egmen ts 11 17 11 18 11 4 T rust in country’s parlia men t .58 .53 .58 .55 .58 .72 the lega l system .55 .54 .55 .57 .55 .67 the p olice .52 .51 .52 .50 .52 .57 po liticians .64 .57 .64 .58 .64 .78 po litical parties .64 .60 .64 .58 .64 .78 the Eur o pean Parliament .46 .45 .46 .47 .46 .52 the United Na tions .47 .43 .47 .48 .47 .52 Sum 3.86 3.63 3.86 3.73 3.86 4.56 T able 2: Comparison of sum of Cramer’s V asso ciation b et ween segmenta - tion base v ariables ”satisfaction with” and th e s egments obtained with eac h criterion. BIC AIC CAIC MAIC ICL EM-MML Num b er of s egmen ts 13 18 13 18 13 7 How satisfied with life as a w ho le .59 .55 .59 .55 .59 .55 present sta te of economy in coun try .58 .55 .58 .55 .58 .62 the nationa l governmen t .56 .55 .56 .55 .56 .60 the wa y demo cracy works in co un try .56 .53 .56 .53 .56 .63 Sum 2.29 2.18 2.29 2.18 2.29 2.40 base v ariables) is higher for E M-M ML, in dicating v ariables with higher dis- criminan t capacit y in the EM-MML solution. Th e s eg men ts selected by EM-MML criterion are pr esen ted in T ables 3 and 4. T able 3: EM-MML segmen tation of the ”trust in” data. T rust in Segment 1 Segment 2 Segment 3 Segmen t 4 country’s parlia men t 59.2% 18.2% 14.2 % 35.8% the lega l system 72.0% 23.2% 17.7% 55.4% the p olice 81.4% 41.6% 20.8% 72.9% po liticians 45.6% 10.6% 8.7% 20 .8% po litical parties 44.1% 10.8% 9.9% 19.1% the Eur o pean Parliament 41.7% 26.9% 20.2% 28.7% the United Na tions 64.7% 35.0% 24.6% 45.0% The ”trust in institutions” EM-MML segmen tation yields four segmen ts. F or example, Lisb on is in s egment 2, where all the regions ha ve low trust in institutions. The lo west tr ust v alues, in this segment, refer to tru st in p oliticia ns and p olitical parties. Comparing the trust in different institu- tions, L isb on citizens ha v e mo derate trust in the country p olice. Dublin, 12 T able 4: EM-MML segmen tation of the ”satisfactio n with” data. How satisfied with Seg. 1 Seg. 2 Seg . 3 Seg. 4 Seg. 5 Seg . 6 Seg. 7 life as a w ho le 92.5% 93.0% 29.3 % 79.7% 71.8 % 59.6 % 88.0% present sta te of economy 85.8% 36.4% 3.5% 20.7% 35.5% 15 .8% 57.3% in country the nationa l governmen t 70.5 % 56.7% 8.5% 23.4% 34.5% 2 1 .9% 42.9% the wa y demo cracy 85.8% 81.3% 9.8% 47.1% 46.1 % 22 .8% 68.2% works in co un try another Europ ean capital ci t y , is in seg m en t 4, c haracterized by a h igher trust in institutions, sp ecia lly in the p olice (7 2.9%) and the legal system (55.4%). The sep ar ation of these seg men ts (according to the measure in (10)) is 1.46, ind icating that th ey are w ell sep arate d . The seven EM-MML segments for ”sati sfaction with” are also w ell sep- arated segmen ts (separation(Π 7 ) = 1 . 26). In these segmen tation, Lisb on is in segmen t 6 and Dublin in segmen t 4. In b oth capitals, p eople only f eel satisfied with life as a wh ole , but in Dublin th ey feel generally m ore satisfied than in Lisb on. 6 Discussion and p ersp ectiv es In this w ork, a mo del selection criterion and metho d for fi nite mixture mo dels of categorical observ ations w as prop osed. The new algorithm si- m u lta neously p er f orms mo del estimation and s elects the n um b er of com- p onen ts/clusters. When compared to information cr iteria, wh ic h are com- monly asso ciat ed w ith the u se of th e EM algo r ithm, the EM-MML metho d exhibits sev er al adv an tages: 1) it easily reco v er s the true num b er of clusters in s yn thetic data sets with v arious degrees of separation; 2) its computa- tions times are significantl y lo we r than those required b y the sequen tial use of EM in standard approac hes (such as BIC); 3) when applied to real data sets it pro duces more parsimonious solutions, th us easier to in terpret. An additional adv antage of the prop osed ap p roac h that stems from obtaining more parsimon ious sol utions is th at suc h so lutions hav e a higher num b er of observ ations p er clus ter, thus helping to o v ercome ev entual estimation problems. Since th e p erformance of th e EM-MML is encouraging for selecting the n u m b er of clusters, and the s ame criterion was already used for feature selection [8], future dev elopments will include the inte gration of b oth mo del and feature selection. 13 References [1] Chen T, Zhang NL, Liu T, P o on KM, W ang Y. Mod el-based multi- dimensional clustering of categ orical data. Artificial In telligence. 2012; 176:22 462269. [2] Xiong T, W ang S, Ma yers A, Monga E. Dhcc: Divisive hierarc h ical clus- tering of categorica l d ata. Data Mining and Knowledge . 2012;24:10 3–35. [3] Giordana M, Diana G. A clustering metho d f or categorical ord inal d ata . Comm u nicatio ns in S tatistics-Theory and Methods. 201 1;40:1 31513 34. [4] Chen K, Liu L. Best k: Critical clustering s tructures in categoric al datasets. Knowledge and In formation Systems. 2009; 20:1– 33. [5] Desai A, Singh H, Pu di V. Disc: Data-in tensive similarit y measure for catego rical d ata . Adv ances in Knowledge Disco ve ry and Data Minin g, Lecture Notes in Computer S cie nce. 2011;66 35:469–481. [6] Constantinopoulos C, Titsias MK, Lik a A. Ba y esian feature and mo del selection for gaussian mixture mo dels. I EEE T ransactions on P attern Analysis and Mac hine In tellige nce. 200 6;28:1 013–1 018. [7] Celeux G, Martin O, La v ergne C. Mixture of linear mixed m o dels f or clustering gene expression profiles f rom rep eated microarray exp eri- men ts. Statistica l Mo delling. 2005;5:2 43–2 67. [8] La w M, Figueiredo M, J ain A. Simultaneous f eat ure selection and clus- tering using mixtu re mo dels. IEEE T r ansacti ons on Patte rn Analysis and Mac hine In telligence. 2004;26:11 54–1166. [9] Bouguila N. Clustering of coun t data using generaliz ed diric h let multi- nomial distrib utions. IE E E T ransactions on Kno wledge and Data En- gineering. 2008;2 0:462– 474. [10] Li M, Zhang L. Multinomial mixture mo del with feature select ion for text clustering. Knowledge -Based Systems. 200 8;21:7 04–70 8. [11] Dey a T, Lim CY. Comparisons of computational metho ds for clustered binary data. Journal of Statistical Compu tati on and Simulat ion. 2013; 83:203 02046. [12] Sc h warz G. Estimating the d im en sion of a mo del. The Ann als of S tat is- tics. 1978;6 :461– 464. [13] Ak aik e H. Maxim um likeli ho od identifica tion of gaussian autorregres- siv e mo ving a verage mo dels. Biometrik a. 1973; 60:25 5–265 . 14 [14] Dempster A, Laird N, Rubin D. Maximum likeli ho od estimation from incomplete data via the EM algorithm. Jour n al of Ro yal Statistical So ciet y . 1997 ;39:1– 38; series B. [15] Figueiredo MA T, Jain AK. Unsup ervised learning of fi n ite mixture mo dels. IEEE T rans ac tions on Patte rn Analysis and Mac hine Intelli- gence. 2002 ;24:38 1–396. [16] Silve stre C, Card oso M, Figueiredo M. Clustering with finite mixture mo dels and categorical v ariables. I n: Inte rnational Conference on Com- putational Statistics - CO MPST A T 2008; 2008. [17] Eve ritt BS, Hand DJ. Finite mixture distribu tions. New Y ork: Chap- man and Hall; 1981. [18] Titterington DM, S mith AFM, Mak o v UE. The statistical analysis of finite mixture mo dels. New Y ork: Wiley; 1985. [19] McLac hlan GJ, Pee l D. Finite mixture mo dels. New Y ork: Wiley; 2000. [20] Titterington DM, S mith AFM, Mak o v U. Statistical analysis of fi nite mixture distributions. Jon Wiley and Sons; 1985 . [21] Meln yko v V, Maitra R. Finite mixture mo dels and mo del-based clus- tering. Statistics Su rv eys. 2010;4: 80116 . [22] Lindsay BG. Mixture mo dels: Theory , geometry , and applications. Hay- w ard : Institute of Mathematical S tati stics; 1995. [23] W edel M, Kamakura W A. Mark et segmen tation: Conceptual and metho dologic al foundations (in tern atio n al series in quan titativ e mar- k eting). 2nd ed. Massac husetts: Klu wer Academic Publishers; 2000. [24] New comb S. A generalized theory of the com b inatio n of observ ations so as to obtain the b est resu lt. American Journ al of Mathematics. 1886; 8:343– 366. [25] P earson K. Con trib utions to th e mathematical theory of evo lution. Philosophical T r ansacti ons of th e Roy al So ciet y . 1894;185: 71–11 0. [26] Macdonald P , Pitc her T. Age-groups fr om size-frequency data: a v er- satile and efficient m ethod of analysing distribution mixtures. Jour nal of the Fisheries Researc h Board of Canada. 1979;36:9 87–10 01. [27] Do K, McLac hlan GJ. Estimatiom of m ixin g prop ortions: A case stud y . Applied Statistics. 1984;33:1 34–14 0. [28] Al-Hussaini EK, Ab del-Hamid AH. Accelerate d life tests und er fi n ite mixture mo dels. Journal of Statistical Computation and Sim u lati on. 2006;7 6:673–690. 15 [29] Maitra R, Meln yko v V. Finite mixture mo dels and mo del-based clus- tering. Statistical Su rv eys. 2010;4:8 0–116. [30] Elmore R T, W ang S. Iden tifiability a nd estimation in finite mixture mo dels with m u ltinomia l comp onen ts. T echnical Rep ort 03-04, Depart- men t of Statistic s, Pennsylv ania S tate Un iv ersit y . 2003 ;. [31] P ortela J . Clustering d iscrete data through the multinomial mix- ture mo del. C omm unications in Statistics-Theory and Metho ds. 2008; 37:325 0–3263. [32] V ermun t JK, Magidson J. Latent class cluster analysis. Applied laten t class analysis. 2002;11 :89–1 06. [33] McLac hlan G, Thr iy am bak am K. The EM algorithm and extensions. 2nd ed. New Jersey: John Wiley and Sons; 2008. [34] Gupta M, Chen Y. Th eo ry and use of the EM algorithm. No w Publish- ers; 2011. [35] Celeux G, Martin-Ma gniette ML, Maugis-Rabusseau C , Raftery AE. Comparing mo del selection and regularization appr oac hes to v ariable selection in mo del-based clustering. Jour nal de la S oci ´ et ´ e F ran¸ caise d e Statistique. 2014;155 :57–7 1. [36] F raleya C, Raftery AE. Model-based clustering, discriminant analysis, and d ensit y estimation. J ournal of the American Statistical Asso ciation. 2002;9 7:611–631. [37] Bozdogan H. Mo del selection an d ak aik e’s inform ation criterion (AIC): The general theory and its analytical extensions. Psy cometrik a. 1987; 52:345 –370. [38] Bozdogan H. Mixture-mo del cluster analysis using mo del selec tion cri- teria and a new informational measure of complexity . In: Pro ceedings of the Firs t US/Japan Conference on th e F rontiers of Statistical Mo deling: An Approac h ; 1994. p. 69–11 3. [39] Biernac ki C, Celeux G, Go v aert G. Assessing a mixture mo del for clus- tering with the inte grated completed lik eliho o d. IEEE T ransactions on P attern analysis and Mac hine Intellige nce. 200 0;22:7 19–7 25. [40] Rissanen J. Complexity in statistical inquir y . W orld Scien tific; 1989. [41] W allace C, Boulton D. An information measure for cla ssification. The Computer Journal. 1968;11 :195–209. [42] Co v er T, Th omas J. Entrop y , relativ e en tropy and m u tual inf orm atio n . Elemen ts of Information Theory; 1991. Chapter 2. 16 [43] Baxter RA, Olivier JJ. Find ing ov erlapping comp onen ts with m ml. Statistics and Computing. 2000;10: 5–16. 17
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment