EM algorithm and variants: an informal tutorial

EM algo rithm and va riants: an info rmal tuto rial Alexis Ro c he ∗ Service Hospi talier F r ´ ed ´ er ic Jolio t , CEA, F-91 4 01 Or sa y , F rance Spring 2003 (rev ised: September 2012 ) 1. Intro duction The exp ectati on-maximization (EM) a lgorithm introdu ced b y Dempster et al [12] in 1977 is a v ery general method to solv e maxim u m lik eliho o d estimation p r oblems. In this in formal rep ort, w e revie w the theory b eh ind EM as w ell as a num b er of EM v arian ts, suggesting that b ey ond the curren t stat e o f the art is an ev en m u c h wider territory still to b e disco v er ed . 2. EM background Let Y a random v ariable with p robabilit y densit y function (p df ) p ( y | θ ), wh ere θ is an un kno wn parameter vec tor. Giv en a n outcome y of Y , we aim at maximizing the likelihoo d f unction L ( θ ) ≡ p ( y | θ ) wrt θ o ver a giv en search s pace Θ. This is the v ery p r inciple of maximum lik eliho o d (ML) estimation. Unfortunately , except in not very exciting situations su c h as, e.g. estimating th e m ean and v ariance of a Gaussian p opulation, a ML estimation problem has generally no closed-form solution. Numerical routines are then n eeded to app ro ximate it. 2.1. EM as a lik eliho o d maximizer The EM al gorithm is a cl ass of optimizers sp eciﬁcally taylo red to ML pr oblems, whic h m ak es it b oth general and not so general. Perhaps the most salien t feature of EM is that it wo rks iterativ ely by maximizing successiv e lo cal app ro ximations of the likel iho o d function. Therefore, eac h iteration consists of t w o steps: one that p erforms the appro ximation (the E-step) and one that maximizes it (the M-step). But, let’s make it clear, not any t wo- step iterativ e sc h eme is an EM algo rithm. F or ins tance, Newton and quasi-Newton metho ds [27] w ork in a similar iterativ e fashion bu t do not ha v e muc h to do with EM. What essen tially d eﬁnes an EM algorithm is the philosophy und erlying the local approxima tion sc heme – whic h , in particular, do esn’t rely on diﬀeren tial calculus. The key id ea underlyin g EM is to introdu ce a laten t v ariable Z whose p df d epen ds on θ with the prop ert y that maximizing p ( z | θ ) is easy or, sa y , easier than maximizing p ( y | θ ). L o osely ∗ alexis.roche@gmail .com 1 sp eaking, w e someho w enhance the incomplete data b y guessing so me useful additional infor- mation. T ec hnically , Z can b e an y v ariable suc h that θ → Z → Y is a Marko v chain 1 , i.e. we assume that p ( y | z , θ ) is indep endent fr om θ , yielding a Chapm an-Kolmog oro v equation: p ( z , y | θ ) = p ( z | θ ) p ( y | z ) (1) Reasons for that d eﬁnition w ill arise soon. C onceptually , Z is a co mplete-data space in the sense that, if it w ere fully obser ved, then estimating θ wo uld b e an easy game. W e will emp hasize that the conv ergence sp eed of EM is highly dep end en t up on the complete-dat a sp eciﬁcation, whic h is widely arb itrary despite some estimation problems may ha ve seamingly “natural” hidden v ariables. Bu t, for the time b eing, w e assum e that the complete-data sp eciﬁcation step has b een accomplished. 2.2. EM as a consequenc e of Je n s e n ’s inequalit y Quite surpr isingly , the original EM formulat ion stems from a very simple v ariational argum ent. Under almost no assum ption regarding the co mplete v ariable Z , except its p df do esn’t v anish to zero, we can b ound the v ariation of the log-lik eliho o d fun ctio n L ( θ ) ≡ log p ( y | θ ) as follo ws : L ( θ ) − L ( θ ′ ) = log p ( y | θ ) p ( y | θ ′ ) = log Z p ( z , y | θ ) p ( y | θ ′ ) dz = log Z p ( z , y | θ ) p ( z , y | θ ′ ) p ( z | y , θ ′ ) dz = log Z p ( z | θ ) p ( z | θ ′ ) p ( z | y , θ ′ ) dz (2) ≥ Z log p ( z | θ ) p ( z | θ ′ ) p ( z | y , θ ′ ) dz | {z } Call this Q ( θ ,θ ′ ) (3) Step (2) results from the fact that p ( y | z , θ ) is in d ep enden t fr om θ owing to (1). Step (3) follo ws from Jensen’s inequ alit y (see [9] and app end ix A.2) along w ith the w ell-known con- ca vit y prop erty of the lo garithm function. Therefore, Q ( θ , θ ′ ) is an auxiliary function for the log-lik eliho od, in the sense that: (i) the lik eliho od v ariation fr om θ ′ to θ is alw a ys greater th an Q ( θ , θ ′ ), and (ii) we hav e Q ( θ ′ , θ ′ ) = 0. Hence, starting from an initial gu ess θ ′ , we are guaran- teed to increase the likel iho o d v alue if we can ﬁ nd a θ suc h that Q ( θ , θ ′ ) > 0. Iterating such a pro cess d eﬁnes an EM algorithm. There is no general con v ergence theorem f or EM, bu t thanks to th e ab o ve men tioned mono- tonicit y pr op erty , co n v er gence results ma y b e p ro v ed under m ild regularity conditions. Typi- cally , con vergence to wards a non-global like liho od maximizer, or a saddle p oint, is a w ors t-case scenario. Still, the only tric k b ehind EM is to exploit the conca vit y o f the logarithm fun ctio n! 1 In many presentations of EM, Z is as an aggre gate v ariable ( X , Y ), where X is some “missing” data, which cor- respond s to the sp ecial case where the transition Z → Y is deterministic. W e b eliev e th is restriction, although imp ortant in practice, is not useful to the global und erstand ing of EM. By the wa y , further generalizations will b e considered later in this rep ort. 2 2.3. EM as exp ecation-maximization Let’s introd uce some n ota tions. Dev eloping the logarithm in the right-hand side of (3), w e m ay in terp ret our auxiliary function as a diﬀerence: Q ( θ , θ ′ ) = Q ( θ | θ ′ ) − Q ( θ ′ | θ ′ ), with: Q ( θ | θ ′ ) ≡ Z log p ( z | θ ) p ( z | y , θ ′ ) dx (4) Clearly , for a ﬁx ed θ ′ , maximizing Q ( θ , θ ′ ) wrt θ is equiv alen t to m aximizi ng Q ( θ | θ ′ ). If w e consider th e r esidual function: R ( θ | θ ′ ) ≡ L ( θ ) − Q ( θ | θ ′ ), th e incomplete-data log-lik eliho od ma y b e w r itte n as: L ( θ ) = Q ( θ | θ ′ ) + R ( θ | θ ′ ) The EM algorithm’s basic p rinciple is to replace the maximizatio n of L ( θ ) with that of Q ( θ | θ ′ ), hop efully easier to d eal with. W e can ignore R ( θ | θ ′ ) b ecause inequalit y (3) imp lies that R ( θ | θ ′ ) ≥ R ( θ ′ | θ ′ ). In other words, EM works b ecause th e auxiliary function Q ( θ | θ ′ ) alwa ys deteriorates as a lik eliho o d approximat ion wh en θ departs f r om θ ′ . In an ideal w orld , the approxima tion error would b e constan t; then, maximizing Q would, n ot only increase, but tru ly maximize th e lik eliho o d. Unfortun ate ly , this wo n’t b e the case in general. Therefore, u nless we decide to giv e up on maximizing the likeli ho o d, we h av e to it erate – which giv es rise to qu ite a p opu lar statistica l learning algorithm. Giv en a current parameter estimate θ n : • E -step: form the auxiliary fu n ctio n Q ( θ | θ n ) a s d eﬁned in ( 4), whic h in vo lv es computing the p osterior distribution of the u n observ ed v ariable, p ( z | y, θ n ). T he “E” in E-step s tands for “exp ectat ion” for reasons that will arise in section 2.4. • M-step: up date the parameter estimate b y maximizing the auxiliary fun ctio n: θ n +1 = arg max θ Q ( θ | θ n ) An ob vious b ut imp ortan t generalization of the M-step is to rep lac e the maximization with a mere increase of Q ( θ | θ n ). Since, an ywa y , the lik eliho od w on’t b e maximized in one iteration, increasing the auxiliary function is enough to ensure that the lik eliho o d will increase in tu rn, thus pr eserving the monotonicit y prop ert y of EM. This deﬁnes generalized EM (GEM) algorithms. More on this later. 2.4. Some p robabilistic interp retations here... F or th ose familiar w ith probability theory , Q ( θ | θ ′ ) as deﬁned in (4) is nothing but th e conditional exp ectati on of the complete-data log-lik eliho o d in terms of the observe d v ariable, tak en at Y = y , and assuming the true parameter v alue is θ ′ : Q ( θ | θ ′ ) ≡ E[ log p ( Z | θ ) | y , θ ′ ] (5) This remark exp lains the “E” in E-step, but also yields some pr obabilistic insigh t on the auxiliary fun ction. F or all θ , Q ( θ | θ ′ ) is an estimate of the the complete-data log-lik eliho o d that is built u p on the knowledge of the incomplete data and un der the working assum ption that th e true parameter v alue is known. I n some wa y , it is not far fr om b eing the “b est” estimate th at 3 w e can p ossibly mak e without kn owing Z , b ecause conditional exp ectation is, by deﬁnition, th e estimator that minimizes the conditional mean s q u ared error 2 . Ha ving said that, w e migh t still b e a bit suspiscious. While we can gran t that Q ( θ | θ ′ ) is a reasonable estimat e of the complete- data log -lik eliho o d, recall that ou r initial p r oblem is to maximize the i nc om plete-data (log) likelihoo d. Ho w go o d a ﬁt is Q ( θ | θ ′ ) for L ( θ )? T o answ er that, let’s see a bit more how the residu al R ( θ | θ ′ ) ma y b e in terpreted. W e hav e: R ( θ | θ ′ ) = log p ( y | θ ) − Z log p ( z | θ ) p ( z | y , θ ′ ) dz = Z log p ( y | θ ) p ( z | θ ) p ( z | y , θ ′ ) dz = Z log p ( y | z , θ ) p ( z | y , θ ) p ( z | y , θ ′ ) dz , (6) where the last step r elies on Ba y es’ la w. No w, p ( y | z , θ ) = p ( y | z ) is indep endent from θ b y th e Mark ov p r op er ty (1). Therefore, using the simpliﬁed notat ions q θ ( z ) ≡ p ( z | y , θ ) and q θ ′ ( z ) ≡ p ( z | y , θ ′ ), w e get: R ( θ | θ ′ ) − R ( θ ′ | θ ′ ) = Z log q θ ′ ( z ) q θ ( z ) q θ ′ ( z ) dz | {z } Call this D ( q θ ′ k q θ ) (7) In the language of in formation theory , this quantit y D ( q θ ′ k q θ ) is kno w n as th e Kullbac k - Leibler distance, a general to ol to assess the deviation b et w een t wo p d f s [9]. Although it is n ot, strictly sp eaking, a gen u ine mathematical distance, it is alwa ys p ositiv e and v anishes iﬀ th e p dfs are equal whic h, again and not surp rinsingly , comes as a direct consequence of Jensen’s inequalit y . What d o es th at mean in our case? W e notice d earlier that the lik eliho o d app ro xima- tion Q ( θ | θ ′ ) cannot get any b etter as θ deviates f rom θ ′ . W e now realize from equation (7) that this pr op er ty reﬂects an implicit strategy of ignoring the v ariatio ns of p ( z | y , θ ) wrt θ . Hence, a p erfect appr o ximation would b e one for which p ( z | y , θ ) is in dep enden t f r om θ . In other words, we w ou ld lik e θ → Y → Z to deﬁne a Marko v c hain... But, lo ok, w e already assumed that θ → Z → Y is a Mark o v chain. Do es the Marko v prop erty still hold w hen p erm uting th e roles of Y and Z ? F rom the fu ndamen tal data p rocessing inequalit y [9], the answer is no in general. Details are un necessary here. Ju st r ememb er that the v alidit y d omain o f Q ( θ | θ ′ ) as a lo cal likel iho o d appro ximation is con trolled by the amoun t of inf ormatio n that b o th y and θ con tain ab out th e complete data. W e are no w going to study this asp ect m ore carefully . 2.5. EM as a ﬁx p oint algo rithm and lo cal convergence Quite clearly , EM is a ﬁx p oin t algorithm: θ n +1 = Φ ( θ n ) with Φ( θ ′ ) = arg max θ ∈ Θ Q ( θ , θ ′ ) Assume the sequence θ n con verge s to wards some v alue ˆ θ – hop efully , the maximum likeli ho o d estimate but p ossibly some other lo cal maxim um or saddle p oint. Und er the assump tio n that Φ is conti n uous, ˆ θ m ust b e a ﬁx p oint f or Φ, i.e. ˆ θ = Φ( ˆ θ ). F urthermore, w e can appr o ximate 2 F or all θ , we hav e: Q ( θ | θ ′ ) = arg min µ Z h log p ( z | θ ) − µ i 2 p ( z | y , θ ′ ) dz . 4 the sequence’s asymptotic b eha vior using a ﬁrst-order T aylo r expansion of Φ aroun d ˆ θ , which leads to: θ n +1 ≈ S ˆ θ + ( I − S ) θ n with S = I − ∂ Φ ∂ θ | ˆ θ This expression shows that the r ate of con verge nce is con trolled b y S , a square matrix that is constan t across iterations. Hence, S is called the sp eed matrix, and its sp ectral radius 3 deﬁnes the global sp eed. Unless the global sp eed is on e, the lo cal conv ergence of EM is only linear. W e ma y relate S to the likel iho o d fu n ctio n by exploiting the fact that, u n der s u ﬃcien t smo othness assumptions, the maximization of Q is characte rized by: ∂ Q ∂ θ t ( θ n +1 , θ n ) = 0 F rom the implicit function theorem, we get the gradient of Φ: ∂ Φ ∂ θ = −  ∂ 2 Q ∂ θ ∂ θ t  − 1 ∂ 2 Q ∂ θ ′ ∂ θ t ⇒ S =  ∂ 2 Q ∂ θ ∂ θ t  − 1 h ∂ 2 Q ∂ θ ∂ θ t + ∂ 2 Q ∂ θ ′ ∂ θ t i where, after some manipu lat ions: ∂ 2 Q ∂ θ ∂ θ t | ( ˆ θ , ˆ θ ) = Z p ( z | y , ˆ θ ) ∂ 2 log p ( z | θ ) ∂ θ ∂ θ t | ˆ θ | {z } Call this − J z ( ˆ θ ) dz ∂ 2 Q ∂ θ ′ ∂ θ t | ( ˆ θ , ˆ θ ) = ∂ 2 log p ( y | θ ) ∂ θ ∂ θ t | ˆ θ | {z } Call this − J y ( ˆ θ ) − ∂ 2 Q ∂ θ ∂ θ t | ( ˆ θ , ˆ θ ) The t w o quanti ties J y ( ˆ θ ) and J z ( ˆ θ ) turn out to b e resp ectiv ely the observed-data in formation matrix and the complete-data inform ati on matrix. The sp eed matrix is th us giv en by: S = J z ( ˆ θ ) − 1 J y ( ˆ θ ) with J z ( ˆ θ ) ≡ E[ J z ( ˆ θ ) | y, ˆ θ ] (8) W e easily c h ec k that: J z ( ˆ θ ) = J y ( ˆ θ ) + F z | y ( ˆ θ ), where F z | y ( ˆ θ ) is the Fisher in formation matrix corresp onding to the p osterior p df p ( z | y , ˆ θ ), whic h is alw ays sym m etric p ositiv e. Therefore, we ha ve the alternativ e expression: S = [ J y ( ˆ θ ) + F z | y ( ˆ θ )] − 1 J y ( ˆ θ ) F or fast conv ergence, we wa n t S close to iden tity , so w e b etter h a v e the p osterior Fish er matrix as “small” as p ossible. T o inte rpret this result, let’s imagine that Z is d ra wn fr om p ( z | y , ˆ θ ), w hic h is not exactly true since ˆ θ may b e at least sligh tly diﬀerent from the actual parameter v alue. The Fisher inf ormatio n matrix rep r esen ts the a ve rage in f ormatio n that the complete data con tains ab out ˆ θ conditional to the observ ed data. In this conte xt, F z | y ( ˆ θ ) is a measure of missing in formation, and the sp eed matrix is the fraction of miss in g d ata . The conclusion is that the rate of conv ergence of EM is go v erned by the fraction of missing data. 3 Let ( λ 1 , λ 2 , . . . , λ m ) b e the complex eigenv alues of S . The sp ectral radius is ρ ( S ) = min i | λ i | . 5 2.6. EM as a p ro ximal p oint algo rithm Chr´ etien & Hero [7 ] note that EM may also b e in terp reted as a proximal p oint algorithm, i.e. an iterativ e sc h eme of the f orm : θ n +1 = arg max θ [ L ( θ ) − λ n Ψ( θ , θ n )] , (9) where Ψ is some p ositiv e r egulariza tion function and λ n is a sequence of p ositiv e n umbers. Let u s see wh ere this result comes from. I n section 2.4, we hav e established the fund amental log-lik eliho od decomp osition un derlying EM, L ( θ ) = Q ( θ | θ ′ ) + R ( θ | θ ′ ), and r ela ted the v ariation of R ( θ | θ ′ ) to a K ullbac k distance (7). T h us, for some current estimate θ n , we can write: Q ( θ | θ n ) = L ( θ ) − D ( q θ n k q θ ) − R ( θ n | θ n ) , where q θ ( z ) ≡ p ( z | y , θ ) and q θ n ( z ) ≡ p ( x | y , θ n ) are the p osterior p dfs of th e complete data, under θ and θ n , resp ectiv ely . F rom this equation, it b ecomes clear that maximizing Q ( θ | θ n ) is equiv alen t to an up date rule of th e form (9) with: Ψ( θ , θ n ) = D ( q θ n k q θ ) , λ n ≡ 1 The proximal in terpretation of EM is v ery u seful to deriv e general conv ergence r esults [7]. In particular, the con vergence rate ma y b e su p e rlinear if the sequen ce λ n is c h osen so as to con verge to wards zero. Un fortunately , such generalizations are usually intracta ble b ecause the ob jectiv e fu nction ma y no longer s implify as so on as λ n 6 = 1. 2.7. EM as maximizati on-maximization Another p o werful wa y of conceptualizing EM is to reinte rpret the E-step as another maxi- mization. This idea, wh ic h was formalized only recen tly b y Neal & Hin ton [26], app ears as a breakthrough in the general und erstanding of EM-t yp e pro cedures. Let us consider the follo wing function: L ( θ , q ) ≡ E q [ log p ( Z , y | θ )] + H ( q ) (10) where q ( z ) is some p d f (y es, an y p df ), and H ( q ) is its en tropy [9 ], i.e. H ( q ) ≡ − R log q ( z ) q ( z ) dz . W e easily obtain an equ iv alent exp ression that inv olv es a Ku llbac k-Leib er distance: L ( θ , q ) = L ( θ ) − D ( q k q θ ) , where w e still deﬁn e q θ ( z ) ≡ p ( z | y , θ ) for notational con v enience. The last equation remin ds us imm ediate ly of th e proxima l inte rpretation of EM w hic h w as brieﬂy discussed in section 2.6. The main diﬀerence here is that w e don’t imp ose q ( z ) = p ( z | y , θ ) for some θ . Equalit y holds for any distribution! Assume we ha v e an initial guess θ n and tr y to ﬁnd q that maximizes L ( θ n , q ). F rom the ab o ve discussed pr oper ties of the K ullbac k-Leibler distance, the answe r is q ( z ) = q θ n ( z ). No w , substitute q θ n in (10), and m aximize o v er θ : this is the same as p erforming a standard M-step 4 ! Hence, the con v entional EM algorithm b oils d o wn to an alternate maximization of L ( θ , q ) ov er a searc h space Θ × Q , where Q is a suitable set of p dfs, i.e. Q must includ e all p dfs fr om the set { q ( z ) = p ( z | y , θ ) , θ ∈ Θ } . It is easy to c hec k that any global maximiz er ( ˆ θ , ˆ q ) of L ( ˆ θ , ˆ q ) is 4 T o see that, just remem b er that p ( z , y | θ ) = p ( z | θ ) p ( y | z ) where p ( y | z ) is ind ependent from θ due to the Mark ov prop ert y (1). 6 suc h that ˆ θ is also a global maximizer of L ( θ ). By the wa y , th is is also true for lo cal maximizers under w eak assumptions [26]. The k ey obs erv ati on of Neal & Hinto n is that the alternate scheme u nderlying EM may b e replaced with other maximization strategies w ithout hamp ering the simp licit y of EM. I n the co n v entional EM setting, the auxilia ry p df q n ( z ) is alwa ys constrained to a sp eciﬁc f orm. This is to sa y that EM authorizes only sp eciﬁc p ath w a ys in the expanded search sp ace Θ × Q , yielding s ome kind of “lab y r in th” motion. Easier tec hniqu es to ﬁnd its w ay in a labyrin th include b reaking the w alls or escaping th rough the roof. Similarly , one ma y consider relaxing the maximiz ation constrain t in the E-step. This lea ds for instance to incremental and sparse EM v arian ts (see section 3). 3. Deterministic EM variants W e ﬁr st present deterministic EM v arian ts as opp osed to s to c hastic v arian ts. Most of these deterministic v ariant s atte mpt at sp eeding up the algorithm, either by simplifying computations, or by increasing the rate of conv ergence (see section 2.5). 3.1. CEM Classiﬁcation EM [6]. The w h ole EM story is abou t in tro ducing a laten t v ariable Z and per- forming some inferen ce ab out its p o sterior p df. W e m igh t w onder: why not simp ly estimate Z ? This is actually the idea underlying the CEM algorithm, whic h is a simp le alternate maximiza- tion of the fun ctional p ( z , y | θ ) wrt b oth θ and z . Giv en a current parameter estimate θ n , this leads to: • C lassiﬁcat ion s tep: ﬁ nd z n = arg max z p ( z | y , θ n ). • Maximization step: ﬁnd θ n +1 = arg max θ p ( z n | θ ). Notice that a sp ecia l instanciation of CEM is the w ell-kno w n k -means alg orithm. In prac- tice, CEM has sev eral adv anta ges o ver EM, lik e b eing easier to imp lement and t ypically f aster to con verge . Ho wev er, CEM do esn’t maximize the incomplete-dat a like liho od and, ther efore, the monotonicit y prop ert y of EM is lost. While C E M estimates the complete data explicitely , EM estimates only suﬃ cient s tatistics for th e complete d ata . In this regard, EM may b e und er- sto od as a fuzzy classiﬁer that a vo ids the statistica l eﬃciency problems inh eren t to th e CEM approac h. Y et, CEM is often usefu l in practice. 3.2. Aitk en’s acceleration An early EM extension [12, 21, 22]. Ait k en’s accelerat ion is a general pu rp ose tec hnique to sp eed up the con vergence of a ﬁx ed p oint r ecursion with asymp tot ic linear b ehavio r. Section 2.5 established that, under app ropriate smo othness assumptions, EM ma y b e approximat ed b y a recursion of the form: θ n +1 ≈ S ˆ θ + ( I − S ) θ n , where ˆ θ is the unkn o wn limit and S is the sp eed matrix giv en by (8) whic h dep end s on this limit. Aitke n’s acceleration s tems from the remark that, if S w as known, th en the limit could b e computed explicitely in a single iteration, namely: ˆ θ ≈ θ 0 + S − 1 ( θ 1 − θ 0 ) for some starting v alue θ 0 . Despite that S is u nkno wn and the s equ ence is not strictly linear, we are tempted to consider the follo win g mo diﬁed EM scheme. Giv en a cur ren t parameter estimate θ n , 7 • E -step: compute Q ( θ | θ n ) and ap p ro ximate the inv erse sp eed matrix: S − 1 n = J y ( θ n ) − 1 J z ( θ n ). • M-step: unc hanged, get an in termediate v alue θ ∗ = arg max θ Q ( θ | θ n ). • Acceleration step: up date the parameter using θ n +1 = θ n + S − 1 n ( θ ∗ − θ n ). It tur n s ou t that this sc h eme is nothin g b ut the Ne wton-Raphson method to ﬁn d a zero of θ 7→ Φ( θ ) − θ , where Φ is the map deﬁned b y the EM sequ ence, i.e. Φ( θ ′ ) = arg max θ Q ( θ | θ ′ ). Since the stand ard EM sets S n = I on eac h iteration, it ma y b e view ed as a ﬁrs t-order approac h to the same ze ro-crossing pr oblem, hence a v oiding the exp ense of computing S n . Beside this imp ortan t implementa tional issu e, conv ergence is pr ob lematic u sing Aitk en’s acceleration as the monotonicit y prop erty of EM is generally lost. 3.3. AEM Accelerat ed E M [15 ]. T o trade oﬀ b et wee n EM and its Aitk en’s accelerated ve rsion (see sec- tion 3.2) , Jamshidian and Jennric h prop ose a conjugate gradient a pproac h. Don’t b e messed up: this is n ot a traditional gradient -based metho d (otherwise there would b e no p oint to talk ab out it in this rep ort). No gradien t compu tat ion is actually inv olv ed in here. The “gradient ” is the fun cti on Φ( θ ) − θ , w hic h may b e viewe d as a generalized gradien t for the incomplete- data log-lik eliho od, hence justifyin g the u se of th e generalized conju ga te gradien t met ho d (see e.g. [27]). C omp ared to the Aitk en’s accelerate d EM, the resulting AEM algorithm do esn’t require computing the sp eed matrix. Instead, the parameter u p date r ule is of the form : θ n +1 = θ n + λ n d n , where d n is a directio n comp osed from the curren t direction Φ( θ n ) − θ n and the previous di- rections (the essence of conjugate gradient ), and λ n is a step size t ypically computed from a line maximization of the complete-data lik eliho o d (which may or m a y not b e cumbersome). As an adv an tage of line maximizations, the monotonicit y prop ert y of E M is safe. Also, from this generalized gradient p ersp ectiv e, it is straightforw ard to d evise EM extensions that mak e use of other gradien t descen t tec hn iques suc h as the steep est descen t or quasi-Newton metho ds [27]. 3.4. ECM Exp ectation C onditional Maximization [23]. This v ariant (not to b e co nfused with CEM, see ab o ve ) w as in tro duced to cop e w ith situations where the standard M-step is in tractable. It is the ﬁrst on a list of co ordinate ascen t-based EM extensions. In EC M, the M-step is replaced w ith a n umber of lo w er d imensional maximization problems called C M-ste ps. This implies d eco mp osing the p aramete r space as a su m of sub spaces, whic h, up to some p ossible reparameterization, is the same as splitting the parameter v ector into sev eral b lo c ks, θ = ( t 1 , t 2 , . . . , t s ). S tarting fr om a curr en t estimate θ n , the CM-steps up date one co ordinate blo c k after another by p artia lly maximizing the auxiliary Q -function, yielding a sc heme similar in essence to P o well’s multidimensional optimizatio n metho d [27]. This pr odu ces a sequence θ n = θ n, 0 → θ n, 1 → θ n, 2 → . . . → θ n,s − 1 → θ n,s = θ n +1 , such that: Q ( θ n | θ n ) ≤ Q ( θ n, 1 | θ n ) ≤ Q ( θ n, 2 | θ n ) ≤ . . . ≤ Q ( θ n,s − 1 | θ n ) ≤ Q ( θ n +1 | θ n ) Therefore, th e a uxiliary fu nction is g uaran teed to increase o n eac h CM-step, h ence globally in th e M-step, and so d oes the incomplete-data likeli ho o d. Hence, ECM is a sp ecial case of GEM (see section 2.3). 8 3.5. ECME ECM either [18]. Th is is an extension of ECM w h ere some CM-ste ps are rep lac ed with steps that maximize, or increase, the incomplete-data log-lik eliho o d L ( θ ) rather than the auxiliary Q -function. T o mak e sur e that th e likelihoo d f unction in cr eases globally in the M-step, th e only r equiremen t is that the CM-steps that act on the actual log-lik eliho od b e p erformed after the usu al Q -maximizations. This is b ecause increasing the Q -fu n ctio n only increases lik eliho o d from the starting p oin t, namely θ n , which is held ﬁ xed durin g th e M-step (at least, th is is what w e assume) 5 . Starting with Q -maximizati ons is guaran teed to increase the lik eliho od, and of course sub- sequen t lik eliho od maximizations can only impro v e the situation. Wit h the corr ect setting, ECME is ev en more general than GEM as deﬁn ed in section 2.3 while inheriting its fu nda- men tal mon otonicit y prop erty . An example application of ECME is in m ixture mo dels, where t ypically mixing prop ortions are up d ated usin g a one-step Newton-Raphs on gradient descen t on the incomplete-data lik eliho od , leading to a simp le additiv e correction to the usu al EM up date rule [18]. A t least in this case, ECME has pr o v ed to con verge faster than standard EM. 3.6. SA GE Space-Alternating Generalized EM [13, 14]. In the cont in uit y of ECM and ECME (see sec- tions 3.4 and 3.5), one can imagine deﬁn in g an auxiliary function speciﬁc to eac h coordinate blo c k of the p aramete r v ector. More tec hn ica lly , using a blo c k decomp osition θ = ( t 1 , t 2 , . . . , t s ), w e assume that, for eac h b lock i = 1 , . . . , s , there exists a function Q i ( θ | θ ′ ) s u c h that, for all θ and θ ′ with ident ical blo c k co ordinates except (ma yb e) for the i -th blo c k, w e h av e: L ( θ ) − L ( θ ′ ) ≥ Q i ( θ | θ ′ ) − Q i ( θ ′ | θ ′ ). This idea has tw o imp ortant implications. First, the usual ECM scheme n eeds to b e r ephrased, b ecause c hanging the auxiliary fun ction across CM-steps ma y we ll resu lt in decreasing the lik eliho o d, a problem work ed around in ECME with an app ropriate ordering of th e CM-steps. In this more general framew ork, though, there ma y b e no s u c h ﬁ x to sa ve the d a y . In ord er to ensure monotonicit y , at least some CM-steps sh ould start with “reinitializing” their corresp onding auxiliary function, whic h m eans... p erforming an E-step. It is imp ortan t to realize that, b ecause the auxiliary fun ctio n is co ordinate-sp eciﬁc, so is the E-step. Hence, eac h “CM-ste p” b ecomes an EM algorithm in itself which is sometimes called a “cycle”. W e end up w ith a nested algorithm where cycles are embed d ed in a higher-lev el iterativ e scheme. F urtherm ore, h o w to deﬁne the Q i ’s? F r om section 2, we kn o w that the stand ard EM auxiliary function Q ( θ | θ ′ ) is b uilt from th e complete-data space Z ; see in particular equation (5). F essler & Herr o intro d uce hidden-data sp ac es , a concept that generalize s complete-data spaces in the sense that hidden-data sp aces may b e co ord inate-speciﬁc, i.e. there is a hid den v ariable Z i for eac h blo c k t i . F ormally , giv en a b loc k d eco mp osition θ = ( t 1 , t 2 , . . . , t s ), Z i is a h id den-data space for t i if: p ( z i , y | θ ) = p ( y | z i , { t j 6 = i } ) p ( z i | θ ) This deﬁnition’s main feature is that the conditional probab ility of Y kn o wing Z i is allo w ed to b e dep endent on ev ery parameter blo c k b ut t i . Let u s c hec k that the r esu lting auxiliary 5 F or example, if on e chooses θ ∗ such th at L ( θ ∗ ) ≥ L ( θ n ) and, then, θ n +1 such th at Q ( θ n +1 | θ n ) ≥ Q ( θ ∗ | θ n ), the only conlusion is that the likel ihoo d increases from θ n to θ ∗ , b ut may actually decrease from θ ∗ to θ n +1 b ecause θ ∗ is not the starting p oin t of Q . Perm uting the L - m ax imiza tion and the Q - maximizatio n, we hav e Q ( θ ∗ | θ n ) ≥ Q ( θ n | θ n ), thus L ( θ ∗ ) ≥ L ( θ n ), and therefore L ( θ n +1 ) ≥ L ( θ n ) since w e have assumed L ( θ n +1 ) ≥ L ( θ ∗ ). This argument generalizes eas ily to any intermediate sequence using the same cascade inequalities as in the deriv ation of ECM (see section 3.4). 9 function fulﬁls the monotonicit y cond itio n. W e deﬁ n e: Q i ( θ | θ ′ ) ≡ E [ log p ( Z i | θ ) | y , θ ′ ] Then, applying Jensen’s inequ ality (3) like in section 2.2, w e get: L ( θ ) − L ( θ ′ ) ≥ Q i ( θ | θ ′ ) − Q i ( θ ′ | θ ′ ) + Z log p ( y | z i , θ ) p ( y | z i , θ ′ ) p ( z i | y , θ ′ ) dz i When θ and θ ′ diﬀer only by t i , the int egral v anishes b ecause the conditional pd f p ( y | z i , θ ) is indep endent from t i b y the ab o ve deﬁn ition. Consequently , m aximizi ng Q i ( θ | θ ′ ) w ith resp ect to t i only (the other p arameters b eing held ﬁxed) is guarante ed to in cr ease the incomplete-data lik eliho o d. Sp eciﬁc applications of SAGE to the Poisson imaging mo del or p en aliz ed least- squares regression were rep orted to con verge m uc h faster than standard EM. 3.7. CEMM Comp onen t-wise EM for Mixtures [4]. C eleux et al extend the SA GE metho dology to the case of constrained likelihoo d maximization, whic h arises t ypically in mixture problems where the sum of mixing prop ortions should equate to one. Using a Lagrangian dualization appr oac h, they recast the initial p roblem into unconstrained maximization by d eﬁning an appropriate p enalized log-lik eliho od function. The CEMM algorithm they derive is a natural co ordinatewise v ariant of EM whose con vergence to a stationary p oint of th e lik eliho o d is established under mild regularit y conditions. 3.8. AECM Alternating ECM [24, 25]. In an attempt to summarize earlier con tribu tions, Meng & v an Dyk prop ose to cast a num b er of EM extensions in to a un iﬁ ed framewo rk, the so-called AECM algorithm. Essen tially , AECM is a S A GE algorithm (itself a generalization of b o th ECM and ECME) that includes another d ata au gmentati on tric k. The idea is to consider a family of complete-data sp ace s indexed b y a wo rking parameter α . More formally , we deﬁn e a join t p d f q ( z , y | θ , α ) as dep end ing on b oth θ and α , ye t imp osing th e constrain t t hat the corresp onding marginal incomplete-data p df b e p reserv ed: p ( y | θ ) = Z q ( z , y | θ , α ) dz , and t h us ind ep en den t from α . In o ther w ords, α is iden tiﬁable only giv en the complete data. A sim p le wa y of achieving such data augmen tation is to deﬁn e Z = f θ , α ( Z 0 ), where Z 0 is s ome reference complete-data space and f θ , α is a one-to-one m apping for an y ( θ , α ). Int erestingly , it can b e seen that α h as no eﬀect if f θ , α is in sensitiv e to θ . I n AECM, α is tuned b eforehan d so as th at to min imize the fraction of missing data (8), thereb y maximizing the algorithm’s global sp eed. In general, how ev er, this initial mimimization cannot b e p erformed exactly s ince the global sp eed may d ep end on the u nkno wn maxim um lik eliho od parameter. 3.9. PX-EM P arameter-Expand ed EM [19, 17]. Liu et a l revisit the working parameter metho d suggeste d b y Meng and v an Dyk [24] (see section 3.8) from a slighlt y d iﬀeren t angle. I n their strategy , the join t p df q ( z , y | θ , α ) is deﬁned so as to m eet the t w o follo wing requirements. First, the b aseline 10 mo del is em b edded in the expanded model in the sense that q ( z , y | θ , α 0 ) = p ( z , y | θ ) for some n ull v alue α 0 . S eco nd, wh ic h is the main d iﬀerence with AECM, the observ ed-data marginals are consisten t up to a many-to -one redu ctio n function r ( θ , α ), p ( y | r ( θ , α )) = Z q ( z , y | θ , α ) dz , for all ( θ , α ). F r om there, the trick is to to “pretend” estimating α iterativ ely rather than pre-pro cessing its v alue. The PX-EM algorithm is simply an EM on the exp anded m o del with additional instr uctions after the M-step to apply the r ed uction fun ctio n an d reset α to its null v alue. Th u s, giv en a curr ent estimate θ n , the E-step forms the auxiliary fu n ctio n co rresp onding to the expanded mo del from ( θ n , α 0 ), whic h amounts to the standard E-step b ecause α = α 0 . The M-step then pro vides ( θ ∗ , α ∗ ) suc h that q ( y | θ ∗ , α ∗ ) ≥ q ( y | θ n , α 0 ), and the additional red u ctio n step up dates θ n according to θ n +1 = r ( θ ∗ , α ∗ ), implying p ( y | θ n +1 ) = q ( y | θ ∗ , α ∗ ). Beca use q ( y | θ n , α 0 ) = p ( y | θ n ) b y construction of the expanded mo del, w e conclude that p ( y | θ n +1 ) ≥ p ( y | θ n ), which sho w s that PX-EM p reserv es the monotonicit y pr operty of EM. In s ome w a y , PX-EM capitalizes on the fact that a large deviation b et w een the estimate of α and its kno wn v alue α 0 is an indication t hat the parameter of interest θ is p oorly estimated. Hence, PX-EM adju sts the M-step for this d eviation via the reduction function. A v a riet y of examples wh er e PX- EM conv erges m uch faster than EM is rep orted in [19]. P ossible v ariants of PX-EM include the co ordinatewise extensions und erlying SAGE. 3.10. Incremental EM F ollo wing the maximization-maximization approac h discussed in section 2.7, Neal & Hin ton [26] address th e common case wher e observ ations are i.i.d. Then, w e hav e p ( y | θ ) = Q i p ( y i | θ ) and, similarly , the global EM ob jectiv e function (10) redu ces to: L ( θ , q ) = X i { E q i [ log p ( Z i , y i | θ )] + H ( q i ) } , where w e can searc h for q u n der the factored form q ( z ) = Q i q i ( z ). T herefore, for a giv en θ , maximizing L ( θ , q ) wrt q is equiv alen t to maximizing the con tribution of eac h data item wrt q i , hence splitting the global maximization problem in to a n u m b er of simpler maximizations. Incre- men tal EM v arian ts are ju stiﬁed f rom this remark, th e general idea b eing to up d ate θ by visiting the data it ems sequencially rather than from a global E-step. Neal & Hin ton d emons trate an incremen tal EM v arian t for mixtures that conv erges twice as fast as standard E M. 3.11. Spa rse EM Another suggestion of Neal & Hin ton [26] is to trac k the auxiliary distribu tion q ( z ) in a subsp ace of the original searc h sp ace Q (at least for a certain num b er of iterations). T his general strategy includes sparse E M v arian ts wh ere q is up dated only at p re-deﬁned plausible unobserved v alues. Alternativ ely , “winner-tak e-all” EM v arian ts su c h as the CEM algorithm [6] (see section 3.1) ma y b e seen in this light. Such pro cedures ma y ha v e strong computational adv anta ges but, in coun terpart, are prone to estimation bias. In the maximization-maximiza tion inte rpretation of EM, this comes as no su rprise sin ce these approac hes “pro ject” the estimate on a reduced searc h space that ma y not con tain the maximum like liho od solution. 11 4. Sto chastic EM va riants While deterministic EM v arian ts w ere mainly m oti v ated by conv ergence sp eed consid eratio ns, sto c hastic v ariants are more concerned with other limitations of standard EM. O ne is that the EM auxiliary function (4) inv olv es compu ting an in tegral th at ma y not b e tractable in some situations. The idea is then to replace this t edious compu tatio n with a sto c hastic simulation. As a typical side eﬀect of s u c h an approac h , the mo diﬁed algorithm in herits a lesser tendancy to getting trapp ed in lo cal maxima, yielding impro v ed global con v er gence prop erties. 4.1. SEM Sto c hastic EM [5]. As noted in section 2.4, the standard EM auxiliary fun ction is the b est estimate of the complete-data log-lik eliho o d in the sense of the conditional mean squared error. The idea und erlying SEM, lik e other sto c h astic EM v arian ts, is that there might b e no need to ask f or such a “goo d” estimate. Th erefore, SE M r eplac es the stand ard auxiliary function with: ˆ Q ( θ | θ ′ ) = log p ( z ′ | θ ′ ) , where z ′ is a random s amp le drawn fr om the p osterior distribu tio n of the unobs er ved v ariable 6 , p ( z | y , θ ′ ). T his leads to th e follo wing m o diﬁed iteration; giv en a current estimate θ n : • S im ulation step: compute p ( z | y, θ n ) and draw an unobserved sample z n from p ( z | y , θ n ). • Maximization step: ﬁnd θ n +1 = arg max θ p ( z n | θ ). By c onstruction, the resulting s equence θ n is an homogeneous Mark o v chain 7 whic h , under mild r egularit y conditions, con v er ges to a stationary p df. Th is means in particular that θ n do esn’t conv erge to a un ique v alue! V arious sc hemes can be used to deriv e a p oint wise limit, suc h as a verag ing the estimates o v er iteratio ns once s tat ionarit y h as b een reac hed (see also SAEM r eg arding this issue). I t w as established in some sp eciﬁc c ases that the stationary p df concen trates around the lik eliho o d maximizer with a v ariance inv ersly prop ortional to th e s am- ple size. Ho w ever, in cases where sev er al lo cal maximizers exist, on e ma y exp ect a m u ltimodal b eha vior. 4.2. D A Data Augmenta tion algorithm [28]. Going fur ther into the w orld of random samples, one may consider replacing th e M-step in SEM w ith yet another random dra w. I n a Ba ye sian con text, maximizing p ( z n | θ ) w rt θ may b e though t of as computing the mo de of the p osterior distribution p ( θ | z n ), giv en b y: p ( θ | z n ) = p ( z n | θ ) p ( θ ) R p ( z n | θ ′ ) p ( θ ′ ) dθ ′ where we can assume a ﬂat (or non-informativ e) prior d istribution for θ . In D A, this m aximiza - tion is replace d with a random dr a w θ n +1 ∼ p ( θ | z n ). F rom equation (1), w e ea sily c hec k that p ( θ | z n ) = p ( θ | z n , y ). Therefore, D A alternates conditional dra ws z n | ( θ n , y ) and θ n +1 | ( z n , y ), whic h is the v ery p rinciple of a Gibbs sampler. Results f rom Gibbs sampling theory apply , and 6 Notice that when Z is deﬁned as Z = ( X , Y ), this sim ulation reduces to a random d ra w of t he missing data X . 7 The draws need to b e mutually indep endent cond itional to ( θ 1 , θ 2 , . . . , θ n ), i.e. p ( z 1 , z 2 , . . . , z n | θ 1 , θ 2 , . . . , θ n ) = Q i p ( z i | θ i ). 12 it is shown u nder general conditions that the sequence θ n is a Marko v chain that conv erges in d istribution to wards p ( θ | y ). Once the sequence has rea c h ed stationarit y , a ve raging θ n o ver iterations yields a ran d om v ariable th at con v erges to the conditional mean E( θ | y ), whic h is an estimator of θ generally diﬀerent from the maxim u m lik eliho o d but not necessarily worse. In teresting enough, sev eral v arian ts of D A h a v e b een pr op osed recently [20, 17] f ollo wing the parameter expansion strategy un derlying the PX-EM algorithm d escrib ed in section 3.9. 4.3. SAEM Sto c hastic Approximati on t yp e EM [3]. Th e SAEM algorithm is a simple hybridation of EM and SEM that p ro vides a p oin t wise con v ergence as opp osed to the err atic b eha vior of SEM. Giv en a cur ren t estimate θ n , SAEM p erf orms a stand ard EM iteration in addition to the SEM iteration. T he parameter is then u p dated as a we igh ted mean of b oth con tribu tions, yielding: θ n +1 = (1 − γ n +1 ) θ E M n +1 + γ n +1 θ S E M n +1 , where 0 ≤ γ n ≤ 1. Of cours e, to apply SAEM, the s tand ard EM needs to b e tractable. The sequence γ n is typica lly c h osen so as to decrease from 1 to 0, in such a wa y that th e algorithm is equiv alen t to SEM in the early iterations, and then b e comes more similar to EM. It is established that SAEM con verges almost surely to wards a lo cal lik eliho o d maximizer (th u s a vo iding saddle p oin ts) und er th e assump tion that γ n decreases to 0 with lim n →∞ ( γ n /γ n +1 ) = 1 and P n γ n → ∞ . 4.4. MCEM Mon te Carlo EM [30]. At least formally , MCEM turn s ou t to b e a generalizatio n of SEM. In the SE M sim ulation step, dra w m indep endent samples z (1) n , z (2) n , . . . , z ( m ) n instead of j ust one, and then maximize the follo w in g function: ˆ Q ( θ | θ n ) = 1 m m X j =1 log p ( z ( j ) n | θ ) , whic h , in general, conv erges almost surely to the standard EM auxiliary function thanks to the la w of large n umbers. Cho osing a large v alue for m jus tiﬁes calling this Monte Carlo something. In this case, ˆ Q m a y b e seen as an empirical approxima tion of th e standard E M auxiliary fu nction, and the algorithm is exp ected to b eha ve simila rly to EM. On the other hand , c ho osing a small v alue for m is n ot forbidden, if not advised (in particular, for computational reasons). W e notice that, for m = 1, MCEM red u ces to SEM. A p ossible strategy consists of increasing p rogressiv ely the parameter m , yielding a “simula ted annealing” MCE M whic h is close in spirit to SAEM. 4.5. SAEM2 Sto c hastic Appro ximation EM [11]. Dely on et al p rop ose a generalizat ion of MCEM called SAEM, not to b e confused with the earlier S AEM algorithm presente d in section 4.3, although b oth algo rithms promote a similar sim u late d annealing philosophy . In this v ersion, th e auxiliary function is deﬁned recursive ly by a v eraging a Mon te Carlo appro ximation with the auxiliary function computed in the pr evious step: ˆ Q n ( θ ) = (1 − γ n ) ˆ Q n − 1 ( θ ) + γ n m n m n X j =1 log p ( z ( j ) n | θ ) , 13 where z (1) n , z (2) n , . . . , z ( m n ) n are dra wn ind ep end en tly from p ( z | y , θ n ). T he w eights γ n are t ypically decreased across it erations in s uc h a w a y that ˆ Q n ( θ ) ev entuall y stabilize s at some p oin t. On e ma y either incr ease the n umber of random dra ws m n , or set a constan t v alue m n ≡ 1 wh en sim- ulations ha v e hea vy compu tanional cost compared to the maximization step. The con verge nce of SAEM2 to wards a lo cal likel iho o d m aximizer is prov ed in [11] und er quite general conditions. Kuhn et al [16] fu rther extend the tec h nique to mak e it p ossible to p erform the simulation under a distribu tion Π θ n ( z ) simpler to d eal with than the p osterior p d f p ( z | y , θ n ). S uc h a distribution ma y b e d eﬁned as the transition pr obabilit y of a Mark o v c hain generated by a Metrop olis-Ha stings algorithm. I f Π θ ( z ) is such that its asso ciated Marko v c h ain con verges to p ( z | y , θ ), then the con v ergence prop erties of SAEM2 generalize u nder mild additional as- sumptions. 5. Conclusion This rep ort’s primary goal is to give a ﬂ a v or of the cu r ren t state of the art on EM-t yp e s tatistical learning pr ocedur es. W e also hop e it will help researc hers and dev elop ers in ﬁnding the literature relev an t to their curren t existen tial questions. F or a more c omprehensive o verview, w e advise some go od tu torials th at are found on the inte rnet [8, 2, 1, 10, 29, 17]. A. App endix A.1. Maximum lik e liho o d quickie Let Y a rand om v ariable with p df p ( y | θ ), wh ere θ is an unkn own parameter v ector. Giv en an outcome y of Y , maximum lik eliho o d estimation consists of ﬁn d ing th e v alue of θ that maximizes the probability p ( y | θ ) o ver a giv en search sp ace Θ. In this con text, p ( y | θ ) is seen as a fu nction of θ and called the lik eliho o d fu nction. Sin ce it is often m ore con v enient to m an ip ulate the logarithm of this expr ession, we will fo cus on th e equiv alen t problem of maximizing the log- lik eliho o d function: ˆ θ ( y ) = arg max θ ∈ Θ L ( y , θ ) where the log-lik eliho o d L ( y , θ ) ≡ log p ( y | θ ) is denoted L ( y , θ ) to emphasize th e dep endance in y , cont rary to the notation L ( θ ) usu all y emplo y ed thr og hout this rep ort. Wheneve r the log- lik eliho o d is diﬀerentiable wrt θ , we also d eﬁne the score fu nction as the log-lik eliho od gradien t: S ( y , θ ) = ∂ L ∂ θ ( y , θ ) In this case, a fund amen tal result is that, for all vec tor U ( y , θ ), w e ha v e: E( S U t ) = ∂ ∂ θ E( U t ) − E  ∂ U t ∂ θ  where the exp ectatio n is taken w rt the d istribution p ( y | θ ). This equalit y is easily obtained fr om the logarithm diﬀeren tiation formula and some additional manip ulations. Ass igning the “true” v alue of θ in this expression leads to th e follo wing: • E ( S ) = 0 • C o v ( S, S ) = − E  ∂ S t ∂ θ  (Fisher information) 14 • I f U ( y ) is an unbiased estimator of θ , then Cov( S, U ) = Id. • I n the case of a single parameter, the ab o v e result imp lies V a r( U ) ≥ 1 V ar( S ) from the Cauc hy-Sc hw artz in equ alit y , i.e. the Fisher information is a lo w er b ound for the v ariance of U . Equalit y o ccurs iﬀ U is an aﬃne function of S , w h ic h imp oses a sp eciﬁc form to p ( y | θ ) (Darmois theorem). A.2. Jensen’s inequalit y F or any rand om v ariable X and any real contin uous conca v e function f , we hav e: f [E( X )] ≥ E[ f ( X )] , If f is str ictl y conca ve, equalit y o ccurs iﬀ X is deterministic. References [1] A. Berger. Conv exit y , Maxim um Likelihoo d and All Th at. T u toria l published on the web, 1998. [2] J. Bilmes. A Gentle T u toria l of the EM Algorithm and its Application to Paramete r Estimation for Gaussian Mixtur e and Hidden Marko v Mo dels. T ec hnical Rep ort ICSI-TR- 97-02 1, Univ ersit y of Berk eley , Apr. 1998. [3] G. Celeux, D. C hauv eau, and J. Dieb olt. On S toc hastic V ersions of the EM Algorithm. T ec h nical Rep ort 2514, INRIA, Mar. 1995. [4] G. Celeux, S . Chr´ etien, F. F o rb es, and A. Mkh ad r i. A Comp onen t-wise EM Algorithm for Mixtures. Journal of Co mputational and Gr aphic al Statistics , 10:699 –712, 2001. [5] G. Celeux and J . Dieb olt. The SEM Algorithm: a Probabilistic T eac h er Algorithm Derive d from the EM Algorithm for the Mixture P roblem. Computational Statistics Qu aterly , 2:73– 82, 1985. [6] G. Celeux and G. Go v aert. A classiﬁcation EM algorithm for clusterin g and tw o s toc hastic v ersions . Computa tional Statistics a nd Data Analysis , 14:315–33 2, 1992. [7] S. Ch r´ etien and A. O. Hero. Ku llbac k Proximal Algorithms for Maxim um Likelihoo d Estimation. IEEE T r ansactions on Information The ory , 46(5):1 800–18 10, 2000 . [8] C. Cou v r eur. The EM Algorithm: A Guided T our. In Pr o c. 2d IEEE Eur op e an Workshop on Computationaly Intensive Metho ds in Contr ol a nd Signa l Pr o c essing , Prague, Czec h Republik, Aug. 1996. [9] T. M. Cov er and J. A. Thomas. Elements of Information The ory . John Wiley & S on s , Inc., 1991. [10] F. Dellaert. T h e Exp ectat ion Maximization Algorithm. T utoria l publish ed on the web, 2002. [11] B. Dely on, M. La vielle, and E. Moulines. Con v ergence of a sto c hastic appro ximation v ersion of th e EM algorithm. Ann als of Statistics , 27(1):94– 128, 1999. 15 [12] A. P . Demps ter, N. M. Laird, and D. B. Rubin . Maxim um lik eliho o d fr om incomplete data via the EM algorithm. Journal of the R oyal Statistic al So ciety: Series B , 39:1–38 , 1977. [13] J. A. F essler and A. O . Hero. Space-Alternating Generalized Exp ectation-Maximizat ion Algorithm. IE EE T r a nsactions on Signal Pr o c essing , 42(10) :2664– 2677, 19 94. [14] J. A. F essler and A. O. Hero. P enalized Maxim um-Likelihoo d Im age Reconstruction Using Space-Alternating Generalized EM Algorithms. IEEE T r ansactions on Image Pr o c essing , 4(10): 1417–1 429, 199 5. [15] M. Jamshidian and R. I. Jennr ic h. C onjugate gradien t acceleration of the EM algorithm. Journal of the Americ an Statistic al Asso ciation , 88:221–22 8, 1993. [16] E. Kuhn and M. La vielle. Coupling a sto c h astic appro ximation version of EM with a MCMC pro cedure. S ubmitted, 2002. [17] C. Liu. An Example of Algorithm Mining: Co v ariance Adju stmen t to Accelerate EM and Gibbs. T o app ear in Dev elopmen t of Mo dern Statistics and Related T opics, 2003. [18] C. Liu and D. B. Rubin . The EC ME algorithm: a simple extension of EM and EC M w ith faster monotone con v ergence. Biometrika , 81:633– 648, 1994. [19] C. Liu, D. B. Ru bin, and Y. N. W u. Pa rameter E x p ansion to Accelerate EM: The PX-EM Algorithm. Biometrika , 85:755 –770, 1998. [20] J. S . Liu and Y. N. W u. P arameter Expansion for Data Augmenta tion. Journal of the Americ a n Statistic al Asso ciation , 94:1264– 1274, 1999. [21] T. A. Louis. Finding th e observed information matrix when using the EM algorithm. Journal of the R oyal Stat istic al So ciety: Series B , 44:226– 233, 1982. [22] I. Meilijson. A fast impro vemen t of the EM algo rithm on its o wn terms. Journal of the R oyal Statistic al So ciety: Serie s B , 51:127–138 , 1989. [23] X. L. Meng and D. B. Ru bin. Maxim um lik eliho od via the ECM algorithm: a general framew ork. Biometrika , 80:26 7–278, 1993. [24] X. L. Meng an d D. A. v an Dyk. The EM Algorithm - An Old F olk Song Su ng T o A F ast New T une. Journal of the R oyal Statistic al So ciety: Series B , 59:511–567 , 19 97. [25] X. L. Meng and D. A. v an Dyk. Seeking eﬃcien t data augmen tation schemes via conditional and marginal data augmen tation. Biometrika , 86(2):301 –320, 1999. [26] R. Neal and G. Hin ton. A view of the EM algorithm that justiﬁes incremental, s parse, and other v arian ts. In M. Jord an, editor, L e arning in Gr aphic al Mo dels , pages 355–368 . K luw er Academic Publish ers, 1998. [27] W. H. Press, B. P . Flannery , S. A. T euk olsky , an d W. T. V e tterling. N umeric al R e cip es in C . Cambrige Universit y P ress, 2nd edition, 1992. [28] M. A. T an n er and W. H. W ong. The Calculation of P osterior Distributions b y Data Augmen tation. J ourna l of the Americ an Statistic al A sso c iation , 82:528–550 , 1 987. 16 [29] D. A. v an Dyk and X. L. Meng. Algorithms based on data augmenta tion: A graphical represent ation and comparison. In K. Berk and M. Pourahmadi, ed itors, Computing Scie nc e and Statistics: Pr o c e e dings of the 31st Symp osium on the Interfac e , pages 230–239. Interface F oundation of North Amer ica, 2000. [30] G. W ei a nd M. A. T a nner. A Monte Carlo implementa tion of the E M algorithm and the p o or man’s data augmen tation algorithm. Journal of the Americ a n Statistic al Asso ciation , 85:699 –704, 1990. 17

EM algorithm and variants: an informal tutorial

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment