Stochastic Algorithm For Parameter Estimation For Dense Deformable Template Mixture Model

STOCHASTIC ALGORITHM F OR BA YESIAN MIXTURE EFFECT TEMPLA TE ESTIMA TION S. ALLASSONNI ¨ I¿ 1 2 RE 1 , E. KUHN 2 CMAP - Eco le P olytechnique 1 LA GA - Universit ¨ ı¿ 1 2 Paris 13 2 Route de Sa c lay 99, Av. J.-B. Cl ¨ ı¿ 1 2 men t 91128 Palaiseau, F rance 93430 Villetaneuse, F rance Abstract. The e stimation of probabilistic deformable template mo dels in computer vision or of probabilistic atlases in Computational Anatom y are core issues in both ﬁelds. A ﬁrs t coh eren t statistical framework where the geometrical v ari ability is modelled as a hidden random v ariable has been given b y Allassonniere, Amit and T rouve in [1]. They i ntroduce a Ba y esian approac h and m ixture of them to estimate deformable template mo dels. A consisten t stochastic algorithm has b een introduced in [2] to f ace the pr oblem encount ered in [1] for the con v ergence of the estimation algorithm for the one component model in the presence of noise. W e prop ose here to go on in this direction of usi ng some “ SAEM-like” algori thm to appro ximate the MAP estimator in the general Bay esian setting of mixture of deformable template mo dels. W e also prov e the conv ergence of our algorithm tow ard a critical p oint of the p enalised li ke liho od of the observ ations and illustrate this with handwritten digit images and medical im ages. k eyw o rds sto chastic appr oximations, non rigid-deforma ble templates, sha pe s statistics, MAP estimation, Bay esian metho d, mixture mo dels. AMS 60J 22, 62 F10, 6 2F15, 62M40 . 1. In tro duction. The issue of represe nting and analys ing some g e ometrical str uctures upo n which s o me deformations can act is a challenging question in applied mathematics as w ell a s in Computational Anatomy . One central point is t he mo delisatio n of v a rying ob jects, a nd the quantiﬁcation of this v a r iability with resp ect to one or several refer ence mo dels which will be called templates. This is known as “Deformable T emplates” [12]. T o our b est k nowledge, the problem of co nstructing probabilistic mo dels o f v a riable shap es in o rder to statistically qua nt ify this v ariability has not bee n succes s fully addr e s sed yet in spite of its imp orta nc e . F or example, mo delling the anatomical v a riability of orga ns a r ound a n ideal s ha p e is of a crucial interest in the medical doma in in or de r to ﬁnd so me characteristic diﬀere nc e s b etw een p o pula tions (patho lo gical and control), or to e x hibit some patholog ical k ind of deformations o r shap es of an o rgan. Many solutions hav e b een prop osed to face the problem of the template deﬁnition. They go from some generalised Pro crus te’s means with a v a r iational [11] or sta tis tica l [1 0] p oint o f view to some statistica l mo dels like Active Appea rance Mo del [6] or Minimum Description Length metho ds [16]. Unfor tunately , all these metho ds only fo cus on the template wher eas the geometrical v aria bility is computed afterwards (using PCA). This contradicts with the fact that a metric is required to compute the templa te through the computation of deforma tions. Moreov er, they do not really diﬀer from the v a r iational po int of view since they consider the deformatio ns as so me nu isance para meters which have to b e estimated and not as s ome unobserved rando m v ariables. The main goa l of this pap er is to pro po se a coherent estimation of b oth photometric mo del and geometrica l distr ibutio n in a given po pulation. Another issue address ed here is the clustering problem. Given a set of images, the statistical estimation of the comp onent weigh ts and o f the image labels is usually sup er v ised, at least the num ber of comp o nents is ﬁxed. The templates of each comp onent a nd the lab el are estimated itera tively (for exa mple in metho ds like K-means ) but the g e ometry , and re la ted to this the metric used to compute the distances b etw een e le ments, is still ﬁxed. Mo r eov er, the lab el, which is no t obser ved, is, as the defor mations, considered as a par ameter and not as a hidden random v aria ble. These metho ds do not le a d to a statistical coherent framework for the understanding of deformable template estimation and all these itera tive algorithms derived from those appr oaches do not hav e a statistical interpretation as the parameter optimisation of a g enerative mo del descr ibing the data. In this pap er we consider the s ta tistical fra mework for dense defor mable templates developed by Allassonniere, Amit and T r ouve in [1] in the generalised case o f mixture mo del for multicom- po nent estimation. Each ima ge tak en fro m a data base is supposed to be genera ted fr o m a nois y random defor mation of a template image pick ed randomly amo ng a given set of p os sible templates. 1 All the templates are a ssumed to b e drawn from a commo n prior distribution o n the template image space. T o prop ose a generative mo del, each deforma tion a nd each image lab el hav e to be considered a s hidden v ar iables. The templates, the pa rameters of the defor mation laws and the comp onents weights a re the par ameters of interest. This generative mo del allows to automati- cally decomp ose the database in to compo nents and, at the same time, estimates the par ameters corres p o nding to each c omp onent while increa sing the lik eliho o d of the obs e r v atio ns . Given this parametr ic statistical Bayesian mode l, the para meter es tima tion is p er fo rmed in [1] by a Maximum A Posterior i (MAP). The authors carry out this estimation problem using a deterministic a nd itera tive scheme ba sed o n the EM (Exp ectatio n Maximisa tion) algor ithm where the p osterio r distribution is appr oximated by a Dir a c measure on its mo de. Unfortunately , this gives an alg o rithm whose co nv ergence tow a rd the MAP estimator cannot b e prov ed. Mor eov er, as shown by the e xp e riments in that pa pe r , the conv ergence is lost within a noisy setting. Our go al in this pape r is to prop ose some sto chastic iterative metho d to r each the MAP estimator for whic h we will b e able to g et a co nv er gence result a s already done for the one comp onent case in [2]. W e prop ose to use a sto chastic version o f the EM a lgorithm to reach the maximum o f the po sterior distribution.W e use the Sto chastic Approximation EM (SAEM) algorithm introduce d by Delyon et a l in [7] coupled with a Monte Car lo Ma rko v Chain (MCMC) metho d. This coupling algor ithm has b een introduced b y K uhn and Lavielle in [14] in the case where the miss ing v a riables had a co mpact supp ort. Contrary to the one c o mpo nent mo del where we can couple the iteration of the SAEM alg orithm with the Marko v c hain evolution (cf. [2]), we show here that it cannot b e driven numerically . W e need to consider a n alter na tive metho d. W e prop os e to simulate the hidden v ar iables using s ome a uxiliary Ma rko v c hains, one per co mp o nent, to approa ch the p o sterior distribution. W e prov e the conv ergence of our alg orithm for a non compact setting by adapting Dely on’s theorem a b o ut general sto chastic approximations and int ro ducing truncatio n on r andom b oundaries as in [5 ]. The pa per is organised as follows: in Section 2 we ﬁrst reca ll the observ ation mixture model prop osed by Allassonnier e, Amit and T rouve in [1]. In Section 3, we descr ibe the sto chastic algo- rithm use d in our par ticular s etting. Section 4 is devoted to the conv ergence theorem. Exp eriments on 2D real data sets ar e pr esented in Section 5. The pro o fs o f the conv ergence o f the algor ithm are p ostp oned in Sectio n 6 wher eas Conclusion and Discus s ion are given in Sectio n 7 . 2. The observ ation mo del. W e consider the framework of multicomponent mo del in- tro duced in [1 ]. Given a s ample of g ray level imag es ( y i ) 1 ≤ i ≤ n observed on a g rid of pixels { v u ∈ D ⊂ R 2 , u ∈ Λ } wher e D is a co ntin uous domain and Λ the pixel net work, we ar e lo o king for some template imag es whic h explain the p opulation. E ach of these images is a rea l function I 0 : R 2 → R deﬁned on the whole plane. An o bserv ation y is s upp os ed to b e a discretisatio n on Λ of a de fo rmation of one of the templates plus an indep endent additive noise. This le a ds to assume the existence of a n unobserved deformation ﬁeld z : R 2 → R 2 such that for u ∈ Λ : y ( u ) = I 0 ( v u − z ( v u )) + ǫ ( u ) , where ǫ denotes a n additive noise. 2.1. M o dels for templates and deformations. W e use the same fr amework as chosen in [1] to describe b oth the templates I 0 and the defor mation ﬁelds z . Our mo del ta kes into a ccount t wo complementary sides: pho tometric -indexe d by p - corresp o nding to the templates and the noise v ariances, a nd g eometric - index ed by g - corr esp onding to the deformatio ns. The templates I 0 and the deformations z a re a s sumed to b elo ng to some ﬁnite dimensional subspaces of tw o repro ducing kernels Hilb ert spa ces V p and V g (determined by their resp ective kernels K p and K g ). W e choose a repr esentation o f b oth of them by ﬁnite linea r combinations of the kernels centred at some ﬁxe d la ndmark p oints in the domain D : ( v p,j ) 1 ≤ j ≤ k p resp ectively ( v g,j ) 1 ≤ j ≤ k g . They are 2 therefore para metrised by the co eﬃcients α ∈ R k p and β ∈ ( R k g ) 2 which yield: ∀ v ∈ D , I α ( v ) , ( K p α )( v ) , k p X j =1 K p ( v , v p,j ) α j , z β ( v ) , ( K g β )( v ) , k g X j =1 K g ( v , v g,j ) β j . 2.2. Para metrical mo del. In this pa p er , we consider a mixture of the defor mable template mo dels which allows a ﬁxed nu mber τ m of comp onents in each training set. This means that the data will b e se pa rated in τ m (at most) diﬀerent co mpo nents by the algorithm. Therefore, for each o bserv a tion y i , we consider the pa ir ( β i , τ i ) o f uno bserved v ar iables which corres p o nd resp ectively to the defor mation ﬁeld and to the lab el of image i . W e deno te b elow by y t , ( y t 1 , . . . , y t n ), by β t , ( β t 1 , . . . , β t n ) and by τ t , ( τ 1 , . . . , τ n ). The generative mo del is:                τ ∼ ⊗ n i =1 τ m P t =1 ρ t δ t | ( ρ t ) 1 ≤ t ≤ τ m , β ∼ ⊗ n i =1 N (0 , Γ g,τ i ) | τ , (Γ g,t ) 1 ≤ t ≤ τ m , y ∼ ⊗ n i =1 N ( z β i I α τ i , σ 2 τ i I d | Λ | ) | β , τ , ( α t , σ 2 t ) 1 ≤ t ≤ τ m , (2.1) where z β I α ( u ) = I α ( v u − z β ( v u )) is the action of the deformation on the template I α , for u in Λ and δ t is the Dirac function o n t . The parameter s of interest a re the vectors ( α t ) 1 ≤ t ≤ τ m co ding the templates, the v a riances ( σ 2 t ) 1 ≤ t ≤ τ m of the additive nois es, the cov arianc e matri- ces (Γ g,t ) 1 ≤ t ≤ τ m of the deformation ﬁelds and the co mpo nent weights ( ρ t ) 1 ≤ t ≤ τ m . W e denote by ( θ t , ρ t ) 1 ≤ t ≤ τ m the parameters so that θ t corres p o nds to the parameters c o mpo sed of the photometric part ( α t , σ 2 t ) and the geometric part Γ g,t for comp onent t . W e assume that for all 1 ≤ t ≤ τ m , the par ameter θ t = ( α t , σ 2 t , Γ g,t ) b elongs to the open s pace Θ deﬁned as Θ = { ( α, σ 2 , Γ g ) | α ∈ R k p , | α | < R , σ > 0 , Γ g ∈ Sym + 2 k g , ∗ ( R ) } , where R is an a rbitrar y p ositive constant a nd Sym + 2 k g , ∗ ( R ) is the s et of str ic tly p ositive symmetric matrices. Concerning the weigh ts ( ρ t ) 1 ≤ t ≤ τ m , we assume that they belong to the s et  =  ( ρ t ) 1 ≤ t ≤ τ m ∈ ]0 , 1[ τ m | τ m P t =1 ρ t = 1  . Remark 1. This yields a gener ative mo del: given the p ar ameters of the mo del, to get a re al- isation of an image, we ﬁrs t dr aw a lab el τ with r esp e ct to the pr ob ability law τ m P t =1 ρ t δ t . Then, we simulate a deformation ﬁeld β using the c ovaria nc e matrix c orr esp onding to c omp onent τ ac c or d- ing to N (0 , Γ g,τ ) . We apply it to the template of the τ th c omp onent. L ast, we add an indep endent Gaussian noise of varianc e σ 2 τ . W e choose a nor mal distribution for the uno bs erved defor mation v aria ble b ecause o f the back- ground we hav e in image analysis. Indeed, the reg istration pro blem is an issue that has been studied deeply for the past tw o deca des. The g o al is , given t wo images, to ﬁnd the b est defor- mation that will ma tch one image close to the other. Such metho ds requir e to choos e the kind of defor mations that ar e allowed (smooth, diﬀeomorphic, etc). These conditions are equiv alent, for some of these metho ds, to cho ose a co v aria nce matrix that ena ble s to deﬁne a n inner product betw een tw o defo rmations co ded by a vector β (cf. [18, 4 ]). The regularis ation term of the ma tc h- ing energy in the small defor mation framework treated in this paper ca n be written as : β t Γ − 1 g β . This lo o ks like the loga rithm of the density of a Gaus sian distribution on β with 0 mean and a cov a riance matrix Γ g . The link b etw een these tw o p oints o f view has been giv en in [1]; the mo de of the p oster ior distribution equals the solution o f a genera l matc hing problem. This is why we therefore s et on the deformation vector β such a distributio n. Moreover, many exp er iment s have bee n run using a la rge v ariety of such a matrix which gives us now a g o o d initial guess for our parameter. This leads us to consider a B ay esian approach with a weakly informative prior . 3 2.3. The Ba y es ian approac h. The informatio n given by the image analysis background is here introduced ma thematically in terms of prior laws on the parameter s of mo del (2.1). As already mentioned in the pre vious parag raph, this background knowledge enables to determine a go o d initial gues s fo r the laws and the v alues o f the hyper -para meter s. As well as for the cov a riance matrix Γ g , the same arguments are tr ue for the noise v ariance σ 2 . In the registration viewp oint, this v ariance is the tradeo ﬀ b etw e en the deformation c ost and the data a ttachmen t term that comp ose the ener g y to minimise. An empirica l go o d initial gues s is therefore known a s well. On another ha nd, the high dimensionality of the parameter s can lead to degene r ated maximum likelihoo d estimator when the training s ample is s mall. While introducing pr ior distributions, the estimation with sma ll sa mples is still p ossible. The imp ortance of these prior dis tr ibutions in the estimation problem has b een shown in [1]. The solution of the estimatio n equation c a n b e int erpreted as barycenters betw een the hyper-pa rameters of the pr iors and the empirical v alues . This ens ures ea s y computations and other theor etical pro pe rties as for example, the inv ertibilit y of the cov ariance matrix Γ g . T he role of the other hyper-pa rameters are discussed in the exp eriments. W e use a gener ative mo del which includes natural standard co njugate prior distributions with ﬁxe d h yper -parameter s. These distributions are an inv erse-Wishar t pr io rs on each Γ g,t and σ 2 t and a normal prio r on each α t , fo r all 1 ≤ t ≤ τ m . All priors a r e assumed indep endent. Then,          ν p ( dα, dσ 2 ) ∝ exp  − 1 2 ( α − µ p ) t (Σ p ) − 1 ( α − µ p )   exp  − σ 2 0 2 σ 2  1 √ σ 2  a p dσ 2 dα, a p ≥ 3 , ν g ( d Γ g ) ∝ exp( −h Γ − 1 g , Σ g i F / 2) 1 p | Γ g | ! a g d Γ g , a g ≥ 4 k g + 1 , where h A, B i F , tr ( A t B ) is the scalar pro duct of tw o matr ic es A and B and tr s tands for the trace. F or the prior law ν ρ , we choos e the Dirichlet distribution, D ( a ρ ), with density ν ρ ( ρ ) ∝ τ m Y t =1 ρ t ! a ρ , with ﬁx ed para meter a ρ . 3. P arameter estim ation using a sto c hastic v ersion of the EM algorithm. F o r the sake of simplicit y , let us denote by N , 2 n k g and by T , { 1 , . . . , τ m } n so that the missing de- formation v ariables take their v a lues in R N and the miss ing la bels in T . W e also introduce the following notatio ns: η = ( θ , ρ ) with θ = ( θ t ) 1 ≤ t ≤ τ m and ρ = ( ρ t ) 1 ≤ t ≤ τ m . In our Bayesian fra mework, we choose the MAP e stimator to estimate the par ameters: ˜ η n = arg max η q B ( η | y ) , (3.1) where q B ( η | y ) denotes the distribution of η conditionally to y . Remark 2. Even if we ar e working in a Bayesian fr amework, we do not want t o est imate t he distributions of our p ar ameters. Knowing the distribution of the t emplate image and its p ossible deformations is not of gr e at inter est fr om an image analysis p oint of view. Inde e d, p e ople ar e mor e inter est e d, in p articular in t he me dic al imaging c ommunity, in an atlas which char acterises the p opulations of shap es that they c onsider r ather t han its distribut ion. Mor e over, the distribution of the defo rmation law makes even less sense. This is the r e ason why we fo cus on the MAP. In pra ctice, to reach this e stimator, we maximise this p oster ior distribution using a Sto chastic Approximation E M (SAEM) algo rithm coupled with a Monte Car lo Markov Chain (MCMC) metho d. Indeed, due to the intractable co mputation o f the E step o f the EM algo rithm introduced 4 by [8] enco unt ered in this complex non linear setting, we follow a sto chastic v ariation c a lled SAE M prop osed in [7]. How ever, ag ain due to the ex pression o f our mo del, the simu lation required in this a lgorithm ca nno t b e p erformed directly . Therefore, we prop ose to use s ome MCMC metho ds to r e ach this simulation a s prop osed b y K uhn and Lavielle in [14] and done for the one comp onent mo del in [2]. Unfor tuna tely , the direct g eneralisatio n of the a lgorithm presented in [2] pa p er turns out to be o f no us e in pr actice b ecause of so me trapping state problems (cf. Subsection 3.2.). This suggests to go back to some other extension of the SAEM pro c e dure. 3.1. The SAEM algorithm using MCMC m etho ds. Let us ﬁrst rec all the SAEM algo- rithm. It g enerates a sequence of es timated parameters ( η k ) k which conv erges to w ards a cr itical po int of η 7→ lo g q ( y , η ) under some mild assumptions (cf. [7]). These critical p oints coincide with the critical p oints o f η 7→ log q B ( η | y ). The k th iteration consists in three steps: Simulation step : the missing data, her e the defor ma tion par ameters a nd the lab els, ( β , τ ), ar e drawn with resp ect to the distribution of ( β , τ ) conditionally to y denoted by π η , using the current par ameter η k − 1 ( β k , τ k ) ∼ π η k − 1 , (3.2) Sto c hastic appro ximation step : given (∆ k ) k a decr easing seq ue nce of po sitive step-sizes , a sto chastic a pproximation is done on the quan tit y log q ( y , β , τ , η ), using the simulated v alue of the missing data : Q k ( η ) = Q k − 1 ( η ) + ∆ k [log q ( y , β k , τ k , η ) − Q k − 1 ( η )] , (3.3) Maximisation step : the par ameters are upda ted in the M-step, η k = arg max η Q k ( η ) . (3.4) Initial v a lues Q 0 and η 0 are arbitr a rily chosen. W e notice that the density function of the mo del prop osed in paragr aphs 2.2 and 2.3 b elo ng s to the curved exp onential family . That is to sa y that the complete lik eliho o d ca n b e written as: q ( y , β , τ , η ) = exp [ − ψ ( η ) + h S ( β , τ ) , φ ( η ) i ] , w her e the suﬃcient s tatistic S is a Bore l function on R N P T taking its v alues in an op en subset S o f R m and ψ , φ tw o Bor el functions on Θ P  . (Note that S , φ and ψ may dep end also o n y , but since y will sta y ﬁxed in the sequel, we omit this dep endency). Thanks to this pr op erty of o ur mo del, it is equiv alent to do the sto chastic approximation on the co mplete log-likeliho o d as well as o n the s uﬃcien t statistics. This yields equation (3.3) to be replaced by the following sto chastic approximation s of the suﬃcient sta tistics S : s k = s k − 1 + ∆ k ( S ( β k , τ k ) − s k − 1 ) P . (3.5) W e now in tro duce the fo llowing function: L : S P Θ P  → R as L ( s ; η ) = − ψ ( η ) + h s, φ ( η ) i . It has b een pr ov ed in [1] that there ex ists a critica l function ˆ η : S → Θ P  which is a zero of ∇ L . It is straig h tforward to prov e that this function satisﬁes: ∀ η ∈ Θ P , ∀ s ∈ S , L ( s ; ˆ η ( s )) ≥ L ( s ; η ) so that the maximisa tion step (3.4) b ecomes: η k = ˆ η ( s k ) . Concerning the simulation step, in our mo del, the simulation o f the missing v ar iables with resp ect to the co nditio na l distribution π η cannot b e carried o ut. Indeed, its probability densit y function (pdf ) has a clo se form but rather co mplicated ; it do es not co rresp ond to some usual p df. One solution pro p o sed in [14] for such cases is to couple the SAEM algor ithm with Mon te Carlo Marko v Cha in (MCMC) metho d. How ever, we do not ﬁt exactly into their requirements s ince the missing v aria ble β do es not ha ve a co mpa ct supp ort. W e intro duce an er go dic Ma rko v chain whose stationa ry distribution is the co nditional distribution π η . W e denote its trans ition kernel by Π η . The sim ulation step (3.2) is thus replaced by the following s tep: 5 ( β k , τ k ) ∼ Π η k − 1 (( β k − 1 , τ k − 1 ) , . ) . (3.6) The most co mmon choice o f kernel is an accept- r eject step which is carr ied out through a Metrop olis -Hastings algo r ithm. Unfor tunately , in o ur particula r setting, we deal with larg e dimensions for the missing v ar iables. This made us mov e to some o ther kind of MCMC metho ds, like a Gibbs sa mpler, to simulate o ur missing v ariables. 3.2. The transition of the MCMC metho d: a h ybrid Gibbs sampler. If we cons ider the full vector ( β , τ ) as a s ingle vector of missing data, we can use the hybrid Gibbs sampler o n R N P T as follows. F o r any b ∈ R and 1 ≤ j ≤ N , let us denote by β b → j the unique conﬁguration which is equal to β everywhere except the co ordinate j where β j b → j = b and by β − j the vector β witho ut the co ordinate j . Each co o r dinate of the deformatio n ﬁeld β j is up dated using a Metrop olis-Has tings step where the pr op osal is given by the conditional distribution of β j | β − j , τ coming from the cur rent Gaus sian distribution with the c orresp o nding para meters (p ointed by τ ). Then, the last co or dinates co rresp onding to the missing v ariable τ are drawn with re s pe ct to q ( τ | β , y , η ). Even if this pro cedure pr ovides an estimated para meter sequence which would theoretica lly conv erge tow ard the MAP es timator, in pra ctice, as mentioned in [19], it would ta ke a quite long time to r each its limit be c ause of the trapping sta te pro blem: when a sma ll num ber of obser v ations are assigned to a co mpo nent , the es timation of the co mp o nent par ameters is har dly conce n trated and the probability of c hanging the lab e l of an image to this comp onent or from this co mpo nent to another is r eally small (most of the time under the co mputer precision). W e can interpret this from a n image a nalysis viewp oint: the ﬁrst iter ation of the algo rithm gives a random lab el to the tr aining set and computes the cor resp onding max imiser η = ( θ , ρ ). Then, for each imag e, accor ding to its current label, it simulates a deformation ﬁeld which only takes in to acco unt the parameters o f this given comp onent. Indeed, the simulation of β through the Gibbs sa mpler inv olves a prop osa l whose cor resp onding Mar ko v chain has q ( β | τ , y , η ) as sta tio nary distribution. Ther efore, the deformation trie s to match y to the deformed template of the given comp onent τ . The deformation ﬁeld tries to get a b etter connection betw een the comp onent parameters and the obser v ation, and there is only sma ll probability that the obser v ation g iven this deformatio n ﬁeld will b e closer to another comp onent. The up da te of the lab el τ is therefore conditional to this defor mation which would no t le ave muc h chance to switch comp onent. T o ov ercome the tra pping state problem, we will simulate the o ptimal lab el, using as many Marko v c hains in β as the num be r o f comp onents so that each comp onent has a co rresp onding deformation which “ computes” its distance to the obser v ation. Then we can simulate the optimal deformation cor resp onding to that o ptimal lab el. Since we a im to simulate ( β , τ ) throug h a tr ansition kernel that has q ( β , τ | y , η ) a s statio nary distribution, we simulate τ with a kernel whose stationary distribution is q ( τ | y , η ) and then β through a transitio n kernel that has q ( β | τ , y , η ) a s stationary distribution. F or the ﬁrst step, we need to c ompute the weights q ( t | y i , η ) ∝ q ( t, y i | η ) for all 1 ≤ t ≤ τ m and all 1 ≤ i ≤ n whic h cannot b e easily r eached. How ev er, for an y dens it y function f , for any image y i and for any 1 ≤ t ≤ τ m , we hav e q ( t, y i | η ) =  E q ( β | y i ,t,η )  f ( β ) q ( y i , β , t | η )  − 1 . (3.7) Obviously the computation of this exp ectation w.r .t. the p osterio r distribution is no t tractable either but we c a n approximate it by a Monte Ca rlo sum. Howev er, we cannot easily s im ulate v aria bles through the pos ter ior distribution q ( ·| y i , t, η ) as well, so w e use some realisations of an ergo dic Mar ko v chain having q ( ·| y i , t, η ) as stationa ry distribution instead o f some indep endent realisatio ns of this distr ibution. The solutio n we prop ose is the following: supp ose we are at the k th iteration of the algorithm and let η b e the curr ent parameters. Given any initial defor mation ﬁe ld ξ 0 ∈ R 2 k g , we run, for 6 each comp onent t , the hybrid Gibbs sampler Π η, t on R 2 k g J times so that we get J elements ξ t,i = ( ξ ( l ) t,i ) 1 ≤ l ≤ J of a n er go dic ho mo geneous Mar ko v chain who s e stationa ry distributio n is q ( ·| y i , t, η ). Let us denote by ξ i = ( ξ t,i ) 1 ≤ t ≤ τ m the matrix of all the aux iliary v ariables. W e then use these elements for the computation o f the w eight s p J ( t | ξ i , y i , η ) through a Mo nte Carlo sum: p J ( t | ξ i , y i , η ) ∝ 1 J J X l =1 " f ( ξ ( l ) t,i ) q ( y i , ξ ( l ) t,i , t | η ) #! − 1 , (3.8) where the normalis ation is done such that their sum ov er t equals one, inv o lving the de p endence on all the a uxiliary v ariables ξ i . The erg o dic theore m ensur es the co nv ergence of our a pproximation tow a rd the expected v a lue. W e then s imulate τ throug h ⊗ n i =1 τ m P t =1 p J ( t | ξ i , y i , η ) δ t . Concerning the sec o nd step, we up date β by re-r unning J times the hybrid Gibbs sampler Π η, τ on R N starting fro m a r andom initial p oint β 0 in a co mpact subs e t o f R N . The size of J will depe nd on the itera tion k of the SAEM a lgorithm in a sense that will be precis ed later, th us we now index it by k . The density function f in volv ed in the Monte Carlo sum a b ove needs to b e sp eciﬁed to get the con vergence result prov ed in the last section of this pap er . W e show that using the prior on the deformation ﬁeld enables to get the s uﬃcient co nditions for conv ergence. This densit y is the Gaussian density function a nd dep ends on the component we are working with: f t ( ξ ) = 1 √ 2 π 2 k g p | Γ g,t | exp  − 1 2 ξ t Γ − 1 g,t ξ  . (3.9) Algorithm 2 shows the detailed itera tion. Remark 3. The use of one simulation of β p er c omp onent is a p oint that was alr e ady use d in [1] while c omputing the b est matching, β ∗ , for al l c omp onents by minimising the c orr esp onding ener gies. This gives as many β ∗ as c omp onents for e ach image. Then, ac c or ding to t hese b est matchings, it c ompute d the b est c omp onent which ther efor e p ointe d the matching to c onsider. 3.3. T runcation on random b oundaries . Since our missing data have a non-compact suppo rt, some of the conv ergence assumptions of such algorithms [14] are not satisﬁed. This leads to consider a trunca tio n algorithm as sug gested in [7] and extended in [2]. Let ( K q ) q ≥ 0 be an increas ing sequence of c ompact s ubsets o f S such a s ∪ q ≥ 0 K q = S and K q ⊂ int ( K q +1 ) , ∀ q ≥ 0. Let K b e a co mpact subset of R N . Let Π η be a tr ansition kernel of an er go dic Marko v chain on R N having π η as stationary distribution. W e c o nstruct the homogeneo us Markov chain (( β k , τ k , s k , κ k )) k ≥ 0 as expla ined in Algor ithm 1. As lo ng as the stochastic appr oximation do es not wander out the curr ent co mpact set, we run our ”SAEM-MCMC” algo rithm. As so on as this previous co ndition is not satisﬁed, we reinitialise the s equences of s and ( β , τ ) using a pro jection (for mor e details see [7]). The curr ent compact is then e nlarge. T o p oint tow ard the current compact, we use a counter sequence ( κ k ) k which remains unchanged when the prev io us condition is sa tisﬁed a nd increases to p oint tow ard a bigge r compa ct whe n re-pro jecting. 4. Con v ergence theorem of the multicomp onen t pro cedure. In this pa rticular section the v ariances of the components ( σ 2 t ) 1 ≤ t ≤ τ m are ﬁxed. Allevia ting this condition is not str aight- forward and is a n issue of our future w ork. T o prove the conv ergence of o ur parameter estimate toward the MAP , w e have to go back to a conv ergence theorem which dea ls with genera l s to chastic approximations. Indeed, the SAEM- MCMC algo rithm introduced a nd detailed a bove is a Robbins-Monro t yp e sto chastic a pproxima- tion pro cedure. One common to o l to prov e the w.p.1 conv ergence of such a sto chastic a pproxima- tion has b een introduced by Kushner and Clark in [15]. How ever, so me o f the a s sumptions they require are int ractable with our pro ce dure (in particula r co ncerning the mea n ﬁeld deﬁned b elow). This lea ds us to slightly a dapt the conv ergence theo r em for sto chastic approximations given in [7]. W e co nsider the following Robbins-Mo nro sto chastic approximation pro c e dure: 7 Algorithm 1 Sto chastic approximation with tr uncation on random b oundar ies Set β 0 ∈ K, τ 0 ∈ T , s 0 ∈ K 0 and κ 0 = 0. for all k ≥ 1 do comput e ¯ s = s k − 1 + ∆ k ( S ( ¯ β , ¯ τ ) − s k − 1 ) where ( ¯ β , ¯ τ ) ar e sampled from a transit ion kernel Π η k − 1 (see Algor ithm 2) . if ¯ s ∈ K κ k − 1 then set ( s k , β k , τ k ) = ( ¯ s, ¯ β , ¯ τ ) an d κ k = κ k − 1 else set ( s k , β k , τ k ) = ( ˜ s, ˜ β , ¯ τ ) ∈ K 0 P K P T and κ k = κ k − 1 + 1 and ( ˜ s, ˜ β ) can be chos en through different ways (cf. [7]) . end if η k = arg max η ˆ η ( s k ). end for Algorithm 2 T ransition step k → k + 1 using a hybrid Gibbs sampler o n ( β , τ ) Require: η = η k , J = J k for all i = 1 : n do for all t = 1 : τ m do ξ (0) t,i = ξ 0 for all l = 1 : J do ξ = ξ ( l − 1) t,i Gibbs samp ler Π η, t : for all j = 1 : 2 k g do Metrop olis- Hastings p rocedu re: b ∼ q ( b | ξ − j , t, η ) Compute r j ( ξ j , b ; ξ − j , η , t ) = h q ( y i | ξ b → j ,t,η ) q ( y i | ξ , t,η ) ∧ 1 i u ∼ U [0 , 1] if u < r j ( ξ j , b ; ξ − j , η , t ) then ξ j = b end if end for ξ ( l ) t,i = ξ end for p J ( t | ξ i , y i , η ) ∝ 1 J J X l =1 " f t ( ξ ( l ) t,i ) q ( y i , ξ ( l ) t,i , t | η ) #! − 1 end for end for τ k +1 ∼ ⊗ n i =1 τ m X t =1 p J ( t | ξ i , y i , η ) δ t and β k +1 ∼ Π J η, τ k +1 ( β 0 ) . s k = s k − 1 + ∆ k ( h ( s k − 1 ) + e k + r k ) , (4.1) where ( e k ) k ≥ 1 and ( r k ) k ≥ 1 are rando m pro ces ses deﬁned on the same probability space tak ing their v alues in a n op en s ubs et S of R n s ; h is referr ed to as the mean ﬁeld of the algor ithm; ( r k ) k ≥ 1 is a rema inder term and ( e k ) k ≥ 1 is a sto chastic excita tion. 8 T o b e able to get a convergence re s ult, we consider the trunca ted sequence ( s k ) k deﬁned as follow: let S a ⊂ S and ¯ s k = s k − 1 + ∆ k h ( s k − 1 ) + ∆ k e k + ∆ k r k , where if ¯ s k ∈ K κ k − 1  s k = ¯ s k , κ k = κ k − 1 , if ¯ s k / ∈ K κ k − 1  s k = ˜ s k ∈ S a , κ k = κ k − 1 + 1 . (4.2) The pro jection ˜ s k can b e made through diﬀerent wa ys (cf. [7]). W e will use Delyon’s theorem whic h gives suﬃcient co nditions for the seque nc e ( s k ) k ≥ 0 trun- cated on rando m bo undaries to conv erge with probability one. The theor em we state here is a generalisa tion of the theorem pre s ented in [7]. Indee d, we have add the existence of a n absorbing set for the s to chastic approximation seq uence. The pro of of this theor e m can b e carr ied o ut the same wa y Delyon et a l. do their s adding the absorbing s et. This is wh y it is not detailed here. Theorem 4.1. We c onsider the se quenc e ( s k ) k ≥ 0 given by the t runc ate d pr o c e dur e (4.2) . Assume that : (SA0’) Ther e exists a close d c onvex set S a ⊂ S such that for al l k ≥ 0 , s k ∈ S a w.p.1. (SA1) (∆ k ) k ≥ 1 is a de cr e asing s e quenc e of p ositive n u mb ers such that ∞ P k =1 ∆ k = ∞ . (SA2) The ve ctor ﬁeld h is c ontinuous on S ou S a and ther e exists a c ontinuously diﬀer entiable function w : S → R such that (i) for al l s ∈ S , F ( s ) = h ∂ s w ( s ) , h ( s ) i ≤ 0 . (ii) int ( w ( L ′ )) = ∅ , wher e L ′ , { s ∈ S : F ( s ) = 0 } . (ST AB1’) Ther e exist a c ontinuous diﬀer entiable function W : R N → R and a c omp act set K such that (i) F or al l c ≥ 0 , we have W c ∩ S a is a c omp act subset of S wher e W c = { s ∈ S : W ( s ) ≤ c } is a level set. (ii) h ∂ s W ( s ) , h ( s ) i < 0 , for al l s ∈ S \ K . (ST AB2) F or any p ositive int e ger M , w.p.1 lim p →∞ p P k =1 ∆ k e k 1 W ( s k − 1 ) ≤ M exists and is ﬁn ite and w.p.1 lim sup k →∞ | r k | 1 W ( s k − 1 ) ≤ M = 0 . Then, w.p. 1, lim sup k →∞ d ( s k , L ′ ) = 0 . Assumption ( SA2 ), w hich inv o lves a Lyapuno v function w , replaces the usua l condition that, w.p.1, the sequence comes back inﬁnitely often in a compact set which is not ea sy to check in pr a c- tice. In additio n, a s sumptions ( ST AB1’ ) and ( ST AB2 ) give a recurr ent condition introducing a Lyapuno v function W which co ntrols the excursio ns outside the compact sets. The t wo Lyapuno v functions w and W do not nee d to b e the same. Another interesting p oint is that the truncation do es not change the mea n ﬁeld and there fo re the stationa r y p oints o f the sequence . This theorem do e s not ens ure the conv ergence of the sequence to a maximum of the likelihoo d but to one o f its critical po ints. T o e ns ure tha t the cr itical p o in t reached is a ma ximum, we would hav e to satisfy tw o other conditions (called ( LOC1- 2 ) in [7]) which a re typical conditions. That is to say , it requires that the critical p oints are iso lated and for every stationary p oints s ∗ ∈ L ′ the Hessian matrix o f the obser ved lo g-likelihoo d is nega tive deﬁnite. W e no w w ant to apply this theore m to prov e the con vergence of o ur “SAEM lik e” pro cedure where the missing v ariables ar e not s im ulated through the p os terior density function but by a kernel which can b e as close as wanted - increasing J k - to this p osterio r law (ge neralising T he o rem 3 in [7]). Let us consider the follo wing sto chastic appr oximation: ( β k , τ k ) are simulated by the transi- tion kernel describ ed in the previous s e c tion and s k = s k − 1 + ∆ k ( S ( β k , τ k ) − s k − 1 ) , 9 which can be c onnected to the Robbins-Monr o pro cedur e using the notatio ns introduce d in [7]: let F = ( F k ) k ≥ 1 be the ﬁltra tion where F k is the σ − algebr a generated b y the random v ariables ( S 0 , β 1 , . . . , β k , τ 1 , . . . , τ k ), E π η is the exp ectation with resp ect to the p osterior distr ibution π η and h ( s k − 1 ) = E π ˆ η ( s k − 1 ) [ S ( β , τ )] − s k − 1 , e k = S ( β k , τ k ) − E [ S ( β k , τ k ) |F k − 1 ] , r k = E [ S ( β k , τ k ) |F k − 1 ] − E π ˆ η ( s k − 1 ) [ S ( β , τ )] . Theorem 4.2. L et w ( s ) = − l ( ˆ η ( s )) wher e l ( η ) = log P τ R R N q ( y , β , τ , η ) d β and h ( s ) = P τ R R N ( S ( β , τ ) − s ) π ˆ η ( s ) ( β , τ ) d β for s ∈ S . Assu me t hat: (A1) the se quenc es (∆ k ) k ≥ 1 and ( J k ) k ≥ 1 satisfy : (i) (∆ k ) k ≥ 1 is non-incr e asing, p ositive, ∞ P k =1 ∆ k = ∞ and ∞ P k =1 ∆ 2 k < ∞ . (ii) ( J k ) k ≥ 1 takes its values in the set of p ositive inte gers and lim k →∞ J k = ∞ . (A2) L ′ , { s ∈ S , h ∂ s w ( s ) , h ( s ) i = 0 } is include d in a level set of w . L et ( s k ) k ≥ 0 b e the t runc ate d se quenc e deﬁne d in e quation (4.2), K a c omp act s et of R N and K 0 ⊂ S ( R N ) a c omp act subset of S . Then, for al l β 0 ∈ K , τ 0 ∈ T and s 0 ∈ K 0 , we hav e lim k →∞ d ( s k , L ′ ) = 0 ¯ P β 0 , τ 0 ,s 0 , 0 -a.s. , wher e ¯ P β 0 , τ 0 ,s 0 , 0 is the pr ob ability me asur e asso ciate d with the chain ( Z k = ( β k , τ k , s k , κ k )) k ≥ 0 starting at ( β 0 , τ 0 , s 0 , 0) . The ﬁrst assumption which concerns the tw o s equences inv olv ed in the algor ithm, is no t restrictive at a ll sinc e these s e quences can b e chosen arbitrar ily . The se cond a ssumption, how ever, is more co mplex. This is required to satisfy the assumptions of Theore m 4.1. This is a condition we have not prov ed yet and is par t o f our future w ork. Pr o of . The pro of o f this theorem is given in Section 6. W e give here a quick sketc h to emphasise the main diﬃculties and diﬀerences b etw een our pro of and the co nv ergence pro of o f the SAEM algorithm given in [7 ]. Even if the only a lgorithmic diﬀerence betw een o ur a lgorithm and the SAEM a lgorithm is the simulation o f the missing data which is not done with resp ect to the p oster io r law q ( β , τ | y , η ) but through an appr oximation whic h can b e arbitra rily close, this y ields a very diﬀerent pr o of. Indeed, whereas for the SAE M alg orithm, the sto chastic appr oximation leads to a Robbins-Mo nro t yp e equa tion (4.1) with no residual term r k , our metho d induces one. The ﬁrst diﬃcult y is therefore to prov e that this res idua l term tends to 0 while the num b er of iterations k tends to inﬁnit y . Our pro of is decomp osed into tw o pa rt, the ﬁrst one concerning the deformation v ariable β and the seco nd one the lab el τ . The ﬁrst ter m requires to prove the g eometric ergo dicity of the Ma rko v chain in β ge ne r ated thro ugh our kernel. F or this pur p o se, we pr ove some typical suﬃcient conditio ns which include the exis tence of a sma ll set for the transition kernel and a drift condition. Then, we use for the second term so me concentration inequalities for non stationa r y Marko v chains to pr ov e that the kernel asso ciated with the lab el distribution c onv erges tow ard the p osterior distribution q ( τ | y , η ). The se cond diﬃculty is to prove the convergence of the excitation term e k . T his ca n b e carr ied out as in [7] using the proper ties of our Markov chain and some ma r tingale limits prop erties. Corollar y 1. U n der the assumptions of The or em 4.2 we have for al l β 0 ∈ K , τ 0 ∈ T , s 0 ∈ S a and η 0 ∈ Θ P  , lim k →∞ d ( η k , L ) = 0 ¯ P β 0 , τ 0 ,s 0 , 0 -a.s , 10 wher e ¯ P β 0 , τ 0 ,s 0 , 0 , is the pr ob ability me asur e asso ciate d with the chain ( Z k = ( β k , τ k , s k , κ k )) k ≥ 0 starting at ( β 0 , τ 0 , s 0 , 0) and L , { P η ∈ ˆ η ( S ) , P ∂ l ∂ η ( η ) = 0 } . Pr o of . This is a direct consequence of the smo othness of the function s 7→ ˆ η ( s ) on S and of Lemma 2 of [7] which links the sets L a nd L ′ . 5. Exp eriments. 5.1. USPS database. T o illustr ate the previous alg orithm for the defo rmable template mo del, we are consider ing ha ndwritten digit images. F or each digit, referred as class later, we learn tw o templates, the cor resp onding noise v a riances and the g eometric cov ariance matrices. W e use the USPS da ta base which co n tains a tra ining s e t of ar ound 700 0 images. Each picture is a (16 P 16) gray level ima ge with intensit y in [0 , 2] wher e 0 cor resp onds to the bla ck background. In Figure 5.1 we show some of the training ima ges used for the sta tis tica l estimation. Fig. 5.1 . Some ex amples of the tr aining se t : 40 images p er class (inverse vide o). 5.1.1. General setting of the algorithm. A natur al choice for the prior laws on α a nd Γ g is to set 0 for the mean on α and to induce the t w o cov ar iance matr ic es b y the metric of the spaces V p and V g inv o lving the co rrelatio n b etw een the landmar ks through the kernels: deﬁne the square matrices M p ( k , k ′ ) = K p ( v p,j , v p,j ′ ) ∀ 1 ≤ k , k ′ ≤ k p , and M g ( k , k ′ ) = K g ( v g,j , v g,j ′ ) ∀ 1 ≤ k , k ′ ≤ k g . Then Σ p = M − 1 p and Σ g = M − 1 g . I n our exp eriments, we hav e chosen Gaussian kernels for b oth K p and K g , w he r e the standa rd deviations a re ﬁxed: σ p = 0 . 2 and σ g = 0 . 12 (for a n estimation on [ − 1 . 5 , 1 . 5] 2 and [ − 1 , 1] resp ectively). F or the sto chastic approximation step-s iz e , we allow a heating p erio d k h which corr esp onds to the absence of memory for the ﬁrst iterations. This a llows the Mar kov chain to reach an a rea of int erest in the p os ter ior probability density function q ( β , τ | y , η ) b efore exploring this particular region. In the exp eriments presented, the heating time k h lasts up to 150 iteratio ns a nd the whole algorithm is stopped at, a t most, 200 iter ations depending on the data set (nois y or not). This num b er of iterations corr esp onds to a p o int w he n the co nv ergence is rea ched. W e choo se, as s uggested in [7] the step-size sequence as ∆ k = 1 for all 1 ≤ k ≤ k h and ∆ k = ( k − k h ) − 0 . 6 otherwise. The multicomponent case has to fac e the problem o f its computationa l time. Indeed, a s w e hav e to approximate the p osterior density by r unning J k elements of τ m independent Marko v chains, the computation time incre a ses linearly with J k . In our exp eriments, we have c hosen a ﬁxed J for every EM iteration, J = 50. This is enough to g et a g o o d approximation tha nk s to the coupling betw een the iteratio ns of the SAEM a lg orithm and the iterations of the Marko v chains. Indeed, ev en if the pa rameter η is mo diﬁed a long the SAEM iter ations, its successive jumps a re s mall enoug h to ensur e the conv ergence of the MCMC metho d. Intuitiv ely sp eaking, it is equiv alen t to consider no t only 50 iteratio ns of the MCMC metho d but 50 times the num b er of SAEM iteratio ns . 11 5.1.2. The e stimated tem plates. W e are showing here the results of the statistical learning algorithm for our genera tive mo del. T o avoid the problems shown in [2], w e ch o ose the sa me initialisation of the template para meter α as they did, that is to s ay , we se t the initia l v a lue of α such that the co rresp onding I α is the mean o f the g ray-level tr aining images. Fig. 5.2 . Esti mated pr ototyp es of the t wo c omp onents mo del for e ach digit (40 images p er class; 100 iter ations; two c omp onents p er class). Fig. 5.3 . Estimate d pr ototyp es of the two co mp onents mo del for e ach digit (40 i mages p er class, se c ond r ando m sample). In Figur e 5.2, we show the t wo estimated templates obtained by the mult icomp onent pro cedure with 4 0 training examples p er cla ss. It app ears that, as for the mo de approximation algorithm which results are presented on this database in [1], the t w o co mpo nents reached ar e meaningful, such as the 2 with and without lo op or America n and Europ ean 7. They even lo o k alike. In Figure 5 .3, w e sho w a second run o f the a lg orithm with a diﬀerent databa se, the tra ining images are ra ndomly s elected in the whole USP S training set. W e can see that there are so me v aria bility , in particular for dig it 7 wher e there were no E urop ean 7 in this training set. This generates t wo diﬀerent clusters still relev an t for this digit. The other digits a re quite stable, in particular the stro ngly constrained ones (like 3, 5, 8 o r 9). 5.1.3. The photometric noise v ariance. E ven if we prov e the co nv ergence r esult for ﬁxed comp onent noise v ariances, we still try to lear n them in the ex pe r iments. The sa me b e haviour for o ur s to chastic EM as for the mo de approximation EM algor ithm done in [1] is observed for the nois e v a riances. Indeed, allowing the decompos ition of the class into comp o nents enables the mo del to better ﬁt the da ta yielding a low er res idua l noise. In addition, the sto chastic algo rithm enables to loo k around the whole po sterior distr ibution and not only fo cusing on its mode whic h increases the a ccuracy of the geo metr ic cov ar iance a nd the template estimatio n. This yields low er noise required to ex plain the gap b etw een the mo del a nd the tr uth. The evolution o f the estimated v aria nces for the tw o comp onents of ea ch digits are presented in Figure 5.4. The conv ergence o f this v ar iance for some very c onstrained digits like digit 1 is faster. This is due to the well deﬁned templates and geometric v ariability in the class w hich can b e easily captured. Therefo r e, a very low level of noise is r equired v ery quic kly . On the other hand, s o me very v ariable digits like digit 2 are s low er to conv erge. The huge g eometric v a riability adding to very co mplex sha p es for the templates lead to a mor e diﬃcult estimation and therefore more iterations b efore co nvergence. Last p oint that c an be noticed is the conv ergence of the Euro p ean 7 which lo oks slower than the other comp one nt (American 7). The reaso n o f this b ehaviour is that there are only tw o ima ges of such a 7 in the tra ining set and it takes a longer time for the alg orithm to put together and only together these tw o shap es s o that the clustering is b etter with resp ect to the likelihoo d. The other 7 do es not s uﬀer from this pr oblem and conv erges faster . 12 0 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 iteration number σ 2 σ 2 evolution, N=50 0 1 2 3 4 5 6 7 8 9 0 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 iteration number σ 2 σ 2 evolution, N=50 0 1 2 3 4 5 6 7 8 9 Fig. 5.4 . Evolution of the two cluster varianc es along the it er ati ons. 5.1.4. The estim ated geometric distributi on. T o b e able to co mpa re the lear nt geo me- try , we draw some synthetic exa mples using the mixture mo del with the lear nt par ameters. Even when the templates lo ok similar , the separatio n b etw een tw o comp o ne nts can be justiﬁed b y the diﬀerent geometry distributions. T o s how the eﬀects o f the geometr y on the comp onents, we hav e drawn some “2 ” with their res pe c tive parameters in the fo ur top rows of Figur e 5.5. Fig. 5.5 . Some synthetic examples of t he c omp onents of digit 2: First four r ows: templates of the two c omp onents deforme d thr ough some deformation ﬁeld β and − β dr awn fr om their r esp e ctive ge ometric c ovarianc e. Two last r ow: template of the ﬁrst c omp onent fr om Figur e 5.2 with deformations dr awn with r e sp e ct to t he se c ond c omp onent c ovarianc e matrix. F or ea ch comp onent, we have drawn the deformation g iven by the v a riable β and its opp osite − β since, as so on as one is lear nt, b eca use of the symmetry of the ce n tred Gaussian distribution, the o ppo site deformation is lea rnt at the same time. This is why sometimes, one of the tw o lo o k s strange whereas the other lo o ks like some element of the tr aining set. The s imulation is done using a common standard Gauss ia n distribution whic h is then multi- plied by a square ro ot of the cov a riance matrix we w ant to apply . W e can see the eﬀects of the cov a riance matr ix on b o th templates and the large v aria bilit y lea rnt. This has to b e compar ed with the b otto m rows of Figure 5.5, where the tw o samples ar e dr awn on the one template but with the cov aria nce matrix of the other one. Even if these six lines repres e nt some “2 ” s, the b o ttom ones suﬀer fr o m the geometric al tendency of the other cluster a nd do not lo ok as natur al. This shows the v a r iability of the models into cla sses. 5.2. M edical images. W e also test the algo rithm on a databa s e which consists in 4 7 2D medical images. Each of them represents the s plenium (back of the corpus calosum) and a pa rt of the cer ebe llum. Some o f the tr aining images ar e shown in Figure 5.6 ﬁr st row. 13 (a) (b) (c) (d) (e) Fig. 5 .6 . First r ow : T en images of the tr aining set r e pr esenti ng t he splenium and a p art of the c er eb el lum. Se co nd r ow : R esults fr om the template estimation. (a) : g r ay lev el me an image of the 47 images. T emplates estimate d (b) : with the F AM (c) : with the sto chastic algorithm on t he simple mo del (d,e) : on the two c omp onent mo del. The results of the estimation ar e pre s ented in Fig ure 5.6, seco nd row. The ﬁr st picture presented, (a), is the gray level mean of the 47 image s . The second one, (b), shows the estimated template computed with the F ast Approximation with Mo de Algorithm pr esented in [1 ] for a single compone nt model. This algorithm is an EM-like a lgorithm where the E step is simpliﬁed. The p oster ior distribution of the hidden v ar iable is a pproximated by a Dirac distribution on its mo de. This y ields a deterministic algo rithm, quite s imple to implement but with no theor e tica l conv ergence pr op erties. It shows a well contrasted splenium wherea s the cereb ellum rema ins a little bit blurry (note that it is still muc h be tter that the simple mean (a)). This pictur e has to b e co mpared with pictur e (c) which gives the estimated template co mputed with our algorithm with τ m = 1. The great improvemen t from the gray level mean of the images (a) or the F AM es tima tio n (b) to our es timations is obvious. In par ticular, the splenium is still very contrasted, b etter lo calised and the cerebellum is reconstructed with s everal br a nches. The background presents several structures whereas the o ther estimates a re blurr y . The tw o ana tomical shap es are rele v ant repr esentan ts o f the o nes observed in the training s et. The estimation has b een done while enabling the decomp osition of the databa s e into tw o comp onents with our SAEM- MCMC algo rithm pres e n ted here. The tw o estimated templates are shown in Figure 5 .6 (d) and (e). The diﬀerences can b e seen in particular on the shap e of the splenium where the bounda ries are more or les s curved. The thickness o f the splenium v aries as well b etw ee n the tw o estimates. The p osition of the fornix is als o diﬀere n t, being closer to the bo undary of the imag e. The num ber of branches in the tw o cereb ella also tends to be diﬀer ent from one template to the o ther (4 in the ﬁrs t component a nd 5 in the second o ne). The estimation suﬀer s from the sma ll num ber of images we hav e. T his can b e seen in the estimation of the background which is blurr y in b oth images. T o b e able to explain the huge v aria bility of the tw o anatomica l shap es, more comp onents would b e in teresting but at the same time more image s requir ed so that the co mpo nents will not end up empt y . 6. Pro of of Theorem 4 .2. W e r ecall that in this section the v ariances of the compo ne nts are ﬁxed. This reduces the pa rameters θ t to ( α t , Γ g,t ) for all 1 ≤ t ≤ τ m . First let exhibit suﬃcient statistics for the mo del. The complete log-likelihoo d equals: log q ( y , β , τ | η ) = n X i =1 ( log "  1 2 π σ 2 τ i  | Λ | / 2 exp  − 1 2 σ 2 τ i k y i − K β i p α τ i k 2  # + lo g "  1 2 π  k g | Γ g,τ i | − 1 / 2 exp  − 1 2 β t i Γ − 1 g,τ i β i  # + log( ρ τ i ) ) , where K β p α = z β I α and k . k denotes the Euclidean nor m. This emphasises ﬁve suﬃcient statistics 14 given in their matr icial form for all 1 ≤ t ≤ τ m , S 0 ,t ( β , τ ) = P 1 ≤ i ≤ n 1 τ i = t , S 1 ,t ( β , τ ) = P 1 ≤ i ≤ n 1 τ i = t  K β i p  t y i , S 2 ,t ( β , τ ) = P 1 ≤ i ≤ n 1 τ i = t  K β i p  t  K β i p  , S 3 ,t ( β , τ ) = P 1 ≤ i ≤ n 1 τ i = t β t i β i , P S 4 ,t ( β , τ ) = P 1 ≤ i ≤ n 1 τ i = t k y i k 2 . Thu s we a pply the stochastic approximation at itera tio n k of the algor ithm leading to: s k,m,t = s k − 1 ,m,t + ∆ k ( S m,t ( β k , τ k ) − s k − 1 ,m,t ) for 0 ≤ m ≤ 4 and rewr ite the maximisa tion step. The weights and the cov ar iance matrix are upda ted as follows: ρ τ ,k = s k, 0 ,τ + a ρ n + τ m a ρ , (6.1) Γ g,τ ,k = 1 s k, 0 ,τ + a g ( s k, 0 ,τ s k, 3 ,τ + a g Σ g ) . (6.2) The photometric par ameters are solution of the follo wing sys tem:      α τ ,k =  s k, 0 ,τ s k, 2 ,τ + σ 2 τ ,k (Σ p ) − 1  − 1  s k, 0 ,τ s k, 1 ,τ + σ 2 τ ,k (Σ p ) − 1 µ p  , σ 2 τ ,k = 1 s k, 0 ,τ | Λ | + a p  s k, 0 ,τ ( s k, 4 ,τ + ( α τ ,k ) t s k, 2 ,τ α τ ,k − 2( α τ ,k ) t s k, 1 ,τ ) + a p σ 2 0  , (6.3) which can b e solved iteratively for each comp onent τ star ting with the prev ious v a lues. In this part, all σ 2 τ ,k are ﬁxed and this lea ds to an explicit fo r m for the para meters α τ ,k . W e will no w a pply Theorem 4.1 to prove Theorem 4.2. ( SA0’ ) is satisﬁed with the set S a deﬁned by S a , { S ∈ S | 0 ≤ S 0 ,t ≤ n, k S 1 ,t k ≤ k y k , k S 2 ,t k ≤ n, 0 ≤ S 3 ,t , 0 ≤ S 4 ,t ≤ k y k 2 , ∀ 1 ≤ t ≤ τ m } . Thanks to the convexit y of this set, the new v alue s k deﬁned as a bar ycenter remains in S a . Assumption ( SA1 ) is trivially sa tis ﬁed since we can choose our s tep-size sequence (∆ k ) k . ( SA2 ) holds a s already proved in [2] for the one co mpo nent case with w ( s ) = − l ( ˆ η ( s )) s uch as ( ST AB1’(ii) ) with the same function W ( s ) = − l ( ˆ η ( s )). These conditions imply the con traction prop erty of the Lyapunov function w and the con vergence o f the sto chastic approximation under some conditions on the p er turbations. W e need to supp ose, like in the one comp onent case [2 ], that the c r itical p oints of our mo de l are in a c ompact subset of S which stands for ( ST A B1’(i) ). This is an assumption which has to be considered in a future w ork. W e will now fo c us on ( ST AB2 ) which is the assumption which gives the control of the p er - turbations requir ed for the conv ergence. W e ﬁrst show the convergence to zero of the remainder term | r k | 1 W ( s k − 1 ) ≤ M for a ny p ositive int eger M . W e denote by π k = π ˆ η ( s k ) for any k ≥ 0. W e hav e r k = E [ S ( β k , τ k ) |F k − 1 ] − 15 E π k − 1 [ S ( β , τ )] th us, r k = X τ Z R N S ( β , τ )Π J k η k − 1 , τ ( β 0 , β ) n Y i =1 Z p J k ( τ i | ξ i , y i , η k − 1 ) τ m Y t =1 J k Y l =1 Π η, t ( ξ ( l − 1) t,i , ξ ( l ) t,i ) dξ ( l ) t,i d β − X τ Z R N S ( β , τ ) π η k − 1 ( β , τ ) d β . W e denote by Q ( ξ i ) dξ i = τ m Q t =1 J k Q l =1 Π η, t ( ξ ( l − 1) t,i , ξ ( l ) t,i ) dξ ( l ) t,i and by R J k ( τ | y , η k − 1 ) = n Q i =1 R p J k ( τ i | ξ i , y i , η k − 1 ) Q ( ξ i ) dξ i . W e can now rewrite | r k | ≤      X τ Z R N S ( β , τ ) h Π J k η k − 1 , τ ( β 0 , β ) R J k ( τ | y , η k − 1 ) d β − π η k − 1 ( β , τ ) i d β      ≤ X τ     Z R N S ( β , τ ) h Π J k η k − 1 , τ ( β 0 , β ) − q ( β | τ , y , η k − 1 ) i d β     | R J k ( τ | y , η k − 1 ) | + X τ     Z R N S ( β , τ ) q ( β | τ , y , η k − 1 ) d β     | R J k ( τ | y , η k − 1 ) − q ( τ | y , η k − 1 ) | . Denoting M η k − 1 = max τ R R N | S ( β , τ ) | q ( β | τ , y , η k − 1 ) d β , we o bta in ﬁna lly | r k | 1 W ( s k − 1 ) ≤ M ≤ X τ     Z R N S ( β , τ ) h Π J k η k − 1 , τ ( β 0 , β ) − q ( β | τ , y , η k − 1 ) i d β     1 W ( s k − 1 ) ≤ M (6.4) + M η k − 1 X τ | R J k ( τ | y , η k − 1 ) − q ( τ | y , η k − 1 ) | 1 W ( s k − 1 ) ≤ M . (6.5) W e will ﬁrs t show tha t the Gibbs sampler kernel Π η, τ satisﬁes a low er b ound condition a nd a Drift condition ( MDRI ) to get its g eometric ergo dicity (as it has b een done in [2]). (MDRI) F or a ny s ∈ S and any τ ∈ T , Π ˆ η ( s ) , τ is irreducible and ap erio dic. I n addition there exists a function V : R N → [1 , ∞ [ such that for any p ≥ 1 and a ny compact s ubset K ⊂ S , there exist a set C , an int eger m , constants 0 < κ < 1, B > 0 , δ > 0 and a probability measure ν such that inf s ∈K , τ ∈T Π m ˆ η ( s ) , τ ( β , A ) ≥ δ ν ( A ) ∀ β ∈ C , ∀ A ∈ B ( R N ) , (6.6) sup s ∈K , τ ∈T Π m ˆ η ( s ) , τ V p ( β ) ≤ κV p ( β ) + B 1 C ( β ) . (6.7) Not a tion 1. L et ( e j ) 1 ≤ j ≤ N b e the c anonic al b asis of the β - s p ac e and for any 1 ≤ j ≤ N , let E η, τ ,j , { β ∈ R N | h β , e j i η, τ = 0 } b e the ort ho gonal of S p an { e j } and p η, τ ,j b e the ortho gonal pr oje ct ion on E η, τ ,j i.e. p η, τ ,j ( β ) , β − h β , e j i η, τ k e j k 2 η, τ e j , wher e h β , β ′ i η, τ = P n i =1 β t i Γ − 1 g,τ i β ′ i for β and β ′ in R N (i.e. the natur al dot pr o duct asso ciate d with the c ovaria nc e matric es (Γ g,t ) t ) and k . k η, τ is the c orr esp onding norm. We denote for any 1 ≤ j ≤ N , η ∈ Θ P  and τ ∈ T , by Π η, τ ,j the Markov kernel on R N asso ciate d with the j -th Metr op olis-Hastings step of the Gibbs sampler on R N . We have Π η, τ = Π η, τ ,N ◦ · · · ◦ Π η, τ , 1 . Inequality (6.6) is equiv ale nt to the existence of a small se t C for the kernel Π ˆ η ( s ) ,τ independent of s ∈ K . W e reca ll here the deﬁnition o f a sma ll s et: 16 Definition 1. (cf. [17]) A set E ∈ B ( R N ) is c al le d a smal l set for t he kernel Π if ther e exist an inte ger m > 0 and a n on trivial me asur e ν m on B ( R N ) , such t hat for al l β ∈ E , B ∈ B ( R N ) , Π m ( β , B ) ≥ ν m ( B ) . When this holds, we say that E is ν m -smal l. W e now prov e the following lemma: Lemma 6.1 . L et E b e a c omp act su bset of R N and K b e a c omp act su bset of S , t hen E is a smal l set of R N for (Π ˆ η ( s ) , τ ) s ∈K , τ ∈T . Pr o of . The tra nsition probability kernel of our Marko v chain on β is deﬁned as follows : fo r co ordinate j , the kernel is Π η, τ ,j ( β , d z ) = ( ⊗ m 6 = j δ β m ( d z m )) P  q j ( d z j | β − j , η , τ ) r j ( β j , d z j ; β − j , η , τ )+ δ β j ( d z j ) Z (1 − r j ( β j , b ; β − j , η , τ )) q j ( b | β − j , η , τ ) db  . (6.8 ) Then note that there exists a c > 0 such that for any η ∈ Θ P  , any β ∈ R N and a ny b ∈ R , the acceptance rate r j ( β j , b ; β − j , η , τ ) is uniformly low er b ounded by a c so that for any 1 ≤ j ≤ N and any non-nega tive function f , Π η, τ ,j f ( β ) ≥ a c Z R f ( β − j + be j ) q j ( b | β − j , τ , η ) db = a c Z R f ( p η, τ ,j ( β ) + z e j / k e j k η, τ ) g 0 , 1 ( z ) dz , where g 0 , 1 is the pro bability dens it y function o f the standar d N (0 , 1). By induction, w e hav e Π η, τ f ( β ) ≥ a N c Z R N f   p η, τ , 1 ,N ( β ) + N X j =1 z j p η, τ ,j +1 ,N ( e j ) / k e j k η, τ   N Y j =1 g 0 , 1 ( z j ) dz j , (6.9) where p η, τ ,q, r = p η, τ ,r ◦ p η, τ ,r − 1 ◦ · · · ◦ p η, τ ,q for any in teger q ≤ r and p η, τ ,N +1 ,N = I d R N . Let A η, τ ∈ L ( R N ) b e the linea r mapping on z N 1 = ( z 1 , · · · , z N ) deﬁned by A η, τ z N 1 = N X j =1 z j p η, τ ,j +1 ,N ( e j ) / k e j k η, τ . One easily chec ks that for any 1 ≤ k ≤ N , Span { p η, τ ,j +1 ,N ( e j ) , k ≤ j ≤ N } = Span { e j , k ≤ j ≤ N } so that A η, τ is an inv ertible mapping. By a change of v ariable, we get Z R N f  p η, τ , 1 ,N ( β ) + A η, τ z N 1  N Y j =1 g 0 , 1 ( z j ) dz j = Z R N f ( u ) g p η, τ , 1 ,N ( β ) ,A η, τ A t η, τ ( u ) du , where g µ, Σ stands for the pro bability density function of the no rmal law N ( µ, Σ). Since ( η , τ ) → A η, τ is smo oth on the set of in vertible mappings in ( η , τ ), we deduce that there exist c K > 0 a nd C K > 0 such that c K Id ≤ A η, τ A t η, τ ≤ Id / c K and g p η, τ , 1 ,N ( β ) ,A η, τ A t η, τ ( u ) ≥ C K g p η, τ , 1 ,N ( β ) , Id /c ( u ) unifor mly for η = ˆ η ( s ) with s ∈ K a nd τ ∈ T . Assuming that β ∈ E , since η → p η, τ , 1 ,N is smo o th and E is compa c t, we hav e sup β ∈ E ,η = ˆ η ( s ) , s ∈K , τ ∈T k p η, τ , 1 ,N ( β ) k < ∞ so that there exist other constants C K > 0 and c K > 0 such that for any ( u, β ) ∈ R N P E a nd any η = ˆ η ( s ) , s ∈ K , τ ∈ T g p η, τ , 1 ,N ( β ) ,A η, τ A t η, τ ( u ) ≥ C K g 0 , Id /c K ( u ) . (6.10 ) Using (6.9) and (6.10), we deduce that for any A Π η, τ ( β , A ) ≥ C K a N c ν ( A ) , with ν equal to the density of the normal law N (0 , Id /c K ). This yields the existence of the small set as well as eq uation (6.6). 17 This prop erty also implies the φ -irreducibility of the Mar kov Cha in gener ated by Π η, τ . More- ov er, the exis tence o f a ν 1 -small set implies the aperio dicity of the chain (cf. [17]). Now consider the Dr ift condition (6.7). W e set V : R N → [1 , + ∞ [ as the follo wing function V ( β ) = 1 + k β k 2 , where k · k denotes the Euclidean norm. Deﬁne for any g : R N → R n s the norm k g k V = sup β ∈ R N k g ( β ) k V ( β ) and the functional space L V = { g : R N → R n s | k g k V < + ∞} . F or any η ∈ Θ P  a nd any τ ∈ T , we introduce a ( η , τ ) dep endent function V η, τ ( β ) , 1 + k β k 2 η, τ . Lemma 6.2. L et K b e a c omp act subset of Θ P  . F or any inte ger p ≥ 1 , ther e exist 0 < ρ < 1 and C > 0 such that for any η ∈ K , any τ ∈ T , any β ∈ R N we have Π η, τ V p η, τ ( β ) ≤ ρV p η, τ ( β ) + C . Pr o of . The pro p o sal distribution for Π η, τ ,j is given b y q j ( β j | β − j , τ , y , η ) law = p η, τ ,j ( β ) + z k e j k η,τ e j , where z ∼ N (0 , 1 ). Then, for any β ∈ R N and any measurable set A ∈ B ( R N ), there exists a η, τ ,j ( β ) uniformly b ounded from b elow by a c > 0 such that Π η, τ ,j ( β , A ) = (1 − a η, τ ,j ( β )) 1 A ( β ) + a η, τ ,j ( β ) Z R 1 A  p η, τ ,j ( β ) + z k e j k η, τ e j  g 0 , 1 ( dz ) , Since h p η, τ ,j ( β ) , e j i η, τ = 0, we get V η, τ  p η, τ ,j ( β ) + z k e j k η,τ e j  = V η, τ ( p η, τ ,j ( β )) + z 2 . W e deduce that there ex ists C such that for an y β ∈ R N : Π η, τ ,j V p η, τ ( β ) = (1 − a η, τ ,j ( β )) V p η, τ ( β ) + a η, τ ,j ( β ) Z R  V η, τ ( p η, τ ,j ( β )) + z 2  p g 0 , 1 ( dz ) ≤ (1 − a η, τ ,j ( β )) V p η, τ ( β ) + a η, τ ,j ( β ) P  V p η, τ ( p η, τ ,j ( β )) + V p − 1 η, τ ( p η, τ ,j ( β )) Z R (1 + z 2 ) p g 0 , 1 ( dz )  ≤ (1 − a η, τ ,j ( β )) V p η, τ ( β ) + a η, τ ,j ( β ) V p η, τ ( p η, τ ,j ( β )) + C V p − 1 η, τ ( p η, τ ,j ( β )) . W e have used in the last inequality the fact that a Ga ussian v ariable ha s b ounded moment o f any order. Since a η, τ ,j ( β ) ≥ a c and k p η, τ ,j ( β ) k η, τ ≤ k β k η, τ ( p η, τ ,j is an orthonormal pro jection for the dot pr o duct h· , ·i η, τ ), we get that for any ε > 0, there e xists C K,ε such that for a n y β ∈ R N and η ∈ K , τ ∈ T Π η, τ ,j V p η, τ ( β ) ≤ (1 − a c ) V p η, τ ( β ) + ( a c + ε ) V p η, τ ( p η, τ ,j ( β )) + C K,ε . By induction, we get Π η, τ V p η, τ ( β ) ≤ X u ∈{ 0 , 1 } N N Y j =1 (1 − a c ) 1 − u j ( a c + ε ) u j V p η, τ ( p η, τ ,u ( β )) + C K,ε ε  (1 + ε ) N +1 − 1  , where p η, τ ,u = ((1 − u N )Id + u N p η, τ ,N ) ◦ · · · ◦ ((1 − u 1 )Id + u 1 p η, τ , 1 ). Let p η, τ = p η, τ ,N ◦ · · · ◦ p η, τ , 1 and note that p η, τ ,u is contracting so that Π η, τ V p η, τ ( β ) ≤ b c,ε V p η, τ ( β ) + ( a c + ε ) N V p η, τ ( p η, τ ( β )) + C K,ε ε  (1 + ε ) N +1  for b c,ε =  P u ∈{ 0 , 1 } N , u 6 = 1 Q N j =1 (1 − a c ) 1 − u j ( a c + ε ) u j  . T o end the pro of, we need to chec k that p η, τ is strictly co nt racting uniformly on K . Indeed, k p η, τ ( β ) k η, τ = k β k η, τ implies that 18 p η, τ ,j ( β ) = β for a ny 1 ≤ j ≤ N so that h β , e j i η, τ = 0 and β = 0 since ( e j ) 1 ≤ j ≤ N is a ba sis. Using the contin uit y o f the no rm of p η, τ and the compac tness of K , we deduce that there exis ts 0 < ρ K < 1 such that k p η, τ ( β ) k η, τ ≤ ρ K k β k η, τ for any β ∈ R N , η ∈ K and any τ ∈ T . Changing ρ K for 1 > ρ ′ K > ρ K we get (1 + ρ 2 K k β k 2 η, τ ) p ≤ ρ ′ 2 p K (1 + k β k 2 η, τ ) p + C K for some unifor m constant C K so that Π η, τ V p η, τ ( β ) ≤ b c,ε V p η, τ ( β ) + ρ ′ 2 p K ( a c + ε ) N V p η, τ ( β ) + C K,ε . Since we hav e inf ε> 0 n b c,ε + ρ ′ 2 p K ( a c + ε ) N o < 1 the r e sult is straig ht forward. Lemma 6. 3. F or any c omp act set K ⊂ Θ P  , any inte ger p ≥ 0 , ther e exist 0 < ρ < 1 , C > 0 and a p ositive inte ger m 0 such that ∀ m ≥ m 0 , ∀ η ∈ K , ∀ β ∈ T Π m η, τ V p ( β ) ≤ ρV p ( β ) + C . Pr o of . Indeed, there exist 0 ≤ c 1 ≤ c 2 such that c 1 V ( β ) ≤ V η, τ ( β ) ≤ c 2 V ( β ) for any ( β , η , τ ) ∈ R N P K P T . Then, using the prev ious lemma, we hav e Π m η, τ V p ( β ) ≤ c − p 1 Π m η, τ V p η, τ ( β ) ≤ c − p 1 ( ρ m V p η, τ ( β ) + C / (1 − ρ )) ≤ ( c 2 /c 1 ) p ( ρ m V p ( β ) + C / (1 − ρ )). Cho osing m large enough for ( c 2 /c 1 ) p ρ m < 1 gives the r esult. This ﬁnishes the pr o of of (6.7) and in the same time the ( M DRI ). Thanks to this prop erty we can use the following prop o s ition (cf. [17], [5] P rop osition B1) and lemma applied to ev ery seq uence ( ξ ( l ) t,i ) l with statio nary distribution q ( ·| y i , t, η ) for all 1 ≤ t ≤ τ m and all 1 ≤ i ≤ n . Proposition 1. Su pp ose that Π is irr e ducible and ap erio dic and that Π m ( β 0 , . ) ≥ 1 C ( β 0 ) δ ν ( . ) for a set C ∈ B ( R N ) , some int e ger m and δ > 0 and t hat ther e is a Drift c onditio n to C in the sense that, for some 0 < κ < 1 , B > 0 and a function V : R N → [1 , + ∞ [ , Π V ( β 0 ) ≤ κV ( β 0 ) ∀ β 0 ∈ / C and sup β 0 ∈ C ( V ( β 0 ) + Π V ( β 0 )) ≤ B . Then, t her e exist c ons t ants K and 0 < ρ < 1 , dep ending only u p on m, δ, κ, B , such that, for al l β 0 ∈ R N , and al l g ∈ L V k Π n g ( β 0 ) − π ( g ) k V ≤ K ρ n k g k V . Lemma 6. 4. Assume that ther e exist an inte ger m and c onstants 0 < κ < 1 and ς > 0 and a set C su ch that Π m V ( β 0 ) ≤ κV ( β 0 ) ∀ β 0 ∈ / C and Π V ( β 0 ) ≤ ς V ( β 0 ) ∀ β 0 ∈ R N for some function V : R N → [1 , + ∞ [ . Then ther e exists a function ˜ V and c onstants 0 < ρ < 1 , c > 0 and C > 0 , dep ending only up on m, κ, ς , such that, Π ˜ V ( β 0 ) ≤ ρ ˜ V ( β 0 ) ∀ β 0 ∈ / C and cV ≤ ˜ V ≤ C V . Pr o of . Deﬁne ˜ V = m X j =1 κ 1 − j /m Π j − 1 V . 19 F or β 0 / ∈ C , we have Π ˜ V ( β 0 ) ≤ m − 1 X j =1 κ 1 − j /m Π j V ( β 0 ) + κV ( β 0 ) ≤ κ 1 /m ˜ V ( β 0 ) . Therefore we o bta in : κ 1 − 1 /m V ≤ ˜ V ≤   m X j =1 κ 1 − j /m ζ j − 1   V . This ends the pro of of Le mma 6.4. Thu s, applying the Prop osition 1 and Le mma 6.4 to the Drift conditions of Le mmas 6.2 a nd 6.3, we get that ea ch Gibbs sampler kernel Π η, τ is geometrically erg o dic. Let us now go ba ck to the conv ergence o f the ﬁrst par t o f the res idua l term (6.4) tow ards 0. W e use the ter m 1 W ( s k − 1 ) ≤ M to show that the para meters η k − 1 are constra ine d to mov e in a compact se t of Θ P  . W e show ﬁr st that the observed lo g-likelihoo d l tends to minus inﬁnity as the pa rameters tend to the b oundary of Θ P  . Eq uation (2.1) implies that for any θ ∈ Θ we have: q ( y i | β i , τ i , α, σ 2 ) q ( β i | Γ g,τ i ) ≤ (2 π σ 2 ) −| Λ | / 2 (2 π ) − k g | Γ g,τ i | − 1 / 2 exp  − 1 2 β t i Γ − 1 g,τ i β i  , so that denoting C as a constant: log( q ( y , η )) ≤ n X i =1  − a g 2 h Γ − 1 g,τ i , Σ g i F + 1 + a g 2 log | Γ − 1 g,τ i | − a p σ 2 0 2 σ 2 τ i − | Λ | + a p 2 log( σ 2 τ i ) − 1 2 ( α τ i − µ p ) t Σ − 1 p ( α τ i − µ p ) − a ρ log ρ τ i  + C. It was shown in [1 ] that we have lim || Γ || + || Γ − 1 ||→∞ − a g 2 h Γ − 1 , Σ g i F + 1+ a g 2 log | Γ − 1 | = −∞ and lim k α k→∞ − 1 2 ( α − µ p ) t Σ − 1 p ( α − µ p ) = −∞ . Moreov er, we hav e lim ρ → 0 log( ρ ) = − ∞ , so we get lim η → ∂ (Θ P  ) log q ( y , η ) = −∞ , which ens ures that for all M > 0 there exists ℓ > 0 such that k α t k ≥ ℓ or || Γ t || + || Γ − 1 t || ≥ ℓ or ρ t ≤ 1 ℓ implies − l ( η ) ≥ M . So W ( s k − 1 ) ≤ M implies that for all 1 ≤ t ≤ τ m we hav e k α t k ≤ ℓ , || Γ t || + || Γ − 1 t || ≤ ℓ and 1 ℓ ≤ ρ t ≤ 1 − 1 ℓ bec ause P τ m t =1 ρ t = 1. Let us denote by V ℓ = Θ τ m ℓ P  ( ρ t ) 1 ≤ t ≤ τ m ∈  1 ℓ , 1 − 1 ℓ  τ m | P τ m t =1 ρ t = 1  , where Θ ℓ =  θ = ( α, Γ g ) | α ∈ R k p , Γ g ∈ Sym + 2 k g , ∗ ( R ) | k α k ≤ ℓ, 1 ℓ ≤ || Γ g || ≤ ℓ  . So there exists a compact s et V ℓ of Θ P  such that W ( s k − 1 ) ≤ M implies ˆ η ( s k − 1 ) ∈ V ℓ and the ﬁrst term (6.4) ca n b e bounded a s follows: X τ     Z R N S ( β , τ ) h Π J k η k − 1 , τ ( β 0 , β ) − q ( β | τ , y , η k − 1 ) i d β     1 W ( s k − 1 ) ≤ M ≤ X τ sup η ∈ V ℓ     Z R N S ( β , τ )  Π J k η, τ ( β 0 , β ) − q ( β | τ , y , η )  d β     . Since for each τ the function β → S ( β , τ ) b elongs to L V , since we hav e proved that each tr ansi- tion kernel Π η, τ is geo metr ically e rgo dic and since the s e t V ℓ is compact, we can deduce that the 20 ﬁrst term (6.4) co n verges to zero as J k tends to inﬁnity . W e no w co nsider the second ter m (6.5). W e ﬁrs t need to prov e that M η k 1 W ( s k − 1 ) ≤ M is uniformly b ounded that is to say the integral of the s uﬃcient statistics are unifor mly b ounded on { W ( s k − 1 ) ≤ M } ; we only nee d to fo c us on the suﬃcient statistic which is not bo unded itself: let ( j, m ) ∈ { 1 , ..., 2 k g } 2 : Z | β j β m | q ( β | τ , y , η k − 1 ) d β 1 η k − 1 ∈V ℓ ≤ Z | β j β m | q ( β , τ , y , η k − 1 ) q ( τ , y , η k − 1 ) d β 1 η k − 1 ∈V ℓ ≤ C ( V ℓ ) q ( τ , y , η k − 1 ) Z | β j β m | exp  − 1 2 β t ˆ Γ − 1 g, τ ,k − 1 β  d β ≤ C ( V ℓ ) Z Q ( β , ˆ Γ g, τ ,k − 1 ) exp  − 1 2 k β k 2  d β < ∞ , where C ( V ℓ ) is a constant depending o nly on the set V ℓ , ˆ Γ g, τ is the dia g onal blo ck matrix with all the Γ g,τ i given by the lab el vector τ and we hav e changed the v ar iable in the last inequality and Q is a qua dratic for m in β whose co e ﬃcient s ar e contin uous functions of elements of the matrix Γ g . So we obtain that for all M > 0 there exists ℓ > 0 such tha t for a ll in teger k w e have: M η k 1 W ( s k − 1 ) ≤ M ≤ C ( V ℓ ) . W e now prov e the con vergence to 0 of the seco nd term of the product in volv ed in (6.5). Let us denote by R τ , y ,k for the term | R J k ( τ | y , η k − 1 ) − q ( τ | y , η k − 1 ) | . Thus we hav e: R τ , y ,k =      n Y i =1 Z p J k ( τ i | ξ i , y i , η k − 1 ) Q ( ξ i ) dξ i − n Y i =1 q ( τ i | y i , η k − 1 )      ≤ n X i =1     Z p J k ( τ i | ξ i , y i , η k − 1 ) Q ( ξ i ) dξ i − q ( τ i | y i , η k − 1 )     ≤ n X i =1 Z | p J k ( τ i | ξ i , y i , η k − 1 ) − q ( τ i | y i , η k − 1 ) | Q ( ξ i ) dξ i ≤ n X i =1 Z     S J k ( τ i , y i | ξ τ i ,i , η k − 1 ) P s S J k ( s, y i | ξ s,i , η k − 1 ) − q ( τ i , y i | η k − 1 ) q ( y i | η k − 1 )     Q ( ξ i ) dξ i , where we denote by S J ( t, y i | ξ t,i , η ) the qua nt ity  1 J J P l =1  f ( ξ ( l ) t,i ) q ( y i ,ξ ( l ) t,i ,t | η )  − 1 . W e wr ite ea ch term of this s um as fo llows: S J k ( τ i , y i | ξ τ i ,i , η k − 1 ) τ m P s =1 S J k ( s, y i | ξ s,i , η k − 1 ) − q ( τ i , y i | η k − 1 ) q ( y i | η k − 1 ) = S J k ( τ i , y i | ξ τ i ,i , η k − 1 )( q ( y i | η k − 1 ) − τ m P s =1 S J k ( s, y i | ξ s,i , η k − 1 )) q ( y i | η k − 1 ) τ m P s =1 S J k ( s, y i | ξ s,i , η k − 1 ) + ( S J k ( τ i , y i | ξ τ i ,i , η k − 1 ) − q ( τ i , y i | η k − 1 )) τ m P s =1 S J k ( s, y i | ξ s,i , η k − 1 ) q ( y i | η k − 1 ) τ m P s =1 S J k ( s, y i | ξ s,i , η k − 1 ) . Denoting by T i the set of τ m + 1 integers { 1 , · · · , τ m } ∪ { τ i } , we obtain ﬁna lly : R τ , y ,k ≤ n X i =1 1 q ( y i | η k − 1 ) X s ∈T i Z | S J k ( s, y i | ξ s,i , η k − 1 ) − q ( s, y i | η k − 1 ) | Q ( ξ i ) dξ i . 21 Deﬁning the ev ent A k,i,t = {| S J k ( t, y i | ξ t,i , η k − 1 ) − q ( t, y i | η k − 1 ) | > ζ k } for some p ositive sequence ( ζ k ) k , we get: R τ , y ,k ≤ n X i =1 1 q ( y i | η k − 1 ) X s ∈T i Z A k,i,s | S J k ( s, y i | ξ s,i , η k − 1 ) − q ( s, y i | η k − 1 ) | Q ( ξ i ) dξ i + n X i =1 1 q ( y i | η k − 1 ) X s ∈T i Z A c k,i,s | S J k ( s, y i | ξ s,i , η k − 1 ) − q ( s, y i | η k − 1 ) | Q ( ξ i ) dξ i . So we deduced that: R τ , y ,k ≤ n X i =1 1 q ( y i | η k − 1 ) X s ∈T i (sup ξ S J k ( s, y i | ξ s,i , η k − 1 ) + q ( s, y i | η k − 1 )) P ( A k,i,s ) + n X i =1 1 q ( y i | η k − 1 ) ! ( τ m + 1) ζ k ≤ n X i =1  sup ξ, s S J k ( s, y i | ξ s,i , η k − 1 ) q ( y i | η k − 1 ) + 1  X s ∈T i P ( A k,i,s ) + P ( A k,i,τ i ) ! + n X i =1 1 q ( y i | η k − 1 ) ! ( τ m + 1) ζ k . Assuming ζ k < min i,t q ( t, y i | η k − 1 ), we obtain: P ( A c k,i,t ) = P ( | S J k ( t, y i | ξ t,i , η k − 1 ) − q ( t, y i | η k − 1 ) | ≤ ζ k ) ≥ P      1 S J k ( t, y i | ξ t,i , η k − 1 ) − 1 q ( t, y i | η k − 1 )     ≤ ζ k q ( t, y i | η k − 1 )( q ( t, y i | η k − 1 ) + ζ k )  ≥ P      1 S J k ( t, y i | ξ t,i , η k − 1 ) − 1 q ( t, y i | η k − 1 )     ≤ ζ k 2 q ( t, y i | η k − 1 ) 2  . Using the ﬁrs t ine q uality of Theor em 2 o f [9], we g et: P ( A k,i,t ) ≤ c 1 exp  − c 2 J k ζ 2 k q ( t,y i | η k − 1 ) 4  , where c 1 and c 2 are independent of k since ( η k ) o nly mov es in a compact set V ℓ thanks to the condition 1 W ( s k − 1 ≤ M ) . This yields: R τ , y ,k ≤ c 1 n X i =1  sup ξ, s S J k ( s, y i | ξ s,i , η k − 1 ) q ( y i | η k − 1 ) + 1  ( τ m + 1) exp  − c 2 J k ζ 2 k max i q ( y i | η k − 1 ) 4  + sup η k − 1 ∈L m n X i =1 1 q ( y i | η k − 1 ) ! ( τ m + 1) ζ k . W e hav e to prov e that the Mont e Carlo sum involv ed in S J k ( s, y i | ξ s,i , η k − 1 ) do es no t equal zero everywhere, s o that sup ξ, s S J k ( s, y i | ξ s,i , η k − 1 ) is ﬁnite. F or this pur p o se, we c a n cho ose a particula r probability density function f . Indeed, if we set f to b e the prior densit y function on the simulated deformation ﬁelds ξ , w e hav e for a ll η ∈ V ℓ : 1 J J X l =1 " f ( ξ ( l ) t,i ) q ( y i , ξ ( l ) t,i , t | η ) # = 1 J J X l =1 " 1 q ( y i | ξ ( l ) t,i , t, η ) q ( t | η ) # ≥ 1 J J X l =1   1 1 (2 π σ 2 t ) | Λ | exp  − 1 2 σ 2 t k y i − K ξ ( l ) p α t k 2    ≥ (2 π σ 2 ) | Λ | , where σ is the lower b ound of the v ariances ( σ t ). 22 W e choose the sequence ( ζ k ) k depe nding up on ( J k ) k such that lim k →∞ ζ k = 0 and lim k →∞ J k ζ 2 k = + ∞ . W e can take for example ζ k = J − 1 / 3 k for all k ≥ 1. W e will no w pr ov e the conv ergence o f the sequence o f excitation terms. F or any M > 0 we deﬁne M n = n P k =1 ∆ k e k 1 W ( s k − 1 ) ≤ M and le t F = ( F k ) k ≥ 1 be the ﬁltration, where F k is the σ − a lgebra genera ted by the ra ndom v ariables ( S 0 , β 1 , . . . , β k , τ 1 , . . . , τ k ). W e hav e M n = n P k =1 ∆ k ( S ( β k , τ k ) − E [ S ( β k , τ k ) |F k − 1 ]) 1 W ( s k − 1 ) ≤ M so this shows us that ( M n ) is a F -martingale. In addition to this we hav e: ∞ X k =1 E  k M k − M k − 1 k 2 | F k − 1  = ∞ X k =1 E  ∆ 2 k k e k k 2 1 W ( s k − 1 ) ≤ M | F k − 1  ≤ ∞ X k =1 ∆ 2 k E  k e k k 2 | F k − 1  ≤ ∞ X k =1 ∆ 2 k E  k S ( β k , τ k ) − E [ S ( β k , τ k ) |F k − 1 ] k 2 | F k − 1  ≤ ∞ X k =1 ∆ 2 k E  k S ( β k , τ k ) k 2 | F k − 1  . W e now ev aluate this la st int egral term: E  k S ( β k , τ k ) k 2 | F k − 1  = X τ Z R N Z k S ( β , τ ) k 2 Π J k η k − 1 , τ ( β 0 , β ) n Y i =1 p J k ,η k − 1 ( τ i , ξ τ i ,i , y i ) Q ( ξ τ i ,i ) dξ τ i ,i d β ≤ " X τ Z R N k S ( β , τ ) k 2 Π J k η k − 1 , τ ( β 0 , β ) d β #  Z Π J k η k − 1 , τ ( ξ 0 , ξ ) dξ  . The last ter m equals one and ag ain w e only need to focus on the suﬃcien t statistic which is not bo unded itself. Indeed S 3 ,t ( β , τ ) for all 1 ≤ t ≤ τ m so using the fact that the function V dominates this suﬃcient s ta tistic, w e obtain: E  k S 3 ,t ( β k , τ k ) k 2 | F k − 1  ≤ X τ Z R N k S 3 ,t ( β , τ ) k 2 Π J k η k − 1 , τ ( β 0 , β ) d β ≤ C X τ Z R N V ( β ) 2 Π J k η k − 1 , τ ( β 0 , β ) d β ≤ C X τ Π J k η k − 1 , τ V ( β 0 ) 2 . Applying Lemma 6.3 for p = 2, w e g et: E  k S ( β k , τ k ) k 2 | F k − 1  ≤ C X τ ( ρV ( β 0 ) 2 + C ) ≤ C τ n m ( ρV ( β 0 ) 2 + C ) . Finally it remains : ∞ P k =1 E  k M k − M k − 1 k 2 | F k − 1  ≤ C ∞ P k =1 ∆ 2 k , w hich ensures that the previo us series conv erges. This in volv es that ( M k ) k ∈ N is a mar tingale b o unded in L 2 so that lim k →∞ M k exists (see [13]). This prov es the ﬁr st part of ( ST AB2 ). T o conclude this pro o f we apply Theor em 4.1 and get that lim k →∞ d ( s k , L ′ ) = 0. 7. Conclusion and discussio n. W e consider the setting of Ba yesian no n-rigid deformable mo dels building in the context of [1] a nd the a sso ciated MAP es tima to r. W e approximate this estimator o f this generative mo del parameters thanks to a sto chastic algorithm which derives from a n EM algorithm. W e also prov e its theoretical conv ergence tow a rd a critical p oint of the observed likelihoo d. This is, to our b est k nowledge, the ﬁrst co nv er gent e stimation algorithm of the templa tes and geometrical v ariabilities in the framew ork o f mixture model for deformable templates. The alg orithm is base d on a sto chastic approximation of the EM a lgorithm using a 23 MCMC approximation of the p osterior distribution a nd truncation on r andom b oundaries . W e present exp eriments on the US-p ostal database as well as on some 2 D medical data. This sho ws that the sto chastic a pproach can be easily implemented with the algor ithm detailed here a nd is robust to noisy situations , giving b etter re s ult than the previo us deterministic schemes. Many interesting ques tions remain op en. The ﬁrst goa l of these mo del and alg orithm is the estimation of some atlases in a given po pulation a nd the acc e ptable defor mations of these atlases that can explain the v ariability in the po pulation. How ev er, this mo del, as so on as the parameters are es timated, can be used to cre ate a classiﬁer. Given a new imag e, one ca n co mpute the most likely comp onent tha t this imag e b elong s to. This co mputation r equires to ev aluate the integral of the complete lik eliho o d with resp ect to the p oster io r distribution a s well as in the es timation pro cess . A ﬁrst pro p osition to overcome this diﬃcult y has be e n given in [1] while approximating the p osterio r distribution by a Dira c o n its mo de. This gav e very interesting r esults which ar e presented in that paper. How e ver, in the case of noisy images, the same problem o ccur s and leads to bad cla ssiﬁcation ra tios. Another way ha s bee n propo sed in [3] using the same metho ds as in this pa p e r , that is to say , using Monte Carlo Marko v Chain metho ds. The r esults are impressive a nd the impro vemen t no ticeable. W e ha ve pres ented here a set of e xp e riments on 2D images. The gene r ative mo del a s well as the a lgorithm and the pro o f of its conv ergence do not dep end on the dimension of the images . The implementation for 3D imag es is only a n umerical issue. W e ar e currently working o n the 3D co des to test this algo rithm on real medical databa ses. An interesting extension would b e to consider diﬀeomorphic ma pping and not o nly displace- men t ﬁelds for the hidden deformation. This a ppea rs to b e particular ly interesting in the context of Computationa l Anatomy wher e a one to one corresp o ndence b etw een the template and the observ ation is usua lly needed a nd cannot b e guaranteed with linear spline interpolation schemes. This extensio n could b e done in principle using tangent mo dels based o n g eo desic sho oting in the spirit of [20]. Many numerical as well as theore tica l work need to b e do ne in this area. REFERENCES [1] S. Allassonni` ere, Y. Amit, and A. T rouv´ e. T ow ard a coheren t statistical framework for dense defor mable template estimation. Journal of the R oyal St atistic al So ciety , 69:3–29, 2007. [2] S. A l lassonni` ere, E. Kuhn, and A. T rouv´ e. Ba y esian deformable m odels bulding via sto chastic approximation algorithm: A con v ergence s tudy . in r e vision . [3] S. Allassonni` ere, E. Kuhn, and A. T rouv´ e . Map estimation of statistical deformable templates via nonlinear mixed eﬀects mo dels : Deterministic and sto chastic approac hes. In X. Pennec and S. Joshi, editors, Pr o c. of the Inte rnational Workshop on the Mathematic al F oundations of Computational Anatomy (MFCA- 2008) , September 2008. [4] Y. Ami t, U. Grenander, and M. Piccioni. Structural im age restoration through deform able templates. Journal of the A meric an St atistic al A sso ciation , 86:376–387, 1989. [5] C. Andri eu, ´ E. Moulines, and P . Priouret. Stabilit y of sto c hastic approximation under veriﬁable conditions. SIAM J. Contr ol Opti m. , 44(1):283–3 12 (electronic), 2005. [6] T. F. Co otes, G. J. Edwards, and C. J. T aylor. Activ es appearance mo dels. In H. Burkhards and B. Neumann, editors, 5th Eur op e an Confer enc e on Computer Vi sion, Berlin , volume 2, pages 484–498. Springer, 1998. [7] B. Delyon, M. Lavielle, and E. Moulines. Conv ergence of a sto cha stic appro ximation version of the EM algorithm. Ann. Statist. , 27(1):94–128, 1999. [8] A. P . Dempster, N. M. Laird, and D. B. Rubin. M aximum like liho o d from incomplete data via the EM algorithm. Journal of the R oyal Statistica l So ciet y , 1:1–22, 1977. [9] C. Dorea and L. Zhao. Nonparametric density estimation in hidden marko v mo dels. Statistica l Infer ence for Sto chastic Pr o c esses , 5:55–64, 2002. [10] C. A. Glasb ey and K. V. M ardia. A penalised likelihoo d approach to image warping. Journal of t he R oyal Statistic al So ciet y, Series B , 63:465–492, 2001. [11] J. Glaun` es and S. Joshi. T emplate estimation f orm unlab eled p oint set data and s ur faces for computational anatom y . In X. Pennec and S. Joshi, editors, Pr o c. of t he International Workshop on t he Mathematic al F oundations of Computational Anatom y (MF CA-2006) , pages 29–39, 1st of October 2006. [12] U. Grenander. Gener al Pattern The ory . Oxford Science Publications, 1993. [13] P . H all and C. C. Heyde. Martingale limit the ory and it s applic ation . A cademic Press Inc. [Harcourt Brace Jo v anovic h Publishers], New Y or k, 1980. Probability and Mathematical Statistics. [14] E. Kuhn and M . Lavielle. Coupling a s tochastic approximation version of EM with an MCMC pr ocedure. ESAIM Pr ob ab. Stat. , 8:115–131 (electronic), 2004. 24 [15] H. J. K ushner and D. S. Clark. Sto chastic appr oximation metho ds for c onstr aine d and unco nstr aine d systems , v olume 26 of Applie d Mathematic al Sciences . Springer-V erl ag, New Y ork, 1978. [16] S. Marsland, C. Twi ning, and C. T aylor. A minimum description length ob j ectiv e f unction for groupwise non rigid image registration. Image and Vision Computing , 2007. [17] S. P . Meyn and R. L. Tweedie. Markov chains and sto chastic stability . Communications and Con trol Engi- neering Seri es. Springer-V erlag London Ltd., London, 1993. [18] M . I. Mi ller, T . A. , and L. Y ounes. On the metrics and Euler- Lagrange equations of computational anatomy . Annu al R eview of biome dic al Engineering , 4:375–405, 2002. [19] C. Rober t. M´ etho des de Monte Carlo p ar cha ˆ ınes de Markov . Statistique M ath ´ ematique et P r obabilit´ e . [Mathematical Statistics and Probabili t y]. ´ Editions ´ Economica, Paris, 1996. [20] M . V aillant, I. Miller, M, A. T rouv´ e , and L. Y ounes. Statistics on diﬀeomorphisms via tangen t space repre- sen tations. Ne ur oimage , 23(S1):S161–S169 , 2004. 25

Stochastic Algorithm For Parameter Estimation For Dense Deformable Template Mixture Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment