Bayes Model Selection with Path Sampling: Factor Models and Other Examples

Statistic al Scienc e 2013, V ol. 28, No. 1, 95–1 15 DOI: 10.1214 /12-STS403 c  Institute of Mathematical Statisti cs , 2013 Ba y es Mo del Selection with P ath Sampling: F acto r Mo dels and Ot her Examples Ritab rata Dutta and Ja y anta K. Ghosh Abstr act. W e pr o v e a theorem justifying th e regularit y conditions whic h are needed for P ath Sampling in F ac tor Mo dels. W e then sho w that the remaining ingredien t, namely , MCMC for ca lculating the in tegrand at eac h p oin t in the path, ma y b e seriously ﬂaw ed, leading to w rong es- timates of Ba ye s factors. W e provide a new metho d of P ath S amp ling (with Small Change) that w orks muc h b etter than standard P ath Sam- pling in the sense of estimating the Ba yes factor better and c ho osing the correct model more often. Wh en the more complex factor mo del is true, PS -S C is subs tan tially more accurate. New MCMC d iagnostics is p r o vided for these pr oblems in supp ort of ou r conclusions and rec- ommendations. Some of our ideas for diagnostics and impr o v emen t in computation through small c hanges should apply to other metho ds of computation of the Ba ye s factor for mo del selectio n. Key wor ds and phr ases: Ba yes mod el selection, co v ariance mo d els, path sampling, Laplace app ro ximation. 1. BA YES MODEL SELECTION Adv ances in MCMC tec hniqu es to compute the p osterior for many complex, hierarc hical mo dels h a v e b een a ma jor reason for su ccess in Ba y es mo del- ing and analysis of complex ph en omena (And rieu, Doucet and Rob ert, 2004 ). Th ese tec hniqu es along with applications are sur vey ed in numerous pap ers, including Chen, Shao and Ibr ahim ( 2000 ), Liu ( 2008 ) and Rob ert and Casella ( 2004 ). Moreo ver, man y Ba yesia n b o oks on applications or theory a nd meth- o ds pro vide a quick in tro d u ction to MCMC, suc h as Gelman et al. ( 2004 ), Ghosh, Delampady and Ritab r ata Dutt a is Ph.D. Stu dent and Jayanta K. Ghosh is Pr ofessor, Dep artment of S tatistics, Pur due University, L afayette, Indiana 47907, USA e-mail: r dutt a@pur due.e du ; statrita2004@gmail .c om ; ghosh@pur due.e du . This is an electronic r eprint of the orig inal ar ticle published b y the Institute of Mathematica l Statistics in Statistic al S cienc e , 2013, V ol. 28, No. 1, 9 5–11 5 . This reprint diﬀers fr om the or iginal in pagination and t yp ogr aphic detail. Saman ta ( 2006 ), Gamerman and Lop es ( 2006 ) and Lync h ( 2007 ). Just as the p osterior for the parameters of a giv en mo del are imp ortan t for calculating Ba yes estimates, p osterior v ariance, c redibility in terv als and a general description of th e uncertain t y inv olv ed, one needs to calculate Ba yes factors for se lecting on e of sev eral mo dels. Ba y es factors are the r atio of marginals of giv en d ata und er diﬀerent mo dels, when more than one mo del is in volv ed and one w ish es to c ho ose one from among them, b ased on their r elativ e or p os- terior pr obabilit y . The r atio of m arginals measures the relativ e p osterior probabilit y or credibilit y of one mo del with resp ect to the other if we make the usual ob jectiv e choic e of half as prior probabilit y for eac h mo del. Although there are many metho ds for calculat- ing Ba y es factors, their success in handlin g complex mo dern mod els is far more limited than seems to b e generally recognized. Part of the reason for lac k of a w areness of this is that mo del selection has b e- come imp ortan t r elativ ely recen tly . Also, one ma y think that, in principle, calculation of a BF can 1 2 R. DUTT A AND J. K. GHOSH b e reduced to the calculation of a p osterior, and hence solv able by the same metho ds as those f or calculating the p osterior. Rev ersible Ju mp MCMC (RJMCMC) is an innov ativ e metho dology du e to Green ( 1995 ), based on this simp le fact. Ho w ev er, t w o mo dels essen tially lead to t wo d iﬀerent sets of states for any Mark o v c hain that connects them. The state spaces for diﬀeren t mo dels often d iﬀer w id ely in th eir d im en sion. Th is ma y pr ev en t go o d mixing and ma y sho w up in the kno wn diﬃculties of tun ing RJMCMC. F or a discussion of tunin g diﬃculties see Rob ert and Casella ( 2004 ). Another p opular m etho d for calculating BF is path sampling (PS ), whic h is du e to Gelman and Meng ( 1998 ) and recen tly r e-examined b y Lefebvre et al. ( 2009 ). Ou r ma jor goal is to explore PS further in the conte xt of nested, relativ ely high-dimens ional co v ariance mo dels, rather than nonnested low-dimen- sional m ean mod els, as in the last reference. Th e new examples sh o w b oth similarities and sharp c hanges from th e s ort of b eha vior do cumented in Lefebvre et al. ( 2009 ). W e consider three p aths, namely , the geo metric mean path, th e arithmetic mean path and the para- metric arithmetic mean path, whic h app ear in Gel- man and Meng ( 1998 ), Le febvre et al. ( 200 9 ), Ghosh and Dunson ( 2008 ), Ghosh and Dunson ( 2009 ), Lee and Song ( 2 002 ) a nd Song and Lee ( 2006 ). O ther applications of path sampling and br idge sampling (with some mo d iﬁcations) app ear in Lartillot and Philipp e ( 2006 ), F riel and Pe ttitt ( 2008 ), Xie et al. ( 2011 ) and F an et al. ( 2011 ). Ou r priors are usually the diﬀuse Cauc hy priors, ﬁrst sugge sted b y Jeﬀreys ( 1961 ) and since then recommended by many others, including B erger (p ersonal communicatio n), Liang et al. ( 2008 ), G elman ( 2 006 ) and Ghosh and Dun s on ( 2009 ). But w e also examine other less diﬀu se priors to o, going all the wa y to normal priors. S ince Lefeb- vre et al. ( 2009 ) ha v e stud ied applicatio ns of PS to mean lik e parameters, we fo cus on co v ariance mo d- els. W e restrict our selv es g enerally to factor mo d- els for co v ariance, which hav e b ecome quite p opu- lar in rece nt app lications, f or example, Lop es and W est ( 2004 ), Ghosh and Dunson ( 2008 ), Ghosh and Dunson ( 2009 ) a nd Lee a nd Song ( 2002 ). The r ecen t p opularity of facto r m o dels is due to the relativ e e ase with w hic h they may b e us ed to provi de a sp arse rep - resen tation of the co v ariance matrix of multiv ariate normal data in many applied problems of ﬁnance, psyc hometry and epidemiology; see, for example, the last thr ee r eferences. Also, often it leads to in terest- ing scien tiﬁc insight ; see Bartholo mew et al. ( 2002 ). In a ddition to pr ior, like liho o d and path, there are other choi ces to b e made b efore PS can b e imp le- men ted, namely , a metho d of discretizing the path, for example, b y equispaced p oints or a daptive ly ( Lefeb vre et al. , 200 9 ) and ho w to in tegrate the score function of Gelman and Meng ( 1998 ) at eac h p oin t in the discrete path. A p opu lar method is to use MCMC. These more tec hnical choice s are discussed later in the pap er. Along with PS, we will consider other metho d s lik e Imp ortance Sampling (IS) and its descendan ts lik e An nealed Imp ortance Sampling (AIS), due to Neal ( 2001 ), and Bridge S ampling (BS), du e to Meng and W ong ( 199 6 ). W e no w summarize our co ntribution in this p ap er. In Section 2 w e review what is kn o wn ab out path sampling an d factor mo dels. W e in tro du ce factor mo dels, a suitable path and suitable diﬀuse t -priors. The path w e use was ﬁ r st introdu ced in Gelman and Meng ( 1998 ) for mean mo d els and b y Lee and Song ( 2002 ) and Ghosh and Dunson ( 2009 ) for fac- tor mo dels. In S ection 2.4 we prov e a theorem (T h eorem 2.1 ) whic h essentiall y sho ws that except for the conv er- gence of MCMC estimates for exp ected score func- tion E t ( U ( θ , t )) at eac h grid p oin t t in the path, all other needed c onditions for PS will hold for o ur c ho- sen path, pr ior and lik eliho o d for f actor mo dels. I n one of the remarks follo wing the theorem w e gener- alize this result to other paths. Remark 3 p oin ts to the need f or some ﬁnite momen ts for the pr ior, not just for T h eorem 2.1 to hold b ut for the p osterior to b eha v e w ell. Th en in Remark 5 w e p ro vide a de- tailed, heuristic argumen t as to wh y the MCMC may fail dramatically by n ot mixing prop erly if the data has come from the bigger of th e tw o m o dels un der consideration. If our heuristics is correct, and there is a sm all in terv al where E t ( U ( θ , t )) oscillates most, then a grid size that is a bit coa rse will not only b e a bit inaccurate, it will b e v ery wrong. Even if the grid size is suﬃcien tly small, one will need to do MCMC s ev eral times with diﬀerent starting p oin ts just to realize PS w ill not work. Our new prop osal a voids these problems but will require more time if man y mo dels are in v olv ed. In Section 3 w e give an argument as to why the ab o v e is un lik ely to b e true if the d ata has come from the smaller mo del. More imp ortant ly , in Sectio n 3.3 w e prop ose a mod iﬁcation of PS, which w e call P ath BA Y ES MODEL SELECTION 3 Sampling with Sm all Chan ge (PS-SC) wh ic h is ex- p ected to do b etter. Implemen tation of PS and PS-SC can b e v ery time consuming due to the need of MCMC samp ling for eac h grid p oint along the path. Time can b e sav ed if w e can implemen t PS and PS-SC b y parallel com- putation, as noted by Gelman and Meng ( 1998 ). In Section 3.4 we sho w MCMC output for the v arious cases discus sed and v alidate o ur heur istics ab o v e. The diagnostics via pr o jection into lik eliho o d space sh ou ld pro v e useful for other mo del selection problems. Our gol d standard is PS-SC, b ased on an MCMC with the num b er of dr a ws m = 50,000 and burn -in of 1000 , if necessary . But actually in our examples m = 6000 and bu rn-in of 1000 s u ﬃces for PS-SC. F or ot her mod el selection r ules we also go up to m = 50,000 if necessary . After Section 3.4 , ha ving sho wn our mo diﬁed PS, namely , PS-SC, is sup erior to PS und er b oth mod els, we do n ot consider PS in the rest of the p ap er. In the last t wo sections we touc h on the follo wing related topics: eﬀects of grid size, alternativ e path, alternativ e m etho ds and p erformance of PS-SC and some other metho ds in very high-dimensional simu- lated and real examples. PS-SC seems to c ho ose the true mo dels in the sim ulated cases and relativ ely conserv ativ e mo dels for real data. In S ection 5 we explore v arious real life and high-dimensional facto r mo dels, w ith the ob ject of co mbining PS-SC with t w o of the method s wh ic h do relativ ely wel l in Sec- tion 4 to reduce the time of PS-SC in problems with the num b er of factors r ather high, sa y , 20 or 26, for whic h PS-S C can b e qu ite slo w. F or these high- dimensional examples, w e use Laplace a pp ro xima- tion to marginals for preliminary screening of mo d- els. A few general commen ts on Laplace approxima - tion in high-dimensional prob lems are in Section 5 . In App endix A.1 w e introdu ce brieﬂy a few other metho ds lik e Annealed Im p ortance Sampling (AIS) whic h we ha ve compared with PS-SC. Finally , Ap- p end ix A.2 p oin ts to some striking d iﬀerences b e- t w een w h at w e observe in factor mo dels and w h at one m igh t ha ve expected from our familiar classi- cal asymptotics for maximum likeli ho o d estimates. Of course, as p oint ed out b y Drton ( 2009 ), classical asymptotics do es n ot apply here, bu t it surpr ised us that the d iﬀerences would b e so sta rk. It is in ter- esting and w orth p oin ting out that the Ba y es meth- o ds like PS-S C can b e v alidated partly theoreti cally and partly numerical ly in s p ite of a lac k of suitable asymptotic th eory . 2. P A TH SAMPLING AND F ACTOR MODELS In the f ollo wing sub sections we review some basic facts ab out PS, including the deﬁnition of th e th ree paths and th e notion of an optimal path. More im- p ortantl y , s ince our inte rest w ould b e in model selec- tion for co v ariance rather than mean, w e introdu ce factor mod els and then PS for factor m o dels in Sec- tions 2.3 and 2.4 . Section 2.1 is mostly an introdu ction to PS and reviews p revious w ork. After that we show the fail- ure of PS-estimates in a to y pr ob lem related to the mo deling of th e co v arince matrix in Section 2.2 . In Section 2.3 we in tro d uce factor mo dels and our pri- ors. Section 2.4 introdu ces p aths that w e consid er for factor mo dels and a theorem sho wing th e regu- larit y conditions needed for v alidity of PS under fac- tor mo d els. Th en in a series of remarks we extend the theorem and also s tu dy and explain ho w the re- maining ingredient of PS, namely , M CMC, can go wrong. W e sho w a few MCMC ou tp uts to sup p ort our arguments in Section 3.4 . This particular theme is v ery imp ortan t and will come up sev eral times in later sections or s u bsections where related diﬀeren t asp ects will b e p resen ted. 2.1 P ath S ampling Among the man y diﬀerent method s related to im- p ortance sampling, th e most p opular is Path S am- pling (PS). How ev er, PS is b est under s to o d as a limit of the simp ler Bridge S ampling (BS) (Gelman and Meng, 1998 ). S o w e ﬁr st b egin w ith BS. It is w ell kno wn t hat un less the dens ities of the sampling and target distributions are close in r ela- tiv e imp ortance s amp ling w eigh ts, Imp ortance Sam- pling (IS ) will ha v e high v ariance as well as high bias. Due to the diﬃculty of ﬁnd ing a suitable sampling distribution for IS, one ma y tr y to redu ce the d iﬃ- cult y by in tro du cing a nonnorm alized int ermediate densit y f 1 / 2 that act s like a bridge betw een the non- normalized sampling d ensit y f 1 and nonnormalized target den sit y f 0 ( Meng and W ong , 1996 ). On e can then use the iden tit y Z 1 / Z 0 = Z 1 / 2 / Z 0 Z 1 / 2 / Z 1 and estimate b oth the numerator and denominato r by I S. Extend- ing this idea, Gelman an d Meng ( 1998 ) constructed a whole path f t : t ∈ [0 , 1] connecting f 0 and f 1 . This is also lik e a b ridge. Discretizing this, they get the iden tit y Z 1 / Z 0 = Q L l =1 Z ( l − 1 / 2) / Z ( l − 1) Z ( l − 1 / 2) / Z ( l ) , wh ic h leads to a c hain of IS estimat es in the n umerator and denom- inator. W e call this estimate the Generalized Br id ge Sampling (GBS) estimate. 4 R. DUTT A AND J. K. GHOSH More imp ortant ly , Gelman and Meng ( 1998 ) in- tro duced PS, whic h is a new sc h eme, using the idea of a contin uous ve rsion of GBS but us ing t he log scale. The PS estimate is calculated b y ﬁrst con- structing a path as in BS. Supp ose the path is giv en b y p t : t ∈ [0 , 1] where for eac h t , p t is a probability densit y . Then we ha ve the follo wing deﬁnition: p t ( θ ) = 1 z t f t ( θ ) , (2.1) where f t is an unnormalized densit y and z t = R f t ( θ ) dθ is th e normalizing constan t. T aking the deriv ativ e of th e log arithm on b oth sides, we obtain the follo win g id en tit y un der the assumption of in - terc h angeabilit y of the order of integ ration and dif- feren tiation: d dt log( z t ) = Z 1 z t d dt f t ( θ ) µ ( dθ ) (2.2) = E t  d dt log f t ( θ )  = E t [ U ( θ , t )] , where th e exp ectation E t is tak en with resp ect to p t ( θ ) and U ( θ , t ) = d dt log f t ( θ ). No w in tegrating ( 2.2 ) from 0 to 1 gi ve s the log of the ratio of the normaliz- ing constan ts, that is, log BF in the con text of mo del selection: log  Z 1 Z 0  = Z 1 0 E t [ U ( θ , t )] dt. (2.3) T o approxima te the inte gral, w e d iscretize the p ath with k p oin ts t (0) = 0 < t (1) < · · · < t ( k ) = 1 and dra w m MC MC samples con verging to p t ( θ ) at eac h of these k p oin ts. Then estimate E t [ U ( θ , t )] b y 1 m P U ( θ ( i ) , t ) wh ere θ ( i ) is the MCMC outpu t. T o estimate the ﬁn al log Ba y es fact or, commonly nu- merical in tegration sc hemes are used. It i s clear that MCMC at diﬀerent p oin ts “ t ” on the path can b e done in parallel. W e ha v e used th is b oth for PS and for our mo diﬁcation o f it, n amely , PS-SC intro d uced in Section 3.3 . Gelman and Meng ( 1998 ) sho wed there is an opti- m um path in the wh ole distribution space pro viding a lo wer b oun d for MCMC v ariance, namely ,  arctan H ( f 0 , f 1 ) p 4 − H 2 ( f 0 , f 1 )  2 . m, where f 0 and f 1 are the densities corresp onding to the t w o mo d els compared and H ( f 0 , f 1 ) is their Hel- linger distance. Unfortunately in nested examples f 0 and f 1 are m utually orthogonal, so H ( f 0 , f 1 ) tak es the trivial v alue of t w o. Moreo ve r, m is so large that the lo wer b ound b ecomes trivial and unattainable. Ho wev er, in a giv en problem, one path ma y b e more suitable or con venien t th an another. F ollo wing Gelman and Meng ( 1998 ) and Lefeb vr e et al. ( 2009 ), w e consider three paths generally used for the implement ation of PS. The Geometric Mean P ath (GMP) a nd Arithmetic Mean Path (A MP) are deﬁned by the mean [ f t = f (1 − t ) 0 f t 1 and f t = tf 0 + (1 − t ) f 1 , resp.] of the d ensities of t wo comp eting mo dels for eac h mo del M t : t ∈ (0 , 1) along the path. Our notation for the Ba yes factor is giv en later in equation ( 2.6 ). One more common path is obtained by assum - ing a sp eciﬁc functional form f θ for the densit y and then constructing the path in the parametric space ( θ ∈ Θ ) of the assumed density . If θ t = tθ 0 + (1 − t ) θ 1 , then f t,θ t is the density of the mo del M t , where f 0 ,θ 0 = f 0 and f 1 ,θ 1 = f 1 . W e d enote th is third path as the P arametric Arith m etic Mean P ath (P AMP). The P AMP path w as shown by Gelman and Meng ( 1998 ) to minimize the Rao d istance in a path for mo del selection ab out normal means. More imp or- tan tly , it is v ery con ve nient for use of MCMC, as sho wn f or some factor mo d els by Song and Lee ( 2006 ) and Ghosh and Dunson ( 2009 ), and for linear mo d els b y Lefeb vre et al. ( 2009 ). Implement ation of PS with the p aths m en tioned ab o ve is denoted as GMP-PS, AMP-PS and P AMP-PS . In view of the discussion in Lefeb vre et al. ( 2009 ) regarding the degeneracy of the AMP-PS, w e will only consider P AMP-PS and GMP-PS. Unlik e Lefebvre et al. ( 2009 ), who s tu dy mo dels ab out means, our in terest is in studying mo del selec- tion for co v ariance models, sp eciﬁcally factor mo dels with diﬀeren t num b er of factors. Th ese are d iscussed in the Sections 2.3 and 2.4 . Performance of PS for co v ariance mo dels can b e v ery diﬀeren t from the ex- amples in L efebvre et al. ( 2009 ). I n the next sub- section we giv e a to y example of co v ariance mo del selection where PS fails and our prop osed mod iﬁca- tion PS -SC is also not applicable. 2.2 Cova riance Mo del: T o y Example T o illustrate the diﬃculties in calculation of the BF that we discuss later, we b egin by considering a problem w here w e can calculate the true v alue of the Ba y es factor. Assuming Y p ∼ N (0 , Σ), for some m < p w e w ish to test wh ether Y 1 ,...,m and Y m +1 ,...,p are indep endent or not. If Σ = ( A 11 A 12 A ′ 12 A 22 ) w here Y 1 ,...,m ∼ N (0 , A 11 ) BA Y ES MODEL SELECTION 5 T able 1 Performanc e of PS in toy example mo deli ng c ovarianc e: L o g Bayes factor (MCMC-standar d deviation) Metho d Data 1 Data 2 T rue BF v alue 258.38 − 132 . 87 PS estimate of BF 184.59 (0.012) − 20 . 11 (0.008) and Y m +1 ,...,p ∼ N (0 , A 22 ), th en the comp etitiv e mo d - els for a ﬁxed m will b e M 0 : A 12 = 0 vs M 1 : A 12 6 = 0. Under M 1 w e use an inv erse-Wishart p rior f or the co v ariance matrix, as it h elps us to calculate the true BF, using the conjugacy pr op erty of the prior. Un- der M 0 w e tak e A 11 , A 22 to b e in dep end en t, eac h with an in verse Wishart prior. W e illustrate th e ab ov e p roblem with p = 10 and m = 7 for a p ositiv e deﬁnite matrix Σ 0 = ( A 0 11 A 0 12 ( A 0 12 ) ′ A 0 22 ) (giv en in App endix A.3 ). W e implemen t the p ath sampling f or this problem connecting M 0 and M 1 , using a Parametric Arithmetic Mean Path: M t : y i ∼ N  0 , Σ =  A 0 11 tA 0 12 t ( A 0 12 ) ′ A 0 22  . (2.4) F or ev ery 0 ≤ t ≤ 1 , the Σ matrix is p ositiv e deﬁnite, b eing a con v ex com b ination of tw o p ositiv e deﬁnite matrices. F or t = 0 and t = 1 we get the mo dels M 0 and M 1 . W e can estimate the Ba yes factor by using the path sampling schemes as describ ed e arlier. W e sim- ulated t w o data sets, one eac h from M 0 and M 1 , a nd rep ort the tr u e BF v alue with the PS estimate in T a- ble 1 . Here the rep orted Ba y es f actor is deﬁn ed as the ratio m 1 m 0 , where m 1 and m 0 are the marginals under the mo d els M 1 and M 0 , r esp ectiv ely . The v alues in the table sho w us that th e estimated BF v alue is oﬀ by an order of magnitude when M 0 is true. The v alue is r elativ ely stable as jud ged by the MCMC-standard d eviation based on 10 runs and near to the true v alue for M 1 . 2.3 F acto r M o dels a nd Bay esian Sp eciﬁcation of Prio r A factor mod el with k factors is deﬁned as n i.i.d. observ ed r .v.’s y i = Λ η i + ε i , ε i ∼ N p (0 , Σ) , where Λ is a p × k matrix of factor loadings, η i = ( η i 1 , . . . , η ik ) ′ ∼ N k (0 , I k ) is a vect or of standard normal laten t f actors, and ε i is the residual with diagonal co v ariance matrix Σ = diag( σ 2 1 , . . . , σ 2 p ). T h us, we ma y w r ite the marginal distribution of y i as N p (0 , Ω), Ω = ΛΛ ′ + Σ. This mo del implies that the sharing of common laten t factors explains the d ep endence in the outcomes and the outcome v ariables are uncorrelated given the la- ten t factors. A factor mo del, without any other constrain ts, is nonidentiﬁable un der orthogonal rotation. Post- m ultiplying Λ by an orthogonal matrix P , where P is su c h that P P ′ = I k , w e ob tain exactly the same Ω as in the previous factor mo del. T o a v oid this, it is customary to assume that Λ has a fu ll-rank lo w er triangular stru cture, restricting the n umb er of free parameters in Λ and Σ to q = p ( k + 1) − k ( k − 1) / 2, where k m u st b e c hosen so that q ≤ p ( p + 1) / 2. The recipro cal of d iagonal en tries of Σ forms the preci- sion ve ctor here. It is w ell kn o wn that maxim um like liho o d esti- mates for parameters in f actor mo dels ma y lie on b ound aries and , hence, lik eliho o d equations ma y n ot hold ( Anderson , 1984 ). Th e Ba yes estimate of Ω de- ﬁned as a v erage o v er MCMC outputs is w ell deﬁned, easy to calculate and, b eing a ve rage of p ositiv e def- inite matrices, is easily s een to b e p ositiv e deﬁnite. This fact is used to searc h f or maxim um lik eliho o d estimates (mle) or m axim um p r ior × lik eliho o d esti- mates (mp le) in a neigh b orho o d of the Ba yes esti- mate. W e also note for later use the follo wing w ell-kno wn simple fact, for example, And er s on ( 1984 ). If th e lik eliho o d is maximized ov er all p ositiv e deﬁnite m a- trices Ω, not just ov er factor mo dels, then the global maxim um for n indep endent observ ations exist s and is giv en by ˆ Ω = 1 n − 1 n X i =1 ( y i − ¯ y )( y i − ¯ y ) ′ . (2.5) F r om the Ba yes mo del selection p ersp ectiv e, a sp ec- iﬁcation of the prior distribution for the fr ee ele- men ts of Λ and Σ is needed. T r uncated normal pr iors for the diagonal elemen ts of Λ, normal priors for the lo wer triangular elemen ts and inv erse-gamma priors for σ 2 1 , . . . , σ 2 p ha v e b een commonly u sed in practice due to conjugacy and the resulting simpliﬁcation in p osterior distribution. Prior elicitatio n is n ot com- mon. Ghosh and Dunson ( 2009 ) ad d ressed the ab ov e problems by using the idea of Gelman ( 20 06 ) to in- tro duce a new class of d efault pr iors for the fac- tor l oadings that ha v e goo d mixing p rop erties. They 6 R. DUTT A AND J. K. GHOSH used the Gibbs samp ling scheme and sh o w ed there w as go o d mixing and con v ergence. T hey u sed p a- rameter expansion to induce a class of t or folded t -priors dep en ding on sign constrain ts on the load- ings. Su itable t -p riors hav e b een very p opular. W e use the same family of priors bu t consider a w h ole range of many degrees of fr eedom going all the w a y to the normal and use the same Gibbs sampler as in Ghosh and Dunson ( 2008 ). W e ha v e used a mo diﬁ ed v ersion of their co d e. In th e factor mo d el fr amew ork, w e stick to the con v en tion of denoting the Ba y es factor for t wo mo d - els with laten t factors h − 1 and h as BF h,h − 1 = m h ( x ) m h − 1 ( x ) , (2.6) where m h ( x ) is the marginal under th e model ha ving h laten t factors. So the Bayes f actor for the simpler mo del ( deﬁne d as M 0 ) and c omplex mo del ( deﬁne d as M 1 ) with h − 1 and h latent factors wil l b e deﬁne d as BF h,h − 1 . W e choose the mo del with h and h − 1 laten t factors, resp ectiv ely , dep end ing on the v alue of the log Ba y es factor b eing p ositiv e and negativ e. Alternativ ely , one ma y c ho ose a mo d el only w hen the v alue of log BF is decisiv ely negativ e or p ositiv e, sa y , less than or greate r than a c h osen threshold. 2.4 P ath S ampling fo r Facto r Mo dels There are s ev eral v ariants of path samplin g w hic h ha v e b een explored in diﬀeren t setups, d ep endin g on c h oice of path, p rior and other tun ing parameters (grid size and MCMC sample size). In the factor mo del s etup th e p arametric arithm etic mean p ath (P AMP) [used by Son g and Lee ( 2006 ) and Ghosh and Duns on ( 2009 )] seems to b e the most p opular one. W e also consider Geometric Mean P ath (GMP) along with the P AMP for the factor mo del. By constructing a GM path from corresp ondin g prior to the p osterior, we can estimate the v alue of the log-marginal under b oth M 0 and M 1 , whic h in turn leads u s to an estimate of the log-BF. W e will ﬁrst describ e th e t w o paths and their corresp onding score fu n ctions to b e estimated along the path. (i) Par ametric a rithmetic me an p ath : Lee and Song ( 2002 ) used this path in factor mo dels, follo w- ing an example in Gelman and Meng ( 1998 ). Ghosh and Dunson ( 2008 ) also used this p ath along w ith parameter expansion. Here we deﬁne M 0 and M 1 to b e the t wo mod els corresp ond ing to the factor mo del with factors h − 1 and h , r esp ectiv ely , and th en connect them by th e path M t : y i = Λ t η i + ε i , Λ t = ( λ 1 , λ 2 , . . . , λ h − 1 , tλ h ), where λ j is the j th column of the loading matrix. So for t = 0 and t = 1 w e get the mo dels M 0 and M 1 . The lik eliho o d f unction at grid p oint t is a MVN whic h is denoted as f ( Y | Λ , Σ , η, t ). W e hav e indep endent priors π (Λ) , π (Σ) , π ( η ) and a score fu n ction, U (Λ , Σ , η, Y , t ) (2.7) = n X i =1 ( y i − Λ t η i ) ′ Σ − 1 (0 p × ( h − 1) , λ h ) η i . F or ﬁxed and ordered grid p oin ts along the path t (0) = 0 < t (1) < · · · < t ( S ) < t ( S +1) = 1 , our p ath sam- pling estimate for the log Ba yes factor is log( d BF h : h − 1 ) (2.8) = 1 2 S X s =0 ( t s +1 − t s )( b E s +1 ( U ) + b E s ( U )) . W e sim u late m samples of (Λ t s ,i , Σ i , η i : i = 1 , . . . , m ) from the p osterior distribu tion of (Λ t s , Σ , η ) at the p oint 0 ≤ t s ≤ 1 and use them to estimate b E s ( U ) = 1 m P U (Λ t s ,i , Σ i , η i , y ) , ∀ s = 1 , . . . , S + 1. (ii) Ge ometric me an p ath : This p ath is constructed o ver the d istributional sp ace (Gelman and Meng, 1998 ), h ence, we mo d el the density for the mo del M t at eac h p oin t alo ng the grid. W e use the dens it y f t (Λ , Σ , η | Y ) = f ( y | Λ , Σ , η ) t π (Λ , Σ , η ) as the un n or- malized den s it y f or the mo del M t connecting the prior and the p osterior, when π (Λ , Σ , η ) and f ( y | Λ , Σ , η ) are the pr ior and the lik eliho o d fun ction, re- sp ectiv ely . By using PS along this path w e can ﬁnd the log marginal for th e mo dels M 0 and M 1 , as the normalizing constan t f or the prior is kno wn. Hence, the log BF can b e estimated by using those esti- mates of the log marginal for those mo dels. The score fu nction U (Λ , Σ , η, Y , t ) will b e th e log lik e- liho o d function log f ( y | Λ , Σ , η ). The theorem b elo w v eriﬁes the regularity condi- tions of path samp lin g for factor mo dels. F or PS to succeed w e also need con vergence of MCMC at eac h p oint in the path. That w ill b e tak en up after prov- ing the theorem. Theorem 2.1. Consider p ath sampling for fac- tor mo dels with p ar ametric arith metic me an p ath (P AMP ) and likeliho o d as g iven ab ove for factor mo dels. Assume prior is pr op er and the c orr esp ond- ing sc or e function is inte gr able w.r.t. the prior: (1) The inter change ability of inte gr ation and dif- fer entiation in ( 2.2 ) is v alid. (2) E t ( U ) is ﬁnite as t → 0 . BA Y ES MODEL SELECTION 7 (3) The p ath sampling inte g r al for factor mo dels, namely, ( 2.3 ), is ﬁnite. Pr oof. Here, for notational conv enience, we write (Λ , Σ , η ) = θ . When f ( Y | θ ) and π ( θ ) are the lik eliho o d function of the data and the pr ior densit y function for the corresp ond ing parameter, r esp ec- tiv ely , then the follo wing is equiv alen t to showing equation ( 2.2 ): d dt Z ∞ −∞ f ( Y | θ , t ) π ( θ ) dθ = Z ∞ −∞ d dt f ( Y | θ , t ) π ( θ ) dθ . W e can wr ite the LHS as the follo wing: = lim δ → 0 Z ∞ −∞ f ( Y | θ , t + δ ) − f ( Y | θ , t ) δ π ( θ ) dθ = lim δ → 0 Z ∞ −∞ f ′ ( Y | θ , t ′ ) π ( θ ) dθ , t ′ ∈ [ t, t + δ ] = lim δ → 0 Z ∞ −∞ U ( Y | θ , t ′ ) f ( Y | θ , t ′ ) π ( θ ) dθ , where t ′ ∈ [ t, t + δ ]. U is a quadratic fun ction in θ and, hence, its absolute v alue is b ou n ded ab o v e by a qu adratic fun ction in θ , free of t bu t dep end ing on Y . f ( Y | θ , t ′ ) is b ounded b y the global maximum of the MVN like liho o d , sa y , M , ac hieve d at b Ω [equa- tion ( 2.6 )]. No w applying th e momen t assumptions for π ( θ ), w e can use the Dominated C onv ergence theorem (DCT) and take the limit within the in- tegral sign. The rest of statemen ts 2 and 3 follo w similarly .  In Remark 1 we extend th e theorem to other p aths. Then in a series of remarks we study v arious as- p ects lik e con v ergence and d iv ergence of PS , that are closely related to the theorem. All the remarks are related to the theorem and insigh ts gained from its p r o of. Remark 5 is the most imp ortan t. Remark 1. F or PS with GMP , the score func- tion is the log like liho o d fun ction whic h can b e b ound ed as b efore by usin g th e RHS of equation ( 2.5 ). Also, f ( y | Λ , Σ , η ) t ≤ (1 ∨ f ( y | ˆ Ω)) with ˆ Ω as in equation ( 2.5 ). W e b elieve a sim ilar generalization holds for most paths mo deling means of t wo mo d - els. No w the p r o of of Theorem 2.1 applies exac tly as b efore (i.e., as f or P AMP). W e exh ibit p erformance of PS for this path in S ection 4 . Remark 2. If w e f u rther assume the MCMC a v- erage at eac h p oint on the grid con v erges to the Ex- p ectation of the score function of MCMC, then the theorem implies the con verge nce of PS. W e sh o wed the in tegrand is con tinuous on [0, 1]. So by cont inu- it y it can b e ap p ro ximated by a ﬁnite sum. Now tak e t he limit of the MCMC av erage at eac h of these ﬁnitely many grid points. Ho w ev er, ev en if the MCMC conv erges in th eory , the rate of conv ergence ma y b e very slo w or there m ay b e a problem with mixing ev en f or m = 50,000, whic h w e ha ve tak en as our gold stand ard f or go o d MCMC. This problem will b e apparent to some extent fr om h igh MC MC standard deviation. Remark 3. As t → 0 the lik eliho o d is practically indep en d en t of the extra p arameters of the bigger mo del, so that a p rior for those p arameters (con- ditional on other parameters) will not learn m uch from data. In particular, the p osterior for these p a- rameters will remain close to the d iﬀuse prior one normally starts with. If the pr ior fails to ha ve the required ﬁn ite m omen t in th e theorem, th e p oste- rior will also b e like ly to ha v e large v alues for mo- men ts, which ma y cause con ve rgence problems for the MCMC. Th at’s why we c hose a prior making the s core function integrable. I n the p ro of of the theorem, we ha ve assum ed the ﬁrst t wo moments of the prior to b e ﬁn ite. In most numerical wo rk our prior is a t with 5 to 10 d.f. Remark 4. In the same v ein, we suggest that ev en when the in tegral at t near zero conv erges, the con v ergence may b e slo w for th e follo win g r eason. Consider a ﬁxed (Λ t , Σ , η ) w ith a large p osterior or negativ e v alue of U (Λ t , Σ , η ) L (Λ t , Σ , η ) at p oint t , the same large v alue will o ccur at ( 1 t Λ t , Σ , η ) with prior weig ht π ( 1 t Λ t , Σ , η ) . F or priors like t -distribution with lo w degrees of fr eedom, π ( 1 t Λ t , Σ , η ) will not deca y rapidly enough to substantia lly dim in ish the con tribution of the large v alue of U (Λ t , Σ , η ) L (Λ t , Σ , η ) at (Λ t , Σ , η ) . Remark 5. The structur e of the lik eliho o d and prior actually provi des insight as to when the MCMC will not con verge to the righ t distribu tion o win g to bad mixing. T o this end, we sk etc h a heuristic argu- men t b elo w , whic h will b e supp orted in Section 3.4 b y MCMC ﬁgures: (1) The maximized lik eliho o d remains the same along the wh ole path, b ecause the path makes a one-to-one transform ation of the parameter space as giv en b elo w. (2) If the MLE of λ h at t = 1 is ˆ λ h , then the MLE at t = t ′ is ˆ λ h t ′ (sub ject to v ariation du e to MCMC at t wo diﬀerent p oin ts at the p ath), which go es to inﬁnity as t go es to zero. This happ ens as th e ˆ λ h re- mains the v ector among λ ′ h (where λ ′ h is the MCMC 8 R. DUTT A AND J. K. GHOSH sample fr om model M t at t ) h a ving the highest max- im um lik eliho o d. Hence, as t → 0, π ( ˆ λ h /t ) → 0 at a rate determined b y the tail of the prior. The conﬂict b et wee n prior and m aximized likelihoo d ma y also b e view ed as a conﬂict b et we en the nested models, with the prior fa vo ring the parsimonious smaller mo del. This inherent conﬂict in mo del selection seems to ha v e the follo wing implications for MCMC. W e exp ect to see a range (sa y , [ t 1 , t 2 ]) n ear zero sho wing a conﬂict b et wee n prior and maximized lik e- liho o d. Deﬁnitely the p oin ts t 1 and t 2 are not we ll sp eciﬁed, but we treat them as such s o as to un der- stand some und erlying issues of mixing and conv er- gence here. On th e set of p oin ts t > t 2 the MCMC samples are exp ected to b e aroun d the p oin ts max- imizing lik eliho o d, whereas for t < t 1 they will b e nearly zero due to the concen tration aroun d a v alue λ h whic h is b oth the prior mo d e and the mle und er M 0 , namely , λ h = 0. But for any p oin t in the range [ t 1 , t 2 ], they will s p an a huge p art of the parameter space, ranging from p oints maximizi ng lik eliho o d to ones ha ving higher pr ior probability , sho wing a lot of ﬂu ctuations f r om MCMC to MCMC. The MCMC outputs in S ection 3.4 sh o w b oth clusters but ha v- ing highly ﬂuctuating v alues (Figure 1 , Section 3.4 ) for the prop ortions of the clusters. Equation ( 2.7 ) tells us that the score function is prop ortional to λ ′ h t (where λ ′ h is the MCMC sample from mo del M t at t ). Hence, we will see E t ( U ) as an increasing function while t → t 2 from the righ t-hand side [(2) in Remark 5 ]. This leads to a lot of v ariation of the estimate of E t ( U ) for diﬀeren t MCMC sam- ples in the r ange [ t 1 , t 2 ] as explained abov e. Also, as explained ab o v e, for t < t 1 , the score function will concen tr ate near zero. The width of the zone of conﬂict (here t 2 − t 1 ) will shrink, if w e h a v e a relativ ely strong deca ying tail of the pr ior. On the other hand, for h ea vy-tailed pri- ors we ma y see th ese ab o v e mentioned ﬂuctuations for a longer range, causing a loss of mass fr om the ﬁnal in tegration. These pr oblems are aggra v ated b y the high dimension of the problem and the diﬀuse spread of the p rior on the h igh-d imensional space. This ma y mean the usu al BF estimated by PS will b e oﬀ b y an order of magnitude. W e will see the im- plications reﬂ ected in some ﬁgures and tables in the next section, wh en w e stu dy P AMP-PS for factor mo dels in detail in Section 3 . Remark 6. W e hav e c hec ke d that adaptiv e c h oice of grid p oints by Lefebvre et al. ( 200 9 ), whic h impro ve s accuracy in their t wo exa mples with GMP , do es not help in the case of the very large ﬂu ctu- ations describ ed ab o v e. It seems to u s that adap- tiv e c hoice w ould w ork b etter when the t wo mo d- els tested are less dissimilar than the mod els in Re- mark 5 , for exa mple, wh en the s m aller of tw o nested mo dels is true (Section 3.1 ) or when our p rop osed mo diﬁcation of PS is used (Section 3.3 ). How ev er, w e h a ve not veriﬁed this because eve n without adap- tiv e c hoice, our new p rop osal w ork ed quite we ll in our examples. W e n ote in passing that in b oth the examples of Lefeb vre et al. ( 2009 ), the tw o mo dels b eing tested ha v e maximum lik eliho o ds that diﬀer by ﬁf teen in the log scale, whereas for the mo dels in Remark 5 they diﬀer b y muc h more, ov er a hundred. 3. WHA T DO ACTUAL COMPUT A TIONS TELL US? F ollo wing the discussion in the previous section, w e would like to study the eﬀects of the th eoretical observ ations in the previous section for the imp le- men tation of path sampling. Here we only consider the P AMP for PS, and for notational conv enience w e will menti on it as ju st PS . After studyin g esti- mated BF’s in s ev eral simulat ed data sets (not re- p orted here) f rom v arious factor mo dels, we note a few salient features. Error in estimation of the BF or the discrepancy b etw een diﬀerent methods tend s to b e relativ ely large, if one of the follo wing is true: the data has come from the complex mo del rather than the simpler m o del, the pr ior is relativ ely diﬀuse or the v alue of th e p r ecision parameters are relativ ely small. Diﬀerent subsections study wh at happ ens if the complex or simpler mo del is tru e, the eﬀect of the p rior, the grid size and the MCMC size. These are don e in Sections 3.1 – 3.3 . In Section 3.3 we in tro du ce a new PS sc heme, whic h op erates through a c h ain of paths, eac h path in v olving t wo nested mo d els with a small c hange b et wee n the con tiguous pairs. The new sc heme is denoted as P ath Sampling with Small Changes (PS- SC). The eﬀect of precision p arameters will also b e studied in t his sub section for PS-S C. Then w e study the MCMC samples and try to u nderstand their b e- ha vior f rom the p oint of view of exp laining the dis- crepancy b et we en diﬀeren t metho d s for estimating Ba yes factors and why PS -S C do es b etter than PS in Section 3.4 . Our simulate d data are similar to those of Ghosh and Du nson ( 2009 ) but ha v e diﬀerent parameters. BA Y ES MODEL SELECTION 9 T able 2 L o ading factors use d for simulation F actor 1 0.89 0 0.25 0 0.8 0 0. 5 F actor 2 0 0.9 0.25 0.4 0 0.5 0 T able 3 Diagonal entries of Σ 0.2079 0.19 0.15 0.2 0.36 0.1875 0.1875 We use a 2 -factor mo del and a 1 -f actor mo del as our c omplex mo del M 1 and simpler mo del M 0 , r e- sp e ctively, to demons tr ate the underlying issues. The loading parameters a nd the diagonal entrie s of the Σ matrix are giv en in T ables 2 and 3 . In sim ulation we tak e mod el M 0 or M 1 as true but Σ is not c hanged. Of course, if the one-factor mo del M 0 is tru e, then since it is nested in M 1 , M 1 is also tru e. 3.1 Issues in Complex (2-Facto r) Mo del W e will study the eﬀect of grid size, prior and the b ehavio r of MCMC, k eeping in mind Theorem 2.1 and the r emarks in S ection 2 . F or path sampling with the P AM path, w e no w discuss the eﬀect of the prior and th e t wo tunin g parameters, namely , the eﬀect of the grid size and MCMC size, on the esti- mated v alue of th e BF and their standard d eviation. F ollo wing the discussion in Remarks 3 and 4 , w e kno w that lim t → 0 E t ( U ) is ﬁnite and path sampling con v erges u nder some ﬁ nite momen t assu mption for the prior. The prior considered in PS b y Ghosh and Dunson ( 2008 ) are Cauch y and half-Cauc h y , which do not h a ve an y ﬁnite momen ts and so U is not inte- grable. W e therefore choose a relativ ely diﬀuse prior, but w ith enough ﬁnite moments for U . F or ﬁnite mean and v ariance one needs a t with at least fou r degrees of freedom. Our f a vorites are t -distrib utions with 5 to 10 degrees of freedom. W e sh o w resu lts for 5 and 10 d.f. only . But we ﬁ rst explore the sen- sitivit y of the estimate to changes in d .f . of the t - distribution as pr ior, o v er a range of 1 through 90. The BF v alues c hange considerably until we reac h ab out 40 d .f . and then it s tabilizes. In T ab le 4 we rep ort the log BF v alues estimated for 5 d ata sets sim ulated from a 2-factor mo del using diﬀerent pri- ors. T he b eha vior of the estimated log BF with the c h ange of d.f. conti nuously f rom 1 to 100 is sho wn in Figure 1 for the 3rd d ata set. W e can see th e estimate of the BF changing with the change in the pattern of the tail of t he prior. The T able 4 P AM-PS: Dep endanc e of log BF 21 over prior, 2-factor m o del true PS using grid size 0.01 t 1 t 5 t 10 t 90 normal 2.62 14.42 22.45 70.20 70.25 3.67 11.90 21.39 68.70 68.72 3.00 13.43 21.31 47.06 47.21 4.29 13.17 18.49 48.03 48.13 4.20 13.11 18.48 47.70 47.74 Fig. 1. Dep endanc e of log BF 21 over prior for 3r d data set . eﬀect of the grid size and MCMC size on MCMC- standard deviation of the estimate are s tu died, us- ing p riors t 10 and N (0 , 1) and rep orted in T able 5 . W e rep ort the mean of the estimates f ou n d from 25 diﬀeren t MCMC runs and the corresp onding stan- dard deviation as MCMC-standard deviation. T he study has b een done on th e 2nd of the 5 d ata sets sim ulated from mo del 1 earlier. T able 5 P AM-PS: Dep endenc e of log BF 21 (MCMC-standar d deviation) estimates over grid size and MCM C size, 2-factor m o del true Grid size 0.01 0.001 MCMC size Prior Data 2 Data 2 5000 t 10 21.26 (1.39) 21.26 (1.29) N (0 , 1) 66.89 (4.15) 67.21 (3.28) 50,000 t 10 23.71 (1.21) 23.57 (0.52) N (0 , 1) 68.21 (3.62) 68.23 (3.11) 10 R. DUTT A AND J. K. GHOSH T able 6 P AM-PS: Dep endenc e of log BF 21 (MCMC-standar d deviation) estimates over grid size and MCM C size, while 1-factor m o del true Grid size 0.01 0.001 MCMC size Prior Data 1 Data 1 5000 t 10 − 4 . 26 (0.054) − 4 . 27 (0.0 44) N (0 , 1) − 4 . 62 (0.052) − 4 . 60 (0.0 51) 50,000 t 10 − 4 . 24 (0.012) − 4 . 24 (0.0 07) N (0 , 1) − 4 . 60 (0.006) − 4 . 62 (0.0 05) As exp ected, T able 5 shows a ma jor increase of MCMC size and ﬁner grid size reduces the MCMC- standard deviation of the estimator. The diﬀerence b et wee n the mean v alues of BF estimated by t 10 and N (0 , 1) diﬀer b y an order of magnitude. W e will s tudy these issues as w ell as sp ecial p atterns exhibiting MCMC in Section 3.4 . Though the d if - feren t v ariants of PS compared h er e d iﬀer in their estimated v alue of BF, they still choose the correct mo del 100% of the time. 3.2 Issues in Simpler (1-Facto r) Mo del No w we study the scenario when the 1-factor mo del is true fo cusin g on the eﬀect of pr ior, grid size and MCMC size on th e e stimated Bay es fact or (T able 6 ). In this scenario the estimates do n ot c hange muc h with the change of prior, so we w ill rep ort the esti- mates for prior t 10 and N (0 , 1 ) with diﬀeren t v alues of MCMC size and grid size. This table sho ws us that the MCMC-standard de- viation improv es w ith the ﬁ ner grid size and large MCMC size as exp ected, b ut th e e stimated v alues of BF 21 remain mostly the same. As noted earlier, PS c h o oses the correct mo d el 100% of the time w hen M 0 is true. W e explain ten tativ ely why the calculatio n of BF is relativ ely stable when the lo wer dim mo del M 0 is true. S ince M 0 is n ested in M 1 , M 1 is also true in this case, w hic h in t urn implies b oth max lik eliho o ds (under M 0 and M 1 ) are similar and smaller than for data coming from M 1 true (but not M 0 ). This tends to reduce or at least is asso ciated with the r eduction of the conﬂict b et ween the tw o mo dels or p rior and lik eliho o d along the path ment ioned in Remark 5 . Moreo ver, the score fun ction for small t causes less p roblem since for d ata under M 0 , λ ′ 2 is rela- tiv ely small compared with that for data generated under M 1 . So w e see w hen t wo m o dels are c lose in some sense, w e exp ect their likeli ho o d r atio will not ﬂ uctuate widely provi ded the parameters from the t wo pa- rameter spaces are prop erly aligned, for example, if found by m in imizing a K-L diverge nce b et we en the corresp on d ing densities or taking a simple pro- jection from th e bigger space to the smaller sp ace. This is lik ely to mak e imp ortance samp ling more stable than if the tw o mo dels we re v ery diﬀerent. It seems plausible that this stabilit y or its lac k in the calculati on of BF will also show u p in metho ds lik e PS that are deriv ed from imp ortance sampling in some w ay . In genious mod iﬁcations of imp ortance sampling seems to mitigate but not completely solv e the problem. F ollo wing this idea of closer mo d els in some sense, we mo dify PS in a s imilar manner b e- lo w . 3.3 P ath S ampling with Sma ll Changes: Prop osed Solution In Remark 5 , Section 3.1 , a pr ior-lik eliho o d con- ﬂict w as identiﬁed as a cause of p o or mixing. Th is will b e re-examined in the n ext sub s ection. In the present subsection we prop ose a mod iﬁcation of PS whic h tries to solve or at least reduce the magnitude of this problem. T o solv e this p roblem without havi ng to give up our diﬀu se prior (w e will b e u sing t w ith 10 d.f. as our prior), we try to redu ce th e problem to a se- ries of one-dimensional problems so th at the com- p eting mo dels are close to eac h other. W e calcu- late th e Ba y es factor b y usin g th e p ath sampling step for ev ery single parameter th at may b e zero, k eeping others ﬁxed. It is easily seen that the origi- nal log Ba yes factor is the sum of all the log Ba y es factors estimate d in these smaller steps. W e denote this pro cedure as PS-SC (P ath Samp lin g w ith Small Change) and implemen t with the parametric arith- metic mean path (P AMP). (As p ointe d out by a Referee, there is scop e for exploring other paths, in- cluding a s earc h for an optimal one, to r educe th e MCMC-v ariance.) More f ormally , if we consider λ 2 as a p -dimensional vec tor, then M 0 and M 1 diﬀer only in the last p − 1 p arameters, as λ 21 is alw a ys zero d ue to the upp er-triangular condition. W e con- sider p mo dels M ′ i : i = 1 , . . . , p , where for mo del M ′ i w e ha v e ﬁrst i p arameters of λ 2 b eing zero cor- resp ond ingly . If we deﬁn e BF ′ i,i +1 = m i ( x ) m i +1 ( x ) , when m i ( x ) is the m arginal for the mo del M ′ i , then log BF 21 = p − 1 X i =1 log BF ′ i,i +1 . BA Y ES MODEL SELECTION 11 T able 7 log BF 21 (MCMC-standar d deviation) estimate d by P AM-PS-SC and P AM-PS T ru e model MCMC size PS-SC PS ( t 10 ) PS ( N (0 , 1)) 1-factor 5000 − 8 . 09 (0 . 013) − 4 . 26 (0 . 054) − 4 . 62 (0 . 052) 1-factor 50, 000 − 8 . 08 (0 . 0067) − 4 . 24 (0 . 012) − 4 . 60 (0 . 0065 ) 2-factor 5000 80 . 14 (0 . 66) 21 . 26 (1 . 39) 66 . 89 (4 . 15) 2-factor 50, 000 80 . 75 (0 . 54 ) 23 . 71 (1 . 21) 68 . 21 (3 . 62) So w e p erform p − 1 path sampling computations to estimate log BF ′ i,i +1 , ∀ i = 1 , . . . , p − 1. And for eac h of the steps the score function will b e of the follo wing form: U ′ i (Λ , Σ , η, Y , t ) = n X j =1 ( y j − Λ t η i ) ′ · Σ − 1 (0 p × ( h − 1) , [0 i ; λ h,i +1 ; 0 p − i − 1 ]) η i , where Λ t = ( λ 1 , [0 i ; tλ 2 ,i +1 ; λ 2 , ( i +2 ,...,p ) ]). As in the case of the small mo del tr u e, the max lik eliho o ds under b oth mo dels are close, and gener- ally the tw o mo dels are close, suggesting ﬂuctuations are less lik ely and true BF is not v ery large. Th is seems generally to lead to stabilit y of computation of BF. Also, the p arameter λ ′ 2 is no w one d imensional. So the score function is more lik ely to b e sm all than when λ ′ 2 is a v ector as under PS. W e also noti ce that in eac h step the score function is not an ymore pro- p ortional to λ ′ 2 t but rather to λ ′ 2 i t whic h will b e m uch smaller in v alue, hence reducing the ﬂuctuation and loss of mass. Computational implementa tion sho ws it to b e sta- ble for diﬀerent MCMC size and grid size regard- ing MCMC-standard deviation and also pr o duces a smo oth c urve of E t ( U ) for ev ery single step. Here w e use an MCMC size of 5000 / 50,000 and grid s ize of 0.01 for our study and report th e corr esp ondin g esti- mated BF v alues for tw o data sets from 1 -factor and 2-factor mo dels, r esp ectiv ely . Th e MCMC-standard deviation of the estimates along with the mean of the estimated v alue o ver 25 MCMC runs are re- p orted in T able 7 . PS-SC has smaller standard devi- ation than PS under b oth M 0 and M 1 . In Section 2 and Section 3.4 , we argue th at, at least under M 1 , PS-SC pr o v id es a b etter estimate of BF. No w w e see th e eﬀect of c hanging th e precision p a- rameters keeping the facto r loadings as b efore. T he diagonal e ntries of Σ are in T able 8 . The p recision o f T able 8 Diagonal entries of Σ in the 3 diﬀer ent mo dels: the ﬁrst one is mo diﬁe d fr om Ghosh and Dunson ( 2008 ) Model 1 0 . 2079 0.19 0.15 0 . 2 0 . 36 0 . 1875 0 . 187 5 Model 2 0 . 553 0.52 0.48 0 . 54 0 . 409 0 . 55 0 . 54 Model 3 0 . 73 0.71 0.67 0 . 7 0 . 59 9 0 . 67 0 . 72 these 3 mo d els lie in the ranges of [2.77, 6.5 5], [1.7 9, 2.44], [1.36, 1.66], resp ectiv ely . W e stud y PS -SC for 6 data sets generated from the 3 mo d els (2 data sets with n = 100 from eac h mo del: Data 1 from 1-factor and Data 2 from 2- factor mo del) and rep ort the estimate d Bay es factor v alue in T able 9 . The eﬀect of p recision parameters is seen on the estimated v alue of the Ba ye s F actor (BF), more p romi- nen tly when the 2-factor mo d el is tru e. Generally , the absolute v alue of the BF decreases with the de- crease in the v alue of the precision parameters. F or the smaller v alue of precision parameters, we exp ect the mod el selection to b e less conclusiv e, explaining the pattern sho wn in the estimated BF v alues. Under M 1 , PS is often bad in estimating the Ba ye s F actor ( BF 21 ), b ut since the true Ba y es factor is large, it u sually c ho oses the true m o del as often as PS-SC. When M 0 is true, PS is m u ch b etter in esti- mating the Ba y es factor, bu t since the Ba ye s factor is usu ally not that large, it do es n ot choose M 0 all the time. The p robabilit y of choosing M 0 correctly dep end s on the data in addition to the tru e v alues of the parameters. PS-SC do es b etter than PS in all these cases; it estimates BF 21 b etter and c ho oses the correct mo del equally or more often. Th e sense in whic h PS -S C estimat es BF 21 b etter has b een dis- cussed in detail earlier in this section. Un der M 0 PS- SC estimat es BF 21 b etter b y ha vin g a smaller, that is, more negativ e, v alue than PS. 3.4 Issues Rega rding MCMC S ampling This sub section is b est read along w ith the re- marks in Section 2 . W e ﬁr st stu dy the graph of E t ( U ) and the lik elho o d v alues for the MCMC sam- 12 R. DUTT A AND J. K. GHOSH T able 9 log BF 21 (MCMC-standar d deviation) estimation by PS-SC: eﬀe ct of pr e cision p ar ameter Mod el T ru e model Data PS-SC PS ( t 10 ) Model 1 1- factor Data 1 − 8 . 09 (0 . 012) − 3 . 84 (0 . 055) 2-factor Data 2 71 . 59 (0 . 66) 19 . 81 (1 . 38) Model 2 1- factor Data 1 − 11 . 01 (0 . 0066) − 3 . 09 (0 . 0277 ) 2-factor Data 2 51 . 41 (0 . 3658) 2 . 8 (1 . 9104) Model 3 1- factor Data 1 − 5 . 13 (0 . 0153) − 2 . 6 (0 . 0419) 2-factor Data 2 3 . 975 (0 . 0130) 2 . 2 (0 . 3588) ples at t for both the t 10 and N (0 , 1) prior (Figures 2 and 3 ). W e will plot the likeli ho o d as a scalar pro xy since we can n ot sh o w ﬂ uctuations of the ve ctor of factor loadings in the MCMC output. The clusters of the latter can b e inferred f rom the clusters of the former. We wil l ar gue that ther e ar e two clusters at e ach grid p oint and the mixing pr op ortion of the two clusters has a deﬁnite p attern. Under the true 2-factor mo d el M 1 , denote λ ′ = [ λ ′ 1 , λ ′ 2 ], where λ ′ i is the loading for th e corresp ond- ing laten t f actor under M t . Here λ ′ 2 is a 7 × 1 v ector and b ecomes zero, as it approac h es M 0 from M 1 (as t → 0). The p osterior distribution a t eac h M t can b e view ed roughly as a sort of m ixture mo d el with tw o comp onent s rep r esen ting M 0 and M 1 , the form of the lik eliho o d as giv en in T h eorem 2.1 . In the dia- gram (Figure 4 ) of th e log-lik eliho o d of MCMC sam- ples, we see t w o clear clusters around log-lik eliho o d v alues − 850 a nd − 925, represen tin g MCMC outpu ts with n onzero λ ′ 2 and zero λ ′ 2 v alues, resp ectiv ely . W e ma y thin k of them as coming from the comp onen t corresp ondin g to M 1 (cluster 2) and the comp onent corresp ondin g to M 0 (cluster 1). Samples of b oth clusters are pr esen t in the r ange [0.03, 0.2], wh ile samples ap p ear to b e predominant ly from cluster 2 unt il t = 0 . 1. A go o d r ep resen tation of samples from cluster 1 are on ly present in the range [0, 0.1]. In the range [0.03, 0.2], b oth clusters o ccur w ith pro- p ortions v arying a lot. Moreo v er, here the magni- tude of the score function is p rop ortional to λ ′ 2 t . W e see these ﬂuctuations in Figure 4 in the r egion [0. 03, 0.2]. This is also br ough t out by the MCMC stan- dard deviat ion of E t ( U ) which ar e of or der of 30–50 in the lo g sc ale . W e notice the absence of an y s amp les from M 1 for t < 0 . 03, except some chaoti c repr esen tation for a few r an d om v alues of t (notice in th e ﬁgur e, a spike represent ing samples fr om M 1 at t = 0 . 016), clea rly represent ing p o or mixing of MC MC samples near the mo del M 0 . Fig. 2. E t ( U ) for prior t 10 and N (0 , 1) , 2-factor mo del is true. BA Y ES MODEL SELECTION 13 Fig. 3. L o g -likeliho o d for prior t 10 and N (0 , 1) , 2-factor mo del is true. The new metho d PS -SC stabilizes the estimated Ba yes factor v alue with a ve ry small MCMC -stand - ard d eviation. Here w e c h ec k through Figures 5 an d 6 that it a voi ds p r ior-lik eliho o d conﬂict and the pr ob- lem ab out mixing for MCMC samp les seen f or the standard PS. W e concen trate our study for the ﬁrst step of PS-SC. In this step only the ﬁr st comp onen t of λ ′ 2 , λ ′ 22 con v erges to zero as t → 0. So her e we consider the sp read of the MCMC sample of λ ′ 22 for diﬀeren t v alues of t near t = 0, fr om b oth PS and PS-SC in Figures 5 and 6 b y considering the h is- togram of MCMC sample of λ ′ 22 . W e can easily n o- Fig. 4. E t ( U ) and L o g -likeli ho o d f or prior t 10 in the r ange t ∈ [0 , 0 . 2] , 2-factor mo del is true. 14 R. DUTT A AND J. K. GHOSH Fig. 5. Histo gr ams for λ ′ 22 for diﬀer ent values of t ne ar t = 0 (MCMC size use d 50,000), using PS. tice that t he spread of the MCMC sample ﬂuctuates in b et ween the tw o mod es in a c haotic manner sho w- ing p o or or un stable mixing for PS, whereas PS-S C samples come from b oth the clusters and slowly s hift to ward the prior mo de as t → 0. W e h a ve also stud- ied but d o not rep ort s im ilar n ice b eha vior regarding mixing of MCMC of PS-SC for the data simula ted from the 1-fac tor mo del. The p o or m ixing discus s ed ab ov e for MCMC out- puts for PS will no w b e illustrated with plots of auto correlation for λ ′ 22 for diﬀeren t lags (Figure 7 ). F or the sak e of comparison, w e do the same for PS- SC (Figure 8 ). C learly , except v ery near t = 0, that is, in what we ha v e called the c haotic zone, the au- to correlations for PS are m uc h bigger than those for PS-SC. Ho w ev er, near t = 0 , though plots in b oth Figures 7 and 8 are small, those for PS are sligh tly smaller. W e hav e no simple explanation for this. P o or m ixing seems to lead to missing mass and random ﬂuctuations f or calculat ions for E t ( U ). This probably explains the discrepancy we h a v e noticed Fig. 6. Hi sto gr ams for λ ′ 22 for diﬀer ent values of t ne ar t = 0 (MCMC size use d 50,000), using PS-SC. BA Y ES MODEL SELECTION 15 Fig. 7. Auto c orr elation f or λ ′ 22 for diﬀer ent values of t ne ar t = 0 (MCMC size use d 50,000), using PS. in the estimation o f BF by PS as compared with PS- SC. W e now lo ok at auto corr elations for a ﬁrst factor loading in Figure 7 and seco nd factor loa ding in Fig - ure 8 . Th e top ro ws in eac h of the t wo ﬁgures sh o w zero autocorrelation, as they are v ery close to t = 0. On the other hand, h igh auto correlations are sho wn in the n ext t wo rows. W e b eliev e they corresp ond to what we called a c haotic region. The b ottom t wo ro ws of Figure 8 show small auto correlation. They corresp ond to the sec ond f actor loading wh ic h comes only in model 2, and they also depict the zone dom- inated by mo del 2. Th e other ﬁ gure is in the same zone as in the previous lin e, but the v ariable consid- ered is a 1-fact or loading. Here autocorrelation also ev en tually tends to 0, but its v alues are bigger than in Figure 8 . W e do not h a ve an y simple explanation for this higher auto correlation. The ab o ve discussion co ve rs the case when the more complex m o del is tr ue. If the simpler mo d el ( M 0 ) is true, as noted in Section 3.3 b oth PS and PS-SC p erform we ll in estimating the Bay es factor as we ll as c ho osing the correct mo del. The Ba y es factor based on PS-SC provides stronger supp ort for the true mo del than the Ba yes factor based on PS. T o chec k whether PSS C works well in other ex- amples as in the factor mo d el, we try to explore its impact on our earlier to y example. In this case, we w ere unable to implemen t path sampling with small Fig. 8. Auto c orr elation for λ ′ 22 for diﬀer ent values of t ne ar t = 0 (MCMC size use d 50,000), using PS-SC. 16 R. DUTT A AND J. K. GHOSH T able 10 Performanc e of PS and pseudo-PS-SC in toy example mo deli ng c ovarianc e: L o g Bayes factor (MCMC standar d deviation) Metho d Data 1 Data 2 T rue BF v alue 258 . 38 − 132 . 87 PS estimate of BF 184 . 59 (0 . 012) − 20 . 11 (0 . 008) pseudo-PSSC estimate of BF 195 . 35 (0 . 01 1) − 25 . 21 (0 . 007) c h anges, but rather used a pseud o-PSSC scheme. Going back to our example wh ere w e ha v e tak en m = 7 and p = 10, w e deﬁne a sequence of mo dels as the follo wing: M i : y i ∼ N  0 , Σ =  A 11 0 0 A 22  when A 11 is ( i × i ) matrix f or i = 7 , 8 , 9 , 10 . W e can s ee our p reviously deﬁn ed M 0 and M 1 are no w M 7 and M 10 , resp ectiv ely . F or our pseud o-PSSC, w e estimate log BF i,i +1 b y log BF b et we en the mo d- els M ′ 0 and M ′ 1 , with m = i and p = i + 1: M ′ t : y i ∼ N  0 , Σ =  A 11 tA 12 t ( A 12 ) ′ A 22  . Still being underestimates on eac h step, this metho d impro ve s on the standard p ath sampling in terms of Ba yes factor estimation, as we can s ee in the T a- ble 10 . 4. IMPLEMENT A TION OF OTHER METHODS W e hav e explored several metho ds of estimating the ratio of norm alizing constan ts, for example, the metho ds of Nielsen ( 2004 ), DiCiccio et al. ( 1997 ), Rue, Martino and Chopin ( 2009 ) and Chib ( 1995 ). The metho d of Rue, Martino and Chopin ( 2009 ) mo dels a link function of m eans, b ut here we are concerned w ith mo dels for the v ariance–co v ariance matrix. W e could not u se Chib ’s metho d here since for our paramete r exp anded prior the fu ll condition- als of the original mo d el parameters are not av ail- able. But we were able to im p lemen t the determin- istic v ariational Ba y es metho d of Nielsen ( 2004 ) and the Laplace ap p ro ximation with a correction due to DiCiccio et al . ( 199 7 ). Since the results w ere not sat- isfactory , w e do not r ep ort them in this pap er. In the v ariational Ba ye s appr oac h, the method selected the correct mo del approxima tely 80% of th e time, but the estimated logBF v alues w ere considerably ov er (or under) estimated. The v ariational Ba y es metho d is wo rth fu rther study , p ossibly with suitable mo d- iﬁcations. I t app ears to us it is still not understo o d when Belief Pr opagatio n provides a go o d approxi- mation to a marginal or n ot, f or example, Gamarnik, Shah and W ei ( 2010 ) commen ted: Only r e c ently we have witnesse d an explosion of r ese ar ch for the or et- ic al understanding of the p erformanc e of the BP al- gorithm in the c ontext of various c ombinatorial op- timization pr oblems, b oth tr actable and intr actable (NP-har d) ve rsions . F ollo wing the discus sion in Section 2.4 , we hav e implemen ted the GMP-PS. Here the marginal for b oth mo dels is estimate d by constructing a p ath b e- t w een the p rior distrib ution to the p osterior distri- bution of the mo del. Due to very high-d im en sionalit y of the mo d el, the mo d es of prior and p osterior dis- tribution are far apart. So as discuss ed b efore, the MCMC sampling along the path fails to sa mple smo othly and ﬂuctuates b etw een the t wo mo des in a c h aotic wa y near the p r ior mo de. Hence, th e es- timate of the m arginal of b oth th e mo dels b ecomes v ery unstable. Due to the p o or estimation of BF, this metho d also fails to choose the correct mo d el v ery often. As in the case of GMP-PS, the AIS with the GM path also did not work w ell. Hence, w e im- plemen ted the AIS with the P AM-path. Implemen- tation of P AM-AIS is also v ery time in tensive , so w e h av e only implemen ted P AMP-AIS with MCMC sample size 5000. P AM-AIS not only sho ws v ery high MCMC-standard deviation, bu t it also fails to c h o ose th e correct mo del many a time, w hen the 2- factor mo del is correct. The last method s we lo oke d at are the follo w in g: (1) Imp ortance Sampling (IS). (2) Newton-Raftery app ro ximation (BICM). (3) Laplace/B IC type appro ximation (BICIM). IS is the most easy to implement and sho ws mo d - erately go o d results in c ho osing the correct mo d el ( Ghosh and Dunson , 2008 )). W e s tu dy the stabil- it y of Ba yes factor v alues estimated by IS with the c h ange of the MCMC size in T able 11 . Similarly , we also study the stabilit y of the es- timates of the Ba y es f actor b y BICM and BICIM (explained in A.1.3 in the App endix) using MCMC sample size 10,000, w here b oth of these metho ds sho w signiﬁcan tly less amount of MCMC-standard deviation than other methods considered. He nce, w e will only consider PS -SC, BICM and BICI M to ex- plore mo del selection for a dimen s ion m uch higher than pr eviously considered. BA Y ES MODEL SELECTION 17 T able 11 Study of IS, BI CM and BICIM for diﬀer ent MC MC size: Estimate d Bayes f actor (MCMC standar d deviation) Metho d(MCMC-size)/ true model 2-factor model 1-factor model IS (10,000) 109 . 78 (168 . 72) 0 . 0749 (0 . 1063) IS (50,000) 97 . 12 (61 . 25) − 5 . 39 (84 . 35) IS (100,000) 86 . 92 (110 . 35) − 3 . 07 (10 . 41) IS (200,000) 83 . 66 (58 . 53) − 2 . 69 (2 . 96) BICM (10,000 ) 68 . 66 (0 . 93) − 5 . 72 (0 . 62) BICIM (10,000) 67 . 9 (0 . 11) − 5 . 3 (0 . 57) PS-SC (5000) 80 . 75 (0 . 63) − 8 . 08 (0 . 0013) 5. EFFECT OF PR ECIS ION P ARA METERS AND HIGH-DIMENSIONAL (SIMULA TED AND REAL) D A T A SET Our goal is to explore if PS-S C ma y b e made more eﬃcien t by co mbining with BICM and BIC IM and also to explore the n u m b er of dimens ions m u c h higher than b efore and the real life examples. In the examples in this section, p v aries from 6 to 26. W e h a v e 2 examples of real life examples with p = 6 and 26 and a simulat ed examp le with p = 20. As exp ected, PS -SC still tak es a long time, even with a p arallel pro cessing for high-d im en sional ex- amples. W e exp lore whether PS-SC can b e com b in ed with BICM and BICIM to substantia lly redu ce time, since their p erformance seems m uch faster than PS- SC. W e compare the b ehavio r of these metho ds for a higher-d imensional mo del an d for some real d ata sets tak en from Ghosh and Dunson ( 2009 ) and Ak aik e ( 1987 ). W e ﬁ rst consid er one 3-factor mo del with p = 20 and n = 100 in T able 12 . W e n otice that all the metho ds are selecting cor- rect mo dels for all the 3 data sets, but based on our earlier discussion of PS-S C, we b eliev e only this metho d p ro vides a reliable estimate of BF. No w we will compare the metho d s f or some real data s ets. W e c ho ose t wo data s ets: “Ro dent Organ Data” from Ghosh and Dunson ( 2009 ) and “26-v ariable Psyc hological Data” from Ak aik e ( 1987 ). These data sets ha ve b een norm alized ﬁrst b efore analyzing them further. W e not only study the estimated Ba yes fac- tor b u t also th e mo del c hosen by them. In the “Ro dant Or gan Data” the mo del chosen b y PS-SC and other metho d s are, resp ectiv ely , the 3-factor m o del and 2-factor mo del (T able 13 ). F or the “26-v ariable Ps yc hological Data,” PS-S C and BICM/BICIM choose the mo d el with 3 factors and 4 factors, resp ectiv ely (T able 14 ). The mo dels c h o- sen by PS-SC and the other metho ds are close, b ut as exp ected diﬀer a lot in their estimate of BFs. There is still no rigorously pr o ved Laplace ap p ro x- imation for relativ ely high-dimensional cases b ecause of analytical diﬃculties. Problems of determining sample size in hierarchica l mo deling, p ointe d out b y Clyde and George ( 2004 ), are av oided b y b oth v ersions of our app ro ximations (App endix A.1.3 ). These t wo metho ds seem to b e go o d as a prelimi- nary searc hing metho d to n arro w the ﬁ eld of plau- sible mo dels b efore usin g PS -SC. T his sav es time relativ e to PS-SC for mo del searc h as seen in the previous examples. 6. CONCLUSION W e hav e studied PS f or factor mo dels (and one other toy example) and h a v e identi ﬁed the comp o- nen t of PS that is most lik ely to go wrong and where. This is p artly based on the fact that w e h a v e a rel- ativ ely simple suﬃcien t condition for factor m o dels (Theorem 2.1 ). Typically , f or the higher-dimens ional T able 12 Simulate d mo del ( p = 20 , n = 100 ) and ( k = the numb er of true factors): Com p arison of lo g Bayes factor Data BF PS-SC BICM BICIM Data1 ( k = 1) BF 21 − 25 . 91 (0 . 0233) − 32 . 68 − 38 . 01 BF 32 − 24 . 84 (0 . 0594) − 21 . 18 − 38 . 24 BF 43 − 22 . 79 (0 . 0483) − 19 . 81 − 43 . 77 Data2 ( k = 2) BF 21 225 . 81 (4 . 2099) 248 . 09 219 . 87 BF 32 − 23 . 61 (0 . 0160) − 23 . 59 − 46 . 17 BF 43 − 19 . 18 (0 . 0297) − 20 . 3 − 47 . 98 Data3 ( k = 3) BF 21 152 . 07 (1 . 7422) 185 . 45 162 . 3 BF 32 104 . 17 (2 . 5468) 198 . 1 168 . 54 BF 43 − 17 . 35 (0 . 0276) − 29 . 73 − 48 . 24 18 R. DUTT A AND J. K. GHOSH T able 13 R o dant or gan weight data ( p = 6 , n = 60 ): Comp arison of lo g Bayes factor Ba yes facto r PS -SC BICM BICIM log BF 21 4.8 26.34 21.57 log BF 32 10.52 − 3 . 14 − 10 . 01 log BF 43 − 3 . 28 T able 14 26-variable psyc holo gic al data ( p = 26 , n = 300 ): Comp arison of lo g Bayes factor Ba yes facto r PS-SC BI CM BICIM log BF 21 122 . 82 205 . 27 188 . 19 log BF 32 35 . 27 71 . 05 35 . 5 log BF 43 − 10 . 7 23 . 16 7 . 55 log BF 54 − 33 . 32 − 4 . 63 − 25 . 51 log BF 65 − 16 . 7 − 17 . 32 − 43 . 21 mo del the MCMC output for ﬁnding the in tegral along grid p oints in the path may b ecome quite unreliable at some parts of th e path. Some insight ab out why it happ ens and h o w it can b e r ectiﬁed has b een suggested. MCMC seems to b e un reliable for PS when the h igher-dimensional m o del is tru e. The problem is w orse the m ore the t w o mo d els dif- fer, as when a very h igh-dimensional mod el is b eing compared to a lo w-dimensional mo del. The suggestion for rectiﬁcation wa s based on the in tuition that PS, lik e imp ortance sampling itself, seems more reliable when the tw o marginal densi- ties in the Ba yes factor are r elativ ely similar, as is the case when the smaller of t wo nested mo dels is true. Based on this in tuition, we suggested P S-SC and j ustiﬁed PS-SC by comparing MCMC output and MCMC standard deviation of b oth PS-SC and PS. It is our b elief that the ab o v e in sigh ts as to w hen things will tend to go w rong and when not, will also b e v alid for the other general strategy for se- lection from among nested mo dels, namely , RJM- CMC. Piya s Chakrab ort y in Pu rdue is w orking on a c h ange p oin t problem in sev eral parameters wh ere Shen and Ghosh ( 2011 ) ha v e an accurate approxi - mation to the Ba y es factor, w hic h may b e used f or v alidation. He will exp lore small c hanges as w ell as adaptiv e MCMC. Our work has focused on mo del sele ction b y Bay es factors, w h ic h seems v ery natural since it pro vides p osterior probabilit y for eac h mo d el. Ho wev er, mo del selection is a complicated bu siness and one of its ma jor purp oses is also to ﬁ nd a mo del that ﬁts the data well. S ev eral mo d el selecting statisticia ns feel this should also b e done along with calculatio n of Ba yes factors. Ho wev er, there has not b een a goo d discussion on ho w one sh ould put together the ﬁ n dings from the t w o diﬀeren t approac hes. W e hop e to return to these issues in a futu r e comm u nication. A natur al futur e dir e ction of our study of factor mo dels is to add to the mo del an unknown me an ve c- tor with a r e gr ession setup. The pr oblem now would b e to simultane ously determine a p arsimonious mo del for b oth the varianc e–c ovarianc e matrix and the me an ve ctor. Ther e ar e natur al priors for these pr oblems, but c omputation of the Bayes factor se ems to b e a chal lengi ng pr oblem. APPENDIX A.1 Other Metho ds A.1.1 Imp ortanc e sampling S upp ose we h av e tw o densities p r op ortional to tw o functions f ( x ) and g ( x ), whic h are f easible to ev aluate at ev ery x , but one of the distributions, say , the one ind uced b y f ( x ), is not easy to sample. Th en the imp ortance samplin g (IS) estimate of the ratio of n ormalizing constan ts is based on m indep end en t dra ws x 1 , . . . , x m gener- ated from the distr ibution deﬁned by g ( x ) . W e ﬁrst compute the imp ortance we igh ts w i = f ( x i ) g ( x i ) and then deﬁne the IS estimate: 1 m m X i =1 w i . (A.1) Under the assumption that g ( x ) 6 = 0 when f ( x ) 6 = 0 , 1 m P m i =1 w i con v erges as m → ∞ to Z f / Z g , when Z f = R f ( x ) dx an d Z g = R g ( x ) dx are the normal- izing constants for f ( x ) and g ( x ). The v ariability of the IS est imate dep ends hea vily on the v ariability of the w eigh t functions. So to hav e a goo d IS estimate, w e need to h a ve g ( x ) as a go o d appro ximation to f ( x ), w hic h is diﬃcu lt to ac h iev e in problems with high or mo d erately high-dimens ional, p ossibly mul- timo dal d ensit y . Analysis of Ba yesian f actor mo dels using IS has b een in tro du ced b y Ghosh and Dunson ( 2008 ). The IS estimator of BF for f actor mo dels is based on m samples θ ( h ) i from the p osterior distrib ution, u nder M ( h ) d BF h − 1 ,h = 1 m m X i =1 p ( y | θ ( h ) i , k = h − 1) p ( y | θ ( h ) i , k = h ) , (A.2) BA Y ES MODEL SELECTION 19 whic h in turn is based on the follo wing iden tit y: Z p ( y | θ ( h ) , k = h − 1) p ( y | θ ( h ) , k = h ) p ( θ ( h ) | y , k = h ) dθ ( h ) = Z p ( y | θ ( h ) , k = h − 1) p ( θ ( h ) ) p ( y | k = h ) dθ ( h ) (A.3) = p ( y | k = h − 1) p ( y | k = h ) . Ghosh and Dun son ( 2008 ) implemented I S w ith a parameter expanded prior. They also ha v e noted that IS is fast and often (90%) chooses the correct mo del in sim ulation. In our simulati on IS c ho oses a true bigger mo d el correctly , b ut a 20% error rate w as observ ed when the smaller mo d el is true. A.1.2 Anne ale d imp ortanc e sampling F ollo wing Neal ( 2001 ), we consider d en sities p t : t ∈ [0 , 1] j oin- ing the densities p 0 and p 1 . W e c ho ose densities b y discr etizing the p ath p t ( i ) where 0 = t (1) < · · · < t ( k ) = 1 and then sim ulate a Mark o v c hain d esigned to con v erge to p t ( k ) . Starting from the ﬁnal states of the pr evious simulatio n, we simulat e some num- b er of iterations of a Mark o v chain designed to con- v erge to p t ( k − 1) . Similarly , w e sim ulate some itera- tions starting from the ﬁ n al steps of p t ( j ) designed to con verge to p t ( j − 1) unt il we sim ulate some itera- tions con ve rging to p t (1) . This sampling scheme pro- duces a sample of p oint s x 1 , . . . , x m and then w e compute the weig hts w i = p 1 ( x i ) p 0 ( x i ) . Then the estimate of the ratio of normalizing constant b ecomes as fol- lo w s : 1 m m X i =1 w i . (A.4) Notice that while b oth AIS and PS are based on MCMC ru ns along a path from one mo del to an- other, the MCMC’S are drawn at eac h p oin t, but the details are v ery diﬀeren t. Due to the b etter spread of MCMC samples, the estimates in AIS seem to b e b etter th an those calculate d by IS when the smaller mo del is tru e, h elping in correct mo del selec tion and also impro ving the estimation of Ba y es fact ors. Ho w- ev er, sim ulations show that AIS has the same prob- lem as IS in estimating the Bay es factor when th e bigger mo d el is tru e. A.1.3 BIC typ e metho ds: R aftery–Newton and our metho d using information matrix I n con trast to the metho ds previously discussed, w e try to directly es- timate the marginal u nder eac h mo d el and th en u se these marginals to ﬁ n d the Ba y es factor. W e kno w that BIC is an app ro ximation to the log-marginal based on a Laplace-t yp e approximati on of th e log- marginal ( Ghosh, Dela mpady and Saman ta , 2006 ), under the assumption of i.i.d. observ ations. Thus, log( m ( x )) ≈ log ( f ( x | ˆ θ ) π ( ˆ θ )) + ( p/ 2) log(2 π ) + ( p/ 2) log( n ) (A.5) + log ( | H − 1 1 , ˆ θ | 1 / 2 ) , where H 1 , ˆ θ is th e observed Fisher In formation ma- trix ev aluated at the maxim um lik eliho o d estimator using a s in gle observ ation. F or BIC w e just use log( m ( x )) ≈ log ( f ( x | ˆ θ ) π ( ˆ θ )) + ( p/ 2) log ( n ) (A.6) ≈ log ( f ( x | ˆ θ )) + ( p/ 2) log( n ) , ignoring other terms as they are O (1). It is kno wn BIC m ay b e a p o or approxi mation to the log-marginal in high-dimensions (Berger, Gh osh and Mukhopadh ya y , 2003 ). T o tak e care of this prob- lem, Raftery et al. ( 2007 ) suggest the foll o wing. Sim- ulate i.i.d. MCMC samples f rom the p osterior distri- butions, ev aluate indep endent sequence of log(prior × lik eliho o d) s (log-p.l.) { l t : t = 1 , . . . , m } , and then an estimate for the marginal is log( m ( x )) ≈ ¯ l − s 2 l (log( n ) − 1) , (A.7) where ¯ l and s 2 l will be the sample mean and v ariance of l t ’s. W e call this metho d BICM, f ollo wing the con v en tion of Raftery et al. ( 2007 ). In order to apply ( A.5 ), w e d o not n eed to ev al- uate n since it cancels b y com bining the last t w o terms. Th is suggests the app r o ximation ( A.5 ) tak e care of the point raised b y Clyde and George ( 2004 ). Ho wev er, ( A.7 ) do es use n , b ut w e d o not know th e impact on the appr o xim ation. W e ha ve also used th e Laplace appro ximation ( A.5 ) without any change as lik ely to work b etter th an the usual BIC. W e compute the Inf orm ation Matrix at the maximum prior × lik eliho o d (mpl) v alue under the mo del and imp ute its v alue in the computation of the marginal. T o ﬁnd the mpl estimate, we use the MCMC sample f r om the p osterior distribution and pic k the maxima in that sample. Then w e searc h for the mple in its neigh b orh o o d, using it as the starting p oint for the optimiza tion algorithm. In our sim u la- tion study , it has b een see n to giv e v ery go o d results similar to the computationally in tensiv e numerical algorithms used to ﬁnd the maximum of a fun ction 20 R. DUTT A AND J. K. GHOSH o ver the w h ole parameter space seen b y taking rep- etition of MCMC r uns and large MCMC samples. In the s pirit of Raftery et al. ( 2007 ), we call this metho d BICIM, ind icating th e use of In formation Matrix b ased Laplace Appro ximation. W e also used sev eral other mo d iﬁcations th at did not giv e go o d results, so are not r ep orted. A.2 A Theo retical Remark on the Lik eliho o d F unction It app ears that the b eha vior of the lik eliho o d, for example, its maxim um, plays an imp ortant role in mo del s electio n, sp eciﬁcally in the kind of conﬂict w e see b et w een PS and the Laplace appro ximations (BICM, BICIM) when the bigge r mo del is true (and the prior is a t with a relativ ely small d.f.). The b e- ha vior seems to b e diﬀerent from the asymptotic b ehavio r of maximum likeli ho o d u nder th e follo w- ing standard assumptions. Assume d imension of the parameter space is ﬁxed and usual regularit y condi- tions h old. Moreo ver, when th e b ig mo del is tru e but the small mo del is assum ed (so that it is a missp eciﬁed mo del), the Kullbac k–Liebler pro jec- tion of the tru e parameter space to the parameter space of the small mo del exists (Bun k e and Milhaud, 1998 ). F a ct. Assu me the b ig m o del is true, and the small m o del is false. Then, as ma y b e veriﬁed easily b y the T a ylor exp ansion, (1) log L ( ˆ θ big ) − log L ( θ true(big) ) = O P (1) (2) log L ( ˆ θ small ) − log L (KL pro jection of θ true(big) to Θ small ) = O P (1) (3) log L ( θ true(big) ) − log L (KL pr o jection of θ true(big) to Θ small ) = O P ( n ) and (4) log L ( ˆ θ big ) − log L ( ˆ θ small ) = log L ( θ true(big) ) − log L (KL pro jection of θ true(big) to Θ small ) (A.8) + O P (1) = O P ( n ) . The maximized like liho o d f or factor mo d els sub- stan tially o ve restimates the true likeli ho o d, u nlik e relation (1) ab o v e. Un fortunately , as p oin ted out in Drton ( 2009 ), the asymptotics of mle for factor mo d- els is still not fully wo rke d out. A.3 Mat rix Used fo r the T oy Example Σ 0 =                   128 . 35 5 2 . 69 − 19 . 25 − 11 . 86 2 4 . 34 52 . 69 73 . 37 − 21 . 04 − 37 . 85 12 . 29 − 19 . 25 − 21 . 04 30 . 8 6 8 . 63 − 1 . 41 − 11 . 86 − 37 . 85 8 . 63 80 . 49 4 . 66 24 . 34 12 . 29 − 1 . 41 4 . 66 15 . 45 8 . 80 8 . 74 − 13 . 58 3 . 26 2 . 58 10 . 63 15 . 60 − 3 . 03 − 49 . 2 4 2 . 05 13 . 75 12 . 09 − 11 . 64 − 9 . 68 3 . 7 2 − 7 . 40 − 14 . 0 8 21 . 2 8 2 2 . 18 − 1 . 3 1 − 29 . 80 − 17 . 27 22 . 0 5 8 . 52 − 7 . 87 8 . 80 1 0 . 63 13 . 75 − 7 . 40 − 29 . 80 8 . 74 1 5 . 60 12 . 09 − 14 . 08 − 17 . 27 − 13 . 58 − 3 . 0 3 − 11 . 6 4 21 . 28 22 . 05 3 . 26 − 49 . 24 − 9 . 68 2 2 . 18 8 . 52 2 . 58 2 . 05 3 . 72 − 1 . 31 − 7 . 87 31 . 37 1 1 . 62 − 4 . 8 5 − 16 . 89 − 20 . 10 11 . 62 5 8 . 09 7 . 0 0 − 19 . 5 8 5 . 16 − 4 . 85 7 . 00 26 . 59 − 3 . 0 4 11 . 17 − 16 . 89 − 19 . 58 − 3 . 04 3 1 . 81 22 . 86 − 20 . 10 5 . 16 11 . 17 22 . 8 6 6 4 . 68                   A.4 Choice of Prio r Under M 0 A r eferee h as ask ed wh ether un der M 0 , the prior for the extra parameter can b e c hosen in a same optimal or philosophically co mp elling manner. This has b een a long-standing problem, b ut the metho d follo wed for factor mo dels is one of the standard pro cedur es, apparen tly ﬁrs t su ggested b y Ed w ards, Lindman and Sa v age ( 1984 ). This prior is mentio ned by E dwa rds, Lind man and Sa v age ( 1984 ) and ma y b e justiﬁed as follo ws. One tries to ensure the extra parameter has similar roles under b oth the mo dels. If the join t p rior of ( θ 1 , θ 2 ) under M 1 is π ( θ 1 , θ 2 ) , then th e natural prior f or ( θ 2 | θ 1 ) is the usual conditional d en sit y of π ( θ 2 | θ 1 ) . In our case π ( θ 1 , θ 2 ) = π ( θ 1 ) π ( θ 2 ) . So π ( θ 2 | θ 1 ) is as we ha v e c h osen. This is one of the standard default c hoices. Another default c hoice is due to Jeﬀreys ( 1961 ), but when θ 1 , θ 2 are in dep end en t, b oth lead to th e same c hoice. If we in tr o duce a prior (e.g., min imiz- ing MCMC-v ariance), it ma y not b e acceptable to Ba yesia n ph ilosoph y . A CKNO WLEDGMENTS W e thank Joy ee Ghosh f or h elpin g us in d iscus- sions on factor mo dels and s haring her co d e and BA Y ES MODEL SELECTION 21 Andrew Lewando wski for th ou ght-pro voking com- men ts on an earlier draft. REFERENCES Akaike, H. (1987). F actor analysis and AIC. Psychometrika 52 317–33 2. MR0914459 Anderson, T. W. (1984). An Intr o duction to M ultivariate Statistic al Analysis , 2nd ed. Wiley , New Y ork . MR0771294 Andrieu, C. , Doucet, A. and Rober t, C. P. (2004). C om- putational adv ances for and from Bay esian analysis. Statist. Sci. 19 118–12 7. MR2082151 Bar tholomew, D. J. , Steele, F. , Moust aki, I. and Gabbrith, J. I. (2002). The Ana lysis and Interpr etation of Multivariate Data for So cial Scientists . Chapman & Hall, Boca Raton, FL. Berger, J. O. , Ghosh, J. K. and Mukhop adhy a y, N. (2003). Ap proximatio ns and consistency of Bay es factors as mo del dimension grows . J. Statist. Plann. Infer enc e 11 2 241–258 . MR1961733 Bunke, O. and Mi lhaud, X. (1998). A sy mptotic b ehavior of Ba yes estimates un der p ossibly incorrect mo dels. Ann. Statist. 26 617–644 . MR1626075 Chen, M.-H. , Shao, Q.-M. and Ibrahim , J. G . (2000). Monte Carlo Metho ds in Bayes ian Computation . Sp ringer, New Y ork. MR1742311 Chib, S. (1995). Marginal likelihoo d from the Gibbs outpu t. J. Amer. Statist. Asso c. 90 131 3–1321. MR1379473 Cl yde, M. and Ge or ge, E. I. (2004). Mod el uncertaint y. Statist. Sci. 19 81– 94. MR2082148 DiCiccio, T. J. , Kass, R. E. , Rafte r y, A. and W asser- man, L. (1997). Computing Bay es factors by com bin- ing simula tion and asymptotic app ro ximations. J. Amer. Statist. Asso c. 92 903–915. MR1482122 Dr ton, M. (2009). Likelihoo d ratio tests an d singularities. Ann . Statist. 37 979–1 012. MR2502658 Edw ards, W. , Lindman , H. and Sa v age, L. J. (1984). Ba yesian statistical inference for psyc hological research. In R obustness of B ayesian Analyses . Stud. Bayesian Ec ono- metrics 4 1–62. North-Holland, Amsterdam. MR0785366 F an, Y. , Wu, R. , Chen, M.-H. , Kuo, L. and Lewis, P. O. (2011). Choosing amo ng partition mo dels in Bay esian phy- logenetics. Mol. Biol. Evol. 28 523–532. Friel, N. and Pettitt, A. N. (2008). Marginal lik elihoo d estimation v ia p ow er p osteriors. J. R. Stat. So c. Ser. B Stat. Metho dol. 70 589–607. MR2420416 Gamarnik, D. , Shah, D. and Wei, Y. (2010). Belief prop- agation f or min-cost net w ork ﬂo w: Conv ergence & correct- ness. In Pr o c e e dings of the Twenty-First Annual ACM- SIAM Symp osium on Discr ete Algorithms 279–292. SIAM, Philadelphia, P A . MR2809676 Gamerman, D. and Lopes, H . F. (2006). Markov C hai n Monte Carlo: Sto chastic Simulation f or Bayesian Inf er- enc e , 2nd ed. Chapman & Hall/CRC , Bo ca Raton, FL. MR2260716 Gelman, A. (2006). Prior distributions for v ariance pa- rameters in h ierarc hical mo dels (comment on article by Bro wne and Drap er). Bayesian Anal. 1 515–533 ( elec- tronic). MR2221284 Gelman, A. and Meng, X.-L. (1998). Sim ulating nor- malizing constants: F rom imp ortance sampling to bridge sampling to path sampling. Statist . Sci. 13 163–185 . MR1647507 Gelman, A. , Carlin, J. B. , Ste rn, H. S. and R ubin, D. B. (2004). Bayesian Data Analysis , 2nd ed. Chapman & Hall/CR C, Bo ca Raton, FL. MR2027492 Ghosh, J. K. , Delamp ady, M. and Saman t a, T. ( 2006). An Intr o duction to Bayesian Analysis: The ory and Metho ds . Springer, New Y ork. MR2247439 Ghosh, J. and D unson, D. B. (2008). R andom Eﬀe ct and L atent V ariable Mo del Sele ction . L e ctur e Notes i n Statistics 192 . Springer, New Y ork. MR2761923 Ghosh, J. and Du nson, D. B. (2009). D efault prior dis- tributions and eﬃcien t p osterior computation in Ba yesian factor analysis. J. Comput. Gr aph. Statist. 18 306–320. MR2749834 Green, P. J. (1995). Reversible jump Mark ov c h ain Monte Carlo computation and Bay esian mo del determination. Biometrika 82 711–73 2. MR1380810 Jeffreys, H. (1961). The ory of Pr ob abi l ity , 3rd ed. Claren- don Press, Oxford. MR0187257 Lar tillot, N. and Philippe, H. ( 2006). Computing Ba yes factors u sing t hermod ynamic integratio n. Syst. Biol. 55 195–207 . Lee, S.-Y. and Song, X.-Y. (2002). Ba yesian selection on the num b er of factors in a factor analysis mo del. Behav- iormetrika 29 23–39. MR1894459 Lefebvre, G. , Stee le, R. , V andal, A. C. , Nara y anan, S. and Arnold, D. L. (2009). Path sampling to compute integ rated likelihood s: An adaptive app roac h. J. Com put. Gr aph. Statist. 18 415–43 7. MR2749839 Liang, F. , P aulo, R. , Molina, G. , Cl y de, M . A. and Berger, J. O. ( 2008). Mixtures of g priors for Bay esian v ariable selection. J. Amer. Statist. Asso c. 103 410–423. MR2420243 Liu, J. S. (2008). Monte Carlo Str ate gies in Scientiﬁc Com- puting . Springer, New Y ork. MR2401592 Lopes, H. F. and W est, M. (2004). Bay esian mo del as- sessmen t in factor analysis. Statist. Sinic a 14 41–67 . MR2036762 L ynch, S. M. (20 07). Intr o duction to Applie d Bayesian Statistics and Estimation for So cial Scientists . S pringer, New Y ork. Meng, X.-L. a nd W ong, W. H. (1996). S imula ting ratios of normalizing constan ts via a simple iden tity: A th eoretical exploration. Statist. Sinic a 6 831–860. MR1422406 Neal, R. M. (2001). Annealed imp ortance sampling. Stat. Comput. 11 125–139 . MR1837132 Nielsen, F. B. (2004 ). V ariational approac h to factor anal y- sis and related mo dels. Master’s th esis, Institute of Infor- matics and Mathematical Modelling, T echnical Univ. Den- mark. Rafter y, A. E. , Newton, M. A. , Sa t ago p an , J. M. and Krivitsky, P. N. (2007). Estimating the integrated lik e- lihoo d via p osterior simulation using the harmonic mean identit y . In Bayesian Statistics 8 371–416 . Oxford Univ. Press, Oxford. MR2433201 Ro ber t, C. P. and Casella, G. (2004). Monte Carlo Sta- tistic al Metho ds , 2nd ed. Springer, New Y ork. MR2080278 22 R. DUTT A AND J. K. GHOSH Rue, H. , M ar tino, S. and C hopin, N. (2009). Appro ximate Ba yesian inference for latent Gaussian mod els b y using in- tegrated nested Laplace approximati ons. J. R. Stat. So c. Ser. B Stat. M etho dol. 71 319–392. MR2649602 Shen, G. and Ghosh, J. K. (2011). Developing a new BIC for d etecting change-p oin ts. J. Statist. Plann. Infer enc e 141 1436–14 47. MR2747912 Song, X.-Y. and Lee, S.-Y. (200 6). Model comparison of generalized linear mi xed mod els. Stat. Me d. 25 1685–1698. MR2227347 Xie, W. , Lewis, P . O. , F an, Y. , Kuo, L. and Chen, M.-H. (2011). Improving marginal likeli ho o d estimation for Ba yesian ph ylogenetic model selection. Syst. Biol. 60 150– 160.

Bayes Model Selection with Path Sampling: Factor Models and Other Examples

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment