An empirical Bayes procedure for the selection of Gaussian graphical models

An empirical Bayes procedure for the selection of Gaussian graphical models Sophie Donnet CEREMADE Uni versit ´ e Paris Dauphine, France Jean-Michel Marin ∗ Institut de Math ´ ematiques et Mod ´ elisation de Montpellier Uni versit ´ e Montpellier 2, France Abstract A new methodology for model determination in decomposable graphical Gaussian models (Dawid and Lauritzen, 1993) is developed. The Bayesian paradigm is used and, for each giv en graph, a hyper in v erse W ishart prior distribution on the cov ariance matrix is considered. This prior distribution depends on hyper -parameters. It is well-known that the models’ s posterior distribution is sensiti ve to the speciﬁcation of these hyper -parameters and no completely satisf actory method is registered. In order to av oid this problem, we suggest adopting an empirical Bayes strategy , that is a strategy for which the values of the hyper-parameters are determined using the data. T ypically , the hyper -parameters are ﬁxed to their maximum likelihood estimations. In order to calculate these maximum likelihood estimations, we suggest a Markov chain Monte Carlo version of the Stochastic Approximation EM algorithm. Moreov er , we introduce a ne w sampling scheme in the space of graphs that improv es the add and delete proposal of Armstrong et al. (2009). W e illustrate the efﬁcienc y of this new scheme on simulated and real datasets. Keyw ords: Gaussian graphical models, decomposable models, empirical Bayes, Stochas- tic Approximation EM, Markov Chain Monte Carlo ∗ Corresponding author: jean-michel.marin@uni v-montp2.fr 1 1 Gaussian graphical models in a Bayesian Context Statistical applications in genetics, sociology , biology , etc often lead to complicated interaction patterns between v ariables. Graphical models ha ve prov ed to be powerful tools to represent the conditional independence structure of a multiv ariate distribution : the nodes represent the variables and the absence of an edge between two vertices indicates some conditional independence between the associated v ariables. Our paper presents a new approach for estimating the graph structure in Gaussian graphical model. A very large literature deals with this issue in the Bayesian paradigm: Dawid and Lauritzen (1993); Madigan and Raftery (1994); Giudici and Green (1999); Jones et al. (2005); Armstrong et al. (2009); Carvalho and Scott (2009). For a frequentist point of view , one can see Drton and Perlman (2004). W e suggest here an empirical Bayes approach: the parameter of the prior are estimated from the data. Parametric empirical Bayes methods have a long history , with major de velopments e v olv- ing in the sequence of papers by Efron and Morris (1971, 1972b,a, 1973a,b, 1976b,a). Empirical Bayes estimation falls outside the Bayesian paradigm. Ho wev er, it has prov en to be an ef fectiv e technique of constructing estimators that performs well under both Bayesian and frequentist cri- teria. Moreover , in the case of decomposable Gaussian graphical models, it gi ves a default and objecti ve way for constructing prior distribution. The theory and applications of empirical Bayes methods are gi ven by Morris (1983). In this Section, we ﬁrst recall some results on Gaussian graphical models, then we justify the use of the empirical Bayes strategy . 1.1 Background on Gaussian graphical models Let G = ( V , E ) be an undirected graph with v ertices V = { 1 , . . . , p } and set of edges E = { e 1 , . . . , e t } , ( ∀ i = 1 , . . . , t , e i ∈ V × V ). Using the notations of Giudici and Green (1999), we ﬁrst recall the deﬁnition of a decomposable graph. A graph or subgraph is said to be complete if all pairs of its v ertices are joined by edges. Moreov er , a complete subgraph that is not contained within another complete subgraph is called a clique. Let C = { C 1 , . . . , C k } be the set of the cliques of an undirected graph. An order of the cliques ( C 1 , . . . , C k ) is said to be perfect if ∀ i = 2 , . . . , k , ∃ h = h ( i ) ∈ { 1 , . . . , i − 1 } such that S i = C i ∩ ∪ i − 1 j =1 C i ⊆ C h . S = { S 2 , . . . , S k } is the set of separators associated to the perfect order { C 1 , . . . , C k } . An undirected graph admitting a perfect order is said to be decomposable. Let D p denote the set of decomposable graphs with p vertices. For more details, one can refer to Dawid and Lauritzen (1993), Lauritzen (1996) (Chapters 2, 3 and 5) or Giudici and Green (1999). The graph drawn in Figure 1 – and used as benchmark in numerical Section 4.2– is decomposable. Indeed, the set of cliques C 1 = { 1 , 2 , 3 } , C 2 = { 2 , 3 , 5 , 6 } , C 3 = { 2 , 4 , 5 } , C 4 = { 5 , 6 , 7 } and C 5 = { 6 , 7 , 8 , 9 } with associated separators S 2 = { 2 , 3 } , S 3 = { 2 , 5 } , S 4 = { 5 , 6 } and S 5 = { 6 , 7 } forms a perfect order . 2 ! 1 0 1 2 3 4 5 6 ! 4 ! 3 ! 2 ! 1 0 1 2 3 1 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9 9 9 Figure 1: Example of decomposable graph Note that, with p vertices, the total number of possible graphs is 2 p ( p − 1) / 2 , p ( p − 1) / 2 being the number of possible edges. The total number of decomposable graphs with p vertices can be calculated for moderate v alues of p . For instance, if p = 6 , among the 32 768 possible graphs, 18 154 are decomposable (around 55% ); if p = 8 , then 30 888 596 of the 268 435 456 possible graphs are decomposable (around 12% ). A pair ( A, B ) of subsets of the vertex set V of an undirected graph G is said to form a decompo- sition of G if (1) V = A ∪ B , (2) A ∩ B is complete and (3) A ∩ B separates A from B ie any path from a verte x in A to a vertex in B goes through A ∩ B . T o each vertex v ∈ V , we associate a random variable y v . For A ⊆ V , y A denotes the collection of random variables { y v : v ∈ A } . T o simplify the notation, we set y = y V . The probability distribution of y is said to be Marko v with respect to G , if for any decomposition ( A, B ) of G , y A is independent of y B gi ven y A ∩ B . A graphical model is a family of distributions on y verifying 3 the Marko v property with respect to a graph. A Gaussian graphical model, also called cov ariance selection model (see Dempster (1972)), is such that y |G , Σ G ∼ N p ( µ , Σ G ) , (1) where N p ( µ , Σ G ) denotes the p -v ariate Gaussian distrib ution with e xpectation µ ∈ R p and p × p symmetric deﬁnite positive covariance matrix Σ G . Σ G has to ensure the Markov property with respect to G . In the Gaussian case, y is Marko v with respect to G = ( V , E ) if and only if ( i, j ) / ∈ E ⇐ ⇒  Σ − 1 G  ( i,j ) = 0 , where A − 1 denotes the in verse of the matrix A . Σ − 1 G is called the concentration matrix. In the following, we suppose that we observe a sample Y = ( y 1 , . . . , y n ) from model (1) with mean parameter µ set to zero. The data are expressed as a deviation from the sample mean. This centering strate gy is standard in the literature, howe ver the technique de veloped here can be easily extended to the case µ 6 = 0 p . The density of Y is a function of multiv ariate Gaussian densities on the cliques and separators of G . More precisely , let C and S denote respectiv ely the sets of the cliques and separators of G corresponding to a perfect order for G . W e have : f ( Y | Σ G , G ) = n Y i =1 ( Q C ∈C φ | C |  y i C | (Σ G ) C  Q S ∈S φ | S |  y i S | (Σ G ) S  ) , (2) where for e very subset of v ertices A , | A | denotes its cardinal and (Σ G ) A is the restriction of (Σ G ) to A i.e. { (Σ G ) i,j } i ∈ A,j ∈ A and y A = ( y j ) j ∈ A . φ q ( ·| ∆) is the q -variate Gaussian density with mean 0 q and q × q symmetric deﬁnite positive co v ariance matrix ∆ . From a Bayesian perspecti ve, we are interested in the posterior probabilities π ( G | Y ) ∝ π ( G ) Z f ( Y | Σ G , G ) π (Σ G |G ) d Σ G , (3) for speciﬁc priors π (Σ G |G ) and π ( G ) . In the follo wing, we discuss the choice of these prior distributions. 1.2 Prior distributions speciﬁcation Prior and posterior distributions f or the covariance matrix Conditionally on G , we set an Hyper-In verse Wishart (HIW) distribution as prior distribution on Σ G : Σ G |G , δ, Φ ∼ HIW G ( δ, Φ) , 4 where δ > 0 is the de gree of freedom and Φ is a p × p symmetric positiv e deﬁnite location matrix. This distribution is the unique hyper-Marko v distribution such that, for ev ery clique C ∈ C , (Σ G ) C ∼ I W ( δ, Φ C ) with density π ((Σ G ) C | δ, Φ C ) = h I W G C ( δ, Φ C ) [det(Σ G ) C ] − δ +2 | C | 2 exp  − 1 2 tr  (Σ G ) − 1 C Φ C  , (4) where h I W G C ( δ, Φ C ) is the normalizing constant: h I W G C ( δ, Φ C ) = det  Φ C 2  ( | C | + δ − 1) / 2 Γ | C |  | C | + δ − 1 2  , (5) where det( · ) and tr ( · ) are respectiv ely the determinant and trace and Γ v is the multiv ariate Γ - function with parameter v : Γ v ( a ) = π v ( v − 1) / 4 v Y j =1 Γ[ a + (1 − j ) / 2] . The full joint density is: π (Σ G |G , δ, φ ) = Q C ∈C π ((Σ G ) C | δ, Φ C ) Q S ∈S π ((Σ G ) S | δ, Φ S ) . (6) Conditionally on G , the HIW distribution is conjugate. The posterior distrib ution of Σ G is given by (Giudici, 1996): Σ G | Y , G , δ, Φ ∼ HIW ( δ + n, Φ + S Y ) . (7) where S Y = P n i =1 y i t y i , t v denoting the transpose of v . Moreov er for such a prior distribution, the marginal likelihood for an y graph G is a simple function of the HIW prior and posterior normalizing constants h G ( δ, Φ) and h G ( δ + n, Φ + S Y ) (Giudici, 1996): f ( Y |G , δ, Φ) = h G ( δ, Φ) (2 π ) − np/ 2 h G ( δ + n, Φ + S Y ) . (8) where h G ( δ, Φ) is the normalizing constant of the HIW distribution which can be computed ex- plicitly in decomposable graphs from the normalizing constants of the in verse W ishart cliques and separators densities (4-5-6) : h G ( δ, Φ) = Q C ∈C h I W G C ( δ, Φ C ) Q S ∈S h I W G S ( δ, Φ S ) . Note that Roverato (2002) extends the Hyper-In verse Wishart distribution to non- decomposable 5 cases. Moreov er , a general treatment of priors for decomposable models is giv en by Letac and Massam (2007). Prior and posterior distributions f or the graphs The prior distrib ution in the space of decomposable graphs has been widely discussed in the liter- ature. The naiv e choice is to use the standard uniform prior distrib ution: π ( G ) ∝ 1 . One great adv antage of this choice is simplifying the calculus b ut it can be criticized. Indeed, with p vertices, the number of possible edges is equal to m = p ( p − 1) 2 and, in the case of a uniform prior ov er all graphs, the prior number of edges has its mode around m/ 2 which is typically too large. An alternati ve to this prior is to set a Bernouilli distribution of parameter r on the inclusion or not of each edge (Jones et al., 2005; Carv alho and Scott, 2009) π ( G | r ) ∝ r k G (1 − r ) m − k G , (9) where k G is the number of edges of G . The parameter r has to be calibrate. If r = 1 / 2 , this prior resumes to the uniform one. In the follo wing we consider this prior distribution and gi ve an empirical estimation of r . Using (8) and (9), we deduce easily that the density of the posterior distrib ution in the space of decomposable graphs satisﬁes: π ( G | Y , δ, r , Φ) ∝ h G ( δ, Φ) h G ( δ + n, Φ + S Y ) π ( G | r ) . (10) This posterior distribution is kno wn to be sensiti ve to the speciﬁcation of the hyper-parameters r , δ and Φ (see Jones et al. (2005); Armstrong et al. (2009)). T o tackle this problem various strategies hav e been dev eloped. In the following, we supply a short revie w of these methods and offer an alternati ve one. Choice of the hyper -parameters δ , r and Φ In a fully Bayesian context, as proposed by Giudici and Green (1999), a hierarchical prior mod- elling can be used. In this approach, δ and Φ are considered as random quantities and a prior distribution is assigned to those parameters ( r is ﬁxed to 1 / 2 ). This strategy does not completely solve the problem since the prior distributions on δ and Φ also depend on hyper -parameters which are dif ﬁcult to calibrate. An other strategy consists in ﬁxing the values of δ , r and Φ as in Jones et al. (2005). In that paper , r is set to 1 p − 1 encouraging sparse graphs. They choose δ = 3 which is the minimal integer such 6 that the ﬁrst moment of the prior distribution on Σ G exists. Finally , they set Φ = τ I p and using the fact that the mode of the marginal prior for each variance terms σ ii is equal to τ / ( δ + 2) , τ is ﬁxed to δ + 2 if the data set is standardized. An intermediate strategy is suggested by Armstrong et al. (2009). First, they ﬁx the value of δ to 4 1 assessing that such a value gi ves a suitably non-informativ e prior for Σ G . Then, they consider dif ferent possibilities for Φ , all of the form Φ = τ A where the matrix A is ﬁxed. In all cases, for the hyper-parameter τ , they use a uniform prior distribution on the interval [0 , Γ] where Γ is very large. Finally , they also use a hierarchical prior on r : r ∼ β (1 , 1) , which leads to π ( G ) ∝  m k G  − 1 by integration.  m k G  is the binomial coef ﬁcient. This hierarchical prior of r is also used in Carvalho and Scott (2009). In that paper , they suggest a HIW g -prior approach with g = 1 /n . This approach consists of ﬁxing δ = 1 and Φ = S Y /n . In our point of view , δ measures the amount of information in the prior relativ e to the sample (see (7)): we suggest setting δ to 1 such that the prior weight is the same as the weight of one observ ation. As pointed out by Jones et al. (2005), for this particular choice, the ﬁrst moment of the prior distribution on Σ G does not exist. Ho wev er, for δ = 1 , the prior distribution is proper and we fail to see an y argument in f av our of the existence of a ﬁrst moment. The structure of Φ can be discussed and various forms exist in the literature (see Armstrong et al. (2009) for instance). In this paper , we standardise the data and use Φ = τ I p . This choice leads to sparse graph: on average each variable has major interactions with a relati vely small number of other variables. In that context, τ plays the role of a shrinkage factor and has to be carefully chosen on the appropriate scale. In this paper , we recommend to use an empirical Bayes strategy and to ﬁx ( τ , r ) to its maximum likelihood estimation for which computation is a challenging issue. T o tackle this point, a Marko v Chain Monte Carlo (MCMC) version of the Stochastic Approximation EM (SAEM) algorithm is used. The SAEM algorithm is presented in Section 2. In Section 3, a new Metropolis-Hasting algorithm is introduced. Then, the proposed methodology is tested on real and simulated datasets. 1 In fact, they set δ = 5 but they consider that µ is unknown with uniform prior distribution: this situation corre- sponds to the case δ = 4 when µ = 0 p . 7 2 An empirical Bayes pr ocedur e via the SAEM-MCMC algorithm In the following, we set θ = ( τ , r ) ∈ R ∗ + × ]0 , 1[ . In order to compute the maximum likelihood estimation of θ , we need to optimize in θ the following function f ( Y | θ ) ∝ X G ∈D p  h G ( δ, τ I p ) h G ( n + δ, τ I p + S Y )  π ( G | r ) (11) If the number of vertices is greater than 10 , the number of decomposable graphs is so huge that it is not possible to calculate the sum over D p . In that case, we consider the use of the Expectation- Maximization (EM) algorithm de veloped by Dempster et al. (1977), noting the fact that the data Y = ( y 1 , . . . , y n ) are issued from the partial observations of the complete data ( Y , G , Σ G ) . Howe ver , for such a data augmentation scheme, the E-step of the EM algorithm is not explicit and we hav e to resort to a stochastic version of the EM algorithm, lik e: 1. the S-EM scheme introduced by Celeux and Diebolt (1992) and Diebolt and Celeux (1993) where the E-step is replaced by a single simulation from the distribution of ( G , Σ G ) given Y and θ ; 2. the MC-EM or the MCMC-EM algorithms where the E-step is replaced by some Monte Carlo approximations (McLachlan and Krishnan, 2008); 3. the SAEM algorithm introduced by Delyon et al. (1999) where the E-step is divided into a simulation step and a stochastic approximation step; 4. the SAEM-MCMC algorithm (Kuhn and Lavielle, 2004) which extends the SAEM scheme, the “exact” simulation step being replaced by a simulation from an ergodic Mark ov chain. The S-EM, MC-EM and SAEM methods require to simulate a realization from the distribution of ( G , Σ G ) given Y and θ . W e are not able to produce a realization exactly distributed from the distribution of ( G , Σ G ) gi ven Y and θ . W e use the SAEM-MCMC algorithm which just requires some realizations from an ergodic Markov chain with stationary distribution ( G , Σ G ) | Y , θ . In a ﬁrst part, we recall the EM algorithm principles and present the SAEM-MCMC scheme. In a second part, we detail its application to Gaussian graphical models and prov e its con vergence. 2.1 The Stochastic A pproximation v ersion of the EM algorithm The EM algorithm is competiti ve when the maximization of the function θ → Q ( θ | θ 0 ) = E Σ G , G | Y ,θ 0 { log f ( Y , Σ G , G | θ ) } 8 is easier than the direct maximization of the marginal likelihood (11). The EM algorithm is a two steps iterative procedure. More precisely , at the k -th iteration, the E-step consists of ev aluating Q k ( θ ) = Q ( θ | b θ k − 1 ) while the M-step updates b θ k − 1 by maximizing Q k ( θ ) . For complicated models where the E-step is untractable, Delyon et al. (1999) introduce the Sto- chastic Approximation EM algorithm (SAEM) replacing the E-step by a stochastic approximation of Q k ( θ ) . At iteration k , the E-step is di vided into a simulation step (S-step) of  Σ ( k ) G , G ( k )  with the posterior distribution (Σ G , G ) | Y , b θ k − 1 and a stochastic approximation step (SA-step): Q k ( θ ) = (1 − γ k ) Q k − 1 ( θ )+ γ k log f ( Y , Σ ( k ) G , G ( k ) | b θ k − 1 ) where ( γ k ) k ∈ N is a sequence of positive numbers decreasing to zero. When the joint distribution of ( Y , Σ G , G ) belongs to the exponential family , the SA-step reduces to the stochastic approxi- mation on the minimal exhausti ve statistics. The M-step remains the same. One of the beneﬁts of the SAEM algorithm is the low-le vel dependence on the initialization θ 0 , due to the stochastic approximation of the SA-step. In Gaussian graphical models, we cannot generate directly a realization from the conditional dis- tribution of (Σ G , G ) giv en Y and b θ k − 1 . For such cases, Kuhn and La vielle (2004) suggest to replace the simulation step by a MCMC scheme which consists of generating M realizations from an ergodic Marko v chain with stationary distrib ution Σ G , G | Y , b θ k − 1 and use the last simulation in the SAEM algorithm. Kuhn and Lavielle (2004) prove the conv er gence of the estimates se- quence provided by this SAEM-MCMC algorithm tow ards a maximum of the function f ( Y | θ ) under general conditions for the exponential f amily . 2.2 The SAEM-MCMC algorithm on Gaussian graphical models In this section, we detail the application of the SAEM-MCMC algorithm to the Gaussian graphical model introduced in Section 1.2. More precisely , we give the expression of the complete log- likelihood and of the minimal suf ﬁcient statistics. Lavielle and Lebarbier (2001) applied the same methodology on a change-point problem. The complete log-likelihood f ( Y , G , Σ G | θ ) can be decomposed into three terms: log f ( Y , G , Σ G | θ ) = log f ( Y |G , Σ G ) + log π (Σ G |G , τ ) + log π ( G | r ) . (12) On the right-hand side of equation (12), the ﬁrst quantity is independent of θ thus, it will not take part in its estimation. Using the fact that we only consider decomposable graphs and the deﬁnition of the Hyper In verse W ishart distrib ution, the second term of the right-hand side of Equation (12) can be de veloped : 9 log π (Σ G |G , τ ) = X C ∈C | C | ( | C | + δ − 1) 2 log( τ ) − log Γ | C |  | C | + δ − 1 2  − δ + 2 | C | 2 log det(Σ G ) C − X S ∈S  | S | ( | S | + δ − 1) 2 log( τ ) − log Γ | S |  | S | + δ − 1 2  − δ + 2 | S | 2 log det(Σ G ) S  − τ 2 tr (Σ − 1 G ) . Furthermore, log π ( G | r ) = k G log  r 1 − r  + m log (1 − r ) . As a consequence, there exists Ψ a function of ( Y , Σ G , G , δ ) independent of θ = ( τ , r ) such that log f ( Y , G , Σ G | τ ) = Ψ ( Y , Σ G , G , δ ) + δ − 1 2 p log ( τ ) + m log (1 − r ) + 1 2 × *   P C ∈C | C | 2 − P S ∈S | S | 2 tr (Σ − 1 G ) k G   ,    log( τ ) − τ log  r 1 − r     + , (13) where h· , ·i is the scalar product of R 3 . Finally , following (13), the complete likelihood function belongs to the exponential family and the minimal sufﬁcient statistic S = ( S 1 , S 2 , S 3 ) is such that: S 1 ( Y , G , Σ G ) = X C ∈C | C | 2 − X S ∈S | S | 2 S 2 ( Y , G , Σ G ) = tr (Σ − 1 G ) S 3 ( Y , G , Σ G ) = k G . In an exponential model, the SA-step of the SAEM-MCMC algorithm reduces to the approxima- tion of the minimal sufﬁcient statistics. Thus, we can no w write the three steps of the SAEM- MCMC algorithm: let ( γ k ) k ∈ N be a sequence of positiv e numbers such that P k γ k = ∞ and P k γ 2 k < ∞ . 10 Algorithm 1 SAEM-MCMC algorithm (1) Initialize b θ (0) , s (0) 1 , s (0) 2 and s (0) 3 . (2) At iteration k , • [S-Step] generate G ( k ) , Σ ( k ) G from M iterations of a MCMC procedure – detailed in Section 3 – with G , Σ G | Y , b θ ( k − 1) as stationnary distribution. • [SA-Step] update  s ( k ) i  i =1 , 2 , 3 using a stochastic approximation scheme: i = 1 , 2 , 3 s ( k ) i = s ( k − 1) i + γ k  S i ( Y , G ( k ) , Σ ( k ) G ) − s ( k − 1) i  . • [M-Step] maximize the joint log-likelihood (13): b τ ( k ) = ( δ − 1) p + s ( k ) 1 s ( k ) 2 b r ( k ) = s ( k ) 3 m . (3) Set k = k + 1 and return to (2) until con ver gence. The conv er gence of the estimates sequence supplied by this SAEM-MCMC algorithm is ensured by the results of Kuhn and Lavielle (2004). Indeed, ﬁrst, the complete likelihood belongs to the exponential family and the re gularity assumptions required by K uhn and Lavielle (2004) (assump- tions M1-M5 and SAEM2 ) are easily veriﬁed. Secondly , the conv er gence requires the ergodicity of the Mark ov Chain generated at S-step tow ards the stationary dis tribution that is the distrib ution of G , Σ G | Y , b θ ( k − 1) . Finally , the properties of ( γ k ) k ∈ N allo w to apply the results of Kuhn and Lavielle (2004) and we conclude that the estimates sequence ( b θ ( k ) ) k ∈ N con verges almost surely to wards a (local) maximum of the function f ( Y | θ ) . 3 A new Metr opolis-Hastings sampler At each iteration k of the SAEM algorithm, a couple ( G , Σ G ) has to be generated under the poste- rior distribution G , Σ G | Y , θ ( k − 1) . As described in Giudici and Green (1999), Brooks et al. (2003) and W ong et al. (2003), this simulation can be achiev ed using a v ariable dimension MCMC scheme like the re versible jump algorithm. In case of an HIW prior distrib ution on Σ G , the mar ginal likeli- hood is av ailable in closed form (8) and, therefore, there is no need to resort to a variable dimension MCMC scheme. At iteration k of the SAEM algorithm, the simulation of ( G , Σ G ) ( k ) can be achiev ed through the follo wing two steps procedure: 11 • [S1-step] G ( k ) ∼ π ( G | Y , θ ( k − 1) ) • [S2-step] Σ ( k ) G ∼ π (Σ G |G ( k ) , Y , θ ( k − 1) ) According to (7), the second step [S2-step] of this procedure resolves into the simulation of HIW distributions the principle of which is detailed in Carv alho et al. (2007). For the ﬁrst step [S1-step], we hav e to resort to an MCMC algorithm b ut not of v ariable dimension since the chain is generated in the decomposable graphs space with p vertices. T o sample for the posterior in the space of graphs, Armstrong et al. (2009) use the fact that the marginal likelihood is av ailable in closed form and introduce a Metropolis-Hastings (MH) algo- rithm. At iteration t , their add and delete MH proposal consists of picking uniformly at random an edge such that the current graph with or without this edge stays decomposable; and deducing the proposed graph by deleting the generated edge to the current graph if it contains this edge or adding the generated edge otherwise. Let G be the current graph, G − G the set of decomposable graphs deri ved from G by removing an edge and G + G the set of decomposable graphs deri ved from G by adding an edge. For pedagogical reasons, we present here an add and delete MH sampler slightly different from the one of Arm- strong et al. (2009). In our proposal, we ﬁrst decide at random if we try to delete or to add an edge. The two schemes has exactly the same properties. Our add and delete algorithm is initialized on G (0) and the follo wing procedure is repeated until the con vergence is reached. Algorithm 2 Add and Delete MH proposal At iteration t , (a) Choose at random (with probability 1 / 2 ) to delete or add an edge to G ( t − 1) . (a.1) If delete an edge, enumerate G − G ( t − 1) and generate G p according to the uniform distribution on G − G ( t − 1) . (a.2) If add an edge, enumerate G + G ( t − 1) and generate G p according to the uniform distribution on G + G ( t − 1) . (b) Calculate the MH acceptance probability ρ ( G ( t − 1) , G p ) such that π ( G | Y , θ ) is the in variant distrib ution of the Markov chain. (c) W ith probability ρ ( G ( t − 1) , G p ) , accept G p and set G ( t ) = G p , otherwise reject G p and set G ( t ) = G ( t − 1) . 12 The acceptance probability ρ ( G ( t − 1) , G p ) is equal to α ( G ( t − 1) , G p ) ∧ 1 where α ( G ( t − 1) , G p ) = π ( G p | Y , δ, r, Φ) π ( G ( t − 1) | Y , δ, r, Φ) q ( G ( t − 1) |G p ) q ( G p |G ( t − 1) ) with q ( G ( t − 1) |G p ) q ( G p |G ( t − 1) ) =        | G + G ( t − 1) | | G − G p | if add | G − G ( t − 1) | | G + G p | if delete Note that because in general | G + G ( t − 1) | 6 = | G − G p | , the proposal distribution is not symmetric. The ratio π ( G p | Y ,δ,r, Φ) π ( G ( t − 1) | Y ,δ,r, Φ) is e valuated with formula (10). The enumerations of G − G ( t − 1) and G + G ( t − 1) are not obvious and can be time-consuming. T o tackle this point, we apply the results of Giudici and Green (1999) characterizing the set of moves ( add and delete ) which preserve the decomposability of the graph. These criteria lead to a f ast enumer - ation. Armstrong et al. (2009) prov e that this scheme 2 is more efﬁcient than the v ariable dimension sampler of Brooks et al. (2003), which is itself an improvement of the rev ersible jump algorithm proposed by Giudici and Green (1999). Their proposal is clearly irreducible and, therefore, the theoretical con vergence of the produced Marko v Chain to wards the stationary distrib ution π ( G | Y , τ ) is ensured, follo wing standard results on MH schemes. Ho wev er, in practice, the space of decomposable graphs is so large that the chain may take quite some time to reach the in variant distribution. T o improv e this point, we introduce a data-dri ven MH kernel which uses the informations contained in the in verse of the empirical cov ariance ma- trix. T o justify this choice, recall that,because of the Gaussian graphical model properties, if the in verse empirical cov ariance between vertices i and j is near zero, we can presume that there is no edge between vertices i and j . Then, during the MH iterations, if the current graph contains an edge between vertices i and j , it is legitimate to propose removing this edge. The same type of reasoning can be done if the absolute v alue of the in v erse empirical cov ariance between vertices k and l is large. Indeed, in that case, and if during the MH iterations the current graph does not contain an edge between vertices k and l , it is legitimated to propose to add this edge. W ith this proposal, once the random choice to add or delete an edge has been done, the proposed graph is not chosen uniformly within the class of decomposable graphs but according to the values of the in verse empirical cov ariances. 2 In Armstrong et al. (2009), the step on the space of graphs represents a Gibbs step of an hybrid sampler (as already explained, the y consider a hierarchical model where that the hyper-parameter τ is a random variable). 13 Let K denote the in verse empirical covariance matrix: K = ( S Y /n ) − 1 . G ( t − 1) \ ( i, j ) (respec- ti vely G ( t − 1) ∪ ( i, j ) ) denotes the graph G ( t − 1) where the edge ( i, j ) has been remov ed (respecti vely added). The Data Dri ven kernel is the follo wing one : Algorithm 3 Data Dri ven MH proposal At iteration t , (a) Choose at random to delete or add an edge to G ( t − 1) . (a.1) If delete an edge, enumerate G − G ( t − 1) and generate G p according to the distrib ution such that P h G p = G ( t − 1) \ ( i, j ) |G ( t − 1) i ∝ 1 | K i,j | . (a.2) If add an edge, enumerate G + G ( t − 1) and generate G p according to the distribution such that P h G p = G ( t − 1) ∪ ( i, j ) |G ( t − 1) i ∝ | K i,j | . (b) Calculate the MH acceptance probability ρ ( G ( t − 1) , G p ) such that π ( G | Y , τ ) is the in variant distrib ution of the Marko v chain. (c) W ith probability ρ ( G ( t − 1) , G p ) , accept G p and set G ( t ) = G p , otherwise reject G p and set G ( t ) = G ( t − 1) . The algorithm is initialized on G (0) and the procedure is repeated until the con ver gence is reached. Finally , in vie w of some numerical experiments and in order to keep the good properties in terms of exploration of the standard MH kernel, we propose to use in practice a combination of the standard add and delete MH kernel and the previously presented data-driv en kernel. This point is detailed in the next section. 4 Numerical experiments In this part, we illustrate the statistical performances of our methodology on three different data sets. The second one is a simulated example which highlights the con ver gence properties of the SAEM-MCMC algorithm. The ﬁrst and third examples appeared in Whittaker (1990) and have been widely used to ev aluate the statistical performance of graphical models methodology , one can see for instance Giudici and Green (1999); Armstrong et al. (2009). Through these two e xamples, 14 the importance of the choice of the hyper-parameters and the efﬁcienc y of the new MCMC sampler are underlined. 4.1 The Fret’ s heads dataset Whittaker (1990) Fret’ s heads dataset contains head measurements on the ﬁrst and the second adult son in a sample of n = 25 families. The p = 4 variables are the head length of the ﬁrst son, the head breadth of the ﬁrst son, the head length of the second son and the head breadth of the second son. 61 graphs are decomposable among the 64 possibles graphs. W e compare three different prior distrib utions on (Σ G , G ) . 1. W e ﬁrst consider the prior distribution suggested by Jones et al. (2005) e.g. δ = 3 and r = 1 / ( p − 1) Φ = τ I p with τ = δ + 2 . 2. In a second experiment, we use the prior distrib ution proposed in Carvalho and Scott (2009) i.e δ = 1 Φ = S y n . Furthermore, r ∼ β (1 , 1) resulting into π ( G ) ∝  m k G  − 1 . 3. Finally , we use our prior distrib ution e.g, δ = 1 , Φ = τ I p . and a Bernouilli prior of parameter r on the edges of G . Using the SAEM algorithm described pre viously , we estimate τ and r to b τ = 0 . 3925 , b r = 0 . 6052 . On this example, there are only 61 decomposable graphs and so we are able to compute exactly the posterior probabilities { π ( G | y ) , G decomposable } for e very prior distribution. At that point, we are interested in comparing the posterior probabilities of the ﬁ ve most probable decomposable graphs for the three pre viously prior distribution. The results are resumed in T able ?? . The empirical Bayes estimation of τ is quite smaller than the value provided by the heuristic of Jones et al. (2005). As a consequence, the posterior probabilities of graphs are really different. Moreov er , the approach of Carvalho and Scott (2009) giv es results not in agreement with one of 15 Prior Most probable posterior graphs and posterior probability p ( G | y ) Jones et al. (2005) 1 2 3 4 1 2 3 4 1 2 3 4 0 . 24076 0 . 16924 0 . 11761 Carv alho and Scott (2009) 1 2 3 4 1 2 3 4 1 2 3 4 0 . 30512 0 . 19979 0 . 10813 SAEM 1 2 3 4 1 2 3 4 1 2 3 4 0 . 28613 0 . 18219 0 . 1264 T able 1: F r et’ s heads dataset : the three most probable posterior graphs using various prior on ( Σ G , G ). the two others method. The way the hyper-param eters τ and r are considered is essential, since that drastically inﬂuences the results. 4.2 Simulated Datasets W e consider 10 artiﬁcial datasets where p = 9 . These datasets are simulated according to model (1) with the graph of Figure 1. τ , δ and n are set equal to 0 . 03 , 1 and 100 respectively . The SAEM-MCMC algorithm has been performed on the 10 datasets in order to estimate the hyper -parameter τ . The algorithm is arbitrary initialized with ˆ τ (0) = 0 . 001 and ˆ r (0) = 0 . 5 . Giv en ˆ τ (0) , G is initialized with a standard backward procedure based on the posterior probabilities with ˆ r (0) . The step of the stochastic approximation scheme is chosen as recommended by K uhn and Lavielle (2005): γ k = 1 during the ﬁrst iterations 1 ≤ k ≤ K 1 , and γ k = ( k − K 1 ) − 1 during the subsequent iterations. The initial guess on ˆ τ (0) and ˆ r (0) could be far from a local maximum of the likelihood function and the ﬁrst iterations with γ k = 1 allo w the sequence of estimates to 16 con verge to a neighborhood of a local maximum. Subsequently , smaller step sizes during K − K 1 additional iterations ensure the almost sure con ver gence of the algorithm to a local maximum of the likelihood function. W e implemented the SAEM-MCMC algorithm with K 1 = 100 and K = 300 . At the S-step of the algorithm, the Markov Chain supplied by the MCMC algorithm is of length M = 500 during the ﬁrst 5 iterations of the SAEM scheme and M = 10 for the remaining iterations. Figure 2 illustrates the conv er gence of the parameter estimates considering 2 arbitrary chosen data- sets. The estimated sequences are represented as a function of the iteration number . During the ﬁrst iterations of SAEM, the parameter estimates ﬂuctuate, reﬂecting the Markov Chain construction. After 100 iterations, the curves smooth but still continue to con ver ge to wards a neighborhood of a local maximum of the likelihood function. Con vergence is obtained after 300 iterations. Considering the 10 datasets, ﬁr the parameter τ the relative bias is negligible and the relati ve root mean square error (RMSE) amounts to 32 . 10% . Note that the same study has been conducted with a uniform prior on G In that case, the algorithm only in volv es the parameter τ and the correspond- ing RMSE is equal to 23 . 5% . 0 50 100 150 200 250 300 0.000 0.010 0.020 0.030 0 50 100 150 200 250 300 0.5 0.6 0.7 0.8 0 50 100 150 200 250 300 0.000 0.010 0.020 0.030 0 50 100 150 200 250 300 0.30 0.35 0.40 0.45 Figure 2: Simulated datasets: e volution of the SAEM-MCMC ˆ τ ( k ) estimations (left) and ˆ r ( k ) estimations (right) on 2 datasets. 17 4.3 The F owl bones dataset Whittaker (1990) This dataset concerns bone measurements which are tak en from n = 276 white le ghorn fo wl. The 6 v ariables are skull length, skull breadth, humerous (wings), ulna (wings), femur (legs) and tibia (legs). On such a dataset, the determination of the best decomposable Gaussian graphical model results in ﬁnding the best graph within 18 , 154 decomposable graphs ( 55% of the possible graphs). Using this example, we aim at illustrating the fact that a careful choice of the transition kernel in the MCMC algorithm ensures a better exploration of the support of the posterior distribution. T o do this, we compare the performances of the add and delete proposal of Armstrong et al. (2009) to those gi ven by the data-dri ven one. In a ﬁrst step, we use the SAEM-MCMC algorithm to calibrate the v alue of τ and r . W e obtain τ ∗ = 0 . 674 and r ∗ = 0 . 69 . In a second step, using this ﬁxed value of τ and r , we generate 2 Marko v chains of 110 000 iterations. The ﬁrst one is simulated using the add and delete kernel. For the second one, we use exclusi vely the add and delete kernel during 10 000 iterations : this phase of burn-in allows a lar ge exploration of the decomposable graphs space. During the last 100 000 iterations, we alternati vely and systematically use the add and delete and data-driv en kernels. T o illustrate the performance of this new k ernel, we compute exactly the posterior probabilities p ( G | Y ; τ ∗ , r ∗ ) for each decomposable graph of size p = 6 . W e concentrate our efforts on the graphs such that p ( G | Y ; τ ∗ , r ∗ ) ≤ 0 . 001 (resulting into 107 graphs among the 18154 ones) as- suming the the other ones are of small interest because nearly nev er reached by the Markov chains. For each graph of interest G int , we count the number of times each Mark ov Chain reached it (after having remov ed the burnin period). W e ﬁnally obtain an estimation of the posterior probability by each chain: b π 1 ( G int | Y ; τ ∗ , r ∗ ) = |{ t ; G ( t ) 1 = G int }| 100 000 b π 2 ( G int | Y ; τ ∗ , r ∗ ) = |{ t ; G ( t ) 2 = G int }| 100 000 These values are compared to the theoretical ones p ( G int | Y ; τ ∗ , r ∗ ) . In Figure 3, we plot the estimated densities of the quantities relati ve errors b π 1 ( G int | Y ; τ ∗ , r ∗ ) − p ( G int | Y ; τ ∗ , r ∗ ) p ( G int | Y ; τ ∗ , r ∗ ) × 100 in solid line, and b π 2 ( G int | Y ; τ ∗ , r ∗ ) − p ( G int | Y ; τ ∗ , r ∗ ) p ( G int | Y ; τ ∗ , r ∗ ) × 100 in dashed line. W e note that the density corresponding to the errors in volved by the data-dri ven kernel is more 18 -100 -50 0 50 100 150 0.000 0.005 0.010 0.015 Figure 3: F owl bones data set : densities of the relati ve errors on the posterior probabilities for the 107 most probable graphs. add and delete kernel in solid line and data-driv en kernel in dashed line. concentrate around the v alue 0 . The large errors in the add an delete density are due to the graphs with small probabilities. Thus, the new kernel e xplores more ef ﬁciently the posterior distrib ution. The acceptance rate is higher for the data-dri ven chain (see Figure 4). 5 Conclusion and discussion An empirical Bayes strate gy estimating prior hyper -parameters in a Gaussian graphical model using a SAEM-MCMC algorithm is introduced. That proposal does not depend on an y calibrating parameters and can be viewed as a default option for decomposable graphical model determination. Some empirical studies show the relev ance of the proposed approach and the good properties of the introduced algorithms. Ho wev er, Scott and Berger (2010) has recently found considerable differences between fully Ba- yes and empirical Bayes strategies in the context of variable selection. It would be v ery interesting to in vestig ate, both from theoretical and practical perspectiv es, on such a discrepancy in the case of decomposable graphical model selection. 19 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0.20 0.25 0.30 0.35 0.40 Figure 4: F owl bones data set : ev olution of the acceptance ratio for the add and delete Markov chains (solid line) and the data dri ven Marko v chains ( − dashed line). Acknowledgments The authors are grateful to Marc Lavielle for very helpful discussions. The authors wish to thank the Associate Editor and two revie wers whose suggestions were very helpful in improving the presentation of this work. This work has been supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 P aris) through the 2009-2012 project Big’MC. Refer ences Armstrong, H., Carter , C., W ong, K., and K ohn, R. (2009). Bayesian co variance matrix estimation using a mixture of decomposable graphicals models. Statistics and Computing , 19(3):303–316. Brooks, S. P ., Giudici, P ., and Roberts, G. O. (2003). Ef ﬁcient construction of re versible jump Marko v chain Monte Carlo proposal distributions. J . R. Stat. Soc. Ser . B Stat. Methodol. , 65(1):3–55. Carv alho, C., Massam, H., and W est, M. (2007). Simulation of hyper-in verse Wishart distrib utions in graphical models. Biometrika , 94:647–659. Carv alho, C. and Scott (2009). Objectiv e Bayesian Model Selection in Gaussian Graphical Mod- els. Biometrika , 96:497–512. Celeux, G. and Diebolt, J. (1992). A stochastic approximation type EM algorithm for the mixture problem. Stochastics Stochastics Rep. , 41(1-2):119–134. 20 Dawid, A. P . and Lauritzen, S. L. (1993). Hyper-Marko v la ws in the statistical analysis of decom- posable graphical models. Ann. Statist. , 21(3):1272–1317. Delyon, B., Lavielle, M., and Moulines, E. (1999). Con vergence of a stochastic approximation version of the EM algorithm. Ann. Statist. , 27:94–128. Dempster , A. (1972). Covariance selection. Biometrics , 28:157–175. Dempster , A., Laird, N., and Rubin, D. (1977). Maximum lik elihood from incomplete data via the EM algorithm. J . Roy . Statist. Soc. Ser . B , 39(1):1–38. W ith discussion. Diebolt, J. and Celeux, G. (1993). Asymptotic properties of a stochastic EM algorithm for esti- mating mixing proportions. Comm. Statist. Stochastic Models , 9(4):599–613. Drton, M. and Perlman, M. (2004). Model selection for Gaussian concentration graphs. Biometrika , 91(3):591–602. Efron, B. and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimators. I. The Bayes case. J . Amer . Statist. Assoc. , 66:807–815. Efron, B. and Morris, C. (1972a). Empirical Bayes on vector observations: an e xtension of Stein’ s method. Biometrika , 59:335–347. Efron, B. and Morris, C. (1972b). Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes case. J . Amer . Statist. Assoc. , 67:130–139. Efron, B. and Morris, C. (1973a). Combining possibly related estimation problems (with discus- sion). J . Roy . Statist. Soc. Ser . B , 35:379–421. Efron, B. and Morris, C. (1973b). Stein’ s estimation rule and its competitors—an empirical Bayes approach. J . Amer . Statist. Assoc. , 68:117–130. Efron, B. and Morris, C. (1976a). Families of minimax estimators of the mean of a multiv ariate normal distribution. Ann. Statist. , 4(1):11–21. Efron, B. and Morris, C. (1976b). Multi v ariate empirical Bayes and estimation of cov ariance matrices. Ann. Statist. , 4(1):22–32. Giudici, P . (1996). Learning in graphical Gaussian models. In Bayesian statistics, 5 (Alicante, 1994) , Oxford Sci. Publ., pages 621–628. Oxford Uni v . Press, Ne w Y ork. Giudici, P . and Green, P . J. (1999). Decomposable graphical Gaussian model determination. Biometrika , 86(4):785–801. Jones, B., Carv alho, C., Dobra, A., Hans, C., Carter , C., and W est, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. , 20. 21 Kuhn, E. and La vielle, M. (2004). Coupling a stochastic approximation version of EM with a MCMC procedure. ESAIM Pr obab . Stat. , 8:115–131. Kuhn, E. and Lavielle, M. (2005). Maximum likelihood estimation in nonlinear mixed effects models. Comput. Statist. Data Anal. , 49:1020–1038. Lauritzen, S. (1996). Gr aphical models , volume 17 of Oxfor d Statistical Science Series . The Clarendon Press Oxford Uni versity Press, Ne w Y ork. Oxford Science Publications. Lavielle, M. and Lebarbier , E. (2001). An application of MCMC methods to the multiple change- points problem. Signal Pr ocessing , 81:39–53. Letac, G. and Massam, H. (2007). Wishart distributions for decomposable graphs. Ann. Statist. , 35(3):1278–1323. Madigan, D. and Raftery , A. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’ s Window . J ournal of the American Statistical Association , 89:1335–1346. McLachlan, G. and Krishnan, T . (2008). The EM algorithm and extensions . W iley Series in Prob- ability and Statistics. W iley-Interscience [John W iley & Sons], Hobok en, NJ, second edition. Morris, C. N. (1983). Parametric empirical Bayes inference: theory and applications (with discus- sion). J . Amer . Statist. Assoc. , 78(381):47–65. Rov erato, A. (2002). Hyper in verse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scand. J. Statist. , 29(3):391– 411. Scott, J. and Berger , J. (2010). Bayes and empirical-Bayes multiplicity adjustment in the v ariable- selection problem. The Annals of Statistics , 38(5):2587–2619. Whittaker , J. (1990). Graphical models in applied multivariate statistics . W ile y Series in Proba- bility and Mathematical Statistics: Probability and Mathematical Statistics. John W iley & Sons Ltd., Chichester . W ong, F ., Carter , C. K., and K ohn, R. (2003). Efﬁcient estimation of co v ariance selection models. Biometrika , 90(4):809–830. 22

An empirical Bayes procedure for the selection of Gaussian graphical models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment