A review of Gaussian Markov models for conditional independence

A review of Gaussian Mark o v mo dels for conditional indep endence Irene C´ ordoba ∗ , Conc ha Bielza , P edro Larra ˜ naga Universidad Polit´ ecnic a de Madrid, Madrid, Sp ain Abstract Mark ov mo dels lie at the interface b etw een statistical indep endence in a probability dis- tribution and graph separation prop erties. W e review mo del selection and estimation in directed and undirected Marko v mo dels with Gaussian parametrization, emphasizing the main similarities and diﬀerences. These tw o mo del classes are similar but not equiv- alen t, although they share a common intersection. W e present the existing results from a historical p ersp ectiv e, taking into account the amount of literature existing from b oth the artiﬁcial intelligence and statistics researc h communities, where these mo dels w ere originated. W e cov er classical topics such as maximum likelihoo d estimation and mo del selection via hypothesis testing, but also more mo dern approac hes lik e regularization and Bay esian metho ds. W e also discuss how the Marko v mo dels reviewed ﬁt in the ric h hierarch y of other, higher lev el Marko v mo del classes. Finally , we close the pap er o verviewing relaxations of the Gaussian assumption and p ointing out the main areas of application where these Mark ov mo dels are no wada ys used. Keywor ds: Gaussian Mark ov mo del, Conditional indep endence, Mo del selection, P arameter estimation 1. In tro duction Mark ov mo dels, or probabilistic graphical mo dels, explicitly establish a corresp on- dence b et ween statistical indep endence in a probability distribution and certain separa- tion criteria holding in a graph. They w ere originated at the interface betw een statistics, where Mark o v random ﬁelds w ere predominan t (Darroch et al., 1980), and artiﬁcial in tel- ligence, with a fo cus on Bay esian netw orks (Pearl, 1985, 1986). These tw o mo del classes are now considered the traditional ones, but still are widely applied and now ada ys there is a signiﬁcan t amount of researc h dev oted to them (Daly et al., 2011; Uhler, 2012). They b oth share the mo delling of conditional indep endences: Ba yesian netw orks relate them with acyclic directed graphs, whereas in Marko v ﬁelds they are asso ciated with undi- rected graphs. Ho wev er, the mo dels they represent are only equiv alen t under additional assumptions on the resp ectiv e graphs. ∗ Corresponding author Email addr esses: irene.cordoba@upm.es (Irene C´ ordoba), mcbielza@fi.upm.es (Concha Bielza), pedro.larranaga@fi.upm.es (Pedro Larra ˜ naga) Pr eprint submitte d to Elsevier Octob er 3, 2019 In this pap er, we review the existing metho ds for mo del selection and estimation in undirected and acyclic directed Marko v mo dels with a Gaussian parametrization. The m ultiv ariate Gaussian distribution is among the most widely dev elop ed and applied s ta- tistical family in this con text (W erhli et al., 2006; Ib´ a ˜ nez et al., 2016), and allows for an explicit parametric comparison of their similarities and diﬀerences. The highly interdis- ciplinary nature of these Marko v mo del classes has led to a wide range of terminology in metho dological dev elopments and theore tical results. They ha ve usually b een studied separately , with some exceptions (W ermuth, 1980; P earl, 1988), and most unifying works (Sadeghi and Lauritzen, 2014; W erm uth, 2015) are c haracterized by a high-lev el view, where the mo dels are embedded in other, more expressiv e classes, and the fo cus is on the properties of these container classes. By contrast, in this paper we review them from a low-lev el p ersp ective. In doing so, w e use a uniﬁed notation that allows for a direct comparison b etw een the tw o types of classes. F urthermore, throughout each section we explicitly compare them, in terms of b oth metho dological and theoretical developmen ts. The pap er is structured as follo ws. A historical in tro duction to Mark ov mo dels is presen ted in Section 2, emphasizing the diﬀerent researc h areas that con tributed to their birth. Afterwards, preliminary concepts from graph theory are presen ted in Section 3. In Section 4, undirected and acyclic directed Marko v mo del classes are introduced, under no distributional assumptions. This is b ecause man y foundational relationships b etw een them can already be established from this general p erspective. Next, w e restrict their parametrization to multi v ariate Gaussian distributions, and explore the main derived prop erties from this in Section 5. W e review maximum lik eliho o d estimation in Section 6. These estimates are used for mo del selection via hypothesis testing, as we present in Section 7. When maximum likelihoo d estimators are not guaranteed to exist, a p opular tec hnique is to employ regularization, whic h w e ov erview in Section 8. Finally , the alternativ e Bay esian approach for mo del selection and estimation is treated in Section 9. W e explore the relationship of Gaussian acyclic directed and undirected Marko v mo dels with other, higher level model classes in Section 10. Alternativ es to the Gaussian distribution are discussed in Section 11. W e close the pap er discussing the main real applications of the Gaussian Mark ov mo del classes review ed in Section 12. 2. A historical p ersp ective W e will no w in troduce the main terminology for Gaussian Mark o v mo dels that can be found now ada ys, from a historical p ersp ectiv e. In Figure 1 w e ha ve depicted a timeline on the origins of Marko v mo dels, con taining most of the key works we will refer to in this section. Undirected Mark o v models for conditional indep endence are the oldest t yp e of Marko v mo dels, preceded only by sp ecial cases such as the Ising mo del for ferromagnetic mate- rials (Kindermann and Snell, 1980; Isham, 1981). In fact, they are a generalization of the Ising mo del, which is at the same time a generalization of Marko v c hains. Origi- nally , undirected Mark o v mo dels were called Markov r andom ﬁelds (Grimmett, 1973), since they generalized the corresp ondence betw een Gibbs measures (Besag, 1974) and Mark ov prop erties. The terminology gr aphic al mo del w as not introduced until Darro ch et al. (1980) linked the graphical ideas for contingency tables with Marko v prop erties of discrete Marko v ﬁelds. F urthermore, we also ﬁnd them called Markov networks (P earl, 2 1970 1993 Linear covariance Anderson (1973) Covariance selection (CS) Dempster (1972) CS ↔ contingency tables (CTs) Wermuth (1976) Mark ov ﬁelds (MFs) ↔ CTs Darroch et al. (1980) Inﬂuence diagrams (IDs) How ard & Matheson (1981) Probabilistic IDs ↔ Bay esian netw orks (BNs) Pea rl (1985) Gaussian MFs Speed & Kiiveri (1986) BNs & MFs Pea rl (1988) Directed MFs ↔ BNs Lauritzen et al. (1990) Figure 1: Timeline on the origins of Gaussian Marko v mo dels. Papers from the statistical communit y appear at the top, while papers from other research areas appear b elo w. Thematically , gray ﬁlled squares are pap ers ab out acyclic directed Marko v mo dels, the white ones are ab out the undirected case, and those gradient ﬁlled treat both classes. 1988), from researchers in artiﬁcial in telligence, as a parallel to the terminology Bayesian networks , used for acyclic directed Mark ov ﬁelds. Regarding the Gaussian parametrization, we can ﬁnd that one of the ﬁrst works to imp ose some structure on the cov ariance matrix of a multiv ariate Gaussian distribution, in order to reduce the num ber of parameters to b e estimated, was Anderson (1973). He considered the mean v ector and cov ariance matrix to b e linear combinations of known linearly indep enden t vectors and matrices, resp ectiv ely . Closely following this work was Dempster (1972), who suggested to estimate the in verse of the cov ariance matrix (concen- tration matrix) b y assuming certain en tries equal to zero, motiv ated b y the representation of the multiv ariate Gaussian distribution as an exp onen tial family . His w ork was later referred to as c ovarianc e sele ction mo dels. In terestingly , although Dempster did not hav e an y graphical interpretation in mind, such zero entries in the concentration matrix are directly asso ciated with missing edges in an undirected Gaussian Mark o v models, and these corresp ondence was analysed some years later in W ermuth (1976a). This is why , ev en no wada ys, these Marko v mo dels with a Gaussian parameterization are sometimes called co v ariance selection mo dels. Acyclic digraphs, in contrast, were intensely used as mo dels for multiv ariate proba- bilit y distributions after the deﬁnition of inﬂuenc e diagr ams . These are used to mo del decision-making pro cesses, and were introduced by Ho ward and Matheson in 1981 (arti- cle reprinted in How ard and Matheson (2005)). Their probabilistic reduction coincides with acyclic directed Marko v models, and w as subsequently extensiv ely studied by P earl (P earl, 1988), who renamed probabilistic inﬂuence diagrams as Bayesian networks or inﬂuenc e networks (P earl, 1985). Some researchers working on Marko v ﬁelds also devel- op ed theory regarding these directed coun terparts, calling them dir e cte d Markov ﬁelds (Lauritzen et al., 1990). Earlier works than the previously outlined, employing or referencing acyclic directed Mark ov mo dels, are av ailable. W ermuth (1980) implicitly studied them in the Gaus- 3 sian case as line ar r e cursive r e gr ession systems , although the main fo cus was rather on co v ariance selection mo dels. In fact, we can trace the use of directed graphs as graph- ical mo dels for dep endencies among random v ariables at least to the w ork of geneticist Sew all W righ t in 1918, who dev elop ed the metho d of p ath c o eﬃcients (W righ t, 1934), no wada ys known as p ath analysis . Linearly related v ariables were represented using a directed acyclic graph, whereas their correlation was represented by bi-directed edges joining them. 3. Graph preliminaries A graph is deﬁned as a pair G = ( V , E ) where V is the vertex set and E is the edge set. Throughout all the pap er, and unless otherwise stated, the graphs will b e lab elled and simple, which means that the elements in V are lab elled, for example, as 1 , . . . , p ; and E is formed by pairs of distinct elements in V . A graph is called undir e cte d if these latter pairs are unordered ( E ⊆ {{ u, v } : u, v ∈ V } ), and dir e cte d or digr aph otherwise ( E ⊆ { ( u, v ) : u, v ∈ V } ). Edges { u, v } in an undirected graph are usually denoted as uv and graphically represen ted as a line (see Figure 2a); while in a digraph they are called ar cs or dir e cte d e dges and represented as arrows (Figure 2b and 2c). 1 3 2 4 5 6 (a) Undirected graph 1 3 2 4 5 6 (b) Cyclic digraph 1 3 2 4 5 6 (c) Acyclic digraph Figure 2: Examples of an undirected graph and tw o digraphs. 3.1. Undir e cte d gr aphs In an undirected graph G = ( V , E ), if uv ∈ E , u and v are called neighb ours . F or v ∈ V , the set of its neigh bours is denoted as ne( v ), and the closur e of v is cl( v ) : = { v } ∪ ne( v ). G is called c omplete if for ev ery u, v ∈ V , uv ∈ E . A maximal C ⊆ V such that G C is complete is called a clique . Let H = ( V H , E H ) b e another undirected graph. H is a sub- gr aph of G (written as H ⊆ G ) if V H ⊆ V and E H ⊆ E . If E H = { uv ∈ E : u, v ∈ V H } , then H is called the induc e d sub-gr aph and denoted G V H . A walk b etw een u and v is an ordered sequence of v ertices ( u =) u 0 , u 1 , . . . , u k − 1 , u k (= v ) where u i − 1 u i ∈ E for i ∈ { 1 , . . . , k } . The num b er k is called the length of the w alk. If u = v the walk is close d , and when u 0 , . . . , u k − 1 are distinct, the walk is called a p ath . A closed path of length k ≥ 3 is called a cycle or k -cycle. G is called chor dal or triangulate d if all minimal k -cycles are of length k = 3. A chor dal c over of a graph G is a graph G ∗ suc h that G ⊆ G ∗ and G ∗ is c hordal. S ⊆ V sep ar ates u and v in G = ( V , E ) if there is no path betw een u and v in the sub-graph G V \ S . If w e consider A, B , S ⊆ V , A and B are said to be sep ar ate d b y S if u and v are separated by S for all u ∈ A , v ∈ B . Let V b e partitioned in to 4 disjoin t sets A, B , S ⊆ V . ( A, B , S ) is called a de c omp osition of G if S separates A and B in G and G S is complete. If A 6 = ∅ and B 6 = ∅ the decomposition is said to b e pr op er . An undirected graph is de c omp osable if (i) it is complete or (ii) it admits a proper decomp osition into de c omp osable sub-graphs. An undirected graph is decomp osable if and only if it is c hordal. 3.2. A cyclic digr aphs In a digraph D = ( V , A ) the deﬁnitions of (induced) sub-graph, walk, path, and cycle are analogous to the undirected case. The undirected graph D U : = ( V , A U ) with A U : = { uv : ( u, v ) ∈ A } is called the skeleton of D , and D is one of its orientations . A digraph D is said to b e c omplete when D U is complete. In the following, assume that D is acyclic (see Figure 2b for a cyclic digraph, and Figure 2c for an acyclic one). The p ar ent set of v ∈ V is pa( v ) : = { u ∈ V : ( u, v ) ∈ A } ; con versely , the child set is c h( v ) : = { u ∈ V : ( v , u ) ∈ A } . The anc estors of v , an( v ), are those u ∈ V suc h that there exists a directed path from u to v ; the desc endants of v , de( v ), are those u ∈ V suc h that there exists a directed path from v to u . W e will let nd( v ) : = V \ ( { v } ∪ de( v )) b e the set of non-desc endants of v ∈ V , and An( A ) : = A ∪ ( ∪ a ∈ A an( a )) the anc estr al set of A ⊆ V . Note that a total order ≺ can b e deﬁned o ver the set of v ertices V in an acyclic digraph D = ( V , A ), suc h that if ( u, v ) ∈ A , then u ≺ v . This ordering is usually called anc estr al , and it is a linear extension of the partial order naturally deﬁned as u  v if u ∈ an( v ). F or v ∈ V , the set of suc c essors of v with respect to ≺ is su( v ) = { u ∈ V : u  v } ; the set of pr e de c essors of v is pr( v ) = { u ∈ V : u ≺ v } . Finally , let u, w 1 , w 2 ∈ V with ( w 1 , u ) , ( w 2 , u ) ∈ A and ( w 1 , w 2 ) , ( w 2 , w 1 ) / ∈ A (see v ertices 1, 2 and 3 in Figure 2c). Suc h conﬁgurations are usually called v-structur es and denoted as w 1 → u ← w 2 . The mor al gr aph of D is deﬁned as the undirected graph D m = ( V , A m ) with A m : = A U ∪ { w 1 w 2 : w 1 → u ← w 2 for some u ∈ V } . 4. Undirected and acyclic directed Marko v mo del classes The Marko v mo del classes we will review asso ciate conditional indep endences in ran- dom vectors X = ( X 1 , . . . , X p ) t with undirected graph and acyclic digraph separation prop erties. This is made explicit via the Markov pr op erties of the distribution of X , whic h are in turn based on what are known as indep endenc e r elations . In the following, for arbitrary I ⊆ { 1 , . . . , p } , w e will denote the | I | -dimensional sub- v ector of X as X I : = ( X i ) i ∈ I . Conditional indep endence will b e expressed as in Dawid (1979): X I ⊥ ⊥ X J | X K represen ts the statement ‘ X I is conditionally indep endent from X J giv en X K ’ (see, e.g. Studen` y, 2018, § 1.3). 4.1. Indep endenc e r elations An indep endenc e r elation ov er a set V = { 1 , . . . , p } is a collection I of triples ( A, B , C ) where A , B and C are pairwise disjoint subsets of V . It is called a semi-gr aphoid when the follo wing conditions are met, if ( A, B , C ) ∈ I then ( B , A, C ) ∈ I , if ( A, B ∪ C, D ) ∈ I then ( A, C , D ) ∈ I and ( A, B , C ∪ D ) ∈ I , if ( A, B , C ∪ D ) ∈ I and ( A, C , D ) ∈ I then ( A, B ∪ C, D ) ∈ I ; 5 and a gr aphoid when it additionally satisﬁes that if ( A, B , C ∪ D ) ∈ I and ( A, C , B ∪ D ) ∈ I then ( A, B ∪ C, D ) ∈ I (Pearl and Paz, 1987). Indep endence relations arise in diﬀeren t contexts relev ant for Mark o v mo dels. Sp ecif- ically , an indep endence relation I ov er V = { 1 , . . . , p } is said to b e induc e d b y • an undirected graph G = ( V , E ) if ( A, B , S ) ∈ I ⇐ ⇒ A and B are separated b y S in G , • an acyclic digraph D = ( V , A ) if ( A, B , S ) ∈ I ⇐ ⇒ A and B are separated by S in ( D An( A ∪ B ∪ S ) ) m , • a p -dimensional random vector X if ( A, B , S ) ∈ I ⇐ ⇒ X A ⊥ ⊥ X B | X S . Graph-induced independence relations are alw a ys graphoids, while probabilistic ones are alw ays semi-graphoids and require additional assumptions on the probability spaces in volv ed to be graphoids (Dawid, 1980). See Studen` y (2018) § 1.5 and § 1.11 for a detailed exp osition of graphoid theory and how to compute and represen t their closures, that is, all the triplets that can b e derived from a given indep endence relation by using the graphoid axioms. The core of Mark o v mo del classes is the relationship b et ween induced indepen- dence relations, which we will denote as I ( · ) with the argument b eing the inducing ele- men t. Sp eciﬁcally , if G is an undirected (acyclic directed) graph, an undirected (directed) Markov mo del is deﬁned as M ( G ) : = { P X : I ( G ) ⊆ I ( X ) } . where the random v ectors X are deﬁned o v er the same probabilit y space and P X denotes their distribution. These classes are non-empty (Geiger and Pearl, 1990, 1993); that is, for an y undirected or acyclic directed graph, w e can alw ays ﬁnd a probabilit y distribution whose indep endence mo del contains the one generated by the graph. Graphoids can b e generalised to what are known as sep ar oids (Da wid, 2001), which are algebraic structures usually app earing whenev er a notion of ‘irrelev ance’ is b eing mathematically treated (see, e.g. Studen` y, 2018, § 1.1.3). F urther research on these ax- iom systems from an abstract p oint of view could shed more ligh t on how the apparently diﬀeren t mathematical contexts in whic h such structures arise are related, and also pro- vide an explicit bridge betw een them and the recen tly deﬁned indep endenc e lo gic (Gr¨ adel and V¨ a¨ an¨ anen, 2013), closely related. 4.2. Markov pr op erties When a distribution P X b elongs to M ( G ) for an undirected or acyclic directed graph G , it is said that P X is glob al ly G -Markov or satisﬁes the glob al Markov pr op erty with resp ect to G . There are other w eak er Mar ko v prop erties that usually allow to simplify the mo del. Sp eciﬁcally , if G = ( V , E ) is an undirected graph, then the probability distribution P X of X is said to b e • p airwise G -Markov if X u ⊥ ⊥ X v | X V \{ u,v } for all uv / ∈ E , • lo c al ly G -Markov if X v ⊥ ⊥ X V \ cl( v ) | X ne( v ) for all v ∈ V ; whereas if G is an acyclic digraph, then P X is called 6 • p airwise G -Markov if X u ⊥ ⊥ X v | X nd( u ) \{ v } for all u ∈ V , v ∈ nd( u ) \ pa( u ); • lo c al ly G -Markov if X v ⊥ ⊥ X nd( v ) \ pa( v ) | X pa( v ) for all v ∈ V . The three Marko v prop erties are equiv alent w hen G is acyclic directed (Lauritzen et al., 1990), while if G is undirected this equiv alence is only guaranteed when I ( X ) is a graphoid (P earl, 1988). A suﬃcien t condition for this to happ en is that P X admits a contin uous and strictly p ositiv e density . This result was prov ed in diﬀeren t forms by sev eral authors, but it is usually attributed to Hammersley and Cliﬀord (1971), who w ere the ﬁrst to outline the pro of for the discrete case (Sp eed, 1979). It relies on an additional c haracterization of a probability distribution with resp ect to G : denoting as C ( G ) the class of cliques of G , the density function f of P X is said to factorize ac c or ding to G when there exists a set { ψ C ( x C ) : C ∈ C ( G ) , ψ C ≥ 0 } such that f ( x ) = Y C ∈ C ( G ) ψ C ( x C ) . (1) When (1) holds, then P X is globally G -Marko v, while if f is con tin uous and strictly p ositiv e, the pairwise Marko v property implies (1), whic h giv es the equiv alence of Mark ov prop erties. P ositivity is a straightforw ard suﬃcient condition for chec king whether an indep endence mo del originated from a distribution is a graphoid. Necessary and suﬃcient conditions are given in measure theoretic terms by Dawid (1980), and recen tly by P eters (2014) in terms of sp ecial functions ov er the sample space. Finally , recall that the no des of an acyclic digraph D = ( V , A ) can b e totally ordered suc h that if ( u, v ) ∈ A , then u ∈ pr( v ). This gives rise to another Mark o v prop erty , exclusiv e for acyclic digraphs: P X is said to b e or der e d D -Markov if X v ⊥ ⊥ X pr( v ) \ pa( v ) | X pa( v ) for all v ∈ V . This prop ert y is also equiv alen t to the global, lo cal and pairwise Mark ov prop erties (Lauritzen et al., 1990). The classical theory of undirected and acyclic directed Marko v prop erties can b e found in Lauritzen (1996), whereas Studen ` y (2018) § 1.7 and § 1.8 pro vides a recent ov erview. 4.3. Indep endenc e and Markov e quivalenc e When the Marko v models deﬁned b y t w o graphs G and ˜ G , with the same v ertex set V , coincide, such graphs are said to b e Markov e quivalent . A simpler notion, which implies Mark ov equiv alence, is indep endenc e e quivalenc e , holding when I ( G ) = I ( ˜ G ). Indep en- dence equiv alence is implied by Marko v equiv alence under fairly general circumstances (Studen` y, 2005, § 6.1), which is why most authors treat them as the same notion. These equiv alences allow to choose the most suited graph for the Marko v mo del. W e will ﬁrst characterize equiv alence within undirected graphs. F or each graphoid I ov er V there exists a unique edge-minimal undirected graph G such that I ( G ) ⊆ I (P earl and P az, 1987). It follows that I ( G ) = I ( ˜ G ) (indep endence equiv alence) if and only if G and ˜ G are iden tical. F urthermore, if w e assume that I ( X ) is a graphoid for all P X ∈ M ( G ), then a unique edge-minimal ˜ G exists, with ˜ G ⊆ G , such that M ( G ) = M ( ˜ G ) (Marko v equiv alence); that is, a unique undirected graph can b e c hosen as represen tative of each undirected Marko v mo del. In contrast, acyclic digraphs are not, in general, unique representations of a Mark ov mo del, since I ( D ) = I ( ˜ D ) if and only if D and ˜ D hav e the same skeleton and the same v-structures (V erma and P earl, 1991). Ho w ev er, unique representativ es can be 7 constructed: let D p b e the set of acyclic digraphs ov er V = { 1 , . . . , p } and deﬁne an equiv alence relation ∼ in D p as D ∼ ˜ D ⇐ ⇒ I ( D ) = I ( ˜ D ). The quotient space of ∼ is D p / ∼ = { [ D ] : D ∈ D p } , where [ D ] : = { ˜ D ∈ D p : ˜ D ∼ D } is the Markov e quivalenc e class ; indeed, M ( ˜ D ) = M ( D ) for all ˜ D ∈ [ D ], that is, [ D ] is the unique represen tative of the directed Mark ov mo del. The asymptotic ratio l = lim p →∞ | D p | / | D p / ∼| inﬂuences the computational gain obtained by using D p / ∼ instead of D p as a search space for mo del selection. Steinsky (2004) analytically calculates an upper b ound of l as 13 . 65. Exact computations by Gillispie and Perlman (2002), for p ≤ 10, and approximations b y Sonntag et al. (2015), up to p = 31, seem to indicate that l ∼ 3 . 7. How ev er, its analytical deduction remains an op en problem. Note that the computational gain is not only inﬂuenced by l , but also b y other factors, such as how the element size in D p / ∼ is distributed. An algorithm to compute such sizes can b e found in He et al. (2015). Recently , Radhakrishnan et al. (2018) hav e provided tight low er and upp er bounds on the n umber and size of Marko v equiv alence classes when D is a tree. Finally , we will characterize equiv alence b et w een directed and undirected graphs, ﬁrstly obtained by W ermuth (1980) for m ultiv ariate Gaussian distributions, W erm uth and Lauritzen (1983) for contingency tables, and generalized in F rydenberg (1990) for graphoid-inducing distributions. When G is an undirected graph, M ( G ) = M ( D ) for some acyclic digraph D if and only if G is c hordal. Con versely , an acyclic digraph D is Mark ov equiv alen t to its skeleton D U if and only if D contains no v-structures. F urther- more, a relation with the moral graph can b e established, which requires an analogous to (1): a density function f is said to r e cursively factorize according to D when f ( x ) = Y v ∈ V f ( x v | x pa( v ) ) . This characterization is equiv alen t to the Marko v prop erties, and also implies that f factorizes as in (1) with resp ect to the moral graph D m (Lauritzen et al., 1990). This means that P X is alwa ys globally D m -Mark ov for con tin uous X , and thus M ( D ) ⊆ M ( D m ), with the equalit y only holding when D m = D U . Example 1. An illustration of the previous concepts can b e found in Figure 3. The graph in 3a is not chordal, and thus there is no Marko v equiv alen t acyclic digraph. 3b is a c hordal cov er of 3a, and a Mark ov equiv alen t orientation is depicted in 3c. The acyclic digraph in 3d has v-structures, emphasized in dark grey , and thus cannot b e Mark ov equiv alen t to its skeleton (3a). The moral graph of 3d is 3e, whic h in fact is another c hordal cov er of 3a, and thus none of its orientations will b e Marko v equiv alen t to 3c. 5. Gaussian parametrization When restricting to multiv ariate Gaussian distributions, w e ﬁnd connections b etw een conditional and v anishing parameters. This corresp ondence can b e used for pro viding a direct interpretation of Marko v prop erties, b oth in the undirected and directed case, allo wing an enhanced manipulation of these Marko v mo dels. In the following, the elemen ts of a real q × r matrix M ∈ M q × r ( R ) will b e denoted as m ij , where i ∈ { 1 , . . . , q } and j ∈ { 1 , . . . , r } . M I J will b e the | I | × | J | sub-matrix 8 1 3 2 4 5 6 (a) Chordless cycle 1 3 2 4 5 6 (b) Chordal co ver 1 3 2 4 5 6 (c) Orientation 1 3 2 4 5 6 (d) V -structures 1 3 2 4 5 6 (e) Moral graph Figure 3: Marko v equiv alence. of M , where I ⊆ { 1 , . . . , q } and J ⊆ { 1 , . . . , r } ; and we will use M − 1 I J as ( M I J ) − 1 . S  0 and S  0 will represent the sets of p ositiv e and semi-p ositive deﬁnite symmetric matrices, resp ectiv ely . The p -v ariate Gaussian distributions is denoted as N p ( µ , Σ ), where µ ∈ R p is the mean vector and Σ ∈ S  0 is the cov ariance matrix. I p will denote the identit y p -dimensional square matrix, whereas 1 p will denote the p -vector with all en tries equal to 1; man y times, dimensionality sub-scripts will b e dropp ed if the dimension of the resp ectiv e ob ject is clear from the context. 5.1. Conditional indep endenc e and the multivariate Gaussian distribution Let V = { 1 , . . . , p } . When a random vector X is distributed as N p ( µ , Σ ), then for i, j ∈ V , X i ⊥ ⊥ X j is equiv alen t to σ ij = 0. If we consider a partition ( I , J ) of V , then X I | x J is distributed as N | I | ( µ I , Σ I · J ), where Σ I · J = Σ I I − Σ I J Σ − 1 J J Σ J I (Anderson, 2003). Thus, for i, k ∈ I , we ha ve that X i ⊥ ⊥ X k | x J is equiv alen t to σ ik · J = 0, the ( i, k ) elemen t in the conditional cov ariance matrix Σ I · J . A corresp ondence can b e established b et w een the zeros in Σ I · J and zero patterns in other representativ e matrices (W ermuth, 1976a, 1980; Uhler, 2018, § 9.1), as follows. Let the concentration matrix of X b e Ω = Σ − 1 , with elements ω uv for u, v ∈ V . The matrix Σ I J Σ − 1 J J is usually denoted as B I · J and called the matrix of r e gr ession c o eﬃcients of X I on X J . Letting Ω I · J : = Σ − 1 J J , we hav e the following matrix identit y (Horn and Johnson, 2012) Ω =  Σ I I Σ I J Σ J I Σ J J  − 1 =  Σ − 1 I · J − Σ − 1 I · J B I · J − B t I · J Σ I · J − 1 Ω I · J + B t I · J Σ − 1 I · J B I · J  . This allo ws us to relate Σ I · J with Ω and B I · J as Σ I · J = Ω − 1 I I , (2) B I · J = − Ω − 1 I I Ω I J , (3) whic h implies that, dually , Ω I I is identically equal to the concentration matrix of X I | x J , while Ω I · J is the concen tration matrix of X J . Finally , from (2) we get, for i, k ∈ V , X i ⊥ ⊥ X j | X V \{ i,k } ⇐ ⇒ ω ik = 0 , (4) whereas from (3) it follo ws that, for J ⊆ V , i, k ∈ V \ J , X i ⊥ ⊥ X k | X J ⇐ ⇒ β ik · J ∪{ k } = 0 , (5) 9 where β ik · J ∪{ k } is the v entry in the vector β t i · J ∪{ k } , that is, the co eﬃcien t of X k on the regression of X i on x J ∪{ k } . The original notation for this, introduced in Y ule (1907), w as β ik · J ; that is, k is implicitly considered as included in the conditioning indexes. W e ha ve how ever chosen the alternative, explicit notation β ik · J ∪{ k } , since it provides more notational simplicit y in later sections. 5.2. Gaussian Markov mo dels In the Gaussian case, undirected Marko v mo dels are in corresp ondence with the concen tration matrix, while for acyclic digraphs this corresp ondence is with the regression co eﬃcien ts. Both rely on the auxiliary Mark ov prop erties that we presented in Section 4.2. Let G = ( V , E ) be an undirected graph and consider X distributed as P X ≡ N p ( µ , Σ ) with P X ∈ M ( G ). Since P X is globally G -Marko v, it is also pairwise G -Marko v, and th us (4) directly giv es that ω uv = 0 for all u, v ∈ V such that uv ∈ E . This means that, if w e deﬁne the set S  0 ( G ) : =  M ∈ S  0 : m uv = 0 for all uv / ∈ E  , w e hav e Ω ∈ S  0 ( G ) if and only if P X is pairwise G -Marko v. F urthermore, since the m ultiv ariate Gaussian distribution has p ositiv e density , I ( X ) is a graphoid and th us the three Marko v prop erties are equiv alen t. This allows us to redeﬁne the Gaussian undir e cte d Markov mo del as N ( G ) =  N p ( µ , Σ ) : Σ − 1 ∈ S  0 ( G ) , µ ∈ R p  . (6) In the directed case, the redeﬁnition is not so direct. Let D = ( V , A ) b e an acyclic digraph, and assume, for notational simplicity , that the no des are already ancestrally ordered as 1  · · ·  p . If X is distributed as P X ≡ N p ( µ , Σ ) with P X ∈ M ( D ), it satisﬁes the ordered Mark ov prop ert y . Th us, whenever v ∈ pr( u ) \ pa( u ), we hav e X u ⊥ ⊥ X v | X pa( u ) , which is equiv alen t to β uv · pa( u ) ∪{ v } = 0 as in (5). Since w e hav e assumed an ancestral order, β uv · pa( u ) ∪{ v } = β uv · pr( u ) for all u ∈ V , v ∈ pr( u ) \ pa( u ), which leads to P X b eing ordered D -Mark ov if and only if β uv · pr( u ) = 0 for all u ∈ V , v ∈ pr( u ) \ pa( u ). Suc h triangular requirement on the regression co eﬃcients can be expressed with the matrix B deﬁned, for v < u as b uv = 0 if v / ∈ pa( u ), and b uv = β uv · pa( u ) ≡ β uv · pr( u ) otherwise. If w e let v u : = σ uu · pr( u ) , the previous characterization leads to a matrix form of the linear regressions in volv ed as X = µ + B ( X − µ ) + E , (7) where E u ∼ N (0 , v u ). W e can rearrange it as X = U − 1 ξ + U − 1 E , where ξ : = U µ and U : = I p − B , since U is inv ertible. Let V be the diagonal matrix of conditional v ariances v . Sometimes ξ , B and V are called the D -parameters of ( µ , Σ ) (Andersson and Perlman, 1998). In fact, U and V allow a decomp osition of Σ (and Σ − 1 ) as Σ = U − 1 VU − t . F urthermore, this decomp osition uniquely determines Σ via U / B and V (Horn and Johnson, 2012). Thus, in analogy with (6), if we deﬁne the set M ( D ) : = { M ∈ M p × p ( R ) : m uv = 0 for all ( u, v ) / ∈ A } and the set ∆ p of p × p diagonal matrices, w e can redeﬁne the Gaussian dir e cte d Markov mo del as N ( D ) =  N p ( µ , Σ ) : Σ − 1 = ( I p − B ) t V − 1 ( I p − B ) , B ∈ M ( D ) , V ∈ ∆ p  . (8) 10 6. Maxim um lik eliho o d estimation Maxim um likelihoo d estimation is greatly simpliﬁed in exp onen tial family theory (Barndorﬀ-Nielsen, 1978). The multiv ariate Gaussian distribution is a regular exp o- nen tial family , and thus b oth undirected and directed Gaussian Marko v mo dels can b e expressed as sp ecial subfamilies of it. 6.1. The Gaussian family and maximum likeliho o d In the multiv ariate Gaussian family the canonical parameter is η = ( Ω µ , − Ω / 2), o ver the space H = { ( η 1 , η 2 ) : η 1 ∈ R p , − η 2 ∈ S  0 } and the suﬃcient statistics are t ( X ) = ( X , X X t ). Let { x ( n ) : 1 ≤ n ≤ N } be N indep enden t observ ations, where X ( n ) ∼ N p ( µ , Σ ) for eac h n ∈ { 1 , . . . , N } , arranged in x ∈ M p × N ( R ), the resp ectiv e random matrix b eing X . The random sample is also a regular exponential family with canonical parameter η = ( Ω µ , − Ω / 2) o ver the space H . The suﬃcien t statistics in this case are t ( X ) = ( N ¯ X , XX t ) with N ¯ X = P N n =1 X ( n ) . In a regular exp onen tial family F H , a maximum of the lik eliho o d function, L ( η ), giv en a random sample X = x , is reached in H if and only if t ( x ) b elongs to the interior of C ( t ), the closed conv ex hull of the supp ort of the distribution of t , denoted as in t( C ( t )). In suc h case, it is unique and given by the η ∈ H satisfying E [ t ( X )] = t ( x ). F or the multiv ariate Gaussian random sample, we hav e that E [ N ¯ X ] = N µ and E [ XX t ] = N Σ + N µµ t , thus the conv ex supp ort of t ( X ) = ( N ¯ X , XX t ) is C ( t ) = { ( v , M ) ∈ R p × S p : M − v v t / N ∈ S  0 } . This gives that the maximum lik eliho o d esti- mator for ( µ , Σ ) exists if and only if xx t − N ¯ x ¯ x t ∈ S  0 , whic h happ ens with probability one whenev er N > p and never otherwise. The solution in such case is ( ¯ x , Q / N ), where Q = N X n =1  X ( n ) − ¯ X   X ( n ) − ¯ X  t = XX t − N ¯ X ¯ X t . A particular situation, usually assumed, is when µ = 0 . The canonical parameter now is η = − Ω / 2 in the space { η : − η ∈ S  0 } , and the suﬃcient statistic is t ( X ) = XX t . The maximum likelihoo d estimator exists if and only if xx t ∈ S  0 , and in such case it is XX t / N = Q / N . 6.2. Gaussian Markov mo dels as exp onential families When G is an undirected graph, the set S  0 ( G ) is a conv ex (linear) cone inside the p ositiv e deﬁnite cone S  0 (e.g. Uhler, 2018, § 9.2), whic h means that R p × S  0 ( G ) is an aﬃne subspace of R p × S  0 , and thus N ( G ) is also a regular exp onential family (Barndorﬀ-Nielsen, 1978). Assume that µ = 0 and let Q G b e the pro jection of Q on E ∪ { uu : u ∈ V } , that is, such that q G uv = 0 for all uv / ∈ E with u 6 = v . Since L ( Ω ) ∝ det( Ω ) 1 2 exp ( − tr( ΩQ )) and Ω ∈ S  0 ( G ), w e hav e tr( ΩQ ) = tr( ΩQ G ) and the suﬃcien t statistic for N ( G ) is t ( x ) = Q G (Lauritzen, 1996). Its con vex supp ort is C ( t ) = { P G : P ∈ S  0 } , equiv alen tly called the set of pro jections extendible to full p ositiv e deﬁnite matrices. Thus, the maximum lik eliho o d estimator for Σ exists if and only if Q G ∈ int( C ( t )), which is equiv alen t to sa y that Q G is extendible to a full p ositiv e deﬁnite matrix. Whenev er it exists, it is the only extendible matrix b Σ that also satisﬁes the mo del restriction b Σ − 1 ∈ S  0 ( G ). A suﬃcient condition th us is that Q ∈ S  0 , which 11 happ ens almost surely for N ≥ p . Reco v ering b Σ is a conv ex optimization problem, Uhler (2018) § 9.6 ov erviews some of the algorithms a v ailable for its computation. Note how ev er that if G is chordal, then there is a closed form expression for b Ω (Lauritzen, 1996) The existence of b Σ has b een completely characterized when G is chordal by Grone et al. (1984) and F rydenberg and Lauritzen (1989), indep enden tly . Since ﬁnding b Σ is equiv alen t to a p ositiv e deﬁnite matrix completion problem (Uhler, 2018, § 9.3), the problem lies at the in terface b etw een statistics and linear algebra. Therefore, this problem has b een solved from an algebraic (Sturmfels and Uhler, 2010; Uhler, 2012, 2018, § 9.4) viewp oin t for a general, non-c hordal G . How ever, the conditions on the sample size N are still unkno wn except for certain non-chordal graph types, see Uhler (2018) § 9.5 for an up-to-date o verview of the adv ances made so far. No w we turn on the cas e where the random sample X is assumed to a follow m ul- tiv ariate Gaussian distribution constrained by the separation properties in an acyclic digraph. The restriction in (8), how ever, is not linear in the canonical parameter; in fact, Spirtes et al. (1997) show that they are curved exp onential families. T o obtain the maxim um lik eliho od estimates, theory from multiv ariate linear regression can b e applied (Andersson and Perlman, 1998). Recall that if X ∼ N p ( µ , Σ ) and N p ( µ , Σ ) ∈ N ( D ), then X can b e expressed as (7). Th us, we can estimate the D -parameters for ( µ , Σ ) as the usual least squares estimators, b β t u · pa( u ) = Q u pa( u ) Q pa( u ) pa( u ) − 1 , b ξ u = ¯ x u − b β t u · pa( u ) ¯ x pa( u ) , N b v uu = q uu − b β t u · pa( u ) Q t u pa( u ) , resp ectiv ely for each u ∈ V , where q uu is the u -th diagonal entry in Q . W e can then obtain directly the maxim um likelihoo d estimator for ( µ , Σ ) from their respective D - parameter estimators (see Andersson and Perlman, 1998, for an algorithm). As op- p osed to the undirected case, ( b µ , b Σ ) exist with probability one if and only if N ≥ p + max {| pa( u ) | : u ∈ V } (Anderson, 2003). Recently , Ben-David and Ra jaratnam (2012) analyze in detail the relationship b etw een the D -parameters and Σ as a p ositive deﬁnite matrix completion problem, in analogy with the undirected case. 7. Mo del selection via hypothesis testing Maxim um likelihoo d estimators, presented in the previous section, can be used to address the problem of mo del estimation, and require either prior knowledge or a statis- tical pro cedure that allows mo del selection; that is, selecting the graph that will deﬁne the Marko v mo del. In this section w e will review the main hypothesis testing metho ds for suc h task. Throughout the section, for u, v ∈ V = { 1 , . . . , p } and U ⊆ V \ { u, v } , we will denote as ρ uv · U the partial correlation coeﬃcient b etw een u and v given the v ariables in U , and as r uv · U its maximum lik eliho o d estimator, the sample partial correlation (see e.g. Anderson, 2003, § 4.3 for an in tro duction to partial correlation theory). 7.1. Stepwise sele ction In the undirected case, we are interested in testing the h yp othesis H 0 : Ω ∈ S  0 ( G 0 ) against H 1 : Ω ∈ S  0 ( G ), where G 0 = ( V , E G 0 ) ⊆ G = ( V , E G ). The result of such test 12 determines whether the edges in E G \ E G 0 should b e excluded from the selected mo del; that is why these tests are usually known as e dge exclusion tests . Note also that this is b ackwar d model selection, since our null h yp othesis consists on a subgraph. Let b Σ 0 and b Σ b e the maximum lik eliho o d estimators for a co v ariance matrix in the Mark ov mo del determined b y G 0 and G , resp ectively. The likelihoo d ratio statistic is T L = det( b Σ ) N/ 2 det( b Σ 0 ) N/ 2 = det( b Ω 0 ) det( b Ω ) ! N/ 2 . Under H 0 , − 2 log( T L ) = N (log det( b Ω ) − log det b Ω 0 ) is asymptotically distributed as a χ 2 distribution with | E G |−| E G 0 | degrees of freedom; ho wev er, this is a p oor approximation in many cases (Porteous, 1989). More accurate distributional results hav e b een derived b y Eriksen (1996), as follows. Let G 0 ⊂ . . . ⊂ G k (= G ) b e a sequence of graphs where , for 1 ≤ i ≤ k , E G i − 1 = E G i \ { e i } for some e i = u i v i ∈ E G i (sequence of edge deletions). Then, under H 0 , T 2 / N L is distributed as the pro duct Q k i =1 B i of univ ariate Beta v ariables, where, for 1 ≤ i ≤ k , B i ∼ B  1 2 ( N − | ne G i ( u i ) ∩ ne G i ( v i ) | − 2) , 1 2  . The ab ov e result is exact whenever G and G 0 are chordal or share the same non-chordal maximal subgraphs (Eriksen, 1996). Speciﬁcally , denote as C ∗ i = ne( u i ) ∩ ne( v i ) the unique clique in G i of which edge e i is a mem b er. Then, under H 0 (see e.g. Lauritzen, 1996, Prop osition 5.14) T 2 / N L = k Y i =1 (1 − r 2 u i v i · C ∗ i \{ u i v i } ) , giving that T 2 / N L is distributed as Q k i =1 B (( N − | C ∗ i | ) / 2 , 1 / 2). Note that in this decom- p osable case one av oids to actually compute b Ω and b Ω 0 . The statistic T L has b een used for mo del selection in undirected Gaussian Marko v mo dels by W ermuth (1976b). In the case of a directed Gaussian Marko v mo del ov er an acyclic digraph D = ( V , A ), most of the results are adaptations from analogues in multiv ariate linear Gaussian mo dels. The likelihoo d ratio, whose moments are also characterized in Andersson and Perlman (1998), is T L = det( b Σ ) N 2 det( ˜ Σ ) N 2 = Q v ∈ V    b σ v v − b σ t v · pa( v ) b Σ − 1 pa( v ) b σ v · pa( v )    Q v ∈ V    ˜ σ v v − ˜ σ t v · pa( v ) ˜ Σ − 1 pa( v ) ˜ σ v · pa( v )    , where ˜ Σ and b Σ are the resp ective maxim um likelihoo d estimators for ˜ D and D , ˜ D ⊆ D . A backw ard step wise metho d has b ecome p opular for selecting D , commonly called the PC algorithm (Spirtes et al., 2000). This metho d pro ceeds by ﬁrst ﬁnding an estima- tor of the sk eleton, d D U , from a complete undirected graph, and then orien ting it. At iter- ation i of the ﬁrst step (ﬁnding d D U ), H 0 : X u ⊥ ⊥ X v | X C is tested, with C = c ne( u ) \ { v } and | C | = i . The edge uv is remov ed from d D U if H 0 is not rejected. Note that d D U de- p ends on the order in which H 0 is tested at each iteration, problem circumv en ted in the 13 mo diﬁcation by Colombo and Maathuis (2014). Assuming that I ( X ) = I ( D ) (see Sec- tion 4), commonly called the faithfulness assumption, Robins et al. (2003) show ed that the PC algorithm is p oint wise consisten t but may not be uniformly consistent, regardless of the method used for testing H 0 . Zhang and Spirtes (2003) approac hed this problem b y in tro ducing a stronger condition, called str ong faithfulness , whic h, b y requiring nonzero partial correlations to hav e a common lo wer b ound, gives uniform consistency , ev en in a high-dimensional setting (Kalisch and B ¨ uhlmann, 2007). How ev er, despite the set of ‘unfaithful’ distributions has Leb esgue measure zero (Meek, 1995), those ‘strongly un- faithful’ constitute a non-zero Leb esgue measure set, whic h can in some cases b e very large (Uhler et al., 2013; Lin et al., 2014). 7.2. Multiple testing When p erforming mo del selection with these tests, multiple testing error rates need to b e controlled. F or ov ercoming this, Drton and Perlman (2004) prop ose an alter- nativ e to the previous stepwise metho ds. First, note that b oth acyclic directed and undirected Gaussian Marko v mo dels ov er V = { 1 , . . . , p } are characterized by cer- tain partial correlation coeﬃcients, since for u, v ∈ V and U ⊆ V \ { u, v } , we ha v e X u ⊥ ⊥ X v | X U ⇐ ⇒ ρ uv · U = 0. Assuming conditional indep endence, that is, ρ uv · U = 0, then p N − | U | − 2 r uv · U / p 1 − r 2 uv · U has a t distribution with N − | U | − 2 degrees of freedom. How ev er, a faster Gaussian appro ximation can b e obtained using Fisher’s Z - transform, Z ( x ) = 1 2 log  1 + x 1 − x  = tanh − 1 ( x ) . In suc h case the distribution of p N − | U | − 3 Z ( r uv · U ) tends to a standard Gaussian. Based on the ab o ve discussion, Drton and Perlman (2004) prop ose a metho d where a set of simultaneous p -v alues and conﬁdence interv als is obtained such that the edge set is estimated, for a signiﬁcance lev el α and using Sidak (1967) inequality , as b E α : =  uv : p N − p − 1   Z ( r uv · V \{ u,v } )   > Φ − 1  1 2 (1 − α ) 2 p ( p − 1) + 1 2  , (9) where Φ is the cum ulative distribution function of a standard Gaussian. Denoting as b G α = ( V , b E α ), it holds lim inf N →∞ P ( b G α = G ) ≥ 1 − α if the distribution under consider- ation N p ( µ , Σ ) ∈ N ( G ) is faithful to G , that is, if ω uv = 0 ⇐ ⇒ uv / ∈ E . If faithfulness is not satisﬁed, then the result holds with resp ect to the smallest graph H suc h that G ⊆ H and N p ( µ , Σ ) is faithful to H . The multiple testing pro cedure in (9) has also b een extended in Drton and P erlman (2008), obtaining an estimate of the arc set as b A α : =  ( v , u ) : v < u and √ N − u − 1   Z ( r uv · pr( u ) \{ v } )   > Φ − 1  1 2 (1 − α ) 2 p ( p − 1) + 1 2  , (10) where an ancestral ordering ≺ is b eing assumed in V such that the resulting p erm utation is the iden tity; that is, such that v ≺ u ⇐ ⇒ v < u . Consistency is established as in the undirected case; note the symmetry with (9). See Drton and P erlman (2007) for a general discussion on some v ariations of (9) and (10) and their impact on ov erall error con trol. 14 Recen tly , Liu (2013) has extended the metho dology of Drton and Perlman (2004) to the high dimensional scenario, with p > N . A related testing pro cedure has emerged motiv ated by the ﬁeld of gene netw ork learning from microarray data. Instead of testing full partial correlations ρ uv · V \{ u,v } in an undirected model, only limited q -order partial correlations ρ uv · U = 0, where U ⊆ V \ { u, v } and | U | = q , are tested (Castelo and Rov erato, 2006). An edge is added to the resulting graph, called q -p artial gr aph , only when all of the q -partial correlations are rejected to b e zero. This pro cedure is sp ecially suited for situations where the num ber of v ariables is substan tially larger than the n umber of instances, as happ ens in the case of microarra y data, where low order conditional indep endence relationships (up to q ≤ 3) ha ve b een p opular (Wille and B ¨ uhlmann, 2006; de la F uen te et al., 2004; Magwene and Kim, 2004). Castelo and Rov erato (2006) generalize and formalize these approac hes, and pro vide a robust model selection pro cedure for q -partial graphs. This is intended to serve as an in termediate step for model selection of a classical undirected Marko v model N ( G ), and yields to a great simpliﬁcation when G is sparse (Castelo and Rov erato, 2006). 8. Regularization Regularization approaches, which p erform mo del selection and estimation in a simul- taneous wa y , hav e b ecome p opular in the context of Marko v mo dels. They are usually applied when N < p , and thus the existence of the maxim um likelihoo d estimator is not guaran teed. The main consistency results av ailable for b oth the directed and undirected cases share sparseness and high-dimensionality assumptions, as w e will see b elow. There are t wo diﬀerent approaches, those that p enalize the likelihoo d and those that instead fo cus on the regression co eﬃcients. Througout this section, we will employ the asymptotic notation, sp eciﬁcally symbols O ( · ) and Θ ( · ), asymptotic inferiorit y and equiv alence, resp ectiv ely . F or M ∈ M q × r ( R ), v ec( M ) will denote the vectorized function of M , ( m 11 , . . . , m q 1 , . . . , m 1 r , . . . , m q r ) t . This w ay , the op erator norm of M will b e denoted as k M k ; whereas k M k q + r will b e used to denote k vec( M ) k q + r , being k·k p the p -norm function. If v is a p -vector, diag( v ) will denote the matrix M in ∆ p with main diagonal v ; analogously , diag ( M ) ∈ ∆ p will ha ve the same diagonal as M , and M − will b e used for M − diag( M ). 8.1. No de-wise r e gr ession Let G = ( V , E ) b e an undirected graph, with V = { 1 , . . . , p } . Let X be a random v ector whose distribution N p ( µ , Σ ) b elongs to the undirected Gaussian Mark ov mo del N ( G ). Assume, for notational simplicity , that µ = 0 , and, following the notation of Section 6, let X = x b e a p × N random sample from N p ( 0 , Σ ). Since for each u, v ∈ V , β uv · V \{ u } = − ω uv /ω uu (Equation (3)), then X u ⊥ ⊥ X v | X V \{ u,v } ⇐ ⇒ ω uv = 0 ⇐ ⇒ β uv · V \{ u } = β v u · V \{ v } = 0 . This m eans that an analogue of the matrix B in directed Gaussian Marko v mo dels (Equation (7)) can b e used for determining the missing edges in the undirected case. In Meinshausen and B ¨ uhlmann (2006), this is done in the regression function, as b b λ u : = arg min b u ∈ R p , b uu =0  1 N   x t u − x t b u   2 2 + λf ( b u )  , (11) 15 where λ ≥ 0, x u is the u -the ro w vector of x and f ( · ) is the p enalt y function. F or each v ∈ V \ { u } , b b λ uv giv es an estimate of β uv · V \{ v } . Let c ne( v ) : = { u ∈ V : b b λ v u 6 = 0 } . while u ∈ ne( v ) ⇐ ⇒ v ∈ ne( u ) for all u, v ∈ V , this may not b e true for c ne( u ) and c ne( v ). Hence, t wo diﬀerent estimators for the edge set E may b e deﬁned b E ∧ : = { uv : u ∈ c ne( v ) and v ∈ c ne( u ) } , b E ∨ : = { uv : u ∈ c ne( v ) or v ∈ c ne( u ) } . Let f ( · ) = k·k 1 , commonly kno wn as the lasso p enalt y (Tibshirani, 1996) or l 1 regular- ization. Then b oth estimators b E ∧ and b E ∨ are consistent for certain choice of λ . This result w as indep enden tly disco vered b y Meinshausen and B¨ uhlmann (2006), Zhao and Y u (2006), Zou (2006) and Y uan and Lin (2007b). It relies on the follo wing almost necessary and suﬃcien t condition       X z ∈ ne( v ) sign( β v z · ne( v ) ) β uz · ne( v )       < 1 , (12) This no de-wise regression approac h may also b e used to p erform mo del selection for acyclic directed Gaussian Marko v mo dels if there is a known order among the v ariables, see for example Sho jaie and Mic hailidis (2010) or Y u and Bien (2017) and references therein. F rom Equation (7), the regression function to p enalize in this case w ould b e, for eac h u ∈ V , X u = µ u + X v ∈ pr( u ) β uv · pr( u ) ( X v − µ v ) + E u (13) The condition of Equation (12), commonly called the ‘irrepresentable condition’ (Zhao and Y u, 2006) or ‘neighbourho od stability’ (Meinshausen and B ¨ uhlmann, 2006), is inher- en t to mo del selection in linear regression with l 1 regularization, and thus it also holds when p enalizing (13) with the l 1 p enalt y . Ho w ever, some v arian ts hav e b een prop osed b ecause it is rather restrictiv e. These alternativ es usually rely on thresholding the re- gression co eﬃcients or adding w eights in the l 1 p enalt y , that under milder assumptions still achiev e mo del selection consistency (Meinshausen and Y u, 2009) or other attractive, ‘oracle’ prop erties (v an de Geer and B¨ uhlmann, 2009); see B ¨ uhlmann and v an de Geer (2011), § 7 for a review. v an de Geer and B ¨ uhlmann (2009) show that although mo del selection consistency for neigh b ourho od selection may be restrictive, suﬃcient conditions for suc h oracle prop erties hold fairly generally . 8.2. Penalize d likeliho o d In v an de Geer and B ¨ uhlmann (2013), l 0 regularization is alternatively used in the con text of directed Gaussian Marko v mo dels, without assuming a known order. As in Meinshausen and B ¨ uhlmann (2006), the regression co eﬃcients are p enalized in their approac h, more generally , the D -parameters in the likelihoo d function (assuming µ = 0 ). As such, the assumptions required for the consistency of b oth metho ds share some symmetry , as w e hav e outlined in T able 1. The estimators in this case are obtained as ( b V λ , b B λ ) = arg min Ω =( I p − B ) t V − 1 ( I p − B ) , B ∈ M ( D ) , V ∈ ∆ p (tr( ΩS ) − N log det( Ω ) + λf ( Ω )) , 16 N ( G ) (Meinshausen and B ¨ uhlmann, 2006) N ( D ) (v an de Geer and B¨ uhlmann, 2013) l 1 regularization l 0 regularization Low er b ound on   ρ uv · V \{ u,v }   Low er b ound on   β uv · pr( u )   Upper bound on | ne( v ) | Upper bound on | pa( v ) | Bounded neighbourho od perturbations Bounded p ermutation p erturbations T able 1: Comparison of assumptions for consistency results on regression based p enalized estimation in acyclic directed and undirected Gaussian Mark ov models. where λ ≥ 0, M ( D ), ∆ p are as in Equation (8) and S = xx t / N . When f ( Ω ) = |{ ( u, v ) : b uv 6 = 0 }| ( l 0 regularization), b V λ and b B λ are equal among Mark o v equiv a- len t mo dels and the resulting estimator of the concen tration matrix b Ω λ is consistent for certain choice of λ (v an de Geer and B ¨ uhlmann, 2013). The strong faithfulness condition for the PC algorithm, b ounding nonzero partial correlations, resembles the assumptions for regularization metho ds (T able 1). In fact, l 0 regularization has b een suggested as an alternativ e for the PC, in order to av oid the restrictive strong faithfulness assumption (Uhler et al., 2013); ho w ever, it is unclear ho w the assumptions of b oth metho ds are related. F or recent extensions of the work by v an de Geer and B ¨ uhlmann (2013), see Aragam and Zhou (2015) and Aragam et al. (2017). In undirected Gaussian Marko v mo dels conditional indep endences can b e read from Ω . Therefore, the p enalized lik eliho o d approac h can b e formulated more directly , for λ ≥ 0, as b Ω λ = arg min Ω ∈ S  0 ( G ) (tr( ΩS ) − N log det( Ω ) + λf ( Ω )) . (14) Y uan and Lin (2007a) w ere the ﬁrst to pursue this approach, and they chose f ( Ω ) = k Ω − k 1 , that is, the oﬀ-diagonal elements in Ω , which determine the edges of the re- sulting undirected graph, are p enalized. Later, in Banerjee et al. (2008) the diagonal elemen ts are included in the regularization function, that is, f ( Ω ) = k Ω k 1 ; ho wev er, since 1 /ω uu = σ uu · V \{ u } , this choice for the p enalt y fav ours larger v alues for the error v ariances in the regression of X u on the rest of v ariables (B ¨ uhlmann and v an de Geer, 2011). Nonetheless, this estimator is the one chosen in the extensiv ely used algorithm Gr aphic al L asso of F riedman et al. (2008), although mo del selection consistency has only b een prov ed for f ( Ω ) = k Ω − k 1 (Lam and F an, 2009; Ravikumar et al., 2011). It is not kno wn whether the suﬃcient conditions required for this consistency are strictly stronger than the irrepresentable condition, as some examples (Meinshausen, 2008) seem to indicate. F or the p enalization of Y uan and Lin (2007a) ( f ( Ω ) = k Ω − k 1 ), the conv ergence rate is (Rothman et al., 2008)    b Ω λ − Ω    2 ∈ O r ( | E | + p ) log( p ) N ! as N → ∞ . A relaxation of this rate can b e obtained based on the correlation matrix, as follo ws. Since Σ = DPD with P the correlation matrix and D the diagonal matrix of standard deviations, if we let the corresp onding sample estimates b e b D 2 = diag ( b Σ ) and b P = 17 b D − 1 b Σ b D − 1 , w e can then estimate K = P − 1 as b K λ = arg min K ∈ S  0 ( G )  tr( K b P ) − N log det( K ) + λf ( K )  , for λ ≥ 0. The concentration matrix can then b e alternatively estimated as ˜ Ω λ = b D − 1 b K λ b D − 1 , yielding a con vergence rate of (Rothman et al., 2008)    ˜ Ω λ − Ω    ∈ O r ( | E | + 1) log( p ) N ! as N → ∞ . Con vergence rates in other norms are provided in Ravikumar et al. (2011), and they ha v e b een generalized by Lam and F an (2009) for other p enalty functions. 9. Ba y esian mo del selection and estimation Consider a contin uous multiv ariate family F θ parametrized by θ , and denote as f ( x | θ ) the densit y function of a random sample X from P ∈ F θ for a given v alue of θ . In Ba yesian statistics, θ is treated as a random v ariable with known distribution, f ( θ ), usually called the prior distribution of θ . Inference is then p erformed based on the v alue of f ( θ | x ) ∝ f ( θ ) f ( x | θ ), the p osterior distribution of θ giv en the information in X = x . In Gaussian Marko v mo dels, θ = ( µ , Ω , G ), where in our case G is either undi- rected or acyclic directed. Therefore the target probabilit y is f ( G , µ , Ω | x ) ∝ f ( x | µ , Ω , G ) f ( µ , Ω | G ) f ( G ). In tegrating out µ and Ω , we obtain the p osterior density of mo del G , f ( G | x ) ∝ f ( x | G ) f ( G ). The prior for the graph space, f ( G ), is usually set as uniform. How ev er, this choice is biased to wards middle size graphs, and th us other prior distributions (Scutari, 2013, e.g.) ha ve b een prop osed; see Massam (2018) § 10.4.1 and references therein for a recen t detailed ov erview. Bay esian inference for Gaussian graphical mo dels, is usually meant for mo derate sample sizes, since it relies on sampling from the resulting posterior distribution, whic h becomes infeasible in high dimensions (see e.g. Jones et al., 2005 or Massam, 2018). In the following, the p -variate Wishart distribution will b e denoted as W p ( n, Λ ) with n ∈ R , n > p − 1 and Λ ∈ M p × p ( R ), Λ  0; analogously , the p -variate inverse Wishart distribution will b e W − 1 p ( ν, Ψ ) with ν ∈ R , ν > p − 1 and Ψ ∈ M p × p ( R ), Ψ  0. 9.1. Hyp er Markov laws When G is undirected and decomp osable, and assuming µ = 0 , Dawid and Lauritzen (1993) proposed for f ( Ω | G ) what are kno wn as the hyp er Markov laws . These are deﬁned in terms of prop erties of the graph asso ciated with the Mark ov mo del, mim- ic king Mark o v prop erties. Sp eciﬁcally , let θ b e a random v ariable taking v alues ov er N ( G ) and for subsets A, B ⊆ V , denote as θ A and θ B | A the parameters of the marginal distribution of X A and the conditional distribution of X A giv en v alues of X B , resp ec- tiv ely . The probability distribution of θ is said to b e (we akly) hyp er G - Markov if, for an y decomp osition ( A, B , S ) of G , it holds that θ A ∪ S ⊥ ⊥ θ B ∪ S | θ S ; if it further holds θ B ∪ S | A ∪ S ⊥ ⊥ θ A ∪ S , it is called str ongly hyp er G -Markov . F or c hordal graphs, if 18 the probabilit y distribution of θ is strongly h yper G -Marko v with resp ect to G , then the probability distribution of θ | x is the unique (strong) hyper G -Marko v distribution sp eciﬁed by the clique-marginal distributions { P ( θ C | x C ) : C ∈ C ( G ) } ; and, when densities exist, f ( θ C | x ) ∝ f ( x C | θ C ) f ( θ C ) (Dawid and Lauritzen, 1993), where x C stands for all the observ ations in x corresp onding to the v ariables in C . That is, under these assumptions, it is possible to localize computations o ver the graph cliques when p erforming Bay esian inference. In a multiv ariate Gaussian distribution N p ( 0 , Σ ), the inv erse Wishart is a conjugate prior for Σ ; that is, if Σ ∼ W − 1 p ( ν, Ψ ), then Σ | Q / N ∼ W − 1 p ( N + ν, Q + Ψ ) (recall Q = xx t ). W e can thus construct the hyp er inverse Wishart distribution , as the unique h yp er Marko v distribution asso ciated with in v erse Wishart clique marginals: Σ C C ∼ W − 1 | C |  ν, Ψ C  , for eac h clique C ∈ C ( G ). This h yp er Marko v distribution is denoted as HW − 1 p ( ν, Ψ ), where Ψ ∈ S  0 suc h that Ψ C C = Ψ C for each clique C ∈ C ( G ). F rom the discussion ab o ve, we know that this distribution is strongly hyper G -Mark ov. The main adv antage of this prior is that it has many prop erties that mirror those for Mark ov mo dels, since hyper Mark ov distributions are also deﬁned in terms of an underlying graph. Since its in tro duction, the hyper inv erse Wishart prior for decomp osable graphs has b een extensiv ely studied. The explicit expression for its densit y is devised in, e.g., Giudici (1996) or Rov erato (2000). In order to set its parameters, a hierarchical approach such as in Giudici and Green (1999) can b e follow ed, where ν and Ψ are assumed to hav e a Gamma and Wishart distribution, resp ectiv ely . Since the absent edges of G corresp ond to zeros in Ω (Equation (4)), Rov erato (2000) derives the distribution induced on Ω by assuming f ( Σ | G ) = HW − 1 p ( ν, Ψ ). He sho ws that in that case, the density is prop or- tional to that of a Wishart matrix conditioned on the even t Ω ∈ S  0 ( G ), and calls such prior distribution on Ω the G -c onditional Wishart . Recen tly , Massam (2018) § 10.3.2 has pro vided a detailed o v erview of the properties of the h yp er in v erse Wishart, and tec hnical considerations as ho w to sample from it or p erform Bay esian model selection using Bay es factors. Letac and Massam (2007) generalize b oth the G -conditional and h yp er inv erse Wishart to a broader conjugate family , allo wing for more than one shape parameter, whic h is used for mo del selection by Ra jaratnam et al. (2008) (see also Massam, 2018, § 10.3.3). The hyper in verse Wishart has been extended to non-c hordal graphs b y Ro v erato (2002), based on prop erties of the Isserlis matrix of Σ (Rov erato and Whittaker, 1998). Ho wev er, Bay esian mo del selection in this scenario requires the ev aluation of the G - conditional Wishart normalizing constan t, whic h b ecomes a problem since it did not hav e a known closed-form expression for a general non-chordal G un til very recently (Uhler et al., 2018). Much of the literature therefore has been devoted to this issue: Ata y-Ka yis and Massam (2005) analysed the Cholesky decomp osition of Ω and its relation with the cone S  0 ( G ) and p ositive deﬁnite matrix completions (Section 6), Carv alho et al. (2007) and W ang and Carv alho (2010) used suc h theoretical analysis to provide a direct sampler from the hyper in v erse Wishart prior, etc. A recen t detailed presen tation of this computational b ody of research can b e found in Massam (2018), § 10.4. Note that, although Uhler et al. (2018) pro vide exact form ulas and examples for sp ecial types of graphs, it still remains to ﬁnd eﬃcien t metho ds for their computation. 19 9.2. Priors for acyclic dir e cte d mo dels The metho dology by Geiger and Heck erman (2002) for acyclic directed Gaussian Mark ov mo dels can b e seen as an extension of h yp er Marko v distributions to suc h con text, since they b oth coincide for chordal skeletons. If D c is an arbitrary complete digraph, then, under some assumptions on f ( µ , Ω | D ) and f ( x | D , µ , Ω ), computations can b e lo calized as f ( x | D ) = Y v ∈ V f ( x { v }∪ pa D ( v ) | D c ) f ( x pa D ( v ) | D c ) , (15) The p osterior in Equation (15) is equal among M arko v equiv alen t acyclic digraphs (Geiger and Hec kerman, 2002). The conjugate prior for ( µ , Ω ) is the normal-Wishart distribution, where Ω ∼ W p ( α Ω , Λ ) and µ | Ω ∼ N p  µ 0 , ( α µ Ω ) − 1  . This yields a normal-Wishart p osterior distribution for ( µ , Ω ) | D c . Using this, Geiger and Heck erman (1994), obtain an explicit expression for eac h factor in Equation (15): for U ⊆ V , f ( x U | D c ) =  α µ α µ + N  | U | 2 2 π − lN 2 Γ | U |  N + α Ω − p + | U | 2  Γ | U |  α Ω − p + | U | 2  | Λ U U | α Ω − p + | U | 2 | R U U | α Ω − p + | U | + N 2 , where Γ p ( · ) is the p -dimensional Gamma function, and R = Λ + Q + N α Ω N + α Ω ( µ 0 − ¯ x )( µ 0 − ¯ x ) t . F urthermore, Geiger and Heck erman (2002) characterize the normal-Wishart prior for ( µ , Ω ) as the only distribution satisfying the global parameter independence assumption, f ( θ | D ) = Y v ∈ V f ( θ v | D ) for ev ery P X ∈ N ( D ). This condition is required for Equation (15) to hold. The ab o ve priors hav e b een used by Consonni and Ro cca (2012) and Altomare et al. (2013) for ob jective Ba yesian mo del selection, where f ( θ | D ) might b e improp er. F or o vercoming this, they use fractional Ba yes factors (O’Hagan, 1995), whic h had b een also previously used by Carv alho and Scott (2009) for c hordal undirected models (see Massam, 2018, § 10.6 and references therein for more details). Recen tly , Ben-David et al. (2016) ha ve prop osed a family of priors extending those by Geiger and Hec kerman (2002) but including more shap e parameters, that is, mimicking those in Letac and Massam (2007) for undirected models. See Ra jaratnam (2012) and Cao et al. (2019) for further discussion on these priors. 10. Higher level Marko v mo del classes with mixed graphs As w e ha ve seen throughout this review, the classes of acyclic directed and undirected Mark ov mo dels are intimately related. Therefore, one approach for their uniﬁed treat- men t could b e to step to a higher level, and deﬁne new Marko v model classes con taining 20 them as sub classes. In this section w e will ov erview this approach, which has b een par- ticularly activ e in the past few years. The graphs used for these new Marko v mo dels are usually called mixed graphs, b ecause unlike purely undirected or acyclic directed graphs, they allow for more than one edge type. W e do not aim in this section for a thorough ac- coun t of the achiev emen ts and drawbac ks of the diﬀeren t developmen ts since that w ould tak e another full pap er. Chain graphs are the ﬁrst higher lev el attempt at this uniﬁcation: they al low t wo edge t yp es, directed and undirected, and forbid semi-directed cycles. Drton (2009) provides a unifying view of these mo del classes, fo cusing on their discrete parametrization: both the undirected and directed edges can hav e tw o diﬀerent interpretations, thus giving rise to four diﬀerent c hain graph mo del classes. Among them, AMP chain gr aphs (Andersson et al., 2001), and L WF chain gr aphs (Lauritzen and W ermuth, 1989; F rydenberg, 1990), named in suc h w a y because of the respective paper authors, contain b oth acyclic directed and undirected Mark ov mo del classe s. Multivariate r e gr ession (MVR) chain gr aphs (Cox and W erm uth, 1993, 1996), or T yp e IV in Drton (2009), are p ossibly the most traditional ones and can b e view ed as a sp ecial case of the path diagrams b y W righ t (1934). Although they do not con- tain undirected mo dels, their extension, r e gr ession gr aphs (W ermuth and Sadeghi, 2012; W erm uth, 2011), do c on tain b oth classes treated in this review, by allowing up to three edge types. The class of regression graphs allo ws to represent additional relationships in classical sequences of multiv ariate regressions by means of a bi-directed edge, whic h could not b e otherwise mo delled using only the acyclic directed or undirected graphs. These bi-directed edges represent in teractions with latent variables . The recov ery of laten t v ariables is sp ecially relev an t in so cial studies, where the presence of these c on- founding v ariables ma y aﬀect the prediction. It has been recently sho wn that a Gaussian MVR chain graph is Marko v equiv alen t to an acyclic directed Gaussian Marko v mo del with latent v ariables when its bidirected part is chordal (F ox et al., 2015). Pure Gaus- sian bidirected graphs represen t marginal indep endences among v ariables, therefore they imp ose zero constraints directly in the cov ariance Σ . Finally , the T yp e III chain gr aph so far has not b een devoted muc h attention (Lau- ritzen and Sadeghi, 2018). All of the mentioned chain graph mo del classes are smo oth, whereas for their discrete counterparts only the classes of L WF and multiv ariate regres- sion c hain graphs consist of smo oth mo dels (Drton, 2009). The semi-directed cycle constraint on chain graphs can b e relaxed, and new graph- ical mo del classes are obtained by forbidding only directed cycles. By doing so, we arriv e at three diﬀerent classes of what are called acyclic directed mixed graphs: the so-called original acyclic dir e cte d mixe d gr aph (oADMG, Richardson, 2003), the alter- native (aADMG, Pe˜ na, 2016a), and UDA Gs (Pe˜ na, 2018), which relax MVR, AMP and L WF chain graphs, resp ectively . Eac h oADMG mo del contains a mo del obtained from a Ba yesian netw ork after marginalizing some of its no des, (the latent v ariables). How ev er, other constraints may arise after marginalizing that cannot b e represen ted in terms of conditional indep endence with this class, for example, the V erma c onstr aints (Ric hard- son and Spirtes, 2002, § 7.3.1) and ine quality c onstr aints (Drton et al., 2012). In order to deal with these, Ric hardson et al. (2012) introduced nested Marko v mo dels which also allo w for hyper-edges b etw een more than t wo no des; how ev er, w e are not aw are of any Gaussian parametrization. The classes of b oth oADMGs and aADMGs are subsumed by the class of ADM 21 gr aphs (Pe˜ na, 2018), consisting, naturally , of three edge types. When parametrized with the Gaussian distribution, ADM graph mo dels can b e represented as recursiv e linear equations with t wo blo c ks of v ariables and p ossibly correlated errors (Koster, 1999; Spirtes, 1995; Pe˜ na, 2016a,b). Bidirected edges in these mo dels represent latent confounding eﬀects, whereas undirected edges account for dep endence betw een the errors. Note that, although the classes of ADM graphs and regression graphs allow the same edge t yp es, they are not equiv alent since the former contains AMP c hain graph mo dels, while the latter do esn’t. There are other mo dels allowing for up to three edge types, b esides the already men- tioned ADM and regression graphs: anterial and chain mixe d gr aphs (Sadeghi, 2016), ribb onless gr aphs (Sadeghi, 2013), MC gr aphs (Koster, 2002), summary gr aphs (W er- m uth, 2011; Cox and W ermuth, 1996), anc estr al gr aphs (Richardson and Spirtes, 2002), etc. These mo del classes share rich relationships, which hav e b een recently discussed by Lauritzen and Sadeghi (2018). Ancestral graphs extend regression graphs by relaxing the cycle constrain t, but they are not a maximal class; that is, if an edge is remov ed from the graph, we may remain on the same Marko v mo del. Maximality is conv enien t b ecause it is what allo ws to deﬁne pairwise Marko v prop erties, so that eac h edge absent implies a conditional independence. F ortunately , for an arbitrary ancestral graph w e ma y alw ays ﬁnd a maximal one whic h is Marko v equiv alen t to it, therefore many times authors sp eak of the class of maximal anc estr al gr aphs (MAGs). This class is closed un- der marginalization and conditioning, and every MAG can b e obtained from an acyclic digraph after p erforming such op erations on its no des. Just as marginalization leads to laten t confounders, conditioning is sometimes called sele ction bias in the literature on so cial sciences. The class of summary graphs, although in corresp ondence with ancestral graphs, is not easily parameterized. One of the main drawbac ks is that they allow more than one edge type betw een the nodes, whic h means that in principle more than one parameter can b e asso ciated b et ween a pair of v ariables. F urthermore, they are not maximal and thus cannot hav e a pairwise Marko v prop ert y , whic h implies that fewer indep endences can b e deduced from the mo del. Most of the other three-edge-type mo dels mentioned share these drawbac ks for deﬁning a parametrization (Richardson and Spirtes, 2002; Sadeghi and Marc hetti, 2012). The proliferation of higher level Marko v mo del classes has led Lauritzen and Sadeghi (2018) to recently prop ose a class of mixed graphs consisting of up to four edge types, in an attempt to unify most of them under a unique Marko v prop erty (see also Sadeghi and Lauritzen, 2014; Ev ans, 2018). Now adays, a great amoun t of researc h is focused on characterizing basic foundational prop erties for these higher level mo dels: for exam- ple, Mark ov equiv alence, deﬁnition and equiv alence of Marko v prop erties, factorization prop erties, etc. 11. Relaxing the Gaussian assumption In some real problems, the Gaussian assumption is to o restrictive, and th us some alternativ e models to ov ercome this ha v e b een proposed. Although these are outside the scop e of this review, we will survey here the main prop osals to relax the Gaussian assumption. 22 As we hav e seen, Gaussian Bay esian netw orks are equiv alen t to a set of recursive re- gressions where the errors are Gaussian. In Shimizu et al. (2006), an analogous mo del is prop osed, called LiNGAM, where the errors are assumed to b e non Gaussian. The work b y Loh and B ¨ uhlmann (2014) generalizes further generalizes this by not making distribu- tional assumptions on the errors. As an alternative, Peters et al. (2014) and B ¨ uhlmann et al. (2014) maintain Gaussian errors but the additive regression is now assumed to b e non linear. Other families of contin uous distributions that ha v e been used for parametriz- ing Mark o v and Bay esian net w orks are nonparametric Gaussian copulas (Liu et al., 2009) and elliptical distributions (V ogel and F ried, 2011), b oth of which generalize the Gaus- sian distribution. Copula graphical mo dels are usually referred to as ‘nonparanormal’ mo dels, and mo del selection and estimation hav e b een researched by Harris and Drton (2013), Xue and Zou (2012) and Liu et al. (2012), including high-dimensional scenarios. Estimation results for elliptical graphical mo dels hav e b een obtained by V ogel and T yler (2014). Another approach is to extend the mo del and allow for b oth discrete and contin uous v ariables. In such case, a challenge is p osed sp ecially on inference in Bay esian netw orks, where the usual op erations may not allow for a direct and eﬃcien t implementation as in the pure cases. The main source of this problem is the integration that app ears in marginalization of contin uous v ariables. T o ov ercome this, the classical approach is to use the conditional Gaussian distribution (Olkin and T ate, 1961). It is c haracterized b y a multinomial distribution on the discrete v ariables and a Gaussian distribution for the con tinuous v ariables when conditioned on the discrete ones. Therefore, it contains the pure m ultinomial and Gaussian models as particular cases. Marko v prop erties of this distribution with resp ect to an undirected graph w ere deﬁned b y Lauritzen and W erm uth (1989). With resp ect to an acyclic digraph, a further assumption is that no discrete v ariable ma y ha v e con tin uous paren ts, whic h leads to conditional linear Gaussian Ba yesian netw orks (Lauritzen, 1992). Exact inference in these net w orks is applicable thanks to these constrain ts imp osed on the netw ork top ology . In order to a void the structural constraints of conditional linear Gaussian Bay esian net works, nonparametric density estimation tec hniques hav e been prop osed. Moral et al. (2001) approximated the joint densit y by mixtures of truncated exp onentials. In this mo del, discrete no des with contin uous parents are allo wed, while exact inference re mains p ossible. A similar approach is that of Sheno y and W est (2011), where mixtures of p oly- nomials are used instead for approximating the join t density . These t wo mo dels ha ve b een generalized b y Langseth et al. (2012) as mixtures of truncated basis functions. How ever, there are limited results ab out maximum likelihoo d estimation and mo del selection for these mo dels (Langseth et al., 2010, 2014; V arando et al., 2015). 12. Main application areas Graphical or Marko v mo dels ha v e been widely applied since their conception and con tinue to be no wada ys an essen tial tool in many ﬁelds, since they are in tuitiv e for visualizing the asso ciations betw een the comp onen ts in a system. W e will ﬁrst outline applications of Gaussian Mark o v and Gaussian Ba y esian netw orks and then illustrate other areas where graphical mo dels hav e pla yed an imp ortan t role. Mark ov and Bay esian netw orks with Gaussian parametrization hav e b een sp ecially useful in biomedical sciences. F or example, Gaussian Bay esian netw orks hav e b een used 23 for extracting kno wledge from fMRI studies (Mumford and Ramsey, 2014; Zhou et al., 2016), where nodes are identiﬁed with brain regions, and arrows are in terpreted as direct inﬂuences betw een the resp ectiv e regions. Another example where both mo dels hav e b een applied is the mo delling of gene regulatory netw orks, which are high-dimensional and complex by nature. In fact, the challenge p osed by this problem has served as an impulse for metho dological developmen ts in both models. A v ast amount of literature can b e found regarding the main computational asp ects in volv ed on this sub ject, as well as interpretabil ity issues, see Lauritzen and Sheehan (2003), F riedman (2004), Mark o wetz and Spang (2007) and Ness et al. (2016) for reviews. In so cial sciences, Ba yesian netw orks hav e b een used since their conception, in fact, w e could say that research in this application area help ed to settle the foundations of graphical mo dels (Kiiveri and Speed, 1982). In terms of interpretabilit y , the directed arcs in Bay esian net works are usually giv en a causal interpretation (P earl, 2000; Co x and W erm uth, 1996), since ultimately the main goal of social studies is to iden tify the causes of a resulting even t of in terest. In recent years, graphical mo dels hav e found a natural area of application which is so cial netw ork analysis (F arasat et al., 2015), which include problems suc h as inﬂuence analysis, priv acy protection, web browsing, etc. Some other traditional mo dels from bioinformatics can also be seen as graphical mod- els. These include ph ylogenetic trees, which mo del evolutionary relationships betw een diﬀeren t sp ecies or organisms, and pedigrees, whic h are diagrams sho wing the occurrence and v ariants of a gene from one generation of organisms to the next (Jordan, 2004). Apart from fMRI studies, Bay esian netw orks hav e also b een applied in diﬀeren t subareas of neu- roscience (Bielza and Larra ˜ naga, 2014), suc h as morphological and electrophysiological studies. Finally , other, more tec hnical application areas include information retriev al (de Camp os et al., 2004), where relev ant documents ab out some matter are collected from an av ailable set of sources, and linguistics, with subﬁelds such as speech recognition (Deng and Li, 2013), and natural language pro cessing (Cambria and White, 2014). Ac kno wledgements This work has b een supp orted by the Spanish Ministry of Science, Inno v ation and Univ ersities through the predo ctoral grant FPU15/03797 holded by Irene C´ ordoba, and through the TIN2016-79684-P pro ject. References Altomare, D., Consonni, G., La Ro cca, L., 2013. Ob jective Bay esian search of Gaussian directed acyclic graphical mo dels for ordered v ariables with non-lo cal priors. Biometrics 69 (2), 478–487. Anderson, T. W., 1973. Asymptotically eﬃcient estimation of cov ariance matrices with linear structure. Ann. Stat. 1 (1), 135–141. Anderson, T. W., 2003. An Introduction to Multiv ariate Statistical Analysis, 3rd Edition. John Wiley & Sons. Andersson, S. A., Madigan, D., Perlman, M. D., 2001. Alternative Marko v properties for chain graphs. Scand. J. Stat. 28 (1), 33–85. Andersson, S. A., P erlman, M. D., 1998. Normal linear regression models with recursive graphical Marko v structure. J. Multiv ar. Anal. 66 (2), 133 – 187. Aragam, B., Amini, A. A., Zhou, Q., 2017. Learning directed acyclic graphs with p enalized neigh b our- hoo d regression. 24 Aragam, B., Zhou, Q., 2015. Concav e p enalized estimation of sparse Gaussian Bay esian netw orks. J. Mach. Learn. Res. 16, 2273–2328. Ata y-Ka yis, A., Massam, H., 2005. A Monte Carlo metho d for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrik a 92 (2), 317–335. Banerjee, O., El Ghaoui, L., d’Aspremont, A., 2008. Model selection through sparse maxim um lik elihoo d estimation for m ultiv ariate Gaussian or binary data. J. Mach. Learn. Res. 9, 485–516. Barndorﬀ-Nielsen, O., 1978. Information and Exponential F amilies in Statistical Theory . John Wiley & Sons. Ben-David, E., Li, T., Massam, H., Ra jaratnam, B., 2016. High dimensional Bay esian inference for Gaussian directed acyclic graph mo dels. Ben-David, E., Ra jaratnam, B., 2012. Positiv e deﬁnite completion problems for Bay esian netw orks. SIAM J. Matrix Anal. Appl. 33 (2), 617–638. Besag, J., 1974. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. So c. Ser. B Stat. Metho dol. 36 (2), 192–236. Bielza, C., Larra ˜ naga, P ., 2014. Bay esian netw orks in neuroscience: A survey . F ront. Comput. Neurosci. 8, 131. B¨ uhlmann, P ., Peters, J., Ernest, J., 2014. CAM: Causal additive mo dels, high-dimensional order search and p enalized regression. Ann. Stat. 42 (6), 2526–2556. B¨ uhlmann, P ., v an de Geer, S., 2011. Statistics for High-Dimensional Data: Metho ds, Theory and Applications. Springer. Cambria, E., White, B., 2014. Jumping NLP curves: A review of natural language pro cessing research. IEEE Comput. In tell. Mag. 9 (2), 48–57. Cao, X., Khare, K., Ghosh, M., 2019. Posterior graph selection and estimation consistency for high- dimensional Bay esian DA G mo dels. Ann. Stat. 47 (1), 319–348. Carv alho, C. M., Massam, H., W est, M., 2007. Simulation of hyper-inv erse Wishart distributions in graphical mo dels. Biometrik a 94 (3), 647–659. Carv alho, C. M., Scott, J. G., 2009. Ob jective Bay esian mo del selection in Gaussian graphical mo dels. Biometrik a 96 (3), 497–512. Castelo, R., Rov erato, A., 2006. A robust procedure for Gaussian graphical mo del searc h from microarray data with p larger than n . J. Mach. Learn. Res. 6, 2621–2650. Colombo, D., Maathuis, M. H., 2014. Order-indep enden t constraint-based causal structure learning. J. Mach. Learn. Res. 15, 3921–3962. Consonni, G., Rocca, L. L., 2012. Ob jectiv e Ba yes factors for Gaussian directed acyclic graphical mo dels. Scand. J. Stat. 39 (4), 743–756. Cox, D. R., W ermuth, N., 1993. Linear dep endencies represented by chain graphs. Stat. Sci. 8 (3), 204–218. Cox, D. R., W erm uth, N., 1996. Multiv ariate dep endencies: Mo dels, analysis and interpretation. Chap- man & Hall. Daly , R., Shen, Q., Aitken, S., 2011. Learning Bay esian netw orks: Approaches and issues. Kno wl. Eng. Rev. 26, 99–157. Darroch, J. N., Lauritzen, S. L., Sp eed, T. P ., 1980. Marko v ﬁelds and log-linear in teraction mo dels for contingency tables. Ann. Stat. 8 (3), 522–539. Dawid, A. P ., 1979. Conditional independence in statistical theory . J. R. Stat. So c. Ser. B Stat. Methodol. 41 (1), 1–31. Dawid, A. P ., 1980. Conditional indep endence for statistical op erations. Ann. Stat. 8 (3), 598–617. Dawid, A. P ., 2001. Separoids: A mathematical framework for conditional independence and irrelev ance. Ann. Math. Artif. Intell. 32 (1), 335–372. Dawid, A. P ., Lauritzen, S. L., 1993. Hyper Marko v laws in the statistical analysis of decomp osable graphical mo dels. Ann. Stat. 21 (3), 1272–1317. de Camp os, L. M., F ern´ andez-Luna, J. M., Huete, J. F., 2004. Ba yesian netw orks and information retriev al: an introduction to the special issue. Inf. Pro cess. Manag. 40 (5), 727 – 733. de la F uen te, A., Bing, N., Ho esc hele, I., Mendes, P ., 2004. Discov ery of meaningful asso ciations in genomic data using partial correlation coeﬃcients. Bioinformatics 20 (18), 3565–3574. Dempster, A. P ., 1972. Co v ariance selection. Biometrics 28 (1), 157–175. Deng, L., Li, X., 2013. Machine learning paradigms for sp eech recognition: An ov erview. IEEE T rans. Audio, Sp eech, Lang. Process. 21 (5), 1060–1089. Drton, M., 2009. Discrete chain graph mo dels. Bernoulli 15 (3), 736–753. Drton, M., F o x, C., K¨ auﬂ, A., Jun 2012. Comments on: Sequences of regressions and their indep enden- cies. TEST 21 (2), 255–261. 25 Drton, M., Perlman, M. D., 2004. Mo del selection for Gaussian concentration graphs. Biometrik a 91 (3), 591–602. Drton, M., Perlman, M. D., 2007. Multiple testing and error control in Gaussian graphical mo del selec- tion. Stat. Sci. 22 (3), 430–449. Drton, M., Perlman, M. D., 2008. A SINful approach to Gaussian graphical mo del selection. J. Stat. Plan. Inference 138 (4), 1179–1200. Eriksen, P . S., 1996. T ests in co v ariance selection models. Scand. J. Stat. 23 (3), 275–284. Ev ans, R., 2018. Markov prop erties for mixed graphical mo dels. In: Handb ook of Graphical Mo dels. CRC Press, pp. 57–78. F arasat, A., Nikolaev, A., Srihari, S. N., Blair, R. H., 2015. Probabilistic graphical mo dels in mo dern social netw ork analysis. So c. Netw. Anal. Min. 5 (1), 62. F ox, C. J., K¨ auﬂ, A., Drton, M., 2015. On the causal interpretation of acyclic mixed graphs under multiv ariate normalit y . Linear Algebra Its Appl. 473 (Suppl. C), 93–113. F riedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inv erse cov ariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441. F riedman, N., 2004. Inferring cellular netw orks using probabilistic graphical mo dels. Science 303 (5659), 799–805. F rydenberg, M., 1990. The c hain graph Mark ov prop ert y . Scand. J. Stat. 17 (4), 333–353. F rydenberg, M., Lauritzen, S., 1989. Decomposition of maximum likelihood in mixed graphical interac- tion mo dels. Biometrik a 76 (3), 539–555. Geiger, D., Heck erman, D., 1994. Learning Gaussian netw orks. In: Pro c. of the T en th Conference on Uncertaint y in Artiﬁcial In telligence. Morgan Kaufmann, San F rancisco, pp. 235–243. Geiger, D., Heck erman, D., 2002. Parameter priors for directed acyclic graphical mo dels and the char- acterization of sev eral probability distributions. Ann. Stat. 30 (5), 1412–1440. Geiger, D., Pearl, J., 1990. On the logic of causal mo dels. In: Pro c. of the F ourth Annual Conference on Uncertaint y in Artiﬁcial In telligence. AUAI Press, Corv allis, pp. 3–14. Geiger, D., Pearl, J., 1993. Logical and algorithmic prop erties of conditional indep endence and graphical models. Ann. Stat. 21 (4), 2001–2021. Gillispie, S. B., Perlman, M. D., 2002. The size distribution for Markov equiv alence classes of acyclic digraph mo dels. Artif. In tell. 141 (1–2), 137 – 155. Giudici, P ., 1996. Learning in graphical Gaussian mo dels. Bay esian Stat. 5, 621–628. Giudici, P ., Green, P . J., 1999. Decomposable graphical Gaussian mo del determination. Biometrik a 86 (4), 785–801. Gr¨ adel, E., V¨ a¨ an¨ anen, J., 2013. Dep endence and independence. Stud. Log. 101 (2), 399–410. Grimmett, G. R., 1973. A theorem ab out random ﬁelds. Bull. Lond. Math. So c. 5 (1), 81–84. Grone, R., Johnson, C. R., S´ a, E. M., W olk owicz, H., 1984. Positiv e deﬁnite completions of partial hermitian matrices. Linear Algebra Its Appl. 58, 109 – 124. Hammersley , J. M., Cliﬀord, P ., 1971. Marko v ﬁelds on ﬁnite graphs and lattices, Unpublished manuscript. Harris, N., Drton, M., 2013. PC algorithm for nonparanormal graphical models. J. Mach. Learn. Res. 14, 3365–3383. He, Y., Jia, J., Y u, B., 2015. Counting and exploring sizes of Marko v equiv alence classes of directed acyclic graphs. J. Mach. Learn. Res. 16, 2589–2609. Horn, R. A., Johnson, C. R., 2012. Matrix Analysis, 2nd Edition. Cambridge University Press. How ard, R. A., Matheson, J. E., 2005. Inﬂuence diagrams. Decis. Anal. 2 (3), 127–143. Ib´ a ˜ nez, A., Arma ˜ nanzas, R., Bielza, C., Larra ˜ naga, P ., 2016. Genetic algorithms and Gaussian Bay esian netw orks to unco ver the predictive core set of bibliometric indices. J. Asso c. Inf. Sci. T ec hnol. 67 (7), 1703–1721. Isham, V., 1981. An introduction to spatial p oint processes and Marko v random ﬁelds. Int. Stat. Rev. 49 (1), 21–43. Jones, B., Carv alho, C., Dobra, A., Hans, C., Carter, C., W est, M., 2005. Exp erimen ts in sto chastic computation for high-dimensional graphical mo dels. Stat. Sci. 20 (4), 388–400. Jordan, M. I., 2004. Graphical models. Stat. Sci. 19 (1), 140–155. Kalisch, M., B¨ uhlmann, P ., 2007. Estimating high-dimensional directed acyclic graphs with the PC- algorithm. J. Mac h. Learn. Res. 8, 613–636. Kiiveri, H., Sp eed, T., 1982. Structural analysis of multiv ariate data: A review. So ciol. Metho dol. 13, 209–289. Kindermann, R., Snell, J. L., 1980. Marko v Random Fields and their Applications. American Mathe- matical So ciety . 26 Koster, J. T., 12 2002. Marginalizing and conditioning in graphical models. Bernoulli 8 (6), 817–840. Koster, J. T. A., 1999. On the v alidity of the Marko v interpretation of path diagrams of Gaussian structural equations systems with correlated errors. Scand. J. Stat. 26 (3), 413–431. Lam, C., F an, J., 2009. Sparsistency and rates of conv ergence in large cov ariance matrix estimators. Ann. Stat. 37 (6B), 4254–4278. Langseth, H., Nielsen, T., P´ erez-Bernab´ e, I., Salmer´ on, A., 2014. Learning mixtures of truncated basis functions from data. Int. J. Approx. Reason. 55, 940–956. Langseth, H., Nielsen, T., Rum ´ ı, R., Salmer´ on, A., 2010. Parameter estimation and model selection for mixtures of truncated exp onen tials. Int. J. Approx. Reason. 51, 485–498. Langseth, H., Nielsen, T., Rum ´ ı, R., Salmer´ on, A., 2012. Mixtures of truncated basis functions. In t. J. Approx. Reason. 53, 212–227. Lauritzen, S., Sadeghi, K., 2018. Unifying Marko v prop erties for graphical mo dels. Ann. Stat. 46 (5), 2251–2278. Lauritzen, S. L., 1992. Propagation of probabilities, means, and v ariances in mixed graphical asso ciation models. J. Amer. Stat. Asso c. 87 (420), 1098–1108. Lauritzen, S. L., 1996. Graphical Models. Oxford Universit y Press. Lauritzen, S. L., Dawid, A. P ., Larsen, B. N., Leimer, H.-G., 1990. Independence properties of directed Marko v ﬁelds. Netw orks 20 (5), 491–505. Lauritzen, S. L., Sheehan, N. A., 11 2003. Graphical models for genetic analyses. Stat. Sci. 18 (4), 489–514. Lauritzen, S. L., W erm uth, N., 1989. Graphical mo dels for asso ciations between v ariables, some of which are qualitative and some quan titative. Ann. Stat. 17 (1), 31–57. Letac, G., Massam, H., 2007. Wishart distributions for decomp osable graphs. Ann. Stat. 35 (3), 1278– 1323. Lin, S., Uhler, C., Sturmfels, B., B ¨ uhlmann, P ., 2014. Hyp ersurfaces and their singularities in partial correlation testing. F ound. Comput. Math. 14 (5), 1079–1116. Liu, H., Han, F., Y uan, M., Laﬀert y , J., W asserman, L., 2012. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat. 40 (4), 2293–2326. Liu, H., Laﬀerty , J., W asserman, L., 2009. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res. 10, 2295–2328. Liu, W., 12 2013. Gaussian graphical mo del estimation with false discov ery rate control. Ann. Stat. 41 (6), 2948–2978. Loh, P .-L., B¨ uhlmann, P ., 2014. High-dimensional learning of linear causal netw orks via inv erse co v ari- ance estimation. J. Mach. Learn. Res. 15, 3065–3105. Magwene, P . M., Kim, J., Nov 2004. Estimating genomic coexpression net works using ﬁrst-order condi- tional indep endence. Genome Biol. 5 (12), R100. Marko w etz, F., Spang, R., 2007. Inferring cellular net works – a review. BMC Bioinform. 8 (6), S5. Massam, H., 2018. Bay esian inference in graphical Gaussian mo dels. In: Handb o ok of Graphical Models. CRC Press, pp. 257–282. Meek, C., 1995. Strong completeness and faithfulness in Bay esian netw orks. In: Pro c. of the Eleven th Conference on Uncertain ty in Artiﬁcial Intelligence. Morgan Kaufmann, San F rancisco, pp. 411–418. Meinshausen, N., 2008. A note on the lasso for Gaussian graphical mo del selection. Stat. Probab. Lett. 78 (7), 880–884. Meinshausen, N., B ¨ uhlmann, P ., 2006. High-dimensional graphs and v ariable selection with the lasso. Ann. Stat. 34 (3), 1436–1462. Meinshausen, N., Y u, B., 2009. Lasso-type recov ery of sparse representations for high-dimensional data. Ann. Stat. 37 (1), 246–270. Moral, S., Rum ´ ı, R., Salmer´ on, A., 2001. Mixtures of truncated exponentials in hybrid Bay esian networks. In: Symbolic and Quantitativ e Approac hes to Reasoning with Uncertain ty . V ol. 2143 of Lecture Notes in Artiﬁcial In telligence. Springer, pp. 156–167. Mumford, J. A., Ramsey , J. D., 2014. Bay esian netw orks for fMRI: A primer. NeuroImage 86, 573 – 582. Ness, R. O., Sachs, K., Vitek, O., 2016. F rom correlation to causality: Statistical approaches to learning regulatory relationships in large-scale biomolecular in vestigations. J. Proteom. Res. 15 (3), 683–690. O’Hagan, A., 1995. F ractional Bay es factors for mo del comparison. J. R. Stat. So c. Ser. B Stat. Methodol. 57 (1), 99–138. Olkin, I., T ate, R., 1961. Multiv ariate correlation mo dels with mixed discrete and con tinuous v ariables. Ann. Math. Stat. 32 (2), 448–465. Pearl, J., 1985. Bay esian netw orks: A mo del of self-activ ated memory for evidential reasoning. T ec h. Rep. R-43, Univ ersity of California, Los Angeles. 27 Pearl, J., 1986. F usion, propagation, and structuring in b elief netw orks. Artif. Intell. 29 (3), 241 – 288. Pearl, J., 1988. Probabilistic Reasoning in Intelligen t Systems. Morgan Kaufmann. Pearl, J., 2000. Causaliy: Mo dels, Reasoning and Inference. Cambridge Universit y Press. Pearl, J., Paz, A., 1987. Graphoids: A graph-based logic for reasoning ab out relev ance relations. In: Adv ances in Artiﬁcial In telligence. V ol. 2. Elsevier, pp. 357–363. Peters, J. , 2014. On the in tersection property of conditional indep endence and its application to causal discov ery . J. Causal Inference 3 (1), 97–108. Peters, J., Mo oij, J. M., Janzing, D., Sch¨ olkopf, B., 2014. Causal discov ery with continuous additive noise mo dels. J. Mac h. Learn. Res. 15, 2009–2053. Pe˜ na, J. M., 2016a. Alternative Marko v and causal prop erties for acyclic directed mixed graphs. In: Pro c. of the Thirty-Second Conference on Uncertaint y in Artiﬁcial Intelligence. AUAI Press, Arlington, pp. 577–586. Pe˜ na, J. M., 2016b. Learning acyclic directed mixed graphs from observations and in terv entions. In: Pro c. of the Eigh th In ternational Conference on Probabilistic Graphical Mo dels. V ol. 52 of Proceedings of Machine Learning Research. PMLR, Lugano, pp. 392–402. Pe˜ na, J. M., 2018. Unifying D AGs and UGs. In: Pro c. of the Ninth International Conference on Proba- bilistic Graphical Mo dels. V ol. 72 of Pro ceedings of Machine Learning Researc h. PMLR, Prague, pp. 308–319. Porteous, B. T., 1989. Sto c hastic inequalities relating a class of log-likelihoo d ratio statistics to their asymptotic χ 2 distribution. Ann. Stat. 17 (4), 1723–1734. Radhakrishnan, A., Solus, L., Uhler, C., 2018. Counting Marko v equivalence classes for DA G models on trees. Discret. Appl. Math. 244, 170 – 185. Ra jaratnam, B., 2012. Comment on: Sequences of regressions and their indep endences. TEST 21 (2), 268–273. Ra jaratnam, B., Massam, H., Carv alho, C. M., 12 2008. Flexible cov ariance estimation in graphical Gaussian mo dels. Ann. Stat. 36 (6), 2818–2849. Ravikumar, P ., W ain wright, M. J., Raskutti, G., Y u, B., 2011. High-dimensional cov ariance estimation by minimizing l 1 -penalized log-determinant divergence. Electron. J. Stat. 5, 935–980. Richardson, T., 2003. Marko v prop erties for acyclic directed mixed graphs. Scand. J. Stat. 30 (1), 145– 157. Richardson, T., Spirtes, P ., 08 2002. Ancestral graph Marko v mo dels. Ann. Stat. 30 (4), 962–1030. Richardson, T. S., Robins, J. M., Shpitser, I., 2012. Nested Marko v properties for acyclic directed mixed graphs. In: Pro c. of the Twen ty-Eigh th Conference on Uncertaint y in Artiﬁcial Intelligence. AUAI Press, Arlington, pp. 13–13. Robins, J. M., Scheines, R., Spirtes, P ., W asserman, L., 2003. Uniform consistency in causal inference. Biometrik a 90 (3), 491–515. Rothman, A. J., Bickel, P . J., Levina, E., Zhu, J., 2008. Sparse p ermutation inv arian t cov ariance esti- mation. Electron. J. Stat. 2, 494–515. Rov erato, A., 2000. Cholesky decomposition of a hyper inv erse Wishart matrix. Biometrik a 87 (1), 99–112. Rov erato, A., 2002. Hyp er in verse Wishart distribution for non-decomp osable graphs and its application to Bay esian inference for Gaussian graphical models. Scand. J. Stat. 29 (3), 391–411. Rov erato, A., Whittaker, J., 1998. The Isserlis matrix and its application to non-decomposable graphical Gaussian mo dels. Biometrik a 85 (3), 711–725. Sadeghi, K., 2013. Stable mixed graphs. Bernoulli 19 (5B), 2330–2358. Sadeghi, K., 2016. Marginalization and conditioning for L WF c hain graphs. Ann. Stat. 44 (4), 1792–1816. Sadeghi, K., Lauritzen, S., 2014. Mark ov properties for mixed graphs. Bernoulli 20 (2), 676–696. Sadeghi, K., Marchetti, G. M., 2012. Graphical Marko v Mo dels with Mixed Graphs in R. The R J. 4 (2), 65–73. Scutari, M., 2013. On the prior and posterior distributions used in graphical mo delling. Bay esian Anal. 8 (3), 505–532. Shenoy , P ., W est, J. C., 2011. Inference in hybrid Bay esian networks using mixtures of p olynomials. Int. J. Approx. Reason. 52 (5), 641–657. Shimizu, S., Hoy er, P . O., Hyv arinen, A., Kerminen, A., 2006. A linear non-Gaussian acyclic mo del for causal discov ery . J. Mach. Learn. Res. 7, 2003–2030. Sho jaie, A., Mic hailidis, G., 2010. Penalized likelihoo d metho ds for estimation of sparse high-dimensional directed acyclic graphs. Biometrik a 97 (3), 519–538. Sidak, Z., 1967. Rectangular conﬁdence regions for the means of multiv ariate normal distributions. J. Amer. Stat. Assoc. 62 (318), 626–633. 28 Sonntag, D., Pe˜ na, J. M., G´ omez-Olmedo, M., 2015. Approximate counting of graphical mo dels via MCMC revisited. In tern. J. In tell. Sys. 30 (3), 384–420. Speed, T. P ., 1979. A note on nearest-neighbour Gibbs and Mark ov probabilities. Sankhy a A 41 (3/4), 184–197. Spirtes, P ., 1995. Directed cyclic graphical representations of feedback models. In: Pro c. of the Eleventh Conference on Uncertain ty in Artiﬁcial Intelligence. Morgan Kaufmann, San F rancisco, pp. 491–498. Spirtes, P ., Glymour, C., Scheines, R., 2000. Causation, Prediction, and Search. MIT Press. Spirtes, P ., Ric hardson, T., Meek, C., 1997. The dimensionality of mixed ancestral graphs. T ech. Rep. CMU-PHIL-83, Carnegie Mellon Universit y . Steinsky , B., 2004. Asymptotic behaviour of the n umber of lab elled essen tial acyclic digraphs and lab elled chain graphs. Graphs Combin. 20 (3), 399–411. Studen` y, M., 2005. On Probabilistic Conditional Indep endence Structures. Springer London. Studen` y, M., 2018. Conditional indep endence and basic Marko v prop erties. In: Handb o ok of Graphical Models. CRC Press, pp. 21–56. Sturmfels, B., Uhler, C., 2010. Multiv ariate Gaussians, semideﬁnite matrix completion, and conv ex algebraic geometry . Ann. Inst. Stat. Math. 62 (4), 603–638. Tibshirani, R., 1996. Regression shrink age and selection via the lasso. J. R. Stat. So c. Ser. B Stat. Methodol. 58 (1), 267–288. Uhler, C., 2012. Geometry of maxim um likelihoo d estimation in Gaussian graphical mo dels. Ann. Stat. 40 (1), 238–261. Uhler, C., 2018. Gaussian graphical mo dels. In: Handb ook of Graphical Models. CR C Press, pp. 235–256. Uhler, C., Lenkoski, A., Richards, D., 2018. Exact formulas for the normalizing constants of Wishart distributions for graphical mo dels. Ann. Stat. 46 (1), 90–118. Uhler, C., Raskutti, G., B ¨ uhlmann, P ., Y u, B., 2013. Geometry of the faithfulness assumption in causal inference. Ann. Stat. 41 (2), 436–463. v an de Geer, S., B¨ uhlmann, P ., 2013. ` 0 -penalized maximum likelihoo d for sparse directed acyclic graphs. Ann. Stat. 41 (2), 536–567. v an de Geer, S. A., B ¨ uhlmann, P ., 2009. On the conditions used to prov e oracle results for the lasso. Electron. J. Stat. 3, 1360–1392. V arando, G., L´ op ez-Cruz, P ., Nielsen, T., Larra ˜ naga, P ., Bielza, C., 2015. Conditional density approxi- mations with mixtures of p olynomials. Intern. J. Intell. Sys. 30, 236–264. V erma, T., Pearl, J., 1991. Equiv alence and synthesis of causal mo dels. In: Pro c. of the Sixth Annual Conference on Uncertain ty in Artiﬁcial Intelligence. AUAI Press, Corv allis, pp. 255–270. V ogel, D., F ried, R., 2011. Elliptical graphical mo delling. Biometrik a 98 (4), 935–951. V ogel, D., Tyler, D. E., 2014. Robust estimators for nondecomp osable elliptical graphical models. Biometrik a 101 (4), 865. W ang, H., Carv alho, C. M., 2010. Sim ulation of h yp er-in verse Wishart distributions for non- decomposable graphs. Electron. J. Stat. 4, 1470–1475. W erhli, A. V., Grzegorczyk, M., Husmeier, D., 2006. Comparative ev aluation of reverse engineering gene regulatory netw orks with relev ance netw orks, graphical Gaussian mo dels and Bay esian net works. Bioinformatics 22 (20), 2523–2531. W ermuth, N., 1976a. Analogies b etw een multiplicativ e models in contingency tables and cov ariance selection. Biometrics 32 (1), 95–108. W ermuth, N., 1976b. Model searc h among m ultiplicative mo dels. Biometrics 32 (2), 253–263. W ermuth, N., 1980. Linear recursive equations, cov ariance selection, and path analysis. J. Am. Stat. Assoc. 75 (372), 963–972. W ermuth, N., 2011. Probability distributions with summary graph structure. Bernoulli 17 (3), 845–879. W ermuth, N., 2015. Graphical Marko v mo dels, unifying results and their interpretation. Wiley StatsRef: Stat. Ref. Online, 1–29. W ermuth, N., Lauritzen, S. L., 1983. Graphical and recursive mo dels for contingency tables. Biometrik a 70 (3), 537–552. W ermuth, N., Sadeghi, K., 2012. Sequences of regressions and their independences. TEST 21 (2), 215– 252. Wille, A., B ¨ uhlmann, P ., 2006. Lo w-order conditional indep endence graphs for inferring genetic netw orks. Stat. Appl. Genet. Mol. Biol. 5 (1), Article 1. W right, S., 1934. The metho d of path coeﬃcients. Ann. Math. Stat. 5 (3), 161–215. Xue, L., Zou, H., 2012. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann. Stat. 40 (5), 2541–2571. Y u, G., Bien, J., 2017. Learning lo cal dep endence in ordered data. J. Mach. Learn. Res. 18, 1–60. 29 Y uan, M., Lin, Y., 2007a. Mo del selection and estimation in the Gaussian graphical model. Biometrik a 94 (1), 19–35. Y uan, M., Lin, Y., 2007b. On the non-negative garrotte estimator. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 (2), 143–161. Y ule, G. U., 1907. On the theory of correlation for any num b er of v ariables, treated by a new system of notation. Pro c. R. Soc. A 79 (529), 182–193. Zhang, J., Spirtes, P ., 2003. Strong faithfulness and uniform consistency in causal inference. In: Proc. of the Nineteen th Conference on Uncertain ty in Artiﬁcial Intelligence. Morgan Kaufmann, San F rancisco, pp. 632–639. Zhao, P ., Y u, B., 2006. On mo del selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–2563. Zhou, L., W ang, L., Liu, L., Ogunbona, P ., Dinggang, S., 2016. Learning discriminative Bay esian net- works from high-dimensional contin uous neuroimaging data. IEEE T rans. Pattern Anal. Mach. In tell. 38 (11), 2269–2283. Zou, H., 2006. The adaptive lasso and its oracle prop erties. J. Amer. Stat. Asso c. 101 (476), 1418–1429. 30

A review of Gaussian Markov models for conditional independence

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment