Graphs for margins of Bayesian networks

Directed acyclic graph (DAG) models, also called Bayesian networks, impose conditional independence constraints on a multivariate probability distribution, and are widely used in probabilistic reasoning, machine learning and causal inference. If late…

Authors: Robin J. Evans

Graphs for margins of Bayesian networks
Graphs for margins of Ba y esian net w orks Robin J. Ev ans August 24, 2015 Abstract Directed acyclic gra ph (D AG) models, a lso called Bay esia n net works, impo se conditional indep endence constraints on a multiv aria te probabil- it y distribution, and are widely used in probabilis tic reasoning, machine learning a nd causa l inference. If la tent v ariables ar e included in such a mo del, then the set of p ossible marg inal distributions over the rema ining (observed) v aria bles is generally co mplex, and not repr esented by a n y D AG. Larg e r clas ses of mixed gr a phical mo dels, which use multiple edge t yp es, have b een intro duce d to ov erco me this; ho wev er, these clas s es do not repre s ent all the mo dels which can arise as margins of DA Gs. In this pap er we show that this is b ecause o r dinary mixed graphs are fundamen- tally insufficiently ric h to ca pture the v a riety of mar ginal models . W e int ro duce a new cla s s of h ype r -gra phs , called mDA Gs, and a laten t pro jection op era tion to obtain an mDA G fro m the marg in of a D AG. W e show that each distinct marg inal o f a DA G model is represe n ted b y at least one mD AG, and provide graphica l results tow ar ds characterizing when t wo such marginal mo dels a re the same. Finally we show that mD AGs correctly capture the marginal structure of c ausally-interpreted D AGs under in tervent ions on the observed v ariables. 1 In tro duction Directed acyclic graph (D AG ) mo dels, also kno wn as Ba yesian net w orks, are widely used in probabilistic reasoning, machine learnin g and causal inference (Bishop, 2007; Darwic he , 2009; Pea rl , 2009). Their p opularit y stems f r om a relativ ely simple definition in terms of a Mark o v prop ert y , a mo dular structure whic h is compu tationally s calable, their nice statistic al prop erties, and their in tuitiv e causal in terpretations. D A G mo dels are n ot closed u n der marginalization, in the sens e that a margin of a join t distribution whic h ob eys a D A G mo del will not generally b e faithfully represent ed b y an y D A G. In deed, although D A G mo dels that include latent v ariables are widely u sed, th ey ind uce mo d els ov er the observe d v ariables that are extremely complicated, and n ot we ll understo o d. V a rious authors h a v e dev elop ed larger classes of graphical m o dels to rep- resen t the result of marginaliz ing (a nd in some cases also conditioning) in Ba y esian net wo rks. In the cont ext of causal mo dels Pea rl and V erma (V erma, 1991; P earl and V erma, 1992; P earl , 2009) in tro d uced mixed graphs obtained 1 c e d f a b Figure 1: An mD A G with m aximal non-trivial bidir ected edges (facets) { a, c } , { c, d, e } and { d, e, f } . b y an op eratio n called latent pr oje ction to represent the mod els induced by marginalizing. These ha v e b een dev elop ed into larger classes of graph ical mo d- els su c h as summary graphs, MC-graphs, ancestral graph s and acyclic directed mixed grap h s (ADMG s) whic h are closed u nder marginalizatio n from the p er- sp ectiv e of co nd itional ind ep endence constr aints (Koster, 2002; Ric hardson and S pirtes, 2002; Ric hardson, 200 3 ; W ermuth, 2011). As h as long b een kno wn, ho wev er, these mo d els d o not fully capture the range of m arginal constrain ts imp osed by D A G mo dels. In this pap er w e sho w that no class of ord inary graphs is ric h enough to do so, regardless of ho w man y t yp es of edge are u sed. Instead we in tro duce the mD AG , a h yp er-graph which extends the idea of an ADM G to ha ve h yp er bidirected edges; an example is giv en in Figure 1. In tuitive ly , eac h red h yp er-edge represen ts an exogenous laten t v ariable whose children are the vertices joined by the edge. W e sho w that mD A Gs are the natural graphical ob ject to represent margins of D A G m o dels. They are ric h enough to represent the v ariety of mo d els that can b e in duced observ ationall y , and to graphically rep resen t the effect of in- terv en tions when the D A G is in terpreted causally . In addition, if the class of p ossible int erven tions is s uitably defined, then there is a one-to-one corresp on- dence b etw een causally interpreted mDA Gs and the marginal m o dels indu ced b y causally inte rp r eted D A Gs. The graphical framewo rk al so pr o vides a plat- form for studying the mo d els themselv es, whic h are complex ob jects (see, for example, E v ans, 2012; Shpitser et al., 2014). W e pro vide some graphical re- sults for Mark o v equiv alence in th is con text, i.e. cr iteria for when tw o m arginal mo dels are equal, though a co mp lete charac terization remains an op en pr ob- lem. As we s h all see, marginal D A G mo dels are relativ ely complex and there is, as y et, no ge neral p arameterizatio n or fitting algorithm a v ailable to use with them; in contrast, explicit p arametric incorp oratio n of laten t v ariables mak es fitting relativ ely straigh tforward. How ever the latter approac h h as some d isad- v anta ges: most ob viously it requires additional assumptions ab out the n ature of the laten t v ariables that ma y b e imp lausible or unte stable; additionally , the resulting mod els are t ypically not stati stically r egular (Drton, 2009). In con- 2 texts where the hid den v ariables represent arbitrary confounders whose nature is unkn o wn—such as is common in ep id emiologic al m o dels—it ma y b e pr efer- able to use a marginal D AG mo del rather than an ord in ary latent v ariable mo del. F or these reasons marginal DA G mo d els ha v e attracte d considerab le in terest, as the references in the previous p aragraphs attest. The r emainder of the pap er is organized as f ollo w s : in S ection 2 w e review directed acyclic graphs and their Ma rko v prop ertie s; in Sectio n 3 we consider laten t v ariables, and d iscuss existing results in this area. Section 4 introdu ces mD A Gs, and shows that th ey are ric h enough to r epresen t the class of mo dels induced by margins of Ba y esian net w orks, w h ile Section 5 give s Mark o v prop- erties for m D A Gs. Sectio n 6 considers Mark o v equiv alence, and demonstrates that ordin ary mixed graph ical m o dels cannot capture the full range of p os- sible mo dels. Section 7 extends the in terpretation of these mo dels to causal settings, and Section 8 conta ins a discus s ion includ ing some op en problems. 2 Directed Graphical Mo dels W e b egin with a r eview of definitions concerning directed acyclic graph s. W e omit examples of man y of these ideas b ecause these are wel l known but see, for example, Ric hard son and Sp irtes (2002) or Pea rl (2009) for more d etail. Definition 2.1. A dir e cte d gr aph D is a pair ( V , E ), where V is a finite set of vertic es and E a collection of e dges , wh ic h are ordered p airs of v ertices. If ( v , w ) ∈ E w e write v → w . Self-loops are not allo wed: that is ( v , v ) / ∈ E for an y v . A graph is acyclic if it do es not co ntain an y sequences of edges of th e form v 1 → · · · → v k → v 1 with k > 1. W e call suc h a graph a dir e cte d acyclic gr aph (D A G); all the directed graphs considered in this pap er are acyc lic. A p ath fr om v 0 to v k is an alternating sequence of v ertices and edges h v 0 , e 1 , v 1 . . . , e k , v k i , suc h that eac h edge e i is b etw een the vertices v i − 1 and v i ; no rep et ition of vertice s (or, therefore, of edges) is p ermitted. A path ma y con tain ze ro edges: i.e. h v 0 i is a path from v 0 to itself. v 0 and v k are the endp oints of the path, and an y other v ertices are non-endp oints . A path is dir e cte d f rom v 0 to v k if it is of the form v 0 → v 1 → · · · → v k . If v → w then v is a p ar ent of w , and w a child of v . Th e set of parents of w is d enoted by pa D ( w ), and th e set of c hild r en of v b y c h D ( v ). If there is a directed p ath from v to w (includin g the case v = w ), we sa y that v is an anc estor 1 of w . Th e set of ancestors of w is denoted b y an D ( w ). W e apply these defin itions disjunctive ly to sets of v ertices so that pa D ( A ) = [ a ∈ A pa D ( a ) , an D ( A ) = [ a ∈ A an D ( a ) . A set is ca lled anc estr al if it con tains all its own ancestors: A = an D ( A ). 1 Note that w is alw ays an ancestor of itself, whic h differs from the conv ention u sed b y some authors (e.g. Lauritzen, 1996). 3 Giv en DA Gs D ( V , E ) an d D ′ ( V ′ , E ′ ), we sa y th at D ′ is a sub gr aph of D , and write D ′ ⊆ D , if V ′ ⊆ V and E ′ ⊆ E . The induc e d sub gr aph of D o ver A ⊆ V is the D A G D A with v ertices A and edges E A = { ( v , w ) ∈ E : v , w ∈ A } ; that is, those edges w ith b oth end p oint s in A . A gr aphic al mo del arises when a grap h is iden tified with structure on a mul- tiv ariate probabilit y distribu tion. With eac h v ertex v w e asso ciate a random v ariable X v taking v alues in some set X v ; the join t distrib u tion is o v er the pro du ct sp ace X V = × v ∈ V X v . In D A Gs the structur e tak es the form of eac h v ariable X v ‘dep endin g’ only up on th e random v ariables X pa( v ) corresp ondin g to its immediate parent s in the graph. Unless exp licitly stated otherwise w e mak e no assumption ab out the state-space of eac h of the rand om v ariables X v , sav e that we w ork with Leb esgue-Ro khlin probabilit y sp aces. Hence X v could b e discrete, one-dimensional real, vec tor-v alued, or a coun tably gener- ated pro cess su c h as a Br ownian motion (see Rokh lin , 1952, Section 2). Definition 2.2 (S tructural Equation Prop ert y) . Let D b e a DA G with v er- tices V , and X V a Cartesian pro duct sp ace. W e s ay th at a joint d istribution P o ve r X V satisfies the structur al e quation pr op erty (SEP ) for D if for some indep end en t random v ariables E v (the err or variables ) taking v alues in E v , and measurable fu nctions f v : X pa( v ) × E v → X v , recursiv ely sett ing X v = f v ( X pa( v ) , E v ) , v ∈ V giv es X V the join t distrib u tion P . Equiv alently , eac h X v is σ ( X pa( v ) , E v )- measurable, wh ere σ ( Y ) denotes the σ -alg ebra generated b y th e rand om v ari- able Y . W e denote the collection of such d istributions (the structur al e quation mo del for D ) b y M se ( D ). Remark 2.3. Th e fact that w e can use th is recurs ive definition follo ws from the fact that the graph is acyclic. Although in prin ciple the err or v ariables h a v e arbitrary state-space, it fol- lo ws from the d iscussion in Chentso v (1982, Section 2.11) that there is no loss of generalit y if they are assumed to b e un iformly d istributed on (0 , 1). Note that the structural equation m o del for D does not requ ir e that a join t densit y for X V exists, and i n particular allo ws for degenerate relationships suc h as functional dep endence b et ween tw o v ariables. If a join t dens ity w ith resp ect to a pr o duct measure do es exist, then the mo del is equiv alen t to that defined by requirin g the usual factorization of the joint d ensit y (Pearl, 2009). Remark 2.4. The p otential outc omes view of causal inference (Rub in, 1974) considers the random function f v ( · , E v ) : X pa( v ) → X v , generally denoted by X v ( · ) = f v ( · , E v ), as the main unit of in terest. Und er our formulati on this is almost surely measurab le, and w e can identify the p air ( f v , E v ) with X v ( · ). In general, s ome care is n eeded w h en defining random functions: one might na ¨ ıv ely c ho ose to set, for example, X v ( x pa( v ) ) ∼ N (0 , 1) ind ep endently for eac h x pa( v ) ∈ X pa( v ) ; ho wev er if the indexing s et X pa( v ) is con tinuous, then the 4 function X v ( · ) will almost surely not b e Leb esgue measurable, and therefore X v ( X pa( v ) ) is not a random v ariable. The structural equation m o del implies th at eac h random v ariable is a mea- surable function of its p aren ts in the graph; it is therefore clear that, con- ditional up on its p aren ts, eac h v ariable is indep end en t of the other v ariables already defined. Pe arl (1985) in tro du ced ‘d-separation’ as a metho d for in- terrogating Bay esian n et w orks ab out their conditional indep end ence implica- tions. Th e resulting Marko v prop erty is equiv alen t to th e structural equation prop erty , but it is ofte n easier to w ork with in practice. Definition 2.5. L et π b e a path from v to w , and let a b e a non-endp oint on π . W e say a is a c ol lider on the path if the t wo edges in π whic h con tain a are b oth orien ted to w ards it: i.e. → a ← . Otherw ise (i.e. if → a → ; ← a → ; or ← a ← ) w e say a is a no n-c ol lider . Definition 2.6 (d-separation) . Let π b e a path fr om a to b in a D A G D ; w e sa y th at π is b lo cke d by a (p ossibly emp t y) set C ⊆ V \ { a, b } if either (i) there is a non -collider on π whic h is also in C , or (ii) there is a collider on the path whic h is not contai ned in an D ( C ). Sets A and B are said to b e d-sep ar ate d giv en C if all paths from any a ∈ A to an y b ∈ B are b lo c k ed b y C . Definition 2.7 (Global Mark o v Pr op ert y) . Let D b e a D A G and X V random v ariables und er a j oin t probabilit y measure P . W e sa y that P ob eys the glob al Markov pr op erty for D if X A ⊥ ⊥ X B | X C [ P ] whenev er A and B are d-separated by C in D . Denote the collectio n of prob- abilit y measures that satisfy the global Marko v prop erty by M g ( D ). In fact M g ( D ) = M se ( D ), so the str uctural equation p rop ert y and the global Marko v p rop erty are equiv alent (Lauritzen et al., 1990). W e use M ( D ) to denote these equiv alen t models. 3 Laten t V ariables In a great man y p ractical statistica l applicati ons it is necessary to includ e un - measured ran d om v ariables in a mo d el to correctly capture the dep endence structure among observed v ariables. Consider a D AG D with v ertices V ˙ ∪ U , and s u pp ose that ( X V , X U ) ∼ P ∈ M ( D ) (here and throughout ˙ ∪ r epresen ts a u n ion of disjoint sets). What restrictions do es this place on th e mar gi nal dis- tribution of X V under P ? In this con text we call V the observe d vertices, and X V the observ ed v ariables; similarly U (resp ec tive ly X U ) are the unobserve d or latent v ertices (v ariables). 5 Definition 3.1. Let D b e a D A G with vertices V ˙ ∪ U , and X V a state-space for V . Define the mar ginal DAG mo del M ( D , V ) by the collection of pr obabilit y distributions P o ve r X V suc h that there exist (i) some state-space X U for X U ; and (ii) a probabilit y measure Q ∈ M ( D ) ov er X V × X U ; and P is the marginal distribution of Q o ve r X V . In other words, we need to co nstr uct ( X U , X V ) with joint distribution Q ∈ M ( D ) suc h that X V ∼ P . T rivially , if U = ∅ then ev erything is observe d and M ( D , V ) = M ( D ). The problem of in terest is to c haracterize the set M ( D , V ) in general. Remark 3.2. Note that we allo w the state-space of the latent v ariables to b e arbitrary in principle (though see Remark 2.3) and the mo del is non- parametric. Typical late nt v ariable mo dels either assume a fixed finite n um- b er of lev els for the latent s, or inv oke some other p arametric structure suc h as Gaussianity . Suc h mo dels are useful in many con texts, but ha v e v arious disadv antag es if the aim is to remain agnostic as to the pr ecise nature of the unobserved v ariables. In general an y laten t v ariable mo del will b e a sub-mo d el of th e marginal D A G mo d el, and m ay imp ose additional constraints on the ob - serv ed join t distribu tion (see, for example, Allman et al., 2013). This is clearly undesirable if it is simp ly an artefact of an arbitrary and untested parametric structure app lied to un measured v ariables. In addition, latent v ariable mo dels are often n ot regular and ma y ha ve p o or statistical prop erties, such as non- standard asymptotics (Drton, 2009). The regularit y of marginal DA G mo d els has not b een established in general, but is kn own in some sp ecial cases (Ev ans, 2015). The follo wing prop osition shows that taking margins with resp ect to ances- tral sets preserv es th e structure of the original graph , represen ting an imp or- tan t sp ecial case. The r esult is we ll kno wn, see for exa mple Ric hardson and S pirtes (2002). Prop osition 3.3. L et D and D ′ b e DAG s with the same vertex set V . (a) If A ⊆ V is an anc estr al set in D , then M ( D , A ) = M ( D A ) . (b) If D ′ ⊆ D , then M ( D ′ ) ⊆ M ( D ) . Pr o of. These b ot h follo w d irectly from the defin ition of the structural equation prop erty , since eac h v ariable dep end s only up on its p aren ts. F or the first claim it is cle ar from the r ecursiv e form of the SEP that th e restrictions on X A are iden tical f or D and D A if A is ancestral. F or the second claim, note that sin ce pa D ′ ( w ) ⊆ pa D ( w ), an y σ ( X pa D ′ ( w ) , E w )- measurable random v ariable must also b e σ ( X pa D ( w ) , E w )-measurable. 6 3 4 5 1 2 Figure 2: A D A G K w ith hidden vertic es. 1 2 3 4 5 Figure 3: A directed acyclic graph on fi ve v ertices. Example 3.4. Consider the DA G K s ho wn in Figure 2, which conta ins five v ertices. W e claim that the mo del defin ed by the margin of this graph o v er the vertices { 1 , 2 , 3 } is precisely those distrib utions for whic h X 1 ⊥ ⊥ X 2 . T o see this, first note that fr om the glo bal Mark o v prop erty for K , an y distrib u tion in M ( K , { 1 , 2 , 3 } ) must satisfy X 1 ⊥ ⊥ X 2 . Con ve rsely , sup p ose that P is a d istribution on ( X 1 , X 2 , X 3 ) su c h th at X 1 ⊥ ⊥ X 2 ; no w let ( X 4 , X 5 , X 3 ) ∼ P so that X 4 ⊥ ⊥ X 5 ; by Prop osition 3.3(a) and th e global Mark o v p r op ert y the d istr ibution of ( X 3 , X 4 , X 5 ) satisfies the Mark o v prop erty for th e ancestral su b graph 4 → 3 ← 5. Setting X 1 = X 4 and X 2 = X 5 is consistent with the structural equation prop er ty for K , so it follo ws that the join t distribution of ( X 1 , . . . , X 5 ) is con tained in M ( K ), and that ( X 1 , X 2 , X 3 ) ∼ P . Hence P ∈ M ( K , { 1 , 2 , 3 } ). Ev en in small p roblems, explicitly c haracterizing the margin of a D A G mo del can b e quite tric ky , as the follo wing example shows. Example 3.5. Cons id er the DA G D in Figure 3, and the marginal mo d el M ( D , { 1 , 2 , 3 , 4 } ). By applyin g the global Marko v prop ert y to D , one can see that an y joint distr ib ution satisfies X 1 ⊥ ⊥ X 3 | X 2 , so this also holds for an y marginal distribu tion. It w as also sho wn by Robins (1986) that any suc h d is- tribution with a p ositiv e pr obabilit y dens ity m ust also satisfy a non-parametric constrain t that the quan tit y q ( x 3 , x 4 ) ≡ Z p 2 ( x 2 | x 1 ) · p 4 ( x 4 | x 1 , x 2 , x 3 ) dx 2 (1) is ind ep endent of x 1 (here p 2 and p 4 represent the relev an t conditional densi- ties). This do es n ot corresp ond to an ordin ary conditional indep end ence, and is kno wn as a V erma c onstr aint after V erma and Pearl (199 0 ) who introd uced it to the compu ter science literature. 7 3.1 Existing Results Margins of DA G mo dels are of considerable interest b ecause of their rela- tionship to causal mo dels un der confounding, and consequen tly ha v e b een w ell studied. Restricting to implications of d-separation app lied to the ob- serv ed v ariables leads to a pure cond itional indep en d ence mo del; this is a sup er-mo del of the m arginal DA G mo d el (so for Example 3.5 w e w ould just find X 1 ⊥ ⊥ X 3 | X 2 , for instance). This class, whic h we refer to as or di- nary Markov mo dels , wa s the s u b ject of the work by Ric h ardson (2003 ) and Ev ans an d Ric h ardson (2014) (see also Richardson and Spir tes, 2002 ). Constrain ts of the kind giv en in Example 3.5 can b e generalized via the algorithm of Tian and P earl (20 02 ), and when used to augmen t the ordinary Mark o v mo del yield neste d Markov mo dels (Shpitser et al., 2014); these mo d - els are d efi ned in Section 5. F or discrete v ariables b oth ordinary and n ested Mark o v m o dels are curved exp onen tial families, and can b e p arameterized and fitted using th e method s of E v ans and Richardson (201 0 , 2014); see also Shpitser et al. (201 3 ). Ev ans (2015) sh o ws th at, up to inequ ality constrain ts, nested mo dels are the same as m arginal DA G mo dels wh en the observ ed v ari- ables are discrete 2 : so, for example, the mo d el in Example 3.5 has no equalit y constrain ts b ey ond the conditional in dep end en ce and (1). In add ition to conditional indep end ences and V erma constraints, margins also exhibit inequalit y constrain ts. These w ere first identified b y Bell (1 964), and the earliest example in the con text of grap h ical mo d els w as the instrumen- tal inequalit y of Pea rl (19 95 ). Ev ans (2012) extended P earl’s w ork to general D A G mo dels and ga v e a graphical criterion similar to d-separation for detect- ing inequalit y constrain ts. F ur ther inequalities are giv en in F ritz (2012). Bonet (2001) sh ow ed that a full deriv ation of inequ alities in these m o dels is like ly to b e very complicated in general. An alternativ e app roac h using information theory , also for discrete v ariables, is giv en b y Ch a v es et al. (2014 ). A related p roblem to the one w e consider here arises when observe d and laten t v ariables are assumed to b e j ointly Gaussian. Again one can defin e an ‘ordinary mo del’ using cond itional indep endence constrain ts, whic h is larger than th e marginal mo d el bu t can b e smo othly parameterized us ing the results in R ichardson and Sp ir tes (2002). Ho w ev er m argins of these mo dels also in - duce V erma constrain ts and inequalities, as w ell as more exotic constrain ts (see 8.3.1 of Ric hards on and Sp irtes, 2002); an o v erview is giv en in Drton et al. (2012). F ox et al. (2014 ) c haracterize these mo dels in a f airly large class of graphs, though the general case remains an op en prob lem. 3.2 Reduction It m igh t seem that to c haracterize general mo dels of the form M ( D , V ) w e will ha v e to consider an infinite colle ction of mod els with arb itrarily man y laten t v ariables, making th e pr ob lem extremely hard. Ho wev er the three r esu lts in this s u bsection sho w that without any loss of generalit y we can assu me laten t 2 In algebraic language, the marginal and nested mo dels hav e the same Zariski closure. 8 l 1 l 2 u k 2 k 1 k 3 (a) l 1 l 2 k 2 k 1 k 3 u (b) Figure 4: (a) A D AG, D , and (b) the exogenized version r ( D , u ). The t w o D A Gs induce the same marginal m o del o v er the ve rtices { l 1 , l 2 , k 1 , k 2 , k 3 } . v ariables to b e exogenous (that is, they ha v e no parents), and that for a fixed n umb er of observed v ariables, the num b er of laten t v ariables can b e limited to a finite v alue. T h is is in the spirit of the laten t pro jection op eration used in P earl (2 009). Definition 3.6. Let D b e a D A G con taining a v ertex u . Define the e xo genize d DA G r ( D , u ) as follo ws: take the ve rtices and edges of D , and then (i) add an edge l → k from ev ery l ∈ pa D ( u ) to k ∈ ch D ( u ) (if necessary), and (ii) delete an y ed ge l → u for l ∈ pa D ( u ). All other edges are as in D . In other words, we join all p aren ts of u to all children of u w ith directed edges, and th en remo ve edges b et wee n u and its parents; the pro cess is most easily und ersto o d visually: see the example in Figure 4. Note that if u has no paren ts in D , then r ( D , u ) = D . Lemma 3.7. L et D b e a DA G with vertic es V ˙ ∪{ u } , and ˜ D ≡ r ( D , u ) . Then M ( D , V ) = M ( ˜ D , V ) ; i.e. the mar g i nal mo dels induc e d b y the two gr aphs over V ar e the same. Pr o of. If u has n o paren ts in D then the resu lt is trivial, since D = ˜ D . Oth- erwise let L = p a D ( u ) and K = c h D ( u ). Supp ose P ∈ M ( D , V ), so one can construct ( X u , X V ) ∼ Q ∈ M ( D ) su c h that X V ∼ P . Let Q b e generated using the SEP by in d ep endent error v ariables ( E v : v ∈ V ∪ { u } ), so th at eac h X v is σ ( X pa D ( v ) , E v )-measurable. No w let ˜ X u = E u , and all other X v remain u nc hanged, so that ˜ X u is σ ( E u )- measurable. The only other v ariables whose paren ts sets are differen t in ˜ D are those in K , so we need only s ho w that X k is σ ( ˜ X u , X L , X pa D ( k ) , E k )- measurable for k ∈ K . S ince X u is σ ( X L , E u ) = σ ( X L , ˜ X u )-measurable, it follo ws that σ ( X u , X pa D ( k ) , E k ) ⊆ σ ( ˜ X u , X L , X pa D ( k ) , E k ) . 9 v 1 v 2 v 3 u w v 1 v 2 v 3 u Figure 5: Two D A Gs w h ose marginal m o dels ov er the vertic es { v 1 , v 2 , v 3 } are the same. X k is σ ( X u , X pa D ( k ) , E k )-measurable by the definition of M ( D ), so it is also σ ( ˜ X u , X L , X pa D ( k ) , E k )-measurable. The j oin t distribution ˜ Q of ( ˜ X u , X V ) is therefore con tained in M ( ˜ D ), and so P ∈ M ( ˜ D , V ). Con ve rsely , if ( ˜ X u , X V ) ∼ ˜ Q ∈ M ( ˜ D ), let E u = ˜ X u , and X u = ( X L , ˜ X u ); then E u is in dep endent of other error v ariables, and X u is σ ( X L , E u )-measurable. F or k ∈ K , σ ( X u , X pa D ( k ) , E k ) ⊇ σ ( ˜ X u , X L , X pa D ( k ) , E k ) , so ( X u , X V ) ∼ Q ∈ M ( D ). As a consequence of this lemma it is su fficien t to consider m o dels in whic h the u n observ ed ve rtices are exogenous. Our second result shows that only a finite num b er of exogenous latent v ariables are necessary . Lemma 3.8. L et D b e a DA G with vertic es V ˙ ∪{ u, w } (wher e u 6 = w ), such that pa D ( w ) = pa D ( u ) = ∅ and ch D ( w ) ⊆ c h D ( u ) . Then M ( D , V ) = M ( D − w , V ) , wher e D − w is the induc e d sub gr aph of D after r emoving w . Pr o of. By Prop osition 3.3(b), M ( D − w , V ) ⊆ M ( D , V ). T ake P ∈ M ( D , V ), so that there exists ( X V , X u , X w ) ∼ Q ∈ M ( D ) w hose V -margin is P . Letting ˜ X u = ( X u , X w ) note that ( X V , ˜ X u ) satisfies the S EP for D − w . Hence P ∈ M ( D − w , V ). This result, com bined with Lemma 3.7, sh o ws that for a fixed set of observe d v ariables V , there are only finitely many distinct mo dels of the form M ( D , V ). In particular, all u nobserve d v ertices ma y b e assumed to b e exoge nous , and their c hild sets to corresp ond to maximal set s of observ ed vertic es. An example of t w o DA Gs sho wn to hav e equal marginal mo dels by th is result is giv en in Figure 5. W e can mak e one fin al simplification, again w ithout an y loss of generalit y . Lemma 3.9. L et D b e a D AG with vertic es V ˙ ∪{ u } , such that u has no p ar ents and at mos t one child. Then M ( D , V ) = M ( D − u , V ) . 10 Pr o of. M ( D − u , V ) ⊆ M ( D , V ), so supp ose P ∈ M ( D , V ). F or the unique v ∈ c h D ( u ) (if in deed there is an y suc h v ), let ˜ E v = ( E v , E u ), so ˜ E v ⊥ ⊥ ( E w : w ∈ V ), and X v is σ ( X pa( v ) , E v ) = σ ( X pa( v ) \ u , ˜ E v )-measurable. Then P ∈ M ( D , V ). The com bination of these results means th at we can restrict our atten- tion to mo dels in whic h the latent v ariables are exogenous, and ha ve non- nested sets of children of size at least t wo. A similar co nclus ion is reac hed by P earl and V erma (1992), bu t the authors also claim that eac h laten t v ariable can b e assumed to hav e exactly t wo c hildren . In the con text of mo d els of conditional in d ep endence this is correct, bu t in general it is to o restrictiv e, as w e sh o w in S ection 6. 1. 4 mD A Gs The results of th e pr evious section suggest a wa y to construct a n ew class of graph, r ic h enough to represen t the distinct mod els that can arise as the margins of D A Gs. First we define the follo wing abstract ob ject, which will b e used to r epresen t laten t structure. Definition 4.1. A simplicial c omplex (or abstract sim p licial complex), B , o v er a fi nite set V is a collection of non-empt y sub sets of V such that (i) { v } ∈ B for all v ∈ V ; (ii) for non-empty sets A ⊆ B ⊆ V we hav e B ∈ B = ⇒ A ∈ B . The inclusion maximal elemen ts of B are called fac ets . An y simplicial complex B can b e characte rized b y its non-trivial facets (i.e. th ose of size at least 2), denoted by ¯ B . Definition 4.2. An mD AG (marginaliz ed D A G) G is a tr ip le ( V , E , B ), where ( V , E ) d efines a D A G, and B is an abstract simplicial complex on V . The elemen ts of B are called the bidir e cte d fac es . D A Gs corresp ond to mD A Gs whose b idirected faces are just singleton ve r- tices: B = {{ v } : v ∈ V } . W e can repr esent an mD A G as a graph with ordinary directed edges E , and b id irected h yp er-edges corresp ond ing to the non-trivial facets ¯ B . W e call ( V , E ) the underlying DA G , and dr a w its edges in blue; th e bid ir ected hyper -edges are in red . See th e example in Figure 1 . If w has n o parents and { w } is a facet of B , w e sa y that w is exo genous . Informally w e may think of eac h facet B as representing a laten t v ariable with c hildren B . The defin itions of parents, c hildren, ancestors and ancestral sets are extended to mD A Gs b y applyin g them to the u nderlying DA G, ignoring the bid irected faces. Visually , there is some resemblance b et wee n the bidirected h yp er-edges in mD A Gs and the factor no des in factor graphs , b ut th is similarit y is only 11 sup erfi cial: for example, factor graphs d o not require inclusion maximalit y (Ksc hisc hang et al. , 2001). If we restrict the facets of B to ha ve size at most 2 (so that B is an ‘edge complex’), then the definition of an mD A G is isomorphic to th at of an acyclic dir e cte d mixe d gr aph or ADMG (Ric hardson, 2003). Clearly th en, mD A Gs are a ric her cl ass of graph s : the relationship b et w een mD A Gs and ADMGs is explained further in S ection 6.1. Definition 4.3 (Subgraph) . Let G ( V , E , B ) and H ( V ′ , E ′ , B ′ ) b e m D A Gs. S a y that H is a sub gr aph of G , and w rite H ⊆ G , if V ′ ⊆ V , E ′ ⊆ E , and B ′ ⊆ B . The induc e d sub gr aph of G o ver A ⊆ V is the mD A G defin ed by the induced underlying D A G ( A, E A ) and b idirected faces B A = { B ⊆ A : B ∈ B } . In other w ords, taking those parts of eac h edge which intersect with the vertic es in A . 4.1 Laten t Pro jection W e n o w relate margins of DA G to mDA Gs, via an op eration called laten t pro jection. This is b ased on the appr oac h tak en b y P earl (200 9 ), but allo ws for join t d ep endence of more than tw o v ariables due to a common ‘cause’ or ancestor. Definition 4.4. Let G b e an m D A G with b id irected faces B , an d let W, U b e disjoin t sets of vertices in G . W e sa y that the v ertices in W share a hidden c ommon c ause in G , with r esp ect to U , if there exists a set B ∈ B suc h that (i) B ⊆ U ˙ ∪ W ; and (ii) for eac h w ∈ W there is a directed path π w from some b ∈ B to w , with all v ertices on π b b eing in U ∪ { w } . If G is a D A G, a hidden common cause is a common ancestor a ∈ V of eac h w ∈ W , where a and the vertices on a directed path b et wee n a and w are unobserved. Note that if W ∈ B then W is trivially a hidden co mmon cause with resp ec t to any U ⊆ V \ W . The concept of a h idden common cause is similar to a system of tr eks which induce latent correlation; see, f or examp le, F oyge l et al. (2012). The d ifference is that treks only consider pairwise dep en d ence, not dep endence b et wee n an arbitrary collection of v ariables. Example 4.5. Let G b e the DA G in Figure 6(a). The v ertices W = { 3 , 4 , 5 , 6 } share a hidden common cause B = { 1 } with resp ect to U = { 1 , 2 } . In the mD A G in Figure 6(c) the set of ve rtices W = { 3 , 4 , 5 , 6 } share a hidd en com- mon cause in the bidir ected facet { 2 , 3 , 4 } , with r esp ect to { 2 } . The hidden common cause form s the basis for determining whic h vertic es should share a bidirected face in an mDA G after pro jecting out some of the v ariables. W e formalize th is with th e next defin ition. 12 1 2 3 4 5 6 7 (a) 1 3 4 5 6 7 (b) 2 3 4 5 6 7 (c) 5 6 3 4 7 (d) Figure 6: (a) A D AG on sev en v ertices, and (b) its laten t p ro jectio n to an mD A G o ve r { 1 , 3 , 4 , 5 , 6 , 7 } , (c) o ve r { 2 , 3 , 4 , 5 , 6 , 7 } and (d) ov er { 3 , 4 , 5 , 6 , 7 } . Definition 4.6. Let G b e an mD A G with vertic es V ˙ ∪ U . The latent pr oje ction of G onto V , denoted b y p ( G , V ), is an mD A G w ith vertice s V , and edges E ′ and bidir ected faces B ′ defined as follo ws: • ( a, b ) ∈ E ′ whenev er a 6 = b and th ere is a directed path a → · · · → b in G , with all non-endp oin ts in U ; • W ∈ B ′ whenev er the vertic es W ⊆ V sh are a hidd en common cause in G with resp ect to U . It is straigh tforwa rd to see that B ′ is an abstract simplicial complex, and therefore the defi n ition ab o v e giv es an mD AG. Example 4.7. Consid er the mD AG in Figure 6(a) , and its latent pr o ject ion after pro jecting out th e ve rtex 2, sh o wn in Figur e 6(b). In the original graph the directed p aths 7 → 2 → 5 and 7 → 2 → 6 are manifested as the directed edges 7 → 5 and 7 → 6 in th e pro jectio n. Additionally , there is a hidd en common ca use for the v ertices 5 , 6 (as noted in the p revious example), so w e end up with a bid irected facet { 5 , 6 } in th e pro jection. The pro jectio n of the graph in Figure 6(b) on to { 3 , 4 , 5 , 6 , 7 } is shown in (d). Definition 4.8. Let G ( V , E , B ) b e an mD A G w ith bidirected facets ¯ B . W e define ¯ G , the c anonic al DAG asso ciated with G , as the DA G with v ertices V ∪ ¯ B and edges E ∪ { B → v : v ∈ B ∈ ¯ B } . 13 B 2 c e d a B 1 B 3 f b Figure 7: The canonical DA G asso ciated with the mD A G in Figure 1. In other w ords, w e replace every non-trivial facet B ∈ B with a ve rtex w hose c hildren are precisely the elemen ts of B . The canonical DA G associated with the mD A G from Figure 1 is sho wn in Figure 7. Prop osition 4.9. L et G b e an mD AG with vertex set V . (a) H ⊆ G = ⇒ p ( H , W ) ⊆ p ( G , W ) for any W ⊆ V ; (b) p ( ¯ G , V ) = G ; (c) if A ⊆ V is an anc estr al set in G , then p ( G , A ) = G A . Pr o of. (a): If H is a sub graph of G , then an y directed path or hidden common cause in H m ust also b e foun d in G . (b): Since ¯ G is a D AG on v ertices V ∪ ¯ B and no B ∈ ¯ B has an y paren ts in ¯ G , the only directed edges added in p ( ¯ G , V ) are those already joining elemen ts of V in ¯ G , and therefore are p r ecisely the directed edges in G . The only hidden common causes with resp ect to ¯ B are singletons { v } and subsets of an y B ∈ ¯ B , whose c hildren are all obs erv ed. Hence the bidirected faces in p ( ¯ G , V ) are pr ecisely B . (c): Since A is ancestral, an y directed p aths b etw een elemen ts of A ha ve all v ertices in A , and th ere are no dir ected paths from V \ A to A (hence there are no hid den common causes). A critical fact ab out latent pro jection is that it d o es n ot matter in what order we p r o ject out vertice s, or ind eed if we d o sev eral at once. Theorem 4.10. L et G b e an mD AG with vertic es V ˙ ∪ U 1 ˙ ∪ U 2 . Then p ( G , V ) = p ( p ( G , V ∪ U 1 ) , V ) = p ( p ( G , V ∪ U 2 ) , V ) . That is, th e or der of pr oje ction do es not ma tter. The pr o of of this result is found in the App endix. T he comm utativit y is illustrated in Figure 6: if w e fir st pro ject out 1 and then 2 from the D A G 14 (a) w e obtain the mDA Gs in (c) and th en (d ) resp ectiv ely . If the order of pro jection is rev ersed w e obtain th e mD A Gs in (b) and then (d). A second crucial fact is that if tw o D A Gs hav e the same laten t pro jection on to a set V , then their marginal m o dels ov er V are also the s ame. T o pro v e this we use the follo wing tw o lemmas, whic h sho w that t w o d ifferen t D A Gs result in the same mD A G if their margins are equiv alent by Lemmas 3.7, 3.8 and 3.9. Lemma 4.11. L et D b e a DA G with vertic es V ˙ ∪{ u } , and r ( D , u ) the exo ge- nize d DA G for u . Then p ( D , V ) = p ( r ( D , u ) , V ) . Pr o of. F rom the definition of r , an y directed p aths passing through u as an in termediate no de l → u → k in D are replaced by l → k in r ( D , u ). Hence the directed edges in b oth p ro jectio ns are the same. The only v ertex b eing pro jected out is u and since its c hild set is the s ame in b oth D and r ( D , u ), the group s of vertice s sharing a hidden common cause with resp ect to { u } will remain unchange d. Hence the bidir ected f aces in b oth pro jections are the same. Lemma 4.12. L et G b e an mDA G with vertic es V ˙ ∪ U , c ontaining an exo ge- nous vertex w ∈ U . If either | c h G ( w ) | ≤ 1 , or c h G ( w ) ⊆ c h G ( u ) for some u ∈ U , then p ( G , V ) = p ( G − w , V ) . Pr o of. Since w h as no paren ts, there are no d irected paths conta ining it as an in termediate vertex; hence we n eed only sho w that if some v ertices in V sh are a hid d en common cause in G with resp ect to U , then they also share one in G − w with resp ec t to U \ { w } . Since w is exoge nous this is clearly tr u e whenever the hidden common cause is not { w } , and so if w has no children the result is trivial . If | c h G ( w ) | = { k } then { k } will also serve as a h idden common cause. If c h G ( w ) ⊆ c h G ( u ) for some u ∈ U then clearly any vertices whic h share { w } as a hid den common cause in G will also ha v e { u } as a hidden common cause in G and G − w . Theorem 4.13. L et D , D ′ b e two DA Gs wh ose latent pr oje ctions onto some set V ar e the same. Then M ( D , V ) = M ( D ′ , V ) . Pr o of. Let G = p ( D , V ) b e the laten t pro jectio n. W e will sho w that M ( D , V ) = M ( ¯ G , V ), and thereby pro ve the r esu lt. Let th e v ertex set of D b e V ˙ ∪ U . If n o ve rtex in U has any p aren ts in D , eac h vertex in U h as at least t w o c hildren, and their c hild sets are nev er nested, then D = ¯ G and th ere is nothing to pro v e. Other w ise supp ose u ∈ U has at least one paren t. Then r ( D , u ) has the same laten t p ro jectio n on to V as D by Lemma 4.11, and M ( r ( D , u ) , V ) = M ( D , V ) b y Lemma 3.7. The problem reduces to r ( D , u ), 15 and by r ep eated applicat ion it reduces to D A Gs in which no v ertex in U h as an y p aren ts. Similarly , if either w ∈ U has only one c h ild, or c h G ( w ) ⊆ ch G ( u ) for some other u ∈ U , then by Lemmas 3.8 and 3.9 we ha v e M ( D − w , V ) = M ( D , V ) and b y Lemma 4.12 p ( D − w , V ) = G , so the p roblem reduces to D − w . It follo ws that we can red u ce to the canonical D AG ¯ G , and the result is p ro v ed. This result shows that mD A Gs are r ic h enough to fully expr ess the class of marginal D A G mo d els. In Section 6 w e will see that ordinary (i.e. not hyp er) graphs are unable to do this, and in Section 7 that mDA Gs are, f rom a causal p ersp ecti ve, the n atural ob ject to repr esen t these mo d els. 5 Mark o v Prop erties W e are no w in a p osition to define a Mark o v pr op ert y for mD A Gs that r elates to the original problem of characte rizing the m argins of DA G mo dels. Definition 5.1. Sa y that P obeys the mar ginal Markov pr op erty f or an mD AG G with ve rtices V , if it is con tained within the marginal D A G mo del of the canonical D AG: P ∈ M ( ¯ G , V ). W e denote the set of su c h distributions (the mar ginal mo del ) b y M m ( G ). F or instance, w e kno w from Example 3.4 that the marginal mod el f or 1 ↔ 3 ↔ 2 is th e collection of distributions under which X 1 ⊥ ⊥ X 2 . It f ollo w s fr om Theorem 4.13 that the marginal mo del of any DA G M ( G , V ) is the same as the mo del obtained by applying the marginal Marko v p rop erty to its laten t pr o ject ion p ( G , V ). F or some W ⊆ V w e denote the marginal mo del of an mD A G with resp ect to W as M m ( G , W ) ≡ M ( ¯ G , W ). Note that Theorem 4.10 sh o ws that this is a s en sible defin ition. Prop osition 5.2. L et G , H b e mD AGs with vertex set V . (a) If A is an anc estr al set in G , then M m ( G A ) = M m ( G , A ) . (b) If H ⊆ G , then M m ( H ) ⊆ M m ( G ) . Pr o of. (a) By defi nition M m ( G , A ) = M ( ¯ G , A ) = M m ( p ( ¯ G , A )), and from Prop osition 4.9 p ( ¯ G , A ) = G A . (b) If H ⊆ G then ¯ H ⊆ ¯ G , so by Pr op ositio n 3.3 M ( ¯ H ) ⊆ M ( ¯ G ). It follo ws that M ( ¯ H , V ) ⊆ M ( ¯ G , V ), giving th e required result. The marginal Mark ov prop ert y also imp lies certain f actoriza tions of the join t d ensit y , if one exists. T o describ e them, we fir st need to defi ne a sp ecial subgraph. Definition 5.3. Let G ( V , E , B ) b e an mDA G with vertic es V . Say that C ⊆ V is bidir e cte d-c onne cte d if for ev ery v , w ∈ C there is a sequence of vertice s v = v 0 , v 1 , . . . , v k = w all in C su c h that { v i − 1 , v i } ∈ B for i = 1 , . . . , k . A maximal bidirected-connected set is ca lled a district . 16 Let G b e an mD A G w ith district D . The graph G [ D ] is the mD A G with v ertices D ∪ pa G ( D ), directed edges D ∪ pa G ( D ) to D , and bidirected edges B D = { B ⊆ D : B ∈ B } . In other w ords, G [ D ] is the indu ced sub-graph o v er D , together with an y directed edges th at p oint into D (and the asso ciated v ertices). As an example, for the mD AG in Figur e 8(a) h as districts { 1 } , { 3 } and { 2 , 4 } . T h e su bgraph corresp ondin g to D = { 2 , 4 } is sh o wn in Figure 8(b). Prop osition 5.4. L et G b e an mD AG with districts D 1 , . . . , D k , and supp ose that P with density p ob eys the mar ginal Markov pr op erty for G . Then p ( x V ) = k Y i =1 q i ( x D i | x pa( D i ) \ D i ) , for some c onditional distributions q i that ob ey the mar ginal Markov pr op erty with r esp e c t to G [ D i ] , i = 1 , . . . , k . The pro of of this is omitted but see Sh pitser et al. (2014), whic h includes v arious examples. q i is a conditional distribu tion, b ut can b e r enormalized as a join t densit y o ve r D i ∪ pa G ( D i ). The notion of conditional distributions in graphical mo dels is dealt with in Shpitser et al. (2014) by ha ving tw o t yp es of v ertex, separately representing the rand om and cond itioned v ariables; we ha v e omit ted these details for th e sak e of b revit y . 5.1 W eak er Mark ov Prop ert ies The marginal mo del pr ecisely answers our origi nal question: wh at collections of distribu tions can b e induced as the margin of a D A G mo d el? Ho wev er, b ecause the definition is rather indirect, it is generally difficult to c haracterize the set M m ( G ), an d w e may b e unable to tell wh ether or not a particular distribution lies in it or not. This complexit y is one of the motiv ations b e- hind the ordinary and nested Mark ov p rop erties of Ric hardson (2003) and Shpitser et al. (2014) resp ectiv ely . Both prop erties follo w from treating the ancestralit y in Pr op osition 5.2(b) and the factorization in Prop osition 5.4 as axiomatic. In order to do so, w e assume the existence of a j oint density with resp ect to a prod uct measur e on X V . Definition 5.5. Let G b e an mD A G with v ertices V , and P a probabilit y distribution ov er X V with densit y p . Say that P ob eys the neste d Markov pr op erty with resp ec t to P if either | V | = 1, or b oth: 1. f or every ancestral set A ⊆ V , the margin of P ov er X A ob eys the nested Mark o v prop ert y for G A ; and 2. if G has districts D 1 , . . . , D k then p ( x V ) = Q k i =1 q i ( x D i | x pa( D i ) \ D i ), where eac h q i ob eys the nested Mark ov prop ert y for G [ D i ]. 17 1 2 3 4 (a) 1 2 3 4 (b) Figure 8: (a) An mD AG G represent ing th e D AG in Figure 3 , with th e ve rtex 5 treated as unobserv ed. (b) T he subgraph G [ { 2 , 4 } ]. W e denote the resulting mo dels by M n ( G ). The nested mo del ‘thro ws a wa y’ the inequalit y constraint s of the marginal mo del, but f or discrete v ariables is kno wn to giv e mo dels of the same dimension (Ev ans, 201 5 ), and it has the adv anta ge of a fairly explicit c haracterizatio n. V arious equiv alen t formula tions to the one ab o ve are give n in Shpitser et al. (201 4 ). The ordin ary mo del can b e defined in the same wa y as the nested m o del, but replacing 2 w ith the w eak er condition: 2’. if G has d istricts D 1 , . . . , D k then p ( x V ) = Q k i =1 q i ( x D i | x pa( D i ) \ D i ) for some conditional densities q i . Crucially , no further structur e is imp osed up on the pieces q i , so the defin ition do es not recurse. F rom th eir definitions and Prop osit ion 5.4 it is clear that the mo d els ob ey the inclus ion M m ( G ) ⊆ M n ( G ) ⊆ M o ( G ): the next example sho w that these in clusions are strict in general. Example 5.6. Consider again the graph in Figure 3; its latent pro jection o v er the v ertices { 1 , 2 , 3 , 4 } is sho wn in Figur e 8(a): call this pro jection G . Applying the ancestralit y p rop erty we see th at, und er the ordinary Mark o v prop erty the margin o ver ( X 1 , X 2 , X 3 ) satisfies the global Marko v prop ert y for the D A G 1 → 2 → 3, so X 1 ⊥ ⊥ X 3 | X 2 . If we factorize into districts we fi nd p ( x 1 , x 2 , x 3 , x 4 ) = q 1 ( x 1 ) · q 3 ( x 3 | x 1 , x 2 ) · q 24 ( x 2 , x 4 | x 1 , x 3 ) , whic h is a v acuous requir ement under the ordinary Mark o v p rop erty , and indeed there are no further constrain ts. Ho w ev er, the nested prop erty addi- tionally requires that q 24 ob eys the nested prop ert y for the mDA G in Figure 8(b). Under this graph w e see that X 4 ⊥ ⊥ X 1 | X 3 , and this giv es the constrain t (1); hence M n ( G ) ⊂ M o ( G ). If X 2 and X 4 are discrete, then the marginal Mark o v prop erty in d uces an extra inequalit y constraint kno wn as Bell’s inequalit y (B ell , 1964; Gill , 2014); hence M m ( G ) ⊂ M n ( G ). 18 6 Mark o v Equiv alen t Gr aphs A natural question to ask w hen t w o different graphs lead to the same mo del under a particular Mark o v p rop erty . Th at is, what is th e equiv alence class determined by G ∼ G ′ whenev er M m ( G ) = M m ( G ′ )? Without further as- sumptions s uc h as a causal ord ering, graphs that are Marko v equiv alen t are indistinguishable; any model searc h pr o ce du re ov er the cla ss of mD A G mo d- els sh ou ld therefore rep ort th e equiv alence class rather than a s in gle graph. In addition, b ecause the marginal Marko v prop erty is difficult to c haracterize explicitly , it can b e helpfu l to r ed uce a pr oblem do wn to a simpler graph (see Example 6.4). F or the ordinary Mark o v prop erty there is a relativ ely simple criterion for determining wh ether t wo graphs are equiv alent (Ric hardson, 2003); for the nested Marko v mo d el, on the other hand, equ iv alence is an op en problem. This section pro vides partial results to wards a c haracterizatio n in the case of the marginal mo del. W e conjecture that if t wo graphs are equiv alen t under the m arginal prop erty then th ey are also equiv alen t under the nested prop erty . The results of Ev ans (201 5 ) sh o w that this h olds for discrete v ariables, but the general case is s till op en. Our first su bstan tiv e equiv alence r esult generalizes an idea for instrumental v ariables. Prop osition 6.1. L et G b e an mDAG c ontaining a bidir e cte d fac et B = C ˙ ∪ D such that: (i) every bidir e cte d fac e c ontaining an y c ∈ C is a subset of B ; and (ii) p a G ( d ) ⊇ p a G ( C ) for e ach d ∈ D . L et H b e the mDA G define d fr om G by r emoving the fac et B and r eplacing it with C and D , and adding e dges c → d for e ach c ∈ C and d ∈ D (wher e such an e dge is not alr e ady pr ese nt). Then M m ( G ) = M m ( H ) . Pr o of. The result follo ws from Lemma A.4 in the app endix, whic h sho ws that under these circumstances we can split the laten t v ariable corresp onding to B in to tw o indep end en t pieces. Example 6.2. C onsider the mD A G in Figure 9(a). W e can apply the Prop o- sition with C = { a, b } and D = { c, d } to see that it is Mark o v equ iv alen t to the graph in Figure 9(b). T he adv an tage of suc h a reduction is th at it mov es the graph ‘closer’ to something wh ic h lo oks lik e a DA G, havi ng smaller bidirected facets. Th is makes it clearer h o w the join t distribution factorizes. Example 6.3. The canonical example to which Prop ositio n 6.1 can b e applied is the instrumental variables mo del, shown in Figure 10(a). As noted by Didelez and S heehan (2007), it is n ot p ossib le observ ationally to tell whether 1 is a direct cause of 2, or th ere is a hidden common cause, or b oth. App lying 19 a c d b e f (a) a c d b e f (b) Figure 9: Two mD A Gs sho wn to b e Mark o v equiv alent by application of Prop osition 6.1 1 2 3 (a) 1 2 3 (b) 1 2 3 (c) Figure 10: Thr ee Mark o v equiv alent graph s representing the instrumenta l v ariables model. 20 1 3 2 4 (a) 1 3 2 4 (b) 1 3 2 4 (c) Figure 11: (a) An mDA G; (b ) an mDA G wh ic h is Marko v equiv alen t to the one in (a); and (c) a D A G whic h is Ma rko v equiv alent to the m D A Gs. Prop osition 6.1 to the graph s in Figure 10(b) and (c) with C = { 1 } and D = { 2 } sho ws that they are indeed equiv alen t to Figure 10(a). Example 6.4. The mD A G in Figure 11(a) can b e redu ced to the simp ler one in 11(b) b y app lying Prop ositio n 6.1 with C = { 1 } and D = { 2 , 3 } . This can b e fu rther simplified to the D A G in (c) b y app lyin g the p rop osition again, this time with C = { 2 } and D = { 3 } . By using the global Marko v p rop erty for D A Gs, this shows that eac h graph repr esen ts those distribu tions un der wh ic h X 4 ⊥ ⊥ X 1 , X 2 | X 3 . Define the skeleton of an mD A G G ( V , E , B ) as the simple und irected graph with vertice s V , and edges v − w when ev er v and w app ear together in some edge (directed or b idirected) in G . Prop osition 6.5. L et G and H b e mD AGs with differ ent skeletons. Then if the state-sp ac e X V is discr ete M m ( G ) 6 = M m ( H ) . Pr o of. This follo ws from Ev ans (20 12 ), Corollary 4.4. Note that this is not necessarily true for all state-spaces: if X 2 is contin u ous the three mo d els defi ned b y applying the marginal Mark o v p r op ert y to the graphs in Figure 10 are all saturated (i.e. con tain an y joint distribution o ver those v ariables), ev en though they ha v e sk eleton 1 − 2 − 3 (Bonet, 200 1). 6.1 Bidirected Graphs and Connection to ADMGs The notion of laten t pro jection was d efined by V erma (1991) with r esp ect to acyclic directed mixed graphs (though this term for su c h graphs w as not in tro du ced until Richardson (2003)). The imp ortance of our m ore general form ulation is no w made clear. Example 6.6. Consider the mD A Gs in Figure 12. The graph in Figure 12(a) is the lat ent p ro jectio n one would obtain from a single laten t v ariable ha ving all th r ee observ ed no des as c h ildren, while Figure 12(b) corresp onds to having three indep endent latent s, eac h with a pair of observed v ariables as c hild ren. The fir s t graph is asso ciated with a mo del wh ic h is clearly saturated, bu t 21 1 2 3 (a) 1 2 3 (b) Figure 12: (a) An mDA G corresp onding to a saturated mo del; (b) an mDA G corresp ondin g to a model with constrain ts. the second is not: for example, if the observed v ariables are b inary , it is not p ossible to hav e P ( X 1 = X 2 = X 3 = 1) = P ( X 1 = X 2 = X 3 = 0) = 1 2 (F ritz, 2012). Under V erma’s original f orm ulation of laten t pro jection with ADMGs, b oth these mo d els are r ep resen ted b y the same graph: the one in Figure 12(b). Ho w ev er, as the pr evious example shows, the t wo m arginal mo d els formed in this w ay are actual ly distinct. Th e next result generalizes this idea. Lemma 6.7. L et G b e a pur ely bidir e cte d mDAG with vertic es V , whose bidi- r e cte d fac es c onsist of al l non-empty B ⊂ V strict subsets of vertic es. Then the mo del M ( G ) is not satur ate d (for any state-sp ac e X V ). Pr o of. F or eac h v ∈ V , let B v = V \ { v } , so that B consists of the sets B v and their subsets. Th e canonical D A G for ¯ G h as vertic es V ∪ { B v : v ∈ V } and edges B v → w wheneve r v 6 = w . Let ( X V , Y B ) hav e a join t distrib u tion whic h resp ects the SE P with resp ect to ¯ G , so that, w r iting Y − v ≡ ( Y B w : w 6 = v ), w e ha ve X v = f v ( Y − v , E v ). Giv en s ome p erm utation s of V suc h th at s ( v ) 6 = v for an y v ∈ V , let F v = σ ( Y B v , E s ( v ) ). Note that eac h X v is σ  W w 6 = v F w  -measurable, and that all the σ -al gebrae F v are indep end en t. It f ollo w s fr om L emm a A.2 in the app en d ix that if E ( X v − X w ) 2 ≤ ǫ for eac h v , w , th en eac h X v has v ariance at most | V | ǫ . But this precludes, for example, the p ossibilit y of a join t binary distribution in wh ic h P ( { X v all equal } ) = 1 − ǫ with P ( X v = 0) = P ( X v = 1) = 1 2 for some suffi cien tly small p ositive ǫ . Since it is alw a ys p ossible to dichoto mize a (non-trivial) rand om v ariable, this sho ws that the mo del is n ot saturated on any s tate-space. In the case where mD A Gs cont ain only bidirected edges, Marko v equiv alence turns out to b e very simple. Prop osition 6.8. L et G , G ′ b e mDA Gs c ontaining no dir e cte d e dges. Then M m ( G ) = M m ( G ′ ) if and only if G = G ′ . 22 Figure 13: mD AG s representi ng the eigh t distinct m o dels o ver three (un la- b elled) v ariables. Pr o of. Sup p ose th at G and G ′ are not equal, so (without loss of generalit y) there exists some B ∈ B ( G ) \ B ( G ′ ). Since B is ancestral (there are n o directed edges), it is sufficien t to pro v e that M m ( G B ) 6 = M m ( G ′ B ), so assume that in fact the vertice s of G and G ′ are B . The mo d el M m ( G ) is saturated. Let ˜ G b e the bid irected graph with v ertices B and such that B ( ˜ G ) consists of all strict subsets of B ; b y Lemma 6.7 M m ( ˜ G ) is not saturated. But G ′ ⊆ ˜ G , so M m ( G ′ ) ⊆ M m ( ˜ G ) ⊂ M m ( G ), so in particular M m ( G ) 6 = M m ( G ′ ) It follo ws from this result that ordinary grap h s are fund amen tally unable to fully represen t marginal mo dels, ev en if w e add additional kind s of ed ge; the n umb er of p ossible marginal mo dels just grows to o quickly . Consequently our extension to hyp er-edges is necessary . Corollary 6.9. No c lass of or dinary gr aphs (i.e. not hyp er-gr aphs) is su fficient to r epr esent mar ginal mo dels of DA Gs. Pr o of. The num b er of simplicial complexes on n ve rtices gro ws faster th an 2 ( n ⌊ n/ 2 ⌋ ) (see, for example, Kleitman, 1969), so by Pr op osition 6.8 there are at least this man y marginal mo dels. F or a class of ordinary graphs with k differen t edge t yp es, there are only 2 k ( n 2 ) differen t graphs, and  n ⌊ n/ 2 ⌋  > k  n 2  for sufficient ly large n . Hence ord inary graphs are n ot sufficient . 6.2 mD A Gs on Three V ariables There are 48 distinct m D A Gs ov er three unlab elled v ertices (i.e. u p to p erm u- tation of the vertic es). Using Pr op ositio ns 5.2, 6.1 and 6.5 one can show that of these there are 8 equiv alence classes of in duced mo d els. Th ese are sho wn in Figure 13. Fiv e of them are DA G mo dels, the other three b eing the instru- men tal v ariables mo del from Figure 10(a), the ‘u n related confounding’ mo del studied by Ev ans (2012), and the pairwise bidirected mo del from Example 6.6. 23 1 2 3 4 (a) 1 2 4 3 (b) 1 2 3 4 (c) Figure 14: Three mD A Gs w hose associated mo d els under the marginal Mark o v prop erty may or ma y not b e saturated. F or four no d es th e problem b ecomes m uch more complicated. As an illus- tration of the limitations of the results in this section, w e note that we are unable to d etermine whether or n ot the grap h s in Figure 14 represent saturated mo dels und er th e marginal Mark o v prop erty or not. 7 Causal Mo dels and In terv en tions The u se of D A Gs to repr esen t causal mo dels goes bac k to the work of Sew all W right , and has found p opularit y more recen tly (see S pirtes et al., 2000; Pearl, 2009, and references therein). The use of an arro w X → Y to express the statemen t that ‘ X causes Y ’ is n atural and intuitiv e, and directed acyclic graphs provide a con v enient recursive structure for r epresen ting causal mo dels, with acyclicit y enforcing the idea that causes sh ould precede effects in time. Note that the structural equation prop erty as formulate d in Definition 2.2 only p osits the existence of some functions f v and error v ariables E v whic h gen- erate the required joint distr ibution. In general, there will b e many graph ical structures and p airs ( f v , E v ) whic h giv e rise to a give n distribu tion. Ho wev er, if a distribution is structur al ly generated in this w a y , then when some of the v ariables in the system are in terv ened up on (in an appropr iately defined wa y), a suitably mo d ified version of the original DA G will correctly repr esen t the resulting inte rven tional probabilit y distribu tion (Pea rl , 2009). Analogously w e will show that mDA Gs are able to r epresen t the mo d els induced on the mar gins of D AGs after interv ention. Definition 7.1. Let D b e a D A G with ve rtices V , and su pp ose that data are generated according to a particular collection of p airs ( f v , E v ), v ∈ V wh ic h satisfy the S EP for D . An i ntervention on A ⊆ V replaces ( f v , E v ) with ( ˜ f v , ˜ E v ) for eac h v ∈ A , where ˜ f v : E v → X v is measur able, and all E w , ˜ E v are indep end en t. Denote by D A the DA G D after intervening on A , f ormed from D b y re- mo ving edges directed to w ards v ∈ A . An interv en tion remo ve s the dep endence of a v ariable on all of its paren ts. If P is generated by ( f v , E v ) according to the D A G D , th en the distribution 24 c e d f a b Figure 15: The mDA G from Figure 1 after inte rvening on d . P A after interv en tion on A is generat ed according to the mutilate d DA G D A , and hen ce ob eys the SEP for M ( D A ). Th is d efinition of an in terv ent ion is based on the one in Pearl (2009). Note that in terv ent ion is not a purely probabilistic operation, in the sense that its effect it is n ot identifiable from the observ ed probability distribution alone: it relies u p on kno wledge of the full str uctural generating sys tem. 7.1 Causal mDA Gs Let D b e a D A G with v ertex set U ˙ ∪ V and let G = p ( D , V ). If ( X U , X V ) are generated according to the structural equation p rop ert y for D , the definitions and results of previous sections tell us th at the distribution of X V , sa y P , is con tained in M m ( G ). If an interv ention is p erformed on some of the vertic es in V , what then should we exp ect from the resulting marginal distribu tion? Definition 7.2. Let G ( V , E , B ) b e an mD A G, and A ⊆ V . Th e mDA G G A has v ertices V , d irected edges E A = { ( w , v ) ∈ E : v / ∈ A } , and bid ir ected faces { B \ A : B ∈ B } (together with the singletons { a } for a ∈ A ). In other words to obtain G A from G , d elete directed edges p oin ting to A , and remov e vertices in A from eac h bid irected edge. F or example Figure 15 sho ws the result of inte rvening on { d } in th e mDA G from Figure 1. The next result s ho ws that th is d efi nition of a mutila ted m DA G is sensible, b eca use m utilation and p ro jectio n comm ute. Prop osition 7.3. L et A ⊆ V . If G = p ( D , V ) , then G A = p ( D A , V ) . Pr o of. Note that the definition of laten t pro jections and of hidden common causes refer only to directed paths w ith non-endp oin t v ertices in U . Since U ∩ A = ∅ , it follo ws that such a d irected path in D is also con tained in D A if and only if the final v ertex is not in A . Hence, the directed edges in p ( D A , V ) are precisely those which are in G = p ( D , V ) and do not p oint to A , as required. No w, supp ose B ∈ B ( G A ): then there is s ome B ′ ∈ B ( G ) with B ′ \ A = B . Hence B ′ share a hidden common cause in D with resp ect to U , and b y th e same reasoning as ab ov e, the ve rtices in B ′ \ A = B share a hid den common cause in D A with resp ec t to U . Hence B ∈ B ( p ( D A , V )) 25 Con ve rsely , if B ∈ B ( p ( D A , V )), then the element s of B share a hidd en common cause in D A with resp ec t to U , and hen ce also in the sup ergraph D . So there is some B ′ ⊇ B with B ′ \ A = B such that B ′ ∈ B ( G ), and hence B ∈ B ( G A ). It follo ws from this result that mD AG s not only represent the structure of a margin of a DA G mo del, but they can also co rr ectly represen t the manner in which it w ill c hange u nder int erven tions on the obs erv ed v ariables. Prop osition 7.4. L et D , D ′ b e DAG s with the same latent pr oje ction G over some set of variables V . F or any subset A ⊆ V of intervene d no des, M ( D A , V ) = M ( D ′ A , V ) Pr o of. By Prop osition 7.3, p ( D A , V ) = p ( D ′ A , V ), so that the result follo ws from Theorem 4.13. Tw o DA Gs ma y b e observ ationally Mark o v equiv alen t, su c h as the graphs 1 → 2 and 1 ← 2 (wh ic h b oth represent saturated mo dels). Ho w eve r, for an y t w o distinct causal D A Gs, there is alw a ys some interv ention under whic h the resulting m utilated D A Gs are not Mark o v equ iv alen t. F or example, if we in- terv ene on 1 in the causal mo d el 1 ← 2 the t wo v ariables b eco me indep end en t, but in 1 → 2 the mo del remains u nc hanged. W e might hop e that something similar holds for mD A Gs: giv en distinct mD A Gs G , H , is there alw a ys some in terve ntio n s u c h that M m ( G A ) 6 = M m ( H A ), so that one could in principle distinguish b et wee n the t w o causal mo d els via a suitable exp eriment? In fact this tur ns out not to b e the case: consider the mD A Gs in Figures 16(a) and (b); denote then by G and H resp ecti vely . Both represent saturated models, so in particular M m ( G ) = M m ( H ). In addition, after interv enin g on any of th e ve rtices the resulting mutilated graphs are the same: G A = H A for an y A 6 = ∅ . Hence M ( G A ) = M ( H A ) for an y A ⊆ { 1 , 2 , 3 } . The next result shows that tw o causal mDA Gs can b e distinguished by in terv entio n if they h a v e differen t underlyin g D A Gs. Prop osition 7.5. L et G and H b e mDAG s on the same vertex set V , and supp ose that their underlying DAG s ar e distinct. Then for some A ⊆ V , M m ( G A ) 6 = M m ( H A ) . Pr o of. Sup p ose that th e edge v → w app ears in G bu t not H . Then let A = V \ { w } : since non-trivial bidirected f aces con tain at least t wo v ertices, G A and H A are D A Gs. Therefore the only edges in G A and H A are those directed in to w . It follo ws that X v ⊥ ⊥ X w under an y distribu tion in M m ( H A ), wh er eas an y f orm of dep endence b et wee n X v and X w is p ossible in M m ( G A ). Remark 7.6. Th e in abilit y to distinguish b et w een certain causal mDA Gs is partly an artefact of the sort of interv entions w e consid er . If we allo w more delicate in terv entio ns whic h can blo ck a sp ecific causal mec hanism b et wee n an y pair of v ariables, this would corresp ond to remo ving individ ual directed edges from th e graph. In this case, b y blo c king all the direct ca usal links w e 26 1 2 3 (a) 1 2 3 (b) Figure 16: T w o mD A Gs whose corresp ondin g mod els are the s ame un d er an y set of p erfect n o de in terve ntions. w ould obtain a d istribution w hic h satisfies the m arginal Marko v prop ert y for the u n derlying bid irected graphs. It w ould then follo w from Prop osition 6.8 that causal mo d els w ould b e in one-t o-one corresp ondence with graphs. 8 Discussion The class of mD AG s pr ovides a n atural framework to represent the margins of non-parametric Ba yesia n net w ork mo dels, and the structur e of these mo dels under interv en tions when interpreted causally . W e ha v e giv en a partial charac- terizatio n of the Mark o v equ iv alence class of these mo dels u nder the marginal Mark o v prop ert y , but a full result is still an op en pr oblem. As men tioned in Section 6, Marko v equiv alence for the nested Marko v mo del is also op en. Fitting and te sting mod els under the m arginal Mark o v prop erty is difficult b ecause n o explicit represen tation of th e mo del is generally a v ailable, though the r esu lts in S ection 6 give c haracterizations in sp ecial cases (see Example 6.4). The w ork of Bonet (2001) suggests that a general c haracterization ma y b e infeasible b ecause of the complexity of th e inequalit y constrain ts. T h e nested m o del p ro vides a u seful s u rrogate b ecause, at least in the discrete case, it is kno wn to b e smo oth, has an explicit parameterization, and has the same dimension as th e marginal mo d el (Ev ans, 2015 ). Sin ce M n ( G ) ⊇ M m ( G ), if the nested mo d el is a bad fit then so is the marginal mo del. The con ve rse is not true ho wev er, so w e p oten tially lose p o wer b y ignoring inequalit y constrain ts. Ev ans (2012) gi ves a graphical method for d eriving some inequalit y constrain ts, so these can in p rinciple b e tested after fitting a larger mo d el. The app roac h of Richardson et al. (2011) giv es a p arameterizati on of the marginal mo del for the mDA G in Figure 10(a), incorp orati ng in equalit y constraints; a general parameterizatio n for suc h mo dels is another op en p roblem. Alternativ ely it is p ossib le to u se a laten t v ariable mo del M l ( G ) as a second surrogate, knowing th at M l ( G ) ⊆ M m ( G ). If the nested and laten t v ari- able mo dels give similar fits (b y some su itable criterion) then w e effectiv ely ha v e a fit for the marginal mo del, w hic h lies in b et wee n the t wo . Metho ds for fitting mo dels u nder the marginal Mark o v prop ert y would enable p o w er- 27 ful searc h pr o cedures for distinguishing b et w een differen t causal mod els with laten t v ariables. Ac kno wledgemen ts W e thank S teffen Lauritzen for h elpf ul discussions , and tw o anonymous refer- ees for excellen t su ggestions, includ in g the idea of u sing a simplicial complex to represent the bidirected stru cture. 28 References E. S. Allman, J . A. Rh o des, B. Sturmf els, and P . Zwiern ik. T ensors of non- negativ e rank t wo. Line ar Algebr a and its Applic ations , 201 3. J. S. Bell. On the Einstein-Po dolsky-Rosen parad ox. Physics , 1(3):195–2 00, 1964. C. M. Bishop. Pattern r e c o gni tion and mach ine le arning . S pringer, 2007. B. Bo net. In strumenta lit y tests revisited. In Pr o c e e dings of the 17th Confer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI-05) , pages 48–55, 2001. R C h a v es, L Luft, TO Maciel, D Gross, D Janzing, and B Sc h¨ olk opf. Inferr ing laten t structures via in formation inequalities. In Pr o c e e dings of the 30th Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI-14) , pages 112– 121, 2014. N. N. Chentso v. Statistic al De cisi on R ules and Optimal Infer enc e . American Mathematica l So ciet y , 1982. T ranslated from Russian. A. Darwic he. Mo deling and r e asoning with B ayesian networks . Cambridge Univ ersit y Press, 2009. V. Didelez and N. Sheehan. Mendelian r andomization as an instru men tal v ari- able approac h to causal inference. Statistic al M etho ds in Me dic al R ese ar ch , 16(4): 309–330, 2007. M. Drton. Lik eliho o d ratio tests and singularities. A nnals of Statistics , pages 979–1 012, 2009. M. Drton, C. J. F ox, and A. K¨ aufl. Comments on: Sequences of regressions and their ind ep endencies. TEST , 21(2):255–2 61, 2012 . R. J. Ev ans. Graphical metho ds for inequalit y constraint s in m arginalized D A Gs. In M achine L e arning for Signal Pr o c essing (M LSP) , 2012. R. J. Ev ans. Margins of discrete Ba y esian net w orks. Pr eprint, arXiv:1501.02 103 , 2015. R. J. Ev ans and T. S. Ric hardson. Maxim um likel iho o d fitting of acyclic directed mixed graph s to b inary data. In Pr o c e e dings of the 26th c onfer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI-08) , 2010. R. J. Ev ans and T. S. Ric hardson. Mark o vian acyclic d irected mixed graphs for discrete data. Annal s of Statistics , 42(4):1452 –1482, 2014 . C. J. F o x, A. K¨ aufl, and M. Drton. On the causal int erpr etation of acyclic mixed graphs un der m ultiv ariate normalit y . Line ar Algebr a and its Appli- c ations , 20 14. 29 R. F o ygel, J. Draisma, and M. Drton. Half-trek criterion for generic id en ti- fiabilit y of linear structural equation mo dels. A nnals of Statistics , 40(3): 1682– 1713, 2012. T. F ritz. Bell’s Theorem without free w ill. arXiv pr eprint arXiv:1206 .5115 , 2012. R. D. Gill. S tatistics, causalit y an d Bell’s theorem. Statistic al Scienc e , 29(4): 512–5 28, 2014. D. Kleitman. On d edekind’s problem: the num b er of m onotone b o olean fun c- tions. Pr o c e e dings of the Americ an M athematic al So ciety , pages 677– 682, 1969. J.T.A. Koster. Marginali zing and conditioning in graph ical mo dels. Bernoul li , pages 817–840 , 2002. F. R. K s c hisc hang, B. J. F rey , and H.-A. Lo eliger. F actor graphs and the sum-pro d uct algorithm. Information The ory, IEEE T r ansactions on , 47(2): 498–5 19, 2001. S. L. Lauritzen. Gr aphic al Mo dels . Clarendon P r ess, Oxford, UK, 1996. S. L. L au r itzen, A. P . Da w id , B. N. Larsen, and H. G. Leimer. Indep endence prop erties of d ir ected Mark o v fields. Networks , 20(5), 199 0. J. Pe arl. A constraint- pr opagatio n approac h to p robabilistic reasoning. I n Pr o c e e dings of the First Confer enc e Annual Confer enc e on U nc ertainty in Artificial Intel ligenc e (UAI- 85) , pages 31–4 2, Corv allis, Oregon, 1985. A UAI Press. J. Pe arl. On the testabilit y of causal mo dels with laten t and instru men tal v ariables. In Pr o c e e dings of the 11th Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI) , pages 435–443 , 1995. J. Pearl. Causality: Mo dels, R e asoning, and Infer enc e . Cambridge Unive rsity Press, second edition, 200 9. J. P earl and T . S. V erma. A statistical s emantics for causation. Statistics and Computing , 2(2):91 –95, 1992. T. S. Ric hardson. Mark ov prop erties for acyclic directed mixed graph s. Sc and. J. Statist. , 30(1):1 45–157, 2003. T. S. Ric hardson and P . Sp irtes. Ancestral graph Marko v m o dels. Ann. Statist. , 30:962–1 030, 2002 . T. S. Ric hardson, R. J. Ev ans, and J. M. Rob in s. T ransparent parameteri- zations of mo d els for p oten tial outcomes. Bayesian Statistics , 9:569–610, 2011. 30 J. M. Robins. A new appr oac h to causal inferen ce in mortalit y s tudies with a sustained exp osure p erio d—app lication to con trol of the health y w ork er survivo r effect. Mathematic al Mo del ling , 7(9):139 3–1512, 1986. V.A. R okh lin. On the fundamental ide as of me asur e the ory . Nu mb er 71 in T ran s lations. American Mathematica l So ciet y , 1952. T ranslated from the Russian: Matemati ˇ ceski ˇ i Sb oroik (N.S.) 25 (67), 107 –150 (1949 ). D. B. Ru bin. Estimating causal effects of treatmen ts in randomized and non- randomized studies. J ournal of Educ ational Psycholo gy , 66(5):688, 1974 . I. Shp itser, R. J . Ev ans, T. S. Ric hards on, and J. M. Robin s. Spars e nested Mark o v mo dels with log-linea r paramete rs. In Pr o c e e dings of the 29th Con- fer enc e on Unc ertainty in Artificial Intel ligenc e (UAI-13) , p ages 576–5 85, 2013. I. Shpitser, R. J. Ev ans , T. S. Ric hardson, and J. M. Robin s. In tro du ction to nested Mark o v mo dels. Behviormetrika , 41(1 ):3–39, 201 4. P . Spirtes, C. Glymour, and R. Sc heines. Causation, Pr e diction and Se ar ch . MIT press, 2000. J. Tian and J. P earl. On the testable im p lications of causal mo dels with hidden v ariables. In Pr o c e e dings of the 18th Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI-02) , pages 519–527, 2002. T. S. V erma. Inv arian t pr op erties of causal mo dels. T ec hnical r ep ort, T ec hnical Rep ort R-134, UCLA C ognitiv e Sy s tems Lab oratory , 199 1. T. S. V erma and J. Pe arl. Equiv alence and synthesis of causal m o dels. In Pr o c e e dings of the 6th Confer enc e on U nc ertainty in Artificial Intel lig e nc e (UAI-90) , pages 255–270, 19 90. N. W erm uth. Probabilit y distributions with summary graph structure. Bernoul li , 17(3):845–8 79, 08 2011. A T ec hnical Pr o ofs A.1 Pro of of Theorem 4.10 Lemma A.1. L et G ( V ˙ ∪ U 1 ˙ ∪ U 2 , E G , B G ) b e an mDA G, and H ( V ˙ ∪ U 1 , E H , B H ) the latent pr oje ction of G over V ˙ ∪ U 1 . Then (a) for a, b ∈ V , ther e is a dir e cte d p ath fr om a to b in G with non-endp oint vertic es in U 1 ˙ ∪ U 2 if and only if ther e is such a p ath i n H with non-endp oint vertic es in U 1 ; (b) ther e is a hidd en c ommon c ause for B ⊆ V in G with r esp e ct to U 1 ˙ ∪ U 2 if and only if ther e is a hidden c ommon c ause f or B in H with r esp e ct to U 1 . 31 Pr o of. (a): Supp ose there is a dir ected path from a to b in G w ith non-end p oint v ertices in U 1 ∪ U 2 . If any non-endp oin t v ertices on the path are also in U 1 , th en the problem redu ces to sh o wing the existence of t wo shorter paths (acyclicit y means we can alw a ys conca tenate directed paths and still obtain a p ath). On the other hand if all non-endp oin t v ertices are in U 2 then there is an edge a → b in H . Con ve rsely if th ere is a dir ected p ath in H with in termediate v ertices in U 1 then eac h edge c → d in that path represents a d irected path from c to d in G with int ermediate v ertices in U 2 . (b): Let B ⊆ V h av e a hid den common cause in G with resp ect to U 1 ∪ U 2 ; for eac h b ∈ B there is a dir ected path π b to b with all other vertic es in U 1 ∪ U 2 as describ ed in the definition of a hidden common cause. Let u b b e the firs t v ertex on π b whic h is not in U 2 (certainly b / ∈ U 2 , so this is well defin ed ). Th en the v ertices A = { u b : b ∈ B } share a hidd en common cause with r esp ect to U 2 , and h ence A ∈ B H . But for eac h b ∈ B , th ere is a d irected path in G f rom u b to b w ith non- endp oints in U 1 ∪ U 2 , and h ence by (a) th er e is a directed path in H from u b to b with non-endp oin ts in U 1 ; hence the v ertices in B sh are a hid den common cause with resp ect to U 1 in H . Con ve rsely , supp ose the elemen ts of B share a hidden common cause A ∈ B H with resp ect to U 1 in H . By th e defin ition of laten t pro jection, the v ertices in A m ust share a h idden common cause C with resp ect to U 2 in G . It follo ws b y concatenat ing th e p aths from C to A , and from A to B , that the v ertices in B share the hidden common cause C with resp ect to U 1 ∪ U 2 in G . Pr o of of The or em 4.10. It is su ffi cien t to pr o v e the fir s t equalit y: let H = p ( G , V ∪ U 1 ). Let a, b ∈ V ; by Lemma A.1, there is a directed path from a to b in G w ith all non-end p oint vertic es in U 1 ∪ U 2 if and only if there is such a path in H with all non-endp oin t v ertices in U 1 . Hence the d irected edges in p ( G , V ) and p ( H , V ) are the same. Also by Lemma A.1, for any set B ⊆ V , there is a hidden common cause in G for B with resp ect to U 1 ∪ U 2 , if and only if there is one in H for B with resp ect to U 1 . Hence th e b idirected faces in p ( G , V ) and p ( H , V ) are also the same. A.2 Measure Theoretic Results Let X b e a square in tegrable random v ariable, and F a σ -alg ebra. S a y that X is ( ǫ, F )-measurable if E ( X − E [ X | F ]) 2 ≤ ǫ Let F − i ≡ F 1 ∨ · · · ∨ F i − 1 ∨ F i +1 ∨ · · · ∨ F k . Lemma A.2. L et X i b e ( ǫ, F − i ) -me asur able for i = 1 , . . . , k , wher e F j ar e indep endent σ -algebr ae. Then E ( X i − X j ) 2 ≤ ǫ for al l i, j implies that X i is (2 ǫ, F − i,j ) -me asur able for i 6 = j . In addition, V ar X i ≤ k ǫ . 32 Pr o of. Since X i , F − i ⊥ ⊥ F i , E ( X i − E [ X i | F − i,j ]) 2 = E ( X i − E [ X i | F − j ]) 2 ≤ E ( X i − E [ X j | F − j ]) 2 ≤ E ( X i − X j ) 2 + E ( X j − E [ X j | F − j ]) 2 ≤ 2 ǫ, so X i is (2 ǫ, F − i,j )-measurable. Rep ea ting this p ro of sh ows that X i is ( k ǫ, ∅ )- measurable, wh ic h is to say th at its v ariance is at most k ǫ . Lemma A.3. L et X b e a σ ( Y , Z ) -me asur able r andom variable, and ( X, Y , Z ) have joint distribution P . Then ther e exist r andom v ariables U, W such tha t: (i) U ⊥ ⊥ W ; (ii) X is σ ( Y , U ) - me asur able; (iii) Z is σ ( W, X, Y ) -me asur able; (iv) ( X , Y , Z ) has the appr opriate joint distribution P . Pr o of. Using the f act that our probabilit y space is Leb esgue-Rokhlin, there exists a measurable fu nction g such that if U is a uniform rand om v ariable indep end en t of Y then ( X, Y ) ≡ ( g ( Y , U ) , Y ) has the correct marginal distri- bution (Chentso v , 1982, Theorem 2.2). S imilarly , let W b e a uniform random v ariable indep enden t of U, Y (and therefore X ), and let h b e a measurable function su ch th at ( X, Y , Z ) ≡ ( X , Y , h ( X , Y , W )) has the same distribution as ( X, Y , Z ). By construction, (i)-(iv) are satisfied. Lemma A.4. L et G b e an mDAG c ontaining a bidir e cte d fac et B = C ˙ ∪ D such that: for any c ∈ C , any bidir e cte d e dge c ontaining c is a subset of B ; and pa G ( d ) ⊇ pa G ( C ) for e ach d ∈ D . T ake P ∈ M m ( G ) . Then ther e exists Q ∈ M ( ¯ G ) such that under Q we have Y B = ( Y C , Y D ) , wher e: (i) Y C ⊥ ⊥ Y D ; (ii) e ach X c is σ ( X pa G ( c ) , Y C ) -me asur able (iii) e ach X d is σ ( X C , X pa G ( C ) , X pa G ( d ) , Y B ( d ) \ B , Y D ) -me asur able; (iv) the V -mar gin of Q is P . Pr o of. This is ju st an application of Lemma A.3 w ith X = X C , Y = X pa G ( C ) , Z = X D , and s ome extra v ariables X pa G ( d ) , Y B ( d ) \ B on whic h Z can dep end (but this extension is trivial). In other words, th e result says that we can decomp ose Y B in to tw o in de- p endent pieces, one of w hic h d etermines the v alue of X C (once its paren ts are kno wn) and con tains no f urther information, in the sense that it is irrelev an t once X C and X pa( C ) are kno wn. 33

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment