Joint Causal Inference from Multiple Contexts

The gold standard for discovering causal relations is by means of experimentation. Over the last decades, alternative methods have been proposed that can infer causal relations between variables from certain statistical patterns in purely observation…

Authors: Joris M. Mooij, Sara Magliacane, Tom Claassen

Joint Causal Inference from Multiple Contexts
Journal of Machine Learning Research 21 (2020) 1-108 Submitted 3/17; Revised 1/20; Published 3/20 Join t Causal Inference from Multiple Con texts Joris M. Mo oij ∗ j.m.mooij@uva.nl Kortewe g-De V ries Institute, University of Amster dam Postb ox 94248, 1090 GE Amster dam, The Netherlands Sara Magliacane sara.magliacane@ibm.com MIT-IBM Watson AI L ab, IBM R ese ar ch 75 Binney St, Cambridge, MA 02142, USA T om Claassen tomc@cs.ru.nl Institute for Computing and Information Scienc es, R adb oud University Nijme gen Postb ox 9010, 6500 GL Nijme gen, The Netherlands Editor: P eter Spirtes Abstract The gold standard for discov ering causal relations is by means of experimentation. Ov er the last decades, alternativ e metho ds ha ve been prop osed that can infer causal relations b et ween v ariables from certain statistical patterns in purely observ ational data. W e in- tro duce Joint Causal Infer enc e (JCI) , a nov el approac h to causal discov ery from m ultiple data sets from different contexts that elegantly unifies both approaches. JCI is a causal mo deling framew ork rather than a sp ecific algorithm, and it can b e implemen ted using an y causal discov ery algorithm that can tak e in to accoun t certain background knowledge. JCI can deal with different t yp es of interv en tions (e.g., p erfect, imp erfect, sto chastic, etc.) in a unified fashion, and do es not require kno wledge of interv ention targets or types in case of interv en tional data. W e explain how several w ell-known causal discov ery algorithms can b e seen as addressing sp ecial cases of the JCI framework, and we also prop ose nov el implemen tations that extend existing causal disco very metho ds for purely observ ational data to the JCI setting. W e ev aluate different JCI implemen tations on syn thetic data and on flow cytometry protein expression data and conclude that JCI implementations can considerably outp erform state-of-the-art causal disco very algorithms. Keyw ords: causal disco very , causal mo deling, causal inference, observ ational and exp er- imen tal data, interv entions, randomized con trolled trials 1. In tro duction The aim of causal disco very is to learn the causal relations b etw een v ariables of a system of in terest from data. As a simple example, supp ose a researcher wan ts to find out whether pla ying violen t computer games causes aggressive b ehavior. She gathers observ ational data b y taking a sample from pupils at several high schools in different countries and observes a significan t correlation b etw een the daily amount of hours spent on pla ying violen t computer games, and aggressiv e behavior at sc ho ol (see also Figure 1). This in itself do es not y et imply a causal relation b etw een the tw o in either direction. Indeed, an alternativ e explanation of ∗ . P art of this work was done while the authors were with the Informatics Institute of the Universit y of Amsterdam. c  2020 Joris M. Mooij, Sara Magliacane and T om Claassen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/ . Attribution requiremen ts are provided at http://jmlr.org/papers/v21/17- 123.html . Mooij, Ma gliacane and Claassen the observed correlation could b e the presence of a confounder (a latent common cause), for example, a genetic predisp osition to wards violence that mak es the carrier particularly enjo y suc h games and also mak e him b ehav e more aggressiv ely . The most reliable wa y to establish whether pla ying violent computer games causes aggressive b ehavior, is by means of exp erimentation , for example by a randomized con trolled trial (Fisher, 1935). This w ould imply assigning eac h pupil to one out of tw o groups randomly , where the pupils in one group are forced to play violent computer games for several hours a da y , while the pupils in the other group are forced to abstain from playing those games. After sev eral mon ths, the aggressiv e b ehavior in b oth groups is measured. If a significant correlation b et ween group and outcome is observ ed (or equiv alen tly , the outcome is significantly different b etw een the t wo groups), it can then b e concluded that playing violen t computer games indeed causes aggressiv e b eha vior. Giv en the ethical and practical problems that such an exp eriment w ould in volv e, one migh t wonder whether there are alternativ e wa ys to answ er this question. One such al- ternativ e is to combine data from differen t con texts. F or example, in some countries the go vernmen t ma y hav e decided to forbid certain ultra-violent games from b eing sold. In addition, some schools may hav e in tro duced certain measures to discourage aggressiv e b e- ha vior. By com bining the data from these differen t contexts in an appropriate w ay , one ma y b e able to identify the presence or absence of a causal effect of playing violent computer games on aggressive b eha vior. F or example, in the setting of Figure 1(c), the causal rela- tionship b etw een the tw o v ariables of interest turns out to b e identifiable from conditional indep endence relationships in p o oled data from all the con texts. In particular, in that case the observed correlation b et ween pla ying violen t computer games and aggressive b eha vior could b e unam biguously attributed to a causal effect of one on the other, just from c ombin- ing multiple readily av ailable data sets, without the need for an impractical exp eriment. 1 In this pap er, w e prop ose a simple and general wa y to combine and analyze data sets from differen t con texts that enables one to draw suc h strong causal conclusions. While exp erimentation is still the gold standard to establish causal relationships, re- searc hers realized in the early nineties that there are other metho ds that require only pur ely observational data (Spirtes et al., 2000; P earl, 2009). Man y metho ds for causal disco very from purely observ ational data ha v e been proposed o ver the last decades, relying on differen t assumptions. These can be roughly divided into c onstr aint-b ase d causal discov ery metho ds, suc h as the PC (Spirtes et al., 2000), IC (P earl, 2009) and F CI algorithms (Spirtes et al., 1999; Zhang, 2008a), sc or e-b ase d causal discov ery metho ds (e.g., Co op er and Hersk ovits, 1992; Heck erman et al., 1995; Chic kering, 2002; Koivisto and So od, 2004), and metho ds exploiting other statistical patterns in the joint distribution (e.g., Mo oij et al., 2016; P eters et al., 2017). Originally , these metho ds were designed to estimate the causal graph of the system from a single data set corresp onding to a single (purely observ ational) context. More recently , v arious causal discov ery metho ds ha ve b een prop osed that extend these tec hniques to deal with multiple data sets from differen t contexts. As an example, the data sets ma y correspond with a baseline of purely observ ational data consisting of measurements concerning the “natural” state of the system, and data consisting of measurements under 1. One can show that the conditional dep endence C α 6 ⊥ ⊥ X 2 | C β and conditional indep endence C α ⊥ ⊥ X 2 | { X 1 , C β } in the po oled data that are en tailed by the causal graph, together with the as- sumption that neither C α nor C β is caused by X 1 or X 2 , suffice to arrive at this conclusion. 2 Joint Ca usal Inference from Mul tiple Contexts (a) X 1 X 2 (b) X 1 X 2 (c) X 1 X 2 C α C β Figure 1: Different causal graphs relating X 1 , the daily amoun t of hours sp en t on pla ying violen t computer games, and X 2 , a measure of aggressiv e behavior. (a) Pla ying violen t computer games causes aggressive b ehavior; (b) The observ ed correlation b et ween X 1 and X 2 is explained by a latent confounder, e.g., a genetic predis- p osition to wards violence. (c) Hyp othetical causal graph also in volving context v ariables C α , whic h indicates whether ultra-violent games hav e b een banned by the go vernmen t, and C β , which represen ts school interv en tions to stimulate social b eha vior. Without considering con texts, it is not p ossible to distinguish b etw een (a) and (b) based on conditional indep endences in the data. In scenario (c), JCI allo ws one to infer from conditional indep endences in the p o oled data that X 1 causes X 2 and that X 1 and X 2 are not confounded (assuming that con text v ariables C α and C β are not caused by system v ariables X 1 and X 2 ). differen t p erturbations of the system due to external in terven tions on the system. 2 More generally , they can corresp ond to measurements of the system in different environmen ts. These metho ds can b e divided into tw o main approaches: (a) metho ds that obtain statistics or constraints from eac h context separately and then construct a single context-independent causal graph b y combining these statistics, but nev er directly compare data from different contexts (Claassen and Hesk es, 2010; Tillman and Spirtes, 2011; Hyttinen et al., 2012, 2014; T riantafillou and Tsamardinos, 2015; Rothenh¨ ausler et al., 2015; F orr´ e and Mo oij, 2018); (b) metho ds that p o ol all data and construct a single context-independent causal graph di- rectly from the po oled data (Co op er, 1997; Co op er and Y o o, 1999; Tian and P earl, 2001; Sac hs et al., 2005; Eaton and Murphy, 2007; Chen et al., 2007; Hauser and B¨ uhlmann, 2012; Mo oij and Heskes, 2013; P eters et al., 2016; Oates et al., 2016a; Zhang et al., 2017). In this paper, w e prop ose Joint Causal Infer enc e (JCI) , a framew ork for causal mo deling of a system in different contexts and for causal discov ery from multiple data sets consisting of measurements obtained in different con texts, which tak es the latter approac h. As will b e discussed in more detail in Section 4.3, JCI is the most generally applicable of those approac hes—for example, it allo ws for the presence of latent confounders and cyclic causal 2. In certain parts of the causal disco very literature, the word “interv ention” has b ecome synon ymous to “p erfect in terven tion” (i.e., an in terven tion that precisely sets a v ariable or set of v ariables to a certain v alue without directly affecting any other v ariables in the system), but in this work w e use it in the more general meaning of any external p erturbation of the system. 3 Mooij, Ma gliacane and Claassen relationships—and also offers most flexibility in terms of its implementation. While the ingredien ts of the JCI framework are not nov el, the added v alue of the framework is that on the one hand it arrives at a unifying description of a diverse sp ectrum of existing approac hes, while on the other hand it serv es to inspire new implemen tations, such as the adaptations of F CI that w e prop ose in this w ork. T echnically , this is ac hiev ed by formulating the problem in terms of a (standard) Structural Causal Mo del that considers system and en vironment as subsystems of one joint system, rather than other t yp es of represen tations in whic h the system is mo deled conditionally on its environmen t (Da wid, 2002; Bareinboim and Pearl, 2013; Oates et al., 2016a; Y ang et al., 2018; F orr´ e and Mo oij, 2019). This allows us to apply the standard notion of statistical indep endence in the same wa ys as is commonly done in the purely observ ational setting. As w e observed in our exp erimen ts (that are rep orted in Section 5), the no vel algorithms prop osed in this work compare fav orably with the state-of-the-art in causal discov ery on syn thetic data in man y settings. The k ey idea of JCI is to (i) consider auxiliary con text v ariables that describ e the context of eac h data set, (ii) po ol all the data from different contexts, including the v alues of the con text v ariables, into a single data set, and finally (iii) apply standard causal discov ery metho ds to the po oled data, incorp orating appropriate bac kground kno wledge on the causal relationships inv olving the context v ariables. The framework is simple and v ery generally applicable as it allo ws one to deal with laten t confounding and cycles (if the causal disco v ery metho d supp orts this) and v arious t yp es of in terven tions in a unified w a y . It does not require bac kground knowledge on the interv ention t yp es and targets, making it very suitable to the application on complex systems in whic h the effects of certain interv entions are not known a priori , a situation that often o ccurs in practice. On the other hand, if such background kno wledge is av ailable, it can b e exploited. JCI can b e implemented using an y causal disco very metho d that can incorp orate the appropriate background knowledge on the relationships b et ween con text and system v ari- ables. This allows one to b enefit from the av ailabilit y of sophisticated and p o w erful causal disco very metho ds that hav e b een primarily designed for a single data set from a single con text by extending their application domain to the setting of m ultiple data sets from m ultiple contexts. F or example, we will sho w in this work how F CI (Spirtes et al., 1999; Zhang, 2008a) can easily b e adapted to the JCI setting. A t the same time, JCI accommo- dates v arious w ell-known causal disco very metho ds as sp ecial cases, such as the standard randomized con trolled trial setting (Fisher, 1935), Lo cal Causal Disco v ery (LCD) (Co oper, 1997) and Inv ariant Causal Prediction (ICP) (Peters et al., 2016). By explicitly in tro duc- ing the con text v ariables and treating them analogously to the system v ariables (but with additional bac kground knowledge ab out their causal relations with the system v ariables), JCI makes it p ossible to elegan tly com bine the principles of causal disco very from experi- men tation with those of causal discov ery from purely observ ational data to achiev e a causal disco very framew ork that is more p o werful than either of the tw o separately . This paper is structured as follows. In Section 2 w e describe the relev an t causal mo deling and discov ery concepts and define terminology and notation. In Section 3 we introduce the JCI framework and mo deling assumptions. In Section 4, we sho w ho w JCI can b e implemen ted using v arious causal discov ery metho ds, and compare it with related w ork. In Section 5 we report experimental results on synthetic and flo w cytometry data. W e conclude in Section 6 with some promising directions for future dev elopments. 4 Joint Ca usal Inference from Mul tiple Contexts 2. Bac kground In this section, we presen t the background material on which w e will base our exp osition. W e start in Section 2.1 with a brief subsection stating the basic definitions and results in the field of graphical causal mo deling that we will use in this pap er. In addition to co vering material that is standard in the field, we review more recent extensions to the cyclic setting (Bongers et al., 2020). Because the cyclic setting is quite s imilar to the acyclic one that is mostly considered in the literature, w e decided to present b oth cases in parallel rather than first explaining the acyclic setting and then explaining ho w everything generalizes to the cyclic setting. 3 In Section 2.2, we discuss the key idea of causal disco very from exp erimentation (in the setting of a randomized con trolled trial, or A/B-testing) in these terms. W e finish with Section 2.3 that briefly illustrates the basic idea underlying constrain t-based causal discov ery from purely observ ational data in a simple setting. 2.1 Graphical Causal Mo deling W e briefly summarize some basic definitions and results in the field of graphical causal mo deling. F or more details, w e refer the reader to Pearl (2009) and Bongers et al. (2020). 2.1.1 Directed Mixed Graphs A Dir e cte d Mixe d Gr aph (DMG) is a graph G = hV , E , F i with no des V and tw o t yp es of edges: dir e cte d edges E ⊆ V 2 , and bidir e cte d edges F ⊆ {{ i, j } : i, j ∈ V , i 6 = j } . W e will denote a directed edge ( i, j ) ∈ E as i → j or j ← i , and call i a p ar ent of j and j a child of i . W e denote all paren ts of j in the graph G as p a G ( j ) := { i ∈ V : i → j ∈ E } , and all children of i in G as ch G ( i ) := { j ∈ V : i → j ∈ E } . W e allo w for self-cycles i → i , so a v ariable can b e its own paren t and child. W e will denote a bidirected edge { i, j } ∈ F as i ↔ j or j ↔ i , and call i and j sp ouses . Tw o no des i, j ∈ V are called adjac ent in G if they are connected b y an edge (or multiple edges), i.e., if i → j ∈ E or i ← j ∈ E or i ↔ j ∈ F . F or a subset of no des W ⊆ V , we define the induc e d sub gr aph G W := ( W , E ∩ W 2 , F ∩ {{ i, j } : i, j ∈ W , i 6 = j } ), i.e., with no des W and exactly those edges of G that connect no des in W . A walk b etwe en i, j ∈ V is a tuple h i 0 , e 1 , i 1 , e 2 , i 3 , . . . , e n , i n i of alternating no des and edges in G ( n ≥ 0), suc h that all i 0 , . . . , i n ∈ V , all e 1 , . . . , e n ∈ E ∪ F , starting with no de i 0 = i and ending with no de i n = j , and suc h that for all k = 1 , . . . , n , the edge e k connects the tw o nodes i k − 1 and i k in G . If the walk contains each no de at most once, it is called a p ath . A trivial walk (p ath) consists just of a single no de and zero edges. A dir e cte d walk (p ath) fr om i ∈ V to j ∈ V is a w alk (path) b et ween i and j suc h that every edge e k on the walk (path) is of the form i k − 1 → i k , i.e., ev ery edge is directed and p oints a wa y from i . By rep eatedly taking parents, w e obtain the anc estors of j : an G ( j ) := { i ∈ V : i = i 0 → i 1 → i 2 → · · · → i n = j in G } . Similarly , we define the desc endants of i : de G ( i ) := { j ∈ V : i = i 0 → i 1 → i 2 → · · · → i n = j in G } . In particular, eac h no de is ancestor and descendant of itself. A dir e cte d cycle is a directed path from i to j suc h that in addition, j → i ∈ E . An almost dir e cte d cycle is a directed path from 3. The disadv antage is that our notation and definitions deviate somewhat from those commonly used in the acyclic causal discov ery literature. Therefore, w e recommend reading this section also to those readers that are already familiar with the theory of acyclic structural causal mo dels. 5 Mooij, Ma gliacane and Claassen i to j such that in addition, j ↔ i ∈ F . All no des on directed cycles passing through i ∈ V together form the str ongly-c onne cte d c omp onent sc G ( i ) := an G ( i ) ∩ de G ( i ) of i . W e extend the definitions to sets I ⊆ V b y setting an G ( I ) := ∪ i ∈ I an G ( i ), and similarly for de G ( I ) and sc G ( I ). A directed mixed graph G is acyclic if it does not contain any directed cycle, in whic h case it is kno wn as an A cyclic Dir e cte d Mixe d Gr aph (ADMG) . A directed mixed graph that does not contain bidirected edges is kno wn as a Dir e cte d Gr aph (DG) . If a directed mixed graph do es not con tain bidirected edges and is acyclic, it is called a Dir e cte d A cyclic Gr aph (DA G) . A no de i k on a w alk (path) π = h i 0 , e 1 , i 1 , e 2 , i 3 , . . . , e n , i n i in G is said to form a c ol lider on π if it is a non-endp oin t node (1 ≤ k < n ) and the tw o edges e k , e k +1 meet head-to-head on their shared node i k (i.e., if the t wo subsequent edges are of the form i k − 1 → i k ← i k +1 , i k − 1 ↔ i k ← i k +1 , i k − 1 → i k ↔ i k +1 , or i k − 1 ↔ i k ↔ i k +1 ). Otherwise (that is, if it is an endp oint no de, i.e., k = 0 or k = n , or if the tw o subsequen t edges are of the form i k − 1 → i k → i k +1 , i k − 1 ← i k ← i k +1 , i k − 1 ← i k → i k +1 , i k − 1 ↔ i k → i k +1 , or i k − 1 ← i k ↔ i k +1 ), i k is called a non-c ol lider on π . W e will denote the colliders on a walk π as col ( π ) and the non-colliders on π (including the endpoints of π ) as ncol ( π ). A triple of no des h i, j, k i in G is called an unshielde d triple if i is adjacen t to j , j is adjacent to k and i is not adjacent to k in G . 2.1.2 Structural Causal Models Directed Mixed Graphs form a conv enient graphical representation for v ariables (lab elled b y the no des) and their functional relations (expressed by the edges) in a Structur al Causal Mo del (SCM) (Pearl, 2009), also known as a (non-parametric) Structur al Equation Mo del (SEM) (W righ t, 1921). Several sligh tly different definitions of SCMs hav e b een proposed in the literature, which all hav e their (dis)adv antages. Here w e use a v arian t of the definition in Bongers et al. (2020) that is most conv enien t for our purp oses. The reason we use SCMs to form ulate JCI (rather than for example the more well-kno wn causal Bay esian net works) is that SCMs are expressiv e enough to mo del b oth latent common causes and cyclic causal relationships. Definition 1 A Structur al Causal Mo del (SCM) is a tuple M = hI , J , H , X , E , f , P E i of: (i) a finite index set I for the endo genous variables in the mo del; (ii) a finite index set J for the latent exo genous variables in the mo del (disjoint fr om I ); (iii) a dir e cte d gr aph H with no des I ∪ J , and dir e cte d e dges p ointing fr om I ∪ J to I ; (iv) a pr o duct of Bor el 4 sp ac es X = Q i ∈I X i , which define the domains of the endo genous variables; (v) a pr o duct of Bor el sp ac es E = Q j ∈J E j , which define the domains of the exo genous variables; (vi) a pr o duct pr ob ability me asur e P E = Q j ∈J P E j on E sp e cifying the exogenous distribu- tion ; (vii) a me asur able function f : X × E → X , the causal mec hanism , such that e ach of its c omp onents f i only dep ends on a p articular subset of the variables, as sp e cifie d by the 4. A Bor el sp ac e is both a measurable and a top ological space, such that the sigma-algebra is generated by the op en sets. Most spaces that one encoun ters in applications as the domain of a random v ariable are (isomorphic to) Borel spaces. 6 Joint Ca usal Inference from Mul tiple Contexts Simple SCM In tervened SCM In terven tional Distribution Marginal SCM Augmen ted Graph Graph d / σ -separations Laten t Confounders Direct Causes Causal Relations Observ ational Distribution (Conditional) Indep endences Mark ov Prop ert y F aithfulness Figure 2: Relationships b et w een v arious representations of simple SCMs. Directed edges represen t mappings. In tervened and marginal SCMs are alw a ys defined and are also simple. dir e cte d gr aph H : f i : X p a H ( i ) ∩I × E p a H ( i ) ∩J → X i , i ∈ I . In discussing the concepts and prop erties of SCMs, the graphical represen tation of v arious ob jects and their relations in Figure 2 ma y b e helpful. This shows ho w the SCM is the basic ob ject containing all information, and how other representations can b e derived from the SCM. In the rest of this section, we will discuss this in more detail. W e refer to the graph H in Definition 1(iii) as the augmente d gr aph of M . In contrast, the gr aph of M , denoted G ( M ), is the directed mixed graph with nodes I , directed edges i 1 → i 2 iff i 1 → i 2 ∈ H , and bidirected edges i 1 ↔ i 2 iff there exists j ∈ p a H ( i 1 ) ∩ p a H ( i 2 ) ∩J . 5 While the augmen ted graph H sho ws in detail the functional dep endence of endogenous v ariables on the (indep endent) exogenous v ariables, the graph G ( M ) provides an abstraction b y not including the exogenous v ariables explicitly , but using bidirected edges to represent any shared dep endence of pairs of endogenous v ariables on a common exogenous parent. If G ( M ) is acyclic, we call the SCM M acyclic , otherwise w e call the SCM cyclic . If G ( M ) con tains no bidirected edges, w e call the endogenous v ariables in the SCM M c ausal ly sufficient . A pair of random v ariables ( X , E ) is called a solution of the SCM M if X = ( X i ) i ∈I with X i ∈ X i for all i ∈ I , E = ( E j ) j ∈J with E j ∈ E j for all j ∈ J , the distribution P ( E ) 5. This definition of graph makes a slight simplification: a more precise definition w ould leav e out edges that are redundan t. F or example, if the structural equation for X 2 reads X 2 = 0 · X 1 + X 3 it could be that 1 → 2 ∈ H , but this edge w ould not appear in G ( M ). F or the rigorous version of this definition, see Bongers et al. (2020). 7 Mooij, Ma gliacane and Claassen is equal to the exogenous distribution P E , and the structur al e quations : X i = f i ( X p a H ( i ) ∩I , E p a H ( i ) ∩J ) a.s. hold for all i ∈ I . An SCM is often sp ecified informally b y sp ecifying only the structural equations and the density 6 of the exogenous distribution with resp ect to some pro duct measure, for example: M : ( X i = f i ( X p a H ( i ) ∩I , E p a H ( i ) ∩J ) , i ∈ I , p ( E ) = Q j ∈J p ( E j ) . F or acyclic SCMs, solutions exist and hav e a unique distribution that is determined by the SCM. This is not generally the case in cyclic SCMs, as these could hav e no solution at all, or could hav e multiple solutions with different distributions (Bongers et al., 2020). Definition 2 An SCM M is said to b e uniquely solv able w.r.t. O ⊆ I if ther e exists a me asur able mapping g O : X ( p a H ( O ) \O ) ∩I × E p a H ( O ) ∩J → X O such that for P E -almost every e for al l x ∈ X : x O = g O ( x ( p a H ( O ) \O ) ∩I , e p a H ( O ) ∩J ) ⇐ ⇒ x O = f O ( x , e ) . (L o osely sp e aking: the structur al e quations for O have a unique solution for X O in terms of the other variables app e aring in those e quations.) If M is uniquely solv able with resp ect to I (in particular, this holds if M is acyclic), then it induces a unique observational distribution P M ( X ). Giv en an SCM that models a certain system, we can model the system after an idealized in terven tion in whic h an external influence enforces a subset of endogenous v ariables to tak e on certain v alues, while leaving the rest of the system un touched. Definition 3 L et M b e an SCM. The p erfect interv ention with tar get I ⊆ I and value ξ I ∈ X I induc es the interv ened SCM M do( I , ξ I ) obtaine d by c opying M , but letting ˜ H b e H without the e dges { j → i ∈ H : j ∈ I ∪ J , i ∈ I } , and mo difying the c ausal me chanism into ˜ f such that ˜ f i ( x , e ) = ( ξ i i ∈ I f i ( x , e ) i / ∈ I . The interpretation is that the causal mechanisms that normally determine the v alues of the comp onen ts i ∈ I are replaced b y mechanisms that assign the v alues ξ i . Other types of in terven tions are p ossible as w ell (see also Section 3.3). If the in tervened SCM M do( I , ξ I ) induces a unique observ ational distribution, this is denoted as P M  X | do( I , ξ I )  and re- ferred to as the interventional distribution of M under the p erfe ct intervention do( I , ξ I ). P earl (2009) deriv ed the do-c alculus for acyclic SCMs, consisting of three rules that express relationships b etw een interv entional distributions of an SCM. 6. W e denote a probability measure (or distribution) of a random v ariable X by P ( X ), and a density of X with respect to some fixed pro duct measure by p ( X ). 8 Joint Ca usal Inference from Mul tiple Contexts 2.1.3 Simple Structural Causal Models The theory of general cyclic Structural Causal Mo dels is rather inv olved (Bongers et al., 2020). In this work, for simplicit y of exp osition, we will focus on a certain sub class of SCMs that has man y conv enient properties and for whic h the theory simplifies considerably: Definition 4 An SCM M is c al le d simple if it is uniquely solvable with r esp e ct to any subset O ⊆ I . All acyclic SCMs are simple. Simple SCMs pro vide a sp ecial case of the more general class of mo dular SCMs (F orr ´ e and Mo oij, 2017). The class of simple SCMs can b e though t of as a generalization of acyclic SCMs that allo ws for (w eak) cyclic causal relations, but preserv es man y of the conv enient properties that acyclic SCMs hav e. Indeed, a simple SCM induces a unique observ ational distribution. Its marginalizations are alwa ys defined (Bongers et al., 2020), and are also simple; in other w ords, the c lass of simple SCMs is closed under marginalizations. The class of simple SCMs is also closed under p erfect interv entions, and hence, all p erfect interv entional distributions of a simple SCM are uniquely defined. Without loss of generality , one can assume that simple SCMs ha ve no self-cycles. The causal interpretation of the graph of an SCM with cycles and/or bidirected edges can be rather subtle in general. How ev er, for graphs of simple SCMs there is a straigh tforward causal interpretation: Definition 5 L et M b e a simple SCM. If i → j ∈ G ( M ) we c al l i a direct cause of j according to M . If ther e exists a dir e cte d p ath i → · · · → j ∈ G ( M ) , i.e., if i ∈ an G ( M ) ( j ) , then we c al l i a cause of j according to M . If ther e exists a bidir e cte d e dge i ↔ j ∈ G ( M ) , then we c al l i and j confounded according to M . W e conclude that the graph G ( M ) of a simple SCM can b e in terpreted as its c ausal gr aph . In the next subsection, w e will discuss how the same graph G ( M ) of a simple SCM M also represen ts the conditional indep endences that m ust hold in the observ ational distribution of M . 2.1.4 Structural Causal Models: Marko v Proper ties Under certain conditions, the graph G ( M ) of an SCM M can b e interpreted as a statistical graphical mo del, i.e., it allows one to read off conditional indep endences that must hold in the observ ational distribution P M ( X ). One of the most common formulations of such Markov pr op erties in volv es the following notion of d -sep ar ation , first prop osed b y Pearl (1986) in the context of DA Gs, and later sho wn to b e more generally applicable: 7 Definition 6 ( d -separation) We say that a walk h i 0 . . . i n i in DMG G = hV , E , F i is d - blo c ked b y C ⊆ V if: (i) its first no de i 0 ∈ C or its last no de i n ∈ C , or (ii) it c ontains a c ol lider i k / ∈ an G ( C ) , or (iii) it c ontains a non-c ol lider i k ∈ C . If al l p aths in G b etwe en any no de in set A ⊆ V and any no de in set B ⊆ V ar e d -blo cke d by a set C ⊆ V , we say that A is d -separated fr om B by C , and we write A ⊥ d G B | C . 7. It is also sometimes called “ m -separation” in the ADMG literature. 9 Mooij, Ma gliacane and Claassen In the general cyclic case, ho wev er, the notion of d -separation is too strong, as was already p oin ted out b y Spirtes (1994). A solution is to replace it with a non-trivial generalization of d -separation, kno wn as σ -separation (F orr´ e and Mo oij, 2017): Definition 7 ( σ -separation) We say that a walk h i 0 . . . i n i in DMG G = hV , E , F i is σ - blo c ked b y C ⊆ V if: (i) its first no de i 0 ∈ C or its last no de i n ∈ C , or (ii) it c ontains a c ol lider i k / ∈ an G ( C ) , or (iii) it c ontains a non-c ol lider i k ∈ C that p oints to a neighb oring no de on the walk in another str ongly-c onne cte d c omp onent (i.e., i k − 1 → i k → i k +1 or i k − 1 ↔ i k → i k +1 with i k +1 / ∈ sc G ( i k ) , i k − 1 ← i k ← i k +1 or i k − 1 ← i k ↔ i k +1 with i k − 1 / ∈ sc G ( i k ) , or i k − 1 ← i k → i k +1 with i k − 1 / ∈ sc G ( i k ) or i k +1 / ∈ sc G ( i k ) ). If al l p aths in G b etwe en any no de in set A ⊆ V and any no de in set B ⊆ V ar e σ -blo cke d by a set C ⊆ V , we say that A is σ -separated fr om B by C , and we write A ⊥ σ G B | C . F orr´ e and Mo oij (2017) prov ed the following fundamen tal result for mo dular SCMs, which w e form ulate here only for the sp ecial case of simple SCMs: Theorem 8 (Generalized Directed Global Mark ov Prop erty) Any solution ( X , E ) of a simple SCM M ob eys the Generalized Directed Global Marko v Prop erty with r esp e ct to the gr aph G ( M ) : A σ ⊥ G ( M ) B | C = ⇒ X A ⊥ ⊥ P M ( X ) X B | X C ∀ A, B , C ⊆ I . The following stronger Marko v prop erties, in which σ -separation is replaced by the more familiar notion of d -separation, ha ve b een derived for sp ecial cases b y F orr´ e and Mo oij (2017) (where again we consider only the sp ecial case of simple SCMs): Theorem 9 (Directed Global Mark ov Prop erty) L et M = hI , J , H , X , E , f , P E i b e a simple SCM. If M satisfies at le ast one of the fol lowing thr e e c onditions: (i) M is acyclic; (ii) al l endo genous sp ac es X i ar e discr ete; (iii) M is line ar (i.e., X i = R for e ach i ∈ I , E j = R for e ach j ∈ J , and e ach c ausal me chanism f i : X p a H ( i ) ∩I × E p a H ( i ) ∩J → X i is line ar), e ach c ausal me chanism f i dep ends non-trivial ly on some exo genous variable(s), and its exo genous distribution has a density p ( E ) with r esp e ct to L eb esgue me asur e; then any solution ( X , E ) of M ob eys the Directed Global Mark ov Prop erty with r esp e ct to the gr aph G ( M ) : A d ⊥ G ( M ) B | C = ⇒ X A ⊥ ⊥ P M ( X ) X B | X C ∀ A, B , C ⊆ I . Of these cases, the acyclic and linear cases are w ell-known. 8 8. The acyclic case w as first shown in the con text of linear-Gaussian structural equation models (Spirtes et al., 1998; Koster, 1999). The discrete case fixes the erroneous theorem by Pearl and Dec hter (1996), for which a coun terexample w as found by Neal (2000), by adding the unique solv ability condition, and extends it to allow for latent common causes. The linear case extends existing results for the linear- Gaussian setting without latent common causes (Spirtes, 1994, 1995; Koster, 1996) to a linear (possibly non-Gaussian) setting with latent common causes. 10 Joint Ca usal Inference from Mul tiple Contexts W e conclude that simple SCMs also ha v e conv enien t Marko v prop erties. A simple SCM induces a unique observ ational distribution that satisfies the Generalized Directed Global Marko v Prop ert y; under additional conditions, it satisfies ev en the Directed Global Mark ov Prop erty . Similarly , for an y p erfect in terven tion, a simple SCM induces a unique in terven tional distribution that satisfies the (Generalized) Directed Global Marko v Property with resp ect to the in tervened graph. W e conclude that the graph G ( M ) of a simple SCM has tw o in terpretations: it expresses b oth the causal structure betw een the v ariables as well as the c onditional indep endence structure of the solutions. These tw o interpretations of the graph G ( M ) of a simple SCM can b e com bined in to a causal do-calculus (F orr ´ e and Mooij, 2019) that extends the acyclic do-calculus of P earl (2009) to the class of simple (or more generally , mo dular) SCMs. The starting p oint for constraint-based approaches to causal discov ery from observ a- tional data is to assume that the data is mo delled b y an (unkno wn) SCM M , such that its observ ational distribution P M ( X ) exists and satisfies a Mark ov property with resp ect to its graph G ( M ). In addition, one usually assumes the faithfulness assumption to hold (Spirtes et al., 2000; P earl, 2009), i.e., that the graph explains al l conditional independences presen t in the observ ational distribution. F or the cases in which the d -separation criterion Theorem 9 applies, this amounts to assuming the following implication: A d ⊥ G ( M ) B | C ⇐ = X A ⊥ ⊥ P M ( X ) X B | X C ∀ A, B , C ⊆ V . Meek (1995) has shown completeness prop erties of d -separation. More sp ecifically , Meek (1995) show ed that faithfulness holds generically for DA Gs if (i) all v ariable domains are finite, or (ii) if all v ariables are real-v alued, linearly related and hav e a m ultiv ariate Gaussian distribution. This in particular pro vides some justification for assuming faithfulness. On the other hand, no completeness results are known yet for the general cyclic case in which the σ -separation criterion Theorem 8 applies. Nevertheless, w e b elieve that suc h results can b e sho wn, and w e will assume for simple SCMs a similar faithfulness assumption as for the d -separation case: A σ ⊥ G ( M ) B | C ⇐ = X A ⊥ ⊥ P M ( X ) X B | X C ∀ A, B , C ⊆ V . 2.2 Causal Disco very by Exp erimentation The gold standard for causal discov ery is b y means of exp erimentation. F or example, randomized con trolled trials (Fisher, 1935) form the foundation of mo dern evidence-based medicine. In engineering, A/B-testing is a common proto col to optimize certain causal effects of an engineered system. T o ddlers learn causal representations of the w orld through pla yful exp erimen tation. W e will discuss here the simplest randomized con trolled trial setting b y form ulating it in terms of the graphical causal terminology in tro duced in the last section. The exp erimental pro cedure is as follows. Consider tw o v ariables, “treatmen t” C and “outcome” X . In the simplest setting, one considers a binary treatment v ariable, where C = 1 corresp onds to “treat with drug” and C = 0 corresp onds to “treat with placeb o”. F or example, the drug could b e aspirin, and outcome could b e the severit y of headache p erceived t w o hours later. 11 Mooij, Ma gliacane and Claassen (a) Two separate data sets: Placebo ( C = 0): X -0.2 0.6 -1.7 . . . Drug ( C = 1): X -0.3 1.8 -0.1 . . . (b) Pooled data: C X 0 -0.2 0 0.6 0 -1.7 0 . . . 1 -0.3 1 1.8 1 -0.1 1 . . . Figure 3: Illustration of the data from an example randomized con trolled trial. The data can either b e in terpreted as (a) tw o separate data sets, one for the treatment and one for the con trol group, or (b) as a single data set including a con text v ariable indicating treatment/con trol. Note that in this particular example, C is dep enden t on X in the po oled data (or equiv alen tly , the distribution of X differs b et ween con texts C = 0 and C = 1), whic h implies that C is a cause of X . P atients are split into t wo groups, the treatment and the control group, by means of a coin flip that assigns a v alue of C to ev ery patien t. 9 P atients are treated dep ending on the assigned v alue of C , i.e., patien ts in the treatment group are treated with the drug and patients in the con trol group are treated with a placeb o. Some time after treatment, the outcome X is measured for each patien t. This yields a data set ( C n , X n ) N n =1 with t wo measuremen ts ( C n , X n ) for the n th patien t. If the distribution of outcome X significantly differs b etw een the tw o groups, one concludes that treatment is a cause of outcome. The imp ortant underlying causal assumptions that ensure the v alidit y of the conclusion are: (i) outcome X is not a cause of treatment C (which is commonly deemed justified if the outcome is an even t that o ccurs later in time than the treatmen t ev ent); (ii) there is no latent confounder of treatmen t and outcome (this is where the random- ization comes in: if treatmen t is decided solely b y a prop er coin flip, then it seems reasonable to assume that there cannot be any laten t common cause of the coin flip C and the outcome X that is not just a com bination of tw o statistically indep enden t separate causes of C and X ), (iii) no selection bias is presen t in the data (in other w ords, no data is missing; for example, if only those patients that did not suffer from certain treatment side effects are included in the data set, then this assumption will b e violated). Under these assumptions, one can show that if the distribution of the outcome X differs b et ween the tw o groups of patien ts (“treatmen t group” with C = 1 vs. “control group” with C = 0), then treatmen t must b e a cause of outcome, at least in this p opulation of patien ts (see Prop osition 10). There are t wo conceptually slightly different w ays of testing 9. Usually this is done in a double-blind wa y , so that neither the patient nor the do ctor knows whic h group a patien t has b een assigned to. 12 Joint Ca usal Inference from Mul tiple Contexts this in the data, dep ending on whether w e treat the data as a single p o oled data set, or rather as t wo separate data sets (each one corresp onding to a particular patient group), see also Figure 3. If we consider the data ab out outcome X in the t wo groups as tw o sep ar ate data sets (corresp onding to the same v ariable X , but measured in differen t contexts C ), then the question is whether the distribution of X is statistically differen t in the t wo data sets. This can b e tested with a t wo-sample test, for example, a t -test or a Wilcoxon test. The other alternative is to consider the data as a single p o ole d data set (b y p o oling the data for the tw o groups), and let the v alue of C indicate the context of each sample (treatmen t or con trol). The question now b ecomes whether the conditional distribution of X given C = 0 differs from the conditional distribution of X given C = 1, i.e., whether P ( X | C = 0) 6 = P ( X | C = 1). In other w ords, w e hav e to test whether there is a statistically significan t dep endenc e C 6 ⊥ ⊥ X in the p o oled data b etw een treatmen t C and outcome X ; if there is, it must b e due to the treatment C causing the outcome X , as the following prop osition shows: Prop osition 10 Supp ose that the data-gener ating pr o c ess on c ontext variable C and out- c ome variable X c an b e mo dele d by a simple SCM M and no sele ction bias is pr esent. 10 Under the r andomize d c ontr ol le d trial assumptions: (i) C ← X / ∈ G ( M ) (“outc ome X is not a c ause of tr e atment C ”) (ii) C ↔ X / ∈ G ( M ) (“ther e is no latent c onfounder of tr e atment C and outc ome X ”), a dep endenc e C 6 ⊥ ⊥ X in the joint distribution P ( C , X ) implies that C c auses X . F urther- mor e, the c ausal effe ct of C on X is given by: P M  X | do( C = c )  = P M ( X | C = c ) . (1) Pro of Out of the eight p ossible graphs G ( M ), only tw o satisfy the assumptions: C ⊥ ⊥ X C X C X C X C X C X C X C X C X By the Marko v prop ert y (Theorem 8), if the edge C → X were absen t in G ( M ), then C w ould b e indep endent of X . Therefore, if C 6 ⊥ ⊥ X , the edge C → X m ust b e in G ( M ). In b oth cases, the causal do-calculus applied to G ( M ) yields the iden tity (1). Of course, in this straigh tforward example the equiv alence b et ween the t wo approac hes (differences b etw een tw o separate data sets vs. prop erties of a single p o oled data set) is trivial, and the reader ma y wonder why w e emphasize it. The reason is that the k ey 10. The context v ariable C is here considered as an endo genous v ariable in the SCM, as explained in Sec- tion 3.1. 13 Mooij, Ma gliacane and Claassen idea of our approac h is precisely this: r e ducing an app ar ently c omplic ate d c ausal disc overy pr oblem with multiple data sets to a mor e standar d c ausal disc overy pr oblem involving a single p o ole d data set. The Joint Causal Inference framew ork that we prop ose in this pap er can b e considered as an extension of this randomized controlled trial setting to multiple treatmen t and outcome v ariables. It is imp ortant to realize that the simple causal reasoning for the RCT c annot b e made when lo oking at the tw o data sets in isolation (i.e., by considering only prop erties of P ( X | C = 0) and P ( X | C = 1) separately , and not using in addition any other prop erties of the join t distribution P ( X, C )). The latter approach is commonly used by constrain t-based metho ds for causal disco very from m ultiple data sets (e.g., Tillman, 2009; Claassen and Hes- k es, 2010; Tillman and Spirtes, 2011; Hyttinen et al., 2014; T riantafillou and Tsamardinos, 2015; Rothenh¨ ausler et al., 2015; F orr ´ e and Mo oij, 2018). Under the assumptions made, the crucial (and possibly very strong) signal in the data that allo ws one to dra w the conclu- sion that C causes X is the dep endence C 6 ⊥ ⊥ X that c an only b e se en in the po oled data. Metho ds that only test for conditional indep endences within eac h context and subsequen tly com bine these in to a single con text-indep endent causal mo del will not yield an y conclusion in this setting. The approach tak en b y JCI, on the other hand, is to analyze the p o oled data jointly , so that informative signals lik e these can b e tak en in to accoun t. 2.3 Causal Disco very from Purely Observ ational Data In the previous section, we discussed the current gold standard for discov ering causal re- lations. Over the last tw o decades, alternative metho ds ha ve b een prop osed to p erform causal disco very from pur ely observational data. This is intriguing and of high relev ance, since exp eriments may b e imp ossible, infeasible, impractical, unethical or to o e xpensive to p erform. These causal discov ery metho ds can b e divided into c onstr aint-b ase d causal disco very metho ds, such as the PC (Spirtes et al., 2000), IC (Pearl, 2009) and F CI algo- rithms (Spirtes et al., 1999; Zhang, 2008a), and sc or e-b ase d causal discov ery metho ds (e.g., Hec kerman et al., 1995; Chick ering, 2002; Koivisto and So o d, 2004). The PC and IC algo- rithms and most score-based metho ds assume causal sufficiency (i.e., the absence of latent confounders), while the F CI algorithm and other mo dern constraint-based algorithms allow for latent confounders and selection bias. Originally , these metho ds hav e b een designed to estimate the causal graph of the system from a single data set corresp onding to a single (purely observ ational) context. All these metho ds try to infer causal relationships on the basis of subtle statistical patterns in the data. The most imp ortant of these patterns are conditional indep endences b et ween v ariables. These are exploited by most constraint-based metho ds, and implicitly , b y score-based metho ds. Other patterns, such as “V erma constrain ts” (Shpitser et al., 2014), algebraic constraints in the linear-Gaussian case (v an Ommen and Mo oij, 2017), non-Gaussianit y in linear mo dels (Kano and Shimizu, 2003), and non-additivity of noise in nonlinear mo dels (Peters et al., 2014) can also b e exploited. Another class of metho ds that has b ecome p opular more recently are metho ds that try to infer the causal direction ( A → B vs. B → A ) from purely observ ational data of v ariable pairs (see e.g., Mooij et al., 2016). 14 Joint Ca usal Inference from Mul tiple Contexts Since our main goal is to enable constraint-based causal disco v ery from m ultiple con texts, w e will focus on this approac h here, while notin g that the JCI framew ork that w e propose in the next section is compatible with all approaches to causal discov ery from purely observ a- tional data that allow for m ultiple v ariables and can handle certain bac kground knowledge (to b e made precise in Section 3.4). As discussed in detail b y Spirtes et al. (2000), causal discov ery from conditional inde- p endence patterns in purely observ ational data b ecomes possible under strong assumptions. The simplest example of how certain patterns of conditional indep endences in the obser- v ational distribution can lead to conclusions ab out the causal relations of the v ariables is giv en b y the “Y-structure” pattern (Mani, 2006), which is illustrated in Figure 4. W e show here that the Y-structure pattern also generalizes to the cyclic case. Prop osition 11 Supp ose that the data-gener ating pr o c ess on four variables X 1 , X 2 , X 3 , X 4 c an b e mo dele d by a simple SCM M . Assume that the sampling pr o c e dur e is not subje ct to sele ction bias, and that faithfulness holds. If the fol lowing c onditional (in)dep endencies hold in the observational distribution P M ( X ) : X 1 6 ⊥ ⊥ X 4 , X 2 6 ⊥ ⊥ X 4 , X 1 ⊥ ⊥ X 2 , X 1 ⊥ ⊥ X 4 | X 3 , X 2 ⊥ ⊥ X 4 | X 3 , X 1 6 ⊥ ⊥ X 2 | X 3 , then X 3 is a dir e ct c ause of X 4 ac c or ding to M . F urthermor e, X 3 and X 4 ar e unc onfounde d ac c or ding to M and the c ausal effe ct of X 3 on X 4 is given by: P M  X 4 | do( X 3 = x 3 )  = P M ( X 4 | X 3 = x 3 ) . (2) Pro of By the assumed Marko v and faithfulness prop erties, one can chec k that the only (cyclic or acyclic) graphs that are compatible with the observed conditional indep endences are the ones in Figure 4 (left), where X 1 m ust b e adjacent to X 3 via at least one of the tw o dashed edges, and similarly , X 2 m ust b e adjacent to X 3 via at least one of the t wo dashed edges. Hence, X 3 is a direct cause of X 4 according to M , but X 4 is not a direct cause of X 3 according to M . Also, X 3 and X 4 cannot be confounded according to M . By applying the causal do-calculus, we arriv e at (2). This example illustrates how conditional indep endence patterns in the observ ational distri- bution allo w one to infer certain features of the underlying causal mo del. This principle is exploited more generally b y constraint-based metho ds, and implicitly , by score-based metho ds that optimize a p enalized likelihoo d o ver (equiv alence classes of ) causal graphs. T ypically , the graph cannot be completely iden tified from purely observ ational data. F or example, in the Y-structure case, the conditional indep endences in the observ ational data do not allo w to conclude whether the dep endence b et w een X 1 and X 3 is explained b y X 1 b eing a cause of X 3 , or by X 1 and X 3 ha ving a latent confounder, or b oth. How ev er, under the assumption of faithfulness, one can deduce the Mark ov equiv alence class of the graph from the conditional indep endences in the observ ational data, i.e., the class of all DMGs that induce the same separations. Another disadv an tage of causal discov ery metho ds from purely observ ational data is that they typically need very large sample sizes and strong assumptions in order to w ork reliably . These are some of the motiv ations to com bine these ideas with those of causal discov ery b y exp erimen tation, as w e will do in the next section. 15 Mooij, Ma gliacane and Claassen X 1 X 2 X 3 X 4 X1 X2 X1 X4 X2 X4 Figure 4: Left: Causal graphs satisfying the “Y-structure” pattern on four v ariables ( X 1 , X 2 , X 3 , X 4 ). Righ t: Scatter plots illustrating the Y-structure pattern in purely observ ational data, where X 3 is discrete-v alued and its v alue is indicated b y color (red/blue). 3. Join t Causal Inference In this section we present Joint Causal Inference (JCI), a no vel framework for causal dis- co very from multiple data sets corresponding to measureme n ts that ha ve been p erformed in different contexts. JCI com bines the existing approac hes to w ards causal discov ery that w e discussed in Sections 2.2 and 2.3. 3.1 The Distinction b et ween System and Context Henceforth, we will distinguish system variables ( X i ) i ∈I describing the system of interest, and c ontext variables ( C k ) k ∈K describing the context in whic h the system has been observ ed. An observ ation that will turn out to b e crucial in what follows is that the decision of what to consider part of the “system” and what to consider part of its “con text” does not reflect an ob jective prop ert y of nature, but is a choice of the mo deler. While the system v ariables are treated as endo genous v ariables of the system of in ter- est, we usually (but not necessarily) think of the con text v ariables as observed exo genous v ariables for the system of interest. In particular, context v ariables could describ e which in- terv entions hav e been performed on the system (or more sp ecifically , ho w these in terv entions ha ve b een p erformed), in which case we will also refer to them as intervention variables . The p ossible interv entions are not limited to the p erfect in terven tions mo deled by the do- op erator of Pearl (2009), but can also b e more general types of in terven tions that app ear in practice, like mec hanism changes (Tian and Pearl, 2001), soft interv en tions (Mark ow etz et al., 2005), fat-hand in terv entions (Eaton and Murphy, 2007), activit y in terven tions (Mooij and Heskes, 2013), and sto c hastic versions of all these. This will b e discussed in more detail in Section 3.3. Even more generally , a context v ariable could describ e any prop ert y of the en vironment of the system, including those prop erties that one would not normally think ab out as an interv en tion. Examples are the lab in which measuremen ts hav e b een done, the time of the da y , the patien t p opulation, v ariables like “gender” or “age”, etc. Like system v ariables, context v ariables can b e discrete or con tin uous (or more generally , tak e v alues in some Borel space). 16 Joint Ca usal Inference from Mul tiple Contexts system con text meta-system Figure 5: JCI reduces mo deling a system in its environmen t to mo deling the meta-system consisting of the system and its en vironment. The idea of explicitly considering con text v ariables is not nov el: they ha v e been discussed in the literature under v arious names, suc h as “policy v ariables” (Spirtes et al., 2000), “force v ariables” (Pearl, 1993), “decision v ariables” in influence diagrams (Da wid, 2002), “regime indicators” (Didelez et al., 2006), “selection v ariables” in selection diagrams (Barein b oim and Pearl, 2013), and “en vironment v ariable” (Peters et al., 2016). Their use for causal disco very was already sugges ted by Co op er and Y o o (1999). F ormal aspects in ho w these v ariables are treated v ary across accounts, ho wev er. F or example, Dawid (2002) treats system v ariables as random v ariables and c ho oses to not treat con text (“decision”) v ariables as random v ariables. In this w ork we simply consider con text v ariables as random v ariables with added background knowledge on their causal relations, which expresses their assumed exogeneit y with resp ect to the system. Conceptually , con text v ariables provide a more general notion than interv ention v ari- ables, since every interv en tion can b e seen as a change of con text, but not every c hange of con text is naturally thought of as an in terven tion. F or example, the causal effect of some drug on a certain health outcome may differ for males and females. T aking “gender” as a con text v ariable that just enco des the sp ecific subp opulation of patien ts w e are considering is more natural than considering it to b e an in terven tion v ariable that enco des the result of a gender-changing op eration on the patien t. F urthermore, interv entions usually come with an “observ ational baseline” of “doing nothing”, but this is not alw ays naturally a v ailable for more general con text v ariables (e.g., “male” and “female” could both qualify as a base- line, while neither of the tw o would provide a more natural “observ ational” baseline than the other). When considering context v ariables, w e do not hav e to sp ecify such a baseline, whereas if we consider them as interv ention v ariables, one can alw ays ask “whic h v alue of the v ariable corresponds with no interv ention?”. Ultimately , though, b oth interpretations can b e treated equally from a mathematical mo deling p ersp ectiv e. Henceforth, w e will use the term “con text v ariable” in general, but “in terven tion v ariable” sp ecifically for context v ariables that mo del an external interv en tion on the system. That b eing said, the approach we take in JCI is simple (see also Figure 5): rather than considering a causal mo del of the system alone (i.e., mo deling only the endogenous system v ariables), w e broaden its scop e to include relev an t parts of the environmen t of the system (i.e., w e include the con text v ariables as additional endogenous v ariables). Thereb y , w e “internalize” parts of the environmen t of the system, which makes the meta-system (consisting of b oth system and its environmen t) amenable to formal causal mo deling. The 17 Mooij, Ma gliacane and Claassen meta-system can now formally b e considered as o ccurring in just a single (meta)-context, and thereby we hav e reduced the problem of how to deal with m ultiple contexts to one of dealing with a single context only . W e will formalise this idea in the next subsection. 3.2 Join t Causal Mo deling of Multiple Contexts Differen t approaches to mo deling multiple contexts can b e tak en, e.g., using influence dia- grams (Dawid, 2002), using selection diagrams (Bareinboim and P earl, 2013), considering only conditional models (i.e., for the conditional probabilit y of the system giv en the context) (Eaton and Murphy, 2007; Mo oij and Heskes, 2013), or using ioSCMs (F orr ´ e and Mo oij, 2019). Here, we will take what is p erhaps the simplest approac h: w e treat b oth context and system v ariables as endogenous v ariables in an SCM. W e will use a simple SCM to mo del the meta-system (i.e., the system and its con texts) causally . The endogenous v ariables of the SCM consist of the system v ariables X = ( X i ) i ∈I with v alues x ∈ X = Q i ∈I X i and the context v ariables C = ( C k ) k ∈K with v alues c ∈ C = Q k ∈K C k . The laten t exogenous v ariables of the SCM are denoted E = ( E j ) j ∈J with v alues e ∈ E = Q j ∈J E j . The SCM mo deling the meta-system is then assumed to b e of the follo wing form: M :      C k = f k ( X p a H ( k ) ∩I , C p a H ( k ) ∩K , E p a H ( k ) ∩J ) , k ∈ K , X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈J P ( E j ) . (3) The system v ariables X and con text v ariables C are all treated as endogenous v ariables of the meta-system, and the exogenous v ariables E are indep endent latent v ariables that are assumed not to b e caused by the system v ariables X or the con text v ariables C . 11 The augmen ted graph H has no des I ∪ J ∪ K and directed edges corresp onding to the functional dep endencies of the causal mechanisms on the v ariables. The graph G ( M ) has only no des I ∪ K , and may contain b oth directed and bidirected edges b et w een the no des, expressing direct causal relations and latent confounders. Note that the most general w a y to use SCMs to mo del m ultiple contexts would b e to use separate SCMs, one for eac h context. In that approac h, we could hav e a differen t graph for each context. Representing the contexts jointly , as in (3), w e simply obtain the union of those graphs. In particular, even if within each con text, the system is acyclic, it could b e that the mixture of systems in different contexts has a cyclic graph. As a simple example, consider a system with t wo system v ariables X 1 and X 2 , and consider t w o differen t con texts, where in the first context X 1 causes X 2 (but not vice v ersa), and in the second context, X 2 causes X 1 (but not vice versa); see also Figure 6. As a more concrete example, the engine driv es the wheels of a car when going uphill, but when going downhill, the rotation of the wheels drives the engine. Mo deling this in a join t SCM as in (3) requires a cyclic graph. The mo del (3) imp oses a probabilit y distribution P ( C ) on the con text v ariables, the c ontext distribution . The con text distribution will reflect the empirical distribution of the con text v ariables in the p o ole d data ˆ P ( C ), b y using as the probability of a context the 11. A t this stage, we hav e not yet incorp orated the assumption that context v ariables are exogenous to the system, and they are still treated equally to system v ariables in (3). 18 Joint Ca usal Inference from Mul tiple Contexts (a) X 1 X 2 C = 0: (b) X 1 X 2 C = 1: (c) X 1 X 2 C Figure 6: The graph of a mixture of tw o acyclic SCMs can b e cyclic. (a) X 1 causes X 2 in con text C = 0; (b) X 2 causes X 1 in context C = 1; (c) X 1 and X 2 cause each other in the joint model. fraction of the total n umber of samples that hav e b een measured in that con text. In case the context v ariables are used to m odel in terven tions, for example, the con text distribution is determined b y the exp erimental design. One might ob ject that this mak es the mo del v ery sp ecific to the particular setting, since it also sp ecifies the relative num b ers of samples in each data set, but as it turns out, the conclusions of the causal disco v ery proce dure do not depend on these details under reasonable assumptions, and therefore generalize to other con text distributions. In other words, the b ehavior of the system is in v ariant of the con text distribution. Because the context v ariables are treated as endogenous v ariables (similarly to the sys- tem v ariables), we ha v e “in ternalized” them. The main adv antage of our modeling approac h o ver alternativ e approac hes is that in (3), con text v ariables are formally treated in exactly the same wa y as the system v ariables. This implies in particular that all standard defini- tions and terminology of Section 2.1, and all causal disco very metho ds that are applicable in that setting, can b e directly applied. 3.3 Mo deling In terven tions as Context Changes The causal mo del in (3) allows one to mo del a perfect in terven tion in the usual wa y (P earl, 2009). Sp ecifically , the p erfect interv ention that forces X I to take on the v alue ξ I (“do( X I = ξ I )”) for some subset I ⊆ I and some v alue ξ I ∈ Q i ∈ I X i can b e mo deled b y replacing the structural equations for the system v ariables in (3) b y: X i = ( ξ i i ∈ I f i ( X p a ( i ) ∩I , C p a ( i ) ∩K , E p a ( i ) ∩J ) i ∈ I \ I , while leaving the rest of the mo del in v arian t. 12 Alternativ ely , the context v ariables can b e used to mo del interv en tions. F or example, the same p erfect interv ention could b e mo deled b y introducing a con text v ariable C k that has ch ( k ) = I , no paren ts or sp ouses, and domain C k = {∅} ∪ Q i ∈ I X i , b y taking f I to b e 12. F or brevity , we dropp ed the subscript H of p a H ( · ). 19 Mooij, Ma gliacane and Claassen of the follo wing form: f i ( X p a ( i ) ∩I , C p a ( i ) ∩K , E p a ( i ) ∩J ) = ( ˜ f i ( X p a ( i ) ∩I , C p a ( i ) ∩K \{ k } , E p a ( i ) ∩J ) C k = ∅ ( C k ) i C k ∈ Q i ∈ I X i (4) for i ∈ I . Here, C k = ∅ corresp onds to no interv ention (i.e., the observ ational baseline). Mo deling a p erfect interv en tion in this wa y is similar to the concept of “force v ariables” in tro duced by Pearl (1993). The observ ational distribution of the system v ariables is then giv en by the conditional distribution P ( X | C k = ∅ ), the interv en tional distribution corre- sp onding to the p erfect interv en tion do( X I = ξ I ) is giv en b y the conditional distribution P ( X | C k = ξ I ), and the marginal distribution P ( X ) represen ts a mixture of those. This is illustrated in Figure 7. More general types of interv entions suc h as mechanism c hanges (Tian and P earl, 2001) can b e mo deled in a similar w ay , simply by not enforcing the dep endence on C k to b e of the form (4), but allo wing more general forms of functional dep endence. F or example, switc hing the causal mechanism of system v ariable X i from mechanism A to mec hanism B can b e mo deled as follows b y introducing a context v ariable C k with ch ( k ) = { i } and domain C k = { A, B } : f i ( X p a ( i ) ∩I , C p a ( i ) ∩K , E p a ( i ) ∩J ) = ( ˜ f A i ( X p a ( i ) ∩I , C p a ( i ) ∩K \{ k } , E p a ( i ) ∩J ) C k = A ˜ f B i ( X p a ( i ) ∩I , C p a ( i ) ∩K \{ k } , E p a ( i ) ∩J ) C k = B . As another example, a sto c hastic perfect in terven tion on X i that is only successful with a certain probability can be modeled b y having one of the latent exogenous v ariables E j with j ∈ p a ( i ) determine whether the in terven tion was successful: f i ( X p a ( i ) ∩I , C p a ( i ) ∩K , E p a ( i ) ∩J ) = ( ˜ f i ( X p a ( i ) ∩I , C p a ( i ) ∩K \{ k } , E p a ( i ) ∩J \{ j } ) C k = ∅ or E j = 0 C k C k ∈ X i and E j = 1 . This approac h of mo deling in terven tions b y means of context v ariables is very general, as it allo ws to treat v arious t yp es of interv entions in a unified w a y . F or example, it can deal with p erfect in terven tions (Pearl, 2009), mec hanism c hanges (Tian and P earl, 2001), soft in terven tions (Marko w etz et al., 2005), fat-hand in terv entions (Eaton and Murph y, 2007), activit y interv en tions (Mo oij and Heskes, 2013), and sto chastic versions of all these. In case the con text v ariables are used to mo del in terven tions in this wa y , we also refer to the con text distribution P ( C ) (the probability for eac h context to o ccur) as the exp erimental design . 3.4 JCI Assumptions In this subsection, w e discuss additional background kno wledge on the causal relationships of context v ariables that one ma y often hav e in practice, and that can b e very helpful for causal discov ery . 20 Joint Ca usal Inference from Mul tiple Contexts (a) X 1 X 2 X 3 Observ ational: ( C α = 0) X 1 X 2 X 3 In terven tional: ( C α = 1) C α X 1 X 2 X 3 Join tly: (b) Observ ational: ( C α = 0) X 1 X 2 X 3 -0.2 -0.4 0.6 0.6 0.8 1.3 -1.7 0.1 0.3 In terven tional: ( C α = 1) X 1 X 2 X 3 -0.3 1.8 -0.1 1.8 -2.2 -0.2 P o oled: C α X 1 X 2 X 3 0 -0.2 -0.4 0.6 0 0.6 0.8 1.3 0 -1.7 0.1 0.3 1 -0.3 1.8 -0.1 1 1.8 -2.2 -0.2 Figure 7: Two wa ys of representing interv en tions, either through modeling contexts sepa- rately (left), or b y mo deling system and context join tly (righ t). In this example, w e consider a p erfect interv en tion on X 2 , though the same idea applies to other t yp es of in terven tions. (a) sho ws the corresp onding causal graphs, as separate graphs for eac h con text (left), or as a single join t graph that includes a context v ariable (right); (b) shows different wa ys of grouping the data: as separate data sets for eac h context (left), or as a single joint data set after p o oling (right). 3.4.1 JCI Assumption 0 First, we restate formally our basic mo deling assumption: Assumption 0 (“Joint SCM”) The data-gener ating me chanism is describ e d by a simple SCM M of the form: M :      C k = f k ( X p a H ( k ) ∩I , C p a H ( k ) ∩K , E p a H ( k ) ∩J ) , k ∈ K , X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈J P ( E j ) , (5) that jointly mo dels the system and the c ontext. Its gr aph G ( M ) has no des I ∪ K (c orr e- sp onding to system variables { X i } i ∈I and c ontext variables { C k } k ∈K ). Whereas w e will alwa ys mak e this assumption in order to facilitate the formulation of JCI, the follo wing three assumptions that we discuss are optional, and their applicabilit y has to b e decided based on a case-b y-case basis. 3.4.2 JCI Assumption 1 T ypically , when a mo deler decides to distinguish a system from its c ontext , the mo deler p ossesses bac kground kno wledge that expresses that the con text is exo genous to the system: 21 Mooij, Ma gliacane and Claassen Assumption 1 (“Exo geneity”, optional) No system variable c auses any c ontext variable, i.e., ∀ k ∈ K , ∀ i ∈ I : i → k / ∈ G ( M ) . This exogeneit y assumption is often easy to justify , for example if con text is gender or age. Another common case is that the con text enco des interv en tions on the system that hav e b een decided and p erformed on the system b efor e measuremen ts on the system are p erformed: this already rules out an y causal influence of system v ariables on the interv ention (context) v ariables if time tra vel is not deemed possible. Of course, one can imagine settings in which a system v ariable was measured b efore an interv ention was p erformed on the system. F or example, a do ctor typically first diagnoses a patien t b efor e deciding on treatment. F or system v ariables con taining the results of the medical examination used for the diagnosis and in terven tion v ariables describing the treatmen t that was decided after —and based up on—the medical examination, JCI Assumption 1 would not apply . 3.4.3 JCI Assumption 2 The second JCI assumption generalizes the randomization assumption for randomized con- trolled trials: Assumption 2 (“Complete r andomize d c ontext”, optional) No c ontext variable is c on- founde d with a system variable, i.e., ∀ k ∈ K , ∀ i ∈ I : i ↔ k / ∈ G ( M ) . This assumption is often harder to justify in practice. It is justifiable in exp erimental proto cols in which the decision of which interv en tion to p erform on the system do es not dep end on anything else that might also affect the system of interest, and in whic h the observ ed context v ariables provide a complete description of the context. This is ensured for example in case of prop er randomization in a double-blind randomized trial setting, i.e., in which neither the patien t nor the physician knows whether the patient w as assigned a drug or a placeb o. Man y exp erimental proto cols that do not in volv e explicit coin flips or random num ber generators are implicitly performing randomization. F or example, in the experimental pro- cedure describ ed b y Sachs et al. (2005) (see also Section 5.8), one starts with a collection of h uman imm une system cells. These are divided into batches randomly , without taking into accoun t an y prop ert y of the cells. When done carefully , the exp erimenter tries to ensure that for example the size of a cell cannot influence the batch it ends up in, by stirring the liquid that contains the cells before pipetting. Then, after randomly assigning cells to batc hes, in terven tions are p erformed on eac h batc h separately , b y adding some c hemical comp ound to the batch of cells. Finally , properties of eac h individual cell within eac h batch are measured. If the system v ariables reflect the measured prop erties of the individual cells, and the con text v ariables enco de the batch ID, this exp erimen tal pro cedure justifies JCI Assumption 2. Ho wev er, one should b e careful not to jump to the conclusion that the c hemical com- p ound administered to the batch is what actually causes the observ ed system b eha vior, as there may b e other factors that v ary across batches due to unin tentional side effects of the 22 Joint Ca usal Inference from Mul tiple Contexts exp erimen tal pro cedure. F or example, the lab assistan t that carries out the exp eriment for a particular batch of cells migh t influence the outcome, b ecause sligh tly differen t experimen- tal procedures are used b y differen t lab assistants. Another example is that the time of the da y may affect the measurements, and also correlate with batch ID. In situations like those, identifying the batch ID with the chemical comp ound administered to that batch could be misleading, and could lead one to incorrectly attribute the inferred causal relation b etw een batc h ID and a certain system v ariable to the causal effect of the intended interv en tion corresp onding to that batch on the system v ariable. This is a subtle type of error that the causal mo deler should b ew are of. Ev en though we hav e go o d reasons to assume that prop er randomization was p erformed for batc h ID in the Sachs et al. (2005) exp erimen t, it is questionable whether the interpretation of the context v ariables as concerning solely the addition of certain chemical comp ounds (and not any other factors that actually v aried across batches) is appropriate. The issue can also be understoo d b y noting that JCI Assumption 2 may not b e preserv ed when marginalizing out context v ariables, as illustrated in Figure 8. The follo wing e xample describ es a situation in which this phenomenon ma y o ccur. Example 1 Consider a r andomize d trial setup for establishing whether sugar c auses plants to gr ow. Context variable C α denotes the c oin flip r esult, C β indic ates whether sugar is administer e d to the plant, and C γ indic ates whether water is administer e d to the plant. The exp erimenter de cide d to use an exp erimental design with two gr oups, and assigning plants to gr oups with a c oin flip. One gr oup of plants was administer e d a solution c onsisting of sugar dissolve d in water on a daily b asis, the other (c ontr ol) gr oup was not tr e ate d in any way. The gr owth r ate X 1 of the plants was me asur e d for b oth gr oups. Supp ose the fol lowing exp erimental design was use d: P ( C = c ) C α (c oin flip) C β (sugar) C γ (water) 1 2 0 0 0 1 2 1 1 1 If one would only take c ontext variable C β (did the plant get sugar?) into ac c ount and would tr e at C α and C γ as latent, as in Figur e 8(b), and would make JCI Assumptions 1 and 2, one would arrive at the (wr ong) c onclusion that sugar c auses plants to gr ow. However, if one would take al l thr e e c ontext variables into ac c ount, and make JCI Assumptions 1 and 2, one would obtain the right c onclusion that at le ast one of the thr e e c ontext variables must c ause plants to gr ow. A simple remedy to av oid the wrong conclusion if only C β is observed w ould b e to drop JCI Assumption 2: then it is no longer identifiable whether C β causes X 1 , or whether C β and X 1 are just confounded. 3.4.4 JCI Assumption 3 W e hav e seen that JCI Assumption 1 is often easily justifiable, but the applicabilit y of JCI Assumption 2 may b e less ob vious in practice. W e will now state JCI Assumption 3, which can b e useful whenev er b oth JCI Assumptions 1 and 2 hav e b een made as well. 23 Mooij, Ma gliacane and Claassen (a) C α C β C γ X 1 (b) C β X 1 Figure 8: Confounding b etw een system and context v ariables due to unobserv ed con text v ariables. (a) If all three con text v ariables C α , C β , C γ are observed, JCI Assump- tion 2 w ould b e v alid. (b) After marginalizing out C α and C γ , lea ving only con text v ariable C β as observed, JCI Assumption 2 is no longer v alid. Assumption 3 (“Generic c ontext mo del”, optional) The c ontext gr aph 13 G ( M ) K is of the fol lowing sp e cial form: ∀ k 6 = k 0 ∈ K : k ↔ k 0 ∈ G ( M ) ∧ k → k 0 / ∈ G ( M ) . In Figure 9(b), this assumption is satisfied, while in Figure 9(a), it is not. W e will sho w that JCI Assumption 3 seems stronger than it is, since it can b e made without loss of generality in many cases o ccurring in practice. In order to precisely form ulate and pro ve that claim, the following definition is needed. Definition 12 Given an SCM M satisfying JCI Assumption 0, define the conditional system graph G ( M ) do( K ) as the DMG with c ontext no des K and system no des I , and as dir e cte d and bidir e cte d e dges those e dges in G ( M ) that c ontain at le ast one system no de in I (i.e., excluding e dges b etwe en c ontext no des). We wil l gr aphic al ly r epr esent the system no des I of G ( M ) do( K ) by el lipses and the c ontext no des K of G ( M ) do( K ) by squar es. Figure 9(c) shows the common conditional system graph for SCMs with graph as given in Figure 9(a) and for SCMs with graph as given in Figure 9(b). The c onditional system graph provides a particular graphical represen tation for an SCM with context and system v ariables that is less expressive than its graph. This representation is useful when w e are not interested in describing relationships b et ween con text v ariables, but only in describing the relationships b et w een system v ariables and ho w the con text affects the system. The following k ey result essentially states that when one is only interested in mo deling the causal relations in volving the system v ariables (under JCI Assumptions 1 and 2), one do es not need to care about the c ausal r elations b et ween the con text v ariables, as long as one correctly mo dels the context distribution . 13. Remem b er that G ( M ) K denotes the subgraph on the context v ariables K induced b y the causal graph G ( M ). 24 Joint Ca usal Inference from Mul tiple Contexts Theorem 13 Assume that JCI Assumptions 0, 1 and 2 hold for SCM M : M :      C k = f k ( C p a H ( k ) ∩K , E p a H ( k ) ∩J ) , k ∈ K , X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈J P ( E j ) , F or any other SCM ˜ M satisfying JCI Assumptions 0, 1 and 2 that is the same as M exc ept that it mo dels the c ontext differ ently, i.e., of the form ˜ M :      C k = ˜ f k ( C p a ˜ H ( k ) ∩K , E p a ˜ H ( k ) ∩ ˜ J ) , k ∈ K , X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈ ˜ J P ( E j ) , with J ⊆ ˜ J and p a H ( i ) = p a ˜ H ( i ) for al l i ∈ I , we have that (i) the c onditional system gr aphs c oincide: G ( M ) do( K ) = G ( ˜ M ) do( K ) ; (ii) if ˜ M and M induc e the same c ontext distribution, i.e., P M ( C ) = P ˜ M ( C ) , then for any p erfe ct intervention on the system variables do( I , ξ I ) with I ⊆ I (including the non-intervention I = ∅ ), ˜ M do( I , ξ I ) is observational ly e quivalent to M do( I , ξ I ) . (iii) if the c ontext gr aphs G ( ˜ M ) K and G ( M ) K induc e the same sep ar ations, then also G ( ˜ M ) and G ( M ) induc e the same sep ar ations (wher e “sep ar ations” c an r efer to either d - sep ar ations or σ -sep ar ations). Pro of See App endix A. The follo wing corollary of Theorem 13 states that JCI Assumption 3 can b e made without loss of generalit y for the purp oses of constrain t-based causal discov ery if the con text distribution contains no conditional indep endences: Corollary 14 Assume that JCI Assumptions 0, 1 and 2 hold for SCM M . Then ther e exists an SCM ˜ M that satisfies JCI Assumptions 0, 1 and 2 and 3, such that (i) the c onditional system gr aphs c oincide: G ( M ) do( K ) = G ( ˜ M ) do( K ) ; (ii) for any p erfe ct intervention on the system variables do( I , ξ I ) with I ⊆ I (including the non-intervention I = ∅ ), ˜ M do( I , ξ I ) is observational ly e quivalent to M do( I , ξ I ) ; (iii) if the c ontext distribution P M ( C ) c ontains no c onditional or mar ginal indep endenc es, then the same σ -sep ar ations hold in G ( ˜ M ) as in G ( M ) ; if in addition, the Dir e cte d Glob al Markov Pr op erty holds for M , then also the same d -sep ar ations hold in G ( ˜ M ) as in G ( M ) . Pro of This follows from Theorem 13 b y sho wing that there exists an ˜ M that satisfies all requiremen ts in Theorem 13 and JCI Assumption 3 by construction, and that induces the same context distribution as M do es. F or a detailed pro of, see App endix A. 25 Mooij, Ma gliacane and Claassen (a) C α C β C γ X 1 X 2 X 3 (b) C α C β C γ X 1 X 2 X 3 (c) C α C β C γ X 1 X 2 X 3 Figure 9: Example graphs of (a) a true SCM M and (b) the modified SCM ˜ M constructed in the pro of of Corollary 14 that satisfies JCI Assumption 3, and (c) their cor- resp onding conditional system graph G ( M ) do( K ) . Corollary 14 gives sufficient conditions for ˜ M and M to b e equiv alent for our purposes. (a) C α C β X 1 X 2 (b) C α C β X 1 X 2 Figure 10: Example that illustrates that the genericity assumption in statement (iii) of Corollary 14 is necessary . Graphs of (a) the true SCM M and (b) the mo dified SCM ˜ M constructed in the pro of of Corollary 14 that are not Marko v equiv- alen t. The graph in (a) is identifiable under JCI Assumptions 0 – 2. The join t distribution P M ( X , C ) is not faithful with resp ect to the graph in (b), which is the minimal one that also satisfies JCI Assumption 3 suc h that P M ( X , C ) is Mark ov with resp ect to it. An example illustrating this corollary is provided in Figure 9. JCI Assumption 3 is typically made for conv enience. When our aim is not to mo del the causal relations b etwe en the con text v ariables, but just to use the context v ariables as an aid to mo del the causal relations b et ween system v ariables and b etw een context and system v ariables, Corollary 14 shows that w e ma y assume JCI Assumption 3 without loss of generality if JCI Assumptions 1 and 2 are made and the con text distribution con tains no (conditional) indep endences. The causal discov ery algorithm then do es not need to waste time on learning the causal relations b et ween con text v ariables but can fo cus directly on learning the causal relations inv olving the system v ariables. Note that the genericit y assumption in statement (iii) of Corollary 14 (i.e., P M ( C ) con taining no conditional indep endences) is necessary , as the simple coun terexample in Figure 10 shows. Dep ending on how well the causal discov ery algorithm can handle faith- fulness violations, mo del missp ecification due to incorrectly assuming JCI Assumption 3 ev en though P ( C ) con tains conditional indep endences migh t prev ent successful iden tifica- 26 Joint Ca usal Inference from Mul tiple Contexts C α C β C γ C δ C  p ossible interpretation 0 0 0 0 0 observ ational 1 0 0 0 0 in terven tion α 0 1 0 0 0 in terven tion β 0 0 1 0 0 in terven tion γ 0 0 0 1 0 in terven tion δ 0 0 0 0 1 in terven tion  T able 1: Example of a diagonal design with 5 context v ariables. If the con text v ariables are indicators of interv en tions, the context with C k = 0 for all k ∈ K corresp onds with the purely observ ational setting, and the other contexts in which one C k = 1 and the other C l = 0 for l 6 = k correspond with a particular interv en tion each. tion of the causal relationships b etw een system v ariables. Therefore, it is prudent to c heck that the empirical con text distribution ˆ P ( C ) indeed contains no conditional indep endences b efore making JCI Assumption 3. An example of a common situation in which the con text distribution contains no condi- tional indep endences is what w e refer to as a diagonal design (see also T able 1). This is a simple exp erimental design that is often used to disco ver the effects of single interv en tions when one is not interested in understanding the interactions that multiple in terven tions migh t ha ve. Note that tw o non-constant binary v ariables X , Y can only b e indep enden t if P ( X = 1 , Y = 1) > 0. Ev en more, they can only b e conditionally independent given a third discrete v ariable Z if P ( X = 1 , Y = 1 | Z = z ) > 0 for all z with P ( Z = z ) > 0. Therefore, eac h pair of context v ariables is dependent in a diagonal design (as there is no context in whic h a pair of context v ariables simultaneously obtains the v alue 1), even conditionally on an y subset of the other context v ariables. In other w ords, the context distribution P ( C ) corresp onding to any such diagonal design (with non-zero probability for each con text) con tains no conditional indep endences. JCI Assumption 3 can easily b e mo dified for situations in which the con text distribu- tion do es contain conditional indep endences. F or example, in the extreme case in which all context v ariables are join tly indep enden t, one would simply assume that G ( M ) contains no directed and no bidirected edges b et ween context v ariables. Suc h situations may o ccur for symmetric experimental designs in which all context v ariables are join tly indep enden t b y design (for example, factorial designs with equal sample sizes in eac h exp erimental con- text). Ho wev er, w e b eliev e that this occurs less often in practice than the generic case in whic h all con text v ariables are (conditionally) dep enden t, b ecause resource constrain ts often lead experimenters to deviate from completely symmetric exp erimental designs. Therefore, rather than assuming the context v ariables to be join tly indep enden t as a default, we ha ve opted here for the more generic default of assuming that no conditional independences hold b et ween con text v ariables in the context distribution. More generally , one could replace JCI Assumption 3 b y assuming that G ( M ) K equals a certain graph that expresses the known conditional indep endences in the exp erimen tal design. Theorem 13 can b e applied to these more general situations as w ell and sho ws that for the purp ose of constraint-based causal discov ery , an y con text graph that implies the 27 Mooij, Ma gliacane and Claassen (a) C α C β X 1 X 2 (b) C α C β X 1 X 2 Figure 11: If JCI Assumption 2 do es not apply , the causal relations b etw een context v ari- ables ha ve testable consequences for the conditional independences in the joint distribution. (a) X 1 6 ⊥ ⊥ X 2 | { C α , C β } ; (b) X 1 ⊥ ⊥ X 2 | { C α , C β } . observ ed conditional indep endences (i.e., an y graph that is Marko v equiv alent to the true con text graph) works. Another alternativ e is to omit JCI Assumption 3 and instead try to infer the context subgraph G ( M ) K from the data. This would typically be computationally more exp ensive, but in our experience do es not seem to mak e muc h of a difference in terms of accuracy in our exp eriments (as w e rep ort in Section 5). JCI Assumption 3 only mak es sense when b oth JCI Assumptions 1 and 2 are made. If we would not make JCI Assumption 1 or 2, the causal relations b et w een the observ ed con text v ariables will hav e testable consequences in the joint distribution in general. F or an example of this, 14 see Figure 11. Here, C α could b e “lab”, and C β could b e the “tem- p erature” at which an exp eriment is p erformed. In this case, we get different conditional indep endences in the joint distribution P ( X 1 , X 2 , C α , C β ) if lab causes temp erature than when they are confounded (for example, by geographical lo cation). Something similar can happ en if context v ariables are caused b y system v ariables. 3.4.5 Summar y of JCI Assumptions and Other Back ground Knowledge Summarizing, the JCI framework rests on different assumptions, one of whic h is required, whereas the others are all optional. The basic assumption that is required is JCI Assump- tion 0, whic h states that the meta-system consisting of con text and system can be describ ed b y a simple SCM. This is just the standard assumption made throughout the causal dis- co very literature, but no w applied to the meta-system rather than to the system only . In addition, assumptions ab out the causal relationships of the context v ariables can b e made, whic h are all optional and can b e decided on a case-b y-case basis. In most cases, we w ould exp ect JCI Assumption 1 (no system v ariable causes an y con text v ariable) to apply . In some cases, also JCI Assumption 2 (no system v ariable is confounded with any con text v ariables) applies. If both apply , one can assume JCI Assumption 3 for conv enience if the con text distribution contains no (conditional) independences. More generally , w e only need to model the observ ed conditional independences in the context distribution, not necessarily their causal relations, when our interest is in mo deling the causal relations inv olving system v ariables only . 14. W e are grateful to Thijs v an Ommen for p ointing this out. 28 Joint Ca usal Inference from Mul tiple Contexts The reader may wonder when one can ever b e sure in practice that JCI Assumption 2 applies. There is one v ery common scenario in which JCI Assumption 2 holds. This is in a scien tific exp erimen t in whic h, in chronological order: (i) an ensemble of systems is prepared in an initial state; (ii) the systems are randomly p erm uted and randomly divided in to batc hes; (iii) for eac h batch, all systems in the batch are interv ened up on sim ultaneously in the same wa y (following an exp erimen tal proto col determined in adv ance); (iv) measurements of the system v ariables are p erformed. The exp erimen tal proto col sp ecifies the intende d interventions for eac h batch, which should b e completely enco ded as context v ariables. Since the in tended interv en tions hav e b een decided b efor e system v ariables are measured, the intended interv entions cannot b e caused b y the system v ariables. Because the systems were randomly p ermuted and divided in to batc hes, the assigned batc h cannot b e caused by prior v alues of the system v ariables, or b y an ything else that may also ha ve an effect on the system v ariables. Because the in tended in terven tions for each system in each batc h are determined completely b y the batch, this implies that intended interv entions and system v ariables cannot b e confounded. As long as the context v ariables provide a complete enco ding of the intended in terven tions (i.e., the in tended in terven tions are in one-to-one corresp ondence to v alues of the context v ariables), JCI Assumptions 1 and 2 then apply to the context v ariables. 15 If additionally , no (condi- tional) indep endences hold in the empirical con text distribution, we can also make use of JCI Assumption 3 to simplify and sp eed up the causal disco very pro cedure. In more general scenarios, suc h as the example in the in tro duction (concerning the question whether playing violent computer games causes aggressiv e b eha vior), the v alidity of JCI Assumption 2 (no confounding b et ween con text and system) should not b e taken for gran ted. F or example, it could be that precisely the schools with a more violent p opulation of pupils see themselves forced to activ ely take measures to promote so cial b eha vior. In that case, the level of violence in the past w ould confound C β (do es a school tak e measures to stim ulate so cial b ehavior) and X 2 (ho w violently do the pupils of the school b eha v e). Th us, in scenarios like these, it seems safer not to rely on JCI Assumption 2 as incorrectly assuming it might lead to wrong conclusions (although we do not curren tly understand the precise impact of such model missp ecification). F or causal discov ery in the JCI framework, knowledge of the interv en tion tar gets (or more generally , whic h system v ariables are affected directly by whic h context v ariables) is not necessary , but it is certainly helpful and can b e exploited similarly to other av ailable bac kground knowledge, depending on the algorithm used to implemen t JCI. When applying JCI on a com bination of different in terv entional data sets, in terven tion targets can be learnt 15. If the exp erimen ter sticks to the exp erimental protocol that w as fixed b efore the exp erimen t was p er- formed, any p ossible influence of the system v ariables on the p erforme d interventions is excluded, and therefore the p erforme d interventions will equal the intende d interventions . This means that the JCI mo deling framework (with JCI Assumptions 1 and 2) applies also when in terpreting the context v ariables as the p erformed in terven tions. This may explain wh y it is considered go o d scien tific practice to p erform an experiment according to an exp erimental proto col that was fixed b eforehand, and not deviate from it in case of unexp ected measurement outcomes, for example. 29 Mooij, Ma gliacane and Claassen from data when they are not kno wn (as the direct effects of in terv ention v ariables), similarly to ho w the effects of system v ariables can b e learnt. One main adv antage of the JCI framew ork is that it offers a unified w ay to deal with differen t types of interv entions, as discussed in Section 3.3. Therefore, kno wledge of interv ention typ es (e.g., is it a p erfect in terven tion, or a mechanism change?) is also not necessary , but can still b e helpful as it pro vides additional background kno wledge that ma y b e exploited for causal discov ery . In concluding this subsection, we observe that the JCI framew ork generalizes and com- bines the ideas of causal discov ery from purely observ ational data and of causal disco very b y means of randomized con trolled trials. Indeed, note that if JCI is applied to a single con text (i.e., 0 context v ariables), it reduces to the standard setting of causal disco very from purely observ ational data describ ed in Section 2.3. If JCI is applied to a setting with a single context v ariable and a single system v ariable, JCI (with Assumptions 1 and 2) reduces to the randomized con trolled trial setting describ ed in Section 2.2. Therefore, the Join t Causal Inference framework truly generalizes b oth these sp ecial cases. 4. Causal Discov ery from Multiple Con texts with JCI In this section, we discuss how causal disco v ery from multiple contexts can be p erformed in the Joint Causal Inference framework. Our starting p oint is the assumption that some mo del of the form (5) is an appropriate causal mo del for the system and its con text, and w e hav e obtained samples of all system v ariables in m ultiple contexts. 16 Supp ose that the exact mo del M and in particular, its causal graph G ( M ), are unkno wn to us. The goal of c ausal disc overy is to infer as muc h as p ossible ab out the causal graph G ( M ) from the a v ailable data and from av ailable background kno wledge ab out context and system. Let us denote the data set for context c ∈ C as D ( c ) =  ( x ( c ) in ) i ∈I  N c n =1 , and for simplicity , assume that no v alues are missing. The n um b er of samples in eac h con text, giv en by N c , is allo wed to dep end on the context. As a first step, w e p o ol the data, thereby representing it as a single data set D = ( x n , c n ) N n =1 where N = P c ∈ C N c . W e then assume that D is an i.i.d. sample of P M ( X , C ), where ( X , C , E ) is a solution of the SCM M of the form (5). 17 In setting up the problem, we ha ve made the simplifying assumptions that the measure- men t pro cedure is not sub ject to selection bias, nor to (indep enden t) measurement error (Blom et al., 2018). W e will assume that the data has b een generated b y an SCM in ac- 16. An interesting problem setting considered by several researchers (Claassen and Heskes, 2010; Tillman and Spirtes, 2011; T riantafillou and Tsamardinos, 2015; Hyttinen et al., 2014; F orr´ e and Mo oij, 2018) that we do not consider here would b e to allo w for each con text a (p ossibly context-dependent) subset of system v ariables to remain unobserved. 17. Although this ma y sound as an inno cuous assumption, it is not necessarily satisfied by the data generating pro cess. F or example, supp ose that in a randomized controlled trial, it is decided a priori that a certain n umber N 0 of patients will b e assigned to the con trol group, and a n umber N 1 of patients to the treatmen t group, but whic h patients end up in whic h group is completely randomized. The resulting p ooled data is not i.i.d.; indeed, if we rep eat this pro cedure, we will alwa ys end up with the same num b er of patients in each group, whereas for an i.i.d. sample, the num b ers would fluctuate around their exp ected v alues. Nev ertheless, this assumption can b e made here without losing m uc h generalit y . In particular, for the case of binary treatmen t and binary outcome, W asserman (2004, Section 15.5) shows that for independence tests based on the (log) o dds ratio the i.i.d. assumption can be w eakened accordingly . Alternatively , in a b ootstrapping pro cedure (which we will use in practice for most implementations of JCI in Section 5), the resampled p o oled data is i.i.d. by construction. 30 Joint Ca usal Inference from Mul tiple Contexts cordance with JCI Assumption 0, and optionally , a subset of JCI Assumptions 1, 2 and 3. T o enable constrain t-based causal disco very , w e will assume that the joint distribution P M ( X , C ) is faithful with resp ect to the graph G ( M ), using the appropriate separation criterion ( σ -separation in general, or d -separation for sp ecific cases, as discussed in Sec- tion 4.1). W e will discuss the ramifications of the faithfulness assumption in more detail in Section 4.1. Definition 15 We say that a p articular fe atur e of G ( M ) is iden tifiable fr om P M ( X , C ) and b ackgr ound know le dge if the fe atur e is pr esent in the gr aph G ( ˜ M ) of any SCM ˜ M with P ˜ M ( X , C ) = P M ( X , C ) that inc orp or ates the b ackgr ound know le dge. “F eature” could refer to the presence or absence of a direct edge, a directed path, a bidirected edge, arbitrary subgraphs, or even the complete graph. The task of causal discov ery is then to identify as man y features of G ( M ) as p ossible from the data, the i.i.d. sample D of P M ( X , C ), and the av ailable background kno wledge. The key insight of the Joint Causal Inference framework that allo ws one to deal with data from m ultiple con texts ( D ( c ) ) c ∈ C is that b y incorporating the context v ariables explic- itly , and p o oling the data, w e ha v e no w reduced the causal discov ery problem to one that is mathematically equiv alent to causal discov ery from purely observ ational data D and appli- cable bac kground knowledge on the causal relations b et w een context and system v ariables (a subset of JCI Assumptions 1 and 2). If applicable, JCI Assumption 3 can b e made to reduce the computational effort. This trick also allows us to easily learn interv en tion targets from data in a similar w ay as we usually learn causal effects b etw een v ariables from data: the interv en tion targets are simply enco ded as the direct effects of the interv ention v ariables. After discussing the faithfulness assumption in more detail, w e will giv e a few suggestions of how JCI can b e implemen ted in Section 4.2. 4.1 F aithfulness Assumption In this subsection we will discuss the subtleties of the faithfulness assumption in the JCI setting and compare it with alternativ e faithfulness assumptions that hav e b een made in the literature. Giv en a simple SCM M of the form (5) with graph G := G ( M ), the join t distribution P M ( X , C ) induced b y the SCM satisfies the Generalized Directed Global Marko v Prop erty (Theorem 8) with resp ect to the graph G of the SCM, i.e., any σ -separation U ⊥ σ G V | W b et ween sets of no des U, V , W ⊆ I ∪ K in the graph G implies a conditional indep endence ˜ X U ⊥ ⊥ P M ( X , C ) ˜ X V | ˜ X W , where we write ˜ X := ( X , C ). Under the additional assumptions of Theorem 9, the stronger Directed Global Marko v Prop erty holds (i.e., the d -separation criterion). F or constrain t-based causal discov ery , some type of faithfulness assumption is usually made. F or simplicity , the faithfulness assumption that we mak e in this work is the standard one, but w e apply it to the combination of system and its en vironment: w e assume that the joint distribution P M ( X , C ) is faithful with respect to the graph G ( M ) of M . In other w ords, an y conditional indep endence ˜ X U ⊥ ⊥ P M ( X , C ) ˜ X V | ˜ X W for sets of nodes U, V , W ⊆ I ∪ K is due to the σ -separation U ⊥ σ G V | W in G (or d -separation, if applicable), and no other 31 Mooij, Ma gliacane and Claassen conditional independences in P M ( X , C ) exist. In particular, this assumption rules out an y conditional indep endence b etw een con text v ariables in case JCI Assumption 3 is made. If JCI Assumption 3 is not made, conditional indep endences in the context distribution that are faithfully describ ed b y an y DMG are allow ed. This faithfulness assumption allows us to deal with differen t types of in terven tions, including p erfect interv entions. F or example, for the p erfect interv en tion on X 2 illustrated in Figure 7 the causal graphs G ( c ) I (restricted to the system v ariables { X i } i ∈I ) dep end on the context c ∈ C : in the observ ational context ( C α = 0), X 1 → X 2 , whereas in the in terven tional con text ( C α = 1), this direct causal relation is no longer presen t (as it has b een ov erruled b y the p erfect interv ention). This do es not in v alidate the faithfulness of the joint distribution P ( C α , X 1 , X 2 , X 3 ) with resp ect to the join t causal graph. Indeed, ev en though X 1 ⊥ ⊥ X 2 | C α = 0, w e still ha ve X 1 6 ⊥ ⊥ X 2 | C α b ecause X 1 6 ⊥ ⊥ X 2 | C α = 1. 18 In other words, the fact that P ( X | C α = 1) is not faithful to the system subgraph G I (i.e., the induced subgraph of the causal graph G on the system no des I ) do es not lead to an y problem as long as we are not going to test for indep endences in the subset of data corresp onding to context C α = 1 separately , but restrict ourselv es to testing indep endences only in the p o ole d data set that combines all contexts. Causal disco very methods that analyze data from each context separately (e.g., Hauser and B ¨ uhlmann, 2012; T rian tafillou and Tsamardinos, 2015; Hyttinen et al., 2014) t ypi- cally make another faithfulness assumption. In our notation, suc h approac hes assume that P ( X | C = c ) is faithful w.r.t. a causal subgraph G ( c ) I that may b e con text-dep endent, and m ust then reason ab out how these context-dependent subgraphs are related, explicitly rely- ing on knowledge ab out the type of in terven tions (typically assuming that the interv entions are p erfect interv en tions with kno wn targets). This faithfulness assumption is to a certain exten t stronger than ours b ecause it requires faithfulness of the system within e ach context. On the other hand, it is to a certain extent w eaker than ours b ecause ours implies restric- tions on the con text distribution ( P M ( C ) must b e faithful to some DMG, at the very least) that the alternative does not hav e. W e consider the extension of the applicability of JCI (b ecause no kno wledge of interv ention t yp es or targets is needed) due to the faithfulness assumption we chose here to outw eigh the limitations in applicability (b ecause not every con text distribution can b e handled). In the rest of this section, w e will discuss some simple w ork arounds that can b e applied in practice when dealing with faithfulness violations in the context distribution P M ( C ). Under JCI Assumption 3, the faithfulness assumption implies that the context distribu- tion P ( C ) do es not contain any conditional indep endences. In case the empirical con text distribution ˆ P ( C ) do es contain conditional indep endences, one has several options. The first (assuming also JCI Assumptions 1 and 2 are made) is to mo dify the assumed graph of the con text v ariables in JCI Assumption 3 suc h that the context distribution is faithful to it. F or example, if all context v ariables are jointly indep endent, one could simply assume that the con text graph G ( M ) K has no (directed or bidirected) edges at all. The second option is to omit JCI Assumption 3 or some analogue of it completely . In that case, the faithfulness assumption still imp oses the restriction that the con text distribution can b e 18. Note that for a discrete context domain C , we hav e that A ⊥ ⊥ B | C if and only if A ⊥ ⊥ B | C = c for all c with p ( c ) > 0. More generally , A ⊥ ⊥ B | C if and only if A ⊥ ⊥ B | C = c for almost all c . 32 Joint Ca usal Inference from Mul tiple Contexts faithfully mo deled by some DMG. The third option is to use only data corresp onding to a certain subset of the contexts and to ignore data from other contexts. In addition, one can sometimes work around conditional indep endences in the con text distribution b y par- titioning the set of context v ariables into groups of context v ariables, and using com bined con text v ariables instead of the original con text v ariables. This will b e illustrated in the next paragraph. Finally , one could ignore the faithfulness violations in the con text distribu- tion and hope that the causal disco very algorithm will handle them w ell. This last approac h usually means that it will b e harder to guaran tee consistency of the approac h. Note that the faithfulness assumption for the con text v ariables is actually testable, since the empirical con text distribution is av ailable, and can b e directly tested for conditional indep endences. The faithfulness assumption also rules out deterministic relations b et w een the v ariables that lead to faithfulness violations. In particular, there could b e deterministic relations b et ween con text v ariables. F or example, in the exp erimen tal design of the exp eriments in Sachs et al. (2005) describ ed in T ables 2 and 3, C α is a (deterministic) function of C θ and C ι : C α = ¬ ( C θ ∨ C ι ). One might na ¨ ıv ely b elieve that this could b e dealt with by simply removing context v ariable C α from consideration, lea ving only context v ariables C β , . . . , C ι as observed con text v ariables, none of which is a (deterministic) function of the others. How ev er, marginalizing out context v ariables ma y giv e rise to violations of JCI Assumption 2, as w e hav e seen in Section 3.4.3. An operation that is generally allow ed is gr ouping context v ariables together. In the case of the Sac hs et al. (2005) exp erimen tal design, we can com bine C α , C θ and C ι together into a single context v ariable given b y the triple ( C α , C θ , C ι ). Accidentally , in this case this w ould b e mathematically equiv alent to the pair ( C θ , C ι ), but the interpretation of ( C α , C θ , C ι ) is differen t from that of ( C θ , C ι ). Another option would b e to ignore a subset of the contexts. In this case, one could exclude the tw o con texts with C θ = 1 or C ι = 1. Then, C α b ecomes a constant, and constants can b e safely ignored (or, trivially com bined with any other context v ariable). 19 T o wrap up: under JCI Assumption 3, it is advisable to c heck that there are indeed no conditional independences b etw een context v ariables in the empirical context distribution. More generally , one should c heck whether the conditional indep endences in the context distribution can b e faithfully describ ed by a DMG. If not, one can try to work around b y applying the tricks discussed in the last paragraph (grouping con text v ariables, and omitting certain contexts). When grouping con text v ariables, one should note that the inferred causal relations from con text v ariables to system v ariables ma y no longer b e easily in terpretable. A simple example that illustrates this is to consider tw o in terven tions that are alwa ys p erformed together: when drug A is prescrib ed, also drug B is prescrib ed, and vice versa. In that case w e cannot b e sure whether the effect on outcome is due to drug A or to drug B (or to b oth com bined). Nonetheless, w e can still use the inferred causal relations b etw een system v ariables. 19. In an earlier draft of this work (Magliacane et al., 2016a), we prop osed to handle deterministic relations b et ween con text v ariables by using the notion of D -separation, first presented in Geiger et al. (1990) and later extended in Spirtes et al. (2000). How ev er, this notion do es not provide a complete characteriza- tion of conditional indep endences due to a com bination of graph structure and deterministic relations. Therefore, in this work we use the simpler techniques of grouping context v ariables and ignoring certain con texts to deal with faithfulness violations due to deterministic relations b etw een context v ariables. 33 Mooij, Ma gliacane and Claassen C α C β C γ C δ C  C ζ C η C θ C ι N c 1 0 0 0 0 0 0 0 0 853 1 1 0 0 0 0 0 0 0 902 1 0 1 0 0 0 0 0 0 911 1 0 0 1 0 0 0 0 0 723 1 0 0 0 1 0 0 0 0 810 1 0 0 0 0 1 0 0 0 799 1 0 0 0 0 0 1 0 0 848 0 0 0 0 0 0 0 1 0 913 0 0 0 0 0 0 0 0 1 707 1 1 1 0 0 0 0 0 0 899 1 1 0 1 0 0 0 0 0 753 1 1 0 0 1 0 0 0 0 868 1 1 0 0 0 1 0 0 0 759 1 1 0 0 0 0 1 0 0 927 C β C γ C δ C  C ζ C η ( C α , C θ , C ι ) N C 0 0 0 0 0 0 (1,0,0) 853 1 0 0 0 0 0 (1,0,0) 902 0 1 0 0 0 0 (1,0,0) 911 0 0 1 0 0 0 (1,0,0) 723 0 0 0 1 0 0 (1,0,0) 810 0 0 0 0 1 0 (1,0,0) 799 0 0 0 0 0 1 (1,0,0) 848 0 0 0 0 0 0 (0,1,0) 913 0 0 0 0 0 0 (0,0,1) 707 1 1 0 0 0 0 (1,0,0) 899 1 0 1 0 0 0 (1,0,0) 753 1 0 0 1 0 0 (1,0,0) 868 1 0 0 0 1 0 (1,0,0) 759 1 0 0 0 0 1 (1,0,0) 927 T able 2: Left: Exp erimental design used by Sachs et al. (2005). N c is the n umber of data samples in con text C . Interpretation of context v ariables is provided in T able 3. Righ t: Different choice of context v ariables: C α , C θ and C ι ha ve b een grouped together in to a single combined context v ariable ( C α , C θ , C ι ) in order to deal with the deterministic relation C α = ¬ ( C θ ∨ C ι ) (see main text for details). Reagen t In terven tion C α α -CD3, α -CD28 global activ ator C β ICAM-2 global activ ator C γ AKT inhibitor activit y of AKT C δ G0076 activity of PKC C  Psitectorigenin abundance of PIP2 C ζ U0126 MEK activity C η L Y294002 PIP2, PIP3 mec hanism change C θ PMA PK C activit y C ι β 2CAMP PKA activity T able 3: F or each context v ariable in T able 2: reagen ts used in this exp erimental setting, and exp ected in terven tion t yp e and targets as based on (our interpretation of ) biological background kno wledge describ ed in Sachs et al. (2005). 4.2 Implemen ting JCI An y causal discov ery metho d that is applicable under the assumptions describ ed in Sec- tion 4 can be used for Join t Causal Inference. In this section w e will describe some concrete examples of JCI implemen tations. Iden tifiability ma y greatly b enefit from taking into ac- coun t the a v ailable background knowledge on the causal graph stemming from the applicable JCI assumptions as discussed in Section 3.4. In addition, taking in to accoun t background kno wledge on targets of interv en tion v ariables may help considerably . Some logic-based causal disco very metho ds (e.g., Hyttinen et al., 2014; T riantafillou and Tsamardinos, 2015; F orr´ e and Mo oij, 2018), are ideally suited to exploit suc h bac kground knowledge. F or other metho ds, e.g., F CI (Spirtes et al., 1999; Zhang, 2008a), or metho ds that fo cus on ancestral relations, e.g., ACI (Magliacane et al., 2016b), incorp orating all bac kground kno wledge is less straigh tforw ard and as far as we kno w cannot b e done with off-the-shelf implemen ta- tions. Often, simple adaptations of off-the-shelf implemen tations can b e made that do allo w 34 Joint Ca usal Inference from Mul tiple Contexts one to b enefit from all JCI background kno wledge. F or example, in Section 4.2.4 we will prop ose a simple adaptation of F CI that do es so. Giv en a causal discov ery algorithm for purely observ ational data that can exploit the JCI background kno wledge, we can implement JCI in a straightforw ard fashion: (i) introduce context v ariables, if not already provided; (ii) p o ol all data sets, including the v alues of the context v ariables; (iii) handle faithfulness violations b etw een context v ariables by grouping con text v ariables and/or leaving out certain contexts, if necessary; (iv) apply the causal discov ery algorithm on the p ooled data, taking into accoun t the appropriate JCI bac kground knowledge. An y soundness, completeness and consistency results for the causal disco v ery algorithm that hold for the algorithm in the purely observ ational setting, but including the bac kground kno wledge, directly apply to the JCI setting, as long as there is no mo del missp ecification (i.e., if the assumed JCI assumptions do hold for the true model). If w e use JCI Assumption 3 (or something similar), w e can use Theorem 13 to sho w that what the algorithm concludes ab out the causal relations concerning system v ariables is still correct. In the remainder of this subsection, we discuss four JCI implemen tations. W e will first describ e t wo existing algorithms (LCD and ICP) that can b e used off-the-shelf for causal discov ery within the JCI framew ork. Then w e propose tw o adaptations of existing algorithms (ASD and FCI) so that they can b e used for causal disco very within the JCI framew ork. In Section 5, w e will provide empirical results on the prop erties of those four algorithms. 4.2.1 Local Causal Discover y (LCD) P erhaps the first implementation of JCI (apart from randomized controlled trials) is pro- vided by the LCD algorithm b y Co op er (1997). LCD is a very simple constraint-based causal disco very algorithm that can b e used for the purely observ ational causal discov ery setting where certain bac kground kno wledge is a v ailable, and in particular, in the JCI set- ting. The basic idea behind the LCD algorithm is the follo wing result (whic h w e generalized to allow for cycles): Prop osition 16 Supp ose that the data-gener ating pr o c ess on thr e e variables X 1 , X 2 , X 3 c an b e r epr esente d by a faithful, simple SCM M with I = { 1 , 2 , 3 } and that the sampling pr o c e dur e is not subje ct to sele ction bias. If X 2 is not a c ause of X 1 ac c or ding to M , the fol lowing c onditional (in)dep endencies in the observational distribution P M ( X 1 , X 2 , X 3 ) X 1 6 ⊥ ⊥ X 2 , X 2 6 ⊥ ⊥ X 3 , X 1 ⊥ ⊥ X 3 | X 2 imply that the gr aph G ( M ) must b e one of the thr e e DMGs in Figur e 12. In p articular, (i) X 3 is not a c ause of X 2 ac c or ding to M ; (ii) X 2 is a dir e ct c ause of X 3 ac c or ding to M ; (iii) X 2 and X 3 ar e not c onfounde d ac c or ding to M ; 35 Mooij, Ma gliacane and Claassen X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 3 Figure 12: All p ossible DMGs detected by LCD. (iv) the c ausal effe ct of X 2 on X 3 is given by: P M  X 3 | do( X 2 = x 2 )  = P M ( X 3 | X 2 = x 2 ) . (6) Pro of The pro of proceeds by en umerating all (p ossibly cyclic) DMGs on three v ariables and ruling out the ones that do not satisfy the assumptions. The assumption that X 2 is not a cause of X 1 implies that there is no directed edge X 2 → X 1 in the graph G ( M ). If there w ere an edge b et ween X 1 and X 3 , X 1 ⊥ ⊥ X 3 | X 2 w ould not hold (faithfulness). Also, since X 1 6 ⊥ ⊥ X 2 , X 1 and X 2 m ust b e adjacen t (Mark ov prop ert y). Similarly , X 2 and X 3 m ust b e adjacen t. X 2 cannot b e a collider on any path b et w een X 1 and X 3 (faithfulness). Since the only possible edges b et w een X 1 and X 2 are X 1 → X 2 and X 1 ↔ X 2 (b oth of which ha ve an arrowhead at X 2 ), this means that there must b e a directed edge X 2 → X 3 , but there cannot b e a bidirected edge X 2 ↔ X 3 or directed edge X 2 ← X 3 . In other words, the only three p ossible graphs are the ones in Figure 12. The causal do-calculus applied to G ( M ) yields (6). In a JCI setting where JCI Assumption 1 is made, w e can directly apply LCD for causal disco very on tuples h C k , X i , X i 0 i with k ∈ K , i 6 = i 0 ∈ I on the p o oled data. A conserv ativ e v ersion of LCD has b een applied b y T riantafillou et al. (2017) to the task of inferring signaling net w orks from mass-cytometry data. A high-dimensional v ersion of LCD has b een sho wn to b e successful in predicting the effects of gene kno c kouts on gene expression lev els (V ersteeg and Mooij, 2019) from large-scale in terven tional yeast gene expression data (Kemmeren et al., 2014). An algorithm closely related to LCD, named “T rigger”, has b een applied on genomics data (Chen et al., 2007). Chen et al. (2007) motiv ate the JCI assumptions in the setting of learning the causal relations b et ween gene expression lev els using single n ucleotide p olymorphisms (SNPs) as context v ariables. Since the DNA conten t cannot b e caused by gene expression levels, JCI Assumption 1 is satisfied. Chen et al. (2007) then argue that Mendelian randomization justifies JCI Assumption 2. Finally , a single conditional indep endence in the p o oled data (as in LCD) provides the desired evidence for an unconfounded causal relation b et ween t wo gene expression levels. 4.2.2 Inv ariant Causal Prediction (ICP) ICP exploits in v ariance of the conditional distribution of a target v ariable given its direct causes across multiple contexts, assuming that none of the contexts corresp onds with an in terven tion that targets the target v ariable (P eters et al., 2016). The implementation describ ed in Peters et al. (2016) handles linear relationships, arbitrary in terven tions (as long as they do not change the conditional distribution of the effect v ariable given its direct causes), assumes the absence of laten t confounders b etw een target v ariable and its direct causes, and the absence of cycles inv olving the target v ariable. One of the main adv antages of this metho d ov er others is that it pro vides (conserv ative) confidence interv als 36 Joint Ca usal Inference from Mul tiple Contexts on direct causal relationships that do not require the faithfulness assumption to b e made; ho wev er, that only works under the assumption of causal sufficiency and acyclicity . The authors discuss several p ossible extensions to broaden the scop e of the metho d, but do not address this in all generalit y . A nonlinear extension of the metho d has been proposed recen tly (Heinze-Deml et al., 2018). ICP has b een successfully applied to predict the effects of gene kno c kouts on gene expression levels (Meinshausen et al., 2016) from large-scale in terven tional yeast gene expression data (Kemmeren et al., 2014). ICP can b e in terpreted as a particular implemen tation of the JCI framework, even in the general setting with nonlinear relations b et ween v ariables and with laten t confounders and cycles present (although faithfulness is then required). The following result broadens the conditions under which ICP iden tifies (p ossibly indirect) causal relations, strengthening results of P eters et al. (2016): Corollary 17 Consider the JCI setting with a single c ontext variable C and multiple system variables { X i } i ∈I . Under JCI Assumptions 0 and 1 and faithfulness, the ICP estimator for tar get i ∈ I : J ∗ i := \ { I ⊆ I \ { i } : C ⊥ ⊥ P M ( C, X ) X i | X I } satisfies J ∗ i ⊆ an G ( M ) ( i ) , i.e., the set J ∗ i c onsists only of (p ossibly indir e ct) c auses of i . Pro of F ollows immediately from Prop osition 28 in App endix A.2. Note that this means that asymptotically , ICP outputs a subset of ancestors of the target v ariable, even in the presence of confounders and linear or nonlinear cycles. Hence, ICP can b e interpreted as a particular causal discov ery algorithm implementing the JCI framew ork. 4.2.3 ASD-JCI Here we in tro duce a nov el JCI implementation that builds on the algorithm by Hyttinen et al. (2014) and its generalization to σ -separation (F orr´ e and Mooij, 2018) and some of the extensions prop osed by Magliacane et al. (2016b). Since adapting this algorithm to the JCI setting is straigh tforward, and the algorithm itself has been describ ed in detail in the cited pap ers, we here only pro vide a brief description of how it w orks. More details can b e found in App endix C. Hyttinen et al. (2014) prop osed formulating causal disco very as an optimization prob- lem ov er p ossible causal graphs, where the loss function sums the w eights of all the condi- tional (in)dep endencies presen t in the data that w ould b e violated for a certain underlying causal graph, assuming Marko v and faithfulness prop erties. The input consists of a list of w eighted conditional independence statemen ts. The w eights λ encode the confidence in the conditional (in)dep endence, where a weigh t of λ = ∞ corresp onds to a “hard constraint” (absolute certain ty) and a w eight of λ = 0 corresp onds to “no evidence at all”. Hyttinen et al. (2014) provide an enco ding of the notion of d -separation in Answ er Set Program- ming (ASP), a declarative programming language that can b e used amongst others for solving discrete optimization problems. F orr´ e and Mo oij (2018) generalize the enco ding to σ -separation. The optimization problem is solv ed by making use of an off-the-shelf ASP solv er. There ma y b e multiple optimal solutions to the optimization problem, b ecause the underlying causal graph ma y not b e iden tifiable from the inputs. Nonetheless, some of 37 Mooij, Ma gliacane and Claassen the features of the causal graph (e.g., the presence or absence of a certain directed edge) ma y still b e iden tifiable. W e employ the metho d prop osed b y Magliacane et al. (2016b) for scoring the confidence that a certain feature is present or absen t by calculating the difference b et ween the optimal losses under the additional hard constraints that the feature is present vs. that the feature is absen t. Magliacane et al. (2016b) show ed that this algorithm for scoring features is sound for oracle inputs and asymptotically consisten t under reasonable assumptions. W e will mak e use of the w eights prop osed in Magliacane et al. (2016b): λ j = log p j − log α , where p j is the p -v alue of a statistical test for the j th conditional indep endence statemen t, with indep endence as null hypothesis, and α is a significance lev el (e.g., 1%) that should decrease with sample size at a suitable rate. These weigh ts ha ve the desirable prop ert y that indep endences get a low er weigh t than strong dep endencies. As we will need an acron ym for this algorithm later, w e will henceforth refer to it as ASD (Accounting for Strong Dep endencies), as it essen tially tries to explain the observ ed dep endencies in the data, taking into accoun t the statistical strength of these dep endencies. This is fundamentally differen t from other constrain t-based algorithms s uc h as PC or F CI, whic h giv e priority to observ ed indep endenc es and do not take into account the strength of dep endencies. T aking into account the JCI background knowledge (a subset of JCI Assumptions 1, 2 and 3), and possible background kno wledge on in terven tion targets, is trivial thanks to the expressiv e p o wer of ASP , and can b e done with a few lines of ASP co de. The resulting algorithm is very accurate but scales only up to a few v ariables due to the combinatoric explosion. Incorporating JCI Assumption 3 considerably reduces computation time, as it remo ves the need to learn the causal relations of the con text v ariables. Since the JCI bac kground knowledge can b e completely and exactly enco ded as con- strain ts on the p ossible causal graphs, w e can directly extend the kno wn soundness, com- pleteness and consistency results for ASD: Theorem 18 ASD-JCI is sound and c omplete for or acle inputs. It is asymptotic al ly c on- sistent if the weights ar e asymptotic al ly c onsistent. Pro of F or the precise meaning of these statemen ts, we refer the reader to App endix C, and in particular, to Theorem 43 and 44. 4.2.4 FCI-JCI Here we introduce an adaptation of the constraint-based causal disco very algorithm F CI (Spirtes et al., 1999; Zhang, 2008a) that can b e used in a JCI setting. The FCI algorithm w as designed to w ork under the assumption that the data w as generated b y an acyclic SCM. While F CI can deal with selection bias, we will assume here for simplicity that no selection bias is presen t. The FCI algorithm consists of t w o main phases: an adjacency search phase leading to the skeleton, follo wed b y an edge orien tation phase. In the adjacency search the algorithm searc hes for conditional indep endences to eliminate edges from the graph. The subsequen t orien tation stage consists of a set of graphical rules that allo w inv ariant edge marks, signify- 38 Joint Ca usal Inference from Mul tiple Contexts ing either causal (tail marks) or non-causal (arrowhead marks) relations, to b e added to the sk eleton. F or a single observ ational data set the final result is a so-called Partial Ancestral Graph (P AG) that is a concise represen tation of ancestral relations and conditional indepen- dences. The P AG represen ts a set of Maximal Ancestral Graphs (MA Gs) (Ric hardson and Spirtes, 2002), and each MAG represen ts a set of ADMGs (T rian tafillou and Tsamardinos, 2015), and eac h ADMG represen ts an infinite set of D A Gs (with arbitrary num b er of laten t v ariables). In the case of purely observ ational data, the P AG output by FCI pro vides a complete description of the Mark ov equiv alence class (Zhang, 2008a; Ali et al., 2009). F or a more detailed discussion of MAGs and P AGs, w e refer the reader to App endix B.1. Extending FCI suc h that it can take in to account the additional JCI background kno wl- edge on the adjacency and causal relations b et ween the com bined set of con text and system v ariables (see also Section 3.4) is straigh tforward: • If JCI Assumption 3 is made, all con text v ariables are connected b y bidirected edges, and the adjacency phase of FCI is adapted accordingly by not removing any edges b et ween context v ariables; afterw ards, all edges betw een con text v ariables are orien ted as k ↔ k 0 for k 6 = k 0 ∈ K . In the subsequent phase of orien ting unshielded triples, only system v ariables can take on the role of the collider. • If JCI Assumption 1 is made, then (since we are assuming no selection bias) an y adjacen t pair of a context v ariable k ∈ K and a system v ariable i ∈ I m ust b e connected in the MAG by an edge with an arro whead at i . Therefore, after the adjacency phase, all edges b etw een a con text and a system v ariable are orien ted as k ∗ → i , with an arro whead at the system v ariable i ∈ I . • If b oth JCI Assumptions 1 and 2 are made, any adjacent pair of a con text v ariable k ∈ K and a system v ariable i ∈ I must b e connected b y a directed edge k → i (since w e are assuming no selection bias) in the MA G. Hence, after the adjacency phase, all edges b etw een a con text and a system v ariable are oriented as k → i , p oin ting from the context v ariable k ∈ K to the system v ariable i ∈ I . The subsequent orien tation phase of the F CI algorithm do es not need to b e adapted. W e will refer to this adaptation of the FCI algorithm as FCI-JCI . In particular, w e distinguish three v ariants: FCI-JCI0 (whic h only makes JCI Assumption 0), FCI-JCI1 (also JCI Assumption 1), and FCI-JCI123 (also JCI Assumptions 1, 2, 3). In App endix B.4 w e discuss how one can read off the iden tified causal and non-causal relations from the P AG output b y FCI or F CI-JCI. F urthermore, in App endix B.5 w e discuss how one can read off the direct targets and direct non-targets of in terven tions represented b y context nodes from the P A G output by FCI-JCI123 . 4.2.5 Speeding up FCI-JCI123 Adding the context no des may make F CI-JCI considerably slow er than FCI on a single con text. In this subsection, we prop ose a further adaptation of FCI-JCI123 . By exploiting the following observ ation, w e can achiev e a considerable sp eedup: 39 Mooij, Ma gliacane and Claassen Lemma 19 L et M b e an SCM that satisfies JCI Assumptions 0, 1, 2. Then for X ⊆ I and Y , Z ⊆ I ∪ K : X ⊥ G ( M ) Y | Z = ⇒ X ⊥ G ( M ) Y | Z ∪ ( K \ Y ) (for b oth d-sep ar ation as wel l as for σ -sep ar ation). Pro of By contradiction: Supp ose there exists a path h v 0 , e 1 , v 1 , e 2 , v 3 , . . . , e n − 1 , v n i b etw een X and Y that is op en giv en Z ∪ ( K \ Y ) and contains no non-endp oin t no des in X ∪ Y , but is closed giv en Z . That can only happ en if the path contains a collider v i (0 < i < n ) that is G ( M )-ancestor of K \ ( Y ∪ Z ). Then v i ∈ K (b ecause no system no de is ancestor of K by JCI Assumption 1). Let 1 ≤ j < i b e the lo west index such that v l ∈ K for all l ∈ { j, j + 1 , . . . , i } . Then v j − 1 ← v j on the path (by JCI Assumptions 1, 2), where v j − 1 is in another strongly-connected comp onent than v j , and therefore v j ∈ K \ Y blo c ks the path, which is a contradiction. This implies that in the skeleton searc h of FCI-JCI12 and FCI-JCI123 , the searc h spaces for finding separating sets b et ween pairs of no des can be reduced. Indeed, instead of testing for each subset B of A ⊆ I ∪ K whether B d-separates no de v from no de w , one can test for eac h subset B of A \ K whether B ∪ K \ { v , w } d-separates v from w . 20 F or the next stages of the FCI algorithm, it do es not matter which separating set is found (as long as any separating set is found if there is one), as follows from Zhang (2006, Lemma 3.2.1 and 3.2.2). This modification to the sk eleton phase reduces the w orst-case n umber of conditional indep endence tests b y a factor that is exponential in the n um b er of context v ariables. W e will refer to the adapted v ersion of FCI-JCI123 that implemen ts this mo dified version of the skeleton searc h as FCI-JCI123r . 4.2.6 Soundness, Completeness and Consistency Resul ts for FCI-JCI The FCI algorithm w as sho wn to b e sound and complete (Zhang, 2008a) for oracle inputs. In App endix B, w e pro ve: Theorem 20 L et M b e an acyclic SCM that satisfies JCI Assumption 0. Assume that its distribution P M ( X , C ) is faithful w.r.t. the gr aph G ( M ) . Then, with input P M ( X , C ) : • FCI-JCI0 is sound and c omplete; • FCI-JCI1 is sound if M also satisfies JCI Assumption 1; • FCI-JCI12 is sound if M also satisfies JCI Assumptions 1 and 2; • FCI-JCI123 is sound and c omplete if M also satisfies JCI Assumptions 1, 2, 3. Her e, sound me ans that the output of the algorithm is a DP AG that c ontains the true DMA G( M ) , and complete me ans that al l e dge marks of the true DMAG( M ) that c an b e identifie d fr om the (c onditional) indep endenc es in P M ( X , C ) and the JCI b ackgr ound know le dge have b e en oriente d in the DP AG output by the algorithm. 20. Of course, the p o wer of the conditional indep endence test ma y be reduced when conditioning on many v ariables. How ever, if the num b er of contexts in the p o oled data is small, one can design conditional indep endence tests that do not suffer from this problem. 40 Joint Ca usal Inference from Mul tiple Contexts Pro of W e refer the reader to App endix B, and in particular to Theorems 35, 37, and 38. In practice, one often do es not hav e access to the joint distribution P M ( X , C ), but only to a finite sample of it. In that case we ha ve: Corollary 21 The FCI variants mentione d in The or em 20 ar e also asymptotic al ly c on- sistent under the assumptions state d in The or em 20 if the c onditional indep endenc e test (including the choic e of the thr eshold to de cide b etwe en indep endenc e and dep endenc e) is c onsistent. Pro of Direct application of Lemma 36 in App endix B. 4.3 Related W ork In this section we provide a more detailed comparison with related work. Since the pio- neering w ork b y Fisher (1935), many different causal discov ery methods that can deal with data from differen t contexts hav e b een prop osed. T able 4 pro vides an ov erview of some of these metho ds and the features they offer. Note that JCI offers most features of all meth- o ds. By implemen ting the JCI framework using sophisticated causal disco very metho ds for observ ational data (plus background knowledge) one obtains v ersatile and p ow erful causal disco very algorithms for m ultiple con texts. W e will no w discuss in detail some of the aspects of the related work. 4.3.1 La tent Confounders Most score-based metho ds that combine multiple contexts (lik e the ones by Co op er and Y o o, 1999; Tian and Pearl, 2001; Sac hs et al., 2005; Eaton and Murphy, 2007; Hauser and B ¨ uhlmann, 2012; Mooij and Heskes, 2013; Oates et al., 2016a) and some constraint-based metho ds (Zhang et al., 2017; Y ang et al., 2018) assume c ausal sufficiency , i.e., that no laten t confounders are present. This simplifies the causal disco v ery problem considerably , but the assumption is likely violated in practice and may lead to wrong conclusions. This is w ell-known for causal disco v ery from a single observ ational data set, but also applies to the JCI setting. 4.3.2 Cycles As w e hav e seen in Prop osition 10, the metho d by Fisher (1935) can handle cycles. Less w ell-known is that also LCD (Co oper, 1997) and T rigger (Chen et al., 2007) can handle cycles (see Prop osition 16). Hyttinen et al. (2012) provide an algorithm for linear SCMs with cycles and confounders that deals with perfect in terv entions. The methods b y Hyttinen et al. (2014) and Mooij and Heskes (2013) can deal with cycles in a linear (or appro ximately linear) setting. The metho d b y Hyttinen et al. (2014) relies on d -separation, which only applies in certain settings (see Theorem 9). The method can b e modified to use σ -separation instead (F orr´ e and Mo oij, 2018). The wa y Mo oij and Hesk es (2013) handle cycles is not as straightforw ard. Generally , their metho d could handle nonlinear cyclic mo dels, but for 41 Mooij, Ma gliacane and Claassen Laten t confounders Nonlinear mechanisms Cycles P erfect in terven tions Mec hanism c hanges Activit y in terven tions Other context changes Unkno wn in terven tion/context targets Learns interv ention/con text targets Global causal discov ery Differen t v ariables in each context Com bination strategy (Fisher, 1935) + + + + + + + + + - - b LCD (Co op er, 1997) + + + + + + + + - - - b (Co oper and Y o o, 1999) - + - + - - - - - + - b (Tian and Pearl, 2001) - + - - + - + - - + - b (Sac hs et al., 2005) - + - + - - - - - + - b (Eaton and Murphy , 2007) - + - + + + + + + + - b T rigger (Chen et al., 2007) + + + + + + + + - - - b (Claassen and Heskes, 2010) + + - - + + + + - + + a (Tillman and Spirtes, 2011) + + - + + + + + - + + a (Hauser and B ¨ uhlmann, 2012) - + - + - - - - - + - b (Hyttinen et al., 2012) + - + + - - - - - + - a (Mo oij and Hesk es, 2013) - ± ± + + + + - - + - b (Hyttinen et al., 2014) + + ± + - - - - - + + a (T riantafillou and Tsamardinos, 2015) + + - + - - - - - + + a (Rothenh¨ ausler et al., 2015) + - ± - - - + + + + - a (Oates et al., 2016a) - - - - - - + - - + - b ICP (Peters et al., 2016) + + + + + + + + - - - b (Zhang et al., 2017) - + - + + + + + + + - b (Y ang et al., 2018) - + - + + - - - - + - a/b (F orr´ e and Mo oij, 2018) + + + + - - - - - + + a Join t Causal Inference (this work) + + + + + + + + + + - b F CI-JCI (this work) + + ? + + + + + + + - b ASD-JCI (this work) + + + + + + + + + + - b T able 4: Ov erview of causal disco v ery metho ds that can com bine data from multiple con- texts. F eatures offered b y the original implemen tations of these metho ds are in- dicated. When a feature is offered only under additional restrictive assumptions, it is indicated with a ± sign. Com bination strategies (right-most column) are: (a) obtain statistics or constraints from each con text separately and then con- struct a single causal graph based on the combined statistics, (b) p o ol all data and construct a single causal graph directly from the p ooled data. 42 Joint Ca usal Inference from Mul tiple Contexts computational reasons, their implementation linearizes the SCMs around eac h (con text- dep enden t) equilibrium, thereby basically assuming that d -separation holds within e ach c ontext . The metho d by Rothenh¨ ausler et al. (2015) assumes linearity and can deal with cycles in that case, under a certain condition that suffices to pro ve identifiabilit y of the metho d. The metho d by P eters et al. (2016) can handle cycles, as our Corollary 17 sho ws. The JCI framework in general allo ws for cycles, but requires its implemen tation to supp ort this. 4.3.3 Selection Bias The only causal disco very method for multiple data sets that is explicitly claimed to be able to deal with selection bias (i.e., conditioning on a laten t v ariable that is a common effect of one or more of the observed v ariables), at least to some exten t, is the IOD algorithm (Tillman and Spirtes, 2011). It allo ws for differen t sets of observed (system) v ariables in eac h con text and for different distributions in each context, while assuming that eac h context can b e describ ed b y a MAG that is the marginal of a common MAG defined on the union of all system v ariables. It p erforms conditional independence tests in eac h data set separately , and merges the p -v alues of the test results using Fisher’s metho d. It then constructs the P AG that represents sim ultaneously all contexts. Since it do es not assume inv ariance of the distribution across con texts, it can deal with a single (laten t) con text v ariable that mo dels mechanism c hanges or other “soft” in terven tions that do not c hange the conditional indep endences in the distribution. It can also deal with p erfect in terven tions since Fisher’s metho d is used to test for indep endence in al l contexts (see also Figure 7). 4.3.4 Imperfect Inter ventions and Other Context Changes Co op er and Y o o (1999) provided the first score-based causal discov ery algorithm that could deal with data from m ultiple con texts, fo cusing on p erfect in terven tions with known targets. They describ e in detail how to handle p erfect interv en tions and in tro duced the idea of adding explicit context v ariables to deal with mechanism changes, whic h was later refined b y Eaton and Murph y (2007), who provide an algorithm that can handle (stochastic) p erfect in terven tions with unkno wn targets, soft in terven tions, and mec hanism changes. Also Sac hs et al. (2005) use a score-based causal discov ery algorithm based on the ideas of Co op er and Y o o (1999) that uses a greedy searc h strategy through the space of D AGs. Another recent approac h to constraint-based causal disco v ery in a JCI setting is the one b y Y ang et al. (2018); these authors prop ose the algorithm IGSP that can b e seen as an implemen tation of JCI for causally sufficien t, acyclic mo dels with a diagonal exp erimental design under JCI Assumptions 1, 2, for mechanism changes with interv en tion targets as- sumed to b e known. An adv antage of IGSP ov er our JCI approach is that essen tially no assumptions on the context distribution need to b e made (apart from p ositivity) since it relies on a weak er faithfulness assumption. Tian and Pearl (2001) were the first to consider me chanism changes . They deal with sequences of mec hanism changes, exploiting c hanges in the distribution to infer descendan ts of the c hanged mechanism. This is follow ed b y a constraint-based approach from observ a- tional data that also takes into account the bac kground knowledge on the causal ordering of the system v ariables inferred from analysing the in terven tional data. A similar approach 43 Mooij, Ma gliacane and Claassen using the differences b etw een data from exp erimental conditions and an observ ational base- line as bac kground kno wledge for a constrain t-based approac h w as applied by Magliacane et al. (2016b) on the data of Sac hs et al. (2005). Claassen and Heskes (2010) handle certain envir onment changes : direct causal rela- tions b etw een system v ariables are assumed to b e inv ariant across con texts, but laten t confounding (and more generally , the exogenous distribution) ma y differ b et ween contexts. Rothenh¨ ausler et al. (2015) assume sto c hastic shift interventions in whic h the mean of a target v ariable is shifted by an (independent) random amount. V arious m ulti-task “struc- ture l earning” (i.e., Ba y esian net w ork learning) approac hes that put a prior on the similarit y of the D AGs in multiple con texts whic h encourages them to b e similar ha v e b een prop osed (e.g., Oates et al., 2014, 2016b). Some metho ds which allo w for a single context v ariable hav e b een applied in settings on time-series data, b y using time as the con text v ariable (F riedman et al., 2000; Zhang et al., 2017). This extends the more usual approach of treating time-series data by assuming invarianc e of the causal structure across time as in dynamic Bay esian netw orks (DBNs) (Murph y, 2002), metho ds based on Granger causality (Granger, 1969), or constrain t-based approac hes (En tner and Hoy er, 2010). JCI allows one to handle all in terven tions and context changes discussed ab ov e in a unified wa y . 4.3.5 Mul tiple Context V ariables Some causal disco very metho ds for com bining data from differen t contexts that explicitly consider a context v ariable, allow for a single con text v ariable only , for example, LCD, ICP , and the metho d by Zhang et al. (2017). There is an imp ortan t adv antage to allo wing m ultiple context v ariables, as JCI do es generally . One might argue that the case of multiple con text v ariables can alwa ys b e reduced to a case with a single context v ariable, by simply com bining all context v ariables { C k } k ∈K in to a single tuple C = ( C k ) k ∈K . How ev er, this reduction to a single con text v ariable t ypically loses information. This is illustrated in Figure 13. When using only a single context v ariable in that case, the DMG cannot b e iden tified from conditional indep endences in the data. On the other hand, when using all three con text v ariables with JCI, the complete DMG can be iden tified, even when the causal relations b etw een context and system v ariables are unkno wn. 4.3.6 Dependent Context V ariables If one allows for m ultiple context v ariables and considers the join t distribution on context and system v ariables, as w e do in JCI, one should accoun t for p ossible dependencies b et w een the context v ariables. Indeed, incorrectly assuming the con text v ariables to be independent a priori may lead to wrong conclusions. An example is provided in Figure 14. In that example, incorrectly assuming the contexts to b e indep enden t leads to the wrong conclu- sion that con text v ariable C β causes the system v ariable X 0 , at least for causal disco very algorithms that are tolerant to faithfulness violations. This issue was recognized and addressed in recent work (Oates et al., 2016a) b y intro- ducing a nov el graphical mo deling framework, Conditional DA Gs (CDA Gs), which b ears some similarit y with our approac h. How ever, a disadv an tage of the CDA G framework is 44 Joint Ca usal Inference from Mul tiple Contexts C X 1 X 2 X 3 C α C β C γ X 1 X 2 X 3 Figure 13: Example that shows that allo wing m ultiple context v ariables (righ t) has adv an- tages o v er considering a single context v ariable only (left). The causal graph on the righ t can b e identified by JCI (with JCI Assumptions 1, 2, 3) from condi- tional independences in the p o oled data, whereas the causal graph on the left is not identifiable. C α C β X 1 C α C β X 1 Figure 14: Example that shows that incorrectly assuming indep enden t context v ariables can lead to wrong conclusions when using causal disco very algorithms that are tol- eran t to faithfulness violations. Left: true causal graph, with dep enden t context v ariables, whic h is identifiable b y JCI. Right: causal graph that repro duces all conditional dep endencies (except C α 6 ⊥ ⊥ C β ) and minimizes the num b er of faith- fulness violations (faithfulness here implies C β 6 ⊥ ⊥ X 1 | C α ), when (incorrectly) assuming that con text v ariables are indep enden t. that existing causal discov ery metho ds cannot b e directly applied to learn a CDA G from data, and the wealth of results on causal mo deling with SCMs cannot b e used directly . One of the k ey adv antages of the JCI framework is that it utilizes existing theory and metho ds, as it reduces a causal discov ery problem from m ultiple con texts to a purely observ ational one with background kno wledge. This is one of the reasons why JCI offers many more features than the approach by Oates et al. (2016a). W e also note that CDA Gs can b e dealt with as a sp ecial case of the JCI framework. 4.3.7 P ar tiall y O verlapping Sets of V ariables A particular case of missing data that has b een addressed b y some of the metho ds is when the set of observed v ariables differ betw een data sets, while still having some ov erlap. The first one to address this using constraint-based causal disco v ery w as Tillman (2009), and sev eral other metho ds ha ve b een prop osed o ver the y ears (Claassen and Heskes, 2010; Tillman and Spirtes, 2011; Hyttinen et al., 2014; T riantafillou and Tsamardinos, 2015). JCI can only deal with this when strengthening its faithfulness assumption: one would need to assume that the con text v ariables are discrete, and that ev ery conditional distribution P ( X | C = c ) for c ∈ C with P ( C = c ) > 0 is faithful with resp ect to the marginalization of the same 45 Mooij, Ma gliacane and Claassen DMG G I on the system v ariables that were observ ed in that context (or, in case of p erfect in terven tions with known targets, the corresp onding marginal interv ened DMG). 4.3.8 Influence Diagrams Our representation of a system within a con text imp osed b y its en vironment b ears strong similarities with influence diagrams (Dawid, 2002). A formal difference is that we consider the con text v ariables to b e random v ariables that reflect the empirical distribution of the exp erimen tal design, whereas in influence diagrams they are in terpreted as non-random decision v ariables. The adv antage of treating context v ariables as random v ariables is that this allows one to apply standard causal disco very techniques (designed for random v ariables) jointly on system and con text v ariables. In particular, the standard notion of statistical conditional indep endence (Da wid, 1979) suffices. If one would like to treat the con text v ariables as decision (i.e., non-random) v ariables, extended notions of conditional indep endence w ould b e necessary (F orr ´ e and Mo oij, 2019). Since we can alwa ys view the con text v ariables as random v ariables in the empiric al distribution of the exp erimental design (see also F o otnote 17), this allo ws us to mak e use of the standard notion of conditional indep endence for the purp oses of causal discov ery . 4.3.9 Selection Diagrams Our represen tation of a system within a context imposed b y its en vironmen t also bears some similarities with selection diagrams (Barein b oim and Pearl, 2013). Selection diagrams hav e also been used for causal mo deling in different contexts, but one crucial difference is that w e are mo deling the joint distribution on the interv en tion and system v ariables, whereas a selection diagram represents the c onditional distribution of the system v ariables giv en the in terven tion (“selection”) v ariables. Because we are mo deling the joint distribution and not only the conditional one, we can apply standard causal discov ery tec hniques directly on p o oled data, something that would not b e as trivial when using selection diagrams instead. Barein b oim and P earl (2013) define a selection diagram as follows: Definition 22 L et M = hI , J , H , X , E , f , P E i and M ∗ = hI , J , H , X , E , f ∗ , P ∗ E i b e two acyclic SCMs c orr esp onding to two differ ent c ontexts, that only differ with r esp e ct to their c ausal me chanisms and exo genous distributions. In p articular, they shar e the same aug- mente d gr aph H and henc e also their gr aphs ar e identic al, G ( M ) = G ( M ∗ ) . The selection diagram S induc e d by hM , M ∗ i is the acyclic dir e cte d mixe d gr aph with no des I ˙ ∪I , wher e I := { ¯ i : i ∈ I } is a c opy of I of selection v ariable indic es, such that 1. the induc e d sub gr aph of S on I e quals the c ommon gr aph of M and M ∗ , i.e., S I = G ( M ) = G ( M ∗ ) , and 2. for e ach i ∈ I such that f i 6 = f ∗ i or P E p a G ( M ) ( i ) 6 = P ∗ E p a G ( M ) ( i ) ther e is an e dge ¯ i → i in S . F rom the definition, it is apparent that a selection diagram essentially mo dels two con- texts, and that the selection v ariables in the selection diagram corresp ond to the childr en of the c ontext variable in our representation. Indeed, consider a JCI mo del of the form (5) (p. 21) with a single binary con text v ariable C . The join t SCM can b e split in to tw o 46 Joint Ca usal Inference from Mul tiple Contexts con text-sp ecific SCMs, M 0 and M 1 , and the induced selection diagram D can b e obtained from the causal graph G ( M ) as follows: (i) eac h edge i 1 → i 2 or i 1 ↔ i 2 in G ( M ) b et ween system v ariables i 1 , i 2 ∈ I is also in D ; (ii) if C → i in G ( M ) for i ∈ I then ¯ i → i is in D . Since the JCI framework can b e used to learn (features of the) causal graph G ( M ) from data, this means that we can thereby learn (features of ) the selection diagram from data. It is not clear ho w a selection diagram could b e used to represent the same informa- tion that an SCM with m ultiple con text v ariables can represen t. Indeed, even though the selection diagram has multiple selection v ariables, it is still mo deling only tw o contexts, corresp onding with just a single binary context v ariable in the JCI framew ork. 5. Exp eriments In this section we rep ort on the exp eriments w e p erformed with JCI, comparing v arious im- plemen tations of the framework with several baselines and state-of-the-art causal discov ery metho ds. W e exp erimen ted b oth with simulated data with p erfectly known ground truth and with real-world data where the ground truth is only kno wn appro ximately . The source co de that we used for pro ducing the results and plots in this section is pro vided under a free and op en source license as Online App endix 1. 5.1 Metho ds and Baselines In our experiments we study different implemen tations of JCI, based on t wo existing causal disco very algorithms: ASD (Hyttinen et al., 2014; Magliacane et al., 2016b; F orr´ e and Mo oij, 2018) and F CI (Spirtes et al., 1999; Zhang, 2008a). The ASD algorithm is accurate but slo w, while FCI is faster but less accurate due to its “greedy” approac h. Another difference b et ween both metho ds is that ASD can deal with partial inputs, while for F CI it is nec- essary to pro vide all indep endence test results it asks for. Although F CI (and with some small extensions, also ASD) can deal with selection bias, we ignore this additional compli- cation here and use simplified implemen tations that assume that there is no selection bias. F or the cyclic case, we used the adaptation of the ASD algorithm prop osed b y F orr ´ e and Mo oij (2018) that replaces d -separation with its general cyclic generalization, σ -separation. Adapting FCI to the cyclic case seems less straightforw ard and is b ey ond the scop e of this pap er. 21 T able 5 provides an o verview of all implemen tations that w e hav e studied here. The metho ds will be discussed in more detail in the next few subsections. The “CI T ests” column in the table describes what conditional indep endence test are p erformed, and how, and can ha ve the following v alues: A : use all v ariables, including context v ariables; the conditional indep endence tests p er- formed are of the form ˜ X a ⊥ ⊥ ˜ X b | ˜ X S with { a } ∪ { b } ∪ S ⊆ I ∪ K and { a } , { b } , S m utually disjoin t. S : use only system v ariables; the conditional independence tests p erformed are of the form X a ⊥ ⊥ X b | X S with { a } ∪ { b } ∪ S ⊆ I and { a } , { b } , S mutually disjoin t. 21. The d -separation case has recently b een addressed by Strobl (2018). 47 Mooij, Ma gliacane and Claassen SS : system v ariables only , separately for each context; the conditional indep endence tests p erformed are of the form X a ⊥ ⊥ X b | X S , C = c for { a } ∪ { b } ∪ S ⊆ I and { a } , { b } , S m utually disjoin t, for all v alues c ∈ C with P ( c ) > 0. Note that this metho d assumes that the con text domain C is discrete. SF : system v ariables only , separately for each con text, using Fisher’s metho d; 22 the conditional indep endence tests p erformed are of the form X a ⊥ ⊥ X b | X S , C = c for { a } ∪ { b } ∪ S ⊆ I and { a } , { b } , S mutually disjoin t, for all v alues c ∈ C with P ( c ) > 0. Note that this metho d assumes that the con text domain C is discrete. NC : test all v ariables, except for conditional independences b et w een con text v ariables; the conditional indep endence tests p erformed are of the form X a ⊥ ⊥ ˜ X b | ˜ X S with a ∈ I , { b } ∪ S ⊆ I ∪ K and { a } , { b } , S mutually disjoin t. PF : test all pairs of context and system v ariables, conditioning on the remaining con- text v ariables; the conditional indep endence tests that are p erformed are of the form X i ⊥ ⊥ C k | C K\{ k } for i ∈ I , k ∈ K . 5.1.1 ASD V ariants F or ASD, we implemen ted different v ariants as describ ed in T able 5. The v ariants ASD-obs , ASD-pooled and ASD-pikt use the original implemen tation of Hyttinen et al. (2014) in the acyclic case and the σ -separation adaptation of F orr´ e and Mo oij (2018) in the cyclic case, with the weigh ts and query metho d of Magliacane et al. (2016b). ASD-obs only uses the observ ational con text and ignores data from the other con texts, ASD-pooled p o ols data from all contexts but do es not add context v ariables, and ASD-pikt uses data from all con texts and assumes that the contexts correspond to p erfect in terven tions with known targets. Inspired b y the approac h of Tillman (2009) and Tillman and Spirtes (2011), w e also implemen ted a v ariant ASD-meta that uses Fisher’s metho d as a “meta-analysis” metho d to combine the p -v alues of conditional indep endence tests performed with data from each con text separately in to a single o verall p -v alue, which w as then used as input for ASD. W e use these implemen tations as state-of-the-art causal disco very metho ds for comparison. F or the JCI approac h, we study sev eral v ariants that differ in terms of which JCI as- sumptions they mak e, whether they also test conditional indep endences b et ween context v ariables, whether all context v ariables are used or whether they are first merged in to a single con text v ariable, and whether the interv ention targets of the context v ariables are consid- ered to b e known or not. The differen t implemen tations are describ ed in detail in T able 5. ASD-JCI123sc and ASD-JCI1sc b oth use a single merged context v ariable C = ( C k ) k ∈K , whereas the other ASD-JCI v arian ts use all context v ariables { C k } k ∈K as separate v ari- ables. The bac kground kno wledge for all ASD-JCI v arian ts is a subset of the three JCI Assumptions 1, 2 and 3. The only ASD-JCI v arian t that uses background kno wledge on in terven tion targets is ASD-JCI123kt (but it do es not mak e an y assumptions on the typ e of the interv en tion). 22. Fisher’s metho d (Fisher, 1925) aggregates N indep enden t p -v alues { p i } N i =1 b y computing the p -v alue for the statistic F := − 2 P N i =1 log p i , which has a χ 2 distribution with 2 N degrees of freedom if all N p -v alues p i are independent. 48 Joint Ca usal Inference from Mul tiple Contexts 5.1.2 FCI V ariants W e implemented different v ariants of the FCI algorithm by adapting the implemen tation in the R pack age pcalg (Kalisc h et al., 2012). W e used the default configuration, i.e., an order- indep enden t (“stable”) sk eleton phase, and no conserv ativ e or ma jority rule mo difications (Colom b o and Maath uis, 2014). F or simplicit y , we assumed that n o selection bias is presen t, whic h means that the rules R 5– R 7 in Zhang (2008a) can b e ignored in the F CI algorithm, and only P AGs without (p ossibly) undirected edges need to be considered. W e consider t wo v ariants of F CI as the current state-of-the-art: FCI-obs , whic h uses only the observ ational con text, and FCI-pooled , which uses p o oled data from all contexts but do es not add con text v ariables. W e also implemented a “meta-analysis” approach FCI-meta that uses Fisher’s metho d to combine the p -v alues from separate con texts in to o verall p -v alues that are used as input for the F CI algorithm. Finally , w e ha ve three v ariants ( FCI-JCI123 , FCI-JCI1 and FCI-JCI0 ) referring to the JCI adaptation of the F CI algorithm as describ ed in Section 4.2.4, for three differen t combinations of JCI Assumptions (w e hav e not yet implemen ted and ev aluated the sp eedup for FCI-JCI123r but leav e that for future work). The output of FCI is a P AG. Here we will not ev aluate the P AG itself, since estimating the P A G is often not the ultimate task in causal discov ery , but instead we will ev aluate presence and absence of ancestral relations that can b e iden tified from the estimated P AG. The pro cedure w e use for this task is explained in App endix B.4. F or the sp ecial case of FCI-JCI123 , we also read off the direct interv en tion targets (and non-targets) from the estimated P AG, as explained in App endix B.5. W e enco de identified presence of a feature b y +1, iden tified absence b y − 1, and an unidentifiable feature b y 0. These predictions can then also easily b e b ootstrapp ed (or more precisely , bagged). The bagged feature scores can then b e used as a score for the confidence that the feature is presen t. 5.1.3 LCD V ariants The LCD implementation simply iterates ov er all context v ariables and ordered pairs of system v ariables and tests for the LCD pattern. As conditional indep endence test we test whether the partial correlation v anishes. As confidence measure for an LCD pat- tern h C , X , Y i (see also Figure 12, where h X 1 , X 2 , X 3 i corresp onds with h C , X , Y i ), w e use − log p C ⊥ ⊥ Y . Note that LCD predicts the presence of an ancestral relation X ∈ an ( Y ), the absence of a confounder b etw een X and Y , and the absence of a direct causal effect of C on Y . 23 5.1.4 ICP V ariants W e also compare with the ICP function in the R pack age InvariantCausalPrediction (P eters et al., 2016). 24 Under the assumption of causal sufficiency , ICP returns direct causal relations. Ho w ever, in our setting, in which causal sufficiency cannot b e assumed, ICP will generally return ancestral causal relations, as shown in Corollary 17. Note that ICP assumes a single context v ariable, whereas one t ypically ma y ha ve multiple context 23. The absence of a bidirected edge X ↔ Y in the marginalization of the graph G on { C, X , Y } implies also the absence of the bidirected edge X ↔ Y in the full graph G . F urthermore, the absence of the direct edge C → Y in this marginalization implies the absence of the direct edge C → Y in the full graph G . 24. W e used the default arguments, except that w e set stopIfEmpty to TRUE . 49 Mooij, Ma gliacane and Claassen v ariables in the data. One wa y to deal with this is to merge all context v ariables before the p o oled data is input to ICP , which we do in ICP-sc . Another wa y is to run ICP on eac h con text v ariable separately , hiding the other context v ariables when feeding the po oled data to ICP , and finally merging all predictions. This is done in ICP-mc . The conditional indep endence tests that ICP p erforms are of the form C ⊥ ⊥ X i | X S with S ⊆ I \ { i } , for i ∈ I . The default conditional indep endence test used in ICP first linearly regresses X i on X S for eac h con text C = c individually , and once globally . It then tests whether there exists a context c in whic h the mean or the v ariance of the regression residuals is different from the global mean or v ariance of the residuals. All p -v alues are then com bined using a Bonferroni correction. Note that this test can detect more conditional dep endencies than a simple partial correlation test, as it also considers the v ariation betw een the v ariances across contexts. As a confidence score for the ancestral causal relation i ∈ an ( j ), w e use − log p i 99K j , where p i 99K j is the p -v alue returned by ICP for system v ariable i being ancestor of system v ariable j . 5.1.5 Fisher ’s Test for Causality This is a very simple and immensely p opular baseline in which we simply go through all pairs ( i, k ) of a system v ariable i ∈ I and a con text v ariable k ∈ K , p erform the conditional indep endence test X i ⊥ ⊥ C k | C K\{ k } on the po oled data resulting in p -v alue p ik . It is limited to disco very of ancestral causal relations from context to system v ariables. As confidence v alue for the ancestral causal relation k ∈ an ( i ) w e rep ort − log p ik + log α , where α is the threshold for the indep endence test. 5.1.6 Bootstrapping A simple wa y to impro ve the stabilit y of causal discov ery algorithms is b o otstrapping. F or a method that outputs a confidence measure for a certain prediction we simply a verage the confidence measures o v er bo otstrap samples. F or F CI, as confidence measure we simply take a {− 1 , 0 , 1 } -v alued v ariable enco ding the identifiable absence/unidentifiabilit y/identifiable presence of an ancestral relation (or direct in terven tion target, for FCI-JCI123 ). F or LCD, w e av erage − log p C ⊥ ⊥ Y o ver bo otstrap samples. F or ICP , we similarly a v erage the negative logarithm of the p -v alues for the discov ered ancestral causal relations. W e do not b ootstrap ASD v arian ts b ecause of the high computational complexity . In our exp erimen ts, we use 100 b o otstrap samples. Bo otstrapp ed m ethods are indicated with a suffix “ -bs ”. 5.1.7 Conditional Independence Tests Using an appropriate conditional indep endence test is imp ortan t to obtain go o d causal disco very results. In this w ork we will use tw o differen t conditional indep endence tests, b oth relying on the assumption that the context v ariables are discrete and the system v ariables hav e a multiv ariate Gaussian distribution giv en the context. How ev er, the JCI framew ork imp oses no principled restrictions on the conditional indep endence tests used, so one could also use non-parametric tests instead, for example. The default conditional indep endence test that we used is the follo wing. F or testing ˜ X a ⊥ ⊥ ˜ X b | ˜ X S , we distinguish tw o cases: 50 Joint Ca usal Inference from Mul tiple Contexts Name Data Con text v ariables JCI Assumptions Interv entions CI T ests 1 2 3 Baselines ASD-obs observ ational none − − − none S ASD-pooled p ooled none − − − an y S ASD-meta all none − − − an y SF ASD-pikt all none − − − p erfect, KT SS FCI-obs observ ational none − − − none S FCI-pooled p ooled none − − − any S FCI-meta all none − − − an y SF JCI Implementations ASD-JCI0 p ooled all − − − an y A ASD-JCI1 p ooled all + − − an y A ASD-JCI12 p o oled all + + − any A ASD-JCI123 p ooled all + + + an y NC ASD-JCI123kt po oled all + + + any , KT NC FCI-JCI123 p ooled all + + + an y NC FCI-JCI1 p ooled all + − − an y A FCI-JCI0 p ooled all − − − an y A LCD-sc p ooled single (merged) + − − any A ∗ ICP-sc p ooled single (merged) + − − any A ∗ ASD-JCI1-sc p ooled single (merged) + − − any A ∗ ASD-JCI123-sc p o oled single (merged) + + + an y A ∗ LCD-mc p ooled all (one-by-one) + − − any A ICP-mc p ooled all (one-by-one) + − − any A ∗ Fisher p ooled all (one-by-one) + + − any PF T able 5: V ariants of implemented JCI algorithms and baselines used in our exp eriments. JCI Assumption 0 is alwa ys assumed. “KT” is an abbreviation of “kno wn targets”. The meaning of “CI T ests” is (more detailed explanation in main text): A: use all v ariables, including con text v ariables; S: use only system v ariables; SS: system v ariables only , separately for each con text; SF: system v ariables only , separately for eac h con text, using Fisher’s method to com bine them in to a single p -v alue; NC: test all v ariables, except for conditional indep endences b et ween context v ariables; PF: test all pairs of con text and system v ariables, conditioning on the remaining con text v ariables. The meaning of the sup erscript ∗ is explained in Section 5.1.7. Bo otstrapp ed versions of metho ds will b e indicated with a suffix “ -bs ” (and ha v e b een omitted from this table for clarity). 51 Mooij, Ma gliacane and Claassen • S ∩ K = ∅ : the test then reduces to a standard partial correlations test. • Otherwise, w e go through all observ ed v alues c S ∩K of C S ∩K , and use a standard partial correlations test to calculate a p -v alue for ˜ X a ⊥ ⊥ ˜ X b | ˜ X S \K , C S ∩K = c S ∩K . W e then aggregate the p -v alues corresp onding to observed v alues of C S ∩K in to one ov erall p -v alue for ˜ X a ⊥ ⊥ ˜ X b | ˜ X S using Fisher’s metho d for aggregating p -v alues. 25 Note that for the case with zero con text v ariables (i.e., a single context), this test reduces to a standard partial correlations test. F or the ICP implemen tations, we mak e use of the implementation in the R pac k age InvariantCausalPrediction . This by default makes use of a conditional indep endence test that uses more than just partial correlations. W e extended this conditional indep endence test to allow for conditioning on con text v ariables, and mak e use of this extended test in all the algorithms that assume a single merged context, i.e., the ones marked with a ∗ in T able 5. It assumes that there is a single context v ariable, i.e., |K| = 1. F or testing ˜ X a ⊥ ⊥ ˜ X b | ˜ X S , it distinguishes the following cases: • If ( { a } ∪ { b } ∪ S ) ∩ K = ∅ , it reduces to a standard partial correlations test. • If ˜ X a = C is the con text v ariable, it uses linear regression to fit ˜ X b as a linear function of X S , using data po oled o ver all con texts, and calculates the corresponding residuals. It then go es through all observed v alues c of C , and tests whether the residuals in con text c ha ve a different distribution than the residuals in the other contexts (i.e., with C 6 = c ). This tw o-sample test is p erformed by comparing the means by a t - test, and the v ariances by an F -test. The tw o resulting p -v alues are com bined with a Bonferroni correction. The resulting p -v alues, one for eac h con text, are then also com bined with a Bonferroni correction. 26 • If ˜ X b = C is the context v ariable, pro ceed similarly as in the previous case. • If the con text v ariable C is part of ˜ X S , we go through all observed v alues c of C , and use a standard partial correlations test to calculate a p -v alue for X a ⊥ ⊥ X b | X S \K , C = c . W e then aggregate the p -v alues corresp onding to observed v alues of C in to one o verall p -v alue for X a ⊥ ⊥ X b | ˜ X S using Fisher’s metho d for aggregating p -v alues. As a final note regarding the conditional indep endence tests, w e state that ASD-pikt uses a standard partial correlation test to calculate a p -v alue for eac h context separately . It subsequen tly combines all these p -v alues b y adding the log p -v alues, but taking in to account ho w the graph structure is changed through p erfect interv entions. The choice of the p -v alue threshold α for rejecting the null hypothesis of indep endence is an imp ortant one. T o obtain consistent results, one should let α decrease to 0 with increasing sample size. In our exp erimen ts, we used fixed sample size and simply used a global threshold α = 0 . 01. 25. Not to b e confused with Fisher’s test for causality that we describ ed in Section 5.1.5. 26. This is the default test for contin uous data in the ICP function of the R pack age InvariantCausalPrediction . 52 Joint Ca usal Inference from Mul tiple Contexts 5.2 Sim ulations W e sim ulated random linear-Gaussian SCMs with p system v ariables and q con text v ari- ables. W e considered b oth the acyclic setting and the cyclic one. W e simulated sto c hastic in terven tions of tw o differen t in terven tion t yp es: mec hanism c hanges, and p erfect in terv en- tions. Random causal graphs w ere sim ulated b y dra wing directed edges indep enden tly betw een system v ariables with probabilit y  . F or the acyclic models, we only allow ed directed edges i 1 → i 2 for i 1 < i 2 with i 1 , i 2 ∈ I . F or cyclic mo dels, w e allo wed directed edges i 1 → i 2 for i 1 6 = i 2 with i 1 , i 2 ∈ I , and subsequen tly selected only the graphs in whic h at least one cycle exists. W e drew bidirected edges indep endently betw een all unordered pairs of system v ariables with probability η , and asso ciated each bidirected edge with a separate latent confounding v ariable. F or eac h con text v ariable, we randomly selected a single system v ariable as its target, while ensuring that each system v ariable has at most one context v ariable as its direct cause. W e sampled all linear co efficients b et ween system v ariables, con text v ariables and confounders from the uniform distribution on [ − 1 . 5 , − 0 . 5] ∪ [0 . 5 , 1 . 5]. The exogenous v ariables (“error terms”) w ere sampled indep enden tly from the standard- normal distribution. T o ensure that system v ariables hav e comparable scales, we rescaled the w eight matrix such that eac h system v ariable w ould hav e v ariance 1 if all its direct causes would b e i.i.d. standard-normal. W e used binary con text v ariables in a “diagonal” design. This means that for eac h random SCM, we sim ulated q + 1 con texts, with the first con text b eing purely observ ational (i.e., C k = 0 for all k ∈ { 1 , . . . , q } ), and the other q con texts corresp onding with one of the con text v ariables turned on (say C k 0 = 1 for some k 0 ∈ { 1 , . . . , q } ) and the others turned off ( C k = 0 for the other k ∈ { 1 , . . . , q } \ { k 0 } ). W e either to ok all in terven tions to b e mec hanism changes, or all in terv entions to b e p erfect. F or mechanism c hanges, w e simply add the v alue of the paren t con text v ariable to the structural equation (i.e., this corresponds with adding a constan t offset of 1 to the in terv ention target v ariable when the in terven tion is turned on). F or p erfect in terven tions, we additionally set the linear co efficients of incoming edges on the interv en tion target to zero. Finally , we sampled N observed v alues of system v ariables from each context and combined all samples into one po oled data set. This was done for eac h random SCM separately . 5.3 Ev aluation In ev aluating the results, w e consider differen t prediction tasks: establishing the absence or presence of ancestral causal relations b et w een system v ariables, the absence or presence of direct causal relations b et w een system v ariables, and the absence or presence of confounders b et ween system v ariables. In addition, we consider predicting the absence or presence of indirect in terven tion targets (i.e., whether or not some con text v ariable is ancestor of some system v ariable) and of direct in terv ention targets (i.e., whether or not some context v ariable is parent of some system v ariable). Eac h metho d outputs a confidence score for each feature of interest, where p ositiv e scores mean that it is more likely that a feature is presen t in the causal graph G , whereas negativ e scores mean that it is more likely that a feature is absent. The higher the absolute v alue of the score, the more lik ely its presence or absence is. The predictions are p o oled 53 Mooij, Ma gliacane and Claassen b oth within mo del instances (e.g., all p ossible ancestral relations i ∈ an G ( j ) for all ordered pairs of system v ariables i, j ∈ I ) and across mo del instances to gather more statistics. The scores are then rank ed and turned into ROC curv es and PR curves (one PR curv e for the presence, and one for the absence of the features) by comparing with the true features. In the ROC curv es, we use solid lines for p ositiv e scores (feature present) and negative scores (feature absent), and dotted lines for v anishing scores (feature presence/absence is unkno wn). 5.4 Results: Small Simulated Mo dels W e first presen t results for small mo dels with p = 4 system v ariables and 0 ≤ q ≤ 4 (as a default, q = 2) context v ariables. W e used  = 0 . 5, η = 0 . 5, and sampled N c = 500 samples for each con text, i.e., N = 500( q + 1) samples in total. 5.4.1 ASD-JCI vs. Baselines (Ca usal Mechanism Changes) W e start by showing off the adv antage that JCI can offer ov er existing metho ds. W e first consider only ASD v arian ts b ecause this most clearly shows the impact of how one merges data from different con texts and ho w one treats the context v ariables, since the other asp ects of the causal discov ery algorithm are the same for all ASD v arian ts. In Figure 15 we presen t results for several ASD v ariants for acyclic mo dels with causal mech- anism c hanges. W e compare the JCI v ariants ASD-JCI123 (unkno wn interv en tion targets) and ASD-JCI123kt (kno wn in terven tion targets) with the av ailable baselines, ASD-obs (ob- serv ational data only), ASD-pooled (po oled data from all context treated as if they were all observ ational, context v ariables not included), ASD-meta (using Fisher’s metho d to com bine p -v alues from conditional indep endence tests in separate contexts), and ASD-pikt (which assumes that in terven tions are p erfect and uses knowledge of interv en tion targets). The tasks of predicting ancestral causal relations and direct causal relations show relativ ely similar ROC and PR curv es. Predicting the absence or presence of confounders is a more c hallenging task. The three baselines ASD-obs , ASD-pooled and ASD-meta show v ery similar performance b eha viors. In particular, for the tasks of predicting the presence of the features, these base- lines p erform p o orly , not m uc h b etter than random guessing. This is partially due to the small sample size, but also to the fact that man y relationships are simply not identifiable from purely observ ational data alone. ASD-pikt even p erforms p o orly on nearly all predic- tion tasks in this simulation setting b ecause it incorrectly assumes that the interv entions are p erfect. The tw o JCI v arian ts, on the other hand, strongly outp erform the baselines and obtain very high precisions. In particular, ev en without knowing the in terven tion tar- gets, ASD-JCI123 manages to predict the presence of (direct and indirect) causal relations at maxim um precision for low recall. Exploiting knowledge of the interv en tion targets, ASD-JCI123-kt obtains an ev en more impressiv e precision for predicting ancestral causal relations for a large range of recalls. This illustrates the significant impro v ement in precision that JCI can yield. Figure 16 sho ws a similar picture in the cyclic setting, where all ASD v ariants make use of σ -separation (F orr´ e and Mooij, 2018). The task of predicting the presence of ancestral relations is easier than in the acyclic setting, b ecause for most pairs of system v ariables, one 54 Joint Ca usal Inference from Mul tiple Contexts Figure 15: Results of some ASD v arian ts (acyclic, causal mec hanism c hanges) for small models. The tw o JCI v arian ts (with unkno wn/kno wn in terv ention targets) strongly outp erform the baselines in this setting. F rom top to b ottom: ancestral relations, direct causes, confounders. F rom left to righ t: ROC curv es, PR curv es for presence of feature, PR curves for absence of feature. is ancestor of the other due to the cycles. The task of predicting the absence of ancestral relations, on the other hand, is more c hallenging. Detecting the presence or absence of confounders in this setting has b ecome nearly imp ossible, for any metho d. F or the other tasks, the JCI approac h again shows substantially impro ved precisions compared to the baselines. 5.4.2 ASD-JCI vs. Baselines (Perfect Inter ventions) Figures 17 and 18 show results for resp ectiv ely the acyclic and cyclic settings, but now for p erfect interv entions with known targets rather than causal mec hanism changes. In these p erfect interv ention scenarios, the JCI v arian ts again obtain a muc h higher precision than any of the baselines, with the sole exception of ASD-pikt . In this setting, the latter metho d successfully exploits the assumed p erfect nature of the in terven tions, 55 Mooij, Ma gliacane and Claassen Figure 16: Results of some ASD v arian ts (cyclic, causal mechanism c hanges) for small mo dels. The tw o JCI v ariants substan tially outp erform the baselines in this setting. thereb y outp erforming the JCI v ariants that do not mak e any assumption ab out the nature of the in terven tion. How ever, a significant disadv antage of ASD-pikt in practice is that its assumption of p erfect interv en tions with kno wn targets may not b e v alid. As we already sa w in Section 5.4.1, ASD-pikt then breaks down, in contrast to the JCI v ariants. F or the cyclic setting with p erfect interv en tions (Figure 18), w e observ e that predicting the presence of confounders still seems imp ossible for all metho ds, but predicting their absence seems at least feasible in principle (although it seems a very challenging task). ASD-pikt again obtains the highest precisions, follow ed by the JCI v ariants. 5.4.3 Influence of JCI Assumptions W e no w in vestigate in more detail which JCI assumptions are resp onsible for the excellent p erformance of the JCI v ariants of ASD. Figure 19 shows that, as exp ected, the more prior kno wledge ab out the context v ariables is used, the b etter the predictions b ecome. 56 Joint Ca usal Inference from Mul tiple Contexts Figure 17: Results of some ASD v ariants (acyclic, p erfect in terven tions) for small mo dels. ASD-pikt , which tak es in to account the p erfect nature of the interv en- tions, is the b est p erforming metho d in this setting. The JCI v arian ts do not assume perfect interv en tions, but still yield a v ast improv ement of the precision of the predicted features with resp ect to the other baselines. Ho wev er, surprisingly , the largest b oos t in precision with resp ect to the observ ational baseline is due to simply p o oling the data and adding the context v ariables: ASD-JCI0 already strongly improv es o ver ASD-obs . Adding more bac kground knowledge regarding the nature of the con text v ariables helps to improv e the results further. JCI Assumption 1 yields a marginal improv emen t. JCI Assumptions 2 (and 3) do not lead to any further impro vemen ts for discov ering the causal relations b et ween system v ariables in this setting, though. Exploiting knowledge of the interv en tion targets, on the other hand, turns out to b e v ery helpful for getting highly accurate predictions for ancestral relations b etw een system v ariables, and also significantly improv es the precision of predicting direct causal relations and confounders. W e ha ve shown here only the results for the acyclic setting with causal mechanism c hanges b ecause we obtained qualitativ ely similar results for the other sim ulation settings. 57 Mooij, Ma gliacane and Claassen Figure 18: Results of some ASD v arian ts (cyclic, p erfect in terv entions) for small mo dels. W e see a similar picture as in the acyclic case in Figure 17. W e also inv estigated v arian ts of ASD-JCI0 , ASD-JCI1 and ASD-JCI12 where w e did not p erform any indep endence tests on the context v ariables, i.e., using conditional indep en- dence testing scheme “NC” rather than “A”. This is p ossible b ecause ASD is capable of handling incomplete inputs. W e obtained almost identical results (in all sim ulation settings considered) to the standard v ariants of those metho ds in whic h we do p erform conditional indep endence tests on the context v ariables themselv es (not shown here). 5.4.4 Mul tiple Context V ariables vs. Single (Merged) Context V ariable Figure 20 shows that for ASD-JCI v arian ts, exploiting mu ltiple context v ariables leads to b etter results than using a single (merged) con text v ariable, as exp ected. Similar results hold for the cyclic settings and for causal mechanism c hanges (not shown). 5.4.5 FCI V ariants Figure 21 shows the results for the v arious F CI v arian ts. FCI is seen to be somewhat less accurate than ASD, but b o otstrapping helps to b o ost precision for lo wer recalls. Similarly 58 Joint Ca usal Inference from Mul tiple Contexts Figure 19: Influence of JCI assumptions for ASD-JCI (acyclic, causal mec hanism c hanges) for small mo dels. Exploiting more prior kno wledge is seen to lead to b etter results. F or reference, the non-JCI baseline ASD-obs is shown whic h only uses observ ational data. to ASD, we conclude that the JCI v ariants of F CI substan tially outp erform the non-JCI v ariants. FCI-JCI1 and FCI-JCI123 seem to yield identical results in this setting. Results for causal mechanism c hanges are very similar to those for p erfect interv entions, and therefore are not shown here. The results for the cyclic setting are also not shown, b ecause FCI w as designed for the acyclic setting. 5.4.6 LCD and ICP Figure 22 sho ws the results of the LCD and ICP v ariants for the task of predicting ancestral relations. W e only sho w the results for p erfect in terv entions, as the results for causal mech- anism c hanges are similar. LCD and ICP b oth only can predict the presence of ancestral relations, not their absence. LCD and ICP apparently b enefit from merging the context v ariables into a single one. A p ossible explanation of this phenomenon could b e that a com bination of conditional indep endence tests of the form C k ⊥ ⊥ X i 0 | X i with each C k bi- 59 Mooij, Ma gliacane and Claassen Figure 20: Multiple con text v ariables vs. single (merged) context v ariable for ASD-JCI (acyclic, perfect interv entions) for small mo dels. Exploiting m ultiple context v ariables leads to b etter results than using a single (merged) con text v ariable. nary (and both X i and X i 0 real-v alued) migh t b e less reliable than a single test C ⊥ ⊥ X i 0 | X i where C is categorical with  2 states. Another observ ation w e made is that bo otstrapping do es lead to only marginal impro vemen ts for these metho ds (not sho wn). 5.4.7 V ar ying the Number of Context V ariables As w e hav e seen, discov ery of causal relations b etw een system v ariables can b enefit strongly from observing the system in m ultiple contexts. As Figure 23 sho ws, the more con text v ari- ables are tak en into accoun t, the b etter the predictions for ASD-JCI123 b ecome. Although not shown here, the same conclusion holds as w ell for the other JCI v arian ts ASD-JCI0 , ASD-JCI1 and ASD-JCI123-kt . It do es not hold for ASD baselines in general, but it does for ASD-pikt if interv en tions are perfect. F or the JCI v arian ts of F CI ( FCI-JCI0 , FCI-JCI1 and FCI-JCI123 ) w e also observed that the more con text v ariables are av ailable, the more accurate the predictions become. This is in line with our expectation that join tly analyz- 60 Joint Ca usal Inference from Mul tiple Contexts Figure 21: Results of FCI v ariants (acyclic, p erfect in terv en tions) for small mo dels. Figure 22: Multiple context v ariables vs. single (merged) one for LCD and ICP (p erfect in terven tions) for small mo dels. LCD and ICP b oth b enefit from merging the context v ariables into a single one for the task of predicting the presence of ancestral relations. T op: acyclic; b ottom: cyclic. 61 Mooij, Ma gliacane and Claassen Figure 23: Results for differen t n umbers of con text v ariables for ASD-JCI123 (acyclic, causal mechanism c hanges) for small mo dels. T aking into accoun t more context v ariables leads to b etter predictions. ing data from m ultiple exp eriments makes it easier to estimate the causal structure of the system. In terestingly , the same conclusion do es not hold for ASD-JCI1-sc and ASD-JCI123-sc . This suggests that having m ultiple contexts is mostly b eneficial if eac h context v ariable targets only a small subset of system v ariables, and only for metho ds that can explicitly tak e in to accoun t m ultiple context v ariables. F or LCD and ICP , precision also do es not impro ve monotonically with the n umber of context v ariables. Although LCD-mc and ICP-mc in principle allow for multiple context v ariables, they suffer from a drop in recall because they fo cus on detecting a certain causal pattern that b ecomes increasingly rare with more con text v ariables. Indeed, for the extreme case q = p , eac h system v ariable is targeted by a single context v ariable in our sim ulation setting, and hence one would exp ect LCD-mc and ICP-mc to make no predictions at all. An y predictions they mak e m ust therefore b e false p ositiv es, resulting in low precision. 62 Joint Ca usal Inference from Mul tiple Contexts Figure 24: Discov ering indirect in terven tion targets (acyclic, causal mechanism c hanges) on small mo dels. T op: ASD v ariants; Bottom: F CI v ariants. Figure 25: Discov ering indirect in terven tion targets (cyclic, causal mec hanism c hanges) on small mo dels. T op: ASD v ariants; Bottom: F CI v ariants. 63 Mooij, Ma gliacane and Claassen 5.4.8 Discovering Indirect Inter vention T ar gets Figure 24 sho ws for the acyclic setting with causal mechanism c hanges ho w accurately indir e ct in terven tion targets (i.e., which system v ariables are descendants of eac h context v ariable) can b e discov ered by v arious metho ds. Baselines ASD-obs , ASD-pooled , ASD-meta and ASD-pikt cannot learn in terv ention targets (neither direct ones nor indirect ones), since they do not represent context v ariables explicitly , and are therefore excluded. 27 LCD and ICP also cannot address this task. 28 Although ASD-JCI123-kt makes use of known dir e ct in terven tion targets (i.e., which system v ariables are c hildren of each con text v ariable?), this means that there is still a non-trivial task of learning the indir e ct ones. The task of deciding that a system v ariable is targeted is an easier one than deciding that a system v ariable is not targeted by an interv en tion. Although Fisher’s test is generally hard to b eat when it comes to predicting indirect interv en tion targets, ASD-JCI123-kt outp erforms it in this setting b y exploiting the kno wledge ab out dir e ct in terven tion targets. While JCI Assumption 2 turned out to b e unimp ortan t for learning the causal relations b et ween system v ariables, it is seen to b e v ery useful for this task of learning causal relations b et ween con text and system v ariables. Figure 25 shows a largely similar picture for the cyclic setting with causal mec hanism c hanges. Surprisingly , F CI v arian ts are also p erforming quite w ell in the cyclic setting. W e do not sho w the results for perfect in terven tions here, as w e observ ed that this task is easier, and the results are generally b etter, but otherwise mostly similar conclusions are obtained. The only exception is that for p erfect in terven tions, the b est metho d is Fisher’s approac h. 5.4.9 Discovering Direct Inter vention T ar gets Fisher’s test is not able to predict dir e ct interv en tion targets (i.e., which system v ariables are children of each in terven tion v ariable?), but the ASD-JCI v ariants can, as well as FCI-JCI123 . Figure 26 shows the results for these algorithms in the four different sim- ulation settings. The task is considerably easier in the acyclic setting than in the cyclic setting. Having p erfect in terven tions makes it sligh tly easier than with causal mec hanism c hanges. W e again notice that exploiting JCI Assumption 2 considerably improv es p er- formance on this task. Surprisingly , FCI-JCI123 obtains almost p erfect precision in all scenarios, outp erforming ASD-JCI123 notably in the cyclic cases. W e do not understand wh y this is the case. 5.4.10 Comput a tion Time So far we ha v e only considered the accuracy of the predictions. Another in teresting aspect is the computation time that v arious metho ds need. Figure 27 shows total computation time (for all prediction tasks together) for all methods considered th us far. Note the logarithmic scale on the x -axis. W e only sho w run times for causal mec hanism c hanges since those for 27. Ho wev er, it w ould b e trivial to extend ASD-pikt suc h that it can predict indirect interv ention targets, b y com bining the known direct interv ention targets of a certain interv ention v ariable with the descendants of those assumed targets as predicted by the metho d as a p ostpro cessing step. 28. Ho wev er, when assuming also JCI Assumption 2, b oth LCD and ICP could be used to learn indirect in terven tion targets. 64 Joint Ca usal Inference from Mul tiple Contexts Figure 26: Discov ering direct interv en tion targets on small mo dels. F rom top to b ottom: acyclic, causal mec hanism changes; acyclic, perfect interv en tions; cyclic, causal mechanism c hanges; cyclic, p erfect interv en tions. p erfect interv en tions are nearly identical. On the other hand, w e do see that the cyclic setting is more computationally demanding in general than the acyclic one. Already for small models of p + q = 4 + 2 = 6 v ariables, the ASD algorithms become slo w b ecause they are performing an optimization ov er a large discrete space. The av ailability of more background knowledge mak es the search space considerably smaller, and hence leads to reduced computation time for the ASD v ariants. Also, the searc h space is considerably larger in the cyclic setting than in the acyclic one. By design, F CI v ariants are muc h faster, 65 Mooij, Ma gliacane and Claassen Figure 27: Runtimes for v arious metho ds on small mo dels. Sho wn are runtimes for causal mec hanism changes; for perfect interv en tions, runtimes are similar. Left: acyclic; Right: cyclic. Figure 28: Runtimes for three differen t algorithms as a function of the num b er of context v ariables on small mo dels. Note the considerably differen t ranges of the (logarithmic) x -axis. Results are omitted if the computation to ok to o long to finish. but b o otstrapping also takes its toll. The fastest metho ds are Fisher’s test, LCD and ICP . Figure 28 shows how computation time scales with the n umber of con text v ariables, for three JCI implemen tations ( ASD-JCI123 , ASD-JCI1 and FCI-JCI123 ). 5.5 Results: Larger Simulated Mo dels W e now presen t results for larger mo dels, with p = 10 system v ariables and q = 10 context v ariables (the meaning of the simulation parameters is explained in Section 5.2). W e only consider causal mec hanism c hanges here, but w e do distinguish the acyclic and cyclic set- tings. W e again used 500 samples per con text. F or the acyclic setting, w e used  = η = 0 . 25, while for the cyclic setting we used  = η = 0 . 15 to get more or less similarly dense graphs 66 Joint Ca usal Inference from Mul tiple Contexts Figure 29: FCI results for disco vering ancestral causal relations b et w een system v ariables in larger mo dels. T op: acyclic; b ottom: cyclic. in b oth scenarios. The motiv ation for these parameter c hoices is that they are somewhat comparable to the setting of the real-world data set that we will study in Section 5.8. F or these larger mo dels, computation time for b ootstrapp ed F CI metho ds b ecame pro- hibitiv e for the default conditional independence test (describ ed in Section 5.1.7). Although w e can sp eed up the implemen tation of this test that we w ere using considerably b y imple- men ting it more efficiently , we here simply replaced it b y a standard partial correlation test. This led to a sp eedup of ab out one order of magnitude at no apparent loss of accuracy . 5.5.1 FCI V ariants Figure 29 shows the accuracy for the task of predicting ancestral causal relations b etw een system v ariables for v arious F CI-JCI v arian ts and for v arious F CI baselines, in both acyclic and cyclic settings. The conclusions are in line with what we already observed for smaller mo dels. Again, b o otstrapping FCI helps considerably to b o ost the accuracy of its pre- dictions. As b efore, FCI-obs (which uses only observ ational data) p erforms worst. The t wo baselines FCI-pooled and FCI-meta (that make use of all data) lead to a mo derate impro vemen t. The JCI v ariants ( FCI-JCI0 , FCI-JCI1 and FCI-JCI123 ) p erform the b est, deliv ering almost maxim um precision for a considerable recall range on the task of predict- ing the presence of an ancestral relation. JCI Assumption 1 do es not seem to help muc h, as the results for FCI-JCI0 and FCI-JCI1 seem to b e iden tical. Assuming in addition JCI Assumption 2 (and 3) do es help to obtain sligh tly higher precision. The goo d p erformance of F CI v arian ts in the cyclic setting is surprising, since FCI was designed for the acyclic case. 67 Mooij, Ma gliacane and Claassen Figure 30: Bo otstrapp ed LCD and ICP results for disco v ering ancestral causal relations b et ween system v ariables for larger mo dels. T op: acyclic; b ottom: cyclic. 5.5.2 LCD and ICP In Figure 30, we show the accuracy of b o otstrapp ed LCD and ICP for the task of predict- ing ancestral causal relations b et ween system v ariables, in b oth the acyclic and the cyclic setting. As for FCI, bo otstrapping improv es the accuracy of LCD and ICP results, and w e decided to only sho w the bo otstrapp ed results here. Contrary to what w e observ ed for small mo dels, the “m ultiple con text” (“ -mc ”) v ersions of b oth algorithms now clearly outperform the versions that use only a single (merged) context (“ -sc ”) in these settings. Interestingly , the accuracy of LCD is quite similar to that of ICP . The additional complexity of ICP apparen tly do es not lead to substantially b etter results than the LCD algorithm already offers in these settings. Also, the precision of LCD-mc is comparable to that of FCI-JCI123 , the most accurate of the JCI v ariants of F CI (cf. Figure 29), on the task of predicting the presence of ancestral relations. 5.5.3 Discovering Inter vention T argets Figure 31 shows the performance of F CI-JCI v arian ts (and as a baseline, Fisher’s test) on the task of disco vering indirect interv en tion targets, for b oth the acyclic and cyclic setting. In terestingly , JCI Assumption 2 seems necessary to obtain go od results on this task. Still, FCI-JCI123-bs is outp erformed b y Fisher’s test. On the other hand, Fisher’s test cannot identify direct interv en tion targets, whereas FCI-JCI123 can. Figure 32 shows that FCI-JCI123 can identify the direct in terven tion targets as w ell as the non-targets with high precision. Surprisingly , this w orks also in the cyclic case. 68 Joint Ca usal Inference from Mul tiple Contexts Figure 31: FCI results for disco vering indirect in terven tion targets in larger models. T op: acyclic; b ottom: cyclic. Figure 32: FCI results for disco v ering direct interv en tion targets in larger mo dels. T op: acyclic; b ottom: cyclic. 69 Mooij, Ma gliacane and Claassen Figure 33: Runtimes for v arious methods on larger models. Left: acyclic; Right: cyclic. 5.5.4 Comput a tion Time Figure 33 shows total run times of the metho ds that we ran on the larger sim ulated models. First, we observe no big differences betw een the run times for the cyclic setting with resp ect to the acyclic one. How ev er, we do observe huge differences in run time b et ween v arious metho ds. LCD v ariants and Fisher’s test are b y far the fastest. ICP v ariants come second. Bo otstrapping puts a large toll on computation time for FCI v arian ts. JCI v arian ts of F CI are m uc h slow er than non-JCI v ariants. This seems to b e mostly due to an exp onen tial increase in the n um b er of conditional indep endence tests. Indeed, w e observ ed that F CI-JCI v ariants are conditioning on a substan tial fraction of all 2 10 subsets of all context v ariables in the skeleton search phase. Nevertheless, JCI v arian ts of FCI are still computationally feasible in this setting, even with b o otstrapping. 5.6 Results: Large Simulated Mo dels W e no w presen t results for large simulated mo dels, with p = 100 system v ariables and q = 10 con text v ariables. W e only consider causal mechanism changes with unknown targets and only the acyclic setting. W e used  = η = 0 . 02, whic h yields rather sparse graphs, in order to a void that the computations w ould tak e too long (the meaning of the sim ulation parameters is explained in Section 5.2). W e used only 100 samples p er con text, b ecause the tasks w ould b ecome to o easy otherwise due to the sparsity of the graphs. W e again used the standard partial correlation test for FCI v ariants in this setting instead of the default conditional indep endence test describ ed in Section 5.1.7 for computational efficiency reasons. 29 29. It is still p ossible to use the standard test, and the results are slightly b etter, but the small gain in precision do es not seem to justify the large increase in computation time. Implementing FCI-JCI123r as prop osed in Section 4.2.5 would probably yield a significan t reduction in computation time. Also, alternativ es for the skeleton search phase such as FCI+ (Claassen et al., 2013) could b e employ ed to gain further sp eedups. Last but not least, a more efficient implemen tation of the standard conditional indep endence test would help considerably . 70 Joint Ca usal Inference from Mul tiple Contexts Figure 34: Discov ering ancestral causal relations b et ween system v ariables in large mo dels. T op: F CI v ariants; Bottom: Bo otstrapp ed LCD and ICP v arian ts. Note that we zo omed in on the PR curv es. 5.6.1 Ancestral Causal Rela tions between System V ariables Figure 34 sho ws the accuracy for the task of disco vering ancestral causal relations betw een system v ariables for v arious metho ds and baselines. Overall, we see that JCI v ariants outp erform non-JCI baselines. On a detailed level, the conclusions are somewhat differen t than what w e saw for smaller mo dels. W e start b y discussing the results for the task of predicting the presence of causal relations. Lik e b efore, we see that b o otstrapping FCI helps considerably to increase the precision of the predictions for the lo wer recall range. On the other hand, it reduces recall, p ossibly b ecause only half of the av ailable data is used and the indep endence threshold w as not adjusted. As before, FCI-obs (which uses only observ ational data) p erforms worst. Ho wev er, FCI-obs outp erforms random guessing by a large margin in this setting. This is esp ecially noteworth y giv en that it is only using 100 observ ational samples. Interestingly , the main impro v ement in this setting is obtained b y p o oling the data; whether one includes the context v ariables ( FCI-JCI0 ) or not ( FCI-pooled ) do es not seem to make m uch of a difference. Also, using more JCI bac kground knowledge yields only small impro vemen ts: the differences b et w een FCI-JCI0 , FCI-JCI1 and FCI-JCI123 are small. W e observe that LCD-mc-bs obtains a higher precision at lo w recall than the F CI-JCI v ariants. On the other hand, the F CI-JCI v arian ts maintain a decen t precision ov er a larger recall range, con trary to the LCD and ICP v arian ts that only lo ok for v ery sp ecific patterns, whic h may explain that their precision drops off at low er recall than it do es for F CI-JCI. The “m ultiple con text” (“ -mc ”) versions of LCD and ICP outperform the v ersions that use 71 Mooij, Ma gliacane and Claassen Figure 35: Results for disco v ering in terven tion targets in large mo dels. T op: indirect in terven tion targets; Bottom: direct in terven tion targets. Note that we zo omed in on the PR curves. only a single (merged) context (“ -sc ”) in these settings. Interestingly , b oth v arian ts of LCD outp erform the corresp onding v arian t of the more complicated ICP algorithm. F or predicting the absence of causal relations, the random guessing baseline already obtains a high precision, b ecause of the sparsity of the graphs. FCI v ariants improv e on this, roughly halving the error for the most confident predictions when using the b o ot- strapp ed v ersions. FCI-JCI v arian ts again obtain the highest precision on this task, but don’t significantly outp erform FCI-pooled . 5.6.2 Discovering Inter vention T argets Figure 35 shows the performance of F CI-JCI v arian ts (and as a baseline, Fisher’s test) on the task of disco vering in terven tion targets. F or discov ering indirect interv en tion targets, Fisher’s test is now slightly outp erformed b y FCI-JCI123-bs . Interestingly , JCI Assump- tion 2 seems necessary to obtain an y results on that particular task. Inv estigating the P AGs shows that the edges betw een con text and system v ariables are mostly bidirected for FCI-JCI1 and FCI-JCI0 , which explains wh y these t w o algorithms yield no predictions at all. W e do not hav e a go o d explanation for this b ehavior, but sp eculate that the sparse setting with man y nodes mak es certain empirical violations of faithfulness quite lik ely . One of the features of FCI-JCI123 is that it can discov er dir e ct interv ention targets, something that Fisher’s test cannot. W e find that FCI-JCI123 identifies with high precision direct in terven tion targets as well as non-targets in this simulation setting. 72 Joint Ca usal Inference from Mul tiple Contexts Figure 36: Runtimes for v arious metho ds on large mo dels. 5.6.3 Comput a tion Time Figure 36 shows total run times of the metho ds that w e ran on the large simulated mo d- els. F or most metho ds, the largest part of the total running time is sp en t on p erforming indep endence tests. In particular, the tests that sub divide data according to context are relativ ely slo w since we ha v e not seriously optimized their implementation. 5.7 Summary of Results on Sim ulated Data W e ha v e seen in our exp eriments with simulated data that JCI metho ds typically outp erform non-JCI methods, in some settings b y a large margin. F or certain tasks, our newly proposed F CI-JCI algorithms pro vide the new state-of-the-art. Interestingly , Fisher’s baseline turned out to b e hard to b eat on the task of discov ering indirect interv en tion targets. Ho wev er, there are other tasks for whic h Fisher’s baseline cannot b e applied but for whic h our newly prop osed metho ds do apply , such as the task of disco vering direct interv en tion targets. As exp ected, LCD, ICP and ASD v ariants work in b oth the acyclic and cyclic setting. While F CI v arian ts were exp ected to w ork only in the acyclic setting, w e were surprised b y how w ell they p erform in the cyclic setting. 30 Often, but not alw ays, adding more con text v ariables leads to b etter results. Having multiple con texts w as seen to b e mostly b eneficial if eac h context v ariable targets only a small subset of system v ariables, and then only for metho ds that can explicitly take in to account multiple con text v ariables. An in teresting exception to this are LCD and ICP , whic h due to their sensitivit y to v ery specific patterns actually degrade in p erformance when to o many system v ariables become directly targeted by context v ariables. Exploiting more JCI background knowledge t ypically led to b etter results, but it dep ends on the task and sim ulation setting how large the b enefits are. In terestingly , the largest b o ost in accuracy for discov ery of causal relations betw een system v ariables comes already from JCI Assumption 0 (i.e., from p o oling the data and adding the con text v ariables). 30. These empirical observ ations led us to conjecture that F CI does not need to be adapted for the σ - separation setting. It w as sho wn v ery recently that this is indeed the case: FCI is also sound and complete in that setting (Mo oij and Claassen, 2020). 73 Mooij, Ma gliacane and Claassen As to the relative merits of the v arious JCI metho ds that we compared, it is difficult to state this concisely . Differen t metho ds b ehav e differen tly on different tasks in different sim ulation settings, b oth in terms of precision and recall as well as in terms of computation time, and for many methods there is a combi nation of a task and a sim ulation setting in whic h they do relatively w ell. Generally sp eaking, LCD and ICP b ehav e quite similarly , are relativ ely fast and obtain high precision but hav e low recall; ASD-JCI v ariants are among the most accurate and ha ve highest recall, but computation time explo des for more than a handful of v ariables; p erformance of FCI-JCI metho ds turns out to b e somewhere in b et ween, b oth in terms of accuracy and in terms of computation time. When implemen ted prop erly , their scalability is comparable to that of standard FCI, where the total num b er of v ariables |I | + |K| and the sparsit y of the underlying MAGs are imp ortant factors in the computation time. Some asp ects that should b e kept in mind when interpreting these results is that all sim ulations ha ve b een done under JCI Assumptions 1, 2, 3, and there w as no mo del mis- sp ecification. F rom that persp ective, it was to be exp ected that JCI metho ds would w ork w ell. Also, w e make no claims as to how our conclusions would generalize to differen t set- tings, for example, with non-Gaussian noise distributions, discrete v ariables, or con tinuous v ariables with non-linear interactions. F or those settings, the choice of the conditional in- dep endence tests could hav e a large influence on the results, for example. A detailed study of that is b ey ond the scop e of this pap er. 5.8 Results: Real-world (Flow Cytometry) Data In this subsection, w e presen t an application of the Joint Causal Inference framew ork on real-w orld data: the flow cytometry data of Sachs et al. (2005). The data consists of a collection of data sets, where each data set corresp onds with a different exp erimental con- dition in which a system was p erturbed and subsequen tly measured. The system consists of an individual cell, drawn randomly from a collection of primary human immune system cells. The system v ariables measure the abundances of several phosphorylated protein and phospholipid comp onen ts in an individual cell using a measuremen t technology known as flo w cytometry . Performing the measurement destroys the cell, and hence, it is not pos sible to obtain multiple measurements ov er time from the same cell. Instead, snap-shot mea- suremen ts of thousands of individual cells are av ailable, obtained in different exp erimen tal conditions in whic h the cells w ere p erturb ed with molecular interv entions, p erformed b y administering certain reagen ts to the cells. Most of these interv entions are not p erfect, but rather c hange the activity of some comp onen t by an unknown amount. There is prior kno wledge ab out the targets of these interv entions (see T able 3), but it is not clear whether in terven tions are as sp ecific as claimed. Many existing causal disco v ery approaches assume that the true causal graph is acyclic and that the system v ariables are causally sufficien t. Ho wev er, it is kno wn that these cellular signaling netw orks contain strong feedback lo ops, and it is quite likely that some of the v ariables may be sub ject to latent confounding. Thus, this type of exp erimental data constitutes a comp elling motiv ation for the Join t Causal Inference framework. Ov er the y ears, this particular flow cytometry data set has become a “b enchmark” in causal disco v ery (see e.g. Ramsey and Andrews (2018) for some references). Many causal 74 Joint Ca usal Inference from Mul tiple Contexts C γ C δ C  C ζ C η ( C α , C θ , C ι ) N C Reagen ts added 0 0 0 0 0 (1,0,0) 853 α -CD3, α -CD28 1 0 0 0 0 (1,0,0) 911 α -CD3, α -CD28, AKT inhibitor 0 1 0 0 0 (1,0,0) 723 α -CD3, α -CD28, G0076 0 0 1 0 0 (1,0,0) 810 α -CD3, α -CD28, Psitectorigenin 0 0 0 1 0 (1,0,0) 799 α -CD3, α -CD28, U0126 0 0 0 0 1 (1,0,0) 848 α -CD3, α -CD28, L Y294002 0 0 0 0 0 (0,1,0) 913 PMA 0 0 0 0 0 (0,0,1) 707 β 2CAMP T able 6: Exp erimen tal design for part of the Sachs et al. (2005) flow cytometry data used in our exp erimen ts. disco very metho ds ha ve b een applied to this data, and in many cases, the “consensus net- w ork” in Sac hs et al. (2005), visualized in Figure 38(a), was used as a ground truth to ev aluate the results of the causal discov ery pro cedure. How ev er, we w ould like to p oin t out that there are go o d reasons to b e skeptical ab out the assumption that the “consensus net work” represen ts the true causal graph of the system (as ackno wledged by several do- main exp erts w e sp ok e). Indeed, b y insp ecting the data one can find man y examples where the data is incompatible with the hypothesis that the “consensus net work” is a realistic and complete description of the underlying system. F or example, according to the “con- sensus net work”, an interv ention that inhibits the activity of Mek should ha ve no effect on Raf (b ecause Raf is a direct cause of Mek and there is no feedback lo op from Mek to Raf ). How ev er, in the data we see an increase of more than one order of magnitude in the abundance of Raf when U0126 (a reagent assumed to inhibit Mek activity) is added to the cells (see Figure 37). So either U0126 also directly targets Raf, or Mek must b e a cause of Raf, in b oth cases contradicting the “consensus net w ork”. In the literature regarding this signaling path wa y (whic h has b een studied in great detail since it pla ys an imp ortan t role in man y human cancers) it is often suggested that there should b e a feedback loop bac k from Erk to Raf (whose molecular mec hanism is still unkno wn). This would b e in line with our observ ations from the data in Figure 37, and also imply that the “consensus netw ork” is in- complete. This is just one example illustrating that the data is not en tirely compatible with the “consensus net w ork”, and it is easy to find more of these examples via visual inspection of the data. Therefore, we will not make use of the “consensus netw ork” as a ground truth to compare with when ev aluating the output of the causal discov ery algorithms. Sac hs et al. (2005) use an MCMC metho d to estimate the structure of a causal Bay esian net work from the combined observ ational and interv entional data, making use of the mo d- ified BDe score prop osed b y Co op er and Y o o (1999). Eaton and Murphy (2007) later used a dynamic programming algorithm to solve the estimation problem exactly . Like the origi- nal analysis by Sac hs et al. (2005), man y causal discov ery metho ds that ha ve been applied on this data rely on the bac kground knowledge about the interv ention types and targets, for which w e pro vide (our interpretation) in T able 3. A notable exception is Eaton and Murph y (2007), who were the first to estimate the interv en tion targets directly from the data. Exploiting the background kno wledge on interv en tion targets and types simplifies the causal discov ery problem considerably . Ho w ever, the accuracy of this bac kground kno wl- 75 Mooij, Ma gliacane and Claassen 0 2 4 6 8 10 0 2 4 6 8 10 ln Raf ln Mek 0 2 4 6 8 10 0 2 4 6 8 10 ln Mek ln Erk Figure 37: Log-abundances of Mek vs. Raf (left) and of Erk vs. Mek (righ t). Blue: obser- v ational baseline ( C α = 1 , C \ α = 0 ); Red: reagen t U0126 added ( C ζ = 1). W e observ e: (i) the measurement noise is quite small; (ii) Raf and Mek are highly correlated (“consensus netw ork”: Raf is a direct cause of Mek); (iii) strong ev- idence for feedbac k (interv ening on Mek changes Raf abundance) if w e assume that U0126 directly targets Mek but not Raf; (iv) the Mek inhibitor U0126 in- cr e ases Mek abundance (so mo deling this as a p erfect in terven tion would not b e realistic); (v) Mek and Erk are indep enden t in b oth contexts (even though Mek is a direct cause of Erk according to the “consensus net w ork”), an apparent violation of faithfulness. edge is not universally accepted. In particular, many biologists that w e sp ok e with were sk eptical ab out the assumed sp ecificit y of the interv en tions (i.e., the interv en tions may ha v e additional direct effects that are not listed in the table). W e ran v arious FCI-JCI v arian ts on a subset of the flo w cytometry data. 31 The ex- p erimen tal design of the original data is describ ed in T able 2 (left), p. 33. In order to a void deterministic relations b et ween context v ariables, w e merged con text v ariables C α ( α -CD3/CD28), C θ (PMA) and C ι ( β 2CAMP), as discussed in Section 4.1, leading to the exp erimen tal design in T able 2 (righ t). This means that we must interpret the merged con- text v ariable as referring to the addition of PMA or β 2CAMP , combined with the omission of α -CD3/CD28. How ev er, note that there are still approximate conditional indep endences in this exp erimen tal design of the form C β ⊥ ⊥ C k | ( C α , C θ , C ι ) for k ∈ { γ , δ, , ζ , η } . This could lead to problems with JCI Assumption 3. F or that reason, but also in order to enable comparisons with other results rep orted in the literature, we only used the 8 (out of 14) exp erimen tal conditions in the data set in whic h no ICAM.2 had b een administered (i.e., with C β = 0), and ignored the others. Similarly to Eaton and Murph y (2007), w e do not use the bac kground kno wledge regard- ing interv ention t yp es or targets. W e only assume that the experimental setting is captured b y the JCI framew ork. JCI Assumption 1 should b e true b ecause the in terven tion is p er- formed some time (approximately 20 min utes) before the measurements are done. W e ha v e already discussed the v alidity of JCI Assumption 2 for this particular exp erimental setting 31. As prepro cessing, we simply took the logarithm of the raw v alues. 76 Joint Ca usal Inference from Mul tiple Contexts in Section 3.4. Assuming that the context v ariables provide a complete causal description of the context (in particular, that there are no unintended batc h effects), JCI Assumption 2 applies. JCI Assumption 3 then also applies since there are no conditional indep endences in the con text distribution (after merging C α , C θ and C ι and leaving out all contexts with C β = 1). When using FCI-JCI123 , we can then learn the in terv ention targets from the data itself, without making use of the background kno wledge on interv en tion types and targets. F or comparison, we also ran FCI-obs , i.e., standard FCI using only the observ ational data set (i.e., the one in which only global activ ators α -CD3 and α -CD28 hav e b een ad- ministered). W e also ran FCI-meta , which uses Fisher’s metho d to combine p -v alues of conditional indep endence tests in the 8 separate exp erimen tal conditions, whic h are then used as input for standard FCI. Finally , w e ran FCI-pooled , i.e., standard F CI on the 8 exp erimen tal conditions p o oled together (but excluding context v ariables). In all those FCI v ariants, w e assumed that no selection bias w ould be presen t. W e additionally compare with other JCI implemen tations, in particular, m ultiple v arian ts of LCD and ICP . Computation times of v arious implementations are rep orted in Figure 42. The “consensus netw ork” and the P A Gs obtained by the F CI baselines are shown in Figure 38. Figure 39 shows P A Gs obtained by the F CI-JCI v ariants. Note that w e show the P AGs obtained without b ootstrapping, although these are not necessarily stable. Therefore, w e show in Figure 40 also the bo otstrapp ed results for the learned ancestral causal relations b et ween system v ariables, and the learned in terv ention targets. One can see here that most, but not all, of these features are stably predicted by the FCI-JCI v arian ts. In particular, the causal relations 32 Mek 99K Raf, PLCg 99K PIP2, Akt 99K Erk, P38 99K PKC are predicted b y b oth FCI-JCI123 and FCI-JCI1 with high confidence. Regarding the learned indirect interv ention targets, FCI-JCI123 has an adv antage o v er FCI-JCI1 b ecause it can exclude bidirected edges b etw een con text and system v ariables. Nonetheless, FCI-JCI1 predicts (in accordance with FCI-JCI123 ) that G0076 affects PLCg, PIP2, PKC and P38. F or discov ering indirect interv en tion targets, Fisher ’s test for causal- it y is simple and p ow erful. How ever, it is not able to learn the dir e ct in terven tion targets, lik e FCI-JCI123 can. Although the direct in terven tion targets learned by FCI-JCI123 do not all corresp ond with the “consensus netw ork”, they do agree for example on Psitectori- genin directly targeting PIP2. Most direct in terv ention targets learned b y FCI-JCI123 w ere also found by Eaton and Murphy (2007). It is interesting to see that FCI-JCI123 considers b oth Mek and Erk as direct in terven tion targets of U0126, while Raf is iden tified as an indirect target. In the absence of a reliable ground truth, we dra w the follo wing conclusions from these re- sults. First, the reference metho ds that exploit kno wledge on in terven tion t yp es and targets obtain rather consistent results. Second, the JCI metho ds that do not assume knowledge on in terven tion t yp es and targets (and the approach b y Eaton and Murph y (2007)) show less consisten t results. This could indicate that the data contain not enough signal in or- der to solv e this more am bitious task reliably . In that case, some mo del missp ecification (for example, strong non-linearities or deviations from Gaussianit y , which makes a simple partial correlation based conditional indep endence test inadequate) could lead to inconsis- tencies b etw een metho ds. Nonetheless, the p erformance of the F CI-JCI v ariants app ears 32. W e write i 99K j if i is a cause of j . 77 Mooij, Ma gliacane and Claassen Raf Mek Erk PLCg PIP2 PKC PIP3 Akt PKA p38 JNK AKT.inh G0076 Psitectorigenin U0126 LY294002 PMA + noAlphaCD3/28 beta2CAMP + noAlphaCD3/28 (a) “Consensus Netw ork” Raf Mek PKA Akt JNK PLCg PIP2 p38 PIP3 Erk PKC (b) FCI-pooled Raf Mek PLCg PIP2 PIP3 Erk Akt PKA PKC p38 JNK (c) FCI-obs Raf Mek PLCg PIP2 PIP3 Erk Akt PKA PKC p38 JNK (d) FCI-meta Figure 38: P AGs resulting from v arious F CI baselines on the flo w cytometry data of Sac hs et al. (2005). Interv en tion v ariables are denoted with rectangles, system v ariables with ellipses. F rom top to b ottom: (a) “Consensus netw ork” according to Sac hs et al. (2005); (b) FCI on p o oled data (without adding the con text v ariables); (c) FCI on the first (“observ ational”) data set in whic h only global activ ators α -CD3 and α -CD28 hav e b een administered; (d) FCI with as input the result of Fisher’s metho d for combining conditional indep endence test results from all data sets. to b e a considerable impro vemen t o ver the simple F CI baselines FCI-obs, FCI-meta and FCI-pooled . Remark ably , FCI-JCI123 manages to orien t most of the edges. The output also resembles the consensus net work, although some of the edges seem to b e reversed. Considering that we ha ve not taken into accoun t the av ailable bac kground kno wledge on the in terven tion types and targets (T able 3), and that we hav e not used an y tuning of the parameters we consider this still an impressiv e and encouraging result that illustrates the p oten tial that JCI has for analyzing complex scientific data sets. 78 Joint Ca usal Inference from Mul tiple Contexts Raf Mek PLCg PIP2 PIP3 Erk Akt PKA PKC p38 JNK AKT.inh G0076 Psitectorigenin U0126 LY294002 PMA/beta2CAMP + noAlphaCD3/28 (a) FCI-JCI123 Raf Mek PMA/beta2CAMP + noAlphaCD3/28 U0126 PLCg PIP2 G0076 PIP3 Psitectorigenin AKT.inh Erk Akt PKA PKC p38 JNK LY294002 (b) FCI-JCI1 Raf Mek PMA/beta2CAMP + noAlphaCD3/28 U0126 PLCg PIP2 G0076 PIP3 Psitectorigenin AKT.inh Erk Akt PKA PKC p38 JNK LY294002 (c) FCI-JCI0 Figure 39: P AGs resulting from v arious F CI-JCI v arian ts on the flo w cytometry data of Sac hs et al. (2005). These causal disco very methods do not make use of the biological prior knowledge regarding interv ention t yp es and targets, but learn the in terven tion targets from the data. In terven tion v ariables are denoted with rectangles, system v ariables with ellipses. F rom top to b ottom, less JCI Assumptions are made. Note that these are individual P A Gs that hav e not b een b ootstrapp ed. T o get an idea of the robustness, Figure 40 shows also the corresp onding b o otstrap estimates for certain features of the P AGs. 79 Mooij, Ma gliacane and Claassen Consensus (Sachs et al., 2005) Estimated (Sachs et al., 2005) Eaton & Murphy (2007) Mooij & Heskes (2013) ICP (Meinshausen et al., 2016) hiddenICP (Meinshausen et al., 2016) Raf->Mek Raf->PLCg Raf->PIP2 Raf->PIP3 Raf->Erk Raf->Akt Raf->PKA Raf->PKC Raf->p38 Raf->JNK Mek->Raf Mek->PLCg Mek->PIP2 Mek->PIP3 Mek->Erk Mek->Akt Mek->PKA Mek->PKC Mek->p38 Mek->JNK PLCg->Raf PLCg->Mek PLCg->PIP2 PLCg->PIP3 PLCg->Erk PLCg->Akt PLCg->PKA PLCg->PKC PLCg->p38 PLCg->JNK PIP2->Raf PIP2->Mek PIP2->PLCg PIP2->PIP3 PIP2->Erk PIP2->Akt PIP2->PKA PIP2->PKC PIP2->p38 PIP2->JNK PIP3->Raf PIP3->Mek PIP3->PLCg PIP3->PIP2 PIP3->Erk PIP3->Akt PIP3->PKA PIP3->PKC PIP3->p38 PIP3->JNK Erk->Raf Erk->Mek Erk->PLCg Erk->PIP2 Erk->PIP3 Erk->Akt Erk->PKA Erk->PKC Erk->p38 Erk->JNK Akt->Raf Akt->Mek Akt->PLCg Akt->PIP2 Akt->PIP3 Akt->Erk Akt->PKA Akt->PKC Akt->p38 Akt->JNK PKA->Raf PKA->Mek PKA->PLCg PKA->PIP2 PKA->PIP3 PKA->Erk PKA->Akt PKA->PKC PKA->p38 PKA->JNK PKC->Raf PKC->Mek PKC->PLCg PKC->PIP2 PKC->PIP3 PKC->Erk PKC->Akt PKC->PKA PKC->p38 PKC->JNK p38->Raf p38->Mek p38->PLCg p38->PIP2 p38->PIP3 p38->Erk p38->Akt p38->PKA p38->PKC p38->JNK JNK->Raf JNK->Mek JNK->PLCg JNK->PIP2 JNK->PIP3 JNK->Erk JNK->Akt JNK->PKA JNK->PKC JNK->p38 Direct causal relations Consensus (Sachs et al., 2005) Estimated (Sachs et al., 2005) Eaton & Murphy (2007) Mooij & Heskes (2013) ACI (Magliacane et al., 2016) ICP (Meinshausen et al., 2016) hiddenICP (Meinshausen et al., 2016) FCI-JCI123 FCI-JCI123-bs FCI-JCI1 FCI-JCI1-bs FCI-JCI0 FCI-JCI0-bs FCI-obs FCI-obs-bs FCI-pooled FCI-pooled-bs FCI-meta FCI-meta-bs LCD-mc LCD-mc-bs ICP-mc ICP-mc-bs LCD-sc LCD-sc-bs ICP-sc ICP-sc-bs Raf->Mek Raf->PLCg Raf->PIP2 Raf->PIP3 Raf->Erk Raf->Akt Raf->PKA Raf->PKC Raf->p38 Raf->JNK Mek->Raf Mek->PLCg Mek->PIP2 Mek->PIP3 Mek->Erk Mek->Akt Mek->PKA Mek->PKC Mek->p38 Mek->JNK PLCg->Raf PLCg->Mek PLCg->PIP2 PLCg->PIP3 PLCg->Erk PLCg->Akt PLCg->PKA PLCg->PKC PLCg->p38 PLCg->JNK PIP2->Raf PIP2->Mek PIP2->PLCg PIP2->PIP3 PIP2->Erk PIP2->Akt PIP2->PKA PIP2->PKC PIP2->p38 PIP2->JNK PIP3->Raf PIP3->Mek PIP3->PLCg PIP3->PIP2 PIP3->Erk PIP3->Akt PIP3->PKA PIP3->PKC PIP3->p38 PIP3->JNK Erk->Raf Erk->Mek Erk->PLCg Erk->PIP2 Erk->PIP3 Erk->Akt Erk->PKA Erk->PKC Erk->p38 Erk->JNK Akt->Raf Akt->Mek Akt->PLCg Akt->PIP2 Akt->PIP3 Akt->Erk Akt->PKA Akt->PKC Akt->p38 Akt->JNK PKA->Raf PKA->Mek PKA->PLCg PKA->PIP2 PKA->PIP3 PKA->Erk PKA->Akt PKA->PKC PKA->p38 PKA->JNK PKC->Raf PKC->Mek PKC->PLCg PKC->PIP2 PKC->PIP3 PKC->Erk PKC->Akt PKC->PKA PKC->p38 PKC->JNK p38->Raf p38->Mek p38->PLCg p38->PIP2 p38->PIP3 p38->Erk p38->Akt p38->PKA p38->PKC p38->JNK JNK->Raf JNK->Mek JNK->PLCg JNK->PIP2 JNK->PIP3 JNK->Erk JNK->Akt JNK->PKA JNK->PKC JNK->p38 Ancestral relations 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Confidence Score Figure 40: Causal relationships b et w een the bio chemical agen ts in the flow cy- tometry data of Sac hs et al. (2005) , according to differen t causal disco v ery metho ds and the “consensus net work” according to Sac hs et al. (2005) (whic h w e do not consider as a reliable complete ground truth). W e also included results of causal disco very methods rep orted in other works. 80 Joint Ca usal Inference from Mul tiple Contexts Consensus (Sachs et al., 2005) Eaton & Murphy (2007) FCI-JCI123 FCI-JCI123-bs AKT.inh->Raf AKT.inh->Mek AKT.inh->PLCg AKT.inh->PIP2 AKT.inh->PIP3 AKT.inh->Erk AKT.inh->Akt AKT.inh->PKA AKT.inh->PKC AKT.inh->p38 AKT.inh->JNK G0076->Raf G0076->Mek G0076->PLCg G0076->PIP2 G0076->PIP3 G0076->Erk G0076->Akt G0076->PKA G0076->PKC G0076->p38 G0076->JNK Psitectorigenin->Raf Psitectorigenin->Mek Psitectorigenin->PLCg Psitectorigenin->PIP2 Psitectorigenin->PIP3 Psitectorigenin->Erk Psitectorigenin->Akt Psitectorigenin->PKA Psitectorigenin->PKC Psitectorigenin->p38 Psitectorigenin->JNK U0126->Raf U0126->Mek U0126->PLCg U0126->PIP2 U0126->PIP3 U0126->Erk U0126->Akt U0126->PKA U0126->PKC U0126->p38 U0126->JNK LY294002->Raf LY294002->Mek LY294002->PLCg LY294002->PIP2 LY294002->PIP3 LY294002->Erk LY294002->Akt LY294002->PKA LY294002->PKC LY294002->p38 LY294002->JNK PMA/beta2CAMP + noAlphaCD3/28->Raf PMA/beta2CAMP + noAlphaCD3/28->Mek PMA/beta2CAMP + noAlphaCD3/28->PLCg PMA/beta2CAMP + noAlphaCD3/28->PIP2 PMA/beta2CAMP + noAlphaCD3/28->PIP3 PMA/beta2CAMP + noAlphaCD3/28->Erk PMA/beta2CAMP + noAlphaCD3/28->Akt PMA/beta2CAMP + noAlphaCD3/28->PKA PMA/beta2CAMP + noAlphaCD3/28->PKC PMA/beta2CAMP + noAlphaCD3/28->p38 PMA/beta2CAMP + noAlphaCD3/28->JNK Direct intervention targets Consensus (Sachs et al., 2005) Eaton & Murphy (2007) Fisher FCI-JCI123 FCI-JCI123-bs FCI-JCI1 FCI-JCI1-bs FCI-JCI0 FCI-JCI0-bs AKT.inh->Raf AKT.inh->Mek AKT.inh->PLCg AKT.inh->PIP2 AKT.inh->PIP3 AKT.inh->Erk AKT.inh->Akt AKT.inh->PKA AKT.inh->PKC AKT.inh->p38 AKT.inh->JNK G0076->Raf G0076->Mek G0076->PLCg G0076->PIP2 G0076->PIP3 G0076->Erk G0076->Akt G0076->PKA G0076->PKC G0076->p38 G0076->JNK Psitectorigenin->Raf Psitectorigenin->Mek Psitectorigenin->PLCg Psitectorigenin->PIP2 Psitectorigenin->PIP3 Psitectorigenin->Erk Psitectorigenin->Akt Psitectorigenin->PKA Psitectorigenin->PKC Psitectorigenin->p38 Psitectorigenin->JNK U0126->Raf U0126->Mek U0126->PLCg U0126->PIP2 U0126->PIP3 U0126->Erk U0126->Akt U0126->PKA U0126->PKC U0126->p38 U0126->JNK LY294002->Raf LY294002->Mek LY294002->PLCg LY294002->PIP2 LY294002->PIP3 LY294002->Erk LY294002->Akt LY294002->PKA LY294002->PKC LY294002->p38 LY294002->JNK PMA/beta2CAMP + noAlphaCD3/28->Raf PMA/beta2CAMP + noAlphaCD3/28->Mek PMA/beta2CAMP + noAlphaCD3/28->PLCg PMA/beta2CAMP + noAlphaCD3/28->PIP2 PMA/beta2CAMP + noAlphaCD3/28->PIP3 PMA/beta2CAMP + noAlphaCD3/28->Erk PMA/beta2CAMP + noAlphaCD3/28->Akt PMA/beta2CAMP + noAlphaCD3/28->PKA PMA/beta2CAMP + noAlphaCD3/28->PKC PMA/beta2CAMP + noAlphaCD3/28->p38 PMA/beta2CAMP + noAlphaCD3/28->JNK Indirect intervention targets 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Confidence Score Figure 41: Interv en tion effects on bio c hemical agents in the flow cytometry data of Sac hs et al. (2005) , according to differen t causal disco very methods and the “consensus netw ork” according to Sac hs et al. (2005) (whic h w e do not consider as a reliable complete ground truth). W e also included results of causal discov ery metho ds rep orted in other works. 81 Mooij, Ma gliacane and Claassen Figure 42: Runtimes for v arious metho ds on the flo w cytometry data of Sachs et al. (2005). 6. Conclusions and Discussion In this w ork, w e prop osed Join t Causal Inference (JCI), a p o w erful and elegant framew ork for causal disco very from data sets from m ultiple contexts. JCI generalizes the ideas of causal discov ery based on exp erimen tation (as in randomized con trolled trials and A/B- testing) to m ultiple con text and system v ariables. Seen from another p erspective, it also generalizes the ideas of causal disco v ery from purely observ ational data to the setting of data sets from multiple con texts—for example, differen t interv entional regimes—b y reducing the latter to a sp ecial case of the former, with additional background knowledge on the causal relationships inv olving the context v ariables. W e prop osed different flav ours of JCI that differ in the amoun t of background knowledge that is assumed, some b eing more conserv ativ e than others. JCI can b e implemen ted with an y causal discov ery method that can take into accoun t the background knowledge. Surprisingly , we saw that one can even apply an off- the-shelf causal discov ery algorithm for purely observ ational data on the p o oled data (with con text v ariables included), completely ignoring the bac kground kno wledge, and thereb y already obtain significan t improv ements in the accuracy of the disco vered causal relations. W e hav e seen ho w JCI deals with differen t types of interv en tions in a unified fashion, ho w it reduces learning in terven tion targets to learning the causal relations betw een con text and system v ariables, and that it allo ws one to fully exploit all the information in the join t distribution on system and context v ariables. JCI was partially inspired by the approach b y Eaton and Murph y (2007), but is m uch more generally applicable, as it allows for latent confounders and cycles, whic h are both imp ortant in many application domains. Esp ecially notew orthy is that more conserv ative flav ours of JCI allow for confounders b et ween system and con text v ariables, which cannot alwa ys b e excluded, for example b ecause the relev an t asp ects of the system’s context w ere only partially observed. W e ha ve inv estigated v arious implementations of JCI, amongst which some existing algorithms (LCD, ICP , and standard estimators for the presence of a causal effect in a randomized con trolled trial), and also prop osed nov el implemen tations that are adaptations 82 Joint Ca usal Inference from Mul tiple Contexts of algorithms for causal disco very from purely observ ational data to the JCI setting. In particular, we prop osed ASD-JCI, an adaptation of the metho d of Hyttinen et al. (2014) com bined with ideas from Magliacane et al. (2016b), which is very flexible and accurate. By replacing d-separation with σ -separation (F orr ´ e and Mo oij, 2018), ASD-JCI can also b e used in general nonlinear cyclic settings. A ma jor disadv antage of ASD-JCI is that it b ecomes computationally extremely exp ensiv e already for as few as ab out 7 v ariables. W e also proposed FCI-JCI, an adaptation of the FCI algorithm that enables it to exploit the applicable JCI bac kground knowledge. This algorithm is less accurate than ASD-JCI, but m uch faster. W e ev aluated different implementations of the JCI approac h on syn thetic data. W e saw that JCI implementations outp erform other state-of-the-art causal discov ery algorithms in most settings. In some cases, the gains were quite extreme; for example, while purely observ ational causal discov ery metho ds did not p erform b etter than random guessing on small mo dels, JCI v arian ts were able to discov er with almost p erfect precision ancestral causal relations b et ween system v ariables. The only case in whic h all JCI implemen tations w ere outp erformed b y another causal discov ery algorithm that com bines data from differen t con texts, was the setting in which the contexts corresp ond with p erfect interv en tions with kno wn targets. The reason is that none of the JCI implemen tations exploited the p erfect nature of the in terven tions. Ho wev er, we also sa w that if interv en tions are not p erfect (for example, in the case of causal mec hanism c hanges), JCI implemen tations still p erform v ery w ell, while algorithms relying on the p erfect nature of interv entions ma y suffer from mo del missp ecification. Another in teresting observ ation w e made in the exp eriments on synthetic data is that for the task of disco vering indirect (ancestral) causal relations, the classic (and v ery simple and fast) LCD algorithm can be competitive with more sophisticated algorithms, lik e ICP and b o otstrapp ed FCI-JCI. W e further illustrated the use of JCI by analyzing flow cytometry protein expression data (Sac hs et al., 2005), a famous “b enchmark” in the field of causal discov ery . Unfortunately , applying ASD-JCI on the 11 system and 6 context v ariables would tak e excessive amoun ts of computation time, so we had to resort to FCI-JCI instead for causal disco very on a global scale. W e compared with LCD and ICP v arian ts that do causal disco very lo cally . The results of v arious metho ds differ considerably , but sho w also some consistent patterns. This suggests that there is indeed a strong causal signal in the data, but it seems hard to conclude which of the v arious metho ds is b est equipped to extract this signal most reliably , b ecause the ground truth is only partially kno wn. In future work, we plan to analyze more recen t cytometry data sets that will allow for a more principled v alidation. Because often the true causal structure is not known, while interv entional data is av ailable, this requires to extend JCI-based causal disco very with causal prediction techniques, enabling one to predict the results of a particular interv en tion (Magliacane et al., 2018). JCI offers increased flexibility when it comes to designing experiments for the purp ose of causal discov ery , as the JCI framew ork facilitates analysis of data from almost arbitrary exp erimen tal designs. This allows researchers to trade off the n umber and complexit y of exp erimen ts to b e done with the reliability of the analysis of the data for the purp ose of causal discov ery . Compared with existing metho ds, the framew ork offered b y JCI is the most generally applicable, handling v arious in terv ention t yp es and other context c hanges in a unified and non-parametric w a y , allo wing for latent v ariables and cycles, and also applies 83 Mooij, Ma gliacane and Claassen when interv en tion types and targets are unkno wn, a common situation in causal discov ery for complex systems. As future work, we plan to (i) weak en the faithfulness assumption of JCI with resp ect to the context v ariables to allow for ev en more general exp erimental designs, (ii) address the problem of learning from data sets with non-identical (but o verlapping) sets of observ ed v ariables, (iii) address selection bias, (iv) dev elop algorithms that need less computation time for deliv ering reliable results, (v) w ork on more applications on real-world data. Ac kno wledgments W e thank Thijs v an Ommen for useful discussions and the review ers and editor for their constructiv e commen ts. SM, JMM and TC were supp orted by NWO, the Netherlands Organization for Scientific Researc h (VIDI gran t 639.072.410). SM w as also supp orted b y the Dutc h programme COMMIT/ under the Data2Semantics pro ject. TC was supp orted b y EU-FP7 grant agreemen t n.603016 (MA TRICS). App endix A. Pro ofs In this app endix w e pro vide the pro ofs that w ere omitted from the main text. A.1 JCI F oundations Theorem 13 Assume that JCI Assumptions 0, 1 and 2 hold for SCM M : M :      C k = f k ( C p a H ( k ) ∩K , E p a H ( k ) ∩J ) , k ∈ K , X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈J P ( E j ) , F or any other SCM ˜ M satisfying JCI Assumptions 0, 1 and 2 that is the same as M exc ept that it mo dels the c ontext differ ently, i.e., of the form ˜ M :      C k = ˜ f k ( C p a ˜ H ( k ) ∩K , E p a ˜ H ( k ) ∩ ˜ J ) , k ∈ K , X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈ ˜ J P ( E j ) , with J ⊆ ˜ J and p a H ( i ) = p a ˜ H ( i ) for al l i ∈ I , we have that (i) the c onditional system gr aphs c oincide: G ( M ) do( K ) = G ( ˜ M ) do( K ) ; (ii) if ˜ M and M induc e the same c ontext distribution, i.e., P M ( C ) = P ˜ M ( C ) , then for any p erfe ct intervention on the system variables do( I , ξ I ) with I ⊆ I (including the non-intervention I = ∅ ), ˜ M do( I , ξ I ) is observational ly e quivalent to M do( I , ξ I ) . (iii) if the c ontext gr aphs G ( ˜ M ) K and G ( M ) K induc e the same sep ar ations, then also G ( ˜ M ) and G ( M ) induc e the same sep ar ations (wher e “sep ar ations” c an r efer to either d - sep ar ations or σ -sep ar ations). 84 Joint Ca usal Inference from Mul tiple Contexts Pro of Let M b e an SCM of the form (5). Under JCI Assumption 1, the structural equations for the context v ariables do not dep end on the system v ariables: C k = f k ( C p a H ( k ) ∩K , E p a H ( k ) ∩J ) , k ∈ K . Because of JCI Assumption 2, p a H ( K ) ∩ p a H ( I ) ∩ J = ∅ , i.e., the context v ariables do not share any exogenous v ariable with the system v ariables. This means that in G ( M ), any edge b et ween a context v ariable and a system v ariable must b e a directed edge p oin ting from context to system v ariable, i.e., of the form k → i with k ∈ K , i ∈ I . Since the structural equations for the system v ariables of ˜ M coincide with those of M , their solutions (in terms of the context and exogenous v ariables) also coincide, even after an y p erfect interv en tion on a subset of the system v ariables. Since C is indep endent of E p a H ( I ) (b oth for M as well as for ˜ M ), and since P E = P ˜ E J b y assumption, this implies that the interv entional distributions of M and ˜ M coincide for any p erfect interv en tion on a subset of system v ariables if P M ( C ) = P ˜ M ( C ). Assume no w that G ( ˜ M ) K and G ( M ) K induce the same separations. In the remainder of this pro of, “op en” can b e read either consisten tly as “ σ -op en” or as “ d -open”. Note that b y assumption, G ( M ) do( K ) = G ( ˜ M ) do( K ) , and that the edges in G ( M ) do( K ) are a subset of those in G ( M ), and of those in G ( ˜ M ). W e will prov e that G ( ˜ M ) and G ( M ) induce the same separations b y first showing that for any t wo con text no des connected by a path π in G ( M ) K suc h that π is A -op en in G ( M ) for some A ⊆ I ∪ K , w e can find a path π 0 in G ( M ) K b et ween the t wo no des that is A 0 -op en in G ( M ) where A 0 = A ∩ K ∪ B with B ⊆ K ∩ an G ( M ) do( K ) ( A \ K ). F or π to b e A -op en in G ( M ), an y collider on π that is not a G ( M )-ancestor of A ∩ K must b e a G ( M )-ancestor of A \ K . Since the latter do es not necessarily imply that the collider m ust also b e G ( ˜ M )-ancestor of A \ K , the idea will b e to replace the v ariables from A \ K in the conditioning set b y v ariables in K ∩ an G ( M ) do( K ) ( A \ K ) (i.e., con text no des that are guaran teed to b e b oth G ( M )-ancestors and G ( ˜ M )-ancestors of A \ K ) that are G ( M )-descendants of those colliders that are not already G ( M )-ancestors of A ∩ K . It will turn out that this is not alwa ys p ossible to achiev e for π , but that w e can construct another path π 0 for which this can b e done. Consider a path π in G ( M ) K b et ween k 0 ∈ K and k n ∈ K that is A -op en in G ( M ) for some A ⊆ I ∪ K . W e will iterativ ely construct a w alk in G ( M ) K b et ween the same t wo no des k 0 and k n that is b oth A -op en in G ( M ) and ( A ∩ K ) ∪ B -op en in G ( M ), where B ⊆ K ∩ an G ( M ) do( K ) ( A \ K ). W e will pro ceed by induction. Supp ose a walk π m in G ( M ) K b et ween k 0 and k n is A -op en in G ( M ). Then it is A ∪ B m -op en in G ( M ) where B m = ( K ∩ an G ( M ) do( K ) ( A \ K )) \ ncol ( π m ). Consider the “problematic” colliders col ( π m ) \ an G ( M ) ( A ∩ K ∪ B m ) on π m , i.e., the ones that are not ancestors of A ∩ K ∪ B m . If there are any , c ho ose one suc h problematic collider c ∈ K on π m . Since c is not G ( M )-ancestor of A ∩ K ∪ B m , but π m is A -op en, it has to b e G ( M )-ancestor of A \ K . This means that there is a directed path in G ( M ) that starts at c , passes through zero or more context no des, none of which lie in A ∩ K ∪ B m b y assumption, and then through zero or more system no des, un til it ends at a system no de in A \ K . Let k c ∈ K b e the last con text no de on this directed path b efore the path crosses the con text-system boundary . By assumption, k c m ust exist as a non-collider on π m (otherwise it w ould b e in B m and c w ould b e G ( M )-ancestor of B m ), hence we can mak e a shortcut by replacing the sub walk of π m b et ween c and k c 85 Mooij, Ma gliacane and Claassen b y a directed path c → · · · → k c in G ( M ) K , whic h necessarily entirely consists of context no des that are not in A . If k c o ccurs more than once on this new walk, remo ve the en tire sub walk b et ween the t wo outermost o ccurrences of k c , such that k c only o ccurs once. This new walk π m +1 m ust b e A -open: c (if still present) is now a non-collider that is not in A , none of the (non-collider) nodes on the directed path (if still presen t) b etw een c and k c are in A , and k c itself is not in A and is a G ( M )-ancestor of A , so it do es not matter whether it is a collider or non-collider. The n umber of problematic colliders on π m +1 is at least one less than on π m : c is no longer a collider, and if k c b ecame a collider on π m +1 , it won’t b e problematic (as it is itself in ( K ∩ an G ( M ) do( K ) ( A \ K )) and cannot also occur as non-collider on π m +1 ), W e rep eat this pro cedure un til no problematic colliders are present anymore. This yields a walk π M that is b oth A -op en and A 0 -op en, with A 0 = ( A ∩ K ) ∪ B where B = B M = ( K ∩ an G ( M ) do( K ) ( A \ K )) \ ncol ( π M ). W e no w shorten this A 0 -op en w alk π M in G ( M ) K in to an A 0 -op en path π 0 in G ( M ) K . This implies that there must be an A 0 -op en path ˜ π 0 in G ( ˜ M ) K connecting k 0 and k n , b y assumption. Every collider on ˜ π 0 is a G ( ˜ M ) K -ancestor of A ∩ K ∪ B , and hence G ( ˜ M )- ancestor of A ∩ K ∪ B , and hence G ( ˜ M )-ancestor of A . Therefore, ˜ π 0 is also A 0 -op en in G ( ˜ M ). But then it m ust also be A -op en, as w e can add A \ K to the conditioning set without blo c king an y non-collider on ˜ π 0 , and then remov e B \ ( A \ K ) from the conditioning set as all colliders are still kept op en due to either b eing G ( ˜ M )-ancestor of A ∩ K or of A \ K . Consider now an y path in G ( M ) that is A -op en, for A ⊆ I ∪ K . An y edge on the path b et ween a system no de and a con text no de must b e of the form i ← k (with i ∈ I , k ∈ K ) or k → i , where i is in another strongly-connected comp onent than k and k cannot b e in A (b ecause the path was assumed to b e A -op en). Replacing each longest subpath consisting en tirely of context no des k 0 . . . k n (with all k 0 , . . . , k n ∈ K ) b y a corresp onding A -op en path in G ( ˜ M ) K b et ween k 0 and k n giv es a w alk in G ( ˜ M ) that b y construction is also A -op en in G ( ˜ M ). An y system collider on this w alk must b e a collider on the original path, and therefore G ( M )-ancestor of A , and therefore also G ( ˜ M )-ancestor of A . Any system non-collider on this walk is also a system non-collider on the original path and therefore not in A or, in case of σ -separation, p oin ting only to no des in the same strongly-connected comp onen t of G ( M ), and hence of G ( ˜ M ). An y context non-collider on this w alk cannot b e in A , or, in case of σ -separation, points to the same strongly-connected comp onent in G ( ˜ M ), since the replacing path in G ( ˜ M ) K w as A -op en by construction. Any context collider on this walk that is a G ( M ) K -ancestor of ( A ∩ K ) ∪ B , and therefore must b e G ( ˜ M )-ancestor of A . The w alk can b e shortened in to an A -op en path in G ( ˜ M ). Similarly , one can show that any path in G ( ˜ M ) that is A -op en, there must b e a corre- sp onding path in G ( M ) that is A -op en. Corollary 14 Assume that JCI Assumptions 0, 1 and 2 hold for SCM M . Then ther e exists an SCM ˜ M that satisfies JCI Assumptions 0, 1 and 2 and 3, such that (i) the c onditional system gr aphs c oincide: G ( M ) do( K ) = G ( ˜ M ) do( K ) ; (ii) for any p erfe ct intervention on the system variables do( I , ξ I ) with I ⊆ I (including the non-intervention I = ∅ ), ˜ M do( I , ξ I ) is observational ly e quivalent to M do( I , ξ I ) ; 86 Joint Ca usal Inference from Mul tiple Contexts (iii) if the c ontext distribution P M ( C ) c ontains no c onditional or mar ginal indep endenc es, then the same σ -sep ar ations hold in G ( ˜ M ) as in G ( M ) ; if in addition, the Dir e cte d Glob al Markov Pr op erty holds for M , then also the same d -sep ar ations hold in G ( ˜ M ) as in G ( M ) . Pro of Let M b e an SCM of the form (5). Under JCI Assumption 1, the structural equations for the context v ariables do not dep end on the system v ariables: C k = f k ( C p a H ( k ) ∩K , E p a H ( k ) ∩J ) , k ∈ K . Because of JCI Assumption 2, p a H ( K ) ∩ p a H ( I ) ∩ J = ∅ , i.e., the context v ariables do not share any exogenous v ariable with the system v ariables. Consider now the mo dified SCM ˜ M of the form: ˜ M :      C k = g k ( E C ) , k ∈ K X i = f i ( X p a H ( i ) ∩I , C p a H ( i ) ∩K , E p a H ( i ) ∩J ) , i ∈ I , P ( E ) = Q j ∈ ˜ J P ( E j ) , where ˜ J = J ∪ { C } con tains an additional exogenous v ariable E C ∈ Q k ∈K E k with com- p onen ts ( E C ) k ∈ C k with distribution P ( E C ) = P M ( C ) and g k the pro jection on the k th comp onen t g k : E C 7→ ( E C ) k . By construction, this SCM ˜ M satisfies JCI Assumptions 1 and 2. The only asp ect that requires some w ork is to prov e that ˜ M as constructed ab o v e is simple (Definition 4). T ake O ⊆ I and consider the solution function for O according to M : g O : X ( p a H ( O ) \O ) ∩I × C ( p a H ( O ) \O ) ∩K × E p a H ( O ) ∩J → X I \O . This solves the structural equations for O \ I , and since these are the same for ˜ M as for M , the same solution function w orks also for ˜ M . Now tak e Q ⊆ K and consider the solution function g Q : E C → C Q with comp onents E C → C k : e C 7→ g k ( e C ), k ∈ Q . An y other solution function can b e obtained by comp osition. W e conclude that ˜ M is simple. ˜ M also induces the same con text distribution P ˜ M ( E ) = P M ( E ) and satisfies JCI Assumption 3 by construction. The other statements now follow b y applying Theorem 13, where the only thing left to sho w is that G ( ˜ M ) K and G ( M ) K induce the same σ -separations if the con text distribution P M ( C ) con tains no conditional indep endences, and the same d -separations if in addition the Directed Global Mark ov Prop ert y holds for M . Marginalizing out the system v ariables (b oth in M as well as in ˜ M ) yields M \I and ˜ M \I , with graphs G ( M \I ) = G ( M ) K and G ( ˜ M \I ) = G ( ˜ M ) K , resp ectively . By the Gen- eralized Directed Global Marko v prop ert y , since P M ( C ) has no conditional independences, there must b e a K - σ -op en path in G ( M ) K b et ween any t wo context no des k 6 = k 0 ∈ K , for an y K ⊆ K with { k , k 0 } ∩ K = ∅ . If the Directed Global Marko v prop ert y holds for M , then it holds for G ( M \I ), and hence there must even b e a K - d -op en path in G ( M ) K b et ween an y t wo con text no des k 6 = k 0 ∈ K , for an y K ⊆ K with { k , k 0 } ∩ K = ∅ . Since by construction G ( ˜ M ) K con tains all bidirected edges k ↔ k 0 , there is a K - d -op en path in ˜ M \I b et ween an y tw o context no des k 6 = k 0 ∈ K , for an y K ⊆ K with { k , k 0 } ∩ K = ∅ . 87 Mooij, Ma gliacane and Claassen The following Lemma and Corollary extend these fundamental results further, which enables one to state a precise relationship b etw een our JCI approac h of join tly mo deling system and con text with alternative approac hes based on mo deling the system conditional on its con text (e.g., Y ang et al. (2018)). Lemma 23 L et M b e an SCM that satisfies JCI Assumptions 0, 1, 2. Then the same restricted sep ar ations hold in G ( M ) as in the c onditional system gr aph G ( M ) do( K ) , i.e., X ⊥ G ( M ) do( K ) Y | Z ⇐ ⇒ X ⊥ G ( M ) Y | Z whenever X, Y , Z ⊆ I ∪ K with X ∩ K = ∅ and K ⊆ Y ∪ Z (wher e “sep ar ations” c an r efer to either d -sep ar ations or σ -sep ar ations). Pro of Let X , Y , Z ⊆ I ∪ K b e suc h that X ∩ K = ∅ and K ⊆ Y ∪ Z . Let G 1 , G 2 b e tw o graphs in {G ( M ), G ( M ) do( K ) } . Let π b e a path in G 1 b et ween a no de in X and a no de in Y that is op en in G 1 and that do es not contain an y non-endp oint no des in X ∪ Y . It cannot ha ve non-endpoint nodes in K , b ecause those would be either in Y (a con tradiction), or in Z (and since they would b e non-colliders with an outgoing directed edge p oin ting to another strongly-connected comp onen t, they w ould block the path, another contradiction). But then the same path π must b e presen t in G 2 as well. It is easy to see that it must also b e op en in G 2 , since for each i ∈ I , de G 1 ( i ) = de G 2 ( i ) and sc G 1 ( i ) = sc G 2 ( i ). W e can now form ulate the follo wing sligh tly adapted version of Corollary 14: Corollary 24 Assume that JCI Assumptions 0, 1 and 2 hold for SCM M . Then ther e exists an SCM ˜ M that satisfies JCI Assumptions 0, 1 and 2 and 3, such that (i) the c onditional system gr aphs c oincide: G ( M ) do( K ) = G ( ˜ M ) do( K ) ; (ii) as a c onse quenc e, the same restricted sep ar ations hold in G ( ˜ M ) as in G ( M ) and in their c orr esp onding c onditional system gr aphs, i.e., X ⊥ G ( ˜ M ) Y | Z ⇐ ⇒ X ⊥ G ( ˜ M ) do( K ) Y | Z ⇐ ⇒ X ⊥ G ( M ) do( K ) Y | Z ⇐ ⇒ X ⊥ G ( M ) Y | Z whenever X, Y , Z ⊆ I ∪ K with X ∩ K = ∅ and K ⊆ Y ∪ Z (wher e “sep ar ations” c an r efer to either d -sep ar ations or σ -sep ar ations); (iii) for any p erfe ct intervention on the system variables do( I , ξ I ) with I ⊆ I (including the non-intervention I = ∅ ), and any p erfe ct intervention on al l c ontext variables do( K , c ) : P ˜ M ( X | do( K , c ) , do( I , ξ I )) = P M ( X | do( K , c ) , do( I , ξ I )); (iv) as a c onse quenc e, P M ( X | C ) = P ˜ M ( X | C ) , and in p articular, the same r estricte d c onditional indep endenc es hold, i.e., X ⊥ ⊥ G ( ˜ M ) Y | Z ⇐ ⇒ X ⊥ ⊥ G ( ˜ M ) do( K ) Y | Z ⇐ ⇒ X ⊥ ⊥ G ( M ) do( K ) Y | Z ⇐ ⇒ X ⊥ ⊥ G ( M ) Y | Z whenever X , Y , Z ⊆ I ∪ K with X ∩ K = ∅ and K ⊆ Y ∪ Z ; 88 Joint Ca usal Inference from Mul tiple Contexts (v) the c ontext distribution P ˜ M ( C ) c ontains no c onditional or mar ginal indep endenc es. Pro of The same SCM ˜ M as constructed in the pro of of Corollary 14, but with a generic distribution of P ( E C ) that con tains no conditional or marginal indep endences, is easily seen to fulfill all requirements. A.2 Minimal Conditional (In)Dep endencies In this section we generalize tw o useful Lemmas from Claassen and Hesk es (2011) to the cyclic setting. Definition 25 L et X , Y , Z , S ⊆ V b e sets of no des in a DMG G = hV , E , F i . L et ⊥ denote a DMG-sep ar ation pr op erty, e.g., d -sep ar ation ( ⊥ d ) or σ -sep ar ation ( ⊥ σ ). We say that the minimal separation X ⊥ G Y | S ∪ [ Z ] holds if and only if X ⊥ G Y | S ∪ Z ∧ ∀ Q ( Z : X 6⊥ G Y | S ∪ Q. In wor ds: al l no des in Z ar e r e quir e d (in the c ontext of the no des in S ) to sep ar ate X fr om Y . Similarly: we say that the minimal connection X 6⊥ G Y | S ∪ [ Z ] holds if and only if X 6⊥ G Y | S ∪ Z ∧ ∀ Q ( Z : X ⊥ G Y | S ∪ Q. In wor ds: al l no des in Z ar e r e quir e d (in the c ontext of the no des in S ) to c onne ct X with Y . Note that despite the notation, a minimal connection is not the logical negation of a minimal separation. Minimal connections imply the absence of certain ancestral relations: Lemma 26 L et { X } , { Y } , S, { Z } ⊆ V b e mutual ly disjoint sets of no des in a DMG G = hV , E , F i . F or b oth d -sep ar ation ( ⊥ d ) and σ -sep ar ation ( ⊥ σ ), we have that: X 6⊥ G Y | S ∪ [ { Z } ] = ⇒ Z / ∈ an G ( { X , Y } ∪ S ) . Pro of The minimal connection means that all paths b et ween X and Y are closed when conditioning on S and there exists at least one path b et ween X and Y that is open when conditioning on S ∪ { Z } . F or d -separation, this means that such a path (i) contains a collider not in an G ( S ), (ii) every collider is in an G ( S ∪ { Z } ), (iii) ev ery non-collider is not in S ∪ { Z } . F or σ -separation, this means that suc h a path (i) con tains a collider not in an G ( S ), 89 Mooij, Ma gliacane and Claassen (ii) every collider is in an G ( S ∪ { Z } ), (iii) every non-collider is either not in S ∪ { Z } , or if it is, it p oin ts to neighboring no des in the same strongly-connected comp onen t only . Th us there exists a path b et ween X and Y that contains a collider in an G ( { Z } ) that is not in an G ( S ). If Z ∈ an G ( S ) this would b e a con tradiction. If Z ∈ an G ( X ), then w e can consider the w alk b et ween X and Y obtained from comp osing the subpath of the original path b et ween Y and the first collider (starting from Y ) in an G ( { Z } ) \ an G ( S ) with a directed path to Z and then on to X , without passing through no des in S . This w alk b et ween X and Y m ust b e op en when conditioning on S , and hence there exists a path b et ween X and Y that is open when conditioning on S , a con tradiction. Similarly w e obtain a contradiction if Z ∈ an G ( Y ). On the other hand, minimal separations imply the presence of certain ancestral relations: Lemma 27 L et { X } , { Y } , S, Z ⊆ V b e mutual ly disjoint sets of no des in a DMG G = hV , E , F i . F or b oth d -sep ar ation ( ⊥ d ) and σ -sep ar ation ( ⊥ σ ), we have that: X ⊥ G Y | S ∪ [ Z ] = ⇒ Z ⊆ an G ( { X , Y } ∪ S ) . Pro of Let Q ( Z . Consider a path b et ween X and Y that is op en when conditioning on S ∪ Q , but b ecomes blo c k ed when conditioning on S ∪ Z . F or d -separation, this means that (i) every collider on the path is in an G ( S ∪ Q ), (ii) every non-collider is not in S ∪ Q , and (iii) it con tains a non-collider in S ∪ Z . F or σ -separation, this means that (i) ev ery collider on the path is in an G ( S ∪ Q ), (ii) every non-collider is either not in S ∪ Q or if it is, it p oin ts to a neighboring no de on the path in another strongly-connected comp onen t, and (iii) it con tains a non-collider in S ∪ Z that p oin ts to a neighboring node on the path in another strongly-connected comp onent. In both cases, we ha ve that (i) every collider on the path is in an G ( S ∪ Q ) and (ii) it contains a non-collider in Z \ Q . Consider a maximal directed subpath of the path starting at a non-collider U in Z \ Q and stopping at a collider or at an end no de. Then U ∈ an G ( { X , Y } ∪ S ∪ Q ). So, for each Q ( Z , there exists a U ∈ Z \ Q with U ∈ an G ( { X , Y } ∪ S ∪ Q ). Thus for e v ery Z i ∈ Z , w e either obtain (taking Q = Z \ { Z i } ) an ancestral relation of the form Z i ∈ an G ( { X , Y } ∪ S ), or, otherwise, at least Z i ∈ an G ( Z j ) for some Z j ∈ Z \ { Z i } . Define a directed graph A with no des Z ∪ { ω } (where ω represen ts { X , Y } ∪ S ) and add an edge Z i → ω whenever our construction yields an ancestral relation of the form Z i ∈ an G ( { X , Y } ∪ S ), or otherwise, an edge Z i → Z j if our construction yields Z i ∈ an G ( Z j ). Then, taking the transitive closure of the constructed directed graph A and using tran- sitivit y of ancestral relations, for any Z i ∈ Z w e either obtain Z i ∈ an G ( { X , Y } ∪ S ), or Z i is in some strongly-connected comp onent C ⊆ Z in A . In the latter case, w e can apply the reasoning ab o ve (taking now Q = Z \ C ) to conclude that there exists a Z j ∈ C with Z j ∈ an G ( { X , Y } ∪ S ) or Z j ∈ an G ( C 0 ) where C 0 ⊆ Z is another strongly-connected com- p onen t. Since the strongly-connected comp onents of Z form an acyclic structure, repeating this reasoning a finite n um b er of times, w e ultimately conclude that Z i ∈ an G ( { X , Y } ∪ S ). An implication of this is that the intersection of all sets that separate a no de X from a no de Y can only consist of ancestors of X or Y : 90 Joint Ca usal Inference from Mul tiple Contexts Prop osition 28 L et X, Y ∈ V b e differ ent no des in a DMG G = hV , E , F i . F or d -sep ar ation ( ⊥ d ) or σ -sep ar ation ( ⊥ σ ), c onsider Z ∗ := T { Z ⊆ V : X / ∈ Z , Y / ∈ Z , X ⊥ G Y | Z } . Then Z ∗ ⊆ an G ( { X , Y } ) . Pro of First, note that Z ∗ = T { Z ⊆ V : X / ∈ Z, Y / ∈ Z , X ⊥ G Y | [ Z ] } . F rom Lemma 27, X ⊥ G Y | [ Z ] implies Z ⊆ an G ( { X , Y } ). Hence Z ∗ ⊆ an G ( { X , Y } ). App endix B. Soundness, Consistency and Completeness Prop erties of F CI-JCI In this app endix we will form ulate and prov e v arious results concerning soundness and completeness of F CI-JCI v ariants. B.1 Preliminaries on MA Gs and P AGs W e start by summarizing the basic definitions and results from the theory of maximal ancestral graphs and partial ancestral graphs (Spirtes et al., 2000; Ric hardson and Spirtes, 2002; Zhang, 2006, 2008a,b) that we will need. B.1.1 Directed Maximal Ancestral Graphs Ric hardson and Spirtes (2002) introduced a class of graphs known as maximal anc estr al gr aphs (MAGs) . The general formulation of MAGs allows for undirected edges whic h are useful when modeling selection bias, but here w e will only use dir e cte d maximal ancestral graphs without undirected edges (sometimes abbreviated as DMAGs in the literature) as w e assume for simplicit y that there is no selection bias. In order to define a directed maximal ancestral graph, w e need the notion of inducing path. Definition 29 L et G = hV , E , F i b e an acyclic dir e cte d mixe d gr aph (ADMG). An inducing path betw een t w o no des u, v ∈ V is a p ath in G b etwe en u and v on which every no de (exc ept for the end no des) is a c ol lider on the p ath and an anc estor in G of an end no de of the p ath. W e can now state: Definition 30 A dir e cte d mixe d gr aph G = hV , E , F i is c al le d a directed maximal ancestral graph (DMAG) if al l of the fol lowing c onditions hold: 1. Betwe en any two differ ent no des ther e is at most one e dge, and ther e ar e no self-cycles; 2. The gr aph c ontains no dir e cte d or almost dir e cte d cycles (“anc estr al”); 3. Ther e is no inducing p ath b etwe en any two non-adjac ent no des (“maximal”). Giv en an ADMG, we can define a corresp onding DMAG (Richardson and Spirtes, 2002): Definition 31 L et G = hV , E , F i b e an ADMG. The directed maximal ancestral graph induced by G is denote d DMA G( G ) and is define d as DMA G( G ) = h ˜ V , ˜ E , ˜ F i such that ˜ V = V and for e ach p air u, v ∈ V with u 6 = v , ther e is an e dge in DMA G( G ) b etwe en u and v if and only if ther e is an inducing p ath b etwe en u and v in G , and in that c ase the e dge 91 Mooij, Ma gliacane and Claassen in DMAG( G ) c onne cting u and v is:      u → v if u ∈ an G ( v ) , u ← v if v ∈ an G ( u ) , u ↔ v if u 6∈ an G ( v ) and v 6∈ an G ( u ) . An imp ortan t prop ert y of the induced DMAG is that it preserves all ancestral and non- ancestral relations. More precisely , for t wo nodes u, v in ADMG G : u ∈ an G ( v ) if and only if u ∈ an DMAG( G ) ( v ). Another imp ortant property of the induced DMA G is that it preserves all d-separations. Indeed, A ⊥ d DMAG( G ) B | C ⇐ ⇒ A ⊥ d G B | C for all A, B , C ⊆ V . W e sometimes identify a DMAG H with the set of ADMGs G such that DMA G( G ) = H . F or an acyclic SCM M , we will define its induced DMA G as DMAG( M ) := DMA G( G ( M )). F or a directed maximal ancestral graph H , define its indep endence mo del to b e IM( H ) := {h A, B , C i : A, B , C ⊆ V , A d ⊥ H B | C } , i.e., the set of all d-separations en tailed b y the DMA G. F or a simple SCM M with endoge- nous index set I and distribution P M ( X ), we define its indep endence mo del to b e IM( M ) := {h A, B , C i : A, B , C ⊆ I , X A ⊥ ⊥ P M X B | X C } , i.e., the set of all (conditional) indep endences that hold in its (observ ational) distribution. If M is acyclic, then by the Mark ov prop ert y , IM( M ) ⊇ IM(DMA G( M )); the faithfulness assumption then means that IM( M ) ⊆ IM(DMAG( M )). B.1.2 Directed P ar tial Ancestral Graphs Since in many cases, the true DMA G is unknown, it is often conv enien t when p erforming causal reasoning to be able to represen t a set of h yp othetical DMA Gs in a compact w a y . F or this purp ose, p artial anc estr al gr aphs (P AGs) ha v e b een in tro duced (Zhang, 2006). Again, since w e are assuming no selection bias for simplicity , we will only discuss dir e cte d P A Gs (that is, P AGs without undirected or circle-tail edges, i.e., edges of the form {− − , − − ◦ , ◦ − −} ). Definition 32 We c al l a mixe d gr aph G = hV , E i with no des V and e dges E of the typ es {→ , ← , ← ◦ , ↔ , ◦ − ◦ , ◦ →} a directed partial ancestral graph (DP A G) if: 1. Betwe en any two differ ent no des ther e is at most one e dge, and ther e ar e no self-cycles; 2. The gr aph c ontains no dir e cte d or almost dir e cte d cycles (“anc estr al”); 3. Ther e is no inducing p ath b etwe en any two non-adjac ent no des (“maximal”). Giv en a DMAG or DP A G, its induced skeleton is an undirected graph with the same no des and with an edge b et w een an y pair of no des if and only if the t wo no des are adjacen t in the DMA G or DP A G. W e often identify a DP A G with the set of all DMAGs that ha v e the same sk eleton as the DP AG, ha ve an arrowhead (tail) on each edge mark for which the DP AG has an arrowhead (tail) at that corresponding edge mark, and for each circle in the DP A G, ha ve either an arrowhead or a tail at the corresp onding edge mark. Hence, the circles in a DP AG can b e thought of as to represent either an arrowhead or a tail. 92 Joint Ca usal Inference from Mul tiple Contexts W e extend the definitions of (directed) walks, (directed) paths and colliders for directed mixed graphs to apply also to DP A Gs. Edges of the form i ← j, i ← ◦ j, i ↔ j are called into i , and similarly , edges of the form i → j, i ◦ → j, i ↔ j are called into j . Edges of the form i → j and j ← i are called out of i . In addition, we define: Definition 33 A p ath v 0 , e 1 , v 1 , . . . , v n b etwe en no des v 0 and v n in a DP AG G = hV , E i is c al le d a p ossibly directed path from v 0 to v n if for e ach i = 1 , . . . , n , the e dge e i b etwe en v i − 1 and v i is not into v i − 1 (i.e., is of the form v i − 1 ◦ − ◦ v i , v i − 1 ◦ → v i , or v i − 1 → v i ). The p ath is c al le d uncov ered if every subse quent triple is unshielde d, i.e., v i and v i − 2 ar e not adjac ent in G for i = 2 , . . . , n . If H 1 and H 2 are DMA Gs, then w e call them Markov e quivalent if IM( H 1 ) = IM( H 2 ). One can show that this implies that H 1 and H 2 m ust ha ve the same sk eleton and the s ame unshielded colliders. The F CI algorithm maps the indep endence mo del IM( H ) of a DMA G H to a DP AG P . Zhang (2008a) show ed that FCI is sound and c omplete , which means that • P has the same skeleton as H ; • As a set of DMA Gs, P con tains H and all Marko v equiv alen t DMA Gs; • F or every circle edge mark in P , there exists a DMA G in P Mark ov equiv alent to H that has a tail at the corresp onding edge mark, and there exists a DMAG in P Mark ov equiv alent to H that has an arrowhead at the corresp onding edge mark. W e will denote the completely orien ted directed partial ancestral graph that contains the Mark ov equiv alence class of a DMA G H by CDP A G( H ). F or an acyclic SCM M w e will de- note its corresponding CDP AG represen tation as CDP AG( M ) := CDP A G(DMAG( G ( M ))). W e will make use of the notion of (in)visible edges in a DMA G (Zhang, 2008b): Definition 34 A dir e cte d e dge i → j in a DMAG is said to b e visible if ther e is a no de k not adjac ent to j , such that either ther e is an e dge b etwe en k and i that is into i , or ther e is a c ol lider p ath b etwe en k and i that is into i and every c ol lider on the p ath is a p ar ent of j . Otherwise i → j is said to b e in visible . W e will use the same notion in a DP AG, but call it definitely visible (and its negation p ossibly invisible ). If a directed edge in a DP A G is definitely visible, it m ust b e visible in all DMAGs in the DP AG. B.2 Soundness and Consistency of F CI-JCI W e are no w equipped to pro ve the soundness (and for some cases, completeness) of FCI-JCI, the adaptation of FCI that incorporates the JCI bac kground knowledge that w e introduced in Section 4.2.4. First, the soundness of FCI-JCI is easy to prov e b y chec king that the soundness of the F CI orien tation rules is not in v alidated by the JCI background knowledge. Theorem 35 L et M b e an acyclic SCM that satisfies JCI Assumption 0 and a subset of JCI Assumptions 1, 2, 3. Supp ose that its distribution P M ( X , C ) is faithful w.r.t. the gr aph G ( M ) . With input IM( M ) , and with the right JCI Assumptions, FCI-JCI outputs a DP AG that c ontains DMAG( M ) . 93 Mooij, Ma gliacane and Claassen Pro of First note that the skeleton obtained b y FCI-JCI must coincide with that of DMA G( M ) (as it would for standard FCI). Indeed, if JCI Assumption 3 is made, the con text no des are all adjacent in DMA G( M ) b y assumption. F or all other edges, and also for edges b etw een con text nodes if JCI Assumption 3 is not made: the edge is in the sk eleton found by F CI-JCI if and only if it is in the sk eleton of DMAG( M ), for the same reason as for standard F CI. F urthermore, one can easily see that FCI rule R 0 is still sound and will not conflict with the application of the background knowledge stemming from the JCI Assumptions. The extra orientation rules to incorp orate the JCI bac kground knowledge are easily seen to b e sound themselv es, as they just imp ose additional features on the DP AG that are satisfied by DMA G( M ) by assumption. By chec king the soundness pro ofs of eac h of the standard FCI orientation rules R 1- R 4 and R 8- R 10 in Zhang (2006), it is ob vious that all these rules are sound when applied on any DP AG as long as (i) it contains the true DMAG and (ii) F CI rule R 0 has b een completely applied, i.e., all unshielded colliders ha ve b een oriented as such. Rules R 5- R 7 are not needed since we assumed no selection bias. Hence all the rules applied by F CI-JCI are sound (as long as they are applied in the prescrib ed ordering), and hence the final DP A G m ust contain the true DMA G( M ). In general, soundness of a constraint-based causal discov ery algorithm implies consis- tency of the algorithm when using appropriate conditional indep endence tests. Lemma 36 If a c onditional indep endenc e test, including the choic e of the sample-size de- p endent thr eshold (to de cide b etwe en the nul l and alternative hyp othesis), is c onsistent, then any sound c onstr aint-b ase d c ausal disc overy algorithm b ase d on the test is asymptotic al ly c onsistent. Pro of Consistency of the conditional indep endence test means that for an y distribution, the probability of a T yp e I or Type I I error con verges to 0 as sample size N → ∞ . Since the num b er of tests is finite for a fixed n umber of v ariables, and the num b er of p ossible predictions made b y the algorithm is finite, the probabilit y of any error then conv erges to 0. An y constraint-based algorithm that is sound (i.e., that w ould return correct answers when using an indep endence oracle, including the p ossible answer “unkno wn”) is therefore asymptotically consistent if it makes use of that conditional indep endence test. As an example of a consistent test, Kalisch and B ¨ uhlmann (2007) provide a c hoice of the threshold for the standard partial correlation test that ensures asymptotic consistency of the test under the assumption that the distribution is multiv ariate Gaussian. Another example of a distribution-free and ev en strongly-consistent conditional indep endence test is prop osed b y Gy¨ orfi and W alk (2012). In general, when using the p -v alue as a test statistic, one should c ho ose the sample-size dep enden t threshold α N (where the p -v alue of the test result is used to decide “dep endence” if p ≤ α N and “indep endence” otherwise) in suc h a w ay that α N → 0 as N → ∞ at a suitable rate. 94 Joint Ca usal Inference from Mul tiple Contexts B.3 Completeness of F CI-JCI Regarding completeness, we curren tly only know ho w to pro ve the completeness of the v ariants FCI-JCI0 and FCI-JCI123 . In particular, we do not kno w whether FCI-JCI1 is complete. The completeness of FCI-JCI0 is obvious, b ecause FCI-JCI0 reduces to the standard FCI algorithm without additional backgrou nd kno wledge. Theorem 37 L et M b e an acyclic SCM that satisfies JCI Assumption 0. Supp ose that its distribution P M ( X , C ) is faithful w.r.t. the gr aph G ( M ) . With input IM( M ) , the output of FCI-JCI0 is a CDP AG in which al l e dge marks that c an p ossibly b e identifie d fr om IM( M ) have b e en oriente d. Pro of F ollo ws immediately from the completeness of FCI (Zhang, 2008a) under the addi- tional assumption of no selection bias. Pro ving the com pleteness of FCI-JCI123 is more work. The pro of strategy is to in tro duce additional v ariables that mimic the JCI background kno wledge. W e can then apply the completeness results for the standard FCI algorithm (Zhang, 2008a). Theorem 38 L et M b e an acyclic SCM that satisfies JCI Assumptions 0, 1, 2, 3. Supp ose that its distribution P M ( X , C ) is faithful w.r.t. the gr aph G ( M ) . With input IM( M ) , the output of FCI-JCI123 is a DP AG in which al l e dge marks that c an p ossibly b e identifie d fr om IM( M ) and the JCI b ackgr ound know le dge have b e en oriente d. Pro of The adjacency phase (skeleton search) of FCI-JCI123 and the orien tation of un- shielded triples b y applying F CI rule R 0 are b oth sound, as we hav e seen in Theorem 35. F urthermore, the sk eleton and unshielded colliders found by FCI-JCI123 will b e the same as found by standard FCI (in particular, note that FCI would not orien t any unshielded colliders on a context node, since the true DMAG( M ) do es not hav e these). 33 Before con tinuing with the F CI orien tation rules R 1– R 4 and R 8– R 10, FCI-JCI123 now uses the JCI background kno wledge to orien t the following edges: • k ↔ k 0 for all k 6 = k 0 ∈ K ; • k → i for k ∈ K , i ∈ I if k and i are adjacen t. After this bac kground orientation step, the only edges in the skeleton that remain to b e (further) oriented are the ones connecting tw o system v ariables. Denote the DP AG identified b y FCI-JCI123 so far by P . Eac h DMA G H with IM( H ) = IM( M ) and that satisfies the JCI Assumptions 0, 1, 2, 3 m ust b e con tained in P . Consider any such DMAG H . W e can extend it to a DMA G H ∗ , defined ov er an extended set of v ariables I ˙ ∪K ˙ ∪{ r } ˙ ∪ ¯ K where ¯ K := { ¯ k : k ∈ K } is a cop y of K , b y adding edges r → k for all k ∈ K , adding edges ¯ k → k for all k ∈ K , and removing all bidirected edges k ↔ k 0 for all k 6 = k 0 ∈ K (see also Figure 43). By construction, the marginal DMAG of H ∗ on I ∪ K is H . 34 If we run F CI on IM( H ∗ ) then the first stages of the algorithm yield a DP AG P ∗ with: 33. If we would use the results of statistical conditional indep endence tests on a finite data sample, then there could b e differences b et ween the DP AGs constructed by FCI and FCI-JCI123 after these stages. 34. F or the notion of marginal DMAG, see Ric hardson and Spirtes (2002). 95 Mooij, Ma gliacane and Claassen • the sk eleton of H ∗ , whic h equals the skeleton of P together with the additionally constructed edges ( r ∗ − ∗ k for all k ∈ K , and ¯ k ∗ − ∗ k for all k ∈ K ) but without the edges b etw een the context v ariables ( k ∗ − ∗ k 0 for k 6 = k 0 ∈ K ); • the unshielded colliders in P plus the additionally constructed unshielded colliders ( r ∗ → k ← ∗ ¯ k for all k ∈ K ), which are all identified by rule R 0; • the arrowheads iden tified b y rule R 1 that could also b e found in P , plus the edge orien tations k → i for k ∈ K , i ∈ I ∩ ch G ( M ) ( k ) that are obtained from rule R 1; This means that the subgraph of DP A G P (obtained b y FCI-JCI123 so far) induced on the system no des I is identical to the subgraph of DP AG P ∗ (obtained b y FCI so far) induced on the system nodes I . In addition, all pairs of a system no de i ∈ I and a context no de k ∈ K are identically connected in the tw o DP AGs. By examining rules R 1- R 4 and R 8- R 10 of F CI (which are used without mo difications in FCI-JCI123 ) in detail, one can c hec k that they will perform exactly the same edge mark orien tations on P as on P ∗ . F or rules R 2, R 3 and R 8- R 10 this is obvious b ecause the only subsets of no des that pla y a role in those rules necessarily must b e in I . F or rules R 1 and R 4 the situation is only sligh tly more complicated: a single no de app earing in those rules can b e in I ∪ K , while all others m ust b e in I . Hence eac h of these rules is applicable to some tuple of no des in P if and only if it is applicable to the same tuple of no des in P ∗ . Hence, the final DP AG obtained b y FCI-JCI123 from IM( M ) and the final DP A G obtained b y F CI from IM( H ∗ ) induce iden tical subgraphs on the system no des I . Th us, if all DMAGs H in IM( M ) that satisfy JCI Assumptions 0, 1, 2, 3 ha v e a certain in v arian t edge mark on an edge betw een t w o system v ariables, then all extended DMA Gs H ∗ m ust ha ve the same inv ariant edge mark. All these extended DMAGs H ∗ m ust b e Mark ov equiv alent, since F CI arrives at the same CDP A G for all IM( H ∗ ). Now suppose FCI-JCI123 left an edge mark on an edge betw een tw o system v ariables unoriented. Then F CI m ust also lea ve the corresp onding edge mark unorien ted when it is run on IM(DMAG( M ) ∗ ). This means that there must exist DMAGs that are Marko v equiv alent to DMAG( M ) ∗ that hav e an arro whead at that sp ot, but also DMA Gs that are Marko v equiv alent to DMA G( M ) ∗ that hav e a tail at that sp ot. Marginalizing those DMA Gs do wn to I ∪ K giv es DMAGs that are Mark ov equiv alen t to DMAG( M ), satisfy JCI Assumptions 0, 1, 2, 3 and hav e an arro whead resp ectiv ely a tail at that sp ot. This means that all edge marks b etw e en system v ariables that could p ossibly b e oriented, ha ve b een oriented by FCI-JCI123 . This completes the pro of of arrowhead and tail completeness of FCI-JCI123 . B.4 Reading off Definite (Non-)Ancestors F rom a DP A G Zhang (2006) conjectured the soundness and completeness of a criterion to read off definite ancestral relations from a CDP AG. Roump elaki et al. (2016) pro ved soundness of this criterion. 35 W e will need a slightly stronger result (with a similar pro of ) for DP AGs: 35. Roumpelaki et al. (2016) also claim to hav e prov ed completeness, but their pro of is flaw ed: the last part of the pro of that aims to pro ve that u, v are non-adjacen t app ears to be incomplete. 96 Joint Ca usal Inference from Mul tiple Contexts (a) C α C β C γ X 0 X 1 X 2 X 3 X 4 (b) C r C α C β C γ C ¯ α C ¯ β C ¯ γ X 0 X 1 X 2 X 3 X 4 Figure 43: (a) Example DMAG satisfying JCI Assumptions 1, 2, 3 and (b) corresp onding extended DMAG with additional v ariables as used in the pro of of Theorem 38. Prop osition 39 L et M b e an acyclic SCM. L et P b e a DP AG that c ontains DMAG( M ) , and in which al l unshielde d c ol liders in DMA G( M ) have b e en oriente d. F or two no des i, j ∈ P : If • ther e is a dir e cte d p ath fr om i to j in P , or • ther e exist unc over e d p ossibly dir e cte d p aths fr om i to j in P of the form i, u, . . . , j and i, v , . . . , j such that u, v ar e non-adjac ent no des in P , then i c auses j ac c or ding to M , i.e., i ∈ an G ( M ) ( j ) . Pro of First, if there is a directed path from i to j in P , it must b e in any DMA G in P , hence there must b e a directed path from i to j in DMAG( M ) as well. Therefore i ∈ an G ( M ) ( j ). Second, assume that there exist unco vered p ossibly directed paths from i to j in P of the form i, u, . . . , j and i, v , . . . , j such that u, v are non-adjacent in P . If DMA G( M ) has i → u , the path i, u, . . . , j must actually correspond to a directed path in DMAG( M ) b ecause otherwise it would con tain unshielded colliders that w ere not oriented, con tradict- ing the assumptions. If DMAG( M ) has i ← ∗ u instead, it must hav e i → v to a void an unshielded collider u ∗ → i ← ∗ v that was not oriented, and hence must hav e a directed path i, v , . . . , j . In b oth cases, DMA G( M ) must ha ve a directed path from i to j , and hence i ∈ an G ( M ) ( j ). Zhang (2006, p. 137) provides a sound and complete criterion to read off definite non- ancestors from a CDP A G. It is easy to pro ve the soundness of the criterion also for (arbi- trary) DP A Gs: Prop osition 40 L et M b e an acyclic SCM. L et P b e a DP AG that c ontains DMAG( M ) . F or two no des i, j ∈ P : if ther e is no p ossibly-dir e cte d p ath fr om i to j in P then i / ∈ an G ( M ) ( j ) . 97 Mooij, Ma gliacane and Claassen (a) C X 1 X 2 (b) C X 1 X 2 (c) C X 1 X 2 Figure 44: Example to illustrate that directed edges in the DP AG obtained by FCI-JCI123 do not necessarily corresp ond with a direct cause. (a) Graph G ( M ), satisfy- ing JCI Assumptions 1, 2, 3; (b) corresp onding DMAG( M ); (c) corresp onding DP AG P output by FCI-JCI123 . The directed edge C → X 2 in P iden tified by FCI-JCI123 do es not corresp ond with a direct causal effect of C on X 2 (note that there is no directed edge C → X 2 in G ( M )). Pro of If i ∈ an G ( M ) ( j ) then there is a directed path from i to j in DMAG( M ). Since P con tains DMA G( M ), this m ust correspond with a possibly-directed path from i to j in P . These tw o propositions allow us to read off (a subset of the) ancestral and non-ancestral relations that are identifiable from the conditional indep endences in the joint distribution and the JCI bac kground knowledge (if applicable) from the DP AGs output by the v arious F CI v ariants ( FCI-JCI123 , FCI-JCI1 , FCI-JCI0 , FCI). B.5 Disco vering Direct Interv ention T argets with FCI-JCI123 One of the features of FCI-JCI123 is that it allo ws one to read off direct in terv ention targets from the DP AG that it outputs. Na ¨ ıvely interpreting a directed edge k → i from context no de k ∈ K to system no de i ∈ I in the DP AG output by FCI-JCI123 as meaning that k directly targets i is incorrect, as can b e seen from the example in Figure 44. Here we prop ose a prov ably correct (but p ossibly incomplete) pro cedure. W e will first consider how to read direct interv en tion targets from DMA Gs b efore ap- plying this to DP AGs. Lemma 41 L et M b e an acyclic SCM that satisfies JCI Assumptions 0, 1, 2, 3. F or k ∈ K , i ∈ I : if • k → i in DMA G( M ) , and • for al l no des j in DMA G( M ) s.t. k → j → i in DMA G( M ) , j → i is visible, then k is a dir e ct c ause of i ac c or ding to M , i.e., k → i ∈ G ( M ) . Pro of Because the edge k → i is presen t in DMA G( M ), there must b e an inducing path b et ween k and i in G ( M ). This m ust b e a collider path in to i where eac h collider is ancestor of i . First suppose that the path consists of more than a single edge. Denote its first collider (the one adjacent to k ) by j . If the first edge on the path would b e into k , then it must b e k ↔ j and j must b e a con text no de (because of JCI Assumptions 1, 2). Similarly , all subsequen t no des on the inducing path (except for the final no de i ) must b e collider no des 98 Joint Ca usal Inference from Mul tiple Contexts and hence in K . But then the final edge is b etw een a con text no de and system no de i and in to the con text no de, con tradicting JCI Assumption 1 or 2. Hence the first edge on the inducing path m ust b e k → j . The same edge k → j m ust then o ccur in DMAG( M ). The remainder of the inducing path is actually an inducing path b et ween j and i that is into j . By Zhang (2008b, Lemma 9), j → i is in DMA G( M ) and it is invisible. This con tradicts the assumption. Therefore the inducing path in G ( M ) b et ween k and i must consist of a single edge. This m ust be out of k b ecause of JCI Assumptions 1, 2, and is th us necessarily of the form k → i . Hence k → i is in G ( M ). The follo wing result enables us to read off direct interv ention targets from the DP A G output b y FCI-JCI123 . Prop osition 42 L et M b e an acyclic SCM that satisfies JCI Assumptions 0, 1, 2, 3. Supp ose that its distribution P M ( X , C ) is faithful w.r.t. the gr aph G ( M ) . L et P b e the DP AG output by FCI-JCI123 with input IM( M ) . L et k ∈ K , i ∈ I . • If k is not adjac ent to i in P , k is not a dir e ct c ause of i ac c or ding to M , i.e., k → i 6∈ G ( M ) . • If: 1. k → i in P , and 2. for al l system no des j ∈ I s.t. k → j in P and j ◦ − ◦ i or j ◦ → i or j → i in P , the e dge j → i is definitely visible in the DP AG obtaine d fr om P by r eplacing the e dge b etwe en j and i by j → i , then k is a dir e ct c ause of i ac c or ding to M , i.e., k → i ∈ G ( M ) . Pro of Because FCI-JCI123 is sound (Theorem 38), P con tains DMA G( M ). F or the first statemen t: if k is not adjacent to i in P , then the t w o nodes are not adjacen t in any DMA G in P , and in particular, in DMAG( M ). This means that k → i 6∈ G ( M ), b ecause otherwise, k → i would b e in DMAG( M ), a contradiction. The second statemen t follo ws from Lemma 41 and from the JCI Assumptions 1, 2, 3. If the context v ariables represen t in terven tions, then this allows us to learn (a subset of ) the direct targets and non-targets of eac h interv ention. While it is easy to see that this criterion is sound, we do not know whether it is complete. App endix C. ASD: Accounting for Strong Dep endencies The causal discov ery and reasoning algorithm that we refer to as ASD (Accounting for Strong Dep endencies) used in this w ork is based on an algorithm prop osed b y Hyttinen et al. (2014) and extensions prop osed by Magliacane et al. (2016b) and F orr ´ e and Mo oij (2018). T o mak e this pap er more self-contained, w e give here a short description of this algorithm, referring the reader for more details to the original publications. Hyttinen et al. (2014) form ulate causal disco very as an optimization problem where a loss function is minimized ov er p ossible causal graphs. Intuitiv ely , the loss function can b e though t of as measuring the amount of evidence against the h yp othesis that the data was generated by an SCM with a particular graph. The loss function dep ends on the h yp o- thetical causal graph and on a list of input statemen ts. F or the purely observ ational case, 99 Mooij, Ma gliacane and Claassen the input consists of a list S =  ( a j , b j , Z j , λ j )  n j =1 of weigh ted conditional indep endence statemen ts. Here, the weigh ted statement ( a j , b j , Z j , λ j ) with { a j } , { b j } , Z j disjoin t sets of endogenous v ariable indices and λ j ∈ ¯ R := R ∪ {−∞ , + ∞} enco des that the conditional indep endence X a j ⊥ ⊥ X b j | X Z j holds with “confidence” λ j , where a finite v alue of λ j giv es a “soft constrain t” and a v alue of λ j = ±∞ imp oses a “hard constraint”. Positiv e w eights en- co de that w e ha v e empirical support in favor of the independence, whereas negativ e w eights enco de empirical support against the independence (in other words, in fa v or of dep endenc e ). The loss function simply sums the absolute weigh ts of all the input statemen ts that would b e violated if the true causal graph w ould consist of the h yp othetical one: L ( G , S ) := X ( a j ,b j ,Z j ,λ j ) ∈ S λ j ( 1 λ j > 0 − 1 a j ⊥ G b j | Z j ) , where 1 is the indicator function. While the original implemen tation by Hyttinen et al. (2014) mak es use of d -separation, F orr´ e and Mo oij (2018) sho w ho w this can b e mo dified for σ -separation. Causal discov ery can now b e formulated as the optimization problem: G ∗ = arg min G ∈ G ( I ) L ( G , S ) , (7) where G ( I ) denotes the set of all p ossible causal graphs with no des I (ADMGs in the acyclic case, and DMGs in the cyclic case). The optimization problem (7) may hav e multiple minima, for example b ecause the underlying causal graph is not iden tifiable from the inputs. Nonetheless, some of the features of the causal graph (e.g., the presence or absence of a certain directed edge) may still be iden tifiable. Let f : G ( I ) → { 0 , 1 } be a feature, i.e., a Bo olean function of the causal graph G . W e employ the metho d prop osed by Magliacane et al. (2016b) for scoring the confidence that feature f is presen t in the causal graph by calculating the difference b etw een the optimal losses under the additional hard constrain ts that the feature f is presen t vs. that the feature f is absent in the causal graph: C ( f , S ) := min G ∈ G ( I ): ¬ f ( G ) L ( G , S ) − min G ∈ G ( I ): f ( G ) L ( G , S ) . (8) This confidence is p ositiv e if there is less evidence against its presence than against its absence, negative if there is less evidence against its absence than its presence, and v anishes if there is as m uch evidence against its presence as there is against its absence. As features, w e can consider for example the presence of a direct causal relation, the presence of an (ancestral) causal relation, and the presence of a latent confounder. Magliacane et al. (2016b) show ed that this scoring metho d is sound for oracle inputs. Theorem 43 F or any fe atur e f : G ( I ) → { 0 , 1 } , the ASD c onfidenc e sc or e C ( f , S ) of (8) is sound and c omplete for or acle inputs with infinite weights. In other wor ds, C ( f , S ) = ∞ if f is identifiable fr om the inputs, C ( f , S ) = −∞ if ¬ f is identifiable fr om the inputs, and C ( f , S ) = 0 if f is unidentifiable fr om the inputs. Additionally , Magliacane et al. (2016b) show ed that the scoring metho d is asymptotically consisten t under a consistency condition on the weigh ts that encode the confidence of con- ditional (in)dep endence. 100 Joint Ca usal Inference from Mul tiple Contexts Theorem 44 Assume that the weights ar e asymptotic al ly c onsistent, me aning that λ j P → ( −∞ X a j 6 ⊥ ⊥ X b j | X Z j + ∞ X a j ⊥ ⊥ X b j | X Z j , (9) (wher e P → me ans c onver genc e in pr ob ability) as the numb er of samples N → ∞ . Then for any fe atur e f : G ( I ) → { 0 , 1 } , the ASD c onfidenc e sc or e C ( f , S ) of (8) is asymptotic al ly c onsistent, i.e., C ( f , S ) P → ∞ if f is identifiably true, C ( f , S ) P → −∞ if f is identifiably false, and C ( f , S ) P → 0 otherwise. In our exp erimen ts, we used the weigh ts prop osed in Magliacane et al. (2016b): λ j = log p j − log α , where p j is the p-v alue of a statistical test with indep endence as null h yp oth- esis, and α is a significance lev el (e.g., 1%). These weigh ts hav e the desirable prop erty that indep endences typically get a smaller absolute weigh t than strong dep endencies. This leads to the strong dep endencies dominating the loss function, which explains the acron ym ASD (Accoun ting for Strong Dep endencies) that w e use here to describe this metho d. By c ho os- ing a sample-size dep endent threshold α N suc h that α N → 0 as N → ∞ at a suitable rate, these w eights ma y become asymptotically consiste n t. Kalisch and B ¨ uhlmann (2007) pro vide a c hoice of α N for partial correlation tests that ensures asymptotic consistency under the assumption that the distribution is m ultiv ariate Gaussian. Another possibility f or obtaining consisten t weigh ts would be to base them on the distribution-free and strongly-consistent conditional indep endence test prop osed by Gy¨ orfi and W alk (2012). References R. Ayesha Ali, Thomas S. Richardson, and P eter Spirtes. Marko v equiv alence for ancestral graphs. The Annals of Statistics , 37(5B):2808–2837, 2009. Elias Barein b oim and Judea Pearl. A general algorithm for deciding transp ortability of exp erimen tal results. Journal of Causal Infer enc e , 1:107–134, 2013. Tinek e Blom, Anna Klimo vsk aia, Sara Magliacane, and Joris M. Mo oij. An upp er b ound for random measuremen t error in causal discov ery . In Pr o c e e dings of the 34th A nnual Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2018) , 2018. Stephan Bongers, Patric k F orr´ e, Jonas Peters, Bernhard Sc h¨ olkopf, and Joris M. Mo oij. F oundations of structural causal mo dels with cycles and latent v ariables. arXiv.or g pr eprint , arXiv:1611.06221v3 [stat.ME], May 2020. URL 06221v3 . Lin S. Chen, F rank Emmert-Streib, and John D. Storey . Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biolo gy , 8(10):R219, 2007. Da vid M. Chic kering. Optimal structure iden tification with greedy search. Journal of Machine L e arning R ese ar ch , 3:507–554, 2002. 101 Mooij, Ma gliacane and Claassen T om Claassen and T om Heskes. Causal discov ery in multiple mo dels from differen t exp er- imen ts. In A dvanc es in Neur al Information Pr o c essing Systems 23 (NIPS 2010) , pages 415–423, V ancouv er, British Columbia, Canada, 2010. T om Claassen and T om Hesk es. A logical characterization of constrain t-based causal disco v- ery . In Pr o c e e dings of the 27th Annual Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2011) , pages 135–144, 2011. T om Claassen, Joris M. Mo oij, and T om Heskes. Learning sparse causal mo dels is not NP-hard. In Ann Nicholson and P adhraic Smyth, editors, Pr o c e e dings of the 29th A nnual Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2013) , pages 172–181. A UAI Press, 2013. Diego Colom b o and Marloes H. Maathuis. Order-indep enden t constrain t-based causal struc- ture learning. The Journal of Machine L e arning R ese ar ch , 15(1):3741–3782, 2014. Gregory F. Co op er. A simple constrain t-based algorithm for efficien tly mining observ ational databases for causal relationships. Data Mining and Know le dge Disc overy , 1(2):203–224, 1997. Gregory F. Co op er and Edw ard Hersko vits. A Bay esian metho d for the induction of prob- abilistic netw orks from data. Machine L e arning , 9:309–347, 1992. Gregory F. Co op er and Changwon Y o o. Causal discov ery from a mixture of exp erimen tal and observ ational data. In Pr o c e e dings of the Fifte enth Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 1999) , pages 116–125, Sto ckholm, Sw eden, 1999. A. Philip Dawid. Conditional indep endence in statistical theory . Journal of the R oyal Statistic al So ciety B , 41(1):1–31, 1979. A. Philip Da wid. Influence diagrams for causal mo delling and inference. International Statistic al R eview , 70(2):161–189, 2002. V anessa Didelez, A. Philip Dawid, and Sara Geneletti. Direct and indirect effects of se- quen tial treatments. In Pr o c e e dings of the 22nd Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2006) , pages 138–146. A UAI Press, 2006. Daniel Eaton and Kevin Murphy . Exact Bay esian structure learning from uncertain in ter- v entions. In Pr o c e e dings of the Eleventh International Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS 2007) , San Juan, Puerto Rico, 2007. Doris En tner and P atrik O. Ho yer. On causal discov ery from time series data using F CI. In Pr o c e e dings of the Fifth Eur op e an Workshop on Pr ob abilistic Gr aphic al Mo dels (PGM- 2010) , pages 121–128, 2010. Ronald A. Fisher. Statistic al Metho ds for R ese ar ch Workers . Oliver and Bo yd, Edinburgh, 1925. Ronald A. Fisher. The Design of Exp eriments . Hafner, 1935. 102 Joint Ca usal Inference from Mul tiple Contexts P atrick F orr´ e and Joris M. Mo oij. Marko v prop erties for graphical mo dels with cycles and laten t v ariables. arXiv.or g pr eprint , arXiv:1710.08775 [math.ST], Octob er 2017. URL https://arxiv.org/abs/1710.08775 . P atrick F orr´ e and Joris M. Mo oij. Constraint-based causal discov ery for non-linear struc- tural causal models with cycles and laten t confounders. In Pr o c e e dings of the 34th A nnual Confer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI 2018) , 2018. P atrick F orr´ e and Joris M. Mo oij. Causal calculus in the presence of cycles, latent con- founders and selection bias. In Pr o c e e dings of the 35th A nnual Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2019) , 2019. Nir F riedman, Michal Linial, Iftac h Nachman, and Dana Pe’er. Using Bay esian netw orks to analyze expression data. Journal of Computational Biolo gy , 7(3-4):601–620, 2000. Dan Geiger, Thomas V erma, and Judea Pearl. Identifying indep endence in Ba y esian net- w orks. Networks , 20(5):507–534, 1990. Cliv e W. J. Granger. Inv estigating causal relations by econometric mo dels and cross-sp ectral metho ds. Ec onometric a , 37(3):424–438, 1969. L´ aszl´ o Gy¨ orfi and Harro W alk. Strongly consistent nonparametric tests of conditional indep endence. Statistics & Pr ob ability L etters , 82:1145–1150, June 2012. Alain Hauser and Peter B ¨ uhlmann. Characterization and greedy learning of interv en tional Mark ov equiv alence classes of directed acyclic graphs. Journal of Machine L e arning R ese ar ch , 13:2409–2464, 2012. Da vid Heck erman, Dan Geiger, and David M. Chick ering. Learning Bay esian netw orks: The combination of knowledge and statistical data. Machine L e arning , 20:197–243, 1995. Christina Heinze-Deml, Jonas P eters, and Nicolai Meinshausen. Inv ariant causal prediction for nonlinear mo dels. Journal of Causal Infer enc e , 6(2):20170016, 2018. An tti Hyttinen, F rederick Eb erhardt, and Patric k O. Ho yer. Learning linear cyclic causal mo dels with laten t v ariables. Journal of Machine L e arning R ese ar ch , 13:3387–3439, 2012. An tti Hyttinen, F rederick Eb erhardt, and Matti J¨ arvisalo. Constrain t-based causal dis- co very: Conflict resolution with answ er set programming. In Pr o c e e dings of the 13th Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2014) , pages 340–349, Queb ec Cit y , Queb ec, Canada, 2014. Markus Kalisch and Peter B ¨ uhlmann. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine L e arning R ese ar ch , 8:613–636, March 2007. Markus Kalisch, Martin M¨ ac hler, Diego Colom b o, Marlo es H. Maath uis, and P eter B ¨ uhlmann. Causal inference using graphical mo dels with the R pac k age p calg. Jour- nal of Statistic al Softwar e , 47(11):1–26, 2012. 103 Mooij, Ma gliacane and Claassen Y utak a Kano and Shohei Shimizu. Causal inference using nonnormalit y . In Pr o c e e dings of the International Symp osium on Scienc e of Mo deling, the 30th Anniversary of the Information Criterion , pages 261–270, 2003. P atrick Kemmeren, Katrin Sameith, Lo es A.L. v an de Pasc h, Joris J. Benschop, Tinek e L. Lenstra, Thanasis Margaritis, Eoghan ODuibhir, Ev a Apw eiler, Sak e v an W agenin- gen, Cheuk W. Ko, Sebastiaan v an Heesch, Mehdi M. Kashani, Giannis Ampatziadis- Mic hailidis, Mariel O. Brok, Nathalie A.C.H. Brab ers, An thony J. Miles, Diane Bou wmeester, Sander R. v an Ho off, Harm v an Bak el, Erik Sluiters, Linda V. Bakker, Berend Snel, Philip Lijnzaad, Dik v an Leenen, Marian J.A. Gro ot Koerk amp, and F rank C.P . Holstege. Large-scale genetic p erturbations reveal regulatory net works and an abundance of gene-sp ecific repressors. Cel l , 157(3):740–752, 2014. Mikk o Koivisto and Kismat Soo d. Exact Ba y esian structure discov ery in Ba y esian net w orks. Journal of Machine L e arning R ese ar ch , 5:549–573, 2004. Jan T. A. Koster. On the v alidit y of the Mark o v in terpretation of path diagrams of Gaussian structural equations systems with correlated errors. Sc andinavian Journal of Statistics , 26:413–431, 1999. Jan T.A. Koster. Mark ov prop erties of nonrecursiv e causal mo dels. Annals of Statistics , 24 (5):2148–2177, 1996. Sara Magliacane, T om Claassen, and Joris M. Mo oij. Joint causal inference from obser- v ational and interv en tional datasets. arXiv.or g pr eprint , arXiv:1611.10351v1 [cs.LG], No vem b er 2016a. URL . Sara Magliacane, T om Claassen, and Joris M. Mooij. Ancestral causal inference. In A dvanc es in Neur al Information Pr o c essing Systems 29 (NIPS 2016) , pages 4466–4474, B arcelona, Spain, 2016b. Sara Magliacane, Thijs v an Ommen, T om Claassen, Stephan Bongers, Philip V ersteeg, and Joris M Mo oij. Domain adaptation b y using causal inference to predict in v arian t conditional distributions. In S. Bengio, H. W allach, H. Laro c helle, K. G rauman, N. Cesa- Bianc hi, and R. Garnett, editors, A dvanc es in Neur al Information Pr o c essing Systems 31 (NeurIPS 2018) , pages 10869–10879. Curran Asso ciates, Inc., 2018. Subramani Mani. A Bayesian L o c al Causal Disc overy F r amework . PhD thesis, Universit y of Pittsburg, Marc h 2006. URL http://d- scholarship.pitt.edu/10181/ . Florian Marko wetz, Steffen Grossmann, and Rainer Spang. Probabilistic soft interv en tions in conditional Gaussian net works. In Pr o c e e dings of the T enth International Workshop on Artificial Intel ligenc e and Statistics (AIST A TS 2005) , Bridgeto wn, Barbados, 2005. Christopher Meek. Strong completeness and faithfulness in Bay esian net w orks. In Pr o c e e d- ings of the 11th Confer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI 1995) , pages 411–419, 1995. 104 Joint Ca usal Inference from Mul tiple Contexts Nicolai Meinshausen, Alain Hauser, Joris M. Mo oij, Jonas Peters, Philip V ersteeg, and P eter B ¨ uhlmann. Metho ds for causal inference from gene p erturbation exp eriments and v alidation. Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of Americ a , 113(27):7361–7368, 2016. Joris M. Mo oij and T om Claassen. Constraint-based causal disco very in the presence of cycles. arXiv.or g pr eprint , arXiv:2005.00610 [math.ST], May 2020. URL https://arxiv. org/abs/2005.00610 . Joris M. Mo oij and T om Hesk es. Cyclic causal disco very from con tinuous equilibrium data. In Pr o c e e dings of the 29th A nnual Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 2013) , pages 431–439, 2013. Joris M. Mo oij, Jonas P eters, Dominik Janzing, Jakob Zsc heisc hler, and Bernhard Sc h¨ olkopf. Distinguishing cause from effect using observ ational data: Metho ds and b ench- marks. Journal of Machine L e arning R ese ar ch , 17(32):1–102, 2016. Kevin Murph y . Dynamic Bayesian Networks: R epr esentation, Infer enc e and L e arning . PhD thesis, Univ ersity of Pittsburg, July 2002. URL http://www.cs.ubc.ca/ ~ murphyk/ Thesis/thesis.pdf . Radford M. Neal. On deducing conditional indep endence from d -separation in causal graphs with feedback. Journal of A rtificial Intel ligenc e R ese ar ch , 12:87–91, 2000. Chris J. Oates, Jim Korkola, Joe W. Gray , and Sach Mukherjee. Joint estimation of multiple related biological net works. Annals of Applie d Statistics , 8(3):1892–1919, 2014. Chris J. Oates, Jim Q. Smith, and Sac h Mukherjee. Estimating causal structure using conditional DA G models. Journal of Machine L e arning R ese ar ch , 17(1):1880–1903, 2016a. Chris J. Oates, Jim Q. Smith, Sach Mukherjee, and James Cussens. Exact estimation of m ultiple directed acyclic graphs. Statistics and Computing , 26(4):797–811, 2016b. Judea Pearl. A constrain t propagation approac h to probabilistic reasoning. In Pr o c e e dings of the First Confer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI 1985) , pages 357–370, 1986. Judea Pearl. Comment: Graphical mo dels, causality , and interv ention. Statistic al Scienc e , 8:266–269, 1993. Judea Pearl. Causality: Mo dels, R e asoning and Infer enc e . Cam bridge Universit y Press, 2009. Judea Pearl and Rina Dec hter. Identifying indepe ndence in causal graphs with feedback. In Pr o c e e dings of the 12th Annual Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 1996) , pages 420–426, 1996. Jonas P eters, Joris M. Mo oij, Dominik Janzing, and Bernhard Sch¨ olkopf. Causal discov ery with contin uous additiv e noise mo dels. Journal of Machine L e arning R ese ar ch , 15:2009– 2053, 2014. 105 Mooij, Ma gliacane and Claassen Jonas Peters, Peter B ¨ uhlmann, and Nicolai Meinshausen. Causal inference using in v arian t prediction: identification and confidence in terv als. Journal of the R oyal Statistic al So ciety, Series B , 78(5):947–1012, 2016. Jonas Peters, Dominik Janzing, and Bernhard Sch¨ olkopf. Elements of Causal Infer enc e: F oundations and L e arning Algorithms . MIT Press, Cam bridge, MA, USA, 2017. Joseph Ramsey and Bryan Andrews. F ASK with in terven tional kno wledge reco vers edges from the Sac hs mo del. arXiv.or g pr eprint , arXiv:1805.03108 [q-bio.MN], 2018. URL https://arxiv.org/abs/1805.03108 . Thomas S. Richardson and Peter Spirtes. Ancestral graph Marko v mo dels. The Annals of Statistics , 30(4):962–1030, August 2002. Dominik Rothenh¨ ausler, Christina Heinze, Jonas Peters, and Nicolai Meinshausen. BACK- SHIFT: Learning causal cyclic graphs from unknown shift interv en tions. In A dvanc es in Neur al Information Pr o c essing Systems 28 (NIPS 2015) , pages 1513–1521. Curran Asso ciates, Inc., 2015. Anna Roump elaki, Giorgos Borb oudakis, Sofia T rian tafillou, and Ioannis Tsamardinos. Marginal causal consistency in constraint-based causal learning. In F rederick Eb er- hardt, Elias Bareinboim, Marlo es Maathuis, Joris Mo oij, and Ricardo Silv a, editors, Pr o c e e dings of the UAI 2016 Workshop on Causation: F oundation to Applic ation , num- b er 1792 in CEUR W orkshop Pro ceedings, pages 39–47, Aac hen, 2016. URL http: //ceur- ws.org/Vol- 1792/paper5.pdf . Karen Sachs, Omar Perez, Dana Pe’er, Douglas A. Lauffen burger, and Garry P . Nolan. Causal protein-signaling netw orks deriv ed from m ultiparameter single-cell data. Scienc e , 308, 2005. Ily a Shpitser, Robin J. Ev ans, Thomas S. Ric hardson, and James M. Robins. Introduction to nested Mark ov models. Behaviormetrika , 41(1):3–39, 2014. P eter Spirtes. Conditional indep endence in directed cyclic graphical mo dels for feedback. T echnical Rep ort CMU-PHIL-54, Carnegie Mellon Universit y , 1994. P eter Spirtes. Directed cyclic graphical represen tations of feedback mo dels. In Philipp e Besnard and Steve Hanks, editors, Pr o c e e dings of the Eleventh Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI 1995) , pages 499–506, San F rancisco, CA, USA, 1995. Morgan Kaufmann. P eter Spirtes, Thomas Ric hardson, Cristopher Meek, Ric hard Sc heines, and Clark Glymour. Using path diagrams as a structural equation mo delling to ol. So ciolo gic al Metho ds & R ese ar ch , 27:182–225, 1998. P eter Spirtes, Christopher Meek, and Thomas S. Richardson. An algorithm for causal inference in the presence of latent v ariables and selection bias. In Clark Glymour and Gregory F. Co op er, editors, Computation, Causation and Disc overy , c hapter 6, pages 211–252. The MIT Press, 1999. 106 Joint Ca usal Inference from Mul tiple Contexts P eter Spirtes, Clark Glymour, and Ric hard Sc heines. Causation, Pr e diction, and Se ar ch . MIT press, 2nd edition, 2000. Eric V. Strobl. A constrain t-based algorithm for causal discov ery with cycles, latent v ari- ables and selection bias. International Journal of Data Scienc e and Analytics , 2018. Jin Tian and Judea P earl. Causal disco very from c hanges. In Pr o c e e dings of the 17th Confer enc e in Unc ertainty in A rtificial Intel ligenc e (UAI 2001) , Seattle, W ashington, USA, 2001. Rob ert E. Tillman. Structure learning with indep endent non-identically distributed data. In Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning (ICML 2009) , pages 1041–1048, 2009. Rob ert E. Tillman and P eter Spirtes. Learning equiv alence classes of acyclic models with la- ten t and selection v ariables from m ultiple datasets with o v erlapping v ariables. In Pr o c e e d- ings of the 14th International Confer enc e on Artificial Intel ligenc e and Statistics (AIS- T A TS 2011) , 2011. Sofia T riantafillou and Ioannis Tsamardinos. Constraint-based causal discov ery from mul- tiple interv en tions o v er ov erlapping v ariable sets. Journal of Machine L e arning R ese ar ch , 16:2147–2205, 2015. Sofia T riantafillou, Vincenzo Lagani, Christina Heinze-Deml, Angelik a Sc hmidt, Jesp er T eg- ner, and Ioannis Tsamardinos. Predicting causal relationships from biological data: Ap- plying automated causal disco very on mass cytometry data of human immune cells. Sci- entific R ep orts , 7:12724, 2017. Thijs v an Ommen and Joris M. Mo oij. Algebraic equiv alence of linear structural equa- tion mo dels. In Pr o c e e dings of the 33r d A nnual Confer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI 2017) , 2017. Philip J.J.P . V ersteeg and Joris M. Mo oij. Bo osting lo cal causal discov ery in high- dimensional expression data. arXiv.or g pr eprint , arXiv:1910.02505v2 [stat.ML], Nov em- b er 2019. URL . Accepted for publication in BIBM 2019. Larry W asserman. Al l of Statistics . Springer T exts in Statistics. Springer, 2004. Sew all W right. Correlation and causation. Journal of A gricultur al R ese ar ch , 20:557–585, 1921. Karren D. Y ang, Abigail Katcoff, and Caroline Uhler. Characterizing and learning equiv- alence classes of causal DA Gs under interv en tions. In Pr o c e e dings of Machine L e arning R ese ar ch 80 (ICML 2018) , pages 5537–5546, 2018. Jiji Zhang. Causal Infer enc e and R e asoning in Causal ly Insufficient Systems . PhD thesis, Carnegie Mellon Universit y , July 2006. URL http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.466.7206&rep=rep1&type=pdf . 107 Mooij, Ma gliacane and Claassen Jiji Zhang. On the completeness of orientation rules for causal discov ery in the presence of latent confounders and selection bias. Artificial Intel ligenc e , 172(16-17):1873–1896, 2008a. Jiji Zhang. Causal reasoning with ancestral graphs. Journal of Machine L e arning R ese ar ch , 9:1437–1474, 2008b. Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and Bernhard Sch¨ olkopf. Causal disco very from nonstationary/heterogeneous data: Sk eleton estimation and orientation determination. In Pr o c e e dings of the Twenty-Sixth International Joint Confer enc e on A rtificial Intel ligenc e (IJCAI 2017) , pages 1347–1353, 2017. 108

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment