Foundations of Structural Causal Models with Cycles and Latent Variables
Structural causal models (SCMs), also known as (nonparametric) structural equation models (SEMs), are widely used for causal modeling purposes. In particular, acyclic SCMs, also known as recursive SEMs, form a well-studied subclass of SCMs that gener…
Authors: Stephan Bongers, Patrick Forre, Jonas Peters
FOUND A TIONS OF STR UCTURAL CA USAL MODELS WITH CYCLES AND LA TENT V ARIABLES B Y S T E P H A N B O N G E R S 1 , P A T R I C K F O R R ´ E 1 , J O NA S P E T E R S 2 A N D J O R I S M . M O O I J 3 1 Informatics Institute, University of Amster dam, s.r .bongers@uva.nl ; p.d.forr e@uva.nl 2 Department of Mathematical Sciences, University of Copenhagen, jonas.peter s@math.ku.dk 3 K ortewe g-De Vries Institute, University of Amster dam, j.m.mooij@uva.nl Structural causal models (SCMs), also known as (nonparametric) struc- tural equation models (SEMs), are widely used for causal modeling purposes. In particular, acyclic SCMs, also known as recursiv e SEMs, form a well- studied subclass of SCMs that generalize causal Bayesian networks to allow for latent confounders. In this paper , we in vestigate SCMs in a more gen- eral setting, allowing for the presence of both latent confounders and cycles. W e sho w that in the presence of cycles, many of the conv enient properties of acyclic SCMs do not hold in general: they do not always hav e a solution; they do not always induce unique observational, interventional and counter- factual distributions; a marginalization does not always exist, and if it exists the marginal model does not always respect the latent projection; they do not always satisfy a Markov property; and their graphs are not always consistent with their causal semantics. W e prove that for SCMs in general each of these properties does hold under certain solvability conditions. Our work general- izes results for SCMs with cycles that were only kno wn for certain special cases so far . W e introduce the class of simple SCMs that extends the class of acyclic SCMs to the cyclic setting, while preserving many of the conv enient properties of acyclic SCMs. With this paper we aim to provide the founda- tions for a general theory of statistical causal modeling with SCMs. 1. Introduction Structural causal models (SCMs), also kno wn as (nonparametric) struc- tural equation models (SEMs), are widely used for causal modeling purposes [ 5 , 51 , 55 , 73 ]. They form the basis for many statistical methods that aim at inferring knowledge of the un- derlying causal structure from data [see, e.g., 7 , 37 , 45 , 48 , 56 ]. In these models, the causal relationships between the v ariables are expressed in the form of deterministic, functional re- lationships, and probabilities are introduced through the assumption that certain v ariables are exogenous latent random variables. SCMs arose out of certain causal models that were first introduced in genetics [ 79 ], econometrics [ 25 ], electrical engineering [ 39 , 40 ] and the social sciences [ 12 , 23 ]. Acyclic SCMs, also known as recursiv e SEMs, form a special well-studied subclass of SCMs that generalize causal Bayesian networks [ 51 ]. They have many con venient proper- ties [see, e.g., 15 , 16 , 34 , 35 , 50 , 60 , 78 ]: (i) they induce a unique distribution over the vari- ables; (ii) they are closed under perfect interventions; (iii) they are closed under mar ginaliza- tions; (i v) their marginalization respects the latent projection; (v) they obey (various equiv- alent versions of) the Markov property and (vi) their graphs express the causal relationships encoded by the SCM in an intuiti ve manner . One important limitation of acyclic SCMs is that they cannot model systems that inv olve causal c ycles. In many systems occurring in the real world, there are feedback loops between MSC2020 subject classifications : Primary 62A09, 68T30; secondary 68T37. K eywor ds and phrases: structural causal models, causal graph, cycles, interv entions, counterfactual s, solv abil- ity , Markov properties, mar ginalization. 1 2 observed variables. For example, in economics the price of a product may be a function of the demanded or supplied quantities, and vice v ersa, the demanded and supplied quantities may be functions of the price. The underlying dynamic processes describing such systems hav e an acyclic causal structure over time. Howe ver , causal cycles may arise when one ap- proximates such systems ov er time [ 17 , 42 , 43 ] or when one describes the equilibrium states of these systems [ 3 , 6 , 27 , 29 , 33 , 46 , 57 ]. In particular , in [ 6 ] it was sho wn that the equi- librium states of a system governed by (random) differential equations can be described by an SCM that represents their causal semantics, which gives rise to a plethora of SCMs that include c ycles (we provide some examples of such feedback systems in Appendix D.1 of the Supplementary Material). In contrast to their acyclic counterparts, SCMs with cycles ha ve enjoyed less attention in the literature and are not as well understood. In general, none of the abov e properties (i)–(vi) hold in the class of SCMs. Howe ver , some progress has been made in the case of discrete [ 49 , 52 ] and linear models [ 27 , 31 , 63 , 70 – 72 ], and more recently , for more general cyclic models the Mark ov properties ha ve been elucidated [ 18 ]. Contributions The purpose of this paper is to provide the foundations for a general theory of statistical causal modeling with SCMs. W e study properties of SCMs and allo w for cycles, latent v ariables and nonlinear functional relationships between the variables. W e in vestigate to which extent and under which suf ficient conditions each of the properties (i)–(vi) holds, in particular , in the presence of cycles. In the next paragraphs, we describe our contributions in more detail. When there are cyclic functional relationships between variables, one encounters various technical complications, which ev en arise in the linear setting. The structural equations of an acyclic SCM tri vially have a unique solution. This unique solvability property ensures that the SCM giv es rise to a unique, well-defined probability distribution on the v ariables. In the case of cycles, ho wev er , this property may be violated, and consequently , the SCM may not hav e a solution at al l, or may allo w for multiple dif ferent probability distrib utions [ 26 ]. Ev en if one starts with a cyclic SCM that is uniquely solvable, performing an intervention on the SCM may lead to an intervened SCM that is not uniquely solv able. Hence, a cyclic SCM may not give rise to a unique, well-defined probability distribution corresponding to that intervention, and whether or not this happens may depend on the intervention. W e provide suf ficient conditions for the existence and uniqueness of these probability distributions after intervention. In general, it is not clear whether the solutions of the structural equations of an SCM are measurable if c ycles are present. In addition, we provide sufficient and necessary conditions for the measurability of solution functions of cyclic SCMs. SCMs pro vide a detailed modeling description of a system. Not all information may be necessary for a certain modeling task, which motiv ates to consider certain classes of SCMs to be equiv alent. In this paper , we formally introduce several of such equiv alence relations. For example, we consider two SCMs observationally equi valent if they cannot be distin- guished based on observ ations alone. Observationally equiv alent SCMs can often still be distinguished by interventions. W e consider two SCMs interventionally equi valent if they cannot be distinguished based on observations and interventions. While these concepts hav e been around in implicit form for acyclic SCMs, we formulate them in such a way that they also apply to cyclic SCMs that hav e either no solution at all or have multiple dif ferent in- duced probability distributions on the variables. Finally , we consider two SCMs counterfac- tually equi v alent if the y cannot be distinguished based on observations and interventions and in addition encode the same counterfactual distributions, which are the distrib utions induced by the so-called twin SCM via the twin network method [ 1 ]. These different equi v alence re- lations formalize the dif ferent le vels of abstraction in the so-called causal hierarchy [ 53 , 69 ]. In addition, we add another , strong version of equiv alence, such that equiv alent SCMs have FOUND A TIONS OF STR UCTURAL CA USAL MODELS 3 the same solutions. This notion clarifies ambiguities when a function is constant in one of its arguments, for e xample. Marginalization becomes useful if not all v ariables are observed: gi ven a joint probability distribution on some v ariables, we obtain a marginal distribution on a subset of the v ariables by inte grating out the remaining variables. Analogously , we can mar ginalize an acyclic SCM by substituting the solutions of the structural equations of a subset of the endogenous vari- ables into the structural equations of the remaining endogenous v ariables. F or ac yclic SCMs, the induced observ ational and interv entional distributions of the marginalized SCM coincide with the marginals of the distrib utions induced by the original SCM [see 15 , 16 , 75 , 78 , a.o.]. In other words, for acyclic SCMs the operation of mar ginalization preserv es the probabilistic and causal semantics (restricted to the remaining v ariables). W e show that for cyclic SCMs a marginalization does not always exist without further assumptions. In [ 18 ] it is shown that for modular SCMs, which can be seen as an SCM together with an additional structure of a compatible system of solution functions, a marginalization can be defined that preserves the probabilistic and causal semantics. W e prove that this additional structure is not necessary and use a local unique solvability condition instead. Under this condition, we show that an SCM and its marginalization are observationally , interventionally and counterfactually equiv alent on the remaining endogenous v ariables. Analogously , we define a marginalization operation on the associated graph of an SCM, which generalizes the latent projection [ 15 , 76 , 78 ]. In general, the marginalization of an SCM does not respect the latent projection of its associ- ated graph, but we sho w that it does so under an additional local ancestral unique solvability condition. In graphical models, Marko v properties allow one to read off conditional independencies in a distrib ution directly from a graph. V arious equi valent formulations of Marko v properties exist for acyclic SCMs [ 34 ], one prominent example being the d -separation criterion, also kno wn as the directed global Marko v property , which w as originally deri ved for Bayesian networks [ 50 ]. Markov properties hav e been of key importance to deriv e various central re- sults regarding causal reasoning and causal discovery . For cyclic SCMs, howe ver , the usual Marko v properties do not hold in general, as was already pointed out by Spirtes [ 71 ]. His solution in terms of collapsed graphs w as recently generalized and reformulated for a gen- eral class of causal graphical models [ 18 ] by adapting the notion of d -separation into what has been termed σ -separation. This resulted in a general directed global Marko v property expressed in terms of σ -separation instead of d -separation. Here, we formulate these general Marko v properties specifically within the framework of SCMs. Again, they only hold under certain unique solv ability conditions. In addition to its interpretation in terms of conditional independencies, the graph of an acyclic SCM also has a direct causal interpretation [ 51 ]. As was already observ ed in [ 49 ], the causal interpretation of SCMs with cycles can be counterintuiti ve, as the causal semantics under interv entions no longer needs to be compatible with the structure imposed by the func- tional relations between the v ariables. W e resolve this issue by sho wing that under certain ancestral unique solvability conditions the causal interpretation of SCMs is consistent with its graph. Cycles lead to sev eral technical complications related to solvability issues. W e introduce a special subclass of (possibly cyclic) SCMs, the class of simple SCMs, for which most of these technical complications are absent and which preserves much of the simplicity of the theory for acyclic SCMs. A simple SCM is an SCM that is uniquely solvable with respect to e very subset of the variables. Because of this strong solv ability assumption, simple SCMs hav e all the con venient properties (i)–(vi): they always have uniquely defined observational, interventional and counterfactual distributions; we can perform e very perfect intervention and marginalization on them and the result is again a simple SCM; marginalization does respect 4 intervened SCM twin SCM marginal SCM intervened graph twin graph marginal graph SCM (augmented) graph interventional distribution(s) observational distribution(s) acyclified SCM graph of the acyclified SCM acyclified graph direct causes, causes, confounders (conditional) independencies d/σ - separations 3.2 do 2.12 twin 2.17 marg 5.3 G (or G a ) 2 . 7 3.2 acy A.11 do 2.13 twin 2.18 marg 5.7 acy A.13 8 . 4 d/σ -sep. A.4 / A.16 G 2 . 7 ⊆ A. 14 A.19 d -sep. A. 4 faithfulness A.9 / A.23 Markov properties A.6 / A.20 Fig 1: Overview of the objects constructed fr om an SCM and the mappings between them. The numbers correspond to the definition, pr oposition or theorem of the corresponding object, mapping or r esult. When an arrow is dashed, the relation only holds under nontrivial assumptions that can be found in the corresponding definition or theorem. The symbol “ ⊆ ” stands for the subgraph of a directed mixed graph (see Definition A.1 in the Supplementary Material) and the symbol “ ” denotes that the surr ounding diagram commutes. T able 1 gives an overview of the commutativity results for each pair of mappings between the objects with the names in bold. SCMs do twin marg G , G a 2.14 2.19 ( 5.11 ) do 2.15.(1) 2.21.(1) 5.5.(1) twin · · · - 5.5.(2) marg · · · · · · 5.4 Graphs do twin marg do 2.15.(1) 2.21.(2) 5.9.(1) twin · · · - 5.9.(2) marg · · · · · · 5.8 T able 1: Overview of the commutativity results of different pairs of mappings, defined on SCMs (left table) and on graphs (right table). All results apply under the assumptions stated in the corresponding proposition. The entries denoted by dots are omitted due to symmetry . W e do not consider the commutativity of the twin operation with itself in this paper . Proposition 5.11 (in par entheses) is not a commutativity r esult but a weaker r elation. The graphical twin operator is only defined for directed graphs. the latent projection; they obey the general directed global Markov property , and for special cases (including the ac yclic, linear and discrete case) they obey the (stronger) directed global Marko v property; their graphs hav e a direct and intuitiv e causal interpretation. The scope of this paper is limited to establishing the foundations for statistical causal mod- eling with cyclic SCMs (Figure 7 in Appendix A.4 of the Supplementary Material shows an ov erview of how SCMs relate to other causal graphical models). F or a detailed discussion of causal reasoning, causal discov ery and causal prediction with cyclic SCMs we refer the reader to other literature [e.g., 14 , 21 , 27 , 28 , 58 , 59 , 61 ]. Se veral recent results (general- izations of the do-calculus, adjustment criteria and an identification algorithm) for modular SCMs [ 19 , 20 ] directly apply to the subclass of simple SCMs, as well. Finally , many causal discov ery algorithms that hav e been designed for the acyclic case also apply to simple SCMs with no or only minor changes [ 44 , 47 ]. Overview Figure 1 gives an overvie w of the different objects that can be constructed from an SCM and the different mappings between them. For pairs of mappings between the objects with the names in bold, we prov e commutativity results which are summarized in T able 1 . Outline This paper is structured as follows: In Section 2 , we provide a formal definition of SCMs and a natural notion of equiv alence between SCMs, define the (augmented) graph cor- responding to an SCM, and describe perfect interventions and counterfactuals. In Section 3 , FOUND A TIONS OF STR UCTURAL CA USAL MODELS 5 we discuss the concept of (unique) solv ability , its properties and how it relates to self-cycles. In Section 4 , we define and relate v arious equi v alence relations between SCMs. In Section 5 , we define a marginalization operation that is applicable to cyclic SCMs under certain condi- tions. W e discuss sev eral properties of this mar ginalization operation and discuss the relation with a marginalization operation defined on directed mixed graphs. In Section 6 , we discuss Marko v properties of SCMs. In Section 7 , we discuss the causal interpretation of the graphs of SCMs. Section 8 introduces and discusses the class of simple SCMs. The Supplementary Material introduces causal graphical models in Appendix A . This sec- tion also contains details on Marko v properties and modular SCMs. Appendix B provides additional (unique) solvability properties, some results for linear SCMs are discussed in Ap- pendix C , other examples in Appendix D and the proofs of all the theoretical results are in Appendix E . Appendix F contains some lemmas and measurable selection theorems that are used in se veral proofs. 2. Structural causal models In this section, we provide the definition and properties of structural causal models (SCMs). Our definition of SCMs slightly deviates from existing definitions [ 5 , 51 , 73 ], because we make the definition of the SCM independent of the random v ariables that solve it. This enables us to deal with the various technical complications that arise in the presence of cycles. 2.1. Structural causal models and their solutions D E FI N I T I O N 2.1 (Structural causal model) . A structural causal model (SCM) is a tuple 1 M := hI , J , X , E , f , P E i , wher e 1. I is a finite index set of endogenous variables , 2. J is a disjoint finite inde x set of exogenous v ariables , 3. X = Q i ∈I X i is the pr oduct of the domains of the endogenous variables, wher e each domain X i is a standar d measurable space (see Definition F .1 ), 4. E = Q j ∈J E j is the pr oduct of the domains of the exogenous variables, wher e each do- main E j is a standar d measurable space , 5. f : X × E → X is a measurable function that specifies the causal mechanism , 6. P E = Q j ∈J P E j is a pr oduct measur e, the exogenous distribution , where P E j is a pr oba- bility measur e on E j for each j ∈ J . 2 In SCMs, the functional relationships between variables are expressed in terms of deter - ministic equations, where each equation expresses an endogenous v ariable (on the left-hand side) in terms of a causal mechanism depending on endogenous and exogenous v ariables (on the right-hand side). This allows us to model interventions in an unambiguous way by changing the causal mechanisms that tar get specific endogenous v ariables (see Section 2.4 ). D E FI N I T I O N 2.2 (Structural equations) . Let M = hI , J , X , E , f , P E i be an SCM. W e call the set of equations x i = f i ( x , e ) x ∈ X , e ∈ E for i ∈ I the structural equations of the structural causal model M . 1 W e often use boldface for variables that have multiple components, for example, vectors in a Cartesian product. 2 For the case J = ∅ , we ha ve that E is the singleton 1 and P E is the degenerate probability measure P 1 . 6 Although it is common to assume the absence of c yclic functional relations (see Defini- tion 2.9 ), we make no such assumption here. In particular , we allo w for self-cycles, which we will discuss in more detail in Sections 2.2 and 3.3 . The solutions of an SCM in terms of random variables are defined up to almost sure equal- ity . Random v ariables that are almost surely equal are generally considered to be equi valent to each other for all practical purposes. D E FI N I T I O N 2.3 (Solution) . A pair ( X , E ) of random variables X : Ω → X , E : Ω → E , wher e Ω is a pr obability space, is a solution of the SCM M = hI , J , X , E , f , P E i if 1. P E = P E , that is, the distribution of E is equal to P E , 3 and 2. the structural equations ar e satisfied, that is, X = f ( X , E ) a.s.. F or con venience, we call a random variable X a solution of M if there e xists a random variable E suc h that ( X , E ) forms a solution of M . Often, the endogenous random variables X can be observed, while the exogenous ran- dom variables E are treated as latent. Latent exogenous v ariables are often referred to as “disturbance terms” or “noise variables. ” For a solution X , we call the distribution P X the observational distrib ution of M associated to X . In general, there may be multiple dif ferent observ ational distributions associated to an SCM due to the e xistence of different solutions of the structural equations. This is a consequence of the allowance of c ycles in SCMs, as the follo wing simple example illustrates. E X A M P L E 2.4 (Cyclic SCMs) . F or br evity , we use thr oughout this paper the notation n := { 1 , 2 , . . . , n } for n ∈ N . Let M = h 2 , 1 , R 2 , R , f , P R i be an SCM 4 with f 1 ( x , e ) = x 2 and f 2 ( x , e ) = x 1 , and P R an arbitr ary pr obability measur e on R . Then ( X, X ) is a solution of M for any arbitrary random variable X with values in R . Hence, any pr obability distri- bution on { ( x, x ) : x ∈ R } is an observational distribution associated to M . Now consider instead the same SCM but with f 1 ( x , e ) = x 2 + 1 . This SCM has no solutions at all, and hence induces no observational distribution. Due to the fact that the structural equations only need to be satisfied almost surely , there may exist man y dif ferent SCMs representing the same set of solutions (see Example D.4 ). It therefore seems natural not to differentiate between structural equations that ha ve differ - ent solutions on at most a P E -null set of exogenous v ariables. This leads to an equiv alence relation between SCMs. T o be able to state the equiv alence relation concisely , we intro- duce the follo wing notation: For subsets U ⊆ I and V ⊆ J , we write X U := Q i ∈U X i and E V := Q j ∈V E j . In particular , X ∅ and E ∅ are defined by the singleton 1 . Moreover , for a subset W ⊆ I ∪ J , we use the con vention that we write X W and E W instead of X W ∩I and E W ∩J , respectively and we adopt a similar notation for the (random) variables in those 3 This implies that the components E j of E are mutually independent, since P E = Q j ∈J E j . 4 W e will abuse notation by using nondisjoint subsets of the natural numbers to inde x both endogenous and exogenous variables; these should be understood to be disjoint copies of the natural numbers: if we write I = n and J = m , we mean instead I = { 1 , 2 , . . . , n } and J = { 1 0 , 2 0 , . . . , m 0 } where k 0 is a copy of k . FOUND A TIONS OF STR UCTURAL CA USAL MODELS 7 spaces, that is, we write x W and e W instead of x W ∩I and e W ∩J , respecti vely . This allows us to define the follo wing natural equiv alence relation for SCMs. 5 , 6 D E FI N I T I O N 2.5 (Equiv alence) . The two SCMs M = hI , J , X , E , f , P E i and ˜ M = hI , J , X , E , ˜ f , P E i are equi valent , denoted by M ≡ ˜ M , if for all i ∈ I , for P E -almost every e ∈ E and for all x ∈ X , x i = f i ( x , e ) ⇐ ⇒ x i = ˜ f i ( x , e ) . Thus, two equiv alent SCMs can only dif fer in terms of their causal mechanism. Impor - tantly , equiv alent SCMs have the same solutions and, as we will see in Sections 2.4 and 2.5 , they hav e the same causal and counterfactual semantics (see Definitions 2.12 and 2.17 , re- specti vely). This equiv alence relation on the set of all SCMs giv es rise to the quotient set of equi valence classes of SCMs. 2.2. The (augmented) graph W e will no w define two types of graphs that can be used for representing structural properties of the SCM. These graphical representations are related to Wright’ s path diagrams [ 79 ]. The structural properties of the functional relations between v ariables modeled by an SCM are specified by the causal mechanism of the SCM and can be encoded in an (augmented) graph. F or the graphical notation and standard terminology on di- rected (mixed) graphs that is used throughout this paper , we refer the reader to Appendix A.1 . W e first define the parents of an endogenous variable. D E FI N I T I O N 2.6 (Parent) . Let M = hI , J , X , E , f , P E i be an SCM. W e call k ∈ I ∪ J a parent of i ∈ I if and only if ther e does not exist a measur able function 7 ˜ f i : X \ k × E \ k → X i such that for P E -almost every e ∈ E and for all x ∈ X , x i = f i ( x , e ) ⇐ ⇒ x i = ˜ f i ( x \ k , e \ k ) . Exogenous variables ha ve no parents by definition. These parental relations are preserved under the equi v alence relation ≡ on SCMs. They can be represented by a directed graph or a directed mixed graph. 8 D E FI N I T I O N 2.7 (Graph and augmented graph) . Let M = hI , J , X , E , f , P E i be an SCM. W e define: 5 An attempt at coarsening this notion of equiv alence by replacing the quantifier “for all x ∈ X ” by “for almost ev ery x ∈ X under the observational distribution P X ” will not lead to a well-defined equiv alence relation, since in general the observational distribution P X may be nonunique or even nonexistent. Refining it by replacing the quantifier “for P E -almost every e ∈ E ” by “for all e ∈ E ” would make it too fine for our purposes, since we assume the exogenous distribution to be fixed and we assume as usual that random variables that are almost surely identical are indistinguishable in practice. Note that the “for P E -almost ev ery e ∈ E ” and “for all x ∈ X ” quantifiers do not commute in general (see Example D.5 ) 6 W e may extend this definition to allow ˜ J 6 = J and for a larger class of SCMs such that the exogenous distribution does not factorize. Then, for any M that satisfies Definition 2.1 , except for that it may have a non- factorizing e xogenous distrib ution, there e xists an equi valent SCM with a factorizing e xogenous distribution (and a dif ferent J ); the latter can be obtained by partitioning the exogenous components into independent tuples. This motiv ates why we can restrict ourselv es in Definition 2.1 to factorizing exogenous distrib utions only . For some more discussion on the representation of latent confounders, see also Example D.6 . 7 For X = Q i ∈I X i , I some index set, I ⊆ I and k ∈ I , we denote X \ I = Q i ∈I \ I X i and X \ k = Q i ∈I \{ k } X i , and similarly for their elements. 8 A directed mixed graph G = ( V , E , B ) consists of a set of nodes V , a set of directed edges E and a set of bidirected edges B (see Definition A.1 for a more precise definition). 8 E 1 E 2 E 3 X 1 X 2 X 3 X 4 X 5 G a ( M ) X 1 X 2 X 3 X 4 X 5 G ( M ) X 1 X 2 X 3 X 4 X 5 G ( M do( { 3 } , 1) ) Fig 2: The augmented graph (left) and the graph (center) of the SCM M of Example 2.8 and the graph of the intervened SCM M do( { 3 } , 1) of Example 2.16 (right). 1. the augmented graph G a ( M ) as the dir ected graph with nodes I ∪ J and dir ected edges u → v if and only if u ∈ I ∪ J is a par ent of v ∈ I ; 2. the graph G ( M ) as the dir ected mixed graph with nodes I , dir ected edges u → v if and only if u ∈ I is a par ent of v ∈ I and bidirected edges u ↔ v if and only if ther e exists a j ∈ J that is a par ent of both u ∈ I and v ∈ I . W e call the mappings G a and G , that map M to G a ( M ) and G ( M ) , the augmented graph mapping and the graph mapping , r espectively . In particular , the augmented graph contains no directed edges pointing tow ard an exoge- nous variable, that is, u ∈ I ∪ J cannot be a parent of v ∈ J , because they are not functionally related through the causal mechanism. W e call a directed edge i → i in G a ( M ) and G ( M ) (here, i is a parent of itself) a self-cycle at i . By definition, the mappings G a and G are in vari- ant under the equiv alence relation ≡ on SCMs, and hence the equiv alence class of an SCM M is mapped to a unique augmented graph G a ( M ) and a unique graph G ( M ) . E X A M P L E 2.8 (Graphs of an SCM). Let M = h 5 , 3 , R 5 , R 3 , f , P R 3 i be an SCM with causal mechanism given by f 1 ( x , e ) = x 1 − x 2 1 + α e 2 1 , f 3 ( x , e ) = − x 4 + e 2 , f 5 ( x , e ) = x 4 · e 3 , f 2 ( x , e ) = x 1 + x 3 + x 4 + e 1 , f 4 ( x , e ) = x 2 + e 2 , wher e α 6 = 0 and P R 3 is a pr oduct of three pr obability measur es P R over R that ar e non- de gener ate. The augmented graph G a ( M ) and the graph G ( M ) of M ar e depicted 9 in F ig- ur e 2 (left and center). Observe that if α had been equal to zer o, then the endog enous variable 1 would not have any par ents in G a ( M ) , that is, it would not have a self-cycle and dir ected edge fr om any exo genous variables in G a ( M ) , and it would not have a self-cycle and bidi- r ected edge fr om any other variable in G ( M ) . Mor eover , if one of the pr obability measures P R over R were deg enerate, then some of the dir ected edges fr om the e xogenous variables to the endog enous variables in the augmented gr aph G a ( M ) and bidirected edg es in the gr aph G ( M ) would be missing . As is illustrated in this example, the augmented graph provides a more detailed repre- sentation than the graph. Therefore, we use the augmented graph as the standard graphi- cal representation for SCMs, unless stated otherwise. For an SCM M , we denote the sets pa G a ( M ) ( U ) , c h G a ( M ) ( U ) , an G a ( M ) ( U ) , etc., for some subset U ⊆ I ∪ J , by respecti vely pa( U ) , ch( U ) , an( U ) , etc., when the notation is clear from the context. 9 For visualizing an (augmented) graph, we adapt the common con vention of using random v ariables, with the index set as a subscript, instead of using the index set itself. With a slight abuse of notation, we still use the random variables notation in the (augmented) graph in the case that the SCM has no solution at all. FOUND A TIONS OF STR UCTURAL CA USAL MODELS 9 D E FI N I T I O N 2.9 . W e call an SCM M acyclic if G a ( M ) is a dir ected acyclic graph (D A G). Otherwise, we call M c yclic . Equi valently , an SCM M is ac yclic if G ( M ) is an ac yclic directed mixed graph (ADMG) [ 60 ]. Acyclic SCMs are also kno wn as semi-Markovian SCMs [ 51 , 76 ]. A commonly con- sidered class of acyclic SCMs are the Markovian SCMs, which are acyclic SCMs for which each exogenous variable has at most one child. Sev eral Markov properties were first shown for these models [ 35 , 51 , 76 ]. 2.3. Structurally minimal r epr esentations W e have discussed an equi valence relation be- tween SCMs in Section 2.1 . In this subsection, we show that for each SCM there exists a representati ve of the equiv alence class of that SCM for which each component of the causal mechanism does not depend on its nonparents [see also 55 ]. D E FI N I T I O N 2.10 (Structurally minimal SCM). Let M = hI , J , X , E , f , P E i be an SCM. W e call M structurally minimal if for all i ∈ I ther e exists a mapping ˜ f i : X pa( i ) × E pa( i ) → X i such that f i ( x , e ) = ˜ f i ( x pa( i ) , e pa( i ) ) for all e ∈ E and all x ∈ X . W e already encountered a structurally minimal SCM M in Example 2.8 . T aking instead α = 0 in that example giv es an SCM M that is not structurally minimal, since the endogenous v ariable 1 is then not a parent of itself, while f 1 ( x , e ) depends on x 1 . Ho wev er , the equi valent SCM where we have replaced the causal mechanism of 1 by f 1 ( x , e ) = 0 yields a structurally minimal SCM. In general, there always e xists an equi valent structurally minimal SCM. P RO P O S I T I O N 2.11 (Existence of a structurally minimal SCM) . F or an SCM M = hI , J , X , E , f , P E i , there exists an equivalent SCM ˜ M = hI , J , X , E , ˜ f , P E i that is struc- turally minimal. For a causal mechanism f : X × E → X and a subset U ⊆ I , we write f U : X × E → X U for the U components 10 of f . A structurally minimal representation is compatible with the (augmented) graph, in the sense that for e very U ⊆ I there e xists a unique measurable mapping ˜ f U : X pa( U ) × E pa( U ) → X U such that f U ( x , e ) = ˜ f U ( x pa( U ) , e pa( U ) ) for all e ∈ E and all x ∈ X . Moreov er , for any U ⊆ I there exists a unique measurable mapping ˜ f an( U ) : X an( U ) × E an( U ) → X an( U ) with f an( U ) ( x , e ) = ˜ f U ( x an( U ) , e an( U ) ) for all e ∈ E and all x ∈ X . 2.4. Interventions T o define the causal semantics of SCMs, we consider here an ideal- ized class of interventions introduced by Pearl [ 51 ] that we refer to as perfect interventions. Other types of interv entions, like mechanism changes [ 77 ], fat-hand interventions [ 13 ], ac- ti vity interventions [ 45 ] and stochastic v ersions of all these are at least as relev ant, b ut we do not consider them here. D E FI N I T I O N 2.12 (Perfect intervention on an SCM) . Let M = hI , J , X , E , f , P E i be an SCM, I ⊆ I a subset of endo genous variables and ξ I ∈ X I a value . The perfect interv ention do( I , ξ I ) maps M to the SCM M do( I , ξ I ) := hI , J , X , E , ˜ f , P E i , wher e the intervened causal mechanism ˜ f is given by ˜ f i ( x , e ) = ( ξ i i ∈ I f i ( x , e ) i ∈ I \ I . 10 For U = ∅ , we always consider the tri vial mapping f ∅ : X × E → X ∅ where X ∅ is the singleton 1 . 10 This operation do( I , ξ I ) preserves the equiv alence relation (see Definition 2.5 ) on the set of all SCMs, and hence this mapping induces a well-defined mapping on the set of equiv a- lence classes of SCMs. Previous w ork has considered interventions only on a specific subset of endogenous variables [ 2 , 3 , 67 ]. Instead, we assume that we can intervene on any subset of endogenous v ariables in the model. W e define an analogous operation do( I ) on directed mixed graphs. D E FI N I T I O N 2.13 (Perfect intervention on a directed mixed graph). Let G = ( V , E , B ) be a dir ected mixed graph and I ⊆ V a subset. The perfect intervention do( I ) maps G to the directed mixed graph do( I )( G ) := ( V , ˜ E , ˜ B ) , wher e ˜ E = E \ { v → i : v ∈ V , i ∈ I } and ˜ B = B \ { v ↔ i : v ∈ V , i ∈ I } . This operation simply removes all incoming edges on the nodes in I . The two notions of intervention are compatible with the (augmented) graph mapping. P RO P O S I T I O N 2.14 . Let M = hI , J , X , E , f , P E i be an SCM, I ⊆ I a subset of en- dogenous variables and ξ I ∈ X I a value. Then G a ◦ do( I , ξ I ) ( M ) = do( I ) ◦ G a ( M ) and G ◦ do( I , ξ I ) ( M ) = do( I ) ◦ G ( M ) . The two notions of perfect interv ention satisfy the following elementary properties. P RO P O S I T I O N 2.15 . F or an SCM and a directed mixed graph, we have the following pr operties: 1. perfect interventions on disjoint subsets of variables commute; 2. acyclicity is pr eserved under perfect intervention. The following example shows that an SCM with a solution may not have a solution any- more after performing a perfect intervention on the SCM, and vice versa that an SCM without a solution may yield an SCM with a solution after intervention. E X A M P L E 2.16 (Intervened SCM and its graphs) . Consider the SCM M of Example 2.8 which has a solution if and only if α ≥ 0 . Applying the perfect intervention do( { 3 } , 1) to M gives the intervened model M do( { 3 } , 1) with the intervened causal mechanism ˜ f 1 ( x , e ) = x 1 − x 2 1 + α e 2 1 , ˜ f 3 ( x , e ) = 1 , ˜ f 5 ( x , e ) = x 4 · e 3 , ˜ f 2 ( x , e ) = x 1 + x 3 + x 4 + e 1 , ˜ f 4 ( x , e ) = x 2 + e 2 , for which the graph G ( M do( { 3 } , 1) ) is depicted in F igur e 2 (right). This is an example wher e a perfect intervention leads to an intervened SCM M do( { 3 } , 1) that does not have a solu- tion anymor e. In addition, performing a perfect intervention do( { 4 } , 1) on M do( { 3 } , 1) yields again an SCM with a solution for α ≥ 0 . Recall that for each solution X of an SCM M we call the distribution P X the observa- tional distribution of M associated to X . For cyclic SCMs, the observational distribution is in general not unique. 11 For example, the SCM M of Example 2.8 has two different observa- tional distributions if α > 0 . Similarly , an intervened SCM may induce a distribution that is 11 In order to assure the existence of a unique observational distribution it is common to consider only SCMs for which the structural equations ha ve a unique solution (see, e.g., Definition 7.1.1 in [ 51 ]). Although these SCMs induce a unique observational distribution, they generally do not induce a unique distribution after a perfect intervention. FOUND A TIONS OF STR UCTURAL CA USAL MODELS 11 not unique. Whenev er the intervened SCM M do( I , ξ I ) has a solution X we therefore call the distribution P X the interventional distribution of M under the perfect intervention do( I , ξ I ) associated to X . 12 2.5. Counterfactuals The causal semantics of an SCM are described by the interven- tions on the SCM. Adding another layer of complexity , one can describe the counterfactual semantics of an SCM by the interventions on the so-called twin SCM, an idea introduced in [ 1 ]. D E FI N I T I O N 2.17 (T win SCM) . Let M = hI , J , X , E , f , P E i be an SCM. The twin op- eration maps M to the twin structural causal model (twin SCM) M twin := hI ∪ I 0 , J , X × X , E , ˜ f , P E i , wher e I 0 = { i 0 : i ∈ I } is a copy of I and the causal mechanism ˜ f : X × X × E → X × X is the measurable function given by ˜ f ( x , x 0 , e ) = f ( x , e ) , f ( x 0 , e ) . The twin operation on SCMs preserves the equi valence relation ≡ on the set of all SCMs. W e define an analogous twin operation t win( I ) on directed graphs. D E FI N I T I O N 2.18 (T win graph) . Let G = ( V , E ) be a dir ected graph and I ⊆ V a subset such that J := V \ I is exogenous , that is, pa G ( J ) = ∅ . The t win( I ) operation maps G to the twin graph w .r .t. I defined by twin( I )( G ) := ( ˜ V , ˜ E ) , wher e: 1. ˜ V = V ∪ I 0 , wher e I 0 is a copy of I , 2. ˜ E = E ∪ E 0 , wher e E 0 is given by E 0 = { j → i 0 : j ∈ J , i ∈ I , j → i ∈ E } ∪ { ˜ i 0 → i 0 : ˜ i, i ∈ I , ˜ i → i ∈ E } with i 0 , ˜ i 0 ∈ I 0 the r espective copies of i, ˜ i ∈ I . T win operations are compatible with the augmented graph mapping and preserve acyclic- ity . P RO P O S I T I O N 2.19. Let M = hI , J , X , E , f , P E i be an SCM. Then ( G a ◦ twin)( M ) = (t win( I ) ◦ G a )( M ) . P RO P O S I T I O N 2.20. F or SCMs and dir ected graphs, we have that acyclicity is pr eserved under the twin operation. The perfect intervention and the twin operation for SCMs and directed graphs commute with each other in the follo wing way . P RO P O S I T I O N 2.21 . Let M = hI , J , X , E , f , P E i be an SCM and G = ( V , E ) a dir ected graph. Then we have that perfect intervention commutes with the twin oper ation on both: 1. the SCM M : for a subset I ⊆ I and value ξ I ∈ X I , (do( I ∪ I 0 , ξ I ∪ I 0 )) ◦ twin)( M ) = (t win ◦ do( I , ξ I ))( M ) , and 2. the directed graph G : for subsets I ⊆ I ⊆ V such that J := V \ I is exog enous, (do( I ∪ I 0 ) ◦ twin( I ))( G ) = (t win( I ) ◦ do( I ))( G ) , 12 In the literature, one often finds the notation p ( x ) and p ( x | do( X I = x I )) for the densities of the observ a- tional and interventional distrib ution, respectiv ely , in case these are uniquely defined by the SCM [e.g., 51 ]. 12 wher e I 0 is the copy of I in I 0 and ξ I 0 = ξ I . Whene ver the intervened twin SCM ( M twin ) do( ˜ I , ξ ˜ I ) , where ˜ I ⊆ I ∪ I 0 and ξ ˜ I ∈ X ˜ I , has a solution ( X , X 0 ) , we call the distribution P ( X , X 0 ) the counterfactual distribution of M under the perfect intervention do( ˜ I , ξ ˜ I ) associated to ( X , X 0 ) . In Example D.3 , we provide an example of how counterfactuals can be sensibly formulated for a well-known market equilibrium model described in terms of a c yclic SCM. The interpretation of counterfactual statements has recei ved a lot of attention in the litera- ture [ 1 , 8 , 36 , 51 , 66 ]. F or acyclic graphs, an alternativ e graphical approach to counterfactuals is the frame work of Single W orld Intervention Graphs (SWIGs) [ 64 ]. One topic of discussion is that there exist SCMs that induce the same observational and interventional distributions, but differ in their counterfactual statements [ 11 ] (see also Example D.7 ). This raises the ques- tion ho w one can estimate such SCMs from data. 3. Solvability In this section, we introduce the notions of solv ability and unique solv- ability with respect to a subset of the endogenous variables of an SCM. They describe the existence and uniqueness of measurable solution functions for the subsystem of structural equations that correspond with a certain subset of the endogenous variables. These notions play a central role in formulating sufficient conditions under which se veral properties of acyclic SCMs may be extended to the cyclic setting. For example, we sho w that solvability of an SCM is a sufficient and necessary condition for the existence of a solution of an SCM. Further , unique solvability of an SCM implies the uniqueness of the induced observ ational distribution. 3.1. Definition of solvability Intuitiv ely , one can think of the structural equations corre- sponding to a subset of endogenous variables O ⊆ I as a description of how the subsystem formed by the variables O interacts with the rest of the system I \ O through the variables pa( O ) \ O . A solution function w .r .t. O assigns each input value ( x pa( O ) \O , e pa( O ) ) of this subsystem to a specific output v alue x O of the subsystem. This is formalized as follo ws. D E FI N I T I O N 3.1 (Solvability). Let M = hI , J , X , E , f , P E i be an SCM. W e call M solv able w .r .t. O ⊆ I if there exists a measurable mapping g O : X pa( O ) \O × E pa( O ) → X O such that for P E -almost every e ∈ E and for all x ∈ X , x O = g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x O = f O ( x , e ) . W e then call g O a measurable solution function w .r .t. O for M . W e call M solvable if it is solvable w .r .t. I . By definition, solv ability w .r .t. a subset respects the equi valence relation ≡ on SCMs. The measurable solution functions w .r .t. a certain subset do not always exist, and if they exist, they are not alw ays uniquely defined. For example, for the SCM M in Example 2.8 , the measurable solution functions w .r .t. { 1 } are given by g ± 1 ( e 1 ) = ± p αe 2 1 if and only if α ≥ 0 . The follo wing theorem states that v arious possible notions of “solvability” are equi v alent. T H E O R E M 3.2 (Sufficient and necessary conditions for solvability). F or an SCM M = hI , J , X , E , f , P E i , the following ar e equivalent: 1. M has a solution (see Definition 2.3 ); 2. for P E -almost every e ∈ E the structural equations x = f ( x , e ) have a solution x ∈ X ; 3. M is solvable (see Definition 3.1 ). FOUND A TIONS OF STR UCTURAL CA USAL MODELS 13 X 1 X 2 X 3 X 4 G ( M ) X 1 X 2 X 3 X 4 G ( ˜ M ) X 1 X 2 G ( ¯ M ) X 1 X 2 G ( ˆ M ) Fig 3: Left: The graphs of the observationally equivalent SCMs M and ˜ M of Examples 3.5 and 4.2 , respectively . Right: The graphs of the interventionally equivalent SCMs ¯ M and ˆ M of Example 4.4 . While in the acyclic case, the abo ve theorem is almost trivial, in the cyclic case the measure-theoretic aspects are not that obvious. In particular , to pro ve the existence of a mea- surable solution function g : E pa( I ) → X in case the structural equations have a solution for almost e very e ∈ E , we mak e use of a strong measurable selection theorem (see Theo- rem F .8 or [ 30 ]). This theorem implies that if there exists a solution X : Ω → X , then there necessarily exists a random variable E : Ω → E and a mapping g : E pa( I ) → X such that g ( E pa( I ) ) is a solution. Ho we ver , it does not imply that there necessarily exists a random v ariable E : Ω → E and a mapping g : E pa( I ) → X such that X = g ( E pa( I ) ) holds a.s., for example, if X is a nontri vial mixture of such solutions (see Example D.8 ). Solv ability w .r .t. a strict subset of I is in general neither sufficient nor necessary for the ex- istence of a (global) solution of the SCM. Consider , for example, the SCM M in Example 2.8 with α < 0 . Even though this SCM is solvable w .r .t. { 2 , 3 , 4 } , it is not (globally) solvable, and hence does not have any solution. In Proposition B.1 , we provide a sufficient condition for solvability w .r .t. a strict subset of I that is similar to condition (2) in Theorem 3.2 in the sense that it is formulated in terms of the solutions of (a subset of) the structural equations without requiring measurability of the solutions. For the class of linear SCMs, we provide in Proposition C.2 a suf ficient and necessary condition for solvability w .r .t. a subset of I . 3.2. Unique solvability The notion of unique solvability w .r .t. a subset O ⊆ I is similar to the notion of solvability , but with the additional requirement that the measurable solution function g O : X pa( O ) \O × E pa( O ) → X O is unique up to a P E -null set. D E FI N I T I O N 3.3 (Unique solvability). Let M = hI , J , X , E , f , P E i be an SCM. W e call M uniquely solv able w .r .t. O ⊆ I if ther e exists a measurable mapping g O : X pa( O ) \O × E pa( O ) → X O such that for P E -almost every e ∈ E and for all x ∈ X , x O = g O ( x pa( O ) \O , e pa( O ) ) ⇐ ⇒ x O = f O ( x , e ) . W e call M uniquely solv able if it is uniquely solvable w .r .t. I . If M ≡ ˜ M and M is uniquely solvable w .r .t. O , then ˜ M is uniquely solv able w .r .t. O , too, and the same mapping g O is a measurable solution function w .r .t. O for both M and ˜ M . The following result explains why the notions of (unique) solvability do not play an im- portant role in the theory of acyclic SCMs. P RO P O S I T I O N 3.4 . An acyclic SCM M = hI , J , X , E , f , P E i is uniquely solvable w .r .t. every subset O ⊆ I . W e now illustrate that also c yclic SCMs can be uniquely solvable w .r .t. e very subset. E X A M P L E 3.5 (Cyclic SCM, uniquely solvable w .r .t. each subset). Consider the SCM M = h 4 , 4 , R 4 , R 4 , f , P R 4 i with causal mec hanism given by f 1 ( x , e ) = e 1 , f 2 ( x , e ) = e 2 , f 3 ( x , e ) = x 1 x 4 + e 3 , f 4 ( x , e ) = x 2 x 3 + e 4 and P R 4 the standar d-normal distribution on R 4 . This SCM M is uniquely solvable w .r .t. every subset and its (augmented) gr aph includes a cycle (see F igur e 3 ). 14 Theorem 3.2 provides sufficient and necessary conditions for (global) solv ability . The ne xt theorem states that under the additional uniqueness requirement there exists a sufficient and necessary condition for unique solvability w .r .t. any subset (for solvability w .r .t. a subset we only have the suf ficient condition provided in Proposition B.1 ), and moreov er , that all solutions of a uniquely solv able SCM induce the same observational distrib ution. T H E O R E M 3.6 (Sufficient and necessary conditions for unique solv ability) . Let M = hI , J , X , E , f , P E i be an SCM and O ⊆ I a subset. The following ar e equivalent: 1. for P E -almost every e ∈ E and for all x \O ∈ X \O the structural equations x O = f O ( x , e ) have a unique solution x O ∈ X O ; 2. M is uniquely solvable w .r .t. O . Furthermor e, if M is uniquely solvable, then ther e exists a solution, and all solutions have the same observational distribution. It is well known that under acyclicity the observational distrib ution is unique. Theorem 3.6 generalizes this result to settings with cycles. F or linear SCMs, the unique solv ability condi- tion w .r .t. a subset is equiv alent to a matrix in vertibility condition (see Proposition C.3 ). In general, (unique) solvability w .r .t. O ⊆ I does not imply (unique) solvability w .r .t. a strict superset O ( V ⊆ I nor w .r .t. a strict subset W ( O (see Example B.2 ). More- ov er , (unique) solv ability is in general not preserved under unions and intersections (see Appendix B.3 ). 3.3. Self-cycles One can think of a structural equation of a single endogenous variable i ∈ I as describing a small subsystem that interacts with the rest of the system. If the output x i of this subsystem is uniquely determined by the input ( x \ i , e ) from the rest of the system (up to a P E -null set), then i is not a parent of itself (see Definition 2.6 ). P RO P O S I T I O N 3.7 (Self-cycles). The SCM M = hI , J , X , E , f , P E i is uniquely solv- able w .r .t. { i } for i ∈ I if and only if G a ( M ) (or G ( M ) ) has no self-cycle i → i at i ∈ I . A self-cycle at an endogenous variable denotes that that variable is not uniquely deter- mined by its parents, up to a P E -null set. This implies that an SCM with a self-c ycle at an en- dogenous v ariable in its graph can be either solvable, or not solv able, w .r .t. that variable. F or the SCM M of Example 2.8 , we ha ve indeed that it is solv able w .r .t. { 1 } for α > 0 , while for α < 0 it is not. For linear SCMs with structural equations X i = P j ∈I B ij X j + P k ∈J Γ ik E k , the endogenous v ariable i ∈ I has a self-c ycle if and only if B ii = 1 (see also Appendix C ). 3.4. Interventions The property of (unique) solvability is in general not preserved un- der perfect intervention. For example, a (uniquely) solvable SCM can lead to a nonuniquely solv able SCM after intervention, which either has no solution or has solutions with multiple induced distributions (see, e.g., Examples 2.16 and D.9 ). A sufficient condition for the inter - vened SCM to be (uniquely) solvable is that the original SCM has to be (uniquely) solvable w .r .t. the subset of nonintervened endogenous v ariables. P RO P O S I T I O N 3.8 . Let M = hI , J , X , E , f , P E i be an SCM that is (uniquely) solvable w .r .t. O ⊆ I . Then, for any set I such that pa( O ) \ O ⊆ I ⊆ I \ O and value ξ I ∈ X I the intervened SCM M do( I , ξ I ) is (uniquely) solvable w .r .t. O ∪ I . FOUND A TIONS OF STR UCTURAL CA USAL MODELS 15 X 1 X 2 X 3 X 4 L X 1 X 4 marg( L ) Fig 4: The graphs of the SCM M (left) of Example 3.11 and the marginal SCM M marg( { 2 , 3 } ) (right) of Example 5.10 . Proposition 3.4 shows that acyclic SCMs are uniquely solvable w .r .t. ev ery subset and hence are uniquely solvable after ev ery perfect intervention. This also directly follows from the fact that acyclicity is preserved under perfect intervention (see Proposition 2.15 ). More- ov er , since ac yclicity is preserv ed under the twin operation (see Proposition 2.20 ), an acyclic SCM induces unique observ ational, interventional and counterfactual distrib utions. 3.5. Ancestral (unique) solvability W e sa w that, in general, solv ability w .r .t. O ⊆ I does not imply solvability w .r .t. a strict subset of O . Here we show that it does imply solv ability w .r .t. the ancestral subsets in G ( M ) O , that is, in the induced subgraph of the graph G ( M ) on O . A subset A ⊆ O is called an ancestral subset in G ( M ) O if A = an G ( M ) O ( A ) , where an G ( M ) O ( A ) are the ancestors of A according to the induced subgraph 13 G ( M ) O . D E FI N I T I O N 3.9 (Ancestral (unique) solvability). Let M = hI , J , X , E , f , P E i be an SCM. W e call M ancestrally (uniquely) solv able w .r .t. O ⊆ I if M is (uniquely) solvable w .r .t. every ancestral subset in G ( M ) O . W e call M ancestrally (uniquely) solvable if it is ancestrally (uniquely) solvable w .r .t. I . P RO P O S I T I O N 3.10 (Solv ability is equi valent to ancestral solvability). The SCM M = hI , J , X , E , f , P E i is solvable w .r .t. the subset O ⊆ I if and only if M is ancestrally solvable w .r .t. O . A similar result does not hold for unique solv ability . Although ancestral unique solv ability w .r .t. O ⊆ I implies unique solv ability w .r .t. O , the con verse does not hold in general, as the follo wing example illustrates. E X A M P L E 3.11 (Unique solv ability w .r .t. O does not imply ancestral unique solv ability w .r .t. O ) . Consider the SCM M = h 4 , 1 , R 4 , R , f , P R i with causal mec hanism given by f 1 ( x , e ) = e , f 2 ( x , e ) = x 2 · (1 − 1 { 0 } ( x 1 − x 3 )) + 1 , f 3 ( x , e ) = x 3 , f 4 ( x , e ) = x 3 and P R the standard-normal measur e on R . This SCM is uniquely solvable w .r .t. the set { 2 , 3 } , and thus solvable w .r .t. this set. Although it is solvable w .r .t., the ancestr al subset { 3 } in G ( M ) { 2 , 3 } , depicted in F igur e 4 (left), it is not uniquely solvable w .r .t. this subset, because the structural equation x 3 = x 3 holds for any x 3 ∈ R . Hence, it is not ancestrally uniquely solvable w .r .t. { 2 , 3 } . Ho wev er , for the class of linear SCMs we hav e that unique solv ability w .r .t. O always implies ancestral unique solv ability w .r .t. O (see Proposition C.4 ). Although in general unique solvability is not preserved under unions, in Proposition B.4 we show that if an SCM is uniquely solvable w .r .t. two ancestral subsets and w .r .t. their intersection, then it is uniquely solvable w .r .t. their union. In general, the property of ancestral unique solvability is not preserv ed under perfect intervention, as can be seen in Example D.9 . The notion of ancestral unique solv ability will appear in various results in Sections 5 and 6 . 13 Here, one can also use the augmented graph G a ( M ) on O since an G ( M ) O ( A ) = an G a ( M ) O ( A ) for every subset A ⊆ O ⊆ I . 16 4. Equivalences In Section 2 , we already encountered an equiv alence relation on the class of SCMs (see Definition 2.5 ). The (augmented) graph of an SCM, its solutions and its induced observ ational, interv entional and counterf actual distributions are preserved under this equi valence relation. In this section, we gi ve se veral coarser equi valence relations on the class of SCMs: observ ational, interventional and counterfactual equi v alence. 4.1. Observational equivalence Observational equi valence is the property that two SCMs are indistinguishable on the basis of their observ ational distributions. D E FI N I T I O N 4.1 (Observ ational equi valence). T wo SCMs M = hI , J , X , E , f , P E i and ˜ M = h ˜ I , ˜ J , ˜ X , ˜ E , ˜ f , P ˜ E i ar e observationally equiv alent w .r .t. O ⊆ I ∩ ˜ I , denoted by M ≡ obs ( O ) ˜ M , if X O = ˜ X O and for all solutions X of M there exists a solution ˜ X of ˜ M such that P X O = P ˜ X O and for all solutions ˜ X of ˜ M ther e e xists a solution X of M such that P X O = P ˜ X O . M and ˜ M ar e called observationally equiv alent if they ar e observationally equivalent w .r .t. I = ˜ I . Equi valent SCMs have the same solutions, and hence they are observ ationally equiv alent w .r .t. ev ery subset O ⊆ I . Howe ver , observational equi v alence does not imply equiv alence. E X A M P L E 4.2 (Observational equi valence does not imply equiv alence) . Consider the SCM ˜ M that is the same as M of Example 3.5 but with the causal mechanism ˜ f given by ˜ f 1 ( x , e ) := e 1 , ˜ f 2 ( x , e ) := e 2 , ˜ f 3 ( x , e ) := x 1 e 4 + e 3 1 − x 1 x 2 , ˜ f 4 ( x , e ) := x 2 e 3 + e 4 1 − x 1 x 2 . This SCM ˜ M is observationally equivalent to the SCM M . Because both SCMs have a differ ent (augmented) graph the y ar e not equivalent to each other (see F igur e 3 ). This example sho ws that if tw o SCMs M and ˜ M are observationally equiv alent, then their associated augmented graphs G a ( M ) and G a ( ˜ M ) are not necessarily equal to each other . 4.2. Interventional equivalence W e consider two SCMs to be interventionally equiv alent if they induce the same interv entional distributions under all perfect interv entions. D E FI N I T I O N 4.3 (Interv entional equi v alence) . T wo SCMs M = hI , J , X , E , f , P E i and ˜ M = h ˜ I , ˜ J , ˜ X , ˜ E , ˜ f , P ˜ E i ar e interventionally equiv alent w .r .t. O ⊆ I ∩ ˜ I , denoted by M ≡ int ( O ) ˜ M , if X O = ˜ X O and for every I ⊆ O and every value ξ I ∈ X I their intervened models M do( I , ξ I ) and ˜ M do( I , ξ I ) ar e observationally equivalent with respect to O . M and ˜ M ar e called interventionally equi valent if the y are interventionally equivalent w .r .t. I = ˜ I . Equi valent SCMs hav e the same solutions under ev ery perfect intervention, and hence they are interv entionally equi valent w .r .t. e very subset O ⊆ I . SCMs that are interventionally equi valent w .r .t. a subset O ⊆ I are interv entionally equi valent w .r .t. e very strict subset W ( O . But in general, they are not interventionally equi v alent w .r .t. a strict superset O ( V ⊆ I , as can be seen in Example 4.2 , where the SCMs M and ˜ M are interventionally equiv alent w .r .t. { 1 , 2 } but are not interv entionally equiv alent. Interventional equiv alence w .r .t. O ⊆ I implies observational equiv alence w .r .t. O , since the empty perfect intervention ( I = ∅ ) is a special case of a perfect intervention. Howe ver , observational equiv alence w .r .t. O ⊆ I does not imply interventional equiv alence w .r .t. O in general, as can be seen in Example 4.2 , where the SCMs M and ˜ M are observ ationally equi valent b ut not interventionally equi valent. Although interventional equiv alence is a finer notion than observational equiv alence, we hav e that if two SCMs M and ˜ M are interventionally equiv alent, then their associated aug- mented graphs G a ( M ) and G a ( ˜ M ) are not necessarily equal to each other . FOUND A TIONS OF STR UCTURAL CA USAL MODELS 17 E X A M P L E 4.4 (Interv entionally equi valent SCMs with different graphs) . Consider the SCM ¯ M = h 2 , 2 , {− 1 , 1 } 2 , {− 1 , 1 } 2 , ¯ f , P E i and the SCM ˆ M that is the same as ¯ M except for its causal mechanism ˆ f , wher e the causal mechanisms ar e given by ¯ f 1 ( x , e ) = e 1 , ¯ f 2 ( x , e ) = x 1 e 2 , ˆ f 1 ( x , e ) = e 1 , ˆ f 2 ( x , e ) = e 2 , and P E = P E with E 1 , E 2 ∼ U ( {− 1 , 1 } ) uniformly distrib uted and E 1 ⊥ ⊥ E 2 . Then ¯ M and ˆ M ar e interventionally equivalent although G ( ¯ M ) is not equal to G ( ˆ M ) (see F igur e 3 ). Example D.6 showcases an SCM with two endogenous and three exogenous variables, for which there is no interventionally equiv alent SCM (satisfying smoothness constraints) with one exogenous variable taking values in R 2 whose first and second components enter in the first and second structural equation, respecti vely . In this sense, representing confounders with dependent exogenous v ariables can be nontri vial in nonlinear models. 4.3. Counterfactual equivalence W e consider two SCMs to be counterfactually equiv- alent if their twin SCMs induce the same counterfactual distributions under e very perfect intervention. D E FI N I T I O N 4.5 (Counterfactual equi v alence) . T wo SCMs M = hI , J , X , E , f , P E i and ˜ M = h ˜ I , ˜ J , ˜ X , ˜ E , ˜ f , P ˜ E i ar e counterfactually equiv alent with respect to O ⊆ I ∩ ˜ I , denoted by M ≡ cf ( O ) ˜ M , if M twin and ˜ M twin ar e interventionally equivalent with respect to O ∪ O 0 , wher e O 0 corr esponds to the copy of O in I 0 ∩ ˜ I 0 . M and ˜ M are called counter- factually equi v alent if they ar e counterfactually equivalent with r espect to I = ˜ I . The notion of counterfactual equiv alence is coarser than equiv alence and finer than inter- ventional equi v alence. P RO P O S I T I O N 4.6 . F or SCMs, we have that equivalence implies counterfactual equiva- lence w .r .t. O , which in turn implies interventional equivalence w .r .t. O , for any O ⊆ I . Interventionally equi valent SCMs that ha ve the same causal mechanism (that differ only in their exogenous distribution) may not be counterfactually equiv alent (see, e.g., Example D.7 ). Although the notion of counterfactual equi v alence is finer than the notion of observational and interventional equi valence, the (augmented) graphs for counterfactually equi valent SCMs are in general not equal to each other (see Example D.10 ). 4.4. Relations between equivalences The definitions of observ ational, interventional and counterfactual equiv alence provide equiv alence relations on the set of all SCMs. For two SCMs to be observationally , interventionally or counterfactually equi valent w .r .t. O ⊆ I ∩ ˜ I , the domains of their endogenous variables O hav e to be equal, that is, X O = ˜ X O . Apart from that, the inde x sets of the endogenous and the exogenous variables, the spaces of the other endogenous and exogenous variables, the causal mechanism and the exogenous probabil- ity measure may all differ . The observational, interventional and counterfactual equiv alence classes w .r .t. O ⊆ I ∩ ˜ I are related in the follo wing way (see Proposition 4.6 ): M and ˜ M are equiv alent = ⇒ M and ˜ M are counterfactually equiv alent w .r .t. O = ⇒ M and ˜ M are interventionally equi valent w .r .t. O = ⇒ M and ˜ M are observationally equi v alent w .r .t. O . This hierarchy allows us to compare SCMs at different lev els of abstraction and formally establishes the “ladder” of causation (last two implications) [ 51 , 53 , 69 ]. 18 5. Marginalizations In this section, we sho w how , and under which condition, one can marginalize an SCM over a subset L ⊆ I of endogenous v ariables (thereby “hiding” the v ariables L ), to another SCM on the margin I \ L that is observ ationally , interventionally and e ven counterfactually equiv alent with respect to I \ L . In other words, we provide a formal notion of marginalization and sho w that this preserves the probabilistic, causal and counterfactual semantics on the mar gin. The problem of marginalization of directed graphical models has been addressed for acyclic graph structures, for example, ADMGs and mD A Gs [see 15 , 16 , 60 , 62 , 78 , a.o.], and more recently in [ 18 ] for certain graph structures (“HEDGes”) that may include cycles. Although in the acyclic setting it has been shown that the marginalization for some of these graph structures preserv es the probabilistic and causal semantics, in the c yclic setting this has only been shown for modular SCMs [ 18 ]. W e show that without the additional structure of a compatible system of solution functions (see Appendix A.3 ) one can still define a marginal- ization for SCMs under certain local unique solv ability conditions. Intuiti vely , the idea is that if the state of a subsystem of endogenous variables is uniquely determined by the parents outside of this subsystem, then one can ignore the internals of this subsystem by treating it as a “black box” that can be described by certain measurable solution functions (see Figure 4 ). One can marginalize o ver this subsystem by substituting these measurable solution functions into the rest of the model, thereby removing the functional dependencies on the variables of the subsystem from the rest of the system, while preserving the probabilistic, causal and the counterfactual semantics of the rest of the system. W e show that in general this marginal- ization operation defined on SCMs does not respect the latent projection on its associated (augmented) graph, where the latent projection is a similar marginalization operation defined on directed mixed graphs [ 15 , 76 , 78 ]. W e show that under certain stronger local ancestral unique solv ability conditions the marginalization does respect the latent projection. 5.1. Mar ginalization of a structural causal model Before we sho w ho w one can marginalize an SCM w .r .t. a subset of endogenous variables, we first point out that in general it is not always possible to find an SCM on the margin that preserves the causal semantics, as the follo wing example illustrates. E X A M P L E 5.1 (No SCM on the margin preserv es the causal semantics). Consider the SCM M = h 3 , ∅ , R 3 , 1 , f , P 1 i with causal mechanism f 1 ( x ) = x 1 + x 2 + x 3 , f 2 ( x ) = x 2 , f 3 ( x ) = 0 . Then ther e e xists no SCM ˜ M on the endog enous variables { 2 , 3 } that is in- terventionally equivalent to M w .r .t. { 2 , 3 } . T o see this, suppose ther e exists such an SCM ˜ M , then for e very ( ξ 2 , ξ 3 ) ∈ X { 2 , 3 } such that ξ 2 + ξ 3 6 = 0 the intervened model ˜ M do( { 2 , 3 } , ( ξ 2 ,ξ 3 )) has a solution but M do( { 2 , 3 } , ( ξ 2 ,ξ 3 )) does not. More generally , for an SCM M that is not solvable w .r .t. a subset L ⊆ I there is no SCM ˜ M on the endogenous v ariables I \ L that is interventionally equi valent w .r .t. I \ L . The follo wing example illustrates that for an SCM that is uniquely solvable w .r .t. a subset there exists an SCM on the mar gin that preserves the causal semantics. E X A M P L E 5.2 (SCM on the mar gin that preserves the causal semantics). Consider the SCM M of Example 3.11 that is uniquely solvable w .r .t. the subset L = { 2 , 3 } (depicted by the gray box in F igur e 4 ). Substituting the measurable solution functions g L into the causal mechanism components f 1 and f 4 for the r emaining endogenous variables { 1 , 4 } gives a “mar ginal” causal mechanism ˜ f 1 ( x , e ) := e and ˜ f 4 ( x , e ) := x 1 . This defines an SCM ˜ M on the mar gin I \ L = { 1 , 4 } that is interventionally equivalent w .r .t. I \ L to M . FOUND A TIONS OF STR UCTURAL CA USAL MODELS 19 In general, for an SCM M and a giv en subset L ⊆ I of endogenous variables and its complement O = I \ L , we can consider the “subsystem” of structural equations x L = f L ( x L , x O , e ) . If M is uniquely solvable w .r .t. L with measurable solution function g L : X pa( L ) \L × E pa( L ) → X L , then for each input ( x pa( L ) \L , e pa( L ) ) ∈ X pa( L ) \L × E pa( L ) of the subsystem, there e xists an output x L ∈ X L , which is unique for P E pa( L ) -almost e v- ery e pa( L ) ∈ E pa( L ) and for all x pa( L ) \L ∈ X pa( L ) \L . W e can remove this subsystem of en- dogenous variables from the model by substitution. This leads to a marginal SCM that is observ ationally , interventionally and counterfactually equiv alent to the original SCM w .r .t. the margin, as we pro ve in Theorem 5.6 . D E FI N I T I O N 5.3 (Marginalization of an SCM). Let M = hI , J , X , E , f , P E i be an SCM that is uniquely solvable w .r .t. a subset L ⊆ I and let O = I \ L . F or g L : X pa( L ) \L × E pa( L ) → L , any measurable solution function of M w .r .t. L , we call the SCM M marg( L ) := hO , J , X O , E , ˜ f , P E i with the mar ginal causal mechanism ˜ f : X O × E → X O given by ˜ f ( x O , e ) = f O ( g L ( x pa( L ) \L , e pa( L ) ) , x O , e ) , a mar ginalization of M w .r .t. L . W e denote by marg( L )( M ) the equivalence class of the mar ginalizations of M w .r .t. L . The marginalization of M w .r .t. L is defined up to the equiv alence ≡ on SCMs, since the measurable solution functions g L are uniquely defined up to P E -null sets. W ith this definition at hand, we can always construct a marginal SCM ov er a subset of the endogenous variables of an acyclic SCM by mere substitution (see also Proposition 3.4 ). Moreover , this definition extends that notion to SCMs that are uniquely solvable w .r .t. a certain subset. For linear SCMs this condition translates into a matrix in vertibility condition, and since substitution preserves linearity , marginalization yields a linear mar ginal SCM (see Proposition C.5 ). In general, marginalization is not al ways defined for all subsets. For instance, the SCM of Example 3.11 cannot be marginalized o ver the v ariable 3 (due to the self-c ycle at 3 ), but can be marginalized over the variables 2 and 3 together . It follows from Proposition 3.7 that we can only marginalize over a single v ariable if that v ariable has no self-cycle. Note that we may introduce new self-cycles if we marginalize ov er a subset of variables, as can be seen, for e xample, from the SCM M in Example 2.8 . This SCM has only one self-cycle; ho wev er , marginalizing w .r .t. { 2 } gi ves a mar ginal SCM with another self-cycle at v ariable 4 . The definition of marginalization satisfies an intuiti ve property: if we can mar ginalize over two disjoint subsets after each other , then we can also mar ginalize o ver the union of those subsets at once, and the respecti ve results agree. P RO P O S I T I O N 5.4 . Let M = hI , J , X , E , f , P E i be an SCM that is uniquely solvable w .r .t. a subset L 1 ⊆ I and let L 2 ⊆ I be a subset disjoint fr om L 1 . Then M marg( L 1 ) is uniquely solvable w .r .t. L 2 if and only if M is uniquely solvable w .r .t. L 1 ∪ L 2 , Mor eover , marg( L 2 ) ◦ marg ( L 1 )( M ) = marg( L 1 ∪ L 2 )( M ) . In this proposition, L 1 and L 2 hav e to be disjoint, since marginalizing first over L 1 gi ves a marginal SCM M marg( L 1 ) with endogenous v ariables I \ L 1 . Next, we show that the distributions of a marginal SCM are identical to the marginal distributions induced by the original SCM. A simple proof of this result proceeds by showing that both the intervention and the twin operation commute with mar ginalization. P RO P O S I T I O N 5.5. Let M be an SCM that is uniquely solvable w .r .t. a subset L ⊆ I . Then the mar ginalization marg( L ) commutes with both: 20 1. the perfect intervention do( I , ξ I ) for a subset I ⊆ I \ L and a value ξ I ∈ X I , that is, (marg( L ) ◦ do( I , ξ I ))( M ) = (do( I , ξ ) ◦ marg( L ))( M ) , and 2. the twin operation t win , that is, (marg ( L ∪ L 0 ) ◦ twin)( M ) = (twin ◦ marg( L ))( M ) , wher e L 0 is the copy of L in I 0 . W ith Proposition 5.5 at hand, we can prov e the main result of this subsection. T H E O R E M 5.6 (Marginalization of an SCM preserves the observational, causal and coun- terfactual semantics). Let M be an SCM that is uniquely solvable w .r .t. a subset L ⊆ I . Then M and marg( L )( M ) ar e observationally , interventionally and counterfactually equiv- alent w .r .t. I \ L . This sho ws that our definition of marginalization (Definition 5.3 ) preserves the probabilis- tic, causal and counterfactual semantics, under a certain local unique solv ability condition. Moreov er , this allows us to marginalize SCMs w .r .t. a certain subset that do not satisfy the ad- ditional assumptions imposed by modular SCMs, for e xample, the SCM M of Example 3.11 does not have any additional structure of a compatible system of solution functions, but M can be marginalized w .r .t. the subset { 2 , 3 } (see Appendix A.3 ). In general, interventional equiv alence does not imply counterf actual equiv alence (see, e.g., Example D.7 ). Howe ver , for our definition of marginalization we arrive at a marginal SCM that is not only interventionally equiv alent, but also counterfactually equiv alent w .r .t. the margin. For an SCM M , unique solvability w .r .t. a certain subset L ⊆ I is a sufficient, but not a necessary condition for the existence of an SCM ˜ M on the margin I \ L such that M and ˜ M are counterf actually equiv alent w .r .t. I \ L (see, e.g., Example D.11 ). Hence, in certain cases it may be possible to relax the uniqueness condition. 5.2. Mar ginalization of a gr aph W e no w turn to a marginalization operation for directed mixed graphs, which we call the latent projection. This name is inspired from a similar con- struction on directed mixed graphs in [ 78 ]. In [ 78 ], the authors concentrate on a mapping between directed mixed graphs and show that it preserves conditional independence proper- ties [see also 76 ]. In this subsection, we provide a sufficient condition for the marginalization of an SCM to respect the latent projection, that is, that the augmented graph of the marginal SCM is a subgraph of the latent projection of the augmented graph of the original SCM. D E FI N I T I O N 5.7 (Marginalization/latent projection of a directed mixed graph) . Let G = ( V , E , B ) be a dir ected mixed graph and L ⊆ V a subset. The marginalization of G w .r .t. L or the latent projection of G onto V \ L maps G to the marginal graph marg( L )( G ) := ( ˜ V , ˜ E , ˜ B ) , wher e: 1. ˜ V = V \ L , 2. for i, j ∈ ˜ V : i → j ∈ ˜ E if and only if ther e exists a dir ected path i → ` 1 → · · · → ` n → j in G with n ≥ 0 and ` 1 , . . . , ` n ∈ L , 3. for i 6 = j ∈ ˜ V : i ↔ j ∈ ˜ B if and only if a) ther e exist n, m ≥ 0 , ` 1 , . . . , ` n ∈ L , ˜ ` 1 , . . . , ˜ ` m ∈ L such that i ← l 1 ← l 2 ← · · · ← ` n ↔ ˜ ` m → ˜ ` m − 1 → · · · → ˜ ` 1 → j in G , or b) ther e exist n, m ≥ 1 , ` 1 , . . . , ` n ∈ L , ˜ ` 1 , . . . , ˜ ` m ∈ L such that i ← l 1 ← l 2 ← · · · ← ` n and ˜ ` m → ˜ ` m − 1 → · · · → ˜ ` 1 → j in G and ` n = ˜ ` m . Note that this gi ves G ( M ) = marg( J )( G a ( M )) for an y SCM M . Further , for a subgraph H ⊆ G we hav e marg( L )( H ) ⊆ marg ( L )( G ) for any subset of nodes L . It does not matter in which order we project out the nodes or if we perform se veral projections at once. FOUND A TIONS OF STR UCTURAL CA USAL MODELS 21 P RO P O S I T I O N 5.8 . Let G = ( V , E , B ) be a dir ected mixed graph and L 1 , L 2 ⊆ V two dis- joint subsets. Then (marg( L 1 ) ◦ marg( L 2 ))( G ) = (marg ( L 2 ) ◦ marg( L 1 ))( G ) = marg ( L 1 ∪ L 2 )( G ) . Similar to the definition of marginalization for SCMs, this definition of the latent projec- tion commutes with both the (graphical) perfect intervention and the twin operation. P RO P O S I T I O N 5.9 . Let G = ( V , E , B ) be a dir ected mixed graph and L , I , I ⊆ V subsets. Then the mar ginalization marg( L ) commutes with both: 1. perfect intervention do( I ) if I is disjoint fr om L , that is, (marg ( L ) ◦ do( I ))( G ) = (do( I ) ◦ marg( L ))( G ) , and 2. the twin operation twin( I ) if B = ∅ , J := V \ I is exo genous (i.e., pa G ( J ) = ∅ ) and L ⊆ I , that is, (marg ( L ∪ L 0 ) ◦ twin( I ))( G ) = (t win( I \ L ) ◦ marg ( L ))( G ) , wher e L 0 is the copy of L in I 0 . An example of an SCM for which a mar ginalization respects the latent projection is the SCM M of Example 2.8 . Mar ginalizing M w .r .t. L = { 2 } gi ves a mar ginal SCM M marg( L ) with a graph that is a subgraph of the latent projection of the graph of the SCM M onto I \ L . In general, not all mar ginalizations respect the latent projection, as is illustrated in the follo wing example. E X A M P L E 5.10 (Marginalization does not respect the latent projection) . Consider the SCM M of Example 3.11 . Although M and its mar ginalization M marg( L ) with L = { 2 , 3 } ar e interventionally equivalent w .r .t. I \ L = { 1 , 4 } , the gr aph G ( M marg( L ) ) is not a sub- graph of the latent pr ojection of G ( M ) onto I \ L , as can be verified fr om the graphs depicted in F igure 4 . Under the local ancestral unique solv ability condition, which is a stronger condition than the local unique solv ability condition (i.e., ancestral unique solvability w .r .t. a subset im- plies unique solvability w .r .t. that subset), one can prove that the marginalization of an SCM respects the latent projection. P RO P O S I T I O N 5.11. Let M be an SCM that is ancestrally uniquely solvable w .r .t. a subset L ⊆ I . Then G a ◦ marg( L ) ( M ) ⊆ marg( L ) ◦ G a ( M ) and G ◦ marg ( L ) ( M ) ⊆ marg( L ) ◦ G ( M ) . The (augmented) graph of a mar ginalized SCM can be a strict subgraph of the correspond- ing latent projection if, for example, certain paths cancel each other out after the substitution of the measurable solution function(s) into the causal mechanism(s) on the margin (see Ex- ample D.12 ). For acyclic SCMs, we recover with Proposition 5.11 the kno wn result that this class is closed under mar ginalization (see Proposition 3.4 ) [ 15 ]. F or linear SCMs, we have that unique solvability w .r .t. a subset L holds if and only if ancestral unique solv ability w .r .t. L holds (see Proposition C.4 ), and hence, a marginalization of a linear SCM al ways respects the latent projection. 6. Marko v properties In this section, we gi ve a short overvie w of Marko v properties for SCMs with cycles. W e make use of the Marko v properties that were recently dev eloped by Forr ´ e and Mooij [ 18 ] for HEDGes, a graphical representation that is similar to the augmented graph of SCMs. W e briefly summarize some of their main results and apply them to the class 22 of SCMs. In Appendix A.2 , we provide a more thorough introduction and gi ve an intuitiv e deri vation, which can act as an entry point for the reader into the more extensi ve discussion of Marko v properties provided in [ 18 ]. Marko v properties associate a set of conditional independence relations to a graph. The directed global Mark ov property for directed ac yclic graphs (see Definitions A.4 and A.6 ), also known as the d -separation criterion [ 50 ], is one of the most widely used. It directly extends to a similar property for acyclic directed mixed graphs (ADMGs) [ 60 ]. It does not hold in general for cyclic SCMs, ho we ver , as was already observed earlier [ 71 , 72 ]. E X A M P L E 6.1 (Directed global Marko v property does not hold for cyclic SCM) . One can c heck that for e very solution X of the SCM M of Example 3.5 , X 1 is not independent of X 2 given { X 3 , X 4 } . However , the variables X 1 and X 2 ar e d -separated given { X 3 , X 4 } in G ( M ) (see F igur e 3 ). Hence the global dir ected Markov pr operty does not hold her e. Although some progress has been made in the case of discrete [ 18 , 49 , 52 ] and linear models [ 18 , 27 , 31 , 63 , 70 – 72 ], only recently a general directed global Markov property has been introduced for more general cyclic models [ 18 ], that is based on σ -separation (see Definition A.16 and A.20 ), an extension of d -separation. This notion of σ -separation was deri ved from the notion of d -separation in the acyclification of the graph [ 18 ] (see Defini- tion A.13 ). The acyclification of a graph generalizes the idea of the collapsed graph de vel- oped by Spirtes [ 71 ] and can, in particular , be applied to the graphs of SCMs. The main idea of the acyclification is that under the condition that the SCM is uniquely solvable w .r .t. each strongly connected component, we can replace the causal mechanisms of these strongly con- nected components by their measurable solution functions, which results in an acyclic SCM. This acyclified SCM (see Definition A.11 ) is observ ationally equiv alent to the original SCM (see Proposition A.12 ). E X A M P L E 6.2 (Construction of an observationally equiv alent acyclic SCM) . The SCM M of Example 3.5 is uniquely solvable w .r .t. all its str ongly connected components, that is, the subsets { 1 } , { 2 } and { 3 , 4 } . Replacing the causal mechanisms of these str ongly connected components by their measurable solution functions gives the observationally equivalent SCM ˜ M of Example 4.2 . Because ˜ M is acyclic (see F igur e 3 ) we can apply the dir ected global Markov pr operty to ˜ M . The fact that X 1 and X 2 ar e not d -separated given { X 3 , X 4 } in G ( ˜ M ) is in line with X 1 being dependent of X 2 given { X 3 , X 4 } for e very solution X of ˜ M (and hence of M ). This acyclification preserves solutions, and d -separation in the acyclification can directly be translated into σ -separation on the original graph (see Proposition A.19 ). This leads to the general directed global Marko v property . The follo wing theorem summarizes the main results of [ 18 ] applied to SCMs. T H E O R E M 6.3 (Global Markov properties for SCMs [ 18 ]) . Let M be a uniquely solv- able SCM. Then its observational distrib ution P X exists, is unique and the following two statements hold: 1. P X satisfies the directed global Markov property (“ d -separation criterion”) r elative to G ( M ) (see Definition A.6 ) if M satisfies at least one of the following conditions: a) M is acyclic; b) all endogenous spaces X i ar e discr ete and M is ancestr ally uniquely solvable; c) M is linear (see Definition C.1 ), each of its causal mec hanisms { f i } i ∈I has a non- trivial dependence on at least one exo genous variable, and P E has a density w .r .t. the Lebesgue measur e on R J . FOUND A TIONS OF STR UCTURAL CA USAL MODELS 23 2. P X satisfies the general directed global Marko v property (“ σ -separation criterion”) r ela- tive to G ( M ) (see Definition A.20 ) if M is uniquely solvable w .r .t. each str ongly connected component of G ( M ) . 14 The general directed global Markov property is generally weaker than the directed global Marko v property , since σ -separation implies d -separation. The acyclic case is well known and was first sho wn in the context of linear-Gaussian structural equation models [ 32 , 75 ]. The discrete case fixes the erroneous theorem by Pearl and Dechter [ 52 ], for which a coun- terexample was found by Neal [ 49 ], by adding the ancestral unique solvability condition, and extends it to allo w for bidirected edges in the graph. The linear case is an e xtension of e x- isting results for the linear -Gaussian setting without bidirected edges [ 31 , 71 , 72 ] to a linear (possibly non-Gaussian) setting with bidirected edges in the graph. In constraint-based approaches to causal disco very , one usually assumes the con verse of the (general) directed global Markov property to hold [ 51 , 73 ], which is called σ -faithfulness respecti vely d -faithfulness (see Definition A.9 and A.23 ). Meek [ 41 ] showed that for multi- nomial and linear-Gaussian D A G (i.e., acyclic and causally suf ficient SCMs) models, d - faithfulness holds for all parameter values up to a measure zero set. Up to our knowledge no such results hav e been shown in more general parametric or nonparametric settings (nei- ther for d -faitfhulness in acyclic or cyclic settings, nor for σ -faithfulness). 7. Causal interpretation of the graph of SCMs In Example 4.4 , we already saw that sometimes no information in the observ ational, interventional and even the counterfactual distributions suffices to decide whether a directed path or bidirected edge is present in the graph, or not. Here, we do not attempt to provide a complete characterization of the condi- tions under which the presence or absence of a directed path or bidirected edge in the graph can be identified from the observational and interventional distrib utions. Instead, we gi ve suf ficient conditions to detect a directed path and bidirected edge in the graph. In general, cyclic SCMs may hav e none, one or multiple induced observational distribu- tions, and this may change after intervening in the system. Here, we restrict ourselves to graphs of SCMs where the induced (marginal) observ ational and interventional distributions are uniquely defined. 7.1. Dir ected paths and edges For c yclic SCMs, the causal interpretation of the SCM is not always consistent with its graph. This can be illustrated with the SCM M of Ex- ample 5.10 . Here, one sees a difference in the mar ginal distrib ution P M do( { 1 } ,ξ 1 ) on X 4 for dif ferent v alues of ξ 1 , although v ariable 1 is not an ancestor of variable 4 and each marginal distribution P M do( { 1 } ,ξ 1 ) on X 4 is uniquely defined. This counterintuitive beha vior that an in- tervention on a nonancestor of a variable can change the distrib ution of that v ariable was already observed by Neal [ 49 ]. Howe ver , under a specific unique solv ability condition, we obtain a direct causal interpretation for the absence of a directed edge or directed path in the graph of an SCM. 14 Since [ 18 ] also provides results under the weaker condition that an SCM is solvable (not necessarily uniquely) w .r .t. each strongly connected component of G ( M ) , one might belie ve that Theorem 6.3.(2) could be generalized to stating that in that case, any of its observational distributions satisfies the general directed global Markov property . Howev er , that is not true: consider , for example, the SCM M = h 2 , ∅ , R 2 , 1 , f , P 1 i with f 1 ( x ) = x 1 and f 2 ( x ) = x 2 . Then M is solvable w .r .t. each of its strongly connected components { 1 } and { 2 } . The solution with X 1 = X 2 , where X 2 has a nondegenerate distribution, sho ws a dependence between X 1 and X 2 , and thus X 1 ⊥ ⊥ X 2 does not hold. In general, all strongly connected components that admit multiple solutions may be dependent on any other v ariable(s) in the model. 24 P RO P O S I T I O N 7.1 (Sufficient condition for detecting a directed edge in the latent projec- tion of the graph of an SCM) . Consider an SCM M = hI , J , X , E , f , P E i , a subset O ⊆ I and i, j ∈ O such that i 6 = j . Let ξ I ∈ X I , where I := O \ { i, j } , such that M do( I , ξ I ) is uniquely solvable w .r .t. an G ( M do( I , ξ I ) ) \ i ( j ) . If ther e e xist values ξ i 6 = ˜ ξ i ∈ X i such that both ( M do( I , ξ I ) ) do( { i } ,ξ i ) and ( M do( I , ξ I ) ) do( { i } , ˜ ξ i ) induce unique mar ginal distributions on X j , and these two induced distributions do not coincide, that is, ther e exists a measurable set B j ⊆ X j such that P ( M do( I , ξ I ) ) do( { i } ,ξ i ) ( X j ∈ B j ) 6 = P ( M do( I , ξ I ) ) do( { i } , ˜ ξ i ) ( X j ∈ B j ) , the dir ected edge i → j is pr esent in the latent pr ojection marg ( I \ O )( G ( M )) of G ( M ) on O . T wo cases are of special interest: O = I , which corresponds with a directed edge i → j in G ( M ) , and O = { i, j } , which corresponds with a directed path i → · · · → j in G ( M ) . The condition in Proposition 7.1 is a sufficient condition for determining whether a di- rected edge or path is present in the graph. In general, not all directed edges and paths can be identified from the interventional distributions with this suf ficient condition. F or e xample, no interventional distrib ution satisfies the condition of Proposition 7.1 for the SCM ¯ M in Example 4.4 , although there is a directed edge 1 → 2 in the graph G ( ¯ M ) . 7.2. Bidir ected edges It is well known that there exists a similar sufficient condition for detecting bidirected edges in the graph of an acyclic SCM also kno wn as the common- cause principle [see, e.g., 51 ]. In the two variables case, this criterion informally states that there exists a bidirected edge between the variables i and j in the graph of the SCM, if the marginal interventional distrib ution of X j under the intervention do( { i } , x i ) differs from the conditional distrib ution of X j gi ven X i = x i (see Example D.13 ). The following proposition provides a generalization of this sufficient condition for detecting bidirected edges in graphs of SCMs that may include cycles. P RO P O S I T I O N 7.2 (Sufficient condition for detecting a bidirected edge in the latent pro- jection of the graph of an SCM) . Consider an SCM M = hI , J , X , E , f , P E i , a sub- set O ⊆ I and i, j ∈ O such that i 6 = j . Let ξ I ∈ X I , wher e I := O \ { i, j } , suc h that M do( I , ξ I ) is uniquely solvable w .r .t. both an G ( M do( I , ξ I ) ) ( i ) and an G ( M do( I , ξ I ) ) \ i ( j ) . Assume that for every ξ i ∈ X i both M do( I , ξ I ) and ( M do( I , ξ I ) ) do( { i } ,ξ i ) induce a unique marginal distribution on X j × X i and X j , r espectively . If j / ∈ an G ( M do( I , ξ I ) ) ( i ) and ther e exists a measurable set B j ⊆ X j such that for every version of the r e gular conditional pr obability P M do( I , ξ I ) ( X j ∈ B j | X i = ξ i ) , ther e exists a value ξ i ∈ X i such that P ( M do( I , ξ I ) ) do( { i } ,ξ i ) ( X j ∈ B j ) 6 = P M do( I , ξ I ) ( X j ∈ B j | X i = ξ i ) , then ther e exists a bidir ected edge i ↔ j in the latent pr ojection marg( I \ O )( G ( M )) of G ( M ) on O . This proposition gi ves a suf ficient condition for determining that a bidirected edge is present in the graph. In general, not all bidirected edges in the graph can be identified from the observ ational, interventional and even the counterfactual distributions, as we saw in Ex- ample D.10 . In this example, there exists a bidirected edge 1 ↔ 2 ∈ G ( M ) while the density p ( x 2 | do( X 1 = x 1 )) = p ( x 2 | X 1 = x 1 ) for all x 1 ∈ X 1 . For the acyclic setting, the abov e criterion is generally considered as a univ ersal way to detect a confounder (note that then one can also deal with the case j ∈ an G ( M do( I , ξ I ) ) ( i ) by swapping the roles of i and j ). If i and j are part of a cycle, the abov e sufficient condition cannot be applied, and in that case, to the best of our knowledge, no simple sufficient conditions for detecting the presence of a bidirected edge are kno wn. FOUND A TIONS OF STR UCTURAL CA USAL MODELS 25 8. Simple SCMs In this section, we introduce the well-behav ed class of simple SCMs. Simple SCMs satisfy all the local unique solvability conditions to ensure that this class is closed under both perfect intervention and marginalization. The y extend the subclass of acyclic SCMs to the c yclic setting, while preserving many of their con venient properties. D E FI N I T I O N 8.1 (Simple SCM) . Let M = hI , J , X , E , f , P E i be an SCM. W e call M simple if it is uniquely solvable w .r .t. every subset O ⊆ I . Loosely speaking, an SCM is simple if any subset of its structural equations can be solv ed uniquely for its associated v ariables in terms of the other variables that appear in these equa- tions. An example of a simple SCM is gi ven in Example D.1 . On simple SCMs one can perform any number of marginalizations (see Definition 5.3 ) in any order (see Proposition 5.4 ). All these marginalizations respect the latent projection (see Proposition 5.11 ) and each resulting marginal SCM is ag ain simple. Moreov er , we sho w that this class is closed under intervention and the twin operation. P RO P O S I T I O N 8.2 . The class of simple SCMs is closed under mar ginalization, perfect intervention and the twin operation. The class of simple SCMs contains the acyclic SCMs as a subclass (see Proposition 3.4 ). In particular , a simple SCM has no self-cycles (see Proposition 3.7 ), since a self-cycle denotes that that v ariable cannot be uniquely (up to a P E -null set) determined by its parents. From Proposition 8.2 , it follows that the results summarized in Theorem 6.3 also apply to all the observ ational, interventional and counterfactual distrib utions of simple SCMs. C O RO L L A RY 8.3 (Global Markov properties for simple SCMs). Let M be a simple SCM. Then the: 1. observational distribution, 2. interventional distribution after perfect intervention on I ⊂ I , 3. counterfactual distribution after perfect intervention on ˜ I ⊆ I ∪ I 0 , all exist, ar e unique and satisfy the general directed global Markov pr operty r elative to G ( M ) , do( I )( G ( M )) and do( ˜ I )(t win( G ( M ))) , r espectively . Mor eover , if M satisfies at least one of the thr ee conditions (1a), (1b), (1c) of Theor em 6.3 , then they also obey the di- r ected global Markov pr operty r elative to G ( M ) , do( I )( G ( M )) and do( ˜ I )(t win( G ( M ))) , r espectively . Many of these properties are also shown to hold for the class of modular SCMs [ 18 ], which contains, in particular , the class of simple SCMs (see Appendix A.3 for more details). Moreov er , simple SCMs satisfy the unique solvability conditions of Proposition 7.1 and 7.2 , which allows us to define the causal relationships for simple SCMs in terms of its graph. D E FI N I T I O N 8.4 (Causal relationships for simple SCMs) . Let M be a simple SCM. 1. If ther e e xists a dir ected edge i → j ∈ G ( M ) , that is, i ∈ pa( j ) , then we call i a direct cause of j according to M ; 2. If there exists a dir ected path i → · · · → j in G ( M ) , that is, i ∈ an( j ) , then we call i a cause of j according to M ; 3. If ther e e xists a bidirected edge i ↔ j ∈ G ( M ) , then we call i and j (latently) confounded according to M . 26 In summary , we have the follo wing sufficient conditions for determining the different causal and confoundedness relationships according to a specific simple SCM M . C O RO L L A RY 8.5 (Sufficient conditions for the presence of causal and confoundedness relationships for simple SCMs) . Let M be a simple SCM and i, j ∈ I such that i 6 = j and I := I \ { i, j } . Then: 1. If ther e exist values ξ I ∈ X I and ξ i 6 = ˜ ξ i ∈ X i and a measurable set B j ⊆ X j such that P ( M do( I , ξ I ) ) do( { i } ,ξ i ) ( X j ∈ B j ) 6 = P ( M do( I , ξ I ) ) do( { i } , ˜ ξ i ) ( X j ∈ B j ) , then i is a dir ect cause of j according to M , that is, i → j ∈ G ( M ) ; 2. If ther e exist values ξ i 6 = ˜ ξ i ∈ X i and a measurable set B j ⊆ X j such that P M do( { i } ,ξ i ) ( X j ∈ B j ) 6 = P M do( { i } , ˜ ξ i ) ( X j ∈ B j ) , then i is a cause of j according to M , that is, i → · · · → j in G ( M ) ; 3. If j / ∈ an G ( M do( I , ξ I ) ) ( i ) and ther e exist a value ξ I ∈ X I and a measurable set B j ⊆ X j such that for every version of the r egular conditional pr obability P M do( I , ξ I ) ( X j ∈ B j | X i = ξ i ) ther e exists a value ξ i ∈ X i such that P ( M do( I , ξ I ) ) do( { i } ,ξ i ) ( X j ∈ B j ) 6 = P M do( I , ξ I ) ( X j ∈ B j | X i = ξ i ) , then i and j are confounded accor ding to M , that is, i ↔ j ∈ G ( M ) . For simple SCMs, it is in general not possible to identify all the causal and confoundedness relationships in the graph from the observ ational, interv entional or ev en the counterfactual distributions. Examples 4.4 and D.10 show that this is already impossible for acyclic SCMs without further assumptions. Finally , there is a connection between SCMs and potential outcomes [ 68 ] that generalizes to the cyclic setting. One of the consequences of Proposition 8.2 is that all counterfactuals are defined for a simple SCM (e ven if it is cyclic). This allows us to define potential outcomes in terms of a simple SCM in the follo wing way . D E FI N I T I O N 8.6 (Potential outcome) . Let M = hI , J , X , E , f , P E i be a simple SCM, I ⊆ I a subset, ξ I ∈ X I a value and E a random variable such that P E = P E . The potential outcome under the perfect intervention do( I , ξ I ) is defined as X ξ I := g M do( I , ξ I ) ( E pa( I ) ) , wher e g M do( I , ξ I ) : E pa( I ) → X is a measurable solution function for M do( I , ξ I ) . 9. Discussion In this paper, we studied the basic properties of SCMs in the presence of cycles and latent variables without restricting to linear functional relationships between the v ariables. W e saw that cyclic SCMs behave differently in many aspects than acyclic SCMs. Indeed, in the presence of cycles, many of the con venient properties of acyclic SCMs do not hold in general: SCMs do not always have a solution; they do not always induce unique observ ational, interventional and counterfactual distributions; a marginalization does not al- ways exist, and if it exists the marginal model does not always respect the latent projection; they do not always satisfy a Marko v property and their graphs are not alw ays consistent with their causal semantics. W e introduced various notions of (unique) solvability and sho wed that under appropriate (unique) solv ability conditions, man y of the operations and results for the acyclic setting can be extended to SCMs with cycles. For example, we introduced sev eral equiv alence relations between SCMs to compare SCMs at dif ferent le vels of abstraction, we sho wed ho w to define FOUND A TIONS OF STR UCTURAL CA USAL MODELS 27 marginal SCMs on a subset of the v ariables that are (in various ways) equi valent to the origi- nal SCM, we discussed under which conditions the distrib utions satisfy the (general) directed global Marko v property relativ e to their graphs and we showed under which conditions the graph of an SCM can be interpreted causally . Most of these results are shown under suf ficient conditions that are not necessary (e.g., for the marginalization operation this was shown in Example D.11 ). It may therefore be possible to further relax some of the conditions. These insights led us to introduce the more well-behav ed class of simple SCMs, which forms an extension of the class of acyclic SCMs to the cyclic setting that preserves many of its conv enient properties: simple SCMs induce unique observ ational, interventional and coun- terfactual distributions; the class of simple SCMs is closed under both perfect intervention and marginalization; the marginalization respects the latent projection; the induced distribu- tions obey the general directed global Marko v property and obe y the directed global Markov property in the acyclic, discrete and linear case. This class does not contain SCMs that have self-cycles and graphs of simple SCMs ha ve a direct and intuiti ve causal interpretation. One key property of simple SCMs is that the solutions always satisfy the conditional in- dependencies implied by σ -separation. By simply replacing d -separation with σ -separation it turns out that one can directly extend results and algorithms for acyclic SCMs to the more general class of simple SCMs. For example, adjustment criteria (including the back-door cri- terion), Pearl’ s do -calculus and T ian’ s ID algorithm for the identification of causal effects hav e been extended recently to the class of modular SCMs, which contains the class of sim- ple SCMs [ 20 ]. Sev eral causal disco very algorithms ha ve already been proposed that work with simple SCMs, for example, the first constraint-based causal discovery algorithm that can deal with cycles and nonlinear functional relationships [ 19 ]. Also, Local Causal Disco v- ery (LCD) [ 10 ], Y -structures [ 38 ] and the Joint Causal Inference framework (JCI) all apply to simple SCMs [ 47 ] even though they were originally de veloped for acyclic SCMs only . Recently , it has been sho wn that e ven the well-kno wn Fast Causal Inference (FCI) algorithm [ 74 , 80 ] is directly applicable to simple SCMs [ 44 ] and provides a consistent estimate of the Marko v equiv alence class (under the faithfulness assumption). Moreover , a method for con- structing nonlinear simple SCMs using neural networks and sampling from them has been proposed [ 19 ]. This illustrates that the class of simple SCMs forms a conv enient and practical extension of the class of acyclic SCMs that can be used for the purposes of causal modeling, reasoning, discov ery and prediction. W e hope that this work will provide the foundations for a general theory of statistical causal modeling with SCMs. Future work might consist of reparametrizing and reducing the space of the exogenous variables of an SCM while preserving the causal and counterfactual semantics; extending and generalizing the identifiability results for (direct) causes and con- founders; e xtending the graphs of SCMs to represent selection bias; proving completeness results for some Marko v properties for a subclass of SCMs that contains cycles. Acknowledgments S. Bongers and J.M. Mooij are supported in part by NWO, the Netherlands Organization for Scientific Research (VIDI grant 639.072.410 and VENI grant 639.031.036). P . Forr ´ e and J.M. Mooij are supported in part by the European Research Coun- cil (ERC) under the European Union’ s Horizon 2020 research and innov ation programme (grant agreement n o 639466). J. Peters is supported by research grants from VILLUM FONDEN (18968) and the Carlsberg F oundation. The authors are grateful to Bernhard Sch ¨ olkopf and Robin Evans for stimulating discus- sions, and to Noud de Kroon, Tinek e Blom and Alexander L y for providing helpful comments on earlier drafts. W e thank two anonymous revie wers and the associate editor for helpful comments. 28 FOUND A TIONS OF STR UCTURAL CA USAL MODELS WITH CYCLES AND LA TENT V ARIABLES Supplementary Material This Supplementary Material contains a summary of the basic terminology and results for causal graphical models (Appendix A ), additional (unique) solv ability properties (Ap- pendix B ), some results for linear SCMs (Appendix C ), other examples (Appendix D ), the proofs of all the theoretical results (Appendix E ) and the measurable selection theorems (Ap- pendix F ) that are used in se veral proofs. APPENDIX A: CA USAL GRAPHICAL MODELS In this appendix, we provide a summary of the basic terminology and results for causal graphical models. In Appendix A.1 we provide the terminology for directed (mixed) graphs. In Appendix A.2 we giv e an introduction and an intuiti ve deriv ation of Mark ov properties for SCMs with cycles. In Appendix A.3 we provide a definition of modular SCMs and show ho w they relate to SCMs. In Appendix A.4 we provide an overvie w of the causal graphical models related to SCMs. The proofs of the theoretical results in this appendix are giv en in Appendix E . A.1. Directed (mixed) graphs In this subsection, we introduce the terminology for di- rected (mixed) graphs, where we do allo w for cycles [ 18 , 34 , 51 , 60 ]. D E FI N I T I O N A.1 (Directed (mixed) graph) . 1. A directed graph is a pair G = ( V , E ) , wher e V is a set of nodes and E is a set of directed edges, which is a subset E ⊆ V × V of or der ed pair s of nodes. Each element ( i, j ) ∈ E can be repr esented by the directed edge i → j or equivalently j ← i . In particular , ( i, i ) ∈ E r epr esents a self-cycle i → i . 2. A directed mixed graph is a triple G = ( V , E , B ) , wher e the pair ( V , E ) forms a dir ected graph and B is a set of bidir ected edges, which is a subset B ⊆ {{ i, j } : i, j ∈ V , i 6 = j } of unor der ed (distinct) pairs of nodes. Each element { i, j } ∈ B can be r epr esented by the bidir ected edge i ↔ j or equivalently j ↔ i . Note that a dir ected gr aph can be consider ed as a dir ected mixed graph without bidir ected edges. 3. Let G = ( V , E , B ) be a dir ected mixed graph. A dir ected mixed graph ˜ G = ( ˜ V , ˜ E , ˜ B ) is a subgraph of G if ˜ V ⊆ V , ˜ E ⊆ E and ˜ B ⊆ B , in which case we write ˜ G ⊆ G . F or a subset W ⊆ V , we define the induced subgraph of G on W by G W := ( W , ˜ E , ˜ B ) , wher e ˜ E and ˜ B ar e the set of dir ected and bidir ected edges in E and B , r espectively , that lie in W × W and {{ i, j } : i, j ∈ W , i 6 = j } , r espectively . 4. A w alk between i, j ∈ V in a dir ected mixed graph G is a tuple ( i 0 , 1 , i 1 , 2 , i 2 , . . . , n , i n ) of alternating nodes and edges in G for some n ≥ 0 , wher e all i 0 , . . . , i n ∈ V , all 1 , . . . , n ∈ E ∪ B such that k ∈ { i k − 1 → i k , i k − 1 ← i k , i k − 1 ↔ i k } for all k = 1 , . . . , n , and it starts with node i 0 = i and ends with node i n = j . Note that n = 0 corresponds with a trivial walk consisting of a single node. If all nodes i 0 , . . . , i n ar e distinct, it is called a path . A walk (path) of the form i → · · · → j , that is, k is i k − 1 → i k for all k = 1 , 2 , . . . , n , is called a directed walk (path) fr om i to j . 29 30 5. A cycle thr ough i ∈ V in a dir ected mixed graph G is a dir ected path fr om i to some node j extended with the edge j → i ∈ E . In particular , a self-cycle i → i ∈ E is a cycle. Note that a path cannot contain any cycles. A directed graph and a dir ected mixed graph are said to be acyclic if they contain no cycles, and ar e then referr ed to as a directed acyclic graph (D A G) and an acyclic directed mixed graph (ADMG) , r espectively . 6. F or a dir ected mixed graph G and a node i ∈ V we define the set of parents of i by pa G ( i ) := { j ∈ V : j → i ∈ E } , the set of childr en of i by ch G ( i ) := { j ∈ V : i → j ∈ E } , the set of ancestors of i by an G ( i ) := { j ∈ V : there is a dir ected path fr om j to i in G } and the set of descendants of i by de G ( i ) := { j ∈ V : there is a dir ected path fr om i to j in G } . Note that we have { i } ∪ pa G ( i ) ⊆ an G ( i ) and { i } ∪ c h G ( i ) ⊆ de G ( i ) . W e can apply all these definitions to subsets U ⊆ V by taking unions, for example pa G ( U ) := ∪ i ∈U pa G ( i ) . A subset A ⊆ V is called an ancestral subset in G if A = an G ( A ) , that is, A is closed under taking ancestors of A in G . 7. Let G = ( V , E , B ) be a dir ected mixed graph. W e call G strongly connected if for every pair of distinct nodes i, j ∈ V , the graph contains a cycle that passes thr ough both i and j . The strongly connected component of i ∈ V , denoted by sc G ( i ) , is the maximal subset S ⊆ V such that i ∈ S and the induced subgraph G S is str ongly connected. Equivalently , sc G ( i ) = an G ( i ) ∩ de G ( i ) . 8. A loop in a dir ected mixed graph G = ( V , E , B ) is a subset O ⊆ V that is str ongly con- nected in the induced subgraph G O of G on O . 9. F or a directed graph G = ( V , E ) , we define the graph of strongly connected components of G as the dir ected gr aph G sc := ( V sc , E sc ) , wher e V sc ar e the str ongly connected compo- nents of G , that is, V sc ar e the equivalence classes in V / ∼ with the equivalence relation i ∼ j if and only if i ∈ sc G ( j ) , and E sc = ( E \ { i → i : i ∈ V } ) / ∼ with the equivalence r elation ( i → j ) ∼ ( i 0 → j 0 ) if and only if i ∼ i 0 and j ∼ j 0 . W e omit the subscript G whene ver it is clear which directed (mixed) graph G we are refer - ring to. L E M M A A.2 (D A G of strongly connected components) . Let G = ( V , E ) be a directed graph. Then G sc , the graph of str ongly connected components of G , is a D A G. A.2. Marko v properties In this subsection, we giv e a short overvie w of Marko v prop- erties for SCMs with cycles. W e will make use of the Markov properties that were recently de veloped by Forr ´ e and Mooij [ 18 ] for HEDGes, a graphical representation that is similar to the augmented graph of SCMs. W e briefly summarize some of their main results and apply them to the class of SCMs. W e also pro vide a shorter and more intuitive deriv ation so that this subsection can act as an entry point for the reader into the more extensi ve discussion of Marko v properties provided in [ 18 ]. Marko v properties associate a set of conditional independence relations to a graph. The directed global Markov property for directed acyclic graphs, also known as the d -separation criterion [ 50 ], is one of the most widely used. It directly extends to a similar property for acyclic directed mixed graphs (ADMGs) [ 60 ]. It does not hold in general for cyclic SCMs, ho wev er , as was already observed earlier [ 71 , 72 ]. Under some conditions (roughly speaking, linearity or discrete v ariables) the directed global Markov property can be sho wn to hold also in the presence of cycles [ 18 ]. 31 Inspired by work of Spirtes [ 71 ], Forr ´ e and Mooij [ 18 ] recognized that in the general cyclic case a different extension of d -separation, termed σ -separation, is needed, leading to the general directed global Marko v property . One ke y result in [ 18 ] implies that under the assumption of unique solv ability w .r .t. each strongly connected component of its graph, the observ ational distrib ution of an SCM satisfies the general directed global Mark ov property w .r .t. its graph. The solv ability assumptions are in general not preserv ed under interv entions. Under the stronger assumption of simplicity , ho wev er , they are, and one obtains the corol- lary that also all interventional and counterfactual distributions of a simple SCM satisfy the general directed global Marko v property w .r .t. to their corresponding graphs. For a more e xtensiv e study of dif ferent Marko v properties that can be associated to SCMs we refer the reader to [ 18 ]. A.2.1. The dir ected global Markov pr operty Conditional independencies in the observ a- tional distribution of an acyclic SCM can be read off from its graph by using the graphical cri- terion called d -separation [ 51 ]. The directed global Markov property associates a conditional independence relation in the observational distribution of the SCM to each d -separation en- tailed by the graph. Here, we use a formulation of d -separation that generalizes d -separation for D A Gs [ 50 ] and m -separation for ADMGs [ 60 ] and mD A Gs [ 15 ]. D E FI N I T I O N A.3 (Collider) . Let π = ( i 0 , 1 , i 1 , 2 , i 2 , . . . , n , i n ) be a walk (path) in a dir ected mixed gr aph G = ( V , E , B ) . A node i k on π is called a collider on π if it is a non-endpoint node ( 1 ≤ k < n ) and the two edges k , k +1 meet head-to-head on i k (i.e., if the subwalk ( i k − 1 , k , i k , k +1 , i k +1 ) is of the form i k − 1 → i k ← i k +1 , i k − 1 ↔ i k ← i k +1 , i k − 1 → i k ↔ i k +1 or i k − 1 ↔ i k ↔ i k +1 ). The node i k is called a non-collider on π otherwise , that is, if it is an endpoint node ( k = 0 or k = n ) or if the subwalk ( i k − 1 , k , i k , k +1 , i k +1 ) is of the form i k − 1 → i k → i k +1 , i k − 1 ← i k ← i k +1 , i k − 1 ← i k → i k +1 , i k − 1 ↔ i k → i k +1 or i k − 1 ← i k ↔ i k +1 . Note in particular that the end points of a walk are non-colliders on the walk. D E FI N I T I O N A.4 ( d -separation) . Let G = ( V , E , B ) be a dir ected mixed graph and let C ⊆ V be a subset of nodes. A walk (path) π = ( i 0 , 1 , i 1 , . . . , i n ) in G is said to be C - d - blocked or d -blocked by C if 1. it contains a collider i k / ∈ an G ( C ) , or 2. it contains a non-collider i k ∈ C . The walk (path) π is said to be C - d -open if it is not d -block ed by C . F or two subsets of nodes A, B ⊆ V , we say that A is d -separated from B giv en C in G if all paths between any node in A and any node in B ar e d -block ed by C , and write A d ⊥ G B | C . The next lemma is a straightforward generalization of Lemma 3.3 in [ 22 ] to the cyclic setting. It implies that it suf fices to formulate d -separation in terms of paths rather than walks. L E M M A A.5. Let G = ( V , E , B ) be a directed mixed graph, C ⊆ V and i, j ∈ V . There exists a C - d -open walk between i and j in G if and only if there exists a C - d -open path between i and j in G . 32 X 1 X 2 X 3 X 4 X 1 X 2 X 3 X 4 Fig 5: The graphs of the observationally equivalent SCMs M (left) and ˜ M (right) of Example A.8 and A.10 . D E FI N I T I O N A.6 (Directed global Markov property). Let G = ( V , E , B ) be a dir ected mixed graph and P V a pr obability distribution on X V = Q i ∈V X i , wher e each X i is a stan- dar d pr obability space. The pr obability distribution P V satisfies the directed global Markov property r elative to G if for all subsets A, B , C ⊆ V we have A d ⊥ G B | C = ⇒ X A ⊥ ⊥ P V X B | X C , that is, ( X i ) i ∈ A and ( X i ) i ∈ B ar e conditionally independent given ( X i ) i ∈ C under P V , wher e we take the canonical pr ojections X i : X V → X i as random variables. From the results in [ 18 ] it directly follows that for the observ ational distribution of an SCM, the directed global Mark ov property w .r .t. the graph of the SCM (also kno wn as the d -separation criterion), holds under one of the following assumptions. T H E O R E M A.7 (Directed global Marko v property for SCMs [ 18 ]) . Let M be a uniquely solvable SCM that satisfies at least one of the following thr ee conditions: 1. M is acyclic; 2. all endogenous spaces X i ar e discr ete and M is ancestr ally uniquely solvable; 3. M is linear (see Definition C.1 ), each of its causal mechanisms { f i } i ∈I has a nontrivial dependence on at least one exog enous variable, and P E has a density w .r .t. the Lebesgue measur e on R J . Then its observational distribution P X exists, is unique and satisfies the dir ected global Markov pr operty r elative to G ( M ) (see Definition A.6 ). The acyclic case is well kno wn and w as first sho wn in the context of linear-Gaussian struc- tural equation models [ 32 , 75 ]. The discrete case fixes the erroneous theorem by Pearl and Dechter [ 52 ], for which a counterexample was found by Neal [ 49 ], by adding the ancestral unique solvability condition, and extends it to allow for bidirected edges in the graph. The linear case is an extension of existing results for the linear-Gaussian setting without bidi- rected edges [ 31 , 71 , 72 ] to a linear (possibly non-Gaussian) setting with bidirected edges in the graph. The following countere xample of an SCM for which the directed global Markov property does not hold was already gi ven in [ 71 , 72 ]. E X A M P L E A.8 (Directed global Markov property does not hold for cyclic SCM). Con- sider the SCM M = h 4 , 4 , R 4 , R 4 , f , P R 4 i with causal mec hanism given by f 1 ( x , e ) = e 1 , f 2 ( x , e ) = e 2 , f 3 ( x , e ) = x 1 x 4 + e 3 , f 4 ( x , e ) = x 2 x 3 + e 4 and P R 4 is the standard-normal distribution on R 4 . The graph of M is depicted in F igur e 5 on the left. The model is uniquely solvable (it is even simple). One can chec k that for every solution X of M , X 1 is not independent of X 2 given { X 3 , X 4 } . However , the variables X 1 and X 2 ar e d -separated given { X 3 , X 4 } in G ( M ) . Hence the global dir ected Markov pr operty does not hold here . 33 In constraint-based approaches to causal disco very , one usually assumes the con verse of the directed global Marko v property to hold [ 51 , 73 ]. D E FI N I T I O N A.9 ( d -Faithfulness) . Let G = ( V , E , B ) be a directed mixed graph and P V a pr obability distribution on X V = Q i ∈V X i , wher e each X i is a standar d pr obability space. The pr obability distribution P V is d -faithful to G if for all subsets A, B , C ⊆ V we have A d ⊥ G B | C ⇐ = X A ⊥ ⊥ P V X B | X C , wher e we take the canonical pr ojections X i : X V → X i as random variables. In other words, the d -faithfulness assumption states that the graph explains, via d - separation, all the conditional independencies that are present in the observational distri- bution. Meek [ 41 ] showed that for multinomial and linear-Gaussian D A G (i.e., acyclic and causally sufficient SCMs) models, d -faithfulness holds for all parameter values up to a mea- sure zero set (in a natural parameterization). Up to our knowledge no such results hav e been sho wn in more general parametric or nonparametric settings (neither in the acyclic case, nor in the cyclic one). A.2.2. The gener al dir ected global Markov pr operty In [ 18 ] the general directed global Marko v property is introduced, that is based on σ -separation, an extension of d -separation. This notion of σ -separation was deri ved from the notion of d -separation in the acyclification of the graph. The acyclification of a graph generalizes the idea of the collapsed graph for directed graphs, developed by Spirtes [ 71 ], to HEDGes. In particular, this notion can be applied to directed mixed graphs, and thus to the graphs of SCMs. The main idea of the acyclification is that under the condition that the SCM is uniquely solv able w .r .t. each strongly connected component, we can replace the causal mechanisms of these strongly connected components by their measurable solution functions, which results in an ac yclic SCM. This acyclification preserves the solutions, and d -separation in the acyclification can directly be translated into σ -separation in the original graph. This then leads to the general directed global Marko v property . W e will discuss this no w in more detail. E X A M P L E A.10 (Construction of an observ ationally equi v alent ac yclic SCM). Consider the SCM M of Example A.8 which is uniquely solvable w .r .t. all its str ongly connected com- ponents, i.e., the subsets { 1 } , { 2 } and { 3 , 4 } . Replacing the causal mechanisms of these str ongly connected components by their measurable solution functions gives the SCM ˜ M that is the same as M e xcept that its causal mechanism ˜ f is given by ˜ f 1 ( x , e ) := e 1 , ˜ f 2 ( x , e ) := e 2 , ˜ f 3 ( x , e ) := x 1 e 4 + e 3 1 − x 1 x 2 , ˜ f 4 ( x , e ) := x 2 e 3 + e 4 1 − x 1 x 2 . By construction, M and ˜ M ar e observationally equivalent. Because ˜ M is acyclic (see F ig- ur e 5 on the right) we can apply the directed global Markov pr operty to ˜ M . The fact that X 1 and X 2 ar e not d -separated given { X 3 , X 4 } in G ( ˜ M ) is in line with X 1 being dependent of X 2 given { X 3 , X 4 } for e very solution X of ˜ M (and hence of M ). One of the ke y insights in [ 18 ] is that this example can easily be generalized as follo ws. D E FI N I T I O N A.11 (Ac yclification of an SCM). Let M = hI , J , X , E , f , P E i be an SCM that is uniquely solvable w .r .t. each str ongly connected component of G ( M ) . F or each i ∈ I , let g i be the i th component of a measurable solution function g sc( i ) : X pa(sc( i )) \ sc( i ) × E pa(sc( i )) → X sc( i ) of M w .r .t. sc( i ) , wher e pa and sc denote the parents and str ongly 34 X 1 X 2 G ( M ) X 1 X 2 G (acy ( M )) X 1 X 2 acy( G ( M )) Fig 6: The graphs of the original SCM M (left), of the acyclified SCM (center), and of the acyclification of the graph of M (right) corr esponding to Example A.15 . connected components accor ding to G a ( M ) , r espectively . W e call the SCM M acy := hI , J , X , E , ˆ f , P E i with the ac yclified causal mechanism ˆ f : X × E → X given by ˆ f i ( x , e ) = g i ( x pa(sc( i )) \ sc( i ) , e pa(sc( i )) ) , i ∈ I , an acyclification of M . W e denote by acy( M ) the equivalence class of the acyclifications of M . Note that acy( M ) is well-defined: all ac yclifications of an SCM M belong to the same equi valence class of SCMs. P RO P O S I T I O N A.12 . Let M be an SCM that is uniquely solvable w .r .t. each str ongly connected component of G ( M ) . Then an acyclification M acy of M is acyclic and observa- tionally equivalent to M . W e can also define a graphical acyclification for directed mixed graphs, which is a special case of the operation defined in [ 18 ] for HEDGes. D E FI N I T I O N A.13 (Ac yclification of a directed mix ed graph). Let G = ( V , E , B ) be a di- r ected mixed graph. The acyclification of G maps G to the acyclified graph G acy := ( V , ˆ E , ˆ B ) with directed edges j → i ∈ ˆ E if and only if j ∈ pa G (sc G ( i )) \ sc G ( i ) and bidir ected edg es i ↔ j ∈ ˆ B if and only if ther e exist i 0 ∈ sc G ( i ) and j 0 ∈ sc G ( j ) with i 0 = j 0 or i 0 ↔ j 0 ∈ B . The follo wing compatibility result is immediate from the definitions. P RO P O S I T I O N A.14 . Let M be an SCM that is uniquely solvable w .r .t. each str ongly connected component of G ( M ) . Then G a (acy( M )) ⊆ acy ( G a ( M )) and G (acy( M )) ⊆ acy( G ( M )) . The following example illustrates that the graph of the acyclification of an SCM can be a strict subgraph of the acyclification of the graph of the SCM. E X A M P L E A.15 (Graph of the acyclification of the SCM is a strict subgraph of the acycli- fication of its graph) . Consider the SCM M = h 2 , 1 , R 2 , R , f , P R i with the causal mecha- nism defined by f 1 ( x , e ) = x 2 − e , f 2 ( x , e ) = 1 2 x 1 + e and P R the standar d Gaussian measur e on R . The SCM M is uniquely solvable w .r .t. the (only) str ongly connected component { 1 , 2 } . An acyclification of M is the acyclified SCM M acy with the acyclified causal mechanism ˆ f defined by ˆ f 1 ( x , e ) = 0 , ˆ f 2 ( x , e ) = e . The graph G (acy( M )) is a strict subgraph of acy( G ( M )) as can be seen in F igure 6 . T ranslating the notion of d -separation from the acyclified graph back to the original graph led to the notion of σ -separation. 35 D E FI N I T I O N A.16 ( σ -separation [ 18 ]) . Let G = ( V , E , B ) be a dir ected mixed gr aph and let C ⊆ V be a subset of nodes. A walk (path) π = ( i 0 , 1 , i 1 , . . . , i n ) in G is said to be C - σ - blocked or σ -blocked by C if 1. its first node i 0 ∈ C or its last node i n ∈ C , or 2. it contains a collider i k / ∈ an G ( C ) , or 3. it contains a non-endpoint non-collider i k ∈ C that points towards a neighboring node on π that lies in a differ ent str ongly connected component of G , that is, such that i k − 1 ← i k in π and i k − 1 / ∈ sc G ( i k ) , or i k → i k +1 in π and i k +1 / ∈ sc G ( i k ) . The walk (path) π is said to be C - σ -open if it is not σ -block ed by C . F or two subsets of nodes A, B ⊆ V , we say that A is σ -separated from B giv en C in G if all paths between any node in A and any node in B ar e σ -blocked by C , and write A σ ⊥ G B | C . The only difference between σ -separation and d -separation is that d -separation does not hav e the extra condition on the non-collider that it has to point to a node in a different strongly connected component. It is therefore obvious that σ -separation reduces to d -separation for acyclic graphs, since sc G ( i ) = { i } for each i ∈ V in that case. Although for proofs it is often easier to make use of walks, it suffices to formulate σ - separation in term of paths rather than walks because of the following result, which is analo- gous to a similar result for d -separation (see Lemma A.5 ). L E M M A A.17 . Let G = ( V , E , B ) be a directed mixed graph, C ⊆ V and i, j ∈ V . There exists a C - σ -open walk between i and j in G if and only if ther e exists a C - σ -open path between i and j in G . It is clear from the definitions that σ -separation implies d -separation. The other way around does not hold in general, as can be seen in the follo wing example. E X A M P L E A.18 ( d -separation does not imply σ -separation) . Consider the dir ected graph G as depicted in F igur e 5 (left). Her e X 1 is d -separated fr om X 2 given { X 3 , X 4 } , but X 1 is not σ -separated fr om X 2 given { X 3 , X 4 } . The follo wing result in [ 18 ] relates σ -separation to d -separation. P RO P O S I T I O N A.19. Let G = ( V , E , B ) be a dir ected mixed graph. Then for A, B , C ⊆ V , A σ ⊥ G B | C ⇐ ⇒ A d ⊥ acy( G ) B | C . By replacing in Definition A.6 “ d -separation” by “ σ -separation”, one obtains the formu- lation of what Forr ´ e and Mooij [ 18 ] termed the general directed global Markov property . D E FI N I T I O N A.20 (General directed global Mark ov property [ 18 ]). Let G = ( V , E , B ) be a dir ected mixed graph and P V a pr obability distribution on X V = Q i ∈V X i , wher e each X i is a standar d pr obability space. The pr obability distrib ution P V satisfies the general directed global Marko v property r elative to G if for all subsets A, B , C ⊆ V we have A σ ⊥ G B | C = ⇒ X A ⊥ ⊥ P V X B | X C , 36 that is, ( X i ) i ∈ A and ( X i ) i ∈ B ar e conditionally independent given ( X i ) i ∈ C under P V , wher e we take the canonical pr ojections X i : X V → X i as random variables. The fact that σ -separation implies d -separation means that the directed global Markov property implies the general directed global Markov property . In other words, the general directed global Marko v property is weaker than the directed global Markov property . It is actually strictly weaker , as we saw in Example A.18 . The follo wing fundamental result, also known as the σ -separation criterion, follo ws di- rectly from the theory in [ 18 ]. T H E O R E M A.21 (General directed global Markov property for SCMs) . Let M be an SCM that is uniquely solvable w .r .t. each str ongly connected component of G ( M ) . Then its observational distribution P X exists, is unique and it satisfies the gener al directed global Markov pr operty r elative to G ( M ) . 15 The proof is based on the reasoning that, for A, B , C ⊆ I , if A is σ -separated from B gi ven C in G ( M ) , then A is d -separated from B by C in acy( G ( M )) and hence in G (acy ( M )) , and since acy ( M ) is acyclic and observ ationally equiv alent to M , it follows from the directed global Marko v property applied to acy ( M ) that X A ⊥ ⊥ P X X B | X C for e very solution X of M . Note that the ancestral unique solv ability condition for the discrete case is strictly weaker than the condition of unique solv ability w .r .t. each strongly connected component in Theo- rem A.21 . For the linear case, the condition of unique solv ability is equi valent to the condition of unique solv ability w .r .t. each strongly connected component (see Proposition C.4 ). The results in Theorems A.7 and A.21 are not preserved under perfect interv ention, be- cause intervening on a strongly connected component could split it into several strongly connected components with dif ferent solvability properties. As the class of simple SCMs is preserved under perfect intervention and the twin operation (Proposition 8.2 ), we obtain the follo wing corollary . C O RO L L A RY A.22 (Global Markov properties for simple SCMs) . Let M be a simple SCM. Then the: 1. observational distribution, 2. interventional distribution after perfect intervention on I ⊂ I , 3. counterfactual distribution after perfect intervention on ˜ I ⊆ I ∪ I 0 , all exist, ar e unique and satisfy the general directed global Markov pr operty r elative to G ( M ) , do( I )( G ( M )) and do( ˜ I )(t win( G ( M ))) , r espectively . Mor eover , if M satisfies at least one of the three conditions (1), (2), (3) of Theor em A.7 , then the y also satisfies the di- r ected global Markov pr operty r elative to G ( M ) , do( I )( G ( M )) and do( ˜ I )(t win( G ( M ))) , r espectively . Similar to d -faithfulness, σ -faithfulness 16 is defined as follo ws. 15 Since [ 18 ] also provides results under the weaker condition that an SCM is solvable (not necessarily uniquely) w .r .t. each strongly connected component of G ( M ) , one might belie ve that Theorem A.21 could be generalized to stating that in that case, any of its observational distributions satisfies the general directed global Markov property . Howev er , that is not true: consider for example the SCM M = h 2 , ∅ , R 2 , 1 , f , P 1 i with f 1 ( x ) = x 1 and f 2 ( x ) = x 2 . Then M is solv able w .r .t. each of its strongly connected components { 1 } and { 2 } . The solution with X 1 = X 2 shows a dependence between X 1 and X 2 and thus X 1 ⊥ ⊥ X 2 does not hold. In gen- eral, all strongly connected components that admit multiple solutions may be dependent on any other variable(s) in the model. 16 In [ 63 ] it is called “collapsed graph faithfulness”. 37 D E FI N I T I O N A.23 ( σ -Faithfulness). Let G = ( V , E , B ) be a dir ected mixed graph and P V a pr obability distribution on X V = Q i ∈V X i , wher e each X i is a standar d pr obability space. The pr obability distribution P V is σ -faithful to G if for all subsets A, B , C ⊆ V we have A σ ⊥ G B | C ⇐ = X A ⊥ ⊥ P V X B | X C , wher e we take the canonical pr ojections X i : X V → X i as random variables. In other words, the graph e xplains, via σ -separation, all the conditional independencies that are present in the observational distribution. Although it has been conjectured [ 72 ] that under certain conditions σ -faithfulness should hold, formulating and proving such complete- ness results is an open problem to the best of our kno wledge. A.3. Modular SCMs In this subsection, we relate the class of (simple) SCMs to that of modular SCMs. Modular SCMs introduced by Forr ´ e and Mooij [ 18 ] are causal graphical models on which marginalizations and interventions are defined and they satisfy the general directed global Markov property . For a comprehensiv e account on modular SCMs we refer the reader to [ 18 ]. A.3.1. Definition of a modular SCM In contrast to an SCM from which a graph can be deri ved, a modular SCM is defined in terms of a graphical object, which Forr ´ e and Mooij [ 18 ] call a directed graph with hyperedges (HEDG). The hyperedges of a HEDG are described in terms of a simplicial complex. D E FI N I T I O N A.24 (Simplicial complex). Let V be a finite set. A simplicial complex H over V is a set of subsets of V such that 1. all single element sets { v } ar e in H for v ∈ V , and 2. if F ∈ H , then also all subsets ˜ F ⊆ F ar e elements of H . D E FI N I T I O N A.25 (Directed graph with hyperedges (HEDGes) [ 18 ]) . A directed graph with hyperedges (HEDG) is a triple G = ( V , E , H ) , where ( V , E ) is a directed graph and H a simplicial complex over the set of nodes V . The elements F of H ar e called hyperedges of G . The elements F of H that ar e inclusion-maximal elements of H ar e called maximal hyperedges and ar e denoted by ˆ H . A HEDG G = ( V , E , H ) can be represented as a directed graph ¯ G := ( V , E ) consisting of nodes V and directed edges E , with additional maximal hyperedges F ∈ ˆ H with |F | ≥ 2 (i.e., not corresponding to single element sets { v } ∈ ˆ H ), that point to their target nodes v ∈ F . For a HEDG G , we define pa G , c h G , etc., in terms of the underlying directed graph ¯ G , that is, pa ¯ G , ch ¯ G , etc., respecti vely . A loop in a HEDG G = ( V , E , H ) is a subset O ⊆ V that is a loop in the underlying directed graph ¯ G = ( V , E ) . In other words, a loop of G is a set of nodes O ⊆ V such that for every two nodes v , w ∈ O there are directed paths v → · · · → w and w → · · · → v in G for which all the intermediate nodes lie in O (if any exist). In particular , a loop may consist of a single element { v } for v ∈ V . The set of loops in G is denoted by L ( G ) . In order to define a modular SCM one needs the notion of a compatible system of solution functions, which assigns to each loop a separate solution function such that all these solution functions are “compatible” with each other . 38 D E FI N I T I O N A.26 (Compatible system of solution functions 17 ) . Let G = ( V , E , H ) be a HEDG. F or every v ∈ V and maximal hyperedg e F in ˆ H , let X v and E F be standard measurable spaces. F or a subset O ⊆ V we define 18 X O := Y v ∈O X v and b E O := Y F ∈ ˆ H F ∩O 6 = ∅ E F . Consider a family of measur able mappings ( g O ) O∈L ( G ) indexed by L ( G ) which are of the form g O : X pa G ( O ) \O × b E O → X O . W e call the family of measurable mappings ( g O ) O∈L ( G ) a compatible system of solution functions , if for all O , ˜ O ∈ L ( G ) with ˜ O ⊆ O and for all b e O ∈ b E O and x pa G ( O ) ∪O ∈ X pa G ( O ) ∪O we have x O = g O ( x pa G ( O ) \O , b e O ) = ⇒ x ˜ O = g ˜ O ( x pa G ( ˜ O ) \ ˜ O , b e ˜ O ) . This structure of a compatible system of solution functions is at the heart of the defnition of a modular SCM. D E FI N I T I O N A.27 (Modular structural causal model (mSCM) [ 18 ]) . A modular struc- tural causal model (mSCM) is a tuple c M := hG , X , E , ( g O ) O∈L ( G ) , P E i , wher e 1. G = ( V , E , H ) is a HEDG, 2. X = Q v ∈V X v is the pr oduct of standard measur able spaces X v , 3. E = Q F ∈ ˆ H E F is the pr oduct of standard measur able spaces E F , 4. ( g O ) O∈L ( G ) is a compatible system of solution functions, 5. P E = Q F ∈ ˆ H P E F is a pr oduct measur e, wher e P E F is a pr obability measur e on E F for each F ∈ ˆ H . Let c M = hG , X , E , ( g O ) O∈L ( G ) , P E i be a modular SCM and O 1 , . . . , O r ∈ L ( G ) the strongly connected components of G ordered according to a topological order of the D A G of strongly connected components of G . Then for any random variable E : Ω → E such that P E = P E one can inductiv ely define the random variables X v := ( g O i ) v ( X pa G ( O i ) \O i , b E O i ) for all v ∈ O i for all i ≥ 1 , starting at X v := ( g O 1 ) v ( b E O 1 ) for all v ∈ O 1 . Because ( g O ) O∈L ( G ) is a compatible system of solution functions, we hav e for e very O ∈ L ( G ) X O = g O ( X pa G ( O ) \O , b E O ) . W e call the random v ariable X a solution of the modular SCM c M . Note that the solution X depends on the choice of the random v ariable E : Ω → E . The causal semantics of modular SCMs can be defined in terms of perfect interventions, which is defined as follo ws. 17 W e deviate from the terminology in [ 18 ] where this is called a “compatible system of structural equations”. 18 W e use the “hat” notation b E O to distinguish it from the ordinary subscript conv ention that E O = Q F ∈O E F for some subset O ⊆ ˆ H . 39 D E FI N I T I O N A.28 (Perfect intervention on an mSCM) . Consider a modular SCM c M = hG , X , E , ( g O ) O∈L ( G ) , P E i , a subset I ⊆ V of endog enous variables and a value ξ I ∈ X I . The perfect intervention do( I , ξ I ) maps c M to the modular SCM c M do( I , ξ I ) := hG do , X , E do , ( g do O ) O∈L ( G do ) , P E do i , wher e 1. G do = ( V , E do , H do ) , wher e E do = E \ { v → w : v ∈ V , w ∈ I } H do = {F \ I : F ∈ H } ∪ {{ v } : v ∈ I } , 2. φ : {F ∈ ˆ H : F \ I 6 = ∅} → ˆ H do \ {{ v } : v ∈ I } is a mapping such that φ ( F ) ⊇ F \ I for all F ∈ ˆ H for which F \ I 6 = ∅ , 3. E do = Q ˜ F ∈ ˆ H do E do ˜ F , wher e E do ˜ F = ( X v if ˜ F = { v } for v ∈ I Q F = φ − 1 ( ˜ F ) E F if ˜ F ∈ ˆ H do \ {{ v } : v ∈ I } , 4. for every O ∈ L ( G do ) g do O = ( I { v } if O = { v } for v ∈ I g O otherwise, (note that if O is a loop in G do , then it is a loop in G ), 5. P E do = Q ˜ F ∈ ˆ H do P E do ˜ F , wher e P E do ˜ F = ( δ ξ v if ˜ F = { v } for v ∈ I Q F = φ − 1 ( ˜ F ) P E F if ˜ F ∈ ˆ H do \ {{ v } : v ∈ I } . In contrast to SCMs, these perfect interventions on modular SCMs are directly defined on the underlying HEDG and depend on the choice of the mapping φ . A.3.2. Relation between SCMs and modular SCMs The solutions of a modular SCM can be described by an SCM that is loop-wisely solv able. D E FI N I T I O N A.29 (Underlying SCM) . Let c M = hG , X , E , ( g O ) O∈L ( G ) , P E i be a mod- ular SCM. Then the mapping ι maps c M to the underlying SCM ˜ M := h ˜ I , ˜ J , ˜ X , ˜ E , ˜ f , P ˜ E i , wher e 1. ˜ I = V , 2. ˜ J = ˆ H , 3. ˜ X = X , 4. ˜ E = E , 5. ˜ f is given by ˜ f v = ( g { v } ) v for all v ∈ V , 6. P ˜ E = P E . Every solution X of a modular SCM c M is also a solution of the underlying SCM ι ( c M ) . Observe that for the modular SCM c M we ha ve that the induced subgraph G a ( ι ( c M )) ˜ I , of the augmented graph of the underlying SCM G a ( ι ( c M )) on ˜ I , is a subgraph of the underlying 40 HEDG G , that is, G a ( ι ( c M )) ˜ I ⊆ G . This implies that, in general, the underlying HEDG G of c M may ha ve more loops than the loops in G ( ι ( c M )) . For a subset O ⊆ ˜ I , we ha ve for the exogenous parents of the underlying SCM ι ( c M ) pa( O ) ∩ ˜ J ⊆ {F ∈ ˜ J : F ∩ O 6 = ∅} , where pa( O ) denotes the set of parents of O in G a ( ι ( c M )) . Hence, in general, not all the hyperedges F ∈ H such that |F | = 2 (i.e., bidirected edges) are in the set of bidirected edges B of the graph of the underlying SCM G ( ι ( c M )) = ( V , E , B ) . W e conclude that the graph of the underlying SCM is, in general, a sparser graph than the HEDG of the modular SCM. Next, we show that the compatible system of solution functions of a modular SCM induces a compatible system of solution functions on the underlying SCM. For this we need the notion of loop-wise solv ability for SCMs. D E FI N I T I O N A.30 (Loop-wise (unique) solvability for SCMs). W e call an SCM M 1. loop-wisely solv able , if M is solvable w .r .t. every loop O ∈ L ( G ( M )) , and 2. loop-wisely uniquely solvable , if M is uniquely solvable w .r .t. e very loop O ∈ L ( G ( M )) . D E FI N I T I O N A.31 (Compatible system of solution functions for SCMs) . F or a loop- wisely solvable SCM M , we call a family of measurable solution functions ( g O ) O∈L ( G ( M )) , wher e g O is a measurable solution function of M w .r .t. O , a compatible system of solution functions , if for all O , ˜ O ∈ L ( G ( M )) with ˜ O ⊆ O and for P E -almost e very e ∈ E and for all x ∈ X we have x O = g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x ˜ O = g ˜ O ( x pa( ˜ O ) \ ˜ O , e pa( ˜ O ) ) . The underlying SCM of a modular SCM alw ays has a compatible system of solution func- tions, by construction. P RO P O S I T I O N A.32 . Let c M = hG , X , E , ( g O ) O∈L ( G ) , P E i be a modular SCM. Then the underlying SCM ˜ M := ι ( c M ) is loop-wisely solvable. Moreo ver , it has a compatible system of solution functions ( g O ) O∈L ( G ( ˜ M )) , wher e g O is a measurable solution function of ˜ M w .r .t. O . This sho ws that a modular SCM can be seen as an SCM together with an additional struc- ture of a compatible system of solution functions, and is, in particular , loop-wisely solv able. Moreov er , the class of simple SCMs corresponds exactly with those SCMs that are loop- wisely uniquely solv able. L E M M A A.33. An SCM M is simple if and only if it is loop-wisely uniquely solvable. In particular, for simple SCMs, or loop-wisely uniquely solvable SCMs, there always ex- ists a compatible system of solution functions. P RO P O S I T I O N A.34 . Let M = hI , J , X , E , f , P E i be a simple SCM. Then e very family of measurable solution functions ( g O ) O∈L ( G ( M )) , where g O is a measurable solution func- tion of M w .r .t. O , is a compatible system of solution functions. 41 causal BNs acyclic SCMs simple SCMs modular SCMs SCMs CCMs Fig 7: Overview of causal graphical models. The “gray” and “dark gray” areas contain all the causal graphical models that can be modeled by an SCM and an acyclic SCM, r espectively . A.4. Overview of causal graphical models Figure 7 giv es an o vervie w of the causal graphical models related to SCMs. The “gray” area contains all the causal graphical models that can be modeled by an SCM, by which we mean, that there exists an SCM that can de- scribe all its observ ational and interventional distrib utions. The “dark gray” area contains all the causal graphical models which can be modeled by an acyclic SCM. Acyclic SCMs gen- eralize causal Bayesian networks (causal BNs) [ 51 ] to allow for latent confounders and to deri ve counterf actuals. Simple SCMs form a subclass of SCMs that e xtends ac yclic SCMs to the c yclic setting, while preserving many of their con venient properties. Modular SCMs [ 18 ] can be seen as SCMs that have an additional structure of compatible system of solution func- tions and contain, in particular , the class of simple SCMs. Forr ´ e and Mooij [ 18 ] sho wed that modular SCMs satisfy v arious conv enient properties, like marginalization and the general directed global Markov property . W e show that for SCMs in general various of those prop- erties still hold under certain solvability conditions. A generalization of SCMs, known as causal constraints models (CCMs) , has been proposed [ 3 ] in order to completely model the causal semantics of the equilibrium solutions of a dynamical system given the initial condi- tions. This class of CCMs is rich enough to model the causal semantics of SCMs, but does not come with a single graphical representation that provides both a Markov property and a causal interpretation [ 4 ]. APPENDIX B: (UNIQUE) SOL V ABILITY PR OPER TIES In this appendix, we provide additional (unique) solvability properties for SCMs. In Ap- pendix B.1 we pro vide a sufficient condition of solvability w .r .t. (strict) subsets. In Ap- pendix B.2 we discuss how (unique) solvability is preserved under strict super- and subsets. In Appendix B.3 we discuss ho w (unique) solvability is preserved under unions and intersec- tions. The proofs of the theoretical results in this appendix are gi ven in Appendix E . B.1. Sufficient condition for solvability w .r .t. subsets For solv ability w .r .t. a (strict) subset of I there exists a suf ficient condition that is similar to the suf ficient (and necessary) condition (2) in Theorem 3.2 in the sense that it is formulated in terms of the solutions of (a subset of) the structural equations, but no measurability is required. P RO P O S I T I O N B.1 (Suf ficient condition for solvability w .r .t. a subset). Let M = hI , J , X , E , f , P E i be an SCM and O ⊆ I a subset. If for P E -almost e very e ∈ E and for all x \O ∈ X \O the topological space S ( e , x \O ) := { x O ∈ X O : x O = f O ( x , e ) } , 42 with the subspace topology induced by X O is nonempty and σ -compact, 19 then M is solvable w .r .t. O . For many purposes, this condition of σ -compactness suf fices since it contains for example all countable discrete spaces, ev ery interval of the real line, and moreover all the Euclidean spaces. In particular , it suffices to prove a sufficient and necessary condition for unique solv- ability w .r .t. a subset, in terms of the solutions of a subset of the structural equations (see Theorem 3.6 ). For larger solution spaces, we refer the reader to [ 30 ]. For the class of linear SCMs (see Definition C.1 ), we provide in Proposition C.2 a suf ficient and necessary condi- tion for solv ability w .r .t. a (strict) subset of I . B.2. (Unique) solvability w .r .t. strict super - and subsets In general, (unique) solv abil- ity w .r .t. O ⊆ I does not imply (unique) solv ability w .r .t. a strict superset O ( V ⊆ I nor w .r .t. a strict subset W ( O , as can be seen in the follo wing example. E X A M P L E B.2 (Solvability is not preserved under strict sub- or supersets). Consider the SCM M = h 3 , ∅ , R 3 , 1 , f , P 1 i wher e the causal mechanism is given by f 1 ( x ) = x 1 · (1 − 1 { 1 } ( x 2 )) + 1 , f 2 ( x ) = x 2 , f 3 ( x ) = x 3 · (1 − 1 {− 1 } ( x 2 )) + 1 . This SCM is (uniquely) solvable w .r .t. the subsets { 1 , 2 } , { 2 , 3 } , however it is not (uniquely) solvable w .r .t. the subsets { 1 } , { 3 } and { 1 , 2 , 3 } , and not uniquely solvable w .r .t. { 2 } . Ho wev er , in Proposition 3.10 we show that solvability w .r .t. O implies solv ability w .r .t. e very ancestral subset in G ( M ) O . B.3. (Unique) solvability w .r .t. unions and intersections In general, (unique) solvabil- ity is not preserved under unions and intersections. The follo wing example illustrates that (unique) solv ability is in general not preserved under intersections. E X A M P L E B.3 (Solv ability is not preserv ed under intersections) . Consider the SCM M = h 3 , ∅ , R 3 , 1 , f , P 1 i wher e the causal mechanism is given by f 1 ( x ) = 0 , f 2 ( x ) = x 2 · (1 − 1 { 0 } ( x 1 · x 3 )) + 1 , f 3 ( x ) = 0 . Then M is (uniquely) solvable w .r .t. { 1 , 2 } and { 2 , 3 } , however it is not (uniquely) solvable w .r .t. their intersection. Example B.2 giv es an example where (unique) solvability is not preserved under unions. Even, if we take the union of disjoint subsets, (unique) solv ability is not preserved (see Exam- ple 2.4 ). Although, in general, unique solvability is not preserved under unions, we show next that unique solvability is preserv ed under the union of ancestral subsets, under the following assumptions. P RO P O S I T I O N B.4 (Combining measurable solution functions on different sets) . Let M = hI , J , X , E , f , P E i be an SCM, O ⊆ I a subset and A , ˜ A ⊆ O two ancestral subsets in G ( M ) O . If M is uniquely solvable w .r .t. A , ˜ A and A ∩ ˜ A , then M is uniquely solvable w .r .t. A ∪ ˜ A . 19 A topological space X is called σ -compact if it is the union of a countable set of compact topological spaces. 43 A consequence of this property is that in order to check whether an SCM is ancestrally uniquely solvable w .r .t. O , it suffices to check that it is uniquely solvable w .r .t. the ancestral subsets for each node in O . C O RO L L A RY B.5 . Let M = hI , J , X , E , f , P E i be an SCM and O ⊆ I a subset. Then M is ancestrally uniquely solvable w .r .t. O if and only if M is uniquely solvable w .r .t. an G ( M ) O ( i ) for e very i ∈ O . APPENDIX C: LINEAR SCMS In this appendix, we provide some results about (unique) solvability and marginalization for linear SCMs. Linear SCMs form a special class of SCMs that has seen much attention in the literature [see, e.g., 5 , 27 ]. The proofs of the theoretical results in this appendix are gi ven in Appendix E . D E FI N I T I O N C.1 (Linear SCM) . W e call an SCM M = hI , J , R I , R J , f , P R J i linear if each component of the causal mechanism is a linear combination of the endogenous and exo genous variables, that is f i ( x , e ) = X j ∈I B ij x j + X k ∈J Γ ik e k , wher e i ∈ I , B ∈ R I ×I and Γ ∈ R I ×J ar e matrices, and P R J is a pr oduct pr obability mea- sur e 20 on R J . For a subset O ⊆ I we also use the shorthand v ector-notation f O ( x , e ) = B OI x + Γ OJ e . A nonzero coef ficient B ij for i, j ∈ I such that i 6 = j corresponds with a directed edge j → i in the (augmented) graph, and a coef ficient B ii = 1 for i ∈ I corresponds with a self-cycle i → i in the (augmented) graph of the SCM. A nonzero coefficient Γ ij for i ∈ I , j ∈ J with P E j a nondegenerate probability distribution ov er R corresponds with a directed edge j → i in the augmented graph. A nonzero entry (ΓΓ T ) ij for i, j ∈ I with i 6 = j such that there exists a k ∈ J for which Γ ik , Γ j k 6 = 0 and P E k a nondegenerate probability distribution over R corresponds with a bidirected edge i ↔ j in the graph of the SCM. For linear SCMs, the solvability condition w .r .t. a subset, Definition 3.1 , translates into a matrix condition. In order to state this condition we need to define the pseudoin verse (or the Moore-Penrose inv erse) A + of a real matrix A [ 24 , 54 ]. The pseudoin verse of the matrix A is defined by A + := V Σ + U ∗ , where A = U Σ V ∗ is the singular v alue decomposition of A and Σ + is obtained by replacing each nonzero entry on the diagonal of Σ by its reciprocal [ 24 ]. One of its useful properties is that AA + A = A . P RO P O S I T I O N C.2 (Sufficient and necessary condition for solv ability w .r .t. a subset for linear SCMs). Let M be a linear SCM and L ⊆ I and O = I \ L . Then M is solvable w .r .t. L if and only if for the matrix A LL = I L − B LL , for P E -almost every e ∈ E and for all x O ∈ X O the identity A LL A + LL ( B LO x O + Γ LJ e ) = B LO x O + Γ LJ e 20 Note that we do not assume that the probability measure P R J is Gaussian. 44 is satisfied, wher e A + LL is the pseudoin verse of A LL . Moreo ver , if M is solvable w .r .t. L , then for every vector v ∈ R L the mapping g v L : R O × R J → R L given by g v L ( x O , e ) = A + LL ( B LO x O + Γ LJ e ) + [ I L − A + LL A LL ] v , is a measurable solution function for M w .r .t. L . For linear SCMs, the unique solvability condition w .r .t. a subset translates into a matrix in vertibility condition, as w as already shown in [ 27 ]. P RO P O S I T I O N C.3 (Sufficient and necessary condition for unique solvability w .r .t. a subset for linear SCMs). Let M be a linear SCM, L ⊆ I and O = I \ L . Then M is uniquely solvable w .r .t. L if and only if the matrix A LL = I L − B LL is invertible . Mor eover , if M is uniquely solvable w .r .t. L , then the mapping g L : R O × R J → R L given by g L ( x O , e ) = A − 1 LL ( B LO x O + Γ LJ e ) , is a measurable solution function for M w .r .t. L . Note that if A LL is in vertible, then A + LL = A − 1 LL (see Lemma 1.3 in [ 54 ]), and the matrix condition of Proposition C.2 is always satisfied and all the measurable solution functions g v L of Proposition C.2 are (up to a P E -null set) equal to the solution function g L of Proposi- tion C.3 . R E M A R K . A suf ficient condition for A LL to be in vertible is that the spectr al r adius of B LL is less than one. If that is the case , then A − 1 LL = P ∞ n =0 ( B LL ) n . Note that the nonzer o nondiagonal entries of the matrix B LL r epr esent the dir ected edges in the induced subgraph G ( M ) L . In particular , if the dia gonal entries of the matrix B LL ar e zer o, then for n ∈ N , the coefficients of the matrix ( B LL ) n in the sum repr esent the sum of the pr oduct of the edge weights B ij over dir ected paths of length n in the induced subgr aph G ( M ) L . From Proposition 3.10 we kno w that an SCM is solv able w .r .t. L if and only if it is an- cestrally solv able w .r .t. L . In particular , this result also holds for linear SCMs. W e saw in Example 3.11 that a similar result for unique solvability does not hold, that is, in general, it does not hold that unique solv ability w .r .t. L implies ancestral unique solvability w .r .t. L . For the class of linear SCMs we do hav e the follo wing positiv e result. P RO P O S I T I O N C.4 (Equiv alent unique solvability conditions for linear SCMs) . F or a linear SCM M and a subset L ⊆ I the following ar e equivalent: 1. M is uniquely solvable w .r .t. L ; 2. M is ancestr ally uniquely solvable w .r .t. L ; 3. M is uniquely solvable w .r .t. each str ongly connected component in G ( M ) L . Under the condition of unique solvability w .r .t. a subset L we can define the marginaliza- tion w .r .t. L of a linear SCM by mere substitution. P RO P O S I T I O N C.5 (Marginalization of a linear SCM) . Let M be a linear SCM and L ⊆ I a subset of endogenous variables such that I L − B LL is in vertible. Then there exists a marginalization M marg( L ) that is linear and with marginal causal mechanism ˜ f : R O × R J → R O given by ˜ f ( x O , e ) = [ B OO + B OL A − 1 LL B LO ] x O + [ B OL A − 1 LL Γ LJ + Γ OJ ] e , wher e A LL = I L − B LL . Moreo ver , this mar ginalization respects the latent pr ojection, that is, G a ◦ marg ( L ) ( M ) ⊆ marg( L ) ◦ G a ( M ) . 45 m 1 m 2 m 3 m 4 m 5 ` 0 ` 1 ` 2 ` 3 ` 4 ` 5 Q 0 = 0 Q 6 = L Q 1 Q 2 Q 3 Q 4 Q 5 Fig 8: Damped coupled harmonic oscillator (top) and the graph of the SCM M that describes the positions of the masses at equilibrium (bottom) of Example D.1 for d = 5 . From Theorem 5.6 we know that M and its marginalization M marg( L ) ov er L are obser- v ationally , interventionally and counterfactually equi valent w .r .t. O . A similar result can also be found in [ 27 ]. In contrast to nonlinear SCMs, this class of linear SCMs has the con venient property that e very marginalization of a model of this class respects the latent projection. Moreov er , the subclass of simple linear SCMs is e ven closed under mar ginalization. APPENDIX D: EXAMPLES In this appendix, we provide additional examples. In Appendix D.1 we provide some ex- amples of SCMs that describe the equilibrium states of certain feedback systems governed by (random) dif ferential equations [ 6 ] that moti vated our study of cyclic SCMs. In Appendix D.2 we provide additional e xamples that support the main text. D.1. SCMs as equilibrium models In many systems occurring in the real world feed- back loops between observed variables are present. For example, in economics, the price of a product may be a function of the demanded or supplied quantities, and vice versa; or in physics, tw o masses that are connected by a spring may ex ert forces on each other . Such sys- tems are often described by a system of (random) differential equations. In [ 6 ] it was shown that SCMs are capable of modeling the causal semantics of the equilibrium states of such sys- tems. For illustration purposes we pro vide the following toy example of interacting masses that are attached to springs. E X A M P L E D.1 (Damped coupled harmonic oscillator) . Consider a one-dimensional sys- tem of d point masses m i ∈ R ( i = 1 , . . . , d ) with positions Q i , whic h ar e coupled by springs, with spring constants k i > 0 and equilibrium lengths ` i > 0 ( i = 0 , . . . , d ), under influence of friction with friction coefficients b i ∈ R ( i = 1 , . . . , d ) and with fixed endpoints Q 0 = 0 and Q d +1 = L > 0 (see F igur e 8 (top)). The equations of motion of this system ar e pr ovided by the following differ ential equations d 2 Q i dt 2 = k i m i ( Q i +1 − Q i − ` i ) + k i − 1 m i ( Q i − 1 − Q i + ` i − 1 ) − b i m i dQ i dt ( i = 1 , . . . , d ) . The dynamics of the masses, in terms of the position, velocity and acceleration, is described by a single and separate equation of motion for each mass. Under friction, that is, b i > 0 ( i = 1 , . . . , d ), ther e is a unique equilibrium position, wher e the sum of for ces vanishes for each mass. If one starts out of equilibrium, for example, by mo ving one or se veral masses out of equilibrium, then the masses will start to oscillate and con ver ge to their unique equilibrium position. At equilibrium (i.e., for t → ∞ ) the velocity dQ i dt and acceleration d 2 Q i dt 2 of the masses vanish (i.e., dQ i dt , d 2 Q i dt 2 → 0 ), and thus the following equation holds at equilibrium 0 = k i m i ( Q i +1 − Q i − ` i ) + k i − 1 m i ( Q i − 1 − Q i + ` i − 1 ) , 46 for each mass ( i = 1 , . . . , d ). Hence, for each mass i = 1 , . . . , d its equilibrium position Q i is given by Q i = k i ( Q i +1 − ` i ) + k i − 1 ( Q i − 1 + ` i − 1 ) k i + k i − 1 . By considering the ` i and k i and L as fixed parameters, we arrive at a linear SCM (see [ 6 ] for mor e details about constructing an SCM fr om a dynamical system) M = h{ 1 , . . . , d } , ∅ , R d , 1 , f , P 1 i , wher e the causal mechanism f is given by f i ( q ) = k i ( q i +1 − ` i ) + k i − 1 ( q i − 1 + ` i − 1 ) k i + k i − 1 . Alternatively , (some of) the parameters could be tr eated as exo genous variables instead. Its graph is depicted in F igur e 8 (bottom). This SCM allows us to describe the equilibrium behavior of the system under perfect intervention. F or example, when forcing the mass j to a fixed position Q j = ξ j with 0 ≤ ξ j ≤ L , the equilibrium positions of the masses correspond to the solutions of the intervened model M do( { j } ,ξ j ) . It is an easy exer cise to show that M is a simple SCM by using Pr oposition C.3 . Next, we show that the well known market equilibrium model from economics, which has been thoroughly discussed in the literature [see, e.g., 65 ], can be described by a (non-simple) SCM. This example illustrates ho w self-cycles enrich the class of SCMs. E X A M P L E D.2 (Price, supply and demand) . Let X D denote the demand and X S the supply of a quantity of a pr oduct. The price of the pr oduct is denoted by X P . The following system of differ ential equations describes how the demanded and supplied quantities ar e determined by the price, and how price adjustments occur in the market: X D = β D X P + E D X S = β S X P + E S dX P dt = X D − X S , wher e E D and E S ar e e xog enous random influences on the demand and supply , r espectively , β D < 0 is the recipr ocal of the slope of the demand curve, and β S > 0 is the recipr ocal of the slope of the supply curve . At the situation known as a “mark et equilibrium”, the price is determined implicitly by the condition that demanded and supplied quantities should be equal, since dX P dt = 0 at equilibrium. Applying the r esults in [ 6 ] gives rise to a linear SCM M = h{ P, S, D } , { S, D } , R 3 , R 2 , f , P E i at equilibrium with the causal mec hanism defined by f D ( x , e ) := β D x P + e D f S ( x , e ) := β S x P + e S f P ( x , e ) := x P + ( x D − x S ) . Note how we use a self-cycle for P in or der to implement the equilibrium equation X D = X S as the causal mechanism for the price P . 21 Mor eover , M is uniquely solvable. Its augmented graph is depicted in F igur e 9 (left). 21 Richardson and Robins [ 65 ] argue that this market equilibrium model cannot be modeled as an SCM. W e observe that it can, as long as one allo ws for self-cycles. 47 E D X D E S X S X P G a ( M ) E D X D X 0 D E S X S X 0 S X P X 0 P G a ( M twin ) E D X D X 0 D E S X S X 0 S X P X 0 P G a ( M twin ) do( { S,S 0 } ) Fig 9: The augmented graph of the SCM M (left), its twin SCM M twin (center) and the intervened twin SCM ( M twin ) do( { S,S 0 } , ( s,s 0 )) (right) of Examples D.2 and D.3 . Next, we pro vide an example of ho w counterf actuals can be sensibly formulated for cyclic SCMs, namely for the price, supply and demand model at equilibrium. E X A M P L E D.3 (Price, supply and demand at equilibrium). Consider the price, supply and demand model at equilibrium of Example D.2 given by the SCM M . As an e xample of a counterfactual query , consider P ( X 0 P | do( X S = s, X S 0 = s 0 ) , X P = p ) , which denotes the conditional distrib ution of X 0 P given X P = p of a solution of the inter- vened twin model M twin do( { S,S 0 } , ( s,s 0 )) . In wor ds: how would— ceteris paribus —price have been distributed, had we intervened to set supplied quantities equal to s 0 , given that actually we intervened to set supplied quantities equal to s and observed that this led to price p ? A straightforwar d calculation shows that this counterfactual distribution of price is the Dirac measur e on x 0 P = p + ( s 0 − s ) /β D . The augmented graphs of the SCM, its twin graph, and its intervened twin graph ar e depicted in F igur e 9 . D.2. Additional examples In this subsection, we provide additional examples that sup- port the main text. Section 2 E X A M P L E D.4 (Structural equations up to almost sure equality). Consider the SCM M = h 1 , 1 , X , E , f , P E i with X = E = {− 1 , 0 , 1 } , P E ( {− 1 } ) = P E ( { 1 } ) = 1 2 and f ( x, e ) = e 2 + e − 1 . Let ˜ M be the SCM M but with a differ ent causal mechanism ˜ f ( x, e ) = e . Then the sets of solutions of the structural equations agr ee for both SCMs for e ∈ {− 1 , +1 } , while the y differ only for e = 0 , which occurs with pr obability zer o. Hence, a pair of random variables ( X , E ) is a solution of M if and only if it is a solution of ˜ M . E X A M P L E D.5 (The for-all and for -almost-e very quantifier do not commute in general). Consider the SCM M = h 2 , 1 , X , E , f , P E i with X = (0 , 1) 2 , E = (0 , 1) , the causal mecha- nism f given by f 1 ( x , e ) = x 1 , f 2 ( x , e ) = 1 { 0 } ( x 1 − e ) · ( x 2 + 1) , and P E = P E with E ∼ U (0 , 1) . Define the pr operty P ( x , e ) := ( 1 if x = f ( x , e ) holds, 0 otherwise . 48 X 1 X 2 E 1 E 2 E 3 X 1 X 2 E Fig 10: Augmented graphs of the SCMs M (left) and M ∗ (right) in Example D.6 . F or SCM M ∗ , the exo genous variable E consists of two real-valued components; the structural equation for X 1 depends only on the first, while the structural equation for X 2 depends only on the second component. Then, for all x ∈ X and for P E -almost every e ∈ E the pr operty P ( x , e ) holds, however for P E -almost every e ∈ E and for all x ∈ X the pr operty P ( x , e ) does not hold, since for P E - almost e very e ∈ E the equation x = f ( x , e ) does not hold for x 1 = e . Hence, in general, for a pr operty P ( x , e ) we have that for all x ∈ X and for P E -almost every e ∈ E P ( x , e ) does not imply for P E -almost every e ∈ E for all x ∈ X P ( x , e ) (see Lemma F .11 for additional pr operties of the for-almost-every quantifier). E X A M P L E D.6 (Representation of latent confounders). Consider the SCM M = h 2 , 3 , R 2 , R 3 , f , P R 3 i with causal mec hanism given by f 1 ( e 1 , e 3 ) = e 1 + e 3 f 2 ( x 1 , e 2 , e 3 ) = x 1 e 3 + e 2 and P R 3 the standar d-normal distribution on R 3 ; F igur e 10 (left) shows the corr esponding augmented graph. Then ther e exists no SCM M ∗ = h 2 , 1 , R 2 , R 2 , f ∗ , P ∗ R 2 i that satisfies the following conditions: 1. M ∗ is interventionally equivalent to M , 2. its structural equations have the form x 1 = f ∗ 1 ( e ∗ 1 ) x 2 = f ∗ 2 ( x 1 , e ∗ 2 ) , wher e e ∗ 1 , e ∗ 2 ar e the two components of e ∗ = ( e ∗ 1 , e ∗ 2 ) ∈ R 2 , 3. the function e ∗ 2 7→ f ∗ 2 ( x 1 , e ∗ 2 ) is strictly monotonically incr easing for all x 1 ∈ R , 4. the cumulative distrib ution function F ∗ 2 of the second component of P ∗ R 2 is continuous and strictly monotonically incr easing. The augmented graph of suc h an SCM is shown in F igur e 10 (right). The pr oof of this statement pr oceeds by contradiction. Assume that such an SCM M ∗ exists. F or any uniquely solvable SCM ¯ M and any endogenous variable i appearing in ¯ M , we denote with F ¯ M X i the marginal cumulative distrib ution function of the i th component of the observational distribution of ¯ M . F or all ξ ∈ R , we have for all x 2 ∈ R (1) F M do( { 1 } ,ξ ) X 2 ( x 2 ) = P ( ξ E 3 + E 2 ≤ x 2 ) = Φ x 2 / p 1 + ξ 2 , wher e Φ denotes the (invertible) cdf of the standar d-normal distrib ution. Now define φ : R → R with φ ( e 2 ) := Φ − 1 ( F ∗ 2 ( e 2 )) and define the SCM ˜ M := h 2 , 1 , R 2 , R 2 , ˜ f , ˜ P R 2 i such that the causal mechanism ˜ f is given by ˜ f 1 ( e 1 ) = f ∗ 1 ( e 1 ) , ˜ f 2 ( x 1 , e 2 ) = f ∗ 2 ( x 1 , φ − 1 ( e 2 )) , and ˜ P R 2 is the push-forward measur e of P ∗ R 2 using ( I R , φ ) . Then, ˜ M is interventionally equiv- alent to M ∗ by construction, and the second component of ˜ P R 2 has a standar d-normal dis- tribution. Let ( ˜ X 1 , ˜ X 2 , ˜ E ) be a solution of ˜ M and let us write ˜ E = ( ˜ E 1 , ˜ E 2 ) . Then, for all 49 ξ ∈ R and ˜ e 2 ∈ R , F ˜ M do( { 1 } ,ξ ) X 2 ( ˜ f 2 ( ξ , ˜ e 2 )) = P ( ˜ f 2 ( ξ , ˜ E 2 ) ≤ ˜ f 2 ( ξ , ˜ e 2 )) = P ( ˜ E 2 ≤ ˜ e 2 ) = Φ( ˜ e 2 ) , using that ˜ e 2 7→ ˜ f 2 ( ξ , ˜ e 2 ) , too, is strictly monotonically incr easing for all ξ . This implies that, for all ξ ∈ R and ˜ e 2 ∈ R , ˜ f 2 ( ξ , ˜ e 2 ) = ( F M do( { 1 } ,ξ ) X 2 ) − 1 Φ( ˜ e 2 ) = p 1 + ξ 2 ˜ e 2 , wher e we used interventional equivalence of M and ˜ M , and ( 1 ) for the second equality . Furthermor e, ˜ X 2 = ˜ f 2 ( ˜ X 1 , ˜ E 2 ) = q 1 + ˜ X 2 1 ˜ E 2 a.s., so ˜ E 2 = ˜ X 2 / q 1 + ˜ X 2 1 a.s.. Now let ( X 1 , X 2 , E 1 , E 2 , E 3 ) be a solution of M . By observational equivalence , ( ˜ X 1 , ˜ X 2 ) has the same distribution as ( X 1 , X 2 ) , and thus ˜ E 2 is distributed as X 2 p 1 + X 2 1 = ( E 1 + E 3 ) E 3 + E 2 p 1 + ( E 1 + E 3 ) 2 a.s. . This contradicts the fact that ˜ E 2 has a standar d-normal distribution as, for example, the mean of the right-hand side is nonzer o. E X A M P L E D.7 (Counterfactual density unidentifiable from observational and interven- tional densities [ 11 ]) . Let ρ ∈ R and M ρ = h 2 , 2 , { 0 , 1 } × R , { 0 , 1 } × R 2 , f , P E i be the SCM with causal mechanism given by f 1 ( x , e ) = e 1 , f 2 ( x , e ) = e 21 (1 − x 1 ) + e 22 x 1 and P E = P ( E 1 , E 2 ) with E 1 ∼ Bernoulli(1 / 2) , E 2 := E 21 E 22 ∼ N 0 , 1 ρ ρ 1 normally distrib uted and E 1 ⊥ ⊥ E 2 . In an epidemiolo gical setting, this SCM could be used to model whether a patient was tr eated or not ( X 1 ) and the corresponding outcome for that patient ( X 2 ). Suppose in the actual world we did not assign tr eatment to a patient ( X 1 = 0 ) and the out- come was X 2 = c ∈ R . Consider the counterfactual query “What would the outcome have been, if we had assigned tr eatment to this patient?”. W e can answer this question by intr oduc- ing a parallel counterfactual world that is modeled by the twin SCM M twin ρ , as depicted in F igure 11 . The counterfactual query then asks for p ( X 2 0 = x 2 0 | do( X 1 0 = 1 , X 1 = 0) , X 2 = c ) . One can calculate that X 2 0 X 2 | do( X 1 0 = 1 , X 1 = 0) ∼ N 0 , 1 ρ ρ 1 and hence X 2 0 | do( X 1 0 = 1 , X 1 = 0) , X 2 = c ∼ N ( ρc, 1 − ρ 2 ) . Note that the answer to the counterfactual query depends on a quantity ρ that we cannot identify fr om the observational density p ( X 1 , X 2 ) or the interventional densities p ( X 2 | do( X 1 = 0)) and p ( X 2 | do( X 1 = 1)) , none of which depends on ρ . Ther efor e, even data fr om randomized contr olled trials combined with observational data would not suffice to determine the value of this particular counterfactual query . Indeed, SCMs M ρ and M ρ 0 with ρ 6 = ρ 0 ar e interventionally equiva- lent, but not counterfactually equivalent. 50 E 2 E 1 X 1 X 2 E 2 E 1 X 1 X 2 X 1 0 X 2 0 E 2 X 1 X 2 X 1 0 X 2 0 Fig 11: The augmented graph of the SCM M ρ (left), its twin SCM M twin ρ (center) and the intervened twin SCM ( M twin ρ ) do( { 1 0 , 1 } , (1 , 0)) (right) of Example D.7 . X 1 X 2 E 1 E 2 G a ( ¯ M ) X 1 X 2 E 1 E 2 G a ( ˆ M ) X 1 X 2 E 1 E 2 G a ( M ) X 1 X 2 E 1 E 2 G a ( ˜ M ) Fig 12: The augmented graphs of SCMs ¯ M , ˆ M , M , and ˜ M that appear in Examples 4.4 , D.10 , and D.13 . Section 3 E X A M P L E D.8 (Mixtures of solutions are solutions) . Let M = h 1 , ∅ , R , 1 , f , P 1 i be an SCM with causal mechanism f : X × E → X defined by f ( x, e ) = x − x 2 + 1 . Ther e e xist only two measurable solution functions g ± : E → X for M , defined by g ± ( e ) = ± 1 . Let X : Ω → R be a random variable that is a nontrivial mixture of point masses on {− 1 , +1 } . Then X is a solution of M , however neither g + ( E ) = X a.s., nor g − ( E ) = X a.s., for any random variable E such that P E = P E . E X A M P L E D.9 (Solvability is not preserved under perfect intervention) . Consider the SCM M = h 2 , ∅ , R 2 , 1 , f , P 1 i with the following causal mec hanism f 1 ( x ) = x 1 + x 2 1 − x 2 + 1 , f 2 ( x ) = x 2 (1 − 1 { 0 } ( x 1 )) + 1 . This SCM has a unique solution (0 , 1) . Doing a perfect intervention do( { 1 } , ξ 1 ) for some ξ 1 6 = 0 , however , leads to an intervened model M do( { 1 } ,ξ 1 ) that is not solvable. P erforming instead the perfect intervention do( { 2 } , ξ 2 ) for some ξ 2 > 1 leads also to a nonuniquely solv- able SCM M do( { 2 } ,ξ 2 ) which has solutions with multiple induced distributions, for example, ( X 1 , X 2 ) = ( φ ( ξ 2 ) √ ξ 2 − 1 , ξ 2 ) with some measurable φ : R → {− 1 , +1 } , but also mixtur es of those. Section 4 E X A M P L E D.10 (Counterfactually equiv alent SCMs with different graphs) . Consider the SCM ˆ M = h 2 , 2 , {− 1 , 1 } 2 , {− 1 , 1 } 2 , ˆ f , P E i with causal mechanism given by ˆ f 1 ( x , e ) = e 1 and ˆ f 2 ( x , e ) = e 2 , and P E = P E with E 1 , E 2 ∼ U ( {− 1 , 1 } ) uniformly distrib uted and E 1 ⊥ ⊥ E 2 . Consider also the SCM M that is the same as ˆ M except for its causal mec hanism, which is given by f 1 ( x , e ) = e 1 and f 2 ( x , e ) = e 1 e 2 . Then M and ˆ M are counterfactually equivalent although G ( M ) is not equal to G ( ˆ M ) (see F igur e 12 ). Section 5 E X A M P L E D.11 (Marginalization condition of an SCM is not a necessary condition) . Consider the SCM M = h 4 , 1 , R 4 , R , f , P R i with causal mec hanism given by f 1 ( x , e ) = e , f 2 ( x , e ) = x 1 , f 3 ( x , e ) = x 2 , f 4 ( x , e ) = x 4 51 and P R is the standard-normal measure on R . This SCM is solvable w .r .t. L = { 2 , 4 } , b ut not uniquely solvable w .r .t. L , and hence we cannot apply Definition 5.3 to L . However , the SCM ˜ M on the endogenous variables { 1 , 3 } with the causal mechanism ˜ f given by ˜ f 1 ( x , e ) = e and ˜ f 3 ( x , e ) = x 1 is counterfactually equivalent to M w .r .t. { 1 , 3 } , which can be chec ked easily . E X A M P L E D.12 (Graph of the mar ginal SCM is a strict subgraph of the latent projection) . Consider the SCM M = h 3 , 1 , R 3 , R , f , P R i with causal mec hanism given by f 1 ( x , e ) = e 1 , f 2 ( x , e ) = x 1 − x 3 , f 3 ( x , e ) = x 1 and tak e for P R the standar d-normal measur e on R . In contrast, to the (augmented) gr aph of M , ther e is no dir ected path in the (augmented) graph of the mar ginal SCM M marg( { 3 } ) . Section 7 E X A M P L E D.13 (Detecting a bidirected edge in the graph of an SCM) . Consider the SCM ¯ M = h 2 , 2 , {− 1 , 1 } 2 , {− 1 , 1 } 2 , ¯ f , P E i with causal mechanism given by ¯ f 1 ( x , e ) = e 1 and ¯ f 2 ( x , e ) = x 1 e 2 , and P E = P E with E 1 , E 2 ∼ U ( {− 1 , 1 } ) uniformly distributed and E 1 ⊥ ⊥ E 2 . Consider also the SCM ˜ M that is the same as ¯ M except for its causal mecha- nism, which is given by ˜ f 1 ( x , e ) = e 1 and ˜ f 2 ( x , e ) = x 1 e 1 . See F igure 12 for their aug- mented graphs. F or the SCM ˜ M we observe that the mar ginal interventional distribution P ˜ M do( { 1 } ,ξ 1 ) ( X 2 = − 1) is not equal to the conditional distribution P ˜ M ( X 2 = − 1 | X 1 = ξ 1 ) for both ξ 1 = − 1 and ξ 1 = 1 . This observation suffices to identify the pr esence of the bidi- r ected edge 1 ↔ 2 in the graph G ( ˜ M ) . F or the SCM ¯ M , whose graph does not contain the bidir ected edge 1 ↔ 2 , the marginal interventional distribution and conditional distribution coincide. APPENDIX E: PR OOFS This appendix contains the proofs of all the theoretical results in the appendices A , B and C , and the main text. Some of the proofs will rely on the measure theoretic terminology and results of Appendix F . E.1. Proofs of the appendices Appendix A P RO O F O F L E M M A A . 5 . It suffices to show that for every C - d -open walk between i and j in G , there exists a C - d -open path between i and j in G . T ake a C - d -open walk π = ( i = i 0 , . . . , i n = j ) . If a node ` occurs more than once in π , let i j be the first occurrence of ` in π and i k the last occurrence of ` in π . W e no w construct a new walk π 0 from π by removing the subwalk between i j and i k of π from π . It is easy to check that the new walk π 0 is still C - d -open. If ` is an endpoint on π 0 , then i j or i k must be endpoint of π , and hence ` / ∈ C . If ` is a non-endpoint non-collider on π 0 , then also i j or i k must hav e been a non-endpoint non-collider on π , and hence ` / ∈ C . If ` is a collider on π 0 , then either (i) i j or i k are both colliders on π , and hence ` is ancestor of C in G , or (ii) on the subwalk between i j and i k that was remov ed, there must be a directed path in G from i j or i k to a collider in an G ( C ) , and hence, ` is in an G ( C ) . The other nodes on π 0 cannot be responsible for C - d -blocking the walk, since they also occur (together with their adjacent edges) on π and they do not C - d -block π . In π 0 , the number of nodes that occur multiple times is at least one less than in π . Repeat this procedure until no repeated nodes are left. 52 P RO O F O F T H E O R E M A . 7 . The first case is a well known result. An elementary proof is obtained by noting that an acyclic system of structural equations trivially satisfies the lo- cal directed Markov property , and then apply [ 35 , Proposition 4], followed by applying the stability of d -separation with respect to (graphical) marginalization [ 18 , Lemma 2.2.15]. Al- ternati vely , the result also follo ws from sequential application of Theorems 3.8.2, 3.8.11, 3.7.7, 3.7.2 and 3.3.3 (using Remark 3.3.4) in [ 18 ]. The discrete case is proved by the series of results Theorem 3.8.12, Remark 3.7.2, Theo- rem 3.6.6 and 3.5.2 in [ 18 ]. The linear case is prov ed in Example 3.8.17 in [ 18 ]. T o connect the assumptions made there with the ones we state here, observ e that under the linear transformation rule for Lebesgue measures, the image measure of P E under the linear mapping R J → R I : e 7→ Γ I J e gi ves a measure on X = R I with a density w .r .t. the Lebesgue measure on R I , as long as the image of the linear mapping is the entire R I . This is guaranteed if each causal mechanism has a nontri vial dependence on some exogenous v ariable(s), that is, for each i ∈ I there is some j ∈ J with Γ ij 6 = 0 . P RO O F O F P R O P O S I T I O N A . 1 2 . This follo ws directly from the fact that the strongly con- nected components of G a ( M ) form a D A G by Lemma A.2 and that the directed edges in G a (acy( M )) by construction respect ev ery topological ordering of that DA G. Both SCMs are observ ationally equiv alent by construction. P RO O F O F P R O P O S I T I O N A . 1 4 . This follows immediately from the Definitions A.11 and A.13 . P RO O F O F L E M M A A . 1 7 . It suffices to sho w that for every C - σ -open w alk between i and j in G , there e xists a C - σ -open path between i and j in G . Let π = ( i = i 0 , . . . , i n = j ) be a C - σ -open walk in G . If a node ` occurs more than once in π , let i j be the first node in π and i k the last node in π that are in the same strongly connected component as ` . Since i j and i k are in the same strongly connected component, there are directed paths i j → · · · → i k and i k → · · · → i j in G . W e now construct a new walk π 0 from π by replacing the subwalk between i j and i k of π by a particular directed path between i j and i k : (i) If k = n , or if k < n and i k → i k +1 on π , we replace it by a shortest directed path i j → · · · → i k , otherwise (ii) we replace it by a shortest directed path i j ← · · · ← i k . W e now show that the new walk π 0 is still C - σ -open. π 0 cannot become C - σ -blocked through one of the initial nodes i 0 . . . i j − 1 or one of the final nodes i k +1 . . . i n on π 0 , since these nodes occur in the same local configuration on π and do not C - σ -block π by assumption. Furthermore, π 0 cannot become C - σ -blocked through one of the nodes strictly between i j and i k on π 0 (if there are any), since these nodes are all non-endpoint non-colliders that only point to nodes in the same strongly connected com- ponent on π 0 . Because π is C - σ -open, i k / ∈ C if k = n or if i k → i k +1 on π . This holds in particular in case (i). Similarly , i j / ∈ C if j = 0 or i j − 1 ← i j on π . In case (i), π 0 is not C - σ -blocked by i k because i k is a non-collider on π 0 but i k / ∈ C . Also i j does not C - σ -block π 0 . Assume i j 6 = i k (otherwise there is nothing to prove). If j = 0 , or if j > 0 and i j − 1 ← i j on π 0 , then the same holds for π and hence i j / ∈ C ; i j is then a non- collider on π 0 , b ut i j / ∈ C . If j > 0 and i j − 1 ↔ i j or i j − 1 → i j on π 0 then i j is a non-endpoint non-collider on π 0 that does not point to a node in another strongly connected component. No w consider case (ii). If j = 0 or i j − 1 ← i j on π 0 then this case is analogous to case (i). So assume j > 0 and i j − 1 → i j or i j − 1 ↔ i j on π 0 . If i j is an endpoint of π 0 , then i j = i k and k = n and therefore i k / ∈ C , and hence i j and i k do not C - σ -block π 0 . Otherwise, i j must be a collider on π 0 (whether i j = i k or not). Then on the subwalk of π between i j and i k 53 there must be a directed path from i j to a collider that is ancestor of C , which implies that i j is itself ancestor of C , and hence i j does not C - σ -block π 0 . Also i k cannot C - σ -block π 0 . Assume i j 6 = i k (otherwise there is nothing to pro ve). Since i k ← i k +1 or i k ↔ i k +1 on π 0 , i k is a non-endpoint non-collider on π 0 that does not point to a node in another strongly connected component. No w in π 0 , the number of nodes that occurs more than once is at least one less than in π . Repeat this procedure until no nodes occur more than once. P RO O F O F P R O P O S I T I O N A . 1 9 . This follo ws directly as a special case of Corollary 2.8.4 in [ 18 ]. P RO O F O F T H E O R E M A . 2 1 . An SCM M that is uniquely solvable w .r .t. each strongly connected component is uniquely solvable and hence, by Theorem 3.6 , all its solutions hav e the same observ ational distribution. The last statement follo ws from the series of re- sults Theorem 3.8.2, 3.8.11, Lemma 3.7.7 and Remark 3.7.2 in [ 18 ]. Alternatively , we giv e here a shorter proof: Under the stated conditions one can always construct the acyclification acy( M ) which is observ ationally equi v alent to M and is ac yclic (see Proposition A.12 ) and hence we can apply Theorem A.7 to acy( M ) . T ogether with Proposition A.14 and A.19 this gi ves A σ ⊥ G ( M ) B | C ⇐ ⇒ A d ⊥ acy( G ( M )) B | C = ⇒ A d ⊥ G (acy ( M )) B | C = ⇒ X A ⊥ ⊥ P X M X B | X C , for A, B , C ⊆ I and X a solution of M . P RO O F O F C O R O L L A RY A . 2 2 . First observe that simplicity is preserved under both per- fect intervention and the twin operation (see Proposition 8.2 ). Now the first statement follows from Theorem A.21 if one takes into account the identities of Proposition 2.14 and 2.19 . Sim- ilarly , the last statement follo ws from Theorem A.7 . P RO O F O F P R O P O S I T I O N A . 3 2 . Let ˜ M =: hV , ˆ H , X , E , ˜ f , P E i be the induced SCM. Observe that every loop O ∈ L ( G ( ˜ M )) is a loop in L ( G ) . Fix ˇ x ∈ X and ˇ e ∈ E . For e v- ery O ∈ L ( G ( ˜ M )) , define I O := (pa G ( O ) \ O ) \ (pa( O ) \ O ) ⊆ ˜ I and J O := {F ∈ ˜ J : F ∩ O 6 = ∅} \ pa( O ) ⊆ ˜ J . No w , define the family of measurable mappings ( ˜ g O ) O∈L ( G ( ˜ M )) , where the mapping ˜ g O : X pa( O ) \O × E pa( O ) → X O is gi ven by ˜ g O ( x pa( O ) \O , e pa( O ) ) := g O ( x pa( O ) \O , ˇ x I O , e pa( O ) , ˇ e J O ) where x pa G ( O ) \O = ( x pa( O ) \O , ˇ x I O ) and b e O = ( e pa( O ) , ˇ e J O ) . Observe that from the defini- tion of the parents (see Definition 2.6 ) it follows that for P E -almost ev ery e ∈ E and for all x ∈ X we have x O = ˜ f O ( x \ I O , ˇ x I O , e \ J O , ˇ e J O ) ⇐ ⇒ x O = ˜ f O ( x , e ) . This, together with the fact that the family of mappings ( g O ) O∈L ( G ) is a compatible system of solution functions, implies that for P E -almost e very e ∈ E and for all x ∈ X we hav e x O = ˜ g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x O = ˜ f O ( x , e ) . 54 Hence, ι ( c M ) is loop-wisely solvable and thus ( ˜ g O ) O∈L ( G ( ˜ M )) is a family of measurable solution functions. In particular , for all O , ˜ O ∈ L ( G ( ˜ M )) with ˜ O ⊆ O and for P E -almost e very e ∈ E and for all x ∈ X we hav e x O = ˜ g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x ˜ O = ˜ g ˜ O ( x pa( ˜ O ) \ ˜ O , e pa( ˜ O ) ) . From this we conclude that ( ˜ g O ) O∈L ( G ( ˜ M )) is a compatible system of solution functions. P RO O F O F L E M M A A . 3 3 . Suppose M is loop-wisely uniquely solvable and consider a subset O ⊆ I . Consider the induced subgraph G a ( M ) O of G a ( M ) on the nodes O . Then e very strongly connected component of G a ( M ) O is an element of L ( G ( M )) . Let C be such a strongly connected component in G a ( M ) O , and let g C : X pa( C ) \C × E pa( C ) → X C be a measurable solution function for M w .r .t. C . Since G a ( M ) O partitions into strongly con- nected components, we can recursiv ely (by follo wing a topological ordering of the D A G G a ( M ) sc O from Lemma A.2 ) insert these mappings into each other to obtain a mapping g O : X pa( O ) \O × E pa( O ) → X O that makes M uniquely solv able w .r .t. O . P RO O F O F P R O P O S I T I O N A . 3 4 . Let ( g O ) O∈L ( G ( M )) be any family of measurable solu- tion functions, where g O is measurable solution function of M w .r .t. O . Then, for O , ˜ O ∈ L ( G ( M )) such that ˜ O ⊆ O , we have that for P E -almost e very e ∈ E and for all x ∈ X x O = f O ( x , e ) = ⇒ x ˜ O = f ˜ O ( x , e ) . This implies that for P E -almost e very e ∈ E and for all x ∈ X x O = g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x ˜ O = g ˜ O ( x pa( ˜ O ) \ ˜ O , e pa( ˜ O ) ) . P RO O F O F C O R O L L A RY 8 . 5 . This follows directly from Proposition 7.1 and 7.2 . Appendix B P RO O F O F P R O P O S I T I O N B . 1 . Let ˜ f : E × X → X be the causal mechanism of a struc- turally minimal SCM that is equiv alent to M (see Proposition 2.11 ). In particular , for any \ pa( O ) ∈ E \ pa( O ) and ξ \ pa( O ) ∈ X \ pa( O ) , we hav e that for all x ∈ X and all e ∈ E , ˜ f ( x , e ) = ˜ f ( x pa( O ) , ξ \ pa( O ) , e pa( O ) , \ pa( O ) ) . This means that we may also consider ˜ f as a mapping ˜ f : X pa( O ) × E pa( O ) → X . Consider the set ˜ S := { ( e pa( O ) , x pa( O ) \O , x O ) ∈ E pa( O ) × X pa( O ) \O × X O : x O = ˜ f O ( x pa( O ) , e pa( O ) ) } . By similar reasoning as in the proof of Theorem 3.2 , ˜ S is measurable. By assumption, for P E -almost e very e ∈ E and for all x \O ∈ X \O the space { x O ∈ X O : x O = f O ( x , e ) } is nonempty and σ -compact. By applying Lemma F .10 to the canonical projection pr E pa ( O ) : E → E pa( O ) and using the equiv alence of f and ˜ f , we obtain that for P E pa( O ) -almost e very e pa( O ) ∈ E pa( O ) and for all x pa( O ) \O ∈ X pa( O ) \O the space ˜ S ( e pa( O ) , x pa( O ) \O ) := { x O ∈ X O : x O = ˜ f O ( x pa( O ) , e pa( O ) ) } is nonempty and σ -compact. 55 The second measurable selection theorem, Theorem F .9 , no w implies that there exists a measurable g O : X pa( O ) \O × E pa( O ) → X O such that for P E pa( O ) -almost ev ery e pa( O ) ∈ E pa( O ) and for all x pa( O ) \O ∈ X pa( O ) \O g O ( x pa( O ) \O , e pa( O ) ) = ˜ f O x pa( O ) \O , g O ( x pa( O ) \O , e pa( O ) ) , e pa( O ) . Once more applying Lemma F .10 , we obtain that for P E -almost ev ery e ∈ E and for all x ∈ X x O = g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x O = f O ( x , e ) . Hence M is solv able w .r .t. O . P RO O F O F P R O P O S I T I O N B . 4 . W ithout loss of generality , we assume that M is struc- turally minimal (see Proposition 2.11 ). Define C := A ∩ ˜ A and D := A ∪ ˜ A . Let g A , g ˜ A be measurable solution functions for M w .r .t. A and ˜ A , respecti vely . Note that pa( C ) \ C ⊆ pa( A ) \ A and similarly pa( C ) \ C ⊆ pa( ˜ A ) \ ˜ A . Indeed, for c ∈ pa( C ) : if c ∈ O then c ∈ C because A and ˜ A are both ancestral in G ( M ) O , while if c / ∈ O then c / ∈ A and c / ∈ ˜ A . Hence by Lemma E.1 , for P E -almost all e ∈ E and for all x ∈ X ( g A ) C ( x pa( A ) \A , e pa( A ) ) = ( g ˜ A ) C ( x pa( ˜ A ) \ ˜ A , e pa( ˜ A ) ) . Hence for P E -almost e very e ∈ E and for all x ∈ X x D = f D ( x , e ) ⇐ ⇒ x A\C = f A\C ( x , e ) x C = f C ( x , e ) x C = f C ( x , e ) x ˜ A\C = f ˜ A\C ( x , e ) ⇐ ⇒ x A\C = ( g A ) A\C ( x pa( A ) \A , e pa( A ) ) x C = ( g A ) C ( x pa( A ) \A , e pa( A ) ) x C = ( g ˜ A ) C ( x pa( ˜ A ) \ ˜ A , e pa( ˜ A ) ) x ˜ A\C = ( g ˜ A ) ˜ A\C ( x pa( ˜ A ) \ ˜ A , e pa( ˜ A ) ) ⇐ ⇒ ( x A = g A ( x pa( A ) \A , e pa( A ) ) x ˜ A = g ˜ A ( x pa( ˜ A ) \ ˜ A , e pa( ˜ A ) ) . No w pa( A ) \ A ⊆ pa( D ) \ D , and similarly , pa( ˜ A ) \ ˜ A ⊆ pa( D ) \ D . Hence, we conclude that the mapping h D : X pa( D ) \D × E pa( D ) → X D defined by h D ( x pa( D ) \D , e pa( D ) ) := ( g A ) A\C ( x pa( A ) \A , e pa( A ) ) , ( g A ) C ( x pa( A ) \A , e pa( A ) ) , ( g ˜ A ) ˜ A\C ( x pa( ˜ A ) \ ˜ A , e pa( ˜ A ) ) is a measurable solution function for M w .r .t. D , and that M is uniquely solvable w .r .t. D . P RO O F O F C O R O L L A RY B . 5 . It suffices to sho w the implication to the left. W e ha ve to sho w that M is uniquely solv able w .r .t. each ancestral subset of G ( M ) O . The proof proceeds via induction with respect to the size of the ancestral subset. For ancestral subsets of size 0, the claim is tri vially true. Ancestral subsets of size 1 must be of the form { i } = an G ( M ) O ( i ) for i ∈ O and hence the claim is true by assumption. Assume that the claim holds for all 56 ancestral subsets of size ≤ n . Let A be an ancestral subset of G ( M ) O of size n + 1 . If A = an G ( M ) O ( i ) for some i ∈ O then the claim holds for A by assumption. Otherwise, A = S i ∈A an G ( M ) O ( i ) is a union of ancestral subsets of size ≤ n . Choose distinct elements { i 1 , . . . , i k } ⊆ A where k is the smallest integer such that S k j =1 an G ( M ) O ( i j ) = A . By ap- plying Proposition B.4 to S k − 1 j =1 an G ( M ) O ( i j ) and an G ( M ) O ( i k ) , thereby noting that the inter - section of these two sets is an ancestral subset of size ≤ n and making use of the induction hypothesis, we arri ve at the conclusion that M is uniquely solv able w .r .t. A . Appendix C P RO O F O F P R O P O S I T I O N C . 2 . Let e ∈ E and x O ∈ X O . For x L ∈ X , x L = f L ( x , e ) ⇐ ⇒ x L = B LL x L + B LO x O + Γ LJ e ⇐ ⇒ A LL x L = B LO x O + Γ LJ e ⇐ ⇒ ( A LL A + LL ( B LO x O + Γ LJ e ) = B LO x O + Γ LJ e ∃ v ∈ X L : x L = A + LL ( B LO x O + Γ LJ e ) + [ I L − A + LL A LL ] v , where the last equi valence follo ws from [Theorem 2, 54 ]. P RO O F O F P R O P O S I T I O N C . 3 . M is uniquely solvable w .r .t. L if and only if for P E - almost e very e ∈ E and for all x O ∈ X O the linear system of equations x L = f L ( x , e ) ⇐ ⇒ x L = B LL x L + B LO x O + Γ LJ e ⇐ ⇒ A LL x L = B LO x O + Γ LJ e has a unique solution x L ∈ X L . Hence, M is uniquely solvable w .r .t. L if and only if A LL is in vertible. P RO O F O F P R O P O S I T I O N C . 4 . It suffices to show (1) = ⇒ (2) and (1) ⇐ ⇒ (3) . W e start by showing that (1) = ⇒ (2) . Let V ⊆ L and denote U := an G ( M ) L ( V ) , then we need to sho w that M is uniquely solv able w .r .t. U . From Proposition C.3 we know that M is uniquely solv able w .r .t. L if and only if the matrix A LL = I L − B LL is in vertible. The matrix A LL is in vertible if and only if the rows of A LL are all linearly independent. In particular , the rows of A U L are all linearly independent. Because A U L = [ A U U Z U L ] , where Z U L is the zero matrix, we know that the rows of A U U = I U − B U U are also all linearly independent, and hence A U U is in vertible. Next, we show that (1) ⇐ ⇒ (3) . Observe that the strongly connected components of G ( M ) L form a partition of the set L and that the directed mixed graph G ( M ) L and the directed graph G a ( M ) L hav e the same strongly connected components. Because, by Lemma A.2 , the graph of strongly connected components G sc of the directed graph G a ( M ) L is a DA G, the square matrix B LL can be permuted to an upper triangular block matrix ˜ B LL , where for each diagonal block ˜ B V V of ˜ B LL the set of nodes V is a strongly connected com- ponent in G ( M ) L . W ithout loss of generality we assume now that B LL is an upper triangular block matrix. From Proposition C.3 it follows that M is uniquely solv able w .r .t. L if and only if the matrix A LL = I L − B LL is in vertible. Because B LL is an upper triangular block matrix, we know that A LL is an upper triangular block matrix, where for each diagonal block A V V of A LL 57 the set of nodes V is a strongly connected component in G ( M ) L . Since an upper triangular block matrix A LL is in vertible if and only if every diagonal block in A LL is in vertible, we hav e that M is uniquely solv able w .r .t. L if and only if M is uniquely solv able w .r .t. each strongly connected component in G ( M ) L . P RO O F O F P R O P O S I T I O N C . 5 . By the definition of marginalization and Proposition C.3 the marginal causal mechanism ˜ f is gi ven by ˜ f ( x O , e ) := f O ( x O , g L ( x O , e ) , e ) = B OO x O + B OL g L ( x O , e ) + Γ OJ e = [ B OO + B OL A − 1 LL B LO ] x O + [ B OL A − 1 LL Γ LJ + Γ OJ ] e . From Propositions C.4 and 5.11 it follo ws that the marginalization respects the latent projec- tion. E.2. Proofs of the main text Section 2 P RO O F O F P R O P O S I T I O N 2 . 1 1 . Let i ∈ I . Note that Definition 2.6 can alternatively be formulated as follows: for k ∈ I ∪ J , k 6∈ pa( i ) if and only if there exists a measurable mapping ˆ f i : X × E → X i such that for P E -almost e very e ∈ E and for all x ∈ X , x i = f i ( x , e ) ⇐ ⇒ x i = ˆ f i ( x , e ) and either k ∈ I and there exists ˆ x k ∈ X k such that ˆ f i ( x , e ) = ˆ f i ( x \ k , ˆ x k , e ) for all x ∈ X , e ∈ E , or k ∈ J and there exists ˆ e k ∈ E k such that ˆ f i ( x , e ) = ˆ f i ( x , e \ k , ˆ e k ) for all x ∈ X , e ∈ E . By repeatedly applying (this formulation of) Definition 2.6 to all k / ∈ pa( i ) , we obtain the existence of a measurable mapping ˜ f i : X × E → X i and ˆ x \ pa( i ) ∈ X \ pa( i ) , ˆ e \ pa( i ) ∈ E \ pa( i ) such that for P E -almost e very e ∈ E and for all x ∈ X , x i = f i ( x , e ) ⇐ ⇒ x i = ˜ f i ( x , e ) , and for all e ∈ E and all x ∈ X , ˜ f i ( x , e ) = ˜ f i ( x pa( i ) , ˆ x \ pa( i ) , e pa( i ) , ˆ e \ pa( i ) ) . Define the SCM ˜ M as M e xcept that its causal mechanism is ˜ f instead of f . Then ˜ M is structurally minimal and equi valent to M . P RO O F O F P R O P O S I T I O N 2 . 1 4 . The do( I , ξ I ) operation on M completely remov es the functional dependence on x and e from the f i components for i ∈ I and hence the cor- responding incoming directed and bidirected edges on nodes in I from the (augmented) graph. P RO O F O F P R O P O S I T I O N 2 . 1 5 . The first statement follows from Definitions 2.12 and 2.13 . For the second statement, note that a perfect intervention can only remove parental relations, and therefore will ne ver introduce a cycle. P RO O F O F P R O P O S I T I O N 2 . 1 9 . This follows directly from Definitions 2.17 and 2.18 . 58 P RO O F O F P R O P O S I T I O N 2 . 2 0 . The additional edges introduced by the twin operation cannot lead to a directed cycle in volving both copied and original nodes, because there are no edges pointing from copied nodes to original nodes (i.e., of the form i 0 → v with i 0 ∈ I 0 and v ∈ V ). Directed cycles in volving only original nodes are absent by assumption, and directed cycles in v olving only copied nodes as well since they would correspond with a directed cycle in the original directed graph. P RO O F O F P R O P O S I T I O N 2 . 2 1 . It suffices to prove the property for directed graphs, since the property for SCMs follo ws directly from Definitions 2.12 and 2.17 . Applying the intervention do( I ) on the graph G remov es all the incoming edges from the nodes in I . Now , if we perform the twin operation w .r .t. I on this graph do( I )( G ) , then we copy the same edges as if we had twinned the graph G w .r .t. I , e xcept those edges that do point to one of the nodes in I . Hence, if we apply the intervention do( I ∪ I 0 ) on the graph t win( I )( G ) , which remo ves all incoming edges of both I and its copy I 0 , then we clearly obtain the same graph. Section 3 P RO O F O F T H E O R E M 3 . 2 . First we define the solution space S ( M ) of M by S ( M ) := { ( e , x ) ∈ E × X : x = f ( x , e ) } . This is a measurable set, since S ( M ) = h − 1 (∆) , where h : E × X → X × X is the mea- surable mapping defined by h ( e , x ) = ( x , f ( x , e )) and ∆ is the set defined by { ( x , x ) : x ∈ X } , which is measurable since X is Hausdorff. Note that A := pr E ( S ( M )) = { e ∈ E : ∃ x ∈ X s.t. x = f ( x , e ) } , is an analytic set because the projection pr E : X × E → E is a measurable mapping between standard measurable spaces (Lemma F .3 ). Suppose that (1) holds, that is, M has a solution. Then there exists a pair of random v ariables ( E , X ) : Ω → E × X such that X = f ( X , E ) P -a.s.. Note that { ω ∈ Ω : X ( ω ) = f X ( ω ) , E ( ω ) } ⊆ { ω ∈ Ω : ∃ x ∈ X s.t. x = f x , E ( ω ) } ⊆ E − 1 { e ∈ E : ∃ x ∈ X s.t. x = f ( x , e ) } = E − 1 ( A ) . By Lemma F .6 , A is P E -measurable because it is analytic, and we can write A = B ˙ ∪ N with B ⊆ E measurable and N a P E -null set. Hence E − 1 ( A ) = E − 1 ( B ) ∪ E − 1 ( N ) where E − 1 ( N ) is a P -null set. Therefore, E − 1 ( B ) ⊇ { ω ∈ Ω : X ( ω ) = f X ( ω ) , E ( ω ) } \ E − 1 ( N ) which implies that P ( E − 1 ( B )) = 1 . Hence, E \ A is a P E -null set. In other words, for P E - almost ev ery e ∈ E the structural equations x = f ( x , e ) have a solution x ∈ X , that is, (2) holds. Suppose that (2) holds. Then E \ pr E ( S ( M )) is a P E -null set. By application of the measurable selection theorem F .8 , there exists a measurable g : E → X such that for P E - almost all e ∈ E , g ( e ) = f ( g ( e ) , e ) . Hence, there exists a measurable mapping g : E → X such that for P E -almost e very e ∈ E and for all x ∈ X x = g ( e ) = ⇒ x = f ( x , e ) , 59 which we call property (A). Let ˜ f : E × X → X be the causal mechanism of a structurally minimal SCM that is equi valent to M (see Proposition 2.11 ). In particular , for any \ pa( I ) ∈ E \ pa( I ) , we have that ˜ f ( x , e ) = ˜ f ( x , e pa( I ) , \ pa( I ) ) for all x ∈ X and all e ∈ E . This means that we may also consider ˜ f as a mapping ˜ f : X × E pa( I ) → X . By applying Lemma F .10 to the canonical projection pr E pa ( I ) : E → E pa( I ) and using the equiv alence of f and ˜ f , we obtain that for P E pa( I ) -almost all e pa( I ) ∈ E pa( I ) there exists x ∈ X with x = ˜ f ( x , e pa( I ) ) . By applying the implication (2) = ⇒ (A) to E pa( I ) and ˜ f , we conclude the existence of a measurable g : E pa( I ) → X such that for P E pa( I ) -almost all e pa( I ) ∈ E pa( I ) , g ( e pa( I ) ) = ˜ f ( g ( e pa( I ) ) , e pa( I ) ) . Once more using Lemma F .10 , we obtain that for P E -almost all e ∈ E , g ( e pa( I ) ) = f ( g ( e pa( I ) ) , e ) . In other words, (3) holds. Lastly , suppose that (3) holds, that is there exists a measurable solution function g : E pa( I ) → X . Then the measurable mappings E : E → E and X : E → X , defined by E ( e ) := e and X ( e ) := g ( e pa( I ) ) , respecti vely , define a pair of random variables ( X , E ) such that X = f ( X , E ) holds a.s. and hence ( X , E ) is a solution. Hence (1) holds. P RO O F O F P R O P O S I T I O N 3 . 4 . Let ˜ f : E × X → X be the causal mechanism of a struc- turally minimal SCM ˜ M that is equiv alent to M (see Proposition 2.11 ). For a subset O ⊆ I consider the induced subgraph G a ( M ) O of the augmented graph G a ( M ) on O . Then the acyclicity of G a ( M ) implies that the induced subgraph G a ( M ) O is acyclic, and hence there exists a topological ordering on the nodes O . W e can substitute the components ˜ f i of the causal mechanism ˜ f for i ∈ O into each other along this topological ordering. This giv es a measurable solution function g O : X pa( O ) \O × E pa( O ) → X O for ˜ M , and hence for M . It is clear from the acyclic structure that this mapping g O is independent of the choice of the topological ordering and is the only solution function for M . Therefore, ˜ M is uniquely solv able w .r .t. O , and so is M . P RO O F O F P R O P O S I T I O N 3 . 7 . This follows immediately from Definitions 2.7 and 3.3 . P RO O F O F T H E O R E M 3 . 6 . Suppose that (1) holds. By Proposition B.1 there e xists a mea- surable solution function g O : X pa( O ) \O × E pa( O ) → X O for M w .r .t. O . Then for P E - almost every e ∈ E and for all x \O ∈ X \O we have that g O ( x pa( O ) \O , e pa( O ) ) is a solution of x O = f O ( x , e ) . Hence, because of (1), for P E -almost e very e ∈ E and for all x \O ∈ X \O we hav e that x O = f O ( x , e ) implies x O = g O ( x pa( O ) \O , e pa( O ) ) . Thus, M is uniquely solv- able w .r .t. O , that is, (2) holds. Suppose that (2) holds. Let g O : X pa( O ) \O × E pa( O ) → X O be a measurable solution function for M w .r .t. O . Then, for P E -almost e very e ∈ E and for all x ∈ X x O = g O ( x pa( O ) \O , e pa( O ) ) ⇐ ⇒ x O = f O ( x , e ) . This implies (1). For the last statement, assume that M is uniquely solvable. Let g : E pa( I ) → X be a measurable solution function. Then there exists a measurable set B ⊆ E with P E ( B ) = 1 and for all e ∈ B , ∀ x ∈ X : x = f ( x , e ) = ⇒ x = g ( e pa( I ) ) . The existence of a solution for M follo ws directly from Theorem 3.2 . Each solution ( X , E ) : Ω → X × E of M satisfies X ( ω ) = f ( X ( ω ) , E ( ω )) P -a.s.. In addition, it satisfies E ( ω ) ∈ B P -a.s., since P ◦ E − 1 = P E . Hence, it satisfies X ( ω ) = g ( E ( ω ) pa( I ) ) P -a.s.. Thus for e very solution ( X , E ) the associated observational distribution is the push-forward of P E under g ◦ pr pa( I ) . 60 P RO O F O F P R O P O S I T I O N 3 . 8 . Let g O : X pa( O ) \O × E pa( O ) → X O be a measurable so- lution function for M w .r .t. O . Then the mapping ˜ g O∪ I : E pa( O ) → X O∪ I defined by ˜ g O∪ I ( e pa( O ) ) := ( g O ( ξ pa( O ) \O , e pa( O ) ) , ξ I ) is a measurable solution function for the SCM M do( I , ξ I ) w .r .t. O ∪ I . If M is (uniquely) solvable w .r .t. O , then it follows that M do( I , ξ I ) is (uniquely) solv able w .r .t. O ∪ I . P RO O F O F P R O P O S I T I O N 3 . 1 0 . It suffices to sho w that solv ability of M w .r .t. O implies ancestral solv ability w .r .t. O . Solv ability of M w .r .t. O implies that there exists a measurable mapping g O : X pa( O ) \O × E pa( O ) → X O such that for P E -almost e very e ∈ E and for all x ∈ X x O = g O ( x pa( O ) \O , e pa( O ) ) = ⇒ x O = f O ( x , e ) . Let ˜ f : E × X → X be the causal mechanism of a structurally minimal SCM ˜ M that is equi valent to M (see Proposition 2.11 ). Let P := an G ( M ) O ( A ) for some A ⊆ O . Then for P E -almost e very e ∈ E and for all x ∈ X ( x P = ( g O ) P ( x pa( O ) \O , e pa( O ) ) x O\P = ( g O ) O\P ( x pa( O ) \O , e pa( O ) ) = ⇒ ( x P = ˜ f P ( x pa( P ) , e pa( P ) ) x O\P = ˜ f O\P ( x pa( O\P ) , e pa( O\P ) ) . Since pa( P ) \ P ⊆ pa( O ) \ O , we hav e that in particular for P E -almost e very e ∈ E and for all x ∈ X x P = ( g O ) P ( x pa( O ) \O , e pa( O ) ) = ⇒ x P = ˜ f P ( x pa( P ) , e pa( P ) ) . This implies that the mapping ( g O ) P cannot depend on elements different from pa( P ) . Moreov er , it follo ws from the definition of P that (pa( O ) \ O ) ∩ pa( P ) = pa( P ) \ P and thus we have pa( O ) \ O = (pa( P ) \ P ) ∪ (pa( O ) \ ( O ∪ pa( P ))) . No w , pick an element ˆ x pa( O ) \ ( O ∪ pa( P )) ∈ X pa( O ) \ ( O ∪ pa( P )) and define the mapping ˜ g P : X pa( P ) \P × E pa( P ) → X P by ˜ g P ( x pa( P ) \P , e pa( P ) ) := ( g O ) P ( x pa( P ) \P , ˆ x pa( O ) \ ( O ∪ pa( P )) , e pa( O ) ) . Then, for P E -almost e very e ∈ E and for all x ∈ X x P = ˜ g P ( x pa( P ) \P , e pa( P ) ) ⇐ ⇒ x P = ( g O ) P ( x pa( O ) \O , e pa( O ) ) . T ogether this giv es that for P E -almost e very e ∈ E and for all x ∈ X x P = ˜ g P ( x pa( P ) \P , e pa( P ) ) = ⇒ x P = ˜ f P ( x pa( P ) , e pa( P ) ) . which is equi valent to the statement that M is solvable w .r .t. an G ( M ) O ( A ) . Section 4 L E M M A E.1 . Let M be an SCM that is uniquely solvable w .r .t. two subsets A, B ⊆ I that satisfy A ⊆ B and pa( A ) \ A ⊆ pa( B ) \ B . Let g A : X pa( A ) \ A × E pa( A ) → X A and g B : X pa( B ) \ B × E pa( B ) → X B be measurable solution functions for M w .r .t. A and B , r espectively . Then for P E -almost every e ∈ E and for all x ∈ X g A ( x pa( A ) \ A , e pa( A ) ) = ( g B ) A ( x pa( B ) \ B , e pa( B ) ) . P RO O F . Without loss of generality , we assume that M is structurally minimal (see Propo- sition 2.11 ). Let ¯ E ⊆ E be a measurable set with P E ( ¯ E ) = 1 such that for all e ∈ ¯ E for all x ∈ X : x A = g A ( x pa( A ) \ A , e pa( A ) ) ⇐ ⇒ x A = f A ( x pa( A ) , e pa( A ) ) 61 and x B = g B ( x pa( B ) \ B , e pa( B ) ) ⇐ ⇒ x B = f B ( x pa( B ) , e pa( B ) ) . No w let e ∈ ¯ E and let x A ∪ pa( B ) \ B ∈ X A ∪ pa( B ) \ B . Then x A = ( g B ) A ( x pa( B ) \ B , e pa( B ) ) = ⇒ ( x A = ( g B ) A ( x pa( B ) \ B , e pa( B ) ) ∃ x B \ A ∈ X B \ A : x B \ A = ( g B ) B \ A ( x pa( B ) \ B , e pa( B ) ) = ⇒ ∃ x B \ A ∈ X B \ A : x B = g B ( x pa( B ) \ B , e pa( B ) ) = ⇒ ∃ x B \ A ∈ X B \ A : x B = f B ( x pa( B ) , e pa( B ) ) = ⇒ ∃ x B \ A ∈ X B \ A : x A = f A ( x pa( A ) , e pa( A ) ) = ⇒ x A = f A ( x pa( A ) , e pa( A ) ) = ⇒ x A = g A ( x pa( A ) \ A , e pa( A ) ) , where the exists-quantifier could be omitted because the expression it binds to does not de- pend on x B \ A (from the assumptions it follows that ( A ∪ pa( A )) ∩ ( B \ A ) = ∅ ). Hence, for all e ∈ ¯ E and all x A ∪ pa( B ) \ B ∈ X A ∪ pa( B ) \ B x A = ( g B ) A ( x pa( B ) \ B , e pa( B ) ) = ⇒ x A = g A ( x pa( A ) \ A , e pa( A ) ) . Hence, for all e ∈ ¯ E and all x A ∪ pa( B ) \ B ∈ X A ∪ pa( B ) \ B ( g B ) A ( x pa( B ) \ B , e pa( B ) ) = g A ( x pa( A ) \ A , e pa( A ) ) . Since this expression does not depend on x ( B \ A ) ∪I \ ( B ∪ pa( B )) , from Lemma F .11.(2) we conclude that for all e ∈ ¯ E and all x ∈ X ( g B ) A ( x pa( B ) \ B , e pa( B ) ) = g A ( x pa( A ) \ A , e pa( A ) ) . L E M M A E.2. An SCM M is observationally equivalent to M twin w .r .t. O ⊆ I . P RO O F . Let ( X , E ) be a solution of M , then (( X , X ) , E ) is a solution of M twin . Con- versely , let (( X , X 0 ) , E ) be a solution of M twin , then ( X , E ) is a solution of M . P RO O F O F P R O P O S I T I O N 4 . 6 . First we sho w that equi valence implies counterfactual equi valence w .r .t. O . The twin operation preserves the equiv alence relation on SCMs and since equiv alent SCMs are interventionally equi valent w .r .t. ev ery subset, the two equiv alent twin SCMs have to be interventionally equi v alent w .r .t. O ∪ O 0 for ev ery O ⊆ I with O 0 the copy of O in I 0 . No w , let M and ˜ M be counterfactually equiv alent w .r .t. O . Then M twin and ˜ M twin are interventionally equiv alent w .r .t. O ∪ O 0 . Thus for I ⊆ O , I 0 ⊆ O 0 the copy of I and ξ I 0 = ξ I ∈ X I , M twin do( I ∪ I 0 , ξ I ∪ I 0 ) and ˜ M twin do( I ∪ I 0 , ξ I ∪ I 0 ) are observ ationally equiv alent w .r .t. O ∪ O 0 . In particular , they are observ ationally equiv alent w .r .t. O . From Proposition 2.21 we hav e that M twin do( I ∪ I 0 , ξ I ∪ I 0 ) = ( M do( I , ξ I ) ) twin and ˜ M twin do( I ∪ I 0 , ξ I ∪ I 0 ) = ( ˜ M do( I , ξ I ) ) twin , and together with Lemma E.2 this gives that M do( I , ξ I ) and ˜ M do( I , ξ I ) are observ ationally equi valent w .r .t. O . 62 Section 5 L E M M A E.3. Let M be an SCM. Let B ⊆ I and A ⊆ I ∪ J such that (pa( B ) \ B ) ⊆ A and B ∩ A = ∅ . Assume that g B : X A × E A → X B is a measurable function such that for P E -almost every e ∈ E and for all x ∈ X x B = f B ( x pa( B ) , e pa( B ) ) ⇐ ⇒ x B = g B ( x A , e A ) . Then M is uniquely solvable w .r .t. B . P RO O F . Assume that for P E -almost e very e ∈ E and for all x ∈ X x B = f B ( x pa( B ) , e pa( B ) ) ⇐ ⇒ x B = g B ( x A , e A ) . Let C := A \ (pa( B ) \ B ) , then by Lemma F .11.(7) we hav e that there exists ˆ e C ∈ E C and ˆ x C ∈ X C such that for P E J \ C -almost e very e J \ C ∈ E J \ C and for all x I \ C ∈ X I \ C x B = f B ( x pa( B ) , e pa( B ) ) ⇐ ⇒ x B = g B ( x pa( B ) \ B , ˆ x C , e pa( B ) , ˆ e C ) . Defining the mapping h B : X pa( B ) \ B × E pa( B ) → X B by h B ( x pa( B ) \ B , e pa( B ) ) := g B ( x pa( B ) \ B , ˆ x C , e pa( B ) , ˆ e C ) , where we picked ˆ e C ∈ E C and ˆ x C ∈ X C such that the above equiv alence holds, and applying Lemma F .11.(6) we get that for P E -almost e very e ∈ E and for all x ∈ X x B = f B ( x pa( B ) , e pa( B ) ) ⇐ ⇒ x B = h B ( x pa( B ) \ B , e pa( B ) ) holds. Thus, M is uniquely solv able w .r .t. B . P RO O F O F P R O P O S I T I O N 5 . 4 . From unique solvability of M w .r .t. L 1 it follows that there exists a mapping g L 1 : X pa( L 1 ) \ ( L 1 ) × E pa( L 1 ) → X L 1 such that for P E -almost ev ery e ∈ E and for all x ∈ X x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) ⇐ ⇒ x L 1 = f L 1 ( x , e ) . Let c pa denotes the parents in G a ( M marg( L 1 ) ) . Note that c pa( L 2 ) \ L 2 ⊆ pa( L 1 ∪ L 2 ) \ ( L 1 ∪ L 2 ) . Let ˜ f denote the marginal causal mechanism of a structurally minimal SCM that is equi valent to the mar ginalization M marg( L 1 ) constructed from g L 1 (see Proposition 2.11 ). = ⇒ : If M marg( L 1 ) is uniquely solv able w .r .t. L 2 , then there exists a mapping ˜ g L 2 : X c pa( L 2 ) \L 2 × E c pa( L 2 ) → X L 2 such that for P E -almost every e ∈ E and for all x I \L 1 ∈ X I \L 1 x L 2 = ˜ g L 2 ( x c pa( L 2 ) \L 2 , e c pa( L 2 ) ) ⇐ ⇒ x L 2 = f L 2 ( g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) , x I \L 1 , e ) . Define the mapping h : X pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) × E pa( L 1 ∪L 2 ) → X L 1 ∪L 2 by ( h L 1 , h L 2 )( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) := g L 1 ( ˜ g L 2 ) pa( L 1 ) ( x c pa( L 2 ) \L 2 , e c pa( L 2 ) ) , x pa( L 1 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ) , ˜ g L 2 ( x c pa( L 2 ) \L 2 , e c pa( L 2 ) ) . 63 Then for P E -almost e very e ∈ E and for all x ∈ X ( x L 1 = f L 1 ( x , e ) x L 2 = f L 2 ( x , e ) ⇐ ⇒ ( x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) x L 2 = f L 2 ( x , e ) ⇐ ⇒ ( x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) x L 2 = f L 2 ( g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) , x I \L 1 , e ) ⇐ ⇒ ( x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) x L 2 = ˜ g L 2 ( x c pa( L 2 ) \L 2 , e c pa( L 2 ) ) ⇐ ⇒ ( x L 1 = g L 1 ( ˜ g L 2 ) pa( L 1 ) ( x c pa( L 2 ) \L 2 , e c pa( L 2 ) ) , x pa( L 1 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ) x L 2 = ˜ g L 2 ( x c pa( L 2 ) \L 2 , e c pa( L 2 ) ) ⇐ ⇒ ( x L 1 = h L 1 ( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) x L 2 = h L 2 ( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) , where in the first equiv alence we used unique solvability w .r .t. L 1 of M , in the second we used substitution, in the third we used unique solvability w .r .t. L 2 of M marg( L 1 ) , in the fourth we used again substitution and in the last equiv alence we used the definition of h . From this we conclude that M is uniquely solvable w .r .t. L 1 ∪ L 2 . Hence, by definition it follows that marg( L 2 ) ◦ marg ( L 1 )( M ) = marg( L 1 ∪ L 2 )( M ) . ⇐ = : If M is uniquely solvable w .r .t. L 1 ∪ L 2 , then there exists a mapping h : X pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) × E L 1 ∪L 2 → X L 1 ∪L 2 such that for P E -almost every e ∈ E for all x ∈ X x L 1 ∪L 2 = h ( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) ⇐ ⇒ x L 1 ∪L 2 = f L 1 ∪L 2 ( x , e ) . Then, for P E -almost e very e ∈ E for all x ∈ X ( x L 1 = h L 1 ( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) x L 2 = h L 2 ( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) ⇐ ⇒ ( x L 1 = f L 1 ( x , e ) x L 2 = f L 2 ( x , e ) ⇐ ⇒ ( x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) x L 2 = f L 2 ( x , e ) ⇐ ⇒ ( x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) x L 2 = f L 2 ( g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) , x I \L 1 , e ) ⇐ ⇒ ( x L 1 = g L 1 ( x pa( L 1 ) \L 1 , e pa( L 1 ) ) x L 2 = ˜ f L 2 ( x c pa( L 2 ) , e c pa( L 2 ) ) . This gi ves for P E -almost e very e ∈ E for all x I \L 1 ∈ X I \L 1 x L 2 = h L 2 ( x pa( L 1 ∪L 2 ) \ ( L 1 ∪L 2 ) , e pa( L 1 ∪L 2 ) ) ⇐ ⇒ x L 2 = ˜ f L 2 ( x c pa( L 2 ) , e c pa( L 2 ) ) . No w apply Lemma E.3 to conclude that M marg( L 1 ) is uniquely solv able w .r .t. L 2 . 64 P RO O F O F P R O P O S I T I O N 5 . 5 . The commutation relation with the perfect intervention follo ws straightforwardly from the definitions of perfect intervention and marginalization and the fact that if M is uniquely solvable w .r .t. L , then M do( I , ξ I ) is also uniquely solvable w .r .t. L , since the structural equations for the v ariables L are the same for M and M do( I , ξ I ) . The commutation relation with the twin operation follows straightforw ardly from the def- inition of the twin operation and marginalization and the fact that if M is uniquely solvable w .r .t. L , then t win( M ) is uniquely solv able w .r .t. L ∪ L 0 , where L 0 is the copy of L in I 0 . L E M M A E.4 . Given an SCM M and a subset L ⊆ I such that M is uniquely solvable w .r .t. L . Then M and marg( L )( M ) ar e observationally equivalent w .r .t. I \ L . P RO O F . Let O := I \ L . From unique solv ability w .r .t. L it follows that for P E -almost e very e ∈ E and for all x ∈ X ( x L = f L ( x , e ) x O = f O ( x , e ) ⇐ ⇒ ( x L = g L ( x pa( L ) \L , e pa( L ) ) x O = f O ( g L ( x pa( L ) \L , e pa( L ) ) , x O , e ) ⇐ ⇒ ( x L = g L ( x pa( L ) \L , e pa( L ) ) x O = ˜ f ( x O , e ) , where ˜ f is the marginal causal mechanism of M marg( L ) constructed from a measurable solution function g L : X pa( L ) \L × E pa( L ) → X L for M w .r .t. L . Hence, a solution ( X , E ) of M satisfies X O = ˜ f ( X O , E ) a.s.. Conv ersely , if ( ˜ X O , E ) is a solution of the marginal SCM M marg( L ) then with ˜ X L := g L ( ˜ X pa( L ) \L , E pa( L ) ) , the random variables ( X , E ) := ( ˜ X O , ˜ X L , E ) are a solution of M . P RO O F O F T H E O R E M 5 . 6 . The observ ational equiv alence follows from Lemma E.4 . Us- ing both Lemma E.4 and Proposition 5.5 we can prove the interventional equiv alence. Observe that from Proposition 5.5 we kno w that for a subset I ⊆ I \ L and a value ξ I ∈ X I , (marg( L ) ◦ do( I , ξ I ))( M ) exists. By Lemma E.4 we know that do( I , ξ I )( M ) and (marg ( L ) ◦ do( I , ξ I ))( M ) are observationally equiv alent w .r .t. O and hence by apply- ing again Proposition 5.5 , do( I , ξ I )( M ) and (do( I , ξ ) ◦ marg ( L ))( M ) are observationally equi valent w .r .t. O . This implies that M and marg( L )( M ) are interv entionally equi valent w .r .t. O . Lastly , we need to show that twin( M ) and (t win ◦ marg( L ))( M ) are interven- tionally equiv alent w .r .t. ( I ∪ I 0 ) \ ( L ∪ L 0 ) , where L 0 is the copy of L in I 0 . From Propo- sition 5.5 (twin ◦ marg( L ))( M ) is equiv alent to (marg( L ∪ L 0 ) ◦ t win)( M ) and since we prov ed that (marg( L ∪ L 0 ) ◦ t win)( M ) and twin( M ) are interventionally equiv alent w .r .t. ( I ∪ I 0 ) \ ( L ∪ L 0 ) the result follo ws. P RO O F O F P R O P O S I T I O N 5 . 8 . A similar proof as for Theorem 1 in [ 15 ] works. P RO O F O F P R O P O S I T I O N 5 . 9 . First we prove the commutation relation of the perfect in- tervention. Observe that applying the do( I ) operation to the latent projection marg( L )( G ) remov es all the incoming edges on the nodes I . Such an incoming edge at a node in I in marg( L )( G ) corresponds to a path in G that points to that node. But since do( I )( G ) is just G with all the incoming edges on I removed, the graph (marg( L ) ◦ do( I ))( G ) also has all the incoming edges on the nodes I remov ed. 65 Next, we will prove the commutation relation of the twin operation. W e will denote the copy in I 0 of any node i ∈ I by i 0 , that is, I 0 = { i 0 : i ∈ I } . The edges in (t win( I \ L ) ◦ marg( L ))( G ) can be partitioned into three cases: v → w v ∈ J ∪ I \ L , w ∈ J ∪ I \ L , v → w ∈ marg ( L )( G ) , v → w 0 v ∈ J , w ∈ I \ L , v → w ∈ marg ( L )( G ) , v 0 → w 0 v ∈ I \ L , w ∈ I \ L , v → w ∈ marg ( L )( G ) , where J := V \ I . Note that in t win( I )( G ) , there are no directed edges of the form v 0 → w by definition. Therefore, the edges in (marg ( L ∪ L 0 ) ◦ twin( I ))( G ) can be partitioned into three cases: v → w v ∈ J ∪ I \ L , w ∈ J ∪ I \ L , v → ` 1 → · · · → ` n → w ∈ twin( I )( G ) , v → w 0 v ∈ J , w ∈ I \ L , v → ` 0 1 → · · · → ` 0 n → w 0 ∈ t win( I )( G ) , v 0 → w 0 v ∈ I \ L , w ∈ I \ L , v 0 → ` 0 1 → · · · → ` 0 n → w 0 ∈ t win( I )( G ) , where all ` 1 , . . . , ` n ∈ L and ` 0 1 , . . . , ` 0 n ∈ L 0 . Thus, the non-endpoint nodes on the directed paths in twin( I )( G ) must either all lie in L or in L 0 . With the definition of twin( I )( G ) we can re write this as follows: v → w v ∈ J ∪ I \ L , w ∈ J ∪ I \ L , v → ` 1 → · · · → ` n → w ∈ G , v → w 0 v ∈ J , w ∈ I \ L , v → ` 1 → · · · → ` n → w ∈ G , v 0 → w 0 v ∈ I \ L , w ∈ I \ L , v → ` 1 → · · · → ` n → w ∈ G , where all intermediate ` 1 , . . . , ` n must lie in L . This corresponds exactly with the edges in (t win( I \ L ) ◦ marg( L ))( G ) . P RO O F O F P R O P O S I T I O N 5 . 1 1 . W ithout loss of generality , we assume that M is struc- turally minimal (see Proposition 2.11 ). Let g L be a measurable solution function for M w .r .t. L and denote by M marg( L ) the marginal SCM constructed from g L . F or j ∈ I \ L , define A j := an G ( M ) L (pa( j ) ∩ L ) ⊆ L and let ˜ g A j be a measurable solution function for M w .r .t. A j . Because A j ⊆ L and pa( A j ) \ A j ⊆ pa( L ) \ L , by Lemma E.1 , for P E -almost ev ery e ∈ E and for all x ∈ X ( g L ) A j ( x pa( L ) \L , e pa( L ) ) = ˜ g A j ( x pa( A j ) \ A j , e pa( A j ) ) . Therefore, the component ˜ f j of the marginal causal mechanism ˜ f of M marg( L ) satisfies for P E -almost e very e ∈ E and for all x ∈ X ˜ f j ( x I \L , e ) := f j ( g L ) pa( j ) ( x pa( L ) \L , e pa( L ) ) , x pa( j ) \L , e pa( j ) = f j ( ˜ g A j ) pa( j ) ∩L ( x pa( A j ) \ A j , e pa( A j ) ) , x pa( j ) \L , e pa( j ) . Hence, the endogenous parents of j in M marg( L ) are a subset of (pa( A j ) \ A j ) ∪ (pa( j ) \ L ) ∩ I and the exogenous parents of j in M marg( L ) are a subset of (pa( A j ) ∪ pa( j )) ∩ J . Hence, all parents of j in M marg( L ) are a subset of those k ∈ ( I \ L ) ∪ J such that there exists a path k → ` 1 → · · · → ` n → j ∈ G a ( M ) for n ≥ 0 and ` 1 , . . . , ` n ∈ L . Therefore, the aug- mented graph G a marg( L )( M ) is a subgraph of the latent projection marg( L ) G a ( M ) . 66 Hence, G marg( L )( M ) = marg( J ) G a marg( L )( M ) ⊆ marg( J ) marg( L ) G a ( M ) = marg( L ) marg( J ) G a ( M ) = marg( L ) G ( M ) and we conclude that also the graph G marg( L )( M ) is a subgraph of the latent projection marg( L ) G ( M ) . Section 6 P RO O F O F T H E O R E M 6 . 3 . This follows directly from Theorems A.7 and A.21 . Section 7 P RO O F O F P R O P O S I T I O N 7 . 1 . W e define ˜ M := M do( I , ξ I ) , f pa := pa G a ( ˜ M ) and A := an G ( ˜ M ) \ i ( j ) . Suppose that i → j / ∈ marg ( I \ O )( G ( M )) and assume that the two in- duced distrib utions do not coincide. Because i → j / ∈ marg( I \ O )( G ( M )) it follows that ( f pa( A ) \ A ) ∩ I = ∅ . Let now ˜ g A : E f pa( A ) → X A be a measurable solution function for ˜ M w .r .t. A , that is, we hav e for P E -almost e very e ∈ E and for all x ∈ X x A = ˜ f A ( x , e ) ⇐ ⇒ x A = ˜ g A ( e f pa( A ) ) , where ˜ f is the ausal mechanism of ˜ M . Because i / ∈ A and j ∈ A , it follows that for the intervened model ( M do( I , ξ I ) ) do( { i } ,ξ i ) the marginal solution X j is also a marginal solution of ( M do( I , ξ I ) ) do( { i } , ˜ ξ i ) and vice versa, which is in contradiction with the assumption. P RO O F O F P R O P O S I T I O N 7 . 2 . Let’ s define ˜ M := M do( I , ξ I ) , f pa := pa G a ( ˜ M ) , A i := an G ( ˜ M ) ( i ) and A \ i j := an G ( ˜ M ) \ i ( j ) . Suppose that there does not exist a bidirected edge i ↔ j in the latent projection marg( I \ O )( G ( M )) . Because i ↔ j / ∈ marg ( I \ O )( G ( ˜ M )) , where here ˜ M is the intervened model M do( I , ξ I ) , we ha ve that an G a ( ˜ M ) \ j ( i ) ∩ an G a ( ˜ M ) \ i ( j ) ∩ J = ∅ . From j / ∈ an G ( ˜ M ) ( i ) it follo ws that an G ( ˜ M ) \ j ( i ) = an G ( ˜ M ) ( i ) , and hence an G a ( ˜ M ) ( i ) ∩ an G a ( ˜ M ) \ i ( j ) ∩ J = ∅ . Observ e that f pa( A i ) ⊆ an G a ( ˜ M ) ( i ) and f pa( A \ i j ) ⊆ an G a ( ˜ M ) \ i ( j ) ∪ { i } , and thus f pa( A i ) ∩ f pa( A \ i j ) ∩ J = ∅ . Let g A i : E f pa( A i ) → X A i be a measurable solution function for ˜ M w .r .t. A i , that is, we hav e for P E -almost e very e ∈ E and for all x ∈ X x A i = ˜ f A i ( x , e ) ⇐ ⇒ x A i = g A i ( e f pa( A i ) ) , where ˜ f is the intervened causal mechanism of ˜ M . Because f pa( A i ) ∩ f pa( A \ i j ) ∩ J = ∅ and i ∈ A i , we hav e that X i ⊥ ⊥ E f pa( A \ i j ) for e very solution ( X , E ) of ˜ M . Assume for the moment that i ∈ f pa( A \ i j ) \ A \ i j , then ( f pa( A \ i j ) \ A \ i j ) ∩ I = { i } . Let g A \ i j : X i × E f pa( A \ i j ) → X A \ i j be a measurable solution function for ˜ M w .r .t. A \ i j , that is, we hav e for P E -almost e very e ∈ E and for all x ∈ X x A \ i j = ˜ f A \ i j ( x , e ) ⇐ ⇒ x A \ i j = g A \ i j ( x i , e f pa( A \ i j ) ) . 67 For ev ery measurable set B j ⊆ X j there exists a version of the regular conditional probability P M do( I , ξ I ) ( X j ∈ B | X i = ξ i ) such that for e very v alue ξ i ∈ X i it satisfies P M do( I , ξ I ) X j ∈ B j | X i = ξ i = P ˜ M X j ∈ B j | X i = ξ i = P ˜ M ( g A \ i j ) j ( X i , E f pa( A \ i j ) ) ∈ B j | X i = ξ i = P ˜ M ( g A \ i j ) j ( ξ i , E f pa( A \ i j ) ) ∈ B j | X i = ξ i = P ˜ M ( g A \ i j ) j ( ξ i , E f pa( A \ i j ) ) ∈ B j = P ˜ M do( { i } ,ξ i ) ( g A \ i j ) j ( X i , E f pa( A \ i j ) ) ∈ B j = P ˜ M do( { i } ,ξ i ) X j ∈ B j = P M do( I , ξ I ) do( { i } ,ξ i ) X j ∈ B j , where we used X i ⊥ ⊥ E f pa( A \ i j ) in the fourth equality . If we assume i / ∈ f pa( A \ i j ) \ A \ i j instead of i ∈ pa( A \ i j ) \ A \ i j , then we similarly arrive at the same conclusion. Section 8 P RO O F O F P R O P O S I T I O N 8 . 2 . W e first show that the class of simple SCMs is closed un- der marginalization. T ake two disjoint subsets L 1 and L 2 in I . Then, it suffices to show that M marg( L 1 ) is uniquely solv able w .r .t. L 2 . This follo ws directly from Proposition 5.4 . T o show that the class of simple SCMs is closed under perfect intervention. Let M be a simple SCM, O ⊆ I , I ⊆ I and ξ I ∈ X I . Define O 1 := O ∩ I and O 2 := O \ I , then O = O 1 ∪ O 2 . Note that pa( O 2 ) \ O 2 = (pa( O 2 ) \ ( O 2 ∪ I )) ∪ (pa( O 2 ) ∩ I ) and pa( O 2 ) \ ( O 2 ∪ I ) ⊆ pa( O ) \ O . Let g O 2 : X pa( O 2 ) \O 2 × E pa( O 2 ) → X O 2 be a measurable solution function for M w .r .t. O 2 . The mapping ˜ g O : X pa( O ) \O × E pa( O ) → X O defined by ( ( ˜ g O ) O 1 ( x pa( O ) \O , e pa( O ) ) := ξ O 1 ( ˜ g O ) O 2 ( x pa( O ) \O , e pa( O ) ) := g O 2 ( x pa( O 2 ) \ ( O 2 ∪ I ) , ξ pa( O 2 ) ∩ I , e pa( O 2 ) ) is a measurable solution function for M do( I , ξ I ) w .r .t. O , and it is clear that M do( I , ξ I ) is uniquely solv able w .r .t. O . Next, we show that the class of simple SCMs is closed under the twin operation. Let ˜ O ⊆ I ∪ I 0 . T ake O 1 = ˜ O ∩ I , O 0 2 = ˜ O ∩ I 0 and O 2 the original copy of O 0 2 in I . Let g O 1 : X pa( O 1 ) \O 1 × E pa( O 1 ) → X O 1 and g O 2 : X pa( O 2 ) \O 2 × E pa( O 2 ) → X O 2 be measur- able solution functions for M w .r .t. O 1 and O 2 , respectiv ely . Define now the mapping h ˜ O : X f pa( ˜ O ) \ ˜ O × E f pa( ˜ O ) → X ˜ O by ( h ˜ O ) ˜ O∩I ( x f pa( ˜ O ) \ ˜ O , e f pa( ˜ O ) ) := g O 1 ( x f pa( O 1 ) \O 1 , e f pa( O 1 ) ) ( h ˜ O ) ˜ O∩I 0 ( x f pa( ˜ O ) \ ˜ O , e f pa( ˜ O ) ) := g O 2 ( x f pa( O 0 2 ) \O 0 2 , e f pa( O 0 2 ) ) , where we define f pa := pa G a ( M twin ) as the parents w .r .t. the twin graph G a ( M twin ) . Then by construction this mapping h ˜ O is a measurable solution function for M twin w .r .t. ˜ O , and it is clear that M twin is uniquely solv able w .r .t. ˜ O . Lastly , it follows that the observational and all the intervened models of M and M twin are uniquely solvable. From Theorem 3.6 we conclude that M induces unique observational, interventional and counterf actual distributions. 68 P RO O F O F C O R O L L A RY 8 . 3 . This follows from Corollary A.22 . APPENDIX F: MEASURABLE SELECTION THEOREMS In this appendix, we deri ve some lemmas and state tw o measurable selection theorems that are used in sev eral proofs in Appendix E . First, we introduce the measure theoretic notation and terminology needed to understand the results (see [ 30 ] for more details). D E FI N I T I O N F .1 (Standard measurable space). A measurable space ( X , Σ ) is a standard measurable space if it is isomorphic to ( Y , B ( Y )) , wher e Y is a P olish space, that is, a separable completely metrizable space, 22 and B ( Y ) are the Bor el subsets of Y , that is, the σ - algebr a generated by the open sets in Y . A measure space ( X , Σ , µ ) is a standard probability space if ( X , Σ ) is a standar d measurable space and µ is a pr obability measur e. Examples of standard measurable spaces are the open and closed subsets of R d , and the finite sets with the usual complete metric. If we say that X is a standard measurable space, then we implicitly assume that there exists a σ -algebra Σ such that ( X , Σ ) is a standard measurable space. Similarly , if we say that X is a standard probability space with probability measure P X , then we implicitly assume that there exists a σ -algebra Σ such that ( X , Σ , P X ) is a standard probability space. D E FI N I T I O N F .2 (Analytic set) . Let X be a P olish space . A set A ⊆ X is called analytic if ther e exist a P olish space Y and a continuous mapping f : Y → X with f ( Y ) = A . L E M M A F .3 . Let X and Y be standar d measurable spaces and f : X → Y a measurable mapping. Then 1. every measur able set A ⊆ X is analytic; 2. if the subsets A ⊆ X and ˜ A ⊆ Y ar e analytic, then the sets f ( A ) and f − 1 ( ˜ A ) are analytic. P RO O F . From Proposition 13.7 in [ 30 ] it follows that ev ery measurable set A ⊆ X is analytic. From Proposition 14.4.(ii) in [ 30 ] it follows that the image and the preimage of an analytic set is an analytic set. D E FI N I T I O N F .4 ( µ -measurability) . Let ( X , Σ , µ ) be a measur e space. A set E ⊆ X is called a µ -null set if ther e e xists a A ∈ Σ with E ⊆ A and µ ( A ) = 0 . W e denote the class of µ -null sets by N , and we denote the σ -algebra g enerated by Σ ∪ N by ¯ Σ , and its members ar e called the µ -measurable sets. Note that each member of ¯ Σ is of the form A ∪ E with A ∈ Σ and E ∈ N . The measur e µ is e xtended to a measure ¯ µ on ¯ Σ , by ¯ µ ( A ∪ E ) = µ ( A ) for every A ∈ Σ and E ∈ N , and is called its completion . A mapping f : X → Y between measurable spaces is called µ -measurable if the inver se image f − 1 ( C ) of e very measurable set C ⊆ Y is µ -measurable. 22 A metrizable space is a topological space X for which there exists a metric d such that ( X , d ) is a metric space and induces the topology on X . For a metric space ( X , d ) , a Cauchy sequence is a sequence ( x n ) n ∈ N of elements of X such that for e very > 0 there e xists an N ∈ N such that for all natural numbers p, q > N we ha ve d ( x n , x m ) < . W e call ( X , d ) complete if ev ery Cauchy sequence has a limit in X . A completely metrizable space is a topological space X for which there exists a metric d such that ( X , d ) is a complete metric space that induces the topology on X . A topological space X is called separ able if it contains a countable dense subset, that is, there exists a sequence ( x n ) n ∈ N of elements in X such that ev ery nonempty open subset of X contains at least one element of the sequence. A separable completely metrizable space is called a P olish space (see [ 9 ] and [ 30 ] for more details). 69 D E FI N I T I O N F .5 (Uni versal measurability). Let ( X , Σ ) be a standar d measurable space. A set A ⊆ X is called univ ersally measurable if it is µ -measurable for every σ -finite mea- sur e 23 µ on X (i.e., in particular e very pr obability measur e). A mapping f : X → Y between standar d measurable spaces is universally measurable if it is µ -measurable for e very σ -finite measur e µ . L E M M A F .6 . Let E be a standard pr obability space with pr obability measur e P E and A ⊆ E an analytic set. Then A is P E -measurable and ther e exist measurable sets S , T ⊆ E such that S ⊆ A ⊆ T and P E ( S ) = ¯ P E ( A ) = P E ( T ) , wher e ¯ P E is the completion of P E . P RO O F . Let A ⊆ E be an analytic set. Since e very analytic set in a standard measurable space is a uni versally measurable set (see Theorem 21.10 in [ 30 ]), we know that A is a uni versally measurable set, and hence it is in particular a P E -measurable set. Thus, there exist a measurable set S ⊆ E and a P E -null set C ⊆ E such that A = S ∪ C and ¯ P E ( A ) = P E ( S ) , where ¯ P E is the completion of P E . Moreov er , there exists a measurable set ˜ C ⊆ E such that C ⊆ ˜ C and P E ( ˜ C ) = 0 . Let T := S ∪ ˜ C , then A ⊆ T and P E ( T ) = P E ( S ) . L E M M A F .7 . Let f : X → Y be a µ -measurable mapping. If Y is countably generated, then ther e exists a measur able mapping g : X → Y such that f ( x ) = g ( x ) holds µ -a.e.. P RO O F . Let the σ -algebra of Y be generated by the countable generating set { C n } n ∈ N . The µ -measurable set f − 1 ( C n ) = A n ∪ E n for some A n ∈ Σ and some E n ∈ N and hence there is some E n ⊆ B n ∈ Σ such that µ ( B n ) = 0 . Let ˆ B = ∪ n ∈ N B n , ˆ A n = A n \ ˆ B and ˆ A = ∪ n ∈ N ˆ A n , then µ ( ˆ B ) = 0 , ˆ A and ˆ B are disjoint and X = ˆ A ∪ ˆ B . Now define the mapping g : X → Y by g ( x ) := ( f ( x ) if x ∈ ˆ A , y 0 otherwise, where for y 0 we can take an arbitrary point in Y . This mapping g is measurable since for each generator C n we hav e g − 1 ( C n ) = ( ˆ A n if y 0 / ∈ C n , ˆ A n ∪ ˆ B otherwise. is in Σ . Moreover , f ( x ) = g ( x ) µ -almost ev erywhere. W ith this result at hand we can no w prov e the first measurable selection theorem. T H E O R E M F .8 (Measurable selection theorem) . Let E be a standar d pr obability space with pr obability measur e P E , X a standard measurable space and S ⊆ E × X a measur able set suc h that E \ pr E ( S ) is a P E -null set, wher e pr E : E × X → E is the pr ojection mapping on E . Then ther e exists a measurable mapping g : E → X suc h that ( e , g ( e )) ∈ S for P E - almost every e ∈ E . P RO O F . T ake the subset ˆ E := E \ B , for some measurable set B ⊇ E \ pr E ( S ) and P E ( B ) = 0 , and note that ˆ E is a standard measurable space (see Corollary 13.4 in [ 30 ]) and 23 A measure µ on a measurable space ( X , Σ ) is called σ -finite if X = ∪ n ∈ N A n , with A n ∈ Σ , µ ( A n ) < ∞ . 70 ˆ E ⊆ pr E ( S ) . Let ˆ S = S ∩ ( ˆ E × X ) . Because the set ˆ S is measurable, it is in particular ana- lytic (see Lemma F .3 ). It follows by the Jankov-v on Neumann Theorem (see Theorem 18.8 or 29.9 in [ 30 ]) that ˆ S has a univ ersally measurable uniformizing function, that is, there exists a univ ersally measurable mapping ˆ g : ˆ E → X such that for all e ∈ ˆ E , ( e , ˆ g ( e )) ∈ ˆ S . Hence, in particular , it is P E ˆ E -measurable, where P E ˆ E is the restriction of P E to ˆ E . No w define the mapping g ∗ : E → X by g ∗ ( e ) := ( ˆ g ( e ) if e ∈ ˆ E x 0 otherwise, where for x 0 we can take an arbitrary point in X . Then this mapping g ∗ is P E -measurable. T o see this, take any measurable set C ⊆ X , then g ∗− 1 ( C ) = ( ˆ g − 1 ( C ) if x 0 / ∈ C ˆ g − 1 ( C ) ∪ B otherwise. Because ˆ g − 1 ( C ) is P E ˆ E -measurable it is also P E -measurable and thus g ∗− 1 ( C ) is P E - measurable. By Lemma F .7 and the fact that standard measurable spaces are countably generated (see Proposition 12.1 in [ 30 ]), we prov e the existence of a measurable mapping g : E → X such that g ∗ = g P E -a.e. and thus it satisfies ( e , g ( e )) ∈ S for P E -almost e very e ∈ E . This theorem rests on the assumption that the standard measurable space E has a probabil- ity measure P E . If this space becomes the product space Y × E , for some standard measurable space Y where only the space E has a probability measure, then in general this theorem does not hold an ymore. Ho we ver , if we assume in addition that the fibers of S in Y are σ -compact for P E -almost ev ery e ∈ E and for all x ∈ X , then we can prov e a second measurable selec- tion theorem. A topological space is σ -compact if it is the union of countably many compact subspaces. For example, all countable discrete spaces, e very interv al of the real line, and moreov er all the Euclidean spaces are σ -compact spaces. T H E O R E M F .9 (Second measurable selection theorem) . Let E be a standar d pr obability space with pr obability measur e P E , X and Y standar d measurable spaces and S ⊆ X × E × Y a measurable set such that E \ K σ is a P E -null set, wher e K σ := { e ∈ E : ∀ x ∈ X ( S ( x , e ) is nonempty and σ -compact ) } , with S ( x , e ) denoting the fiber over ( x , e ) , that is S ( x , e ) := { y ∈ Y : ( x , e , y ) ∈ S } . Then ther e e xists a measurable mapping g : X × E → Y such that for P E -almost every e ∈ E and for all x ∈ X we have ( x , e , g ( x , e )) ∈ S . P RO O F . T ake the subset ˆ E := E \ B , for some measurable set B ⊇ E \ K σ and P E ( B ) = 0 . Note that ˆ E is a standard measurable space, ˆ E ⊆ K σ and ˆ S = S ∩ ( X × ˆ E × Y ) is measurable. By assumption, for each ( x , e ) ∈ X × ˆ E the fiber ˆ S ( x , e ) is nonempty and σ -compact and hence by applying the Theorem of Arsenin-Kunugui (see Theorem 35.46 in [ 30 ]) it follows that the set ˆ S has a measurable uniformizing function, that is, there exists a measurable mapping ˆ g : X × ˆ E → Y such that for all ( x , e ) ∈ X × ˆ E , ( x , e , ˆ g ( x , e )) ∈ ˆ S . No w define the mapping g : X × E → Y by g ( x , e ) := ( ˆ g ( x , e ) if e ∈ ˆ E y 0 otherwise, 71 where for y 0 we can take an arbitrary point in Y . This mapping g inherits the measurability from ˆ g and it satisfies for P E -almost e very e ∈ E and for all x ∈ X that ( x , e , g ( x , e )) ∈ S . The next tw o lemmas provide some useful properties for the “for P E -almost e very e ∈ E ” quantifier . L E M M A F .10 . Let φ : E → ˜ E be a measur able map between two standar d measur able spaces. Let P E be a pr obability measur e on E and let P ˜ E = P E ◦ φ − 1 be its push-forwar d under φ . Let ˜ P : ˜ E → { 0 , 1 } be a pr operty , that is, a (measurable) boolean-valued function on ˜ E . Then the pr operty P = ˜ P ◦ φ on E holds P E -a.e. if and only if the pr operty ˜ P holds P ˜ E -a.e.. P RO O F . Assume the property P = ˜ P ◦ φ holds P E -a.e., then C = { e ∈ E : P ( e ) = 1 } contains a measurable set C ∗ with P E -measure 1, that is, C ∗ ⊆ C and P E ( C ∗ ) = 1 . By Lemma F .3 , φ ( C ∗ ) is analytic. By Lemma F .6 , there e xist measurable sets A , B such that A ⊆ φ ( C ∗ ) ⊆ B and P ˜ E ( A ) = P ˜ E ( B ) . Because φ is measurable, φ − 1 ( A ) and φ − 1 ( B ) are both measurable. Also, φ − 1 ( A ) ⊆ φ − 1 ( φ ( C ∗ )) ⊆ φ − 1 ( B ) . As C ∗ ⊆ φ − 1 ( φ ( C ∗ )) , we must hav e that P E ( φ − 1 ( B )) ≥ P E ( C ∗ ) = 1 . Hence P ˜ E ( A ) = P ˜ E ( B ) = 1 . Note that as C ∗ ⊆ C , A ⊆ φ ( C ∗ ) ⊆ φ ( C ) ⊆ { ˜ e ∈ ˜ E : ˜ P ( ˜ e ) = 1 } . Hence the set ˜ C := { ˜ e ∈ ˜ E : ˜ P ( ˜ e ) = 1 } contains a measurable set of P ˜ E -measure 1, in other words, ˜ P holds P ˜ E -a.s.. The con verse is easier to prove. Suppose ˜ C = { ˜ e ∈ ˜ E : ˜ P ( ˜ e ) = 1 } contains a measurable set ˜ C ∗ with P ˜ E -measure 1, that is, ˜ C ∗ ⊆ ˜ C and P ˜ E ( ˜ C ∗ ) = 1 . Because φ is measurable, the set φ − 1 ( ˜ C ∗ ) is measurable and P E ( φ − 1 ( ˜ C ∗ )) = 1 , and furthermore, φ − 1 ( ˜ C ∗ ) ⊆ φ − 1 ( ˜ C ) = C . L E M M A F .11 (Some properties for the for-almost-e very quantifier) . Let X = X × ˜ X and E = E × ˜ E be pr oducts of nonempty standard measurable spaces and P E = P E × P ˜ E be the pr oduct measur e of pr obability measures P E and P ˜ E on E and ˜ E , r espectively . Denote by “ ∨ ∼ e ” the quantifier “for P E -almost every e ∈ E ” and by “ ∀ x ” the quantifier “for all x ∈ X ”, and similarly for their components, for example, “ ∨ ∼ e ” for “for P E -almost e very e ∈ E ” and “ ∀ x ” for “for all x ∈ X ”. Then we have the following pr operties: 1. ∨ ∼ e : P ( e ) = ⇒ ∃ e : P ( e ) (similarly to ∀ x : P ( x ) = ⇒ ∃ x : P ( x ) ); 2. ∨ ∼ e : P ( e ) ⇐ ⇒ ∨ ∼ e : P ( e ) (similarly to ∀ x : P ( x ) ⇐ ⇒ ∀ x : P ( x ) ); 3. ∃ x ∨ ∼ e : P ( x, e ) = ⇒ ∨ ∼ e ∃ x : P ( x, e ) (similarly to ∃ x ∀ e : P ( x, e ) = ⇒ ∀ e ∃ x : P ( x, e ) ); 4. ∨ ∼ e ∀ x : P ( x, e ) = ⇒ ∀ x ∨ ∼ e : P ( x, e ) (similarly to ∀ e ∀ x : P ( x, e ) = ⇒ ∀ x ∀ e : P ( x, e ) ); 5. ∨ ∼ e : P ( e ) = ⇒ ∃ ˜ e ∨ ∼ e : P ( e ) (similarly to ∀ x : P ( x ) = ⇒ ∃ ˜ x ∀ x : P ( x ) ); 6. ∨ ∼ e ∀ x : P ( x, e ) ⇐ ⇒ ∨ ∼ e ∀ x : P ( x, e ) ; 7. ∨ ∼ e ∀ x : P ( x , e ) = ⇒ ∃ ˜ e ∃ ˜ x ∨ ∼ e ∀ x : P ( x , e ) , wher e P denotes a pr operty , that is, a measurable boolean-valued function, on the corr e- sponding measurable spaces and we write e and x for ( e, ˜ e ) and ( x, ˜ x ) , r espectively . P RO O F . W e only prov e the statements that may not be immediately obvious. Property 2. Let pr E : E → E be the projection mapping on E . Then by Lemma F .10 we hav e ∨ ∼ e : P ( e ) ⇐ ⇒ ∨ ∼ e : P ◦ pr E ( e ) ⇐ ⇒ ∨ ∼ e : P ( e ) . 72 Property 4: W e have ∨ ∼ e ∀ x : P ( x, e ) = ⇒ ∃ P E -null set N ∀ e ∈ E \ N ∀ x : P ( x, e ) = ⇒ ∃ P E -null set N ∀ x ∀ e ∈ E \ N : P ( x, e ) = ⇒ ∀ x ∃ P E -null set N ∀ e ∈ E \ N : P ( x, e ) = ⇒ ∀ x ∨ ∼ e : P ( x, e ) . Property 5: Let N be a measurable P E -null set such that P ( e ) holds for all e ∈ E \ N . Define for ˜ e ∈ ˜ E the set N ˜ e := { e ∈ E : ( e, ˜ e ) ∈ N } . Note that the sets N ˜ e are measurable. From Fubini’ s theorem it follows that for P ˜ E -almost every ˜ e ∈ ˜ E we have P E ( N ˜ e ) = 0 . That is, there exists a measurable P ˜ E -null set ˜ N such that P E ( N ˜ e ) = 0 for all ˜ e ∈ ˜ E \ ˜ N . Hence, there e xists ˜ e ∈ ˜ E \ ˜ N such that P E ( N ˜ e ) = 0 ; for all e ∈ E \ N ˜ e , P ( e ) then holds. This means ∃ ˜ e ∨ ∼ e : P ( e ) . Property 7: W e have ∨ ∼ e ∀ x : P ( x , e ) = ⇒ ∃ ˜ e ∨ ∼ e ∀ x : P ( x , e ) = ⇒ ∃ ˜ e ∨ ∼ e ∀ ˜ x ∀ x : P ( x , e ) = ⇒ ∃ ˜ e ∀ ˜ x ∨ ∼ e ∀ x : P ( x , e ) = ⇒ ∃ ˜ e ∃ ˜ x ∨ ∼ e ∀ x : P ( x , e ) , where in the first equiv alence we used Property 5, in the third equiv alence we used Property 4 and in the last equi valence we used Property 1. REFERENCES [1] B A L K E , A . and P E A R L , J . (1994). Probabilistic Ev aluation of Counterfactual Queries. In Pr oceedings of the T welfth National Confer ence on Artificial Intelligence (AAAI-94) 1 230–237. AAAI Press. [2] B E C K E R S , S . and H A L P E R N , J . Y . (2019). Abstracting Causal Models. In Proceedings of the Thirty-Third AAAI Confer ence on Artificial Intelligence (AAAI-19) 33 2678–2685. AAAI Press. [3] B L O M , T . , B O N G E R S , S . and M O O I J , J . M . (2019). Beyond Structural Causal Models: Causal Constraints Models. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (U AI-19) (R . P . A DA M S and V . G O G AT E , eds.). A UAI Press. [4] B L O M , T., V A N D I E P E N , M . M . and M O O I J , J . M . (2020). Conditional Independences and Causal Relations implied by Sets of Equations. arXiv .or g preprint arXi v:2007.07183 [cs.AI]. [5] B O L L E N , K . A . (1989). Structural Equations with Latent V ariables . John Wile y & Sons, New Y ork, USA. [6] B O N G E R S , S . , B L O M , T . and M O O I J , J . M . (2021). Causal Modeling of Dynamical Systems. arXiv .org pr eprint arXiv:1803.08784v3 [cs.AI]. [7] B ¨ U H L M A N N , P . , P E T E R S , J . and E R N E S T , J . (2014). CAM: Causal Additiv e Models, high-dimensional order search and penalized regression. The Annals of Statistics 42 2526–2556. [8] B Y R N E , R . M . J . (2007). The Rational Ima gination: How P eople Cr eate Alternatives to Reality . A Br adford Book . MIT Press, Cambridge, MA. [9] C O H N , D . L . (2013). Measur e Theory , 2nd ed. Birkh ¨ auser , Boston, USA. [10] C O O P E R , G . F . (1997). A Simple Constraint-Based Algorithm for Efficiently Mining Observ ational Databases for Causal Relationships. Data Mining and Knowledge Discovery 1 203–224. [11] D AW I D , A . P . (2002). Influence Diagrams for Causal Modelling and Inference. International Statistical Revie w 70 161–189. [12] D U N C A N , O . D . (1975). Intr oduction to Structural Equation Models . Academic Press, Ne w Y ork. [13] E ATO N , D . and M U R P H Y , K . (2007). Exact Bayesian structure learning from uncertain interventions. In Pr oceedings of the Eleventh International Confer ence on Artificial Intelligence and Statistics (M . M E I L A and X . S H E N , eds.). Pr oceedings of Machine Learning Resear ch 2 107–114. [14] E B E R H A R D T , F., H O Y E R , P . and S C H E I N E S , R . (2010). Combining Experiments to Discover Linear Cyclic Models with Latent V ariables. In Proceedings of the Thirteenth International Conference on Artifi- cial Intelligence and Statistics (Y . W. T E H and M . T I T T E R I N G T O N , eds.). Pr oceedings of Machine Learning Resear ch 9 185–192. [15] E V A N S , R . J . (2016). Graphs for Margins of Bayesian Networks. Scandinavian Journal of Statistics 43 625–648. 73 [16] E V A N S , R . J . (2018). Margins of discrete Bayesian networks. The Annals of Statistics 46 2623–2656. [17] F I S H E R , F . M . (1970). A Correspondence Principle For Simultaneous Equation Models. Econometrica 38 73–92. [18] F O R R ´ E , P . and M O O I J , J . M . (2017). Markov Properties for Graphical Models with Cycles and Latent V ariables. arXiv .org pr eprint arXiv:1710.08775 [math.ST]. [19] F O R R ´ E , P . and M O O I J , J . M . (2018). Constraint-based Causal Discovery for Non-Linear Structural Causal Models with Cycles and Latent Confounders. In Pr oceedings of the 34th Confer ence on Uncertainty in Artificial Intelligence (U AI-18) (A . G L O B E R S O N and R . S I LV A , eds.). A U AI Press. [20] F O R R ´ E , P . and M O O I J , J . M . (2019). Causal Calculus in the Presence of Cycles, Latent Confounders and Selection Bias. In Proceedings of the 35th Confer ence on Uncertainty in Artificial Intelligence (UAI- 19) (R . P . A DA M S and V . G O G AT E , eds.). A UAI Press. [21] F O Y G E L , R ., D R A I S M A , J . and D RT O N , M . (2012). Half-trek Criterion for Generic Identifiability of Linear Structural Equation Models. The Annals of Statistics 40 1682–1713. [22] G E I G E R , D . (1990). Graphoids: A Qualitativ e Framew ork for Probabilistic Inference T echnical Report No. R-142, Computer Science Department, Univ ersity of California, Los Angeles, USA. [23] G O L D B E R G E R , A . S . and D U N C A N , O . D . (1973). Structural Equation Models in the Social Sciences . Seminar Press, New Y ork. [24] G O L U B , G . and K A H A N , W . (1965). Calculating the Singular V alues and Pseudo-Inv erse of a Matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2 205– 224. [25] H A A V E L M O , T. (1943). The Statistical Implications of a System of Simultaneous Equations. Econometrica 11 1–12. [26] H A L P E R N , J . (1998). Axiomatizing Causal Reasoning. In Pr oceedings of the F ourteenth Conference on Uncertainty in Artificial Intelligence (UAI-98) ( G . C O O P E R and S. M O R A L , eds.) 202–210. Morgan Kaufmann, San Francisco, CA, USA. [27] H Y T T I N E N , A . , E B E R H A R D T , F. and H OY E R , P . O . (2012). Learning Linear Cyclic Causal Models with Latent V ariables. Journal of Mac hine Learning Resear ch 13 3387–3439. [28] H Y T T I N E N , A . , H OY E R , P . O ., E B E R H A R D T , F . and J ¨ A RV I S A L O , M . (2013). Discovering Cyclic Causal Models with Latent V ariables: A General SA T -based Procedure. In Proceedings of the T wenty-Ninth Confer ence on Uncertainty in Artificial Intelligence (UAI-13) ( A . N I C H O L S O N and P . S M Y T H , eds.) 301–310. A UAI Press, Corv allis, Oregon, USA. [29] I WAS A K I , Y . and S I M O N , H . A . (1994). Causality and model abstraction. Artificial Intelligence 67 143– 194. [30] K E C H R I S , A . S . (1995). Classical Descriptive Set Theory . Graduate T exts in Mathematics 156 . Springer- V erlag, New Y ork, USA. [31] K O S T E R , J . T. A . (1996). Markov Properties of Nonrecursive Causal Models. The Annals of Statistics 24 2148–2177. [32] K O S T E R , J . T . A . (1999). On the V alidity of the Markov Interpretation of Path Diagrams of Gaussian Structural Equations Systems with Correlated Errors. Scandinavian J ournal of Statistics 26 413–431. [33] L A C E R D A , G . , S P I RT E S , P . L . , R A M S E Y , J . and H O Y E R , P . O . (2008). Discovering cyclic causal models by independent components analysis. In Pr oceedings of the T wenty-F ourth Confer ence on Uncertainty in Artificial Intelligence (U AI-08) ( D . M C A L L E S T E R and P . M Y L LY M A K I , eds.) 366–374. A UAI Press, Corvallis, Ore gon, USA. [34] L A U R I T Z E N , S . L . (1996). Graphical Models . Oxfor d Statistical Science Series 17 . Clarendon Press, Ox- ford. [35] L A U R I T Z E N , S . L . , D AW I D , A . P . , L A R S E N , B . N . and L E I M E R , H . G . (1990). Independence Properties of Directed Markov Fields. Networks 20 491–505. [36] L E W I S , D . K. (1979). Counterfactual Dependence and Time’ s Arrow. No ˆ us 13 455–476. [37] M A ATH U I S , M . H . , C O L O M B O , D . , K A L I S C H , M . and B ¨ U H L M A N N , P . (2009). Estimating High- Dimensional Intervention Ef fects from Observational Data. The Annals of Statistics 37 3133–3164. [38] M A N I , S . (2006). A Bayesian Local Causal Discov ery Framework, PhD thesis, Uni versity of Pittsb urg. [39] M A S O N , S . J . (1953). Feedback Theory - Some Properties of Signal Flow Graphs. In Pr oceedings of the IRE 41 1144-1156. IEEE. [40] M A S O N , S . J . (1956). Feedback Theory - Further Properties of Signal Flo w Graphs. In Pr oceedings of the IRE 44 920–926. IEEE. [41] M E E K , C . (1995). Strong Completeness and Faithfulness in Bayesian Networks. In Proceedings of the Eleventh Confer ence on Uncertainty in Artificial Intelligence (U AI-95) ( P . B E S N A R D and S . H A N K S , eds.) 411–418. Morgan Kaufmann, San Francisco, CA, USA. [42] M O G E N S E N , S . W. and H A N S E N , N . R . (2020). Marko v equiv alence of marginalized local independence graphs. Ann. Statist. 48 539–559. 74 [43] M O G E N S E N , S . W ., M A L I N S K Y , D . and H A N S E N , N . R . (2018). Causal Learning for Partially Observed Stochastic Dynamical Systems. In Pr oceedings of the Thirty-F ourth conference on Uncertainty in Ar- tificial Intelligence (U AI-18) (A . G L O B E R S O N and R . S I LV A , eds.). A U AI Press. [44] M O O I J , J . M . and C L A A S S E N , T. (2020). Constraint-Based Causal Discov ery using Partial Ancestral Graphs in the presence of Cycles. In Pr oceedings of the 36th Conference on Uncertainty in Artifi- cial Intelligence (U AI-20) (J . P E T E R S and D . S O N TAG , eds.) 124 1159–1168. PMLR. [45] M O O I J , J . M . and H E S K E S , T . (2013). Cyclic Causal Discovery from Continuous Equilibrium Data. In Pr oceedings of the 29th Confer ence on Uncertainty in Artificial Intelligence (UAI-13) ( A . N I C H O L S O N and P . S M Y T H , eds.) 431–439. A U AI Press, Corv allis, Oregon, USA. [46] M O O I J , J . M . , J A N Z I N G , D . and S C H ¨ O L K O P F , B . (2013). From Ordinary Differential Equations to Struc- tural Causal Models: the deterministic case. In Pr oceedings of the 29th Confer ence on Uncertainty in Artificial Intelligence (U AI-13) (A . N I C H O L S O N and P . S M Y T H , eds.) 440–448. A U AI Press. [47] M O O I J , J . M ., M AG L I A C A N E , S . and C L A A S S E N , T. (2020). Joint Causal Inference from Multiple Con- texts. J ournal of Machine Learning Resear ch 21 1–108. [48] M O O I J , J . M . , P E T E R S , J ., J A N Z I N G , D . , Z S C H E I S C H L E R , J. and S C H ¨ O L K O P F , B . (2016). Distinguishing Cause from Effect using Observational Data: Methods and Benchmarks. Journal of Machine Learning Resear ch 17 1–102. [49] N E A L , R . M . (2000). On Deducing Conditional Independence from d -Separation in Causal Graphs with Feedback. Journal of Artificial Intellig ence Resear ch 12 87–91. [50] P E A R L , J . (1985). A Constraint Propagation Approach to Probabilistic Reasoning. In Pr oceedings of the F irst Confer ence on Uncertainty in Artificial Intelligence (U AI-85) ( L . K A N A L and J . L E M M E R , eds.) 31–42. A UAI Press, Corv allis, Oregon, USA. [51] P E A R L , J . (2009). Causality: Models, Reasoning , and Infer ence , 2nd ed. Cambridge Uni versity Press, Ne w Y ork, USA. [52] P E A R L , J . and D E C H T E R , R . (1996). Identifying Independence in Causal Graphs with Feedback. In Pr o- ceedings of the T welfth Conference on Uncertainty in Artificial Intelligence (U AI-96) (E . H O RV I T Z and F. J E N S E N , eds.) 420–426. Morgan Kaufmann, San Francisco, CA, USA. [53] P E A R L , J . and M A C K E N Z I E , D . (2018). The Book of Why: The New Science of Cause and Effect , 1st ed. Basic Books, New Y ork, USA. [54] P E N RO S E , R . (1955). A generalized in verse for matrices. Mathematical Pr oceedings of the Cambridge Philosophical Society 51 406–413. [55] P E T E R S , J . , J A N Z I N G , D . and S C H ¨ O L K O P F , B. (2017). Elements of Causal Inference: F oundations and Learning Algorithms . MIT Press, Cambridge, MA, USA. [56] P E T E R S , J . , M O O I J , J . M ., J AN Z I N G , D . and S C H ¨ O L K O P F , B. (2014). Causal Discovery with Continuous Additiv e Noise Models. Journal of Mac hine Learning Resear ch 15 2009–2053. [57] P FI S T E R , N . , B A U E R , S . and P E T E R S , J . (2019). Learning Stable and Predictiv e Structures in Kinetic Systems. Pr oceedings of the National Academy of Sciences 116 25405–25411. [58] R I C H A R D S O N , T. S . (1996). A Discovery Algorithm for Directed Cyclic Graphs. In Pr oceedings of the T welfth Confer ence on Uncertainty in Artificial Intellig ence (U AI-96) (E . H O RV I T Z and F . J E N S E N , eds.) 454–461. Morgan Kaufmann, San Francisco, CA, USA. [59] R I C H A R D S O N , T . S. (1996). Discovering Cyclic Causal Structure T echnical Report No. CMU-PHIL-68, Carnegie Mellon Uni versity . [60] R I C H A R D S O N , T . (2003). Marko v Properties for Acyclic Directed Mixed Graphs. Scandinavian J ournal of Statistics 30 145–157. [61] R I C H A R D S O N , T. S. and S P I RT E S , P . (1999). Automated Discov ery of Linear Feedback Models. In Com- putation, Causation, and Discovery (C. Glymour and G. F . Cooper, eds.) 253-–304. MIT Press. [62] R I C H A R D S O N , T. S . and S P I RT E S , P . (2002). Ancestral Graph Marko v Models. The Annals of Statistics 30 962–1030. [63] R I C H A R D S O N , T. S . (1996). Models of Feedback: Interpretation and Disco very , PhD thesis, Carne gie Mel- lon Univ ersity . [64] R I C H A R D S O N , T . S . and R O B I N S , J . (2013). Single W orld Intervention Graphs (SWIGs): A Unification of the Counterfactual and Graphical Approaches to Causality T echnical Report No. 128, Center for Statistics and the Social Sciences. [65] R I C H A R D S O N , T . S . and R O B I N S , J . M . (2014). ACE Bounds; SEMs with Equilibrium Conditions. Statis- tical Science 29 363-366. [66] R O E S E , N . J . (1997). Counterfactual Thinking. Psycholo gical Bulletin 121 133–148. [67] R U B E N S T E I N , P . K . , W E I C H W A L D , S . , B O N G E R S , S . , M O O I J , J . M . , J A N Z I N G , D . , G R O S S E - W E N T R U P , M . and S C H ¨ O L K O P F , B . (2017). Causal Consistency of Structural Equation Models. In Pr oceedings of the 33r d Conference on Uncertainty in Artificial Intelligence (U AI-17) (G . E L I D A N and K . K E R S T I N G , eds.). A U AI Press. 75 [68] R U B I N , D . B . (1974). Estimating Causal Ef fects of T reatments in Randomized and Nonrandomized Studies. Journal of Educational Psyc hology 66 688–701. [69] S H P I T S E R , I . and P E A R L , J . (2008). Complete Identification Methods for the Causal Hierarchy. Journal of Machine Learning Resear ch 9 1941–1979. [70] S P I RT E S , P . (1993). Directed Cyclic Graphs, Conditional Independence, and Non-recursiv e Linear Struc- tural Equation Models T echnical Report No. CMU-PHIL-35, Carnegie Mellon Uni versity . [71] S P I RT E S , P . (1994). Conditional Independence in Directed Cyclic Graphical Models for Feedback T echnical Report No. CMU-PHIL-54, Carnegie Mellon Uni versity . [72] S P I RT E S , P . (1995). Directed Cyclic Graphical Representations of Feedback Models. In Pr oceedings of the Eleventh Confer ence on Uncertainty in Artificial Intelligence (U AI-95) ( P . B E S N A R D and S . H A N K S , eds.) 499–506. Morgan Kaufmann, San Francisco, CA, USA. [73] S P I RT E S , P . , G L Y M O U R , C . and S C H E I N E S , R . (2000). Causation, Pr ediction, and Sear ch , 2nd ed. Adaptive Computation and Machine Learning . MIT Press, Cambridge, Massachusetts. [74] S P I RT E S , P . , M E E K , C . and R I C H A R D S O N , T. S . (1999). An Algorithm for Causal Inference in the Presence of Latent V ariables and Selection Bias. In Computation, Causation and Discovery (C. Glymour and G. F . Cooper, eds.) 6, 211-252. The MIT Press. [75] S P I RT E S , P . , R I C H A R D S O N , T . , M E E K , C . , S C H E I N E S , R . and G L Y M O U R , C . (1998). Using Path Diagrams as a Structural Equation Modelling T ool. Sociological Methods & Resear ch 27 182–225. [76] T I A N , J . (2002). Studies in Causal Reasoning and Learning T echnical Report No. R-309, Cognitiv e Systems Laboratory , University of California, Los Angeles, USA. [77] T I A N , J . and P E A R L , J . (2001). Causal Discov ery from Changes. In Pr oceedings of the 17th Confer ence in Uncertainty in Artificial Intelligence (U AI-01) (J . B R E E S E and D . K O L L E R , eds.) 512–521. Morg an Kaufmann, San Francisco, CA, USA. [78] V E R M A , T. S . (1993). Graphical Aspects of Causal Models T echnical Report No. R-191, Computer Science Department, Univ ersity of California, Los Angeles, USA. [79] W R I G H T , S . (1921). Correlation and Causation. Journal of Agricultural Resear ch 20 557–585. [80] Z H A N G , J . (2008). On the Completeness of Orientation Rules for Causal Discovery in the Presence of Latent Confounders and Selection Bias. Artificial Intelligence 172 1873–1896.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment