Causal screening for dynamical systems
Many classical algorithms output graphical representations of causal structures by testing conditional independence among a set of random variables. In dynamical systems, local independence can be used analogously as a testable implication of the und…
Authors: S{o}ren Wengel Mogensen
Causal scr eening in dynamical systems Søren W engel Mogensen Department of Mathematical Sciences Univ ersity of Copenhagen Copenhagen, Denmark swengel@math.ku.dk Abstract Many classical algorithms output graphical rep- resentations of causal structures by testing con- ditional independence among a set of random variables. In dynamical systems, local inde- pendence can be used analogously as a testable implication of the underlying data-generating process. W e suggest some inexpensiv e meth- ods for causal screening which provide output with a sound causal interpretation under the as- sumption of ancestral faithfulness. The popular model class of linear Hawkes processes is used to provide an example of a dynamical causal model. W e argue that for sparse causal graphs the output will often be close to complete. W e giv e examples of this framework and apply it to a challenging biological system. 1 INTR ODUCTION Constraint-based causal learning is computationally and statistically challenging. There is a large literature on learning structures that are represented by directed acyclic graphs (D A Gs) or marginalizations thereof (see Maathuis et al. (2019) for references). The fast causal inference algorithm (FCI, Spirtes et al., 2000) provides in a certain sense maximally informativ e output (Zhang, 2008), but at the cost of using a large number of conditional inde- pendence tests (Colombo et al., 2012). T o reduce the computational cost, other methods provide output which has a sound causal interpretation, but may be less informa- ti ve. Among these are the anytime FCI (Spirtes, 2001) and RFCI (Colombo et al., 2012). A recent algorithm, ances- tral causal inference (A CI, Magliacane et al., 2016), aims to learn only the directed part of the underlying graphical structure which allo ws for a sound causal interpretation ev en though some information is lost. Pr oceedings of the 36 th Confer ence on Uncertainty in Artificial Intelligence (U AI) , PMLR volume 124, 2020. In this paper , we describe some simple methods for learn- ing causal structure in dynamical systems represented by stochastic processes. Many authors hav e described frame- works and algorithms for learning structure in systems of time series, ordinary differential equations, stochastic dif- ferential equations, and point processes. Howe ver , most of these methods do not have a clear causal interpretation when the observed processes are part of a lar ger system and most of the current literature is either non-causal in nature, or requires that there are no unobserved processes. Analogously to testing conditional independence when learning D A Gs, one can use tests of local independence in the case of dynamical systems. Eichler (2013), Meek (2014), and Mogensen et al. (2018) propose algorithms for learning graphs that represent local independence struc- tures. W e show empirically that we can reco ver features of their graphical learning tar get using considerably fewer tests of local independence. First, we suggest a learning target which is easier to learn, though still conv eys useful causal information, analogously to A CI (Magliacane et al., 2016). Second, the proposed algorithm is only guaranteed to pro vide a supergraph of the learning target and this also reduces the number of local independence tests drasti- cally . A central point is that our proposed methods retain a causal interpretation in the sense that absent edges in the output correspond to implausible causal connections. Meek (2014) suggests learning a directed graph to rep- resent a causal dynamical system and giv es a learning algorithm which we will describe as a simple screening algorithm (Section 4.2). W e show that this algorithm can be gi ven a sound interpretation under a weaker faith- fulness assumption than that of Meek (2014). W e also provide a simple interpretation of the output of this algo- rithm and we show that similar screening algorithms can giv e comparable results using considerably fewer tests of local independence. All proofs are provided in the supplementary material. ● ● ● ● ● ● ● ● 4 3 2 1 Time Coordinate processes 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 (a) T op: Example data from a four-dimensional Hawkes process. Bottom: The corresponding intensities. The time axis is aligned between the two plots. 1 2 3 4 1 2 4 (b) Left: The causal graph (see Section 2.1) of a four- dimensional Hawkes process. Right: Learning output of stan- dard approach (see Section 2) when 3 is unobserved. When 3 is unobserved, 2 is predicti ve of 4 and vice versa (heuristically , more e vents in process 2 indicate more e vents in 3 which in turn indicates more ev ents in 4). Howe ver , they are not causally con- nected and using local independence one can learn that 2 is not a parent of 4. This is important to predict what would happen under interventions in the system as the right-hand graph indi- cates that an intervention on 2 would change the distribution of 4 ev en though this is not the case as g α 2 = 0 for α ∈ { 1 , 3 , 4 } . Figure 1: Subfigure 1a sho ws data generated from the system in 1b (left). Until the first ev ent all intensities are constant (equal to µ α for the α -process). The first e vent occurs in process 3. W e see that g 23 , g 33 , and g 43 are different from zero as encoded by the graph in 1b (left). Therefore the e vent makes the intensity processes of 2, 3, and 4 jump, making new e vents in these processes more likely in the immediate future (1a, bottom). 2 HA WKES PR OCESSES Local independence can be defined in a wide range of discrete-time and continous-time dynamical models (e.g., point processes (Didelez, 2000), time series (Eichler, 2012), and dif fusions (Mogensen et al., 2018). See also Commenges and G ´ egout-Petit (2009)), and the algorith- mic results we present apply to all these classes of models. Howe ver , the causal interpretation will differ between these model classes, and we will use the linear Hawkes pr ocesses to exemplify the frame work. Laub et al. (2015) giv e an accessible introduction to this continuous-time model class and Liniger (2009), Bacry et al. (2015), and Daley and V ere-Jones (2003) provide more background. Hawkes processes ha ve also been studied in the machine learning community in recent years (Zhou et al., 2013a,b; Luo et al., 2015; Xu et al., 2016; Etesami et al., 2016; Achab et al., 2017; T an et al., 2018; Xu et al., 2018; T rouleau et al., 2019). It is important to note that these papers all consider the case of full observation, i.e., ev- ery coordinate process is observed. In causal systems that are not fully observed that assumption may lead to false conclusions (see Figure 1b). Our work addresses the learning problem without the assumption of full ob- servation, hence there can be unknown and unobserved confounding processes. On a filtered probability space, (Ω , F , ( F t ) , P ) , we con- sider an n -dimensional multi variate point process, X = ( X 1 , . . . , X n ) . F t is a filtration, i.e., a nondecreasing family of σ -algebras, and it represents the information which is av ailable at a specific time point. Each coordi- nate process X α is described by a sequence of positiv e, stochastic event times T α 1 , T α 2 , . . . such that T α j > T α i almost surely for j > i . W e let V = { 1 , . . . , n } . This can also be formulated in terms of a counting process, N , such that N α s = P i 1 ( T i ≤ s ) , α ∈ V . There exists so-called intensity pr ocesses , λ = ( λ 1 , . . . , λ n ) , such that λ α t = lim h → 0 1 h P ( N α t + h − N α t = 1 | F t ) and the intensity at time t can therefore be thought of as describing the probability of a jump in the immediate future after time t conditionally on the history until time t as captured by the F t -filtration. In a linear Hawkes model, the intensity of the α -process, α ∈ V , is of the simple form λ α t = µ α + X γ ∈ V Z t 0 g αγ ( t − s ) d N γ s = µ α + X γ ∈ V X i : T γ i 0 . W e denote the parent graph of the causal graph by P O ( G ) , or just P ( G ) if the set O used is clear from the context. In applications, a parent graph may provide answers to important questions as it tells us the causal relationships between the observed nodes. A similar idea was applied in D AG-based models by Magliacane et al. (2016), though that paper describes an exact method and not a screening procedure. In large systems, it can easily be infeasible to learn the complete independence structure of the observed system, and we propose instead to estimate the parent graph which can be done efficiently . In the supplementary material, we giv e another characterization of a parent graph. Figure 2 contains an example of a causal graph and a corresponding parent graph. 2.3 LOCAL INDEPENDENCE Local independence has been studied by se veral authors and in different classes of continuous-time models as well as in time series (Aalen, 1987; Didelez, 2000, 2008; α β γ δ φ α β γ δ φ Figure 2: Left: A causal graph on nodes V = { α, β , γ , δ, , φ } . Right: The corresponding parent graph on nodes O = { α, δ, } . Note that causal graphs and parent graphs may contain cycles. The parent graph does not contain information on the confounder process φ as it only encodes ‘causal ancestors’. One can also marginalize the causal graph to obtain a dir ected mixed graph from which one can read of f the parent graph (see the supplementary material). Eichler and Didelez, 2010). W e give an abstract defini- tion of local independence, following the exposition by Mogensen et al. (2018). Definition 2 (Local independence) . Let X be a multi- variate stochastic process and let V be an index set of its coordinate processes. Let F D t denote the complete and right-continuous version of the σ -algebra σ ( { X α s : s ≤ t, α ∈ D } ) , D ⊆ V . Let λ be a multiv ariate stochastic process (assumed to be inte grable and c ` adl ` ag) such that its coordinate processes are index ed by V . For A, B , C ⊆ V , we say that X B is λ -locally independent of X A gi ven X C (or simply B is λ -locally independent of A giv en C ) if the process t 7→ E ( λ β t | F C ∪ A t ) has an F C t -adapted version for all β ∈ B . W e write this as A 6→ λ B | C , or simply A 6→ B | C . In the case of Hawkes processes, the intensities will be used as the λ -processes in the abo ve definition. Didelez (2000), Mogensen et al. (2018), and Mogensen and Hansen (2020) provide technical details on the defini- tion of local independence. Local independence can be thought of as a dynamical system analogue to the classi- cal conditional independence. It is, howe ver , asymmetric which means that A 6→ B | C does not imply B 6→ A | C . This is a natural and desirable feature of an independence relation in a dynamical system as it helps us distinguish between the past and the present. It is important to note that by testing local independences we can obtain more information about the underlying parent graph than by simply assuming full observation and fitting a model to the observed data (see Figure 1b). 2.3.1 Local Independence and the Causal Graph T o make progress on the learning task, we will in this sub- section describe the link between the local independence model and the causal graph. Definition 3 (Pairwise Marko v property (Didelez, 2008)) . W e say that a local independence model satisfies the pair- wise Markov pr operty with respect to a directed graph, D = ( V , E ) , if the absence of the edge α → β in D implies α 6→ λ β | V \ α for all α, β ∈ V . W e will make the follo wing technical assumption through- out the paper . In applications, the functions g αβ are often assumed to be of the below type (Laub et al. (2015)). Assumption 4. Assume that N is a multiv ariate Hawkes process and that we observed N ov er the interv al J = [0 , T ] where T > 0 . For all α, β ∈ V , the function g β α : R + → R is continuous and µ α > 0 . A version of the follo wing result was also stated by Eich- ler et al. (2017) but no proof was given and we provide one in the supplementary material. If G 1 = ( V , E 1 ) and G 2 = ( V , E 2 ) are graphs, we say that G 1 is a pr oper subgraph of G 2 if E 1 ( E 2 . Proposition 5. The local independence model of a linear Hawkes process satisfies the pairwise Markov property with respect to the causal graph of the process and no proper subgraph of the causal graph has the property . 3 GRAPH THEOR Y AND INDEPENDENCE MODELS A graph is a pair ( V , E ) where V is a finite set of nodes and E a finite set of edges. W e will use ∼ to denote a generic edge. Each edge is between a pair of nodes (not necessarily distinct), and for α, β ∈ V , e ∈ E , we will write α e ∼ β to denote that the edge e is between α and β . W e will in particular consider the class of dir ected graphs (DGs) where between each pair of nodes α, β ∈ V one has a subset of the edges { α → β , α ← β } , and we say that these edges are dir ected . Let G 1 = ( V , E 1 ) and G 2 = ( V , E 2 ) be graphs. W e say that G 2 is a supergr aph of G 1 , and write G 1 ⊆ G 2 , if E 1 ⊆ E 2 . For a graph G = ( V , E ) such that α, β ∈ V , we write α → G β to indicate that the directed edge from α to β is contained in the edge set E . In this case we say that α is a parent of β . W e let pa G ( β ) denote the set of nodes in V that are parents of β . W e write α 6→ G β to indicate that the edge is not in E . Earlier work allowed loops, i.e., self-edges α → α , to be either present or absent in the graph (Meek, 2014; Mogensen et al., 2018; Mogensen and Hansen, 2020). W e assume that all loops are present, though this is not an essential assumption. A walk is a finite sequence of nodes, α i ∈ V , and edges, e i ∈ E , h α 1 , e 1 , α 2 , . . . , α k , e k , α k +1 i such that e i is between α i and α i +1 for all i = 1 , . . . , k and such that an orientation of each edge is known. W e say that a walk is nontrivial if it contains at least one edge. A path is a walk such that no node is repeated. A dir ected path from α to β is a path such that all edges are directed and point in the direction of β . Definition 6 (T rek, directed trek) . A tr ek between α and β is a (nontri vial) path h α, e 1 , . . . , e k , β i with no colliders (Foygel et al., 2012). W e say that a trek between α and β is dir ected from α to β if e k has a head at β . W e will formulate the following properties using a gen- eral independence model , I , on V . Let P ( · ) denote the power set of some set. An independence model on V is simply a subset of P ( V ) × P ( V ) × P ( V ) and can be thought of as a collection of independence statements that hold among the processes/variables indexed by V . In subsequent sections, the independence models will be de- fined using the notion of local independence. In this case, for A, B , C ⊆ V , A 6→ λ B | C is equiv alent to writing h A, B | C i ∈ I in the abstract notation, and we use the two interchangeably . W e do not require I to be symmet- ric, i.e., h A, B | C i ∈ I does not imply h B , A | C i ∈ I . In the following, we also use µ -separation which is a ternary relation and a dynamical model (and asymmetric) analogue to d -separation or m -separation. Definition 7 ( µ -separation) . Let G = ( V , E ) be a DMG, and let α, β ∈ V and C ⊆ V . W e say that a (nontri v- ial) walk from α to β , h α, e 1 , . . . , e k , β i , is µ -connecting giv en C if α / ∈ C , the edge e k has a head at β , ev ery collider on the walk is in an ( C ) and no noncollider is in C . Let A, B , C ⊆ V . W e say that B is µ -separated from A giv en C if there is no µ -connecting walk from any α ∈ A to any β ∈ B giv en C . In this case, we write A ⊥ µ B | C , or A ⊥ µ B | C [ G ] if we wish to emphasize the graph to which the statement relates. More graph-theoretical definitions and references are giv en in the supplementary material. Definition 8 (Global Markov property) . W e say that an independence model I satisfies the global Marko v pr op- erty with respect to a DG, G = ( V , E ) , if A ⊥ µ B | C [ G ] implies h A, B | C i ∈ I for all A, B , C ⊆ V . From Proposition 5, we kno w that the local independence model of a linear Hawkes process satisfies the pairwise Markov property with respect to its causal graph, and using the results in Didelez (2008) and Mogensen et al. (2018) it also satisfies the global Markov property with respect to this graph. Definition 9 (Faithfulness) . W e say that I is faithful with respect to a DG, G = ( V , E ) , if h A, B | C i ∈ I implies A ⊥ µ B | C [ G ] for all A, B , C ⊆ V . 4 NEW LEARNING ALGORITHMS In this section, we state a very general class of algo- rithms which is easily seen to provide sound causal learning and we describe some specific algorithms. W e throughout assume that there is some underlying, true DG, D 0 = ( V , E ) , describing the causal model and we wish to output P O ( D 0 ) . Howe ver , this graph is not in general identifiable from the local independence model. In the supplementary material, we argue that for an equiv alence class of parent graphs, there exists a unique member of the class which is a supergraph of all other members. De- note this unique graph by ¯ D . Our algorithms will output supergraphs of ¯ D , and the output will therefore also be supergraphs of the true parent graph. W e assume that we are in the ‘oracle case’, i.e., hav e access to a local independence oracle that provides the correct answers. W e will say that an algorithm is sound if it in the oracle case outputs a super graph of ¯ D and that it is complete if it outputs ¯ D . W e let I O denote the local inde- pendence model restricted to subsets of O , i.e., this is the observed part of the local independence model. W e pro- vide algorithms that are guaranteed to be sound, but only complete in particular cases. Naturally , one would wish for completeness as well. Howe ver , complete algorithms can easily be computationally infeasible whereas sound algorithms can be very inexpensi ve (e.g., Mogensen et al., 2018). W e think of these sound algorithms as scr eening pr ocedur es as they rule out some causal connections, but do not ensure completeness. 4.1 ANCESTRAL F AITHFULNESS Under the faithfulness assumption, e very local indepen- dence implies µ -separation in the graph. W e assume a weaker , but similar , property to show soundness. For learning mar ginalized D A Gs, weaker types of faithfulness hav e also been explored, see Zhang and Spirtes (2008); Zhalama et al. (2017a,b). Definition 10 (Ancestral faithfulness) . Let I be an in- dependence model and let D be a DG. W e say that I satisfies ancestral faithfulness with respect to D if for ev- ery α, β ∈ V and C ⊆ V \ { α } , h α, β | C i ∈ I implies that there is no µ -connecting directed path from α to β giv en C in D . Ancestral faithfulness is a strictly weaker requirement than faithfulness. W e conjecture that local independence models of linear Hawkes processes satisfy ancestral faith- fulness with respect to their causal graphs. Heuristically , if there is a directed path from α to β which is not blocked by an y node in C , then information should flow from α to β , and this cannot be ‘cancelled out’ by other paths in the graph as the linear Hawk es processes are self-excitatory , i.e., no process has a dampening effect on any process. This conjecture is supported by the so-called P oisson cluster r epr esentation of a linear Hawkes process (see Jov anovi ´ c et al. (2015)). 4.2 SIMPLE SCREENING ALGORITHMS As a first step in describing a causal screening algorithm, we will define a very general class of learning algorithms that simply test local independences and sequentially re- mov e edges. It is easily seen that under the assumption of ancestral faithfulness e very algorithm in this class gi ves sound learning in the oracle case. The complete DG on nodes V is the DG with edge set { α → β | α, β ∈ V } . Definition 11 (Simple screening algorithm) . W e say that a learning algorithm is a simple scr eening algorithm if it starts from a complete DG on nodes O and removes an edge α → β only if a conditioning set C ⊆ O \ { α } has been found such that h α, β | C i ∈ I O . The ne xt results describe what can be learned from absent edges in the output of a simple screening algorithm. Proposition 12. Assume that I satisfies ancestral faith- fulness with respect to D 0 = ( V , E ) . The output of any simple screening algorithm is sound in the oracle case. Corollary 13. Assume ancestral faithfulness of I with respect to D 0 and let A, B , C ⊆ O . If e very directed path from A to B goes through C in the output graph of a simple screening algorithm, then e very directed path from A to B goes through C in D 0 . Corollary 14. If there is no directed path from A to B in the output graph, then there is no directed path from A to B in D 0 . 4.3 P ARENT LEARNING In the previous section, it was shown that if edges are only remov ed when a separating set is found the output is sound under the assumption of ancestral f aithfulness. In this section we gi ve a specific algorithm. The ke y obser- v ation is that we can easily retrieve structural information from a rather small subset of local independence tests. Let D t denote the output from Subalgorithm 1 (see below). The following result sho ws that under the assumption of faithfulness, α → D t β if and only if there is a directed trek from α to β in D 0 . Proposition 15. There is no directed trek from α to β in D 0 if and only if α ⊥ µ β | β [ D 0 ] . Note that abov e, β in the conditioning set represents the β -past while the other β represents the present of the β - process. While there is no distinction in the graph, this interpretation follows from the definition of local inde- pendence and the global Marko v property . W e will refer to running first Subalgorithm 1 and then Subalgorithm 2 (using the output DG from the first as input to the second) as the causal screening (CS) algorithm. Intuitively , Subal- gorithm 2 simply tests if a candidate set (the parent set) is a separating set and other candidate sets could be chosen. Proposition 16. The CS algorithm is a simple screening algorithm. It is of course of interest to understand under what con- ditions the edge α → β is guaranteed to be removed by the CS algorithm when it is not in the underlying target graph. In the supplementary material we state and prov e a result describing one such condition. input : a local independence oracle for I O output : a DG on nodes O initialize D as the complete DG on O ; for each ( α, β ) ∈ V × V do if α 6→ λ β | β then delete α → β from D ; end end retur n D Subalgorithm 1: T rek step input : a local independence oracle for I O and a DG, D = ( O , E ) output : a DG on nodes O for each ( α, β ) ∈ V × V such that α → D β do if α 6→ λ β | pa D ( β ) \ { α } then delete α → β from D ; end end retur n D Subalgorithm 2: Parent step 4.4 ANCESTR Y PROP A GA TION In this section, we describe an additional step which prop- agates ancestry by reusing the output of Subalgorithm 1 to remov e further edges. This comes at a price as one needs faithfulness to ensure soundness. The idea is similar to A CI (Magliacane et al., 2016). input : a DG, D = ( O , E ) output : a DG on nodes O initialize E r = ∅ as the empty edge set; for each ( α, β , γ ) ∈ V × V × V such that α, β , γ ar e all distinct do if α → D β , β 6→ D α , β → D γ , and α 6→ D γ then update E r = E r ∪ { β → γ } ; end end Update D = ( V , E \ E r ) ; retur n D Subalgorithm 3: Ancestry propagation In ancestry propagation, we exploit the f act that any trek between α and β (such that γ is not on this trek) com- posed with the edge β → γ giv es a directed trek from α to γ . W e only use the trek between α and β ‘in one direction’, as a directed trek from α to β . In Subalgorithm 4 (supplementary material), we use a trek between α and β twice when possible, at the cost of an additional test. W e can construct an algorithm by first running Subalgo- rithm 1, then Subalgorithm 3, and finally Subalgorithm 2 (using the output of one subalgorithm as input to the next). W e will call this the CSAPC algorithm. If we use Subalgorithm 4 (in the supplementary material) instead of Subalgorithm 3, we will call this the CSAP . Proposition 17. If I is faithful with respect to D 0 , then CSAP and CSAPC both provide sound learning. 5 APPLICA TION AND SIMULA TIONS When ev aluating the performance of a sound screening algoritm, the output graph is guaranteed to be a super- graph of the true parent graph, and we will say that edges that are in the output b ut not in the true graph are e xcess edges . For a node in a directed graph, the inde gree is the number of directed edges adjacent with and pointed into the node, and the outde gr ee is the number of directed edges adjacent with and pointed away from the node. One should note that all our experiments are done us- ing an oracle test , i.e., instead of using real or synthetic data, the algorithms simply query an oracle for each local independence and recei ve the correct answer . This tests whether or not an algorithm can give good results using an efficient testing strate gy (i.e., a low number of queries to the oracle) and therefore it ev aluates the algorithms. This approach separates the algorithm from the specific test of local independence and ev aluates only the algorithm. As such this is highly unrealistic as we would nev er have ac- cess to an oracle with real data, howe ver , we should think of these experiments as a study of ef ficiency . The oracle approach to ev aluating graphical learning algorithms is common in the D AG-based case, see Spirtes (2010) for an ov erview . Also note that the comparison is only made with other constraint-based learning algorithms that can actually solve the problem at hand. Learning methods that assume full observation (such as the Hawk es methods mentioned in Section 2) would generally not output a graph with the correct interpretation e ven in the oracle case (see the example in Figure 1b). 5.1 C. ELEGANS NEUR ONAL NETWORK Caenorhabditis elegans is a roundworm in which the network between neurons has been mapped completely (V arshney et al., 2011). W e apply our methods to this network as an application to a highly complex network. It consists of 279 neurons which are connected by both non-directional gap junctions and directional chemical synapses. W e will represent the former as an unobserved process and the latter as a direct influence which is con- sistent with the biological system (V arshne y et al., 2011). From this network, we sampled subnetworks of 75 neu- rons each (details in the supplementary material) and computed the output of the CS algorithm. These sub- sampled networks had on a verage 1109 edges (including bidirected edges representing unobserv ed processes, see the supplementary material) and on av erage 424 directed edges. The output graphs had on average 438 excess edges which is explained by the fact that there are man y unobserved nodes in the graphs. T o compare the output to the true parent graph, we computed the rank correlation between the indegrees of the nodes in the output graph and the indegrees of the nodes in the true parent graph, and similarly for the outdegree (indegree correlation: 0.94, outdegree correlation: 0.52). Finally , we in vestigated the method’ s ability to identify the observed nodes of highest directed connecti vity (i.e., highest in- and outdegrees). The neuronal network of c. elegans is inhomogeneous in the sense that some neurons are extremely highly con- nected while others are only very sparsely connected. W e considered the 15 nodes of highest indegree/outdegree (out of the 75 observed nodes). On average, the CS algo- rithm placed 13.4 (in) and 9.2 (out) of these 15 among the 15 most connected nodes. From the output of the CS algorithm, we can find areas of the neuronal network which mediates information from one area to another , e.g., using Corollary 13. ● ● ● ● ● ● ● ● ● 100 200 300 6 9 12 15 18 21 24 No. of edges in underlying graph No. of tests used Algo. CS dFCI CA (a) Comparison of number of tests used. For each le vel of sparsity (number of edges in true graph), we generated 500 graphs, all on 5 nodes. The number of tests required quickly rises for dFCI and CA while CS spends no more than 2 · 5(5 − 1) tests. The output of dFCI and CA is not considerably more informativ e as measured by the mean number of excess edges: CS 0.96, dFCI 0.07, CA 0.81 (average ov er all levels of sparsity). 0 5 10 15 0 5 10 15 20 No. of edges in underlying graph Mean number of excess edges in output Algo. CS CSAP CSAPC dFCI CA (b) Mean number of excess edges in output graphs for v arying numbers of edges (bidirected and directed) in the true graph (all graphs are on 10 nodes), not counting loops. Figure 3: Comparison of performance. 5.2 COMP ARISON OF ALGORITHMS In this section we compare the proposed causal screen- ing algorithms with previously published algorithms that solve similar problems. Mogensen et al. (2018) propose two algorithms, one of which is sure to output the correct graph when an oracle test is av ailable. They note that this complete algorithm is computationally very expensi ve and adds little extra information, and therefore we will only consider their other algorithm for comparison. W e will call this algorithm dynamical FCI (dFCI) as it resem- bles FCI (Mogensen et al., 2018). dFCI actually solves a harder learning problem (see details in the supplementary material), ho wever , it is computationally infeasible for many problems. The Causal Analysis (CA) algorithm of Meek (2014) is a simple screening algorithm and we hav e in this paper argued that it is sound for learning the parent graph under the weaker assumption of ancestral faithfulness. Even though this algorithm uses a large number of tests, it is not guaranteed to provide complete learning as there may be inseparable nodes that are not adjacent (Mogensen et al., 2018; Mogensen and Hansen, 2020). For the comparison of these algorithms, two aspects are important. As the y are all sound, one aspect is the number of excess edges. The other aspect is of course the number of tests needed. The CS and CSAPC algorithms use at most 2 n ( n − 1) tests and empirically the CSAP uses roughly the same number as the two former . This makes them feasible in large graphs. The quality of their output is dependent on the sparsity of the true graph, though the CSAP and CSAPC algorithms can deal considerably better with less sparse graphs (Subfigure 3b). 6 DISCUSSION W e suggested inexpensi ve constraint-based methods for learning causal structure based on testing local indepen- dence. An important observation is that local indepen- dence is asymmetric while conditional independence is symmetric. In a certain sense, this may help when con- structing learning algorithms as there is no need of some- thing like an ‘orientation phase’ as in the FCI. This fa- cilitates using very simple methods to giv e sound causal learning as we do not need the independence structure in full to gi ve interesting output. Simple screening algo- rithms may be either adaptiv e or nonadaptiv e. W e note that nonadapti ve algorithms may be more robust to f alse conclusions from statistical tests of local independence. The amount of information in the output of the screening algorithms depends on t h e sparsity of the true graph. How- ev er , even in examples with v ery little sparsity interesting structural information can be learned. W e showed that the proposed algorithms hav e a compu- tational adv antage over pre viously published algorithms within this framew ork. This makes it feasible to consider causal learning in large networks with unobserved pro- cesses. W e obtained this gain in efficiency in part by out- putting only the directed part of the causal structure. This means that we may be able to answer structural questions, but not questions relating to causal ef fect estimation. Acknowledgments This work was supported by VILLUM FONDEN (re- search grant 13358). W e thank Niels Richard Hansen and the anonymous re viewers for their helpful comments that improv ed this paper . References Odd O. Aalen. Dynamic modelling and causality . Scandi- navian Actuarial Journal , pages 177–190, 1987. Massil Achab, Emmanuel Bacry , St ´ ephane Ga ¨ ıff as, Ia- copo Mastromatteo, and Jean-Fran c ¸ ois Muzy . Uncov- ering causality from multiv ariate Hawk es integrated cumulants. In Proceedings of the 34th International Confer ence on Machine Learning (ICML) , 2017. Emmanuel Bacry , Iacopo Mastromatteo, and Jean- Fran c ¸ ois Muzy . Hawkes processes in finance. Market Micr ostructur e and Liquidity , 1(1), 2015. Diego Colombo, Marloes H. Maathuis, Markus Kalisch, and Thomas S. Richardson. Learning high-dimensional directed acyclic graphs with latent and selection v ari- ables. The Annals of Statistics , 40(1):294–321, 2012. Daniel Commenges and Anne G ´ egout-Petit. A general dynamical statistical model with causal interpretation. Journal of the Royal Statistical Society . Series B (Sta- tistical Methodology) , 71(3):719–736, 2009. Daryl J. Daley and David D. V ere-Jones. An intr oduction to the theory of point pr ocesses . New Y ork: Springer , 2nd edition, 2003. V anessa Didelez. Graphical Models for Event History Analysis based on Local Independence . PhD thesis, Univ ersit ¨ at Dortmund, 2000. V anessa Didelez. Graphical models for marked point processes based on local independence. Journal of the Royal Statistical Society , Series B , 70(1):245–264, 2008. Michael Eichler . Graphical modelling of multi variate time series. Pr obability Theory and Related Fields , 153(1): 233–268, 2012. Michael Eichler . Causal inference with multiple time series: Principles and problems. Philosophical T rans- actions of the Royal Society , 371(1997):1–17, 2013. Michael Eichler and V anessa Didelez. On Granger causal- ity and the ef fect of interventions in time series. Life- time Data Analysis , 16(1):3–32, 2010. Michael Eichler , Rainer Dahlhaus, and Johannes Dueck. Graphical modeling for multiv ariate Hawkes processes with nonparametric link functions. Journal of T ime Series Analysis , 38:225–242, 2017. Jalal Etesami, Negar Kiyav ash, Kun Zhang, and Kushagra Singhal. Learning network of multiv ariate Hawkes processes: A time series approach. In Pr oceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (U AI) , 2016. Rina Fo ygel, Jan Draisma, and Mathias Drton. Half-trek criterion for generic identifiability of linear structural equation models. The Annals of Statistics , 40(3):1682– 1713, 2012. Stojan Jov anovi ´ c, John Hertz, and Stef an Rotter . Cumu- lants of Hawkes point processes. Physical Review E , 91(4), 2015. Patrick J. Laub, Thomas T aimre, and Philip K. Pollett. Hawkes processes. 2015. URL https://arxiv. org/pdf/1507.02822.pdf . Thomas Josef Liniger . Multivariate Hawkes pr ocesses . PhD thesis, ETH Z ¨ urich, 2009. Dixin Luo, Hongteng Xu, Y i Zhen, Xia Ning, Hongyuan Zha, Xiaokang Y ang, and W enjun Zhang. Multi- task multi-dimensional Hawkes processes for modeling ev ent sequences. In Proceedings of the 24th Interna- tional Joint Confer ence on Artificial Intelligence (IJ- CAI) , 2015. Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin W ainwright. Handbook of graphical models . Chapman & Hall/CRC handbooks of modern statistical methods, 2019. Sara Magliacane, T om Claassen, and Joris M. Mooij. An- cestral causal inference. In Pr oceedings of the 29th Confer ence on Neural Information Pr ocessing Systems (NIPS) , 2016. Christopher Meek. T ow ard learning graphical and causal process models. In CI’14 Pr oceedings of the U AI 2014 Confer ence on Causal Infer ence: Learning and Pr edic- tion , 2014. Søren W engel Mogensen and Niels Richard Hansen. Markov equiv alence of marginalized local indepen- dence graphs. The Annals of Statistics , 48(1), 2020. Søren W engel Mogensen, Daniel Malinsky , and Niels Richard Hansen. Causal learning for partially observed stochastic dynamical systems. In Proceed- ings of the 34th conference on Uncertainty in Artificial Intelligence (U AI) , 2018. Judea Pearl. Causality . Cambridge Univ ersity Press, 2009. Jonas Christopher Peters, Dominik Janzing, and Bernhard Sch ¨ olkopf. Elements of causal infer ence, foundations and learning algorithms . MIT Press, 2017. Kjetil Røysland. Counterfactual analyses with graphical models based on local independence. The Annals of Statistics , 40(4):2162–2194, 2012. Peter Spirtes. An anytime algorithm for causal inference. In Pr oceedings of the 8th International W orkshop on Artificial Intelligence and Statistics (AIST ATS) , 2001. Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Resear ch , 11, 2010. Peter Spirtes, Clark Glymour , and Richard Scheines. Cau- sation, Pr ediction, and Searc h . MIT Press, 2000. Xi T an, V inayak Rao, and Jennifer Neville. Nested CRP with Hawkes-Gaussian processes. In Pr oceedings of the 21st International Confer ence on Artificial Intelligence and Statistics (AIST A TS) , 2018. W illiam T rouleau, Jalal Etesami, Matthias Grossglauser , Negar Kiya vash, and Patrick Thiran. Learning Hawkes processes under synchronization noise. In Pr oceed- ings of the 36th International Confer ence on Machine Learning (ICML) , 2019. Lav R. V arshney , Beth L. Chen, Eric Paniagua, David H. Hall, and Dmitri B. Chklo vskii. Structural properties of the Caenorhabditis elegans neuronal network. PLoS Computational Biology , 7(2), 2011. Thomas V erma and Judea Pearl. Equiv alence and synthe- sis of causal models. T echnical Report R-150, Univer - sity of California, Los Angeles, 1991. Hongteng Xu, Mehrdad Farajtabar , and Hongyuan Zha. Learning Granger causality for Ha wkes processes. In Pr oceedings of the 33rd International Confer ence on Machine Learning (ICML) , 2016. Hongteng Xu, Dixin Luo, Xu Chen, and Lawrence Carin. Benefits from superposed Hawkes processes. In Pr o- ceedings of the 21st International Confer ence on Artifi- cial Intelligence and Statistics (AIST ATS) , 2018. Zhalama, Jiji Zhang, Frederick Eberhardt, and W olfgang Mayer . SA T-based causal discov ery under weaker as- sumptions. In Pr oceedings of the 33th Confer ence on Uncertainty in Artificial Intelligence (U AI) , 2017a. Zhalama, Jiji Zhang, and W olfgang Mayer . W eaken- ing faithfulness: Some heuristic causal discovery al- gorithms. International Journal of Data Science and Analytics , 3:93–104, 2017b. Jiji Zhang. On the completeness of orientation rules for causal discov ery in the presence of latent confounders and selection bias. Artificial Intelligence , 172:1873– 1896, 2008. Jiji Zhang and Peter Spirtes. Detection of unfaithfulness and robust causal inference. Minds & Machines , 18: 239–271, 2008. Ke Zhou, Hongyuan Zha, and Le Song. Learning trigger- ing kernels for multi-dimensional Hawkes processes. In Pr oceedings of the 30th International Conference on Machine Learning (ICML) , 2013a. Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks using multi- dimensional Hawkes processes. In Pr oceedings of the 16th International Confer ence on Artificial Intelligence and Statistics (AIST A TS) , 2013b. Supplementary material This supplementary material contains additional graph theory , results, and definitions, as well as the proofs of the main paper . 7 GRAPH THEOR Y In the main paper , we introduce the class of DGs to rep- resent causal structures. One can represent marginal- ized DGs using the larger class of DMGs. A dir ected mixed graph (DMG) is a graph such that any pair of nodes α, β ∈ V is joined by a subset of the edges { α → β , α ← β , α ↔ β } . W e say that edges α → β and α ← β are dir ected , and that α ↔ β is bidirected . W e say that the edge α → β has a head at β and a tail at α . α ↔ β has heads at both α and β . W e also introduced a walk h α 1 , e 1 , α 2 , . . . , α n , e n , α n +1 i . W e say that α 1 and α n +1 are endpoint nodes. A nonendpoint node α i on a walk is a collider if e i − 1 and e i both hav e heads at α i , and other- wise it is a noncollider . A cycle is a path h α, e 1 , . . . , β i composed with an edge between α and β . W e say that α is an ancestor of β if there exists a directed path from α to β . W e let an ( β ) denote the set of nodes that are ancestors of β . For a node set C , we let an ( C ) = ∪ β ∈ C an ( β ) . By con vention, we say that a trivial path (i.e., with no edges) is directed and this means that C ⊆ an ( C ) . For D A Gs d -separation is often used for encoding inde- pendences. W e use the analogous notion of µ -separation which is a generalization of δ -separation Didelez (2000, 2008); Meek (2014); Mogensen and Hansen (2020). W e use the class of DGs to represent the underlying, data- generating structure. When only parts of the causal sys- tem is observed, the class of DMGs can be used to rep- resent marginalized DGs Mogensen and Hansen (2020). This can be done using latent projection V erma and Pearl (1991); Mogensen and Hansen (2020) which is a map that for a DG (or more generally , for a DMG), D = ( V , E ) , and a subset of observed nodes/processes, O ⊆ V , pro- vides a DMG, m ( D , O ) , such that for all A, B , C ⊆ O , A ⊥ µ B | C [ D ] ⇔ A ⊥ µ B | C [ m ( D , O )] . See Mogensen and Hansen (2020) for details on this graphical marginalization. W e say that two DMGs, G 1 = ( V , E 1 ) , G 2 = ( V , E 2 ) , are Markov equivalent if A ⊥ µ B | C [ G 1 ] ⇔ A ⊥ µ B | C [ G 2 ] , for all A, B , C ⊆ V , and we let [ G 1 ] denote the Markov equiv alence class of G 1 . Every Marko v equiv alence class of DMGs has a unique maximal element Mogensen and Hansen (2020), i.e., there exists G ∈ [ G 1 ] such that G is a supergraph of all other graphs in [ G 1 ] . For a DMG, G , we will let D ( G ) denote the dir ected part of G , i.e., the DG obtained by deleting all bidirected edges from G . Proposition 18. Let D = ( V , E ) be a DG, and let O ⊆ V . Consider G = m ( D , O ) . For α, β ∈ O it holds that α ∈ an D ( β ) if and only if α ∈ an D ( G ) ( β ) . Furthermore, the directed part of G equals the parent graph of D on nodes O , i.e., D ( G ) = P O ( D ) . Pr oof. Note first that α ∈ an D ( β ) if and only if α ∈ an G ( β ) Mogensen and Hansen (2020). Ancestry is only defined by the directed edges, and it follo ws that α ∈ an G ( β ) if and only if α ∈ an D ( G ) ( β ) . For the second statement, the definition of the latent projection giv es that there is a directed edge from α to β in G if and only if there is a directed path from α to β in D such that no nonendpoint node is in O . By definition, this is the parent graph, P O ( D ) . In words, the above proposition says that if G is a marginalization (done by latent projection) of D , then the ancestor relations of D and D ( G ) are the same among the observed nodes. It also says that our learning target, the parent graph, is actually the directed part of the latent projection on the observed nodes. In the next subsection, we use this to describe what is actually identifiable from the induced independence model of a graph. 7.1 MAXIMAL GRAPHS AND P ARENT GRAPHS Under faithfulness of the local independence model and the causal graph, we know that the maximal DMG is a cor - rect representation of the local independence structure in the sense that it encodes exactly the local independences that hold in the local independence model. From the max- imal DMG, one can use results on equi valence classes of DMGs to obtain ev ery other DMG which encodes the observed local independences (Mogensen and Hansen, 2020) and from this graph one can find the parent graph as simply the directed part. Howe ver , it may require an infeasible number of tests to output such a maximal DMG. This is not surprising, seeing that the learning target en- codes this complete information on local independences. Assume that D 0 = ( V , E ) is the underlying causal graph and that G 0 = ( O , F ) , O ⊆ V is the marginalized graph over the observed variables, i.e., the latent pro- jection of D 0 . In principle, we would like to output P ( D 0 ) = D ( G 0 ) , the directed part of G 0 . Ho wever , no algorithm can in general output this graph by testing only local independences as Markov equivalent DMGs may not hav e the same parent graph. Within each Markov equiv alence class of DMGs, there is a unique maximal graph. Let ¯ G denote the maximal graph which is Markov equi valent of G 0 . The DG D ( ¯ G ) is a super graph of D ( G 0 ) and we will say that a learning algorithm is complete if it is guaranteed to output D ( ¯ G ) as no algorithm testing local independence only can identify anything more than the equiv alence class. 8 COMPLETE LEARNING The CS algorithm pro vides sound learning of the parent graph of a general DMG under the assumption of ances- tral faithfulness. For a subclass of DMGs, the algorithm actually provides complete learning. It is of interest to find sufficient graphical conditions to ensure that the al- gorithm remov es an edge α → β which is not in the true parent graph. In this section, we state and prove one such condition which can be understood as ‘the true parent set is alw ays found for unconfounded processes’. W e let D denote the output of the CS algorithm. Proposition 19. If α 6→ G 0 β and there is no γ ∈ V \ { β } such that γ ↔ G 0 β , then α 6→ D β . Pr oof. Let D 1 , D 2 , . . . , D N denote the DGs that are con- structed when running the algorithm by sequentially re- moving edges, starting from the complete DG, D 1 . Con- sider a connecting walk from α to β in G 0 . It must be of the form α ∼ . . . ∼ γ → β , γ 6 = α . Under ancestral faithfulness, the edge γ → β is in D , thus γ ∈ pa D i ( β ) for all D i that occur during the algorithm, and there- fore when h α, β | pa D i ( β ) \ { α }i is tested, the walk is closed. Any walk from α to β is of this form, thus also closed, and we have that α ⊥ µ β | pa D i ( β ) and there- fore h α, β | pa D i ( β ) \ { α }i ∈ I . The edge α → D i β is remov ed and thus absent in the output graph, D . 9 ANCESTR Y PR OP A GA TION W e state Subalgorithm 4 here. Composing Subalgorithm 1, Subalgorithm 4, and Subal- gorithm 2 is referred to as the causal screening, ancestry propagation (CSAP) algorithm. If we use Subalgorithm 3 instead of Subalgorithm 4, we call it the CSAPC algo- rithm (C for cheap as this does not entail any additional independence tests compared to CS). 10 APPLICA TION AND SIMULA TIONS In this section, we provide some additional details about the c. elegans neuronal network and the simulations. input : a local independence oracle for I O and a DG, D = ( O , E ) output : a DG on nodes O initialize E r = ∅ as the empty edge set; for each ( α, β , γ ) ∈ V × V × V such that α, β , γ ar e all distinct do if α ∼ D β , β → D γ , and α 6→ D γ then if h α, γ | ∅i ∈ I O then update E r = E r ∪ { β → γ } ; end end end Update D = ( V , E \ E r ) ; retur n D Subalgorithm 4: Ancestry propagation 10.1 C. ELEGANS NEUR ONAL NETWORK For each connection between two neurons a different number of synapses are present (ranging from 1 to 37). W e only consider connections with more than 4 synapses when we define the true underlying network. When sam- pling the subnetworks, highly connected neurons were sampled with higher probability to av oid a fully connected subnetwork when mar ginalizing. 10.2 COMP ARISON OF ALGORITHMS As noted in the main paper , the dFCI algorithm solves a strictly harder problem. By using the additional graph theory in the supplementary material, we can understand the output of the dFCI algorithm as a supergraph of the maximal DMG, ¯ G . There is also a version of the dFCI which is guaranteed to output not only a supergraph of ¯ G , but the graph ¯ G itself. Clearly , from the output of the dFCI algorithm, one can simply take the directed part of the output and this is a supergraph of the underlying parent graph. 11 PR OOFS In this section, we pro vide the proofs of the result in the main paper . Pr oof of Pr oposition 5. Let D denote the causal graph. Assume first that α 6→ D β . Then g β α is identically zero ov er the observation interval, and it follo ws directly from the functional form of λ β t that α 6→ β | V \ { α } . This shows that the local independence model satisfies the pairwise Markov property with respect to D . If instead g β α 6 = 0 ov er J , there exists r ∈ J such that g β α ( r ) 6 = 0 . From continuity of g β α there exists a compact interval of positive measure, I ⊆ J , such that inf s ∈ I ( g β α ( s )) ≥ g β α min and g β α min > 0 . Let i 0 and i 1 denote the endpoints of this interv al, i 0 < i 1 . W e consider now the e vents D k = ( N α T − i 0 − N α T − i 1 = k , N γ T = 0 for all γ ∈ V \ { α } ) k ∈ N 0 . Then under Assumption 4, for all k λ β T 1 D k ≥ 1 D k Z I g β α ( T − s ) d N α s ≥ g β α min · k · 1 D k . Assume for contradiction that β is locally independent of α giv en V \ { α } . Then λ β T = E ( λ β T | F V T ) = E ( λ β T | F V \{ α } T ) is constant on ∪ k D k and furthermore P ( D k ) > 0 for all k . Howe ver , this contradicts the above inequality when k → ∞ . Pr oof of Pr oposition 12. Let D denote the DG which is output by the algorithm. W e should then show that P ( D 0 ) ⊆ D . Assume that α → P ( D 0 ) β . In this case, there is a directed path from α to β in D 0 such that no nonendpoint node on this directed walk is in O (the ob- served coordinates). Therefore for any C ⊆ O \ { α } there exists a directed µ -connecting walk from α to β in D 0 and by ancestral faithfulness it follows that h α, β | C i / ∈ I . The algorithm starts from the complete directed graph, and the above means that the directed edge from α to β will not be remov ed. Pr oof of Cor ollary 13. Consider some directed path from α to β in D 0 on which no node is in C . Then there is also a directed path from α to β on which no nodes is in C in the graph P ( D 0 ) , and therefore also in the output graph using Proposition 12. Pr oof of Pr oposition 15. Assume that there is a µ - connecting walk from α to β giv en { β } . If this walk has no colliders, then it is a directed trek, or can be re- duced to one. Otherwise, assume that γ is the collider which is the closest to the endpoint α . Then γ ∈ an ( β ) , and composing the subwalk from α to γ with the directed path from γ to β gi ves a directed trek, or it can be reduced to one. On the other hand, assume there is a directed trek from α to β . This is µ -connecting from α to β giv en { β } . Pr oof of Pr oposition 17. Assume β → P ( D 0 ) γ . Subalgo- rithms 1 and 2 are both simple screening algorithms, and they will not remov e this edge. Assume for contradiction that β → γ is removed by Subalgorithm 3. Then there must exist α 6 = β , γ and a directed trek from α to β in D 0 . On this directed trek, γ does not occur as this would imply a directed trek either from α to γ or from β to α , thus implying α → D γ or β → D α , respectively ( D is the output graph of Subalgorithm 1). As γ does not occur on the trek, composing this trek with the edge β → γ would gi ve a directed trek from α to γ . By faithfulness, h α, γ | γ i / ∈ I , and this is a contradiction as α → γ would not ha ve been removed during Subalgorithm 1. W e consider instead CSAP . Assume for contradiction that β → γ is removed during Subalgorithm 4. There exists in D 0 either a directed trek from α to β or a directed trek from β to α . If γ is on this trek, then γ is not µ -separated from α giv en the empty set (recall that there are loops at all nodes, therefore also at γ ), and using faithfulness we conclude that γ is not on this trek. Composing it with the edge β → γ would gi ve a directed trek from α to γ and using faithfulness we obtain a contradiction.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment