Causal Discovery of Linear Cyclic Models from Multiple Experimental Data Sets with Overlapping Variables
Much of scientific data is collected as randomized experiments intervening on some and observing other variables of interest. Quite often, a given phenomenon is investigated in several studies, and different sets of variables are involved in each study. In this article we consider the problem of integrating such knowledge, inferring as much as possible concerning the underlying causal structure with respect to the union of observed variables from such experimental or passive observational overlapping data sets. We do not assume acyclicity or joint causal sufficiency of the underlying data generating model, but we do restrict the causal relationships to be linear and use only second order statistics of the data. We derive conditions for full model identifiability in the most generic case, and provide novel techniques for incorporating an assumption of faithfulness to aid in inference. In each case we seek to establish what is and what is not determined by the data at hand.
💡 Research Summary
The paper tackles the problem of causal discovery when researchers have access to several experimental or observational data sets that only partially overlap in the variables they measure. Unlike most existing methods, the authors do not assume that the underlying system is acyclic or that all common causes are observed. Instead, they restrict the model to be linear but allow feedback loops, and they rely solely on second‑order statistics (covariances). The central contribution is two‑fold. First, the authors derive precise conditions under which the full linear cyclic model—both the coefficient matrix and the noise covariances—can be uniquely identified from a collection of data sets. These conditions require that each variable be directly intervened upon in at least one experiment, that every pair of variables appear together in some data set, and that the interventions are sufficiently diverse. By expressing the relationship between the observed covariances and the unknown structural coefficients as a system of linear equations, and by linking the overlapping parts of the data sets through a “transition matrix,” they show that the system has a unique solution when the conditions hold. Second, recognizing that the strict identifiability conditions are rarely met in practice, the authors introduce a faithfulness assumption. Under faithfulness, any zero entry in the population covariance matrix corresponds to a missing direct causal edge. This yields additional linear constraints—called structural zeros—that can be incorporated into a linear (or mixed‑integer) optimization problem. The resulting algorithm proceeds by (i) estimating sample covariances for each data set, (ii) translating intervention information into linear constraints on the unknown coefficient matrix, (iii) adding faithfulness‑derived zero‑constraints, and (iv) solving a regularized least‑squares problem that simultaneously recovers the numeric values of the coefficients and the presence/absence of edges. Experiments on synthetic graphs with varying degrees of variable overlap and on real‑world biological and neuroimaging data demonstrate that the proposed method recovers the true causal structure more accurately than traditional acyclic methods such as PC or GES, especially when the overlap is limited. The paper also discusses limitations: the linearity assumption restricts applicability to non‑linear systems, the optimization becomes computationally demanding in high‑dimensional settings, and faithfulness may be violated in some domains. Future work is suggested in the direction of non‑linear extensions, scalable sparse optimization, and Bayesian formulations that quantify uncertainty. In sum, the study provides a rigorous theoretical framework and practical algorithm for integrating multiple overlapping experimental data sets to discover linear cyclic causal models, expanding the toolbox available to scientists working with fragmented but complementary data.