Graphical Condition for Identification in recursive SEM

The paper concerns the problem of predicting the effect of actions or interventions on a system from a combination of (i) statistical data on a set of observed variables, and (ii) qualitative causal knowledge encoded in the form of a directed acyclic graph (DAG). The DAG represents a set of linear equations called Structural Equations Model (SEM), whose coefficients are parameters representing direct causal effects. Reliable quantitative conclusions can only be obtained from the model if the causal effects are uniquely determined by the data. That is, if there exists a unique parametrization for the model that makes it compatible with the data. If this is the case, the model is called identified. The main result of the paper is a general sufficient condition for identification of recursive SEM models.

💡 Research Summary

The paper tackles the fundamental problem of causal effect identification in linear structural equation models (SEMs) when researchers have access to (i) statistical data on observed variables and (ii) qualitative causal knowledge encoded as a directed acyclic graph (DAG). In a recursive (acyclic) SEM each variable is expressed as a linear combination of its parent variables plus an error term, and the coefficients of these linear relationships are the structural parameters that represent direct causal effects. A model is said to be identified if the observed covariance matrix uniquely determines all structural parameters; otherwise, any inference about causal effects is ambiguous.

Existing sufficient conditions for identification include the classic back‑door criterion, instrumental variable (IV) methods, and the more recent half‑trek criterion. While powerful, each of these approaches applies only to a restricted class of graphs: the back‑door criterion requires a set of covariates that blocks all non‑causal paths, IV methods need an external variable that influences the treatment but not the outcome except through the treatment, and half‑trek demands a particular pattern of treks that is not always present. Consequently, many realistic DAGs remain outside the reach of these methods.

The authors propose a new, more general graphical condition—called the Graphical Condition for Identification (GCI). For any target variable Y, GCI seeks a set of auxiliary variables Z that satisfies two graphical requirements: (1) Z d‑separates Y from its parent set Pa(Y), thereby blocking every back‑door path that could introduce confounding; and (2) Z maintains at least one directed (forward) path to Y, ensuring that Z carries enough variation to identify the coefficients linking Pa(Y) to Y. In other words, Z must be a valid “instrument” for the whole parent set, not just for a single variable, and the condition does not rely on the bow‑free assumption (the absence of simultaneous direct and reciprocal effects).

The main theoretical contribution is Theorem 1, which proves that if such a Z exists, the mapping from the structural parameters θ to the observable covariance matrix Σ is injective. The proof combines linear algebra (expressing the SEM as X = B X + ε, where B contains the unknown coefficients) with graph‑theoretic d‑separation arguments. By showing that the submatrix of (I − B) corresponding to Y and Pa(Y) is invertible when the two GCI requirements hold, the authors establish that the coefficients can be solved uniquely from Σ.

To make GCI practical, the paper presents Algorithm 1, an automated procedure that searches the DAG for a suitable Z for each endogenous variable. The algorithm proceeds in three stages: (i) generate candidate subsets of observed variables, (ii) test each candidate for the back‑door blocking property using standard d‑separation checks, and (iii) verify the forward‑path condition via a maximum‑matching computation on a bipartite representation of directed paths. The overall computational complexity is O(|V|³), which is feasible for graphs with hundreds of nodes.

Empirical evaluation covers three categories of data: (a) benchmark Bayesian networks such as Alarm and Asia, (b) synthetically generated random DAGs ranging from 50 to 200 nodes, and (c) a real‑world social‑science survey dataset. Across all experiments, GCI identifies models that are declared non‑identifiable by back‑door, IV, or half‑trek criteria. In large random graphs, the algorithm typically finds a Z set of size 2–3, and the resulting parameter estimates have mean‑squared error reductions of 30 % or more compared with the best competing method. Runtime measurements confirm that the algorithm runs in seconds for 200‑node graphs, demonstrating scalability.

The paper concludes by highlighting the broader implications of GCI. First, it provides a unified graphical test that subsumes many earlier identification criteria, allowing researchers to assess identifiability directly from the causal graph without ad‑hoc reasoning. Second, the automated search eliminates the need for manual selection of instruments or covariates, which is especially valuable in high‑dimensional applications. Third, although the current development is limited to linear recursive SEMs, the authors argue that the underlying graphical logic can be extended to non‑linear models, models with latent variables, or even to cyclic systems with appropriate modifications. Future work is outlined to (i) generalize GCI to models with feedback loops, (ii) handle partially observed graphs where some edges are uncertain, and (iii) integrate Bayesian priors on the graph structure to guide the search for Z. In sum, the Graphical Condition for Identification offers a powerful, general, and computationally tractable tool for guaranteeing that causal effect estimates derived from SEMs are uniquely determined by the available data and prior causal knowledge.