Causal Network Discovery from Interventional Count Data with Latent Linear DAGs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The increasing availability of interventional data offers new opportunities for causal discovery, with gene perturbation studies providing a prominent example. Such data are typically count-valued and subject to substantial measurement error arising from technical variability and latent state heterogeneity. Motivated by these challenges, we study identification and estimation in latent linear structural causal models for interventional count data. We propose a latent linear Gaussian directed acyclic graph (DAG) model with Poisson measurement error that explicitly separates the latent causal structure from the observed counts. Under a mean-shift intervention design, we establish population-level identifiability of the latent causal DAG. Building on these identification results, we develop an estimation procedure based on sparse inverse matrix estimation and provide theoretical guarantees on estimation error and finite-sample causal discovery. Simulation studies and applications to Perturb-seq data demonstrate the practical effectiveness of the proposed method.

💡 Research Summary

This paper presents a novel methodological framework for causal discovery from interventional count data, with a primary focus on applications in single-cell gene perturbation experiments such as Perturb-seq. The core challenge addressed is that such data are inherently count-valued and contaminated by substantial measurement error due to technical noise and latent cellular heterogeneity, which existing methods based on Gaussian or continuous approximations fail to handle adequately.

The authors propose a two-layer latent linear structural causal model. The observation layer models the observed RNA read count for gene j as a Poisson variable, conditional on a library size factor, observable covariates, and a latent true expression level Z_j. This explicitly separates the measurement process from the biological process. The latent layer posits that the vector of latent expression levels Z^(m) in each interventional environment m follows a linear Gaussian Structural Causal Model (SCM): Z^(m) = A Z^(m) + η^(m) + ε^(m). Here, A is a matrix of causal coefficients invariant across environments, defining a Directed Acyclic Graph (DAG). The intervention is modeled as a “mean-shift” on the latent variable: η^(m) = η^(0) + α_m e_ℓ_m, meaning only the intercept of the targeted gene ℓ_m is shifted by α_m, representing a sparse mechanism shift. This is a more realistic “soft” intervention compared to the “hard” do-interventions often assumed.

The paper’s first major contribution is an identifiability theorem. It proves that under this model, provided each node is subject to at least one non-vanishing mean-shift intervention, the latent causal mechanism matrix A (and hence the DAG) is identifiable at the population level by leveraging differences in latent means across environments. This identifiability holds even in the presence of latent confounding (non-diagonal Cov(ε^(m))) and does not rely on the restrictive causal faithfulness assumption.

Building on this, the second contribution is a practical estimation procedure. It involves first estimating the latent environment-specific means μ^(m) and covariances from the observed Poisson counts using moment relationships. Then, it formulates the estimation of A as a constrained optimization problem: minimizing the difference between the estimated and model-implied precision matrices under sparsity and DAG (acyclicity) constraints. This non-convex problem is solved efficiently using the Alternating Direction Method of Multipliers (ADMM).

The third contribution is a theoretical analysis providing non-asymptotic guarantees. The authors derive a Frobenius norm error bound for the estimator Â and, under a beta-min condition on the minimum edge strength and intervention strength, characterize the finite-sample probability of exactly recovering the true DAG skeleton and directions.

Simulation studies demonstrate that the proposed method outperforms existing observational methods and interventional methods designed for continuous data, especially in settings with high measurement error and latent confounding. An application to real Perturb-seq data on immune response genes recovers a causal network with biologically plausible pathways, validating the method’s practical utility.

In summary, this work provides a principled and theoretically grounded framework for causal discovery from interventional count data that directly addresses the measurement error and latent state challenges pervasive in modern biological data like Perturb-seq, bridging a significant gap between causal methodology and practical data analysis.

Causal Network Discovery from Interventional Count Data with Latent Linear DAGs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment