Confounder-robust causal discovery and inference in Perturb-seq using proxy and instrumental variables
Emerging single-cell technologies that integrate CRISPR-based genetic perturbations with single-cell RNA sequencing, such as Perturb-seq, have substantially advanced our understanding of gene regulation and causal influence of genes. While Perturb-seq data provide valuable causal insights into gene-gene interactions, statistical concerns remain regarding unobserved confounders that may bias inference. These latent factors may arise not only from intrinsic molecular features of regulatory elements encoded in Perturb-seq experiments, but also from unobserved genes arising from cost-constrained experimental designs. Although methods for analyzing large-scale Perturb-seq data are rapidly maturing, approaches that explicitly account for such unobserved confounders in learning the causal gene networks are still lacking. Here, we propose a novel method to recover causal gene networks from Perturb-seq experiments with robustness to arbitrarily omitted confounders. Our framework leverages proxy and instrumental variable strategies to exploit the rich information embedded in perturbations, enabling unbiased estimation of the underlying directed acyclic graph (DAG) of gene expressions. Simulation studies and analyses of CRISPR interference experiments of K562 cells demonstrate that our method outperforms baseline approaches that ignore unmeasured confounding, yielding more accurate and biologically relevant recovery of the true gene causal DAGs.
💡 Research Summary
The paper introduces ARGEN (Arbitrary‑confounder Robust causal Gene Network), a novel statistical framework designed to infer directed gene‑gene regulatory networks (DAGs) from Perturb‑seq experiments while remaining robust to arbitrarily omitted confounders. Perturb‑seq combines CRISPR‑based perturbations with single‑cell RNA‑sequencing, offering interventional data that, in principle, enable causal discovery. However, existing DAG‑learning methods (PC, GES, NOTEARS, GIES, etc.) assume that all relevant variables are observed and that interventions are perfectly measured, an assumption violated in typical Perturb‑seq studies due to hidden biological factors (cell‑cycle stage, chromatin accessibility, unmeasured genes) and technical covariates (sequencing depth, batch effects).
ARNGEN tackles this problem by exploiting two key ideas inherent to Perturb‑seq: (1) the guide RNA (gRNA) detection indicator serves as an instrumental variable (IV). Because gRNA assignment is random and independent of technical covariates, it satisfies the relevance and exclusion‑restriction conditions required for an IV, providing a source of exogenous variation for the perturbed gene. (2) Unobserved biological confounders are approximated by proxy variables constructed from the expression of other genes that share the same gRNA perturbation. By aggregating these co‑perturbed expression profiles, ARGEN creates a surrogate for the latent confounder, allowing the model to adjust for hidden bias.
The authors extend the standard measurement model for scRNA‑seq counts (Poisson or Negative Binomial) to incorporate a structural equation model (SEM) that reflects the underlying regulatory DAG and the perturbation effects. For each gene i, the log‑true expression μ_i follows
log μ_i = ∑{j∈Pa(i)} f{ij}(μ_j) + β_i D_i + γ_i X + ε_i,
where D_i is the binary gRNA indicator, X denotes observed technical covariates, and ε_i captures unobserved confounding.
Identifiability is rigorously established through three theorems. Theorem 1 (non‑degenerate direct effect) guarantees that, under the assumption that every gene has a non‑zero causal effect on its descendants and a non‑zero intervention effect, the ancestor and descendant sets of each node are identifiable. The proof introduces the concept of an “exclusive directed path,” a unique directed path between two nodes, and shows that such paths exist for any pair of nodes in the true subgraph. Theorem 2 leverages exclusive paths to recover the full topological ordering of the DAG once the ancestor information is known. Theorem 3 combines proxy‑adjustment and IV reasoning to demonstrate that the parent set of each gene can be identified via a quasi‑maximum likelihood estimation (QMLE) problem, even when hidden confounders are present.
Algorithmically, ARGEN proceeds in two stages. First, descendant sets are estimated by testing conditional mean relationships (Equation 4) across all genes, yielding a collection of descendant candidates for each node. These are fed into a custom “Algorithm 1” that constructs a topological order while enforcing acyclicity. Second, given the order, parent sets are recovered: (i) proxy variables η_i are constructed from co‑perturbed genes; (ii) a QMLE is solved to obtain estimates of β_i (intervention strength) and θ_{ij} (edge coefficients); (iii) p‑values for θ_{ij} are adjusted using an online false discovery rate (FDR) procedure (Zrnic et al., 2020) across all potential parent‑child pairs. The final DAG consists of edges that survive the FDR threshold and respect the previously determined ordering.
Extensive simulations explore varying levels of hidden confounding, sample sizes, and numbers of perturbed genes. Compared to baseline methods that ignore confounding, ARGEN achieves substantially lower structural Hamming distance (SHD), higher area under the precision‑recall curve, and better recovery of true edges, especially when confounding is strong.
In a real‑world application, the authors analyze a Perturb‑seq dataset from K562 cells (≈5 000 cells, ~120 gRNAs targeting ~100 genes). ARGEN successfully recovers known transcription factor–target relationships (e.g., GATA1 → HBB, MYC → CCND1) with high precision, and uncovers novel candidate edges that are biologically plausible. Importantly, methods that ignore hidden confounders either add spurious edges or miss key regulatory links, whereas ARGEN’s confounder‑robust approach yields a network that aligns more closely with prior knowledge and functional assays.
Limitations are acknowledged: (1) proxy construction requires a sufficient number of co‑perturbed genes; sparse perturbation designs may weaken identifiability; (2) the current implementation assumes linear or log‑linear functional forms for f_{ij}, potentially missing complex non‑linear regulatory effects; (3) solving the QMLE can be computationally intensive for very large gene sets. Future work will explore multi‑instrument extensions, Bayesian modeling of latent confounders, and non‑linear SEMs using neural networks.
Overall, ARGEN represents the first framework that explicitly leverages the instrumental nature of CRISPR perturbations and proxy variables to deliver confounder‑robust causal network inference from Perturb‑seq data, offering a powerful tool for the single‑cell genomics community to move beyond association maps toward reliable mechanistic insights.
Comments & Academic Discussion
Loading comments...
Leave a Comment