Latent Variable Causal Discovery under Selection Bias
Addressing selection bias in latent variable causal discovery is important yet underexplored, largely due to a lack of suitable statistical tools: While various tools beyond basic conditional independencies have been developed to handle latent variables, none have been adapted for selection bias. We make an attempt by studying rank constraints, which, as a generalization to conditional independence constraints, exploits the ranks of covariance submatrices in linear Gaussian models. We show that although selection can significantly complicate the joint distribution, interestingly, the ranks in the biased covariance matrices still preserve meaningful information about both causal structures and selection mechanisms. We provide a graph-theoretic characterization of such rank constraints. Using this tool, we demonstrate that the one-factor model, a classical latent variable model, can be identified under selection bias. Simulations and real-world experiments confirm the effectiveness of using our rank constraints.
💡 Research Summary
This paper tackles the significant yet underexplored challenge of performing latent variable causal discovery in the presence of selection bias. Selection bias, where data is not randomly sampled from the population (e.g., survey respondents self-selecting based on traits), can severely distort inferred causal relationships. While tools like the FCI algorithm can handle both latent confounding and selection bias using conditional independence (CI) tests, they are limited to discovering relationships among observed variables and cannot identify causal structures between latent variables.
The authors bridge this gap by developing a novel statistical tool: Generalized Rank Constraints. Their work builds upon existing “rank constraints” from prior literature, which, under linear Gaussian assumptions, detect low-rank patterns in covariance submatrices to infer the presence and structure of latent variables, going beyond the information provided by CI tests. However, these existing constraints fail when selection bias is present, as the data distribution becomes non-Gaussian and complex.
The key innovation is proving that even under selection bias, low-rank patterns in the biased covariance matrix still preserve meaningful structural information. The authors assume a “linear selection mechanism,” where the probability of a sample being selected depends on a linear function of the observed (and possibly latent) variables. Under this general assumption, they demonstrate theoretically that the rank of any submatrix of the biased covariance matrix Σ’ is determined by a graphical criterion on an “augmented graph” that includes both the original causal DAG and the selection mechanism. This criterion is a generalization of the t-separation rule used for rank constraints without selection.
A compelling example is the “Inverse Tetrad” structure: four independent observed variables are truncated based on their linear sum. The resulting selected data is not linear Gaussian, yet its covariance matrix exhibits the same low-rank (rank=1) patterns among variable pairs as the classical “Tetrad” structure with a single latent common cause. This shows how the selection mechanism itself creates a dimensional bottleneck, leaving a signature detectable via rank constraints.
The paper applies this new tool to establish the identifiability of the One-Factor Model (a single latent variable influencing multiple observed indicators) under selection bias, a problem previously unsolved. Finally, simulations and a real-world experiment on estimating latent political orientation from survey data demonstrate the practical effectiveness of the proposed generalized rank constraints. This work provides the first formal tool beyond CI constraints that can handle both latent variables and selection bias simultaneously, paving the way for more powerful and realistic causal discovery algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment