Weighting-Based Identification and Estimation in Graphical Models of Missing Data

Weighting-Based Identification and Estimation in Graphical Models of Missing Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a constructive algorithm for identifying complete data distributions in graphical models of missing data. The complete data distribution is unrestricted, while the missingness mechanism is assumed to factorize according to a conditional directed acyclic graph. Our approach follows an interventionist perspective in which missingness indicators are treated as variables that can be intervened on. A central challenge in this setting is that sequences of interventions on missingness indicators may induce and propagate selection bias, so that identification can fail even when a propensity score is invariant to available interventions. To address this challenge, we introduce a tree-based identification algorithm that explicitly tracks the creation and propagation of selection bias and determines whether it can be avoided through admissible intervention strategies. The resulting tree provides both a diagnostic and a constructive characterization of identifiability under a given missingness mechanism. Building on these results, we develop recursive inverse probability weighting procedures that mirror the intervention logic of the identification algorithm, yielding valid estimating equations for both the missingness mechanism and functionals of the complete data distribution. Simulation studies and a real-data application illustrate the practical performance of the proposed methods. An accompanying R package, flexMissing, implements all proposed procedures.


💡 Research Summary

This paper tackles the fundamental problem of recovering the full-data distribution in the presence of missing‑not‑at‑random (MNAR) mechanisms by exploiting graphical models. The authors assume that the missingness mechanism p(R|X) factorizes according to a conditional directed acyclic graph (DAG) G, while placing no restrictions on the target distribution p(X). Missingness indicators Rₖ are treated as intervention nodes: setting Rₖ=1 corresponds to a do‑operation that removes incoming edges into Rₖ and replaces the latent variable Xⱼ by its observed counterpart whenever Rⱼ=1.

A central difficulty is that sequences of such interventions can create and propagate selection bias, which may block identification even when propensity scores are invariant to available interventions. To formalize the sources of bias, the authors introduce for each indicator Rₖ the following sets: (i) Sₓₖ, the “counterfactual‑induced” selection set containing indicators whose presence is forced by missing parents of Rₖ; (ii) Sᵣₖ, the “indicator‑induced” set of variables that must be evaluated at R=1 to use the propensity score for recovering p(X); (iii) Sₖ=Sₓₖ∪Sᵣₖ, the overall selection set; and (iv) Rₚₖ=Sₓₖ∩de_G(Rₖ), the problematic set of descendants that interfere with associative arguments.

The paper distinguishes two notions of irrelevance. Associative irrelevance follows from the local Markov property of the missingness DAG and allows conditioning on non‑descendant non‑parents without inducing bias. When this fails because a parent indicator lies in Sₓₖ, causal irrelevance (also called autonomy or stability) is invoked: the propensity score πₖ is invariant to interventions on any other indicator. Consequently, one may intervene on the problematic descendants Rⱼ∈Rₚₖ to break the offending paths, provided that such interventions do not themselves generate prohibitive selection bias.

To operationalize these ideas, the authors propose a tree‑based identification algorithm. The root node represents the observed data law p(X*,R). Each node is labeled by a set of interventions R* already performed. Expanding a node consists of selecting an admissible indicator Rⱼ (e.g., from Rₚₖ) and applying do(Rⱼ=1). The algorithm then rewrites the target propensity score πₖ in terms of the post‑intervention distribution, checks whether the required conditioning variables are available, and determines whether selection bias has been introduced. If a branch leads to a representation of πₖ that is free of bias, the branch is retained; otherwise it is pruned. The resulting identification tree not only yields a binary verdict of identifiability but also provides explicit formulas for each πₖ, the necessary evaluation sets Sᵣₖ, and the admissible intervention sequence that achieves identification. When full identification of p(X) fails, the tree can still be queried to identify which subsets of propensity scores suffice for particular functionals θ(p(X)).

Having obtained identified propensity scores, the authors construct recursive inverse‑probability‑weighting (IPW) estimators that mirror the intervention logic of the tree. The key identity is
p(X)=p(X,R=1)/π(R=1|X),
where π(R=1|X)=∏ₖπₖ. Each πₖ is estimated from the observed data using the representation supplied by the tree, and the overall weight for an observation is the product of the inverse of the estimated πₖ for all indicators that are missing in that unit. The recursion handles cases where a πₖ depends on another πⱼ by estimating the inner scores first, exactly as the tree dictates. Under a positivity condition πₖ>σ>0, the authors prove consistency, asymptotic normality, and semiparametric efficiency of the resulting estimators. The estimating equations fit naturally into the M‑estimation framework, allowing straightforward variance estimation via the sandwich formula.

Simulation studies explore a variety of MNAR structures, including self‑masking, block‑conditional MAR, and discrete‑choice missingness. The proposed method is compared against the EM algorithm, multiple imputation, and existing causal‑identification approaches. Results show that when the missingness DAG contains complex dependencies among indicators, the tree‑guided interventions dramatically reduce bias and mean‑squared error relative to competitors. A real‑world application to a survey with non‑response on income and education variables demonstrates the practical utility of the R package flexMissing, which implements the identification tree, the recursive IPW estimator, and diagnostic tools for assessing admissible intervention strategies.

In summary, the paper makes three major contributions: (1) a novel tree‑based algorithm that explicitly tracks creation and propagation of selection bias in graphical missing‑data models and yields constructive identification rules; (2) a recursive IPW estimation framework that directly exploits the identified propensity scores while preserving the bias‑correction logic; and (3) a comprehensive theoretical and empirical validation, together with open‑source software, that bridges the gap between abstract identifiability theory and applied missing‑data analysis. The work advances the state of the art by providing both a diagnostic for identifiability and a practical, statistically sound estimation procedure for a broad class of MNAR problems.


Comments & Academic Discussion

Loading comments...

Leave a Comment