Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75–83% while maintaining valid coverage under realistic MNAR mechanisms.


💡 Research Summary

The paper tackles the classic problem of estimating population quantities—most notably the mean outcome—from data that suffer from missing‑not‑at‑random (MNAR) mechanisms. In MNAR settings, standard estimators are biased because the probability of observation depends on the unobserved outcome itself (e.g., users with strong opinions are more likely to respond). Traditional solutions either impose strong parametric models for the missingness mechanism or rely on auxiliary “shadow variables” that satisfy a completeness condition. Both approaches are often infeasible in modern digital platforms where such auxiliary variables are unavailable, and where large pretrained models (especially large language models, LLMs) can generate rich predictions but do not meet the completeness requirement.

The authors introduce a novel concept called a weak shadow variable. A weak shadow variable is any observed quantity—such as an LLM‑generated prediction—that is conditionally independent of the missingness indicator given the observed covariates, yet it does not need to satisfy the stringent completeness condition required by classical shadow‑variable theory. This relaxation allows practitioners to harness the predictive power of pretrained models without demanding that the predictions be a perfect proxy for the true outcome.

The methodological core is a partial‑identification framework that formulates the problem as two linear programs (LPs). The decision variables of the LPs represent a latent joint distribution of the full data (outcome, missingness indicator, covariates). The constraints encode: (1) consistency with the observed marginal distribution, (2) the conditional independence of the weak shadow variable and the missingness indicator, and (3) basic probability axioms (non‑negativity, sum to one). Solving the first LP yields the minimum possible mean of the outcome compatible with the data; solving the second yields the maximum possible mean. The interval between these two solutions is a sharp identified set for the population mean. When the weak shadow variable is highly informative, the feasible region collapses and the identified set degenerates to a point, recovering standard point identification.

Because the identified set is generally a set rather than a point, the authors develop a set‑expansion estimator for finite‑sample inference. Using bootstrap resampling, they estimate the sampling variability of the LP solutions and then expand the identified interval by a data‑driven margin. This construction guarantees asymptotically correct coverage: in the set‑identified regime the convergence rate is slower than $\sqrt{n}$ (approximately $n^{-1/4}$), reflecting the intrinsic difficulty of learning a set; in the point‑identified regime the estimator attains the usual $\sqrt{n}$ rate.

The empirical evaluation proceeds in two stages. First, Monte‑Carlo simulations explore a variety of MNAR mechanisms (logistic response probabilities, non‑linear missingness functions) and different levels of LLM prediction accuracy. The results demonstrate that even when the LLM predictions are ill‑conditioned for classical shadow‑variable methods, the LP‑based partial‑identification approach still produces substantially tighter bounds. Second, a semi‑synthetic experiment uses real customer‑service dialogue data. An LLM is fine‑tuned to predict sentiment or satisfaction scores from the dialogue text; these scores serve as weak shadow variables. Compared with traditional shadow‑variable techniques, the proposed method shrinks the identification interval by 75–83 % while maintaining nominal 95 % coverage under realistic MNAR patterns.

In summary, the paper makes four key contributions: (1) it formalizes weak shadow variables, enabling the use of pretrained model outputs in MNAR analysis without demanding completeness; (2) it provides a tractable linear‑programming formulation that yields sharp identified sets; (3) it proposes a robust set‑expansion inference procedure that delivers valid finite‑sample coverage in both set‑identified and point‑identified settings; and (4) it validates the approach on both synthetic and real‑world data, showing that LLM‑derived predictions can dramatically improve identification despite their failure to meet classical assumptions. This work opens a practical pathway for social scientists, platform engineers, and policy analysts to incorporate modern AI predictions into rigorous causal inference under non‑random missingness.


Comments & Academic Discussion

Loading comments...

Leave a Comment