Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.
Missing data is pervasive in economic and social research as well as on digital platforms. In household and health surveys, respondents often skip questions perceived as sensitive or irrelevant. On digital platforms, users often choose whether or not to leave feedback based on their experiences.
As noted by Abrevaya and Donald (2017), nearly 40% of top economics papers report data missingness, with about 70% dropping observations as a result.
In many of these settings, data is missing not at random (MNAR): the probability of observing an outcome depends on its possibly unobserved value. For example, Bollinger et al. (2019) shows that nonresponse across the earnings distribution is U-shaped, where left-tail “strugglers” and against assuming that model predictions can perfectly substitute for human judgments. These observations suggest that LLM predictions can serve as useful auxiliary signals for tightening identification bounds, but that an approach robust to prediction imperfections is needed.
In light of both the promise and limitations of these predictions, we treat LLM-generated outputs as weak shadow variables. Specifically, we assume an exclusion-type condition: conditional on the true outcome and observed covariates, the prediction is independent of the missingness indicator. However, we do not require strong relevance or completeness conditions as in classical shadow variables (Miao and Tchetgen Tchetgen 2016); the predictions may only weakly correlate with the outcome. Even so, incorporating them introduces additional linear constraints into our identification framework, tightening the feasible region. When the predictions are sufficiently informative, the bounds may collapse to a single point, yielding point identification as a special case.
We summarize our three main contributions below.
First, we propose a novel linear programming framework for partial identification under MNAR that applies both with and without auxiliary predictions. In the baseline setting without auxiliary inputs, the formulation yields closed-form solutions for the identification region of the mean outcomes. When incorporating auxiliary predictions from LLMs, we derive analytical results that quantify how these predictions tighten the feasible set and shrink the identification region. This formulation offers a unified and tractable approach to understanding how predictive signals impact identification under minimal assumptions.
Second, we develop a set-expansion estimator that asymptotically converges to the identification region while accounting for estimation error in the probability constraints that define the bounds. We establish convergence rates for this estimator. In the partially identified setting, the convergence rate is slower than the usual √ n rate (for example, on the order of √ n/ log n), as a result of the additional uncertainty inherent in set identification. When the auxiliary information is sufficiently informative and point identification is achieved, the estimator recovers the standard √ n rate, matching classical results in the shadow variable literature (Miao and Tchetgen Tchetgen 2016).
We further extend our framework to randomized experiments and derive bounds on treatment effects, and provide sufficient conditions under which reliable treatment decisions can still be made despite missing outcomes.
Third, we evaluate the proposed methods through simulation studies and semi-synthetic experiments based on real customer-service dialogue data. To construct auxiliary signals, we generate outcome predictions using LLMs under several prompting and training regimes, including zeroshot, few-shot, chain-of-thought prompting, and fine-tuning. Our results reveal two key insights.
First, we show that LLM predictions can fail to meet the strong completeness conditions required for classical shadow variable methods, rendering point identification unstable or infeasible. This underscores the need for our partial identification framework. Second, despite these limitations, the predictions remain informative: incorporating LLM-based weak shadow variables reduces the width of identification intervals by 75-83% across prompting strategies, while preserving valid coverage under realistic MNAR mechanisms.
Our work contributes to the rich literature on identification and estimation under MNAR mechanisms. Classical approaches include parametric selection models, such as the Heckman correction, which jointly models the outcome and missingness process (Heckman 1979), and Pattern-Mixture models that parameterize outcome distributions within each missingness stratum (Little 1994, Rubin 1987). Other strands of work leverage graphical models to represent missing data processes (Fay 1986), or use auxiliary variables such as instrumental variables that affect missingness but not outcomes (Das et al. 2003, Tchetgen Tchetgen and Wirth 2017, Sun et al. 2018). Our approach is most closely aligned with recent developments in
This content is AI-processed based on open access ArXiv data.