Nearly Optimal Bayesian Inference for Structural Missingness
Structural missingness breaks ‘just impute and train’: values can be undefined by causal or logical constraints, and the mask may depend on observed variables, unobserved variables (MNAR), and other missingness indicators. It simultaneously brings (i) a catch-22 situation with causal loop, prediction needs the missing features, yet inferring them depends on the missingness mechanism, (ii) under MNAR, the unseen are different, the missing part can come from a shifted distribution, and (iii) plug-in imputation, a single fill-in can lock in uncertainty and yield overconfident, biased decisions. In the Bayesian view, prediction via the posterior predictive distribution integrates over the full model posterior uncertainty, rather than relying on a single point estimate. This framework decouples (i) learning an in-model missing-value posterior from (ii) label prediction by optimizing the predictive posterior distribution, enabling posterior integration. This decoupling yields an in-model almost-free-lunch: once the posterior is learned, prediction is plug-and-play while preserving uncertainty propagation. It achieves SOTA on 43 classification and 15 imputation benchmarks, with finite-sample near Bayes-optimality guarantees under our SCM prior.
💡 Research Summary
The paper tackles a pervasive yet under‑explored problem in modern data science: structural missingness. Unlike classical missing‑data settings (MCAR, MAR, MNAR) where the missingness mask is either independent of the data or depends only on observed variables, structural missingness allows the mask to be generated by logical or causal constraints that involve both observed and unobserved features, and even other missingness indicators. Typical examples include questionnaire items that become undefined when a preceding question is unanswered, or sensor readings that are physically impossible under certain conditions. This creates a “catch‑22”: prediction requires the missing features, yet inferring those features depends on the missingness mechanism, which itself may be driven by the very features we wish to predict. Moreover, under MNAR the distribution of missing values can differ from that of observed values, leading to systematic bias if one naively treats them as drawn from the same distribution. Finally, deterministic single‑imputation approaches lock in uncertainty, producing over‑confident and biased downstream decisions.
To address these challenges, the authors propose a unified Bayesian framework built on a second‑order structural causal model (SCM). The generative process consists of two stages: (1) sample a causal graph and its parameters, then generate a complete data matrix X from exogenous noise; (2) generate a binary mask M using deterministic gates that can depend on parent features (X→M) and on other mask entries (M→M), thereby capturing both structural validity constraints and propagation of missingness. The resulting joint distribution over (X, M) defines a prior over tasks.
Prediction is formalized as posterior predictive inference conditioned on the observed sub‑space X_c^M (the observed features together with the mask) and the observed training set D_c^M. The posterior predictive distribution (PPD) decomposes into (i) a standard supervised term p(y|X_c^M, X_m, D_c^M, D_m) that assumes full data, and (ii) a posterior over the missing values p(X_m, D_m|X_c^M, D_c^M). Existing methods effectively replace term (ii) with a shortcut that ignores the mask or assumes the missing values follow the observed distribution, which induces MNAR bias and plug‑in imputation bias.
The core technical contribution is the use of Prior‑Fitted Networks (PFNs) to learn the mapping from an incomplete dataset to the PPD directly. PFNs are meta‑learners trained on a massive collection of synthetic tasks sampled from the second‑order SCM prior. For each synthetic task, the PFN receives the incomplete training set D_train (including masks) and a query point X_test with its mask, and is trained to minimize the cross‑entropy between its output distribution and the true label. This objective is equivalent to minimizing the expected conditional KL divergence, guaranteeing that the PFN approximates the Bayes‑optimal posterior predictive distribution under the chosen prior. Because the PFN outputs a full predictive distribution, downstream decisions can exploit calibrated uncertainty without any Monte‑Carlo sampling over imputations.
Optionally, the authors attach a Flow Matching head that learns an explicit posterior over missing values p(X_m|X_c^M, D_c^M). This auxiliary model enables sampling‑based imputation and uncertainty analysis but is not required for label prediction, illustrating a clean separation between inference of missing features and downstream prediction.
Theoretical analysis provides a finite‑sample excess‑risk bound that decomposes into (a) error due to approximating the missing‑value posterior and (b) error due to the predictive model. The bound shows that, compared to plug‑in baselines, the proposed method achieves the same target risk with fewer samples, because it correctly integrates over the missing‑value uncertainty.
Empirically, the method is evaluated on 43 classification benchmarks and 15 imputation benchmarks that span MCAR, MAR, MNAR, and fully structural missingness scenarios. Across all datasets, the PFN‑based approach outperforms state‑of‑the‑art baselines (including multiple imputation, variational auto‑encoders, and recent causal imputation methods) by a substantial margin (often 3–7 percentage points in accuracy or AUROC). The advantage is most pronounced on datasets with strong MNAR or mask‑propagation effects, confirming the theoretical claims. The optional Flow Matching head yields realistic posterior samples of missing features, useful for downstream interpretability and risk‑aware decision making (e.g., in medical diagnosis).
In summary, the paper makes three intertwined contributions: (1) a principled prior over structural missingness via second‑order SCMs, (2) an efficient amortized Bayesian inference scheme for posterior predictive distributions using PFNs, and (3) an optional explicit missing‑value posterior estimator. Together, these components provide a practically viable solution to the long‑standing problem of learning and predicting under structural missingness, demonstrating both strong theoretical guarantees and state‑of‑the‑art empirical performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment