Admissibility Alignment

This paper introduces Admissibility Alignment: a reframing of AI alignment as a property of admissible action and decision selection over distributions of outcomes under uncertainty, evaluated through the behavior of candidate policies. We present MAP-AI (Monte Carlo Alignment for Policy) as a canonical system architecture for operationalizing admissibility alignment, formalizing alignment as a probabilistic, decision-theoretic property rather than a static or binary condition. MAP-AI, a new control-plane system architecture for aligned decision-making under uncertainty, enforces alignment through Monte Carlo estimation of outcome distributions and admissibility-controlled policy selection rather than static model-level constraints. The framework evaluates decision policies across ensembles of plausible futures, explicitly modeling uncertainty, intervention effects, value ambiguity, and governance constraints. Alignment is assessed through distributional properties including expected utility, variance, tail risk, and probability of misalignment rather than accuracy or ranking performance. This approach distinguishes probabilistic prediction from decision reasoning under uncertainty and provides an executable methodology for evaluating trust and alignment in enterprise and institutional AI systems. The result is a practical foundation for governing AI systems whose impact is determined not by individual forecasts, but by policy behavior across distributions and tail events. Finally, we show how distributional alignment evaluation can be integrated into decision-making itself, yielding an admissibility-controlled action selection mechanism that alters policy behavior under uncertainty without retraining or modifying underlying models.

💡 Research Summary

The paper reframes AI alignment as a property of “admissibility,” meaning that a policy’s actions must be acceptable with respect to human values, institutional constraints, and risk tolerances across a distribution of uncertain outcomes. Rather than judging alignment by static accuracy or a single loss function, the authors propose evaluating policies over ensembles of Monte Carlo‑sampled futures and measuring distributional statistics such as expected utility, variance, tail‑risk metrics (VaR, CVaR), and the probability of mis‑alignment events.

To operationalize this idea, they introduce MAP‑AI (Monte Carlo Alignment for Policy), a three‑layer control‑plane architecture. The prediction layer retains any high‑performing generative or forecasting model and produces a conditional distribution over possible world states given the current context. The simulation layer draws many samples from this distribution, applies a candidate policy π to each sampled state, and records the resulting outcome X. This yields an empirical outcome distribution Pπ that reflects how the policy would behave in practice.

The admissibility evaluation layer computes the aforementioned statistics from the empirical distribution and compares them against pre‑specified admissibility criteria A (e.g., mis‑alignment probability < 1 %, CVaR ≥ ‑10, expected utility loss ≤ 5 %). Policies that satisfy all criteria form a feasible set Π*. Crucially, MAP‑AI does not require retraining or fine‑tuning the underlying model. If a policy violates a criterion, a policy‑modification function ψ intervenes, replacing the risky action with a safer alternative, activating a human‑in‑the‑loop veto, or otherwise reshaping behavior while leaving the model parameters unchanged.

The authors formalize admissibility alignment as a probabilistic decision‑theoretic problem, showing its relationship to Bayesian risk minimization and multi‑objective optimization. They distinguish hard admissibility constraints (which guarantee a non‑empty feasible set) from soft constraints (which allow trade‑offs between utility and safety). Theoretical results demonstrate that, under reasonable assumptions, admissibility evaluation provides a sufficient condition for safe deployment without sacrificing most of the policy’s performance.

Empirical validation is performed in two domains: (1) a text‑based decision‑making task using a large language model, and (2) a robotic control task trained via reinforcement learning. In both cases, MAP‑AI reduces the frequency of mis‑alignment events by more than 70 % compared with baseline alignment methods, while incurring less than a 5 % drop in expected utility. Tail‑risk measures improve dramatically, and the ψ function lowers the operational cost of human supervision.

Key contributions include: (i) a shift from static verification to dynamic, distribution‑based policy assessment; (ii) a concrete Monte Carlo framework that makes tail‑risk and uncertainty explicit in alignment metrics; (iii) a mechanism for real‑time policy adjustment without model retraining; and (iv) a theoretical grounding that connects admissibility to established decision‑theoretic concepts.

Future work is outlined as follows: develop more efficient sampling techniques for high‑dimensional continuous action spaces; learn admissibility thresholds automatically via meta‑learning; design negotiation protocols for resolving value conflicts among multiple stakeholders; and integrate MAP‑AI into real‑world enterprise pipelines with transparent monitoring and explainability of the ψ interventions. The paper thus offers a practical foundation for governing AI systems whose impact is determined not by isolated predictions but by the aggregate behavior of policies across uncertain futures.

💡 Research Summary

📜 Original Paper Content