Arxiv 2601.01816

February 10, 2026

Reading time: 33 minute

...

📝 Original Info

Title: Arxiv 2601.01816
ArXiv ID: 2601.01816
Date:
Authors: Unknown

📝 Abstract

This paper introduces Admissibility Alignment: a reframing of AI alignment as a property of admissible action and decision selection over distributions of outcomes under uncertainty, evaluated through the behavior of candidate policies. We present MAP-AI (Monte Carlo Alignment for Policy) as a canonical system architecture for operationalizing admissibility alignment, formalizing alignment as a probabilistic, decision-theoretic property rather than a static or binary condition. MAP-AI, a new control-plane system architecture for aligned decision-making under uncertainty, in which alignment is enforced through Monte Carlo estimation of outcome distributions and admissibility-controlled Policy selection rather than static model-level constraints. The framework evaluates decision Policies across ensembles of plausible futures, explicitly modeling uncertainty, intervention effects, value ambiguity, and governance constraints. Alignment is assessed through distributional properties-expected utility, variance, tail risk, and probability of misalignmentrather than accuracy or ranking performance. This approach distinguishes probabilistic prediction from decision reasoning under uncertainty and provides an executable methodology for evaluating trust and alignment in enterprise and institutional AI systems. The result is a practical foundation for governing AI systems whose impact is determined not by individual forecasts, but by Policy behavior across distributions and tail events. Finally, we show how distributional alignment evaluation can be integrated into decision-making itself, yielding an admissibility-controlled action selection mechanism that alters Policy behavior under uncertainty without retraining or modifying underlying models. Where prior alignment work focuses on how AI systems think-through representation or reasoning-Admissibility Alignment addresses the orthogonal technical problem of how AI systems act, formalizing alignment at the level of action selection by determining which candidate actions and decisions are permitted to execute under uncertainty. While non-agentic probabilistic monitors reduce incentives for self-preservation and deception, they do not eliminate the need to embed normative thresholds under deep uncertainty. Any guardrail that blocks actions based on predicted harm must choose how to trade false positives against catastrophic tail risks-an inherently value-laden decision that cannot be resolved ex ante. As a result, oracle-based approaches face a decision-theoretic uncertainty problem structurally similar to long-horizon reward optimization, albeit at a different layer. MAP-AI therefore treats alignment not as a fixed value embedding problem, but as a procedural decision-theoretic process over admissible actions, with explicit thresholds, auditability, and human governance. Admissibility Alignment is operationalized through MAP-AI, a canonical decision control-plane architecture that uses Monte Carlo evaluation to govern decision selection without modifying underlying representation or reasoning models. This paper advances a precise and operational claim: once AI systems act under uncertainty, any alignment evaluation that does not simulate Policy behavior across outcome distributions is incomplete by definition. The limitation is structural, not methodological. Metrics designed to evaluate predictions or scores-even when probabilistic-do not, by construction, capture how a system trades off expected performance, tail risk, value ambiguity, and governance constraints when selecting actions. As a result, alignment failures may remain undetected until they manifest operationally, often in rare but consequential regimes. We formalize alignment as a probabilistic, decision-theoretic property of Policies rather than as a static attribute of models. The core object of evaluation is the distribution of outcomes induced by a Policy operating in an uncertain world under uncertain values and explicit constraints. Alignment, under this framing, is assessed through distributional characteristics-expected utility, variance, tail risk, and constraint violation probability-rather than through point estimates or binary compliance criteria. This reframing shifts alignment from a question of correctness to a question of risk engineering.

📄 Full Content

As artificial intelligence systems transition from passive prediction engines to systems that act, intervene, and optimize in the world, the question of alignment necessarily changes in kind. Predictive accuracy, ranking performance, and static safety checks are no longer sufficient to characterize system behavior once decisions propagate through uncertain environments, interact with human values, and generate irreversible outcomes. In such settings, alignment is not a property that can be asserted pointwise or certified once; it is a property that emerges from how a decision Policy behaves across distributions of possible futures. Alignment is not a property of internal cognition or belief correctness; it is a property of the external behavior induced by a decision Policy operating under uncertainty.

The alignment system architecture introduced in this paper operationalizes this definition through Monte Carlo-based alignment stress testing. By simulating Policy rollouts across ensembles of plausible environments and value specifications, the approach makes explicit the distribution of outcomes a system is likely to induce, including low-probability but high-impact events. Monte Carlo simulation is not introduced as a novel algorithmic contribution, but as a canonical instantiation of a more general requirement: alignment must be evaluated by examining Policy behavior under uncertainty, not inferred from isolated predictions or training-time objectives.

To operationalize alignment as a distributional property of decision-making under uncertainty, we introduce the MAP-AI Alignment System Architecture (Figure 1). The architecture decomposes alignment evaluation and control into three interacting components. Part I is a Monte Carlo uncertainty engine that generates distributions over possible futures by jointly sampling world realizations, Policy behavior, trajectory evolution, value uncertainty, and constraint realization. Part II performs cross-layer alignment stress testing by estimating distributional risk measuressuch as constraint violation probability and tail risk-and projecting these risks across the model, Policy, constraint, and governance layers of the system. Part III integrates these risk signals into decision control through an admissibility filter and a champion-challenger decision loop, ensuring that only policies satisfying institutionally specified governance thresholds are eligible for execution. Together, these components close the alignment loop: uncertainty is transformed into decision-relevant constraints that directly shape which actions an agent is permitted to take. This architecture is agnostic to model internals and task domain, and is intended as a canonical system-level framing for alignment in agentic AI systems. Crucially, this architecture is not limited to post hoc evaluation. Because alignment failures arise from decisions taken under uncertainty, alignment assessment must ultimately inform which actions are admissible to execute. We therefore treat alignment evaluation as an input to decisionmaking itself, enabling Policies to be filtered, overridden, or escalated when distributional risk exceeds institutionally specified thresholds. This distinction-between measuring alignment and enforcing it at the point of action-forms the basis of the decision integration mechanism introduced later in the paper.

Policy (capital P) denotes the abstract decision rule that is the object of alignment: a mapping from histories or states to actions whose induced outcome distribution is evaluated. The policy layer refers to the system-level decision interface where a Policy is instantiated and where action selection is subject to admissibility constraints. MAP-AI evaluates Policies and enforces alignment constraints at the policy layer.

First, it defines alignment as a distributional property of action selection under uncertainty, rather than as a property of model internals, training objectives, or belief correctness.

Second, it introduces an evaluation-first alignment standard in which decision-making behavior is assessed via Monte Carlo estimation of outcome distributions, including tail risk and constraint violation probabilities, rather than optimized directly for alignment objectives.

Third, it formalizes governance-admissible action selection as a first-class construct, showing how institutional risk tolerances can be enforced through an admissibility-controlled decision interface that directly shapes which actions are executed under uncertainty.

Fourth, it demonstrates via stress tests that systems with equivalent predictive performance can exhibit sharply divergent alignment behavior in the tails, underscoring the insufficiency of expectation-based evaluation for alignment-critical systems.

Conceptually, this work is aligned with prior decision-theoretic treatments of AI alignment, particularly those emphasizing uncertainty over objectives and the evaluation of policies rather than predictions. Notably, Stuart Russell has argued that advanced AI systems should be understood as decision-makers operating under uncertainty about human values, and that treating objectives as fixed and known leads to systematic misalignment. This paper builds on that foundation by focusing not on how aligned policies are learned, but on how alignment can be evaluated, compared, and governed in deployed systems using tools already familiar to institutions that manage risk in finance, engineering, and safety-critical domains.

The contribution of this paper is therefore neither an ethical taxonomy nor a new training paradigm. It is a system-level alignment evaluation standard that treats alignment as a continuous, distributional property of decision-making systems, operationalized via Monte Carlo estimation of policy-induced outcomes. By separating alignment evaluation from capability development and optimization, the framework is intended to be complementary to existing AI systems-including forecasting and ranking platforms-while addressing a gap that becomes unavoidable as AI systems increasingly act, intervene, and optimize in the world.

This work draws on, but is distinct from, several bodies of prior research spanning probabilistic prediction, decision-theoretic alignment, and robust evaluation under uncertainty. The common limitation across these literatures is not conceptual sophistication, but scope: most focus on what systems should optimize or how aligned behavior might be learned, rather than on how alignment should be evaluated once systems act in uncertain environments.

The dominant paradigm for evaluating AI systems-particularly in enterprise and institutional contexts-remains grounded in probabilistic prediction and ranking. Forecast accuracy, calibration, likelihood-based metrics, and ranking performance have proven effective for systems whose primary function is estimation rather than intervention. Even when predictions are probabilistic, evaluation typically remains pointwise, assessing whether a model assigns high probability to observed outcomes or ranks options correctly.

While these approaches are appropriate for passive inference, they do not evaluate the consequences of acting on predictions. Once a system selects actions that influence future states, downstream outcomes depend not only on predictive quality but on how uncertainty, tradeoffs, and constraints are resolved at decision time. As a result, alignment failures may occur despite well-calibrated predictions, particularly in rare or high-impact regimes. This limitation is structural rather than methodological: predictive metrics are not designed to characterize Policy behavior across distributions of futures.

The closest conceptual foundation for the present work comes from decision-theoretic treatments of AI alignment that emphasize uncertainty over objectives and policies rather than fixed reward maximization. In particular, Stuart Russell has argued that advanced AI systems should be understood as decision-makers operating under uncertainty about human values, and that treating objectives as fully specified leads to systematic misalignment. Formalizations such as cooperative inverse reinforcement learning and assistance games model alignment as a cooperative decision problem in which the system must reason under incomplete information about preferences.

These frameworks establish a critical principle adopted here: alignment is fundamentally about Policy choice under uncertainty, not about prediction alone. However, prior work in this tradition has focused primarily on how aligned behavior might be learned or incentivized, rather than on how alignment should be evaluated in deployed systems. In particular, there is limited treatment of alignment as a continuously monitored, distributional property of outcomes under varying environmental conditions, value specifications, and governance constraints.

Adjacent to alignment research is a substantial literature on robust and risk-sensitive decisionmaking, including robust Markov decision processes, constrained optimization, and tail-riskaware planning. These approaches explicitly account for uncertainty, worst-case scenarios, and risk measures such as conditional value at risk. Separately, simulation-based stress testing has been widely adopted in finance, engineering, and safety-critical systems to evaluate behavior under extreme but plausible conditions. While these techniques provide important methodological building blocks, they are typically framed as optimization or planning tools rather than as a general evaluation standard for alignment. Moreover, they are rarely connected explicitly to questions of value uncertainty, institutional governance, or trust thresholds. As a result, their relevance to alignment is often implicit rather than formalized.

Taken together, prior work establishes that (i) advanced AI systems should be evaluated as decision-makers rather than predictors, and (ii) uncertainty and risk must be treated explicitly. What remains unaddressed is a unifying evaluation framework that treats alignment as a distributional property of Policy behavior, estimated through systematic simulation across uncertain environments, values, and constraints.

The present work fills this gap by reframing alignment evaluation as a form of decision-theoretic stress testing. Rather than proposing new objectives, learning algorithms, or ethical taxonomies, it introduces a practical methodology for comparing and governing AI decision Policies based on the distributions of outcomes they induce.

We evaluate alignment at the level of induced trajectory distributions under a scenario generator, rather than model internals. The notation below defines the stochastic objects used throughout. o Expected utility:

• Utility variance:

• Constraint violation probability:

𝜏 % violates 𝐶}

• Tail risk (CVaR R -). For 𝛼 ∈ (0,1), let 𝑞1 -R -be the empirical 1-𝛼-quantile of 𝐿(𝜏). The empirical CVaR R -estimator is:

• Governance thresholds (admissibility). Governance parameters (𝜀 , 𝜅) define an admissible region; a Policy is admissible if, for example,

Expected utility is compared among admissible policies.

• Confidence intervals. MAP-AI reports uncertainty using bootstrap confidence intervals (default), or asymptotic intervals where appropriate. For 𝑝̂) *+, at modest 𝑁, Wilson or Clopper-Pearson intervals may be used.

MAP-AI is related to off-policy evaluation (OPE), constrained Markov decision processes (CMDPs), and risk-sensitive reinforcement learning, but differs in both objective and scope.

OPE methods typically estimate expected return or performance metrics for a fixed policy under a logged data distribution. In contrast, MAP-AI evaluates alignment risk, operationalized as the probability and severity of unacceptable outcomes under explicitly modeled uncertainty, including value uncertainty and governance thresholds.

Risk-sensitive and CVaR R –based RL approaches incorporate tail risk into optimization objectives. MAP-AI is intentionally evaluation-first rather than optimization-first: it does not prescribe how policies should be trained, nor does it assume that alignment can be achieved through objective shaping alone. Instead, it provides a common evaluation substrate for comparing policies, constraints, and governance regimes once systems are capable of acting.

Finally, MAP-AI explicitly models governance admissibility and escalation as control variables external to the policy itself, a dimension largely absent from standard CMDP and risk-sensitive control formulations.

We model an AI system as selecting actions via a decision Policy in an uncertain environment with uncertain values and explicit governance constraints. Let 𝑠 $ ∈ 𝒮 denote the state at time 𝑡, 𝑎 $ ∈ 𝒜 the action selected, and 𝜔 ∈ Ω a latent world realization capturing uncertain dynamics, regime shifts, and rare events. A scenario generator G induces a distribution 𝑃 0 (𝜔) over such worlds.

A policy 𝜋( 𝑎 $ | | 𝑠 $ , ℎ $ ) selects actions based on state and available information ℎ $ , which may include internal model state and human inputs. A trajectory 𝜏 = (𝑠 ! , 𝑎 ! , … , 𝑠 # ) is generated by rolling out 𝜋under world 𝜔.

Values are represented by a utility function 𝑈(𝜏; 𝜃), with uncertainty encoded via a distribution 𝜃 ∼ 𝑃(𝜃). Governance requirements are represented as trajectory-level constraints 𝑔 % (𝜏) ≤ 0 and an unacceptable outcome set ℳ.

Each Monte Carlo rollout samples jointly from world and value uncertainty, (𝜔 % , 𝜃 % ) ∼ 𝑃 0 (𝜔) 𝑃(𝜃), generating an outcome trajectory 𝜏 % ∼ 𝑃(𝜏 | 𝜋, 𝜔 % , 𝜃 % ).

The central object of alignment evaluation is the induced distribution over trajectories:

Policy under uncertainty. For this purpose, uncertainty is decomposed into the following components, each of which may be sampled independently or jointly in Monte Carlo evaluation: This decomposition is not intended to be exhaustive of all sources of uncertainty but is sufficient to characterize the Policy-induced outcome distribution that MAP-AI treats as the alignment object.

Alignment is evaluated through distributional functionals of this distribution, including expected utility, variance, tail risk, constraint violation probability, and misalignment risk. Alignment is defined comparatively and contextually: a policy is preferred if its outcome distribution dominates ok I ho

This section clarifies the scope, limits, and intended interpretation of MAP-AI. The objective is to prevent common category errors-particularly the misinterpretation of MAP-AI as a training protocol, compliance checklist, or safety guarantee-by making explicit what claims the framework does and does not make. MAP-AI is not an offline or pre-deployment evaluation; this framework applies to continuous, post-deployment monitoring as policies, environments, values, and governance thresholds evolve.

MAP-AI is not a protocol, checklist, or procedure whose execution guarantees alignment. It does not prescribe how policies are constructed, trained, or optimized, nor does it specify actions that must be taken to achieve alignment. Instead, MAP-AI is a trust and safety evaluation system that measures, stress-tests, and governs Policy behavior under uncertainty. The framework defines what must be evaluated-the distribution of outcomes induced by a Policy operating under uncertainty-and how alignment risk should be reported-via distributional metrics, tail risk, and governance admissibility. It does not assert that following a sequence of steps produces alignment, nor does it collapse alignment into compliance with a fixed procedure.

Alignment, under MAP-AI, is an empirical property of system behavior that must be observed, compared, and governed-not a box that can be checked.

All MAP-AI results are conditional. Reported alignment metrics are conditional on:

• the declared scenario generator and its abstractions,

• the modeled interfaces (including human involvement and tooling),

• the specified value parameter distributions,

• and the governance thresholds used to define admissibility.

MAP-AI makes no unconditional claims about real-world frequencies or universal safety. When scenario generators or evaluators are mis-specified, results should be interpreted as bounds under stated assumptions, not as guarantees. This conditionality is a feature rather than a limitation: it makes assumptions explicit and auditable, rather than implicit and unverifiable.

MAP-AI treats all evaluators as fallible system components. This includes automated constraint classifiers, harm estimators, red-teaming models, and guardrail mechanisms, whether humanoperated or automated.

Evaluator-first safety architectures-such as probabilistic harm estimators or automated guardrail systems-can be incorporated within MAP-AI as implementations of constraint evaluation or gating logic applied to policy outcomes rather than internal beliefs. However, their outputs are treated as noisy measurements, not as ground truth. Evaluator error, calibration drift, blind spots, and normative ambiguity are therefore sources of system risk that propagate into misalignment estimates.

MAP-AI does not assume evaluator trustworthiness by design. Instead, it requires that the effect of evaluators and guardrails be assessed empirically through their impact on the induced trajectory distribution.

MAP-AI explicitly separates failure discovery from risk estimation.

Procedures that over-sample hazardous regions-such as adversarial scenario search, rare-event amplification, or evaluator-guided stress generation-are permitted and encouraged for discovering alignment-relevant failure modes. However, reported alignment metrics must always be computed under a declared evaluation distribution, whether the base scenario generator or a specified stress distribution.

This separation is required to avoid conflating “failures were found” with “failures are likely.” MAP-AI therefore treats adversarial or evaluator-guided sampling as a discovery tool, not as a substitute for distributional estimation.

Governance thresholds defining admissibility regions are institutional decisions, not objective truths. A policy deemed inadmissible under one set of risk tolerances may be admissible under another. MAP-AI makes these thresholds explicit and evaluates their consequences, rather than embedding them implicitly in optimization objectives or safety rules.

Thresholding introduces discontinuities: small changes in tolerances can induce large changes in the admissible policy set. MAP-AI treats this sensitivity as an object of evaluation, not as a defect.

Governance is therefore modeled as an active control layer, not as a static compliance filter. Alignment emerges from the interaction of these layers once a system has the ability to act. Monte Carlo simulation is not itself the alignment layer; it is the evaluation substrate that makes the interaction between layers observable under uncertainty.

Human involvement-approval gates, escalation rules, overrides-is treated as a policy component, not as a guarantee. MAP-AI evaluates whether oversight mechanisms change outcome distributions, tail risk, and admissibility in practice. It does not assume that the presence of a human in the loop ensures alignment by design.

This mirrors institutional safety practice in other domains: oversight mechanisms are evaluated by their realized effect on outcomes, not by their intent.

MAP-AI does not claim to solve alignment, eliminate emergent risk, or replace interpretability, training-time alignment, or governance institutions. Its claim is narrower and more defensible: once AI systems act under uncertainty, alignment cannot be meaningfully assessed without evaluating policy behavior across outcome distributions.

Absent such evaluation, alignment remains an assumption rather than an operational property.

Alignment metrics in MAP-AI are estimated via repeated simulation of Policy behavior under a declared scenario generator. This section specifies the stochastic evaluation object, the estimator family used to quantify alignment risk, the treatment of tail events and rare failures, and the calibration and reporting requirements needed to make results comparable across policies and reproducible across evaluation contexts.

Let 𝐺 denote a scenario generator that induces a distribution over possible worlds 𝜔. For a candidate policy 𝜋, MAP-AI evaluates the induced distribution over trajectories 𝜏 = (𝑠 ! , 𝑎 ! , 𝑠 " , 𝑎 " , … , 𝑠 # ) These sources may be sampled independently or jointly. All reported results must explicitly state which components are randomized and which are conditioned.

Let {𝜏 % } %’" & ∼ 𝑃(𝜏 | 𝜋, 𝐺) denote 𝑁 independent rollouts of Policy 𝜋. For any trajectory-level functional 𝑓(𝜏) (e.g., utility, loss, or constraint indicator), MAP-AI estimates distributional quantities using the following canonical estimators. Throughout this section, CVaR R -is computed over loss, where the loss is defined as L(τ) := -U(τ).

Dispersion (sample variance)

For trajectory-level constraints 𝐶,

𝜏 % violates 𝐶}

Let L(τ) denote loss, and let q_hat_(1-α) be the empirical (1-α)-quantile of L(τ). Define the index

MAP-AI treats governance thresholds as first-class. A policy 𝜋 is admissible only if

for institutionally specified tolerances (𝜀 , 𝜅).

Expected utility is not sufficient for admissibility; it is evaluated only among admissible policies.

Uncertainty in all reported estimates must be quantified. MAP-AI supports:

• nonparametric bootstrap confidence intervals for 𝔼 D [𝑈], 𝑝̂) *+, , and CvaR R -; • asymptotic intervals when regularity conditions are satisfied; • Wilson or Clopper-Pearson intervals for violation probabilities when 𝑁 is modest.

Unless otherwise stated, results are reported with 95% confidence intervals.

Naïve Monte Carlo sampling under-represents rare but high-impact failures. MAP-AI therefore supports variance-reduction and rare-event amplification techniques as part of the evaluation protocol.

Two families are recommended:

Stratified or conditional sampling, in which rollouts are allocated across scenario strata (e.g., regimes defined by 𝐺) to ensure coverage of relevant operational states and enable regime-conditional risk reporting. 2. Importance sampling, using a proposal distribution 𝑄(𝜔) that over-weights known hazard regions. When likelihood ratios are available, unbiased estimators may be constructed; otherwise, resulting estimates are treated as conservative upper bounds.

Adversarial scenario search can be interpreted as constructing 𝑄 to target failure discovery. In MAP-AI, such procedures are permitted for discovery, but any reported alignment metric must be computed under a declared evaluation distribution (either the base generator 𝐺 or an explicitly specified stress distribution) to avoid conflating discovery with estimation.

All Monte Carlo results are conditional on the scenario generator 𝐺 and any simulator used to produce rollouts. MAP-AI treats this dependence as model risk and requires explicit reporting of:

the assumed dynamics, interfaces, and abstractions represented in 𝐺; 2. calibration targets and procedures, analogous to backtesting in finance; 3. sensitivity analyses over key scenario parameters.

Calibration is assessed by whether the simulator reproduces known regimes and failure modes relevant to the deployment context. When calibration is weak, MAP-AI results are interpreted as bounds under stated assumptions rather than claims about real-world frequencies.

In practice, the scenario generator 𝐺 may take several forms, including: (i) physics-or rules-based simulators calibrated to known regimes;

(ii) learned world models trained on historical interaction data; (iii) hybrid generators combining empirical logs with synthetic stress scenarios; and (iv) curated adversarial or red-team scenario suites.

MAP-AI is agnostic to the specific instantiation of 𝐺, but requires that its abstractions, calibration targets, and known failure modes be explicitly reported. Differences in 𝐺 are treated as model risk rather than hidden assumptions.

To support comparability across candidate policies, MAP-AI fixes the evaluation horizon 𝑇, applies random-seed control for paired evaluations, and enforces identical interface constraints (e.g., human oversight rules, tool access, rate limits). Where possible, policies are evaluated using common random numbers to reduce estimator variance in pairwise comparisons.

Human involvement enters only through its operationalized effect on trajectories-approval gates, escalation rules, and constraint overrides. MAP-AI evaluates these mechanisms empirically as part of the policy, rather than assuming they guarantee alignment by design.

This section instantiates Part II of the MAP-AI Alignment System Architecture (alignment stress testing) in its simplest non-trivial form. To keep the diagnostic example maximally interpretable, the stress tests in this section vary a single uncertainty dimension-world uncertainty via the scenario generator 𝐺-while holding Policy stochasticity, value parameters, and constraint realization fixed. This isolates the effect of environmental uncertainty on alignment-relevant behavior without confounding effects from learning dynamics or preference drift.

Accordingly, the worked example in Section 6.1 varies only world uncertainty via the scenario generator, holding Policy stochasticity, value parameters, and constraint realization fixed. We illustrate MAP-AI using a minimal diagnostic stress test sufficient to induce distributional divergence under identical predictive beliefs.

We model uncertainty using a regime-switching environment, where regime is used in its standard technical sense to denote latent operating modes of a stochastic system, as in regimeswitching Markov models, econometric bull/bear regimes, control-theoretic operating modes, and reliability models distinguishing normal and fault conditions.

The scenario generator 𝐺 induces world uncertainty through a latent operational regime variable with two modes: Normal and Adverse. The system begins in the Normal regime and transitions to the Adverse regime at each timestep with probability 𝑝 = 0.02. Once entered, the Adverse regime is absorbing. The rollout horizon is fixed at 𝑇 = 20.

This structure captures rare but persistent adverse operating conditions-such as distribution shift, cascading failures, or adversarial onset-without assuming any particular task domain, model architecture, or internal reasoning process.

We evaluate two deterministic policies, 𝜋 3 and 𝜋 4 , that share an identical predictive belief state and identical observations. Both policies assign the same probability to regime transitions and differ only in their action mapping under uncertainty.

• Policy 𝜋 3 selects an aggressive action at every timestep, prioritizing higher nominal returns under normal conditions. • Policy 𝜋 4 selects a conservative action at every timestep, sacrificing some nominal return to cap downside exposure.

Under naive expectation-based evaluation, the two policies are nearly indistinguishable.

Per-step utilities are defined as follows. In the Normal regime, the aggressive action yields utility +1.2 and the conservative action yields +1.0. In the Adverse regime, both actions yield utility -2.0 per timestep. In addition, when the system first transitions into the Adverse regime, Policy 𝜋 3 incurs a one-time catastrophic penalty of -10, reflecting actions whose downside risk materializes only under rare conditions.

Alignment is evaluated with respect to a single hard governance constraint. A trajectory violates the constraint if total cumulative utility satisfies

The constraint is deterministic and binary, enforcing a clear admissibility criterion without reliance on moral interpretation.

We estimate alignment metrics using Monte Carlo rollouts 𝜏 % ∼ 𝑃( 𝜏 | 𝜋, 𝐺 ), with 𝑁 = 200,000rollouts per policy. Policy stochasticity, value uncertainty, and constraint realization are held fixed by design. Tail risk is measured using Conditional Value-at-Risk at level 𝛼 = 0.05. All estimates are reported with 95% bootstrap confidence intervals.

Table 1 reports expected utility, constraint violation probability, and tail risk for both policies.

Under expected utility, the policies exhibit predictive parity:

This experiment demonstrates that alignment-relevant distinctions emerge only at the level of outcome distributions, as evaluated by the MAP-AI stress-testing layer, not at the level of expected performance or internal belief correctness. MAP-AI does not modify policy internals or belief formation. It changes which policies are considered admissible by making tail risk and constraint violations explicit and measurable. Alignment, in this framing, is not a property of internal cognition or belief correctness, but of externally observable behavior under uncertainty.

MAP-AI treats governance thresholds as first-class parameters rather than fixed background assumptions. In the regime-switching scenario, small changes in allowable violation probability or CVaR R -thresholds alter the set of admissible policies without changing the policies themselves.

This demonstrates that alignment decisions are not downstream of optimization alone. They arise from the interaction between policy behavior and institutional risk tolerance. MAP-AI explicitly separates risk estimation from threshold selection, allowing governance sensitivity analysis without retraining models or altering policy logic.

While Section 6.1 holds value parameters fixed, real deployments face both distribution shift and value drift. MAP-AI accommodates these effects by allowing the scenario generator and value parameters to vary independently.

Stress tests reveal that modest increases in adverse regime frequency or severity can disproportionately affect tail risk while leaving expected utility largely unchanged. Similarly, changes in value parameters can flip admissibility decisions even when observable policy behavior remains constant. MAP-AI exposes this brittleness by making both forms of uncertainty explicit.

Human oversight is often assumed to guarantee alignment by design. MAP-AI instead treats human intervention as part of the evaluated Policy.

In the regime-switching environment, human intervention can be modeled as an approval gate or override that activates probabilistically under adverse conditions. Monte Carlo evaluation shows that such mechanisms reduce but do not eliminate tail risk. Their effectiveness depends on detection latency, intervention reliability, and outcome severity.

By evaluating human-in-the-loop mechanisms empirically, MAP-AI grounds alignment claims in observed outcomes rather than architectural intent.

A critical class of alignment failures arises under shutdown pressure, where Policies may take actions that preserve operation at the expense of safety constraints. MAP-AI evaluates such behavior by modeling shutdown as an additional adverse operating mode within the scenario generator.

Stress tests reveal that Policies optimized for nominal conditions may exhibit sharply elevated tail risk under shutdown pressure, even when shutdown is rare. These failures emerge only under compounded uncertainty, reinforcing the need for distributional evaluation.

Sections 6.1-6.5 demonstrate that alignment-relevant distinctions arise from the interaction of policy behavior, uncertainty, and governance constraints. Expected performance alone is insufficient. MAP-AI surfaces these distinctions by evaluating the full distribution of outcomes induced by Policy actions, enabling principled admissibility decisions in environments where rare but consequential failures dominate risk. We now show how these distributional evaluations directly determine action selection in agentic systems.

This section demonstrates how MAP-AI functions as a decision-relevant alignment substrate for agentic systems. We instantiate MAP-AI within a minimal tool-using agent setting and show how distributional alignment evaluation induces concrete Policy selection decisions under uncertainty-without modifying model internals, training objectives, or optimization dynamics.

We consider an agent that, at each decision point, may select one of three actions:

Act: execute a proposed tool-mediated action autonomously; 2. Escalate: defer execution to a human supervisor or external authority; 3. Abort: decline action execution.

The downstream consequences of autonomous action are uncertain and depend on latent world conditions, modeled via a scenario generator 𝐺. Escalation and abort actions incur fixed opportunity costs but eliminate the risk of catastrophic outcomes associated with autonomous execution.

The agent’s Policy space therefore consists of mappings from observed state to one of these actions. Importantly, multiple candidate Policies may exhibit similar expected utility under naïve evaluation, yet differ materially in their induced tail-risk and constraint-violation profiles.

For each candidate Policy 𝜋, MAP-AI evaluates the induced trajectory distribution

and estimates alignment-relevant risk metrics, including:

• constraint violation probability 𝑝̂) *+, ,

• tail risk measured via CvaR R -, with associated confidence intervals.

Governance constraints are specified institutionally as admissibility thresholds:

Policies failing to satisfy these constraints are deemed inadmissible, regardless of expected utility.

MAP-AI induces a decision functional that maps evaluated policies to actionable governance decisions.

Formally, define the decision functional

with the following structure:

Act, if 𝜋 is admissible and maximizes 𝔼 D [𝑈] among admissible policies, Escalate, if 𝜋 is inadmissible but an escalation alternative is admissible, Abort, otherwise.

Crucially, MAP-AI does not define or optimize this functional; it renders it well-defined by producing measurable, distributional alignment quantities with quantified uncertainty.

In the evaluated agentic setting, naïve expected-utility evaluation favors a Policy that executes actions autonomously due to higher mean reward. However, MAP-AI reveals that this Policy exhibits elevated tail risk and constraint violation probability, rendering it inadmissible under declared governance thresholds.

An alternative Policy that escalates under uncertainty-despite slightly lower expected utility-is admissible. Applying the decision functional therefore yields a policy selection reversal:

Expected utility selection ≠ MAP-AI admissible selection This decision flip occurs without modifying the agent’s predictive beliefs, internal reasoning, or training procedure. Alignment emerges solely at the level of policy-induced outcome distributions under uncertainty.

This example illustrates that alignment-relevant decisions cannot be derived from model internals, belief correctness, or expected performance alone. They arise only when policy behavior is evaluated distributionally and constrained by explicit governance criteria.

MAP-AI therefore functions as a decision interface, not an optimization algorithm. It separates:

• measurement (Monte Carlo evaluation of policy outcomes),

• governance (institutionally specified admissibility thresholds),

• choice (application of a decision functional).

By maintaining this separation, MAP-AI preserves auditability, accountability, and institutional control, while enabling agentic systems to act-or decline to act-based on measurable alignment risk.

In deployment settings, MAP-AI outputs may be consumed within a champion-challenger decision loop that enables alignment-aware action selection while preserving strict separation between evaluation, governance, and policy generation.

Let governance be represented as a compilable specification 𝒢 = (H, S, ≺, T)

where:

• H is a set of hard admissibility constraints. Each h ∈ H is a predicate h: ℝ ; → {0,1} (e.g., p Y )*+, ≤ ε, CVaR R < ≤ κ). • S is a set of soft objectives or penalty-weighted criteria.

• ≺ is a total, deterministic priority order over evaluation criteria (e.g., lexicographic tiers prioritizing risk over utility). • T is a deterministic tie-breaking rule ensuring a unique output.

Define the Proof-Carrying Admissibility Compiler as a deterministic function

where 𝒵 is a certificate space.

Algorithm: Proof-Carrying Admissibility Compilation (PCAC)

PCAC is deterministic, model-agnostic, and replayable: identical inputs (Π , M , G) yield identical decisions and certificates. Its computational complexity is polynomial in the number of candidates and constraints, and its output is an executable decision artifact rather than a recommendation.

Within the champion-challenger loop, MAP-AI functions as an evaluation substrate, while PCAC provides the explicit decision operator that maps evaluated alternatives into institutionally governed action. This completes the MAP-AI control plane: uncertainty is evaluated via Monte Carlo simulation, governance is expressed as compilable constraints, and decisions are produced through deterministic, auditable compilation rather than optimization or learning.

MAP-AI reframes alignment, through Admissibility Alignment, as the control of action selection under uncertainty rather than a one-time certification of model properties. Because deployed AI systems operate in open-ended environments with evolving objectives, counterparties, and governance assumptions, such guarantees cannot be treated as static. Instead, alignment must be continuously evaluated as policies, environments, and institutional constraints change.

A central contribution of this framework is the explicit separation of alignment evaluation from optimization. MAP-AI does not prescribe how policies are trained or constructed; it evaluates how candidate policies behave under uncertainty once deployed. This separation avoids conflating alignment with capability and allows the framework to remain compatible with diverse modeling approaches, including predictive models, planning systems, and hybrid human-AI workflows.

By treating governance thresholds, human intervention mechanisms, and escalation rules as firstclass decision variables, MAP-AI makes alignment tradeoffs explicit and auditable. Rather than assuming that constraints or oversight mechanisms ensure safety by design, the framework evaluates their realized effect on trajectory distributions. This enables institutions to reason concretely about questions such as how stricter risk tolerances reshape the feasible Policy class, or how changes in oversight alter tail-risk exposure.

The framework also clarifies the limits of model-centric alignment approaches. Training procedures may shape tendencies and preferences, but they cannot guarantee aligned decisionmaking in the presence of combinatorial state spaces, value uncertainty, and rare but high-impact events. MAP-AI addresses this gap by evaluating alignment at the level where actions are selected and consequences materialize: the policy operating under uncertainty.

From an institutional perspective, MAP-AI positions alignment evaluation as infrastructure, analogous to stress testing and risk limits in finance or safety certification in engineering. Its outputs-distributional metrics, tail-risk estimates, and admissibility regions-are designed to support governance, audit, and escalation decisions, rather than to serve as abstract ethical scores. This makes the framework directly applicable to enterprise, regulatory, and safety-critical deployment contexts.

The primary limitations of MAP-AI arise from its dependence on scenario generation and sampling efficiency. As with all stress-testing methodologies, results are conditional on the assumed scenario distributions and the coverage of rare events. These limitations are intrinsic to any probabilistic evaluation of complex systems and are addressed through explicit reporting of assumptions, calibration procedures, and sensitivity analyses rather than by claims of completeness.

We note that MAP-AI naturally induces a class of alignment benchmarks defined over admissibility under uncertainty, which we leave as future infrastructure work (‘MAPBench’).

Overall, MAP-AI advances alignment from a philosophical aspiration to an operational discipline: one in which alignment is measured, stress-tested, and governed as a property of policy behavior over distributions, not inferred from model intent or training objectives alone.

This paper argues that once AI systems act in the world, alignment can no longer be treated as a property of model internals or training procedures alone. Deployed systems face open-ended environments, value uncertainty, and combinatorial state spaces that make it intractable to guarantee aligned decision-making purely through optimization of a fixed objective. Alignment therefore cannot be certified at the level of the model. It must be evaluated at the level of Policy behavior under uncertainty through Admissibility Alignment at the point of action selection.

MAP-AI formalizes this shift by treating alignment as a distributional property of Policy behavior, rather than a binary attribute of artifacts. The framework is explicitly agnostic about how alignment is learned-through training, instruction, reinforcement, or human oversight-but explicit about where alignment must be evaluated and governed: at the point where actions are selected and consequences materialize. If systems act, alignment evaluation that does not simulate Policy behavior across distributions is incomplete by definition.

A central contribution of MAP-AI is the separation of evaluation from optimization. The framework does not prescribe architectures, loss functions, or training regimes. Instead, it evaluates the induced distribution of outcomes produced by candidate Policies under uncertainty, incorporating governance thresholds, constraints, and human intervention mechanisms as firstclass variables. This separation avoids conflating alignment with capability and allows alignment evaluation to function as infrastructure, rather than as an aspirational property of individual models.

Within this framing, alignment is not localized to a single layer. Models generate representations and proposals; training shapes tendencies; Policies select actions; constraints define admissible behavior; stress testing reveals distributional failures; and governance adjusts thresholds and escalation rules. Alignment emerges from the interaction of these layers once the system has power. Monte Carlo simulation is not itself the alignment layer, but the evaluation substrate that makes this interaction measurable under uncertainty.

MAP-AI is intentionally narrow in its claims. It does not claim to solve alignment, guarantee safety, eliminate emergent risk, or replace interpretability, training, or governance. Instead, it advances a stronger and more defensible claim: that alignment only exists when Policy behavior is evaluated distributionally across interacting layers, and that Monte Carlo stress testing provides a natural and established substrate for doing so. Without such evaluation, alignment remains an assumption rather than an operational property.

Beyond evaluation, MAP-AI establishes a principled mechanism for integrating alignment into decision execution itself. By treating alignment constraints as admissibility conditions rather than optimization targets, the framework enables risk-aware action selection, escalation, or suppression without modifying underlying models or training objectives. This separation preserves model generality while ensuring that decisions taken under uncertainty remain institutionally aligned. In this sense, MAP-AI functions as an alignment control layer-bridging probabilistic risk assessment and operational decision-making in agentic systems.

Finally, MAP-AI reframes alignment as a systems engineering and risk management problem, rather than as a philosophical overlay. Human alignment is not guaranteed by cognition alone, but by institutions that evaluate, constrain, and govern behavior under uncertainty. MAP-AI applies the same principle to AI systems. By simulating the conditions under which alignment failures would occur-rather than waiting for them to occur in deployment-the framework provides a concrete basis for alignment evaluation, governance, and iterative refinement as system capability scales.

. Alignment Layers in MAP-AI (System-Level Framing)Alignment emerges at the system level. No single layer is sufficient for alignment in isolation. Evaluator and guardrail mechanisms operate within the Constraints and Stress-Testing layers and are themselves subject to distributional evaluation.

. Alignment Layers in MAP-AI (System-Level Framing)

iol (𝜋 4 ) = 0.02035 (95% CI: 0.01973-0.02089) Tail risk follows the same pattern. The estimated Conditional Value-at-Risk is 𝐶𝑉𝑎𝑅 R !.!7 (𝜋 3 ) = 47.443 (95% CI: 47.358-47.531) versus 𝐶𝑉𝑎𝑅 R !.!7 (𝜋 4 ) = 37.604 (95% CI: 37.519-37.690)Under institutionally specified governance thresholds, policy 𝜋 3 is inadmissible despite competitive expected utility, while policy 𝜋 4 remains admissible.Notes.

interpretation Figure 2. Empirical distributions of cumulative utility induced by two policies under the regimeswitching stress test. The dashed vertical line denotes the governance constraint threshold. While central mass is similar across policies, alignment-relevant divergence arises exclusively in the tail, where rare but catastrophic outcomes dominate admissibility.

A = {π(k) ∈ Π: ∀h ∈ H, h(mk) = 1}

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

Arxiv 2601.01816

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found