A Decision-Theoretic Approach for Managing Misalignment

Reading time: 6 minute
...

📝 Abstract

When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

💡 Analysis

When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

📄 Content

A Decision-Theoretic Approach for Managing Misalignment Daniel A. Herrmann† University of North Carolina at Chapel Hill danher@unc.edu Abinav Chari Georgia Institute of Technology abinav.ram.chari@gmail.com Isabelle Qian University of California, Berkeley isabelle.qian@gmail.com Sree Sharvesh Amrita Vishwa Vidyapeetham sssreesharvesh@gmail.com B. A. Levinstein† Anthropic University of Illinois at Urbana-Champaign benlevin@illinois.edu Abstract When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty. 1 Introduction When should we delegate decisions to AI systems? This is one of the key questions in AI safety, as we must decide when we expect things to go better for ourselves and humanity by outsourcing decision-making to AI. An intuitive answer implicit in much of the value alignment literature is: when we’ve successfully aligned their values with our own. But this answer overlooks two critical facts about real-world delegation. First, we routinely delegate to agents whose values differ from ours. Many people delegate investment decisions to a financial advisor who charges fees they’d rather not pay. We delegate medical decisions †Equal contribution. Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI’26). arXiv:2512.15584v2 [cs.AI] 21 Dec 2025 to doctors who spend less time thinking about our specific case than we’d wish them to. These delegations make sense because the agents’ superior competence outweighs their value misalignment. Second, we make these delegation decisions under uncertainty. We cannot be certain about either an agent’s values or their capability. Any practical framework for AI delegation must account for this uncertainty and help us manage it. These observations reveal a gap in the value alignment literature. Significant effort has gone into developing alignment techniques such as reinforcement learning with human feedback and constitu- tional AI. These techniques are aimed at answering the question: how do we nudge AI values in the direction of our own values? Less attention has been paid to the complementary question: how do we determine when an AI system is sufficiently aligned to warrant delegation? This is not simply a matter of measuring how well we shaped the values of a system. It requires weighing the misalignment against the capability under uncertainty. In this article we provide a formal framework that makes this tradeoff precise by modeling three key factors: the agent’s epistemic accuracy (how well-calibrated its beliefs are), its value alignment (how closely its objectives match ours), and its reach (what kinds of problems it can access and solve). Our framework treats delegation as a decision problem where a principal must choose whether to delegate based on her beliefs about all three factors. Our analysis yields a sharp contrast between two types of delegation. Universal delegation—being willing to delegate any decision problem to the agent—requires an extremely demanding condition: the principal must totally trust the agent’s beliefs when values are aligned, and near-perfect alignment when values differ. This helps explain why full automation might be inadvisable even for capable AI systems. However, context-specific delegation—delegating particular types of problem or accepting delegation in expectation—can be rational even with moderate misalignment, provided the agent’s superior epistemic accuracy or expanded reach generates sufficient gains. We consider three important results. First, as shown in Dorst et al. (2021), when the principal and agent share values and face the same problems, universal delegation is equivalent to a

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut