AI Deception: Risks, Dynamics, and Controls

Reading time: 6 minute
...

📝 Original Info

  • Title: AI Deception: Risks, Dynamics, and Controls
  • ArXiv ID: 2511.22619
  • Date: 2025-11-27
  • Authors: Boyuan Chen, Sitong Fang, Jiaming Ji, Yanxu Zhu, Pengcheng Wen, Jinzhou Wu, Yingshui Tan, Boren Zheng, Mengying Yuan, Wenqi Chen, Donghai Hong, Alex Qiu, Xin Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Borong Zhang, Tianzhuo Yang, Saad Siddiqui, Isabella Duan, Yawen Duan, Brian Tse, Jen-Tse, Huang, Kun Wang, Baihui Zheng, Jiaheng Liu, Jian Yang, Yiming Li, Wenting Chen, Dongrui Liu, Lukas Vierling, Zhiheng Xi, Haobo Fu, Wenxuan Wang, Jitao Sang, Zhengyan Shi, Chi-Min Chan, Eugenie Shi, Simin Li, Juncheng Li, Jian Yang, Wei Ji, Dong Li, Jinglin Yang, Jun Song, Yinpeng Dong, Jie Fu, Bo Zheng, Min Yang, Yike Guo, Philip Torr, Robert Trager, Yi Zeng, Zhongyuan Wang, Yaodong Yang, Tiejun Huang, Ya-Qin Zhang, Hongjiang Zhang, Andrew Yao

📝 Abstract

As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This project provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive behaviors when triggered by external conditions. Deception treatment, in turn, focuses on detecting and addressing such behaviors. On deception emergence, we analyze incentive foundations across three hierarchical levels and identify three essential capability preconditions required for deception. We further examine contextual triggers, including supervision gaps, distributional shifts, and environmental pressures. On deception treatment, we conclude detection methods covering benchmarks and evaluation protocols in static and interactive settings. Building on the three core factors of deception emergence, we outline potential mitigation strategies and propose auditing approaches that integrate technical, community, and governance efforts to address sociotechnical challenges and future AI risks. To support ongoing work in this area, we release a living resource at www.deceptionsurvey.com.

💡 Deep Analysis

Deep Dive into AI Deception: Risks, Dynamics, and Controls.

As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This project provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive be

📄 Full Content

AI Deception: Risks, Dynamics, and Controls Project Team1 1The full list of Senior Advisors, Project Leaders, and Core Contributors is detailed on page 5. # deceptionsurvey@gmail.com, € www.deceptionsurvey.com Abstract | As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This survey provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive behaviors when triggered by external conditions. Deception treatment, in turn, focuses on detecting and addressing such behaviors. On deception emergence, we analyze incentive foundations across three hierarchical levels and identify three essential capability preconditions, namely perception, planning, and performing, required for deception. We further examine contextual triggers, including supervision gaps, distributional shifts, and environmental pressures. On deception treatment, we survey detection methods spanning both external and internal analyses, covering benchmarks and evaluation protocols in static and interactive settings. Building on the three core factors of deception emergence, we outline potential mitigation strategies and propose auditing approaches that integrate technical, community, and governance efforts to address sociotechnical challenges and future AI risks. This survey concludes on key challenges and future directions in AI deception research, aiming to provide a comprehensive and insightful review of AI deception research. To support ongoing work in this area, we release a living resource at www.deceptionsurvey.com, continuously capturing the latest developments and curating collections of papers, blog posts, and other resources. One may smile, and smile, and be a villain. — William Shakespeare arXiv:2511.22619v2 [cs.AI] 3 Dec 2025 AI Deception: Risks, Dynamics, and Controls Executive Summary AI systems are increasingly capable, interactive, and embedded in sensitive workflows. With these advances, the possibility of deception, where systems cause humans or other agents to hold false beliefs that benefit the system, has moved from speculation to empirical reality. This survey provides a comprehensive mapping of the AI deception field, integrating definitions, empirical taxonomy, risks, causal mechanisms, and treatments into a unified framework. Definition of AI Deception Although deception is conventionally associated with intent, we char- acterize AI deception through a functional lens, referring to behaviors that mislead human or other AI systems and yield outcomes aligned with the system’s objectives. Thus, AI deception can be understood as a signal-based causal process in which a model, acting as the sender, produces signals that induce the receiver to form false beliefs and respond rationally on the basis of those beliefs, thereby yielding actual or potential benefits for the sender. Its formal elements include the sender and the receiver, the signals and subsequent actions, the resulting utility, and the temporal dimension. In multi-step interactions, if the trajectory of the receiver’s beliefs persistently deviates from reality in ways that enhance the sender’s utility, the behavior constitutes sustained deception. This formulation avoids presuppositions about the model’s intent and instead relies on a causal criterion: whether the signals systematically induce false beliefs, alter the receiver’s behavior, and advantage the sender. Capability scaling Deception scaling Figure 1 | The Entanglement of Intelligence and Deception. (1) The Möbius Lock: Contrary to the view that capability and safety are opposites, advanced reasoning and deception actually exist on the same Möbius surface. They are fundamentally linked; as AI capabilities grow, deception becomes deeply rooted in the system. It is impossible to remove it without damaging the model’s core intelligence. (2) The Shadow of Intelligence: Deception is not a bug or error, but an intrinsic companion of advanced intelligence. As models expand their boundaries in complex reasoning and intent understanding, the risk space for strategic deception exhibits non-linear, exponential growth. (3) The Cyclic Dilemma: Mitigation strategi

…(Full text truncated)…

📸 Image Gallery

adversarial_pressure.png adversarial_pressure.webp capability_preconditions.png capability_preconditions.webp contextual_triggers.png contextual_triggers.webp deception_cycle.png deception_cycle.webp deception_mitigation.png deception_mitigation.webp distributional_shift.png distributional_shift.webp empirical_studies.png empirical_studies.webp evaluation_gap.png evaluation_gap.webp icon.png icon.webp incentive_foundations.png incentive_foundations.webp instructional_pressure.png instructional_pressure.webp multi_agent.png multi_agent.webp observation_gap.png observation_gap.webp oversight_gap.png oversight_gap.webp topology_risk.png topology_risk.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut