The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems
  • ArXiv ID: 2512.05449
  • Date: 2025-12-05
  • Authors: Robert Yang

📝 Abstract

Large language models display a peculiar form of inconsistency: they "know" the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model's local response contradicts its own prior commitments. The benchmark enables quantitative comparison of "self-control" across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akrasia may compound into macro-level instability in multi-agent systems that may be interpreted as "scheming" or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.

💡 Deep Analysis

Figure 1

📄 Full Content

The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems Robert Yang 1 1Independent Researcher San Jose, CA 95129 USA bobyang9@alumni.stanford.edu Abstract Large language models display a peculiar form of incon- sistency: they ”know” the correct answer but fail to act on it. In human philosophy, this tension between global judg- ment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyz- ing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model’s local response contradicts its own prior commitments. The benchmark en- ables quantitative comparison of ”self-control” across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akra- sia may compound into macro-level instability in multi-agent systems that may be interpreted as ”scheming” or deliber- ate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philoso- phy, psychology, and the emerging science of agentic AI. Introduction What is ”scheming”? And how do we know if someone (or something) is performing the action of ”scheming”, and not some other, less complex and more fundamental pattern of behavior? Today, we are faced with the central challenge in AI safety: ensuring that increasingly autonomous, agentic sys- tems behave reliably and in accordance with human- specified goals. The potential for unexpected, harmful be- havior is a primary concern. A dominant mental model for this failure is ”scheming” or ”deceptive alignment”, as ob- served in the work by Apollo Research and OpenAI (Ope- nAI 2025). In this view, the AI is a rational agent with a stable, hidden objective that conflicts with its stated objec- tive. This is not an average case, rather the ultimate, critical failure mode, a worst-case scenario; it is the observation of and the attribution of a high degree of coherent, long-term planning and stable hidden intent to current systems. However, system failures are usually a conflation of mul- tiple factors compounding over time (Perrow 1999). We ob- serve the final result and assign attributions, but sometimes the micro-failures embedded within seemingly performant components (Reason 1997) are the latent drivers of systemic collapse (Cook 1998). Hence, it may be more valuable to investigate the seed conditions of failure – the subtle mech- anisms that give rise to systemic breakdown – rather than merely diagnosing the event of collapse itself (Dekker 2011; Rasmussen 1997). One such mechanism is inconsistency: where models of- ten quote-and-quote ”know” the right thing to do in a gen- eral sense but fail to apply that knowledge in a specific, local context (Strotz 1955; Ainslie 1992). The effect is magnified when the model is placed under duress, i.e. in double-bind situations, where competing imperatives fracture its internal consistency. Like a person acting against their better judg- ment under pressure, the model defaults to patterns most deeply ingrained in its training rather than to its stated in- tent (Simon 1955). This pattern of fractured behavior can be framed through the classical philosophical concept of akrasia, or weakness of will (Aristotle 2009; Plato 1992). In Aristotle’s formula- tion, akrasia denotes the act of choosing a course of action despite judging that another course is better – a misalign- ment between one’s global judgment and a local impulse. Davidson goes further in characterizing this as an override of a judgment that is for the best ”all things considered”, due to immediate urges (Davidson 1969). Transposed to ar- tificial agents, akrasia captures how a model may ”know” the correct or intended response in a global, deliberative sense, yet fail to enact it when faced with an immediate, context- specific cue. Unlike explanations that invoke hidden goals or deliberate deception, the akratic framing treats such failures as lapses of self-control rather than signs of malevolent intent. It por- trays inconsistency not as a strategic choice but as an invol- untary epistemic error, a breakdown in the model’s ability to preserve internal coherence between the beliefs it reports and what it subsequently generates. In this view, ensuring that a model remains epistemically stable (that its local pre- dictions remain faithful to its global understanding) becomes central to maintaining behavioral alignment. We propose akrasia as a foundational and measurable con- cept for analyzing inconsistency and goal drift in agentic AI systems. To begin to operationalize this, we introduce the Akrasia Benchmark, which systematically measures when a model’s lo

📸 Image Gallery

model_size.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut