Title: The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems
ArXiv ID: 2512.05449
Date: 2025-12-05
Authors: Robert Yang
📝 Abstract
Large language models display a peculiar form of inconsistency: they "know" the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model's local response contradicts its own prior commitments. The benchmark enables quantitative comparison of "self-control" across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akrasia may compound into macro-level instability in multi-agent systems that may be interpreted as "scheming" or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.
💡 Deep Analysis
📄 Full Content
The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic
Systems
Robert Yang 1
1Independent Researcher
San Jose, CA 95129 USA
bobyang9@alumni.stanford.edu
Abstract
Large language models display a peculiar form of incon-
sistency: they ”know” the correct answer but fail to act on
it. In human philosophy, this tension between global judg-
ment and local impulse is called akrasia, or weakness of will.
We propose akrasia as a foundational concept for analyz-
ing inconsistency and goal drift in agentic AI systems. To
operationalize it, we introduce a preliminary version of the
Akrasia Benchmark, currently a structured set of prompting
conditions (Baseline [B], Synonym [S], Temporal [T], and
Temptation [X]) that measures when a model’s local response
contradicts its own prior commitments. The benchmark en-
ables quantitative comparison of ”self-control” across model
families, decoding strategies, and temptation types. Beyond
single-model evaluation, we outline how micro-level akra-
sia may compound into macro-level instability in multi-agent
systems that may be interpreted as ”scheming” or deliber-
ate misalignment. By reframing inconsistency as weakness of
will, this work connects agentic behavior to classical theories
of agency and provides an empirical bridge between philoso-
phy, psychology, and the emerging science of agentic AI.
Introduction
What is ”scheming”? And how do we know if someone (or
something) is performing the action of ”scheming”, and not
some other, less complex and more fundamental pattern of
behavior?
Today, we are faced with the central challenge in AI
safety: ensuring that increasingly autonomous, agentic sys-
tems behave reliably and in accordance with human-
specified goals. The potential for unexpected, harmful be-
havior is a primary concern. A dominant mental model for
this failure is ”scheming” or ”deceptive alignment”, as ob-
served in the work by Apollo Research and OpenAI (Ope-
nAI 2025). In this view, the AI is a rational agent with a
stable, hidden objective that conflicts with its stated objec-
tive. This is not an average case, rather the ultimate, critical
failure mode, a worst-case scenario; it is the observation of
and the attribution of a high degree of coherent, long-term
planning and stable hidden intent to current systems.
However, system failures are usually a conflation of mul-
tiple factors compounding over time (Perrow 1999). We ob-
serve the final result and assign attributions, but sometimes
the micro-failures embedded within seemingly performant
components (Reason 1997) are the latent drivers of systemic
collapse (Cook 1998). Hence, it may be more valuable to
investigate the seed conditions of failure – the subtle mech-
anisms that give rise to systemic breakdown – rather than
merely diagnosing the event of collapse itself (Dekker 2011;
Rasmussen 1997).
One such mechanism is inconsistency: where models of-
ten quote-and-quote ”know” the right thing to do in a gen-
eral sense but fail to apply that knowledge in a specific, local
context (Strotz 1955; Ainslie 1992). The effect is magnified
when the model is placed under duress, i.e. in double-bind
situations, where competing imperatives fracture its internal
consistency. Like a person acting against their better judg-
ment under pressure, the model defaults to patterns most
deeply ingrained in its training rather than to its stated in-
tent (Simon 1955).
This pattern of fractured behavior can be framed through
the classical philosophical concept of akrasia, or weakness
of will (Aristotle 2009; Plato 1992). In Aristotle’s formula-
tion, akrasia denotes the act of choosing a course of action
despite judging that another course is better – a misalign-
ment between one’s global judgment and a local impulse.
Davidson goes further in characterizing this as an override
of a judgment that is for the best ”all things considered”,
due to immediate urges (Davidson 1969). Transposed to ar-
tificial agents, akrasia captures how a model may ”know” the
correct or intended response in a global, deliberative sense,
yet fail to enact it when faced with an immediate, context-
specific cue.
Unlike explanations that invoke hidden goals or deliberate
deception, the akratic framing treats such failures as lapses
of self-control rather than signs of malevolent intent. It por-
trays inconsistency not as a strategic choice but as an invol-
untary epistemic error, a breakdown in the model’s ability
to preserve internal coherence between the beliefs it reports
and what it subsequently generates. In this view, ensuring
that a model remains epistemically stable (that its local pre-
dictions remain faithful to its global understanding) becomes
central to maintaining behavioral alignment.
We propose akrasia as a foundational and measurable con-
cept for analyzing inconsistency and goal drift in agentic AI
systems. To begin to operationalize this, we introduce the
Akrasia Benchmark, which systematically measures when
a model’s lo