Self-Improving AI Agents through Self-Play

Reading time: 5 minute
...

📝 Original Info

  • Title: Self-Improving AI Agents through Self-Play
  • ArXiv ID: 2512.02731
  • Date: 2025-12-02
  • Authors: Przemyslaw Chojecki

📝 Abstract

We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ν_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Θ$, and we identify the coefficient of self-improvement $κ$ as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $κ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.

💡 Deep Analysis

Figure 1

📄 Full Content

The central problem in Artificial General Intelligence (AGI) is not the achievement of a specific benchmark score, but the achievement of ignition: the point at which an agent can autonomously convert computational resources into capability gains without human intervention. In the framework of [2], we defined the capability of an agent A on a battery B as a functional value Φ B (ρ B (A)). However, for current Large Language Models (LLMs), this value is static once pre-training concludes. As noted in [12], the trajectory of self-improvement for standard LLMs is flat (κ ≈ 0) or decaying due to hallucination drift.

By contrast, systems like AlphaGo Zero [3] exhibited κ ≫ 0, reaching superhuman capability solely through self-play. The disparity lies in the nature of the verification signal. Go provides a noiseless, ground-truth verifier (the game rules). Open-ended domains do not.

To bridge this gap, recent literature has proposed various mechanisms for “self-correction” and “self-play” in language models. These include iterative reasoning bootstrapping (STaR [4]), zero-sum language games (SPIN [5], LSP [6]), and verbal reinforcement learning (Reflexion [8]), but also GANs and AlphaZero.

This paper unifies these approaches under a single rigorous mathematical framework. We define the GVU Operator as the canonical engine of self-improvement. We show that the success or failure of any self-improving agent is determined by the spectral properties of this operator acting on the tangent bundle of the moduli space. Specifically, we derive a “Second Law of AGI Dynamics”: Entropy (hallucination) tends to increase unless the combined signal from generation and verification is strong enough, relative to their noise and to curvature, to keep the expected capability gain positive. In practice, many architectures satisfy this by making verification spectrally “easier” than generation (e.g., via oracles, ensembles, or external structure).

Central message for practitioners is: The Variance Inequality tells you exactly why your RL training plateaus and what to do about it -strengthen the verifier, not the generator. Check out ?? for relation of our framework to current LLM training pipelines.

This paper makes the following contributions:

• From static scores to dynamical flows. We extend the moduli-theoretic framework of psychometric batteries [2] from static capability scores to dynamical trajectories. An agent is modeled as a flow (ν r ) r≥0 on a statistical parameter manifold Θ, and the self-improvement coefficient κ(r) is identified with the Lie derivative of the capability functional F = Φ B • ρ B along this flow. This yields an operational notion of ignition as sustained κ > 0 across capability fibers. • The GVU operator and a universality theorem. We formalize the Generator-Verifier-Updater (GVU) operator T GVU = U • V • G as the canonical engine of self-improvement, and prove a score-based GVU representation theorem: on a regular statistical manifold, any first-order, sample-based update vector field can be written in REINFORCE form

for some scalar potential V θ . Thus any rational, data-driven self-update implicitly instantiates a GVU with an internal Verifier potential. A non-trivial verifier is shown to be necessary for non-zero expected κ. • The Variance Inequality and the Hallucination Barrier. We derive the Variance Inequality, a sufficient spectral condition for expected capability gain E[∆F ] > 0. It quantitatively relates alignment ρ between the internal potential and the external score, generation and verification variances (σ 2 G , σ 2 V ), curvature L, and step size η. A corollary identifies the Hallucination Barrier: in diagonal regimes where V ≈ G, verification noise matches generation noise and self-correction typically fails to produce sustained κ > 0.

• Geometric and spectral design levers. Working on the Fisher-information statistical manifold (Θ, g), we interpret the GVU drift as a noisy vector field whose usefulness is governed by the Fisher angle between the mean update and the true gradient of F . We analyze generic design levers that improve κ-ensemble verifiers, group-based normalization (GRPO-style schemes), oracle-like executors (code, games, proofs), and “cold” verifier interfaces in diagonal GVU-and quantify how they increase SNR(V) and widen the stable stepsize window. We also introduce a Goodhart-type limit on long-run κ via decay of the alignment coefficient ρ under proxy optimization. • Topological realizations and an empirical κ protocol. We show that a wide range of existing self-improvement methods-AlphaZero, GANs, STaR, SPIN/LSP, PRMs, RAG selftraining, self-debugging code agents, RLHF, Constitutional AI, Self-Instruct, and GRPO-are concrete topological realizations of the GVU operator on different fibers (Sociality, Planning, Embodiment, Recursive, Alignment, Synthetic, Critic-less) of the moduli space. Finally, we propose a finite-difference evaluation protocol for estimati

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut