The Geometry of Benchmarks: A New Path Toward AGI
📝 Abstract
Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $κ$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $κ> 0 $. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.
💡 Analysis
Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $κ$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $κ> 0 $. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.
📄 Content
Modern AI systems are usually evaluated by their performance on fixed benchmarks -standardized test suites for language modelling, reasoning, vision, or control [6]. Performance is reported as a percentage score or leaderboard rank on individual datasets, occasionally aggregated across a small collection of tasks. This practice has three limitations.
First, benchmarks are typically narrow: a system may excel on a particular dataset while failing catastrophically on nearby tasks [5]. Second, evaluation is often fragmented: each new domain introduces bespoke test suites, with little understanding of how different benchmarks relate to one another or how much new information they provide. Third, benchmarks are usually static: they measure capabilities at a single point in time, while progress in AIespecially toward artificial general intelligence (AGI)-is fundamentally about self-improvement over time [7].
Here we develop a framework that addresses all three issues simultaneously by treating benchmarks themselves as mathematical objects and studying their geometry. The central idea is that once we consider all psychometric batteries at once, they organize into a well-structured moduli space. Agents are then characterized not only by their scores on individual benchmarks but by their position and trajectory in this space.
Our contributions are threefold:
- We propose an Autonomous AI (AAI) Scale, an operational, Kardashev-inspired hierarchy that measures the autonomy and generality of AI systems. The AAI scale is defined in terms of performance on families of batteries under explicit resource constraints, providing a behavioural notion of “AGI-like” capacity.
[2]
- We develop a moduli-theoretic view of batteries. By quotienting out natural equivalence relations between test suites, we define a moduli space on which capability functionals become smooth fields, enabling geometric reasoning about generalization, coverage and redundancy of benchmarks. [3] 3. We extend this static geometry to dynamics of selfimprovement. We introduce a general Generator-Verifier-Updater (GVU) loop that subsumes reinforcement learning (RL), self-play, debate, adversarial training and verifier-based fine-tuning. Viewing GVU as a stochastic flow on the parameter manifold of agents, we define a self-improvement coefficient κ and derive a variance inequality giving sufficient conditions for κ > 0 in terms of the noise in generation and verification. [4] Together, these ingredients recast progress toward AGI as a flow on moduli of batteries. “Everything is reinforcement learning” in the sense that any practical selfimproving AI system can be represented as a GVU flow that climbs a capability functional defined over this moduli space.
We formalize an agent as a policy π acting in interactive environments, mapping histories of observations and actions to distributions over actions. A task instance τ specifies an environment, initial condition and termination rule, together with a scoring functional
that measures the performance of π on τ given a resource budget (for example, number of calls, computation time or human interventions).
A battery B is a finite or countable collection of task instances equipped with:
• a sampling distribution µ B over tasks and random seeds,
• a scoring rule that aggregates instance-level scores into a battery score
• metadata specifying the family of capabilities probed (e.g., mathematical reasoning, tool orchestration, long-horizon planning).
We write F(π, B) for a capability functional which may coincide with S or incorporate penalties for resource usage and uncertainty. Formally, such functionals are acting on the agent’s trajectory distribution under B and are required to satisfy the axioms of naturality, restricted monotonicity, threshold calibration and generality.
Real-world agency is multi-dimensional. We therefore consider a finite collection of families F = { f 1 , . . . , f m } (for example reasoning, learning, memory, tool use, social interaction). For each family f we specify a set of batteries B f that probe that family under standardized protocols.
The AAI Index of an agent π is the vector
which measures worst-case performance over batteries in each family, under fixed resources. When desired, a scalar AAI score can be obtained by aggregating with a monotone functional (for example a norm or weighted quantile), but the vector form is primary.
To make the scale actionable, we define discrete AAI levels by specifying level gates: for level ℓ and each family f , a threshold θ f ,ℓ and robustness parameters (for example tolerated degradation under perturbations of the battery). An agent π is said to be at level ℓ if, for all families f ,
and this inequality continues to hold under (controlled) drifts in battery composition and environment parameters.
We denote the highest level satisfied by π as AAI-ℓ. Instantiations of this scheme yield interpretable levels such as:
• AAI-0: Nar
This content is AI-processed based on ArXiv data.