AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence
Current artificial intelligence systems exhibit strong performance on narrow tasks, while existing evaluation frameworks provide limited insight into generality across domains. We introduce the Artificial General Intelligence Testbed (AGITB), a complementary benchmarking framework grounded in twelve explicitly stated axioms and implemented as a suite of twelve automated, simple, and reusable tests. AGITB evaluates models on their ability to learn and to predict the next input in a temporal sequence whose semantic content is initially unknown to the model. The framework targets core computational properties, such as determinism, adaptability, and generalisation, that parallel principles observed in biological information processing. Designed to resist brute-force or memorisation-based strategies, AGITB requires autonomous learning across previously unseen environments, in a manner broadly inspired by cortical computation. Preliminary application of AGITB suggests that no contemporary system evaluated to date satisfies all test criteria, indicating that the benchmark provides a structured and interpretable means of assessing progress toward more general learning capabilities. A reference implementation of AGITB is freely available on GitHub.
💡 Research Summary
The paper introduces the Artificial General Intelligence Testbed (AGITB), a novel benchmark designed to evaluate core learning capabilities of AI systems at the raw signal level rather than through high‑level symbolic tasks. AGITB is built around twelve explicitly stated axioms—determinism, adaptability, generalisation, meta‑learning, learning‑time constraints, self‑consistency, configuration equivalence detection, among others—and implements each axiom as an automated test. A model under evaluation receives a stream of ten‑bit binary vectors whose semantic meaning is unknown to the system. At each discrete time step the model must predict the next vector, receive the true input, and update its internal state. Success is defined not by conventional accuracy or loss metrics but by relative consistency and superiority across multiple independently instantiated copies of the same model, each run for at least 5,000 trials per test. All twelve tests must be passed for a model to be deemed successful, reflecting the authors’ view that satisfying the full set of axioms captures a necessary suite of general‑intelligence behaviours.
The authors formalise prediction, model update, learning, learning time, and autoregressive generation with precise mathematical notation, allowing clear measurement of when a model first achieves accurate prediction (learning time τ_A(ϕ)). Importantly, AGITB requires model authors to provide a mechanism for determining whether two instantiated copies occupy the same configuration, ensuring that internal state comparisons are possible even for black‑box systems.
A comparative discussion positions AGITB alongside existing benchmarks such as the Turing Test, ARC, and NeuroBenc h, highlighting its avoidance of symbol‑grounding issues by operating directly on binary spike‑like signals. This low‑level focus mirrors cortical processing, where neurons handle temporally sensitive spike trains rather than abstract symbols.
Empirical evaluation on several state‑of‑the‑art systems—including large language models, transformer‑based time‑series predictors, and reinforcement‑learning agents—shows that while individual tests can be passed, none of the systems satisfy all twelve requirements. The failures are most pronounced in meta‑learning, rapid adaptation, and configuration equivalence detection, underscoring gaps between current narrow AI and the broader learning abilities exhibited by biological brains.
The benchmark’s code and test suite are released openly on GitHub, facilitating reproducibility and future extensions. By providing a stress‑test that forces autonomous learning in previously unseen environments, AGITB offers a structured, interpretable, and metric‑free means to track progress toward artificial general intelligence, positioning itself as a potential standard for future AGI research.
Comments & Academic Discussion
Loading comments...
Leave a Comment