Can Complexity and Uncomputability Explain Intelligence? SuperARC: A Test for Artificial Super Intelligence Based on Recursive Compression
We introduce an increasing-complexity, open-ended, and human-agnostic metric to evaluate foundational and frontier AI models in the context of Artificial General Intelligence (AGI) and Artificial Super Intelligence (ASI) claims. Unlike other tests that rely on human-centric questions and expected answers, or on pattern-matching methods, the test here introduced is grounded on fundamental mathematical areas of randomness and optimal inference. We argue that human-agnostic metrics based on the universal principles established by Algorithmic Information Theory (AIT) formally framing the concepts of model abstraction and prediction offer a powerful metrological framework. When applied to frontiers models, the leading LLMs outperform most others in multiple tasks, but they do not always do so with their latest model versions, which often regress and appear far from any global maximum or target estimated using the principles of AIT defining a Universal Intelligence (UAI) point and trend in the benchmarking. Conversely, a hybrid neuro-symbolic approach to UAI based on the same principles is shown to outperform frontier specialised prediction models in a simplified but relevant example related to compression-based model abstraction and sequence prediction. Finally, we prove and conclude that predictive power through arbitrary formal theories is directly proportional to compression over the algorithmic space, not the statistical space, and so further AI models’ progress can only be achieved in combination with symbolic approaches that LLMs developers are adopting often without acknowledgement or realisation.
💡 Research Summary
The paper proposes a novel, human‑agnostic benchmark called SuperARC (Super Artificial Recursive Compression) to evaluate the intelligence of AI systems, especially in the context of Artificial General Intelligence (AGI) and Artificial Super‑Intelligence (ASI). The authors argue that existing benchmarks are heavily human‑centric, relying on language‑based questions, expected answers, or simple pattern‑matching, and therefore cannot capture the deeper computational abilities that define general intelligence. To overcome this, they ground their metric in Algorithmic Information Theory (AIT), using the concepts of algorithmic (Kolmogorov‑Chaitin) complexity, algorithmic probability, and algorithmic randomness as the theoretical foundation for optimal inference.
Methodologically, the paper employs two AIT‑based approximation techniques: the Coding Theorem Method (CTM) and the Block Decomposition Method (BDM). CTM exhaustively runs a large library of tiny programs to estimate the algorithmic probability distribution of short strings, while BDM decomposes larger objects into smaller blocks whose CTM values are summed, providing a scalable estimate of algorithmic complexity. Unlike conventional compressors such as GZIP or LZW, which are essentially Shannon‑entropy based and capture only statistical regularities, CTM/BDM aim to detect true recursive regularities and therefore differentiate genuine randomness from structured complexity.
The experimental evaluation consists of two main tasks. The first is a next‑digit prediction task on binary and non‑binary sequences. “Climber” sequences (low‑complexity, recursively defined) and truly random sequences are generated, and several specialized LLMs (Lag‑Llama, TimeGPT‑1, Chronos) are asked to predict the final digit. Lag‑Llama achieves the best performance on climbers (≈70 % accuracy), while all models hover around chance (≈50 %) on random sequences, confirming that LLMs can capture simple regularities but fail on high‑complexity or truly random data. The second task is a free‑form generation challenge: models must produce any program or formula that generates target sequences of increasing algorithmic complexity. The authors evaluate the generated models using compression length (LZW, ZIP) and BDM scores. Results show that as sequence complexity rises, LLM performance degrades sharply, whereas a hybrid neuro‑symbolic approach—combining neural pattern learning with explicit symbolic manipulation—maintains shorter compressed representations and higher BDM scores, indicating superior algorithmic compression and predictive power.
These findings support the central theoretical claim that predictive capability is proportional to compression in algorithmic space, not merely statistical space. The authors also observe that newer versions of leading LLMs sometimes regress in performance on these tasks, suggesting that scaling model size or data alone does not guarantee progress toward universal intelligence. Instead, integrating symbolic reasoning components appears essential for achieving the kind of recursive compression that AIT identifies as a hallmark of intelligence.
The paper discusses several limitations. CTM/BDM are computationally intensive and approximate non‑computable measures, making large‑scale deployment challenging. The test sequences are synthetic and may not fully represent the richness of real‑world data such as natural language, images, or video. Human evaluation is not directly incorporated, so the relationship between the proposed metric and human‑perceived intelligence remains indirect. Nevertheless, the work establishes a rigorous, mathematically grounded framework for open‑ended, complexity‑driven AI assessment, offering a promising direction for future AGI/ASI research that seeks to move beyond anthropocentric benchmarks.
Comments & Academic Discussion
Loading comments...
Leave a Comment