Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.
💡 Deep Analysis
📄 Full Content
Evolutionary System 2 Reasoning: An Empirical Proof
Zeyuan Ma14, Wenqi Huang1, Guo-Huan Song23, Hongshu Guo14,
Sijie Ma1, Zhiguang Cao5, Yue-Jiao Gong1*
1South China University of Technology, 2Zhejiang Normal University, 3Northern Computility,
4Panorama Optimization, 5Singapore Management University
The Evolution of Human Beings
The Evolution of Machine Intelligence
Game Theory
1940s
J. von Neumann
Info. Theory
1940s
C. Shannon
Turing Test
1940s
A. Turing
MCP Neuron
1943
McCulloch/Pitts
Mark I Perceptron
1958
Frank Rosenblatt
ADALINE
1960
B. Widrow/M. Hoff
Backprop
1982
P. Werbos
ReLU
1969
Fukushima
CNN
1979
Fukushima
LeNet
1989
Y. LeCun
LSTM, DanNet
1991, 2011
J. Schmidhuber
AlexNet
2012
G. Hinton
GAN
2014
Y. Bengio
Transformer
2017
Google AI
LLMs
2020s
OpenAI...
Figure 1: An intuitive comparison between the evolution paths of human beings and machine intelligence.
Abstract
Machine intelligence marks the ultimate dream of making
machines’ intelligence comparable to human beings. While
recent progress in Large Language Models (LLMs) show sub-
stantial specific skills for a wide array of downstream tasks,
they more or less fall shorts in general intelligence. Follow-
ing correlation between intelligence and system 2 reasoning
(slow thinking), in this paper, we aim to answering a worth-
while research question: could machine intelligence such as
LLMs be evolved to acquire reasoning ability (not specific
skill) just like our human beings? To this end, we propose
evolutionary reasoning optimization (ERO) framework which
performs survival of the fittest over a population of LLMs to
search for individual with strong reasoning ability. Given a
reasoning task, ERO first initializes multiple LLMs as a pop-
ulation, after which an evolutionary strategy evolves the pop-
ulation to maximize quantified reasoning score of the best in-
dividual. Based on experiments on representative testsuites,
we claim two surprising empirical discoveries: i) the latest
LLMs such as GPT-5 still show limited system 2 reason-
ing ability; ii) with simple evolution-loop of ERO, a rela-
tively weak model (Qwen-7B) could be enhanced to emerge
powerful reasoning ability. Our project can be accessed at
https://github.com/MetaEvo/ERO for reproduction needs.
This paper does not advertise for LLMs, but explores
more possibilities.
— The authors
*Corresponding author (gongyuejiao@gmail.com)
1
Introduction
Machine intelligence (often interchangeably used with AI)
has experienced ups and downs within a long river of his-
tory (Legg and Hutter 2007; Minsky 2007; LeCun, Ben-
gio, and Hinton 2015). Since the initial proposal of AI at
1950s (McCarthy et al. 2006), an evolution path has been ob-
served: from basic theories (Shannon 1948; Turing 1950) to
concrete architectures (Rosenblatt 1958; Fukushima 1980;
Hochreiter and Schmidhuber 1997; Vaswani et al. 2017; Gu
and Dao 2024) and algorithms (Robbins and Monro 1951;
Werbos 1994; Graves 2013; Loshchilov and Hutter 2017).
Today, the application of AI has spread to every corner of the
world. Domains such as image processing (Gonzalez 2009),
nature language processing (Bengio et al. 2003) and scien-
tific discovery (Jumper et al. 2021) benefit from its learning
power and corresponding human-competitive performance.
However, we should not overlook the dark side of ad-
vanced machine intelligence (i.e., LLMs) simply due to its
twinkling academic and engineering achievements (Zhou
et al. 2024; Li et al. 2024; Novikov et al. 2025a). In other
words, we have to realize that LLMs, though pre-trained
with massive human knowledge prior, may still operate at
the pattern recognition (fast thinking, System 1 reasoning)
level, and hence lacks long-chain, deep, logical reasoning
ability (slow thinking, System 2 reasoning), as testified in
recent competitions1.
As illustrated in Figure 1, such System 2 reasoning in-
ability potentially roots from the essential difference be-
1https://arcprize.org/leaderboard
arXiv:2512.05760v1 [cs.AI] 5 Dec 2025
tween the evolution of machine intelligence and that of our
human beings (Cosmides and Tooby 1994; Pinker 2003).
For human beings, we are continually involved in evolu-
tionary process under open-ended environmental selection
pressure, which follows the survival of the fittest principle
proposed by Darwin (Darwin, Burrow, and Burrow 1958).
The “open-ended” term is used to reference extreme gener-
alization scenario where environmental uncertainty is natu-
rally unknown by human beings (Wolpert 2024). In contrast,
almost all machine intelligence instances are trained for spe-
cific application scopes explicitly restricted by their develop-
ers (human beings). The feedback or learning signal in their
learning loops may inherently restricts them from general
intelligence with logic reasoning (Wolpert and Macready
2002). To make this point clearer, we borrow the valu-
able perspective from developmental psychology (Spelke
and Kinzler 2007), which holds the position that: human-
lev