Discovering Differences in Strategic Behavior Between Humans and LLMs
As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.
💡 Research Summary
The paper investigates how state‑of‑the‑art large language models (LLMs) differ from humans in strategic decision‑making, using iterated rock‑paper‑scissors (I‑RPS) as a testbed. The authors argue that traditional behavioral game theory (BGT) models are human‑centric and may fail to capture the idiosyncrasies of non‑human agents such as LLMs. To overcome this limitation, they employ AlphaEvolve, a recent program‑discovery framework that uses LLMs themselves to generate symbolic programs that model observed behavior. AlphaEvolve optimizes a multi‑objective fitness function that balances predictive likelihood (cross‑validated normalized likelihood) with interpretability, measured by the Halstead effort metric. This approach yields human‑readable programs that can be compared across species without imposing pre‑specified theoretical forms.
Data were collected from two sources. The human dataset consists of 411 participants playing 300‑round games of I‑RPS against 15 pre‑programmed bots (both non‑adaptive and adaptive), yielding 129,087 choices. For the LLM side, four leading models—Gemini 2.5 Pro, Gemini 2.5 Flash, GPT‑5.1, and an open‑source GPT‑OSS 120B—were prompted to play the identical games under the same payoff structure (+3 for win, 0 for tie, –1 for loss). Each LLM generated 90,000 choices (20 games per bot).
AlphaEvolve was run separately on the human and each LLM dataset. The resulting “best” programs for humans are relatively simple: they implement a value‑based learning rule (Q‑learning) with a modest opponent‑bias term that depends on the frequency of the opponent’s previous moves. In contrast, the program discovered for Gemini 2.5 Pro contains three notable enhancements. First, it builds a probabilistic opponent model that updates beliefs about the opponent’s next move using a Bayesian‑like recursion. Second, it incorporates counterfactual updates: the value of actions not taken is adjusted based on the hypothetical reward they would have earned, leading to an asymmetric learning‑rate that depends on the sign of the prediction error. Third, the temperature parameter governing exploration versus exploitation is dynamically adjusted according to recent prediction errors. These mechanisms together enable the LLM to detect and exploit patterns in the opponent’s play far more efficiently than humans.
Performance evaluation shows that all LLMs achieve higher win rates than humans, with gains ranging from roughly 8 percentage points for the less capable Gemini 2.5 Flash to over 12 percentage points for Gemini 2.5 Pro. The advantage is especially pronounced against adaptive bots, where humans perform near chance while LLMs maintain a 30 %+ win rate. Predictive likelihood scores confirm that the LLM‑derived programs fit their respective data better than the human‑derived program, while still remaining within a reasonable interpretability budget (Halstead effort).
The authors discuss several implications. The superior opponent modeling and counterfactual learning suggest that frontier LLMs possess a form of theory‑of‑mind capability that surpasses typical human performance in this constrained setting. However, they caution that AlphaEvolve’s programs are descriptive, not necessarily causal, and that the findings may not generalize to richer strategic environments. Limitations include reliance on a single, simple game, potential biases introduced by prompt engineering, and the fact that the discovered programs reflect the inductive biases of the underlying LLMs used for program generation.
Contributions are threefold: (1) introducing automated symbolic model discovery to compare human and LLM strategic behavior, (2) demonstrating that modern LLMs can achieve deeper strategic reasoning than humans in I‑RPS, and (3) providing a structural explanation—more sophisticated opponent models and counterfactual updates—for the observed performance gap. The paper concludes with a roadmap for future work: extending the methodology to multi‑stage negotiation or cooperation games, integrating neuro‑cognitive validation (e.g., fMRI) to test whether the discovered mechanisms align with human brain activity, and exploring how prompt design and token limits affect LLM strategic behavior. This line of research is positioned as essential for ensuring the safety and alignment of increasingly capable LLMs in real‑world social interactions.
Comments & Academic Discussion
Loading comments...
Leave a Comment