The Token Games: Evaluating Language Model Reasoning with Puzzle Duels
Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity’s Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.
💡 Research Summary
**
The paper introduces “The Token Games” (TTG), a novel evaluation framework for large language models (LLMs) that draws inspiration from 16th‑century mathematical duels. Instead of relying on costly human‑crafted benchmarks, TTG pits two LLMs against each other in a series of “reasoning duels” where each model alternately acts as a puzzle proposer and a solver. Puzzles are expressed as Python functions that return a Boolean value; a solution is any input that makes the function evaluate to True. This “programming puzzle” format is highly expressive—it can encode NP‑complete problems, algebraic equations, or even open mathematical conjectures—while remaining automatically verifiable by executing the code in a sandboxed environment.
During a duel, the proposer first writes a puzzle and a private scratch‑pad reasoning trace, then submits the final puzzle together with its own solution. If the solution fails verification, the proposer instantly loses the round. If the solution is correct, the opponent (the solver) receives only the puzzle code and must find any input that satisfies it. A successful solver results in a draw; a failure awards a point to the proposer. Each round’s outcome is recorded in a shared history, allowing models to adapt their strategies in later rounds (e.g., avoiding repeated mistakes or increasing difficulty).
TTG aggregates the outcomes of many pairwise duels and fits an Elo‑style rating using the Bradley‑Terry likelihood model. The standard Elo formula (σ = 400) translates rating differences into win probabilities, providing a single scalar measure of relative reasoning ability. The authors evaluated ten frontier models—including GPT‑4o, Claude‑3.5, Gemini‑1.5, and Llama‑2‑70B—by having each pair play multiple 5‑round duels. The resulting Elo rankings correlate strongly with established reasoning benchmarks: ρ = 0.58 with Humanity’s Last Exam (HLE) and ρ = 0.63 with GPQA Diamond. Notably, win rates when models act as solvers correlate even more tightly (ρ ≈ 0.75 with both HLE and GPQA), indicating that TTG captures traditional problem‑solving competence.
A key contribution of TTG is that it also measures a model’s ability to generate novel, solvable puzzles—a dimension largely absent from existing benchmarks. The authors find that puzzle‑creation performance shows low correlation with HLE/GPQA, highlighting that many state‑of‑the‑art models struggle with self‑assessment and creative problem design. Overconfidence is a common failure mode: models often propose puzzles they cannot solve themselves, leading to losses when the opponent fails to find a solution. The sandboxed execution also catches malformed code, runtime errors, or timeouts, automatically penalizing such proposals.
The paper discusses limitations and future directions. Current timeouts (e.g., 5 seconds) restrict the complexity of puzzles that can be reliably evaluated, especially for genuine NP‑complete instances. Extending the framework to multi‑model tournaments, introducing diversity metrics for puzzle topics and difficulty, and combining automated scoring with human expert review are proposed avenues. Moreover, integrating static analysis to filter overly complex or ambiguous code could improve the quality of generated puzzles.
In summary, The Token Games provide a scalable, low‑cost, and self‑sustaining benchmark that simultaneously evaluates LLM reasoning, creativity, and self‑evaluation. By turning models into both problem creators and solvers, TTG mitigates data‑contamination concerns, avoids saturation, and opens a path toward continual assessment as models advance. The strong alignment with existing human‑curated benchmarks validates its effectiveness, while the novel focus on puzzle generation reveals new challenges for future LLM development.
Comments & Academic Discussion
Loading comments...
Leave a Comment