SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others’ weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.


💡 Research Summary

The paper introduces SKATE (Scalable Tournament Evaluation), a novel framework that turns the evaluation of large language models (LLMs) into an automated, competitive game. Instead of relying on static, human‑crafted benchmarks, SKATE lets each model act both as a task‑setter and a solver. In each round, every model generates a verifiable Code‑Output Prediction (COP) question for all other participants, then attempts to answer its own and the competitors’ questions. A COP question consists of a code snippet whose output can be deterministically obtained by running the code in a sandbox; this guarantees an objective ground‑truth answer without human judges.

Question generation is constrained by three criteria: (1) Verifiability – the code must execute without error; (2) Distractor‑rich – the model must produce nine distinct incorrect answer choices; (3) Uniqueness – the new question must be sufficiently dissimilar from any previously generated question by the same model, measured with cosine similarity on text embeddings (threshold d_thresh = 0.336). These constraints push models to craft questions that align with their own strengths while avoiding repetition, thereby surfacing a diverse set of capabilities.

Answer scoring addresses the known sensitivity of multiple‑choice formats to option ordering. For each question, the correct answer is randomly shuffled among the distractors and the model’s response is sampled repeatedly until the estimated accuracy p(correct) stabilises (standard deviation σ* = 0.05). This yields a robust probability of correctness for every model‑question pair.

Model rankings are derived using the TrueSkill algorithm, a Bayesian rating system originally designed for online games. Each model’s skill is represented by a mean μ and uncertainty σ; after every round, the observed successes (correct answers) update these parameters. Over many rounds the system converges to a stable hierarchy, even when initial uncertainties are high.

The authors evaluate six state‑of‑the‑art LLMs (including GPT‑4o, Gemini 1.5, Claude 3, etc.) across 50 rounds of play. Three key findings emerge: (1) Weaker models consistently assign lower p(correct) scores to stronger models, demonstrating that even less capable LLMs can reliably differentiate higher‑performing peers. (2) Models exhibit self‑preferring behavior: the reward prompt (+1 for a valid question, +1 for each correct answer) incentivises them to generate questions they can answer well, confirming that the game dynamics successfully steer task creation toward each model’s own strengths. (3) Automatic clustering of questions based on embedding similarity uncovers “discriminatory” questions that expose fine‑grained capability gaps, allowing the system to surface nuanced performance differences without human annotation.

Limitations are acknowledged. The current implementation restricts tasks to multiple‑choice COP, which may miss abilities better captured by open‑ended generation, code synthesis, or multi‑step reasoning. Distractor generation quality varies across models, potentially biasing difficulty. Reliance on a code‑execution sandbox introduces latency and security considerations that could hinder scaling to massive model fleets. Moreover, self‑preferring bias could lead to evaluations that over‑emphasise a model’s niche strengths while under‑representing its weaknesses.

Future work is outlined: extending the framework to other verifiable task families (e.g., theorem proving, game tree evaluation), improving automatic distractor quality via meta‑prompting, exploring richer augmentation strategies where models can exploit information about opponents’ past performance, and optimizing sandbox infrastructure for high‑throughput evaluation.

In sum, SKATE demonstrates that a fully automated, data‑free, and objective evaluation pipeline is feasible by turning LLMs into both challengers and respondents. It offers a scalable solution that keeps pace with rapid model advances, automatically reveals subtle capability differences, and reduces the human labor traditionally required for benchmark construction and scoring. This approach could become a cornerstone for ongoing monitoring and alignment of increasingly powerful language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment