Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences
Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
💡 Research Summary
The paper introduces the concept of a “benchmark illusion”: the observation that large language models (LLMs) can achieve nearly identical scores on standard reasoning benchmarks yet disagree substantially on the specific items they answer correctly. Using two widely‑cited benchmarks—MMLU‑Pro (12,032 multiple‑choice questions across 14 disciplines) and GPQA (448 expert‑level science questions)—the authors compute pairwise disagreement rates among a broad set of contemporary models. Even when models have comparable overall accuracies, the proportion of items on which any two models give different answers ranges from 16 % to 66 % (MMLU‑Pro) and 17 % to 65 % (GPQA). Among the highest‑performing frontier models (accuracy > 60 %), disagreement remains at 16 %‑38 % for MMLU‑Pro and 17 %‑32 % for GPQA. Because prompts and decoding are held constant, these differences reflect systematic variations in how models represent and reason about knowledge, not random noise.
The authors then argue that such hidden divergence has concrete consequences for scientific research that relies on LLMs for data annotation. They formalize a measurement‑error framework: if the annotation error (e = \hat Y - Y) is correlated with covariates (X) or the true outcome (Y), ordinary‑least‑squares estimates of treatment effects become biased. The bias magnitude and direction depend on the error profile of the chosen model, turning model selection into an implicit, unreported specification choice.
To illustrate the point, a simulation study is presented. Three synthetic “AI annotators” are defined:
- Annotator 1 – low overall accuracy (85 %) with purely random errors.
- Annotator 2 – high accuracy (93.6 %) but systematic: it under‑detects positive outcomes in the treated group (25 % error) while almost never misclassifying positives in the control group (1 % error).
- Annotator 3 – similarly high accuracy (93.8 %) but the opposite systematic bias: it under‑detects positives in the control group (30 % error) and only rarely in the treated group (3 % error).
All three annotators label a simulated dataset of 10,000 observations generated from a logistic model where the true treatment effect is 1.0. Logistic regressions using each annotator’s labels produce estimated treatment effects of approximately 0.62 (random error, classic attenuation), 0.37 (severe under‑estimation due to bias in the treated group), and 1.43 (over‑estimation because the bias is concentrated in the control group). Thus, two models with nearly identical aggregate accuracies can induce biases that are larger—and opposite in sign—than those from a less accurate but random‑error model.
The paper then validates these findings with two real‑world case studies:
-
Education study – Re‑analysis of a large randomized trial of a literacy intervention (Kim et al., 2021). Original human‑graded essay scores yielded a treatment effect of 0.44. When the same essays were scored by eight high‑performing LLMs, the estimated effects ranged from 0.19 to 0.35, an 80 % swing despite identical experimental design, data, and analysis code.
-
Political‑science study – Re‑analysis of research on selective attribution in Russian state media (Rozenas & Stukal, 2019). Some LLMs reproduced the original finding that officials are more likely to be credited for good news, while other models reversed the pattern, suggesting officials are blamed for bad news. Model choice alone flips the sign of the key coefficient.
These empirical demonstrations underscore that current LLM evaluation practices—focused on average accuracy—are misaligned with scientific needs. For many downstream tasks, researchers care not only about how often a model is wrong, but where it is wrong, whether errors correlate with variables of interest, and how stable those error patterns are across time and domains.
The authors propose several concrete recommendations:
- Report disagreement metrics alongside benchmark scores, making transparent the extent of pairwise divergence among high‑performing models.
- Characterize error structures by testing for correlations between annotation errors and key covariates before downstream analysis.
- Adopt ensemble or bias‑correction strategies (e.g., Bayesian calibration, label smoothing) to mitigate model‑specific systematic errors.
- Develop standard protocols for LLM validation in scientific pipelines, analogous to inter‑coder reliability measures (Krippendorff’s α) used for human annotators.
In conclusion, the study reveals a hidden source of variability—model‑specific disagreement—that can dramatically alter scientific conclusions even when models appear equally competent on standard benchmarks. Recognizing and explicitly accounting for this “benchmark illusion” is essential for preserving reproducibility and credibility in an era where AI‑driven annotation is becoming routine.
Comments & Academic Discussion
Loading comments...
Leave a Comment