LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. A 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.


💡 Research Summary

The paper addresses a growing “evaluation crisis” in large language model (LLM) research, where static benchmarks are increasingly vulnerable to data contamination and leaderboard over‑fitting, leading to inflated scores that do not reflect true generalization ability. To remedy this, the authors introduce LLMEval‑Fair, a comprehensive framework for dynamic, contamination‑resistant evaluation. The core of the system is a proprietary bank of over 220 000 graduate‑level exam questions collected from Chinese universities across 13 disciplines. Expert annotators filtered 78 009 high‑quality original items, and an LLM‑driven augmentation pipeline expanded each item into multiple formats (multiple‑choice, fill‑in‑the‑blank, true/false, short answer, etc.), resulting in a diverse, private dataset that is difficult for models to memorize.

Evaluation proceeds in three stages. First, for each model run, a unique set of 1 000 questions is sampled without replacement from the bank, and the order is fixed to prevent “cherry‑picking.” Second, a two‑layer anti‑cheating architecture protects the process: an outer layer uses JWT‑based authentication and role‑based access control to restrict API access, while an inner layer enforces quota limits, tracks question allocation, and strips all answer text from the payload sent to the model, eliminating any possibility of answer leakage. Third, a calibrated ranking system combines a single LLM‑as‑a‑Judge (GPT‑4o) with a relative scoring (ELO‑style) mechanism. The judge assigns a 0‑3 integer score based on correctness and explanation quality; absolute scores are normalized to a 0‑100 scale, and each model’s score is expressed relative to a reference model (Doubao‑1.5‑Thinking‑Pro). This relative metric ensures that rankings remain stable even when different question subsets are used.

Human‑machine agreement was validated by comparing GPT‑4o judgments with expert ratings, achieving a 90 % agreement rate (Cohen’s κ > 0.70). The framework was deployed in a 30‑month longitudinal study, evaluating nearly 60 proprietary and open‑source LLMs (each model evaluated at least three times). Over 180 k evaluation datapoints were collected. Key findings include: (1) a performance ceiling around 90 % on knowledge memorization, with persistent gaps in specialized domains such as literature and medicine; (2) dynamic rankings diverge substantially from static benchmark rankings, revealing that many static scores are inflated by data contamination; (3) the relative ranking system exhibits negligible variance under multi‑round resampling and different sample sizes, confirming its fairness and robustness.

All code, data pipelines, and evaluation scripts are released on GitHub, enabling the community to reproduce the anti‑cheating mechanisms and relative scoring methodology. The authors argue that dynamic, contamination‑resistant evaluation should become the new standard for LLM assessment, providing a more trustworthy basis for model comparison and guiding future research toward genuine generalization rather than memorization of benchmark data.


Comments & Academic Discussion

Loading comments...

Leave a Comment