Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark’s expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
💡 Research Summary
The paper addresses a growing problem in the evaluation of large language models (LLMs): the benchmarks that have traditionally been used to track progress are losing their discriminative power, becoming saturated, and often no longer reflecting real‑world impact. To remedy this, the authors propose the Benchmark Health Index (BHI), a systematic, data‑driven framework that quantifies the “health” of any evaluation set along three orthogonal dimensions: (1) Capability Discrimination, (2) Anti‑Saturation, and (3) Impact.
Capability Discrimination measures how well a benchmark can separate models of different abilities. It combines two sub‑metrics: the Effective Differentiation Ratio (EDR), which counts the proportion of model‑pair score differences that exceed a noise‑threshold set at 2 % of the observed score range, and the Robust Coefficient of Variation (RCV), which captures the spread of scores using the inter‑decile range (P90‑P10) normalized to a 0‑100 scale. Both sub‑metrics are min‑max normalized to
Comments & Academic Discussion
Loading comments...
Leave a Comment