Aligning Language Model Benchmarks with Pairwise Preferences
Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.
💡 Research Summary
The paper introduces the concept of benchmark alignment, a new problem that seeks to adapt existing static language‑model (LM) benchmarks so that their rankings better reflect external preferences, such as human judgments of helpfulness or honesty. Traditional benchmarks treat every test item as equally important and evaluate models by a single aggregate score. However, numerous recent studies have shown that high benchmark scores often do not translate into superior real‑world performance, especially when the end‑user’s preferences are taken into account. To bridge this gap, the authors propose BenchAlign, the first method that learns preference‑aligned weightings for individual benchmark questions using only a limited set of model‑level performance data together with pairwise preference rankings of those models.
Problem definition
Given a benchmark D = (Q, s) with a set of questions Q and a scoring function s, and a set of language models F, the goal is to construct a new benchmark (\hat D = (\hat Q, s)) whose weighted questions (\hat w) produce system‑level scores that induce a ranking (\hat R) close to a target ranking (R_T) derived from human or otherwise external preferences. The target ranking is expressed as pairwise comparisons (model i preferred over model j).
Methodology
BenchAlign treats the alignment task as a learning‑to‑rank problem. For each model (f_i) the authors collect a vector (x_i) of instance‑level binary scores (a(f_i, q)) (correct/incorrect) across all questions. A single‑layer linear model with weight vector (\hat W) computes a relevance score (\hat y_i = \hat W x_i). Pairwise training data are generated from the target ranking: for every ordered pair (i, j) where model i is preferred to model j, the loss penalizes cases where (\hat y_i < \hat y_j). The loss is a sigmoid‑based pairwise cross‑entropy (Equation 2). Optimization proceeds via mini‑batch stochastic gradient descent (Algorithm 1). After training, the learned weights (\hat w) are applied to the original questions, yielding a new benchmark (\hat D). When a previously unseen model is evaluated on (\hat D), its weighted score should align with the external preference ordering.
Experimental setup
The authors use the Open‑LLMLeaderboard dataset, containing responses from 4,576 models on six widely used benchmarks (BigBench Hard, MMLU‑Pro, MuSR, MATH, GPQA, IFEval) totaling 21,606 questions. Human‑like preferences are simulated using two reward models—HelpSteer and UltraFeedback—trained to predict “helpfulness” and “honesty”. These reward models generate pairwise rankings for all models, which serve as the target (R_T).
BenchAlign is compared against two benchmark‑distillation baselines that select a binary subset of questions (weight = 1 for selected, 0 otherwise). The authors evaluate three research questions: (RQ1) generalization across model sizes, (RQ2) impact of limited data, and (RQ3) robustness to arbitrary source‑target model splits.
Key results
-
Generalization across sizes – Training BenchAlign only on small (≤ 13 B) and medium‑sized models yields high correlation (Spearman ρ ≈ 0.62) when ranking much larger models (30 B–70 B) on both helpfulness and honesty dimensions. Baselines degrade sharply for larger models, confirming that learned continuous weights capture preference‑relevant signal beyond mere model similarity.
-
Data efficiency – Using as little as 25 % of the original questions (≈ 5 k items) still produces benchmarks that outperform the baselines; even with 10 % of items the Spearman correlation remains above 0.55. This demonstrates that a modest amount of preference data suffices for effective alignment.
-
Robustness to source‑target splits – When the set of models used for training is randomly replaced or when entirely new models are evaluated, BenchAlign’s ranking performance remains stable. This suggests that the learned weights reflect intrinsic question relevance to the preference signal rather than over‑fitting to particular model behaviours.
-
Interpretability – The scalar weights (\hat w) can be inspected to identify which questions drive the alignment for each preference axis. For “helpfulness”, items requiring multi‑step reasoning or contextual clarification receive higher weights; for “honesty”, fact‑checking and citation‑type questions dominate. Such insights enable benchmark designers to curate more purpose‑aligned test suites.
Limitations and future work
The approach relies on the quality of the simulated human preferences; any bias in the reward models propagates into the aligned benchmark. Direct human‑annotated pairwise rankings would provide a stronger validation. Moreover, the current formulation assumes binary correctness scores; extending to graded or generative evaluations (e.g., BLEU, ROUGE, or open‑ended quality scores) is an open direction. Finally, while BenchAlign produces a static, preference‑aligned benchmark, integrating online user feedback to adapt weights dynamically could further close the gap between evaluation and deployment.
Conclusion
BenchAlign offers a practical, data‑efficient solution to the benchmark‑alignment problem. By learning continuous, interpretable question weights from limited model‑performance and pairwise preference data, it creates static benchmarks that reliably predict how unseen models will be ranked according to real‑world user preferences. This method promises to reduce the reliance on costly human‑labelled leaderboards, improve model selection confidence, and accelerate the development of language models that truly serve end‑user needs.
Comments & Academic Discussion
Loading comments...
Leave a Comment