Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking
Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow’s impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.
💡 Research Summary
The paper tackles a pressing problem in modern machine‑learning evaluation: how to combine multiple performance criteria—accuracy, robustness, efficiency, fairness, etc.—into a single, meaningful ranking of models. Existing practices such as arithmetic‑mean aggregation or Borda‑score averaging often produce incoherent or unstable leaderboards, especially when metrics have different scales, outliers dominate, or ordinal criteria are involved.
To address this, the authors recast multi‑criteria benchmarking as a social‑choice problem. They define a finite set of models A, a set of datasets or tasks D, and a collection of metrics Φ = (ϕ₁,…,ϕₙ). Each metric evaluated on a particular dataset induces a complete, transitive preference relation R_{D,i} over the models (higher scores mean higher preference). The collection of all such relations for a given dataset forms a preference profile R_D. A benchmark operator B maps any profile to an overall ranking of the models; the simplest operator examined is the pairwise‑majority rule M, which declares model a at least as good as model b if a majority of metrics rank a above b.
Arrow’s impossibility theorem tells us that no aggregation rule can satisfy a set of desirable axioms (Pareto efficiency, independence of irrelevant alternatives, non‑dictatorship, and transitivity) for all possible preference profiles. However, the authors argue that real‑world benchmark data rarely explore the full, unrestricted domain. Instead, they identify three natural domain restrictions that often hold in practice and that restore the possibility of coherent aggregation:
-
Single‑Peakedness – There exists a one‑dimensional ordering of models such that each metric has a single “peak” (its most preferred model) and preference declines monotonically as one moves away from that peak. Under this condition, pairwise majority never creates Condorcet cycles; a Condorcet winner is guaranteed and the majority relation is transitive.
-
Group‑Separability – Models can be partitioned into independent groups (e.g., by size, architecture family). Within each group, all metrics induce consistent rankings, while inter‑group comparisons follow a separate, possibly simpler rule. This structure yields a hierarchical overall ranking that is both transitive and robust to intra‑group variations.
-
Distance‑Restricted Preferences – The rankings generated by different metrics are close to each other under a suitable distance measure (e.g., Kendall‑tau distance). When the pairwise distances are bounded by a constant, linear aggregation methods (average rank, Borda) inherit transitivity, and the addition or removal of a model has limited impact on the relative order of the remaining models.
The paper proves that, under any of these restrictions, the majority operator M satisfies coherence (no cycles), stability (adding or removing a model does not flip the order of existing models), non‑dictatorship, and independence of irrelevant alternatives. In particular, single‑peakedness guarantees a unique Condorcet winner; group‑separability yields a tiered ranking respecting the group hierarchy; distance‑restriction bounds the magnitude of rank changes caused by new entrants.
Empirically, the authors evaluate these concepts on the HELM MMLU benchmark, which comprises 57 subject datasets and several language models (GPT‑4, Claude‑3‑Opus, Qwen1.5, Llama‑2, etc.). They focus on three metrics per subject: accuracy, inference time, and output length. For each subject they test:
-
Single‑Peakedness – By ordering models according to accuracy and checking whether the other two metrics exhibit a single peak along this order. Some subjects (e.g., “Business Ethics”) satisfy the condition, while others (e.g., “Abstract Algebra”) display multiple peaks or irregular patterns.
-
Group‑Separability – By clustering models into size‑based groups (large, medium, small) and verifying that within each group the three metrics produce consistent rankings. Most subjects show intra‑group consistency, supporting the separability assumption.
-
Distance‑Restriction – By computing Kendall‑tau correlations between the three metric rankings. The average correlation exceeds 0.85 for the majority of subjects, indicating that the rankings are indeed close.
When the majority rule is applied to subjects that meet any of the structural conditions, the resulting overall rankings are largely free of cycles and are far more stable than those obtained by naïve averaging. In a stability test where an additional model (Llama‑2) is introduced, the majority‑based ranking changes the relative order of existing models in only a small fraction of subjects (≈9 %), whereas the average‑rank approach flips rankings in 44 out of 57 subjects. Moreover, the frequency of Condorcet cycles drops dramatically under the identified restrictions.
The authors conclude that Arrow’s theorem is not a fatal barrier for multi‑criteria benchmarking because the “universal domain” assumption is overly pessimistic for real benchmark suites. By deliberately designing benchmarks—or selecting metrics—that satisfy single‑peakedness, group‑separability, or distance‑restriction, practitioners can obtain coherent, stable, and interpretable model leaderboards. The paper suggests future work on (i) extending the structural analysis to more complex, possibly non‑numeric criteria such as interpretability or safety, (ii) monitoring whether these conditions hold over time as benchmarks evolve, and (iii) devising hybrid aggregation schemes that fallback gracefully when the conditions are violated.
In sum, the work bridges social‑choice theory and machine‑learning evaluation, showing that meaningful aggregation is achievable once realistic domain restrictions are recognized and leveraged. This provides a principled foundation for the next generation of holistic benchmark suites.
Comments & Academic Discussion
Loading comments...
Leave a Comment