When LLMs get significantly worse: A statistical approach to detect model degradations
Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model’s degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar’s test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.
💡 Research Summary
The paper addresses a practical yet under‑explored problem in the deployment of large language models (LLMs): how to reliably determine whether an optimization (e.g., quantization, sparsity, or a more efficient inference kernel) has actually degraded the model’s predictive quality, or whether any observed change in accuracy is simply due to sampling noise. While many recent works report average accuracy differences after compression, they typically treat the baseline and the optimized model as if their performance estimates were independent, even though both are evaluated on exactly the same set of test examples. This oversight inflates the estimated variance of the difference and can mask small but real degradations, especially when the reported changes are on the order of a few tenths of a percent.
To solve this, the authors revisit McNemar’s 1947 test for paired binary outcomes. They formalize the evaluation as a 2 × 2 contingency table with counts a (both models fail), b (baseline succeeds, optimized fails), c (baseline fails, optimized succeeds), and d (both succeed). The overall accuracies are γ = (b + d)/N and β = (c + d)/N, but the crucial information for testing equality lies solely in the discordant cells b and c. The paper introduces two derived probabilities: the “flip probability” p↕ = (b + c)/N, which measures how often the two models disagree, and the “conditional degradation probability” q↓ = b/(b + c), i.e., the proportion of disagreements where the baseline is correct and the optimized model is wrong. Fact 1 proves that β < γ if and only if q↓ > 0.5.
Consequently, the hypothesis test reduces to checking whether q↓ exceeds 0.5. The empirical estimate (\hat q↓ = b/(b+c)) follows a Binomial(b + c, q↓) distribution, allowing an exact one‑sided binomial test (p‑value = P_{Binomial(b+c, 0.5)}(X ≥ b)). This is more precise than the traditional McNemar chi‑square approximation, which is two‑sided and relies on a large‑sample normal approximation. The authors call their approach the “Exact One‑Sided McNemar Test.”
The paper also analyses statistical power. The accuracy difference δ = γ − β can be expressed as δ = p↕·(2q↓ − 1). Under the null (q↓ = 0.5) the variance of the estimator (\hat δ = (b‑c)/N) simplifies to p↕/N. Therefore, power depends on both the flip probability (how often the models disagree) and the total sample size N. When p↕ is large (i.e., many disagreements), the test can detect very small δ, even as low as 0.3 % with typical benchmark sizes. Asymptotic normal approximations are used to derive power formulas, confirming that the exact binomial test is near‑optimal for the paired‑sample setting.
Because LLM research often aggregates results across many benchmarks, the authors propose three methods to combine per‑benchmark p‑values: (1) Fisher’s method (−2∑log p_i), (2) Stouffer’s method (weighted Z‑score sum), and (3) a Bonferroni‑style correction that uses the maximum p‑value scaled by the number of tests. Synthetic experiments show Fisher is more powerful when the number of benchmarks is small and heterogeneous, while Stouffer excels with many similarly sized benchmarks. The Bonferroni approach is the most conservative but guarantees family‑wise error control.
Empirical validation is performed on several state‑of‑the‑art LLMs (Llama‑3.1 8B Instruct, LLaMA‑2 13B, etc.). The authors compare three categories of optimizations: (a) lossless kernel improvements, (b) mixed‑precision quantization of KV‑cache and attention (e.g., 4‑bit cache, 8‑bit attention), and (c) aggressive 8‑bit weight quantization. For lossless kernels, the test yields q↓ ≈ 0.5, p‑values around 0.4–0.6, and no statistically significant degradation, matching the expectation that these methods are truly lossless. For the mixed‑precision quantizations, the discordant count b exceeds c, leading to q↓ ≈ 0.55–0.60 and p‑values as low as 1.7 × 10⁻⁵, thereby flagging a genuine degradation. Importantly, the method detects degradations as small as 0.3 % in overall accuracy, a level at which naïve mean‑difference reporting would be indistinguishable from noise.
Implementation-wise, the authors release an open‑source Python package (github.com/amazon‑science/LLM‑Accuracy‑Stats) that plugs into the popular LM Evaluation Harness. A single command‑line flag (--stat-test mc_nemar) automatically builds the contingency table, computes the exact p‑value, and reports q↓. An additional flag (--aggregate fisher|stouffer|bonferroni) lets users choose the multi‑benchmark aggregation strategy. The code is lightweight, requires only the per‑sample binary scores already produced by the harness, and integrates with existing CI pipelines for model regression testing.
In summary, the paper provides a rigorous, easy‑to‑implement statistical framework for detecting LLM performance regressions after optimization. By leveraging the exact binomial distribution of paired discordant outcomes, it offers controlled type‑I error (false‑positive) rates and high power to detect sub‑percent accuracy changes. The three aggregation schemes give practitioners flexibility when dealing with suites of benchmarks. The released tooling makes the approach immediately applicable to real‑world model deployment pipelines, helping the community move beyond anecdotal “accuracy drop” claims toward statistically validated regression testing.
Comments & Academic Discussion
Loading comments...
Leave a Comment