Efficient Evaluation of LLM Performance with Statistical Guarantees

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference – a finite-population extension of active inference (Zrnic & Candès, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.

💡 Research Summary

The paper tackles the costly problem of evaluating many large language models (LLMs) on extensive benchmark suites. Rather than exhaustively scoring every model on every question, the authors frame benchmarking as finite‑population inference: the benchmark’s N questions form a fixed population, and the goal is to estimate the model’s overall accuracy θ = (1/N)∑ z_j, where each z_j∈{0,1} indicates a correct answer. Under a fixed query budget n_b, the task is to select which questions to query so that the resulting confidence interval (CI) for θ is as narrow as possible while guaranteeing frequentist coverage (e.g., a 95 % CI contains the true θ at least 95 % of the time).

Factorized Active Querying (FAQ) is introduced as a three‑component solution.

Bayesian factor model – Historical performance data H (models × questions) is partially observed. The authors fit a logistic factor model: P(H_ij=1)=σ(u_iᵀv_j), where u_i∈ℝ^k captures latent capabilities of model i and v_j∈ℝ^k captures the latent difficulty or skill requirements of question j. The model is trained on observed entries using a masked negative log‑likelihood and AdamW; hyper‑parameters (latent dimension k and weight decay λ) are chosen by cross‑validation. This model is not used for final inference; it merely provides informative priors and predictions for the adaptive sampling stage.
Hybrid active‑learning sampling policy – For a new model, the question factors {v_j} are fixed, and the model factor u is initialized with a Gaussian prior whose mean and covariance are the empirical mean and covariance of the historical model factors. After each query, the authors perform a Laplace update to maintain a Gaussian approximation of the posterior over u: they compute the predicted probability p̂_{t‑1}^{I_t}=σ(û_{t‑1}ᵀv_{I_t}), the weight ŵ=p̂(1‑p̂), and update the covariance Σ̂_t and mean û_t via closed‑form formulas (Eqs. 3‑5). This yields an online, O(k²) update that refines predictions for all unqueried questions.
Proactive Active Inference (PAI) – Traditional active inference (Zrnic & Candès, 2024) assumes a streaming order and can only decide to label or skip the next item, often with stochastic labeling. In the benchmark setting, the full pool of questions is always available, and labeling is deterministic once a question is chosen. PAI defines an estimator
θ̂_{n_b}= (1/n_b)∑{t=1}^{n_b} ϕ_t,
where ϕ_t = (1/N)∑j p̂{t‑1}^j + (z{I_t}‑p̂_{t‑1}^{I_t})/q_t(I_t). The first term is the current factor‑model plug‑in estimate of overall accuracy; the second term corrects bias using the observed label and the sampling probability q_t. The authors prove (Theorem 3.1) that θ̂_{n_b} is unbiased for θ and, under mild regularity conditions, satisfies a martingale central limit theorem: √n_b(θ̂_{n_b}‑θ)/σ̂_{n_b} → N(0,1). Consequently, a standard Wald‑type CI, θ̂ ± z_{1‑α/2}·σ̂/√n_b, attains asymptotic (1‑α) coverage.

The paper also derives an oracle‑optimal sampling distribution: if the true per‑question success probabilities p_j were known and answers were independent Bernoulli(p_j), the variance‑minimizing q* is proportional to p_j(1‑p_j). Since the factor model supplies estimates p̂_j, the practical policy approximates this optimal distribution, achieving near‑oracle efficiency.

Experiments are conducted on two large benchmark suites (including MMLU‑Pro) with thousands of questions and hundreds of models. Historical data is artificially masked at missingness levels from 0 % to 80 %. Baselines include uniform random sampling, a simple active‑learning heuristic, and a recent AIPW estimator. Results show that FAQ reaches the same CI width as uniform sampling while using roughly 20 %–25 % of the queries (up to a 5× effective sample size gain). Coverage remains close to the nominal 95 % (94.7 %–95.3 %). The gains are larger when historical data is sparse, highlighting the factor model’s ability to extract useful signal from limited observations. Sensitivity analyses confirm that latent dimension k between 10 and 30 balances expressiveness and overfitting, and that early queries dramatically improve the posterior over u, leading to better subsequent question selection.

Finally, the authors release a curated dataset (≈4.4 K models, 21.5 K questions) and full source code, facilitating reproducibility and future extensions (e.g., multi‑turn safety evaluation, domain‑specific benchmarks).

In summary, FAQ offers a statistically principled, computationally lightweight framework for LLM benchmarking under budget constraints. By leveraging historical performance through a Bayesian factor model, employing a variance‑reduction‑driven active‑learning policy, and wrapping everything in a coverage‑preserving Proactive Active Inference estimator, the method delivers up to five‑fold query efficiency without sacrificing confidence‑interval reliability. This work bridges the gap between practical evaluation needs and rigorous statistical guarantees, and its open‑source release positions it as a valuable tool for the growing LLM ecosystem.

Efficient Evaluation of LLM Performance with Statistical Guarantees

💡 Research Summary

Comments & Academic Discussion

Leave a Comment