Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While “LLM-as-a-Judge” offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a “Noisy but Valid” hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge’s True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical “Oracle” (perfectly known judge parameters), quantifying the cost of estimation. Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.


💡 Research Summary

The paper tackles a central problem in the deployment of large language models (LLMs): how to certify that a model’s failure rate stays below a predefined safety threshold while keeping the evaluation process scalable and cost‑effective. Traditional approaches either rely on large public benchmarks, which suffer from label noise, contamination, and over‑optimization, or on small human‑annotated test sets, which are expensive and insufficient for statistically reliable conclusions. Recent work has turned to “LLM‑as‑a‑Judge” (LLM‑J), using another LLM to label massive evaluation data, but this introduces a new source of uncertainty because the judge itself is imperfect, noisy, and potentially biased. If the judge’s error rates are ignored, any statistical guarantee on the target model’s safety can be invalidated.

Key Idea – Noisy‑but‑Valid Hypothesis Testing
The authors propose a two‑stage data collection strategy. First, a small, high‑quality human‑labeled calibration set (D_M) (size (n_M)) is used to estimate the judge’s true‑positive rate (TPR) and false‑positive rate (FPR). Second, the same judge is applied to a large, cheap, automatically labeled set (D_J) (size (n_J)). The raw proportion of positive judgments in (D_J) is denoted (\hat{p}_J). Using the estimated TPR and FPR, they construct an unbiased estimator of the true failure rate: \


Comments & Academic Discussion

Loading comments...

Leave a Comment