Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation

Reading time: 5 minute
...

📝 Abstract

As LLM-based judges become integral to industry applications, obtaining well-calibrated uncertainty estimates efficiently has become critical for production deployment. However, existing techniques, such as verbalized confidence and multi-generation methods, are often either poorly calibrated or computationally expensive. We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training. We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments. Our results demonstrate that probes achieve superior calibration compared to existing methods with $\approx10 $x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, probes produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates. Overall, our work demonstrates that interpretability-based uncertainty estimation provides a practical and scalable plug-and-play solution for LLM judges in production.

💡 Analysis

As LLM-based judges become integral to industry applications, obtaining well-calibrated uncertainty estimates efficiently has become critical for production deployment. However, existing techniques, such as verbalized confidence and multi-generation methods, are often either poorly calibrated or computationally expensive. We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training. We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments. Our results demonstrate that probes achieve superior calibration compared to existing methods with $\approx10 $x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, probes produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates. Overall, our work demonstrates that interpretability-based uncertainty estimation provides a practical and scalable plug-and-play solution for LLM judges in production.

📄 Content

LLM-as-judge paradigm (Zheng et al., 2023;Dubois et al., 2023;Touvron et al., 2023) has become ubiquitous in modern AI development, providing reward signals for alignment training (RLAIF) (Bai et al., 2022;Lee et al., 2023;Kirk et al., 2024) and ranking models for deployment decisions (Chiang et al., 2024). With the rapid advancement of reasoning-capable models (Ope-nAI, 2024;DeepSeek-AI, 2025), LLM judges that generate reasoning traces before rendering verdicts are becoming increasingly prevalent due to their higher accuracy and transparency (Whitehouse et al., 2025;Chen et al., 2025). Yet, current practice treats all judgments as equally reliable. Without calibrated confidence estimates, we cannot distinguish high-confidence judgments from cases where the LLM is essentially guessing. Moreover, LLM judges are known to be overconfident, systematically expressing higher confidence than their empirical accuracy supports (Tian et al., 2025;Jung et al., 2024). This leaves practitioners to either blindly trust all judgments or manually review everything (defeating the purpose of automation). This lack of uncertainty awareness is particularly problematic across judgment types: for correctness evaluation, we cannot identify objectively wrong judgments to exclude (Zhou et al., 2024b;Ross et al., 2024;Wang et al., 2023); for preference evaluation, we cannot detect genuinely ambiguous cases where humans may disagree or multiple valid answers exist (Aroyo and Welty, 2015;Pavlick and Kwiatkowski, 2019;Nie et al., 2020;Talat et al., 2022;Radharapu et al., 2025).

Calibrated LLM judges, whose expressed confidence matches their empirical accuracy, enable systems to route straightforward cases to efficient models while reserving expensive judges or human reviewers for uncertain decisions (Jung et al., 2024;Chen et al., 2023), dramatically reducing computational and labor costs. Training processes become more efficient by down-weighting uncertain judgments, preventing noisy labels from causing reward hacking (Gao et al., 2022) and model collapse (Zhang et al., 2024). Across these applications, calibration serves complementary purposes: flagging likely errors in correctness tasks while preserving valid diversity in preference tasks. Calibration is thus essential for building reliable and trustworthy AI systems with LLM judges.

In this work, our main contributions are: (1) A Brier-score-trained linear probe that produces calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training or multi-sample generation. (2) Our probe achieves substantially better calibration than verbalized and multi-generation baselines across multiple model families (dense and MoE) and judging styles (prompted and fine-tuned), while requiring an order of magnitude less compute. (3) Strong out-of-distribution generalization to unseen benchmarks, with analysis of key trade-offs in probe’s performance.

Large language model judges have become ubiquitous for evaluating model outputs across tasks ranging from pairwise preference ranking for RLHF (Christiano et al., 2017;Ouyang et al., 2022) to judging for correctness in verifiable tasks. These judges span a spectrum from prompted generalpurpose models like GPT-4 (Zheng et al., 2023;Dubois et al., 2023) to specialized fine-tuned judges (Whitehouse et al., 2025;Zhu et al., 2025;Kim et al., 2024a,b;Li et al., 2024a) that reason before giving their verdict. While these judges achieve high accuracy, LLM judges are known to suffer from systematic overconfidence (Jung et al., 2024;Tian et al., 2025;Xiong et al., 2024).

Uncertainty estimation methods for LLMs fall into several categories, each with distinct trade-offs:

Logit-Based Methods. Token-level uncertainty estimation methods like Perplexity and Temperature Scaling (Guo et al., 2017) assume uniform calibration across tokens, ignoring the semantic and contextual nuances crucial for reasoning tasks (Xie et al., 2024a). Maximum Softmax Probability (MSP) (Plaut et al., 2024) has been shown to be consistently miscalibrated in reasoning and multiplechoice QA tasks. Additionally, approaches such as contextual calibration (Zhao et al., 2021) and batch calibration (Zhou et al., 2024a) operate on single-token output logits, making them inapplicable to reasoning judges that generate multi-token responses.

Verbalized Confidence. Asking models directly for confidence scores (Tian et al., 2023;Kadavath et al., 2022;Lin et al., 2022) is straightforward, and has been argued to be better than logit-based methods. However, verbalized confidence produces overconfident estimates (Tian et al., 2025;Xiong et al., 2024;Tao et al., 2025;Lyu et al., 2025).

Consistency-Based Methods. Self-consistency (Wang et al., 2022;Manakul et al., 2023), semantic entropy (Kuhn et al., 2023), and related approaches (Chen and Mueller, 2024) achieve strong calibration by aggregating uncertainty across multiple generations. Methods such as prompt e

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut