Balanced Accuracy: The Right Metric for Evaluating LLM Judges -- Explained through Youden's J statistic

Reading time: 5 minute
...

📝 Abstract

Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden’s $J$ statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of $J $. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.

💡 Analysis

Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden’s $J$ statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of $J $. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.

📄 Content

Evaluating large language models (LLMs) is a cornerstone of their development cycle. Standard practice involves running models on benchmark datasets of user prompts and estimating the prevalence of key behaviors in their responses such as task pass rates, safety violations, or false refusals. Prevalence estimates rely on another classifier, typically an LLM, a fine-tuned model, or human annotators. We refer to this classifier as a judge (Gu et al., 2024;Liu et al., 2023;Li et al., 2024b,a;Zheng et al., 2023). Because prevalence measurements feed directly into ablation studies, capabilities assessments, and release decisions, the quality of this judge critically determines the validity of the resulting model comparisons.

However, despite widespread use of LLM-as-ajudge pipelines, there is less consensus on how to evaluate the judges themselves. We raise a central * A preprint of this work is available on arXiv. methodological question: Which metric best evaluates judges for the downstream task of comparing models on prevalence?

In this position paper, we identify and advocate for a principled best practice grounded in the statistical structure of prevalence estimation, with a focus on judge-quality metrics measured on a golden set. We show that widely used metrics such as Accuracy, Precision, Recall, F1, and Macro-F1 are prevalence-dependent: they change as a function of the underlying class distribution, causing judges to be over-or under-valued depending on the dataset imbalance. As a result, these metrics less reliably reflect a judge’s ability to detect true differences between evaluated models.

We argue instead for Balanced Accuracy (equivalently, Macro-Recall) as the primary metric for judge selection. Balanced Accuracy is independent of class prevalence, assigns equal importance to both classes, extends naturally to multi-class settings, and most directly captures the key property needed for prevalence comparison: how well a judge distinguishes positive from negative instances. We formalize this by grounding the argument in Youden’s J statistic (Youden, 1950), historically used in diagnostic testing to measure a classifier’s ability to separate classes. We show that J is theoretically aligned with detecting prevalence differences and that Balanced Accuracy is a simple monotonic linear transformation of it. We provide geometric intuition through ROC analysis and empirical examples demonstrating that Balanced Accuracy leads to more reliable judge selection and more trustworthy downstream evaluation. duce inherently balanced labels, making metrics like Accuracy suitable for evaluating preference models (Malik et al., 2025).

We describe two datasets in our setup: 1. Benchmark: A dataset of prompts used to elicit responses from the evaluated LLMs, whose behavior prevalence we aim to compare. 2. Golden set: A dataset of prompts, responses, and gold labels used to evaluate the judges themselves.

An ideal golden set is balanced across classes to enable precise measurement of judge performance. In practice, this is difficult to obtain: ground-truth labels are unknown during dataset construction; rare behaviors (e.g., safety violations) are costly to collect; and a high-quality set must include responses from multiple models to capture modelspecific biases. Downsampling wastes expensive gold labels, while upsampling introduces artificial distribution shifts. Consequently, golden sets are typically imbalanced, underscoring the need for judge metrics, such as Balanced Accuracy, that remain valid under class imbalance.

When comparing multiple judges, we need a single, principled metric that reflects how well each judge will support the downstream task of estimating behavior prevalence across LLMs. A suitable judge metric should satisfy three core criteria: 1. Prevalence independence: It should not change when the class distribution of the golden set changes. 2. Label symmetry: Flipping which class is designated “positive” should not alter the metric’s meaning. 3. Balanced class treatment: It should capture a judge’s ability to correctly identify both positive and negative instances, since both directly affect prevalence estimation. This section outlines the key issues with commonly used metrics.

Precision and Recall are widely used for evaluating binary classifiers, but they have structural properties that make them unsuited for judge selection. Lack of label symmetry. Precision and Recall treat the “positive” class as privileged. When we flip class labels-for example, defining “safe” instead of “violating” as the positive class-Recall simply becomes Recall for the other class, but Precision does not: it turns into Negative Predictive Value (NPV). This asymmetry creates inconsistencies across datasets or benchmarks that use different labeling conventions. In contrast, Sensitivity and Specificity (true positive rate and true negative rate) form a label-symmetric pair: swapping class labels simply swaps the t

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut