Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in{0,1}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.


💡 Research Summary

The paper addresses the problem of aggregating binary judgments from multiple annotators—specifically large language models (LLMs) used as judges—in large‑scale AI evaluation. Classical aggregation methods such as Dawid‑Skene (DS) or weighted majority voting rely on the conditional independence (CI) assumption: given the true label Y∈{0,1}, the annotators’ votes are independent. The authors argue that this assumption is routinely violated for LLM judges because they share pre‑training data, architecture, prompts, and failure modes, which induce strong pairwise dependencies among their outputs. Ignoring these dependencies can dramatically mis‑calibrate posterior probabilities and even cause a system to confidently predict the wrong label despite each judge being better than random.

To capture such dependencies, the authors propose a hierarchy of Ising graphical models. The most general form is a class‑conditional Ising distribution

P(J | Y=y) ∝ exp( Jᵀh(y) + ½ JᵀW(y)J )

where J∈{0,1}^K is the vote vector, h(y)∈ℝ^K encodes per‑judge bias for class y, and W(y)∈ℝ^{K×K} (zero diagonal, symmetric) encodes pairwise couplings that may differ across classes. The Bayes log‑odds for a single item become

Λ(J) = log π/(1−π) + ∑j Δh_j J_j + ∑{j<k} ΔW_{jk} J_j J_k + ΔZ,

with Δh_j = h(1)j − h(0)j, ΔW{jk}=W(1){jk}−W(0)_{jk} and ΔZ the difference of partition‑function logs. Hence, under a fully class‑dependent Ising model the Bayes decision boundary is a quadratic function of the votes.

A special, practically important case is the class‑independent Ising model where the coupling matrix is shared across classes, i.e., W(0)=W(1)=W. In this setting the quadratic terms cancel in the likelihood ratio, leaving a linear decision rule

g*(J)=1{ ∑_j c_j J_j + b₀ ≥ 0 },

where c_j = Δh_j (or equivalently the difference of logits) and b₀ absorbs the prior and partition‑function terms. Thus, even though judges are dependent, the optimal aggregator remains a weighted vote, but the weights and intercept are “correlation‑adjusted”: they account for redundancy and over‑counting that would plague naïve weighted voting.

The paper’s contributions are threefold:

  1. Model hierarchy – The authors formalize the inclusion CI ⊂ class‑independent Ising ⊂ class‑dependent Ising, deriving exact Bayes log‑odds for each level. They also discuss connections to Potts models for multiclass problems and to Gaussian discriminant analysis.

  2. Separation results – Using a Curie‑Weiss Ising specialisation (global magnetization M_K), they prove that for any K, a CI‑based predictor collapses to the prior (risk = min{π,1−π}) because the one‑dimensional marginals are identical across classes (each judge’s vote is ½‑½). In contrast, the Bayes predictor that exploits the dependence can achieve risk → 0 as K→∞ by thresholding M_K². Hence, the excess risk of CI remains non‑vanishing even with infinitely many judges. They also extend the separation to latent‑factor models where a shared random effect Z induces low‑rank correlations; CI remains sub‑optimal under such generation.

  3. Empirical validation – Experiments on three real‑world evaluation tasks (relevance, toxicity, summarization) with six LLM judges (Claude Sonnet 4.5, Claude Haiku 4.5, OpenAI gpt‑oss‑120b, Llama‑4 Maverick 17B Instruct, Llama‑4 Scout 17B Instruct, DeepSeek‑R1) show that the class‑independent Ising linear aggregator consistently outperforms DS and naïve weighted voting by 2–5 percentage points. Synthetic simulations corroborate the theoretical separation, demonstrating that CI‑based methods misclassify when dependence is strong, while the quadratic Ising predictor recovers near‑perfect accuracy.

The paper also discusses how the Ising formulation relates to factor models: a latent variable Z that makes judges conditionally independent given (Y,Z) yields an approximately low‑rank Ising coupling matrix, situating factor models between CI and the full class‑dependent Ising.

In summary, the work highlights a critical flaw in current LLM‑as‑a‑judge pipelines—over‑reliance on conditional independence—and provides a principled, tractable alternative based on Ising graphical models. By explicitly modeling pairwise dependencies, the proposed methods achieve Bayes‑optimal aggregation, both theoretically (risk separation) and empirically (improved accuracy), offering a concrete path forward for reliable large‑scale AI evaluation.


Comments & Academic Discussion

Loading comments...

Leave a Comment