Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

💡 Research Summary

The paper tackles a fundamental yet under‑explored problem in the rapidly growing practice of using large language models as evaluators (LLM‑as‑a‑Judge). While many recent works rely on simple outcome‑level metrics—such as correlation with human scores, agreement percentages, or uncertainty estimates—to claim that an LLM’s judgments are “good enough,” these measures do not reveal whether the LLM itself behaves as a stable, reliable measurement instrument. To fill this gap, the authors introduce a two‑phase diagnostic framework grounded in Item Response Theory (IRT), specifically the Graded Response Model (GRM), which is traditionally used in psychometrics to separate latent traits from item characteristics.

Methodology
The GRM treats each rating (e.g., a 5‑point Likert score) as a probabilistic function of a latent quality variable θ for the evaluated item and a set of prompt‑specific parameters: discrimination αₚ and threshold βₚₖ. By fitting a Bayesian GRM to the scores produced by an LLM under several controlled prompt variations (typo, newline insertion, paraphrase), the framework can attribute observed score differences either to genuine quality differences (θ) or to prompt‑induced measurement noise (α, β). Posterior inference is performed with NUTS MCMC, yielding both posterior means ˆθⱼ and uncertainties σⱼ².

Phase 1 – Intrinsic Consistency
Two metrics assess whether the LLM functions as a reliable instrument without reference to human judgments:

Prompt Consistency (CV) – For each prompt variant, the variance of θ within each rating category is computed, averaged across categories, and then the coefficient of variation across prompts is calculated. A CV < 0.1 (i.e., <10 % variance) signals that the judge’s measurements are stable across semantically equivalent prompts.
Marginal Reliability (ρ) – Defined as Var(ˆθ) /

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

💡 Research Summary

Comments & Academic Discussion

Leave a Comment