Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of “persona” ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

💡 Research Summary

This paper tackles the problem of external validity in the evaluation of generative AI (GenAI) systems. When system quality is estimated from human ratings collected in a laboratory or crowd‑working setting (the “source” distribution), the resulting metric often fails to generalize to the real‑world deployment environment (the “target” distribution). Two major sources of evaluation sampling bias are identified: (1) covariate shift, where the joint distribution of rater characteristics (X) and content (V) differs between source and target, and (2) selection bias, where the probability of a rating being observed (C) depends on X and V. Existing statistical tools such as Prediction‑Powered Inference (PPI), PPI++ and RePPI assume i.i.d. data and missing‑completely‑at‑random (MCAR) labels, and therefore break down under the more realistic bias conditions described above.

The authors introduce a novel auxiliary data source: “persona” ratings. By prompting an LLM‑as‑a‑judge with demographic and expertise specifications, the model simulates a human rater with a desired persona and produces a proxy rating (\hat Y). Although (\hat Y) is imperfect, it is often correlated with the true human rating (Y) and can be used as an informative feature in downstream models.

Two naïve baselines are first described. Persona‑augmented regression trains a predictor (\mu(W,\hat Y)) of (E

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

💡 Research Summary

Comments & Academic Discussion

Leave a Comment