Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.


💡 Research Summary

This paper conducts a systematic investigation into the reliability of using large language models (LLMs) as judges—i.e., as automatic substitutes for human evaluators—in a variety of natural‑language‑generation (NLG) and reasoning tasks. The authors ask whether the judgments produced by LLM‑as‑a‑judge stem primarily from the explicit instructions supplied in prompts (e.g., detailed rubrics) or from the model’s inherent preference for data that resembles its fine‑tuning corpus. To answer this, they design four prompting conditions of increasing instructional richness: (1) Perplexity (no prompt) – the model’s log‑probability (perplexity) on the candidate text is taken as a quality score, eliminating any prompt bias; (2) Generic quality prompt – a minimal instruction to rate the response on a 1‑5 Likert scale without specifying any criteria; (3) Criteria‑specific prompt – the name of a single evaluation criterion (e.g., coherence) is provided, but no definition; (4) Full rubric prompt – a complete rubric containing a definition of the criterion and explicit scoring guidelines is supplied.

The study aggregates a taxonomy of 34 evaluation metrics drawn from eight widely used benchmark datasets, grouping them into four high‑level categories: Content, Engagement, Integrity, and Relevance. This taxonomy serves as a common framework for comparing model performance across diverse tasks such as summarization (SummEval, OpinSummEval, InstruSumm), dialogue (TopicalChat), creative story generation (Hanna, TheNextChapter), and problem‑solving/ reasoning (Roscoe, Flask).

Four prompting conditions are applied to five LLM families: OpenAI’s GPT‑4‑Turbo, Meta’s Llama‑3 (70 B and 8 B), Mistral‑v0.3, and Microsoft’s Phi‑3‑Medium. For each model‑dataset‑prompt combination, the authors compute Pearson correlation between the model‑generated scores and human annotations. The key findings are:

  1. Limited benefit from detailed rubrics – Adding full rubric information improves correlation by at most ~4 percentage points compared with the generic prompt. GPT‑4‑Turbo shows the largest absolute gain (0.414 → 0.469), while other models exhibit even smaller improvements.
  2. Perplexity as a strong baseline for textual quality – For content‑centric metrics (fluency, readability, grammaticality), perplexity scores often correlate better with human judgments than any prompting strategy, achieving a Pearson r of 0.51 versus 0.44 for the best prompting condition. This suggests that the model’s internal language modeling signal captures aspects of “high‑quality” text that align with human preferences.
  3. Task‑specific nuances – For more subjective or creative criteria (e.g., surprise, empathy), full rubrics sometimes provide a modest edge, but the overall advantage remains modest.
  4. Model size and fine‑tuning effects – Larger, closed‑source models (GPT‑4‑Turbo) tend to benefit slightly more from richer prompts, yet the trend is not uniform across all families.

The authors conclude that (a) detailed instruction does not guarantee substantially higher alignment with human judgments, and (b) a simple, prompt‑free perplexity measure can be a competitive, even superior, alternative for many textual quality assessments.

Limitations include reliance on English‑only datasets, potential biases inherent in human annotations, and the fact that perplexity reflects similarity to training data, which may reinforce model‑specific preferences rather than an objective notion of quality.

Future work is suggested in three directions: (i) developing hybrid metrics that combine perplexity with rubric‑based scores, (ii) employing multi‑task regression to learn optimal weighting of the 34 metrics against human judgments, and (iii) automating rubric generation to discover the most effective instructional phrasing for each criterion. Extending the analysis to multilingual settings and exploring the impact of different fine‑tuning regimes are also highlighted as promising avenues.


Comments & Academic Discussion

Loading comments...

Leave a Comment