Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models’ own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.


💡 Research Summary

This paper investigates a subtle bias in large language model (LLM) judges when they evaluate summarization outputs. While LLM judges are praised for capturing semantics and being robust to paraphrasing, prior work has identified length, order, and self‑preference biases. The authors ask two questions: (1) How does the degree of n‑gram overlap (measured by ROUGE and BLEU) between a candidate summary and a human reference affect the LLM judge’s preference? (2) How does presentation order (position bias) interact with this relationship?

To answer these, the authors construct a benchmark of 6,744 LLM‑generated summaries from filtered subsets of WikiSum and CNN/DailyMail, ensuring human references are 95‑105 words long to control for length effects. Nine LLMs ranging from 1 B to 12 B parameters—including variants of Gemma 3, LLaMA 3, Mistral‑7B, Phi‑4‑mini, and GPT‑4o mini—are used both as summarizers and as judges. Summaries are scored by the average of ROUGE‑1, ROUGE‑2, BLEU‑1, and BLEU‑4 to obtain an overlap metric. Because the initial generated summaries occupied a narrow overlap range (<0.55), the authors additionally prompt models to rephrase human references, creating higher‑overlap summaries while keeping the rephrasing hidden from the judges.

During evaluation, each pair of summaries (one human, one machine) is presented in both possible orders. Judges are instructed to output only the name of the better summary; choices are categorized as “ground truth” (human), “generated” (machine), or “tied‑choose‑first/last” when order influences the decision.

Results show a striking pattern: across almost all models, the probability of selecting the machine‑generated summary rises sharply as overlap decreases. This “AI‑AI bias” appears even when the machine summary comes from a tiny 1 B model, and it diminishes only when the average overlap exceeds roughly 0.5. Position bias is also observed—larger models tend to favor the last‑presented summary, smaller models the first—but this does not alter the core preference for generated summaries. The bias persists regardless of the direction of position bias, suggesting a stylistic marker inherent to LLM‑produced text that makes it more appealing to other LLM judges when it diverges from human wording.

The authors conclude that LLM‑as‑a‑judge frameworks cannot rely solely on simple n‑gram similarity or naive prompting; more sophisticated techniques (e.g., multi‑reference evaluation, adversarial prompting, bias mitigation strategies) are required. Limitations include the use of a single human reference per article, a restricted length window, and the absence of adversarial examples. Future work should broaden reference diversity, explore additional similarity metrics, and test mitigation methods to improve the reliability of LLM‑based evaluation.


Comments & Academic Discussion

Loading comments...

Leave a Comment