Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.

💡 Research Summary

**
Information retrieval (IR) evaluation suffers from incomplete benchmark datasets that contain many unlabeled “holes” – chunks that may be relevant but have never been judged. Recent attempts to use large language models (LLMs) either fully automate relevance assessment or adopt a confidence‑based hybrid where the LLM handles easy cases and humans intervene on low‑confidence instances. Both approaches rely on a single model’s judgment, which leads to over‑confidence, mis‑calibration, and unnecessary or missed escalations to human annotators.

The paper introduces DREAM (Debate‑based RElevance Assessment with Multi‑agents), a novel framework that replaces the single‑model pipeline with a two‑agent adversarial debate. One LLM is initialized with a “relevant” stance, the other with an “irrelevant” stance, forcing the agents to explore opposite perspectives from the start. They then engage in multi‑round reciprocal critique: each round the agents read the query, the candidate chunk, the answer set, and the full debate history from the previous round, critique the opponent’s arguments, extract supporting evidence, and produce a new relevance label together with a rationale. After each round the system checks whether the two agents agree; if they do, the agreed label is emitted automatically. If disagreement persists after a predefined maximum number of rounds (R), the case is deemed genuinely uncertain and escalated to a human annotator together with the entire debate transcript. The transcript serves as a structured briefing, allowing the human to see the exact points of contention and the evidence each agent considered, which improves human labeling accuracy compared with starting from scratch.

Empirical evaluation was performed on 700 randomly sampled query‑chunk pairs drawn from six domains (MS MARCO, Natural Questions, and five RobustQA topics). DREAM was compared against several fully automated LLM annotators (UMBRELA, D‑MERIT, MIRA‑GE, SynDL) and against confidence‑based hybrid methods (LARA). Results show that DREAM reaches agreement in an average of two rounds for 84 % of cases and achieves an overall labeling accuracy of 95.2 %, while requiring human involvement for only 3.5 % of instances. When humans do intervene, the provided debate history boosts their accuracy by roughly 12 % relative to a baseline human‑only annotation.

Using DREAM, the authors construct the BRIDGE benchmark by re‑labeling two widely used IR test collections, BEIR and RobustQA. The debate‑driven process discovers 29,824 previously unlabeled relevant chunks, a 428 % increase over the original 6,976 gold chunks. This massive augmentation reduces evaluation bias and leads to different retriever rankings when the benchmarks are re‑evaluated. Notably, dense and sparse retrievers that appeared comparable on the original benchmarks show clearer performance separation on BRIDGE.

The paper also investigates the downstream impact on Retrieval‑Augmented Generation (RAG). Prior work has noted a “retrieval‑generation misalignment”: improvements in retrieval scores do not always translate into better generated answers. The authors demonstrate that part of this misalignment stems from the missing relevance judgments in the original benchmarks; when the missing chunks are added via BRIDGE, the correlation between retrieval quality and generation quality improves substantially.

Limitations are acknowledged. First, using only two agents may still miss nuanced arguments that a larger panel could surface. Second, each additional debate round incurs extra LLM inference cost, which may be prohibitive for very large collections. Third, even with debate transcripts, human annotators retain some subjectivity, and the system does not completely eliminate human bias. Future directions include scaling to more agents, learning optimal debate policies via meta‑learning, and designing richer human‑in‑the‑loop interfaces that further leverage the debate history for training annotators.

In summary, DREAM offers a principled, agreement‑driven alternative to confidence‑based LLM annotation. By turning disagreement into a signal for human escalation and by providing a transparent debate record, it achieves high labeling accuracy (95.2 %) with minimal human cost (3.5 %). The derived BRIDGE benchmark substantially enriches existing IR testbeds, leading to more reliable retriever comparisons and a clearer understanding of retrieval‑generation interactions in RAG systems.

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment