Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

State-of-the-art single-agent claim verification methods struggle with complex claims that require nuanced analysis of multifaceted evidence. Inspired by real-world professional fact-checkers, we propose \textbf{DebateCV}, the first debate-driven claim verification framework powered by multiple LLM agents. In DebateCV, two \textit{Debaters} argue opposing stances to surface subtle errors in single-agent assessments. A decisive \textit{Moderator} is then required to weigh the evidential strength of conflicting arguments to deliver an accurate verdict. Yet, zero-shot Moderators are biased toward neutral judgments, and no datasets exist for training them. To bridge this gap, we propose \textbf{Debate-SFT}, a post-training framework that leverages synthetic data to enhance agents’ ability to effectively adjudicate debates for claim verification. Results show that our methods surpass state-of-the-art non-debate approaches in both accuracy (across various evidence conditions) and justification quality.


💡 Research Summary

The paper addresses a fundamental shortcoming of current automated claim‑verification systems: single‑agent, retrieval‑augmented generation (RAG) pipelines often fail when a claim requires the integration of multiple, sometimes conflicting pieces of evidence. Inspired by professional fact‑checking practices such as PolitiFact’s “star‑chamber” debates, the authors introduce DebateCV, the first debate‑driven claim‑verification framework that orchestrates three large language model (LLM) agents—a positive Debater, a negative Debater, and a Moderator.

In DebateCV, both Debaters receive the same evidence set E and argue opposite positions over multiple rounds. The positive Debater (D⁺) initially presents supporting arguments, while the negative Debater (D⁻) immediately rebuts. Subsequent rounds consist of back‑and‑forth counter‑arguments, each grounded in the shared evidence. After every round, the Moderator (M) synthesizes a concise summary Sₜ of the exchanged arguments, checks for convergence (i.e., whether Sₜ is substantially similar to Sₜ₋₁), and decides whether to continue or terminate the debate. When the debate ends—either by early convergence or after a predefined maximum number of rounds (T_max)—the Moderator produces the final verdict (Supported, Refuted, Not Enough Evidence, or Conflicting Evidence/Cherry‑picking) together with a justification that distills the key insights from the entire discussion.

Preliminary zero‑shot experiments reveal that while the Debaters can generate coherent, evidence‑based arguments, the zero‑shot Moderator suffers from a strong conformity bias: it frequently defaults to neutral verdicts (e.g., “Not Enough Evidence”) even when the evidence clearly supports a definitive true or false judgment. This limitation motivates the second major contribution: Debate‑SFT, a post‑training framework designed to fine‑tune the Moderator using synthetic debate data.

Debate‑SFT proceeds in three stages. First, the authors take an existing non‑debate claim‑verification dataset (claims C, gold verdicts Y, and gold evidence E) and run the zero‑shot DebateCV pipeline to generate synthetic multi‑round debate transcripts D for each claim. The zero‑shot Moderator also produces provisional verdicts ˆY and justifications ˆJ based on these debates. Second, an error‑correction step identifies cases where ˆY diverges from the human gold verdict y. For each mismatch, an LLM‑based Corrector rewrites the justification to align with the gold label, thereby creating a clean (D, y, corrected justification) training triple. Third, the Moderator is fine‑tuned on this curated SynDeC (Synthetic Debate for Claim verification) dataset, learning to aggregate multi‑round arguments and to overcome its prior neutrality bias.

Extensive experiments evaluate DebateCV with and without Debate‑SFT across several LLM backbones (e.g., GPT‑3.5, LLaMA‑2) and under three evidence‑quality regimes: full evidence, partial evidence, and scarce evidence. The Debate‑SFT‑enhanced Moderator consistently outperforms state‑of‑the‑art single‑agent baselines, achieving 2.6–5.8 percentage‑point gains in accuracy. Human expert assessments further demonstrate that the justifications produced by DebateCV are more detailed, logically coherent, and better grounded in the evidence than those from baseline models. Qualitative case studies illustrate how the debate mechanism corrects typical single‑agent failures such as mis‑interpreting outdated statistics, overlooking newly retrieved evidence, or over‑relying on speculative inference.

The authors summarize three core contributions: (1) the novel DebateCV framework that brings structured, adversarial multi‑agent debate to claim verification; (2) the Debate‑SFT pipeline that automatically synthesizes high‑quality debate data and fine‑tunes the Moderator without costly human annotation; and (3) comprehensive empirical validation showing improvements in both verdict accuracy and justification quality.

Limitations are acknowledged. The quality of the synthetic debates depends on the zero‑shot performance of the initial Debaters and Moderator; errors in early stages can propagate into the training data despite the correction step. The choice of T_max and convergence thresholds influences the depth of the debate and computational cost. Moreover, the current taxonomy includes only four verdict categories, which may be insufficient for nuanced domains requiring finer‑grained stance detection.

Future work is outlined: extending the debate paradigm to other tasks such as stance detection, misinformation detection, or domain‑specific fact‑checking (legal, medical); exploring human‑in‑the‑loop debate where experts intervene during the discussion; and investigating meta‑learning approaches that treat the debate process itself as a learnable module. Overall, the paper presents a compelling argument that structured multi‑LLM debate, coupled with synthetic data‑driven fine‑tuning, can substantially elevate the reliability and transparency of automated claim verification.


Comments & Academic Discussion

Loading comments...

Leave a Comment