증거 기반 검색증강 생성의 인터랙티브 증명 학습
📝 Abstract
Retrieval-augmented generation (RAG) relies on retrieved context to guide large language models (LLM), yet treats retrieval as a weak heuristic rather than verifiable evidence -leading to unsupported answers, hallucinations, and reliance on spurious context. We introduce a novel training framework that treats the RAG pipeline as an interactive proof system by adapting the Merlin-Arthur (M/A) protocol: Arthur (the generator LLM) trains on questions with unknown context provenance and Merlin gives helpful evidence, while Morgana injects adversarial, misleading context. Both use an XAI method to identify and modify evidence most influential to Arthur. This trains Arthur to (1) answer when evidence supports the answer, (2) reject when evidence is insufficient, and (3) rely on the context spans that truly ground the answer. We further introduce a verification framework that disentangles explanation fidelity from model predictive errors, and introduce the Explained Information Fraction (EIF), which normalizes M/A mutual-information guarantees. Across three RAG datasets and multiple LLM families and sizes, M/A training makes LLMs more grounded in evidence, increases information theoretic measures (soundness, completeness) and reject behavior with less hallucinations, without manually annotated unanswerable samples. Finally, the retriever also improves recall and MRR via automatically generated M/A hard positives and negatives. While high accuracy does not guarantee entropy flow from context to answer, our EIF results show that autonomous interactiveproof-style supervision enables RAG systems that treat retrieved documents as verifiable evidence.
💡 Analysis
Retrieval-augmented generation (RAG) relies on retrieved context to guide large language models (LLM), yet treats retrieval as a weak heuristic rather than verifiable evidence -leading to unsupported answers, hallucinations, and reliance on spurious context. We introduce a novel training framework that treats the RAG pipeline as an interactive proof system by adapting the Merlin-Arthur (M/A) protocol: Arthur (the generator LLM) trains on questions with unknown context provenance and Merlin gives helpful evidence, while Morgana injects adversarial, misleading context. Both use an XAI method to identify and modify evidence most influential to Arthur. This trains Arthur to (1) answer when evidence supports the answer, (2) reject when evidence is insufficient, and (3) rely on the context spans that truly ground the answer. We further introduce a verification framework that disentangles explanation fidelity from model predictive errors, and introduce the Explained Information Fraction (EIF), which normalizes M/A mutual-information guarantees. Across three RAG datasets and multiple LLM families and sizes, M/A training makes LLMs more grounded in evidence, increases information theoretic measures (soundness, completeness) and reject behavior with less hallucinations, without manually annotated unanswerable samples. Finally, the retriever also improves recall and MRR via automatically generated M/A hard positives and negatives. While high accuracy does not guarantee entropy flow from context to answer, our EIF results show that autonomous interactiveproof-style supervision enables RAG systems that treat retrieved documents as verifiable evidence.
📄 Content
Retrieval-augmented generation (RAG) systems combine a context retriever (R) with a context augmented answer generator (G) and are increasingly deployed in high-stakes settings where large language models (LLMs) must reason reliably over external evidence. Yet current RAG systems often break: R1) retrieval is often error-prone when multiple documents appear plausible, G1) generators answer despite missing support (Sun et al., 2025), G2) hallucinate under incomplete or misleading evidence (Wang et al., 2025b), G3.1) change answers under noise such as exchanged or erased tokens (Cao et al., 2025) G3.2) and rely on spurious parts of context that do not warrant their predictions (Wang et al., 2022). A root cause is that current RAG architectures tend to treat context merely as a heuristic cue, instead of enforcing strict grounding in aligned, verifiable evidence.
To address this, we reframe the RAG pipeline as an interactive proof system by extending the Merlin-Arthur protocol. We train the generator LLM (Arthur) against (i) a helpful prover (Merlin) supplying valid context, and (ii) an adversarial prover (Morgana) injecting misleading evidence, forcing the model to distinguish between reliable and deceptive inputs without knowing the source. This induces robustness against hallucinations via reject behavior without annotated reject samples or preference data. We further incorporate Merlin and Morgana masks into retriever training which yields better contrastive retrieval behavior.
This adversarial formulation establishes a lower bound on the mutual information between context and generated answer. Specifically, we train the model to satisfy (i) completeness (answering correctly under valid evidence), and (ii) soundness (resisting misleading context). Consequently, the system learns to act as a discerning verifier and answer only when the evidence is grounded and reject queries otherwise, improving robustness and interpretability without sacrificing accuracy relative to standard supervised finetuning.
Our contributions are as follows:
- Extending the M/A theoretical framework: We extend the M/A framework to retrieval-augmented open-ended language generation. We introduce conditional evaluation to disentangle explanation fidelity from predictive errors; we propose and measure the Explained Information Fraction (EIF) to normalize certified guarantees relative to model capability and imperfect benchmarks. 2. M/A generator training: We use Merlin-and Morganagenerated contexts to adversarially train generator LLMs with increased soundness, groundedness and explicit reject behavior, without annotated unanswerable questions or preference data. 3. M/A retriever training: We feed Merlin-and Morganagenerated contexts -produced by a converged generator -back into retriever training, forming a feedback loop from generation to retrieval, yielding automated hard positives and hard negatives to improve the retriever.
Across Llama-3.2-1B, Llama-3.2-3B, and Qwen3-4B Instruct models on SQuAD2.0, HotpotQA and TriviaQA, we consistently improve groundedness, completeness, soundness for the generator, and recall and Mean Reciprocal Rank (MRR) for a BERT-based retriever. Specifically, we observe:
- Improved retriever: With our automated M/A augmentations, the retriever learns to better separate useful from harmful context than standard training. Recall@1 improves by 2-11 pp over the baseline across datasets. 2. Higher robustness, fewer hallucinations: The generator learns to abstain when retrieved context does not support an answer and be more robust to misleading evidence. The reject behavior emerges without explicit unanswerable questions or preference data and improves upon instructing LLMs to reject under insufficient evidence. Under adversarial context, M/A training reduces incorrect answers by 25-60 pp relative to the instruct LLM and by 2-50 pp compared to vanilla finetuning, yet matches it in accuracy (model utility). 3. Improved interpretability with mutual information exchange guarantees: Before training, input attributions often are diffuse and misaligned with evidence. After M/A training, they become more human-aligned, reflecting grounded reasoning. We further obtain information exchange bounds between context and answer of EIF cond ≥ 0.3 on all datasets, providing strong lower bounds on context-dependent generation behavior 1 .
Reject Behavior in RAG. A growing body of work aims to elicit abstention from LLMs, including RAG (Wen et al., 2025). Methods elicit refusal via prompting (Madhusudhan et al., 2025;Peng et al., 2025) with mixed success, or via uncertainty estimation (Tomani et al., 2024;Lavi et al., 2025), which remains unreliable because it is hard to calibrate LLM uncertainty. Training for refusal behavior uses annotated 1 We will release our code before publication.
unanswerable questions, e.g., from SQuAD2.0 (Rajpurkar et al., 2016), Natural Questions (Kwiatkowski et al., 2019), or handcrafted rejec
This content is AI-processed based on ArXiv data.