IRB: Automated Generation of Robust Factuality Benchmarks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.

💡 Research Summary

The paper addresses a pressing problem in the evaluation of Retrieval‑Augmented Generation (RAG) systems: existing static benchmarks quickly saturate, become contaminated by model memorization, and require costly human annotation to stay relevant. To overcome these limitations, the authors introduce IRB (Iterative Retrieval Benchmark), a fully automated framework that generates factuality benchmarks with minimal human effort while preserving high quality and controllability.

IRB’s pipeline is built on two complementary scaffolds. The factual scaffold extracts citing sentences from Wikipedia articles—sentences that have been manually selected by human editors as evidence for a claim. After filtering for syntactic completeness, each citing sentence is split into atomic “keypoints” when multiple inline citations are present. An LLM then de‑contextualizes each keypoint, turning it into a self‑contained factual statement. Each keypoint is linked to the URLs of its supporting documents, and a second LLM‑based groundedness check verifies that the keypoint is indeed supported by the retrieved text. Only verified keypoints survive to the next stage, ensuring that the factual base of the benchmark is trustworthy.

The algorithmic scaffold takes these verified facts and systematically creates question‑answer (QA) pairs. First, each fact is transformed into a structured knowledge graph (KG) consisting of (head, relation, tail) triples, with node types annotated. A coverage metric guarantees that the KG captures enough of the original wording. The KG is then masked and transformed to produce three distinct question types: (1) single‑hop questions where a single node is masked as the answer; (2) multi‑hop questions formed by merging two single‑hop graphs, requiring the model to perform a two‑step reasoning chain; and (3) false‑premise questions where nodes are deliberately altered (e.g., swapping a surname or distorting a date) to inject misleading information. To increase difficulty and avoid trivial keyword matching, unmasked nodes are paraphrased using rule‑based transformations (abbreviations for person names, relative time expressions for dates, etc.). Finally, a step‑by‑step prompting strategy generates natural‑language questions from the masked graphs, while the answer is taken directly from the original keypoint.

Using this pipeline, the authors construct IRB1K, a benchmark comprising roughly one thousand QA pairs covering diverse topics (culture, geography, STEM), varying temporal freshness, and a mix of single‑hop, multi‑hop, and false‑premise items. Each entry includes rich metadata (topic, hop count, language freshness, etc.), the ground‑truth answer, and the set of supporting documents that constitute the retrieval ground truth.

The evaluation phase probes both closed‑book LLMs (e.g., GPT‑4, Claude‑2, Gemini‑1.5) and RAG systems that combine these generators with different retrievers (FAISS, BM25, DPR). Metrics such as Exact Match, F1, and robustness under adversarial conditions (incorrect retrieval, false‑premise questions, internal‑external knowledge conflicts) are reported. Key findings are: (1) IRB1K is substantially more challenging than traditional static benchmarks; closed‑book LLMs experience a 15‑20 % drop in accuracy, highlighting the benchmark’s ability to expose factuality gaps. (2) Models equipped with explicit reasoning mechanisms (Chain‑of‑Thought, self‑consistency) outperform vanilla generators, especially when faced with false premises or noisy retrieval results, indicating that reasoning capabilities improve robustness. (3) Retrieval quality emerges as the dominant factor influencing overall system performance; upgrading the retriever yields larger, more cost‑effective gains than scaling the generator further.

The paper’s contributions are threefold: (i) a novel factual scaffold that leverages human‑curated citations to guarantee high‑quality factual seeds; (ii) a graph‑based algorithmic scaffold that offers fine‑grained control over question type, reasoning depth, and adversarial perturbations; and (iii) an open‑source benchmark (IRB1K) and codebase that enable reproducible research and continuous benchmark evolution. Limitations are acknowledged: the current pipeline handles only textual evidence, excluding multimodal sources (videos, audio), and occasional noisy keypoints may slip through despite the groundedness check. Future work aims to integrate multimodal verification and more sophisticated automated validation models.

In summary, IRB demonstrates that fully automated, scaffold‑driven benchmark generation is feasible and yields challenging, up‑to‑date factuality tests for RAG systems. The results underscore that while LLMs have reached impressive capabilities, their factual reliability still hinges on the quality of the retrieval component, and that reasoning‑enhanced models provide a promising path toward more trustworthy AI assistants.

IRB: Automated Generation of Robust Factuality Benchmarks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment