WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia’s unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:https://github.com/BstWPY/WildGraphBench.
💡 Research Summary
WildGraphBench addresses a critical gap in the evaluation of Graph‑based Retrieval‑Augmented Generation (GraphRAG) systems: the lack of benchmarks that reflect the noisy, long‑document, heterogeneous nature of real‑world knowledge sources. Existing GraphRAG benchmarks typically rely on short, curated passages, which do not stress the multi‑document aggregation and long‑context reasoning capabilities that GraphRAG is designed to provide. To remedy this, the authors exploit Wikipedia’s unique citation structure. Wikipedia articles serve as concise, well‑written summaries, while the external reference pages linked via citations are often lengthy, noisy, and span a wide range of formats (news articles, PDFs, government reports, blogs, etc.). This mismatch creates a “wild” retrieval environment that closely mirrors real web corpora.
The dataset construction proceeds in three phases. First, the authors select 12 high‑level Wikipedia topics (e.g., History, Science, Society) and, within each topic, choose articles with a large number of references. All reference URLs are crawled using the jina.ai service; when a page is unavailable, archived versions are retrieved to preserve completeness. The raw text, including boilerplate and noise, is kept intact. Second, citation‑linked sentences in the Wikipedia articles are extracted, and a large language model (LLM) rewrites each into a clean factual statement, stripping footnote markers and resolving coreferences. Each statement is paired with its set of reference URLs and the count of references, forming a triple (statement, ref_urls, ref_count). Third, three question types are generated from this gold corpus:
-
Single‑Fact questions (667 instances) – derived from triples with ref_count = 1. An LLM is prompted to write a non‑trivial question whose answer is exactly the gold statement, encouraging inclusion of multiple constraints (entity, time, location).
-
Multi‑Fact questions (191 instances) – derived from triples with ref_count ≥ 2. The authors enforce a strict multi‑reference check: an LLM judge verifies that no single reference alone can fully support the statement, ensuring that the question truly requires evidence aggregation across at least two sources.
-
Section‑Level Summary questions (339 instances) – each leaf section of a Wikipedia article provides a set of statements S*. An LLM generates an information‑seeking question based solely on the article title and section path, without seeing the section text. The expected answer is the entire set S*, requiring the system to retrieve and synthesize information from many noisy documents.
Evaluation metrics are tailored to each question type. For single‑ and multi‑fact questions, a separate LLM judge decides whether the system’s answer is factually equivalent to the gold statement, yielding a binary accuracy score. For summary questions, the system’s output is passed through a statement extractor to produce a predicted set (\hat{S}). A binary matching function determines whether each predicted statement paraphrases a gold statement, and precision, recall, and F1 are computed at the statement level. This design captures both factual coverage (recall) and hallucination (precision) while tolerating paraphrasing.
Experimental setup includes flat RAG baselines (NaïveRAG, BM25) and several GraphRAG pipelines (Fast‑GraphRAG, Microsoft GraphRAG – local and global variants). Documents are chunked into 1,200‑token windows with 100‑token overlap; top‑k retrieval is set to 5 for fact questions and 10 for summaries. Graph construction and answer generation use gpt‑4o‑mini, while evaluation employs gpt‑5‑mini as the judge.
Results reveal that GraphRAG methods outperform flat baselines on multi‑fact questions, achieving 12–15 % higher accuracy, confirming that graph‑guided evidence expansion is beneficial when relevant facts are scattered across multiple sources. However, on summary questions, GraphRAG systems exhibit lower precision: they tend to over‑emphasize high‑level statements and omit finer details, leading to reduced statement‑level recall and overall F1. This suggests that current graph aggregation strategies bias toward “core” nodes and lack mechanisms to preserve granular information.
The paper’s contributions are threefold: (1) introducing a benchmark that faithfully reproduces the challenges of real‑world, heterogeneous web corpora; (2) providing a statement‑grounded evaluation framework that jointly measures factual correctness and coverage; (3) delivering empirical evidence that while GraphRAG excels at moderate‑scale multi‑source aggregation, it struggles with broad summarization tasks, highlighting a design limitation that future work should address. Potential directions include refining graph propagation weights to retain low‑level facts, incorporating noise‑robust filtering before graph construction, and designing summary‑specific graph structures that balance high‑level abstraction with detail preservation.
Comments & Academic Discussion
Loading comments...
Leave a Comment