DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization
Retrieval-augmented generation (RAG) can substantially enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms - including vanilla, planning-based, and iterative RAG - all depend on a robust retriever, yet existing retrievers rely heavily on public knowledge and often falter when faced with domain-specific queries. To address these limitations, we introduce DRAGON, a framework that combines a data-construction modeling approach with a scalable synthetic data-generation pipeline, specifically designed to optimize domain-specific retrieval performance and bolster retriever robustness. To evaluate RAG performance on domain-specific RAGs, we propose DRAGONBench, a benchmark spanning 8 domain-specific document collections across 4 distinct fields and featuring a wide spectrum of query complexities, answerability, and hop numbers. Leveraging DRAGON, we generate a large-scale synthetic dataset - encompassing both single-hop and multi-hop queries - to enrich retriever training. Extensive experiments demonstrate that retrievers trained on this data yield significant performance gains and exhibit strong cross-domain generalization. Moreover, when our optimized retrievers are integrated into vanilla, planning-based, and iterative RAG paradigms, we observe consistent end-to-end improvements in system accuracy.
💡 Research Summary
**
Retrieval‑augmented generation (RAG) has become a cornerstone for extending the knowledge reach of large language models (LLMs) on tasks that require up‑to‑date or specialized information. While many recent works focus on improving the generator or the planning component, the retriever remains a critical bottleneck, especially when the target corpus lies outside the public domain (e.g., Wikipedia). Existing dense retrievers are typically pre‑trained on generic corpora and consequently struggle with domain‑specific terminology, stylistic quirks, and multi‑document reasoning patterns that are common in specialized fields.
The paper introduces DRAGON (Domain‑specific Robust Automatic Data Generation for RAG Optimization), a two‑stage framework that automatically synthesizes high‑quality RAG training data for any given domain and uses this data to fine‑tune dense retrievers. The first stage builds an entity‑centric graph from the target document collection. Documents are first chunked to match the retriever’s context window; entities and relations are extracted from each chunk, and an entity resolution step merges duplicate mentions. This graph serves as a scaffold for both single‑hop and multi‑hop query generation, because edges naturally encode cross‑document links that can be turned into “clues”.
In the second stage, the extracted clues are fed to a large language model (LLM) that generates base questions. Two families of transformation rules are then applied:
- Logical Rephraser Rules – temporal expansion, comparison addition, metric segmentation, multi‑step questioning, and reason explanation. These increase the logical depth of the query.
- Completeness Rephraser Rules – near‑synonym replacement, order reversal, semantic ambiguity injection, perspective shift, and conditional addition. These manipulate how much of the required evidence is explicitly present in the question.
Through iterative prompting, the pipeline produces a structured dataset G = (D, Q, C, A, M) where:
- D – the set of document chunks.
- Q – generated queries.
- C – the set of “clues”, i.e., sentences that must be retrieved to answer the query.
- A – the ground‑truth answer together with answer variants that correspond to partial retrieval scenarios (e.g., missing one of the supporting documents).
- M – two mappings: (M₁) answer‑to‑clue and (M₂) clue‑to‑document‑sentence. These mappings make it possible to compute, for any query, the exact subset of documents and sentences that constitute a correct retrieval set, and to generate realistic “incomplete” answer variants for training.
To evaluate DRAGON, the authors construct DRAGONBench, a benchmark comprising eight domain‑specific corpora across four fields: (1) gaming wikis (Hearthstone, Zelda), (2) medical/health (Drugs.com, Mayo Clinic), (3) software/micro‑electronics (Cyotek, Notion), and (4) academic/research (Stanford, UC Berkeley). For each domain the benchmark provides:
- Queries spanning 1‑ to 3‑hop reasoning.
- Labels for answerability (answerable vs. unanswerable).
- Fine‑grained clue‑completeness levels (full, partial, missing).
- Sentence‑level citations linking each clue to its source.
Because many recent RAG benchmarks rely on LLM‑as‑a‑Judge for automatic scoring—a method known to be unstable—the authors propose Criteria‑Based Score Generation (CSG). CSG uses a predefined rubric (e.g., factual correctness, citation coverage, logical coherence) and a separate LLM to score each answer, yielding more consistent evaluations than raw LLM‑as‑Judge.
Experimental Setup
Six state‑of‑the‑art dense retrievers from the MT‑eb leaderboard (sizes 33 M–611 M parameters, context windows 512–8192 tokens) are selected. Three training regimes are compared:
- General‑pretrained – retrievers trained only on public data (Wikipedia, MS‑MARCO).
- In‑domain real – retrievers fine‑tuned on the raw domain documents without synthetic augmentation.
- DRAGON‑augmented – retrievers fine‑tuned on the synthetic dataset generated by DRAGON (including both single‑ and multi‑hop queries, varied logical depth, and clue completeness).
Training uses contrastive learning with hard negatives (ANCE) and standard cross‑entropy losses. Evaluation metrics include Recall@k for retrieval and end‑to‑end RAG accuracy (using a frozen LLaMA‑2‑13B generator).
Key Findings
- Retrieval Gains – On all four domains, DRAGON‑augmented models achieve 12–19 % absolute improvements in Recall@10 for multi‑hop queries compared to the general‑pretrained baseline, and 5–8 % gains for single‑hop queries.
- Cross‑Domain Generalization – A model fine‑tuned on the Zelda corpus (gaming) transfers well to the other three domains, delivering 4–7 % Recall gains despite no direct exposure, indicating that the synthetic data captures generic reasoning patterns.
- Impact of Logical vs. Completeness Augmentation – Ablation shows that removing logical rephrasings drops multi‑hop performance by ~6 %, while removing completeness rephrasings reduces robustness to missing evidence by ~8 %, confirming both components are essential.
- End‑to‑End RAG Improvements – When integrated into three RAG pipelines:
- Vanilla RAG – answer accuracy rises from 68.2 % to 71.4 % (Δ + 3.2 %).
- Planning‑based RAG – the planner requires on average 1.8 fewer sub‑queries, and overall accuracy improves by 2.9 %.
- Iterative RAG – the LLM decides to stop after 2.1 rounds on average (vs. 3.4), cutting inference cost by ~14 % while maintaining a 2.5 % accuracy boost.
These gains are attributed to the retriever’s ability to surface the exact clue sentences needed for the generator, reducing hallucination and unnecessary context.
Limitations & Future Work
The pipeline’s reliance on high‑quality entity extraction means that domains lacking robust NER models may suffer from noisy graphs. Moreover, generating millions of queries with an LLM incurs substantial GPU time and API costs. The authors suggest (i) lightweight prompting or distillation to reduce LLM overhead, (ii) automated verification of clue‑document links (e.g., using cross‑encoder consistency checks), and (iii) a human‑in‑the‑loop feedback loop to iteratively refine synthetic data quality.
Conclusion
DRAGON offers a systematic, scalable solution to the data scarcity problem that hampers domain‑specific retrieval for RAG. By automatically constructing a richly annotated query‑answer‑clue dataset and fine‑tuning dense retrievers on it, the authors achieve substantial retrieval and end‑to‑end performance improvements across diverse specialized domains. DRAGONBench, with its fine‑grained annotations and robust CSG evaluation, provides a new benchmark for future research on domain‑aware RAG systems. The work underscores that enhancing the retriever—often overlooked in favor of LLM improvements—can be equally, if not more, impactful for building reliable knowledge‑augmented AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment