A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model’s input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at https://github.com/Ayanami0730/arag.


💡 Research Summary

The paper introduces A‑RAG, an agentic Retrieval‑Augmented Generation framework that gives large language models (LLMs) direct access to three hierarchical retrieval tools: keyword search, semantic search, and chunk read. Existing RAG systems fall into two paradigms: (1) a single‑shot retrieval that fetches many passages and concatenates them into the model’s context, or (2) a predefined workflow where the model follows a fixed step‑by‑step procedure. Both prevent the model from using its reasoning and tool‑use capabilities to guide retrieval, limiting scalability as models improve.

A‑RAG’s key insight is that information in a corpus is naturally organized at multiple granularities—exact lexical cues, sentence‑level semantic representations, and larger chunk‑level contexts. To exploit this, the authors build a lightweight hierarchical index: the corpus is split into ~1,000‑token chunks aligned to sentence boundaries; each chunk is further broken into sentences, and each sentence is embedded with a pretrained sentence encoder (Qwen3‑Embedding‑0.6B). Keyword level does not require a separate index; exact string matching is performed at query time, dramatically reducing offline indexing cost.

The three tools operate as follows:

  1. Keyword Search – Given a list of keywords and a desired result count k, the tool computes a relevance score based on keyword frequency and length, returns the top‑k chunk IDs together with snippets consisting of sentences that contain any keyword. This acts as a fast filter to narrow the search space.

  2. Semantic Search – The query is encoded into the same embedding space; cosine similarity is computed against all sentence embeddings. The top‑k sentences are aggregated by their parent chunks, and the highest‑scoring sentence per chunk determines the chunk’s relevance. The tool returns chunk IDs and the matched sentences as snippets, enabling more nuanced semantic matching.

  3. Chunk Read – Using snippets from the previous tools, the model can request the full text of selected chunks (or adjacent chunks for additional context). A context tracker records which chunks have already been read; subsequent reads of the same chunk return a “already read” notice, saving tokens and encouraging exploration of new material.

A‑RAG adopts a simple ReAct‑style agent loop: at each iteration the LLM reasons, decides which tool to call, observes the result, and repeats. No parallel tool calls or complex orchestration are used, allowing a clean analysis of how the hierarchical interfaces affect behavior. The loop stops either when the model decides it has enough evidence or when a maximum iteration budget is reached, after which the model synthesizes an answer from the gathered information.

Experiments were conducted on four multi‑hop QA benchmarks—HotpotQA, 2WikiMultiHopQA, MuSiQue, and GraphRAG‑Bench—using two backbone LLMs: GPT‑4o‑mini and GPT‑5‑mini. Baselines include vanilla direct‑answer models, a naive single‑tool RAG, several graph‑based RAG systems (GraphRAG, HippoRAG2, LinearRAG), and workflow‑based agents (FaithfulRAG, MA‑RAG, RAGentA). Two evaluation metrics were used: LLM‑Evaluation Accuracy (LLM‑Acc), an LLM‑based semantic equivalence check, and Contain‑Match Accuracy (Cont‑Acc), which verifies whether the ground‑truth answer appears verbatim in the generated response.

Results show that A‑RAG (Full) consistently outperforms all baselines across both metrics, often with fewer retrieved tokens. For example, with GPT‑5‑mini on MuSiQue, A‑RAG (Full) achieves 94.5 % LLM‑Acc and 88.0 % Cont‑Acc, surpassing the best prior method by roughly 5–10 percentage points. Even the “Naive” variant that only uses semantic search beats many graph‑ and workflow‑based systems, demonstrating the power of giving the model agency over retrieval.

A systematic scaling study reveals that performance improves steadily as model size and test‑time compute increase, indicating that the hierarchical interface scales well with advances in LLM capabilities. The authors attribute this to the model’s ability to dynamically select the most appropriate granularity (keyword → semantic → full chunk) based on the question’s difficulty, rather than being constrained by a static pipeline.

Limitations discussed include the reliance of keyword search on exact matches, which may miss synonyms or morphological variants, and the fixed chunk size and top‑k hyperparameters that may be suboptimal for very large corpora or extremely long documents. Future work is proposed on dynamic chunking, multimodal retrieval, and reinforcement‑learning‑based tool‑selection policies to further generalize the agentic RAG paradigm.

In summary, A‑RAG demonstrates that exposing LLMs to hierarchical retrieval tools enables true agentic behavior, reduces unnecessary context, and yields superior QA performance that scales with model improvements. The framework offers a clean, extensible foundation for next‑generation retrieval‑augmented generation systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment