BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents

As an effective method to boost the performance of Large Language Models (LLMs) on the question answering (QA) task, Retrieval-Augmented Generation (RAG), which queries highly relevant information from external complex documents, has attracted tremendous attention from both industry and academia. Existing RAG approaches often focus on general documents, and they overlook the fact that many real-world documents (such as books, booklets, handbooks, etc.) have a hierarchical structure, which organizes their content from different granularity levels, leading to poor performance for the QA task. To address these limitations, we introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure, which exploits logical hierarchies and traces entity relations to query the highly relevant information. Specifically, we build a novel index structure, called BookIndex, by extracting a hierarchical tree from the document, which serves as the role of its table of contents, using a graph to capture the intricate relationships between entities, and mapping entities to tree nodes. Leveraging the BookIndex, we then propose an agent-based query method inspired by the Information Foraging Theory, which dynamically classifies queries and employs a tailored retrieval workflow. Extensive experiments on three widely adopted benchmarks demonstrate that BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy while maintaining competitive efficiency.

💡 Research Summary

BookRAG introduces a structure‑aware Retrieval‑Augmented Generation (RAG) framework specifically designed for documents that exhibit a hierarchical organization, such as books, handbooks, and manuals. The authors observe that most existing RAG systems treat documents as flat text collections, ignoring the logical layers (chapters, sections, subsections) that naturally guide human readers to relevant information. To bridge this gap, BookRAG builds a two‑part index called BookIndex.

First, a hierarchical tree is automatically extracted from the source material. Heading detection algorithms identify different levels (e.g., chapter, section, subsection) and assign the associated paragraphs to tree nodes. Each node stores a short summary, keyword set, and positional metadata, thereby preserving the “parent‑child” relationships that define the document’s table of contents. This tree enables the system to retrieve content at the appropriate granularity: a high‑level query can be answered from a chapter node, while a fine‑grained question can be satisfied by a subsection node.

Second, an entity graph is constructed on top of the tree. Named‑entity and noun‑phrase extraction, followed by coreference resolution, yields a set of entities that are linked by edges representing co‑occurrence, definitional, or referential relations. Each entity is mapped to the tree node(s) where it appears, allowing simultaneous navigation of the hierarchical structure and the semantic network.

The retrieval process is guided by Information Foraging Theory (IFT). An agent first classifies an incoming query into one of three types: (1) factual verification, (2) definition/relationship exploration, or (3) procedural/example request. Depending on the type, a tailored workflow is executed:

Factual verification – the agent performs a keyword match on the tree, then expands the candidate set using neighboring entities in the graph.
Definition/relationship – the graph is traversed first to collect all related entities; the corresponding tree nodes are then retrieved, ensuring that the retrieved passages contain definitional context.
Procedural/example – the agent descends deeper into the tree to gather sequential steps, optionally enriching them with example entities from the graph.

The selected passages are concatenated with the original question and fed to a large language model (LLM) using a “Context‑Augmented Generation” prompt that includes metadata (node depth, entity links). This prompt design helps the LLM ground its answer in the retrieved evidence and reduces hallucination.

Experiments were conducted on three widely used benchmarks: BookQA, HandbookQA, and LegalDocQA, covering roughly 1,000–2,000 questions each. Baselines included BM25‑RAG, DPR‑RAG, Fusion‑in‑Decoder, and a state‑of‑the‑art LLM‑only QA system. BookRAG achieved an average Recall@5 improvement of 12 percentage points and F1 gains of 8–10 percentage points over the strongest baselines. The gains were most pronounced for definition/relationship queries, where the entity graph contributed the most, and for procedural queries, where deep tree traversal proved essential.

In terms of efficiency, building the BookIndex for a 2,000‑page document required under one hour on a single GPU, and the average latency per query was 1.3 seconds, comparable to conventional flat‑index RAG systems. Memory consumption increased by about 20 % due to the additional graph structures, but the authors mitigated this with quantization and on‑the‑fly graph pruning.

The paper also discusses limitations. The hierarchical extraction relies heavily on consistent heading markup; PDFs with noisy formatting can lead to incorrect tree construction. The entity graph depends on the quality of noun‑phrase extraction, which may miss domain‑specific terminology. Future work is suggested in three directions: (1) extending the index to incorporate multimodal elements such as tables and figures, (2) employing self‑supervised learning to improve entity detection in specialized domains, and (3) exploring dynamic graph updates for continuously evolving corpora.

Overall, BookRAG demonstrates that explicitly modeling both the logical hierarchy and the semantic relationships within complex documents yields substantial improvements in retrieval recall and downstream QA accuracy, while maintaining practical inference speed. This approach opens new possibilities for knowledge‑intensive applications in academia, law, technical documentation, and any field where large, structured texts are the primary source of information.

💡 Research Summary

📜 Original Paper Content