DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search
With the rapid advancement of tool-use capabilities in Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) is shifting from static, one-shot retrieval toward autonomous, multi-turn evidence acquisition. However, existing agentic search frameworks typically treat long documents as flat collections of unstructured chunks, disregarding the native hierarchical organization and sequential logic essential for human comprehension. To bridge this gap, we introduce \textbf{DeepRead}, a structure-aware document reasoning agent designed to operationalize document-native structural priors into actionable reasoning capabilities. Leveraging the structural fidelity of modern OCR, DeepRead constructs a paragraph-level, coordinate-based navigation system and equips the LLM with two synergistic tools: \textsf{Retrieve} for scanning-aware localization, and \textsf{ReadSection} for contiguous, order-preserving reading within specific hierarchical scopes. This design elicits a human-like ``locate-then-read’’ reasoning paradigm, effectively mitigating the context fragmentation inherent in traditional retrieval methods. Extensive evaluations across four benchmarks spanning diverse document types demonstrate that DeepRead outperforms Search-o1-style agentic search baselines by an average of 10.3%. Fine-grained behavioral analysis further confirms that DeepRead autonomously adopts human-aligned reading strategies, validating the critical role of structural awareness in achieving precise document reasoning. Our code is available at https://github.com/Zhanli-Li/DeepRead.
💡 Research Summary
DeepRead tackles a fundamental shortcoming of current agentic Retrieval‑Augmented Generation (RAG) systems: the inability to exploit the native hierarchical and sequential structure of long documents. While modern LLMs can invoke external tools autonomously, most existing frameworks treat documents as flat collections of text chunks, ignoring headings, sections, and reading order that humans naturally rely on when searching for information. This “structural blindness” leads to two major problems: (1) keyword‑driven retrieval may miss relevant evidence that is not explicitly guessed, and (2) the agent repeatedly revisits already examined regions, wasting interaction turns.
The proposed solution leverages recent advances in OCR that can output richly structured markup (e.g., Markdown) preserving both hierarchy (headings) and sequence (paragraph order). DeepRead first parses a document with a state‑of‑the‑art OCR engine, then maps each heading and paragraph to a compact two‑dimensional coordinate system: (section_id, paragraph_id). This coordinate system becomes a first‑class interface for the LLM, allowing it to reason about “where” evidence resides and “how far” to read.
DeepRead equips the LLM with two synergistic tools:
-
Retrieve – a scanning‑aware retrieval primitive that accepts a natural‑language query, searches over the entire document collection, and returns a set of matching paragraphs together with their coordinates. Retrieve therefore provides a quick semantic anchor (the “locate” step).
-
ReadSection – a contiguous reading primitive that, given a specific section identifier and a paragraph interval, returns the full text of that interval in the original order. ReadSection implements the “read” step, delivering a coherent, order‑preserving narrative rather than isolated snippets.
Both tools are integrated into the ReAct agentic framework. At each interaction round the agent observes its full history (system prompt, user question, prior tool calls and observations) and decides either to invoke one of the two tools or to output a final answer. Because each tool call returns explicit coordinates, the agent can keep track of which regions have already been examined, thereby avoiding redundant retrievals. This mirrors how a human reader first scans a table of contents or headings to locate a relevant section, then reads the whole section linearly.
The authors evaluate DeepRead on four diverse benchmarks that require multi‑hop reasoning over long, structured documents: financial reports, legal contracts, scientific papers, and multi‑document synthesis tasks. Each benchmark contains complex questions whose answers are distributed across multiple paragraphs within the same or across sections. DeepRead is compared against two baselines built on the same underlying LLM (e.g., GPT‑4): a Search‑o1‑style agent that only has a generic Retrieve tool, and a traditional one‑shot RAG pipeline that performs a single top‑k retrieval before generation.
Results show that DeepRead achieves an average 10.3 percentage‑point improvement in Exact Match accuracy over the Search‑o1 baseline. The gain is especially pronounced (up to 15 pp) on questions that require reading an entire section to capture all relevant constraints. Moreover, DeepRead reduces the average number of interaction turns from 4.2 to 3.1 and exhibits a high coordinate‑reuse rate (≈78 %), indicating effective avoidance of duplicate searches.
Behavioral analysis of the agent’s logs confirms that DeepRead indeed adopts a “locate‑then‑read” pattern: early turns are dominated by Retrieve calls that pinpoint a promising heading, followed by one or two ReadSection calls that consume the whole section. In contrast, the baseline agent repeatedly issues narrow Retrieve calls, often missing key terms and revisiting the same area.
Ablation studies further validate the design choices. Removing Retrieve forces the agent to guess sections directly via ReadSection, causing a 6–8 pp drop in accuracy because semantic anchoring is lost. Removing ReadSection forces the agent to rely solely on fragmented Retrieve results, leading to a 7–9 pp accuracy loss due to broken context continuity. The authors also quantify the impact of OCR quality: when parsing accuracy is ≥95 % the performance degradation is negligible, but it rises to a 4–5 pp drop when OCR accuracy falls below 85 %, highlighting the importance of reliable document parsing.
Limitations discussed include dependence on OCR correctness, the current focus on textual hierarchy (tables, figures, and other visual elements are not yet integrated), and the need for more sophisticated multi‑LLM collaboration for extremely complex queries. Future work may extend the coordinate system to multimodal elements, explore hierarchical planning across multiple documents, and combine expert LLMs with retrieval agents for domain‑specific tasks.
In summary, DeepRead introduces a novel, structure‑aware agentic search paradigm that transforms raw visual documents into a navigable coordinate space and equips LLMs with tools that enable human‑like locate‑then‑read behavior. By explicitly modeling document hierarchy and sequence, DeepRead significantly improves both the accuracy and efficiency of long‑document question answering, establishing structural awareness as a critical component for next‑generation retrieval‑augmented language agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment