ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Reading time: 5 minute
...

📝 Original Info

  • Title: ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization
  • ArXiv ID: 2511.18192
  • Date: 2025-11-22
  • Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

📝 Abstract

Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.

💡 Deep Analysis

Figure 1

📄 Full Content

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization Ahmad Mohammadshirazi Ohio State University Flairsoft Columbus, Ohio, US mohammadshirazi.2@osu.edu Pinaki Prasad Guha Neogi Ohio State University Columbus, Ohio, US guhaneogi.2@osu.edu Dheeraj Kulshrestha Flairsoft Columbus, Ohio, US dheeraj@flairsoft.net Rajiv Ramnath Ohio State University Columbus, Ohio, US ramnath.6@osu.edu Abstract Document Visual Question Answering (VQA) requires models to not only ex- tract accurate textual answers but also precisely localize them within document images—a capability critical for interpretability in high-stakes applications. How- ever, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via fine-tuned Gemma 3-27B, and explicit bounding-box localization through text-to- region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We eval- uate ARIAL on four benchmarks—DocVQA, FUNSD, CORD, and SROIE—using both textual accuracy (ANLS) and spatial precision (mAP@IoU 0.50:0.95). AR- IAL achieves SoTA results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems. Code is available at: https://github.com/ahmad-shirazi/ARIAL 1 Introduction Document Visual Question Answering (VQA) requires reasoning over both textual content and visual layout in scanned or digitally rendered documents. Models must not only read and understand diverse formats—forms, receipts, reports—but also locate where answers appear within the document structure. While recent models such as LayoutLMv3 [14], LayoutLLM [29], and DocLayLLM [25] have improved textual accuracy by combining language with layout features, they often treat localization as a secondary task. Consequently, they may generate plausible answers without clearly identifying their source in the document, making verification difficult. Standard metrics like ANLS [40] capture 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2511.18192v2 [cs.CV] 28 Nov 2025 Figure 1: Overview of the ARIAL agentic workflow for Document VQA. The system consists of three modular stages: (1) Input Processing, where an OCR module extracts text segments and bounding boxes from a document image; (2) Agentic Reasoning Pipeline, where the planner agent coordinates task execution—retrieving relevant text, invoking QA or computation, and triggering spatial grounding; and (3) Output Generation, where the final answer and its bounding box are produced. The reasoning loop enables iterative refinement based on confidence, supporting flexible and context-aware decision-making. string similarity but fail to reflect spatial correctness, prompting a shift towards combined evaluations that include IoU for grounding precision. DLaVA [33] introduced answer localization by integrating bounding-box prediction within a large multimodal transformer. However, its monolithic design can be computationally intensive and may struggle with fine-grained details in dense or handwritten layouts. We propose ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular document VQA framework built around an agentic planning model. Rather than using a single large model, AR- IAL delegates subtasks—OCR, layout analysis, retrieval, reasoning, and grounding—to specialized modules orchestrated by a central agent. This agent, implemented with LLaMA 4 Scout [31], dynam- ically selects tools and composes multi-step reasoning chains, enabling accurate and interpretable answers with precise spatial grounding. Our key contributions are: 1. Agentic Document QA: We introduce an agent-based document VQA system that decom- poses queries into tool calls for OCR, retrieval, and grounding. The modular design enables tool reuse, error tracing, and flexible adaptation across document types. 2. Precise Answer Localization: ARIAL produces both answer text and corresponding bounding boxes by aligning answers to OCR-detected spans and contextual cues, ensuring visual traceability. 3. Retrieval-Augmented Reasoning: ARIAL incorporates retrieval-aug

📸 Image Gallery

receipt_00000.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut