시각‑우선 멀티모달 RAG: OCR‑프리 파이프라인과 피라미드 인덱싱의 혁신
📝 Abstract
Document-centric RAG pipelines typically begin with OCR, followed by brittle, engineering-heavy heuristics for chunking, table parsing, and layout reconstruction. These text-first workflows are costly to maintain, sensitive to small layout shifts, and discard the visuo-spatial cues that frequently contain the answer. Visionfirst retrieval has recently emerged as a compelling alternative: by operating directly on page images, systems such as ColPali and ColQwen preserve spatial structure and reduce pipeline complexity while achieving strong benchmark performance. However, these late-interaction models tightly couple retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating substantial memory overhead and complicating large-scale deployment. We introduce VisionRAG, a multimodal retrieval system that is both OCR-free and model-agnostic. Vision-RAG indexes documents directly as images, preserving layout, table structure, and spatial cues, and constructs semantic vectors without committing to a specific extraction. Our three-pass pyramid indexing framework create semantic vectors using global page summaries, section headers, visual hotspots, and fact-level cues. These summaries serve as lightweight retrieval surrogates: at query time, VisionRAG retrieves the most relevant pages using the pyramid index, then forwards the raw page image (encoded as base64) to a multimodal LLM for final question answering. During retrieval, reciprocal rank fusion integrates representations across the pyramid, yielding robust ranking across heterogeneous visual and textual content. VisionRAG maintains just 17-27 vectors per page, matching the efficiency of patch-based approaches while remaining adaptable to different multimodal encoders. On financial document benchmarks, VisionRAG achieves 0.8051 accuracy@10 on FinanceBench and 0.9629 Recall@100 on TAT-DQA, demonstrating strong coverage of answer-bearing content in complex, visually rich documents. These results suggest that OCRfree, summary-guided multimodal retrieval provides a practical and scalable alternative to traditional text-extraction pipelines.
💡 Analysis
Document-centric RAG pipelines typically begin with OCR, followed by brittle, engineering-heavy heuristics for chunking, table parsing, and layout reconstruction. These text-first workflows are costly to maintain, sensitive to small layout shifts, and discard the visuo-spatial cues that frequently contain the answer. Visionfirst retrieval has recently emerged as a compelling alternative: by operating directly on page images, systems such as ColPali and ColQwen preserve spatial structure and reduce pipeline complexity while achieving strong benchmark performance. However, these late-interaction models tightly couple retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating substantial memory overhead and complicating large-scale deployment. We introduce VisionRAG, a multimodal retrieval system that is both OCR-free and model-agnostic. Vision-RAG indexes documents directly as images, preserving layout, table structure, and spatial cues, and constructs semantic vectors without committing to a specific extraction. Our three-pass pyramid indexing framework create semantic vectors using global page summaries, section headers, visual hotspots, and fact-level cues. These summaries serve as lightweight retrieval surrogates: at query time, VisionRAG retrieves the most relevant pages using the pyramid index, then forwards the raw page image (encoded as base64) to a multimodal LLM for final question answering. During retrieval, reciprocal rank fusion integrates representations across the pyramid, yielding robust ranking across heterogeneous visual and textual content. VisionRAG maintains just 17-27 vectors per page, matching the efficiency of patch-based approaches while remaining adaptable to different multimodal encoders. On financial document benchmarks, VisionRAG achieves 0.8051 accuracy@10 on FinanceBench and 0.9629 Recall@100 on TAT-DQA, demonstrating strong coverage of answer-bearing content in complex, visually rich documents. These results suggest that OCRfree, summary-guided multimodal retrieval provides a practical and scalable alternative to traditional text-extraction pipelines.
📄 Content
Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval Anup Roy anup.roy@inceptionai.ai Inception AI Abu Dhabi, UAE Rishabh Gyanendra Upadhyay rishabh.upadhyay@inceptionai.ai Inception AI Abu Dhabi, UAE Animesh Rameshbhai Panara animesh.panara@inceptionai.ai Inception AI Abu Dhabi, UAE Robin Mills robin.mills@inceptionai.ai Inception AI Abu Dhabi, UAE Aidan Philip Millar amillar@mubadala.ae Mubadala Abu Dhabi, UAE Abstract Document-centric RAG pipelines typically begin with OCR, fol- lowed by brittle, engineering-heavy heuristics for chunking, table parsing, and layout reconstruction. These text-first workflows are costly to maintain, sensitive to small layout shifts, and discard the visuo-spatial cues that frequently contain the answer. Vision- first retrieval has recently emerged as a compelling alternative: by operating directly on page images, systems such as ColPali and ColQwen preserve spatial structure and reduce pipeline com- plexity while achieving strong benchmark performance. However, these late-interaction models tightly couple retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating substantial memory overhead and complicating large-scale deployment. We introduce VisionRAG, a multimodal retrieval system that is both OCR-free and model-agnostic. Vision- RAG indexes documents directly as images, preserving layout, ta- ble structure, and spatial cues, and constructs semantic vectors without committing to a specific extraction. Our three-pass pyra- mid indexing framework create semantic vectors using global page summaries, section headers, visual hotspots, and fact-level cues. These summaries serve as lightweight retrieval surrogates: at query time, VisionRAG retrieves the most relevant pages using the pyramid index, then forwards the raw page image (encoded as base64) to a multimodal LLM for final question answering. Dur- ing retrieval, reciprocal rank fusion integrates representations across the pyramid, yielding robust ranking across heterogeneous visual and textual content. VisionRAG maintains just 17-27 vec- tors per page, matching the efficiency of patch-based approaches while remaining adaptable to different multimodal encoders. On financial document benchmarks, VisionRAG achieves 0.8051 ac- curacy@10 on FinanceBench and 0.9629 Recall@100 on TAT- DQA, demonstrating strong coverage of answer-bearing content in complex, visually rich documents. These results suggest that OCR- free, summary-guided multimodal retrieval provides a practical and scalable alternative to traditional text-extraction pipelines. Keywords Retrieval Augmented Generation, Vision-Language Models, Doc- ument Question Answering, Reciprocal Rank Fusion, Multi-Index Retrieval, ColPali, FinanceBench, TAT-DQA, Pyramid Indexing, Explicit Semantic Fusion ACM Reference Format: Anup Roy, Rishabh Gyanendra Upadhyay, Animesh Rameshbhai Panara, Robin Mills, and Aidan Philip Millar. 2025. Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval. In Proceedings of XX. ACM, New York, NY, USA, 12 pages. https://doi.org/XXXXXXX . XXXXXXX 1 Introduction Retrieval-Augmented Generation (RAG) has improved factual ground- ing in large language models by enabling access to external knowl- edge sources [17]. However, in enterprise and financial domains, critical information is embedded in visually rich PDFs containing complex tables, multi-column layouts, section hierarchies, and spa- tial cues. OCR-based pipelines flatten these structures into plain text, discarding layout boundaries, table geometry, and reading order-leading to degraded retrieval recall and weaker downstream answer quality. These limitations are amplified in document-intensive settings such as financial filings, where hundreds of densely formatted pages contain key facts within table cells, visually emphasized regions, or multi-column spans that OCR systems often fragment or mis- interpret. As a result, text-only representations fail to capture the multimodal signals necessary for accurate indexing and retrieval. Recent vision-aware systems address these issues by processing document pages directly as images. Approaches such as ColPali generate dense patch-level embeddings to support image-to-text matching. While effective, they impose substantial computational cost: ColPali produces multi-dimension embeddings per page, and even aggressively pooled variants still require ∼341 vectors-far exceeding what is feasible for large-scale indexing and low-latency retrieval. Figure 1 summarizes this evolution from OCR-based RAG to dense vision retrieval and our proposed approach. These computational constraints pose a challenge in enterprise environments, where repositories may contain millions of pages and latency, memory, and hardware budgets are tightly restricted. VisionRAG is designed specifically with these constraints in mind. By relying on compact semantic representa
This content is AI-processed based on ArXiv data.