IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.
💡 Research Summary
IRPAPERS introduces a new benchmark for visual document retrieval and question answering (QA) that directly compares image‑based and text‑based approaches on scientific literature. The dataset comprises 3,230 pages from 166 information‑retrieval (IR) papers, each provided as a high‑resolution image and an OCR transcription generated with GPT‑4.1. A set of 180 “needle‑in‑the‑haystack” questions was curated to require the exact source page for a correct answer, ensuring that retrieval systems must discriminate fine‑grained methodological details rather than relying on coarse topical similarity.
The authors evaluate open‑source retrieval models for both modalities. Text retrieval uses Arctic 2.0 dense embeddings, BM25, and a hybrid combination; image retrieval employs multi‑vector embedding models such as ColModernVBER‑T, ColPali, and ColQwen2. Results show that text retrieval slightly outperforms image retrieval at the top rank (Recall@1 = 46 % vs 43 %), while image retrieval catches up and exceeds text at deeper ranks (Recall@20 = 93 % vs 91 %). Crucially, the two modalities exhibit complementary failure patterns: many queries succeed with one and fail with the other. By normalizing scores and fusing them, a multimodal hybrid search achieves the best performance (Recall@1 = 49 %, Recall@20 = 95 %).
Efficiency‑performance trade‑offs are explored with the MUVERA encoder, where the ef parameter controls index size and query speed. Lower ef values dramatically reduce latency with only modest drops in recall, highlighting practical considerations for real‑time services.
Closed‑source models are also benchmarked. Cohere Embed v4 image embeddings attain Recall@1 = 58 %, Recall@5 = 87 %, and Recall@20 = 97 %, outperforming the best open‑source image model by about 9 % absolute on Recall@1 and also beating the top text model (Voyage 3 Large, Recall@1 = 52 %). This underscores the current gap between commercial multimodal embeddings and publicly available alternatives.
For QA, Retrieval‑Augmented Generation (RAG) pipelines are built for both modalities. Using an LLM‑as‑Judge to assess semantic equivalence, text‑based RAG achieves a ground‑truth alignment score of 0.82, whereas image‑based RAG scores 0.71. Retrieval depth matters: increasing k from 1 to 5 improves both modalities substantially, and retrieving five documents outperforms an oracle that supplies only the single gold document. This indicates that scientific answers often require synthesis across multiple related pages.
Error analysis reveals four visual element categories—architectural diagrams, tables, charts, and abstract concepts—where image representations are essential, while long narrative passages favor text. The complementary strengths suggest future systems should dynamically select or combine modalities based on question type.
The paper concludes that IRPAPERS provides a reproducible, low‑overhead benchmark for visual document processing, demonstrates that multimodal hybrid search currently yields the highest retrieval quality, and highlights the performance gap between open‑source and closed‑source embeddings. Future work is suggested on cheaper, high‑quality OCR, end‑to‑end multimodal models, and adaptive modality selection for query decomposition.
Comments & Academic Discussion
Loading comments...
Leave a Comment