CMRAG: Co-modality-based visual document retrieval and question answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal retrieval and generation results. To address these research gaps, we propose the Co-Modality-based RAG (CMRAG) framework, which can simultaneously leverage texts and images for more accurate retrieval and generation. Our framework includes two key components: (1) a Unified Encoding Model (UEM) that projects queries, parsed text, and images into a shared embedding space via triplet-based training, and (2) a Unified Co-Modality-informed Retrieval (UCMR) method that statistically normalizes similarity scores to effectively fuse cross-modal signals. To support research in this direction, we further construct and release a large-scale triplet dataset of (query, text, image) examples. Experiments demonstrate that our proposed framework consistently outperforms single-modality–based RAG in multiple visual document question-answering (VDQA) benchmarks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex VDQA systems.

💡 Research Summary

The paper introduces CMRAG (Co‑Modality‑based Retrieval‑Augmented Generation), a novel framework designed to improve visual document question answering (VDQA) by jointly exploiting textual and visual information. Traditional RAG approaches for documents fall into two categories: (1) text‑only pipelines that rely on layout analysis and OCR to extract plain text, which ignore images, tables, and other non‑textual cues; and (2) vision‑only pipelines that treat whole document pages as images and feed them directly to visual language models (VLMs), thereby discarding the rich semantic content of the extracted text. Both suffer from sub‑optimal retrieval and generation performance.

CMRAG addresses these gaps with two core components. First, the Unified Encoding Model (UEM) projects queries, parsed text, and page images into a shared embedding space. UEM builds on the SigLIP backbone, re‑using the pretrained query encoder (E_q) and image encoder (E_I) while initializing a text encoder (E_T) as a length‑extended copy of E_q. During training, only E_T is updated; E_q and E_I remain frozen to preserve their large‑scale multimodal alignment. Alignment is enforced by a Dual‑Sigmoid Alignment (DSA) loss, a pairwise sigmoid‑based contrastive objective applied separately to query‑text and query‑image pairs, with a weighting hyper‑parameter λ to balance the two modalities.

Second, the Unified Co‑Modality‑informed Retrieval (UCMR) module fuses the similarity scores from the two modalities in a statistically principled way. For a given query embedding q, inner products with each page’s image embedding I_i and text embedding T_i yield raw scores z_Ii and z_Ti. Because these scores have different scales and distributions, a naïve linear combination (α·z_Ti + (1‑α)·z_Ii) is inadequate. UCMR first applies a sigmoid to map each raw score into

CMRAG: Co-modality-based visual document retrieval and question answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment