๐ Original Info
- Title:
- ArXiv ID: 2512.20626
- Date:
- Authors: Unknown
๐ Abstract
Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domainspecific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graphbased RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora. Our code is available on GitHub.๐ Full Content
restricting their ability to deeply process long-form, domain-specific content. E.g., interpreting a history textbook involves both conceptual insights and localized observations, which remains challenging for MLLMs.
On the other hand, RAG can enhance LLMs by providing on-demand access to external knowledge. Early text-based RAG relied on sparse or dense retrieval but struggled with deep, multi-hop reasoning in multimodal documents. Recently, Graph-based RAG introduces structured abstraction via entity-relation graphs. With models like GraphRAG (Edge et al., 2024) and LightRAG (Guo et al., 2025), long-range knowledge retrieval of improved scalability are enhanced through KGassisted retrieval pipelines. However, these methods excel in text-based multi-hop reasoning but remain constrained in handling complex, multimodal content. Current graph-based RAG methods face some key limitations. First, existing approaches remain unimodal, overlooking visual cues like diagrams, charts or maps, yielding disjointed representations that hinder multimodal reasoning. Additionally, due to context window constraints, most approaches segment documents into independent chunks, extracting entities separately rather than sequentially. This leads to fragmented KGs that miss cross-chunk relationships and key entities.
To our knowledge, while recent studies have explored manually constructed multimodal knowledge graphs (KGs) for RAG-based question answering (Lee et al., 2024), automatically building such KGs for RAG-assisted reasoning remains underexplored. To address this gap, we introduce MegaRAG, a multimodal, graph-based RAG method that enhances cross-modal reasoning.
To better handle the association of different modalities in visual documents, more relations beyond text-to-texts need to be extracted, such as textto-figures and figure-to-figure relations. Although the parallel-reading-then-combining strategy can refine entities and relations as in GraphRAG (Edge et al., 2024) and LightRAG (Guo et al., 2025), such refinement still relies on a single chunk while overlooking global document information. To address this limitation, we design a page-based, two-round approach for KG construction. Our solution initiates a KG by simply extracting entity-relation pairs in parallel for every page of a document using existing MLLMs, and the page-based relations are joined to form an initial graph. As the initial KG may not capture the inter-relationship between texts and visual elements sufficiently well, we conduct refinement processes in subsequent stage(s), where the initial KG(s) serve as global guidance to capture subtle relationships often lost in naรฏve, isolated extraction. In particular, to maintain scalability while incorporating long-range dependencies, we avoid injecting the entire initial KG into the MLLM inputs. Instead, we retrieve only a subgraph of the entire KG for each page, yielding a lightweight yet context-aware input. This strategy enables progressive improvement of the graph’s structural coherence, semantic coverage, and crossmodal grounding.
We validate MegaRAG across global (booklevel) and local (page/slide-level) QA benchmarks, spanning both text-only and multimodal datasets. Experimental results demonstrate that MegaRAG consistently outperforms strong baselines, particularly in scenarios requiring deep cross-modal integration and structured abstraction. Our contributions are summarized as follows.
โข We introduce MegaRAG, an easy-to-use system that automatically constructs Multimodal KGs for visual document question answering with MLLMs.
โข We develop a novel refinement process that enhances cross-modal grounding while addressing limitations in independent KG construction.
โข We demonstrate that MegaRAG outperforms strong baselines on both global and local QA tasks, including GraphRAG and LightRAG.
We briefly review several major directions of RAG: including retrieving information directly from raw data sources such as documents and images, and integrating structured knowledge through KGs. RAG with Raw Data Source. Early RAG methods (Guu et al., 2020;Lewis et al., 2020) retrieve text chunks from corpora to support answer generation, primarily relying on retrieval strategies either sparse or dense. Sparse methods exemplified by TF-IDF (Salton et al., 1975) and BM25 (Robertson and Zaragoza, 2009) depend on lexical heuristics to match queries with relevant text segments. They offer computational efficiency but lack deeper semantic comprehension. Dense techniques (Karpukhin et al., 2020;Khattab and Zaharia, 2020;Santhanam et al., 2022) project queries and documents into a shared embedding space, significantly improving retrieval performance of lexical variations. Subsequent works have enhanced this pipeline using LLM recently: HyDE (Gao et al., 2023) generates a hypothetical answer to enrich the retrieval query, Self-RAG (Asai et al., 2024) introduces reflection tokens to enable adaptive retrieval and self-critique within a single LLM, while RQ-RAG (Chan et al., 2024) decomposes the query into sub-queries to improve context coverage. Despite their strong performance on text-based RAG tasks, these methods often struggle with multimodal documents involving complex texts, layouts and visual elements. Multimodal RAG (MMRAG). To tackle the limitations, more recent studies have focused on multimodal retrieval methods that better retain the structural information of documents. DSE (Ma et al., 2024) treats document screenshots as unified inputs and directly encodes their visual layout, text, and images into a single vector embedding. Col-PaLi (Faysse et al., 2025) continues this direction by encoding document images into multi-vector embeddings, effectively capturing fine-grained visual cues. Its variant, ColQwen, replaces the PaLI-Gemma (Beyer et al., 2024) with Qwen2-VL (Wang et al., 2024b) and achieves improved retrieval performance. Moving beyond retrieval, VisRAG (Yu et al., 2025) integrates MLLMs into the full RAG pipeline. Instead of extracting text, it embeds document images directly for retrieval and incorporates them into the generation stage, allowing the model to jointly reason over visual and textual content.
The above methods excel in text-to-image retrieval but fail to solve tasks involving a mixture of single-modality (e.g., text-to-text), crossmodality (e.g., text-to-image), and fused-modality (text+image-to-text+image) retrieval. GME (Zhang et al., 2025) tackles this by introducing a unified embedding model that encodes diverse modality combinations and enables flexible retrieval within a shared representation space.
While these approaches significantly enhance document understanding, they neglect the long- range corpus-level structure, which is essential for handling complex, multi-hop QA (Tanaka et al., 2023;Yang et al., 2018).
RAG with Knowledge Graph. Knowledgeaugmented generation (Procko and Ochoa, 2024) leverages KGs to provide structured, factual context for LLMs. Within this line of research, Sub-graphRAG (Li et al., 2025) enhances efficiency through lightweight scoring mechanisms for subgraph retrieval, while G-Retriever (He et al., 2024) frames subgraph selection as a Steiner Tree optimization problem to support large-scale textual graphs. Gao et al. (Gao et al., 2022) employ a learning-to-rank approach to improve retrieval from KGs. While these methods advance graphbased retrieval, they depend on manually constructed KGs, which are costly to build and require substantial domain expertise. Moreover, static KGs are inherently limited in addressing queries that require corpus-level reasoning beyond fixed graph structures.
To address this limitation, GraphRAG (Edge et al., 2024) proposes building KGs directly from raw text using LLMs, followed by a hierarchical community detection algorithm (Traag et al., 2019) to cluster semantically related nodes. During inference, it prompts the LLM to generate intermediate answers for each community summary, scores them by confidence, and aggregates the top responses into a final answer. Although this enables corpuslevel reasoning, it incurs high computational cost due to repeated LLM queries over many community summaries. To improve efficiency, Ligh-tRAG (Guo et al., 2025) introduces a two-stage retrieval process: it first extracts local and global keywords from the query, then retrieves relevant nodes and their surrounding subgraphs using dense retrieval. This design reduces the need for repeated LLM inference and significantly improves scalability. which introduces a hybrid RAG framework that alternates between naive and graph-based retrieval. TOG-2 (Ma et al., 2025) introduces a hybrid RAG method that alternates between dense retrieval and graph reasoning. However, these approaches rely on manually curated KGs, which are costly to construct and limited in coverage.
However, these KG-augmented RAGs rely solely on textual KGs, limiting their ability to handle multimodal content such as images. To overcome this limitation, multimodal knowledge graphs (MMKGs) (Liu et al., 2019;Zhang et al., 2023) enrich KGs by associating entities with aligned visual (e.g., images), numeric (e.g., dates, measurements), and textual descriptions. A representative benchmark (Liu et al., 2019) introduces MMKGs that were constructed by linking overlapping entities via sameAS relations and annotating them with web-crawled images and numeric literals. MMKGs have demonstrated utility across tasks, including KG completion (Mousselly-Sergieh et al., 2018;Xie et al., 2017), recommendation systems (Sun et al., 2020), and image captioning (Zhao and Wu, 2023).
More recently, MMKGs have been integrated into RAG pipelines to support multimodal QA with LLMs. For instance, Lee et al. (Lee et al., 2024)
In this section, we present MegaRAG, covering the iterative construction process of MMKG, graph indexing and retrieval mechanisms, and the answer generation pipeline.
We define our MMKG as G = (V, E), where V is the set of nodes representing entities, and E is the set of edges denoting relations between entities. Given a document consisting of N pages, we extract four types of content from each page i: text content T i , figure images F i , table images B i , and the full-page rendered image I i (which captures the layout of the page). These elements are obtained using an off-the-shelf document analysis tool. We define the input for page i as P i = {T i , F i , B i , I i }, which serves as input to our graph construction pipeline. Initial Graph Construction. As illustrated in Figure 1(a), the initial stage involves extracting entities and relations from each page in parallel using a graph generation function G(โข), which leverages an MLLM guided by a task-specific prompt. The prompt specifies the extraction goals, provides reasoning instructions, and enforces a constrained output format to ensure consistency across pages. In our implementation, GPT-4o-mini serves as the MLLM for the MMKG construction.
Given a multimodal input P i , the graph generation function produces a set of page-level entities and relations (E, R) 0 i = G(P i ), extracted from both textual and visual content. The MLLM is guided to identify multiple entities within the text and to treat each figure or table as a single, standalone entity. For instance, a bar chart titled “Monthly Website Visitors” may be recognized as an entity and connected to surrounding text discussing user engagement trends. Decorative or non-informative visuals, such as background patterns or logos, are ignored. The full-page image I i is used solely to support spatial reasoning and does not generate entity nodes. Each extracted entity includes a name, a predefined type (e.g., person, organization), and a description. Relations are defined by a source and target entity, a description, and a set of representative keywords.
After generating the set of page-level entities and relations (denoted as {(E, R) 0 i } N i=1 ), we merge them into a unified MMKG G 0 . This involves consolidating entity nodes with the same name and merging relation edges with matching source, target, and relation types. During this process, different descriptions associated with the same entity or relation are aggregated to form a richer, more comprehensive representation. Similarly, keywords from multiple occurrences are accumulated. Graph Refinement and Enrichment. The initial MMKG G 0 is often incomplete, as many crossmodal entities and relationships may be overlooked during the first-pass extraction. To bridge the gaps, we introduce a refinement stage that enhances graph G 1 , leveraging both the original multimodal inputs and the preliminary knowledge encoded in G 0 . The process is illustrated in Figure 1(b).
To efficiently refine MMKG under the MLLM’s limited context window, we focus on constructing lightweight, page-specific subgraphs rather than processing the entire graph. For each page i, we extract a context-specific subgraph G 0 i from G 0 . In practice, we reuse entity names and relation keywords from the previously extracted page-level output (E, R) 0 i to retrieve relevant content in G 0 , reducing redundancy and simplifying subgraph construction. These entity names and relation keywords are encoded into semantic embeddings and efficiently matched against dense vector representations of entities and relations built from initial MMKG. To enrich the local context, the selected nodes and edges are further expanded by including their one-hop neighbors, resulting in a compact yet informative subgraph. A detailed explanation of this graph indexing and retrieval process is provided in Section 3.2.
The refinement process is formalized as
) is a refinement function that reuses the same MLLM from the initial stage, now guided by a KG-specific refinement prompt. Since the pages remain independent when extracting the entity relationship leveraging the subgraph, the benefit of parallelism is maintained for efficient graph construction. This function identifies missing knowledge in page P i by examining the retrieved subgraph G 0 i . Specifically, it detects entities mentioned in the input that are not yet present in the subgraph, as well as implicit relations between entities that are suggested by the content but missing from G 0 i . For example, consider a page where the text states “Electric vehicle sales increased significantly in 2023,” and a nearby figure titled “Annual Sales by Vehicle Type” presents a bar chart with a prominent “EV” bar (denoting Electric Vehicles). In the initial extraction, the text and the figure may be treated as independent entities. During refinement, the MLLM infers that the figure visually supports the textual claim and adds a relation such as illustrates or supports between the textual entity “Electric vehicle sales in 2023” and the visual entity “Annual Sales by Vehicle Type.”
These newly identified entities and relations are added to the refined set (E, R) 1 i . The updated pagelevel outputs {(E, R) 1 i } N i=1 are then merged to form the enriched MMKG G 1 . Although we perform only a single refinement step, the process can be applied iteratively to further improve graph completeness. To balance effectiveness and efficiency, we adopt one round of refinement and provide the full prompt formats used for both the initial con-struction and refinement. More details can be found in Appendix B.
We adopt a unified retrieval framework that integrates graph structure, represented by entities and relations, along with page images within a shared embedding space to enable seamless cross-modal retrieval. Specifically, we use GME (Zhang et al., 2025), a multimodal encoder that jointly embeds textual and visual inputs. GME aligns all content types, including both textual and visual information, into a common vector space, supporting textto-text and text-to-image retrieval through a unified representation. Indexing. Our indexing process encompasses three content types, as illustrated in Figure 1(c): document page images, entities, and relations. Page images are directly encoded using GME without additional preprocessing. For each entity, we concatenate its name with its textual description to form a descriptive sentence, which is then embedded using GME. Relation embeddings are constructed similarly, by combining relation keywords, the names of the source and target entities, and a textual description. All embeddings are stored in separate dense vector stores by type. Graph Retrieval. To retrieve relevant knowledge, we adopt a dual-level retrieval strategy (Guo et al., 2025) that targets both entities and relations. Given a user query, we first prompt the MLLM to extract two types of keywords: low-level keywords corresponding to specific entities, and high-level keywords that capture broader concepts. These keywords are then embedded by using the same GME model adopted during indexing. Both low-level and high-level keywords are combined into a single keyword list and used to query the entity vector store, retrieving the top-k most relevant entities. In parallel, the top-k most relevant relations, along with their associated source and target entities, are retrieved from the relation store. To further enrich the context, each retrieved entity is expanded by incorporating its one-hop neighbors from G 1 . The final set of entities and relations serves as input to the downstream reasoning module. Page Retrieval. Complementary to graph retrieval, we also perform text-to-page(image) retrieval to capture fine-grained visual and layout cues that may be missed by symbolic representations alone. Given the same input query, we retrieve the top-m relevant document pages by comparing text and image embeddings within the shared vector space.
When combined with visual content and MMKG in a single MLLM prompt, this integration can lead to modality bias. The model often disproportionately focuses on one modality, typically text, while underutilizing the other. To address this issue, we propose a two-stage answer generation approach that decouples the processing of textual and visual inputs. Given the retrieved subgraph and the relevant page images, the model first generates two intermediate responses in parallel: one based on the symbolic knowledge graph, and the other on the visual content. In the second stage, the MLLM synthesizes a final answer by integrating both intermediate outputs. Full prompt formats for each generation stage are provided in Appendix B.
In this section, we outline the experimental setups and present the results for our MegaRAG method.
Global QA. As these datasets lack manually labeled global questions, we adopt the question generation strategy from GraphRAG (Edge et al., 2024) and Ligh-tRAG (Guo et al., 2025). For each dataset, we use the document outline as input and prompt an LLM to create five synthetic RAG users, each with a profile describing their background and information needs. Each user is assigned five tasks representing distinct information-seeking goals, and each task is used to generate five questions that require a comprehensive understanding of the full document. This process yields 125 global questions per dataset.
Local QA. To evaluate local (slide-or page-level) QA, we use two benchmarks: SlideVQA (Yang et al., 2018) and RealMMBench (Wasserman et al., 2025). SlideVQA includes over 52,000 slides and 14,500 questions covering complex reasoning and numerical understanding, but its scale makes full evaluation computationally expensive. Instead, we construct a subset of 2,000 slides, referred to as SlideVQA (2k). RealMMBench assesses retrieval in multimodal RAG settings using visualrich, table-heavy, and rephrased queries. RealMM-Bench consists of four sub-datasets: FinReport (2,687 pages), FinSlides (2,280 pages), TechReport (1,674 pages), and TechSlides (1,963 pages). Additional details are provided in Appendix A.
As our approach is the first one automatically building Multimodal KGs for MMRAG-based question answering, we compare it with several widely adopted RAG baselines, including raw-sourcebased NaiveRAG, as well as KG-aided methods GraphRAG (Edge et al., 2024), and LightRAG (Guo et al., 2025) that are recent advancements in graph-based RAG. Details of them are provided in Appendix C. For fairness, besides the multimodal benchmark, we compare our method with them using only the textual benchmark too.
Global QA. In the absence of ground truth answers for global (book-level) questions, we follow the LLM-based evaluation strategy from GraphRAG (Edge et al., 2024) and LightRAG (Guo et al., 2025). Model responses are assessed along four qualitative dimensions: Comprehensiveness, Diversity, Empowerment, and Overall, as defined in prior work (Guo et al., 2025). Each response is compared against a baseline in a pairwise setup, with win rates (including ties) reported. Comprehensiveness measures how well the answer covers all aspects of the question; Diversity captures the richness and variety of perspectives; Empowerment reflects how effectively the answer informs and supports user understanding; Overall provides an aggregate score across the three preceding criteria.
Local QA. For local (slide-or page-level) QA, we evaluate performance by comparing the generated answers against ground truth answers. Specifically, LLM is used to judge whether the generated answer aligns semantically with the reference answer.
Accuracy is then computed based on the proportion of correct matches. Further details regarding the evaluation dimensions and procedures are provided in Appendix C.
To ensure consistency across all RAG methods, we standardize the LLM/MLLM implementation. Response generation and global question generation use GPT-4o-mini, while evaluation uses GPT-4.1mini for greater robustness. All methods, including NaiveRAG, GraphRAG, and LightRAG, use Ope-nAI’s text-embedding-3-small model for textual embeddings. Textual documents are segmented into 1,200 token chunks with a 100-token overlap.
We follow GraphRAG and LightRAG by setting their gleaning parameter to 1. The generation temperature is fixed at 0 across all tasks to reduce output variance. For multimodal documents, we use the MinerU toolkit (Wang et al., 2024a) to extract text, figures, and tables. MinerU converts PDFs into machinereadable formats while preserving layout and symbols, making it especially effective for processing scientific and technical documents. In MegaRAG, multimodal embeddings are encoded using GME-Qwen2-VL-2B (Zhang et al., 2025), which is designed to support a unified embedding space across single-, cross-, and fused-modality retrieval tasks. This allows MegaRAG to flexibly retrieve diverse input types within a consistent representation space. During retrieval, we set the top-k value to k = 60 for graph retrieval steps, following the dual-level retrieval strategy and set the top-m value to m = 6 for the page retrieval described in Section 3.2. For baselines without multimodal support, we retain only the extracted text and process it using the same pipeline as for textual documents. To mitigate inconsistencies, we standardize response prompts across all baselines, so output quality differences stem from model capabilities rather than prompt variations.
Textual Global QA. Table 1 shows the results on the UltraDomain benchmark consisting of purely textual documents. As can be seen, across all domains and evaluation dimensions, MegaRAG consistently outperforms the baselines, achieving average win rates of 59.0% for Comprehensiveness, 71.4% for Diversity, 74.8% for Empowerment, and 71.8% Overall.
A key contributor to this performance is MegaRAG’s graph refinement process. Unlike GraphRAG and LightRAG, which employ gleaning per page, a form of local subgraph refinement, MegaRAG doesn’t employ gleaning but constructs and refines a global knowledge graph that captures broader contextual relationships between documents. This approach enhances the expressiveness and coverage of the graph, leading to superior performance. Multimodal Global QA. An main characteristic of our method is that it can build MMKGs for RAG. In this experiment, we evaluate our MegaRAG on global QA tasks over multimodal documents. As shown in Table 2, MegaRAG outperforms all baselines on four visually rich datasets: World History, Environmental Report, DLCV, and GenAI. It achieves average win rates of 83.3% for Comprehensiveness, 92.7% for Diversity, 84.7% for Empowerment, and 89.5% Overall. The advantage is particularly evident on slide-based datasets such as DLCV and GenAI, where much of the core content is visual rather than textual. Compared with NaiveRAG and LightRAG, relying primarily on text, MegaRAG delivers stronger results across all evaluation dimensions. These gains stem from MegaRAG’s ability to build KGs that jointly encode textual information and visual cues.
Although all baselines in this comparison are text-only models, our ablation study, Section 4.5, further demonstrates that removing MMKG from MegaRAG leads to a substantial performance drop. Since our MegaRAG reduces to an MMRAG approach when its KG components are removed, this suggests that even vision-capable retrieval methods of MMRAG would struggle to match MegaRAG without multimodal global knowledge integration. Multimodal Local QA. Table 3 shows the accuracy results on SlideVQA (2k) and the four RealMMBench subsets. Across all five test sets, MegaRAG performs more favorably. On Slide-VQA (2k), which focuses on fine-grained slidelevel reasoning, MegaRAG achieves 64.85% accuracy, higher than double the score of the strongest baseline. Similar trends are observed in RealMM-Bench. On FinSlides and TechSlides, which feature highly visual and table slide content, MegaRAG achieves 58.37% and 60.86%, outperforming the best baseline by 45 and 29 percents, respectively. Even in the more text-heavy FinReport and TechReport subsets, MegaRAG maintains a clear lead with 39.51% and 51.51%, surpassing LightRAG by 8 to 9 percents.
To evaluate the contribution of each major component in MegaRAG, we conduct an ablation study by disabling key modules across the three main stages: MMKG construction, retrieval, and answer generation. In the first setting (A1), we remove all visual inputs, such as figures, tables, and page images, from the graph construction stage, relying solely on textual content. In the second setting (A2), we disable the MMKG-based retrieval mechanism and rely solely on the page retrieval. In the third setting (A3), we replace the two-stage generation pipeline with a single-pass generation setup that simultaneously considers both the subgraph and visual input. (A1) Text-only graph construction. Removing visual inputs from the graph construction stage leads to a substantial performance decline across all datasets. Without visual entities and relations, the MMKG lacks critical cross-modal context, which is especially detrimental in visually rich domains such as GenAI. For example, the overall win rate on GenAI drops dramatically from 86.4% to just 0.8%. These results underscore the importance of incorporating visual elements in MMKG. (A2) Disable MMKG retrieval.
Disabling MMKG-based retrieval and relying solely on page retrieval results in the most severe performance degradation. Across all datasets and evaluation di-mensions, MegaRAG achieves near 100% win rates when compared to this variant. This clearly demonstrates that structured retrieval over the MMKG is essential for accessing semantically rich and wellconnected information, far outperforming pagelevel retrieval alone. (A3) Remove two-stage answer generation. Replacing the two-stage generation pipeline with a single-pass setup causes moderate but consistent performance drops. Although this variant still benefits from MMKG construction and retrieval, average win rates decline by 14 to 25 percents. The largest drops appear in Diversity and Empowerment, suggesting that separating textual and visual reasoning before integration helps generate more nuanced and informative answers.
Among the three components, MMKG-based retrieval (A2) proves to be the most critical; its removal leads to a near-complete collapse in performance. Visual inputs in graph construction (A1) also play an important role, particularly for slidecentric documents, though their absence results in less dramatic losses. The two-stage generation strategy (A3) contributes more subtle but consistent gains, especially in generating diverse and empowering responses. Together, these results highlight the complementary value of all three components, with graph-based retrieval emerging as the core driver of MegaRAG’s effectiveness.
In this paper, we introduced MegaRAG, a novel KG-based RAG method that leverages MLLMs to automatically construct MMKGs. MegaRAG improves MLLMs’ capabilities over complex, longform documents by combining textual and visual information into a unified graph representation and refining it through iterative updates. MegaRAG needs no fine-tuning and is easy to use. To reduce modality bias, we adopt a two-stage answer generation process that separately reasons over textual and visual evidence before integrating the results, enabling more comprehensive and balanced responses. Through evaluations on both global and local QA tasks across textual and multimodal datasets, MegaRAG consistently outperforms other competitive RAG approaches. Our work highlights a promising new direction for scalable and interpretable multimodal reasoning in RAG systems.
In the Appendix, we present the Datasets, Implementation Details, and Baselines & Evaluations in Appendices A, B, and C, respectively.
We provide an overview of the datasets in our experiments and dataset statistics in Table 5.
The Ultradomain benchmark (Qian et al., 2024) comprises 428 college-level textbooks spanning 18 academic disciplines. For this study, we focus on the four representative subsets: Agriculture dataset. SlideVQA (2k). SlideVQA (Tanaka et al., 2023) includes over 52,000 slides and 14,500 questions covering complex reasoning and numerical understanding, but its scale makes full evaluation computationally expensive. Instead, we construct a subset of SlideVQA, which consists of 2,000 educational slides, featuring 1,581 figures, 139 tables, and 120,000 tokens.
The RealMMBench (Wasserman et al., 2025) is designed to evaluate retrieval performance in realistic multi-modal RAG scenarios, and contains four subsets: FinReport. This subset includes 19 long-form
To generate global questions, we utilize the prompt shown in Figure 5. This prompt guides the MLLM (GPT-4o-mini) to first identify representative user profiles and their associated tasks, then generate questions that require a comprehensive understanding of the dataset.
MegaRAG leverages the General Multimodal Embedder (GME) (Zhang et al., 2025) to encode entities, relations, and page images within a unified embedding space. GME is built upon the Qwen2-VL architecture, a MLLM capable of processing text, images, or combined text-image inputs. It supports a broad range of retrieval tasks, including single-modality retrieval (e.g., text-to-text, imageto-image), cross-modality retrieval (e.g., text-toimage, image-to-text), and fused-modality retrieval (e.g., text with image to text with image). To generate embeddings, GME uses the final hidden state of the last token as the representation of the input. GME’s strength lies in its flexibility and generalization capability, making it well-suited for MegaRAG, which requires seamless integration of both text-to-text and text-to-page (image) retrieval tasks. GME Encoding Time. In our pipeline, the GME-Qwen2-VL-2B encoder is executed locally to process both text and image inputs. All encoding is performed on a single NVIDIA RTX 3090 GPU with 24GB of VRAM. Due to memory constraints, we limit GME to encoding two page images concurrently, with an average processing time of approximately 0.97 seconds per image.
During graph retrieval in the MMKG refinement stage, as described in Section 3.1, we retrieve the top 120 entities and relations from the initial MMKG and concatenate them into a single string (as illustrated in Figure 3, subgraph). We then truncate this string to a maximum of 32,000 tokens. The truncated string is then used to prompt the MLLM to identify missing entity-relation pairs that were not captured in the initial stage. We experimented with both larger and smaller retrieval sizes and found that retrieving 120 entities and relations provides the best balance between global coverage of the MMKG and input length constraints.
Given a primary image document with its text content that is potentially relevant to this activity, along with additional images obtained from layout detection (if available), and a list of entity types, identify all entities of those types from the text content and from any additional images that contain meaningful content. Note that: -The first image is always the primary image document.
-The remaining 0 to many images are results from layout detection.
-For each additional image, analyze whether it contains meaningful content (e.g., tables, charts, images of important persons, events, etc.). In making this determination, also reference the primary image document and its text content to understand the context. If the additional image is meaningful, treat it as an entity by extracting its relevant details. If the image is merely decorative or irrelevant (e.g., decorative patterns, unrelated photos), then ignore it.
-The input images are provided by appending them directly after the text (with the primary image document guaranteed to be the first image). Use {language} as output language.
-Steps-1. Process the Input: a. The primary image document and its text content. b. Additional images from layout detection (if any), appended after the prompt. 2. Identify all entities from the text content and from any additional images that contain meaningful content. For each identified entity, extract the following information:
-entity_name: Name of the entity, using the same language as the input text (capitalize the name if it is in English).
-entity_type: One of the following types: [{entity_types}] -entity_description: A comprehensive description of the entity’s attributes and activities. Format each entity as (“entity”{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>) *For additional images that are deemed meaningful (for example, a table showing financial data, a chart representing trends, an image of an important person or event, or in the one of the following types: [{entity_types}]), create an entity with an appropriate name and description indicating the content and significance of the image. When evaluating these images, also refer to the primary image document and its text content for context. 3. From the entities identified in step 2, identify all pairs of (source_entity, target_entity) that are clearly related to each other. For each pair, extract the following information:
-source_entity: Name of the source entity, as identified in step 2.
-target_entity: Name of the target entity, as identified in step 2.
-relationship_description: Explanation of why the source entity and the target entity are related.
-relationship_strength: A numeric score indicating the strength of the relationship between the source and target entities.
-relationship_keywords: One or more high-level keywords that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details.
Format each relationship as (“relationship”{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description> {tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>) 4. Identify high-level keywords that summarize the main concepts, themes, or topics of the entire text and images. Format these as (“content_keywords” {tuple_delimiter}<high_level_keywords>) 5. Return the output in {language} as a single list of all the entities and relationships identified in steps 2 and 3. Use {record_delimiter} as the list delimiter. 6. When finished, output {completion_delimiter}
Entity_types: {entity_types} Primary Image Document text content: {input_text} Additional Layout Detection Images: (The images are provided by appending them directly after this prompt, with the primary image document as the first image.)
Best Practices for Enterprise Gen Al Solutions A proven, scalable platform | Vela cloud-native supercomputer Deployed on Vela Cloud-native supercomputer, IBM Cloud Full stack running across thousands of GPUs with OpenShift (each node with 8 x A100 GPUs) Covering entire life cycle of foundation model, from data preprocessing, training, inference and workbench Jobs requiring anywhere between single to hundreds of GPUs Support for priorities and pre-emption, improving utilion and user experience …
Entity_types: [person, technology] Primary Image Document text content: “Alex clenched his jaw in frustration as Taylor asserted control. Jordan’s drive for discovery clashed with Cruz’s desire for order. Later, Taylor examined a device with reverence, hinting at its transformative power.” Additional Layout Detection Images:
-Image 1: (An image file showing a handwritten note on a whiteboard) -Image 2: (An image file showing a decorative background pattern with no meaningful information) Output:
(“entity”, “Taylor”, “person”, “Taylor is portrayed with strong authority and later shows respect toward a powerful device.”) (“entity”, “The Device”, “technology”, “The device is treated as a transformative object with great potential.”) (“relationship”, “Taylor”, “The Device”, “Taylor’s reverence for the device emphasizes its significance.”, “technological importance”, 8) (“content_keywords”, “authority, technology, significance”)
“person”, “organization”, “job_title”, “concept_or_framework”, “quote_or_statement”, “challenge_or_problem”, “question_or_use_case”, “technology_investment_area”, “business_goal_or_value”, “audience_or_stakeholder”
One-shot Examplar
Given a primary image document with its text content that is potentially relevant to this activity, along with additional images obtained from layout detection (if available), and a list of entity types, identify all entities of those types from the text content and from any additional images that contain meaningful content. Additionally, use the provided Knowledge Graph Data to enhance entity extraction by leveraging prior knowledge, ensuring that:
-Entities and relationships already present in the Knowledge Graph Data should not be re-extracted from the text content or images.
-If a new entity is found in the text content or images that is not present in the Knowledge Graph Data, it should be extracted.
-If an entity from the text content or images is related to an existing entity in the Knowledge Graph Data, establish a new relationship between them.
-If two existing entities from the Knowledge Graph Data have a new relationship within the given text content or images, this relationship should also be extracted.
-If a previously known entity appears in the current text content or image with new descriptive attributes not found in the Knowledge Graph Data, those descriptions should be added to the entity.
-If a new entity is mentioned multiple times across text or images with different complementary attributes, the extracted description should integrate all such information.
Note that:
-The first image is always the primary image document.
-The remaining 0 to many images are results from layout detection.
-For each additional image, analyze whether it contains meaningful content (e.g., tables, charts, images of important persons, events, etc.). In making this determination, also reference the primary image document, its text content, and the Knowledge Graph Data to understand the context. If the additional image is meaningful, treat it as an entity by extracting its relevant details. If the image is merely decorative or irrelevant (e.g., decorative patterns, unrelated photos), then ignore it.
-The input images are provided by appending them directly after the text (with the primary image document guaranteed to be the first image).
-Use {language} as the output language.
-Steps-1. Process the Input: a. The primary image document and its text content. b. Additional images from layout detection (if any), appended after the prompt. c. The Knowledge Graph Data, which provides structured relationships and prior knowledge that can help with entity identification. 2. Identify all new entities from the text content and additional images containing meaningful content.
-Do not extract entities that already exist in the Knowledge Graph Data.
-If a new entity is found, extract the following:
-entity_name: Name of the entity, using the same language as the input text (capitalize the name if it is in English).
-entity_type: One of the following types: [{entity_types}] -entity_description: A comprehensive description of the entity’s attributes and activities. If found in multiple locations, integrate all details into one complete description.
-Format: (“entity”{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>) 3. Identify relationships between entities, ensuring that:
- -Format each relationship as:
-source_entity: Name of the source entity, as identified in step 2 or the Knowledge Graph Data.
-target_entity: Name of the target entity, as identified in step 2 or the Knowledge Graph Data.
-relationship_description: Explanation of why the source entity and the target entity are related.
-relationship_strength: A numeric score indicating the strength of the relationship between the source and target entities.
-relationship_keywords: One or more high-level keywords that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details.
-Format: (“relationship”{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter} <relationship_keywords>{tuple_delimiter}<relationship_strength>) 4. Extract high-level content keywords summarizing the main concepts, themes, or topics from the text and meaningful images, but excluding the Knowledge Graph Data.
-Format: (“content_keywords”{tuple_delimiter}<high_level_keywords>) 5. Return the output in {language} as a single list of all the entities and relationships identified in steps 2 and 3. Use {record_delimiter} as the list delimiter. 6. When finished, output {completion_delimiter}.
Entity_types: {entity_types} Primary Image Document text content: {input_text} Additional Layout Detection Images: (The images are provided by appending them directly after this prompt, with the primary image document as the first image.) Knowledge Graph Data: {kg_context}
Entity_types: [person, technology] Primary Image Document text content: “Alex clenched his jaw in frustration as Taylor asserted control. Jordan’s drive for discovery clashed with Cruz’s desire for order. Later, Taylor examined a device with reverence, hinting at its transformative power.” Additional Layout Detection Images: -Image 1: (An image file showing a handwritten note on a whiteboard) -Image 2: (An image file showing a decorative background pattern with no meaningful information) Output: (“entity”, “Taylor”, “person”, “Taylor is portrayed with strong authority and later shows respect toward a powerful device.”) (“entity”, “The Device”, “technology”, “The device is treated as a transformative object with great potential.”) (“relationship”, “Taylor”, “The Device”, “Taylor’s reverence for the device emphasizes its significance.”, “technological importance”, 8) (“content_keywords”, “authority, technology, significance”)
Best Practices for Enterprise Gen Al Solutions A proven, scalable platform | Vela cloud-native supercomputer Deployed on Vela Cloud-native supercomputer, IBM Cloud Full stack running across thousands of GPUs with OpenShift (each node with 8 x A100 GPUs) Covering entire life cycle of ….
“person”, “organization”, “job_title”, “concept_or_framework”, “quote_or_statement”, “challenge_or_problem”, “question_or_use_case”, “technology_investment_area”, “business_goal_or_value”, “audience_or_stakeholder”
Inputs Subgraph (“entity”, “Vela Cloud-native Supercomputer”, “platform”, “A scalable platform deployed on IBM Cloud, supporting large-scale AI workloads.”) … (“relationship”, “Vela Cloudnative Supercomputer”, “IBM Cloud”, “Vela is deployed on IBM Cloud, leveraging its infrastructure.”, “deployment, cloud integration”, 9) … You are a helpful assistant responding to user query about Document Images provided below.
—Goal—Generate a concise response based on Document Images and follow Response Rules, considering both the conversation history and the current query. Summarize all information in the provided Document Images, and incorporating general knowledge relevant to the Document Images. Do not include information not provided by Document Images.
When handling content with timestamps: 1. Each piece of content has a “created_at” timestamp indicating when we acquired this knowledge 2. When encountering conflicting information, consider both the content and the timestamp 3. Don’t automatically prefer the most recent content -use judgment based on the context 4. For time-specific queries, prioritize temporal information in the content before considering creation timestamps —Response Rules—-Target format and length: Multiple Paragraphs -Use markdown formatting with appropriate section headings -Please respond in English.
-Ensure the response maintains continuity with the conversation history.
-If you don’t know the answer, just say so.
-Do not include information not provided by the Document Images.
You are a helpful assistant responding to user query about Knowledge Base provided below.
—Goal—Generate a concise response based on Knowledge Base and follow Response Rules, considering both the conversation history and the current query. Summarize all information in the provided Knowledge Base, and incorporating general knowledge relevant to the Knowledge Base. Do not include information not provided by Knowledge Base.
When handling relationships with timestamps: 1. Each relationship has a “created_at” timestamp indicating when we acquired this knowledge 2. When encountering conflicting relationships, consider both the semantic content and the timestamp 3. Don’t automatically prefer the most recently created relationships -use judgment based on the context 4. For time-specific queries, prioritize temporal information in the content before considering creation timestamps —Knowledge Base——Response Rules—-Target format and length: Multiple Paragraphs -Use markdown formatting with appropriate section headings -Please respond in English.
-Ensure the response maintains continuity with the conversation -you know answer, just say so.
-Do not make anything up. Do not include information not provided by the Knowledge Base.
You are a professional assistant responsible for answering questions based on both a knowledge graph and visual information extracted from document images containing relevant textual and visual content (e.g., scanned pages, slides, charts, or forms).
You are provided with a user query and two independent answers:
-
An answer based on the knowledge graph.
-
An answer based on the document images.
Your task is to analyze the user’s query and integrate the two provided answers into a single comprehensive response. Do not omit any relevant points from either source. When the answers conflict or provide complementary insights, use grounded reasoning to reconcile them. If the knowledge graph provides explicit facts, do not override them unless contradicted by strong visual evidence.
Please respond in English.
- Generate a concise response to the query that incorporates all relevant information from both Answers from the Knowledge Graph and the Document Images. If you don’t know the answer, just say so. Do not make anything up or include information where the supporting evidence is not provided.
When handling information with timestamps:
-
Each piece of information (both relationships and content) has a “created_at” timestamp indicating when we acquired this knowledge.
-
When encountering conflicting information, consider both the content/relationship and the timestamp.
-
Don’t automatically prefer the most recent information -use judgment based on the context. 4. For time-specific queries, prioritize temporal information in the content before considering creation timestamps.
—Response Rules—-Target format and length: Multiple Paragraphs -Generate a final answer that integrates both inputs.
-Use markdown formatting with appropriate section headings.
-Organize answer in sections focusing on one main point or aspect of the answer -List up to 5 most important reference sources at the end under a “References” section. Clearly indicate whether each source is from Knowledge Graph (KG) or Document Content (DC), using this format: [KG/DC] Source content.
-Ensure the response maintains continuity with the conversation history.
-If you don’t know the answer, just say so. Do not make anything up.
-Do not include information not provided by the inputs.
We evaluate MegaRAG against two widely used graph-based RAG baselines: GraphRAG and Ligh-tRAG, as well as a commonly adopted non-graph baseline, NaiveRAG. To ensure a fair comparison, we set the generation temperature to 0 across all models. Below, we provide a detailed overview of each method along with its specific settings for reference.
NaiveRAG. Serving as a standard baseline among RAG systems, NaiveRAG divides the input document into multiple text chunks, which are then encoded into a vector space using text embeddings. At query time, relevant chunks are retrieved based on the similarity between their embeddings and the query representation.
GraphRAG. GraphRAG begins by segmenting the input text into chunks and extracting entities and relationships to construct a graph. This graph is subsequently partitioned into communities at multiple levels. During retrieval, GraphRAG identifies entities mentioned in the query and synthesizes answers by referencing summaries of the corresponding communities. Compared to traditional RAG approaches, GraphRAG offers a more structured and high-level understanding of the document.
LightRAG. LightRAG is a variant of GraphRAG.
It is designed to reduce computational overhead while enhancing retrieval quality through a duallevel retrieval mechanism. This design improves both efficiency and effectiveness, offering a better balance between performance and resource usage compared to GraphRAG.
Global QA. To evaluate model performance on global (book-level) questions, where no goldstandard answers are available, we conduct pairwise comparative evaluations between MegaRAG and baseline models. Responses are assessed along three qualitative dimensions: Comprehensiveness, Diversity, and Empowerment, as well as an overall rating that reflects performance across all criteria.
Each evaluation instance presents a question alongside two competing answers, one from a baseline model and one from MegaRAG. We employ GPT-4.1-mini as the evaluator to compare the two responses, select a winner for each dimension, and provide brief justifications. Comprehensiveness measures how thoroughly the answer addresses all aspects of the question. Diversity evaluates the richness and variety of perspectives presented. Empowerment assesses how effectively the answer enhances user understanding and supports informed decision-making. The full evaluation prompt used in this process is shown in Figure 6 (a). Local QA. For local (slide-or page-level) QA, where reference answers are available, we use GPT-4.1-mini to assess answer correctness. Each instance includes a question, the model’s response, and the corresponding ground truth. The LLM judge evaluates whether the response is semantically consistent with the reference, regardless of surface phrasing. The output is a binary label (yes or no) accompanied by a brief explanation. Accuracy is calculated as the proportion of responses judged correct. The evaluation prompt is shown in Figure 6 (b).
To ensure that GPT-4o-mini has not been exposed to our evaluation datasets during pretraining, and to confirm that it cannot answer questions solely by relying on its internal knowledge, we conduct an additional ablation study. Specifically, we compare MegaRAG against a retrieval-free baseline where answers are generated using GPT-4o-mini without access to any external context or retrieved information. As shown in Table 6, MegaRAG consistently outperforms the retrieval-free baseline, highlighting the value of combining retrieval with multimodal knowledge to enhance answer quality. Table 6: Compare MegaRAG with using only GPT-4o-mini in terms of win rates (%).
Given the following description of a dataset: {description} Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level of the entire dataset.
Output the results in the following structure:
-User 1: [user description]
-Task 1: [task description]
-Question 1:
-Question 2:
-Question 3:
-Question 4:
-Question 5:
We present two case studies demonstrating the benefits of our MMKG refinement stage in improving knowledge extraction from visually rich documents. These examples show how refinement enhances multimodal grounding and enables the recovery of global, cross-page relations.
Example of enhanced multimodal relations.
In the initial MMKG stage shown in Figure 7, entities such as Estimated Global Emissions and Earth Network of Grids are extracted from figure images, but their connections to textual entities are missing. After refinement, these visual entities are correctly linked to the 1 Gigaton Aspiration. Example of enhanced cross-page relations.
We deomnstrate that cross-page relations can be recovered after the refinement stage in the example shown in Figure 8. By leveraging the provided MMKG subgraph, our method successfully links the visual entity Renewable Energy Purchasing vs. Total Electricity" to the cross-page entity Total Electricity Consumption.
Further examples are provided in Tables 7,8, 9, 10 to compare our MegaRAG with GraphRAG and LightRAG. As shown in the respective LLM judgement, our approach consistently outperforms the baselines across four evaluation metrics: comprehensiveness, diversity, empowerment, and overall. GANs use a generator and discriminator to produce realistic images, which are helpful when the dataset is small or imbalanced. VAEs encode input data into a latent space and sample from it to create new images, increasing dataset variation.
Generative models can also generate labeled data, reducing the need for expensive manual annotation. Basic augmentation like color shifts, noise addition, and rotations enhances generalization. By integrating generative models early in the data pipeline, models learn from both real and synthetic images, leading to higher accuracy and reduced overfitting.
Answer 1 offers a more complete explanation, discussing synthetic generation, feature learning, data structure analysis, augmentation robustness, and simulation. Answer 2 focuses mainly on synthetic data and standard augmentation.
Answer 1 explores multiple dimensions-from feature learning to simulation-while Answer 2 concentrates on class imbalance and dataset enlargement.
Answer 1 gives readers a clearer picture of how and why generative models are effective, with practical examples and diverse use cases. Answer 2 is helpful but more limited in depth.
Answer 1 outperforms Answer 2 in all aspects, providing broader insights and more actionable information for leveraging generative models in dataset augmentation.
Performance on SlideVQA (2k) and RealMMBench datasets in terms of Accuracy (%). GraphRAG (L) and GraphRAG (G) denote its local and global search modes.
A1: text-only graph construction (no visual inputs); A2: disable MMKG retrieval (page retrieval only); A3: replace two-stage generation with single-pass generation.
4 https://www.microsoft.com/en-us/corporate-responsibility/ sustainability/report 5 https://sustainability.atmeta.com/ 2024-sustainability-report/ 6 https://www.nvidia.com/en-us/sustainability/
Answer 1 empowers readers with clear explanations, real examples, and reasoning behind each strategy. Answer 2 lacks the same level of depth.Overall -Winner: MegaRAGAnswer 1 is the most comprehensive, diverse, and empowering of the two answers.
Answer 1 empowers readers with clear explanations, real examples, and reasoning behind each strategy. Answer 2 lacks the same level of depth.Overall -Winner: MegaRAG
Answer 1 empowers readers with clear explanations, real examples, and reasoning behind each strategy. Answer 2 lacks the same level of depth.
Case (1) Study: Comparison between MegaRAG and GraphRAG.Question 1:
Case (1) Study: Comparison between MegaRAG and GraphRAG.