Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.
While retrieval-augmented generation (RAG) has achieved significant success in the textual domain [9,20,56], extend-ing it to the multi-modal realm presents substantial challenges [17]. Traditional methods typically rely on a cascaded pipeline in which a multi-modal retriever extracts relevant evidence, which is then forwarded to LLMs for answer synthesis [15,18,36]. This segmented approach inherently suffers from global semantic alignment gap [19,24].
The emergence of multi-modal large language models (MLLMs) presents a transformative opportunity, as these models possess a remarkable ability to understand visuallanguage correlations [3,12,39]. This capability enables MLLMs to function not only as generators but also as integrative engines for multi-modal RAG (mRAG). By jointly processing retrieved multi-modal evidence and questions within a cohesive reasoning backbone, MLLMs can synthesize information and derive coherent conclusions from retrieved knowledge [31,54,58]. Nonetheless, when confronted with complex, multi-faceted multi-modal questions that demand diverse expertise or deliberative reasoning, a single agent often reaches its limitations [10,33,49].
Recent efforts have investigated multi-agent systems built upon MLLMs, in which multiple specialized agents collaboratively communicate through structured graph topologies to distribute reasoning workloads for complex multi-modal tasks [17,32,34,51]. However, performance improvements often come at a substantial cost, notably a significant increase in computational and token overhead. Critically, we identify the root cause (Communication Redundancy), which not only induces inefficiency but also directly undermines the accuracy of the final response [17,53]. This redundancy is especially pronounced in multi-modal settings, where the informational requirements for processing textual cues and visual patterns inherently differ. Yet, existing methods frequently employ intra-modal communication strategies across all modalities [22,27,44]. As shown in Fig. 1, in the “first flight date” question, the date expert agent is overwhelmed by irrele- vant messages about engine counts and models. This noise prevents filtering out incorrect dates and obscures the correct evidence, leading to an erroneous answer.
We introduce M 3 Prune, a novel framework for hierarchical communication graph pruning in multi-agent mRAG systems. Two key modules of M 3 Prune are as follows: Intra-Modal Graph Sparsification: In mRAG tasks, agents assigned different roles may hold divergent viewpoints on the same question [50,57]. Consequently, there can be both cooperation and conflict among agents within the same modality, which may hinder mutual enhancement. To address this, we design a spatio-temporal messagepassing technique that facilitates effective exchange of viewpoints within the visual and textual modalities, respectively. The significance of edge connections between agents in each modality is determined by the quality of the final response and the structure of the communication topology. For redundant edges connecting agents with low contribution within a modality, communication with other key agents is substantially reduced by sparsifying edge weights. Inter-Modal Graph Sparsification: Given the diverse effects of information granularity across modalities, it is essential to comprehensively integrate key clues from multiple modalities before generating the final response [2,7,23,41]. Therefore, we aggregate the roles of agents across modalities to construct an inter-modal spatio-temporal communication topology, facilitating robust inter-modal collaboration and resolution of viewpoint conflicts. Specifically, we initialize the inter-modal communication topology based on intra-modal sparse graphs and diverse agent role correlations, integrating cross-modal text-visual and visualtext role viewpoints to learn the significance of inter-modal edges. In addition to optimizing for task performance and structural regularity, analogous to intra-modal learning, we further incorporate a modality alignment loss to ensure consistency in task understanding across different modalities during inter-modal topology sparsification. Finally, after learning edge sparsity weights within hierarchical intra-and inter-modal graphs, we progressively prune invalid edges associated with redundant agent roles across modalities.
In our experiments, we evaluate M 3 Prune against a range of strong baselines, including zero-shot, single-agent, and multi-agent multi-modal settings on mRAG benchmarks: Vidoseek [42], MultimodalQA [38], and ScienceQA [28], which cover general and domain-specific tasks. Results show that our approach achieves state-of-the-art performance, with an improvement of 9.4% in accuracy while significantly reducing token consumption by 15.7% compared to strong multi-modal multi-agent baselines.
Previous works in mRAG can be broadly categorized into two interconnected themes: (1) Modality-
This content is AI-processed based on open access ArXiv data.