MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.

💡 Research Summary

Multi‑modal Retrieval‑Augmented Generation (MMRAG) has emerged as a powerful paradigm for producing fact‑rich, context‑aware answers by pulling external visual‑textual knowledge into large language models (LLMs). Despite impressive gains on complex tasks, existing MMRAG pipelines are essentially black boxes: they retrieve documents and generate responses without exposing the reasoning that led to the selection of those documents or the generation of the answer. This opacity hampers adoption in high‑stakes domains such as medicine, law, or education, where users must be able to verify the provenance of the information.

The paper addresses this critical gap by introducing a two‑stage reinforcement‑learning fine‑tuning framework called MMRAG‑RFT (Reinforcement Fine‑Tuning). The core idea is to treat both retrieval and generation as sequential decision‑making problems and to shape the policy with carefully designed reward signals that encourage not only high performance but also explicit, human‑readable explanations.

Stage 1 – Rule‑Based Reinforcement Fine‑Tuning (Coarse‑grained Point‑wise Ranking).
In the first stage the model learns to filter out obviously irrelevant multi‑modal documents. A set of deterministic heuristics—image‑text similarity scores, textual overlap thresholds, and metadata matches—are encoded as a scalar reward for each candidate document. Using Proximal Policy Optimization (PPO), the policy outputs a probability distribution over the document pool, and the point‑wise reward pushes the probability mass toward documents that satisfy the rules. Because the reward is computed independently for each document, this stage can be trained efficiently on very large corpora and stabilises the subsequent fine‑grained optimisation by removing noisy candidates early on.

Stage 2 – Reasoning‑Based Reinforcement Fine‑Tuning (Fine‑grained List‑wise Ranking & Answer Generation).
The second stage jointly optimises (i) the ordering of the retrieved set and (ii) the generation of the final answer together with a textual reasoning trace. The reward function is a weighted sum of three components:

List‑wise retrieval quality – measured by NDCG, Recall@k, and Precision@k against the ground‑truth document IDs.
Answer fidelity – evaluated with ROUGE‑L, BLEU, and BERTScore to capture lexical and semantic similarity to the reference answer.
Explanation consistency – a novel term that scores the model‑generated reasoning paragraph against the actually used documents using a cross‑modal similarity metric.

During training the policy now emits two streams: (a) a distribution over the candidate set (still conditioned on the query) and (b) a token‑by‑token distribution for the answer plus explanation. PPO with clipping is again employed to keep policy updates stable. The architecture couples a multi‑modal encoder (e.g., ViLT or CLIP‑based) that fuses image and text embeddings with a LLM decoder (such as LLaMA‑2). Cross‑attention layers allow the decoder to attend to the selected documents while generating both the answer and its justification.

Experimental Evaluation.
The authors evaluate on two challenging benchmarks: WebQA, which contains web‑page snippets paired with images, and MultimodalQA, a dataset of multi‑modal, compositional questions. Baselines include state‑of‑the‑art RAG‑style models, Fusion‑in‑Decoder, and recent retrieval‑augmented vision‑language models. MMRAG‑RFT achieves notable improvements: Exact Match rises from 68.4 % to 73.9 % on WebQA and from 61.2 % to 66.8 % on MultimodalQA (≈5 % absolute gain). F1 and BLEU scores also increase by 3–4 % points. Human evaluation of explanation quality shows a mean consistency score of 4.3/5 for MMRAG‑RFT versus 3.1/5 for the best baseline, confirming that the model’s reasoning traces are both accurate and useful.

Ablation studies reveal that removing the rule‑based stage degrades overall performance by ~2.8 % points, while omitting the list‑wise or explanation rewards each causes a ~3–4 % drop, underscoring the complementary role of the two stages.

Contributions and Impact.

The paper pioneers the integration of reinforcement learning into multi‑modal retrieval‑augmented generation, turning a previously opaque pipeline into an explainable, controllable system.
By separating coarse, rule‑driven filtering from fine, reasoning‑driven optimisation, the framework achieves both training stability and computational efficiency.
The explicit generation of a reasoning paragraph bridges the gap between model decisions and human interpretability, a prerequisite for trustworthy AI in critical applications.

Limitations and Future Work.
The current reward design relies on domain‑specific heuristics; a more universal, meta‑learning‑based reward could broaden applicability. Automatic metrics for explanation quality remain under‑developed, so future research should explore better proxy measures to reduce reliance on costly human judgments. Finally, extending the approach beyond image‑text pairs to audio, video, or 3‑D data would test the scalability of the method to richer multi‑modal environments.

In summary, MMRAG‑RFT demonstrates that a carefully staged reinforcement‑learning fine‑tuning regimen can simultaneously boost retrieval‑augmented generation performance and endow the system with transparent, human‑readable reasoning, setting a new benchmark for explainable multi‑modal AI.

💡 Research Summary

📜 Original Paper Content