Multimodal Fact-Level Attribution for Verifiable Reasoning
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
💡 Research Summary
The paper introduces MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark designed to evaluate fact‑level attribution in multimodal large language models (MLLMs) when answering questions that require multi‑step reasoning over heterogeneous inputs such as video, audio, and graphical figures. Unlike prior grounding tasks that focus on simple observation‑based queries or a single visual modality, MuRGAt demands that models not only produce correct answers but also attach precise citations to each verifiable claim. Each citation must specify the modality (e.g., audio, visual) and an exact temporal segment, thereby forcing the model to self‑select evidence rather than relying on pre‑provided timestamps.
The evaluation pipeline consists of three sub‑tasks. (1) Verifiable Claim Identification uses an LLM‑based verifier to filter sentences that can be directly grounded in the input; sentences that are purely reasoning or lack citations are discarded. (2) Atomic Fact Decomposition breaks each remaining sentence into minimal, independently verifiable facts, resolves pronouns, and propagates the original citation set to every atomic fact. (3) Attribution Quality assesses, for each fact‑citation pair, (a) Recall – whether the union of cited segments fully entails the fact, and (b) Precision – whether each cited segment is strictly necessary, thus penalizing spurious or overly broad citations. These two measures are combined into an F1 score, and together with coverage (the proportion of verifiable sentences that receive at least one citation) they form the holistic MuRGAt‑Score.
Human annotations were collected on two multimodal datasets, WorldSense and Video‑MMMU, covering a wide range of question types, including those that require reasoning beyond direct observation. Using these annotations as gold standards, the authors evaluated several state‑of‑the‑art MLLMs, including Gemini‑2.5‑Flash, Gemini‑3‑Flash, Gemini‑3‑Pro, Qwen‑3‑Omni‑Instruct, and Qwen‑3‑Omni‑Thinking. While many models achieved high answer‑accuracy, their attribution performance was poor: average citation precision hovered below 30 %, and hallucinated citations (incorrect or unsupported references) were common, especially on complex reasoning questions where the hallucination rate approached 50 %.
The automatic metric MuRGAt‑Score correlates strongly with human judgments (Pearson r = 0.84), outperforming a baseline LLM‑as‑judge approach (r = 0.59). The study also uncovers a “reasoning tax” effect: requiring explicit citations modestly impacts simple recognition tasks but dramatically degrades performance on tasks demanding deep logical inference. Programmatic strategies that separate reasoning from citation generation improve attribution quality (average +9.6 points in MuRGAt‑Score) but incur a trade‑off with overall answer correctness.
Further analysis reveals a non‑linear relationship between model scale, compute budget, and grounding ability. Larger models (e.g., Gemini‑3‑Pro) improve citation quality with increased compute, whereas smaller models experience a decline in MuRGAt‑Score as compute grows, suggesting that their internal reasoning becomes increasingly detached from verifiable evidence.
In summary, MuRGAt provides the first large‑scale, fact‑level multimodal attribution benchmark that evaluates both the correctness of generated content and the fidelity of its supporting evidence. The findings highlight a substantial gap: current MLLMs can often answer correctly but fail to ground their statements reliably. The paper points toward future research directions such as joint evidence retrieval and reasoning, modular architectures that enforce citation constraints, and training objectives that directly optimize the precision‑recall trade‑off of multimodal attribution.
Comments & Academic Discussion
Loading comments...
Leave a Comment