ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning

ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model’s reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.


💡 Research Summary

The paper tackles a critical failure mode in large multimodal reasoning models (MLRMs) known as “reasoning drift,” where during long‑chain inference the model’s attention drifts toward question‑irrelevant visual regions, leading to hallucinations—generated content not grounded in the image. Existing inference‑time mitigation techniques such as contrastive decoding, logit steering, or global attention reallocation assume that visual grounding can be improved without conditioning on the specific tokens the model generates. While these methods can help short responses, they accumulate bias over many reasoning steps, often causing premature token emission and overlooking the intermediate visual clues that are essential for correct answers.

To address this, the authors first introduce ClueRecall, a layer‑wise visual‑grounding metric that measures how well each transformer layer retrieves question‑relevant visual clues. By aligning model attention maps with ground‑truth object bounding boxes (using datasets like POPE and MSCOCO), they compute an average recall per layer. Experiments on a 7‑billion‑parameter, 28‑layer model reveal that layers 18‑24 achieve roughly 50 % ClueRecall, indicating that mid‑to‑late layers are most effective at visual clue extraction.

Building on this insight, they propose ClueTracer, a training‑free, parameter‑free plugin that operates at inference time. The method proceeds in three steps:

  1. Key Question Token Detection – Identify question tokens whose attention variance across decoding steps is high; these tokens are likely to carry the core constraints of the query.
  2. Output Token Selection – From the generated reasoning trace, select output tokens that show strong alignment (high attention weight) with the identified key question tokens.
  3. Visual Token Tracing – For each selected output token, retrieve its attention distribution over visual tokens at the layer identified by ClueRecall. The top‑k visual tokens form a minimal patch set that directly supports the answer.

These patches are then either re‑fed to the model or used to mask attention, effectively re‑anchoring the generation process to the most relevant visual evidence. Because the procedure runs before the final answer token is produced, it is architecture‑agnostic and can be attached to any decoder‑style multimodal model without additional training or fine‑tuning.

The authors evaluate ClueTracer on several reasoning‑focused benchmarks (HallusionBench, VMCBench, POPE) and on non‑reasoning vision‑language tasks (LLaVA‑1.6, R1‑OneVision). Results show:

  • Reasoning models: average improvements of 4.25× on HallusionBench and 1.17× on VMCBench, with a 1.21× boost in overall reasoning accuracy.
  • Non‑reasoning models: a 1.14× gain, lifting many systems from near‑chance performance to the GPT‑4V range, and an average +8.7 percentage‑point increase in accuracy.

Ablation studies confirm that each component (key‑token detection, output‑guided tracing, layer selection via ClueRecall) contributes meaningfully; removing any step degrades performance substantially. Qualitative analyses illustrate that ClueTracer progressively narrows attention to the true visual clue (e.g., the batter’s head for a helmet query), suppressing hallucinated details that previously led to incorrect answers.

In summary, the paper makes three major contributions: (1) a diagnostic metric (ClueRecall) for probing visual grounding across transformer layers, (2) the ClueTracer inference‑time algorithm that traces clues through the query → output → vision pipeline, and (3) extensive empirical evidence that this simple, training‑free approach dramatically reduces hallucination in both reasoning‑heavy and standard multimodal settings. The work opens a new direction for post‑hoc, token‑aware grounding that respects the sequential nature of multimodal reasoning, offering a practical tool for improving reliability of large vision‑language systems without costly retraining.


Comments & Academic Discussion

Loading comments...

Leave a Comment