Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
💡 Research Summary
The paper tackles the persistent hallucination problem in Large Vision‑Language Models (LVLMs), which often stems from insufficient attention paid to visual tokens during generation. While prior works have tried to boost visual attention globally or employ contrastive decoding, these approaches inadvertently amplify attention to irrelevant image regions, sometimes worsening hallucinations. The authors observe that task‑relevant visual tokens tend to have high visual‑textual similarity and propose a training‑free, plug‑and‑play framework called VAALE (Visual‑Aware Attention and Logits Enhancement) consisting of two complementary modules.
The first module, Attention Refocusing, extracts the visual‑to‑instruction (v→i) and instruction‑to‑visual (i→v) sub‑matrices from the cross‑attention matrix of a LVLM. These sub‑matrices quantify the semantic alignment between visual and textual tokens. By constructing reweighting matrices from them and mixing them with the original attention weights using a balance factor α, the method amplifies attention on tokens with strong visual‑textual alignment while suppressing attention on unrelated visual patches. Importantly, this operation is applied only to the attention of the last generated token, leveraging the cached hidden states that modern LVLMs maintain for efficient inference, thus incurring negligible runtime overhead.
The second module, Visual Beam Search, modifies the decoding phase. For each candidate beam, a Visual Interaction Degree (VID) is computed by aggregating cross‑attention scores between the candidate token and visual tokens across a selected range of layers. The original logits are then adjusted as L′ = L + β·γ·VID, where β controls how much the visual signal influences the decision and γ scales the VID term. This encourages the beam that interacts most strongly with the image to be selected, directly embedding visual relevance into the probability distribution.
Experiments are conducted on two representative LVLMs: LLaVA‑v1.5‑7B and Qwen2.5‑VL‑3B. Using 500 random COCO‑2014 validation images, the authors evaluate hallucination reduction with CHAIR‑I/CHAIR‑S (instance‑level and sentence‑level image‑caption alignment) and POPE (binary VQA probing). VAALE outperforms three strong baselines—OPERA, VCD, and PAI—by reducing CHAIR‑S by up to 15.5% and CHAIR‑I by 17.8% on LLaVA, and achieving even larger relative drops (≈39% and 43%) on Qwen2.5‑VL. F1 scores and POPE accuracy remain comparable or slightly improved, demonstrating that the method does not sacrifice content richness or factual correctness.
Ablation studies explore the sensitivity of α (attention refocusing strength) and β (visual beam weight). Optimal α lies around 0.5 for greedy decoding and 0.2 for beam search, while β performs best between 0.3 and 0.6, with γ fixed as per Table 1. Combining both modules yields marginally better results than either alone, confirming their complementary nature.
In summary, VAALE offers a practical, training‑free solution to LVLM hallucinations by (1) re‑allocating attention toward visually‑textually aligned tokens using cross‑attention sub‑matrices, and (2) injecting visual interaction signals directly into beam‑search logits. The approach is model‑agnostic, incurs minimal computational cost, and preserves generation quality. The paper opens avenues for further work such as automatic layer selection, dynamic hyper‑parameter tuning, and extension to other multimodal tasks like video captioning or multi‑image reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment