Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework
Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs’ over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs’ over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects’ existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.
💡 Research Summary
Large Vision‑Language Models (LVLMs) have achieved impressive results on tasks such as image captioning and visual question answering, yet they still suffer from object hallucination – the generation of objects that do not appear in the image. Prior work attributes this to an over‑reliance on language priors and attempts to alleviate it through logits calibration (e.g., VCD, DeCo). However, these methods do not fully address the fact that the reliance on language priors grows as the generated caption becomes longer.
The authors first conduct a systematic analysis on LLaVA‑v1.5‑7B using samples from the MS‑COCO validation set. For each decoding step they compute the Jensen‑Shannon Divergence (JSD) between the full conditional distribution p(yₜ|v, x, y<ₜ) and the language‑only distribution p(yₜ|x, y<ₜ). High JSD values indicate strong visual influence, while low values signal dominance of language priors. Their results show that the first ~20 % of tokens have high JSD (strong visual grounding), but thereafter JSD drops sharply, and the hallucination rate of object tokens rises by an order of magnitude. Even state‑of‑the‑art calibration methods keep JSD slightly higher than the vanilla baseline but cannot reverse this downward trend.
Motivated by this finding, the paper introduces Language‑Prior‑Free Verification (LPFV). After a normal caption is generated, the model is prompted with a simple instruction: “Describe any element of the image with only one word or phrase.” This forces the model to produce a distribution p(o | v, xₑ) that is free from the previously generated textual context. Comparing the original object probability p(o | v, x, y<ₛ(o)) with the LPFV probability, the authors report an AUROC of 0.85 versus 0.69, demonstrating that LPFV provides a much more reliable estimate of object existence.
Building on LPFV, the authors propose a Self‑Validation Framework consisting of two stages:
-
Candidate Generation & Verification – N captions are sampled (temperature 0.5, top‑k = 50). For each caption, objects are extracted and each object’s existence confidence cⱼ is obtained via LPFV.
-
Final Caption Production – Two strategies are explored:
- Best‑of‑N Selection (BoN): Compute the average confidence of objects in each candidate (f(yᵢ) = (1/|Oᵢ|)∑cⱼ) and select the caption with the highest score.
- Filter‑then‑Aggregate (FtA): Discard sentences containing low‑confidence objects (cⱼ ≤ α), collect the remaining factual sentences, and ask the LVLM to re‑aggregate them into a single, coherent caption.
The framework is training‑free and can be applied to any LVLM. Experiments are conducted on four models (LLaVA‑v1.5‑7B/13B, mPLUG‑Owl2‑7B, Qwen2.5‑VL‑7B) and evaluated on 500 randomly chosen COCO images. Metrics include CHAIR‑S (sentence‑level hallucination), CHAIR‑I (instance‑level hallucination), F1 (precision‑recall balance on object extraction), and GPT‑4o‑mini based Accuracy and Relevancy scores. Compared with strong baselines (VCD, CGD, HALC, DeCo, Less, Nullu), the Self‑Validation framework achieves dramatic reductions in hallucination: for LLaVA‑v1.5‑7B, FtA improves CHAIR‑I by 65.6 % and BoN by 28.8 % relative to the vanilla model, while maintaining comparable F1 and even slightly higher Accuracy/Relevancy. Similar gains are observed across the other models, confirming the method’s robustness.
Ablation studies on the number of candidates N and the confidence threshold α show that performance is stable across a wide range of settings. The authors note limitations: LPFV’s one‑word/phrase prompt may struggle with complex multi‑word objects, and the computational cost grows with N, which could be a concern for real‑time applications. Future directions include integrating multimodal attention scores directly into the verification step and leveraging self‑supervised learning on large unlabeled image‑text pairs to further enhance the self‑validation capability.
In summary, the paper provides a clear diagnosis of the over‑reliance problem in LVLM captioning, introduces a simple yet powerful verification mechanism that removes language bias, and demonstrates that a training‑free self‑validation pipeline can substantially curb object hallucination, outperforming existing state‑of‑the‑art methods without sacrificing overall caption quality.
Comments & Academic Discussion
Loading comments...
Leave a Comment