Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.

💡 Research Summary

Large Vision‑Language Models (LVLMs) have demonstrated impressive multimodal reasoning abilities, yet they remain vulnerable to object hallucination—producing plausible‑looking but non‑existent visual details. The authors of this paper argue that the root cause lies in the entanglement of visual features with pretrained language priors in the deeper layers of the model. When visual cues are weak or ambiguous, the language backbone dominates, causing the decoder to drift toward ungrounded probabilistic guessing.

To address this, the paper introduces REVIS (Re‑activating VISual information), a training‑free framework that intervenes directly in the latent space of an LVLM. The method proceeds in three stages: (1) Orthogonal visual vector construction, (2) Layer‑wise calibration to locate the optimal intervention depth, and (3) Dynamic sparse steering at inference time.

Stage 1 – Orthogonal projection.
From a small calibration set (≈100 image‑caption pairs) the authors compute two difference vectors per layer ℓ:

Raw visual vector v_raw(ℓ) = E

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment