DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

December 03, 2025

Reading time: 4 minute

...

📝 Original Info

Title: DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
ArXiv ID: 2512.03992
Date: 2025-12-03
Authors: Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang

📝 Abstract

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

💡 Deep Analysis

📄 Full Content

1 DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation Zexin Lin1,3,∗, Hawen Wan1,3,∗, Yebin Zhong1,3, and Xiaoqiang Ji1,2,3,† Abstract—Vision-Language Models (VLMs) deployed in safety- critical applications like autonomous driving must handle con- tinuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images, ignoring temporal degradation and error propagation—critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics- based corruptions (motion blur, sensor noise, compression arti- facts) and measures hallucination persistence, error recovery, and temporal consistency through multi-turn Q&A tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth via lightweight VLMs with uncertainty filtering, achieving 15.3% accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal significant robustness gaps: even top models like GPT- 4o show only 78.5% recovery rate, while open-source models struggle with temporal consistency (<60%). DIQ-H provides a comprehensive platform for evaluating VLM reliability in real- world deployments. Index Terms—Large Vision-Language Models, Error Propa- gation, Automated Annotation, Multimodal Benchmarking, Ro- bustness Evaluation I. INTRODUCTION Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications such as autonomous driving and robotic manipulation, where they must interpret continuous visual streams under imperfect conditions. A fundamental challenge is hallucination—the tendency to fabricate non- existent objects or attributes—which becomes particularly dangerous when errors compound over time in sequential reasoning tasks. Limitations of Existing Benchmarks. Current evaluation paradigms suffer from three critical gaps: (1) Static focus: Benchmarks like LLaVA-Bench [1] and MME [2] assess only single-frame understanding; (2) Temporal blindness: Halluci- nation benchmarks (POPE [3], AMBER [4]) evaluate isolated responses without modeling error propagation; (3) Idealized inputs: Video benchmarks (ConBench [5]) assume pristine quality, ignoring real-world degradation from motion blur, sen- sor noise, and compression artifacts. These limitations obscure a critical failure mode: cognitive inertia, where hallucinations 1The School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China. 2The School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China. 3The Shenzhen Institute of Artificial Intelligence and Robotics for Society Shenzhen, China. †Corresponding author is Xiaoqiang Ji whose e-mail is jixiaoqiang@cuhk.edu.cn. Fig. 1. Overview of motivation and approach. (a) VLMs hallucinate under degradation. (b) Existing benchmarks ignore temporal error propagation. (c) DIQ-H evaluates hallucination persistence and recovery under dynamic degradation. induced by transient degradation persist even after visual quality recovers. Our Approach. We introduce DIQ-H (Degraded Image Quality leading to Hallucinations), the first benchmark to evaluate VLMs under dynamic visual degradation in tem- poral sequences. DIQ-H addresses the above gaps through: (1) Physics-based degradation simulation applying realistic motion blur, Poisson-Gaussian noise, and H.265 compres- sion; (2) Temporal task design with multi-turn Q&A prob- ing error propagation and recovery; (3) Adaptive difficulty calibration that stress-tests model limits. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refine- arXiv:2512.03992v1 [cs.CV] 3 Dec 2025 2 ment (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving 15.3% accuracy improvement over direct annotation. Our main contributions are: • DIQ-H Benchmark: First systematic evaluation of VLM robustness to sequential degradation and hallucination propagation in dynamic video environments. • Multi-Agent Generation Framework: Scalable pipeline combining degradation simulation, temporal task design, and adaptive difficulty control. • UIR Annotation Framework: Cost-effective method for high-quality GT synthesis via uncertainty-guided re- finement, reducing reliance on expensive human/GPT-4o annotation. II. RELATED WORK With the increasing use of VLM, challenges regarding their stability, accuracy, and controllability have become more prominent. Benchmarking VLM has therefore become a focal point of multimodal intelligence research. Early benchmarks primarily assessed semantic comprehen- sion over static images. Datasets such as VQA v2 [6] and the RefCOCO series [7, 8] focused on tasks like visual question answering and referring expression grounding

📄 Read Full PDF on ArXiv