Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite the outstanding performance in multimodal tasks, Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination, i.e., generating content that is inconsistent with the corresponding visual inputs. While previous works have proposed various benchmarks to evaluate this issue, the quality of these evaluations remains unverified. We observe that some of these benchmarks may produce inconsistent evaluation results across repeated tests or fail to align with human evaluation. To address this, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages specific indicators to assess both reliability and validity. Our empirical analysis using HQM reveals and pinpoints potential evaluation issues in existing benchmarks, exposing a critical gap in current hallucination evaluation. To bridge this gap, we propose HQH, a High-Quality Hallucination benchmark, which demonstrates superior reliability and validity under HQM, serving as a credible evaluation tool. Our large-scale evaluation of popular LVLMs on HQH reveals severe hallucination problems, which occur not only in the models’ main answer to a question but also in additional analysis. This highlights the necessity for future model improvements to effectively mitigate hallucinations and reduce the associated security risks in real-world applications. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.


💡 Research Summary

Large Vision‑Language Models (LVLMs) have achieved impressive results on multimodal tasks such as image captioning and visual question answering, yet they frequently generate text that contradicts the visual input—a phenomenon known as hallucination. Existing works have introduced a variety of hallucination benchmarks, but the quality of these evaluation tools has never been rigorously examined. This paper addresses that gap by proposing the Hallucination benchmark Quality Measurement (HQM) framework, which adapts psychometric concepts of reliability and validity to the AI benchmarking context.

HQM evaluates a benchmark along four dimensions. (1) Test‑retest reliability measures the consistency of model scores when the same benchmark is run twice with different random seeds; Pearson correlation between the two result sets quantifies stability. (2) Parallel‑forms reliability assesses robustness to prompt variations by creating re‑phrased (“parallel”) versions of the benchmark (e.g., opposite‑ground‑truth yes/no questions, shuffled multiple‑choice options, synonymic re‑writes of open‑ended prompts) and correlating the scores with the original. (3) Content validity checks whether each image‑instruction‑ground‑truth triple accurately reflects the intended hallucination type (object‑level, attribute‑level, scene‑level) through manual verification. (4) Criterion validity measures the alignment between automatic scores and human judgments, again using Pearson correlation.

Applying HQM to six representative benchmarks reveals systematic problems. Closed‑ended benchmarks such as POPE and AMBER‑Y achieve near‑perfect test‑retest reliability (≈0.999) but suffer from low parallel‑forms reliability (≈0.35), indicating strong response bias (e.g., acquiescence to “yes” or position bias in multiple‑choice). Open‑ended benchmarks like OpenCHAIR, MMHal, and GAVIE display moderate test‑retest reliability (0.88–0.91) and modest content validity (0.68–0.79). Moreover, LLM‑based automatic scoring (e.g., using GPT to assign hallucination scores) shows weak criterion validity (correlation ≤0.75), confirming that subjective scoring diverges from human preferences.

To overcome these deficiencies, the authors construct a new benchmark, HQH (High‑Quality Hallucination benchmark). HQH draws images from the Visual Genome dataset and formulates free‑form questions that probe fine‑grained perceptual dimensions: object existence and count, attribute properties (color, action), spatial relations, comparative relations, environmental context, and embedded text. Every sample undergoes manual review to eliminate annotation errors and ensure that the instruction matches the intended hallucination type. By avoiding binary or multiple‑choice formats, HQH mitigates response bias.

For evaluation, HQH introduces a two‑step, LLM‑assisted metric. First, the model’s answer is checked for semantic equivalence with the ground truth. Second, any statements that contradict the image are extracted, yielding two quantitative measures: (a) hallucination rate (the proportion of hallucinated claims among all claims) and (b) the absolute number of hallucinated claims. This objective approach sidesteps the limitations of object‑centric metrics such as CHAIR/OCH and reduces reliance on subjective LLM scoring. Under HQM, HQH achieves test‑retest reliability of 0.9977, parallel‑forms reliability of 0.9856, and high content and criterion validity (criterion correlation ≈0.95), outperforming all prior benchmarks.

The authors then conduct a large‑scale evaluation of ten state‑of‑the‑art LVLMs, including open‑source models (BLIP‑2, InstructBLIP, LLaVA, Shikra, Qwen‑VL) and closed‑source APIs (Gemini‑1.5‑Pro, GPT‑4o). Results show that hallucinations are pervasive: on average, 30 % of the generated statements are hallucinated, and the problem appears not only in the primary answer but also in auxiliary explanations or analyses. Even the most capable closed‑source models exhibit multi‑claim hallucinations on complex queries, indicating that simple accuracy metrics severely underestimate safety risks.

In summary, the paper makes four key contributions: (1) a psychometric‑inspired HQM framework for systematic quality assessment of hallucination benchmarks; (2) an empirical diagnosis of reliability and validity flaws in existing benchmarks; (3) the creation of HQH, a rigorously curated, bias‑resistant benchmark with an objective, human‑aligned evaluation metric; and (4) a comprehensive analysis of current LVLMs that highlights the urgent need for hallucination mitigation strategies in model training and deployment. By foregrounding benchmark quality, the work sets a new standard for trustworthy evaluation in multimodal AI research.


Comments & Academic Discussion

Loading comments...

Leave a Comment