UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization.

💡 Research Summary

The paper introduces UniVRSE, a unified vision‑conditioned response semantic entropy framework designed to detect hallucinations in medical vision‑language models (VLMs) used for visual question answering (VQA) and visual report generation (VRG). Hallucinations—outputs that are plausible in language but contradict visual evidence—pose a serious barrier to clinical deployment. Existing uncertainty‑based detection methods, particularly Semantic Entropy (SE), work well for text‑only large language models but fail for medical VLMs because strong language priors cause over‑confidence even when visual input is altered.

UniVRSE addresses this by explicitly incorporating visual guidance into uncertainty estimation. For a given image‑question pair (VQA) or an image‑claim pair (VRG), the method creates two versions of the visual input: the original image and a visually distorted version (e.g., blurred, noisy, color‑shifted). The VLM generates multiple low‑temperature responses for each version; these responses are clustered in semantic space to form predictive distributions (P_{orig}) and (P_{dist}). The discrepancy distribution (D = |P_{orig} - P_{dist}|) captures how much the model’s semantic output changes when visual information is perturbed. The entropy of (D) is defined as Vision‑Conditioned Semantic Entropy (VCSE). A high VCSE indicates that the model’s output is insensitive to visual changes, signaling a higher risk of hallucination.

For VRG, the generated report is first decomposed into atomic claims. For each claim a verification question is automatically generated (e.g., “Is there evidence of pneumonia in this image?”). The same VCSE computation is performed at the claim level, and claim‑wise entropies are aggregated to obtain a report‑level hallucination score.

A major contribution is the Alignment Ratio of Atomic Facts (ALFA), a new metric for automatically labeling hallucinations. ALFA parses both the model‑generated text and the reference answer into sets of atomic facts, aligns them using semantic similarity and rule‑based matching, and computes the proportion of reference facts that are covered. An ALFA score below a chosen threshold (e.g., 0.7) marks the output as hallucinated. This provides objective, fine‑grained ground‑truth labels without costly human annotation, enabling robust benchmarking.

Experiments span six medical datasets (four VQA: MIMIC‑Diff‑VQA, Path‑VQA, SLAKE, RAD‑VQA; two VRG: IU‑Xray, CheXpertPlus) covering CT, MRI, X‑ray, and pathology images, and three state‑of‑the‑art VLMs (MedGemma‑4B‑it, LLaVA‑Med‑7B, HuaTuoGPT‑Vision‑7B). UniVRSE is compared against baseline uncertainty methods (standard SE, VL‑Uncertainty, VASE), token‑level uncertainty, and supervised detectors. Across all settings UniVRSE achieves an average AUROC of 0.92, improving by 8–12 percentage points over the best baselines. It especially mitigates over‑confidence in models that remain certain even after visual distortion. The ALFA‑derived labels correlate strongly (ρ≈0.84) with expert human judgments, outperforming existing subjective metrics such as MedHallTune or GREEN.

Cross‑modal generalization tests show that a UniVRSE detector trained on one dataset retains high performance (AUROC >0.88) on unseen datasets and models, demonstrating its model‑agnostic nature. Ablation studies confirm that both the visual distortion step and the discrepancy‑entropy computation are essential; removing either component degrades performance to the level of plain SE.

The authors acknowledge limitations: the distortion techniques are simple and may not capture all real‑world imaging artefacts; ALFA focuses on atomic fact alignment and may miss higher‑order clinical reasoning; and VCSE computation incurs extra inference cost due to multiple samples and clustering. Future work includes richer perturbations, integration with medical knowledge graphs for deeper fact alignment, and efficiency optimizations.

In summary, UniVRSE provides a principled, vision‑aware uncertainty metric that reliably flags hallucinations in both short‑form (VQA) and long‑form (VRG) medical language generation. Coupled with the ALFA labeling framework, it establishes a scalable benchmark for hallucination detection and moves the field toward safer, more trustworthy AI assistance in clinical imaging.

UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment