VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.

💡 Research Summary

The paper introduces VLM‑UQBench, a dedicated benchmark for evaluating uncertainty quantification (UQ) in vision‑language models (VLMs) with explicit modality awareness. Recognizing that uncertainty in VLMs can stem from the image, the text, or their interaction, the authors construct a 600‑sample dataset derived from VizWiz and augment it with grounded cross‑modal ambiguity cases from VQ‑FocusAmbiguity and synthetic hallucination‑focused subsets built on CLEVR scene graphs. Each sample is annotated by experts as belonging to one of four categories: clean, image‑uncertainty, text‑uncertainty, or cross‑modality uncertainty.

To probe how UQ methods respond to controlled perturbations, the authors design a scalable perturbation pipeline that injects eight visual (blur, brightness, darkness, cutout, noise, pixelate, shuffle, etc.), five textual (typos, word‑shuffle, drop‑words, subjectivity rewrites, invalid rewrites), and three cross‑modal (ambiguous reference, insufficient visual evidence, etc.) edits. Perturbation intensity is calibrated on small validation subsets to avoid trivial or catastrophic changes, enabling the automatic generation of contrastive pairs (original vs. perturbed) without additional human labeling.

Two evaluation metrics are proposed: (1) Uncertainty Reflection Rate (URR) measures the proportion of perturbed instances whose UQ scores increase, quantifying sensitivity to a specific modality of perturbation; (2) Hallucination Consistency Coefficient (HCC) captures the correlation between UQ score changes and the occurrence of hallucinations in the synthetic CLEVR‑Hallucination set.

The benchmark is used to assess nine UQ methods—both white‑box (token entropy, maximum sequence probability, PMI, etc.) and black‑box (lexical similarity, diversity‑based metrics)—across four modern VLMs (e.g., OFA, BLIP‑2, LLaVA, InstructBLIP) and three datasets (VizWiz, VQ‑FocusAmbiguity, CLEVR‑Hallucination). Evaluation includes standard calibration and selective‑prediction metrics (AUROC, AUROC, F1) together with URR and HCC.

Key findings:

Modality‑specific specialization – Most UQ scores react strongly to perturbations of the modality they were originally designed for (visual vs. textual) but show little sensitivity to the other modality, indicating that current scalar‑based UQ methods lack fine‑grained modality awareness.
Model dependence – The same UQ method can exhibit markedly different URR/HCC values across VLM architectures, especially for visual uncertainty, suggesting that encoder‑decoder designs and multimodal attention mechanisms heavily influence uncertainty signals.
Weak link to hallucinations – High URR does not reliably predict hallucination occurrence, and HCC values are generally low, meaning that existing UQ scores are poor risk indicators for unsafe VLM outputs.
Group‑level vs. instance‑level detection – On the grounded cross‑modal ambiguity set (VQ‑FocusAmbiguity), UQ methods perform comparably to chain‑of‑thought (CoT) reasoning baselines in detecting overt ambiguity. However, on the subtle, instance‑level perturbations generated by the pipeline, most UQ methods fail to distinguish uncertain from certain cases, highlighting a critical gap for real‑world deployment.

The authors argue that VLM‑UQBench fills a missing niche: it provides instance‑wise, modality‑labeled data and a systematic perturbation framework, enabling researchers to move beyond the traditional “uncertainty = single scalar” paradigm. They also acknowledge limitations: the nine evaluated methods all show limited effectiveness, underscoring the need for new UQ approaches that explicitly model image‑text interaction (e.g., attention‑distribution‑based uncertainty, multimodal consistency scores).

Future directions suggested include: (i) integrating UQ scores with actionable policies such as question reformulation or image reacquisition; (ii) training VLMs to predict modality‑specific uncertainty during pre‑training; and (iii) developing calibration techniques that leverage human‑annotated risk labels.

In summary, VLM‑UQBench offers a comprehensive, modality‑aware benchmark for uncertainty quantification in vision‑language models, reveals substantial shortcomings of current UQ methods, and sets the stage for next‑generation uncertainty‑aware multimodal AI systems.

VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment