AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.

💡 Research Summary

The paper tackles a fundamental shortcoming of current Vision‑Language Models (VLMs) in Visual Question Answering (VQA): the inability to recognise and appropriately react to ambiguous visual queries. While most VQA benchmarks consist of clear, single‑answer pairs, real‑world interactions often contain vague references, multiple plausible targets, or insufficient visual cues. Existing work on ambiguous VQA either treats all ambiguity uniformly or adopts a binary “answer‑or‑ask for clarification” policy, which does not reflect how humans actually handle uncertainty.

To address this gap, the authors introduce AQuA (Ambiguous Visual Question Answering), a new dataset and training framework that (1) systematically categorises visual ambiguity into four fine‑grained levels, and (2) defines four corresponding response strategies that a model should be able to select dynamically.

Ambiguity Levels

Level 0 (Unambiguous) – Traditional VQA items with a single, obvious answer.
Level 1 (Low‑level Referential Ambiguity) – Questions contain pronouns such as “this” or “that”, but the referent can be inferred unambiguously from visual context (e.g., a single salient object).
Level 2 (Multiple Valid Interpretations) – Two or three plausible targets exist; the optimal response is to list all reasonable answers rather than ask for clarification.
Level 3 (High‑level Ambiguity Requiring Clarification) – Many similar objects or insufficient cues make it impossible to decide; the model should request clarification.

Response Strategies

Direct answer (Level 0/1).
Context‑based inference (Level 1).
Enumerate plausible alternatives (Level 2).
Ask for clarification (Level 3).

Dataset Construction
Images are drawn from COCO; bounding‑box annotations are used to compute object saliency (area ratio × 0.7 + distance‑to‑center × 0.3). Based on saliency thresholds the authors automatically assign each image to a level. Questions, answers, and strategy labels are generated with GPT‑5 using level‑specific prompts, then filtered through a three‑stage pipeline: (i) automatic verification of level conformity, (ii) cross‑level sanity check, and (iii) human validation via MTurk. The final corpus contains 7.2 K triplets (3.6 K for training, 3.6 K for evaluation), evenly split across the four levels.

Training Procedure
The authors first apply Supervised Fine‑Tuning (SFT), feeding the model not only the answer text but also an explicit strategy token, thereby teaching the model the space of possible responses. Since SFT only optimises for token‑level likelihood, they further employ Group Relative Policy Optimization (GRPO), a reinforcement‑learning method that rewards the model when its generated output aligns with the ground‑truth strategy. This two‑stage approach yields a model that both understands the visual content and selects the appropriate behavioural policy.

Evaluation
Baseline VLMs—including open‑source models (e.g., LLaVA, InternVL2) and closed‑source giants (GPT‑5, Gemini Flash)—are evaluated on AQuA. Across all levels, they overwhelmingly produce over‑confident single answers, rarely asking for clarification even in Level 3 cases. Human annotation of 200 test items (50 per level) shows near‑perfect agreement with the defined strategies for Levels 0 and 1, and high but not perfect agreement for Levels 2 and 3, reflecting inherent subjectivity.

Fine‑tuned AQuA models dramatically improve both overall accuracy and strategy‑selection metrics:

In Level 2, the proportion of responses that correctly enumerate alternatives rises from ~78 % (baseline) to ~92 %.
In Level 3, clarification requests increase from ~12 % to ~85 %.
Human‑model agreement reaches ~90 % for the ambiguous levels, indicating that the model’s behaviour aligns with human expectations.

Insights & Contributions

Systematic Ambiguity Taxonomy – The four‑level scheme provides a reproducible way to benchmark VLMs on nuanced uncertainty handling.
Strategy‑Aware Training – By coupling SFT with GRPO, the model learns not only “what to say” but “how to say it” based on context.
Empirical Evidence of Failure in Existing Models – Even state‑of‑the‑art closed‑source systems lack the ability to adapt their response style to ambiguity.
Public Release – Code, dataset, and trained checkpoints are made publicly available, encouraging further research.

Limitations & Future Directions

The dataset relies on synthetic question‑answer generation from GPT‑5 and on COCO images; domain transfer to medical, industrial, or highly specialized visual domains remains untested.
Strategy labels are human‑annotated and can be ambiguous at the boundary between Levels 2 and 3; more granular uncertainty metrics could reduce subjectivity.
Only four strategies are considered; real conversational agents may need richer behaviours such as partial answers, progressive clarification, or multimodal feedback.
GRPO optimises for strategy alignment but does not directly enforce linguistic naturalness; integrating reward models that evaluate fluency and coherence could further improve user experience.

In summary, AQuA establishes the first comprehensive benchmark that couples a fine‑grained ambiguity taxonomy with explicit response strategies, and demonstrates that VLMs can be trained to manage uncertainty in a human‑like, context‑adaptive manner. This work paves the way for more trustworthy and conversationally competent multimodal AI systems.

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment