CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM’s cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue

💡 Research Summary

CLUE (Crossmodal disambiguation via Language‑vision Understanding with attEntion) tackles a core challenge in interactive visual grounding (IVG): deciding when a robot should ask clarification questions. Existing IVG systems rely on heuristics such as the number of candidate objects or token‑level uncertainty (entropy, confidence) to trigger clarification. These signals are indirect, policy‑level cues that do not explicitly indicate where in the visual scene the ambiguity originates.

The key insight of CLUE is to repurpose the cross‑modal attention maps that naturally arise inside a large vision‑language model (VLM) as a spatially grounded ambiguity signal. The authors use a SigLIP image encoder and a Gemma2 decoder (based on the PaliGemma‑2‑3B‑mix‑448 foundation model). During inference, after feeding the image and the current dialog prefix into the decoder, they extract the text‑to‑image attention weights from the 14th transformer layer. Each attention head is L1‑normalized over the 32×32 image patches, and the normalized maps for all query tokens (excluding special tokens) are averaged, yielding a single 32×32 heatmap per example. When an instruction is unambiguous, the heatmap concentrates on a single region; when multiple objects satisfy the instruction, the attention mass spreads across several patches, producing a diffuse pattern.

A lightweight convolutional network (three conv layers) is trained on these heatmaps to predict a binary ambiguity probability pₐₘb. This “ambiguity detector” is trained with binary cross‑entropy on a synthetic dataset generated in Isaac Sim (≈2 000 tabletop scenes with at least two visually similar YCB objects) and the IT2P dataset, totaling about 4 000 image‑instruction pairs. An additional out‑of‑distribution (OOD) set of 100 real‑world images from the InViG benchmark (each labeled ambiguous/unambiguous) is used only for evaluation.

For the actual grounding and clarification dialogue, CLUE employs LoRA adapters (rank r = 16, α = 32) inserted into the decoder’s projection layers. Two separate adapters are used: Adapter A for ambiguity detection (trained jointly with the CNN head) and Adapter B for the IVG task. The decoder is conditioned with a special “” token; given the image and the running dialog context, it autoregressively generates either a clarification question or a sequence of tokens that encode a bounding box in a 1024‑grid coordinate system. Training on the human‑human subset of InViG‑21K uses a prefix‑to‑target formulation where the prefix is the accumulated dialogue and the target is either the next question or the final grounding tokens. The loss is standard token‑level cross‑entropy applied only to the suffix.

Experimental results show that the attention‑based ambiguity detector outperforms prior baselines such as Grad‑CAM‑derived saliency or simple confidence thresholds, achieving higher F1 scores on both synthetic and real OOD data. In the full IVG pipeline, CLUE surpasses the state‑of‑the‑art TiO model on the InViG‑only setting, despite using far fewer trainable parameters (only the LoRA adapters). An ablation study varying the decoder layer used for attention extraction demonstrates that layer 14 provides the strongest ambiguity signal, with performance improving steadily as deeper layers are used.

Overall contributions are threefold: (1) a novel spatial ambiguity detector that turns cross‑modal attention into an explicit, interpretable signal, (2) a synthetic dataset for multimodal ambiguity detection released to the community, and (3) an end‑to‑end, parameter‑efficient IVG system that jointly decides when to ask and where to ground, achieving state‑of‑the‑art results with minimal fine‑tuning. By leveraging the latent alignment patterns already present in large VLMs, CLUE eliminates the need for heavy‑weight auxiliary modules or extensive annotation, offering a scalable solution for real‑time human‑robot interaction where transparency and efficiency are paramount. Future work may extend the approach to video streams, multimodal cues such as gestures, and larger language models to further improve robustness in open‑world settings.

CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

💡 Research Summary

Comments & Academic Discussion

Leave a Comment