Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication
To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the $\textit{Common Objects Out-of-Context (COOCo)}$ dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and high fit. These results suggest that VLMs dynamically balance local and contextual information for reference generation. Dataset and code are available here: $\href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}$.
💡 Research Summary
The paper introduces the Common Objects Out‑of‑Context (COOCo) dataset to study how Vision‑Language Models (VLMs) use scene context when naming objects, especially under semantic violations. COOCo builds on COCO‑Search18, selecting 2,241 images without people or animals, and then systematically replaces the target object with alternatives that have low, medium, or high semantic relatedness to the scene. Semantic relatedness is quantified using cosine similarity between ConceptNet Numberbatch embeddings of the scene label (derived from a fine‑tuned ViT model on SUN‑397) and candidate object labels from THINGSPlus, ensuring realistic size and typicality constraints. For each original image, the authors generate 15 variants: original, same‑category replacement, high‑fit, medium‑fit, low‑fit objects, and versions with Gaussian noise (λ = 0, 0.5, 1.0) applied to the target region, the context region, or the whole image. The final dataset contains 18,395 images, providing a graded manipulation of object‑scene congruence and visual degradation.
Experiments evaluate five state‑of‑the‑art VLMs—KOSMOS‑2 (1.6 B), Molmo (7 B), xGen‑MM‑Phi3/BLIP‑3 (≈4.4 B), LLaVA‑OneVision (0.5 B/7 B), and Qwen2.5‑VL (7 B). Each model receives an image and a structured prompt that explicitly references the bounding‑box coordinates (e.g., “What is the object in this part of the image
Comments & Academic Discussion
Loading comments...
Leave a Comment