Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing “a red cube and a blue sphere” with “a blue cube and a red sphere”. Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., “a monitor to the left of a bicycle on a white background”) and LLM-generated Contextual captions (e.g., “In a brightly lit photography studio, a monitor is positioned to the left of a bicycle”), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel “Confusion Benchmark” reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).

💡 Research Summary

The paper introduces Auto‑Comp, a fully automated pipeline that creates large‑scale, photorealistic benchmarks for probing the compositional reasoning abilities of contrastive Vision‑Language Models (VLMs). Starting from a user‑defined “concept” – a set of objects, attributes, and spatial relations – the system generates two parallel caption streams: a Minimal caption that follows a strict template and specifies “on a white background”, and a Contextual caption that is rewritten by a large language model (Gemma‑3‑12b‑it) into natural, fluent prose. Both captions are fed to Stable Diffusion 3.5‑large to synthesize images.

Each generated image undergoes a two‑stage automatic validation. First, GroundedSAM2 checks that all objects appear with the correct count; Minimal images receive an additional background uniformity test. Second, a VLLM answers attribute‑and‑relation questions derived from the original concept; only if all answers match the ground‑truth does the sample pass. Human evaluation shows this automated judge agrees with experts over 94 % of the time.

Validated image‑caption pairs form the Positive Benchmark. Separate sets are kept for Minimal (B_min) and Contextual (B_ctx) conditions, and their intersection (B_paired) enables a controlled A/B comparison of the same concept under sterile versus realistic visual contexts.

Hard negative benchmarks are then generated programmatically. The Swap Benchmark creates all possible attribute/relationship permutations (N!‑1 negatives), while the Confusion Benchmark injects low‑entropy distractors (repeated objects or colors) to produce N·2N‑1 negatives. This systematic manipulation ensures that positive and negative samples are identical in length and linguistic complexity, differing only in the swapped conceptual elements.

The authors evaluate over 20 state‑of‑the‑art VLMs, including CLIP and SigLIP families, on two compositional tasks: Color Binding (N = 1, 2, 3) and Position Binding (N = 2, 3). Results reveal universal failures: models frequently confuse swapped attributes, especially as N grows. SigLIP models outperform CLIP overall, yet a striking trade‑off emerges—Contextual scenes improve spatial reasoning but degrade local attribute binding due to visual clutter. The Confusion Benchmark uncovers a deeper weakness: all models lose 30‑50 % performance when low‑entropy distractors are present, indicating a limitation beyond simple bag‑of‑words effects.

Auto‑Comp’s fully automatic, concept‑driven generation and rigorous validation make it extensible to any new compositional skill. The code, datasets, and benchmarks are released on HuggingFace, providing the community with a scalable tool for diagnosing VLM compositional shortcomings and guiding the development of more robust multimodal models.

Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment