Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Reading time: 2 minute
...

📝 Original Info

  • Title: Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
  • ArXiv ID: 2510.26302
  • Date: 2025-10-30
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (예: “저자 정보 없음” 혹은 “데이터 부족”) **

📝 Abstract

Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.

💡 Deep Analysis

Figure 1

📄 Full Content

📸 Image Gallery

c2.png c3.png p2.png scm.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut