Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) – memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding for multimodal reasoning. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representational discrepancy beyond the VIP to quantify how strongly visual query influences response generation. Across 60 model-dataset combinations spanning 10 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
💡 Research Summary
The paper tackles the pervasive issue that large vision‑language models (LVLMs) often default to language priors (LP) learned during massive text pre‑training, neglecting visual evidence. Prior work has mainly probed LP through input‑output behavior, which cannot reveal where inside the network visual information begins to influence reasoning. To fill this gap, the authors introduce a novel analysis framework based on the “chain‑of‑embedding”—the sequence of hidden states across the model’s decoder layers. For each layer l they extract two embeddings: Zₗ^vis (from a multimodal input containing both image and text) and Zₗ^blind (from text‑only input). By measuring a distance d(Zₗ^vis, Zₗ^blind) (cosine distance by default) they define an expected representation distance Dₗ for two data distributions: vision‑dependent (P_VT) and vision‑independent (P_T).
The central hypothesis is that there exists a specific layer l*—the Visual Integration Point (VIP)—where the distance between Zₗ^vis and Zₗ^blind begins to rise sharply for vision‑dependent samples while remaining near zero for vision‑independent ones. Before VIP the model processes visual features superficially; after VIP it actively integrates visual cues into its reasoning. Empirically, the authors observe a clear VIP in every examined LVLM, with its position being relatively stable across datasets but varying across model architectures (e.g., around layers 18‑20 for Qwen2.5‑VL‑7B and 20‑22 for Gemma‑3‑4B).
Building on VIP, the paper proposes the Total Visual Integration (TVI) metric, which aggregates the representation distance across all layers after VIP: TVI = Σ_{l≥l*} Dₗ(P_VT, F) – Dₗ(P_T, F). TVI quantifies how much visual information actually shapes the final answer. A high TVI indicates strong visual grounding, whereas a low TVI signals heavy reliance on language priors.
The authors evaluate ten contemporary LVLMs on six multimodal benchmarks (MME, MMBench, VLind‑Bench, etc.), yielding 60 model‑dataset configurations. Across all settings, VIP consistently emerges, and TVI correlates strongly with performance on vision‑heavy tasks, outperforming alternative proxies such as visual‑attention weights or output divergence scores. They also provide a theoretical interpretation of TVI in information‑theoretic terms and derive upper and lower bounds based on model capacity and modality alignment.
Overall, the study delivers a principled toolkit for diagnosing language prior in LVLMs: (1) a method to locate the internal layer where visual integration truly begins, and (2) a quantitative metric (TVI) that reflects the extent of visual grounding on a per‑sample basis. This framework enables fine‑grained, annotation‑free analysis, offers insights for model debugging and architecture design, and paves the way for future work aimed at reducing harmful language‑prior reliance in multimodal AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment