The Geometry of Representational Failures in Vision Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the “Binding Problem”, the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill “concept vectors” - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

💡 Research Summary

The paper investigates why modern open‑weight vision‑language models (VLMs) such as Qwen‑VL, InternVL, and Gemma frequently make “binding” errors in multi‑object scenes—mistaking colors, shapes, or hallucinating objects that are not present. The authors hypothesize that these failures stem from geometric interference in the shared high‑dimensional latent space where visual tokens are projected into the language model’s embedding stream. To test this, they develop two complementary methods for extracting “concept vectors,” linear directions that encode specific visual concepts (e.g., a particular color or shape).

The first method, supervised discrimination, trains a linear probe on synthetic multi‑object images to separate the presence versus absence of a target concept. The learned probe weight, after normalization, serves as a candidate concept direction. While effective, this approach can overfit to the training distribution and capture spurious shortcuts. The second method, centroid‑based geometric distillation, aggregates token embeddings that contain the target concept across many images, computes their mean, and orthogonalizes it against the global activation mean to isolate the pure concept signal. This yields vectors that reflect the underlying data distribution rather than label‑specific boundaries.

To bridge the two, the authors introduce a PCA‑probe: they generate N² concept vectors for all combinations of N colors and N shapes, apply principal component analysis, and retain only the 2N‑2 components that correspond to independent variation along each categorical axis. This regularization forces the discriminative vectors to respect the known factor structure, reducing reliance on accidental correlations.

The core experimental contribution is a causal steering intervention. Given a source concept A and a target concept B, each represented by normalized vectors ˆv_A and ˆv_B, they modify every token h_t in the residual stream as follows:

h′_t = h_t – (h_t·ˆv_A)ˆv_A + (h_t·ˆv_A)ˆv_B

This operation removes the component of the activation aligned with concept A and injects an equal‑magnitude component aligned with concept B. Because the scaling is derived from each token’s own projection onto ˆv_A, the intervention preserves the original intensity of unrelated features and minimizes collateral disruption.

The authors evaluate steering on a set of 60 real‑world images spanning six colors. For each ordered pair of colors they attempt to “erase” the ground‑truth color vector and inject the target color vector, then query the model for the reported color. Success rates differ markedly across extraction methods and models. Centroid‑based vectors achieve the highest performance (e.g., 84.7 % on Qwen, 88.4 % on InternVL, 95.7 % on Gemma), whereas supervised probes and PCA‑probes lag behind (typically 3–16 %). This demonstrates that the distilled vectors capture causally relevant internal representations, while probe‑derived vectors often reflect superficial discriminative patterns.

Beyond steering, the authors compute cosine similarities between all pairs of concept vectors and find a strong positive correlation between vector overlap and the frequency of binding errors in the original, un‑steered models. In other words, when the latent directions for “red” and “square” are not orthogonal, the model is more likely to produce an illusory conjunction such as “red square” when the scene actually contains a red circle and a blue square. This empirical link supports the central claim: geometric interference—i.e., insufficient angular separation between concept directions—drives many VLM failures.

The paper situates these findings within cognitive neuroscience, drawing parallels to the human binding problem described by Treisman and Gelade (1980). Human vision mitigates interference through serial attention, effectively allocating separate temporal slots to each object. Current VLMs lack such a mechanism, processing an entire scene in a single feed‑forward pass, which forces a compression of rich visual information into a limited token sequence. The authors label this tension the “Curse of Generalization”: the same compositional representations that enable systematic generalization also make the model vulnerable to interference when many concepts co‑occur.

In the discussion, the authors propose several avenues for future work: (1) integrating attention‑like temporal sequencing into multimodal architectures to allocate distinct sub‑spaces for each object, (2) designing training objectives that explicitly penalize cosine similarity between concept vectors, thereby encouraging orthogonal representations, and (3) extending the steering framework to more complex, hierarchical concepts (e.g., “red striped shirt”) to test the limits of linear manipulability.

Overall, the study provides a rigorous mechanistic account of VLM visual failures, introduces robust methods for extracting and validating concept vectors, and demonstrates that intervening directly in the latent geometry can both diagnose and remediate errors. By linking representational geometry to observable behavior, the work offers a concrete quantitative framework for future interpretability and robustness research in multimodal AI.

The Geometry of Representational Failures in Vision Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment