Probing Perceptual Constancy in Large Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

💡 Research Summary

This paper investigates the extent to which modern Vision‑Language Models (VLMs) exhibit perceptual constancy—the human ability to maintain stable perceptions of an object’s color, size, and shape despite changes in lighting, distance, or viewpoint. Drawing inspiration from classic cognitive‑psychology experiments, the authors construct ConstancyBench, a benchmark comprising 236 tasks (both single‑image and video‑based) that probe three constancy dimensions: color, size, and shape. The dataset aggregates six sources—real photographs, classic experimental images, movie clips, hand‑drawn sketches, and AI‑generated 3D shapes—ensuring a diverse, in‑the‑wild evaluation setting.

A total of 155 VLMs are evaluated in a zero‑shot regime, ranging from lightweight open‑source models to frontier systems such as GPT‑4o and Gemini 1.5 Pro. For each task the model must output a textual answer and a brief explanation, allowing the authors to assess both correctness and reasoning quality. Model size (parameter count) and architecture type are recorded for subsequent correlation analyses.

Results reveal a clear hierarchy among the three constancy domains. Shape constancy achieves the highest mean accuracy (0.723 ± 0.170), followed closely by color (0.588 ± 0.185) and size (0.584 ± 0.123). A one‑way ANOVA confirms a strong main effect of domain (F(2,451)=36.49, p≈2×10⁻¹⁵, η²=0.139). Post‑hoc Tukey HSD tests show that shape performance significantly exceeds both color and size (p < 0.001, Cohen’s d≈0.8), while color and size do not differ (p = 0.976). The authors interpret the superior shape scores as evidence that current VLMs can exploit simple geometric shortcuts, whereas color constancy demands high‑dimensional photometric reasoning and size constancy requires robust 3‑D world representations.

Scaling analyses demonstrate a log‑linear relationship between model capacity and performance across all domains. Across 151 models with known parameter counts, overall accuracy scales as y = 0.1200·log₁₀(param) + 0.4766 (R² = 0.2804, p ≈ 2.7×10⁻¹²). Domain‑specific regressions show similar trends, with the strongest scaling effect for size constancy (R² = 0.3080), moderate effects for shape (R² = 0.1148) and color (R² = 0.0973). These findings suggest that larger models develop more invariant representations, but the degree of benefit varies with the perceptual challenge.

To uncover latent structure in task difficulty, the authors fit a two‑parameter logistic Item Response Theory (2PL‑IRT) model to all items. Shape items cluster at low difficulty (mean b = ‑1.27) and moderate discrimination (mean a = 1.33), indicating they are generally easy for most models. Color items exhibit the highest discrimination (mean a = 2.25) with moderate difficulty (mean b = ‑0.58), meaning they sharply separate stronger from weaker models but do not require extreme ability. Size items display intermediate discrimination (mean a = 1.59) and the broadest difficulty distribution (mean b = ‑0.72), reflecting heterogeneous spatial reasoning demands.

The discussion highlights that while state‑of‑the‑art VLMs approach human‑level performance on shape constancy, they lag behind on color and size constancy. This gap is attributed to current multimodal pre‑training objectives that prioritize text‑image alignment over explicit modeling of illumination, depth, or 3‑D geometry. The authors argue that scaling alone will not close the gap; targeted architectural innovations—such as dedicated illumination‑invariant encoders, depth‑aware vision backbones, or joint 3‑D scene understanding—are needed.

Limitations include reliance on static prompts, absence of real‑time video reasoning, and lack of direct human performance baselines. Future work is proposed in three directions: (1) extending ConstancyBench to continuous video streams to assess temporal stability, (2) benchmarking against human participants to quantify the human‑model gap, and (3) designing training regimes that explicitly incorporate photometric and geometric transformations.

In sum, the paper provides the first large‑scale, systematic evaluation of perceptual constancy in VLMs, revealing pronounced domain‑specific performance differences, a robust scaling law, and structural insights via IRT. These contributions furnish a valuable diagnostic tool for the community and chart a roadmap toward more robust, human‑like visual perception in multimodal AI systems.

Probing Perceptual Constancy in Large Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment