SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 43 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.
💡 Research Summary
SpinBench is a cognitively‑inspired diagnostic benchmark that probes the spatial reasoning abilities of vision‑language models (VLMs) through the lens of perspective‑taking. The authors argue that true perspective‑taking requires a suite of sub‑skills: recognizing objects across viewpoints, grounding relative positions, and mentally simulating transformations such as translation and rotation. To isolate and measure these capabilities, SpinBench defines seven hierarchical task groups that progress from simple single‑object perception to complex multi‑object scene reasoning.
- Identity Matching tests whether a model can consistently identify the same object when presented from different viewpoints, a prerequisite for any cross‑view reasoning.
- Object‑Relation Grounding evaluates static spatial relations (left/right, front/behind, near/far) within a single image, with explicit frame‑of‑reference definitions to eliminate linguistic ambiguity.
- Dynamic Translation presents two temporally ordered frames of a single object and asks the model to infer the direction of linear displacement (left, right, front, back), isolating translational understanding from rotation.
- Dynamic Rotation similarly provides before/after images of an object that has rotated in place and requires a clockwise versus counter‑clockwise decision, defined from a top‑down perspective.
- Canonical View Selection asks the model to map a reference front view to the correct alternative perspective (left, right, back), testing the ability to retrieve canonical viewpoints without the confound of multiple objects.
- Mental Rotation gives an object, a specified rotation angle (e.g., 135°), and direction, then asks the model to select the resulting orientation among four candidates, probing internal visual simulation rather than direct observation.
- Perspective Taking is the most demanding group, split into (S) scene selection from a new viewpoint and (T) prediction of how object relations transform under that viewpoint change. This group integrates all prior sub‑skills and includes multi‑object clutter, partial occlusion, and both symmetric and syntactic augmentations.
The benchmark comprises 2,700 samples drawn from four visual domains: synthetic Infinigen tabletop scenes, household objects, vehicles, and human faces. Each sample is accompanied by variations that manipulate premise structure, symmetry (e.g., swapping left/right), and syntax (rephrasing questions) to ensure that performance reflects genuine spatial reasoning rather than shortcut learning.
The authors evaluated 43 state‑of‑the‑art VLMs, including both open‑source and proprietary models. Key findings include:
- Egocentric bias – models overwhelmingly answer from their own viewpoint, leading to systematic errors when the correct answer requires an allocentric perspective.
- Rotational weakness – across all models, accuracy on dynamic rotation and mental rotation tasks hovers below 60 %, and performance collapses to near chance under symmetric augmentations.
- Scaling behavior – while larger models and more training data yield smooth improvements on simpler tasks, emergent jumps in capability appear only after a certain scale is reached for the high‑level perspective‑taking tasks, suggesting a threshold for internal 3D reasoning.
- Human correlation – a human baseline achieved 91.2 % accuracy. Human response times correlated strongly (r≈0.78) with VLM accuracy across tasks, indicating that SpinBench captures difficulty dimensions shared by humans and machines.
Compared to prior benchmarks such as CLEVR, MindCube, and SPHERE, SpinBench uniquely isolates viewpoint transformation and rotation, provides controlled variations of premise and symmetry, and spans both synthetic and real‑world domains. The analysis reveals that current VLMs rely heavily on language‑visual pattern matching and lack robust internal representations of 3D geometry. The authors suggest future work should incorporate explicit 3D scene representations (e.g., NeRF, voxel grids), multimodal simulation environments, and reinforcement‑learning curricula focused on rotation and viewpoint invariance.
In summary, SpinBench offers a rigorous, cognitively grounded tool for diagnosing spatial reasoning gaps in modern VLMs, highlighting egocentric bias, rotational deficits, and inconsistency under symmetry. By exposing these weaknesses, the benchmark paves the way for targeted architectural and training innovations aimed at endowing VLMs with genuine 3‑D spatial intelligence.
Comments & Academic Discussion
Loading comments...
Leave a Comment