SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Reading time: 5 minute
...

📝 Original Info

  • Title: SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
  • ArXiv ID: 2511.23075
  • Date: 2025-11-28
  • Authors: Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei

📝 Abstract

Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.

💡 Deep Analysis

Figure 1

📄 Full Content

Humans perceive space not merely by seeing, but by understanding from where they see. This coupling be- tween visual observation and viewpoint underlies our ability to estimate distances, compare sizes, infer connectivity, and navigate unfamiliar environments. Endowing multimodal models with such spatial intelligence [68] is crucial for a wide range of applications. However, despite remarkable progress in large language models (LLMs) [6,7,29,46,48,49,56,57,63,76] and multimodal large language models (MLLMs) [1-3, 9, 20, 21, 27, 35, 40] on open-ended reasoning, instruction following, and long-context understanding, recent studies consistently show that they struggle with explicitly 3D-aware tasks, such as metric comparison, layout inference, and multi-view consistency.

Existing approaches to spatially grounded multimodal reasoning can be roughly grouped into two families. The first family augments language or vision-language models with explicit 3D inputs, such as point clouds, depth maps, reconstructed meshes, or BEV/voxel features [14,26,42,44,61,71,77,78,81,82]. These methods leverage depth sensors or multi-view reconstruction pipelines to build 3D scene representations, which are then encoded and aligned with language. They can capture rich geometric details, but inherit several structural limitations: reliance on specialized hardware or pre-scanned environments, heavy multistage processing, sensitivity to reconstruction failures and scale ambiguity, and difficulty scaling to unconstrained inthe-wild video. The second family, exemplified by recent 3D-aware MLLMs, targets visual-based spatial intelligence from monocular or multi-view RGB [16,65]. They typically combine a general-purpose visual encoder [3,72] with a feed-forward geometry encoder [58,59] and then fuse the resulting features via simple mechanisms such as MLP projection or one-stage cross-attention. These designs avoid explicit 3D supervision and are more scalable, but the way they integrate spatial cues remains largely ad hoc.

A closer look at these two lines of work reveals a common limitation. Explicit-3D methods over-commit to precomputed geometry and do not address how viewpoint should interact with language-native reasoning. Visualbased methods, on the other hand, tend to treat all geometric signals-image features, spatial features, and camera information-as if they belonged to a single homogeneous feature space. In practice, camera-related signals are often appended as auxiliary embeddings or implicitly blended into fusion layers. This conflation ignores a principle long emphasized in 3D vision [28,36,51,58,74]: camera (ego/viewpoint) and scene (allocentric content) play fundamentally different roles.

Motivated by this insight, we present SpaceMind, a large vision-language model explicitly designed for 3D spatial reasoning from RGB inputs. SpaceMind treats the camera representation as a dedicated modality that guides how spatial information is injected into the visual stream, rather than as a passive conditioning vector. Concretely, Space-Mind uses a visual encoder [9] and a spatial encoder [58] to obtain semantic and geometry-aware tokens, and introduces a Camera-Guided Modality Fusion (CGMF) module before the language model. CGMF applies camera-conditioned biasing over spatial tokens, a query-independent spatial importance weighting, and camera-conditioned gating on fused features, making the role of the camera explicit in the fusion process while preserving the interface of standard VLMs. This design is intentionally simple-built from standard components-but aligns viewpoint, spatial cues, and semantics within a unified multimodal framework. Empirically, SpaceMind achieves a new state of the art on VSI-Bench [68], with a clear margin over existing open and proprietary systems, and delivers comparable or superior performance on other spatial reasoning benchmarks such as SQA3D [41] and SPBench [55], indicating strong generalization across datasets and task formats. In summary, our contributions are four-fold:

Multimodal large language models extend large language models to vision and other modalities by aligning visual and textual tokens in a shared embedding space. CLIP [50] and ALIGN [1] learn joint image-text representations via contrastive pretraining, enabling strong zero-shot transfer. Flamingo [1] and BLIP-2 [35] introduce tokenlevel or feature-level fusion modules (e.g., Q-Former, crossattention bridges) to couple frozen LLM backbones with vision encoders, while instruction-tuned systems such as the LLaVA family [32,33,38,39,75], MiniGPT-4 [79], Qwen-VL [3], and InternVL [9] further enhance open-ended visual QA. Recent video MLLMs [37,43,54,64,73] extend these designs to temporal inputs through spatiotemporal pooling, causal attention over frame tokens, or long-context architectures, and achieve strong performance on video understanding benchmarks.

Despite their success, these models are mostly optimized for semantic and

📸 Image Gallery

cgmf.png pipeline.png radar.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut