VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation

VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer “Is this the target object?” and “Why should I take this action?” The reasoning process unfolds in three stages: “think”, “think summary”, and “action”, yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.


💡 Research Summary

VISOR (Visual Spatial Object Reasoning) tackles the language‑driven object navigation problem by unifying perception, reasoning, and action selection within a single 3‑billion‑parameter Vision‑Language‑Action (VLA) model. Traditional approaches fall into two camps: (i) end‑to‑end policy networks that map vision‑language embeddings directly to low‑level actions, which suffer from poor generalization to unseen environments and lack explainability, and (ii) modular zero‑shot pipelines that stitch together large language models (LLMs), vision‑language models (VLMs), and open‑set object detectors. While the latter provide richer reasoning, they incur high computational cost, error propagation across components, and difficulty integrating reasoning back into the navigation policy.

VISOR adopts the CURE design principles—Compact (≈3 B parameters), Unified (single model), Reasoning‑capable (spatial reasoning from multiple observations), and Explainable (producing textual rationales). The model is built on the pre‑trained Qwen‑2.5‑VL (3 B) backbone and receives three inputs at each timestep: a panoramic RGB observation (768 × 256) composed of three 90°‑FOV cameras, an online‑built top‑down map (256 × 256), and the natural‑language instruction describing the target object. Using depth information, the system projects each pixel into world coordinates, discards invalid points, and clusters the remaining positions with DBSCAN to generate a set of candidate waypoints. Each waypoint is overlaid on the panoramic image and assigned a random alphabetic label (e.g., A, B, C) to avoid over‑fitting to specific names.

The reasoning process is explicitly structured into three textual stages:

  1. – a chain‑of‑thought (CoT) style narrative that answers “Is this the target object?” and “Why is this waypoint promising?” by grounding the answer in the visual observation, the map, and the instruction.
  2. – a concise bullet‑point style rationale that distills the key factors driving the decision.
  3. – the selection of the most plausible waypoint label.

The chosen waypoint is then handed to Habitat’s shortest‑path planner, which converts the high‑level decision into low‑level motor commands (forward, turn left/right, stop). This decoupling allows the model to focus on strategic reasoning while relying on a reliable planner for execution.

To train VISOR, the authors introduce WAYPOINT‑Bench, a novel dataset derived from GOAT‑Bench. Each sample contains: (i) a detailed natural‑language description of the target object (including intrinsic attributes like color and shape, and extrinsic spatial relations), (ii) the top‑down map of the environment, (iii) the panoramic RGB view, (iv) a set of waypoint candidates, (v) the ground‑truth waypoint (the one minimizing geodesic distance to the target), and (vi) a reasoning trace generated by GPT‑4o using a CoT prompt. The dataset comprises 36 170 training instances and 3 047 validation instances, with an average of ~4 candidate actions per step.

Training proceeds in two phases:

  • Supervised Fine‑Tuning (SFT): The model is teacher‑forced to generate the exact , , and tokens given the multimodal prompt. Random alphabetic labels prevent memorization of specific waypoint names. Cross‑entropy loss is used to align the model’s output distribution with the ground‑truth sequence.
  • Reinforcement Learning (RL) Post‑Training: Using Group Sequence Policy Optimization (GSPO), the model is further refined to maximize a reward proportional to the reduction in geodesic distance to the target after each action. The group‑based advantage normalizes rewards across sampled action sequences, encouraging consistent improvement without additional labeled data.

Empirical results show that VISOR outperforms prior end‑to‑end policies on standard navigation metrics such as Success Rate and SPL, especially in unseen test environments where its reasoning‑driven approach yields better generalization. Moreover, the generated and texts provide human‑readable explanations, addressing the transparency gap of earlier methods.

The paper also discusses limitations: the 3 B model, while compact, still struggles with fine‑grained attribute discrimination in cluttered scenes, and depth information is only used for waypoint candidate generation rather than full 3‑D scene understanding. Future work is suggested to explore larger multimodal models, tighter integration of depth/point‑cloud data, and real‑world robot deployments with on‑device inference constraints.

In summary, VISOR demonstrates that a single, moderately sized VLA model can perform explicit, explainable spatial reasoning for language‑driven object navigation, eliminating the need for cumbersome multi‑model pipelines while achieving stronger generalization and interpretability.


Comments & Academic Discussion

Loading comments...

Leave a Comment