Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning
While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects–failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.
💡 Research Summary
Hydra‑Nav tackles the long‑standing challenges of object‑goal navigation with large vision‑language models (VLMs): weak temporal‑spatial reasoning and high inference cost. The core contribution is a unified VLM architecture that embeds both a deliberative “slow” system and a reactive “fast” system within a single model and learns to switch between them adaptively.
The slow system receives the natural‑language goal, a panoramic view (four 90° RGB images), and a structured long‑term memory graph composed of text‑image landmark nodes. It first summarizes past observations, analyzes the current view, and then generates a high‑level plan expressed as chain‑of‑thought (CoT) reasoning text. This reasoning is immediately followed by a meta‑action (e.g., MoveAhead 0.25 m) that is handed to the fast system.
The fast system operates with KV‑caching, encoding only the latest ego‑centric frame while autoregressively decoding low‑level motor primitives (MoveAhead, TurnLeft/Right, etc.). This design avoids re‑processing the entire history at every step, dramatically reducing computation. Transition between systems is triggered by a special token “obs”. When the high‑level plan is completed or the current view invalidates it, the model emits “obs”, initiates a new panoramic scan, creates a new landmark node, and updates the memory graph. Memory length is capped at ten landmarks by preserving start/end nodes and uniformly sampling intermediate ones.
Training proceeds in three curriculum stages. Stage 1 (Spatial‑Action Alignment) generates 500 K trajectories using an A* planner on HM3D, MP3D, and OVON training splits, teaching the base VLM (Qwen2.5‑VL‑7B) to produce collision‑free motor sequences. Stage 2 (Reasoning‑Memory Integration) introduces exploration waypoints and synthesizes CoT reasoning via a larger LLM (Qwen3‑VL‑235B‑Thinking). The pipeline prompts the LLM to review past images, summarize progress, and plan future moves while explicitly hiding the future view from the output; a verifier LLM filters any leakage. Stage 3 (Iterative Rejection Fine‑Tuning, IRFT) automatically detects “stagnation points” during policy roll‑outs, forces a slow‑system invocation at those moments, and applies rejection sampling to discard inefficient reasoning traces. This teaches the agent to invoke costly reasoning only when it truly benefits navigation.
Evaluation on three benchmark suites shows substantial gains: Hydra‑Nav‑IRFT outperforms the previous best by 11.1 % on HM3D, 17.4 % on MP3D, and 21.2 % on OVON. The authors also propose SOT (Success weighted by Operation Time), a metric that jointly measures success and inference latency. Hydra‑Nav‑IRFT achieves markedly higher SOT scores than fixed‑frequency reasoning baselines, confirming that adaptive reasoning preserves success while cutting computation.
Limitations include reliance on synthetic trajectories and CoT texts generated in simulation, which may not transfer perfectly to real‑world robots; a simple uniform memory‑pruning strategy that could discard critical landmarks; the heavy computational footprint of the underlying 7 B‑parameter VLM, potentially prohibitive for edge devices; and the SOT metric’s focus on operation time without accounting for power consumption or battery constraints. Future work could explore domain adaptation, more sophisticated memory management, and lightweight VLM variants.
Overall, Hydra‑Nav presents the first integrated dual‑process VLM that learns when to think and when to act, delivering state‑of‑the‑art navigation performance and efficiency. Its curriculum‑driven training and IRFT fine‑tuning provide a compelling blueprint for building adaptive, reasoning‑aware embodied agents in robotics and beyond.
Comments & Academic Discussion
Loading comments...
Leave a Comment