StepNav: Structured Trajectory Priors for Efficient and Multimodal Visual Navigation
Visual navigation is fundamental to autonomous systems, yet generating reliable trajectories in cluttered and uncertain environments remains a core challenge. Recent generative models promise end-to-end synthesis, but their reliance on unstructured noise priors often yields unsafe, inefficient, or unimodal plans that cannot meet real-time requirements. We propose StepNav, a novel framework that bridges this gap by introducing structured, multimodal trajectory priors derived from variational principles. StepNav first learns a geometry-aware success probability field to identify all feasible navigation corridors. These corridors are then used to construct an explicit, multi-modal mixture prior that initializes a conditional flow-matching process. This refinement is formulated as an optimal control problem with explicit smoothness and safety regularization. By replacing unstructured noise with physically-grounded candidates, StepNav generates safer and more efficient plans in significantly fewer steps. Experiments in both simulation and real-world benchmarks demonstrate consistent improvements in robustness, efficiency, and safety over state-of-the-art generative planners, advancing reliable trajectory generation for practical autonomous navigation. The code has been released at https://github.com/LuoXubo/StepNav.
💡 Research Summary
StepNav tackles the core inefficiency of current end‑to‑end visual navigation generators, which rely on an unstructured Gaussian noise prior. The framework consists of three tightly coupled stages. First, raw RGB observations and a goal image are encoded with a pretrained V‑JEPA2 backbone; temporal coherence is enforced by solving a projection problem that balances fidelity to the original features, smoothness via a temporal Laplacian, and alignment with a global motion context. This yields temporally smoothed embeddings that are robust to visual flicker.
Second, a lightweight convolutional decoder predicts a continuous success‑probability field F(x) over the workspace. The field is trained as the minimizer of a variational energy that combines a data term (binary labels from expert demonstrations) with first‑ and second‑order gradient regularizers, leading to a biharmonic‑regularized Poisson PDE. The resulting field encodes the inverse of collision cost and naturally forms wide, smooth corridors. An energy functional E(τ)=∫−log(F(τ(t))+δ)‖τ̇(t)‖dt measures the difficulty of traversing a candidate path. By discretizing the environment into a graph and applying a K‑shortest‑path algorithm, the method extracts a set of low‑energy trajectories. To preserve multimodality, a greedy max‑min Hausdorff selection picks a diverse subset, which is then turned into a mixture prior p_prior with weights proportional to a composite score of success probability, length, and curvature.
Third, the mixture prior serves as the initialization for a conditional flow‑matching model vθ. Standard flow‑matching aligns the learned vector field with a stochastic interpolant, but StepNav augments the loss with explicit smoothness (finite‑difference jerk penalty) and safety (log‑barrier based on distance to obstacles estimated from F). The resulting regularized objective directly optimizes for physically plausible trajectories. Because the prior already respects feasibility and multimodality, only a few integration steps (typically five) are needed at inference time, enabling real‑time generation on embedded hardware.
Extensive experiments on over 1,400 scenes, including indoor Stanford 2D‑3D‑S and outdoor Gazebo citysim benchmarks, demonstrate that StepNav outperforms strong baselines (V‑iNT, NoMaD, NaviBridger, FlowNav) across four metrics: success rate, SPL, collision rate, and minimum snap. In zero‑shot transfer to unseen environments, the gap widens, confirming the robustness of the geometry‑aware prior. Real‑world deployment on a Clearpath Jackal equipped with an NVIDIA Jetson Orin‑X validates that the full pipeline runs under 30 ms per decision, satisfying real‑time constraints.
The paper’s contributions are threefold: (1) a variationally learned success‑probability field that captures navigation feasibility; (2) a structured, multimodal mixture prior extracted directly from this field; and (3) a regularized conditional flow‑matching refinement that enforces smoothness and safety. Limitations include dependence on labeled success maps for field training and potential computational overhead in highly complex 3D scenes. Future work may explore self‑supervised field estimation and more scalable graph constructions, broadening applicability to diverse robotic platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment