VLS: Steering Pretrained Robot Policies via Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: https://vision-language-steering.github.io/webpage/

💡 Research Summary

The paper tackles a fundamental limitation of current imitation‑learning‑based robot policies: they are tightly coupled to the spatial configurations and language instructions seen during training, causing dramatic performance drops when faced with modest out‑of‑distribution (OOD) changes such as shifted support surfaces, nearby obstacles, or altered task phrasing. Retraining or fine‑tuning the policy to cover every possible variation is impractical and conceptually misaligned because the required motor primitives already exist in the pretrained model; what is missing is a mechanism to steer those primitives under new constraints at test time.

Vision‑Language Steering (VLS) is introduced as a training‑free, inference‑time adaptation framework for frozen diffusion or flow‑matching policies. VLS treats the generation of action trajectories as a controllable denoising process. It first grounds the OOD observation‑language pair (o, l) into a compact set of 3D keypoints using a vision‑language model (VLM). Object masks are obtained with SAM, visual features are extracted with DINOv2, and depth maps re‑project the masked pixels into point clouds. Clustering yields task‑relevant keypoints P that encode the geometric constraints implied by the OOD input.

Next, the VLM is prompted to synthesize programmatic, differentiable reward functions R(a, P) that score any candidate action chunk a with respect to the spatial and semantic constraints encoded in P. Because R is differentiable with respect to a, its gradient ∇ₐR can be used as a guidance signal. For diffusion policies the standard noise predictor ϵ is modified as ϵ̂ = ϵ − λ·g, where g = ∇ₐR and λ controls guidance strength. For flow‑matching policies the velocity field is similarly adjusted: v̂ = v + λ·g. This gradient‑based steering reshapes the sampling distribution without altering the underlying policy parameters, preserving the diversity and robustness of the pretrained prior while biasing it toward trajectories that satisfy the OOD constraints.

VLS also incorporates closed‑loop execution monitoring. During multi‑stage tasks, the system evaluates whether the current stage’s goal has been met and dynamically switches to the next stage’s reward function, while adaptively scaling λ based on observed progress. Particle‑level diversity and occasional resampling are employed to avoid mode collapse and to maintain exploration.

The authors evaluate VLS on two large‑scale simulation benchmarks: CALVIN and LIBERO‑PRO. On CALVIN, VLS achieves up to a 31 % absolute increase in success rate over prior inference‑time steering methods such as ITPS, DynaGuide, and V‑GPS, especially on long‑horizon tasks with spatial perturbations. On LIBERO‑PRO, VLS improves frozen VLA‑based policies (e.g., OpenVLA, π⁰, π⁰·5 variants) by up to 13 % under both layout and instruction shifts.

Real‑world experiments are conducted with a Franka Emika robot. The robot is tasked with multi‑stage, language‑specified manipulations while encountering unseen object appearances, shifted table positions, and target substitutions. Without any parameter updates, VLS enables stable execution, reduces collisions, and achieves higher placement accuracy compared to the unmodified base policy.

Key contributions include:

A fully training‑free adaptation scheme that leverages frozen generative policies.
Automatic generation of dense, differentiable reward functions from vision‑language models, eliminating manual reward engineering.
Gradient‑based steering of diffusion/flow‑matching denoising that preserves the policy’s prior while enforcing OOD constraints.
A multi‑stage closed‑loop control loop that dynamically adjusts guidance strength and switches reward functions.

Limitations are acknowledged: VLM inference adds computational overhead, the quality of the synthesized reward depends on prompt design, and current rewards focus on geometric constraints rather than force/torque or dynamic interaction constraints. Future work could explore more efficient VLM architectures, richer physical constraint modeling, and integration with learned dynamics models for even tighter control.

Overall, VLS demonstrates that “steering” rather than “retraining” is a powerful paradigm for deploying generalist robot policies in the messy, variable real world, opening a path toward scalable, adaptable robotic manipulation powered by large multimodal models.

VLS: Steering Pretrained Robot Policies via Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment