SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning
Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this degradation is closely associated with the erosion of spatial inductive bias during RL adaptation, as sparse rewards and spatially agnostic exploration increasingly favor short-horizon visual cues. To address this issue, we propose \textbf{SA-VLA}, a spatially-aware RL adaptation framework that preserves spatial grounding during policy optimization by aligning representation learning, reward design, and exploration with task geometry. SA-VLA fuses implicit spatial representations with visual tokens, provides dense rewards that reflect geometric progress, and employs \textbf{SCAN}, a spatially-conditioned annealed exploration strategy tailored to flow-matching dynamics. Across challenging multi-object and cluttered manipulation benchmarks, SA-VLA enables stable RL fine-tuning and improves zero-shot spatial generalization, yielding more robust and transferable behaviors. Code and project page are available at https://xupan.top/Projects/savla.
💡 Research Summary
Vision‑Language‑Action (VLA) models have shown impressive generalization for robotic manipulation, yet fine‑tuning these pretrained policies with reinforcement learning (RL) often erodes the spatial inductive bias that underlies their robustness. This problem is especially acute for flow‑matching policies, whose continuous‑time, noise‑driven formulation relies on implicit geometric priors that can be overwritten by high‑variance on‑policy updates and sparse rewards. The authors introduce SA‑VLA, a spatially‑aware RL adaptation framework designed to preserve spatial grounding while still allowing the policy to improve through interaction.
The core of SA‑VLA consists of three tightly coupled components. First, spatial token fusion augments the usual 2‑D visual tokens with implicit spatial tokens derived from multi‑view imagery. These spatial tokens are projected, enriched with 2‑D positional and view encodings, and then fed to the visual tokens via a unidirectional cross‑attention mechanism. A learnable channel‑wise gate controls the contribution of the spatial stream, and a residual MLP refines the fused representation, yielding geometry‑aware embeddings (h_t) that retain the pretrained visual features while injecting geometric cues, even under occlusion.
Second, the authors propose step‑level dense rewards that directly encode geometric progress. Each manipulation episode is automatically segmented into three phases—Reach, Place, and Leave—based on gripper state and relative pose changes. At every timestep the normalized distances between end‑effector and object ((d_{ro})) and between object and goal ((d_{od})) are computed. The reward for a phase is the signed change of the relevant distance, scaled by a factor (\lambda). This yields a dense, phase‑consistent signal that rewards reducing the end‑effector‑object distance during Reach, reducing the object‑goal distance during Place, and increasing the end‑effector‑object distance after release (Leave). Because the reward is computed online from geometry, it does not require hand‑crafted phase annotations and adapts to arbitrary objects and goal locations.
Third, the exploration strategy SCAN (Spatially‑Conditioned Annealed Noise) injects stochasticity that is both geometry‑aware and annealed over training. For a given state, a learned noise scale (\sigma_{\text{learn}}(x_t)) is predicted from the spatial embedding; this is combined with a time‑dependent minimum noise (\sigma_{\min}(t)) to form the total noise (\sigma_t(x_t)=\sigma_{\min}(t)+h(\sigma_{\text{learn}}(x_t)-\sigma_{\min}(t))). The final action is (a_t = \pi_\theta(h_t) + \epsilon_{\text{SCAN}}(t)) where (\epsilon_{\text{SCAN}}(t)\sim\mathcal{N}(0,\sigma_t^2 I)). Consequently, early training enjoys broad exploration, while later stages focus exploration on regions where the spatial embedding indicates higher uncertainty, preventing premature collapse to deterministic policies.
The authors evaluate SA‑VLA on three challenging benchmarks: multi‑object pick‑and‑place, cluttered‑table rearrangement, and tasks with large viewpoint shifts. Baselines include the original flow‑matching policy (π₀), Rein‑Flow, and standard PPO fine‑tuning with sparse rewards. Across all metrics—success rate, average return, and zero‑shot performance under viewpoint perturbations—SA‑VLA outperforms baselines by 12–18 %. Visualizations of end‑effector trajectories show that the “phase‑inconsistent” behavior observed in naïve fine‑tuning disappears, confirming that spatial consistency is retained.
Key contributions are: (1) a cross‑attention‑based spatial token fusion that injects implicit 3‑D structure into 2‑D visual representations without explicit reconstruction; (2) a geometry‑driven dense reward formulation that aligns with natural manipulation phases; (3) SCAN, a spatially‑conditioned annealed noise scheme that harmonizes exploration with the same geometric priors used for representation and reward design. Together, these elements mitigate catastrophic forgetting of spatial priors, improve credit assignment, and enable stable RL fine‑tuning of flow‑matching VLA policies in cluttered, partially observable environments.
The paper also discusses limitations: reliance on multi‑view inputs for spatial tokens, heuristic phase detection that may need refinement for more complex task hierarchies, and the need for environment‑specific tuning of the annealing schedule. Future work could explore meta‑learning of phase boundaries, extend SA‑VLA to multi‑robot coordination, and integrate sim‑to‑real transfer techniques. Overall, SA‑VLA presents a compelling solution to the longstanding issue of spatial bias erosion during RL adaptation of large multimodal policies.
Comments & Academic Discussion
Loading comments...
Leave a Comment