ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

💡 Research Summary

The paper addresses a fundamental limitation of current Vision‑Language‑Action (VLA) models for robotic manipulation: they are pretrained on 2‑D image data and therefore lack robust 3‑D spatial reasoning. While prior work has tried to inject 3‑D cues by augmenting inputs with depth maps, using external depth estimators, or aligning a single intermediate layer of the VLA model to a strong 3‑D vision foundation model, these approaches either add hardware complexity, incur extra inference cost, or require costly post‑hoc layer selection.

ROCKET (Residual‑Oriented Multi‑Layer Alignment) proposes a principled multi‑layer alignment strategy that operates on the residual streams of both the student VLA network and a frozen 3‑D vision teacher (e.g., VGGT, Depth Anything). The key insight is that deep residual networks evolve as a sequence of small additive updates; thus, aligning multiple layers should be performed as a single cone‑to‑cone mapping rather than independent per‑layer projections. To achieve this, ROCKET employs a shared lightweight projector that is used for all selected student layers. The shared projector forces the gradients from different alignment losses to be coherent, dramatically reducing gradient interference that plagues naïve multi‑projector designs. The authors provide a theoretical analysis showing that, under a shared projector, cross‑layer gradient inner products admit a lower bound that is positive when teacher‑side error signals are aligned, whereas independent projectors yield arbitrary, often destructive, cross terms. Empirical measurements of cosine similarity between gradients confirm that the shared projector substantially increases alignment coherence.

A second challenge is that shallow layers converge quickly and can dominate the shared projector, starving deeper layers of capacity to learn high‑level geometric cues. ROCKET solves this with a Matryoshka‑style sparse activation scheme: deeper layers activate more of the projector’s parameters, while shallow layers use only a small subset. This hierarchical activation balances the contribution of each layer to the total alignment loss without adding significant computational overhead.

The overall training objective combines the standard action‑prediction loss with the multi‑layer alignment loss weighted by a single λ. A “training‑free layer selection” rule (based on layer depth and residual magnitude) selects which student layers to align, eliminating the need for exhaustive search.

Experiments span three benchmark suites—LIBERO, LIBERO‑Plus, and RoboTwin—and multiple VLA backbones (OpenVLA‑7B, PI0‑6B, etc.). ROCKET achieves 98.5 % success on LIBERO, surpassing prior state‑of‑the‑art methods while using only ~4 % of their training compute. Across all datasets, ROCKET consistently improves success rates, especially on tasks requiring fine‑grained spatial reasoning such as object contact, viewpoint changes, and constrained manipulation. Ablation studies confirm that (1) the shared projector is the primary driver of performance gains, (2) the Matryoshka activation further boosts deep‑layer learning, and (3) the training‑free layer selection yields stable improvements without extra tuning.

In summary, ROCKET introduces three major contributions: (i) a residual‑oriented multi‑layer alignment framework with a shared projector that mitigates gradient conflict, (ii) a depth‑aware sparse activation mechanism that balances alignment losses across layers, and (iii) a highly compute‑efficient training pipeline that delivers state‑of‑the‑art 3‑D spatial understanding to 2‑D‑pretrained VLA models. This work paves the way for more capable, data‑efficient embodied AI systems that can reason about three‑dimensional environments without requiring additional sensors or heavy computational resources.

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment