SimVLA: A Simple VLA Baseline for Robotic Manipulation
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA
💡 Research Summary
The paper introduces SimVLA, a deliberately minimalist baseline for Vision‑Language‑Action (VLA) robotic manipulation that isolates the true sources of performance gains in a rapidly evolving field. Recent VLA research has produced a plethora of sophisticated architectures—temporal chain‑of‑thought modules, explicit 3D spatial priors, diffusion‑based policy networks, and large cross‑modal adapters—often accompanied by opaque training recipes and implementation tricks. This makes it difficult to attribute improvements to architectural novelties versus hidden training dynamics. SimVLA addresses this by (1) strictly decoupling perception from control, (2) using a standard pretrained vision‑language model (VLM) as a frozen‑or‑lightly‑fine‑tuned encoder, and (3) attaching a lightweight action head that predicts continuous action chunks via conditional flow‑matching.
Architecture: The VLM processes multi‑view RGB images and a language instruction into a shared token sequence Zₜ. No cross‑attention or modality‑specific routing is added; the tokens are simply concatenated with proprioceptive state embeddings, a sinusoidal timestep embedding, and a noised action chunk. This concatenated sequence is fed into a vanilla Transformer encoder (self‑attention only) that outputs a denoising vector field. The action head therefore remains small (≈0.5 B parameters total for the whole system) and can be swapped out or upgraded independently of the perception backbone.
Flow‑Matching for Action Generation: Instead of stochastic diffusion or discrete token autoregression, SimVLA adopts deterministic flow‑matching. A normalized action chunk x is corrupted with Gaussian noise ε at a noise level t, forming xₜ = t·ε + (1‑t)·x. The network vθ predicts the vector field (ε − x), trained with an L₂ loss. At inference, a few Euler integration steps transform pure noise into a smooth action trajectory, yielding low latency and stable control suitable for real‑time robot execution.
Standardized Training Recipe: The authors identify three “silent drivers” that heavily influence performance: (i) data shuffling strategies, (ii) per‑dimension normalization of actions and proprioceptive states, and (iii) consistent optimizer schedules (learning‑rate warm‑up, cosine decay, batch size). By fixing these across all experiments, they demonstrate that a well‑tuned baseline can outperform many larger, more complex systems.
Empirical Results: On the LIBERO benchmark, SimVLA attains a 98.6 % success rate, surpassing OpenVLA‑OFT (7 B, 97.1 %) and π0.5 (3 B, 96.9 %). Memory consumption during training drops dramatically to 9.3 GB versus 24.7 GB for the 0.5 B VLA‑Adapter and 62 GB for the 7 B model. In zero‑shot real‑robot experiments on the Galaxea R1 Lite, SimVLA matches the performance of the state‑of‑the‑art diffusion policy π0.5 without any robot‑specific fine‑tuning, handling multi‑stage tasks that require both dexterous manipulation and semantic understanding.
Implications: SimVLA proves that a clean, modular design combined with disciplined training practices can achieve state‑of‑the‑art results while being far more compute‑efficient and easier to reproduce. Its architecture is future‑proof: as newer, larger VLMs become available, they can be swapped in without redesigning the action head or cross‑modal adapters. This baseline thus provides the community with a transparent reference point for quantifying the genuine contribution of any added spatial priors, temporal modules, or diffusion‑style policies.
In summary, SimVLA demonstrates that simplicity, when paired with rigorous training standardization, is sufficient to set a new performance “lower bound” for VLA research, enabling clearer, fairer comparisons and accelerating progress toward more capable and general robotic manipulation systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment