A Pragmatic VLA Foundation Model

A Pragmatic VLA Foundation Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.


💡 Research Summary

LingBot‑VLA presents a pragmatic, large‑scale Vision‑Language‑Action foundation model designed for real‑world robotic manipulation. The authors first address a critical gap in the VLA literature: the lack of systematic evidence on how performance scales with massive, diverse real‑robot data. To this end, they collected approximately 20,000 hours of tele‑operated manipulation data from nine popular dual‑arm robot platforms (including AgiBot G1, AgileX, Galaxea R1Lite/Pro, Realman RS‑02, Leju KUA VO 4 Pro, Qinglong, ARX Lift2, and Bimanual Franka). Each platform provides multiple RGB‑D viewpoints, high‑dimensional joint states, and parallel grippers, yielding a dataset that is an order of magnitude larger and more varied than those used in prior VLA works.

Data labeling proceeds in two stages. Human annotators segment raw videos into atomic action clips and prune redundant start/end frames. Then a large multimodal language model (Qwen3‑VL‑235B‑A22B) automatically generates task‑level and sub‑task instructions, which are subsequently refined by humans to ensure linguistic fidelity. This pipeline produces high‑quality paired visual‑language‑action triples suitable for end‑to‑end training.

The model architecture builds on the state‑of‑the‑art vision‑language model Qwen2.5‑VL as a semantic backbone and introduces an “action expert” module. Both components are integrated via a Mixture‑of‑Transformers (MoT) scheme similar to BaGel: visual‑language tokens and action tokens are processed in separate transformer streams but share a unified self‑attention layer, enabling cross‑modal conditioning while preserving modality‑specific representations. Observations at timestep t (Oₜ) consist of three‑view image tokens, the natural‑language instruction, and the robot’s proprioceptive state; the action chunk Aₜ comprises 50 future joint commands. Continuous action generation is trained with Flow Matching, which interpolates between Gaussian noise and the ground‑truth trajectory and minimizes the L2 distance between the model’s predicted velocity field and the analytically derived optimal field. This yields smooth, high‑precision trajectories.

Spatial reasoning is explicitly enhanced through a knowledge‑distillation step: learnable query embeddings derived from the VLM are aligned with depth embeddings from a dedicated depth network (LingBot‑Depth). The distillation loss forces the multimodal representation to encode geometric cues, improving performance on tasks that require precise depth perception (e.g., object insertion, assembly).

Training efficiency is a central contribution. The authors employ Fully Sharded Data Parallel (FSDP) combined with a Hybrid Sharded Data Parallel (HSDP) strategy that isolates the action expert into its own shard group, dramatically reducing inter‑GPU communication. Mixed‑precision training uses float32 for reductions and float16 for storage/communication. At the operator level, they replace generic attention with FlexAttention (sparse attention) and fuse kernels via torch.compile, achieving a throughput of 261 samples per second on an 8‑GPU (40 GB each) cluster. Compared with existing VLA codebases, this represents a 1.5‑2.8× speedup.

For empirical validation, the authors adopt the GM‑100 benchmark, which defines 100 diverse manipulation tasks. They evaluate on three distinct robot embodiments, executing each task for 130 episodes per robot (total 39,000 episodes). Metrics include success rate, time‑to‑completion, and failure mode analysis. LingBot‑VLA attains an average success rate above 87 %, substantially outperforming prior state‑of‑the‑art VLA models (which typically hover around 70 %). Notably, zero‑shot transfer to unseen robot platforms remains robust, demonstrating that the large, heterogeneous pre‑training data successfully mitigates the sim‑to‑real and cross‑embodiment gaps.

Beyond performance, the authors release the entire ecosystem: the training codebase (with the aforementioned optimizations), pretrained checkpoints, and the full 20k‑hour dataset. All resources are publicly available via GitHub and HuggingFace, encouraging reproducibility and community‑driven extensions. The paper concludes with a roadmap that includes integrating additional sensor modalities (force/torque), scaling to even larger datasets, and expanding the benchmark to more challenging, long‑horizon tasks.

In summary, LingBot‑VLA convincingly demonstrates that (1) VLA models continue to benefit from massive real‑world data without evident saturation, (2) a carefully engineered training pipeline can make such large‑scale learning tractable on modest GPU clusters, and (3) the resulting model achieves state‑of‑the‑art, broadly generalizable manipulation capabilities ready for deployment in real robotic systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment