cVLA: Towards Efficient Camera-Space VLAs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

💡 Research Summary

Vision‑Language‑Action (VLA) models promise unified perception, language understanding, and manipulation, yet they typically require massive multimodal datasets and heavy computation. This paper introduces cVLA, a lightweight VLA that sidesteps these bottlenecks by predicting robot end‑effector keyposes directly in camera‑image coordinates using a pre‑trained vision‑language model (PaliGemma‑2). Instead of outputting low‑level joint commands or long action sequences, cVLA predicts only two absolute keyposes (start and goal) as a token sequence; a downstream low‑level planner then converts them into a full trajectory. The model fine‑tunes only the attention layers of PaliGemma‑2, keeping the parameter count low while inheriting the strong image encoding capabilities of large‑scale VLMs.

Training data are generated entirely in simulation (ManiSkill). Two object families are used: simple geometric shapes (CLEVR‑style) and a diverse set of real‑world meshes from Objaverse. Four dataset variants (CLEVR‑easy/hard, Mix‑easy/hard) are created by varying scene randomization, background replacement, and camera parameters. Each sample consists of an RGB image, optionally a depth map (converted to a viridis‑colored RGB image), the current robot state, and a natural‑language instruction.

Key technical contributions include: (1) encoding 6‑DoF poses as discrete tokens—1024 position tokens and 128 orientation tokens—augmented with separate depth tokens; (2) feeding depth information through the same image encoder by color‑mapping it, avoiding a dedicated depth network; (3) exploring inference‑time strategies such as image cropping and a novel beam‑search‑NMS decoding that generates multiple candidate trajectories and selects non‑maximum‑suppressed results; (4) extending the framework to one‑shot imitation learning by conditioning on a demonstration image‑trajectory pair (demo‑image + demo‑trajectory + live‑image → predicted trajectory). No fine‑tuning is performed at test time.

Ablation studies on simulated data show that adding depth consistently improves success rates (6–18 percentage points) across all dataset variants, while aggressive augmentation slightly harms raw simulation performance but benefits real‑world transfer. The beam‑search‑NMS decoder markedly boosts performance on the multimodal DR‑OID‑hard subset, where multiple plausible trajectories exist.

For real‑world evaluation, the authors use the DR‑OID dataset, extracting two subsets focused on cube‑to‑cube moves: an “easy” set with blurred distractors and a “hard” set with clutter. Without any real‑world fine‑tuning, cVLA achieves average L1 position errors of 2–3 cm and rotation errors of 5–7°, demonstrating strong sim‑to‑real transfer.

Overall, cVLA delivers four major advantages: (i) dramatically reduced training cost by leveraging a frozen VLM backbone; (ii) embodiment‑agnostic action representation via camera‑space keyposes; (iii) effective use of depth cues without extra encoders; (iv) flexible inference strategies and a simple one‑shot imitation protocol. The work opens a path toward scalable VLA research that does not rely on massive real‑world datasets, while still achieving practical performance on tabletop manipulation tasks. Future directions include increasing the number of keyposes for more complex, multi‑step tasks, integrating temporal token prediction for long‑horizon planning, and extensive safety‑oriented real‑robot testing.

cVLA: Towards Efficient Camera-Space VLAs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment