A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation
Large behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies.
💡 Research Summary
This paper presents a comprehensive empirical investigation into how different co‑training data modalities and training‑phase strategies affect the performance and generalization of large behavior models (LBMs) for dexterous robot manipulation. Recognizing that existing LBMs are limited by the relatively tiny amount of robot‑specific data compared to the massive corpora used to pre‑train vision‑language models (VLMs), the authors explore “co‑training” – jointly training on target robot data together with heterogeneous external data – as a way to bridge this gap.
Five distinct co‑training modalities are examined: (1) standard vision‑language (VL) data (e.g., RoboPoint, RefSpatial) that provide commonsense, object‑grounding, and spatial‑reasoning knowledge; (2) dense language annotations for robot trajectories, generated both by heuristic scripting and by prompting a large VLM (GPT‑5) to produce rich per‑frame descriptions; (3) cross‑embodiment robot data drawn from Open X‑Embodiment, covering many robot morphologies and environments; (4) human egocentric videos, from which either latent action tokens are extracted using a latent‑action model or language captions are generated via a VLM; and (5) discrete robot action tokens obtained by aggressive compression (FAST) or vector‑quantization (VQ‑VAE).
The authors assemble a massive dataset (named TRI‑Ramen) comprising roughly 4,000 hours of manipulation data (both simulated and real) and 50 million VL samples. They adopt a Vision‑Language‑Action (VLA) architecture built on a pretrained PaliGemma2‑PT backbone, augmented with a single “observation token” whose hidden states from the last four layers form a compact conditioning vector for an Action Flow Transformer (ActionFT). Continuous actions are trained with a flow‑matching loss, while text or discrete tokens use cross‑entropy; a weighted sum combines the two when both are present.
Three training‑phase strategies are evaluated: (i) single‑phase co‑training (all data together), (ii) two‑phase 1st‑phase‑only (co‑training data only in phase 1, robot actions only in phase 2), and (iii) two‑phase full (co‑training data in both phases). Loss weights and batch ratios are fixed based on ablations.
A total of 89 policies are trained and evaluated across 58 k simulation rollouts (covering seen and unseen tasks, nominal and distribution‑shift conditions) and 2 835 real‑world rollouts (language following and long‑horizon dexterous tasks). The key findings are:
-
Vision‑language and cross‑embodiment data consistently boost generalization. Policies that incorporate standard VL data and/or cross‑embodiment robot data achieve significantly higher success rates on distribution‑shifted environments, on tasks never seen during training, and on language‑following benchmarks. The VL data injects rich physical commonsense and spatial reasoning, while cross‑embodiment data expands the diversity of robot morphologies and interaction contexts.
-
Human‑video modalities and discrete action tokens provide little or no benefit. Neither latent‑action tokens extracted from videos nor VLM‑generated captions improve performance, and discrete token representations (FAST, VQ‑VAE) do not help continuous control, likely because the abstraction loses fine‑grained motor information needed for dexterous manipulation.
-
Two‑phase 1st‑phase‑only training is the most efficient. Training first on the heterogeneous modalities to learn strong multimodal representations, then fine‑tuning solely on robot action data, yields the best trade‑off between data efficiency and final task performance.
-
Combining effective modalities yields cumulative gains. Adding both VL and cross‑embodiment data together produces additive improvements (≈12 % higher success over any single modality). When these combined policies are later fine‑tuned on unseen long‑horizon tasks, adaptation speed roughly doubles compared with single‑modality baselines.
-
Exclusive robot‑only training degrades the VLM backbone’s language‑vision abilities. Benchmarks such as VQ‑A, GQA, and NLVR2 show a marked drop when the backbone is trained only on robot data. Co‑training with VL and cross‑embodiment data restores or even slightly improves these scores, confirming that heterogeneous data preserves the backbone’s general vision‑language competence.
-
Explicit chain‑of‑thought (CoT) conditioning does not help. Adding a step where the model first generates an intermediate CoT trace (learned from co‑training data) before producing actions yields no measurable benefit in the simulated benchmark, suggesting that the added complexity outweighs any potential guidance for the current control tasks.
Overall, the paper delivers a clear, data‑driven roadmap for building scalable, generalist robot policies: prioritize standard vision‑language and cross‑embodiment robot data, employ a two‑phase training schedule that front‑loads multimodal learning, and avoid reliance on discrete action token abstractions or CoT conditioning for now. These insights are validated at both simulation scale and real‑world deployment, offering practical guidance for future research aiming to close the data‑scale gap between robot learning and foundation models.
Comments & Academic Discussion
Loading comments...
Leave a Comment