다중패턴 강화학습으로 시각언어행동 모델을 위한 다양하고 확장 가능한 데이터 생성

Reading time: 6 minute
...

📝 Abstract

Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.

💡 Analysis

Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.

📄 Content

Vision-language-action (VLA) models have established a dominant paradigm of large-scale pretraining followed by downstream task fine-tuning [30,32,52,59,69]. The goal of pre-training is to expose the model to diverse behaviors of robots for it to acquire broad manipulation capabilities. The subsequent fine-tuning phase then adapts this founda-tion to skillfully execute specific tasks [4,8,25,37,72]. The efficacy of this paradigm is critically dependent on the scale and diversity of the pre-training data [3,6,56]. Currently, this data is primarily sourced from human teleoperation, a process that is not only labor-intensive and costly but also inherently limited in the behavioral diversity. Driven by the sole goal of task success, human demonstrators naturally rely on a few efficient strategies, rather than deliberately demonstrating alternative viable solutions [71]. This limitation poses a fundamental challenge to generating the rich, multi-pattern data required for effective VLA pre-training.

Reinforcement learning (RL) has emerged as a powerful alternative for enabling robots to acquire complex patterns through environmental interaction [19,22,46]. Its fundamental strength lies in the trial-and-error process: by optimizing the reward signal, an agent can autonomously discover efficient strategies that often surpass what can be learned by simply imitating human demonstrations [2,45].

Previous work has demonstrated the potential of RL to refine VLA policies on specific tasks-achieving smoother, more efficient behaviors than human demonstrators or even discovering novel successful strategies [24,39,57,67]. However, prior work focuses primarily on using RL for VLA fine-tuning, while its potential for enabling VLA pretraining remains largely understudied. In this paper, we explore how to collect diverse trajectories for VLA pretraining, which is important for VLA to succeed even in a single task in the same environment [3,49]. By design, the objective of policy-based RL is to find the optimal policy, which often leads to convergence on a fixed execution pattern [31,60]. While highly effective for mastering a specific skill, the resulting trajectories may not possess the diversity required to instill the rich knowledge needed for downstream generalization. Therefore, designing an RL framework capable of explicitly generating a diverse dataset for VLA pre-training emerges as a critical research challenge.

In this paper, we propose to reframe the goal of RL training to discovering a repertoire of high-success behavioral This process results in a diverse, multi-modal state visitation distribution. The bottom row shows a standard offline-to-online RL baseline:

(1) A policy is initialized via behavior cloning on the entire unlabeled human dataset. (2) The policy is refined online with a sparse success reward. This standard approach leads to mode collapse, resulting in a uni-modal state visitation distribution.

patterns for each task. Concretely, we introduce the threestage Discover, Learn, and Reinforce (DLR) framework:

We 1) discover distinct behavioral patterns from human demonstrations using the information-theoretic principle;

  1. learn a pattern-conditioned policy to imitate these discovered patterns; and 3) reinforce the pattern-conditioned policy using the task reward towards the refined solutions corresponding to different patterns. This process yields a multi-pattern policy where each behavior serves as a highquality data generator, enabling diverse sampling for VLA pretraining.

We conduct experiments to evaluate the out-ofdistribution generalization capability using the LIBERO benchmark [33]. Specifically, we pre-train the VLA model on tasks from LIBERO-90 with the data collected using RL, and fine-tune the model on tasks from LIBEROspatial/object/goal/long respectively. Our experiment results indicate that 1) the VLA model pre-trained on the RL data with multiple patterns generated by DLR outperforms its counterpart pre-trained on an equal-sized dataset collected using canonical RL when fine-tuned on downstream tasks, and 2) the performance of VLA scales with the data volume when DLR is used to collect the data. These findings suggest the possibility to shift from human-centric to algorithmically generated data pipelines, reducing cost while enabling principled scaling. In summary, our paper makes the following contributions: • We propose a principled three-stage framework, DLR, to generate high-quality and diverse robotic trajectories for VLA pre-training using reinforcement learning. • We provide a theoretical analysis to demonstrate the ability of DLR to preserve the diversity of discovered patterns and prevent collapsing to a single solution. • We show that DLR can not only generate diverse successful trajectories but also result in pre-trained VLAs that perform better when fine-tuned on downstream tasks.

Diverse and high-quality trajectories are crucial in producing generalist VLA models. The

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut