Emergence of Human to Robot Transfer in Vision-Language-Action Models
📝 Original Info
- Title: Emergence of Human to Robot Transfer in Vision-Language-Action Models
- ArXiv ID: 2512.22414
- Date: 2025-12-27
- Authors: Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, Suraj Nair
📝 Abstract
We observe that the transfer from human data to robot policies scales with the size and diversity of VLA pre-training data. The x-axis represents the diversity of the pre-training robot dataset, and the yellow and blue lines show the finetuning performance with and without human embodiment data. While both increase, the gain from leveraging human data only appears beyond a certain pre-training scale. We evaluate on a suite of four generalization scenarios shown only in the human data.📄 Full Content
Drawing inspiration from language models, recent work has found that the ability to leverage certain sources of data is intrinsically tied to model scale [47,49,48]. For instance, while smaller models fail to leverage diverse instruction tuning datasets, larger models become generalists that soak up diverse data and generalize to new tasks. This leads us to ask: does learning skills from human video data, with no explicit alignment, emerge with scale? To test this hypothesis, we introduce a simple co-training recipe, that treats human videos as an additional embodiment with the same objectives used for robot data. Specifically, we predict low level end effector trajectories using 3D hand tracks and high-level sub-tasks using dense language annotations, mirroring the objectives used during robot pretraining. We then co-finetune on a mix of this human data with the relevant robot data, and evaluate in a setting present only in the human data. For example one such setting involves sorting eggs, where the robot data covers placing eggs in cartons, while human data specifies how differently colored eggs should be sorted across multiple cartons.
With this recipe, we uncover our key finding: human to robot transfer is an emergent property of diverse VLA pretraining (Figure 1). As we expand the diversity of robot data -across tasks, scenes, and embodiments -the pretrained VLA Fig. 2: Per-task improvement from human data: We plot the difference in performance between policies fine-tuned with robot + human data versus robot-only data, isolating the lift from human supervision. Gains are largest when pre-training spans diverse tasks, scenes, and embodiments, suggesting that broad pre-training improves transfer from human videos. becomes increasingly capable of leveraging human videos during post-training. We quantify this effect on four generalization benchmarks that probe different axes of transfer, including unseen apartments, novel object categories, and new task semantics. Our full recipe leverages human video data to enable capabilities never shown in robot data, for instance it busses unseen objects, tidies an unseen home and performs a task with novel semantic structure.
These findings might lead one to ask: why does diverse pretraining matter so much for transfer? We find that as pretraining diversity increases, the latent representations between human and robot data naturally align. This suggests that with sufficient data coverage, models begin to form embodiment agnostic representations, despite vast visual and kinematic domain shifts. In the same way large language models become generalists that can learn from diverse supervision, diverse VLAs become generalists that can learn from diverse embodiments.
Concretely we show that robotic foundation models are able to directly leverage human data when pre-trained on sufficiently diverse data, which we demonstrate quantitatively by performance on generalization tasks, as well as qualitatively by analyzing the structure of the latent embeddings across embodiments. To test the efficacy of our recipe, we ablate the importance of each training objective, the importance of wrist cameras, and compare the relative value between human and robot data. We believe this research provides a new perspective on the potential role of embodied human data in training stateof-the-art VLAs. Rather than developing bespoke algorithms to leverage human data, we can think of it in the context of cross-embodiment transfer, where its usefulness is amplified by diverse pretraining.
Learning from Humans. Learning manipulation from human video has received significant attention due to its potential scalability. Over the years, advances have been made to leverage this data more directly for policy learning. Early works in this field leveraged human video data to train stronger vision encoders, which can improve downstream policy learning [31,30,29]. Such approaches leverage the visual diversity of large human datasets like Ego4D [19] to train rich visual features, but are unable to directly improve action prediction. To address this, a number of works developed proxies for actions via intermediate prediction tasks such as keypoint tracking [5,45], latent actions [55], reward modeling [9], and affordance prediction [3,2]. Alternatively, other approaches use overlayed robots and AR/VR to explicitly align the human and robot actions [15,34]. While these works get closer to capturing real human actions, they introduce a manually engineered structure to enable transfer, limiting the generality of tasks that can be captured.
In parallel to this work, advances in AR/VR enable us to extract explicit actions from humans in the form of 3D hand and head tracking [17]. Recent works leverage this advancement to train unified policies on human and robot data with a single objective: future action prediction, whether that be from a human hand or robot end-effector [22,36,59,27,28,54,37]. These works offer a promising path to directly leverage large scale human data, but such methods are generally brittle at a small scale. As such, they often rely on some form of alignment to work well, whether that be kinematic, visual, or Fig. 3: Training mixture and benchmark. Our fine-tuning mix is evenly split between human data for generalization tasks and robot data for the nearest neighbor task. For each task we evaluate generalization to a new concept introduced only in human data 1) Scene generalization: We have robot data for tidying dressers and spice racks across many airbnbs and human data for an unseen apartment. 2) Object generalization: robot data covers bussing a table filled with trash and dinnerware and human data for a new set of objects. 3) Task generalization: robot data covers placing eggs into cartons, but human data introduces the new concept of sorting eggs by color. latent. In our work, we extend this style of work and perform no explicit alignment steps. Heterogeneous Vision-Language-Action Models. Modern VLAs are trained as generalist policies with heterogeneous supervision, combining robot teleoperation, web-scale vision-language data, and language annotations into a single model [13,60,24,8,50,57,4,40,35,42,51,7,20]. These models bootstrap strong vision-language backbones to get broad semantic understanding from human-generated images and text, and then ground that understanding in robot experience via behavior cloning on large teleoperation datasets [11,16,44,33,23,6,18,38,1,25,21]. While supervision from web images, videos, and language further improves open-world generalization, this data lacks explicit actions and is visually out-of-distribution for a robot’s egocentric observations.
A common theme in recent VLAs is cross-embodiment training [33,32,12,53,58], where a single policy is used to control many different robot embodiments with a unified architecture and action representation. These multirobot VLAs show that skills can transfer across embodiments, often without bespoke alignment beyond shared observation and action spaces. This suggests that heterogeneous, multirobot pretraining can produce internal representations that are naturally conducive to transfer across embodiments.
We build on this cross-embodiment hypothesis by treating humans as yet another embodiment within the same heterogeneous VLA training recipe. In contrast to non-embodied human videos one might find on YouTube, we leverage embodied human videos with explicit hand motion and language annotations in the VLA mixture. We find that with sufficient pretraining scale, the resulting VLAs naturally form embodiment-agnostic representations that align human and robot trajectories. Scaling Alternative Data Collection Strategies. While most VLAs are largely driven by robot teleoperation data, there’s recent work exploring alternative data collection mechanisms which are more scalable. A number of works use portable hardware that a user operates with their hands to simulate teleoperation [39,56], for instance UMI [10] is a hand-held parallel jaw gripper that tracks its own movement to use as demonstration data. A number of works expanded on this design to capture data for dexterous hands as well, both via exoskeletons and portable motion capture [52,41,46]. While these devices are exciting options to increase data scalability, they ultimately encumber the operator, and it is difficult to perform work naturally with these devices Capturing embodied human data offers a promising way to address these limitations, using cameras and computer vision to record 3D hand motion with minimal disruption. This approach enables us to observe human behavior without encumbrance. In this study, we therefore focus on methods for leveraging embodied human data.
We consider the setting of training generalist policies with vision language action models. VLAs inherit the architecture and pretrained weights of a vision-language model, but are trained to produce continuous robot control. VLAs are typically trained via behavior cloning on a dataset of demonstrations D = (o t , l t , a t:t+H ) to produce a policy that maps observation and language command to a trajectory of future actions π θ (a t:t+H | o t , l t ). Actions can either be represented as discrete action tokens [60,26,35], trainable via standard next-token prediction, or as continuous values, often trained via flow-matching objectives [8]. In this work, we follow Driess et al. [14] and train VLA models using both action representations: we train our model to predict discretized FAST [35] action tokens, and introduce a small action expert network that decodes continuous actions via a flow matching objective. For more details on model architecture and training objective, see [20].
Recent VLAs like π 0.5 showed improved generalization from co-training VLAs with additional objectives like subtask prediction, object detection, and VQA. For subtask prediction, the policy predicts a subtask string given visual observations and a high level language command p(l subtask t | o t , l t ). This language is fed back into the model to condition action generation π θ (a t:t+H | o t , l subtask ), similar to chain of thought. Subtask labels are obtained by densely annotating demonstration data with language descriptions of short, atomic action sequences. In this work, we train on human data with two objectives: flow based prediction for continuous actions and language based subtask prediction.
Our fine-tuning recipe aims to leverage embodied human data identically to other robots in our mixture with no explicit alignment. This approach is maximally general, and leans on the capabilities of large models to ingest relevant information from diverse sources, rather than human designed heuristics for aligning domains. We first collect, process, and annotate the human video data, and then use it in combination with robot data to fine-tune the pre-trained model, which we base on the π 0.5 model shown in Figure 4. The fine-tuning objective treats human and robot data in exactly the same way, without any explicit transfer learning method or loss.
Data collection device. We design our data collection setup to enable capture of a wide range of human interactions, while being minimally intrusive, and thus scalable. We equip human data collectors with a head-worn, high resolution camera. Since recent robotics research has demonstrated the benefit of wrist-mounted cameras for policy learning, which provide a more detailed view of the end-effector’s interactions with manipulated objects, we also experiment with equipping data collectors with wrist-mounted cameras that provide two additional, time-synchronized camera streams. We ablate the effect of these additional cameras in Section V. Data collection protocol. Our intent is to collect human data in the style of episodic robot teleoperation data, which allows us to isolate the transfer problem to only the visual and kinematic differences between human and robot. As such, operators are instructed to collect repeated demonstrations for each of our tasks while wearing the data collection device. Further, operators are asked to keep their hands in the field of view of the camera to improve tracking quality. We collected 3 hours of data for bussing, 3 hours for spice, 3 hours for dresser and 5 hours for sort eggs. Data processing & annotation. Given a raw video capture of human interactions, we use visual SLAM to reconstruct the 6D movement e t ∈ R 6 of the head-mounted camera relative to a constant world-frame. We also reconstruct the position of 17 3D keypoints h et t ∈ R 3×17 for both hands in the head camera frame. Finally, analogously to the robot teleoperation data in our training mixture, we annotate the human video data with text-based subtasks, describing the actions of each arm. Action space. We aim to roughly align the action representations for human and robot data in our training mixture to facilitate transfer. For robot teleoperation data, there are multiple choices of action representations. Two common options are to represent actions as trajectories of robot joint positions or end-effector poses. We will consider end-effector based actions, since approximating joint positions for humans is difficult. Specifically, these end-effector actions are represented as a length H action chunk [a 0 , a 1 , …, a H ], where each a i represents the 6-DoF pose relative to the current observed state 6-DoF pose s 0 . The total action space for robot data is the concatenation of the 6 DoF left arm endeffector trajectory + gripper, 6 DoF right arm end-effector trajectory + gripper, and 2 dimensional base actions, yielding a total action chunk of a ∈ R H×16 . To compute corresponding actions in the human video, we define an “end-effector” pose spanning the 3D keypoints of each hand’s palm, middle, and ring finger (Figure 6),relative to the head frame e t . We then compute end-effector actions as relative transformations from the current 6-DoF state similar to how we do for the robot end effector. Similarly, we approximate relative robot base actions by projecting the human video base camera poses into the frame of the chunk’s first timestep base camera pose. We do not explicitly approximate “gripper actions” for the human video, since it is challenging to estimate the openness of a Training objectives. Our best recipes to perform difficult long horizon tasks leverage both high level subtask prediction and low level action prediction. We construct both of these prediction tasks on our human data. For low level action prediction, we supervise action chunk prediction both via next token prediction on discrete FAST tokens as well as flow matching loss on the continuous actions π θ (a|o t , l subtask t ). For high level subtask prediction we train on next token prediction on subtask language tokens π θ (l subtask
Training mixture. At fine-tuning time, it is important to create a training mixture that both retains the model’s original capabilities, while introducing new concepts from human data to improve generalization. Our mixture reflects this with a simple recipe: we co-train human data for generalization tasks at 50-50 proportion to the nearest neighbor robot task. We use this mixture to finetune π 0.5 , a strong VLA exhibiting zero shot generalization, and further improve its capabilities. As shorthand, we refer to the combined model that integrates egocentric data into π 0.5 as π 0.5 + ego.
To test whether π 0.5 + ego can generalize to new concepts from egocentric human data, we construct a suite of “gener-alization” scenarios that have limited coverage in robot data, but present in human data. These scenarios span generalizing to new scenes, objects and tasks. We begin our study by understanding whether our recipe can enable transfer to these new settings. Then, we validate our core hypothesis, which is that this transfer is an emergent property of diverse VLA pretraining. Finally, we compare human embodiment data to other robot embodiments, study whether transfer occurs from high level subtask or low level action prediction, and ablate the impact of our human-worn wrist cameras.
Our benchmark aims to test human to robot transfer across various axis of generalization: scene, object and task (fig. 3). For each axis, we consider a setup where our robot teleoperation data lacks coverage, and we collect targeted human data to expand this coverage. In each setting we co-train using π 0.5 + ego and evaluate on the new concept introduced in human data. Scene transfer: We identified two tasks where we have robot data coverage in a fixed number of homes, but π 0.5 trained only on robot data fails to generalize to an unseen home: Spice and Dresser in which the robot must tidy a spice rack and the top of a dresser respectively. We collect human data in the unseen target kitchen, then benchmark π 0.5 + ego on this new scene. For both of these shorter horizon tasks the score is a binary success rate. Object transfer: Our robot data has coverage of bussing a messy table covered with trash and dinnerware. We then collect human data which introduces new objects like kitchen tools, then benchmark π 0.5 +ego on these new objects. For this longer-horizon task the score measures the number of correctly placed objects. Task transfer: Our robot data has coverage for picking eggs and packing them into a carton. We collect human data that sorts eggs into two cartons based on color and benchmark π 0.5 + ego on this new task. For this longer-horizon task the score measures the number of correctly placed eggs.
For more details about task setup and scoring, see Section A. B. The π 0.5 + ego recipe enables generalization to unseen scenes, objects, and tasks
We report transfer results on our suite of benchmark tasks in Figure 7. Across all three generalization axes we find that targeted human data collection and co-training can significantly improve policy generalization. Concretely, for both scene and object generalization, we see substantially higher task score after co-training: Spice: 32% → 71%; dresser: 25% → 50%; bussing: 53→63%. Notably, we also see strong task transfer from human video in the egg sorting task: while a policy trained on robot data only has the basic manipulation skills to pick and place eggs, it has no concept of sorting and simply places eggs into cartons randomly (57% sorting accuracy). In contrast, once co-trained with human egg sorting videos, the robot policy is able to sort eggs with 78% accuracy, and on average placed 4 more eggs correctly than π 0.5 .
We’ve established that π 0.5 + ego can leverage embodied human data to expand its capabilities, which leads us to the central question of this work: what enables this transfer? We hypothesize that policy pretraining with a diverse data mix of many scenes, tasks and embodiments is the key enabler for effective human to robot transfer. Intuitively, VLAs with strong pretraining may learn abstractions that span embodiments-organizing their representations to capture shared structure across domains, and thus facilitating transfer. We test this hypothesis in two parts. First, we establish that human to robot transfer on our generalization benchmark increases as a function of pretraining diversity. Then, we analyze our model’s learned representations as pretraining diversity increases. Strong human to robot transfer emerges with diverse pretraining. To evaluate the impact of VLA pretraining on human to robot transfer, we repeat our transfer benchmark experiments but with the following, increasingly diverse, pretrained initializations:
• 0%: base VLM initialization only • 25%, 50%, 75%, 100%: VLA pre-trained on increasingly diverse robot data, corresponding to fractions of the full diversity of [scene-task] combinations in our data, constrained to the target robot embodiments: ARX and mobile ARX • 100% + X-emb: the π 0.5 full VLA pretraining mixture Intelligence et al. [20], which additionally contains data across numerous non-target robot embodiments. With each of these pretrained initializations we train two models: one with only robot teleoperation data from the most similar tasks in our dataset, and one which additionally includes human embodiment data for these tasks. This allows us to measure the impact of diverse pretraining on human to robot transfer.
We report results in Figure 2. Concretely, we report the difference in score between the model using human data and not using human data at different levels of pre-trained model scale. This difference represents the magnitude of the human to robot transfer as a function of pre-training diversity. We find that this transfer significantly increases as a function of pretraining diversity. While with no or little pre-training, VLAs cannot benefit from human data co-training (0%, 25%), VLAs pre-trained on diverse data see significant gains from human data co-training (75%, 100%). Transfer is further improved by pre-training on a diverse cross-embodiment data mix, that includes data from diverse, non-target robot embodiments. We can analyze the scaling trends for each task individually. For instance in Sort Eggs we see that increased pretraining diversity alone does not enable a robot-only policy to perform Sort Eggs, a task never seen in our robot teleoperation data (Figure 8). However, increased pretraining diversity enables us to transfer significantly more knowledge from human data, which does cover this new task. Similarly, for Dresser, up until our 50% pretrained checkpoint there are no gains from human video co-training, and potentially even negative transfer (Figure 13). But between 75%→100% + X-emb we see consistent stacked gains on top of the robot-only baseline, Fig. 9: Human data compared to target robot data: For Eggs and Dresser we find that fientuning with comparable amounts of human data (yellow) and robot data (grey) resulted in similarly performant models. For Bussing we observe a larger performance gap to target robot data. even as the pretrained checkpoint gets stronger. More broadly, these results suggest that human to robot transfer will continue to improve as the diversity of our pretrained model grows. This follows our intuition, since we expect that diversity of scenes, tasks and embodiments ought to improve the model’s ability to form embodiment agnostic abstractions. Embodiment agnostic representations emerge with pretraining scale. We hypothesize that diverse pre-training helps produce embodiment agnostic representations which in turn improve human to robot transfer. To probe this, we perform a TSNE [43] analysis on the output embeddings of the VLA from both human and robot data after co-training (Figure 5). With poor pre-training, the model has disjoint representations across embodiments, suggesting that the model separately fits these distributions. Then, as pretraining diversity increases, the representations converge, suggesting the model builds a unified representation for both embodiments. See Appendix C for more details.
Prior works that operate on less data observe that co-training improves performance, but the representations of human and robot are disjoint, and propose methods to explicitly improve representational alignment [36]. Our analysis suggests that with sufficiently diverse pretraining, co-training alone can produce aligned representations that facilitate transfer. D. How does embodied human data compare to data from other robots? π 0.5 +ego frames human to robot transfer as an instantiation of cross embodiment transfer, so it’s natural to benchmark human to robot transfer with robot to robot transfer. This helps us understand whether we can leverage human data as just another robot embodiment in our mix. We first compare human data to an “upper bound” scenario where we collect target robot data for our benchmark tasks (Figure 9).
For two of three tasks (Sort Eggs and Dresser), we find that finetuning with human data were nearly as effective as finetuning with in domain data from the target robot itself. However, we note that on the Bussing task, target robot data was more effective than human data alone (25% vs 65%).
Next we study whether embodied human data has roughly Fig. 10: Human data compared to cross embodiment robot data: We compare the transfer from human data on the bussing task to that from a robot (albeit a different UR5 robot). We find that both enable lift over the baseline, but neither match data of the target robot, suggesting that human data transfer parallels cross-embodiment robot transfer. the same value as non-target robot data for a new task. Specifically, for the Bussing task we collected 400 demonstrations (7.45 hours) on a UR5 robot and evaluated transfer to an arx robot. We see a similar trend in the nature of human to ARX and UR5 to ARX transfer -both exceed the baseline, but both also don’t match data from the target robot embodiment, suggesting that human data transfer and cross-embodiment robot transfer share similar properties.
A natural question is whether human data can only be used to transfer “high level” semantic concepts, or whether it’s also transferring the “low level” action prediction. For the Bussing and Eggs task, we do not use a high-level policy during evaluation, so the transfer has to come from low level action prediction. For our mobile tasks Spice and Dresser, we evaluate a joint high level + low level and ablate the impact of each (Figure 11), by testing robot-only HL+LL, robot-only HL + cotrained LL, cotrained HL + robot-only LL and cotrained HL+LL. We find that leveraging human data for only the HL or LL policy alone is not as effective as cotraining both with human data, suggesting that transfer occurs Fig. 12: Human worn wrist cameras: We see that wrist cameras provide benefit for the Dresser and Bussing tasks and had similar performance elsewhere. This matches our intuition that some (but not all) tasks will benefit from the added observability of wrist cameras. across both levels.
When we only leverage human data for the HL policy, the low level policies don’t follow commands correctly. For instance in the spice task we observe a failure mode where “pick up the spice bottle” is misinterpreted by the low level policy to pick up bottles that are already on the tray. Or in the dresser task, when the HL says to “put the necklace in the jewelry box”, it sometimes puts it in the dresser organizer.
Likewise, when we only leverage human data for LL policy, we get poor HL policy commands. For instance in spice the high level policy continues to predict “pick up spice bottle” long after the bottle has been picked, blocking task progress. And in dresser the HL policy often predicts incorrect actions, like “put the hair clip on the top of the dresser” instead of correctly predicting to put it in the organizer.
To help mitigate the sensor gap between human and robot, we opted towards collecting human data with small wrist-worn cameras, to mimic the wrist cameras on our robot arms. We seek to understand their importance, since it has implications on how to sensorize humans for large scale data collection. We report transfer results with and without wrist camera observations for the human video data in Figure 12. For some tasks like Bussing and Dresser we see improved transfer from leveraging the human worn wrist cameras, while in other tasks like Spice and Eggs, transfer does not benefit from the additional camera streams. This matches our expectations, since some tasks rely on wrist cameras for observability more than others. As a result of these experiments, we expect that collecting embodied human data with wrist worn cameras maximally covers the space of potential tasks.
We study the emergence of human to robot transfer in our proposed recipe π 0.5 + ego. We find that with limited pretraining diversity, VLAs fail to transfer knowledge from human data, but as pretraining diversity grows past a critical threshold, transfer emerges. While our recipes leverage vast datasets of robot teleoperation data in pretraining, we ultimately only use 10s of hours of human data, and this data is collected in an episodic manner. We’re moving towards a future with vast datasets of embodied human data, which cover the episodic collection like in this work, but also passive data of people performing everyday tasks. There is additional work to be done to effectively leverage this data during pre-training, but we believe our work lays the groundwork to train VLAs with human data at scale.
Our findings on the emergence of human-to-robot transfer point to a promising future for scaling vision-language-action models. Much like large language models, larger VLAs may not only improve performance but also unlock entirely new capabilities. These capabilities could make it easier to tap into previously hard-to-use data sources and enable more effective transfer across domains-ultimately allowing robotic foundation models to scale even further. Using human video may be just one such capability, and it’s exciting to imagine what others might emerge as we continue to scale up robotic foundation models. Fig. 13: per task scaling graphs: Across all tasks we see a clear upward trend in the efficacy of finetuning with human data as pretrained diversity increases. In Eggs we observe that despite pretraining diversity increasing, the model fails to generalize to this OOD task (blue line), but improves significantly with in-distribution human embodiment data (yellow line).
Our human to robot transfer benchmark consists of 4 tasks, each testing a concept shown only in human data.
Bussing: The robot begins with nine kitchen items on a table, and must place each item in either the trash can or the bussing bin. The robot earns one point for each item correctly bussed, and we normalize score between [0, 1].
Spice: A kitchen island is set with three randomly placed spice bottles and a spice rack. The robot earns a point if it successfully places all spice bottles on the rack. This kitchen is unseen in robot data, but covered in human data.
Dresser: A bedroom dresser is set with a necklace and hair clip. The robot earns a point by correctly putting the necklace in the jewelry box and the hair clip in the accessories organizer. This bedroom is unseen in robot data, but covered in human data.
Sort Eggs: The table is set with two egg cartons (half dozen each) and a bowl containing six white and six brown eggs. The robot scores one point for each white egg it places in the left carton and brown egg in the right carton. An additional point is awarded for closing each egg carton shut at the end. Scores are normalized [0, 1]. While the robot data covers picking eggs, the concept of sorting eggs is only covered in human data.
For each experiment, we perform between 20 and 40 evaluations. All error bars visualize 1 standard error.
In the context of our experiments, performance improves as a function of two parameters, pretrained model diversity and human data finetuning. To help disentangle these factors, we can compare the robot-only and human+robot performance scaling curves.
First we can consider the zero shot generalization of our robot only model (blue robot-only curve). For tasks like Bussing and Spice, the zero shot generalization to these new tasks increases steadily with increased pretrained diversity. However, for Dresser zero shot generalization only emerges in the in our strongest pretrained model. And for Eggs, performance quickly plateaus even with improved pretraining. This tells us that increasing pretrained diversity generally improves zero shot generalization across tasks, but it can be nonlinear.
Interestingly, across these tasks there are points in which the zero shot generalization and transferability from human data (yellow Human + Robot curve) are not necessarily correlated. For instance, with Spice 50%→75% we see a modest increase in zero shot generalization, but a large increase in transferability. Similarly for Dresser 75%→100% or Eggs 50%→100%+X-emb we see no improvement in zero shot generalization, but a large increase in transfer from human data. In other words, there are cases where increased pretraining diversity doesn’t improve zero shot generalization, but it does improve human to robot transfer.
To visualize how the model represents the human and robot data internally, we visualize the output embeddings of the VLA from human and robot data with a TSNE in Figure 5. Specifically, we pass observations from human and robot data through the co-finetuned VLA as input. We then mean-pool the first 200 output embeddings of the VLA (corresponding roughly to “task”) and visualize them with the TSNE, capturing how well the VLA aligns different observations from humans and robots to the same task. We observe this alignment visibly improves with more diverse pre-training, even after all models have been finetuned on the same data.