EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
Human behavior is among the most scalable sources of data for learning physical intelligence, yet how to effectively leverage it for dexterous manipulation remains unclear. While prior work demonstrates human to robot transfer in constrained settings, it is unclear whether large scale human data can support fine grained, high degree of freedom dexterous manipulation. We present EgoScale, a human to dexterous manipulation transfer framework built on large scale egocentric human data. We train a Vision Language Action (VLA) model on over 20,854 hours of action labeled egocentric human video, more than 20 times larger than prior efforts, and uncover a log linear scaling law between human data scale and validation loss. This validation loss strongly correlates with downstream real robot performance, establishing large scale human data as a predictable supervision source. Beyond scale, we introduce a simple two stage transfer recipe: large scale human pretraining followed by lightweight aligned human robot mid training. This enables strong long horizon dexterous manipulation and one shot task adaptation with minimal robot supervision. Our final policy improves average success rate by 54% over a no pretraining baseline using a 22 DoF dexterous robotic hand, and transfers effectively to robots with lower DoF hands, indicating that large scale human motion provides a reusable, embodiment agnostic motor prior.
💡 Research Summary
EgoScale introduces a two‑stage framework for transferring large‑scale egocentric human manipulation data to dexterous robot control. The authors first assemble an unprecedented dataset of 20,854 hours of egocentric video, covering nearly 10 k scenes, 6 k tasks, and over 43 k objects. Each video is paired with estimated camera pose and 21‑point hand pose (including the wrist). From these raw streams they derive a unified action representation: (1) relative wrist motion ΔW between consecutive frames, which is invariant to global camera movement, and (2) a 22‑DoF joint‑angle vector obtained by mapping the 21 human keypoints onto the Sharpa hand’s joint space via an optimization that respects joint limits and kinematic constraints. This representation is deliberately shared across human demonstrations and robot executions, providing a common “language” for cross‑embodiment learning.
In Stage I, a flow‑based Vision‑Language‑Action (VLA) model is pretrained on the entire human corpus. The model receives an RGB image and a natural‑language instruction, encodes them with a vision‑language transformer, and predicts a chunk of future actions using a flow‑matching objective. Training runs for 100 k steps on 200 × 256 GB GPUs with a global batch size of 8 192 and a learning rate of 5 × 10⁻⁵, fully fine‑tuning all parameters. The authors discover a clear log‑linear scaling law: as the amount of human data grows, the validation loss on wrist‑hand prediction decreases logarithmically. Crucially, this loss correlates strongly with downstream real‑robot performance, establishing the human dataset as a predictable source of supervision.
Stage II introduces a small, embodiment‑aligned dataset that contains both human and teleoperated robot trajectories captured under identical sensor setups (head‑mounted RGB, wrist‑mounted cameras, Vive trackers for wrist pose, and Manus gloves for hand pose). This dataset comprises 344 tabletop tasks, roughly 30 human and 5 robot trajectories per task, amounting to 50 hours of human data and only 4 hours of robot data. During mid‑training (50 k steps, batch size 2 048, LR = 3 × 10⁻⁵) the vision‑language backbone is frozen while the vision encoder and a DiT action expert are updated, anchoring the pretrained representations to the robot’s proprioceptive and joint‑command spaces. Light‑weight MLP adapters handle embodiment‑specific inputs and outputs, enabling the same core model to be reused for different hands (e.g., the 22‑DoF Sharpa hand and the 3‑finger Unitree G1 hand).
In Stage III the model is fine‑tuned on five challenging dexterous manipulation tasks (syringe injection, fruit picking, shirt folding, bottle unscrewing, and cloth folding). Each task provides 100 tele‑operated robot demonstrations (20 for the deformable shirt‑folding task). After fine‑tuning (10 k steps, batch size 512, LR = 3 × 10⁻⁵) the resulting policy achieves an average success‑rate improvement of 54 % over a baseline trained from scratch. Remarkably, with only a single robot demonstration (one‑shot), the policy reaches 88 % success on shirt folding, demonstrating emergent few‑shot generalization. Moreover, the same human‑pretrained policy transfers to the low‑DoF G1 hand, yielding more than a 30 % absolute gain in success across evaluated tasks, confirming that the learned motor prior is largely embodiment‑agnostic.
The paper answers five research questions: (RQ1) large‑scale human pretraining substantially boosts dexterous manipulation; (RQ2) the log‑linear scaling law predicts performance gains as data grows; (RQ3) a modest amount of aligned human‑robot data is sufficient to enable few‑shot adaptation; (RQ4) the learned representations generalize across robots with very different kinematics; and (RQ5) explicit supervision of wrist motion and high‑DoF hand articulation is critical for downstream success.
Key contributions are: (1) empirical validation of a scaling law linking human data volume to robot manipulation performance; (2) a simple yet effective two‑stage transfer recipe that decouples data scale from embodiment alignment; (3) demonstration of emergent one‑shot transfer and cross‑embodiment generalization, suggesting that massive, noisy human video can serve as a scalable, reusable motor prior for future robot learning systems. The work points toward a future where human demonstrations are treated as a parallel, scalable embodiment alongside robots, dramatically reducing the need for expensive robot‑specific data collection.
Comments & Academic Discussion
Loading comments...
Leave a Comment