World Models Can Leverage Human Videos for Dexterous Manipulation

Reading time: 5 minute
...

📝 Original Info

  • Title: World Models Can Leverage Human Videos for Dexterous Manipulation
  • ArXiv ID: 2512.13644
  • Date: 2025-12-15
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

💡 Deep Analysis

Deep Dive into World Models Can Leverage Human Videos for Dexterous Manipulation.

Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on

📄 Full Content

World Models Can Leverage Human Videos for Dexterous Manipulation Raktim Gautam Goswami1,2,∗, Amir Bar1, David Fan1, Tsung-Yen Yang1, Gaoyue Zhou1,2, Prashanth Krishnamurthy2, Michael Rabbat1, Farshad Khorrami2, Yann LeCun1,2 1FAIR at Meta, 2New York University ∗Work done during internship at Meta Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks. Project Page: https://raktimgg.github.io/dexwm/ Figure 1 We introduce DexWM, a Dexterous Manipulation World Model which predicts future latent states of the environment based on past states and dexterous actions. Trained on large-scale human and non-dexterous robot video data, DexWM learns to simulate complex manipulation trajectories in the latent space. With minimal fine-tuning on a small exploratory robot simulation dataset, DexWM enables robust planning for novel reaching, grasping, and placing tasks in simulation, and achieves zero-shot transfer to real-world robot tasks. 1 arXiv:2512.13644v1 [cs.RO] 15 Dec 2025 1 Introduction As embodied agents become increasingly integrated into daily lives, dexterous manipulation emerges as a critical capability for achieving human-like inter- action with the physical world. Everyday tasks like cooking, as well as high-stakes applications like surgery, demand a level of dexterity that is infeasible with commonly used parallel-jaw grippers. Dexterous grippers are modeled after the human hand and can unlock advanced human skills, including handling complex tools, performing fine-grained movements, and executing in-hand manipulation [39, 63, 69, 93]. Recent advances in deep learning have enabled the development of computer vision based policies for robotic manipulation [22, 23, 37, 56, 59]. However, these approaches face challenges in generalizing to unseen tasks and in planning and executing policies in physical environments [15, 32, 101]. Successful execution requires models to reason about how their actions affect objects and their surroundings; for example, recognizing that opening the gripper when holding an object will cause it to drop. World models can learn environmental dynamics from observation and action [58], and thus offer a promising solution. Early work on learned world models [48, 74, 102] has primarily focused on small- scale tasks with constrained environments and lim- ited action spaces. More recent efforts have extended these approaches to handle complex actions, such as text [2], navigation [10] and whole-body motion [8]. However, the action spaces in these methods are often too coarse to capture the fine-grained information required for precise dexterous control. Moreover, building world models for dexterous manipulation is challenging as there are no large-scale robotic datasets with dexterous grippers. To address these challenges, we propose DexWM (Figure 1), a latent space world model that learns from human data to predict future latent states based on past states and dexterous hand actions. Inspired by recent work [29, 80] that leverages human train- ing data, we pre-train DexWM on EgoDex [47], a large-scale egocentric human interaction dataset, and further incorporate DROID [54] sequences, consisting of non-dexterous robot manipulation, to reduce the embodiment gap. DexWM’s actions are represented as differences in 3D hand keypoints and camera poses, capturing detailed hand configurations and enabling the model to learn how hand posture changes affect the environment. We find that accurately simulating hand locations using the next latent state prediction objective alone is difficult. Therefore, we train DexWM to jointly optimize both the future environment state and the hand configuration, providing a richer learning signal for dexterity. With this auxiliary hand consistency loss, DexWM outperforms existing world models [2, 8, 10] in open-loop trajectory simulation. Furthermore, DexWM enables strong zero-shot trans- fer to dexterous robot manipulation tasks by opti- mizing actions at test time within an MPC frame- wo

…(Full text truncated)…

📸 Image Gallery

action_transfer.png action_transfer.webp atomic_actions.png atomic_actions.webp droid_hands.png droid_hands.webp encoder_ablation.png encoder_ablation.webp exploratory_data.png exploratory_data.webp flow.png flow.webp keypoints.png keypoints.webp model_size_swapped.png model_size_swapped.webp planning.png planning.webp real_world_experiments.png real_world_experiments.webp real_world_rollout.png real_world_rollout.webp rollouts.png rollouts.webp sim_experiments.png sim_experiments.webp teaser_new.png teaser_new.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut