Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction
Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.
💡 Research Summary
The paper introduces VIDEOMANIP, a device‑free framework that learns dexterous multi‑fingered robot manipulation policies directly from ordinary RGB human videos, without any wearables, depth sensors, or robot demonstrations. The authors address three core challenges: (1) reconstructing accurate 3‑D hand‑object trajectories from monocular video, (2) aligning those trajectories to a robot‑centric world frame even for “in‑the‑wild” footage lacking calibration, and (3) turning the noisy reconstructed data into physically plausible, diverse training demonstrations.
Reconstruction pipeline.
Given a video V = {I₁,…,I_T}, the system first runs MoGe‑2 to obtain metric depth maps and camera intrinsics, establishing a common metric 3‑D coordinate system. Object masks are extracted with Segment‑Anything‑Model 2 (SAM 2); the masked crops are fed to MeshyAI for image‑to‑mesh generation. Because the mesh lacks real‑world scale, the authors query GPT‑4.1 for a coarse size estimate based on the object’s semantic label, then refine the scale by evaluating a set of candidate scalings (0.5×–2×) with FoundationPose and selecting the one that minimizes rendering error against the original masks. Human hand meshes are recovered with HaMeR, which outputs low‑dimensional pose (θ) and shape (β) parameters. Depth ambiguity in HaMeR is resolved by averaging MoGe‑2 depth values at the 2‑D keypoints, yielding a corrected hand depth. The hand mesh and object mesh are thus placed in the same metric space.
Robot retargeting.
Using the robot’s URDF, a set of robot link keypoints is defined. An optimization aligns these robot keypoints with the corresponding human hand joints, producing a full robot configuration q_t (wrist pose + finger joint angles) for each frame. This yields a time‑indexed robot‑object trajectory suitable for imitation learning.
Calibration for wild videos.
In‑scene videos can be transformed to the robot base via a known extrinsic T_cam. Wild videos lack such calibration; instead the authors apply GeoCalib, a single‑image method that infers the gravity direction from visual cues. The estimated rotation grav R_cam aligns gravity with the negative z‑axis, and this rotation is applied to all reconstructed meshes and robot configurations, producing gravity‑aligned trajectories that share a common horizontal plane with in‑scene data. Full world‑frame recovery is unnecessary for policy learning.
Physical feasibility and data augmentation.
Reconstructed trajectories often contain interpenetrations or unrealistic contacts due to mesh‑scale errors. To mitigate this, a differentiable hand‑object contact optimization is performed: a contact map is predicted and a loss term penalizes distance and normal misalignment between hand and object surfaces, yielding contact‑aware hand poses. Moreover, a single video provides only one demonstration, which is insufficient for robust policy training. The authors adopt DemoGen’s skill‑motion decomposition, splitting each trajectory into a grasp phase (approach → stable grasp) and a manipulation phase (post‑grasp actions). DemoGen then synthesizes many spatially randomized demonstrations by perturbing object pose, hand pose, and trajectory timing, dramatically increasing data diversity while preserving the underlying skill.
Learning and evaluation.
Point‑cloud‑based policies are trained on the reconstructed, contact‑optimized, and augmented data. In simulation with the Inspire Hand, the learned grasping model achieves a 70.25 % success rate across 20 diverse objects, outperforming prior retargeting baselines by over 15 % absolute. In real‑world experiments with the LEAP Hand, policies trained from RGB videos attain an average 62.86 % success rate across seven manipulation tasks (three from in‑scene videos, four from in‑the‑wild videos). The wild‑video policies perform comparably to in‑scene ones, demonstrating successful transfer despite the lack of explicit camera‑to‑world calibration.
Contributions.
- A fully device‑free pipeline that reconstructs 3‑D hand‑object trajectories from single RGB videos, leveraging depth estimation, segmentation, image‑to‑mesh generation, and language‑model‑guided scale estimation.
- Contact‑aware grasp modeling via differentiable hand‑object contact optimization, ensuring physically plausible demonstrations.
- Demonstration synthesis (DemoGen) that expands one video into many diverse trajectories, enabling robust policy learning without any robot‑side data.
- Empirical validation in both simulation and real robots, showing that policies learned solely from human videos can achieve high success rates on dexterous manipulation tasks.
Overall, VIDEOMANIP demonstrates that large‑scale, readily available human video corpora can be transformed into valuable robot learning data, removing a major bottleneck in dexterous manipulation research and opening the door to scalable, vision‑driven robot skill acquisition.
Comments & Academic Discussion
Loading comments...
Leave a Comment