Human Preference Modeling Using Visual Motion Prediction Improves Robot Skill Learning from Egocentric Human Video
We present an approach to robot learning from egocentric human videos by modeling human preferences in a reward function and optimizing robot behavior to maximize this reward. Prior work on reward learning from human videos attempts to measure the long-term value of a visual state as the temporal distance between it and the terminal state in a demonstration video. These approaches make assumptions that limit performance when learning from video. They must also transfer the learned value function across the embodiment and environment gap. Our method models human preferences by learning to predict the motion of tracked points between subsequent images and defines a reward function as the agreement between predicted and observed object motion in a robot’s behavior at each step. We then use a modified Soft Actor Critic (SAC) algorithm initialized with 10 on-robot demonstrations to estimate a value function from this reward and optimize a policy that maximizes this value function, all on the robot. Our approach is capable of learning on a real robot, and we show that policies learned with our reward model match or outperform prior work across multiple tasks in both simulation and on the real robot.
💡 Research Summary
This paper introduces a novel framework for leveraging large-scale egocentric human video data to accelerate robot skill acquisition. Traditional video‑based reward learning methods estimate a long‑term value of a visual state by measuring its temporal distance to the final frame of a demonstration. Such approaches suffer from two major drawbacks: (1) they are highly sensitive to sub‑optimal human behavior (hesitations, multitasking, non‑time‑optimal actions), which biases the value of all preceding frames, and (2) the learned value function must be transferred across a substantial embodiment and environment gap between human and robot videos, often leading to poor generalization.
To overcome these issues, the authors propose modeling “short‑term human preferences” via a motion‑prediction model. They first process a large collection of human videos (e.g., Ego4D, Epic Kitchens) by extracting object masks and sampling a dense grid of points on each task‑relevant object. A Transformer‑based network, denoted Fθ, is trained to predict the locations of these points at the next time step given the current RGB observation, current point positions, and a normalized progress indicator t/T. In effect, Fθ learns how objects tend to move from one frame to the next in successful human demonstrations.
The reward for a robot rollout is defined as the per‑step alignment between the predicted point displacements (Δp_pred) and the actual displacements observed in the robot’s own video (Δp_track). Specifically, for each point p, the cosine similarity between Δp_pred and Δp_track is computed, and the reward r_t is the maximum of zero and this similarity. This yields a dense, step‑wise signal that rewards the robot for reproducing the one‑step motion preferences exhibited by humans, while being robust to any earlier hesitation or distraction in the human data because each time step is evaluated independently.
Policy learning proceeds in two stages. First, a small set of on‑robot demonstrations (10 trajectories) is used to train a behavior‑cloning policy π_base, providing a reasonable initial behavior and reducing the exploration burden. Second, a modified Soft Actor‑Critic (SAC) algorithm is employed in a residual‑RL setting: the base policy is kept fixed (or fine‑tuned with a very low learning rate) while a residual policy π_residual is learned to correct π_base. The motion‑prediction reward r_t is fed to SAC, which simultaneously learns a value function V(s) and a Q‑function Q(s,a). Entropy regularization in SAC encourages exploration, and an online buffer of robot experience is interleaved with the offline demonstrations to achieve high sample efficiency.
The authors evaluate the approach on three real‑world manipulation tasks—opening a microwave, grasping a cup, and opening a door—as well as corresponding simulated versions. Compared to prior methods that use temporal‑distance‑based visual rewards (e.g., VIP), the proposed method achieves significantly higher success rates, often improving performance by more than 30 % within a single hour of real‑world interaction. Importantly, the method remains stable even when the human videos contain pauses or non‑optimal actions, confirming that the short‑term motion‑prediction reward is less biased than long‑term value estimates.
Key contributions include: (1) reframing human video supervision as a point‑motion prediction problem that captures immediate preferences, (2) converting these predictions into a dense, per‑step reward that directly guides robot behavior, and (3) integrating this reward with a sample‑efficient residual SAC pipeline that leverages a tiny amount of on‑robot data. Limitations are acknowledged: the approach relies on accurate point tracking and object masks, which may be challenging for deformable, reflective, or occluded objects; it also assumes that task‑relevant objects can be identified a priori. Future work could explore integrating tactile or force feedback, learning keypoints end‑to‑end rather than using fixed grids, and extending the method to fully unstructured environments where object segmentation is not readily available. Overall, the paper demonstrates that modeling short‑term visual motion preferences is a powerful and practical way to extract useful supervision from abundant human video data for real‑world robot learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment