Robust and Generalized Humanoid Motion Tracking

Robust and Generalized Humanoid Motion Tracking
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Learning a general humanoid whole-body controller is challenging because practical reference motions can exhibit noise and inconsistencies after being transferred to the robot domain, and local defects may be amplified by closed-loop execution, causing drift or failure in highly dynamic and contact-rich behaviors. We propose a dynamics-conditioned command aggregation framework that uses a causal temporal encoder to summarize recent proprioception and a multi-head cross-attention command encoder to selectively aggregate a context window based on the current dynamics. We further integrate a fall recovery curriculum with random unstable initialization and an annealed upward assistance force to improve robustness and disturbance rejection. The resulting policy requires only about 3.5 hours of motion data and supports single-stage end-to-end training without distillation. The proposed method is evaluated under diverse reference inputs and challenging motion regimes, demonstrating zero-shot transfer to unseen motions as well as robust sim-to-real transfer on a physical humanoid robot.


💡 Research Summary

The paper tackles the long‑standing challenge of learning a general whole‑body controller for humanoid robots that can reliably track a wide variety of motions despite noisy or imperfect reference signals. The authors propose a dynamics‑conditioned command aggregation framework that couples a causal temporal encoder with a multi‑head cross‑attention command encoder. Recent proprioceptive observations (joint positions, velocities, base angular velocity, gravity direction, previous action) over the last ten simulation steps are embedded via a two‑layer MLP, enriched with sinusoidal positional encodings, and processed by a lightweight causal transformer. The resulting dynamics embedding hₜ summarizes the robot’s current motion state and serves as a query for cross‑attention.

A contextual command window (2L + 1 frames) containing reference base velocities, angular velocities, gravity direction, and joint positions is similarly tokenized. The dynamics embedding is projected to a query vector qₜ, which attends to the command tokens through multi‑head attention. Because the attention weights are conditioned on the current dynamics, the encoder can emphasize command segments that are physically feasible and down‑weight noisy or inconsistent parts (e.g., body penetration, contact mismatches). The aggregated command embedding uₜ is concatenated with the current proprioceptive observation and fed to an actor‑critic policy trained with PPO.

The policy outputs a residual joint‑position correction aₜ, which is added to the reference joint configuration to form a PD setpoint. This residual formulation anchors the controller to the reference motion while allowing fine‑grained adjustments, dramatically improving sample efficiency. The reward combines dense key‑point tracking terms (position, orientation, velocity consistency) with regularization penalties (action smoothness, joint limits, unintended contacts). The critic receives privileged, noise‑free information (reference base height, full link poses, base linear velocity) to produce more accurate value estimates.

Robustness is further enhanced by integrating a fall‑recovery curriculum directly into training. A fraction (15 %) of parallel environments are randomly initialized in unstable poses with diverse contact configurations. Early in training, an upward pulling force sampled from


Comments & Academic Discussion

Loading comments...

Leave a Comment