HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba
End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.
💡 Research Summary
The paper introduces HuMam, a novel end‑to‑end reinforcement‑learning (RL) framework for humanoid locomotion that leverages a single‑layer Mamba encoder as a lightweight fusion backbone. The authors identify three major shortcomings of existing RL‑based humanoid controllers: training instability, inefficient multimodal feature fusion, and high actuation cost. HuMam addresses these by (1) fusing robot‑centric states (joint positions, velocities, base orientation, angular velocity) with external guidance (two upcoming footstep targets, heading, and a continuous gait‑phase clock) into a compact token sequence, (2) processing this sequence with a gated state‑space model (Mamba) that mixes features without recurrent or transformer‑style attention, and (3) shaping the reward with six carefully weighted terms—contact force, swing velocity, foot‑placement error, body orientation, torso height, and upper‑body sway. The reward weights (0.15, 0.15, 0.45, 0.05, 0.05, 0.05) balance stability, accuracy, and implicit energy efficiency.
The policy outputs desired joint positions for the 12 actuated leg joints at 40 Hz; a low‑gain proportional‑derivative (PD) controller runs at 1000 Hz to convert these targets into torques, effectively embedding low‑level tracking dynamics into the high‑level policy. Training uses Proximal Policy Optimization (PPO) under the same hyper‑parameters and computational budget as a strong feed‑forward baseline.
Experiments are conducted on the JVRC‑1 humanoid in the mc‑mujoco simulator across five locomotion tasks: forward, backward, curved, lateral, and standing. Results show that HuMam achieves faster convergence (steeper learning curves), higher final average returns (≈12 % improvement), and reduced variance across random seeds (≈5 % reduction). Energy metrics improve as well: average power consumption drops by about 8 % and torque peaks are suppressed by roughly 10 %, indicating a more economical control strategy. Qualitatively, the robot exhibits smoother swing phases, more precise foot placements, and fewer destabilizing impacts, even on curved trajectories.
The authors discuss the advantages of the Mamba encoder: near‑linear computational complexity, minimal memory footprint, and gated dynamics that enhance robustness to observation noise. They also acknowledge limitations: the current design relies solely on the current timestep, lacking explicit temporal memory for long‑horizon prediction, and the work is confined to simulation without a sim‑to‑real transfer study. Future directions include augmenting Mamba with temporal modules, integrating exteroceptive sensors (vision, lidar) for more complex terrain, and validating the approach on physical hardware.
In summary, HuMam is the first humanoid RL controller that adopts Mamba for state‑centric multimodal fusion. By combining this efficient backbone with a concise six‑term reward, the framework simultaneously improves learning efficiency, training stability, and control economy, marking a significant step toward practical, energy‑aware humanoid locomotion via deep reinforcement learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment