ECO: Energy-Constrained Optimization with Reinforcement Learning for Humanoid Walking
Achieving stable and energy-efficient locomotion is essential for humanoid robots to operate continuously in real-world applications. Existing MPC and RL approaches often rely on energy-related metrics embedded within a multi-objective optimization framework, which require extensive hyperparameter tuning and often result in suboptimal policies. To address these challenges, we propose ECO (Energy-Constrained Optimization), a constrained RL framework that separates energy-related metrics from rewards, reformulating them as explicit inequality constraints. This method provides a clear and interpretable physical representation of energy costs, enabling more efficient and intuitive hyperparameter tuning for improved energy efficiency. ECO introduces dedicated constraints for energy consumption and reference motion, enforced by the Lagrangian method, to achieve stable, symmetric, and energy-efficient walking for humanoid robots. We evaluated ECO against MPC, standard RL with reward shaping, and four state-of-the-art constrained RL methods. Experiments, including sim-to-sim and sim-to-real transfers on the kid-sized humanoid robot BRUCE, demonstrate that ECO significantly reduces energy consumption compared to baselines while maintaining robust walking performance. These results highlight a substantial advancement in energy-efficient humanoid locomotion. All experimental demonstrations can be found on the project website: https://sites.google.com/view/eco-humanoid.
💡 Research Summary
The paper introduces ECO (Energy‑Constrained Optimization), a novel constrained reinforcement‑learning (RL) framework designed to improve the energy efficiency of humanoid robot walking while preserving stability and tracking performance. Traditional model‑predictive control (MPC) and RL approaches embed energy‑related terms directly into the reward function, requiring extensive hyper‑parameter tuning of weight coefficients that lack clear physical meaning. This often leads to sub‑optimal policies where energy savings conflict with stability or task achievement.
ECO resolves this by decoupling energy consumption from the reward and treating it as an explicit inequality constraint within a Constrained Markov Decision Process (CMDP). The primary reward remains focused on velocity tracking, reference motion fidelity, and posture stability, while two dedicated constraints are enforced: (1) an energy consumption constraint that limits the average cost‑of‑transport (CoT) over an episode, and (2) a reference‑motion constraint that encourages symmetric gait and reduced body shaking. Both constraints are incorporated via a Lagrangian formulation, yielding the classic PPO‑Lagrangian (PPO‑Lag) update: the policy parameters are optimized to maximize the reward, while Lagrange multipliers are adjusted to keep the constraint violations below predefined thresholds.
The authors evaluate four state‑of‑the‑art constrained RL algorithms—PPO‑Lag, CRPO, IPO, and P3O—under identical settings. Empirical results show that PPO‑Lag converges fastest (≈2 M steps), maintains zero constraint violations, and achieves the lowest energy cost. In contrast, CRPO and IPO exhibit frequent constraint breaches and unstable learning curves.
Experiments are conducted both in simulation (MuJoCo model of the kid‑sized humanoid “BRUCE”) and on the physical robot. Compared with a well‑tuned MPC baseline and a standard PPO baseline with reward‑shaped energy penalties, ECO reduces the CoT by approximately sixfold relative to MPC and 2.3× relative to PPO, while preserving walking speed and stability. Qualitatively, the learned gait displays “extended knee movements,” “lighter steps,” and “reduced body shaking,” traits that are advantageous for loco‑manipulation tasks because they minimize disturbances to the upper body.
A notable contribution is the physically interpretable hyper‑parameter tuning for the energy constraint. Instead of searching a high‑dimensional weight space, the authors perform a simple linear search on the energy threshold, directly guided by measured CoT values. This dramatically reduces the tuning burden and makes the method more accessible for practitioners.
The paper also discusses limitations. The energy constraint is static; it does not adapt to varying battery levels, terrain inclines, or payload changes. Moreover, scaling the Lagrangian approach to many simultaneous constraints could lead to instability in multiplier updates, suggesting future work on adaptive multiplier schemes or hierarchical constraint handling.
In summary, ECO demonstrates that formulating energy consumption as a hard, physically meaningful constraint—rather than a soft penalty—enables efficient, stable, and interpretable learning of humanoid walking policies. The framework bridges the gap between simulation and real‑world deployment, achieving substantial energy savings on a real robot without sacrificing performance, and opens avenues for extending constrained RL to more complex, multi‑objective humanoid tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment