Automatic Reward Shaping from Multi-Objective Human Heuristics

Reading time: 5 minute
...

📝 Original Info

  • Title: Automatic Reward Shaping from Multi-Objective Human Heuristics
  • ArXiv ID: 2512.15120
  • Date: 2025-12-17
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.

💡 Deep Analysis

Deep Dive into Automatic Reward Shaping from Multi-Objective Human Heuristics.

Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those

📄 Full Content

AUTOMATIC REWARD SHAPING FROM MULTI-OBJECTIVE HUMAN HEURISTICS Yuqing Xie1, Jiayu Chen1, Wenhao Tang1, Ya Zhang2, Chao Yu1, and Yu Wang1 1Tsinghua University 2Shanghai Jiao Tong University ABSTRACT Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that au- tomatically combines multiple human-designed heuristic rewards into a unified re- ward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neu- ral network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions. 1 INTRODUCTION Reward design remains a fundamental challenge in robot learning, particularly when multiple con- flicting objectives must be balanced. A typical task definition might involve the robot traversing a specified distance (e.g., 2 meters) within a given time limit (e.g., 10 seconds) while maintaining the torque below a safety limit (e.g., 45 Nm). While the success criteria for the task are straightfor- ward to define, their sparsity makes them inadequate as rewards for effective policy optimization. (Andrychowicz et al., 2018) To facilitate policy learning, researchers often design dense heuristic reward functions, such as re- warding higher velocities or penalizing large joint movements to minimize energy cost. However, these heuristics frequently conflict with one another, necessitating labor-intensive manual tuning to combine them into a final reward function. Studies show that over 90% of reinforcement learn- ing (RL) practitioners devote substantial effort to adjusting reward functions, highlighting a critical scalability bottleneck (Booth et al., 2023). Our work proposes a self-supervised reward shaping framework that automates the reward design process. The approach requires RL practitioners to specify only: (1) a task performance criterion, which scores a trajectory at the end of the episode, and (2) a set of heuristic reward functions, which provide dense feedback for each step. The system then automatically learns an optimal combination of these heuristics through a bi-level optimization process (Zhang et al., 2024b), where the inner loop trains RL policies under the current reward function, and the outer loop adapts the reward function itself to maximize task performance. However, it is not trivial to apply bi-level optimization to shape the rewards for real-world robotic tasks. Consider the robotic dog benchmark in (Margolis & Agrawal, 2023), which defines 15 distinct objectives, constituting an enormous space of valid reward combinations. With such a large reward weight space, the weight–performance landscape becomes highly non-convex with numerous local extrema (Xie et al., 2024). Under such circumstances, existing bi-level methods (Gupta et al., 2023) fail to converge reliably in complex scenarios, as we will further demonstrate in the experiments. 1 arXiv:2512.15120v1 [cs.LG] 17 Dec 2025 Environment Policy State �� Action �� Reward �� Reward Weight � Task Criteria ����� Heuristic ���, � Shaped Reward ��= ��∙ �����, ���, � Manual Auto (a) Task Definition (b) Reward Definition (c) MORSE Overview Sparse Task Criteria Desired Velocity Smooth Action Heuristic Functions Tracking Error Reward Control Cost Penalty Outer Loop Gradient Update (n×) RL Policy Training (1×) Random sample 1k points Calculate novelty value (via RND) Outer Loop Exploration (m×n×) ① ② ③ RL Training Epoch Apply Softmax sample Figure 1: (a) Experts provide only task criteria and heuristic functions, instead of a manually tuned reward function. (b) Example task criteria and heuristic functions for robotic dog. (c) The inner loop trains the RL policy on shaped rewards, while the outer loop, combining gradient updates with exploration, updates reward weights to maximize task criteria. In the 3D graphs, we use xy plane to denote reward weight space, and z axis to represent the corresponding task performance. To overcome these challenges, we introduce Multi-Objective Reward Shaping with Exploration (MORSE). MORSE augments the outer loop of bi-level optimization with controlled stochastic ex- ploration, thereby facilitating training when the reward space is large. Concretely, if the policy plateaus without reaching the target objective, the algorithm initiates outer-loop exploration

…(Full text truncated)…

📸 Image Gallery

Conditional_Random.webp FixedNN.webp HalfCheetah-easy.webp HalfCheetah.webp HalfCheetah_default.webp HalfCheetah_easy.webp HalfCheetah_hard.webp Hopper-easy.webp Hopper.webp Hopper_default.webp Hopper_easy.webp Hopper_hard.webp MORSE.webp MORSE2.webp No_Noise.webp No_Reset2.webp Periodic_Noise2.webp Periodic_Novel.webp Periodic_Novelty2.webp Periodic_Random.webp RandomSpiky.webp SmoothPolynomial.webp Stop_Noise2.webp Walker2d-easy.webp Walker2d.webp Walker_default.webp Walker_easy.webp Walker_hard.webp a1-9obj.webp a1.webp a1_flat.jpg a1_flat.webp ablate.webp ablate_condition.webp ablate_condition_easy.webp ablate_gradient.webp ablate_noise.webp ablate_noise_easy.webp ablate_rebuttal.webp ablate_reset.webp ablate_reset_easy.webp ablate_sampling.webp bad_behavior.webp cart_pole.jpg cart_pole.webp cartpole.webp cartpole2.webp drone-dr.webp drone.webp entropyhard.webp franka_lift.jpg franka_lift.webp franka_reach.jpg franka_reach.webp freq_sensitivity_all.webp good_behavior.webp half_cheetah.jpg half_cheetah.webp hc_mesh.png hc_mesh.webp hc_surface.png hc_surface.webp hopper.jpg hopper_mesh.png hopper_mesh.webp hopper_surface.png hopper_surface.webp lift.webp lr_sensitivity_all.webp overeview.webp overview_draft.jpg overview_draft.webp quadcopter.jpg quadcopter.webp randomization.webp reach.webp walker2d.jpg walker_mesh.png walker_mesh.webp walker_surface.png walker_surface.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut