Automatic Reward Shaping from Multi-Objective Human Heuristics

February 23, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Automatic Reward Shaping from Multi-Objective Human Heuristics
ArXiv ID: 2512.15120
Date: 2025-12-17
Authors: Researchers from original ArXiv paper

📝 Abstract

Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.

💡 Deep Analysis

Deep Dive into Automatic Reward Shaping from Multi-Objective Human Heuristics.

📄 Full Content

AUTOMATIC REWARD SHAPING FROM MULTI-OBJECTIVE HUMAN HEURISTICS Yuqing Xie1, Jiayu Chen1, Wenhao Tang1, Ya Zhang2, Chao Yu1, and Yu Wang1 1Tsinghua University 2Shanghai Jiao Tong University ABSTRACT Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that au- tomatically combines multiple human-designed heuristic rewards into a unified re- ward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neu- ral network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions. 1 INTRODUCTION Reward design remains a fundamental challenge in robot learning, particularly when multiple con- flicting objectives must be balanced. A typical task definition might involve the robot traversing a specified distance (e.g., 2 meters) within a given time limit (e.g., 10 seconds) while maintaining the torque below a safety limit (e.g., 45 Nm). While the success criteria for the task are straightfor- ward to define, their sparsity makes them inadequate as rewards for effective policy optimization. (Andrychowicz et al., 2018) To facilitate policy learning, researchers often design dense heuristic reward functions, such as re- warding higher velocities or penalizing large joint movements to minimize energy cost. However, these heuristics frequently conflict with one another, necessitating labor-intensive manual tuning to combine them into a final reward function. Studies show that over 90% of reinforcement learn- ing (RL) practitioners devote substantial effort to adjusting reward functions, highlighting a critical scalability bottleneck (Booth et al., 2023). Our work proposes a self-supervised reward shaping framework that automates the reward design process. The approach requires RL practitioners to specify only: (1) a task performance criterion, which scores a trajectory at the end of the episode, and (2) a set of heuristic reward functions, which provide dense feedback for each step. The system then automatically learns an optimal combination of these heuristics through a bi-level optimization process (Zhang et al., 2024b), where the inner loop trains RL policies under the current reward function, and the outer loop adapts the reward function itself to maximize task performance. However, it is not trivial to apply bi-level optimization to shape the rewards for real-world robotic tasks. Consider the robotic dog benchmark in (Margolis & Agrawal, 2023), which defines 15 distinct objectives, constituting an enormous space of valid reward combinations. With such a large reward weight space, the weight–performance landscape becomes highly non-convex with numerous local extrema (Xie et al., 2024). Under such circumstances, existing bi-level methods (Gupta et al., 2023) fail to converge reliably in complex scenarios, as we will further demonstrate in the experiments. 1 arXiv:2512.15120v1 [cs.LG] 17 Dec 2025 Environment Policy State �� Action �� Reward �� Reward Weight � Task Criteria �� Heuristic ��, � Shaped Reward ��= ��∙ ��, ��, � Manual Auto (a) Task Definition (b) Reward Definition (c) MORSE Overview Sparse Task Criteria Desired Velocity Smooth Action Heuristic Functions Tracking Error Reward Control Cost Penalty Outer Loop Gradient Update (n×) RL Policy Training (1×) Random sample 1k points Calculate novelty value (via RND) Outer Loop Exploration (m×n×) ① ② ③ RL Training Epoch Apply Softmax sample Figure 1: (a) Experts provide only task criteria and heuristic functions, instead of a manually tuned reward function. (b) Example task criteria and heuristic functions for robotic dog. (c) The inner loop trains the RL policy on shaped rewards, while the outer loop, combining gradient updates with exploration, updates reward weights to maximize task criteria. In the 3D graphs, we use xy plane to denote reward weight space, and z axis to represent the corresponding task performance. To overcome these challenges, we introduce Multi-Objective Reward Shaping with Exploration (MORSE). MORSE augments the outer loop of bi-level optimization with controlled stochastic ex- ploration, thereby facilitating training when the reward space is large. Concretely, if the policy plateaus without reaching the target objective, the algorithm initiates outer-loop exploration

…(Full text truncated)…

📄 Read Full PDF on ArXiv