📝 Original Info
- Title: Automatic Reward Shaping from Multi-Objective Human Heuristics
- ArXiv ID: 2512.15120
- Date: 2025-12-17
- Authors: Researchers from original ArXiv paper
📝 Abstract
Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
💡 Deep Analysis
Deep Dive into Automatic Reward Shaping from Multi-Objective Human Heuristics.
Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration, a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those
📄 Full Content
AUTOMATIC REWARD SHAPING
FROM MULTI-OBJECTIVE HUMAN HEURISTICS
Yuqing Xie1, Jiayu Chen1, Wenhao Tang1, Ya Zhang2, Chao Yu1, and Yu Wang1
1Tsinghua University
2Shanghai Jiao Tong University
ABSTRACT
Designing effective reward functions remains a central challenge in reinforcement
learning, especially in multi-objective environments. In this work, we propose
Multi-Objective Reward Shaping with Exploration, a general framework that au-
tomatically combines multiple human-designed heuristic rewards into a unified re-
ward function. MORSE formulates the shaping process as a bi-level optimization
problem: the inner loop trains a policy to maximize the current shaped reward,
while the outer loop updates the reward function to optimize task performance.
To encourage exploration in the reward space and avoid suboptimal local minima,
MORSE introduces stochasticity into the shaping process, injecting noise guided
by task performance and the prediction error of a fixed, randomly initialized neu-
ral network. Experimental results in MuJoCo and Isaac Sim environments show
that MORSE effectively balances multiple objectives across various robotic tasks,
achieving task performance comparable to those obtained with manually tuned
reward functions.
1
INTRODUCTION
Reward design remains a fundamental challenge in robot learning, particularly when multiple con-
flicting objectives must be balanced. A typical task definition might involve the robot traversing a
specified distance (e.g., 2 meters) within a given time limit (e.g., 10 seconds) while maintaining the
torque below a safety limit (e.g., 45 Nm). While the success criteria for the task are straightfor-
ward to define, their sparsity makes them inadequate as rewards for effective policy optimization.
(Andrychowicz et al., 2018)
To facilitate policy learning, researchers often design dense heuristic reward functions, such as re-
warding higher velocities or penalizing large joint movements to minimize energy cost. However,
these heuristics frequently conflict with one another, necessitating labor-intensive manual tuning to
combine them into a final reward function. Studies show that over 90% of reinforcement learn-
ing (RL) practitioners devote substantial effort to adjusting reward functions, highlighting a critical
scalability bottleneck (Booth et al., 2023).
Our work proposes a self-supervised reward shaping framework that automates the reward design
process. The approach requires RL practitioners to specify only: (1) a task performance criterion,
which scores a trajectory at the end of the episode, and (2) a set of heuristic reward functions, which
provide dense feedback for each step. The system then automatically learns an optimal combination
of these heuristics through a bi-level optimization process (Zhang et al., 2024b), where the inner loop
trains RL policies under the current reward function, and the outer loop adapts the reward function
itself to maximize task performance.
However, it is not trivial to apply bi-level optimization to shape the rewards for real-world robotic
tasks. Consider the robotic dog benchmark in (Margolis & Agrawal, 2023), which defines 15 distinct
objectives, constituting an enormous space of valid reward combinations. With such a large reward
weight space, the weight–performance landscape becomes highly non-convex with numerous local
extrema (Xie et al., 2024). Under such circumstances, existing bi-level methods (Gupta et al., 2023)
fail to converge reliably in complex scenarios, as we will further demonstrate in the experiments.
1
arXiv:2512.15120v1 [cs.LG] 17 Dec 2025
Environment
Policy
State ��
Action ��
Reward ��
Reward Weight �
Task Criteria
�����
Heuristic
���, �
Shaped Reward
��= ��∙ �����, ���, �
Manual
Auto
(a) Task Definition
(b) Reward Definition
(c) MORSE Overview
Sparse Task Criteria
Desired
Velocity
Smooth
Action
Heuristic Functions
Tracking Error Reward
Control Cost Penalty
Outer Loop Gradient Update (n×)
RL Policy Training (1×)
Random
sample 1k
points
Calculate
novelty value
(via RND)
Outer Loop Exploration (m×n×)
①
②
③
RL Training Epoch
Apply
Softmax
sample
Figure 1: (a) Experts provide only task criteria and heuristic functions, instead of a manually tuned
reward function. (b) Example task criteria and heuristic functions for robotic dog. (c) The inner
loop trains the RL policy on shaped rewards, while the outer loop, combining gradient updates with
exploration, updates reward weights to maximize task criteria. In the 3D graphs, we use xy plane to
denote reward weight space, and z axis to represent the corresponding task performance.
To overcome these challenges, we introduce Multi-Objective Reward Shaping with Exploration
(MORSE). MORSE augments the outer loop of bi-level optimization with controlled stochastic ex-
ploration, thereby facilitating training when the reward space is large. Concretely, if the policy
plateaus without reaching the target objective, the algorithm initiates outer-loop exploration
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.