Black Box Meta-Learning Intrinsic Rewards

Black Box Meta-Learning Intrinsic Rewards
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The broader application of reinforcement learning (RL) is limited by challenges including data efficiency, generalization capability, and ability to learn in sparse-reward environments. Meta-learning has emerged as a promising approach to address these issues by optimizing components of the learning algorithm to meet desired characteristics. Additionally, a different line of work has extensively studied the use of intrinsic rewards to enhance the exploration capabilities of algorithms. This work investigates how meta-learning can improve the training signal received by RL agents. We introduce a method to learn intrinsic rewards within a reinforcement learning framework that bypasses the typical computation of meta-gradients through an optimization process by treating policy updates as black boxes. We validate our approach against training with extrinsic rewards, demonstrating its effectiveness, and additionally compare it to the use of a meta-learned advantage function. Experiments are carried out on distributions of continuous control tasks with both parametric and non-parametric variations. Furthermore, only sparse rewards are used during evaluation. Code is available at: https: //github.com/Octavio-Pappalardo/Meta-learning-rewards


💡 Research Summary

The paper tackles two persistent challenges in reinforcement learning (RL): poor data efficiency and difficulty learning in sparse‑reward environments. It proposes a meta‑learning approach that learns an intrinsic reward function without the need for second‑order meta‑gradients. Instead of differentiating through the inner‑loop optimizer, the authors treat the inner loop (a PPO agent) as a black box. The intrinsic reward generator is modeled as a stochastic policy πᵣ_ϕ, implemented with an LSTM that receives the full interaction history (state, action, current policy probabilities, extrinsic reward, previous intrinsic reward, and a flag for episode start) and outputs a reward rᵢₜ at each timestep. During meta‑training, multiple tasks sampled from a distribution are solved by the inner PPO agent, but the environment’s extrinsic reward is replaced by the intrinsic reward produced by πᵣ_ϕ. The outer loop collects the trajectories from all tasks, computes the cumulative return G(τ) for each, and updates πᵣ_ϕ (and an outer‑loop critic) using standard PPO updates. Because the outer‑loop update only requires gradients with respect to ϕ, no gradients need to flow through the inner PPO updates, eliminating the expensive computation of second‑order derivatives.

The method is evaluated on the MetaWorld benchmark suite, which provides continuous‑control robotic manipulation tasks. Two families of tasks are used: ML1 (single problem class with 50 training and 50 test parametric variations) and ML10 (ten problem classes, each with 50 training variations and five unseen test classes). During meta‑training, dense shaped rewards are available, but during evaluation only sparse rewards are used: a small penalty for failure, a scaled reward based on the proportion of steps taken to reach the goal, and a unit reward upon successful completion. This hybrid setting mirrors realistic scenarios where shaping is possible for training but not for deployment.

Results show that after a short adaptation phase of 4,000 environment steps, agents trained with the meta‑learned intrinsic reward consistently outperform agents trained directly on the dense extrinsic reward. The authors also meta‑learn an advantage function within the same framework and demonstrate comparable performance, indicating that the approach can be used to learn alternative objective components. The black‑box formulation offers several advantages: (1) it simplifies implementation and reduces memory consumption because only first‑order gradients are needed; (2) it is agnostic to how the inner loop uses the learned signal, allowing non‑differentiable integrations that traditional meta‑gradient methods cannot handle; (3) it avoids the high computational cost of second‑order differentiation. The primary drawback is potentially higher variance in the meta‑learning signal, as meta‑gradients can provide a lower‑variance estimate by explicitly modeling the influence of the intrinsic reward on policy parameters.

In summary, the paper introduces a practical and efficient way to meta‑learn intrinsic motivation signals for RL agents. By treating the inner learning algorithm as a black box, it sidesteps the need for costly meta‑gradients while still achieving superior performance on challenging continuous‑control tasks under sparse‑reward evaluation. The work bridges meta‑reinforcement learning and intrinsic reward research, offering a scalable pathway toward more data‑efficient and robust agents suitable for real‑world robotic applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment