Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration
Reinforcement Learning (RL) has become a key approach for enhancing the reasoning capabilities of large language models. However, prevalent RL approaches like proximal policy optimization and group relative policy optimization suffer from sparse, outcome-based rewards and weak exploration incentives, limiting their effectiveness. Specifically, sparse rewards offer limited feedback, especially on difficult problems, and introduce biases favoring familiar trajectories over novel reasoning paths. These issues critically undermine performance on complex tasks that inherently require iterative reasoning. To overcome these challenges, we propose Intrinsic MotivAtion Guided exploratIoN for Enhanced reasoning (IMAGINE), which delivers dense rewards and encourages exploration. IMAGINE introduces three innovations: a trajectory-aware exploration reward that reduces token-level bias efficiently; an error-conditioned reward allocation that promotes efficient exploration on hard samples while stabilizing training; and an advantage-preserving integration mechanism that retains distributional integrity during learning. Experiments on four public datasets show that IMAGINE improves performance by 22.23% on AIME 2024.
💡 Research Summary
This paper addresses two fundamental shortcomings of current reinforcement‑learning (RL) approaches for improving large language model (LLM) reasoning: sparse outcome‑based rewards and weak exploration incentives. Standard methods such as Proximal Policy Optimization (PPO) and Group‑Relative Policy Optimization (GRPO) rely on binary correctness signals that provide little guidance for intermediate reasoning steps, especially on hard problems where most rollouts receive zero reward. Moreover, because GRPO normalizes advantages across a group, trajectories that lead to the same final answer receive identical advantage estimates regardless of the diversity of their chain‑of‑thought (CoT), discouraging genuine exploration.
To overcome these issues, the authors propose IMAGINE (Intrinsic Motivation Guided Exploration for Enhanced reasoning). IMAGINE introduces three key innovations:
-
Trajectory‑aware exploration rewards – Instead of token‑level novelty, a dual‑network architecture (a frozen random target network and a trainable predictor) evaluates the prediction error on the entire question‑answer sequence. This yields a single dense intrinsic reward per trajectory, eliminating length bias and reducing computational cost from O(|output|) to O(1).
-
Error‑conditioned reward allocation – The intrinsic reward is applied only to incorrect samples, directing exploration resources toward hard cases while keeping correct trajectories stable. This self‑regulating mechanism prevents wasteful exploration on already solved inputs.
-
Advantage‑preserving integration – Exploration bonuses are added after the standard advantage computation (ˆA_new = ˆA_old + R*). By postponing the intrinsic term, the method avoids contaminating value‑function estimates in PPO and prevents negative advantage flips in GRPO’s group normalization, preserving the distributional integrity of the original RL objective.
The framework is compatible with PPO, GRPO, and the newer DAPO algorithm. Experiments were conducted on four public reasoning benchmarks—including the AIME 2024 competition and the Countdown‑4 dataset—using two base LLMs (Qwen2.5‑3B and DeepSeek‑7B). Across all settings, IMAGINE consistently improves accuracy, with an average gain of 22.23% on the hardest benchmark. Qualitative case studies show reductions in logical errors, fewer redundant steps, and more diverse CoT paths. Additionally, models trained with IMAGINE generate longer, more complex responses, indicating that the intrinsic exploration encourages deeper reasoning without sacrificing training stability.
In summary, IMAGINE demonstrates that carefully designed intrinsic motivation, when aligned with the sequential nature of LLM reasoning, can supply dense, stable guidance in otherwise sparse reward environments. This leads to more effective exploration, better utilization of hard samples, and ultimately superior reasoning performance, marking a significant step forward for RL‑based fine‑tuning of large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment