Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.
Large language models (LLMs) such as OpenAI-o3 (OpenAI, 2024) and DeepSeek (DeepSeek-AI, 2025) have shown striking performance in complex reasoning tasks. A key enabler of this recent progress is reinforcement learning with verifiable rewards (RLVR) (Shao et al., 2024;Lambert et al., 2024;Gao et al., 2024), which finetunes pre-trained base models via reinforcement learning (RL) using automatically verifiable, outcome-based feedback, such as a binary signal indicating whether the final answer is correct.
This raises an immediate question: if RLVR relies on outcome-based feedback only at the end of a reasoning trajectory, how can such sparse reward mechanism drive effective learning on long-horizon problems? As the horizon grows, RL algorithms encounter an inherent search barrier, because useful signals are buried within an exponentially expanding space of trajectories. While recent studies have sought to understand the mechanism of RLVR (Yeo et al., 2025;Wu et al., 2025;Yue et al., 2025;Sun et al., 2025;Yuan et al., 2025;Wen et al., 2025), existing findings often provide mixed and inconclusive results across different tasks and setups. It remains unclear under what conditions the outcome rewards are sufficient to ensure effective RL.
A recent controlled study (Zhang et al., 2025) proposed an important insight: RLVR is only effective when training operates near the model’s edge of competence, where the model can solve the problems with non-random accuracy but has not yet mastered them. This suggests a principle for RL data design: one should select problems that are right at the edge of the model’s competence for effective training. Yet, this principle is mainly descriptive: it suggests where RL tends to work, but does not explain why its effectiveness is confined to this regime. This motivates us to ask the following intriguing question: Figure 1: Reward-growth dynamics in mixed-difficulty RL. A schematic illustration of the reward growth rate dr/dt and r(t) for mixed-difficulty RL, demonstrating how the difficulty ratio R = L k+1 /L k changes the learning dynamics at the edge of model’s competence, which yields either grokking-type phase transitions or smooth relays.
Why does RLVR primarily improve performance near the edge of the model’s competence?
To address this question, we study a multi-step compositional reasoning setting where solving a problem requires sequential steps, yet feedback is provided only through a terminal reward. Our model is a minimal transformer model (Vaswani et al., 2017), which is the backbone architecture of most LLMs. It consists of a softmax-based attention layer followed by a multilayer perceptron (MLP) layer. We fix the MLP to perfectly implement the atomic operation, modeling the regime where the model already possesses the requisite atomic skill and RLVR only needs to learn how to compose these skills (Yuan et al., 2025). We study RL training on this task under outcome-based rewards via the standard policy gradient algorithm REINFORCE (Williams, 1992). Within this setting, we track the learning dynamics of the transformer model and identify the factors that govern progress across increasing horizons, thereby shedding insights on when RLVR can (or cannot) scale to long-horizon compositional reasoning. An overview of our main contributions is provided below.
A comparative study between short-horizon learning vs. long-horizon barrier. We first show that with outcome-based rewards, REINFORCE-style policy gradient algorithms provably learn short-horizon compositions. Meanwhile, even if the initial policy achieves non-zero reward, the gradient field at initialization is exponentially flat beyond a critical horizon, indicating an optimization barrier for near-random policies. Alternatively, we show that supervised fine-tuning (SFT) can provably learn beyond the critical horizon by providing intermediate feedback for sequential compositional reasoning.
A theory of phase transitions in RLVR on mixed-difficulty data distribution. On an easyto-hard mixture over horizons, we establish polynomial-time convergence guarantees for outcome-based RL training. We further show that the shape of the difficulty spectrum in the mixture governs the dynamics: when the difficulty spectrum contains discontinuities, the learning process undergoes long plateaus followed by abrupt improvement, exhibiting grokking-like phase transitions (Sun et al., 2025); In comparison, a smoother spectrum yields a relay effect that maintains the momentum of reward growth, ensuring steady progress through increasingly harder problems.
Novel techniques from Fourier analysis on groups. We introduce a Fourier analysis (Terras, 1999) framework that transforms the problem of trajectory-level success conditioning into tractable calculations based on Fourier analysis for convolutions of measures. Our new framework allows us to compute the magnitude of policy gradients in long-horizon group composition problems
This content is AI-processed based on open access ArXiv data.