Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of “simple examples”: instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

💡 Research Summary

The paper investigates why and how transformers trained with outcome‑only reinforcement learning (RL) can spontaneously develop chain‑of‑thought (CoT) reasoning, despite receiving reward only for the final answer. The authors focus on a synthetic graph‑traversal task that requires multi‑step reasoning: given two disjoint directed chains and a starting vertex, the model must output the terminal vertex of the chain containing the start. This task cannot be solved in a single step under standard complexity assumptions, but it admits a simple iterative solution that proceeds forward along the chain.

The authors analyze the policy‑gradient dynamics of a single‑layer transformer. They show that the expected return gradient pushes the model to increase the probability of forward steps that advance toward the terminal vertex, while backward or switch steps receive no reward signal and are therefore discouraged. Consequently, the policy converges to a structured algorithm that repeatedly selects the next vertex along the chain—an explicit, interpretable reasoning procedure. This convergence is proved under mild assumptions about the model’s expressivity and the smoothness of the loss landscape.

Crucially, the analysis reveals that the emergence of this efficient algorithm depends on the training data distribution. The authors introduce a family of distributions (D_Q) parameterized by a distribution (Q) over the start‑position index (k). Small values of (k) correspond to “simple examples” that require few forward steps, while large (k) produce “hard examples” needing many steps. The theory proves an “if and only if” condition: when (Q) places non‑vanishing probability mass on simple examples, the policy‑gradient updates receive low‑variance signals early in training, allowing the model to discover the forward‑step bias and then generalize to longer chains. If the mass on simple examples vanishes, the gradient signal becomes too noisy for long chains, and learning stalls regardless of training time.

The paper derives two major implications. First, policy‑gradient methods possess an implicit efficiency bias: among all policies that achieve the same final‑answer reward, those that use fewer reasoning steps are favored because they provide higher‑probability trajectories and lower variance gradients. Second, data curriculum matters: providing a sufficient proportion of simple examples is essential for the model to acquire a reusable reasoning subroutine that can extrapolate to more complex instances. Remarkably, the authors show that augmenting the training set with out‑of‑distribution simple examples can improve performance on in‑distribution hard examples more than training directly on the hard examples.

Empirical validation is performed on both the synthetic graph task and real‑world mathematical reasoning benchmarks. In the synthetic setting, models trained with a (Q) that emphasizes simple starts quickly learn to generate pure forward‑step rollouts and achieve near‑perfect accuracy even on chains longer than seen during training. When simple starts are removed, the models fail to learn any coherent strategy. For real language models, the authors fine‑tune Qwen‑based transformers on a suite of math problems. Models trained only on easy arithmetic problems develop a CoT style generation (explicit intermediate steps) and successfully solve harder problems that require longer reasoning chains. Ablation studies confirm that excluding simple problems prevents CoT emergence, while adding simple out‑of‑distribution tasks boosts performance on the target hard tasks.

In summary, the work provides a rigorous theoretical foundation for the observation that outcome‑based RL can induce step‑by‑step reasoning in transformers. It identifies the distributional prerequisite—non‑vanishing mass on simple examples—as the key factor enabling policy‑gradient dynamics to discover an efficient iterative algorithm. The findings have practical implications for designing curricula and data sampling strategies when applying RL fine‑tuning to large language models for reasoning‑intensive applications.

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment