New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model’s existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($ρ\in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.


💡 Research Summary

The paper tackles the contentious question of whether Reinforcement Learning with Verifiable Rewards (RLVR) endows large language models (LLMs) with genuinely new reasoning abilities or merely amplifies latent skills already present in the base model. The authors adopt a probabilistic definition of capability: a model lacks a capability if, after exhaustive sampling, it still fails to produce a correct solution for a given instance. This leads to a fine‑grained analysis of the Pass@k metric, decomposing it into per‑instance success probabilities Pθ(q).

A central theoretical construct is the “Multiplicative Barrier.” For a multi‑step reasoning problem that can be broken into M atomic sub‑tasks, the overall success probability is approximated as the product of the atomic success probabilities: Pθ(q) ≈ ∏₁ᴹ Pθ(s_j). Even modest atomic accuracies (e.g., 0.3) decay exponentially with depth, pushing the composite probability below a statistical existence threshold ε≈0.023, effectively placing the problem in a “Null Set” of unsolvable tasks.

RLVR provides dense reward signals on the atomic level, allowing the training process to “sharpen” each Pθ(s_j). When atomic accuracies rise sufficiently (e.g., to 0.7), the product surpasses a feasibility bound δ=1/K_min≈0.125, moving the problem into a “Feasible Set.” The authors describe this shift as a phase transition driven by probability amplification, which manifests as a steep rise in Pass@k curves for low k values.

The paper also introduces “Capability Erosion.” Because RLVR optimizes the expected reward over the entire task distribution, improvements on frequent or high‑reward instances can come at the expense of rarer or more difficult instances. Empirically, this appears as an increase in Pass@1 while Pass@k for larger k may decline, indicating a polarization of success probabilities across the dataset.

To validate the theory, the authors use the Algebrarium framework to generate synthetic algebraic reasoning tasks. Training data consist exclusively of single‑step (atomic) examples from four algebraic systems (integers, modular products, a “Knitting” system, and Rubik’s Cube). Test data contain 800 multi‑step problems stratified by depth (2–5 steps). This “Atomic‑Composite Split” isolates the effect of improving atomic skills on unseen composite tasks.

Experimental findings: (1) Composite task performance correlates strongly with the joint probability of atomic steps (Pearson ρ ranging from 0.69 to 0.96). (2) RLVR can dramatically increase atomic accuracies, thereby crossing the multiplicative barrier and enabling previously impossible multi‑step reasoning. (3) However, some atomic or composite instances degrade, confirming the predicted capability erosion. (4) Pass@k curves exhibit the predicted polarization: low‑k metrics improve, high‑k metrics may fall.

The authors conclude that RLVR does not directly teach new complex reasoning patterns; instead, it amplifies the reliability of primitive operations, which mathematically yields emergent compositional abilities. At the same time, global reward maximization can reallocate model capacity, sacrificing certain skills for overall gain. This probabilistic perspective clarifies why emergent abilities appear after RLVR and highlights the need for careful evaluation metrics and balanced training objectives when using RLVR to enhance LLM reasoning.


Comments & Academic Discussion

Loading comments...

Leave a Comment