When Is Compositional Reasoning Learnable from Verifiable Rewards?
The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.
💡 Research Summary
This paper provides a rigorous theoretical analysis of when large language models (LLMs) can acquire compositional reasoning abilities using only outcome‑level feedback, a setting known as Reinforcement Learning with Verifiable Rewards (RLVR). The authors model reasoning as a sequence of deterministic “tasks” (or skills) that the model selects at each autoregressive step. Each task σ_j maps the current token prefix to a single next token, and the model’s parameters θ linearly combine fixed pretrained features with step‑specific positional embeddings h_{s,j}.
Training proceeds via a REINFORCE‑style update that only receives a binary verification signal V on the final output. Positive (V=1) samples are repeatedly sampled until a successful rollout is obtained; no KL regularization is used. The central contribution is the introduction of the “task‑advantage ratio” A_{s,j} = P(V=1 | task σ_j selected at step s) / P(V=1 | σ_j not selected). Theorem 5.2 shows that the expected gradient update for a given task is proportional to (A_{s,j} − 1); if A_{s,j}>1 the update reinforces the task, otherwise it suppresses it.
Theorem 5.4 establishes a positive‑advantage condition: if for every step s the correct task τ(s) satisfies A_{s,τ(s)} ≥ 1 + Δ for some constant Δ>0, then RLVR converges to the correct chain‑of‑thought (CoT) in O(S²) iterations, where S is the length of the desired CoT. This condition captures the intuition that partially correct intermediate steps must provide a statistical edge toward final verification. Conversely, when A_{s,τ(s)}≈1, the expected learning signal vanishes and RLVR may stagnate or converge to sub‑optimal compositions, even if the correct composition exists and achieves perfect verification.
The paper illustrates these ideas with concrete problem families. In “long addition”, each digit‑wise addition step directly influences the final sum, yielding a large task‑advantage ratio and enabling efficient learning. In contrast, problems like sparse parity have intermediate results that barely affect the verifier, leading to ratios near one and causing exponential‑time learning or failure.
A further insight concerns the quality of the base model. The authors argue that the task‑advantage ratio depends on the base model’s ability to perform each primitive task σ_j with some probability p_j. If p_j is low, even selecting the correct task does not substantially increase the chance of verification, making A_{s,τ(s)} close to one. Thus, a sufficiently capable pretrained model is a prerequisite for RLVR to succeed; otherwise even simple compositions may be unlearnable.
Overall, the work introduces a clean, quantitative metric (task‑advantage ratio) that predicts the learnability of compositional reasoning under RLVR. It clarifies why RLVR works well on tasks where intermediate steps are informative for the final answer, and why it can fail when such structure is absent or when the base model lacks the necessary primitive skills. These findings offer practical guidance for designing RL‑based fine‑tuning pipelines: choose problems with informative intermediate steps, and ensure the pretrained model already possesses reasonable competence on the constituent tasks. The theoretical framework thus bridges the gap between empirical successes of RLVR and a principled understanding of its limitations.
Comments & Academic Discussion
Loading comments...
Leave a Comment