Beyond Correctness: Learning Robust Reasoning via Transfer

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR’s average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.

💡 Research Summary

The paper addresses a critical shortcoming of recent reinforcement‑learning‑with‑verifiable‑reward (RLVR) methods for large language models (LLMs): they reward only the correctness of the final answer and ignore the quality of the reasoning process that leads to that answer. To fill this gap, the authors introduce a philosophical principle that robust reasoning should be transferable—that is, a reasoning fragment produced by one model should remain useful when handed to another model for continuation. Building on this idea, they propose Reinforcement Learning with Transferable Reward (RLTR), which augments the standard RLVR objective with a new “transfer reward.”

In RLTR, a generator model first produces a full solution (reasoning steps plus final answer). A random truncation ratio τ (uniformly sampled between 0.3 and 0.9) is applied to cut the output into a prefix. This prefix is fed to a frozen receiver model, which attempts to complete the solution. If the receiver’s final answer matches the ground‑truth, the transfer reward R_trans is 1; otherwise it is 0. The overall reward combines the original answer reward, a format reward, and the transfer reward weighted by a hyper‑parameter t (set to 1.0 in the experiments). The policy is then updated using Group Relative Policy Optimization (GRPO), the same algorithm used in prior RLVR work.

Experiments are conducted on a 7‑billion‑parameter Qwen2.5‑7B‑Instruct as the generator and a 3‑billion‑parameter Qwen2.5‑3B‑Instruct as the receiver. The method is evaluated on five benchmarks spanning moderate to high difficulty: MATH‑500, GSM8K, AMC23, AIME2024, and the scientific reasoning set GPQA. The authors report two primary metrics: average single‑sample accuracy (Acc.) and majority‑vote accuracy at various sample sizes (Maj@K), which captures multi‑sample consistency.

Results show that RLTR consistently outperforms RLVR across all datasets. On MATH‑500, RLTR raises average accuracy from 71.0 % to 77.0 % and improves Maj@64 from 80.6 % to 84.2 %, a +3.6 percentage‑point gain. Similar gains are observed on GSM8K and the harder AMC/AIME tasks. Moreover, RLTR reaches comparable performance to RLVR with roughly 2.5 × fewer training steps, demonstrating superior sample efficiency. Analyses reveal a strong correlation between the measured transferability score and Maj@K, confirming that the transfer reward indeed encourages more stable intermediate reasoning. The authors also test alternative receiver models (smaller Qwen variants and Llama 3.2) and find that the benefit persists, indicating that the approach is not tied to a specific architecture.

The paper acknowledges two limitations. First, computing the transfer reward requires an extra forward pass through a receiver model, adding computational overhead. Second, the current implementation uses a single continuation to decide success; richer estimators (e.g., expected reward over multiple continuations) could provide smoother learning signals. Future work may explore meta‑reward models that predict transferability without explicit roll‑outs, or integrate multi‑step verification to further tighten the reasoning process.

In summary, RLTR introduces a novel, process‑oriented reward that pushes LLMs to generate reasoning that is not only correct but also reusable and interpretable. By rewarding the ability of a reasoning fragment to be handed off to another model, RLTR improves both final‑answer accuracy and multi‑sample consistency while reducing training time, marking a significant step toward more robust, trustworthy LLM reasoning.

Beyond Correctness: Learning Robust Reasoning via Transfer

💡 Research Summary

Comments & Academic Discussion

Leave a Comment