MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model’s reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.
💡 Research Summary
MulFeRL tackles a fundamental limitation of reinforcement‑learning‑with‑verifiable‑rewards (RLVR): the sparsity and uninformative nature of scalar outcome signals, especially on failed attempts. The authors propose a multi‑turn framework that activates only when a sampled group of K candidate solutions for a prompt all receive a zero reward from a verifier. In such cases, a feedback simulator (implemented as an LLM) generates a concise, group‑level verbal critique that pinpoints the dominant error mode and offers concrete suggestions for improvement. This feedback is then injected into the model’s next‑turn context, prompting a new set of K candidates conditioned on the feedback. The process repeats for up to T turns, stopping early if the group becomes mixed (some successes) or all‑positive (all successes).
Two complementary learning signals are employed. Within a turn where the regenerated group is mixed, the method applies Group Relative Policy Optimization (GRPO), computing group‑wise normalized advantages from the verifier’s binary rewards and updating the policy via a clipped importance‑sampling loss with KL regularization. When feedback is strong enough that the regenerated group becomes all‑positive, the within‑group contrast disappears; therefore, MulFeRL forms cross‑turn preference pairs between the newly generated solutions and the previous‑turn solutions for the same prompt. These pairs are optimized with a Direct Preference Optimization (DPO) loss, encouraging the model to assign higher log‑probability to the feedback‑improved answers.
The framework thus converts the qualitative improvement induced by verbal feedback into a quantitative gradient signal, enabling stable policy updates even in failure‑dominated regimes. Experiments on a sampled subset of OpenR1‑Math (≈1 M training examples) demonstrate that MulFeRL consistently outperforms baseline GRPO, PPO, DPO‑only, and supervised fine‑tuning across five challenging math benchmarks (including MATH and GSM‑8K). Notably, on hard problem subsets where the initial failure rate exceeds 70 %, the multi‑turn feedback loop raises success rates by over 20 %. Out‑of‑distribution evaluations on scientific and general reasoning tasks (ARC‑E, SciQ) also show gains of 2–3 % absolute accuracy, indicating strong generalization.
Ablation studies confirm the necessity of each component: removing the multi‑turn loop or the DPO cross‑turn loss degrades performance, while using feedback on all samples (instead of only on all‑failed groups) wastes learning capacity and reduces sample efficiency. The authors discuss limitations such as dependence on the quality of the feedback simulator, increased computational cost due to multiple regeneration turns, and the need to explore human‑generated feedback or meta‑RL policies for feedback selection.
In summary, MulFeRL introduces a principled way to harness rich verbal feedback for reinforcement learning in reasoning‑heavy domains. By dynamically triggering feedback‑guided regeneration and jointly optimizing GRPO and DPO objectives, it transforms otherwise silent failure cases into informative learning signals, achieving state‑of‑the‑art results on both in‑domain and out‑of‑domain reasoning benchmarks.
Comments & Academic Discussion
Loading comments...
Leave a Comment