Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
💡 Research Summary
The paper tackles two fundamental inefficiencies in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models: (1) indiscriminate use of all rollouts generated for a prompt, despite large quality variance, and (2) discarding historical rollouts after a single training step, which limits sample efficiency. To address both, the authors cast rollout scheduling as a contextual bandit problem. Each rollout is treated as an arm, and its reward is defined as the performance gain (change in average reward and entropy) achieved when the policy is updated after training on that rollout.
A neural scheduler, called Contextual Bandit Scheduler (CBS), is introduced. CBS encodes each rollout into a compact 10‑dimensional context vector that captures training dynamics such as reward, advantage, entropy, and clipping statistics. An MLP predicts the future utility of each rollout. Selection proceeds by ranking rollouts according to predicted utility and keeping the top‑K. This mechanism serves two purposes: intra‑group selection (filtering noisy rollouts within a prompt) and global reuse (choosing high‑value rollouts from a FIFO buffer that stores the most recent L batches). The same bandit framework unifies both tasks.
Reward feedback is derived from the group‑level performance gain R(·) defined in Equation 5, which combines the change in mean reward V and an entropy‑penalty term. The group reward is distributed to individual rollouts proportionally to their advantage magnitude, yielding sample‑level rewards (Equation 6). CBS updates its parameters online via stochastic gradient descent on the squared error between predicted and observed sample rewards (Equation 7), allowing it to adapt to evolving policy dynamics.
Theoretically, under a simplified single‑arm setting, the authors prove a sub‑linear regret bound of O(√T) for the scheduler, showing that its cumulative loss compared to an oracle that always picks the best rollout vanishes as training progresses. They also prove that enlarging the replay buffer strictly improves the achievable upper bound on policy performance, providing a formal justification for global rollout reuse.
Empirically, the method is evaluated on six mathematical reasoning benchmarks (including MATH, GSM8K, and MMLU) and three RLVR policy‑optimization algorithms (GRPO, DAPO, GSPO). Across all combinations, CBS yields consistent gains of 2–4 percentage points in accuracy while reducing computational overhead by up to 30 % relative to baseline RLVR pipelines that use all rollouts. Ablation studies isolate the contributions of intra‑group filtering and global reuse, demonstrating that each component improves performance and that their combination produces the largest benefit. Early in training, CBS quickly identifies high‑utility rollouts, accelerating policy improvement; later, the buffer reuse sustains steady gains.
Limitations include the relatively low‑dimensional rollout representation, which may omit richer signals present in large models, and the reliance on a reward proxy based on average reward and entropy, which might miss subtle improvements on very hard problems. Future work could explore richer contextual embeddings, non‑linear bandit models, and alternative reward formulations.
Overall, the paper presents the first unified, theoretically‑grounded approach to selective rollout usage and historical reuse in RLVR, delivering both analytical guarantees and practical performance improvements, and opening a promising direction for more data‑efficient reinforcement learning of LLM reasoning abilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment