Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

💡 Research Summary

Reinforcement learning for large language models (LLMs) is notoriously expensive because the rollout phase—where the model autoregressively generates trajectories—typically consumes 80 % or more of the total training budget. A natural way to cut costs is to replace the expensive target model with a cheaper surrogate (e.g., a quantized, sparsified, or distilled version) for rollout generation. However, this decoupling creates a severe actor‑policy distribution mismatch: the rollout model’s output distribution (q) diverges dramatically from the target policy distribution (p), leading to exploding KL divergences, unstable advantage estimates, and ultimately training collapse.

Existing off‑policy corrections rely on importance sampling (IS) or truncated importance sampling (TIS) to re‑weight the on‑policy objective after the fact. While these methods can mitigate modest mismatches, they break down when the KL gap is an order of magnitude larger, as is typical in extreme off‑policy settings. Moreover, they do not address the root cause—the mismatch at the source.

The paper introduces Jackpot, a framework that directly narrows the gap between (q) and (p) before policy updates by employing Optimal Budget Rejection Sampling (OBRS). Classical rejection sampling would accept a token (i) drawn from (q) with probability (p_i/(\lambda q_i)) where (\lambda \ge \max_i p_i/q_i). In the vocabulary of modern LLMs (≥ 100 k tokens) even tiny local differences can make (\lambda) astronomically large, driving the acceptance rate to near zero. OBRS relaxes this requirement: the user specifies a desired average acceptance budget (\bar a) (e.g., 90 %); OBRS then finds the unique scaling factor (\lambda) that satisfies the budget and simultaneously minimizes the KL divergence between the post‑rejection distribution (\tilde q) and the target (p). Theoretical guarantees (Theorems 3.3 and 3.4) prove that for any non‑trivial budget, (\tilde q) is strictly closer to (p) than the original (q) in KL sense, and that OBRS is optimal among all acceptance rules respecting the same budget.

Jackpot integrates OBRS into a unified training objective:

Policy loss – the standard PPO clipped surrogate using the post‑rejection samples.
Rollout‑policy alignment loss – a reverse‑KL term that gradually pulls the rollout model toward the evolving policy.
OBRS weighting – the acceptance probabilities (a_i) are used as sample weights, ensuring that the PPO objective sees a distribution already aligned with the target.

System‑level challenges are addressed as follows. Computing (a_i) for every token in the full vocabulary would be prohibitive. Jackpot therefore approximates the full distribution by extracting the top‑(k) tokens (e.g., (k=64)) and estimating their probabilities; the remaining mass is treated as a uniform tail. This top‑(k) approximation dramatically reduces memory and compute while preserving the dominant probability mass. Because top‑(k) introduces bias, Jackpot adds a batch‑level bias‑correction term derived from the expected contribution of the omitted tail, ensuring unbiased gradient estimates.

Empirical evaluation focuses on a mathematical reasoning benchmark using the Qwen family of models. The target policy is Qwen‑3‑8B‑Base, while rollouts are generated by the smaller Qwen‑3‑1.7B‑Base. Over 300 PPO update steps with batch size 64, Jackpot maintains an average acceptance rate above 90 % across a wide range of KL gaps (from 0.1 up to >2.0). The KL divergence after OBRS drops by an order of magnitude compared to the raw rollout distribution, and the training curve remains stable, matching the performance of an on‑policy PPO baseline. In contrast, TIS and plain IS exhibit exploding KL, unstable advantage estimates, and rapid performance degradation. Ablation studies confirm that (i) removing the reverse‑KL alignment allows the mismatch to grow over time, (ii) omitting top‑(k) approximation makes the method infeasible memory‑wise, and (iii) skipping batch‑level correction introduces a small but cumulative bias that harms long‑run performance.

Key insights from the work are:

Pre‑emptive distribution alignment via OBRS reduces reliance on high‑variance importance weights, leading to more stable learning.
Budget‑driven trade‑off: practitioners can directly control the acceptance rate (hence computational cost) while guaranteeing the best possible KL reduction for that budget.
Scalable implementation: top‑(k) probability estimation combined with bias correction enables OBRS to run on modern LLM vocabularies without prohibitive overhead.
Complementarity: OBRS can be stacked with existing IS or TIS corrections, handling the bulk of the mismatch up‑front and leaving only a small residual to be corrected post‑hoc.

In summary, Jackpot demonstrates that optimal‑budget rejection sampling is a practical and theoretically sound tool for bridging the actor‑policy gap in extreme off‑policy RL for LLMs. By doing so, it makes decoupled rollout generation viable, opening the door to large‑scale, cost‑effective RL fine‑tuning of future multi‑billion‑parameter language models.

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment