Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Published as a conference paper at ICLR 2026 B U FF E R M A T T E R S : U N L E A S H I N G T H E P O W E R O F O FF - P O L I C Y R E I N F O R C E M E N T L E A R N I N G I N L A R G E L A N - G U AG E M O D E L R E A S O N I N G Xu W an ♠ ♣ ♡ , Y ansheng W ang ♣ , W enqi Huang ♢ , Mingyang Sun ∗♡ ∗ ♠ Zhejiang Univ ersity ♣ Bytedance Seed Robotics ♢ China Southern Power Grid ♡ Peking Univ ersity A B S T R A C T T raditional on-policy Reinforcement Learning with V eriﬁable Rew ards (RL VR) framew orks suf fer from e xperience waste and reward homogeneity , which di- rectly hinders learning ef ﬁciency on dif ﬁcult samples during large language mod- els post-training. In this paper , we introduce Batch Adaptation Policy Optimiza- tion (B APO), an off-policy RL VR frame work to improve the data efﬁcienc y in large language models post-training. It dynamically selects training batches by re- ev aluating historically difﬁcult samples and reusing high-quality ones, while hold- ing a lo wer bound guarantee for policy improvement. Extensiv e e xperiments fur - ther demonstrate that B APO achiev es an average 12.5% impro vement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially , B APO suc- cessfully resolves 40.7% of problems that base models consistently fail to solv e. § The code is available in Here . 1 I N T RO D U C T I O N Reinforcement Learning from Human Feedback (RLHF) has emer ged as a transformati ve paradigm for aligning Large Language Models (LLMs) with human preferences and improving their per- formance on complex reasoning tasks ( Ouyang et al. , 2022 ; Bai et al. , 2022 ). A signiﬁcant recent ev olution is Reinforcement Learning with V eriﬁable Re wards (RL VR) ( Lambert et al. , 2024 ), which replaces costly neural re ward models with deterministic v eriﬁcation functions for more ef ﬁcient and reliable training ( Guo et al. , 2025 ). Numerous on-policy RL optimization methods, particularly Group Relativ e Polic y Optimization (GRPO) ( Shao et al. , 2024 ), and its v ariants like Dynamic Sampling Policy Optimization (D APO) ( Y u et al. , 2025 ), Group Sequence Policy Optimization (GSPO) ( Zheng et al. , 2025 ), ha ve demonstrated remarkable success in LLM post-training scenar - ios, achieving exceptional performance on mathematical reasoning, code generation, and various downstream applications ( Y ang et al. , 2025 ; Chen et al. , 2025a ; Shen et al. , 2025 ). Figure 1: Tracking the sample counts across accuracy groups of the mathematical dataset before and after GRPO post-training. Although with lo wer bound guarantees of policy improv ement theoretically ( Mroueh , 2025 ), existing RL post-training frame works still f ace signiﬁcant ef- ﬁciency challenges in practice. As shown in Fig- ure 1 , models after GRPO post-training struggle to handle dif ﬁcult samples, especially those with zero accuracy in the initial rollout group. The reasons are twofold: (1) Homogeneous r ewards : Recent in- vestigations ( Hong et al. , 2025 ; Simoni et al. , 2025 ) rev eal that samples at both extremes of dif ﬁculty of- fer minimal beneﬁt for post-training polic y impro ve- ment. This arises because advantage estimation in most GRPO-based methods relies heavily on rel- ativ e re ward di versity within each group. Conse- quently , when intra-group re wards are identical, the lower bound guarantee for policy impro vement col- ∗ Coresponding Author 1 Published as a conference paper at ICLR 2026 lapses ( Zhang et al. , 2025 ; Mroueh et al. , 2025 ), resulting in negligible effecti ve gradient contribu- tions ( Liu et al. , 2025 ; Y u et al. , 2025 ). (2) W aste of experience : Given the sensitivity of policy improv ement to intra-group reward v ariance, une ven difﬁculty distributions yield signiﬁcantly fewer high-quality samples than the conﬁgured batch size implies. Crucially , since these methods are pri- marily on-polic y and lack experience replay , each rollout group is consumed only once, leading to a substantial waste of v aluable training data ( Sun et al. , 2025 ; Li et al. , 2025 ). A straightforward solution is to adopt of f-policy rather than on-policy training paradigms, which has been established in traditional RL tasks as a viable solution to increase sample efﬁciency and div ersity in the training batch ( Queeney et al. , 2021 ; Hilton et al. , 2022 ; Meng et al. , 2023 ). Howe ver , naiv ely applying sample-reusing schemes to RL frameworks may exacerbate instability during LLM post-training, leading to entrop y collapse, and ultimately performance degradation ( Y u et al. , 2025 ; He et al. , 2025 ; Chen et al. , 2025c ). Thus, to systematically exploring the utility of stale off-polic y experience in RL VR post-training, we incorporates multiple off-policy strate gies into on-policy RL VR framew ork to dissect effecti ve pathways for historical data utilization. The main contributions of this paper are as follo ws: (1) W e propose a dif ﬁculty-aware experience replay mechanism as a practical solution for efﬁcient off-polic y data utilization. Unlike the simple mixing of the buf fer’ s data and online data, we ac- tiv ely re-ev aluate historical hard prompts to driv e exploration while directly reusing high-quality trajectories with a dynamic quality threshold. (2) Theoretically , we prove that under certain assumptions, the proposed adaptiv e construction mechanism mitigates the homogeneous reward issue via adaptive batch construction and KL- constrained updates. (3) By integrating it into multiple reasoning tasks with dif ferent LLM backbones, we validate the proposed B atch A daptation P olicy O ptimization (BAPO) method achie ves better con ver gence and yields greater improvements on solving difﬁcult samples compared to existing on-policy and off- policy RL VR frame works. 2 R E L A T E D W O R K 2 . 1 O N - P O L I C Y R L P O S T - T R A I N I N G F R A M E W O R K W e ﬁrst revie w the concept of on-policy RL VR, where the core objective is to optimize an LLM policy to maximize the outcome response re ward. Let x ∈ X represent the input prompts, and y ∈ Y denote responses generated by the LLM policy π θ . The terminal reward r ( x, y ) ∈ { 0 , 1 } is determined by a deterministic veriﬁcation function ( Lambert et al. , 2024 ; Guo et al. , 2025 ). Follo wing the setting of GRPO ( Shao et al. , 2024 ), the objectiv e is formulated as: 1 G G X i =1 1 | y i | | y i | X t =1 min  ρ i,t ( θ ) ˆ A i,t , clip ( ρ i,t ( θ ) , 1 − ε, 1 + ε ) ˆ A i,t  − β · D KL ( π θ || π ref ) (1) where G = { y 1 , y 2 , . . . , y G } represents a G -size group of responses sampled from π θ t ( ·| x ) for each input x ; ρ i,t ( θ ) is the probability ratio π θ ( y t i | y 0 representing the delay timesteps. ( x, y ) ∼ B denotes historical samples from the replay buf fer B . The importance sampling ratios are deﬁned as ρ α = π θ ( y | x ) α ( y | x ) for the online rollout samples and ρ α B = π θ ( y | x ) α B ( y | x ) for buffer samples, α B is the historical rollout policies that generated the buf fer . Each entity in the buf fer B is formally deﬁned as: B = { ( u i , { x i,j } G j =1 , { y i,j } G j =1 , { r i,j } G j =1 , { α B ( y i,j | x i ) } G j =1 ) } |B| i =1 (4) where u i is the unique identiﬁer of each prompt, { x i,j } , { y i,j } , { r i,j } represent the set of prompts, generated responses, and corresponding rewards, respectively . { α B ( y i,j | x i ) } G j =1 is the rollout pol- icy’ s probability , which is stored for calculating ρ α B ( θ ) when reusing, and |B| is the buffer size. 3 . 2 A D A P T I V E T R A I N I N G B AT C H C O N S T RU C T I O N The core of off-polic y RL VR lies in how to integrate historical experiences with online samples, to maintain non-homogeneous rewards and an appropriate difﬁculty distribution in each training step. For B APO, we introduce a ﬁlter function I ( x ) in Deﬁnition 3.1 that decomposes the data selection criteria for each training step’ s batch into three parts. Deﬁnition 3.1 (Training Batch Filtering Function) . Deﬁne µ π ,r ( x ) = E y ∼ π ( ·| x ) [ r ( x, y )] as the expected re ward under polic y π for input x . The training batch indicator function I : X → { 0 , 1 } is formulated as: I ( x ) = 1 { 1 G ≤ µ α,r ( x ) ≤ G − 1 G } | {z } Filtered Fresh + 1 { µ α B ,r ( x ) ≤ c 1 ∧ µ π θ t ,r ( x ) >c 1 } | {z } Improved Historical Dif ﬁcult + 1 { c 2 ≤ µ α B ,r ( x ) ≤ c 3 } | {z } Historical High-quality (5) where α denotes the delayed rollout policy and α B denotes the policy associated with buf fer samples. The function selects samples based on three criteria, yielding subsets X 1 , X 2 and X 3 respectiv ely . Next, we explain the selection principles for I ( x ) and derive three categories of samples, namely X 1 , X 2 , and X 3 , which are obtained from these three conditions, respectiv ely . (1) Filtered Fresh Samples ( X 1 ). T o prev ent gradient vanishing and maintain training stability , we ﬁlter the online rollout batch to exclude samples with zero variance. Speciﬁcally , we retain fresh 4 Published as a conference paper at ICLR 2026 samples where the group mean reward satisﬁes µ α,r ( x ) ∈ [ 1 G , G − 1 G ] . While other ﬁltering strate- gies (e.g., Gaussian sampling or uniform sampling) can be applied, we ﬁnd that simple truncation sufﬁcient for effecti ve learning. A detailed discussion and comparison of different online ﬁltering functions are provided in Appendix A.3 . (2) Improved Historical Difﬁcult Samples ( X 2 ). Samples exhibiting extremely low group mean rew ards, where µ α,r ( x ) ∈ [0 , c 1 ] , present signiﬁcant challenges to the current policy and typically yield negligible policy improv ement. Howe ver , as the model ev olves, these historically difﬁcult queries may eventually become tractable for a successor policy . T o harness this, we periodically re-generate responses using the current policy π θ t ev ery m training steps and construct the subset X 2 based on the observable impro vement. Let B bad ⊆ B denote the buf fer for difﬁcult samples. T o manage the computational overhead asso- ciated with the re-ev aluation process, we limit the buf fer capacity |B bad | to be equal to the training batch size. A First-In-First-Out (FIFO) mechanism is employed to automatically discard outdated samples when the buf fer reaches capacity . X 2 is formulated as: X 2 =  ( x, y ′ ) | ( x, y ) ∈ B bad , y ′ ∼ π θ t ( · | x ) , c 1 < µ π θ t ,r ( x ) < 1  (6) where y ′ represents the ne w response generated by π θ t , and we speciﬁcally select samples that show improv ement such that c 1 < µ π θ t ,r ( x ) < 1 . (3) Reused Historical High-quality Samples ( X 3 ). T o prev ent underﬁlled batches caused by the scarcity of X 1 and X 2 , we maintain a FIFO auxiliary buf fer B high ⊆ B . T o mitigate training insta- bility from stale data, B high is restricted to high-quality trajectories from the three most recent steps. The subset X 3 is randomly sampled to ﬁll the remaining capacity: X 3 = S ( B high , min ( |B high | , B − |X 1 | − |X 2 | )) (7) where B is the conﬁgured training batch size and S ( · , k ) denotes the random sampling of k elements. Furthermore, to progressively master increasingly difﬁcult tasks, we employ a linear mapping to shift the historical “high-quality” from easier to harder instances, scaling in accordance with the global av erage performance r tot : c i = r tot · ( c high i − c low i ) + c low i , i ∈ 2 , 3 (8) 3 . 3 T H E O R E T I C A L A NA L Y S I S In this section, we further provide theoretical analysis in Theorem 3.2 to establish BAPO’ s training stability based on ( Mroueh et al. , 2025 )’ s theorem. W e show that, under certain assumptions, our constructed adaptiv e batches can consistently maintain guaranteed bounded policy improv ement. Theorem 3.2 ( Policy Improvement Lower Bound with Adaptive T raining Batch ) . Assume re- war ds ar e bounded: 0 ≤ r ≤ 1 . Let π θ t be the current policy , α 1 = π θ t − v be the delayed r ollout policy , α 2 = π θ t be the curr ent policy for re-e valuation, α 3 = α B be the buf fer policy distribution, and I ( x ) be the ﬁltering function partitioning samples into X 1 , X 2 , and X 3 . Suppose c 1 , c 2 , c 3 ∈ (0 , 1) with c 2 < c 3 , and the following TV distance constraints hold: TV ( π θ t ( ·| x ) , π θ t − v ( ·| x )) ≤ δ 1 ∀ x ∈ X 1 (9) TV ( π θ t ( ·| x ) , α B ( ·| x )) ≤ δ 3 ∀ x ∈ X 3 (10) wher e δ 1 , δ 3 > 0 ar e sufﬁciently small such that the variance lower bounds r emain positive. Then, for the policy update objective in Equation 3 , the expected policy impr ovement over ﬁlter ed samples satisﬁes: E x ∼ ρ X [ I ( x )( J ( π θ ( ·| x )) − J ( π θ t ( ·| x )))] ≥ 3 X i =1 L i ( π θ , α i ) wher e: J  π θ ( · | x )  = E y ∼ π θ ( ·| x ) r ( x, y ) L i ( π θ , α i ) = E x ∈X i [ L α i ( π θ ( ·| x )) − 2 K i · TV ( π θ ( ·| x ) , α i ( ·| x )) − 2 TV ( π θ t ( ·| x ) , α i ( ·| x ))] 5 Published as a conference paper at ICLR 2026 with L α i ( π θ ( ·| x )) = 1 σ α i ,r,ε ( x ) ( J ( π θ ( ·| x )) − J ( α i ( ·| x ))) . The constants ar e: K 1 = 1 − q G − 1 G 2 + ε q G − 1 G 2 + ε (11) K 2 = 1 − p c 1 (1 − c 1 ) + ε p c 1 (1 − c 1 ) + ε (12) K 3 = 1 − p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε (13) More importantly , we highlight several properties from this theorem: Bounded Stability . All constants K 1 , K 2 , and K 3 are ﬁnite positiv e values, which guarantee that the training process remains numerically stable and theoretically bounded. Off-policy T olerance. The stability of trust-region methods inherently constrain the magnitude of single-step policy updates. Consequently , the diver gence between the current policy π θ t and the delayed rollout policy α remains bounded ov er short intervals. Furthermore, the strict FIFO mechanism with limited b uffer capacity ensures that only samples from recent policies are retained, thereby maintaining policy consistenc y within the training batch. 4 E X P E R I M E N T A L S E T U P T o comprehensively e valuate the effecti veness of our off-polic y RL VR framework, we conduct extensi ve experiments across different tasks and backbones, following the experimental setup de- scribed in ( Qu et al. , 2025 ). First, we select three representativ e reasoning tasks, as detailed below: Mathematics. Following prior work ( Luo et al. , 2025 ), we use the DeepSeek R1 Distilled 1.5B ( Guo et al. , 2025 ) and Qwen3 8B ( Y ang et al. , 2025 ) as the base model, and conducted post-training on the DeepScaleR-Previe w-Dataset ( Aggarwal & W elleck , 2025 ), which contains 40 thousand question- answer pairs sourced from sev eral mathematics competitions. Ev aluation is performed on multiple mathematics benchmarks, including AIME24, AMC23, MA TH500 ( Hendrycks et al. , 2021 ), Min- erva Math (Minerv a) ( Lewk owycz et al. , 2022 ), and OlympiadBench (Olympiad) He et al. ( 2024 ). Planning. W e choose Qwen2.5 Math 1.5B and 7B ( Y ang et al. , 2024 ) as the backbone, and adopted the Countdown Number Game as the speciﬁc task. For training, we used a 10,000-problem subset of the Countdo wn-34 dataset, where each problem provides 3-4 source numbers. Evaluation was conducted on two variants: Countdo wn-3to4 (CD-34) test set using a 200-problem held-out split, and the more challenging Countdo wn-4 (CD-4) test set with 200 problems that consistently provide four source numbers ( Chen et al. , 2025b ). V isual Geometry . W e train Qwen2.5 VL 3B and 7B ( Bai et al. , 2025 ) on the 2,101-problem training split of the Geometry3K dataset ( Lu et al. , 2021 ), where each problem consists of a geometric diagram paired with a natural language question requiring spatial and logical reasoning. Evaluation was performed on the ofﬁcial 300-problem validation split (Geo-3K val) and 601-problem test split of Geometry3K (Geo-3K test). Besides, we select sev eral on-policy and of f-policy RL VR frameworks as baselines: On-policy . W e select GRPO ( Shao et al. , 2024 ), D APO ( Y u et al. , 2025 ), and MoPPS ( Qu et al. , 2025 ) as representativ e on-policy RL VR methods. GRPO is the ﬁrst to integrate group-relativ e advantage estimation into the RL VR framew ork, while D APO further improv es training stability and efﬁcienc y . MoPPS incorporates difﬁculty-a ware prediction into prompt selection. Off-policy . W e compare our approach with three representative off-policy methods: GRPO ( v = 5 ) ( Mroueh et al. , 2025 ), RePO ( Li et al. , 2025 ), and Remix-GRPO ( Liang et al. , 2025 ). Speciﬁcally , GRPO ( v = 5 ) delays the rollout policy with a frequency of 5, whereas RePO and Remix-GRPO adopt div erse replay strategies to retrie ve off-polic y samples from a replay buffer . 6 Published as a conference paper at ICLR 2026 Implementation Details. All comparativ e experiments were run on 8 A100 GPUs with 80GB memory based on the V erl frame work ( Sheng et al. , 2025 ). Identical parameters were used to ensure fair comparison, with speciﬁc details in Appendix A.7 . 5 R E S U LT S A N A L Y S I S 5 . 1 M A I N R E S U LT S W e ev aluate B APO across three reasoning tasks to demonstrate its broad applicability . Experimental results show that BAPO consistently outperforms existing baselines throughout training (Figure 4 ) and testing (Figure 12 ). Notably , in mathematical tasks, the GRPO baseline exhibits se vere training instability , as e videnced by signiﬁcant oscillations in its early-stage training curv e. This is attributed to the high variance in problem difﬁculty within the DeepScalerR dataset. Under the same settings, B APO achieves smoother con vergence and higher re ward bounds. In T ables 1 , BAPO achieves an av erage 12.5% accuracy improv ement ov er baselines. Crucially , while D APO approaches B APO’ s performance in some metrics, it requires approximately 2.5 × more r ollouts (as visualized in Figure 9 ), imposing a substantial computational burden. Figure 4: T raining Curves of Reward Changes for mathematics, planning, and geometry tasks using DeepSeek Distilled Qwen 1.5B, Qwen2.5 Math 1.5B, and Qwen2.5 VL 3B, respectiv ely . T able 1: Comprehensi ve Evaluation Results. ’ + ’ indicates ﬁne-tuning via the corresponding method. Accuracy is averaged over 32 runs. The bold value denotes the top result, and the underlined value denotes the second-top result. (a) Mathematics Benchmarks Method AIME24 AMC MA TH500 Minerva. Olympiad. A vg. ↑ Rollouts ↓ T ype DeepSeek R1 Distill Qwen 1.5B 28.80 62.90 82.80 26.50 44.42 48.90 - - +GRPO ( Guo et al. , 2025 ) 30.73 67.47 85.40 28.95 45.33 51.58 677k on +D APO ( Y u et al. , 2025 ) 35.73 70.08 86.05 30.70 48.48 54.20 1921k on +MoPPS ∗ ( Qu et al. , 2025 ) 33.33 65.29 84.94 28.88 45.93 51.67 737k on +GRPO ( v = 5 ) ( Mroueh et al. , 2025 ) 30.49 65.09 86.72 28.16 46.18 51.57 677k off +RePO ( Li et al. , 2025 ) 30.42 64.76 83.75 28.33 45.44 50.54 677k off +Remix-GRPO ∗ ( Liang et al. , 2025 ) 33.33 65.06 84.60 26.10 43.55 50.53 - off +BAPO (Ours) 38.54 72.74 89.18 29.55 50.06 56.01 733k off (b) Planning and V isual Geometry Benchmarks Method CD-34 CD-4 A vg Method Geo-3K(val) Geo-3K(test) A vg Qwen2.5 Math 1.5B 1.12 0.37 0.75 Qwen2.5 VL 3B 14.77 19.18 16.98 +GRPO ( Guo et al. , 2025 ) 62.94 35.88 49.41 +GRPO ( Guo et al. , 2025 ) 36.44 43.12 39.78 +D APO ( Y u et al. , 2025 ) 70.56 45.87 58.22 +D APO ( Y u et al. , 2025 ) 40.11 45.18 42.65 +BAPO w/o X 2 (Ours) 60.31 35.31 47.81 +BAPO w/o X 2 (Ours) 30.57 36.92 33.75 +BAPO w/o X 3 (Ours) 64.43 38.75 51.59 +BAPO w/o X 3 (Ours) 32.22 39.79 36.01 +BAPO (Ours) 73.00 47.50 60.25 +BAPO (Ours) 40.11 46.33 43.22 Qwen2.5 Math 7B 2.68 0.94 1.81 Qwen2.5 VL 7B 30.40 36.10 33.25 +GRPO ( Guo et al. , 2025 ) 70.75 50.25 60.50 +GRPO ( Guo et al. , 2025 ) 40.79 47.15 43.97 +D APO ( Y u et al. , 2025 ) 78.75 57.43 68.09 +D APO ( Y u et al. , 2025 ) 40.87 47.02 43.95 +BAPO (Ours) 79.13 57.13 68.13 +BAPO (Ours) 41.89 48.77 45.33 *This method’ s performance is taken from the corresponding paper. 7 Published as a conference paper at ICLR 2026 5 . 2 M E C H A N I S M A NA L Y S I S T o deeply in vestigate whether BAPO’ s success stems from sensitive hyperparameter tuning or its core batch reconstruction mechanism, we conducted both Minimalist V eriﬁcation and Hyperpa- rameter Robustness e xperiments. Off-policy Components > Off-policy Hyperparameters The performance gains of BAPO primarily stem from the structural logic of its off-polic y components rather than speciﬁc hyperparameter settings. The framew ork remains effecti ve ev en under rigid, parameter-free conditions. Figure 5: T est Curves of Group Accuracy Changes on AIME for different RL VR methods based on Qwen3 8B . Left: Standard BAPO vs. GRPO. Medium: B APO (mini test) vs. GRPO . Right: Standard B APO vs. D APO. Minimalist V eriﬁcation. T o validate the theoretical implications of Theorem 3.2 without relying on hyperparameter engineering, speciﬁcally av oiding the tuning of thresholds c 1 , c 2 , c 3 and update frequencies, we devised a “Mini-test” experiment. W e trained Qwen3 8B on the mathematics task under 4K length constraints using a stripped-down, parameter -free B APO logic for constructing training batch: X 1 : W e apply strictly standard zero-adv antage ﬁltering, remo ving only the prompts where all G responses are entirely correct or entirely wrong. X 2 : W e replay historical all-wr ong samples ( µ α,r ( x ) = 0 ). These correspond exactly to the difﬁcult cases discarded by X 1 , creating a closed-loop system that recovers waste data without requiring a difﬁculty threshold c 1 . X 3 : Instead of a dynamic accuracy range, we reuse historical samples with exactly 50% accuracy . As formally proven in Proposition A.3 , samples with accuracy µ α,r ( x ) = 1 2 maximize the reward variance, thereby providing the theoretical maximum potential for single-step policy improvement J ( π θ ) − J ( π θ t ) . The results in Figure 5 demonstrate that ev en in the hyperparameter-free “Mini-test”, BAPO main- tains a clear advantage ov er GRPO. This conﬁrms that the structural introduction of X 2 and X 3 driv es the performance, not the speciﬁc tuning of c values. Component Efﬁcacy . T o ev aluate the contribution of re-ev aluated difﬁcult samples X 2 and reused high-quality samples X 3 , we conduct ablation studies sho wn in T able 1 and Figure 6 (Column 2). Both components are essential: removing X 2 causes a ∼ 21% performance drop, underscoring the importance of explicitly tar geting difﬁcult samples. Hyperparameter Robustness. W e further ev aluate the sensiti vity of B APO to its key hyperparame- ters: rollout delay v , re-rollout frequency m , and difﬁculty thresholds. Frequency ( v , m ): As shown in Figure 6 (Column 1), performance remains stable within reasonable ranges (e.g., v = 5 , m = 5 ). Extreme delays only degrade performance when policy diver gence becomes e xcessive, aligning with our theoretical analysis regarding the trust region. Difﬁculty Thresholds ( c 2 , c 3 ): While our adap- tiv e boundary mechanism yields the best conv ergence, Figure 6 (Column 3) shows that using ﬁxed 8 Published as a conference paper at ICLR 2026 ranges still signiﬁcantly outperforms baselines. This indicates that the pr esence of di verse historical data is more critical than the precise values of the thresholds. Figure 6: Ablation Studies for B APO. The ﬁrst column presents ablations on frequency-related hyperparameters ( m, v ). The second column shows ablations on buf fer subsets ( X 2 , X 3 ). The third column compares ﬁxed vs. adaptive dif ﬁculty thresholds. 5 . 3 D E TA I L E D A NA L Y S I S W e analyze B APO’ s internal mechanisms below . For extended analysis on training dynamics, com- putation, and visualization, please refer to Appendices A.4 , A.5 and A.6 . T racking Difﬁcult Samples. W e visualize the training dynamics in Figure 7 . B APO exhibits a superior capability to ”unlock” dif ﬁcult problems: after 3 epochs, BAPO successfully improv es 31% of the samples that were initially unsolvable ( 0 / 8 accuracy), compared to only 19% for GRPO. Figure 7: Tracking changes in the Number of Different Accuracy Bins on the DeepScalerR training subset. Special attention is paid to the reduction of bad samples (red bars). Sample Distribution & Efﬁciency . T o uncov er the source of B APO’ s efﬁcienc y , we analyze the dynamic batch construction in Figure 8 alongside the rollout costs in Figure 9 . As observed in Figure 8 , the assembled training batch size frequently ﬂuctuates below the maxi- mum conﬁgured capacity . This reduction in backward propagation load ef fectively of fsets the com- putational overhead caused by off-policy re-ev aluation and log-probability re-computation. Con- sequently , as detailed in T able. 2 , BAPO maintains a training speed comparable to GRPO while requiring signiﬁcantly fewer rollouts than D APO, achieving a superior trade-off between con ver- gence performance and computational cost. 9 Published as a conference paper at ICLR 2026 Figure 8: Dynamic Sample Distribution. The composition of B APO’ s X 1 , X 2 , X 3 and the total samples compared to the ﬁxed GRPO batch size (Red line). Figure 9: Cumulative Rollout Batches Comparison between BAPO and D APO. The maximum rollout time for D APO is set to 4. Efﬁcient Batch Adaptation B APO maintains training efﬁcienc y comparable to GRPO. While the periodic re-ev aluation of X 2 introduces additional generation overhead, this cost is effecti vely offset by reduced training samples, particularly during the initial stages of training. 6 C O N C L U S I O N In this paper , we propose BAPO, an off-polic y RL VR framework for LLM post-training. It aims to utilize historical training data better and thereby improve training efﬁciency . Speciﬁcally , we appropriately delay the rollout policy to stabilize the policy discrepancies of buf fer samples. More importantly , we construct training batches by re-ev aluating difﬁcult samples and reusing historical high-quality ones, thereby enhancing the efﬁcienc y of post-training. W e validate the strong adapt- ability of the B APO framework through e xperiments on three distinct reasoning tasks using dif ferent LLM backbones, and the results demonstrate that B APO signiﬁcantly outperforms baselines in both con vergence performance and training efﬁcienc y . Nev ertheless, exploring how to adapt BAPO to large models with MoE architectures, as well as to agentic RL framew orks, remains a signiﬁcant challenge. 10 Published as a conference paper at ICLR 2026 A C K N O W L E D G E M E N T S This work was supported by the National Natural Science Foundation of China under Grant 72571007 and Grant 72595830/72595831, and by Beijing Nov a Program (No. 20250484850). E T H I C S S T A T E M E N T All authors of this study strictly adhere to the ICLR code of ethics. Our research does not in volv e any potential conﬂicts of interest or sponsorship issues. W e ha ve carefully considered and addressed concerns related to discrimination, bias, and fairness in our methodology . The study raises no pri- vac y or security concerns, maintains full legal compliance, and upholds the highest standards of research integrity . All experimental procedures and data handling practices follow established ethi- cal guidelines for machine learning research. R E P RO D U C I B I L I T Y S T A T E M E N T T o ensure full reproducibility of our results, we provide comprehensiv e implementation details of the proposed B APO training algorithm in the supplementary materials. All experimental settings, hyperparameters, and dataset speciﬁcations are clearly documented. For our theoretical contribu- tions, complete proofs and clear explanations of all assumptions are included in the appendix. Code and data will be made av ailable upon acceptance to facilitate replication of our ﬁndings. T H E U S E O F L A R G E L A N G U AG E M O D E L S In this research, we employed LLMs solely as language editing tools to impro ve the clarity and read- ability of our manuscript. LLMs were used for grammar checking, style reﬁnement, and language polishing purposes only . All core research ideas, experimental design, analysis, and conclusions are entirely the original work of the authors. The use of LLMs did not contribute to the conceptual or technical content of this study . R E F E R E N C E S Pranjal Aggarwal and Sean W elleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv pr eprint arXiv:2503.04697 , 2025. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2. 5-vl technical report. arXiv preprint , 2025. Y untao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nov a DasSarma, Dawn Drain, Stanisla v Fort, Deep Ganguli, T om Henighan, et al. T raining a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint , 2022. Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng W ang, Mengkang Hu, Y uhang Zhou, T e Gao, and W anxiang Che. T owards reasoning era: A surve y of long chain-of- thought for reasoning large language models. arXiv preprint , 2025a. Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian T ang, Ale xandre Pich ´ e, Nicolas Gontier, Y oshua Bengio, and Ehsan Kamalloo. Self-ev olving curriculum for llm reasoning. arXiv pr eprint arXiv:2505.14970 , 2025b. Y ang Chen, Zhuolin Y ang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catan- zaro, and W ei Ping. Acereason-nemotron: Adv ancing math and code reasoning through rein- forcement learning. arXiv pr eprint arXiv:2505.16400 , 2025c. Ganqu Cui, Y uchen Zhang, Jiacheng Chen, Lifan Y uan, Zhi W ang, Y uxin Zuo, Haozhan Li, Y uchen Fan, Huayu Chen, W eize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv pr eprint arXiv:2505.22617 , 2025. 11 Published as a conference paper at ICLR 2026 W ei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo W ei, Jun Mei, Jiashu W ang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. arXiv pr eprint arXiv:2505.24298 , 2025. Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi W ang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv pr eprint arXiv:2501.12948 , 2025. Chaoqun He, Renjie Luo, Y uzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Y ujie Huang, Y uxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-le vel bilingual multimodal scientiﬁc problems. arXiv preprint arXiv:2402.14008 , 2024. Jujie He, Jiacai Liu, Chris Y uhao Liu, Rui Y an, Chaojie W ang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, W ei Shen, et al. Skyw ork open reasoner 1 technical report. arXiv pr eprint arXiv:2505.22312 , 2025. Dan Hendrycks, Collin Burns, Saurav Kadav ath, Akul Arora, Ste ven Basart, Eric T ang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv pr eprint arXiv:2103.03874 , 2021. Jacob Hilton, Karl Cobbe, and John Schulman. Batch size-in variance for policy optimization. Ad- vances in Neural Information Pr ocessing Systems , 35:17086–17098, 2022. W enyi Hong, W enmeng Y u, Xiaotao Gu, Guo W ang, Guobing Gan, Haomiao T ang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: T owards versatile multimodal reasoning with scalable reinforcement learning. arXiv pr eprint arXiv:2507.01006 , 2025. Nathan Lambert, Jacob Morrison, V alentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane L yu, et al. T ulu 3: Pushing frontiers in open language model post-training. arXiv pr eprint arXiv:2411.15124 , 2024. Aitor Le wkowycz, Anders Andreassen, David Dohan, Ethan Dyer , Henryk Michalewski, V inay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in neural information pr ocessing systems , 35:3843–3857, 2022. Siheng Li, Zhanhui Zhou, W ai Lam, Chao Y ang, and Chaochao Lu. Repo: Replay-enhanced policy optimization. arXiv pr eprint arXiv:2506.09340 , 2025. Jing Liang, Hongyao T ang, Y i Ma, Jin yi Liu, Y an Zheng, Shuyue Hu, Lei Bai, and Jianye Hao. Squeeze the soaked sponge: Efﬁcient of f-policy reinforcement ﬁnetuning for large language model. arXiv pr eprint arXiv:2507.06892 , 2025. Ziru Liu, Cheng Gong, Xinyu Fu, Y aofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan T u. Ghpo: Adapti ve guidance for stable and efﬁcient llm reinforce- ment learning. arXiv pr eprint arXiv:2507.10628 , 2025. Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Arpo: End-to-end policy opti- mization for gui agents with experience replay . arXiv pr eprint arXiv:2505.16282 , 2025. Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In Pr oceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer ence on Natural Language Processing (V olume 1: Long P apers) , pp. 6774–6786, 2021. Michael Luo, Sijun T an, Justin W ong, Xiaoxiang Shi, William Y T ang, Manan Roongta, Colin Cai, Jeffre y Luo, Tianjun Zhang, Li Erran Li, et al. Deepscaler: Surpassing o1-previe w with a 1.5 b model by scaling rl. Notion Blog , 2025. Lu Ma, Hao Liang, Meiyi Qiang, Lexiang T ang, Xiaochen Ma, Zhen Hao W ong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: In- terleav ed online ﬁne-tuning for hardest questions. arXiv preprint , 2025. 12 Published as a conference paper at ICLR 2026 W enjia Meng, Qian Zheng, Gang Pan, and Y ilong Y in. Off-polic y proximal policy optimization. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , volume 37, pp. 9162–9170, 2023. Y oussef Mroueh. Reinforcement learning with veriﬁable rew ards: Grpo’ s effecti ve loss, dynamics, and success ampliﬁcation. arXiv pr eprint arXiv:2503.06639 , 2025. Y oussef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Gree- new ald, Jiri Navratil, Jerret Ross, and Jesus Rios. Re visiting group relativ e policy optimization: Insights into on-policy and of f-policy training. arXiv pr eprint arXiv:2505.22257 , 2025. Long Ouyang, Jeffrey W u, Xu Jiang, Diogo Almeida, Carroll W ainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Ale x Ray , et al. Training language models to fol- low instructions with human feedback. Advances in neural information pr ocessing systems , 35: 27730–27744, 2022. Y un Qu, Qi Cheems W ang, Y ixiu Mao, V incent T ao Hu, and Xiangyang Ji. Can prompt dif ﬁ- culty be online predicted for accelerating rl ﬁnetuning of reasoning models? arXiv pr eprint arXiv:2507.04632 , 2025. James Queeney , Y annis Paschalidis, and Christos G Cassandras. Generalized proximal policy op- timization with sample reuse. Advances in Neural Information Pr ocessing Systems , 34:11909– 11919, 2021. Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv pr eprint arXiv:2402.03300 , 2024. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Y ibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv pr eprint arXiv:2504.07615 , 2025. Guangming Sheng, Chi Zhang, Zilingfeng Y e, Xibin W u, W ang Zhang, Ru Zhang, Y anghua Peng, Haibin Lin, and Chuan W u. Hybridﬂow: A ﬂexible and efﬁcient rlhf frame work. In Proceedings of the T wentieth Eur opean Conference on Computer Systems , pp. 1279–1297, 2025. Marco Simoni, Aleksandar Fontana, Giulio Rossolini, and Andrea Saracino. Gtpo: T rajectory- based policy optimization in lar ge language models, 2025. URL 2508.03772 . Y ifan Sun, Jingyan Shen, Y ibin W ang, Tian yu Chen, Zhendong W ang, Mingyuan Zhou, and Huan Zhang. Improving data ef ﬁciency for llm reinforcement ﬁne-tuning through difﬁculty-targeted online data selection and rollout replay . arXiv preprint , 2025. Kimi T eam, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv pr eprint arXiv:2501.12599 , 2025. Jianhao Y an, Y afu Li, Zican Hu, Zhi W ang, Ganqu Cui, Xiaoye Qu, Y u Cheng, and Y ue Zhang. Learning to reason under off-polic y guidance. arXiv pr eprint arXiv:2504.14945 , 2025. An Y ang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Y u, Chengpeng Li, Dayiheng Liu, Jian- hong T u, Jingren Zhou, Jun yang Lin, et al. Qwen2. 5-math technical report: T oward mathematical expert model via self-impro vement. arXiv pr eprint arXiv:2409.12122 , 2024. An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 , 2025. Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiaochen Zuo, Y u Y ue, W einan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv pr eprint arXiv:2503.14476 , 2025. 13 Published as a conference paper at ICLR 2026 Xiaojiang Zhang, Jinghui W ang, Zifei Cheng, W enhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie W ang, Y inghan Cui, Chao W ang, Junyi Peng, et al. Srpo: A cross-domain implementation of large-scale reinforcement learning on llm. arXiv preprint , 2025. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bo wen Y u, Chang Gao, Kai Dang, Y uqiong Liu, Rui Men, An Y ang, et al. Group sequence policy optimization. arXiv pr eprint arXiv:2507.18071 , 2025. 14 Published as a conference paper at ICLR 2026 A A P P E N D I X A . 1 G L O S S A RY O F T E R M S A N D N O TA T I O N S T erm Deﬁnition c 1 , c 2 , c 3 Thresholds for classifying historical samples by difﬁculty (group mean rew ard). X 1 , X 2 , X 3 Subsets of training batch: fresh, re-e valuated difﬁcult, and historical high-quality samples. m Re-ev aluation frequency for historically dif ﬁcult samples. v Delay steps for updating the rollout policy . G Group size, number of responses generated per prompt during rollout. B Replay buf fer storing historical samples. ˆ A i,t Estimated advantage for tok en t in response i . ε Clipping parameter in PPO-style objecti ves. β Coefﬁcient for KL penalty in the objecti ve function. I ( x ) Filter function for constructing B APO’ s training batch. D KL Kullback–Leibler di vergence, used to constrain polic y deviation. α Rollout policy for B APO, which synchronizes to π θ ev ery v steps. π θ LLM policy parameterized by θ . π ref Reference policy (e.g., initial pre-trained model). ρ ( θ ) Importance sampling ratio: π θ ( y | x ) π old ( y | x ) . r ( x, y ) Rew ard function, we set to binary (0/1) based on correctness. µ α,r ( x ) Expected re ward under polic y π for input x . W e approximate this v alue using the mean of r ( x, y ) corresponding to G responses y generated by the rollout policy α for each prompt x . σ α,r,ε ( x ) Standard de viation of re wards under policy α for input x , with smooth- ing ε . J ( π ( ·| x )) Expected rew ard of policy π for input x : E y ∼ π ( ·| x ) [ r ( x, y )] . N ( µ α,r ( x ) | µ, σ 2 ) A sampling method that assigns weights to online rollouts based on a normal distrib ution centered at µ with standard de viation σ , used to ﬁlter samples by their group mean rew ard µ α,r ( x ) . A . 2 T H E O R E T I C A L A NA L Y S I S Lemma A.1 (Kantorovich-Rubenstein duality of total v ariation distance) . The Kantor ovich- Rubinstein duality (variational r epresentation) of the total variation distance is as follows: TV ( m 1 , m 2 ) = 1 2 L sup g ∈G L { E Z ∼ m 1 [ g ( Z )] − E Z ∼ m 2 [ g ( Z )] } , (14) wher e G L = { g : Z → R , ∥ g ∥ ∞ ≤ L } . Theorem A.2 ( Policy Improv ement Lower Bound with Adaptive T raining Batch ) . Assume r e- war ds ar e bounded: 0 ≤ r ≤ 1 . Let π θ t be the current policy , α 1 = π θ t − v be the delayed r ollout policy , α 2 = π θ t be the curr ent policy for re-e valuation, α 3 = α B be the buf fer policy distribution, and I ( x ) be the ﬁltering function partitioning samples into X 1 , X 2 , and X 3 . Suppose c 1 , c 2 , c 3 ∈ (0 , 1) with c 2 < c 3 . and the following TV distance constraints hold: TV ( π θ t ( ·| x ) , π θ t − v ( ·| x )) ≤ δ 1 ∀ x ∈ X 1 (15) TV ( π θ t ( ·| x ) , α B ( ·| x )) ≤ δ 3 ∀ x ∈ X 3 (16) wher e δ 1 , δ 3 > 0 ar e sufﬁciently small such that the variance lower bounds r emain positive. Then, for the policy update objective in Equation 3 , the expected policy impr ovement over ﬁlter ed samples satisﬁes: E x ∼ ρ X [ I ( x )( J ( π θ ( ·| x )) − J ( π θ t ( ·| x )))] ≥ 3 X i =1 L i ( π θ , α i ) 15 Published as a conference paper at ICLR 2026 wher e: L i ( π θ , α i ) = E x ∈X i [ L α i ( π θ ( ·| x )) − 2 K i · TV ( π θ ( ·| x ) , α i ( ·| x )) − 2 TV ( π θ t ( ·| x ) , α i ( ·| x ))] with L α i ( π θ ( ·| x )) = 1 σ α i ,r,ε ( x ) ( J ( π θ ( ·| x )) − J ( α i ( ·| x ))) . The constants ar e: K 1 = 1 − q G − 1 G 2 + ε q G − 1 G 2 + ε (17) K 2 = 1 − p c 1 (1 − c 1 ) + ε p c 1 (1 − c 1 ) + ε (18) K 3 = 1 − p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε (19) Pr oof. W e prov e the bound by analyzing each ﬁltered sample set separately , applying off-policy policy impro vement bounds tailored to the reference distribution used in each re gion. Step 1: Core inequality for off-policy samples. For any x such that I ( x ) = 1 , we establish the fundamental inequality: J ( π θ ( ·| x )) − J ( π θ t ( ·| x )) ≥ L α i ( π θ ( ·| x )) − 2 K i · TV ( π θ ( ·| x ) , α i ( ·| x )) (20) − 2 TV ( π θ t ( ·| x ) , α i ( ·| x )) (21) where K i = 1 − σ α i ,r,ε ( x ) σ α i ,r,ε ( x ) is a constant that depends on the variance of re wards in each ﬁltered subset. First, we expand the adv antage objectiv e. By deﬁnition: L α i ( π θ ( ·| x )) = E y ∼ α i ( ·| x )  π θ ( y | x ) α i ( y | x ) A α i ( x, y )  (22) = E y ∼ α i ( ·| x )  π θ ( y | x ) α i ( y | x ) · r ( x, y ) − µ α i ,r ( x ) σ α i ,r,ε ( x )  (23) = 1 σ α i ,r,ε ( x ) ( J ( π θ ( ·| x )) − J ( α i ( ·| x ))) (24) Next, we establish the ke y algebraic identity relating L α i ( π θ ( ·| x )) to J ( π θ ( ·| x )) − J ( π θ t ( ·| x )) : L α i ( π θ ( ·| x )) − ( J ( π θ ( ·| x )) − J ( π θ t ( ·| x ))) (25) = 1 − σ α i ,r,ε ( x ) σ α i ,r,ε ( x ) ( J ( π θ ( ·| x )) − J ( α i ( ·| x ))) + ( J ( π θ t ( ·| x )) − J ( α i ( ·| x ))) (26) Application of Kantorovich-Rubenstein duality: For bounded rew ards with ∥ r ∥ ∞ = 1 , the Kantorovich-Rubenstein duality Lemma A.1 pro vides: | J ( π θ ( ·| x )) − J ( α i ( ·| x )) | ≤ 2 · TV ( π θ ( ·| x ) , α i ( ·| x )) (27) | J ( π θ t ( ·| x )) − J ( α i ( ·| x )) | ≤ 2 · TV ( π θ t ( ·| x ) , α i ( ·| x )) (28) Since 0 ≤ r ≤ 1 , we hav e σ α i ,r,ε ( x ) < 1 , ensuring K i = 1 − σ α i ,r,ε ( x ) σ α i ,r,ε ( x ) ≥ 0 . Combining these bounds yields the desired inequality . Step 2: Analysis for X 1 (Filtered fresh samples). For x ∈ X 1 , samples are generated by the delayed rollout policy α 1 = π θ t − v and selected via Gaussian sampling with group-lev el accuracy µ α 1 ,r ( x ) ∈ { 1 G , 2 G , . . . , G − 1 G } , excluding e xtremes { 0 , 1 } . V ariance analysis on discrete set: For the v ariance function f ( p ) = p (1 − p ) over the discrete set { 1 G , 2 G , . . . , G − 1 G } , the minimum value occurs at the boundary points p = 1 G or p = G − 1 G , both yielding f ( p ) = G − 1 G 2 . Therefore: σ 2 α 1 ,r ( x ) = µ α 1 ,r ( x )(1 − µ α 1 ,r ( x )) ≥ G − 1 G 2 (29) 16 Published as a conference paper at ICLR 2026 Thus: σ α 1 ,r,ε ( x ) ≥ q G − 1 G 2 + ε , yielding: K 1 = 1 − q G − 1 G 2 + ε q G − 1 G 2 + ε Step 3: Analysis for X 2 (Re-evaluated difﬁcult samples). For x ∈ X 2 , samples are generated by the current policy α 2 = π θ t through re-ev aluation of historically difﬁcult samples. The selection criterion ensures that historically difﬁcult samples ( µ α B ,r ( x ) ≤ c 1 ) now achiev e improved perfor- mance ( c 1 < µ π θ t ,r ( x ) < 1 ) under the current policy . Since these samples are directly generated by π θ t , we hav e α 2 = π θ t , and the constraint c 1 < µ π θ t ,r ( x ) < 1 provides a natural lo wer bound, yielding: σ 2 α 2 ,r ( x ) = µ α 2 ,r ( x )(1 − µ α 2 ,r ( x )) > c 1 (1 − c 1 ) (30) Therefore: σ α 2 ,r,ε ( x ) > p c 1 (1 − c 1 ) + ε , gi ving us: K 2 = 1 − p c 1 (1 − c 1 ) + ε p c 1 (1 − c 1 ) + ε Step 4: Analysis for X 3 (Historical high-quality samples). For x ∈ X 3 , samples are generated by historical buf fer policies α 3 = α B with µ α B ,r ( x ) ∈ [ c 2 , c 3 ] . Since µ α 3 ,r ( x )(1 − µ α 3 ,r ( x )) achiev es its minimum at the endpoints of the interval [ c 2 , c 3 ] : σ 2 α 3 ,r ( x ) ≥ min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) (31) Therefore: σ α 3 ,r,ε ( x ) ≥ p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε , yielding: K 3 = 1 − p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε p min( c 2 (1 − c 2 ) , c 3 (1 − c 3 )) + ε Step 5: Combining the results. T aking expectations ov er x ∼ ρ X and applying the indicator function decomposition: E x ∼ ρ X [ I ( x )( J ( π θ ( ·| x )) − J ( π θ t ( ·| x )))] (32) = 3 X i =1 E x ∼ ρ X [ 1 { x ∈X i } ( J ( π θ ( ·| x )) − J ( π θ t ( ·| x )))] (33) ≥ 3 X i =1 E x ∈X i [ L α i ( π θ ( ·| x )) − 2 K i · TV ( π θ ( ·| x ) , α i ( ·| x )) − 2 TV ( π θ t ( ·| x ) , α i ( ·| x ))] (34) = 3 X i =1 L i ( π θ , α i ) (35) All constants K 1 , K 2 , K 3 are ﬁnite, since denominators are strictly positive by construction and numerators are bounded by 1 under c 1 , c 2 , c 3 ∈ (0 , 1) , completing the proof. Proposition A.3. F or binary r ewar d tasks where r ( x, y ) ∈ { 0 , 1 } , the contribution to the policy impr ovement lower bound is maximized when the expected gr oup re ward of the sample is µ = 0 . 5 . Pr oof. Recalling Theorem 3.2 , the lower bound for policy improvement on a speciﬁc data distribu- tion inv olves the constant K , which scales the penalty for policy div ergence. The tightness of this bound is gov erned by the standard deviation of the re wards σ α,r ( x ) . 17 Published as a conference paper at ICLR 2026 Due to advantage standardization ˆ A ∝ 1 σ , the ef fective step size in the adv antage estimation and consequently the gradient magnitude is proportional to the in verse of the standard deviation. Ho w- ev er , in the context of the lower bound analysis in Theorem 3.2 , the stability constant K is deﬁned as: K ( µ ) = 1 − σ ( µ ) σ ( µ ) (36) where a smaller K indicates a tighter bound and thus a larger guaranteed improv ement step. For a binary reward function r ∈ { 0 , 1 } , the re ward distribution follows a Bernoulli distribution with parameter µ ( x ) = E [ r | x ] . The variance is gi ven by: σ 2 ( µ ) = µ (1 − µ ) (37) T o ﬁnd the µ that maximizes variance, we tak e the deriv ati ve with respect to µ : d dµ ( µ − µ 2 ) = 1 − 2 µ (38) Setting the deriv ativ e to zero: 1 − 2 µ = 0 = ⇒ µ = 0 . 5 (39) Since the second deriv ativ e d 2 dµ 2 = − 2 < 0 , this is a global maximum. At µ = 0 . 5 , the variance is maximized ( σ 2 = 0 . 25 , σ = 0 . 5 ). This corresponds to the state of max- imum entropy , where the model is most ”uncertain” about the outcome. T raining on these samples provides the strongest gradient signal for distinguishing between correct and incorrect reasoning paths, ef fectively maximizing the information gain per step. Conv ersely , as µ → 0 or µ → 1 , σ → 0 , causing the advantage estimates to numerical instability or the gradient signal to vanish. Therefore, selecting samples with µ = 0 . 5 theoretically offers the most efﬁcient learning signal and the most fa vorable stability bound. A . 3 O N L I N E F I LT E R M E C H A N I S M A N A LY S I S T o in vestigate the impact of fresh sample selection on training stability and con ver gence, we conduct an ablation study using Qwen3 8B with a 4K response length limit. W e compare three distinct ﬁltering strategies for the online component ( X 1 ): Mode 1 (Range Filter): It retains samples with group mean rew ards µ ∈ [ 1 G , G − 1 G ] . This effec- tiv ely remo ves only the zero-advantage samples (all-correct or all-incorrect) that contrib ute minimal gradients. Mode 2 (Gaussian Filter): A dif ﬁculty-weighted strategy that prioritizes samples with high vari- ance (accuracy near 0.5) using a Gaussian distribution, thereby reducing the proportion of extremely easy or hard samples. Mode 3 (Unif orm Filter): A baseline that randomly selects 60% of the fresh samples regardless of their quality . This ratio was chosen to match the approximate data retention rates of Mode 1 and Mode 2 (approximately 40%–60%) for a fair comparison of data v olume. The V alue of Quality o ver Randomness. As illustrated in Figure 10 , the uniform ﬁlter mechanism exhibits sev ere instability , characterized by exploding gradient norms and a complete collapse in performance after 150 steps. Since this strategy blindly includes all-wrong samples (where µ = 0 ), the model is forced to update based on low-quality , zero-advantage signals. Suppressing the token probabilities of incorrect responses without a corresponding positive signal introduces signiﬁcant noise and uncertainty , ultimately destabilizing the policy . This failure highlights that the quality of the training batch, particularly the exclusion of zero-adv antage noise, is crucial. Con vergence Speed and Final P erformance. The Gaussian ﬁlter demonstrates faster con ver gence in the early stages. By focusing heavily on samples with the highest variance (accuracy ≈ 0.5), it provides the steepest learning signal initially . Ho wever , its ﬁnal conv ergence performance is lower 18 Published as a conference paper at ICLR 2026 Figure 10: Ablation on Online Filtering Strategies. Comparison of Range Filter , Gaussian Filter, and Uniform Filter on training stability (Grad Norm) and performance (Mean@8). The star symbol indicates the best checkpoint for B APO. than that of the range ﬁlter . W e hypothesize that the Gaussian ﬁlter restricts sample diversity by ag- gressiv ely ﬁltering out samples that are slightly easier or harder but still informati ve. In contrast, the range ﬁlter retains a broader spectrum of valid samples. While it learns slightly slower initially , it maintains a rich distrib ution of training data, pre venting premature plateauing and ultimately achie v- ing the highest asymptotic performance. A . 4 T R A I N I N G D Y NA M I C S A N D T E S T C U RV E S As illustrated in Figure 11 and Figure 12 , we present more detailed training dynamics and test curves for the Planning and V ision Geometry tasks. The results indicate that both B APO and D APO consistently outperform GRPO in terms of training re wards. Interestingly , BAPO exhibits higher entropy , reﬂecting better exploration capability compared to other algorithms, which also results in longer response lengths. Figure 11: T raining Dynamics during BAPO, GRPO, and DAPO post-training, including training rew ards, training entropy , and response lengths. 19 Published as a conference paper at ICLR 2026 Figure 12: T est Curves of Group Accuracy Changes for mathematics, planning, and geometry tasks among AMC, CD-4 test set, and Geo-3K test set, respectiv ely . A . 5 C O M P U T AT I O N A NA L Y S I S From T able 2 , we observe that B APO’ s computational ov erhead correlates with the number of sam- ples requiring re-ev aluation and the actual training batch size. For the Planning task, B APO (w/o X 2 ) achiev es the fastest training time by eliminating bad case re-ev aluation, but this comes at the cost of reduced performance. For the Mathematics task, the high number of bad cases (as shown by the 0/8 accuracy samples in Figure 7 ) means that under our re-ev aluation frequenc y setting of m = 5 , infer- ence time e xceeds that of GRPO. Ho wever , this additional time in vestment proves valuable, yielding better bad-case handling rates and overall test performance, as shown in Figure 4 and T able ?? . W e plan to explore lo wer re-ev aluation frequencies to assess the performance trade-offs. B APO ( c 2 = 0 . 375 , c 3 = 0 . 5 ) runs signiﬁcantly faster than BAPO ( c 2 = 0 , c 3 = 0 . 25 ) due to the larger historical data volume in the latter conﬁguration. This causes BAPO ( c 2 = 0 , c 3 = 0 . 25 ) to maintain a larger effecti ve batch size than B APO ( c 2 = 0 . 375 , c 3 = 0 . 5 ). T raining logs also conﬁrm this observation: B APO ( c 2 = 0 , c 3 = 0 . 25 ) consistently utilizes 100% of the conﬁgured batch size (equiv alent to on-policy methods’ batch size), while BAPO ( c 2 = 0 . 375 , c 3 = 0 . 5 ) operates at approximately 70% capacity . T able 2: Computational Overhead Analysis. “ Batch size ” ( a, b ) represents the sample batch size a and train mini batch size b . “ Time ” is measured in total training time (d=days, h=hours, m=minutes) on 8 A100 GPUs. T asks Methods Batch Size Num Epoch Time Mathematics GRPO (256, 64) 3 1d 16h 58m D APO (256, 64) 3 2d 15h 30m B APO (256, 64) 3 1d 22h 37m Planning GRPO (256, 64) 3 3h 47m D APO (256, 64) 3 6h 35m B APO (256, 64) 3 3h 23m B APO (w/o X 2 ) (256, 64) 3 2h 38m B APO (w/o X 3 ) (256, 64) 3 3h 4m B APO ( c 2 = 0 , c 3 = 0 . 25 ) (256, 64) 3 3h 54m B APO ( c 2 = 0 . 375 , c 3 = 0 . 5 ) (256, 64) 3 3h 4m V isual Geometry GRPO (256, 64) 30 7h 55m D APO (256, 64) 30 12h 19m B APO (256, 64) 30 5h 50m B APO (w/o X 2 ) (256, 64) 30 3h 42m B APO (w/o X 3 ) (256, 64) 30 4h 31m A . 6 V I S UA L I Z A T I O N W e present additional visualization details, including the sample accuracy tracking for the Count- down and Geometry3K datasets, as shown in Figure 13 . Meanwhile, we visualize the source of 20 Published as a conference paper at ICLR 2026 Figure 13: T racking changes in the Number of Different Accuracy Bins on the Countdown (up- per) and Geometry3K training sets (lower) for the baseline model, GRPO, and our BAPO method. Special attention is paid to the change in the number of bad samples (red bars) that the base model fails to handle. Figure 14: Batch Distribution V isualization of X 1 , X 2 , X 3 for Mathematics, Planning, and V isual Geometry T asks (left to right) during B APO’ s training. 21 Published as a conference paper at ICLR 2026 Figure 15: Accuracy Migration Matrix Analysis. W e track a ﬁx ed subset of 1,000 randomly selected prompts from the training set and visualize their movement between accuracy bins (0/8 to 8/8) at Steps 0, 150, 300, and 471 (the last step). The y-axis represents the initial accuracy bin at Step 0, while the x-axis represents the current accuracy bin. The scarcity of samples in the lower triangle demonstrates that performance degradation is rar e. samples in each training batch and their respectiv e proportions during the training process, as illus- trated in Figure 12 . It can be observed that approximately 40-60% of the actual training samples for B APO come from online samples X 1 , while the remaining samples are deriv ed from X 2 or X 3 . Stability of Historical High-Quality Samples. A potential concern regarding the reuse of historical high-quality samples ( X 3 in Eq. 5) is the assumption of policy consistency—speciﬁcally , whether samples that were high-quality under a past policy remain valid for the current policy . T o address this, we visualize the ev olution of sample difﬁculty in Figure 15 by tracking the accurac y migration of a training subset. The heatmaps in Figure 15 rev eal a distinct pattern: the mass is concentrated along the diagonal (performance maintenance) and the upper triangle (performance improvement). Crucially , the pro- portion of samples exhibiting signiﬁcant performance degradation (migrating to the lower triangle) is negligible. For example, samples that initially achiev ed 8 / 8 accuracy predominantly remain in the high-accuracy bins throughout the training process, with minimal re gression to lower bins. This empirical evidence demonstrates that high-quality reasoning paths learned by RL are robust and re- sistant to forgetting. Consequently , historical high-quality samples stored in the buf fer likely remain high-quality under the current policy , validating the consistenc y of the X 3 data source. A . 7 H Y P E R P A R A M E T E R S E T T I N G Hyperparmeters The major hyperparameter choices are sho wn in T able 3 . T able 3: Hyperparameter Conﬁguration for B APO Framework on Mathematics T ask. For plan- ning and visual geometry tasks, some parameters differ slightly; speciﬁc conﬁguration scripts are provided in our code repository . Parameter V alue Parameter V alue Parameter V alue Rollout Conﬁguration T op-p 1 T op-k -1 T emperature 1 Group size ( G ) 8 Max prompt length 2048 Max response length 8192 Rollout workers 8 Sample batch size 256 Seed 42 T raining Conﬁguration Learning rate 1e-6 T rain mini batch size 64 GAE lambda 1.0 T raining epochs 3 KL coef ﬁcient ( β ) 0.001 Entropy coefﬁcient 0.001 Off-policy Conﬁguration c 1 threshold 1 / 8 c 2 range [1 / 8 , 4 / 8] c 3 range [2 / 8 , 5 / 8] Buffer size ( | B | ) 256 Rollout delay ( v ) 5 Re-ev aluation freq ( m ) 5 Gaussian std ( σ ) 0.2 Gaussian mean ( µ ) 0.5 Max re-e valuate prompts 128 22 Published as a conference paper at ICLR 2026 Reward Function T o ev aluate the impact of our method, we adopt a simple reward function as below . All training experiments employ the same re ward function. r ( x, y ) =  1 , if y is correct 0 , otherwise Datasets and Benchmarks T o e valuate the models above, we use three training datasets and eight benchmarks categorized into mathematical, planning and vision geometry reasoning benchmarks as described in T able 4 . T able 4: Datasets and Benchmarks used in this study . Dataset #T rain #T est T ask T ype Domain License Source T raining Datasets D E E P S C A L E R - 1 . 5 B - P R E V I E W 40,000 – Math reasoning Mathematics Apache 2.0 Link C O U N T D O W N - T A S K S - 3 T O 4 49,000 – Logic reasoning Planning Apache 2.0 Link G E O M E T RY 3 K 2,100 – V isual reasoning V isual Geometry Apache 2.0 Link T est Benchmarks A I M E 2 4 – 30 Math competition Mathematics MIT Link A M C – 83 Math competition Mathematics Apache 2.0 Link M A T H 5 0 0 – 500 Math reasoning Mathematics - Link M I N E RV A – 272 Math reasoning Mathematics Apache 2.0 Link O LYM P I A D – 674 Math competition Mathematics Apache 2.0 Link C O U N T D O W N - T A S K S - 3 T O 4 – 200 ∗ Logic reasoning Planning Apache 2.0 Link C O U N T D O W N - T A S K S - 4 – 200 ∗ Logic reasoning Planning Apache 2.0 Link G E O M E T RY 3 K – 901 V isual reasoning V isual Geometry Apache 2.0 Link *W e only use a random subset of this benchmark for faster ablation studies. A . 8 A L G O R I T H M Algorithm 1 presents the proposed B APO, which can be seamlessly integrated with any GRPO-lik e RL VR algorithm. Algorithm 1 Batch Adaptation Policy Optimization (B APO) Require: Policy π θ 0 , buf fer B = ∅ , thresholds c 1 , c 2 , c 3 , delay steps v , re-e v aluate frequency m 1: for t = 1 to T do 2: // Of f-policy Rollout Phase 3: if t mo d v = 0 then 4: Synchronize rollout policy’ s parameter with trainer: α = π θ t 5: end if 6: Using rollout policy α to generate G responses { y j } G j =1 for each question x 7: Compute log probabilities α ( y | x ) and rew ards r for constructing the online batch X on 8: Store samples into buf fer B bad ← { ( x, y , α ( y | x ) , r ) ∈ X on : µ α,r ( x ) ≤ c 1 } 9: Store samples into buf fer B high ← { ( x, y , α ( y | x ) , r ) ∈ X on : c 2 ≤ µ α,r ( x ) ≤ c 3 } 10: // Of f-policy T raining Phase 11: X 1 ← online ﬁlter on X on with µ α,r ( x ) ∈ { 1 G , . . . , G − 1 G } (Filtered Fresh Samples) 12: X 2 ← ∅ 13: if t mo d m = 0 then 14: Re-ev aluate B bad with π θ t to get X 2 using Equation 6 (Re-ev aluated Difﬁcult Samples) 15: end if 16: X 3 ← Sample from { ( x, y ) ∈ B high : µ α B ,r ( x ) ∈ [ c 2 , c 3 ] } (Historical High-quality Samples) 17: Final batch ← X 1 ∪ X 2 ∪ X 3 18: Compute adv antages and update critic/actor with ﬁnal batch 19: Add D t to buf fer B 20: end for 23 Published as a conference paper at ICLR 2026 A . 9 G E N E R A L I Z A T I O N A NA L Y S I S T o demonstrate the algorithmic generalizability of our framew ork, we extended the Batch Adap- tation paradigm to Proximal Policy Optimization (PPO), denoted as BA-PPO . In this experiment, both the Actor and Critic networks were initialized with the Qwen3-4B backbone and trained on the DeepScaleR dataset with a maximum response length of 4K tokens. W e maintained consistency with the foundational B APO conﬁguration by applying standard zero-advantage ﬁltering for X 1 (removing only all-correct and all-wrong groups), utilizing the initial B APO values for thresholds c 1 , c 2 , c 3 , and setting the buf fer size to 64. Figure 16: Generalization to Actor-Critic Algorithms (BA-PPO). Performance comparison be- tween standard PPO (orange triangles) and BA-PPO (purple circles) on the AIME 2024 benchmark using Qwen3-4B. The star ( ⋆ ) marks the peak performance of B A-PPO ( 0 . 325 ). As illustrated in Figure 16 , BA-PPO achie ved a remarkable performance gain of +5.5 on the AIME 2024 benchmark compared to the standard PPO baseline. This result further conﬁrms that the core principle of dynamic batch construction is effecti ve not only for GRPO but also functions as a robust, algorithm-agnostic enhancement for actor-critic methods. 24

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment