A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

Reading time: 5 minute
...

📝 Original Info

  • Title: A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
  • ArXiv ID: 2512.06547
  • Date: 2025-12-06
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8x speedup while maintaining comparable performance. Code \& off-the-shelf example are available at: https://github.com/inclusionAI/AReaL/blob/main/docs/algorithms/prox_approx.md

💡 Deep Analysis

Deep Dive into A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation.

Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms’ (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8x speedup while maintaining comparable performance. Code & off-the-shelf example

📄 Full Content

A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation⋆ Xiao-Can (Bruce) Li1, Shi-Liang (Bruce) Wu1, and Zheng Shen1 Huawei Canada hsiaotsan.li@alumni.utoronto.ca, {okwsl201210, zhengshencn}@gmail.com Abstract. Decoupled PPO has been a successful reinforcement learn- ing (RL) algorithm to deal with the high data staleness under the asyn- chronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms’ (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy cor- rection (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational over- head for large language models training. We observe that since the proxi- mal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation with- out explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accel- erating training by 1.8× speedup while maintaining comparable perfor- mance. Code & off-the-shelf example are available at: https://github. com/inclusionAI/AReaL/blob/main/docs/algorithms/prox_approx.md Keywords: Reinforcement Learning · Policy Optimization · Large Lan- guage Models 1 Introduction Reinforcement learning (RL) has become a central approach to improve the rea- soning capabilities of large language models (LLMs) [21,26,28,27,17], with exten- sive surveys covering RL from human feedback methods and workflows [11,5,23] and various alternative approaches including AI feedback [15] and safety consid- erations [3]. Among RL algorithms, Proximal Policy Optimization (PPO) [20] has emerged as the dominant method due to its stable trust-region constraints, building upon earlier trust region methods like TRPO [19]. However, standard PPO performs rollout-then-training loop, i.e., the training stage must wait un- til the rollout stage collects predefined number of episodes, limiting throughput (measured by the number of environment steps per unit of time) and under- utilizes computational resources. ⋆The first and second authors have equal contribution. arXiv:2512.06547v2 [cs.LG] 9 Jan 2026 2 X.-C. Li et al. To improve the throughput and computational resources utilization, asyn- chronous RL [9,24,22,8,16,25,7,14,6,2] treats rollout and training as two indepen- dent engines, which can be executed in parallel. Nevertheless, the target policy on the training engine can be several updates ahead of the behavior policy on the rollout engine. Such staleness (off-policyness) caused by asynchronous RL setting could lead to severe learning instability in standard PPO. To mitigate this, decoupled PPO [10] improves the learning stability by introducing a proxi- mal policy that decouples the off-policy correction (importance weight) from the policy update constraint (trust region). Decoupled loss empirically demonstrates improved learning stability in Atari games when high off-policyness exists. Apart from Atari games, AReaL [9], an LLM post training framework, demonstrated superior learning stability of decoupled loss on LLM reasoning tasks under high off-policyness setting. Thanks to asynchronous RL setup, AReaL also achieved up to 2.77× training speedup. However, the proximal policy in decoupled loss requires an extra forward pass through the neural network at each training step, which is expensive for autoregressive LLMs. This overhead limits the potential speedups from asynchronous training. This raises a natural question: do we really need to compute the proximal policy explicitly? Looking at the objective from first prin- ciples, the proximal policy simply serves as a trust region anchor: it does not need to be computed from the network, but it just needs to lie somewhere between the behavior and target policies to prevent extreme importance weights. This insight leads to our solution: instead of computing the proximal policy through a forward pass, we approximate it by in- terpolating between the behavior policy and the target policy in log-probability space. Our staleness-aware interpolation weighs fresher data more heavily, main- taining the stabilizing effect of decoupled loss while eliminating the computa- tional overhead. Our contributions are threefold: 1. A staleness-aware proximal probability interpolation method that eliminates the computation cost of proximal policies in decoupled loss while retaining the trust-region structure of PPO. 2. Empirical evaluation across two model scales (1.5B and 8B parameters) demonstrates that our method achieves up to 1.8× speedup in training time while maintaining comparable task performance and superior training stabil- ity compared to both the standard decoupled PPO and synchronous training baselines, with particular advan

…(Full text truncated)…

📸 Image Gallery

eval-rollout_reward_vs_steps_plot.png eval-rollout_reward_vs_steps_plot.webp eval-rollout_reward_vs_time_plot.png eval-rollout_reward_vs_time_plot.webp ppo_actor_task_reward_avg_vs_steps_plot.png ppo_actor_task_reward_avg_vs_steps_plot.webp ppo_actor_task_reward_avg_vs_time_plot.png ppo_actor_task_reward_avg_vs_time_plot.webp ppo_actor_update_clipped_tokens_vs_steps_plot.png ppo_actor_update_clipped_tokens_vs_steps_plot.webp ppo_actor_update_entropy_avg_vs_steps_plot.png ppo_actor_update_entropy_avg_vs_steps_plot.webp ppo_actor_update_importance_weight_max_vs_steps_plot.png ppo_actor_update_importance_weight_max_vs_steps_plot.webp ppo_actor_update_importance_weight_min_vs_steps_plot.png ppo_actor_update_importance_weight_min_vs_steps_plot.webp timeperf_recompute_logp_vs_steps_plot.png timeperf_recompute_logp_vs_steps_plot.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut