다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법
📝 Abstract
Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turnlevel stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.
💡 Analysis
Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turnlevel stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.
📄 Content
ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Chenliang Li1∗, Adel Elmahdy2, Alex Boyd2, Zhongruo Wang3, Alfredo Garcia1, Parminder Bhatia2, Taha Kass-Hout2, Cao Xiao2, Mingyi Hong4 1Texas A&M University 2GE HealthCare 3Independent Researcher 4University of Minnesota chenliangli@tamu.edu, {adel.elmahdy, alex.boyd}@gehealthcare.com wangleft@gmail.com alfredo.garcia@tamu.edu {parminder.bhatia, taha.kass-hout, cao.xiao}@gehealthcare.com mhong@umn.edu November 27, 2025 Abstract Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn- level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training. ∗This work was done during an internship at GE HealthCare, Bellevue, WA. Corresponding authors: Chenliang Li and Adel Elmahdy. 1 arXiv:2511.20718v1 [cs.LG] 25 Nov 2025 →ωJToken→PPO(ω) = E 1 |y| |y|
t=1 1t↑Bc token wt(ω) →ω log εω(yt|x, y<t) ˆAt . →ωJTurn-PPO(ω) = E ! 1 |y| K " k=1 wturn k (ω) ˆAk |yk|
$% & turn-level credit →ω log εω(yk | x, y<k) ' . →ωJS-PPO(ω) := 1 ↑Ctoken(ω)↑2 →ωJPPO(ω). →ωJST-PPO(ω) := 1 ↑Cturn(ω)↑2 →ωJTurn-PPO(ω). 0 50 100 150 200 250 Policy Optimization Steps 100 102 104 106 108 1010 1012 Gradient Norm (log scale) ST-PPO S-PPO Token-PPO Turn-PPO Figure 1: Illustration of the four PPO variants. Token-level PPO becomes Turn-level PPO by applying turn-level importance sampling (Eq. 4). Further adding the clipping bias to normalize gradients yields S-PPO and ST-PPO (Eq. 8 and Eq. 7). Both variants significantly reduce the probability of extreme gradient spikes, leading to more stable training. 1 Introduction Reinforcement learning (RL) has significantly advanced the reasoning capabilities of large language models (LLMs), enabling strong performance in domains such as mathematical problem solving (Jaech et al., 2024; Liu et al., 2024; Yu et al., 2025) and code generation (El-Kishky et al., 2025; Cui et al., 2025). Beyond these applications, RL has also shown promise in more agentic settings such as tool learning (Qian et al., 2025; Feng et al., 2025), where models learn to invoke external tools (e.g., web search engines), execute actions, and interact with real-world environments. Recent systems such as Deepseek V3 (Liu et al., 2024) and Kimi V2 (Team et al., 2025) have achieved state-of-the-art performance on both mathematical reasoning (e.g., AIME, Math-500) and agentic benchmarks (Jimenez et al., 2023). Despite these successes, the computational demands of multi-turn RL training pose significant practical challenges. These gains rely on numerous interaction samples, which are costly due to the large number of rollouts and multi-turn tool calls required during training. In practice, hardware and memory limits force each batch of collected samples to be split into several mini-batches (Schulman et al., 2017) and updated sequentially. This naturally induces a hybrid update mechanism where later updates become increasingly off-policy (Chen et al., 2023). To maximize sample efficiency under computational constraints, practitioners often adopt off-policy pipelines, and the resulting distribution mismatch is typically corrected using importance sampling (Nachum et al., 2017). The sh
This content is AI-processed based on ArXiv data.