다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법

Reading time: 4 minute
...

📝 Original Info

  • Title: 다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법
  • ArXiv ID: 2511.20718
  • Date: 2025-11-25
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turnlevel stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

💡 Deep Analysis

Deep Dive into 다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법.

Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turnlevel stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how t

📄 Full Content

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Chenliang Li1∗, Adel Elmahdy2, Alex Boyd2, Zhongruo Wang3, Alfredo Garcia1, Parminder Bhatia2, Taha Kass-Hout2, Cao Xiao2, Mingyi Hong4 1Texas A&M University 2GE HealthCare 3Independent Researcher 4University of Minnesota chenliangli@tamu.edu, {adel.elmahdy, alex.boyd}@gehealthcare.com wangleft@gmail.com alfredo.garcia@tamu.edu {parminder.bhatia, taha.kass-hout, cao.xiao}@gehealthcare.com mhong@umn.edu November 27, 2025 Abstract Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn- level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training. ∗This work was done during an internship at GE HealthCare, Bellevue, WA. Corresponding authors: Chenliang Li and Adel Elmahdy. 1 arXiv:2511.20718v1 [cs.LG] 25 Nov 2025 →ωJToken→PPO(ω) = E  1 |y| |y| # t=1 1t↑Bc token wt(ω) →ω log εω(yt|x, y…(Full text truncated)…

📸 Image Gallery

3algorithm_clipping_ratio_comparison.png 3algorithm_clipping_ratio_comparison.webp 3algorithm_kl_divergence_comparison.png 3algorithm_kl_divergence_comparison.webp 3algorithm_minibatch128_comparison.png 3algorithm_minibatch128_comparison.webp HotpotQA_comparison.png HotpotQA_comparison.webp NQ_performance_curve.png NQ_performance_curve.webp STPPO_formulation_revised.png STPPO_formulation_revised.webp clipped_l2_norm.png clipped_l2_norm.webp clipping_bias_comparison.png clipping_bias_comparison.webp four_algorithms_grad_norm_comparison.png four_algorithms_grad_norm_comparison.webp gae_advantage_with_variance.png gae_advantage_with_variance.webp gradient_norm.png gradient_norm.webp non_clipped_l2_norm.png non_clipped_l2_norm.webp success_rate.png success_rate.webp turn_token_comparison.png turn_token_comparison.webp turn_token_grad_norm_comparison.png turn_token_grad_norm_comparison.webp valid_action_ratio.png valid_action_ratio.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut