다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법

February 23, 2026

Reading time: 4 minute

...

📝 Original Info

Title: 다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법
ArXiv ID: 2511.20718
Date: 2025-11-25
Authors: Researchers from original ArXiv paper

📝 Abstract

Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turnlevel stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

💡 Deep Analysis

Deep Dive into 다중턴 대화 학습을 위한 턴 기반 PPO 안정화 기법.

📄 Full Content

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Chenliang Li1∗, Adel Elmahdy2, Alex Boyd2, Zhongruo Wang3, Alfredo Garcia1, Parminder Bhatia2, Taha Kass-Hout2, Cao Xiao2, Mingyi Hong4 1Texas A&M University 2GE HealthCare 3Independent Researcher 4University of Minnesota chenliangli@tamu.edu, {adel.elmahdy, alex.boyd}@gehealthcare.com wangleft@gmail.com alfredo.garcia@tamu.edu {parminder.bhatia, taha.kass-hout, cao.xiao}@gehealthcare.com mhong@umn.edu November 27, 2025 Abstract Proximal policy optimization (PPO) has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1) token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn- level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training. ∗This work was done during an internship at GE HealthCare, Bellevue, WA. Corresponding authors: Chenliang Li and Adel Elmahdy. 1 arXiv:2511.20718v1 [cs.LG] 25 Nov 2025 →ωJToken→PPO(ω) = E  1 |y| |y| # t=1 1t↑Bc token wt(ω) →ω log εω(yt|x, y…(Full text truncated)…

📄 Read Full PDF on ArXiv