QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning
GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.
💡 Research Summary
The paper addresses a fundamental weakness in recent reinforcement‑learning (RL) based fine‑tuning methods for large language models (LLMs) such as Group Relative Policy Optimization (GRPO) and Group‑Scale Policy Optimization (GSPO). These methods rely on heuristic importance‑ratio clipping to enforce a trust‑region (TR) constraint, but clipping discards gradient information once the ratio exceeds a fixed threshold. Consequently, samples that drift far from the sampling policy receive no corrective signal, leading to gradient masking, instability under policy staleness (large learning rates or extensive rollout reuse), and rapid entropy collapse. Moreover, a single global KL‑divergence bound is applied uniformly across all queries, causing easy prompts to become overly deterministic while hard prompts suffer from insufficient exploration.
To overcome these issues, the authors propose QUery‑Adaptive TRust‑Region Optimization (QUATRO). The key idea is to formulate a query‑conditioned KL‑constrained optimization problem: for each prompt q, maximize the expected trajectory reward subject to KL(πθ(·|q)‖πold(·|q)) ≤ δ. By introducing dual variables λq ≥ 0 (for the KL constraint) and μq (for normalization), they derive a Lagrangian whose inner minimization yields an exact optimal policy:
π⋆(o|q) = πold(o|q) · exp
Comments & Academic Discussion
Loading comments...
Leave a Comment