Unifying Stable Optimization and Reference Regularization in RLHF

Unifying Stable Optimization and Reference Regularization in RLHF
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($π_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($π_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $π_0$ and $π_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.


💡 Research Summary

The paper tackles two persistent problems in Reinforcement Learning from Human Feedback (RLHF): reward hacking, where the policy over‑optimizes a learned reward model and produces poor actual behavior, and unstable policy updates, which can cause catastrophic policy collapse. Existing RLHF pipelines typically address these issues separately: a KL‑divergence penalty against an initial supervised fine‑tuned model (π₀) mitigates reward hacking, while PPO‑style policy ratio clipping against the current policy (πₜ) enforces stable updates. The authors argue that this dual‑regularization creates an implicit conflict: the policy must stay simultaneously close to both π₀ and πₜ, which can overly restrict the feasible region, especially when high‑reward actions lie outside the support of π₀.

To resolve this, they propose a dual‑KL regularization objective that combines KL penalties towards both π₀ and πₜ in a weighted manner:

  J₍dual‑KL₎ = E₍x∼D, y∼πθ₎


Comments & Academic Discussion

Loading comments...

Leave a Comment