f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
💡 Research Summary
The paper unifies two dominant paradigms for large‑language‑model (LLM) alignment—Preference Alignment (PA) and Reinforcement Learning with Verifiable Rewards (RLVR)—under a single f‑divergence framework. Building on prior work that showed PA objectives can be interpreted as estimators of an f‑divergence between aligned (chosen) and unaligned (rejected) response distributions, the authors extend this perspective to the RLVR setting where only an external scalar reward r(x, y) is available.
The core contributions are two new loss families. First, f‑GRPO generalizes the existing GRPO on‑policy algorithm by embedding a generic f‑divergence variational representation into the advantage‑based update. Using importance sampling with the current policy πθold, the method constructs truncated importance weights that select only responses whose rewards lie above (or below) the policy’s average. These weights are combined with a monotone link function g and the convex conjugate f* to form a per‑sample term ψ(rθ,i, a_i). The final loss, L_f‑GRPO(θ)=E_x
Comments & Academic Discussion
Loading comments...
Leave a Comment