APO: Alpha-Divergence Preference Optimization
๐ Original Info
- Title: APO: Alpha-Divergence Preference Optimization
- ArXiv ID: 2512.22953
- Date: 2025-12-28
- Authors: Wang Zixian
๐ Abstract
Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL KL(๐โฅ๐ ๐ ), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online RLHF behaves closer to reverse KL KL(๐ ๐ โฅ๐), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods (e.g., ADPO) show that doing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce APO (๐ผ-Divergence Preference Optimization), an anchored framework that uses Csiszรกr ๐ผ-divergence to continuously interpolate between forward-and reverse-KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by ๐ผ, analyze gradient variance properties, and propose a practical reward + confidence guarded ๐ผ schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 show that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.๐ Full Content
โข Forward-KL (mode covering) objectives, e.g., supervised fine-tuning (SFT) and distillation, which minimize KL(๐โฅ๐ ๐ ) for a target distribution ๐. These methods are stable and “zero-avoiding” but can produce overly averaged behavior.
โข Reverse-KL-like (mode seeking) objectives, e.g., PPO [2] and GRPO [4]-style online RLHF, which emphasize high-reward modes and can achieve higher ceilings, but are sensitive to variance, overconfidence, and mode collapse.
The recently proposed ADPO [13] perspective suggests that a major source of instability is not only the divergence choice, but also where we perform the projection. By anchoring the policy in log-ratio coordinates relative to a reference policy (e.g., ๐ ref = ๐ old in on-policy RL), one obtains better-conditioned geometry and an implicit trust region via temperature scaling.
This paper pushes the idea further: on top of anchored coordinates, we replace the single forward KL with a family of divergences. Concretely, we propose APO: ๐ผ-Divergence Preference Optimization. The same training pipeline can start in a forward-KL-like regime (coverage, safety, stability) and smoothly transition to a reverse-KL-like regime (exploitation, higher peak reward) by scheduling ๐ผ.
Scope. We focus on online LLM RLHF with group sampling (multiple completions per prompt), because this is where reverse-KL-like updates (PPO/GRPO) are practically valuable and where collapse is most painful. We use a Boltzmann soft target over the sampled candidate set, which closely matches modern GRPO-style pipelines.
Preference Optimization and RLHF. Reinforcement Learning from Human Feedback (RLHF) typically involves learning a reward model from preferences and then optimizing a policy via PPO [2,5,6]. Direct Preference Optimization (DPO) [1] simplifies this by deriving a closed-form solution to the KL-constrained reward maximization problem, optimizing policy-reference log ratios directly. Recent variants extend this paradigm: IPO [7] adds a regularization term to prevent overfitting, SimPO [8] simplifies the reference-free objective, and KTO [9] uses Kahneman-Tversky value functions. APO differs by introducing a continuous divergence family rather than committing to a single objective.
Group-Relative Policy Optimization. GRPO [4] extends PPO to preference learning by normalizing advantages within groups of sampled completions, enabling efficient online RLHF without a separate reward model. GSPO [10] improves GRPO-style training stability by operating with sequence-level objectives/ratios. GTPO [11] analyzes GRPO instability (e.g., gradient conflicts and collapse) and introduces gradient/entropy control for stabilization. G 2 RPO-A [12] studies guided GRPO configurations and proposes an adaptive guidance mechanism that adjusts guidance during training. These methods share a common theme with APO: dynamically adjusting the aggressiveness of RL updates based on training signals. APO differs by controlling this through the divergence family (๐ผ) rather than clipping thresholds or guidance weights.
๐ -Divergences in Machine Learning. The ๐ -divergence family [15,16] provides a unified framework for measuring distributional discrepancy. Csiszรกr-Amari ๐ผ-divergence [15,18] is particularly attractive because it continuously connects forward and reverse KL (distinct from Rรฉnyi divergence [17], which shares the name but has different properties). Prior work has explored ๐ -divergences in variational inference [19,20], GANs [21], and imitation learning [22]. In RL, ๐ผPPO [14] systematically studied ๐ผ-divergence as a trust-region constraint for PPO, finding that intermediate ๐ผ values often outperform pure KL. APO builds on this insight but applies it to the objective function (not the constraint) and introduces a confidence-guarded schedule for LLM RLHF.
Trust-Region Methods. Ensuring stable policy updates is a central challenge in RL. TRPO [3] enforces stability via explicit KL constraints, while PPO [2] approximates this with ratio clipping. ADPO [13] shows that anchored coordinates provide an implicit trust region via temperature-scaled curvature. APO extends this by allowing the divergence itself to be scheduled, providing an additional degree of freedom for balancing exploration and exploitation.
We work in the standard group-based RLHF setting. For each prompt (context) ๐ฅ, we sample a candidate set ๐ ๐ฅ = {๐ฆ 1 , . . . , ๐ฆ ๐ } of ๐ completions from the current policy. Let โ ๐ = log ๐ ๐ (๐ฆ ๐ |๐ฅ) be the student log-probabilities and โ ref ๐ = log ๐ ref (๐ฆ ๐ |๐ฅ) be the anchor log-probabilities. In online RLHF, we use on-policy anchoring:
where ๐ old is the sampling policy from the previous iteration.
Define anchored logits
and the induced distribution over the candidate set
The temperature ๐ anc > 0 controls curvature scaling in anchored coordinates and plays the same stabilizing role as an implicit trust region: smaller ๐ anc penalizes deviations from the anchor more strongly (see Section 4.4).
We define a target distribution ๐(โข | ๐ ๐ฅ ) on the candidate set using a Boltzmann (softmax) transformation of group-relative advantages. Compute group-relative advantages ๐ด ๐ via z-score normalization:
where ๐ ๐ is the reward for completion ๐ฆ ๐ and ๐ > 0 is a small constant for numerical stability. The Boltzmann target is:
where ๐ฝ ๐ > 0 controls target sharpness. Smaller ๐ฝ ๐ makes ๐ concentrate more on the best responses.
Let ๐ ๐ (โข) = ฯ๐ (โข | ๐ ๐ฅ ) be the anchored student distribution over the candidate set (we write ๐ ๐ for brevity) and let ๐(โข) = ๐(โข | ๐ ๐ฅ ) be the Boltzmann target. We define the APO objective using the Csiszรกr ๐ผ-divergence:
where
The ๐ผ-divergence continuously interpolates between forward and reverse KL:
๐ท ๐ผ (๐โฅ ๐) = KL( ๐โฅ๐) (reverse KL, mode-seeking).
We restrict ๐ผ โ (0, 1) throughout this work, which already interpolates between forward and reverse KL and yields a monotone “mode-covering โ mode-seeking” path. Extending to ๐ผ โ (0, 1) is mathematically valid but leads to less interpretable optimization behavior in our RLHF setting, so we leave it to future work. This provides a principled way to transition between SFT-like stability (๐ผ โ 1) and PPO-like exploitation (๐ผ โ 0).
Define the ratio
Theorem 4.1 (Unified gradient for ๐ผ-divergence). Assume ๐(๐) > 0 whenever ๐ ๐ (๐) > 0 on the candidate set. Then the gradient of Equation (7) w.r.t. ๐ can be written as
Proof. Let ๐ฟ ๐ผ = ๐ ๐(๐) ๐ผ ๐(๐) 1-๐ผ . Taking the derivative:
where we used ๐(๐) ๐ผ ๐(๐)
Remark 4.2 (Limiting gradient forms). As ๐ผ โ 1, using ๐ ๐ผ = ๐ we recover the forward-KL gradient:
As ๐ผ โ 0, using the expansion ๐ ๐ผ = 1 + ๐ผ log ๐ + ๐(๐ผ) and canceling the 1/๐ผ prefactor, we recover the reverse-KL gradient:
The detailed limiting analysis is standard and omitted for brevity.
Interpretation. Equation (11) shows that ๐ผ controls how the score function is reweighted by ๐ ๐ผ :
โข ๐ผ โ 1 (forward KL): ๐ ๐ผ โ ๐, so samples where ๐ โซ ๐ (under-represented by policy) get high weight. This encourages coverage.
โข ๐ผ โ 0 (reverse KL): ๐ ๐ผ โ 1, so all samples are weighted equally by the policy. Combined with the score function, this encourages concentration on modes where ๐ already has mass.
The choice of ๐ผ affects not only the optimization landscape but also the variance of gradient estimates.
Let ๐ ๐ผ = ๐ ๐ผ โ ๐ log ๐ ๐ be the per-sample gradient direction.
Under the heuristic assumption that โฅโ ๐ log ๐ ๐ (๐) โฅ does not vary excessively across candidates (commonly adopted to isolate the effect of importance weighting), the variance of the gradient estimator scales as:
When ๐ and ๐ ๐ are close, ๐ โ 1 and the variance is low for all ๐ผ. When ๐ and ๐ ๐ differ significantly:
โข Large ๐ผ (โ 1): High variance due to large ๐ 2๐ผ terms when ๐(๐) โซ ๐ ๐ (๐).
โข Small ๐ผ (โ 0): Lower variance but potentially higher bias (mode-seeking).
Why not start with small ๐ผ? A natural question is: if small ๐ผ has lower variance, why not use it from the beginning? The answer lies in the bias-variance trade-off and the support coverage requirement:
โข ๐ผ โ 1 (forward KL): The gradient weights ๐ ๐ผ โ ๐ = ๐/๐ upweight samples where ๐ โซ ๐, forcing the policy to cover the target’s support. This is essential early in training when ๐ ๐ may not yet overlap well with high-reward regions.
โข ๐ผ โ 0 (reverse KL): The limiting gradient E ๐ [(log ๐log ๐)โ log ๐] penalizes “misplaced mass”
where ๐ > ๐, driving concentration on modes where ๐ is already high. This is effective after the policy has found the good modes, but dangerous early on-it can lock into suboptimal modes before discovering better ones.
Thus, the schedule from ๐ผ max โ ๐ผ min implements a natural curriculum: first find the high-reward modes (coverage), then concentrate on them (exploitation).
Following ADPO [13], the anchored coordinates induce curvature scaling. The Fisher information of the anchored policy ๐ ๐ in the candidate set is:
The local quadratic approximation of the ๐ผ-divergence loss near the optimum ๐ข โ is:
where ๐น ๐ = Diag(๐) -๐๐ โค is the Fisher information metric at the optimum. For the standard normalized ๐ผ-divergence (as defined in Equation ( 7)), the second-order expansion matches the KL divergence up to a constant scaling factor independent of the optimization direction ๐ฟ. This theoretical property suggests that the local implicit trust region is predominantly governed by the anchor temperature ๐ anc , providing a unified stability mechanism across the ๐ผ family. The effect of ๐ผ is thus largely orthogonal to local stability: it controls the global optimization trajectory (mode-covering vs. mode-seeking) when the policy is far from the target.
The key practical question is how to set ๐ผ over training. We propose a simple rule: become more mode-seeking only when the policy is both (i) confident and (ii) improving. This avoids the “confident-but-wrong” failure mode, where entropy collapses while reward stagnates or decreases.
We compute two scalars per update:
โข Confidence from entropy of the candidate-set distribution:
Here ๐ ๐ก โ 0 indicates uncertain/high-entropy policies; ๐ ๐ก โ 1 indicates confident/low-entropy policies.
โข Improvement from reward gain (time-series signal):
Here ๐ ๐ก is an EMA reward baseline, ๐ ๐ > 0 is a reward scale (e.g., running std), and ๐ ๐ก โ [0, 1] discards negative gains-crucial for the guard behavior. 350:
Let ๐ผ max โฒ 1 (coverage, e.g., 0.9) and ๐ผ min > 0 (exploitation, e.g., 0.35). For mathematical reasoning tasks with sparse binary rewards, we recommend ๐ผ min โ [0.
optionally smoothed by an EMA:
Remark 5.1 (Why the multiplicative gate matters). If we used an additive schedule ๐ผ ๐ก = ๐(๐ ๐ก + ๐ ๐ก ), then a policy that is confident but not improving could still be pushed toward mode-seeking updates, risking collapse. The multiplicative gate ๐ ๐ก โข ๐ ๐ก ensures that ๐ผ only decreases when both conditions are met.
Behavior in different regimes.
โข Early training (๐ ๐ก low, ๐ ๐ก variable): ๐ผ ๐ก โ ๐ผ max , forward-KL-like, stable coverage.
โข Confident and improving (๐ ๐ก high, ๐ ๐ก high): ๐ผ ๐ก โ ๐ผ min , reverse-KL-like, mode-seeking.
โข Confident but stuck/degrading (๐ ๐ก high, ๐ ๐ก โ 0): ๐ผ ๐ก โ ๐ผ max , pull back to coverage to escape local minimum.
When ๐ผ ๐ก is small, the effective weights ๐ ๐ผ can be heavy-tailed. In practice, one can clip ๐ ๐ผ similarly to PPO-style clipping:
6 Algorithm
โข GSPO [10]: An improved GRPO variant that enhances training stability via sequence-level objectives.
โข ADPO-Softmax: Standard ADPO using softmax loss variant (loss_variant: softmax), with the cross-entropy formulation L = -๐ ๐ ๐ log ๐ ๐ .
Configuration. All methods share: learning rate 1.5 ร 10 -5 (cosine decay), batch size 8, gradient accumulation 16, ๐ = 8 generations per prompt, max completion length 1024, 2 epochs. For APO: ๐ anc = 0.8, ๐ฝ ๐ = 1.0, ๐ผ max = 0.9, ๐ผ min = 0.35, ๐ = 0.1, ๐ = 0.1, ๐ ๐ initialized to 0.5 and updated via running std.
Figure 1 shows the training curves for all five algorithms. We observe the following:
โข Comparable performance: All five algorithms achieve similar final reward around 0.6-0.7. The APO variants (Adaptive ESS, Fixed ๐ผ, Legacy) perform on par with GSPO and ADPO-Softmax.
โข Stability: All anchored methods maintain stable training dynamics without significant reward collapse, validating the effectiveness of the reference model anchoring.
โข AlphaPO flexibility: While APO does not surpass GSPO in this particular setup, it provides a unified framework for exploring different divergence behaviors through ๐ผ scheduling. Discussion. The results suggest that for this specific task (mathematical reasoning with binary rewards), the choice of divergence within the anchored framework has limited impact on final performance. However, APO offers a theoretical unification of forward-KL and reverse-KL optimization, which may be beneficial in:
โข Tasks with denser reward signals where mode-seeking exploration is more important.
โข Settings where the reward landscape is more complex and the coverage-exploitation trade-off matters.
โข Future work combining APO with other techniques like reward shaping or curriculum learning.
We introduced APO (๐ผ-Divergence Preference Optimization), an anchored preference optimization framework that uses Csiszรกr ๐ผ-divergence to continuously interpolate between forward-KL-style (modecovering) and reverse-KL-like (mode-seeking) optimization. Our key contribution is the reward + confidence guarded ๐ผ schedule, which transitions from stable coverage to exploitation only when the policy is both confident and improving, preventing the “confident-but-wrong” collapse pattern. Experiments on Qwen3-1.7B with mathematical reasoning tasks demonstrate that APO can achieve competitive performance while maintaining training stability.