APO: Alpha-Divergence Preference Optimization

Reading time: 12 minute
...

๐Ÿ“ Original Info

  • Title: APO: Alpha-Divergence Preference Optimization
  • ArXiv ID: 2512.22953
  • Date: 2025-12-28
  • Authors: Wang Zixian

๐Ÿ“ Abstract

Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL KL(๐‘žโˆฅ๐œ‹ ๐œƒ ), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online RLHF behaves closer to reverse KL KL(๐œ‹ ๐œƒ โˆฅ๐‘ž), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods (e.g., ADPO) show that doing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce APO (๐›ผ-Divergence Preference Optimization), an anchored framework that uses Csiszรกr ๐›ผ-divergence to continuously interpolate between forward-and reverse-KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by ๐›ผ, analyze gradient variance properties, and propose a practical reward + confidence guarded ๐›ผ schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 show that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.

๐Ÿ“„ Full Content

Large language model (LLM) alignment is often practiced through two seemingly different optimization paradigms:

โ€ข Forward-KL (mode covering) objectives, e.g., supervised fine-tuning (SFT) and distillation, which minimize KL(๐‘žโˆฅ๐œ‹ ๐œƒ ) for a target distribution ๐‘ž. These methods are stable and “zero-avoiding” but can produce overly averaged behavior.

โ€ข Reverse-KL-like (mode seeking) objectives, e.g., PPO [2] and GRPO [4]-style online RLHF, which emphasize high-reward modes and can achieve higher ceilings, but are sensitive to variance, overconfidence, and mode collapse.

The recently proposed ADPO [13] perspective suggests that a major source of instability is not only the divergence choice, but also where we perform the projection. By anchoring the policy in log-ratio coordinates relative to a reference policy (e.g., ๐œ‹ ref = ๐œ‹ old in on-policy RL), one obtains better-conditioned geometry and an implicit trust region via temperature scaling.

This paper pushes the idea further: on top of anchored coordinates, we replace the single forward KL with a family of divergences. Concretely, we propose APO: ๐›ผ-Divergence Preference Optimization. The same training pipeline can start in a forward-KL-like regime (coverage, safety, stability) and smoothly transition to a reverse-KL-like regime (exploitation, higher peak reward) by scheduling ๐›ผ.

Scope. We focus on online LLM RLHF with group sampling (multiple completions per prompt), because this is where reverse-KL-like updates (PPO/GRPO) are practically valuable and where collapse is most painful. We use a Boltzmann soft target over the sampled candidate set, which closely matches modern GRPO-style pipelines.

Preference Optimization and RLHF. Reinforcement Learning from Human Feedback (RLHF) typically involves learning a reward model from preferences and then optimizing a policy via PPO [2,5,6]. Direct Preference Optimization (DPO) [1] simplifies this by deriving a closed-form solution to the KL-constrained reward maximization problem, optimizing policy-reference log ratios directly. Recent variants extend this paradigm: IPO [7] adds a regularization term to prevent overfitting, SimPO [8] simplifies the reference-free objective, and KTO [9] uses Kahneman-Tversky value functions. APO differs by introducing a continuous divergence family rather than committing to a single objective.

Group-Relative Policy Optimization. GRPO [4] extends PPO to preference learning by normalizing advantages within groups of sampled completions, enabling efficient online RLHF without a separate reward model. GSPO [10] improves GRPO-style training stability by operating with sequence-level objectives/ratios. GTPO [11] analyzes GRPO instability (e.g., gradient conflicts and collapse) and introduces gradient/entropy control for stabilization. G 2 RPO-A [12] studies guided GRPO configurations and proposes an adaptive guidance mechanism that adjusts guidance during training. These methods share a common theme with APO: dynamically adjusting the aggressiveness of RL updates based on training signals. APO differs by controlling this through the divergence family (๐›ผ) rather than clipping thresholds or guidance weights.

๐‘“ -Divergences in Machine Learning. The ๐‘“ -divergence family [15,16] provides a unified framework for measuring distributional discrepancy. Csiszรกr-Amari ๐›ผ-divergence [15,18] is particularly attractive because it continuously connects forward and reverse KL (distinct from Rรฉnyi divergence [17], which shares the name but has different properties). Prior work has explored ๐‘“ -divergences in variational inference [19,20], GANs [21], and imitation learning [22]. In RL, ๐›ผPPO [14] systematically studied ๐›ผ-divergence as a trust-region constraint for PPO, finding that intermediate ๐›ผ values often outperform pure KL. APO builds on this insight but applies it to the objective function (not the constraint) and introduces a confidence-guarded schedule for LLM RLHF.

Trust-Region Methods. Ensuring stable policy updates is a central challenge in RL. TRPO [3] enforces stability via explicit KL constraints, while PPO [2] approximates this with ratio clipping. ADPO [13] shows that anchored coordinates provide an implicit trust region via temperature-scaled curvature. APO extends this by allowing the divergence itself to be scheduled, providing an additional degree of freedom for balancing exploration and exploitation.

We work in the standard group-based RLHF setting. For each prompt (context) ๐‘ฅ, we sample a candidate set ๐‘† ๐‘ฅ = {๐‘ฆ 1 , . . . , ๐‘ฆ ๐‘ƒ } of ๐‘ƒ completions from the current policy. Let โ„“ ๐‘– = log ๐œ‹ ๐œƒ (๐‘ฆ ๐‘– |๐‘ฅ) be the student log-probabilities and โ„“ ref ๐‘– = log ๐œ‹ ref (๐‘ฆ ๐‘– |๐‘ฅ) be the anchor log-probabilities. In online RLHF, we use on-policy anchoring:

where ๐œ‹ old is the sampling policy from the previous iteration.

Define anchored logits

and the induced distribution over the candidate set

The temperature ๐œ anc > 0 controls curvature scaling in anchored coordinates and plays the same stabilizing role as an implicit trust region: smaller ๐œ anc penalizes deviations from the anchor more strongly (see Section 4.4).

We define a target distribution ๐‘ž(โ€ข | ๐‘† ๐‘ฅ ) on the candidate set using a Boltzmann (softmax) transformation of group-relative advantages. Compute group-relative advantages ๐ด ๐‘– via z-score normalization:

where ๐‘… ๐‘– is the reward for completion ๐‘ฆ ๐‘– and ๐œ– > 0 is a small constant for numerical stability. The Boltzmann target is:

where ๐›ฝ ๐‘Ÿ > 0 controls target sharpness. Smaller ๐›ฝ ๐‘Ÿ makes ๐‘ž concentrate more on the best responses.

Let ๐‘ ๐œƒ (โ€ข) = ฯ€๐œƒ (โ€ข | ๐‘† ๐‘ฅ ) be the anchored student distribution over the candidate set (we write ๐‘ ๐œƒ for brevity) and let ๐‘ž(โ€ข) = ๐‘ž(โ€ข | ๐‘† ๐‘ฅ ) be the Boltzmann target. We define the APO objective using the Csiszรกr ๐›ผ-divergence:

where

The ๐›ผ-divergence continuously interpolates between forward and reverse KL:

๐ท ๐›ผ (๐‘žโˆฅ ๐‘) = KL( ๐‘โˆฅ๐‘ž) (reverse KL, mode-seeking).

We restrict ๐›ผ โˆˆ (0, 1) throughout this work, which already interpolates between forward and reverse KL and yields a monotone “mode-covering โ†’ mode-seeking” path. Extending to ๐›ผ โˆ‰ (0, 1) is mathematically valid but leads to less interpretable optimization behavior in our RLHF setting, so we leave it to future work. This provides a principled way to transition between SFT-like stability (๐›ผ โ‰ˆ 1) and PPO-like exploitation (๐›ผ โ‰ˆ 0).

Define the ratio

Theorem 4.1 (Unified gradient for ๐›ผ-divergence). Assume ๐‘ž(๐‘–) > 0 whenever ๐‘ ๐œƒ (๐‘–) > 0 on the candidate set. Then the gradient of Equation (7) w.r.t. ๐œƒ can be written as

Proof. Let ๐ฟ ๐›ผ = ๐‘– ๐‘ž(๐‘–) ๐›ผ ๐‘(๐‘–) 1-๐›ผ . Taking the derivative:

where we used ๐‘ž(๐‘–) ๐›ผ ๐‘(๐‘–)

Remark 4.2 (Limiting gradient forms). As ๐›ผ โ†’ 1, using ๐‘Ÿ ๐›ผ = ๐‘Ÿ we recover the forward-KL gradient:

As ๐›ผ โ†’ 0, using the expansion ๐‘Ÿ ๐›ผ = 1 + ๐›ผ log ๐‘Ÿ + ๐‘œ(๐›ผ) and canceling the 1/๐›ผ prefactor, we recover the reverse-KL gradient:

The detailed limiting analysis is standard and omitted for brevity.

Interpretation. Equation (11) shows that ๐›ผ controls how the score function is reweighted by ๐‘Ÿ ๐›ผ :

โ€ข ๐›ผ โ†’ 1 (forward KL): ๐‘Ÿ ๐›ผ โ†’ ๐‘Ÿ, so samples where ๐‘ž โ‰ซ ๐‘ (under-represented by policy) get high weight. This encourages coverage.

โ€ข ๐›ผ โ†’ 0 (reverse KL): ๐‘Ÿ ๐›ผ โ†’ 1, so all samples are weighted equally by the policy. Combined with the score function, this encourages concentration on modes where ๐‘ already has mass.

The choice of ๐›ผ affects not only the optimization landscape but also the variance of gradient estimates.

Let ๐‘” ๐›ผ = ๐‘Ÿ ๐›ผ โˆ‡ ๐œƒ log ๐‘ ๐œƒ be the per-sample gradient direction.

Under the heuristic assumption that โˆฅโˆ‡ ๐œƒ log ๐‘ ๐œƒ (๐‘–) โˆฅ does not vary excessively across candidates (commonly adopted to isolate the effect of importance weighting), the variance of the gradient estimator scales as:

When ๐‘ž and ๐‘ ๐œƒ are close, ๐‘Ÿ โ‰ˆ 1 and the variance is low for all ๐›ผ. When ๐‘ž and ๐‘ ๐œƒ differ significantly:

โ€ข Large ๐›ผ (โ‰ˆ 1): High variance due to large ๐‘Ÿ 2๐›ผ terms when ๐‘ž(๐‘–) โ‰ซ ๐‘ ๐œƒ (๐‘–).

โ€ข Small ๐›ผ (โ‰ˆ 0): Lower variance but potentially higher bias (mode-seeking).

Why not start with small ๐›ผ? A natural question is: if small ๐›ผ has lower variance, why not use it from the beginning? The answer lies in the bias-variance trade-off and the support coverage requirement:

โ€ข ๐›ผ โ†’ 1 (forward KL): The gradient weights ๐‘Ÿ ๐›ผ โ†’ ๐‘Ÿ = ๐‘ž/๐‘ upweight samples where ๐‘ž โ‰ซ ๐‘, forcing the policy to cover the target’s support. This is essential early in training when ๐‘ ๐œƒ may not yet overlap well with high-reward regions.

โ€ข ๐›ผ โ†’ 0 (reverse KL): The limiting gradient E ๐‘ [(log ๐‘log ๐‘ž)โˆ‡ log ๐‘] penalizes “misplaced mass”

where ๐‘ > ๐‘ž, driving concentration on modes where ๐‘ž is already high. This is effective after the policy has found the good modes, but dangerous early on-it can lock into suboptimal modes before discovering better ones.

Thus, the schedule from ๐›ผ max โ†’ ๐›ผ min implements a natural curriculum: first find the high-reward modes (coverage), then concentrate on them (exploitation).

Following ADPO [13], the anchored coordinates induce curvature scaling. The Fisher information of the anchored policy ๐‘ ๐œƒ in the candidate set is:

The local quadratic approximation of the ๐›ผ-divergence loss near the optimum ๐‘ข โ˜… is:

where ๐น ๐‘ž = Diag(๐‘ž) -๐‘ž๐‘ž โŠค is the Fisher information metric at the optimum. For the standard normalized ๐›ผ-divergence (as defined in Equation ( 7)), the second-order expansion matches the KL divergence up to a constant scaling factor independent of the optimization direction ๐›ฟ. This theoretical property suggests that the local implicit trust region is predominantly governed by the anchor temperature ๐œ anc , providing a unified stability mechanism across the ๐›ผ family. The effect of ๐›ผ is thus largely orthogonal to local stability: it controls the global optimization trajectory (mode-covering vs. mode-seeking) when the policy is far from the target.

The key practical question is how to set ๐›ผ over training. We propose a simple rule: become more mode-seeking only when the policy is both (i) confident and (ii) improving. This avoids the “confident-but-wrong” failure mode, where entropy collapses while reward stagnates or decreases.

We compute two scalars per update:

โ€ข Confidence from entropy of the candidate-set distribution:

Here ๐‘ ๐‘ก โ‰ˆ 0 indicates uncertain/high-entropy policies; ๐‘ ๐‘ก โ‰ˆ 1 indicates confident/low-entropy policies.

โ€ข Improvement from reward gain (time-series signal):

Here ๐‘ ๐‘ก is an EMA reward baseline, ๐‘  ๐‘… > 0 is a reward scale (e.g., running std), and ๐‘ ๐‘ก โˆˆ [0, 1] discards negative gains-crucial for the guard behavior. 350:

Let ๐›ผ max โ‰ฒ 1 (coverage, e.g., 0.9) and ๐›ผ min > 0 (exploitation, e.g., 0.35). For mathematical reasoning tasks with sparse binary rewards, we recommend ๐›ผ min โˆˆ [0.

optionally smoothed by an EMA:

Remark 5.1 (Why the multiplicative gate matters). If we used an additive schedule ๐›ผ ๐‘ก = ๐œ™(๐‘ ๐‘ก + ๐‘ ๐‘ก ), then a policy that is confident but not improving could still be pushed toward mode-seeking updates, risking collapse. The multiplicative gate ๐‘ ๐‘ก โ€ข ๐‘ ๐‘ก ensures that ๐›ผ only decreases when both conditions are met.

Behavior in different regimes.

โ€ข Early training (๐‘ ๐‘ก low, ๐‘ ๐‘ก variable): ๐›ผ ๐‘ก โ‰ˆ ๐›ผ max , forward-KL-like, stable coverage.

โ€ข Confident and improving (๐‘ ๐‘ก high, ๐‘ ๐‘ก high): ๐›ผ ๐‘ก โ†’ ๐›ผ min , reverse-KL-like, mode-seeking.

โ€ข Confident but stuck/degrading (๐‘ ๐‘ก high, ๐‘ ๐‘ก โ‰ˆ 0): ๐›ผ ๐‘ก โ‰ˆ ๐›ผ max , pull back to coverage to escape local minimum.

When ๐›ผ ๐‘ก is small, the effective weights ๐‘Ÿ ๐›ผ can be heavy-tailed. In practice, one can clip ๐‘Ÿ ๐›ผ similarly to PPO-style clipping:

6 Algorithm

โ€ข GSPO [10]: An improved GRPO variant that enhances training stability via sequence-level objectives.

โ€ข ADPO-Softmax: Standard ADPO using softmax loss variant (loss_variant: softmax), with the cross-entropy formulation L = -๐‘– ๐‘ž ๐‘– log ๐‘ ๐‘– .

Configuration. All methods share: learning rate 1.5 ร— 10 -5 (cosine decay), batch size 8, gradient accumulation 16, ๐‘ƒ = 8 generations per prompt, max completion length 1024, 2 epochs. For APO: ๐œ anc = 0.8, ๐›ฝ ๐‘Ÿ = 1.0, ๐›ผ max = 0.9, ๐›ผ min = 0.35, ๐œŒ = 0.1, ๐œ† = 0.1, ๐‘  ๐‘… initialized to 0.5 and updated via running std.

Figure 1 shows the training curves for all five algorithms. We observe the following:

โ€ข Comparable performance: All five algorithms achieve similar final reward around 0.6-0.7. The APO variants (Adaptive ESS, Fixed ๐›ผ, Legacy) perform on par with GSPO and ADPO-Softmax.

โ€ข Stability: All anchored methods maintain stable training dynamics without significant reward collapse, validating the effectiveness of the reference model anchoring.

โ€ข AlphaPO flexibility: While APO does not surpass GSPO in this particular setup, it provides a unified framework for exploring different divergence behaviors through ๐›ผ scheduling. Discussion. The results suggest that for this specific task (mathematical reasoning with binary rewards), the choice of divergence within the anchored framework has limited impact on final performance. However, APO offers a theoretical unification of forward-KL and reverse-KL optimization, which may be beneficial in:

โ€ข Tasks with denser reward signals where mode-seeking exploration is more important.

โ€ข Settings where the reward landscape is more complex and the coverage-exploitation trade-off matters.

โ€ข Future work combining APO with other techniques like reward shaping or curriculum learning.

We introduced APO (๐›ผ-Divergence Preference Optimization), an anchored preference optimization framework that uses Csiszรกr ๐›ผ-divergence to continuously interpolate between forward-KL-style (modecovering) and reverse-KL-like (mode-seeking) optimization. Our key contribution is the reward + confidence guarded ๐›ผ schedule, which transitions from stable coverage to exploitation only when the policy is both confident and improving, preventing the “confident-but-wrong” collapse pattern. Experiments on Qwen3-1.7B with mathematical reasoning tasks demonstrate that APO can achieve competitive performance while maintaining training stability.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut