APO: Alpha-Divergence Preference Optimization

February 10, 2026

Reading time: 12 minute

...

📝 Original Info

Title: APO: Alpha-Divergence Preference Optimization
ArXiv ID: 2512.22953
Date: 2025-12-28
Authors: Wang Zixian

📝 Abstract

Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL KL(𝑞∥𝜋 𝜃 ), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online RLHF behaves closer to reverse KL KL(𝜋 𝜃 ∥𝑞), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods (e.g., ADPO) show that doing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce APO (𝛼-Divergence Preference Optimization), an anchored framework that uses Csiszár 𝛼-divergence to continuously interpolate between forward-and reverse-KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by 𝛼, analyze gradient variance properties, and propose a practical reward + confidence guarded 𝛼 schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 show that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.

📄 Full Content

Large language model (LLM) alignment is often practiced through two seemingly different optimization paradigms:

• Forward-KL (mode covering) objectives, e.g., supervised fine-tuning (SFT) and distillation, which minimize KL(𝑞∥𝜋 𝜃 ) for a target distribution 𝑞. These methods are stable and “zero-avoiding” but can produce overly averaged behavior.

• Reverse-KL-like (mode seeking) objectives, e.g., PPO [2] and GRPO [4]-style online RLHF, which emphasize high-reward modes and can achieve higher ceilings, but are sensitive to variance, overconfidence, and mode collapse.

The recently proposed ADPO [13] perspective suggests that a major source of instability is not only the divergence choice, but also where we perform the projection. By anchoring the policy in log-ratio coordinates relative to a reference policy (e.g., 𝜋 ref = 𝜋 old in on-policy RL), one obtains better-conditioned geometry and an implicit trust region via temperature scaling.

This paper pushes the idea further: on top of anchored coordinates, we replace the single forward KL with a family of divergences. Concretely, we propose APO: 𝛼-Divergence Preference Optimization. The same training pipeline can start in a forward-KL-like regime (coverage, safety, stability) and smoothly transition to a reverse-KL-like regime (exploitation, higher peak reward) by scheduling 𝛼.

Scope. We focus on online LLM RLHF with group sampling (multiple completions per prompt), because this is where reverse-KL-like updates (PPO/GRPO) are practically valuable and where collapse is most painful. We use a Boltzmann soft target over the sampled candidate set, which closely matches modern GRPO-style pipelines.

Preference Optimization and RLHF. Reinforcement Learning from Human Feedback (RLHF) typically involves learning a reward model from preferences and then optimizing a policy via PPO [2,5,6]. Direct Preference Optimization (DPO) [1] simplifies this by deriving a closed-form solution to the KL-constrained reward maximization problem, optimizing policy-reference log ratios directly. Recent variants extend this paradigm: IPO [7] adds a regularization term to prevent overfitting, SimPO [8] simplifies the reference-free objective, and KTO [9] uses Kahneman-Tversky value functions. APO differs by introducing a continuous divergence family rather than committing to a single objective.

Group-Relative Policy Optimization. GRPO [4] extends PPO to preference learning by normalizing advantages within groups of sampled completions, enabling efficient online RLHF without a separate reward model. GSPO [10] improves GRPO-style training stability by operating with sequence-level objectives/ratios. GTPO [11] analyzes GRPO instability (e.g., gradient conflicts and collapse) and introduces gradient/entropy control for stabilization. G 2 RPO-A [12] studies guided GRPO configurations and proposes an adaptive guidance mechanism that adjusts guidance during training. These methods share a common theme with APO: dynamically adjusting the aggressiveness of RL updates based on training signals. APO differs by controlling this through the divergence family (𝛼) rather than clipping thresholds or guidance weights.

𝑓 -Divergences in Machine Learning. The 𝑓 -divergence family [15,16] provides a unified framework for measuring distributional discrepancy. Csiszár-Amari 𝛼-divergence [15,18] is particularly attractive because it continuously connects forward and reverse KL (distinct from Rényi divergence [17], which shares the name but has different properties). Prior work has explored 𝑓 -divergences in variational inference [19,20], GANs [21], and imitation learning [22]. In RL, 𝛼PPO [14] systematically studied 𝛼-divergence as a trust-region constraint for PPO, finding that intermediate 𝛼 values often outperform pure KL. APO builds on this insight but applies it to the objective function (not the constraint) and introduces a confidence-guarded schedule for LLM RLHF.

Trust-Region Methods. Ensuring stable policy updates is a central challenge in RL. TRPO [3] enforces stability via explicit KL constraints, while PPO [2] approximates this with ratio clipping. ADPO [13] shows that anchored coordinates provide an implicit trust region via temperature-scaled curvature. APO extends this by allowing the divergence itself to be scheduled, providing an additional degree of freedom for balancing exploration and exploitation.

We work in the standard group-based RLHF setting. For each prompt (context) 𝑥, we sample a candidate set 𝑆 𝑥 = {𝑦 1 , . . . , 𝑦 𝑃 } of 𝑃 completions from the current policy. Let ℓ 𝑖 = log 𝜋 𝜃 (𝑦 𝑖 |𝑥) be the student log-probabilities and ℓ ref 𝑖 = log 𝜋 ref (𝑦 𝑖 |𝑥) be the anchor log-probabilities. In online RLHF, we use on-policy anchoring:

where 𝜋 old is the sampling policy from the previous iteration.

Define anchored logits

and the induced distribution over the candidate set

The temperature 𝜏 anc > 0 controls curvature scaling in anchored coordinates and plays the same stabilizing role as an implicit trust region: smaller 𝜏 anc penalizes deviations from the anchor more strongly (see Section 4.4).

We define a target distribution 𝑞(• | 𝑆 𝑥 ) on the candidate set using a Boltzmann (softmax) transformation of group-relative advantages. Compute group-relative advantages 𝐴 𝑖 via z-score normalization:

where 𝑅 𝑖 is the reward for completion 𝑦 𝑖 and 𝜖 > 0 is a small constant for numerical stability. The Boltzmann target is:

where 𝛽 𝑟 > 0 controls target sharpness. Smaller 𝛽 𝑟 makes 𝑞 concentrate more on the best responses.

Let 𝑝 𝜃 (•) = π𝜃 (• | 𝑆 𝑥 ) be the anchored student distribution over the candidate set (we write 𝑝 𝜃 for brevity) and let 𝑞(•) = 𝑞(• | 𝑆 𝑥 ) be the Boltzmann target. We define the APO objective using the Csiszár 𝛼-divergence:

where

The 𝛼-divergence continuously interpolates between forward and reverse KL:

𝐷 𝛼 (𝑞∥ 𝑝) = KL( 𝑝∥𝑞) (reverse KL, mode-seeking).

We restrict 𝛼 ∈ (0, 1) throughout this work, which already interpolates between forward and reverse KL and yields a monotone “mode-covering → mode-seeking” path. Extending to 𝛼 ∉ (0, 1) is mathematically valid but leads to less interpretable optimization behavior in our RLHF setting, so we leave it to future work. This provides a principled way to transition between SFT-like stability (𝛼 ≈ 1) and PPO-like exploitation (𝛼 ≈ 0).

Define the ratio

Theorem 4.1 (Unified gradient for 𝛼-divergence). Assume 𝑞(𝑖) > 0 whenever 𝑝 𝜃 (𝑖) > 0 on the candidate set. Then the gradient of Equation (7) w.r.t. 𝜃 can be written as

Proof. Let 𝐿 𝛼 = 𝑖 𝑞(𝑖) 𝛼 𝑝(𝑖) 1-𝛼 . Taking the derivative:

where we used 𝑞(𝑖) 𝛼 𝑝(𝑖)

Remark 4.2 (Limiting gradient forms). As 𝛼 → 1, using 𝑟 𝛼 = 𝑟 we recover the forward-KL gradient:

As 𝛼 → 0, using the expansion 𝑟 𝛼 = 1 + 𝛼 log 𝑟 + 𝑜(𝛼) and canceling the 1/𝛼 prefactor, we recover the reverse-KL gradient:

The detailed limiting analysis is standard and omitted for brevity.

Interpretation. Equation (11) shows that 𝛼 controls how the score function is reweighted by 𝑟 𝛼 :

• 𝛼 → 1 (forward KL): 𝑟 𝛼 → 𝑟, so samples where 𝑞 ≫ 𝑝 (under-represented by policy) get high weight. This encourages coverage.

• 𝛼 → 0 (reverse KL): 𝑟 𝛼 → 1, so all samples are weighted equally by the policy. Combined with the score function, this encourages concentration on modes where 𝑝 already has mass.

The choice of 𝛼 affects not only the optimization landscape but also the variance of gradient estimates.

Let 𝑔 𝛼 = 𝑟 𝛼 ∇ 𝜃 log 𝑝 𝜃 be the per-sample gradient direction.

Under the heuristic assumption that ∥∇ 𝜃 log 𝑝 𝜃 (𝑖) ∥ does not vary excessively across candidates (commonly adopted to isolate the effect of importance weighting), the variance of the gradient estimator scales as:

When 𝑞 and 𝑝 𝜃 are close, 𝑟 ≈ 1 and the variance is low for all 𝛼. When 𝑞 and 𝑝 𝜃 differ significantly:

• Large 𝛼 (≈ 1): High variance due to large 𝑟 2𝛼 terms when 𝑞(𝑖) ≫ 𝑝 𝜃 (𝑖).

• Small 𝛼 (≈ 0): Lower variance but potentially higher bias (mode-seeking).

Why not start with small 𝛼? A natural question is: if small 𝛼 has lower variance, why not use it from the beginning? The answer lies in the bias-variance trade-off and the support coverage requirement:

• 𝛼 → 1 (forward KL): The gradient weights 𝑟 𝛼 → 𝑟 = 𝑞/𝑝 upweight samples where 𝑞 ≫ 𝑝, forcing the policy to cover the target’s support. This is essential early in training when 𝑝 𝜃 may not yet overlap well with high-reward regions.

• 𝛼 → 0 (reverse KL): The limiting gradient E 𝑝 [(log 𝑝log 𝑞)∇ log 𝑝] penalizes “misplaced mass”

where 𝑝 > 𝑞, driving concentration on modes where 𝑞 is already high. This is effective after the policy has found the good modes, but dangerous early on-it can lock into suboptimal modes before discovering better ones.

Thus, the schedule from 𝛼 max → 𝛼 min implements a natural curriculum: first find the high-reward modes (coverage), then concentrate on them (exploitation).

Following ADPO [13], the anchored coordinates induce curvature scaling. The Fisher information of the anchored policy 𝑝 𝜃 in the candidate set is:

The local quadratic approximation of the 𝛼-divergence loss near the optimum 𝑢 ★ is:

where 𝐹 𝑞 = Diag(𝑞) -𝑞𝑞 ⊤ is the Fisher information metric at the optimum. For the standard normalized 𝛼-divergence (as defined in Equation ( 7)), the second-order expansion matches the KL divergence up to a constant scaling factor independent of the optimization direction 𝛿. This theoretical property suggests that the local implicit trust region is predominantly governed by the anchor temperature 𝜏 anc , providing a unified stability mechanism across the 𝛼 family. The effect of 𝛼 is thus largely orthogonal to local stability: it controls the global optimization trajectory (mode-covering vs. mode-seeking) when the policy is far from the target.

The key practical question is how to set 𝛼 over training. We propose a simple rule: become more mode-seeking only when the policy is both (i) confident and (ii) improving. This avoids the “confident-but-wrong” failure mode, where entropy collapses while reward stagnates or decreases.

We compute two scalars per update:

• Confidence from entropy of the candidate-set distribution:

Here 𝑐 𝑡 ≈ 0 indicates uncertain/high-entropy policies; 𝑐 𝑡 ≈ 1 indicates confident/low-entropy policies.

• Improvement from reward gain (time-series signal):

Here 𝑏 𝑡 is an EMA reward baseline, 𝑠 𝑅 > 0 is a reward scale (e.g., running std), and 𝑝 𝑡 ∈ [0, 1] discards negative gains-crucial for the guard behavior. 350:

Let 𝛼 max ≲ 1 (coverage, e.g., 0.9) and 𝛼 min > 0 (exploitation, e.g., 0.35). For mathematical reasoning tasks with sparse binary rewards, we recommend 𝛼 min ∈ [0.

optionally smoothed by an EMA:

Remark 5.1 (Why the multiplicative gate matters). If we used an additive schedule 𝛼 𝑡 = 𝜙(𝑐 𝑡 + 𝑝 𝑡 ), then a policy that is confident but not improving could still be pushed toward mode-seeking updates, risking collapse. The multiplicative gate 𝑐 𝑡 • 𝑝 𝑡 ensures that 𝛼 only decreases when both conditions are met.

Behavior in different regimes.

• Early training (𝑐 𝑡 low, 𝑝 𝑡 variable): 𝛼 𝑡 ≈ 𝛼 max , forward-KL-like, stable coverage.

• Confident and improving (𝑐 𝑡 high, 𝑝 𝑡 high): 𝛼 𝑡 → 𝛼 min , reverse-KL-like, mode-seeking.

• Confident but stuck/degrading (𝑐 𝑡 high, 𝑝 𝑡 ≈ 0): 𝛼 𝑡 ≈ 𝛼 max , pull back to coverage to escape local minimum.

When 𝛼 𝑡 is small, the effective weights 𝑟 𝛼 can be heavy-tailed. In practice, one can clip 𝑟 𝛼 similarly to PPO-style clipping:

6 Algorithm

• GSPO [10]: An improved GRPO variant that enhances training stability via sequence-level objectives.

• ADPO-Softmax: Standard ADPO using softmax loss variant (loss_variant: softmax), with the cross-entropy formulation L = -𝑖 𝑞 𝑖 log 𝑝 𝑖 .

Configuration. All methods share: learning rate 1.5 × 10 -5 (cosine decay), batch size 8, gradient accumulation 16, 𝑃 = 8 generations per prompt, max completion length 1024, 2 epochs. For APO: 𝜏 anc = 0.8, 𝛽 𝑟 = 1.0, 𝛼 max = 0.9, 𝛼 min = 0.35, 𝜌 = 0.1, 𝜆 = 0.1, 𝑠 𝑅 initialized to 0.5 and updated via running std.

Figure 1 shows the training curves for all five algorithms. We observe the following:

• Comparable performance: All five algorithms achieve similar final reward around 0.6-0.7. The APO variants (Adaptive ESS, Fixed 𝛼, Legacy) perform on par with GSPO and ADPO-Softmax.

• Stability: All anchored methods maintain stable training dynamics without significant reward collapse, validating the effectiveness of the reference model anchoring.

• AlphaPO flexibility: While APO does not surpass GSPO in this particular setup, it provides a unified framework for exploring different divergence behaviors through 𝛼 scheduling. Discussion. The results suggest that for this specific task (mathematical reasoning with binary rewards), the choice of divergence within the anchored framework has limited impact on final performance. However, APO offers a theoretical unification of forward-KL and reverse-KL optimization, which may be beneficial in:

• Tasks with denser reward signals where mode-seeking exploration is more important.

• Settings where the reward landscape is more complex and the coverage-exploitation trade-off matters.

• Future work combining APO with other techniques like reward shaping or curriculum learning.

We introduced APO (𝛼-Divergence Preference Optimization), an anchored preference optimization framework that uses Csiszár 𝛼-divergence to continuously interpolate between forward-KL-style (modecovering) and reverse-KL-like (mode-seeking) optimization. Our key contribution is the reward + confidence guarded 𝛼 schedule, which transitions from stable coverage to exploitation only when the policy is both confident and improving, preventing the “confident-but-wrong” collapse pattern. Experiments on Qwen3-1.7B with mathematical reasoning tasks demonstrate that APO can achieve competitive performance while maintaining training stability.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

APO: Alpha-Divergence Preference Optimization

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found