KL Divergence in Alignment Covering Modes vs. Seeking Rewards

Reading time: 2 minute
...

📝 Original Paper Info

- Title: APO Alpha-Divergence Preference Optimization
- ArXiv ID: 2512.22953
- Date: 2025-12-28
- Authors: Wang Zixian

📝 Abstract

Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by alpha, analyze gradient variance properties, and propose a practical reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 demonstrate that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.

💡 Summary & Analysis

1. **New Attention Mechanism**: The model uses a novel way of adjusting focus based on the input text, much like how someone might pay more attention to important words or phrases. 2. **Improved Sentiment Analysis Performance**: By accurately capturing sentiments, this method significantly boosts the model's accuracy, similar to squinting harder to not miss any crucial information. 3. **Validation Across Diverse Datasets**: The model shows superior performance compared to traditional methods across various datasets, akin to a car working well in different weather conditions.

📄 Full Paper Content (ArXiv Source)

1. **New Attention Mechanism**: The model uses a novel way of adjusting focus based on the input text, much like how someone might pay more attention to important words or phrases. 2. **Improved Sentiment Analysis Performance**: By accurately capturing sentiments, this method significantly boosts the model's accuracy, similar to squinting harder to not miss any crucial information. 3. **Validation Across Diverse Datasets**: The model shows superior performance compared to traditional methods across various datasets, akin to a car working well in different weather conditions.

📊 논문 시각자료 (Figures)

Figure 1



A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut