Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.


💡 Research Summary

The paper addresses a fundamental weakness in Direct Preference Optimization (DPO) and related preference‑learning methods for aligning large language models (LLMs). Traditional DPO relies on a fixed reference policy—usually the supervised‑fine‑tuned (SFT) model—to regularize updates. As training proceeds, the policy drifts away from this static reference, causing distributional mismatch. This mismatch weakens the implicit preference signal and makes the model vulnerable to noisy supervision, often leading it to exploit superficial cues such as length or style. Reference‑free alternatives like SimPO avoid the mismatch but discard the stabilizing effect of a reference, which can result in uncontrolled reward drift and degradation of general capabilities.

To reconcile these opposing forces, the authors propose Geometric Anchor Preference Optimization (GAPO). Instead of a static reference, GAPO constructs a dynamic geometric anchor by applying an adversarial perturbation to the current policy within a small ℓ₂‑ball of radius ρ. For each preference pair i, the margin is defined as
M_i(θ) = p_θ(x_i, y_w,i) – p_θ(x_i, y_l,i),
where p_θ is the length‑normalized log‑probability (the same implicit reward used by SimPO). The worst‑case perturbation ϵ_i* minimizes this margin under the norm constraint. In practice the authors approximate a per‑instance perturbation with a batch‑shared direction:
ϵ_B ≈ – ρ ∇θ (1/|B| Σ{j∈B} M_j(θ)) / ‖∇θ (1/|B| Σ{j∈B} M_j(θ))‖,
and define the anchor parameters \tildeθ = θ + ϵ_B. The Anchor Gap for instance i is then
Γ_i(θ) = M_i(θ) – sg(M_i(\tildeθ)),
where sg denotes a stop‑gradient operator.

GAPO optimizes a logistic loss on the Anchor Gap:
L_GAPO(θ) = – Σ_i log σ(β·Γ_i(θ) – γ),
with temperature β and target margin γ. Differentiating yields
∇_θ L_GAPO(θ) = – E


Comments & Academic Discussion

Loading comments...

Leave a Comment