Latent Adversarial Regularization for Offline Preference Optimization

Latent Adversarial Regularization for Offline Preference Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.


💡 Research Summary

The paper tackles a fundamental limitation of current preference‑based alignment methods for large language models (LLMs). Most offline preference optimization (OPO) techniques, such as Direct Preference Optimization (DPO), rely on a token‑level KL regularizer that forces the policy model to stay close to a reference model in probability space. While mathematically convenient, this token‑space constraint is a very coarse proxy for the true semantic or behavioral similarity between model outputs. Two sentences can be semantically identical yet far apart under token‑level divergences, or conversely, semantically different sentences may have a small token distance. This mismatch can lead to undesirable drift, reward hacking, and poor generalization when the model is fine‑tuned on human‑generated preference data.

To address this, the authors propose GANPO (Generative Adversarial Network Preference Optimization), a latent‑space regularization framework that aligns the internal representations of the policy and a frozen reference model. The key insight is that the final hidden‑state vectors of a transformer encode dense semantic information in a lower‑dimensional space, making them a more faithful substrate for measuring similarity. However, latent representations do not have an explicit probability density, so standard divergences (e.g., KL) cannot be computed directly.

The solution draws from the theory of Generative Adversarial Networks (GANs). A discriminator (D_{\phi}) is trained to distinguish latent vectors produced by the reference model (treated as “real”) from those produced by the policy model (treated as “fake”). Because standard GAN training can be unstable, the authors adopt the Relativistic Average GAN (RaGAN) formulation, which estimates the probability that a real sample is more realistic than the average fake sample in the current batch. This yields a variational expression equivalent to the Jensen‑Shannon divergence (JSD) between the two latent distributions, denoted (D_{\text{Ra}}(p_{\theta}|p_{\text{ref}})). The discriminator loss reduces to a binary cross‑entropy (BCE) term, while the policy is penalized by the negative of this BCE, effectively minimizing the JSD.

GANPO is not a simple binary GAN. Preference data consist of triples ((x, y_w, y_l)) – a prompt, a chosen response, and a rejected response. The authors exploit this structure by constructing a quadruple of latent vectors for each training example: (1) the reference model’s representation of the chosen response ((h^+{\text{ref}})), (2) the reference model’s representation of the rejected response ((h^-{\text{ref}})), (3) the policy’s representation of the chosen response ((h^+{\theta})), and (4) the policy’s representation of the rejected response ((h^-{\theta})). Two discriminators are introduced: a positive discriminator ((\phi_{\text{pos}})) that learns to rank (h^+{\text{ref}} > h^+{\theta}) and also (h^+{\theta} > h^-{\text{ref}}), and a negative discriminator ((\phi_{\text{neg}})) that learns the analogous ordering for the “bad” side. This design allows the policy to receive dense structural feedback about both high‑quality and low‑quality regions of the latent space.

The overall training objective combines the original OPO loss (e.g., DPO’s pairwise log‑sigmoid term) with the adversarial regularizer weighted by a hyper‑parameter (\lambda): \


Comments & Academic Discussion

Loading comments...

Leave a Comment