Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model’s dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an “alignment gap estimator”, to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs. Code is available at https://github.com/junming-yang/MetaAPO.


💡 Research Summary

The paper addresses a central problem in aligning large language models (LLMs) with human preferences: the distribution mismatch between static, pre‑collected offline preference datasets and the evolving policy of the model during training. While offline methods such as Reinforcement Learning from Human Feedback (RLHF) and its more efficient variants (DPO, SimPO, KTO) enjoy high‑quality data and low annotation cost, they suffer from out‑of‑distribution (OOD) issues because the data were generated by an earlier model. Conversely, online approaches (Iterative DPO, SPPO) generate on‑policy data that matches the current distribution but often lack diversity and quality, leading to inefficient learning and higher annotation costs.

Meta‑Weighted Adaptive Preference Optimization (MetaAPO) proposes a unified framework that dynamically couples data generation with model training through a lightweight meta‑learner. The meta‑learner, a two‑layer MLP denoted h₍ϕ₎, receives a preference score ℓ(x, y_w, y_l) for each offline sample, computed from the current policy π₍θ₎ and a fixed reference model π₍ref₎ using the same log‑odds formulation employed in DPO. It maps this score to a meta‑weight w∈


Comments & Academic Discussion

Loading comments...

Leave a Comment