Holistic Utility Preference Learning for Listwise Alignment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Aligning large language models with human preferences is essential for improving interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Existing methods such as Direct Preference Optimization (DPO) focus on pairwise comparisons, categorizing responses into preferred and less preferred pairs and optimizing pairwise margins. However, this pairwise approach cannot capture the holistic ranking relationships among multiple responses or effectively leverage the rich preference information available in list-wise comparisons. To address this challenge, this paper introduces \underline{D}irect \underline{R}anking \underline{P}reference \underline{O}ptimization (DRPO), a novel method that views human preference alignment as a Learning-to-Rank (LTR) task. Unlike pairwise methods, DRPO optimizes the preference ranking of entire response lists by computing holistic utility scores through NDCG, a standard LTR metric. To enable end-to-end optimization with the non-differentiable NDCG, we propose diffNDCG loss, a differentiable approximation facilitated by a sorting network. Furthermore, we introduce a novel margin-based Adaptive Rank Policy Score to enhance the discriminative quality of generated responses. Extensive experiments have shown that DRPO outperforms existing methods, enhancing the quality of the generated responses.

💡 Research Summary

This paper tackles the problem of aligning large language models (LLMs) with human preferences by exploiting ranking‑style feedback rather than the more common pairwise comparisons. While recent methods such as Direct Preference Optimization (DPO) treat preference data as binary “preferred vs. less‑preferred” pairs and optimize a pairwise logistic loss, they ignore the richer relational information present when multiple responses are ordered by a human or a strong evaluator (e.g., GPT‑4). The authors propose Direct Ranking Preference Optimization (DRPO), a framework that casts human‑preference alignment as a Learning‑to‑Rank (LTR) task and directly optimizes the Normalized Discounted Cumulative Gain (NDCG) metric over entire response lists.

Key technical contributions

Adaptive Rank Policy Score (ARPS).
- Instead of the traditional policy‑reference ratio, ARPS computes a length‑normalized log‑likelihood for each response and adds a rank‑dependent margin γ(y) = τ·q(y) – β·V_q(y).
- q(y) is the current rank (0 for the top response), τ controls the base margin between adjacent ranks, and V_q(y) is an exponential moving average of log‑likelihoods at that rank, providing a dynamic adjustment based on historical model performance.
- This design yields finer discrimination for neighboring ranks while emphasizing larger gaps for more distant ranks, encouraging the model to assign higher absolute probabilities to truly preferred outputs.
Differentiable sorting via sorting networks.
- NDCG requires a sorted list, which is non‑differentiable. The authors employ a sorting network (e.g., NeuralSort or a Sinkhorn‑based doubly‑stochastic permutation) to produce a continuous permutation matrix (\hat{P}) from the ARPS scores.
- The network is simple to implement, runs in O(K log K) time, and preserves probability mass, enabling gradient flow through the sorting operation.
diffNDCG loss.
- Using the soft permutation (\hat{P}), the authors construct a differentiable approximation of NDCG, called diffNDCG, which closely mirrors the true NDCG while being fully back‑propagatable.
- The loss penalizes misplacements of high‑relevance responses more heavily, aligning the training objective with the evaluation metric used in practice (win‑rate, ranking accuracy).
Learning pipeline.
- For each prompt, a supervised fine‑tuned (SFT) model generates K responses. The current policy (\pi_\theta) scores them with ARPS, the sorting network orders them, diffNDCG is computed against human‑provided relevance scores, and gradients update (\theta).

Experimental validation
The authors evaluate DRPO on publicly available listwise preference datasets such as UltraFeedback and VLFeedback, as well as on synthetic pairwise datasets for comparison. Metrics include win‑rate (pairwise preference dominance), NDCG, and human‑rated quality scores. DRPO consistently outperforms DPO, RSO, PRO, LiPO, and other baselines across all metrics, with especially large gains in top‑rank NDCG (positions 1‑3). Ablation studies show that removing either ARPS or diffNDCG degrades performance, confirming that both components are essential.

Limitations and future work
The current formulation assumes a fixed number K of responses per prompt; extending to variable‑length lists or streaming generation scenarios remains open. The sorting‑network approximation may introduce bias for very large K, suggesting the need for more accurate differentiable sorting techniques. Moreover, the authors note that integrating active ranking (selectively querying the most informative comparisons) or semi‑supervised learning could further reduce the cost of obtaining high‑quality ranking data.

Overall impact
DRPO demonstrates that directly optimizing a listwise ranking metric, together with a carefully designed score function and differentiable sorting, yields a more faithful alignment of LLMs to human preferences than pairwise methods. By bridging the gap between evaluation (win‑rate/NDCG) and training objectives, the work offers a practical, computationally efficient alternative to traditional RLHF pipelines, potentially lowering training costs while improving safety and usefulness of generated text.

Holistic Utility Preference Learning for Listwise Alignment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment