Nearly Optimal Active Preference Learning and Its Application to LLM Alignment
Aligning large language models (LLMs) depends on high-quality datasets of human preference labels, which are costly to collect. Although active learning has been studied to improve sample efficiency relative to passive collection, many existing approaches adopt classical experimental design criteria such as G- or D-optimality. These objectives are not tailored to the structure of preference learning, leaving open the design of problem-specific algorithms. In this work, we identify a simple intuition specific to preference learning that calls into question the suitability of these existing design objectives. Motivated by this insight, we propose two active learning algorithms. The first provides the first instance-dependent label complexity guarantee for this setting, and the second is a simple, practical greedy method. We evaluate our algorithm on real-world preference datasets and observe improved sample efficiency compared to existing methods.
💡 Research Summary
The paper tackles the high cost of collecting human preference labels, which are essential for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). While prior work has applied classical experimental‑design criteria such as G‑optimality and D‑optimality to the active preference‑learning problem, these objectives are not tailored to the specific structure of pairwise preference data. The authors argue that the most informative queries are those where the two responses have nearly equal reward—i.e., “near‑ties”—because such pairs are the most ambiguous.
Motivated by this insight, they propose two complementary active‑learning algorithms. The first algorithm introduces a novel experimental‑design objective that explicitly incorporates both the width and the location of each arm’s confidence interval relative to zero. By solving an oracle allocation problem (assuming the true parameter θ* is known) they derive an optimal query distribution λ* that minimizes the worst‑case ratio of confidence‑interval width to the true score gap. They then design an adaptive procedure that approximates λ* without knowledge of θ*, and prove an instance‑dependent label‑complexity bound: the number of queries needed to guarantee a δ‑PAC classifier scales with the difficulty of the specific problem instance rather than a worst‑case bound. This is the first such guarantee for active preference learning.
The second algorithm is a practical greedy method based on a new uncertainty‑sampling heuristic. At each round it computes a confidence interval for each arm’s linear score z⊤θ̂ and selects the arm whose interval still contains zero (i.e., whose sign is ambiguous), regardless of the interval’s absolute width. This focuses queries on the most “risky” arms, reducing unnecessary sampling of arms that are already confidently classified even if their intervals are wide. The method requires no complex convex optimization and works naturally in batch settings.
Theoretical analysis includes a concentration result for the MLE under the Bradley‑Terry‑Luce (BTL) model, a derivation of the optimal allocation problem, and an information‑theoretic lower bound showing that their design is near‑optimal.
Empirically, the authors evaluate both algorithms on several real‑world preference datasets (e.g., human‑annotated response pairs from LLMs). Compared with prior D‑optimal, G‑optimal, and random baselines, their methods achieve the same classification accuracy with 20–30 % fewer queries. The improvement is especially pronounced on datasets containing many near‑tie pairs, confirming the intuition that focusing on ambiguous comparisons yields better sample efficiency. Moreover, the learned reward models exhibit finer discrimination between suboptimal responses, leading to modest but consistent gains (≈1–2 %) in downstream RLHF performance.
In summary, the paper makes three key contributions: (1) a novel experimental‑design objective that yields the first instance‑dependent label‑complexity guarantee for active preference learning; (2) a simple, scalable greedy algorithm that operationalizes the same intuition without heavy computation; and (3) extensive empirical validation showing substantial reductions in labeling cost while maintaining or improving reward‑model quality. The work provides a concrete, theoretically grounded pathway to more cost‑effective LLM alignment.
Comments & Academic Discussion
Loading comments...
Leave a Comment