Bayesian Preference Learning for Test-Time Steerable Reward Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

💡 Research Summary

The paper addresses a fundamental limitation of current reward models (RMs) used in reinforcement learning from human feedback (RLHF): once trained, they are static and cannot adapt to new or shifting preference distributions at test time. To overcome this, the authors introduce Variational In‑Context Reward Modeling (ICRM), a Bayesian framework that enables test‑time steerability of a single classifier‑type RM through few‑shot in‑context preference demonstrations.

ICRM treats the probability that a “chosen” response is preferred over a “rejected” one as a latent variable z. A Beta(α₀, β₀) prior is placed on z, which is conjugate to the Bernoulli likelihood of observed pairwise outcomes. Given a context C consisting of N demonstration triples (prompt, chosen, rejected), the model θ maps the inputs to two scalars per response: a utility score u and an evidence score s. These are transformed into the parameters of an approximate posterior Beta(α_q, β_q) where α_q = μτ, β_q = (1 − μ)τ, with μ = σ(u_w − u_l) (the usual Bradley‑Terry logistic) and τ = Softplus(s_w) + Softplus(s_l) + 1 controlling concentration. Thus the posterior mean E

Bayesian Preference Learning for Test-Time Steerable Reward Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment