Computer Science / Artificial Intelligence

The Reward Model Selection Crisis in Personalized Alignment

February 10, 2026

Reading time: 27 minute

...

📝 Original Info

Title: The Reward Model Selection Crisis in Personalized Alignment
ArXiv ID: 2512.23067
Date: 2025-12-28
Authors: Fady Rezk, Yuangang Pan, Chuan-Sheng Foo, Xun Xu, Nancy Chen, Henry Gouk, Timothy Hospedales

📝 Abstract

Personalized alignment from preference data has focused primarily on improving personal reward model (RM) accuracy, with the implicit assumption that better preference ranking translates to better personalized behavior. However, in deployment, computational constraints necessitate inference-time adaptation such as reward-guided decoding (RGD) rather than per-user policy fine-tuning. This creates a critical but overlooked requirement: reward models must not only rank preferences accurately but also effectively guide generation. We demonstrate that standard RM accuracy fails catastrophically as a selection criterion for deployment-ready personalized rewards. We introduce policy accuracy-a metric quantifying whether RGD-adapted LLMs correctly discriminate between preferred and dispreferred responses-and show that upstream RM accuracy correlates only weakly with downstream policy accuracy (Kendall's 𝜏 = 0.08-0.31). More critically, we introduce Pref-LaMP, the first personalized alignment benchmark with ground-truth user completions, enabling direct behavioural evaluation. On Pref-LaMP, we expose a complete decoupling between discriminative ranking and generation metrics: methods with 20point RM accuracy differences produce almost identical output quality, and methods with high ranking accuracy can fail to generate behaviorally aligned responses. These findings reveal that the field has been optimizing for proxy metrics that do not predict deployment performance, and that current personalized alignment methods fail to operationalize preferences into behavioral adaptation under realistic deployment constraints. In contrast, we find simple in-context learning (ICL) to be highly effective -dominating all reward-guided methods for models ≥3B parameters, achieving ∼3 point ROUGE-1 gains over the best reward method at 7B scale.

📄 Full Content

Recent advances in aligning large language models (LLMs) with human preferences have primarily focused on learning from aggregated feedback across diverse user populations (Rafailov et al., 2023;Ouyang et al., 2022). However, preferences are inherently pluralistic-varying across individuals, communities, and contexts (Santurkar et al., 2023;Sorensen et al., 2024). This reality motivates personalized alignment: adapting model behavior to heterogeneous, sometimes conflicting, user preferences rather than collapsing them into a single consensus objective.

Current personalized alignment research has converged on a common paradigm: collect user-specific preference data (pairwise comparisons), train personalized ranking/reward models to capture individual preferences, and assume that better reward models naturally translates to better policies (Bose et al., 2025;Chen et al., 2025a;Shenfeld et al., 2025;Li et al., 2024;Poddar et al., 2024). The last assumption is likely to break as suggested by Goodhart’s law (El-Mhamdi and Hoang, 2024). Unlike standard RLHF, personalized alignment lacks downstream benchmarks that measures policy performance such as MMLU (Hendrycks et al., 2021) and GSM8k (Cobbe et al., 2021).

Practical Deployment and the End-to-End Perspective Per-user policy fine-tuning using personal rewards via RL is computationally infeasible at scale. RL-based personalization requires peruser dynamic adapter management, RL instability mitigation, and orders of magnitude more compute than inference-time alternatives. One key scalable deployment path is inference-time adaptation through reward-guided decoding (RGD) (Khanov et al., 2024), maintaining a single base policy while using personalized rewards to guide generation. Another option is or Best-of-N sampling (Ichihara et al., 2025) but the high latency of BoN makes it unfit for personalization text generation.

This deployment reality demands we adopt an end-to-end behavioral perspective: personalized alignment is not merely reward modeling, but the complete process from preference data to actual generation behavior. We propose a key principle:

A personalized alignment method must specify not only how preferences are modeled, but how they are operationalized into behavioral adaptation.

A direct corollary of this is that papers proposing reward models are responsible for evaluating if improved RM accuracy translates to improved generation. Current evaluations ignore this responsibility, treating reward modeling and policy adaptation as independent. This obscures whether methods actually achieve their objective: making models generate responses aligned with user preferences.

We adopt an end-to-end perspective, studying the complete chain from preference modeling to generation behavior. We ask three questions: (1) Does RM accuracy predict policy accuracy under RGD? (2) Does policy accuracy predict generation quality? (3) How do reward-based alignment methods compare to simpler baselines? To answer these, we introduce (A) policy accuracy, measuring whether RGD scoring assigns higher scores to preferred responses, and (B) Pref-LaMP, a preference learning benchmark with ground-truth user completions enabling direct behavioral evaluation.

Our findings reveal a fundamental selection crisis: practitioners cannot reliably choose deployment-ready methods because standard metrics do not predict actual performance.

Finding 1: Upstream RM accuracy does not predict downstream policy accuracy under RGD (Kendall’s 𝜏 = 0.08-0.31). Methods with 20-point RM accuracy differences achieve nearly identical policy performance.

Finding 2: Response ranking quality does not predict response generation quality. On Pref-LaMP, methods with similar generation quality vary dramatically in RM and policy accuracy.

Finding 3: ICL dominates at scale. At 7B parameters, ICL-RAG, with RAG selected preference demonstrations (Salemi et al., 2024), outperforms best personalized reward model by ∼3 ROUGE-1 points.

Implications: For practitioners, use simple ICL-RAG in preference to published personal reward methods. For researchers, take an end-to-end perspective: co-design and co-evaluate reward personalization with policy adaptation strategies and evaluate generation quality as well as ranking. Use Pref-LaMP and develop more benchmarks with groundtruth completions, analogous to GSM8K/MMLU for general RLHF.

To summarize, we (1) contribute Pref-LaMP-the first benchmark with groundtruth user completions, (2) demonstrate the standard RM accuracy metric fails as a selection criterion across three datasets and four scales, (3) demonstrate that a simple ICL baseline outperforms published personal alignment work in end-to-end adaptation, and (4) provide actionable recommendations for practitioners and researchers.

Personalized Alignment Recent work has focused on learning user-specific reward models or policies for alignment under limited supervision (e.g., PAL, PReF, LoRE, P-DPO, VPL) (Chen et al., 2025a;Poddar et al., 2024;Bose et al., 2025;Shenfeld et al., 2025;Li et al., 2024). These approaches largely target reward-modeling accuracy (e.g relative preference ranking) as proxies for personalization quality (Chen et al., 2025a;Bose et al., 2025;Shenfeld et al., 2025). However, such metrics often fail to capture (i) whether RM adaptation translates to downstream policy adaptation, and (ii) whether, under realistic resource-constrained settings, a personalized policy is able to go beyond response ranking and actually generate responses reflective of a user’s preferences. This evaluation limitation leaves open the null hypothesis that prior personal alignment results, measured by RM accuracy, are due to unintended overoptimization, also known as reward-hacking (Pan et al., 2022).

Multi-objective alignment (MOA) addresses the challenge of optimizing language models across multiple known and predefined reward dimensions simultaneously. Unlike personalized alignment, where the goal is to learn individual user preferences under limited supervision, MOA assumes access to distinct reward models for each objective dimension (e.g., helpfulness, harmlessness, factuality) and focuses on finding optimal policy trade-offs among these objec-tives. Prior work has explored weighted reward optimization (Zhou et al., 2024), model merging (Jang et al., 2024;Rame et al., 2023), auxiliary correction models (Ji et al., 2024;Yang et al., 2024a), testtime reward-guided decoding (Chen et al., 2025b) among other methods (Yang et al., 2024b).

Evaluation Challenges in RLHF Existing evaluation practices in RLHF and personalized alignment rely on proxy metrics such as reward-model scores which are susceptible to reward hacking and circularity (Tien et al., 2023;Pan et al., 2022). These methods assess optimization success rather than behavioral quality (Wen et al., 2025;Gao et al., 2023;El-Mhamdi and Hoang, 2024). In contrast, our work introduces a framework for direct behavioral evaluation, measuring whether generated responses match user-provided completions (Section 5). Extended discussion of related evaluation pathologies appears in Appendix A. Our work addresses these limitations by introducing direct behavioral evaluation on ground-truth user completions, measuring actual generation quality rather than relying on proxy metrics.

Inference-Time Alignment Reward-guided decoding (Khanov et al., 2024) and Best-of-N sampling (Ichihara et al., 2025) enable policy steering without fine-tuning, making them computationally attractive for personalization. Recent work has explored their effectiveness (Wu, 2025), but standard personal alignment evaluation remains limited to reward-based metrics. Our work is the first to systematically evaluate test-time alignment (reward-guided decoding in particular) for personalized alignment with ground-truth behavioral assessment, revealing fundamental limitations in their ability to operationalize user preferences.

Consider a preference dataset D = {(𝑢 𝑖 , 𝑥 𝑖 , 𝑦 (𝑤) 𝑖 , 𝑦 (𝑙) 𝑖 )} 𝑁 𝑖=1 , where 𝑢 𝑖 ∈ {1..𝐾 } is a user identifier, 𝑥 𝑖 is a prompt, and 𝑦 (𝑤) 𝑖 , 𝑦 (𝑙) 𝑖 are chosen/winning and rejected/loosing completions. We partition users into U train (for learning shared preference structure) and U adapt (for evaluating few-shot personalization , we adapt:

where A is the adaptation algorithm.

Given computational constraints prohibiting peruser policy fine-tuning, we deploy personalized alignment through inference-time guidance using Reward-Guided Decoding (Khanov et al., 2024).

Standard practice evaluates personalization methods by reward model accuracy, defined below. Definition 1 (Reward Model Ranking Accuracy). For user 𝑘 with evaluation set D query 𝑘 = {(𝑥 𝑖 , 𝑦 (𝑤) 𝑖 , 𝑦 (𝑙) 𝑖 )}, we define the Reward Model Ranking Accuracy as

(

This measures pairwise ranking on complete responses of the reward model. Later we will show that this standard metric has several issues and is not predictive of deployment performance.

We introduce a metric quantifying whether the RGD scoring function-not just the reward model in isolation-correctly ranks preferred over dispreferred responses.

Definition 2 (Policy Ranking Accuracy). Let 𝑠 : Y × X → R be the scoring function used at generation time. The policy accuracy for user 𝑘 is given by

where 𝑦 (𝑤) and 𝑦 (𝑙) denote the chosen (winning) and rejected (losing) completions.

We instantiate 𝑠 with three scoring functions, each revealing different aspects of the personalization pipeline.

Base Policy.

The base policy’s lengthnormalized log-likelihood, its off-the-shelf nonpersonalized zero-shot ranking ability,

where demonstrations are prepended to the input prompt. Both personalized RGD and ICL leverage user-specific information-𝑧 𝑘 versus D demo

-but through different mechanisms: learned reward shaping versus direct context conditioning. This allows us to compare whether parametric reward models or demonstration-based adaptation better capture user preferences, particularly as model scale increases.

Outstanding Limitation. Policy accuracy measures how well the scoring function ranks static responses-not whether the policy will actually generate outputs that actually align with user preferences. A method might rank existing responses correctly while producing generations that differ substantially from what users would write. This motivates our behavioral evaluation in Section 5.

To enable direct measurement of behavioral alignment without circular reward-based metrics, we introduce Pref-LaMP-a personalized alignment benchmark providing both pairwise preferences and ground-truth user-authored completions.

Dataset Construction Pref-LaMP derives from LaMP-5 (Salemi et al., 2024), pairing researchers’ abstracts with their titles. Both are author-written, capturing individual style. We construct preferences via hard negative mining: (1) encode abstracts with Qwen3-Embedding-0.6B, (2) retrieve top-𝑘 similar abstracts, (3) sample one retrieved abstract as 𝑥 and use its title as 𝑦 (𝑙) , (4) use original title as 𝑦 (𝑤) . This ensures rejections are topically relevant but different in title formulation1 . Pref-LaMP is the first benchmark enabling direct behavioural evaluation of personalization through user-authored completions, measurable via ROUGE and BERTScore.

We evaluate endto-end behavioural alignment by comparing usergenerated and personalized model responses. Definition 3 (Behavioral Alignment). Let G : X → Y be a generation operator and S : Y × Y → R be a similarity measure. For user 𝑘 with test set

We instantiate G with ARGS decoding (Eq. 1), zero-shot generation and ICL generation. Meanwhile, S in instantiated with ROUGE-1 (lexi-

LoRA rank 8, trained on U train to learn shared 𝜃 and user-specific {𝑧 𝑘 }. We evaluate six personalization methods: LoRE (Bose et al., 2025) (learn reward bases and user-specific convex combination), LoRE-Alt (same as LoRE but alternates between bases and user specific parameter gradient steps), PReF (Shenfeld et al., 2025) (collaborative filtering), PAL (Chen et al., 2025a), VPL (Poddar et al., 2024), MPU/MPU-Avg (a simple baselines of per-user MLPs), and P-DPO (Li et al., 2024) (personalized direct preference optimization). Baselines include Global-RM (non-personalized Bradley-Terry using last token embedding as input to RM decoder), Global-RM-V2 (a sequence reward is average reward for all tokens), GenARM (Xu et al., 2025) (autoregressive RM for token-level guidance), zeroshot generation, ICL (random demonstrations), and ICL-RAG (retrieved demonstrations).

Evaluation Protocol We measure: (1) RM accuracy on adaptation users’ held-out preferences (Eq. 2), (2) Adaption users’ policy accuracy vs prior (no-reward) and global reward baselines (Eq. 3), (3) generation quality on Pref-LaMP via ROUGE-1/L and BERTScore against ground-truth (Eq. 8), and (4) win rates where each method’s RM judges its own outputs versus zero-shot baseline. than Global RM-a complete ranking inversion.

In terms of scale, methods show minimal scaling gains, remaining in narrow bands (LoRE-Alt: 57.1-58.5%, VPL: 63.8-66.4%). Unlike TLDR/SmolLM2 where correlations degraded with scale (𝑟 = 0.41→0.11), PRISM/Qwen2.5 shows strengthening correlations (𝑟 = 0.31→0.48). Whether this reflects dataset differences, model architecture, or their interaction remains unclear. Regardless, even at 7B, correlations remain too weak for choosing methods based on RM accuracy.

Our careful control evaluation shows wide failure of prior personal alignment methods both in terms of beating global alignment baselines, and in terms of the standard metric of RM accuracy not corresponding to downstream policy accuracy. We attribute this to a mixture of released code not reproducing results, missing non-personal baselines, and inconsistent non-comparable choice of datasets in prior evaluations. See Appendix D and E for further discussion.

Implication: Personal RM accuracy does not reflect performance during policy inference and cannot guide choice of reward model for deployment: Methods with 10+ point RM gaps can perform identically as adapted policies; methods with near identical RM accuracy can have 10+ point gaps in policy accuracy; and alignment methods can invert in ranking between reward and policy evaluations.

Recommendation. Future personal alignment methods must specify a policy adaptation strategy, and assess downstream policy understanding across multiple datasets and scales-not just upstream personal reward accuracy. The RM-policy disconnect demands new metrics measuring reward models’ suitability for guiding generation, rather than pairwise ranking accuracy alone.

Given the weak correlation between RM accuracy and policy discrimination ability under RGD, we now ask: even when methods achieve high policy accuracy-demonstrably preferring chosen over rejected responses-do they actually generate outputs that behaviorally align with user preferences?

We first study Pref-LaMP with preference ranking evaluation in Table 4. They key observation is that for this challenging task, similarly to PRISM (Table 3), personal alignment methods struggle to surpass Global RM baselines -for both the standard proxy metric of upstream RM accuracy, as well as our downstream policy accuracy. Only LoRE-Alt come close to the global baselines in RM accuracy.

We next move to analysing behavioural generation quality of the policies -as uniquely enabled by our Pref-LAMP dataset, in Figure 2 and raw results in Appendix C. We see that: (1) The top personal alignment methods for upstream ranking accuracy (LoRE-Alt in Table 4) tend to underperform in downstream generation quality. (2) A few personal alignment methods can surpass the zeroshot baseline but produce comparative performance to Global RM baseline. However, the better methods for generation (e.g, MPU-Avg and LoRe) are worse for upstream ranking (Table 4). Both these observations reflect decoupling between upstream RM accuracy and downstream behavioural generation. This shows that downstream generation quality evaluation is a crucial missing component of standard evaluation practice.

Our end-to-end behavioural evaluation also allows direct comparison between existing RMfocused personal alignment approaches and ICL. From Table 4, we can see that ICL actually achieves better policy preference ranking than the personal RMs. In terms of generation quality, Figure 2 shows that direct application of ICL surpasses both the baselines and prior personal alignment methods at 3-7B scale. This suggests that practitioners today should use simple ICL in favour of complex RM-based alignment approaches. can we reconcile prior papers’ claims of successful RM+RGD based non-personal alignment with the often negative results from our experiments? The evaluation protocol of prior RGD-based analyses involved guiding generation with a RM, and then evaluating the resulting generations using the same RM (Khanov et al., 2024). The issue with win-rates scored in this way is circularity. If RGD-adaptation can hack the RM (find a ‘false positive’ response that the RM accepts, while not actually reflecting user preferences), using the same RM to evaluate the result produces overly optimistic results.

To study this protocol, we report win Rate vs zeroshot in Table 5, which confirms the risk of ‘circular’ evaluation. Using the same RM to guide decod- 2,4) does not suffer from this -because the RM is not used to generate; neither does our ground-truth evaluation (Figure . 2) -because the RM generation is compared to ground-truth.

Additional Analysis: ICL We provide further analysis in Appendix C.1 showing ICL and ICL-RAG improve further with shots higher than 8 (as used in the main text).

To summarise, we considered the standard RM accuracy, RGD-win rate, and our policy-accuracy metrics, all of which are discriminative ranking metrics. Table 6 correlates each of these against our end-to-end generation quality metric. All exhibit negligible to negative correlations with generation quality (Kendall’s 𝜏 = -0.188 to -0.017), with the negative correlations suggesting reward hacking.

Takeaway: No existing metric predicts whether personalization methods will generate aligned outputs. Ground-truth evaluation on user-authored completions is a necessary evaluation component.

Our findings reveal an evaluation crisis in personal alignment research: RM accuracy is uncorrelated with policy accuracy (𝜏 = 0.08-0.31), and method rankings can completely invert between upstream and downstream evaluations. Using Pref-LaMP-the first benchmark with ground-truth user completions, we show discriminative metrics fail to predict generation quality (Table 6): reward models claiming 99% win rates show no improvement over baselines in ground-truth similarity. The field has been optimizing proxy metrics divorced from deployment objectives.

On a more positive note, we highlight that in contrast to these issues with personal rewards and their evaluations, simple in-context learning dominates reward-guided methods for models ≥3B parameters, while being easy and reliable to implement.

Practitioners should use ICL with retrieval for 3B+ models; reward modeling adds complexity without benefit at scale. Researchers should: (1) evaluate complete pipelines end-to-end, not just reward model accuracy, (2) include policy accuracy and ground-truth behavioral metrics, (3) test across model scales to detect scale-dependent effects, (4) build behavioral benchmarks with user-authored completions and (5) compare against ICL baselines and focus future research effort on developing such amortized approaches to personal alignment.

We focus on RGD because it represents a key scalable deployment path-per-user RL fine-tuning remains computationally infeasible for realistic populations. A fundamental challenge with RGD is that it assumes reward models can be token-wise factorized to provide local guidance at each generation step, which is a known source of error when this assumption is violated (Li et al., 2025). While GenARM is specifically designed to address this limitation through token-level autoregressive reward training, it still exhibits the same performance gaps we observe across other methods. This suggests the problem runs deeper than factorization alone-the disconnect between preference learning and generation guidance may be fundamental to the inference-time adaptation paradigm.

Our use of three datasets goes beyond most prior work, which often used only one or contrived datasets. However, our results do show some facets of dataset dependence, so future work should aim to establish larger multi-dataset benchmark suits to thoroughly test personalization across more dimensions of interest.

Incomparable Accuracy Metrics RLHF and DPO both use pairwise preference accuracy, but these metrics measure fundamentally different things. In RLHF, reward model accuracy measures how well 𝑅 𝜃 ranks response pairs. However, the reward model is not the final artifact-it guides a policy through ARGS or RL fine-tuning. The critical question is: does the resulting policy generate aligned responses? Reward model accuracy cannot answer this. A reward model might perfectly rank static pairs while the derived policy fails to generate appropriate responses. In DPO, policy accuracy measures whether 𝜋 𝜙 assigns higher probability to preferred responses, but only at the likelihood level-not generation quality. These metrics are not comparable across methods, and neither directly measures the ultimate goal: whether generated outputs align with user preferences. Circular Evaluation Under Frozen Rewards A common RLHF practice adapts policies using reward models, then evaluates by measuring if adapted policies achieve higher rewards than baselines. This creates circularity: the reward model serves as both the training signal and evaluation metric. High performance only confirms the policy learned to exploit the reward model’s scoring function-not that it captures actual user preferences. If the reward model is misspecified, this circular evaluation systematically hides the failure. A policy could achieve high reward scores while generating responses users would disprefer, and the evaluation cannot detect this because both training and evaluation use the same potentially-flawed reward model.

Proxy-Based Evaluation with LLM-as-a-Judge Recent work uses frontier LLMs as judges, conditioning them on few-shot user examples to rank policy outputs. While appealing, LLM-asjudge remains a learned proxy, not a direct measure of user satisfaction. It provides only relative rankings between methods and cannot quantify whether even the best-ranked method produces satisfactory outputs for individual users.

Toward Comprehensive Evaluation These limitations motivate our evaluation framework, which:

(1) introduces comparable metrics for both reward model quality and policy understanding, (2) breaks circular evaluation by measuring behavioral alignment against ground-truth user completions rather than reward scores, (3) moves beyond proxies to evaluate actual generation quality, and (4) disentangles where personalization succeeds or fails across the reward modeling, policy guidance, and generation stages.

Please note that we only evaluate on a subset of the test split of PRISM. This is because policy accuracy computation was expensive. Reward model’s performance on the full test split is in Table 8. Global RM still outperforms all other methods so our conclusions in the main paper text does not change. Meanwhile, data plotted in Figure 1 can be found in Table 7.

Raw results for Pref-LaMP5 dataset can be found in Tables 9, 10, 11 and 12.

We further analyse our strong ICL baselines in terms of number of demonstrations. ICL-RAG improves steadily with demonstrations and scale, reaching ∼49 ROUGE-1 at 7B with 8 shots. Larger models show no saturation, effectively leveraging context without reward guidance. This shows that personal alignment is not only possible, but straightforward to implement. However, operationalizing standard but more complex RM-based personal alignment approaches with RGD is comparatively fraught. This is shown in Figure 3.

In our work, we leverage the core principles of PReF but introduce a key architectural modification and a novel initialization scheme. These changes are motivated by the need to adapt PReF from a pairwise preference model into a pointwise reward model, making it suitable for advanced applications such as reward-guided decoding.

The original PReF model is designed to predict a user’s preference for one complete response over another. It computes a single score for a pair of items, (𝑟 1 , 𝑟 2 ). However, reward-guided decoding requires a scalar reward score for a single, often incomplete, sequence at each step of the generation process. The original PReF formulation is therefore unsuitable for this task.

To address this, we modified the PReF architecture to explicitly compute a user-specific reward for an individual response, 𝑅(𝑢, 𝑟). This allows us to score single candidate sequences during decoding.

The original PReF model calculates a preference score 𝑠 for a user 𝑢 and a pair of responses (𝑟 1 , 𝑟 2 ) based on the difference of their feature representations:

Here, 𝜙 is a linear head that projects the LLM’s response embeddings into a latent feature space, and u is the user’s embedding vector.

Our modified architecture decomposes this calculation into two distinct reward computations:

While these two formulations are mathematically equivalent in the final forward pass, this decomposition presents a significant challenge for the model’s initialization, which we address with a novel technique.

The PReF methodology uses Singular Value Decomposition (SVD) on a (response_pair × user) preference matrix to warm-start the model’s parameters. A key challenge arises because the SVD process yields a single feature vector, v 𝑝 , for each response pair 𝑝. This vector v 𝑝 serves as a proxy for the latent feature difference, 𝜙(𝑟 1 ) -𝜙(𝑟 2 ).

The SVD provides no direct information about the individual feature vectors 𝜙(𝑟 1 ) and 𝜙(𝑟 2 ). To initialize a linear head 𝜙 that operates on individual responses, we leverage the linearity of the projection and work directly in the difference space.

We achieve this with the following direct regression algorithm:

Perform SVD: Perform SVD on the preference matrix as in the original PReF to obtain the matrix of pairwise feature vectors 𝑈 𝑆 (representing 𝜙(𝑟 1 ) -𝜙(𝑟 2 ) for each pair) and user embeddings 𝑉 𝑆 .

For each unique response pair (𝑟 1 , 𝑟 2 ) in the training data, compute the difference of their frozen LLM embeddings: e diff = e(𝑟 1 ) -e(𝑟 2 ). 4. Bias-Free Linear Regression: A bias-free (intercept-free) regression is used to find the optimal initial weights 𝑊 for the linear head 𝜙. This approach is mathematically sound because:

)e(𝑟 2 )) • Learning 𝑊 from differences is equivalent to learning it from individual features • The lack of bias term means reward values have an arbitrary global offset, which cancels out in pairwise comparisons

Following this warm-start initialization of both the user embeddings (from 𝑉 𝑆 ) and the linear reward head (from our direct regression method), the model is trained end-to-end using backpropagation.

During training, the model computes individual rewards 𝑅(𝑢, 𝑟 1 ) and 𝑅(𝑢, 𝑟 2 ) for chosen and rejected responses. The Bradley-Terry preference learning loss is then computed:

where 𝜎 is the sigmoid function. Gradients are backpropagated to fine-tune 𝜙, u, and optionally the LLM encoder if not frozen.

This procedure optimizes the true preference learning objective, with the SVD-based initialization serving as a high-quality starting point that accelerates convergence and improves stability. Any imprecision in initialization (such as the arbitrary offset in absolute reward values) is corrected during training. This enhanced methodology preserves the core insights of PReF’s SVD-based initialization while adapting its architecture to support rewardguided decoding.

The original PReF implementation uses a synthetically augmented version of PRISM rather than the natural conversational data. • Real human preferences expressed through dialogue

• Sparse preference matrix (unique conversation contexts)

• No systematic overlap between users and prompts Our implementation extracts genuine conversational preferences, yielding ∼27K training samples.

PReF’s “PRISM” dataset differs fundamentally from real PRISM data across multiple dimensions. While PReF uses synthetic data generated by GPT-4o-mini with simulated demographic preferences, real PRISM captures authentic human conversations and actual user choices. The synthetic dataset exhibits a dense matrix structure with exactly 50 users per item, enabling high controllability and strong SVD performance, whereas real PRISM data is characterized by sparse, unique contexts with natural variation that yields weaker SVD results. PReF’s dataset contains over 90K samples compared to 27K in the real data, but this larger volume comes at the cost of realism-the synthetic patterns may not generalize to authentic human behavior in the way that real PRISM’s genuine user interactions do.

Key points:

Not an apples-to-apples comparison-PReF’s dense setup differs fundamentally from real sparse data.
SVD initialization performs better in dense synthetic matrices.
Training dynamics differ due to uniform synthetic distribution.
Evaluation on simulated preferences may not generalize to real human data.

PReF’s reliance on dense user-item overlap is intrinsic to collaborative filtering. Sparse real data poses challenges but reflects real-world personalization.

• Match PReF: Use synthetic PERSONA-style preferences for strong SVD initialization. • Hybrid: Augment sparse real data with synthetic overlap.

Our Choice: We prioritize authenticity by using real PRISM conversational preferences in their natural sparse form, tackling the more difficult-but more realistic-personalization problem.

LoRE is a pairwise preference learning method introduced in LoRe: Personalizing LLMs via Low-Rank Reward Modeling (?). It learns a reward function from preference data, where each datapoint consists of a user input and two responses, one preferred over the other.

Unlike methods that train a binary classifier to predict which response is better, LoRE optimizes a logistic loss over the difference of reward values assigned to the preferred and dispreferred responses.

The LoRE architecture consists of two key components:

Feature Extractor A shared feature extractor 𝜙 (typically a pretrained language model) processes the input 𝑥 and response 𝑦 to produce 𝐾 base reward scores: 𝑅 𝜙 (𝑥, 𝑦) ∈ R 𝐾 .

For each user, we learn a low-rank weight vector 𝑤 ∈ R 𝐾 that linearly combines these base rewards to produce a personalized scalar reward:

This architecture allows the model to learn a shared representation of reward dimensions through 𝜙, while capturing individual user preferences through the lightweight weight vectors 𝑤.

For a preference pair (𝑥, 𝑦 + , 𝑦 -) where 𝑦 + is preferred over 𝑦 -, the loss uses the difference of personalized rewards:

L LoRE = log(1+exp(-𝑤 ⊤ [𝑅 𝜙 (𝑥, 𝑦 + )-𝑅 𝜙 (𝑥, 𝑦 -)]))

(10) This encourages the model to assign a higher personalized reward to the preferred response 𝑦 + over the dispreferred one 𝑦 -.

The paper introduces two variants: LoRE Trains both user-specific weights 𝑤 and the feature extractor 𝜙 simultaneously in a single optimization step. This approach was used for the TL;DR dataset in the original implementation.

Uses an alternating optimization strategy: for each batch, it takes one gradient step on the user-specific weights 𝑤 (freezing the feature extractor 𝜙), then one step on the feature extractor 𝜙 (freezing the weights 𝑤). This approach was used for more complex datasets in the original implementation.

LoRE-Alt also leverages an off-the-shelf reward model (Skywork RM) and includes a regularization term to prevent the learned model from deviating too far from the pretrained baseline. However, since we train our Qwen2.5-0.5B model from scratch without a pretrained reward model, we omit this regularization.

Note: The original codebase does not successfully reproduce results on the PRISM dataset.

In our implementation, we instead use: loss = -F.logsigmoid(reward_diff).mean()

where: reward_diff = w.T @ ( R_phi(x, y^+) -R_phi(x, y^-) )

This is mathematically equivalent to the original logistic loss, since:

log(𝜎(𝑥)) = log(1 + exp(-𝑥))

(11)

The logsigmoid loss is a numerically stable, PyTorch-friendly implementation of the same core principle. This change does not affect the training dynamics or final optimization target-it is purely an implementation detail.

In the original LoRE paper, the reward model can be formulated to take the difference of features directly:

𝑤 ⊤ [𝜙(𝑥, 𝑦 + ) -𝜙(𝑥, 𝑦 -)]

(12)

In our implementation, we compute the reward separately on each response using the 𝐾-dimensional feature extractor, then take the weighted difference:

This equivalence holds because the personalization layer (the 𝑤 weights) is linear in the feature space.

LoRE also supports Alignment as Reward-Guided Search (ARGS), where generation is guided at decoding time using the learned reward model. In our implementation, we enable ARGS as a runtime decoding strategy by plugging in the learned reward model as a plug-and-play scoring function. This is implemented by scoring candidate continuations during beam or sampling-based decoding using the personalized reward:

This allows us to steer generation toward responses that maximize the learned user-aligned reward signal, without requiring reinforcement learning or sampling from a reward-shaped distribution.

It is a known issue that LoRe released code does not reproduce their results on the PRISM dataset due to issues in dataset preparation that confalted reported results.2 . This can be found here.

For the TLDR dataset, all models were trained with a LoRA of rank 8 and LoRA alpha of 16. rsLora was used for initializaiton. The backbone (LoRA module) was 5 × 10 -5 . Different models decoder used varying hyperparameters as listed below:

Datasets and ModelsWe consider datasets: TLDR(Stiennon et al., 2020): Binary stylistic preferences, 10 training users (2,097 prefs/user), 31 adaptation users (100 prefs/user). PRISM (Kirk et al., 2024): Pluralistic preferences, 1,232 training users (22.1 prefs/user), 139 adaptation users (14.5 prefs/user). Pref-LaMP (ours): User-authored completions, 485 training users (48.8 prefs/user), 126 adaptation users (49.2 prefs/user).

Datasets and ModelsWe consider datasets: TLDR(Stiennon et al., 2020)

Datasets and ModelsWe consider datasets: TLDR

Datasets and Models

We first investigate whether reward model accuracy predicts policy accuracy under reward-guided decoding. I.e., whether personal rewards that rank preferences well can guide policies to do the same.TLDR: Weak Correlation on Simple DataWe first evaluate the popular TDLR dataset’s simple binary style preferences in Table2. The main observation is that upstream RM accuracy correlates weakly with downstream policy accuracy (Kendall’s 𝜏 = 0.08-0.31), degrading as scale increases (Pearson 𝑟: 0.41 → 0.11 from 180M to 1.7B).

We first investigate whether reward model accuracy predicts policy accuracy under reward-guided decoding. I.e., whether personal rewards that rank preferences well can guide policies to do the same.

Human-written rather than LLM-written negatives avoid shortcut learning. Initial LLM-generated rejections let linear probes detect generation artifacts rather than preference signals.

https://github.com/facebookresearch/LoRe/ issues/1

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

The Reward Model Selection Crisis in Personalized Alignment

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found