PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reward models are pivotal for aligning Large Language Models (LLMs) with human preferences. Existing approaches face two key limitations: Discriminative reward models require large-scale annotated data, as they cannot exploit the preference instruction-following capability of LLMs available to generative reward models. Moreover, reward models are particularly prone to reward overoptimization, where LLMs exploit weaknesses in the reward function instead of improving true alignment. We introduce \textbf{PIRA}, a training paradigm that integrates three complementary strategies to address these challenges: (1) reformulating question-answer pairs into preference-task instructions to explicitly leverage LLMs’ preference instruction-following capability, (2) averaging the rewards aggregated from diverse preference-task instructions for each sample, which mitigates task-specific bias and enhances robustness across evaluation perspectives, and (3) averaging outputs from the value head under different dropout rates to stabilize reward estimation. Experiments on public datasets show that PIRA improves performance considerably, enhances generalization, and effectively mitigates reward overoptimization.

💡 Research Summary

The paper introduces PIRA (Preference‑Oriented Instruction‑tuned Reward Models with Dual Aggregation), a novel training paradigm for reward models used in aligning large language models (LLMs) with human preferences. Existing discriminative reward models suffer from two major drawbacks: they require large amounts of annotated preference data because they do not exploit the instruction‑following abilities of LLMs, and they are vulnerable to reward over‑optimization, where the policy learns to exploit quirks of the reward function rather than genuinely improving alignment. Generative reward models mitigate the first issue but incur high inference latency due to autoregressive generation.

PIRA addresses both problems through three complementary strategies. First, it reformulates each (question, preferred answer, rejected answer) triple into a set of “preference‑task instructions”. These instructions explicitly tell the model to evaluate a response, thereby leveraging the LLM’s innate capability to follow task‑level prompts. The instruction set T is created by a large language model and refined by human review; each instruction provides a holistic rubric rather than a list of dimensions, and multiple phrasings introduce diverse evaluation perspectives.

Second, during training and inference the model does not rely on a single instruction. For each sample, K instructions are randomly sampled from T, and the scalar reward rϕ(x, y | t_k) is computed for each. The final instruction‑averaged reward R_inst(x, y) is the mean over these K values. This aggregation reduces instruction‑specific bias, improves robustness across evaluation viewpoints, and yields a more stable estimate of preference.

Third, PIRA applies stochastic dropout only to the lightweight value head gψ. For each instruction t, M different dropout rates δ_m (sampled uniformly from 0.1 to 0.4) are used to produce M reward samples r^{(m)}(x, y | t). Their mean R_stoc(x, y | t) approximates a Bayesian Monte‑Carlo dropout estimate, decreasing variance and preventing the model from over‑fitting to a particular head configuration.

The overall reward for a (question, answer) pair is then computed as a double average: first across the K instructions, then across the M stochastic forward passes. The backbone hθ (the main language model) is fine‑tuned with a very low learning rate (1e‑6) to preserve its pretrained knowledge, while the value head is trained with a higher rate (5e‑4) for rapid adaptation to preference signals. Training uses the standard Bradley‑Terry pairwise loss, with each training instance paired with a randomly sampled instruction.

Experiments were conducted on three backbone models—Mistral‑7B‑v0.1, LLaMA‑3‑8B, and Qwen2.5 (1.5B and 7B)—across six public preference datasets: HH, HH‑cleaned, SHP, Alpaca‑farm, OASST, and UltraFeedback. PIRA consistently outperformed baseline discriminative models, the “Thomas” dropout‑ensemble method, and the W‑ARM weight‑averaging technique. For example, on the HH‑cleaned set, PIRA achieved 75.5% accuracy with a standard deviation of 0.7, compared to 73.3%/1.2 for the best baseline. Ablation studies showed that instruction reformulation contributed the largest gain in accuracy, while instruction‑set averaging and stochastic value‑head averaging mainly reduced variance.

In an end‑to‑end RLHF pipeline using PPO, the authors observed that baseline reward models exhibited sharp spikes in KL divergence and rapid reward inflation, followed by a decline in gold‑reward scores—a classic sign of reward hacking. In contrast, PIRA‑trained reward models kept KL divergence bounded, prevented reward inflation, and showed monotonic improvement in gold‑reward values throughout training, indicating effective mitigation of over‑optimization.

The computational overhead of PIRA is modest because dropout is applied only to the value head; using M = 12 adds roughly 7 % latency. The instruction‑set averaging incurs a larger cost proportional to K (e.g., sixfold when K = 6), which must be balanced against the robustness benefits.

Limitations noted by the authors include the lack of evaluation on models larger than 13 B parameters, and the increased inference cost associated with multiple instruction passes. Future work could explore distilling the dual‑aggregated reward into a single‑pass model or dynamically selecting a subset of instructions at inference time.

Overall, PIRA presents a simple yet effective framework that combines explicit instruction reformulation with dual aggregation (across instructions and stochastic value‑head realizations) to produce more accurate, stable, and over‑optimization‑resistant reward models. The approach demonstrates strong generalization across domains and model sizes, making it a promising direction for scalable, preference‑aligned LLM training.

PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment