UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated & Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.

💡 Research Summary

The paper tackles the problem of aligning large language models (LLMs) with multiple, potentially conflicting human preference dimensions at inference time, a setting known as multi‑objective test‑time alignment. Existing approaches either train separate autoregressive reward models (ARMs) for each preference—incurring high inference cost and risking contradictory guidance—or use a single ARM with separate feature‑extraction modules per preference (e.g., PBLoRA), which suffers from feature entanglement and limited control over trade‑offs.

To overcome these limitations, the authors introduce Preference‑Modulated & Shared Low‑Rank Adaptation (MoSLoRA). MoSLoRA consists of two low‑rank components: (1) a preference‑agnostic module that extracts shared representations from a frozen LLM using a low‑rank adaptation (matrices A₁, B₁ and core tensor W₁), and (2) a preference‑modulation module that receives a mixed preference vector o′ = αᵀ o (where o contains semantic embeddings of each objective) and generates affine modulation parameters γ_o′ and η_o′ through shared low‑rank matrices A₂, B₂ and separate core tensors W_γ, W_η. The modulation applies a feature‑wise scaling and shifting to the shared representation, yielding a final representation \tilde{h} = (γ_o′ + 1)⊙h′ + η_o′. This design avoids learning separate preference‑specific features, maximizes parameter reuse, and keeps the total number of learnable parameters constant regardless of the number of objectives.

Building on MoSLoRA, the authors propose the Unified Autoregressive Reward Model (UniARM). UniARM receives a prompt x and the mixed preference vector o′ and outputs token‑level rewards constrained by all k preference dimensions. Training uses a preference‑conditioned negative log‑likelihood loss: for each preference dimension i, a dataset D_i contains pairs (x, y₁, y₂, z_i) where z_i indicates which response is preferred. The loss encourages the model to assign higher log‑probability to the preferred response after conditioning on o′. At inference time, a single UniARM model can be steered toward any point on the Pareto front simply by changing α, without needing to load or merge multiple ARM parameter subsets.

Experiments evaluate UniARM on two benchmark suites: a safety‑alignment task and a helpful‑assistant task. A 7‑billion‑parameter UniARM is used to guide both a 7‑b and a 65‑b frozen LLM (the “weak‑to‑strong” scenario). Results show substantial improvements over the previous state‑of‑the‑art P‑ARM and GenARM baselines: on safety alignment, UniARM raises Hypervolume (HV) by 18.5 % and Multi‑objective Improvement Percentage (MIP) by 30.2 %; in the weak‑to‑strong setting it adds 9.1 % HV and 6.8 % MIP. On the assistant task, gains of 5.4 % HV and 10.7 % MIP are reported. Crucially, these gains are achieved without increasing the number of learnable parameters or inference latency, confirming the efficiency of the low‑rank design.

Table 1 compares UniARM with prior multi‑objective alignment methods, highlighting that UniARM requires only a single small reward model (‑ 1 parameter space) and no retraining of the base LLM, whereas other approaches need multiple reward models, parameter merging, or full‑model fine‑tuning. Figure 2 visualizes the performance gap, demonstrating UniARM’s superior Pareto‑efficient frontier.

In summary, the paper contributes: (1) a novel MoSLoRA architecture that cleanly separates shared feature extraction from preference‑specific modulation, (2) a unified ARM (UniARM) that jointly models all preference dimensions in a single parameter space, and (3) empirical evidence that this approach delivers significant multi‑objective alignment gains while remaining parameter‑efficient and latency‑neutral. Future work may explore scaling to higher‑dimensional preference spaces, dynamic user profiling, and deployment in real‑world interactive systems.

UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment