Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA’s robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
💡 Research Summary
The paper tackles the problem of personalized alignment for large language models (LLMs) by introducing a reinforcement‑learning framework called RLPA (Reinforcement Learning for Personalized Alignment). Traditional approaches to personalization—prompt‑based methods that inject a static user profile into the prompt, and offline fine‑tuning methods such as supervised fine‑tuning (SFT) or Direct Preference Optimization (DPO)—are limited in cold‑start scenarios and cannot adapt to evolving user preferences over long conversations. To overcome these limitations, the authors formalize personalized dialogue as a multi‑turn Markov Decision Process (MDP). In this MDP, the LLM is the agent, the state consists of the accumulated dialogue history, the action is the model’s generated response, and the transition is governed by a simulated user that produces the next utterance conditioned on a hidden user profile.
Two complementary reward signals guide learning. The Profile Reward evaluates how accurately the model’s inferred user profile matches the ground‑truth profile. Profiles are represented as slot‑value pairs; precision, recall, and their harmonic mean (F1) are computed at each turn, encouraging incremental and correct profile construction. The Response Reward measures the quality of the generated response with respect to the inferred profile. An external reward model (based on GPT‑4o‑mini) checks five binary criteria—naturalness, relevance, logical consistency, engagement, and informativeness—and only grants a reward when all are satisfied, thereby enforcing that responses are not only fluent but also aligned with the user’s preferences, style, goals, and persona. The combined reward (R_t = R_profile_t + R_response_t) is fed into Proximal Policy Optimization (PPO) to update the policy.
The simulated user is built by embedding a predefined profile into the system prompt of a language model (selected as GPT‑4o‑mini after human evaluation). The user model is designed to be profile‑grounded (responses consistently reflect the injected attributes) and behaviorally consistent (stable preferences and conversational style across turns). Importantly, the user does not reveal the full profile immediately; instead, it reveals information gradually, forcing the agent to perform multi‑turn reasoning and profile refinement.
The authors instantiate RLPA by fine‑tuning Qwen‑2.5‑3B‑Instruct, producing Qwen‑RLPA. Experiments are conducted on two benchmark settings: Vanilla ALOE and Extended ALOE. Evaluation metrics include Alignment Score (higher is better), Normalized Improvement Rate (N‑IR), and Normalized R² (N‑R²). Qwen‑RLPA achieves an Alignment Score of 73.38 on Vanilla ALOE and 52.74 on Extended ALOE, surpassing strong baselines such as SFT (44.32), DPO (45.27), CoT (24.46), and even self‑critic versions of GPT‑4o‑mini (75.81) and Claude‑3.5‑Sonnet (69.19) in the respective settings. It also outperforms commercial models Claude‑3.5, DeepSeek‑V3, and matches GPT‑4o’s performance while being more computationally efficient.
Further analysis demonstrates that Qwen‑RLPA maintains coherent, personalized responses over long dialogues (dozens of turns), can resolve conflicting user preferences by updating the inferred profile, and exhibits higher inference efficiency compared to recent reasoning‑focused LLMs such as DeepSeek‑R1 and OpenAI‑o3.
In summary, the paper makes three key contributions: (1) reframing personalized dialogue as a multi‑turn MDP to capture dynamic user modeling, (2) proposing a dual‑level reward mechanism that simultaneously supervises profile inference and response personalization, and (3) showing that a modest‑size model (3 B parameters) fine‑tuned with RLPA can achieve state‑of‑the‑art personalized dialogue performance, rivaling or exceeding much larger proprietary systems. The work opens avenues for future research on real‑world online user feedback, automatic expansion of profile slots, and multimodal personalization.
Comments & Academic Discussion
Loading comments...
Leave a Comment