Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues
The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.
💡 Research Summary
The paper addresses a critical gap in the deployment of large language models (LLMs) for interactive systems: the lack of mechanisms for long‑term, user‑specific personalization and the cold‑start problem where no prior user information is available. Existing alignment methods focus on universal human values (helpfulness, harmlessness) or single‑turn preferences, which limits their ability to maintain consistency across multiple turns or sessions. To overcome these limitations, the authors introduce PersonalAgent, a lifelong, user‑centric agent that continuously infers and adapts to individual preferences.
PersonalAgent decomposes a multi‑turn dialogue into a sequence of single‑turn interactions. At each turn t, the agent receives the current user utterance uₜ and the accumulated profile attributes p₁:ₜ₋₁, forming a state sₜ = (uₜ, p₁:ₜ₋₁). It then selects an action aₜ = pₜ, representing the newly inferred preference for that turn. This formulation is cast as a multi‑turn Markov Decision Process (MDP) with a deterministic transition to sₜ₊₁ = (uₜ₊₁, p₁:ₜ). The reward for each turn combines four criteria—completeness, no hallucination, informativeness, and consistency—weighted to produce a scalar Rₜ. The cumulative reward R_final = ∑ωₜRₜ drives policy learning.
The user profile P is a unified representation built from 11 high‑level categories (basic information, education, personality, career, family, geography, consumption, digital behavior, social network, scenario features, culture) and over 300 sub‑categories, derived from the LMSYS‑Chat‑1M dataset. Each turn contributes a brief inferred attribute pₜ, and the session‑level profile is the aggregation P = ∑ₜpₜ. This lifelong profile is persisted across sessions, enabling the agent to answer future queries without re‑processing the entire dialogue history.
Training employs Group Relative Policy Optimization (GRPO), a recent reinforcement‑learning algorithm that samples a set of candidate outputs O = {o₁,…,o_G} from the current policy, evaluates them with the unified reward, and updates the policy based on relative rankings. The authors compare three training regimes—base, supervised fine‑tuning (SFT), and reinforcement learning (RL)—and find that the policy‑based approach best captures the dynamics of multi‑turn preference evolution.
To evaluate performance in cold‑start scenarios, the authors construct ALOE‑Unseen, a new benchmark consisting of 3,820 multi‑turn dialogues spanning diverse topics. In each dialogue, the initial profile inferred from prior sessions is deliberately insufficient, forcing the agent to proactively query the user for missing preferences before generating a response. Ground‑truth explanations are provided by GPT‑4.1 and human annotators. Experiments show that PersonalAgent significantly outperforms strong baselines, including prompt‑based methods, Direct Preference Optimization (DPO), and standard RLHF, both in ideal conditions and when irrelevant turns are injected to simulate noise. Human evaluations confirm that PersonalAgent captures user preferences more naturally, coherently, and consistently.
The paper’s contributions are threefold: (1) reformulating long‑context personalization as a turn‑level MDP, enabling unified optimization across turns; (2) maintaining a lifelong, session‑level user profile to ensure long‑term alignment; and (3) releasing the ALOE‑Unseen dataset to benchmark proactive personalization under cold‑start conditions. By demonstrating that a structured, policy‑driven approach can reliably infer and retain fine‑grained user preferences, the work provides a practical roadmap for building more inclusive, adaptive conversational agents that truly understand individual users over time.
Comments & Academic Discussion
Loading comments...
Leave a Comment