Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs
Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs’ ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.
💡 Research Summary
The paper addresses a critical gap in the evaluation of large language model (LLM)‑driven recommendation assistants. Traditional recommender systems excel at fixed, well‑defined tasks such as “users who bought X also bought Y,” but they struggle with the open‑ended, multi‑constraint queries that real users often pose (e.g., “a durable laptop for graphic design under $1500”). Recent advances in LLMs have opened the possibility of turning these models into interactive recommendation assistants capable of understanding natural‑language requests, performing multi‑step reasoning, and adapting to user profiles. However, existing research evaluates such assistants with static, task‑specific prompts and datasets that lack realistic, complex user queries, making it difficult to assess true conversational recommendation capabilities.
To fill this void, the authors introduce RecBench+, a benchmark comprising roughly 30,000 user queries derived from two well‑known recommendation corpora: MovieLens‑1M (movies) and Amazon‑Book (books). Queries are categorized into two high‑level types: (1) Condition‑based queries, which impose explicit constraints, and (2) User‑profile‑based queries, which rely on inferred user interests or demographic attributes. The condition‑based category is further split into three sub‑types: Explicit (clearly stated constraints), Implicit (requiring multi‑hop inference to uncover hidden constraints), and Misinformed (containing factual errors or misleading information). The profile‑based category includes Interest‑based and Demographics‑based queries. Each query is paired with a ground‑truth set of recommended items obtained via knowledge‑graph traversal and interaction‑history matching, and the dataset includes rich metadata such as the number of constraints, difficulty level, and error type.
The authors evaluate seven prominent LLMs—including GPT‑4o, Gemini‑1.5‑Pro, DeepSeek‑R1, LLaMA‑2‑70B, and Falcon‑180B—both in their off‑the‑shelf form and after a two‑stage fine‑tuning regimen: (i) Supervised Fine‑Tuning (SFT) on the RecBench+ query‑answer pairs, followed by (ii) Reinforcement Fine‑Tuning (RFT) using a reward model that scores recommendation relevance and factual correctness. Evaluation metrics cover Top‑k accuracy, Mean Reciprocal Rank (MRR), query‑type specific success rates, and a dedicated misinformation‑detection score.
Key findings are: (1) All models perform strongly on explicit condition queries, with GPT‑4o and DeepSeek‑R1 achieving >90% Top‑5 accuracy, confirming that LLMs can map clear constraints to relevant items. (2) Implicit and misinformed queries expose notable weaknesses; Gemini‑1.5‑Pro shows the best multi‑hop reasoning (≈78% accuracy), while GPT‑4o lags behind, indicating that reasoning ability varies across architectures. (3) Profile‑based queries reveal demographic biases: recommendations for female users are on average 3% more accurate than for male users, and popular interests receive higher scores than niche ones. (4) The two‑stage fine‑tuning consistently improves performance, with SFT providing a 5‑12% boost and RFT adding another 3‑5%, especially for the harder implicit and misinformed categories. (5) Misinformation handling remains a challenge; models correctly identify and correct erroneous facts in only ~68% of cases, leading to potentially misleading recommendations.
The paper also discusses limitations: the query generation process relies heavily on templated constructions, which may not fully capture the spontaneity of real user dialogue; the benchmark currently covers only movies and books, leaving other domains (music, fashion, health) untested; and the observed demographic biases highlight the need for fairness‑aware training strategies.
In conclusion, RecBench+ provides the first large‑scale, high‑quality benchmark for assessing LLMs as personalized recommendation assistants in realistic, complex scenarios. The experimental results demonstrate that while LLMs have promising preliminary abilities—especially when conditions are explicit—significant work remains to enhance multi‑step reasoning, misinformation correction, and bias mitigation. Future research directions suggested include online learning from real user feedback, multimodal integration (text, image, audio) for richer context, and expanding the benchmark to diverse cultural and linguistic settings to ensure equitable recommendation performance across user groups.
Comments & Academic Discussion
Loading comments...
Leave a Comment