LLM-Enhanced Reinforcement Learning for Long-Term User Satisfaction in Interactive Recommendation
Interactive recommender systems can dynamically adapt to user feedback, but often suffer from content homogeneity and filter bubble effects due to overfitting short-term user preferences. While recent efforts aim to improve content diversity, they predominantly operate in static or one-shot settings, neglecting the long-term evolution of user interests. Reinforcement learning provides a principled framework for optimizing long-term user satisfaction by modeling sequential decision-making processes. However, its application in recommendation is hindered by sparse, long-tailed user-item interactions and limited semantic planning capabilities. In this work, we propose LLM-Enhanced Reinforcement Learning (LERL), a novel hierarchical recommendation framework that integrates the semantic planning power of LLM with the fine-grained adaptability of RL. LERL consists of a high-level LLM-based planner that selects semantically diverse content categories, and a low-level RL policy that recommends personalized items within the selected semantic space. This hierarchical design narrows the action space, enhances planning efficiency, and mitigates overexposure to redundant content. Extensive experiments on real-world datasets demonstrate that LERL significantly improves long-term user satisfaction when compared with state-of-the-art baselines. The implementation of LERL is available at https://github.com/1163710212/LERL.
💡 Research Summary
The paper addresses two persistent challenges in interactive recommender systems: (1) the tendency to over‑fit short‑term user preferences, leading to content homogeneity and filter‑bubble effects, and (2) the difficulty of applying reinforcement learning (RL) in environments characterized by sparse, long‑tailed user‑item interactions. To tackle these issues, the authors propose LLM‑Enhanced Reinforcement Learning (LERL), a hierarchical framework that fuses the high‑level semantic planning capabilities of large language models (LLMs) with the fine‑grained, feedback‑driven adaptability of RL.
Framework Overview
LERL consists of two cooperating modules:
-
High‑Level Semantic Planner (HSP) – Powered by a pretrained LLM, the planner receives a user’s category‑level interaction history and a sampled set of “reflections” derived from past sessions with high cumulative rewards. By prompting the LLM with a structured template that includes candidate categories, the user’s recent category selections, and the sampled reflections, the HSP outputs a subset of content categories cₜ to be explored in the current step. This high‑level decision explicitly balances relevance with diversity, deliberately de‑prioritizing categories that have become saturated for the user.
-
Low‑Level Policy Learner (LPL) – An RL agent (trained with modern off‑policy algorithms such as PPO or SAC) operates within the constrained item space I_{cₜ} defined by the categories selected by the HSP. The LPL encodes the user’s item‑level interaction sequence using a Transformer encoder, producing a preference vector eₚₜ. This vector parameterizes a Gaussian policy (mean μₜ, standard deviation σₜ) that samples a recommendation list aₜ. The reward signal combines immediate feedback (clicks, ratings, session continuation) with intrinsic diversity incentives (e.g., penalties for repeatedly exposing the same category). The objective is to maximize the discounted cumulative reward ∑ₜγ^{t‑1}rₜ.
Reflection Pool and Prompt Design
After each user session, the HSP’s companion “high‑level critic” uses the same LLM to generate a textual reflection Fᵤ that summarizes the session’s trajectory, cumulative reward Sᵤ, and actionable insights (e.g., “reduce exposure to category X”). These reflections are stored in a pool R. When planning a new step, a subset of reflections is sampled with probability proportional to exp(αSᵤ), ensuring that high‑reward experiences are more likely to influence future decisions. Because LLM context windows are limited, only a small number Nₛ of reflections are inserted into the prompt, preserving relevance while keeping the prompt size manageable.
Action‑Space Reduction and Efficiency
By delegating category selection to the LLM, LERL dramatically shrinks the RL action space from the full item catalog (often millions of items) to a manageable subset defined by the chosen categories. This reduction yields faster convergence, lower sample complexity, and more interpretable exploration behavior. Moreover, the semantic guidance from the LLM helps the RL agent avoid myopic exploitation of popular items, encouraging exploration of semantically diverse regions of the catalog.
Experimental Setup
The authors evaluate LERL on several real‑world datasets, including MovieLens‑1M, Amazon Books, and a proprietary log‑derived dataset. Because online user trials are costly, they construct a simulated offline environment that mimics user feedback dynamics (click probability, session dropout, exposure fatigue). Baselines include (i) traditional matrix‑factorization with diversity‑aware re‑ranking, (ii) RL‑only approaches (e.g., DQN, PPO), and (iii) LLM‑only category prediction models.
Results
Across all datasets, LERL outperforms baselines on both accuracy metrics (HR@10, NDCG@10) and diversity metrics (Diversity@10, Coverage). Notably, with a discount factor γ = 0.99, LERL increases average session length by ~20 % and reduces the “Category Saturation” filter‑bubble indicator by ~15 %. The hierarchical design yields a 2.5× speed‑up in training time compared to pure RL methods, confirming the benefit of action‑space narrowing. Ablation studies show that removing the reflection pool or replacing the LLM planner with a random category selector degrades performance, highlighting the importance of semantic planning and experience‑based reflections.
Discussion and Limitations
The paper acknowledges several practical concerns: (1) LLM prompt engineering and reflection sampling introduce hyper‑parameters (α, Nₛ) that affect stability; (2) real‑time inference with large LLMs incurs latency and cost, potentially limiting deployment in latency‑sensitive services; (3) the current implementation relies on a hosted GPT‑3.5‑style model, which may not be feasible for all organizations. The authors suggest future work on lightweight fine‑tuned LLMs (e.g., LoRA adapters), multi‑modal user context integration, and online A/B testing to validate the framework in production.
Conclusion
LERL demonstrates that coupling LLM‑driven high‑level semantic planning with RL‑driven low‑level item selection can simultaneously improve long‑term user satisfaction and content diversity in interactive recommendation. By narrowing the RL action space through semantically informed category constraints and leveraging textual reflections from high‑reward sessions, the framework achieves superior performance while mitigating filter‑bubble effects. The work opens avenues for further research on hierarchical decision‑making, efficient LLM integration, and real‑world deployment of long‑term, diversity‑aware recommender systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment