Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at https://anonymous.4open.science/r/RISER/.

💡 Research Summary

The paper begins by questioning the recent trend of applying Long Chain‑of‑Thought (Long CoT) reasoning to sequential recommendation. It argues that Long CoT suffers from two fundamental mismatches: (1) inference latency – generating extended reasoning chains dramatically slows down response time, which is unacceptable for real‑time recommender systems; and (2) cognitive pattern scarcity – user‑behavior logs contain only implicit actions and lack the explicit reasoning patterns (e.g., verification, back‑tracking) that LLMs have learned from domains such as mathematics or code. Consequently, the authors propose to discard the CoT structure altogether and to leverage its underlying mechanism, reinforcement learning (RL), for item‑space exploration.

Directly applying RL to a large language model (LLM) for recommendation faces two major obstacles. First, sample utilization is extremely low because the SFT‑pretrained policy is narrow; most rollouts fail to hit the ground‑truth item, yielding zero advantage and no gradient signal. Second, training is unstable: the policy tends to repeat the same items across rollouts, and the textual representation of item IDs creates token‑level imbalance, where a few distinguishing tokens receive disproportionately large updates, causing policy collapse.

To address these issues, the authors introduce RISER (Reinforced Item Space Exploration for Recommendation). RISER consists of two complementary modules. The first module enhances sample utilization by employing Simple Preference Optimization (SimPO). When all G rollouts for a prompt miss the correct item, the method constructs G preference pairs (ground‑truth vs. each incorrect generation) and trains a Bradley‑Terry model on the resulting preferences. This converts otherwise wasted trajectories into informative learning signals, dramatically improving sample efficiency.

The second module stabilizes training. It first oversamples a larger set of completions and then de‑duplicates them, ensuring diverse rollouts and preventing the policy from converging on a few items. Next, a token‑level KL‑Cov regularizer computes the KL divergence for each token between the current and reference (SFT) policies, penalizing only those tokens with high confidence and high advantage, thereby avoiding abrupt shifts in the distribution. Finally, a loss‑mask down‑weights updates for highly predictable tokens, and the length penalty is removed to further reduce variance.

Experiments on three real‑world datasets (e.g., MovieLens, Amazon, Yelp) demonstrate that RISER outperforms strong baselines, including SFT‑GRPO, PPO‑based LLM recommenders, and recent CoT‑enhanced methods, achieving notable gains in HR@10 and NDCG@10. Ablation studies confirm that each component—SimPO, oversampling + deduplication, KL‑Cov, and loss masking—contributes positively to performance and stability. The paper also reports a substantial reduction in rollout redundancy (over 70% fewer duplicate generations) and smoother KL divergence curves during training.

In summary, the work provides a compelling critique of Long CoT for sequential recommendation and offers a well‑engineered RL framework that simultaneously tackles sample inefficiency and instability. RISER’s design is modular and could be integrated with various LLM backbones, paving the way for practical, RL‑driven recommendation systems. Remaining challenges include the high computational cost of fine‑tuning large LLMs and the need for efficient tokenization of massive item catalogs, which the authors acknowledge as future research directions.

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment