REG4Rec: Reasoning-Enhanced Generative Model for Large-Scale Recommendation Systems

REG4Rec: Reasoning-Enhanced Generative Model for Large-Scale Recommendation Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sequential recommendation aims to predict a user’s next action in large-scale recommender systems. While traditional methods often suffer from insufficient information interaction, recent generative recommendation models partially address this issue by directly generating item predictions. To better capture user intents, recent studies have introduced a reasoning process into generative recommendation, significantly improving recommendation performance. However, these approaches are constrained by the singularity of item semantic representations, facing challenges such as limited diversity in reasoning pathways and insufficient reliability in the reasoning process. To tackle these issues, we introduce REG4Rec, a reasoning-enhanced generative model that constructs multiple dynamic semantic reasoning paths alongside a self-reflection process, ensuring high-confidence recommendations. Specifically, REG4Rec utilizes an MoE-based parallel quantization codebook (MPQ) to generate multiple unordered semantic tokens for each item, thereby constructing a larger-scale diverse reasoning space. Furthermore, to enhance the reliability of reasoning, we propose a training reasoning enhancement stage, which includes Preference Alignment for Reasoning (PARS) and a Multi-Step Reward Augmentation (MSRA) strategy. PARS uses reward functions tailored for recommendation to enhance reasoning and reflection, while MSRA introduces future multi-step actions to improve overall generalization. During inference, Consistency-Oriented Self-Reflection for Pruning (CORP) is proposed to discard inconsistent reasoning paths, preventing the propagation of erroneous reasoning. Lastly, we develop an efficient offline training strategy for large-scale recommendation. Experiments on real-world datasets and online evaluations show that REG4Rec delivers outstanding performance and substantial practical value.


💡 Research Summary

REG4Rec addresses the limitations of existing generative recommendation (GR) models, which typically encode each item into a single fixed sequence of semantic tokens, leading to rigid representations and limited reasoning diversity. The proposed system introduces a Mixture‑of‑Experts based Parallel Quantization (MPQ) codebook that generates multiple unordered semantic tokens for each item. By employing several parallel auto‑encoder experts, each expert captures a distinct semantic facet (e.g., brand, style, price) and produces its own codebook. During inference, the model treats the tokens from all codebooks as parallel candidates and selects the token with the highest confidence at each reasoning step through Confidence‑based Reasoning Step Selection (CRSS). This design creates a combinatorial explosion of possible reasoning paths, allowing the model to adaptively explore diverse explanations for a user’s intent.

To improve the reliability of these paths, REG4Rec adopts a two‑stage training framework. In the pre‑training phase, besides the standard next‑item token prediction, an auxiliary category‑prediction task is added; the resulting category distribution serves as a proxy for user intent and helps assess path consistency. In the post‑training phase, Preference Alignment for Reasoning (PARS) uses reinforcement learning to shape a policy that favors paths with high confidence and strong alignment between predicted categories and the target item. Multi‑Step Reward Augmentation (MSRA) extends the reward horizon beyond a single step, incorporating future user actions with a time‑decay weighting, thereby capturing long‑term preferences and reducing noise from stochastic behavior.

During inference, Consistency‑Oriented Self‑Reflection for Pruning (CORP) evaluates each generated path step‑by‑step, checking token‑level and category‑level consistency. Paths that fall below a predefined confidence threshold are pruned, preventing error propagation and ensuring that only reliable reasoning trajectories contribute to the final recommendation list.

Because the model must be trained on billions of interactions, the authors also propose a Layer‑Adaptive Dynamic Quantization (LADQ) controller. LADQ profiles the sensitivity of each transformer layer and dynamically assigns precision (fp32, bf16, fp8) to meet a target accuracy budget while minimizing training time and hardware cost.

Extensive experiments on three public datasets and a large‑scale industrial advertising dataset demonstrate that REG4Rec consistently outperforms six strong baselines, including ReaRec and STREAM, achieving up to 16.59% relative improvement in hit‑rate and NDCG. An online A/B test on Alibaba’s advertising platform shows practical gains of +5.60% in advertising revenue, +1.81% in click‑through rate, and +3.29% in gross merchandise volume.

In summary, REG4Rec contributes (1) a multi‑token item representation via MPQ that expands the reasoning space, (2) a reinforcement‑learning‑driven alignment and multi‑step reward scheme (PARS + MSRA) that enhances path reliability, (3) a self‑reflection pruning mechanism (CORP) that filters inconsistent reasoning, and (4) an efficient large‑scale training strategy (LADQ). By integrating these components, the paper demonstrates that reasoning‑enhanced generative models can achieve both superior recommendation accuracy and operational feasibility in real‑world, high‑traffic e‑commerce and advertising environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment