RankGR: Rank-Enhanced Generative Retrieval with Listwise Direct Preference Optimization in Recommendation
Generative retrieval (GR) has emerged as a promising paradigm in recommendation systems by autoregressively decoding identifiers of target items. Despite its potential, current approaches typically rely on the next-token prediction schema, which treats each token of the next interacted items as the sole target. This narrow focus 1) limits their ability to capture the nuanced structure of user preferences, and 2) overlooks the deep interaction between decoded identifiers and user behavior sequences. In response to these challenges, we propose RankGR, a Rank-enhanced Generative Retrieval method that incorporates listwise direct preference optimization for recommendation. RankGR decomposes the retrieval process into two complementary stages: the Initial Assessment Phase (IAP) and the Refined Scoring Phase (RSP). In IAP, we incorporate a novel listwise direct preference optimization strategy into GR, thus facilitating a more comprehensive understanding of the hierarchical user preferences and more effective partial-order modeling. The RSP then refines the top-λ candidates generated by IAP with interactions towards input sequences using a lightweight scoring module, leading to more precise candidate evaluation. Both phases are jointly optimized under a unified GR model, ensuring consistency and efficiency. Additionally, we implement several practical improvements in training and deployment, ultimately achieving a real-time system capable of handling nearly ten thousand requests per second. Extensive offline performance on both research and industrial datasets, as well as the online gains on the “Guess You Like” section of Taobao, validate the effectiveness and scalability of RankGR.
💡 Research Summary
The paper introduces RankGR, a novel framework that enhances generative retrieval (GR) for large‑scale recommendation by explicitly modeling user preference hierarchies and enabling deep interactions between candidate items and the full behavior sequence. Existing GR approaches rely on next‑token prediction (NTP), which treats each token of the next interacted item as an isolated target. This design suffers from two critical drawbacks: (1) it cannot capture the partial order of user preferences (e.g., purchases > clicks > exposures) because training is performed at the token level, and (2) during inference the model selects the next token by a simple inner‑product between a single hidden state and the whole vocabulary, which resembles a two‑tower matching scheme and ignores richer candidate‑sequence relationships.
RankGR addresses these issues through a two‑stage pipeline that remains end‑to‑end trainable and is suitable for real‑time production.
- Initial Assessment Phase (IAP) – Built on the standard GR backbone, IAP replaces the point‑wise NTP loss with Listwise Direct Preference Optimization (LDPO). Each item is represented by a multi‑token semantic identifier (SID) generated via a residual‑quantized VAE. The model computes a log‑probability for each SID by aggregating the probabilities of its constituent codewords. LDPO then applies a sigmoid‑rank loss that directly contrasts positive feedback (purchases, clicks) against negative feedback (mere exposures) across the entire session, thereby learning the relative ordering of multiple items without an auxiliary reward model. This listwise formulation captures the hierarchical, partially ordered nature of real‑world user behavior.
- Refined Scoring Phase (RSP) – After IAP produces the top‑λ candidate SIDs, RSP employs a lightweight scoring module that takes the hidden states from IAP and each candidate SID as inputs. The module performs a candidate‑specific interaction (similar to a shallow attention) with the full behavior sequence, yielding a refined score that reflects deep semantic compatibility. By breaking the inner‑product‑only assumption of IAP, RSP corrects coarse rankings and improves precision.
Both phases share the same GR parameters and are jointly optimized, which keeps the parameter count modest and enables efficient batch training.
From an engineering perspective, the authors introduce several optimizations—tokenizer caching, batch sharding, GPU‑CPU pipeline tuning—to achieve a throughput of nearly 10 k requests per second with sub‑30 ms latency, meeting industrial real‑time constraints.
Empirical evaluation spans public benchmarks (e.g., MovieLens, Amazon) and massive proprietary datasets from Alibaba (billions of interactions). RankGR consistently outperforms state‑of‑the‑art GR models such as TIGER, FORGE, and COBRA on click‑through rate (CTR), conversion, and NDCG. In an online A/B test on Taobao’s “Guess You Like” homepage, deploying RankGR yielded a 1.08 % lift in item page views while maintaining low latency, demonstrating both effectiveness and scalability.
The key contributions are: (i) a listwise direct preference loss that models partial order without a separate reward network, (ii) a lightweight candidate‑sequence interaction module that refines rankings beyond simple inner‑product matching, and (iii) a production‑ready architecture that scales to tens of thousands of queries per second. The work opens avenues for further research, such as extending LDPO to more expressive ranking losses, incorporating multimodal feedback, and exploring deeper interaction architectures while preserving real‑time performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment