Masked Diffusion for Generative Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user’s interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user’s sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

💡 Research Summary

Generative recommendation (GR) has recently attracted attention for its ability to model user interaction sequences and predict the next item a user will engage with. A particularly promising variant of GR uses semantic IDs (SIDs), which are compact token tuples derived from the semantic embeddings of items (e.g., text or visual features processed by large pre‑trained language or vision models). By clustering these embeddings, each item is represented by a short sequence of discrete tokens, dramatically reducing vocabulary size while preserving rich semantic information. Existing works on SID‑based GR (e.g., TIGER) rely on autoregressive (AR) modeling: the probability of a SID sequence is factorized as a product of conditional probabilities, and the model is trained to predict the next token given all previous tokens. Although AR models have driven breakthroughs in natural language processing, they suffer from two major drawbacks when applied to recommendation. First, inference requires sequential token‑wise decoding, which leads to latency that scales linearly with the number of tokens to generate. Second, training on next‑token prediction biases the model toward short‑range dependencies and under‑utilizes the limited interaction data that is typical in recommender systems.

The paper “Masked Diffusion for Generative Recommendation” proposes to replace the AR paradigm with a masked diffusion framework, yielding a new model called MaskGR. Masked diffusion, originally introduced for discrete language modeling, replaces the continuous Gaussian noise of classic diffusion with a discrete masking operation. At each forward diffusion step a token is replaced by a special mask token

Masked Diffusion for Generative Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment