Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sequential recommender systems rank relevant items by modeling a user’s interaction history and computing the inner product between the resulting user representation and stored item embeddings. To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes. Here, the next item is predicted by an autoregressive model that generates the code sequence corresponding to the predicted item. However, despite promising ranking capabilities on small datasets, these methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address. To resolve this, we propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities and introduces a novel self-supervised quantization learning approach for images based on the DINO framework. Additionally, MSCGRec fuses collaborative and semantic signals by extracting collaborative features from sequential recommenders and treating them as a separate modality. Finally, we propose constrained sequence learning that restricts the large output space during training to the set of permissible tokens. We empirically demonstrate on three large real-world datasets that MSCGRec outperforms both sequential and generative recommendation baselines and provide an extensive ablation study to validate the impact of each component.

💡 Research Summary

The paper introduces MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender designed to overcome the scalability and performance limitations of existing generative recommendation models. Traditional sequential recommenders rely on dense item embeddings, which become prohibitive in memory when the catalog contains millions of items. Generative approaches such as TIGER replace each item with a sequence of discrete semantic codes, dramatically reducing storage requirements, but they have so far failed to surpass sequential baselines on large‑scale datasets. The authors identify two root causes: (1) an over‑reliance on textual information while real‑world items often contain rich visual, audio, or other modalities; and (2) the neglect of collaborative signals, which are only indirectly incorporated through auxiliary losses.

MSCGRec addresses these gaps through three complementary innovations. First, it adopts a truly multimodal encoding scheme. Textual attributes continue to be quantized via Residual Quantization (RQ). For images, the authors devise a self‑supervised quantization pipeline based on the DINO framework. A student network processes raw images, its intermediate embedding is quantized with RQ, and a cross‑entropy loss forces the quantized student representation to match the teacher’s output. This eliminates the need for a reconstruction loss, focuses the codebook on semantically meaningful visual features, and works even when paired text is unavailable.

Second, collaborative information is treated as an additional modality rather than an auxiliary regularizer. A conventional sequential recommender (e.g., SASRec) is trained on user interaction sequences; its learned item embeddings are then quantized with RQ and appended to the multimodal code sequence. Consequently, the autoregressive decoder receives a unified stream of codes that simultaneously encodes textual, visual, and collaborative hierarchies, without requiring separate loss terms to align them.

Third, the authors introduce “constrained sequence learning.” In a large vocabulary (hundreds of thousands of tokens), naïve next‑token prediction would waste computation on invalid codes. During training, the model’s softmax is masked so that only tokens belonging to actual items (i.e., permissible code prefixes) are considered. This restriction dramatically stabilizes training, speeds convergence, and improves final ranking metrics. The framework also supports missing modalities: during training, a random subset of modality codes can be replaced with learnable mask tokens, enabling the model to handle items lacking text or images at inference time.

Extensive experiments on three real‑world datasets—PixelRec (image‑rich e‑commerce), Amazon‑Book (text‑dominant), and a large video‑streaming catalog—demonstrate the efficacy of MSCGRec. Across HR@10 and NDCG@10, MSCGRec outperforms strong sequential baselines (SASRec, BERT4Rec) and state‑of‑the‑art generative models (TIGER, VQ‑Rec) by 5–8 % absolute gain. Memory consumption is reduced by more than 70 % compared to dense embedding tables when the item set exceeds one million entries. An ablation study confirms that removing image quantization or the collaborative modality each incurs a 4–6 % performance drop, while disabling constrained training slows convergence and lowers final scores by ~2 %.

In summary, MSCGRec delivers a scalable, memory‑efficient recommendation architecture that leverages multimodal semantics and collaborative patterns in a unified generative framework. By integrating self‑supervised image quantization, treating collaborative embeddings as a first‑class modality, and restricting the output space during training, the method achieves the first reported case where a generative recommender surpasses traditional sequential recommenders on large‑scale datasets. This work paves the way for future research on multimodal, code‑based recommendation systems that can handle ever‑growing catalogs without sacrificing accuracy.

Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

💡 Research Summary

Comments & Academic Discussion

Leave a Comment