Multimodal Enhancement of Sequential Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a novel recommender framework, MuSTRec (Multimodal and Sequential Transformer-based Recommendation), that unifies multimodal and sequential recommendation paradigms. MuSTRec captures cross-item similarities and collaborative filtering signals, by building item-item graphs from extracted text and visual features. A frequency-based self-attention module additionally captures the short- and long-term user preferences. Across multiple Amazon datasets, MuSTRec demonstrates superior performance (up to 33.5% improvement) over multimodal and sequential state-of-the-art baselines. Finally, we detail some interesting facets of this new recommendation paradigm. These include the need for a new data partitioning regime, and a demonstration of how integrating user embeddings into sequential recommendation leads to drastically increased short-term metrics (up to 200% improvement) on smaller datasets. Our code is availabe at https://anonymous.4open.science/r/MuSTRec-D32B/ and will be made publicly available.

💡 Research Summary

MuSTRec (Multimodal and Sequential Transformer‑based Recommendation) is a unified framework that simultaneously leverages multimodal item content (text and images) and users’ sequential interaction histories. The authors first extract modality‑specific embeddings for every item using pretrained encoders (e.g., BERT for text, ResNet for images). For each modality they compute a cosine‑similarity matrix, keep only the top‑k (k=10) nearest neighbours per item, and symmetrically normalize the resulting sparse graph. Modality‑specific graphs are then linearly combined with learnable weights αₘ to form a single multimodal item‑item adjacency matrix S.

In parallel, a user‑item bipartite graph is built from the binary interaction matrix. To mitigate oversmoothing, edges incident to high‑degree nodes are probabilistically pruned using a degree‑sensitive retention probability pᵢⱼ = 1/(√ωᵢ √ωⱼ). After each epoch the pruned graph is re‑normalized, yielding adjacency matrix A.

Both graphs are processed with LightGCN. L_ui layers are applied to A to obtain user embeddings ˆhᵤ, while L_ii layers are applied to S to obtain item embeddings ˜hᵢ. Final item representations are the sum of the two: hᵢ = ˜hᵢ + ˆhᵢ, thereby fusing collaborative filtering signals with multimodal similarity.

The item embeddings are reshaped into chronological sequences for each user: Sᵤ =

Multimodal Enhancement of Sequential Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment