Sequences as Nodes for Contrastive Multimodal Graph Recommendation

Sequences as Nodes for Contrastive Multimodal Graph Recommendation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

To tackle cold-start and data sparsity issues in recommender systems, numerous multimodal, sequential, and contrastive techniques have been proposed. While these augmentations can boost recommendation performance, they tend to add noise and disrupt useful semantics. To address this, we propose MuSICRec (Multimodal Sequence-Item Contrastive Recommender), a multi-view graph-based recommender that combines collaborative, sequential, and multimodal signals. We build a sequence-item (SI) view by attention pooling over the user’s interacted items to form sequence nodes. We propagate over the SI graph, obtaining a second view organically as an alternative to artificial data augmentation, while simultaneously injecting sequential context signals. Additionally, to mitigate modality noise and align the multimodal information, the contribution of text and visual features is modulated according to an ID-guided gate. We evaluate under a strict leave-two-out split against a broad range of sequential, multimodal, and contrastive baselines. On the Amazon Baby, Sports, and Electronics datasets, MuSICRec outperforms state-of-the-art baselines across all model types. We observe the largest gains for short-history users, mitigating sparsity and cold-start challenges. Our code is available at https://anonymous.4open.science/r/MuSICRec-3CEE/ and will be made publicly available.


💡 Research Summary

MuSICRec (Multimodal Sequence‑Item Contrastive Recommender) tackles the persistent cold‑start and sparsity challenges in recommender systems by jointly leveraging collaborative filtering, sequential dynamics, and multimodal content within a unified multi‑view graph architecture. The core idea is to treat each user’s interaction history as a “sequence node” in a newly constructed Sequence‑Item (SI) bipartite graph. These sequence nodes are obtained via additive attention pooling over the user’s item embeddings, which yields length‑aware, salient representations that emphasize the most informative past items.

The SI graph connects sequence nodes to their constituent items (unit‑weight SI edges) and to other sequence nodes based on Jaccard similarity of item sets (SS edges). This structure captures intra‑sequence transitions, higher‑order co‑occurrences, and short‑range contextual signals that traditional user‑item (UI) bipartite graphs miss. By propagating over the SI graph with simple linear GCN layers, MuSICRec derives an organic alternative view of the data, avoiding heuristic augmentations (node/edge dropout, random walks) commonly used in contrastive learning. The UI graph remains a standard LightGCN‑style bipartite graph built directly from the interaction matrix, providing a global collaborative signal.

To incorporate multimodal information without overwhelming the collaborative signal, MuSICRec adopts a static multimodal item‑item (MM) graph, following the FREEDOM paradigm. Each item possesses pre‑extracted visual and textual embeddings, linearly projected into a shared space. An ID‑guided gate, conditioned on the item’s ID embedding, dynamically balances the contribution of visual and textual features before they are injected into the frozen MM graph. This gating mitigates cross‑modal misalignment and prevents noisy modalities from dominating the learned representations. LightGCN‑style propagation on the frozen MM graph further stabilizes training.

Training proceeds with three concurrent objectives: (1) a BPR ranking loss on the UI view, (2) an entity‑level contrastive loss aligning each user embedding from the UI view with its own sequence embedding from the SI view, and (3) regularization terms for the gating mechanism. The contrastive loss exploits the natural correspondence between a user and its sequence, eliminating the need for synthetic view generation.

Extensive experiments on three Amazon domains (Baby, Sports, Electronics) using a strict leave‑two‑out protocol demonstrate that MuSICRec consistently outperforms a broad spectrum of baselines, including state‑of‑the‑art collaborative (LightGCN, LightGCL), sequential (SASRec, BERT4Rec, FEARec), multimodal (MMGCN, DualGNN, LGMRec), and contrastive (SGL, SimGCL, BM3) models. The most pronounced gains appear for users with short interaction histories (≤5 items), confirming that the SI graph effectively alleviates sparsity. Ablation studies reveal that removing the SI view drops performance by 7–9%, while disabling the ID‑guided gate reduces accuracy by 4–6%, underscoring the importance of both components. Sensitivity analyses show robustness to the number of propagation layers, Jaccard similarity threshold, and gate scaling parameters.

In summary, MuSICRec introduces three key innovations: (i) representing interaction sequences as graph nodes to obtain a natural contrastive view, (ii) employing ID‑conditioned multimodal gating to align and calibrate visual/textual signals, and (iii) integrating UI, SI, and frozen MM graphs in a unified framework that balances collaborative, sequential, and content signals. This design yields superior recommendation quality, especially for cold‑start users, and opens avenues for future work such as dynamic sequence node updates, user‑specific modality preference modeling, and extension to other domains like video or music recommendation.


Comments & Academic Discussion

Loading comments...

Leave a Comment