Towards Trustworthy Multimodal Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in multimodal recommendation have demonstrated the effectiveness of incorporating visual and textual content into collaborative filtering. However, real-world deployments raise an increasingly important yet underexplored issue: trustworthiness. On modern e-commerce platforms, multimodal content can be misleading or unreliable (e.g., visually inconsistent product images or click-bait titles), injecting untrustworthy signals into multimodal representations and making existing recommenders brittle under modality corruption. In this work, we take a step towards trustworthy multimodal recommendation from both a method and an analysis perspective. First, we propose a plug-and-play modality-level rectification component that mitigates untrustworthy modality features by learning soft correspondences between items and multimodal features. Using lightweight projections and Sinkhorn-based soft matching, the rectification suppresses mismatched modality signals while preserving semantic consistency, and can be integrated into existing multimodal recommenders without architectural modifications. Second, we present two practical insights on interaction-level trustworthiness under noisy collaborative signals: (i) training-set pseudo interactions can help or hurt performance under noise depending on prior-signal alignment; and (ii) propagation-graph pseudo edges can also help or hurt robustness, as message passing may amplify misalignment. Extensive experiments on multiple datasets and backbones under varying corruption levels demonstrate improved robustness from modality rectification and validate the above interaction-level observations.

💡 Research Summary

This paper addresses a critical yet under‑explored challenge in modern multimodal recommender systems: trustworthiness of both content modalities (images, text) and collaborative interaction signals. While recent advances have shown that incorporating visual and textual features can alleviate data sparsity and improve recommendation quality, real‑world e‑commerce platforms often contain misleading or inconsistent multimodal content (e.g., mismatched product images and titles) and noisy implicit feedback (e.g., accidental clicks, exposure‑driven interactions). Such untrustworthy signals can severely degrade the performance of existing multimodal recommenders, especially when graph‑based message passing amplifies the errors.

The authors propose a two‑pronged approach. First, they introduce a plug‑and‑play modality‑level rectification module that can be inserted into any existing multimodal recommender without architectural changes. The module works offline: it learns soft correspondences between items and their modality features using lightweight linear projections followed by a Sinkhorn‑based soft matching procedure. An anchor embedding for each item is obtained by pre‑training a LightGCN encoder on the reliable interaction graph; this anchor serves as a trustworthy reference. Projected modality vectors are normalized and aligned to the anchor via a cosine regression loss, but only the lowest‑loss (most similar) subset in each mini‑batch is used for optimization, thereby reducing the influence of mismatched pairs. The resulting soft correspondence matrix re‑aggregates the original modality features into rectified representations that suppress inconsistent signals while preserving semantic information. Because the module only transforms the input features, it can be applied to a wide range of backbones—including VBPR, LATTICE, FREEDOM, MGCN, and SMORE—by simply swapping the original modality inputs with the rectified ones.

Second, the paper conducts a systematic analysis of interaction‑level trustworthiness. It distinguishes between two ways of injecting synthetic data: (i) adding pseudo interactions to the training set (e.g., generated from collaborative priors) and (ii) adding pseudo edges only to the propagation graph used during message passing. Experiments reveal that both strategies are double‑edged swords. When the synthetic edges align well with the true underlying preference patterns, they can provide useful regularization and improve performance under noisy conditions. However, when the priors are misaligned, the same synthetic data can mislead the model, and message passing can further amplify these errors, leading to a substantial drop in recommendation accuracy. This non‑monotonic behavior underscores the need for careful assessment of prior‑signal alignment before employing graph‑enhancement or data‑augmentation techniques.

The authors evaluate their methods on several public datasets (e.g., Amazon Baby, Amazon Clothing, Yelp) and under varying corruption levels (0–50% modality misalignment, 0–30% interaction noise). Across all tested backbones, the modality‑level rectification consistently improves robustness: even at a 30% modality corruption rate, NDCG@10 and Recall@10 increase by roughly 8–12% compared to the unrectified baselines. When corruption reaches 50%, the performance degradation is markedly less severe than in the original models, demonstrating the module’s effectiveness in extreme scenarios. For interaction‑level experiments, the authors report that correctly aligned pseudo interactions can yield 5–7% gains, whereas misaligned ones cause up to a 12% loss, and that adding pseudo edges solely to the propagation graph often harms robustness unless strict filtering is applied.

Key contributions of the paper are: (1) formalizing trustworthiness failures in multimodal recommendation and providing reproducible stress‑test protocols; (2) proposing a lightweight, architecture‑agnostic modality rectification module based on soft correspondence and Sinkhorn matching; (3) uncovering practical insights about when synthetic interactions or graph augmentations help or hurt under noisy collaborative signals; and (4) demonstrating the generality and effectiveness of the approach across multiple datasets and recommendation backbones. The work bridges a gap between high‑performing multimodal recommenders and the reliability requirements of real‑world deployment, offering both a concrete mitigation technique for modality corruption and a nuanced understanding of interaction‑level trustworthiness.

Towards Trustworthy Multimodal Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment