RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment

RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal recommendation systems typically integrates user behavior with multimodal data from items, thereby capturing more accurate user preferences. Concurrently, with the rise of large models (LMs), multimodal recommendation is increasingly leveraging their strengths in semantic understanding and contextual reasoning. However, LM representations are inherently optimized for general semantic tasks, while recommendation models rely heavily on sparse user/item unique identity (ID) features. Existing works overlook the fundamental representational divergence between large models and recommendation systems, resulting in incompatible multimodal representations and suboptimal recommendation performance. To bridge this gap, we propose RecGOAT, a novel yet simple dual semantic alignment framework for LLM-enhanced multimodal recommendation, which offers theoretically guaranteed alignment capability. RecGOAT first employs graph attention networks to enrich collaborative semantics by modeling item-item, user-item, and user-user relationships, leveraging user/item LM representations and interaction history. Furthermore, we design a dual-granularity progressive multimodality-ID alignment framework, which achieves instance-level and distribution-level semantic alignment via cross-modal contrastive learning (CMCL) and optimal adaptive transport (OAT), respectively. Theoretically, we demonstrate that the unified representations derived from our alignment framework exhibit superior semantic consistency and comprehensiveness. Extensive experiments on three public benchmarks show that our RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights. Additionally, the deployment on a large-scale online advertising platform confirms the model’s effectiveness and scalability in industrial recommendation scenarios. Code available at https://github.com/6lyc/RecGOAT-LLM4Rec.


💡 Research Summary

RecGOAT addresses the fundamental mismatch between large‑model (LLM/LVM) embeddings and traditional ID‑based collaborative signals in multimodal recommendation. The framework consists of two main components. First, intra‑modal graph learning enriches collaborative semantics by constructing separate K‑nearest‑neighbor graphs for textual and visual item features, as well as user‑item and user‑user interaction graphs. High‑quality modality embeddings are obtained from state‑of‑the‑art models: Qwen‑Embedding‑8B for text, LLaVA‑1.5‑7B for images, and Qwen‑32B for user‑level preference reasoning via personalized prompts. Graph Attention Networks (GAT) and LightGCN propagate these embeddings across the graphs, capturing high‑order relations beyond simple ID aggregation.

Second, a dual‑granularity semantic alignment aligns modality embeddings with ID embeddings at both instance and distribution levels. Instance‑level alignment uses cross‑modal contrastive learning (CMCL) with an InfoNCE loss to pull together the same item’s text, image, and ID vectors while pushing apart different items. Distribution‑level alignment introduces Optimal Adaptive Transport (OAT), which minimizes the 1‑Wasserstein distance between the empirical distributions of each modality and the ID space. A learnable transport matrix, regularized by adaptive parameters, is jointly optimized with the recommendation loss (BPR), allowing the OT process to be directly guided by downstream performance.

The authors provide a theoretical guarantee: the unified representation’s expected error is lower than that of any single modality, and the error gap is bounded by the sum of the Wasserstein distance and the InfoNCE loss. This establishes that jointly minimizing CMCL and OAT yields representations that are both semantically consistent (instance‑wise) and comprehensively aligned (distribution‑wise).

Extensive experiments on three public multimodal recommendation benchmarks (e.g., Baby, Clothing, Movie) and a large‑scale online advertising platform demonstrate state‑of‑the‑art results. RecGOAT improves NDCG@10 by 4–7 percentage points over strong baselines and shows consistent gains in Recall and Hit Rate. Ablation studies confirm that both CMCL and OAT are necessary; removing either component leads to noticeable performance drops. In an online A/B test, RecGOAT increases click‑through rate by 5.2% and reduces system latency by 12%, confirming industrial viability.

In summary, RecGOAT unifies large‑model semantic knowledge with collaborative filtering through (1) graph‑enhanced multimodal feature augmentation, (2) a two‑level alignment strategy combining contrastive learning and optimal transport, and (3) rigorous theoretical analysis guaranteeing superior fused representations. The work opens avenues for further research on dynamic prompt generation, more complex heterogeneous graph structures, and real‑time OT adaptation in large‑scale recommender systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment