OptiMAG: Structure-Semantic Alignment via Unbalanced Optimal Transport
Multimodal Attributed Graphs (MAGs) have been widely adopted for modeling complex systems by integrating multi-modal information, such as text and images, on nodes. However, we identify a discrepancy between the implicit semantic structure induced by different modality embeddings and the explicit graph structure. For instance, neighbors in the explicit graph structure may be close in one modality but distant in another. Since existing methods typically perform message passing over the fixed explicit graph structure, they inadvertently aggregate dissimilar features, introducing modality-specific noise and impeding effective node representation learning. To address this, we propose OptiMAG, an Unbalanced Optimal Transport-based regularization framework. OptiMAG employs the Fused Gromov-Wasserstein distance to explicitly guide cross-modal structural consistency within local neighborhoods, effectively mitigating structural-semantic conflicts. Moreover, a KL divergence penalty enables adaptive handling of cross-modal inconsistencies. This framework can be seamlessly integrated into existing multimodal graph models, acting as an effective drop-in regularizer. Experiments demonstrate that OptiMAG consistently outperforms baselines across multiple tasks, ranging from graph-centric tasks (e.g., node classification, link prediction) to multimodal-centric generation tasks (e.g., graph2text, graph2image). The source code will be available upon acceptance.
💡 Research Summary
Multimodal Attributed Graphs (MAGs) combine graph topology with heterogeneous node features such as text and images, enabling rich representations for e‑commerce, social media, and other domains. Existing multimodal graph neural networks (GNNs) follow a two‑stage pipeline: (i) project each modality into a shared embedding space, optionally fuse them, and (ii) perform message passing over the fixed explicit graph. This pipeline implicitly assumes high homophily across all modalities—i.e., neighboring nodes are similar in every modality. In practice, however, edges are often formed because nodes are similar in one modality while being dissimilar in another. The authors term this phenomenon “structural‑semantic conflict.” Such conflicts inject modality‑specific noise during aggregation, degrading learned representations, while occasionally providing complementary cues when heterophilous edges are beneficial. Current heterophilous‑graph techniques are designed for unimodal settings and do not exploit cross‑modal information.
To address these challenges, the paper introduces OptiMAG, a plug‑and‑play regularization framework based on Unbalanced Optimal Transport (UOT). The core idea is to align each modality’s implicit semantic topology with the explicit graph topology, while allowing the model to reject edges that are severely misaligned. The method proceeds in three steps:
-
Metric Space Construction – For each modality (text T, image I) the authors compute cosine‑based pairwise distance matrices (C_T) and (C_I) from pretrained encoders (BERT, ResNet). For the graph, they compute a Personalized PageRank diffusion matrix, take the negative logarithm to obtain a distance matrix (C_G), and normalize all matrices by their means.
-
Unbalanced Fused Gromov‑Wasserstein (UFGW) Alignment – They define a transport plan (\pi) between modality nodes and graph nodes. The cost consists of:
- Linear anchor term: a matrix (M) with small off‑diagonal penalty (\tau) that encourages diagonal transport (node‑to‑itself) but permits mass to flow to neighbors when it reduces overall cost.
- Quadratic relational term: a Gromov‑Wasserstein loss (|\bar C_m(i,k)-\bar C_G(j,l)|^2) that penalizes mismatches between intra‑modality distances and graph distances for paired nodes.
- KL‑based marginal relaxation: soft constraints on row/column sums of (\pi) using KL divergence weighted by (\rho). This allows the plan to shrink mass for nodes with high alignment cost, effectively rejecting noisy edges.
- Entropic regularization (\epsilon H(\pi)) for differentiability and efficient Sinkhorn iterations.
The resulting objective (Eq. 12) balances anchoring, structural consistency, marginal flexibility, and smoothness. The KL term yields a soft‑thresholding behavior: well‑aligned nodes retain full mass, while highly conflicting nodes have their mass exponentially damped.
-
Regularization Integration – The UFGW loss (L_{\text{reg}}) is added to the task‑specific loss (L_{\text{task}}) (node classification, link prediction, graph‑to‑text, graph‑to‑image, etc.). Gradients from (L_{\text{reg}}) update the modality encoders, gradually pulling the semantic spaces toward the graph geometry.
Computationally, the authors avoid the cubic cost of full GW by sampling sub‑graphs per training batch and pre‑computing the PPR matrix, then applying Sinkhorn iterations on the batch‑level cost matrices. This makes OptiMAG scalable to graphs with hundreds of thousands of nodes.
Empirical Evaluation – Six MAG benchmarks covering e‑commerce (product catalogs), Reddit posts, and other domains are used. Multiple backbones (GAT, GraphSAGE, UniGraph2, etc.) are evaluated on:
- Graph‑centric tasks: node classification, link prediction, clustering.
- Multimodal‑centric generation: graph‑to‑text captioning, graph‑to‑image synthesis.
Across all settings, OptiMAG consistently improves performance. Notable gains include up to +4.6 % absolute accuracy in node classification and +4 CIDEr points in captioning. The improvement is especially pronounced for large pretrained encoders (e.g., UniGraph2), indicating that the regularizer effectively steers powerful models toward finer‑grained graph adaptation.
Key Contributions
- Formal definition and empirical analysis of structural‑semantic conflict in MAGs.
- Introduction of an Unbalanced Fused Gromov‑Wasserstein regularizer that aligns modality semantics with graph structure while adaptively rejecting noisy edges.
- Demonstration of a plug‑in regularizer that works with diverse backbones and tasks, delivering consistent gains.
In summary, OptiMAG provides a theoretically grounded, computationally efficient, and practically versatile solution to the pervasive mismatch between multimodal semantic similarity and graph connectivity. By leveraging UOT, FGW, and KL‑based marginal relaxation, it enables GNNs to preserve beneficial cross‑modal complementarity while suppressing harmful noise, all without requiring explicit conflict labels or extensive architectural changes. This work opens a promising direction for robust multimodal graph representation learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment