Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Graph Foundation Models (GFMs) have achieved remarkable success in generalizing across diverse domains. However, they mainly focus on Text-Attributed Graphs (TAGs), leaving Multimodal-Attributed Graphs (MAGs) largely untapped. Developing Multimodal Graph Foundation Models (MGFMs) allows for leveraging the rich multimodal information in MAGs, and extends applicability to broader types of downstream tasks. While recent MGFMs integrate diverse modality information, our empirical investigation reveals two fundamental limitations of existing MGFMs: (1)they fail to explicitly model modality interaction, essential for capturing intricate cross-modal semantics beyond simple aggregation, and (2)they exhibit sub-optimal modality alignment, which is critical for bridging the significant semantic disparity between distinct modal spaces. To address these challenges, we propose PLANET (graPh topoLogy-aware modAlity iNteraction and alignmEnT), a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities. At the embedding granularity, (1)Embedding-wise Domain Gating (EDG) performs local semantic enrichment by adaptively infusing topology-aware cross-modal context, achieving modality interaction. At the node granularity, (2)Node-wise Discretization Retrieval (NDR) ensures global modality alignment by constructing a Discretized Semantic Representation Space (DSRS) to bridge modality gaps. Extensive experiments demonstrate that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.

💡 Research Summary

The paper addresses a critical gap in graph foundation models (GFMs): while most GFMs excel on text‑attributed graphs (TAGs), they largely ignore multimodal‑attributed graphs (MAGs) that contain heterogeneous data such as text, images, audio, and more. Existing multimodal graph foundation models (MGFMs) like UniGraph2 integrate multiple modalities but suffer from two fundamental shortcomings. First, they do not explicitly model modality interaction, meaning cross‑modal semantics are reduced to simple aggregation and important mutual information is lost. Second, they provide only weak modality alignment, failing to bridge the large semantic disparity between the latent spaces of different modality encoders.

To overcome these issues, the authors propose PLANET (graph topology‑aware modality interaction and alignment), a novel MGFM that adopts a divide‑and‑conquer strategy across two granularities. At the embedding level, PLANET introduces Embedding‑wise Domain Gating (EDG). EDG first passes each modality’s raw embeddings through a Mixture‑of‑Experts (MoE) network, where each expert learns to capture a specific pattern of cross‑modal mutual information (e.g., textual description ↔ visual texture). A soft‑gating mechanism dynamically selects the most relevant experts for each node based on its neighborhood’s multimodal context. The resulting expert‑weighted signals are then fed into a Graph Transformer that performs topology‑aware attention, explicitly incorporating graph structure into the cross‑modal interaction. This design yields a locally enriched, topology‑conditioned multimodal embedding for every node, addressing the first limitation.

At the node level, PLANET adds Node‑wise Discretization Retrieval (NDR). NDR defines a Discretized Semantic Representation Space (DSRS) consisting of C learnable token vectors. After EDG, each modality‑specific node embedding is mapped to its nearest DSRS token via Euclidean distance. By forcing different modalities of the same node to share the same token, NDR creates a global semantic consensus that dramatically reduces inter‑modal gaps. Alignment is further reinforced through a combination of contrastive, reconstruction, and token‑level losses (CMR, SR, Dec).

The final node representation concatenates the original modality‑specific embedding with the cross‑modal embedding from EDG and the aligned token from NDR, producing a unified multimodal vector that can be fed to downstream heads. For graph‑centric tasks (node classification, link prediction) the concatenated vectors are directly used; for generative tasks (multimodal text‑image generation) the DSRS tokens serve as conditioning inputs, ensuring consistent cross‑modal generation.

Extensive experiments on seven public MAG datasets covering four downstream tasks (node classification, link prediction, multimodal retrieval, multimodal generation) demonstrate that PLANET consistently outperforms state‑of‑the‑art baselines. Compared with UniGraph2, PLANET improves average accuracy by 4.3 percentage points and surpasses GraphGPT‑O by 5.1 points, with larger gains observed as the number of modalities increases. Ablation studies reveal that removing EDG drops performance by 2.7 points, while removing NDR causes a 3.4‑point decline, confirming the complementary nature of the two modules. Visualizations of DSRS tokens show coherent clustering of different modalities, evidencing effective alignment.

In summary, PLANET provides a principled solution to the two core challenges of MGFM design: (1) fine‑grained, topology‑aware modality interaction via EDG, and (2) robust, global modality alignment via discretized token retrieval in NDR. By decoupling these problems across granularities, PLANET achieves superior generalization on multimodal graph data, opening avenues for more powerful applications in bio‑informatics, multimedia recommendation, knowledge graph reasoning, and beyond.

Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach

💡 Research Summary

Comments & Academic Discussion

Leave a Comment