MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning

MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal-attributed graphs (MMAGs) provide a unified framework for modeling complex relational data by integrating heterogeneous modalities with graph structures. While centralized learning has shown promising performance, MMAGs in real-world applications are often distributed across isolated platforms and cannot be shared due to privacy concerns or commercial constraints. Federated graph learning (FGL) offers a natural solution for collaborative training under such settings; however, existing studies largely focus on single-modality graphs and do not adequately address the challenges unique to multimodal federated graph learning (MMFGL). To bridge this gap, we present MM-OpenFGL, the first comprehensive benchmark that systematically formalizes the MMFGL paradigm and enables rigorous evaluation. MM-OpenFGL comprises 19 multimodal datasets spanning 7 application domains, 8 simulation strategies capturing modality and topology variations, 6 downstream tasks, and 57 state-of-the-art methods implemented through a modular API. Extensive experiments investigate MMFGL from the perspectives of necessity, effectiveness, robustness, and efficiency, offering valuable insights for future research on MMFGL.


💡 Research Summary

The paper introduces MM‑OpenFGL, the first comprehensive benchmark dedicated to Multimodal Federated Graph Learning (MMFGL). Recognizing that multimodal‑attributed graphs (MMAGs) are increasingly prevalent yet often distributed across isolated platforms due to privacy or commercial constraints, the authors argue that traditional centralized graph learning is insufficient. Existing federated graph learning (FGL) research, however, focuses almost exclusively on single‑modality graphs and does not address the unique challenges that arise when heterogeneous modalities (e.g., text and images) coexist with graph structures in a federated setting.

To fill this gap, the authors formalize the MMFGL problem and construct a benchmark that spans four major dimensions: data, simulation, algorithms, and evaluation. The data component aggregates 19 real‑world multimodal graph datasets covering seven domains (e‑commerce, recommendation, social media, medical imaging, etc.). Unlike prior benchmarks that provide pre‑processed features, MM‑OpenFGL supplies raw text and image data aligned with graph edges, enabling researchers to experiment with various multimodal encoders. The benchmark includes a diverse encoder pool ranging from lightweight Llama‑3.2‑1B for text to large vision‑language models such as Qwen2‑7B‑Instruct and CLIP‑Vi‑Large, allowing systematic study of the trade‑off between feature quality and federated communication overhead.

Simulation strategies are organized along three orthogonal heterogeneity axes—modality, topology, and label—each with IID and Non‑IID variants. By combining these axes, eight distinct federated scenarios are generated (e.g., Modality‑NonIID with missing modalities per client, Topology‑Unavailable where edges are removed and reconstructed via SBM or RDPG, and Label‑NonIID where class distributions are skewed using Louvain or Metis partitions). This tri‑dimensional design enables stress‑testing of algorithms under realistic fragmentation, structural concealment, and class imbalance.

Two learning pipelines are supported. The End‑to‑End pipeline follows the classic FL loop: server broadcasts a global model, clients perform local updates on their multimodal graphs, and the server aggregates updates (e.g., FedAvg). The Two‑Stage pipeline mirrors modern foundation‑model workflows: clients collaboratively pre‑train a graph encoder using self‑supervised objectives (link prediction, masked feature reconstruction, cross‑modal contrast), after which the global encoder is fine‑tuned locally for downstream tasks without further communication.

Algorithmically, MM‑OpenFGL integrates 57 state‑of‑the‑art methods, categorized into four families: (1) MM‑GNN – multimodal‑specific GNNs such as MM‑GCN, MGA‑T, GraphMAE2 that serve as strong local baselines; (2) Standard FL – classic optimizers (FedAvg, FedProx, SCAFFOLD) and graph‑tailored variants (FedSPA, FedIIH); (3) Heterogeneous FL – approaches that tolerate client‑side model heterogeneity (MH‑pFLID, PEPSY, FedMVP), crucial for Modality‑NonIID settings; and (4) Graph Foundation Models – large‑scale pretrained models (GFT, OFA, GraphCLIP) evaluated in the Two‑Stage pipeline. The benchmark’s modular API allows seamless swapping of encoders, GNN backbones, and FL optimizers.

Evaluation is conducted across four perspectives. Data analysis quantifies client‑wise feature distribution shifts (KL divergence), label‑topology correlation (multi‑level homophily), and structural statistics (degree, centrality). Effectiveness measures standard task metrics: node classification accuracy/F1, link prediction AUC, modality‑matching AP/AUC, retrieval Recall@K and MRR, and generation quality (BLEU, ROUGE‑L, CIDEr). Robustness tests five perturbations—noise, sparsity, homophily variation, cross‑scenario generalization, and differential privacy enforcement. Efficiency assesses convergence speed, communication volume, FLOPs, and runtime/memory footprints.

Key empirical insights include: (1) Naïve FedAvg degrades performance under Modality‑NonIID; cross‑modal contrastive objectives are essential to reconcile missing modalities. (2) Topology‑Unavailable scenarios benefit from synthetic edge reconstruction, yet incur higher communication costs. (3) Label‑NonIID heterogeneity is mitigated by regularization‑based FL methods (FedProx) and label‑aware weighted aggregation, yielding 3–5 % accuracy gains. (4) Large multimodal encoders achieve the best absolute performance, but lightweight encoder‑GNN combos dramatically reduce FLOPs and bandwidth with only ≤2 % accuracy loss, highlighting practical trade‑offs. (5) Differential privacy can be applied with modest utility loss if privacy budgets are tuned per heterogeneity dimension.

In sum, MM‑OpenFGL offers a unified, extensible platform for rigorous MMFGL research, bridging the methodological gap between multimodal representation learning and federated optimization. It supplies the community with realistic datasets, systematic simulation of heterogeneity, a rich algorithmic zoo, and multi‑faceted evaluation protocols. The authors release all code, data, and tutorials publicly, lowering the entry barrier for future work. They anticipate that subsequent research will explore modality‑synchronization protocols, topology‑preserving privacy mechanisms, and scalable training of graph‑LLM foundations in federated environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment