GraphFM: A generalist graph transformer that learns transferable representations across diverse domains

GraphFM: A generalist graph transformer that learns transferable representations across diverse domains
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Graph neural networks (GNNs) are often trained on individual datasets, requiring specialized models and significant hyperparameter tuning due to the unique structures and features of each dataset. This approach limits the scalability and generalizability of GNNs, as models must be tailored for each specific graph type. To address these challenges, we introduce GraphFM, a scalable multi-graph pretraining approach designed for learning across diverse graph datasets. GraphFM uses a Perceiver-based encoder with learned latent tokens to compress domain-specific features into a shared latent space, enabling generalization across graph domains. We propose new techniques for scaling up graph training on datasets of different sizes, allowing us to train GraphFM on 152 distinct graph datasets, containing a total of 7.4 million nodes and 189 million edges. This allows us to study the effect of scale on pretraining across domains such as molecules, citation networks, and product graphs, and show that training on diverse datasets improves performance over single-source pretraining. Additionally, pretraining with a mixture of synthetic and real graphs enhances adaptability and stability, leading to competitive performance with state-of-the-art models across various node classification tasks. This approach reduces the burden of dataset-specific training and provides a single generalist model capable of performing across multiple diverse graph structures and tasks. Code is available at https://github.com/nerdslab/GraphFM.


💡 Research Summary

GraphFM introduces a generalist graph transformer that can be pretrained on a large, heterogeneous collection of graphs and then fine‑tuned with minimal effort on a wide range of downstream node‑level tasks. The core architecture is a Perceiver‑style encoder: a fixed set of learnable latent tokens (K = 512) attends to the tokenized node sequence of each graph via cross‑attention, compressing the entire graph into a compact latent representation. This design decouples computational cost from graph size (complexity O(K·N + L·K²) versus O(N²) for full self‑attention) and creates a shared “vocabulary” across domains.

Node features are first projected by an MLP and enriched with positional encodings derived from Laplacian eigenvectors. To ensure invariance to sign flips and basis changes, the eigenvectors are processed through SignNet, yielding stable, permutation‑invariant positional embeddings. The resulting node tokens are concatenated into a single sequence; no padding is used. FlashAttention efficiently handles the variable‑length batch, dramatically reducing memory overhead.

After the cross‑attention stage, L layers of self‑attention refine the latent tokens. For downstream prediction, a multi‑task node decoder builds a short sequence for each target node consisting of (i) the node’s own token, (ii) a small set of neighbor tokens sampled via random walks, and (iii) all latent tokens. A shallow transformer (depth M) processes this sequence, producing a node‑specific embedding that is finally passed through a per‑dataset linear head for classification or regression. This decoder leverages both local context (neighbors) and global context (latent tokens) while keeping computational cost low (≈ N·M·(K+T+1)²).

Training on 152 datasets (80 real‑world graphs from chemistry, citation networks, product co‑purchase graphs, etc., plus 72 synthetic graphs designed to cover low‑homophily regimes) yields a corpus of 7.4 M nodes and 1.89 × 10⁸ edges. To balance GPU memory across graphs of vastly different sizes, the authors propose the Distributed Snake Strategy Sampler (DistributedSSSampler), which sorts graphs by size and distributes them in a snake‑like pattern across GPUs, pairing large and small graphs in each mini‑batch. This achieves near‑100 % GPU utilization even on 8‑GPU clusters.

Scaling experiments show monotonic gains: increasing model parameters from 0.4 M to 75 M and pretraining tokens from 0.2 M to 7.3 M improves accuracy on unseen graphs by up to 2.1 percentage points. Adding synthetic graphs and biological graphs further boosts performance, indicating that diverse structural patterns help the model learn transferable inductive biases. Fine‑tuning is evaluated in two regimes. The lightweight MLP‑only fine‑tuning (MFT) converges within 10–20 gradient steps and reaches strong out‑of‑the‑box performance, while full node‑decoder fine‑tuning (NFT) matches or exceeds state‑of‑the‑art graph transformers on standard benchmarks.

A thorough sensitivity analysis reveals that GraphFM is far less sensitive to learning‑rate, weight‑decay, and depth choices than GCNs or NAGphormer, reducing the need for extensive hyper‑parameter search.

In summary, GraphFM makes three key contributions: (1) a scalable Perceiver‑based encoder that can ingest graphs of arbitrary size and topology, (2) empirical evidence that multi‑domain pretraining—including synthetic low‑homophily graphs—significantly improves generalization to unseen graphs, and (3) practical engineering solutions (FlashAttention‑based multi‑graph packing and DistributedSSSampler) that enable efficient large‑scale training. The work demonstrates that, much like language and vision models, graph AI can benefit from massive, diverse pretraining, paving the way toward truly universal graph representation learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment