Geometric Structural Knowledge Graph Foundation Model

Reading time: 28 minute
...

📝 Original Info

  • Title: Geometric Structural Knowledge Graph Foundation Model
  • ArXiv ID: 2512.22931
  • Date: 2025-12-28
  • Authors: Ling Xin, Mojtaba Nayyeri, Zahra Makki Nayeri, Steffen Staab

📝 Abstract

Structural knowledge graph foundation models aim to generalize reasoning to completely new graphs with unseen entities and relations. A key limitation of existing approaches like ULTRA is their reliance on a single relational transformation (e.g., element-wise multiplication) in message passing, which can constrain expressiveness and fail to capture diverse relational and structural patterns exhibited on diverse graphs. In this paper, we propose GAMMA, a novel foundation model that introduces multi-head geometric attention to knowledge graph reasoning. GAMMA replaces the single relational transformation with multiple parallel ones, including real, complex, split-complex, and dual number based transformations, each designed to model different relational structures. A relational conditioned attention fusion mechanism then adaptively fuses them at link level via a lightweight gating with entropy regularization, allowing the model to robustly emphasize the most appropriate relational bias for each triple pattern. We present a full formalization of these algebraic message functions and discuss how their combination increases expressiveness beyond any single space. Comprehensive experiments on 56 diverse knowledge graphs demonstrate that GAMMA consistently outperforms ULTRA in zero-shot inductive link prediction, with a 5.5% improvement in mean reciprocal rank on the inductive benchmarks and a 4.4% improvement across all benchmarks, highlighting benefits from complementary geometric representations.

📄 Full Content

K NOWLEDGE Graphs (KGs) store factual information as triples ( HEAD ENTITY, RELATION NAME, TAIL EN-TITY), e.g., (PARIS, ISCAPITALOF, FRANCE) [1]. Reasoning over KGs (such as predicting a missing link) has been a longstanding challenge in AI [2], [3]. Recent structural knowledge graph foundation models (Structural KGFMs) [4], [5] seek to overcome the limitations of traditional transductive embedding methods [3], [6] by enabling fully inductive generalization to unseen entities and relations. The core idea is to learn transferable structural patterns instead of memorizing entities and relations. Structural KGFMs build universal relational representations by constructing a relation graph (a graph where nodes are relations) [4], [7] and applying message passing, thereby obtaining representations for new relations without any node or textual features.

While such foundation models have made important progress, they often inherit a key architectural limitation: the use of a single fixed relation transformation in the messagepassing or scoring function. An element-wise multiplication (DistMult-style bilinear transform [4], [8]) is used to combine an entity with a relation when propagating messages. Relying on a single, uniform transformation constrains the model’s capacity to represent the diverse relational patterns essential for structural knowledge graph foundation models trained across multiple heterogeneous graphs. For instance, relation specific element wise multiplication in message passing has an antisymmetric nature and cannot properly model symmetric relations, nor can it adequately represent hierarchical orderings. Relying on a single algebraic geometric transformation means the model is biased toward a particular class of relational structure, potentially leading to suboptimal generalization [9].

Geometric knowledge graph embeddings research [9], [10] has shown that different geometric spaces offer complementary strengths. Complex number embeddings (as in ComplEx and RotatE) can model symmetry, anti-symmetry, and cyclic composition via rotations in the complex plane [11]. Hyperbolic or split-complex representations can naturally capture partial orders and hierarchical relations due to their ability to represent infinite or unbounded distances. Dual-number embeddings introduce translational components (via nilpotent ϵ terms) that can model one-to-many relations or additive offsets, while also enabling non-commutative compositions [10], [12]. Each of these algebraic families (in real, complex, split-complex, and dual spaces) provides a unique bias: no single space is optimal for all relation types. This raises a crucial question:

Question: Can we combine multiple geometric transformations to create a more powerful, universally generalizing Structural KGFM trained on multiple KGs with diverse relational patterns?

In this paper, we answer this question by proposing GAMMA (Geometric Attention Multi-Message Aggregation), a novel structural knowledge graph foundation model that fuses multiple relational message-passing mechanisms. Instead of using one relation transform across the board, GAMMA leverages multi-head message functions in parallel (Figure 1), where each head operates in a different geometric space (real (flat), complex (sphere), split-complex (hyperbola), or dual (Galilean circle)). These heads produce diverse candidate messages for each triple, which are then aggregated by an attention module that learns to suitably combine them for the query at hand. Intuitively, GAMMA works as a stable and compact integrator to form complementary but not overly divergent features, which enables more expressive relational representations that lead to better generalization than any single space messagepassing alone. We provide a full formalization of each message function and the attention-based fusion, and we derive insights into why this multi-head approach is more expressive. In particular, we show that the combined representation can model important relational properties like symmetry, antisymmetry, and composition more effectively than any single message function. From an architecture perspective, we carefully design GAMMA’s fusion strategy to maintain separate “expert” branches for each message type through the layers, only merging their outputs before the final prediction stage, a choice that we empirically validate via ablation studies (demonstrating its superiority over naive layer-level fusion).

Empirically, we pre-train GAMMA on the three source graphs (FB15k-237, WN18RR, and CoDEx-M [13]- [15]) and evaluate zero-shot on 53 unseen target graphs covering transductive, inductive entity, and inductive entity-relation scenarios. GAMMA consistently outperforms the ULTRA [4] baseline (which uses only real element-wise multiplication message passing (DistMult-style bilinear transform)) in the link prediction task, with particularly large gains on the challenging inductive benchmarks. Notably, we show that these improvements do not rely completely on increasing the model’s depth or width (i.e., increasing the model capacity); rather, the performance boost comes from learning complementary relational biases. For example, on average GAMMA improves MRR by 4.4% and Hits@10 by 2.5% across all 53 graphs compared to ULTRA, with up to 7.0% MRR on Inductive (e) sets, while maintaining the same results on transductive sets (no loss of performance on traditional tasks).

Our contributions are summarized as follows:

• Multi-Head Geometric Message Passing: We introduce a new KG reasoning architecture that parallelizes multiple algebraic message-passing heads (real, complex, splitcomplex, dual), greatly enhancing expressiveness. To our knowledge, GAMMA is the first foundation model to incorporate a mixture of geometric transformations for relational reasoning. • Mathematical Formalism and Insight: We provide a unified formal description of each algebraic head and prove how the combination can represent a strictly broader class of relational patterns than single-head models. We also discuss examples of relational structures that GAMMA has a potentially stronger modeling capability from a theoretical perspective (e.g., one-to-many mappings with hierarchy and simultaneity), which are difficult for singlespace models like DistMult or even ComplEx alone. • Performance Improvement: Experimentally, GAMMA outperforms ULTRA (the previous best foundation model) on 53 evaluation KGs on average, with particularly strong gains in the difficult inductive scenario. Through rigorous ablations, we show that the necessity and effectiveness of stable multi-head attention gating: removing any component of the module leads to a degradation in the model’s capability; simply increasing ULTRA’s hidden dimension and dimensions of feed-forward network to match GAMMA’s parameter count does not replicate our gains. We also find that the complex + split-complex and complex + DistMult combinations yield the best synergy among message types, aligning with our message function complementarity hypothesis. • Robustness and Generality: GAMMA’s robust attention gating allows it to adapt to a variety of graphs without retraining and yield consistent improvements without explicit branch selection. Unlike strong routing, we apply regularization to prevent gate collapse, forming a stable and compact integrator: weakening the regularization causes the attention to shift to single-branch dominance and leads to performance degradation. The remainder of this paper is structured as follows: Section II reviews existing literature on Knowledge Graph embeddings, geometric message passing, and foundation models, positioning our work within the current research landscape. Section III provides essential background and formal definitions relevant to Knowledge Graphs and geometric algebra utilized in our model. Section IV details the proposed GAMMA model, elucidating its novel multi-head geometric message passing architecture and attention-based fusion mechanism. Section V presents our comprehensive experimental evaluation, including dataset descriptions, baseline comparisons, performance results on zero-shot link prediction, and ablation studies. Finally, Section VI summarizes our findings, discusses the implications of GAMMA, and outlines promising directions for future work.

Traditional KG embedding models [2], [3] operate mainly in transductive and semi-inductive settings, where all entities, or at least most entities, and all relations are known during training, and the task is to infer missing links among them. Early translation models like TransE [16] are scalable but weak for one to many relations, while bilinear or complex embeddings such as DistMult [17] and ComplEx [18] improve expressiveness. Later designs like ConvE [14] enhances feature interaction with convolution, and GNNs like RGCN [19] and CompGCN [20] capture structural context through message passing yet remain tied to fixed embeddings. To address these limits, semi-inductive methods like GraIL [21] employs enclosing subgraphs to generalize to unseen entities, and NBFNET [22] integrates path reasoning with GNNs for stronger interpretability. A*Net [23] further improves efficiency, while RED-GNN [24] utilizes relational digraphs for richer structure. AdaProp [25] adaptively samples paths to reduce noise, and NodePiece [26] tokenizes entities via anchors, cutting dependence on large embedding tables.

However, semi-inductive methods still rely on fixed relation vocabularies, limiting their generalization. This motivates the development of structural KG foundation models. These models extend the move to a fully inductive setting, where models generalize to unseen entities and relations across new graphs. Rather than memorizing embeddings, they exploit structural patterns, relation graphs, prompts, or motifs. IN-GRAM [27] pioneered the study of fully inductive reasoning by building weighted relation graphs for unseen entities and relations, while RMPI [28] broadened this direction through local subgraph extraction at significant computational expense. ULTRA [4] advanced the paradigm by learning universal structural motifs, inspiring successors such as TRIX [29] and MOTIF [30] to pursue more expressive motif-based reasoning, and GraphOracle [31] introduced relation-dependency graphs to enhance cross-relation generalization. At the same time, ISDEA [32] and MTDEA [33] established double equivariant formulations to guarantee invariance across nodes and relations, and KG-ICL [34] later demonstrated that in-context learning with subgraph prompts could scale inductive reasoning across diverse knowledge graphs.

Geometric relational transformation-based methods frame knowledge graph reasoning as learning geometric operations in embedding space. Translation-based methods like TransE [16] view relations as vector shifts, offering scalability but limited expressiveness. Rotation-based approaches like RotatE [11] extend this idea into the complex plane, enabling modeling of symmetry, anti-symmetry, and relation composition. Hypercomplex embeddings like ComplEx [18] and QuatE [35] enrich representation power through complex and quaternion spaces, respectively. Non-Euclidean methods such as MuRP [36] exploit hyperbolic geometry to capture hierarchical patterns, while region-based approaches like BoxE [37] encode relations as hyper-rectangles that naturally capture inclusion and intersection. Compared to entity-focused or inductive foundation models, geometric methods emphasize algebraic and spatial transformations, offering strong expressiveness in capturing relation semantics but often struggling with full inductive generalization. These methods provide inductive biases for modeling relational patterns, which structural KG foundation models build upon to achieve fully inductive generalization beyond fixed embeddings.

This section introduces the fundamental concepts and notation necessary for defining our proposed model, GAMMA, focusing on the structure of the knowledge graph foundation model and the task of inductive link prediction.

A Knowledge Graph (KG) is formally defined as

where E is the set of entities (nodes), R is the set of relations (edge types), and T is the set of factual triples. A triple (h, r, t) ∈ T represents a fact, where h ∈ E is the head entity, t ∈ E is the tail entity, and r ∈ R is the relation connecting them.

The primary task addressed in this work is Inductive Link Prediction. Unlike the transductive setting, where all entities are known during training, the inductive setting requires the model to generalize knowledge learned from a source graph, G src to an entirely novel, unseen target graph, G tgt . Specifically, G src and G tgt are disjoint in terms of their entity and relation sets:

The model, trained on G src , must predict missing links (e.g., (h, r, ?) or (?, r, t)) within the new G tgt . This scenario is crucial for assessing the transferability and universal generalization ability of foundation models.

Following the formulation in [11], the formal definitions of several relational patterns used in knowledge graph analysis, including symmetry, anti-symmetry, inversion, and composition, are as follows:

That is, whenever a triple holds in one direction, the reversed triple must also appear in the graph.

In other words, the validity of a triple precludes the existence of its reverse counterpart.

Moreover, if there exists some r ′ ∈ R with r ′ ̸ = r such that r ′ satisfies the above inverse property with respect to r, then r is classified as an inverse relation.

That is, if a path of length two through r 1 and r 2 exists, then a direct triple following relation r must also be present. Whenever this condition holds, we refer to r as a composite relation.

Modern structural KG foundation models often utilize a message passing paradigm to aggregate information from local neighborhoods and compute updated entity representations. In this context, the information between a head entity h and a tail entity t through relation r is often modeled via a transformation function f r (h):

A large class of effective models, including the baseline ULTRA, relies on a single algebraic geometric transformation to define f r (h). Common examples of these single geometric operations include:

which models relations as simple translation vectors in R d (e.g., TransE). This is primarily effective for modeling compositional and hierarchical patterns.

where • represents element-wise multiplication in the complex or quaternion space, thus modeling a rotation in the embedding space (e.g., RotatE [11], QuatE [35]). This is effective for modeling cyclic and asymmetric relations. • Reflection/Projection: This class involves operations like matrix multiplication,

or specific parameter sharing mechanisms that effectively model reflections or projections in the embedding space. This is essential for capturing symmetric and inversion properties (e.g., RESCAL [38], SimplE [39]). While these single-geometric approaches are expressive for specific relational properties, they introduce inherent biases that limit their ability to universally capture diverse relational structures, which is the primary challenge that GAMMA aims to overcome.

This section presents a formal description of the proposed Geometric Attention MultiMessage Aggregation (GAMMA) model. As illustrated in Figure 1, at a high level, GAMMA extends the Neural Bellman-Ford network (NBFNET) framework [4], [22] by (i) learning a representation for every relation by propagating signals over a relation graph and (ii) employing several algebraically distinct message functions to propagate information over the entity graph. A trainable attention module then fuses the outputs of these message functions to produce a single representation conditioned on the query.

Let G = (E, R, T ) be an input knowledge graph. We define an auxiliary relation graph

whose node set coincides with the relation set R. The edge set E r captures how relations cooccur in the knowledge graph.

In order to distinguish different types of cooccurrence, four directed edge types are introduced as follows:

• Head-to-Head (H2H). There is a directed edge r i

–→ r j in E r if there exist entities h ∈ E and t 1 , t 2 ∈ E such that both (h, r i , t 1 ) and (h, r j , t 2 ) belong to T . This edge type indicates that r i and r j share a head entity.

• Head-to-Tail (H2T). There is an edge r i H2T –→ r j if there exist triples (h, r i , m) and (m, r j , t) in T . In other words, the tail entity of r i coincides with the head entity of r j , so that r j may follow r i along a path from a head entity to a tail entity.

• Tail-to-Head (T2H). An edge is introduced r i T2H –→ r j if there exist triples (m, r i , h) and (t, r j , m) in T . This type links relations whose heads and tails are connected in the reverse order of H2T.

• Tail-to-Tail (T2T). Finally, an edge r i T2T –→ r j is added when there exist triples (h 1 , r i , t) and (h 2 , r j , t) in T . In this case r i and r j share a tail entity. Each edge η = (r i , τ, r j ) ∈ E r therefore has a type τ ∈ {H2H, H2T , T2H, T2T}.

To obtain a vector representation for every relation r ∈ R, we apply NBFNET [4], [22] to the relation graph G r . We denote by h (t) r the vector representation of the relation r at iteration t. We initialize these representations with 1 d for r q , and 0 d for the rest of the relations. For each iteration t ≥ 1 the update rule reads

where MSG is a neural message function and AGG is a permutationinvariant aggregator (e.g., a learnable sum or mean). Each edge type τ is associated with a trainable type embedding e τ ∈ R d . After L iterations, we set the relation representation r = h (L) r . These relation embeddings are subsequently used to modulate the message passing over the entity graph.

Given a query triple q = (h, r q , ?) consisting of a head entity h and a query relation r q , GAMMA computes a representation of each candidate tail entity e conditioned on q. We achieve this by running several algebraically distinct message passing processes on the entity graph and then aggregating their outputs with attention.

a) Generalized relation transformations.: Let’s θ denote the parameters of a particular message branch. Each branch defines a relation-specific transformation

where K is an algebraic number system (e.g., the reals, complex numbers, split-complex numbers, or dual numbers). Given an entity embedding x ∈ K d and a relation embedding

r (x) is computed by a fixed algebraic operation in K. For the branches considered in this work, we obtain the following specific forms:

• Real (DistMult) branch. Working over K = R, the relation transformation is defined elementwise as

where ⊙ denotes the Hadamard (elementwise) product. This recovers the bilinear DistMult operator. • Complex branch. Over the field of complex numbers K = C, each vector is represented by its real and imaginary parts, x = (x re , x im ) and r = (r re , r im ).

Complex multiplication yields

This branch, therefore, implements rotations in the complex plane and is well suited to modelling symmetric and anti-symmetric relations. • Split-complex branch. The split-complex numbers introduce an imaginary unit j satisfying j 2 = +1. For vectors x = (x re , x im ) and r = (r re , r im ) we define the split-complex multiplication as

Because j 2 = +1, this operator can model hyperbolic rotations and thus enhances the ability to capture hierarchical and partial order patterns. • Dual branch. Dual numbers take the form a + εb with ε 2 = 0. Writing x = (x re , x im ) and r = (r re , r im ), multiplication is given by

The nilpotent nature of ε allows this branch to encode translational offsets and one-to-many relations. b) Conditional message passing.: Fix a branch k and its associated transformation f (k) . To compute a representation of entities conditioned on the source entity h and query relation r q , we run T steps of a Bellman-Ford-style message passing on the entity graph G. Let z (k,0) u ∈ K d denote the initial representation of each entity u ∈ E for branch k. We set z (k,0) h = 1 * r q (1 is a vector of ones) and z

and updates its representation via a permutation-invariant aggregator AGG,

This iterative process implicitly sums over all paths of length up to T starting from the source entity h. After T iterations, we obtain a branch-specific representation z (k,T ) e

for every entity e ∈ E. c) Attention-based fusion.: The final step of GAMMA combines the K branch outputs using an attention mechanism conditioned on the query. First, a linear map W ctx projects the query relation in q = (h, r q ) to a context vector c ∈ R datt :

where Norm(•) denotes L 2 -normalization and r q is a learned embedding of the query relation from (2). For each branch k, we similarly project the entity representation z (k,T ) e into the same attention space via W key to obtain a key vector k

). The attention weight of branch k for entity e is then

where κ is a temperature parameter controlling the sharpness of the attention distribution. To avoid over-focusing on a single branch, the final attention weights are mixed with a uniform distribution:

where λ ∈ [0, 1] is a uniform mixing coefficient. Finally, the entity representation conditioned on the query is a concatenation of rescaled branch outputs:

To score a candidate tail entity e, we apply a feed-forward network ψ : R d → R to the real part of z e and compute score(h, r q , e) = ψ (z e ) .

During training, to encourage diversity among attention distributions and mitigate branch collapse, an entropy-based regularization term is incorporated:

The overall training objective is to minimize L, the primary prediction loss L pred (a negative log-likelihood over positive and negative triplets as described in [22]) combined with the entropy regularizer:

where β is a small coefficient controlling the strength of regularization.

The use of multiple algebraic branches endows GAMMA with the ability to model a wide range of relational patterns. Complex multiplication captures cyclic and anti-symmetric relations, split-complex multiplication models hierarchical and hyperbolic interactions, while dual multiplication accounts for translational offsets. The attention mechanism in (11) learns to weight these branches depending on the query, thereby enabling the model to adaptively balances the contributions of heterogeneous message passings, thereby enhancing expressiveness, interpretability, and generalization across complex relational graphs.

By conducting evaluations across diverse knowledge graphs, we seek to answer the following research questions: RQ1: To what extent does multi-head geometric attention enhance the generalization ability of ULTRA? RQ2: What underlying benefits enable multi-head geometric attention to surpass a single message function? RQ3: Do the gains come from attention or simply from more parameters? RQ4: How do the fusion modules influence performance?

We build upon the open-source PyG (PyTorch Geometric) implementation of ULTRA [4] by modifying the entity model architecture to introduce a multi-head geometric attention mechanism. We leave improvements to the relation model architecture for future work.

Our experimental setup remains consistent with ULTRA: Besides replacing the original single DistMult [17] message function in the EntityModel with different combinations of message functions (selected from DistMult, complex [18], split-complex [40], and dual [12]), all other hyperparameters remain identical to ULTRA. We provide more details in Appendix B in the supplementary material.

For pre-training, we use the same three datasets and 53 datasets for zero-shot evaluation. Due to the extremely large scale of Hetionet [41] used in ULTRA [4], we exclude it from our evaluation. This dataset requires substantially higher computational resources and a much longer time than our current setting allows. Importantly, the remaining datasets cover a wide range of domains and scales, providing a representative and comprehensive evaluation of model generalization. During evaluation, the best checkpoint is selected based on its performance on validation sets: VGCS (Validation-Guided Checkpoint Strategy). In this setting, we use the 10 checkpoints saved after each epoch during pre-training to identify the one achieving the highest average MRR across 53 validation sets, and use it to report the final results on the 53 test sets. The resulting model, GAMMA, contains approximately 359K parameters. Pre-training is carried out on four NVIDIA H200 GPUs, taking around 15 hours for 10 epochs. We provide the code in the supplementary material.

Our experiments cover 56 publicly available knowledge graph datasets from diverse domains and sizes. These datasets are organized into three generalization scenarios:

Inductive (e,r) datasets where both new entities and relations emerge at inference: FB-100 [51], FB-50 [51], FB-75 [51], FB-25 [51], WK-100 [51], WK-50 [51], WK-75 [51], WK-25 [51], NL-100 [51], NL-75 [51], NL-50 [51], NL-25 [51], NL-0 [51], WIKITOPICS-MT1:TAX [52], WIKITOPICS-MT1:HEALTH [52], WIKITOPICS-MT2:ORG [52], WIKITOPICS-MT2:SCI [52], WIKITOPICS-MT3:ART [52], WIKITOPICS-MT3:INFRA [52], WIKITOPICS-MT4:SCI [52], WIKITOPICS-MT4:HEALTH [52], METAFAM [52], FBNELL [52]. We provide the full description of these datasets in Appendix A in the supplementary material.

During evaluation, we apply GAMMA in a zero-shot setting, meaning the model is not trained or fine-tuned on the target datasets. The link prediction task involves predicting missing head or tail entities; however, consistent with ULTRA, only tail prediction is conducted for the three datasets (FB15k-237-10%, FB15k-237-20%, FB15k-237-50%) introduced by [44]. We adopt the filtered ranking protocol [53] and report Mean Reciprocal Rank (MRR) along with Hits@10 (H@10) as the primary evaluation metrics, computed against the entire set of entities in the inference graph and under the VGCS setting. According to [4], zero-shot inference produces deterministic results; each evaluation is executed once.

To highlight the benefits of multi-head geometric attention over using a single message function, we take ULTRA [4] as our baseline model for comparison. All baseline results are reproduced using ULTRA’s official PyG implementation. While the results reported in ULTRA’s original paper were based on the TorchDrug framework, we adopt the PyG framework as its tensor-based message passing paradigm and customizable operator interfaces align well with our design of multi-branch stacking and attention fusion in NBFNET [22]. Compared to TorchDrug’s higher-level graph abstraction, PyG offers more flexible low-level control and more efficient sparse computation support, which facilitates architectural extensions and performance optimization.

Table I shows the average zero-shot results of GAMMA and ULTRA [4] on 53 graphs. GAMMA consistently outperforms baselines on inductive datasets while maintaining the same performance on transductive datasets. Its advantage is most pronounced in the inductive setting; this can be attributed to the smaller graph scale of inductive datasets. Here, GAMMA achieves improvements of 7% in average MRR and 3.2% in average Hits@10 on Inductive (e) benchmarks, highlighting the effectiveness of multi-head geometric attention to compensate for the representational limitations of a single message function like DistMult [17]. Even when averaged across all 53 datasets, GAMMA still yields a consistent boost (4.4% in MRR, 2.5% in Hits@10), suggesting that its improvements are not isolated to a few graphs but robust across scales and domains. We provide detailed results for each dataset in Appendix D in the supplementary material.

In the field of knowledge graph link prediction, improvements such as those reported by Low-Dimensional Hyperbolic KG Embeddings [54] (6.1% MRR improvement) are already regarded as significant. This provides a basis for considering the improvements in the generalization ability of ULTRA in knowledge graph reasoning achieved by our model as statistically and practically significant.

Across the 53 evaluation datasets, GAMMA outperforms ULTRA [4] on 47 datasets in terms of MRR and on 44 datasets in Hits@10. The improvements are particularly pronounced on smaller inductive benchmarks. For example, in the Inductive (e,r) setting, NL-75 [51] and WIKITOPICS-MT2:SCI [52] show MRR gains of 16.2% and 14.2%, respectively. Similarly, in the Inductive (e,r) setting, WN18RR:v1 [21], Hamaguchi-BM:3k [49], and Hamaguchi-BM:5k [49] exhibit substantial improvements of 62.0%, 20.9%, and 17.9% in MRR, respectively.

We relate this to the high prevalence of symmetric, strongly anti-symmetric, or strongly composition relations in these datasets. In such settings, the limitations of DistMult [17] are exposed. By contrast, the nature of complex [18] and split-complex numbers [40] are more capable in this aspect. Under our conditional expert selection mechanism, the complementarity between complex and split-complex, in terms of their functional inductive biases and gradient diversity, should be effectively utilized. Hence, we hypothesize that such complementary branches may contribute to stronger zero-shot generalization.

To verify this hypothesis, we follow the methodology from [55] to identify symmetric, anti-symmetric, and compositional relational patterns across all test datasets. Based on these detected patterns, we partition the test triples into three corresponding subsets for evaluation. Since not all test sets contain enough triples of all three relational patterns, we only evaluate and report results for a relational pattern group when it includes at least 50 test triples to ensure statistical reliability. We provide the detailed subset statistics in Appendix A in the supplementary material.

We present the zero-shot average MRR and Hits@10 of ULTRA [4] and GAMMA on three relational pattern subsets derived from 53 test sets in Table II. On average, GAMMA outperforms ULTRA on all three relational pattern subsets.

Takeaway 1. The complementary inductive biases encoded by different branches allow the multi-head geometric attention mechanism to more effectively extract useful relational cues, thereby enhancing the model’s expressive capacity. To isolate the effect of model capacity from the contribution of multi-head geometric attention, we trained two additional variants of ULTRA [4]. One variant increases the hidden dimension and MLP width to approximately match GAMMA’s parameter count, while another variant directly uses two parallel DistMult branches of identical structure. These two expansion strategies allow us to tease apart the improvements contributed by the attention mechanism and those stemming from the enhanced geometric expressiveness of the message functions.

As shown in Table III, models obtained by merely increasing the MLP width and hidden dimensionality do not exhibit the level of overall improvement achieved by GAMMA. In contrast, introducing multi-head attention yields a more significant comprehensive performance boost, and further incorporating the geometric enhancement of the message functions provides an additional layer of gains. These observations indicate that GAMMA’s improvements stem from architectural and representational advances rather than from naively scaling parameter count. Takeaway 2. The observed gains cannot be attributed merely to a larger parameter budget. Instead, they highlight the representational advantage of multi-head geometric attention.

Beyond verifying that multi-head geometric attention yields consistent gains, it is essential to understand how the design choices of the fusion module influence performance. In particular, we investigate three key factors: (i) branch fusion mechanism: attention vs. weak or no-attention variants; (ii) attention fusion position: late fusion vs. early fusion; and (iii) branch composition and complementarity: the choice of message functions. By systematically varying these dimensions, we aim to identify the most effective mechanism for integrating heterogeneous relational signals.

  1. Ablation on Branch Fusion Mechanism: To examine whether the full attention mechanism is truly necessary, we (i) remove the attention module and directly concatenate the outputs of all branches, (ii) remove the query vector from the attention computation, (iii) exclude the node features from the attention context, and (iv) replace the final feature concatenation with summation. These variants allow us to quantify how each component contributes to the model’s performance.

Table IV reports the corresponding results. The largest drops occur for lacking of either attention, query or key (-0.008 or -0.009 total average MRR), indicating that the adaptive weighting driven by query-key interactions is the major contributor to GAMMA’s performance gains. By replacing concatenation with summation shows a smaller decrease (-0.006), implying that concatenation mainly improves the representational capacity rather than the adaptive weighting itself. Removing any of these components leads to a noticeable degradation in average MRR, confirming the necessity of the complete attention formulation.

  1. Ablation on Attention Fusion Position: We also investigate how the fusion position impacts model performance by training an early fusion version, which merges features after every layer. Table V provides the comparison of the results where early fusion consistently underperforms the late fusion. The pronounced performance drop across almost all scenarios indicates that fusing branch representations too early leads to a loss of predictive performance.

  2. Ablation on Branch Composition and Complementarity: We further explore the choice of message functions by training multiple GAMMA variants with all pairwise combinations. The results are summarized in Table VI where the combination of complex and split-complex deliver the largest MRR gains (+0.012) over the best single branch baseline. Interestingly, replacing the split-complex branch with DistMult [17] yields a very similar performance, suggesting strong complementarity between phase sensitive (complex [18]) and amplitude or scale oriented multiplicative behaviors (split-complex [40] or DistMult). Dual [12] and DistMult combination also shows a clear gain (+0.006), indicating that first order shear or translation like interactions (dual) complement diagonal multiplicative patterns (DistMult). In contrast, the combination of split-complex and DistMult (+0.005) and the combination of complex and dual (+0.004) exhibit moderate complementarity, while the combination of split-complex and dual (+0.002) is the weakest due to potential higher overlap in the induced feature subspaces. The ablation results reveal that multi-branch message composition consistently improves link prediction performance compared to single branch variants, confirming the effectiveness of our multi-head attention design.

Takeaway 3. The complete attention architecture with late, multi-branch fusion is crucial for achieving the strongest and most expressive performance.

In this work, we introduced GAMMA, a structural knowledge graph foundation model that overcomes the limitations of existing approaches by integrating multiple geometric transformations within a unified attention-based framework. Unlike prior models that rely on a single transformation defined in a single geometry and thus introduce structural biases, GAMMA exploits the complementarity of geometric message functions to dynamically adapt to the queries with different relational patterns present in the data.

Through extensive evaluation on 53 inductive link prediction benchmarks, GAMMA consistently outperforms ULTRA, achieving particularly large gains in the fully-inductive setting. These results highlight the importance of multi-geometric reasoning for enabling universal generalization in structural knowledge graph foundation models.

The current regularization settings remain relatively conservative, and the relation model architecture still requires refinement to better accommodate a broader spectrum of relational patterns. We also intend to reduce the model’s parameter size, improve computational efficiency and extend GAMMA toward scalable pretraining on large heterogeneous knowledge sources and explore its applicability to downstream tasks beyond link prediction, such as multi-hop reasoning and temporal knowledge graph completion. These directions represent promising avenues for future work.

We build upon the open-source PyG implementation of ULTRA 1 , extending its original layer design to incorporate additional message functions, including split-complex, dual, mobius, mobius+, splitmobius, and transrotate. The dimensionality of relation embeddings is dynamically adjusted to match the requirements of each message function. Furthermore, we extend EntityNBFNet into a multi-branch attention fusion architecture, where each message function maintains an independent stack of NBFNET layers and performs forward propagation separately. All branches execute Bellman-Ford iterations in parallel to produce multi-channel representations. The final feature is obtained via an attention fusion mechanism: The query is projected to form a context vector, and each

Open-source code of ULTRA: https://github.com/DeepGraphLearning/ ULTRA

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut