GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numeri…

Authors: Mattia Rigotti, Nicholas Thumiger, Thomas Frick

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators
GIST : Gauge-In variant Spectral T ransf ormers f or Scalable Graph Neural Operators Mattia Rigotti ∗ 1 Nicholas Thumiger ∗ 1 Thomas Frick ∗ 1 Abstract Adapting transformer positional encoding to meshes and graph-structured data presents sig- nificant computational challenges: exact spectral methods require cubic-comple xity eigendecom- position and can inadvertently break gauge in- varia nce through numerical solver artifacts, while efficient approximate methods sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization in inducti ve learning, where models trained with one set of numerical choices fail when encountering dif ferent spectral decompositions of similar graphs or discretiza- tions of the same mesh. W e propose GIST (Gauge- In variant Spectral T ransformers), a new graph transformer architecture that resolves this chal- lenge by achieving end-to-end O ( N ) complex- ity through random projections while algorith- mically preserving gauge in variance via inner - product-based attention on the projected embed- dings. W e prove GIST achiev es discretization- in variant learning with bounded mismatch error , enabling parameter transfer across arbitrary mesh resolutions for neural operator applications. Em- pirically , GIST matches state-of-the-art on stan- dard graph benchmarks (e.g., achie ving 99.50% micro-F1 on PPI) while uniquely scaling to mesh- based Neural Operator benchmarks with up to 750K nodes, achieving state-of-the-art aerody- namic prediction on the challenging Dri vAerNet and DrivAerNet++ datasets. 1. Introduction Follo wing their incredible success for processing sequen- tial data in Natural Language Processing, T ransformers ( V aswani et al. , 2017 ) hav e been demonstrating a remark- able capacity for handling data of increasing structural com- plexity . Lee et al. ( 2019a ) hav e proposed a variant of the * Equal contribution 1 IBM Research. Correspondence to: Mattia Rigotti < mrg@zurich.ibm.com > . Pr eprint. Marc h 18, 2026. transformer block for permutation in variant data with their Set T ransformer architecture; Dosovitskiy et al. ( 2021 ) ha ve adapted the self-attention mechanism to 2D images with the very influential V ision T ransformer architecture; and Berta- sius et al. ( 2021 ) hav e extended transformers to video anal- ysis with their V ideo V ision T ransformer (V iV iT), demon- strating how attention mechanisms capture both spatial and temporal dependencies across video frames. This progres- sion from sequential text to increasingly structured data in- dicates a trajectory suggesting that Transformers are poised to tackle ev en more complex data structures, including ir- regular meshes and graphs. Indeed, recent developments in adapting T ransformers to graphs ha ve sho wn promising results in capturing long- range dependencies that traditional Graph Neural Networks (GNNs) struggle with due to their reliance on localized mes- sage passing ( Dwi vedi et al. , 2022 ; Zhu et al. , 2023 ). Unlike GNNs that aggregate information from nearest neighboring nodes by iterating through layers, T ransformers can directly capture global relationships across the whole graph through self-attention, enabling them to reason about distant node interactions in a single layer . T wo Barriers to Scalable Graph T ransformers. How- ev er, adapting transformers to graphs introduces two distinct barriers that hav e not been simultaneously addressed. First, there is a computational barrier : exact spectral graph em- beddings, while theoretically natural, require eigendecom- position of the graph Laplacian, scaling as O ( N 3 ) for dense graphs or O ( N 2 ) for sparse graphs (where N is the number of nodes), which is prohibiti vely expensi ve for large-scale graphs. Second, there is a gauge invariance barrier : both exact and approximate spectral methods may inadv ertently break gauge in variance, i.e. the inherent freedom to rotate eigen vectors, flip signs, or choose among degenerate eigen- vectors. Exact methods may break it through numerical solver artifacts (sign choices, eigenspace ordering, degener - acy handling, see Bronstein et al. 2017 ), while approximate methods commit to a specific basis decomposition through their approximation scheme, sacrificing freedom in spectral choices. This gauge in variance breaking introduces spurious inductiv e biases tied to arbitrary numerical choices, causing models trained with one set of choices to fail catastrophi- 1 GIST : Gauge-In variant Spectral T ransf ormers cally when e v aluated with different spectral decompositions or numerical solvers, particularly in inducti ve learning tasks where models must generalize to unseen graph structures. The Gauge In variance Challenge. Consider a graph whose spectral embeddings are computed via random pro- jection matrix R . A neural network trained on these em- beddings { Rϕ i } i will inevitably learn features correlated with the specific choice of R . Consequently , when ev al- uated on different graphs or with a different random pro- jection R ′ or dif ferent numerical eigensolver choices (sign flips, eigenspace ordering, handling eigenspace degenerac y), learned features become meaningless. This gauge depen- dence fundamentally undermines generalization, a critical failure mode for inductiv e graph learning where models must transfer to unseen structures. Beyond graphs, gauge inv ariance is essential to generate neural oper ators with bounded discr etization err or . Phys- ical problems (e.g., computational fluid dynamics, struc- tural mechanics, shape analysis) are defined on continuous manifolds but discretized into computational meshes cor- responding to graphs. Different mesh resolutions produce different graph Laplacians with dif ferent spectral decompo- sitions, each in volving arbitrary gauge choices (sign flips, eigenspace rotations, solver artif acts). W ithout gauge in v ari- ance, parameters trained on one discretization fail to transfer to another , and the attention k ernels computed from spectral embeddings at diff erent resolutions cannot be compared meaningfully . Gauge in v ariance ensures that the learned operator con verges to the same continuum limit regardless of discretization, enabling prov ably bounded discretization mismatch error that vanishes as resolution increases. Existing approaches address only one barrier at a time: (1) spectral methods like SAN ( Kreuzer et al. , 2021 ) maintain gauge inv ariance but require full eigendecomposition; (2) approximate spectral methods achiev e linear complexity but sacrifice gauge in variance; (3) generic linear transformers re- duce attention complexity but ignore graph structure. Recent spectral in v ariant methods achie ve gauge in variance through careful attention design, but still require full eigendecompo- sition ( O ( N 3 ) complexity) or quadratic attention ( O ( N 2 ) ), limiting practical scalability . Moreov er, the y lack discretiza- tion error controls needed for neural operator applications where consistency across mesh resolutions is critical. Our Contribution: GIST . W e propose Gauge-In variant Spectral T r ansformers (GIST) , which o vercomes both bar- riers through a ke y insight: gauge in variance can be pre- served by restricting attention to inner products between spectral embeddings, which remain inv ariant ev en under approximate spectral computations. W e instantiate this framew ork using random projections for their computa- tional efficiency and theoretical guarantees. For random projections, inner products between projected embeddings ⟨ Rϕ i , Rϕ j ⟩ = ⟨ ϕ i , ( R T R ) ϕ j ⟩ ≈ ⟨ ϕ i , ϕ j ⟩ remain approxi- mately in variant under g auge transformations. By restricting attention operations to only these inner products and com- bining with linear attention, we achiev e end-to-end O ( N ) complexity while reco vering gauge in variance algorithmi- cally . Unlike prior approaches that address these barriers separately , GIST achie ves gauge in variance and linear com- plexity simultaneously . Our contributions in short are: 1. identifying gauge in variance breaking as a fundamen- tal limitation shared by both exact and approximate spectral methods, and characterizing when this breaks generalization and neural operator con vergence; 2. proposing GIST , combining gauge-in variant spectral attention with multi-scale linear transformer blocks, achieving theoretical guarantees on complexity and in variance preservation, with competitive results on standard graph benchmarks (e.g. Cora, PubMed, PPI); 3. establishing GIST as a neural operator with prov ably bounded discretization mismatch error achieving state- of-the-art on large-scale mesh problems. 2. Related W orks Graph T ransf ormers. Graphormer ( Y ing et al. , 2021 ) introduces the idea of integrating structural encodings such as shortest path distances and centrality in T ransformers. Similarly , Dwi vedi et al. ( 2022 ) propose LSPE (Learnable Structural and Positional Encodings), an architecture that decouples structural and positional representations. Kreuzer et al. ( 2021 ) propose Spectral Attention Network (SAN), which introduces learned positional encodings from the full Laplacian spectrum. Park et al. ( 2022 ) de velop Graph Relativ e Positional Encoding (GRPE), which extends rela- tiv e positional encoding to graphs by considering features representing node-topology and node-edge interactions. Hi- erarchical Graph T ransformer ( Zhu et al. , 2023 ) addresses scalability to million-node graphs through graph hierarchies and coarsening techniques. SpecFormer ( Bo et al. , 2023 ) and PolyFormer ( Chen et al. , 2025b ) are recent spectral T ransformers that improve o ver SAN by lev eraging approximate spectral bases or low-rank polynomial Laplacian filters to enhance scalability and ac- curacy on graph tasks. Sev eral recent architectures aim to bridge the gap between local and global reasoning via structural encodings and scal- able attention. GraphGPS ( Ramp ´ a ˇ sek et al. , 2022 ) adopts a hybrid paradigm that decouples local message passing from global attention while integrating spectral positional encodings to capture multi-scale interactions. Howe ver , un- like GIST , it lacks a mechanism to algorithmically preserve gauge in variance, lea ving the model susceptible to the arbi- trary numerical artifacts and sign flips inherent in spectral 2 GIST : Gauge-In variant Spectral T ransf ormers decompositions. While models like Exphormer ( Shirzad et al. , 2023 ) attempt to address the resulting complexity by replacing full attention with sparse expander -based mecha- nisms, they still struggle with the high ov erhead of global coordinate systems. Similarly , tok enized approaches such as T ok enGT ( Hamilton et al. , 2017 ) and N A Gphormer ( Chen et al. , 2023 ) treat graphs as sets of tokens with [CLS]-style readouts, yet these models scale poorly due to memory- intensiv e tokenization and positional encoding costs that grow super-linearly with graph size. Consequently , these framew orks are rarely ev aluated on large-scale inductive tasks like Elliptic or ogbn-arxiv , where the computa- tional burden of maintaining structural encodings becomes prohibitiv e. Scalable Attention Architectur es. Recent advances tack- led the quadratic scaling of self-attention through various approaches, including cross-attention bottlenecks that map inputs to fixed-size latent representations or concepts ( Jae- gle et al. , 2021b ; Rigotti et al. , 2022 ), k ernel-based attention mechanisms using random feature approximations ( Choro- manski et al. , 2020 ), feature map decomposition methods that linearize the attention computation ( Katharopoulos et al. , 2020 ), and memory-ef ficient v ariants with sub-linear complexity ( Likhoshersto v et al. , 2021 ). As noted by Dao & Gu ( 2024 ), many such linear transformer models are directly related to linear recurrent models such as state- space-models ( Gu et al. , 2021 ; 2022 ; Gu & Dao , 2023 ; Chennuru V ankadara et al. , 2024 ) Neural Operators. Further addressing the scalability of these graph-based methods is essential for applying them to complex domains such as geometry meshes and point clouds. In these settings, graphs are induced by the connec- tivity of an underlying continuous object whose discretiza- tion is not unique: it can be sampled at arbitrarily many densities and resolutions. High-density discretizations can render the graph prohibitiv ely large, undermining both ef- ficiency and scalability in existing methods. As a result, efficient mesh downsampling and/or re-discretization onto regular lattices (e.g., via SDF-based volumetric grids), and task-aware coarsening learned by GNNs, were commonly required to make these problems tractable. Recently , neural operators ha ve shown success in learning maps between continuous function spaces rather than fixed- dimensional vectors. T w o properties are crucial here: (i) discr etization in variance , i.e., a single set of parameters applies across discretizations (meshes, resolutions, and sam- pling locations) of the same underlying continuum problem; and (ii) global inte gration , i.e., the ability to represent non- local interactions via learned integral kernels, rather than being limited to finite-receptive-fields. Formally , a neural operator composes learned inte gral operators with pointwise nonlinearities, yielding univ ersal approximation results for continuous nonlinear operators and implementations that share weights across resolutions. Our approach preserves these neural operator properties and improv es scalability , allowing it to be applied to these cases ( K ovachki et al. , 2023 ). Foundational operator families. The Fourier Neural Op- erator (FNO) parameterizes kernels in the spectral domain and ev aluates them with FFT -based spectral con volutions, sharing weights across resolutions and enabling efficient nonlocal interactions on grids ( Li et al. , 2021 ). The Graph Neural Operator (GNO) realizes the kernel via message passing, supporting irregular meshes and geometry varia- tion while keeping the learned map discretization-agnostic ( Li et al. , 2020 ). Conv olutional Neural Operators (CNOs) define continuous con volutions with learnable k ernels and interpolation, specifying the operator in the continuum and discretizing only at runtime ( Raoni ´ c et al. , 2023 ). Hybrid designs pair geometry-a ware encoders with operator layers to handle complex shapes. GINO couples a graph encoder/decoder with a latent FNO on a proxy grid from SDF or point-cloud inputs and shows con ver gence across large 3D, multi-geometry problems ( Li et al. , 2023 ). En- coder–decoder operator learners, such as DeepONet, use a branch network for inputs and a trunk network for coordi- nate queries, directly supporting heterogeneous sampling ( Lu et al. , 2021 ); U-NO adds a multi-resolution U-shaped backbone for multiscale effects ( Rahman et al. , 2022 ). T ransf ormers as neural operators. Self-attention is a learned, data-dependent kernel integral, and with suitable positional features can approximate continuous maps on variable-length sets for discretization-in v ariant operator learning; cross-attention ev aluates outputs at arbitrary coor- dinates ( Tsai et al. , 2019 ; Y un et al. , 2020 ; Lee et al. , 2019b ; Jaegle et al. , 2021a ). T ransolver casts PDE operator learning as attention from query coordinates to context tok ens built from input fields, yielding resolution-agnostic inference and strong generalization across meshes ( W u et al. , 2024a ). Recent operator transformers, e.g., GNO T , add geometric normalization and gating to stabilize training on irregular meshes and multi-condition PDEs ( Hao et al. , 2023 ). Positioning GIST . Existing graph transformers and scal- able attention methods address complementary but not simultaneous challenges. Spectral methods like SAN ( Kreuzer et al. , 2021 ) le verage the full Laplacian spectrum to maintain theoretical expressi veness b ut incur significant computational costs from full spectral methods. Approxi- mate spectral methods achieve better scalability but com- pletely forsake gauge in variance, exacerbating generaliza- tion failures when arbitrary gauge choice differs between training and testing. Generic linear transformers reduce 3 GIST : Gauge-In variant Spectral T ransf ormers attention complexity but typically ignore graph structure entirely . GIST uniquely combines graph awareness through spectral embeddings, computational efficienc y through ran- dom projections, gauge in variance through a modified atten- tion mechanism, and linear attention for end-to-end linear scaling. It also preserves the discretization-in variance and global integration properties needed for neural operator ap- plications on mesh regression, unifying graph learning and continuous function approximation in a single framew ork. 3. A pproach 3.1. Preliminaries Self-attention and positional encoding . Gi ven query , ke y and value representations q i , k i , v i of N tokens with i = 1 , . . . , N , self-attention ( V aswani et al. , 2017 ) famously computes outputs as a weighted sum of v alues with attention weights determined by query-key similarities: o i = N X j =1 α ij v j , where α ij = softmax j  q ⊤ i k j √ d  . (1) A key insight ( Shaw et al. , 2018 ) is that positional infor- mation can be injected through relativ e positional biases: e ij = q ⊤ i k j √ d + b ij , where b ij reflects distances between po- sitions. For graphs, this can be generalized by replacing b ij with distance measures reflecting the graph structure. Graph Laplacian and spectral embeddings. The (nor - malized) graph Laplacian L = 1 − D − 1 2 AD − 1 2 induces a natural metric via the r esistance distance : Ω( i, j ) = ( e i − e j ) ⊤ L † ( e i − e j ) , where e i is the i th standard basis vector and L † is the Moore-Penrose pseudoin verse ( Klein & Randi ´ c , 1993 ). The resistance distance can be expressed via Laplacian eigenmaps , which satisfy: Ω( i, j ) = || ϕ i − ϕ j || 2 where ( ϕ i ) k = 1 √ λ k ( u k ) i , (2) with λ k , u k being the eigen values and eigen vectors of L . These eigenmaps are natural positional encodings for graphs because their pairwise distances preserve the graph’ s metric structure ( Dwiv edi & Bresson , 2021 ). Howe ver , exact com- putation requires O ( N 3 ) eigendecomposition, prohibiti ve for large graphs. Appr oximate spectral embeddings and the gauge in- variance problem. GIST’ s gauge-in variant attention re- quires approximate spectral methods that preserve inner products between embeddings. W e use FastRP ( Chen et al. , 2019 ), which employs random projections R ∈ R r × N with r = O (log( N ) /ϵ 2 ) and k power iterations to compute ap- proximated eigenmaps ˜ ϕ i = Rϕ i ∈ R r with O ( N log N ) complexity . By the Johnson-Lindenstrauss Lemma, while individual embeddings ˜ ϕ i depend on the arbitrary choice of R , the inner products ⟨ ˜ ϕ i , ˜ ϕ j ⟩ ≈ ⟨ ϕ i , ϕ j ⟩ remain approximately in variant. This inner-product preservation is essential for gauge-in variant attention, ensuring attention weights do not depend on arbitrary numerical choices in the projection. Importantly , exact spectral methods can also introduce gauge dependence through eigensolv er artifacts (sign choices, eigenspace ordering), making gauge in v ariance a concern across both exact and approximate approaches ( Bronstein et al. , 2017 ). Motivation f or gauge-in variant operations. While the approximate eigenmaps ˜ ϕ i are efficient and preserve dis- tances, as mentioned their g auge dependence is problematic: neural networks trained on these embeddings will learn fea- tures correlated with the specific projection matrix R . This creates a generalization failure in inductiv e settings where different graphs or different numerical solv ers produce dif- ferent projections. Our approach addresses this by using approximate eigen- maps as positional encodings, but restricting the Trans- former to operations that depend only on gauge-in variant quantities. 3.2. Our Appr oach: GIST . The key insight is that while the projection matrix R breaks gauge in variance, the inner products between projected em- beddings remain approximately in variant: ( Rϕ i ) ⊤ ( Rϕ j ) = ϕ ⊤ i ( R ⊤ R ) ϕ j ≈ ϕ ⊤ i ϕ j . By taking care that all operations de- pend on the embeddings only through these inner products we preserve g auge in variance by design while maintaining computational efficienc y . Gauge-In variant Spectral Self-Attention. W e no w in- troduce our main contribution: Gauge-In variant Spectral T ransformer (GIST) . The first ingredient of GIST is Gauge- In variant Spectral Self-Attention , which operates on approx- imate spectral embeddings ˜ ϕ i = Rϕ i ∈ R r but restricts attention computations to inner pr oducts between embed- dings. The key observation is that while the embeddings themselves depend on the arbitrary projection matrix R , the inner products ( Rϕ i ) ⊤ ( Rϕ j ) = ϕ ⊤ i ( R ⊤ R ) ϕ j ≈ ϕ ⊤ i ϕ j are approximately gauge-in variant because R ⊤ R ≈ I by Johnson-Lindenstrauss. Thus, attention weights computed from these inner products do not depend on R and general- ize across different projections. Formally , for each node i = 1 , . . . , N , Gauge-Inv ariant Spectral Self-Attention modifies the standard self-attention 4 GIST : Gauge-In variant Spectral T ransf ormers Product Scaling Sum Softmax Add&Norm Q K V O Gauge-Invariant Spectral Self-Attention Multi-Scale Gauge-Invariant Spectral T ransformer block r esidual connection FFN graph positional embeddings node featur es featur es branch local branch global branch Linear Self-Attention Linear Self-Attention GraphConv Linear Linear Self-Attention Gauge-Invariant Spectral Self-Attention Gauge- Equivariant Self-Attention C F igure 1. Gauge-In variant Spectral Transformer . Left : Gauge- In variant Spectral Self-Attention operates on graph positional embeddings ˜ ϕ as queries and keys, and node features x as val- ues. The output of the self-attention operation is then combined with x through a residual connection. Limiting ˜ ϕ to queries and keys preserves gauge in variance across the self-attention block. Right : Gauge-In variant Spectral Self-Attention is embedded in a Multi-Scale Gauge-In variant Spectral T ransformer Block which comprises 3 parallel branches inspired by EfficientV iT . mechanism as follows: q i = ˜ ϕ i , k i = ˜ ϕ i , v i = f v ( x i ) . This ensures attention logits are based on inner products: e ij = q ⊤ i k j √ d = ˜ ϕ ⊤ i ˜ ϕ j √ d = ϕ ⊤ i R ⊤ Rϕ j √ d ≈ ϕ ⊤ i ϕ j √ d , which are approximately gauge-in variant (see Fig. 1 , left). By limiting the embeddings to the query-k ey computation and not using them as values, we ensure that downstream layers (which operate on node features) cannot access g auge- dependent information. Algorithm 1 in Section A.2 e xplains how we compute graph spectral positional embeddings and Algorithm 2 details the implementation. Gauge-Equivariant Spectral Self-Attention. The Gauge-In variant Spectral Self-Attention operation thus preserves gauge in variance, but at the cost of gi ving up a lot of the flexibility of regular self-attention. In particular, there is no mechanism that allows for a modification of the vectors ˜ ϕ i through learning. In fact, applying ev en just a linear operation on ˜ ϕ i would break again g auge in variance. Howe ver , notice that rescaling each ˜ ϕ i by a scalar possibly depending on the node features s ( x i ) ∈ R would modify the similarity between graph positional embeddings in the same way across gauge choices, since scalars commute with orthogonal projections, meaning that it is an equiv- ariant operation across gauges: ( s ( x i ) ˜ ϕ i ) ⊤ ( s ( x j ) ˜ ϕ j ) = s ( x i ) s ( x j )( ˜ ϕ ⊤ i ˜ ϕ j ) = s ( x i ) s ( x j )( ϕ ⊤ i ϕ j ) . Remarkably , such a gauge-equiv ariant operation can also be straight-forwardly implemented via a modification of self-attention by modifying equation 1 as follows: q i = f q ( x i ) , k i = f k ( x i ) , v i = ˜ ϕ i , and the output in equation 1 such that it is constrained to operate on ˜ ϕ i , i.e. ˜ ϕ l +1 i = P N j =1 α ij v j , where ˜ ϕ l +1 i indicates the graph positional encoding that will be used in the next layer l + 1 . Algorithm 3 in Section A.2 details how the implementation of Gauge-Equiv ariant Spectral Self- Attention relates to regular Self-Attention. Linear Self-Attention, and Multi-Scale Architecture. Gauge-In variant Spectral Self-Attention ensures that we can compute reliable graph positional encoding with linear time complexity in the number of nodes in the graph N . In order to maintain that linear scaling end-to-end, the v ery last component of our architecture aims to address the quadratic scaling of Transformers by implementing a linear version of self-attention. In particular , we implement the linear transformer by Katharopoulos et al. ( 2020 ). Crucially , as feature map we use φ ( x ) = ReLU( x ) , which is a map that induces a kernel k 0 ( · ) corresponding to the arc-cosine kernel ( Cho & Saul , 2009 ). More specifically , for random features ˜ ϕ i , ˜ ϕ j ∈ R r , the attention weights ⟨ φ ( ˜ ϕ i ) , φ ( ˜ ϕ j ) ⟩ ≈ k 0 ( ˜ ϕ ⊤ i ˜ ϕ j ) con verge to a kernel function that depends only on the inner product ˜ ϕ ⊤ i ˜ ϕ j . Since ˜ ϕ ⊤ i ˜ ϕ j ≈ ϕ ⊤ i ϕ j by Johnson-Lindenstrauss (as established earlier), this preserves gauge in variance: attention weights depend only on gauge-inv ariant inner products between true spectral embeddings. For further considerations on the choice of the feature map φ ( · ) see the note in Appendix A.2 . In order to fully exploit the capabilities of linear attention and mitigate its drawbacks like the reported lack of sharp attention scores compared to softmax attention, we design a parallel architecture inspired from EfficientV iT by Cai et al. ( 2022 ) who proposed a multi-scale linear attention architec- ture. Just lik e Ef ficientV iT our Multi-Scale Gauge-In variant Spectral T r ansformer Bloc k has 3 parallel branches: a fea- tur e branch consisting in a linear transformer block acting on node features x alone, a local branch consisting in a graph-con volution layer also acting on x followed by a lin- ear transformer block, and a global br anch consisting in our Gauge-In variant Spectral Self-Attention layer followed by a Gauge-Equiv ariant Spectral Self-Attention layer (which as explained act on both node features x and graph posi- tional embeddings ˜ ϕ ) then followed by a linear transformer . In keeping with the analogy with EfficientV iT , the role of the graph-con volution layer (which simply averages node features across adjacent nodes) is to emphasize local in- formation, which would be otherwise diffused by linear attention. Con versely , Gauge-Inv ariant Spectral attention integrates global information across the graph. This block is represented in the right panel of Fig. 1 and represents a unit layer that is sequentially replicated multiple times. 5 GIST : Gauge-In variant Spectral T ransf ormers In Appendix A.5 we report ablation studies empirically showing that all branches meaningfully contribute to the final accuracy of the architecture, specifically for PPI where the full architecture achieves SO T A performance but would not if any of the branches were missing. Complexity Scaling Analysis. GIST computational com- plexity is dominated by two components. First, spectral embedding computation via FastRP scales as O ( N · r · k ) where r is the embedding dimension and k is the number of power iterations. Second, linear transformer blocks with Gauge-In variant Spectral Self-Attention on d -dimensional node features scale as O ( N · d 2 ) . Overall, this gi ves end-to- end O ( N · d 2 + N · r · k ) scaling, i.e. linear in the number of nodes N . This contrasts with O ( N 3 ) for exact eigende- composition and O ( N 2 d ) for standard quadratic attention. Empirical validation of this linear scaling is provided in Appendix A.4 : Figure 3 demonstrates that both VRAM con- sumption and forward pass time gro w linearly with graph size up to 500K nodes on DrivAerNet samples, confirming our theoretical analysis and enabling the large-scale neural operator experiments in Section 4 . 4. Results 4.1. Theoretical Guarantees: GIST as Discretization-In variant Neural Operator W e now formalize how GIST’ s gauge-in variant design en- ables discretization-in variant learning with bounded error, a property essential for neural operator applications. Physical problems (computational fluid dynamics, structural mechan- ics, shape analysis) are defined on continuous manifolds but discretized into computational meshes corresponding to graphs. Different mesh resolutions produce dif ferent graph Laplacians with different spectral decompositions, each in volving arbitrary gauge choices (sign flips, eigenspace rotations, solver -dependent orderings). W ithout gauge in- variance, parameters trained on one discretization fail to transfer to another , preventing con ver gence to a continuum limit. Gauge inv ariance ensures that attention weights con- ver ge to a well-defined continuum k ernel, enabling prov ably bounded discretization mismatch error that v anishes as mesh resolution increases. Proposition (Informal) . Outputs of GIST applied to dif- fer ent discretizations of the same m -dimensional manifold, obtained by random sampling nodes, con ver ge to each other with err or O ( n − 1 / ( m +4) ) wher e n is the coarser r esolution. This ensur es learned parameters tr ansfer acr oss arbitrary mesh r esolutions with bounded error that vanishes as r eso- lution incr eases. Pr oof sketch. The result follows from: (1) spectral con- ver gence of graph Laplacians to manifold eigenfunctions ( Belkin & Niyogi , 2008 ; Calder & Garc ´ ıa Trillos , 2022 ); (2) random projections preserve inner products (Johnson- Lindenstrauss); (3) gauge-in variant attention depends only on these inner products, ensuring parameter transfer . Full formulation and proof are found in Appendix A . This result distinguishes GIST from prior spectral meth- ods: while approximate methods achieve computational efficienc y , they lack gauge in variance and thus cannot pro- vide discretization error bounds; while exact methods could maintain inv ariance, their O ( N 3 ) cost prevents scaling to high-resolution meshes where such bounds are most valu- able. GIST uniquely combines gauge in variance with end-to- end linear comple xity , achieving both theoretical guarantees and practical scalability for neural operator applications. 4.2. Discretization In variance Empirical V erification W e v alidate the discretization in variance property predicted by Proposition 4.1 with a controlled experiment isolating the effect of g auge-in variance on cross-resolution transfer . Experimental setup. W e train two architectures with matched parameters on a single DrivAerNet ( Elref aie et al. , 2024 ) car at 50% decimation ( ∼ 250K vertices) and test on full resolution ( ∼ 500K vertices) predicting surface pressure. The gauge-in variant architecture is a GIST block followed by a linear transformer block. The non-gauge-in variant baseline uses a linear transformer block with spectral embed- dings summed to node features (treating them as standard positional encodings), which breaks gauge in variance by using the projected embeddings directly as features rather than restricting attention to their inner products. Discretization In variance Results. T able 1 sho ws both architectures achiev e essentially the same training perfor- mance on the coarse mesh, confirming matched model ca- pacity . Howe ver , when ev aluated on the fine mesh, the gauge-in variant architecture maintains strong performance (0.840 R 2 , a 15% relativ e drop), while the non-gauge- in variant baseline fails catastrophically (0.526 R 2 , a 47% relativ e drop). This controlled experiment directly demon- strates that gauge in variance is essential for parameter trans- fer across discretizations, validating our theory . T able 1. Discretization inv ariance verification. Models trained on 50% decimated mesh and tested on full-resolution mesh of the same car geometry . Gauge-inv ariant architecture transfers suc- cessfully; non-gauge-in variant baseline fails (we sho w R-squared, av eraged ov er 6 random seeds, confidence intervals are std. error). Architectur e T rain R 2 ↑ T est R 2 ↑ (coarse mesh) (fine mesh) GIST (gauge-in variant) 0.989 ± 0.005 0.840 ± 0.021 Non-gauge-in variant 0.991 ± 0.004 0.526 ± 0.033 6 GIST : Gauge-In variant Spectral T ransf ormers 4.3. Node Classification T asks T o demonstrate the key advantages of GIST , we ev aluate it on both transductive and inductiv e node classification datasets. T ransducti ve tasks are a common graph neural networks paradigm and consist of training and ev aluating the model on the same graph, with the goal of predicting at test time node labels that were not provided at training (infilling). Inductive tasks on the other hand, operate on a disjoint set of graphs and aim to predict properties on an entirely new graph. Experiment Setup W e ev aluate our method on transduc- ti ve graph benchmarks using the of ficial training, validation, and test splits and ev aluation protocols. For each method, we select optimal hyperparameters by optimizing ov er the validation split of each dataset. GIST -specific parameters (FastRP k and r ) are optimized through HPO. Appendix A.3 shows final accuracy is ro- bust to v ariation of these parameters, with performance saturating at relati vely lo w r as predicted by Johnson- Lindenstrauss. For the final result, we train on the combined training and validation set and ev aluate the model on the corresponding test set. W e train across multiple random seeds and report the mean ± standard deviation of the rele vant metric. 4 . 3 . 1 . T R A N S D U C T I V E T A S K S W e ev aluate our method on the three standard Planetoid cita- tion benchmarks for the transductiv e setting where the whole graph is observed at train time: Cora (2,708 nodes, 5,429 edges, 1,433 bag-of-words features, sev en classes), CiteSeer (3,327 nodes, 4,732 edges, 3,703 features, six classes), and PubMed (19,717 nodes, 44,338 edges, 500 features, three classes). T rain-val-test sets follo w the Planetoid public split, and we report node-classification accurac y ( Sen et al. , 2008 ; Y ang et al. , 2016 ; Kipf & W elling , 2017 ). Across these benchmarks, GIST is competiti ve with strong graph con volutional and transformer-style baselines (see T able 2 ). On Pubmed, GIST attains the best mean accuracy among the reported methods ( 81 . 20% ± 0 . 41 ), narrowly surpassing enhanced GCN variants (e.g., 81 . 12% ± 0 . 52 ) and outperforming GA T/GraphSA GE families. On Cora and Citeseer , GIST achiev es results comparable to the top results (within ∼ 1–2 points of GCNII/SGFormer and the enhanced GCN), landing at 84 . 00% ± 0 . 60 and 71 . 31% ± 0 . 50 , respectiv ely . 4 . 3 . 2 . I N D U C T I V E T A S K S W e ev aluate our method on four inductiv e benchmarks: PPI, Elliptic, Arxiv , and Photo. PPI is a collection of 24 disjoint tissue-specific protein–protein interaction graphs where T able 2. Transducti ve node classification on the Planetoid bench- marks (Cora, Citeseer , Pubmed). W e report test accuracy (%) as mean ± std across random seeds using the standard public split (higher is better). Benchmark results are taken from the following references (a – means no reported results): ( Kipf & W elling , 2017 ; Hu et al. , 2021 ; Luo et al. , 2024 ; V eli ˇ ckovi ´ c et al. , 2018 ; Chiang et al. , 2019 ; Hu et al. , 2021 ; Zeng et al. , 2020 ; Chen et al. , 2020 ; Brody et al. , 2022 ; Choi , 2022 ; W u et al. , 2024b ) . Model Cora (accuracy ↑ ) Citeseer (accuracy ↑ ) Pubmed (accuracy ↑ ) GCN (base) 81.60 ± 0.40 71.80 ± 0.01 79.50 ± 0.30 GraphSA GE 71.49 ± 0.27 71.93 ± 0.85 79.41 ± 0.53 GIN 77.60 ± 1.10 – – GA T 83.00 ± 0.70 69.30 ± 0.80 78.40 ± 0.90 GCNII 85.50 ± 0.50 72.80 ± 0.60 79.80 ± 0.30 GA Tv2 82.90 71.60 78.70 SGFormer 84.82 ± 0.85 72.60 ± 0.20 80.30 ± 0.60 GCN (enhanced) 85.10 ± 0.67 73.14 ± 0.67 81.12 ± 0.52 GIST (ours) 84.00 ± 0.60 71.31 ± 0.50 81.20 ± 0.41 nodes (proteins) have 50 features and 121 non–mutually- exclusi ve GO labels. Following the standard split (20 graphs for training, 2 for validation, and 2 for testing), we report micro-av eraged F1 on the unseen test graphs. Elliptic is a time-e volving directed Bitcoin transaction graph with 203,769 transactions (nodes), 234,355 payment-flow edges, and 166 features across 49 snapshots, labeled licit/illicit with man y nodes unlabeled due to class imbalance. W e train on the first 29 time steps, validate on the next 5, and test on the last 14, reporting micro-F1. For the Arxiv citation graph (269,343 nodes, 1,166,243 edges, 128 features, and 40 classes) and the Amazon Photo co-purchase network (7,650 nodes, 119,081 edges, 745 features, and 8 classes), we e valuate GIST in the inducti ve node classification setting using official splits ( Fe y & Lenssen , 2019 ; Fey et al. , 2025 ). GIST achie ves strong performance across all datasets. On PPI, GIST reaches 99 . 50% ± 0 . 03 micro-F1 , matching the best large-scale sampling methods and deep residual GCNs (see T able 3 , on par with GCNIII and within noise of the strongest GCNII setting ( 99 . 53% )). On the temporally in- ducti ve Elliptic dataset, GIST attains 94 . 70% ± 0 . 03 micro- F1. While this trails the strongest GraphSAGE configu- ration, GIST maintains stable performance across future time steps. On Arxiv , GIST achiev es a mean micro-F1 of 72 . 12% ± 0 . 21 , competitive with recent spectral trans- formers (PolyFormer: 72.42%, SpecFormer: 72.37%, Ex- phormer: 72.44%) while maintaining efficient O ( N ) scal- ing through its linear attention formulation. On the smaller but feature-rich Photo graph, GIST attains 94 . 42% ± 0 . 40 micro-F1, which is also competitiv e with recent spectral and polynomial T ransformer variants. Note that GraphGPS, goes OOM on ArXiv primarily due to poor scalability prop- erties and the lack of the gauge in variance trick. These findings demonstrate GIST’ s effecti veness as a competitiv e 7 GIST : Gauge-In variant Spectral T ransf ormers T able 3. Inductive node classification on PPI, Elliptic, Arxiv , and Photo. Re- sults are reported as micro-F1 (higher is better). Benchmark results are taken from the follo wing references (a – means no reported results): ( Chen et al. , 2020 ; W eber et al. , 2019 ; Chi- ang et al. , 2019 ; V eli ˇ ckovi ´ c et al. , 2018 ; Zhang et al. , 2018 ; Zeng et al. , 2020 ; Chen et al. , 2025b ; Brody et al. , 2022 ; Bo et al. , 2023 ; Ramp ´ a ˇ sek et al. , 2022 ) Model PPI (micro-F1 ↑ ) Elliptic BTC (micro-F1 ↑ ) ArXiv (micro-F1 ↑ ) Photo (micro-F1 ↑ ) GCN (base) 51.50 ± 0.60 96.10 71.74 ± 0.29 88.26 ± 0.73 GraphSA GE 61.20 97.70 71.49 ± 0.27 – GA T 97.30 ± 0.02 96.90 – 90.94 ± 0.68 GA Tv2 96.30 – 71.87 ± 0.25 – GaAN 98.70 – – – Cluster-GCN 99.36 – – – GraphSAINT 99.50 – – – GCNII 99.53 ± 0.01 – 72.04 ± 0.19 89.94 ± 0.31 GCNIII 99.50 ± 0.03 – – – GraphGPS 99.10 – OOM – SGFormer – – 72.63 ± 0.13 – SpecFormer 99.50 – 72.37 ± 0.18 95.48 ± 0.32 PolyFormer – – 72.42 ± 0.19 – Exphormer (LapPE) – – 72.44 91.59 ± 0.31 GIST (ours) 99.50 ± 0.03 94.70 ± 0.03 72.12 ± 0.21 94.42 ± 0.40 graph learning approach, validating the successful trade- off between computational overhead and representational power . 4.4. Neural Operators W e no w validate GIST’ s discretization-in v ariant properties (Section 4.1 ) on large-scale mesh-based re gression through the use of the DrivAerNet and Dri vAerNet++ Dataset ( Elre- faie et al. , 2024 ). DrivAerNet is a high-fidelity CFD dataset of parametric car geometries comprising 4,000 designs with approximately 500k surface v ertices per car and accompany- ing aerodynamic fields. This is extended by Dri vAerNet++, which increases the scale to o ver 10,000 designs and intro- duces greater geometric div ersity across multiple vehicle classes, including SUVs, sedans, and hatchbacks. W e model each car as a graph whose nodes are surface vertices and edges follo w mesh connecti vity . Our task is node-level re- gression of the surface pressure field on previously unseen cars following the published train/v alidation/test split. T able 4 shows GIST achie ves state-of-the-art on both bench- marks: 20.10% relativ e ℓ 2 error on DrivAerNet (vs. 20.35% for the previous best) and 18.60% on DrivAerNet++ (vs. 20.05%). GIST’ s linear scaling enables direct processing of these 500K-node meshes without do wnsampling, while prior methods require lossy projection to regular grids or lower -dimensional latent spaces. GIST’ s spectral attention provides global recepti ve fields in a single layer , which aids performance on this task. 5. Conclusions W e presented GIST , a gauge-in variant spectral transformer that achie ves three properties: gauge in variance for princi- pled generalization across graphs and meshes, O ( N ) com- plexity for scalability , and global spectral attention for cap- T able 4. Surface pressure prediction accuracy on Dri vAerNet and DrivAerNet++. Baselines include RegDGCNN ( Elrefaie et al. , 2024 ), Transolv er ( W u et al. , 2024a ), FigConvNet ( Choy et al. , 2025 ), TripNet ( Chen et al. , 2025a ), and AdaField ( Zou et al. , 2026 ). MSE is reported in units of 10 − 2 ; Rel L2 is reported as a percentage (%). Lo wer is better for both metrics. DrivAerNet DrivAerNet++ Model MSE ↓ Rel L2 ↓ MSE ↓ Rel L2 ↓ RegDGCNN 9.01 28.49 8.29 27.72 T ransolver 5.37 22.52 7.15 23.87 FigCon vNet 4.38 20.98 4.99 20.86 T ripNet 4.23 20.35 4.55 20.05 GIST (ours) 4.16 20.10 3.63 18.60 turing long-range dependencies. Crucially , while random projections break gauge symmetry in individual embed- dings, inner products between projected embeddings remain approximately inv ariant. By restricting attention to these gauge-in variant inner products, GIST recov ers the symmetry algorithmically while maintaining end-to-end linear scaling. Gauge inv ariance is essential for neural operator applica- tions: it ensures learned parameters transfer across mesh res- olutions with discretization mismatch error O ( n − 1 / ( m +4) ) that v anishes as resolution increases. Prior spectral methods address gauge in variance or ef ficiency separately . The nov- elty of the GIST architecture that we propose is the careful design that successfully combines these properties. Empirically , GIST achie ves competiti ve results on standard graph benchmarks (Cora, PubMed, PPI) and sets a new state-of-the-art on large-scale mesh regression (Dri vAerNet and DrivAerNet++, up to 500K and 750K nodes, respec- tiv ely), processing these graphs directly at full resolution where methods requiring quadratic attention or cubic eigen- decomposition cannot scale. 8 GIST : Gauge-In variant Spectral T ransf ormers Impact Statement This paper presents work whose goal is to advance the field of Machine Learning through improv ed geometric deep learning and neural operator methods. There are many po- tential societal consequences of our work, and the methods presented could be beneficial for scientific computing ap- plications (e.g., computational fluid dynamics) and could advance the state of graph neural netw orks. References Belkin, M. and Niyogi, P . T o wards a Theoretical Foundation for Laplacian-Based Manifold Methods. Journal of Com- puter and System Sciences , 74(8):1289–1308, August 2008. ISSN 0022-0000. doi: 10.1016/j.jcss.2008.04.001. Bertasius, G., W ang, H., and T orresani, L. Is Space-T ime Attention All Y ou Need for V ideo Understanding?, June 2021. Bo, D., Shi, C., W ang, L., and Liao, R. Specformer: Spectral graph neural networks meet transformers. In Pr oceedings of the International Confer ence on Learning Representa- tions (ICLR) , 2023. Brody , S., Alon, U., and Y ahav , E. How Attenti ve are Graph Attention Networks?, January 2022. Bronstein, M. M., Bruna, J., LeCun, Y ., Szlam, A., and V an- derghe ynst, P . Geometric deep learning: Going beyond euclidean data. IEEE Signal Pr ocessing Magazine , 34(4): 18–42, 2017. Cai, H., Li, J., Hu, M., Gan, C., and Han, S. EfficientV iT: Multi-Scale Linear Attention for High-Resolution Dense Prediction, February 2022. Calder , J. and Garc ´ ıa T rillos, N. Improved spectral con ver- gence rates for graph Laplacians on epsilon-graphs and k - NN graphs. Applied and Computational Harmonic Analy- sis , 60:123–175, 2022. doi: 10.1016/j.acha.2022.04.004. Chen, H., Sultan, S. F ., Tian, Y ., Chen, M., and Skiena, S. Fast and Accurate Network Embeddings via V ery Sparse Random Projection. In Pr oceedings of the 28th ACM International Conference on Information and Knowl- edge Manag ement , CIKM ’19, pp. 399–408, New Y ork, NY , USA, Nov ember 2019. Association for Computing Machinery . ISBN 978-1-4503-6976-3. doi: 10.1145/ 3357384.3357879. Chen, J., Gao, K., Li, G., and He, K. N AGphormer: A T ok enized Graph T ransformer for Node Classification in Large Graphs. In Pr oceedings of the International Confer ence on Learning Representations (ICLR) , 2023. Chen, M., W ei, Z., Huang, Z., Ding, B., and Li, Y . Simple and Deep Graph Con volutional Networks, July 2020. Chen, Q., Elrefaie, M., Dai, A., and Ahmed, F . T ripNet: Learning Lar ge-scale High-fidelity 3D Car Aerodynamics with T riplane Networks, May 2025a. Chen, Y ., Y ang, W ., and Jiang, Z. W ide & Deep Learning for Node Classification, May 2025b. Chennuru V ankadara, L., Xu, J., Haas, M., and Cevher , V . On Feature Learning in Structured State Space Models. Advances in Neural Information Pr ocessing Systems , 37: 86145–86179, 2024. Chiang, W .-L., Liu, X., Si, S., Li, Y ., Bengio, S., and Hsieh, C.-J. Cluster-GCN: An Efficient Algorithm for T rain- ing Deep and Lar ge Graph Con volutional Netw orks. In Pr oceedings of the 25th A CM SIGKDD International Confer ence on Knowledge Discovery & Data Mining , pp. 257–266, July 2019. doi: 10.1145/3292500.3330925. Cho, Y . and Saul, L. Kernel Methods for Deep Learning. In Advances in Neural Information Pr ocessing Systems , volume 22. Curran Associates, Inc., 2009. Choi, J. Personalized PageRank Graph Attention Networks, August 2022. Choromanski, K., Likhosherstov , V ., Dohan, D., Song, X., Gane, A., Sarlos, T ., Hawkins, P ., Da vis, J., Mohiuddin, A., Kaiser , L., Belanger, D., Colwell, L., and W eller , A. Rethinking Attention with Performers. [cs, stat] , September 2020. Choy , C., Kamene v , A., Kossaifi, J., Rietmann, M., Kautz, J., and Azizzadenesheli, K. Factorized Implicit Global Con- volution for Automoti ve Computational Fluid Dynamics Prediction, February 2025. Dao, T . and Gu, A. T ransformers are SSMs: General- ized Models and Ef ficient Algorit hms Through Structured State Space Duality , May 2024. Dasgupta, S. and Gupta, A. An elementary proof of a theo- rem of Johnson and Lindenstrauss. Random Structures & Algorithms , 22(1):60–65, January 2003. ISSN 1042-9832, 1098-2418. doi: 10.1002/rsa.10073. Dosovitskiy , A., Beyer , L., Kolesnik ov , A., W eissenborn, D., Zhai, X., Unterthiner , T ., Dehghani, M., Minderer , M., Heigold, G., Gelly , S., Uszkoreit, J., and Houlsby , N. An Image is W orth 16x16 W ords: Transformers for Image Recognition at Scale, June 2021. Dwiv edi, V . P . and Bresson, X. A Generalization of T rans- former Networks to Graphs. arXiv:2012.09699 [cs] , Jan- uary 2021. 9 GIST : Gauge-In variant Spectral T ransf ormers Dwiv edi, V . P ., Luu, A. T ., Laurent, T ., Bengio, Y ., and Bresson, X. Graph Neural Networks with Learnable Structural and Positional Representations, February 2022. Elrefaie, M., Dai, A., and Ahmed, F . DrivAerNet: A Para- metric Car Dataset for Data-Driven Aerodynamic Design and Graph-Based Drag Prediction, March 2024. Fey , M. and Lenssen, J. E. F ast graph representation learning with PyT orch Geometric. In ICLR W orkshop on Repre- sentation Learning on Graphs and Manifolds , 2019. Fey , M., Sunil, J., Nitta, A., Puri, R., Shah, M., Stojano vi ˇ c, B., Bendias, R., Barghi, A., Kocijan, V ., Zhang, Z., He, X., Lenssen, J. E., and Leskovec, J. PyG 2.0: Scalable learning on real world graphs. In T emporal Graph Learn- ing W orkshop @ KDD , 2025. Gao, W ., Xu, R., Deng, Y ., and Liu, Y . Discretization- in variance? on the discretization mismatch errors in neu- ral operators. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. Gu, A. and Dao, T . Mamba: Linear-time sequence modeling with selecti ve state spaces. arXiv preprint arXiv:2312.00752 , 2023. Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T ., Rudra, A., and R ´ e, C. Combining recurrent, con volutional, and continuous-time models with linear state space layers. Advances in neur al information pr ocessing systems , 34: 572–585, 2021. Gu, A., Goel, K., and R ´ e, C. Ef ficiently Modeling Long Sequences with Structured State Spaces, August 2022. Hamilton, W . L., Y ing, R., and Leskovec, J. Inductiv e representation learning on lar ge graphs. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , pp. 1024–1034, 2017. Hao, Z., W ang, Z., Su, H., Y ing, C., Dong, Y ., Liu, S., Cheng, Z., Song, J., and Zhu, J. GNO T: A general neural operator transformer for operator learn- ing. In Pr oceedings of the 40th International Con- fer ence on Machine Learning (ICML) , volume 202 of Pr oceedings of Machine Learning Researc h . PMLR, 2023. URL https://proceedings.mlr.press/ v202/hao23c.html . Hu, W ., Fey , M., Zitnik, M., Dong, Y ., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open Graph Benchmark: Datasets for Machine Learning on Graphs, February 2021. Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., H ´ enaff, O. J., Botvinick, M. M., Zisserman, A., V inyals, O., and Carreira, J. Per- ceiv er io: A general architecture for structured inputs & outputs, 2021a. URL 2107.14795 . ICLR 2022 version. Jaegle, A., Gimeno, F ., Brock, A., V inyals, O., Zisserman, A., and Carreira, J. Perceiv er: General perception with it- erati ve attention. In International Conference on Mac hine Learning , pp. 4651–4664. PMLR, 2021b. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F . T ransformers are RNNs: Fast Autore gressi ve T ransform- ers with Linear Attention. arXiv:2006.16236 [cs, stat] , June 2020. Kipf, T . N. and W elling, M. Semi-Supervised Classification with Graph Con volutional Networks, February 2017. Klein, D. J. and Randi ´ c, M. Resistance distance. Journal of Mathematical Chemistry , 12(1):81–95, December 1993. ISSN 1572-8897. doi: 10.1007/BF01164627. K ovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhat- tacharya, K., Stuart, A., and Anandkumar, A. Neural oper - ator: Learning maps between function spaces with appli- cations to PDEs. Journal of Mac hine Learning Resear ch , 24:1–97, 2023. URL http://jmlr.org/papers/ v24/21- 1524.html . Kreuzer , D., Beaini, D., Hamilton, W . L., L ´ etourneau, V ., and T ossou, P . Rethinking Graph T ransformers with Spectral Attention, October 2021. Lee, J., Lee, Y ., Kim, J., Kosiorek, A., Choi, S., and T eh, Y . W . Set T ransformer: A Framework for Attention- based Permutation-In variant Neural Netw orks. In Interna- tional Confer ence on Machine Learning , pp. 3744–3753. PMLR, May 2019a. Lee, J., Lee, Y ., Kim, J., K osiorek, A., Choi, S., and T eh, Y . W . Set transformer: A framew ork for attention- based permutation-in variant neural networks. In Pr o- ceedings of the 36th International Confer ence on Ma- chine Learning (ICML) , volume 97 of Pr oceedings of Ma- chine Learning Resear ch , pp. 3744–3753. PMLR, 2019b. URL https://proceedings.mlr.press/v97/ lee19d.html . Li, P ., Hastie, T . J., and Church, K. W . V ery sparse random projections. In Pr oceedings of the 12th A CM SIGKDD International Confer ence on Knowledge Discovery and Data Mining , pp. 287–296, Philadelphia P A USA, August 2006. A CM. ISBN 978-1-59593-339-3. doi: 10.1145/ 1150402.1150436. Li, Z., K ovachki, N., Azizzadenesheli, K., Liu, B., Bhat- tacharya, K., Stuart, A., and Anandkumar , A. Neural operator: Graph kernel network for partial differential 10 GIST : Gauge-In variant Spectral T ransf ormers equations, 2020. URL 2003.03485 . Li, Z., K ov achki, N. B., Azizzadenesheli, K., Liu, B., Bhat- tacharya, K., Stuart, A., and Anandkumar , A. Fourier neu- ral operator for parametric partial dif ferential equations. In International Confer ence on Learning Repr esentations (ICLR) , 2021. URL https://openreview.net/ forum?id=c8P9NQVtmnO . Li, Z., K ovachki, N. B., Choy , C., Li, B., K ossaifi, J., Otta, S. P ., Nabian, M. A., Stadler , M., Hundt, C., Azizzade- nesheli, K., and Anandkumar, A. Geometry-informed neural operator for large-scale 3d PDEs. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 36, 2023. URL 2309.00583 . See also Likhosherstov , V ., Choromanski, K. M., Davis, J. Q., Song, X., and W eller , A. Sub-linear memory: How to make per - formers slim. Advances in Neural Information Pr ocessing Systems , 34:6707–6719, 2021. Lu, L., Jin, P ., Pang, G., Zhang, Z., and Karniadakis, G. E. Learning nonlinear operators via deeponet based on the univ ersal approximation theorem of operators. Nature Machine Intelligence , 3(3):218–229, 2021. doi: 10.1038/ s42256- 021- 00302- 5. Luo, Y ., Shi, L., and W u, X.-M. Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification, October 2024. Park, W ., Chang, W ., Lee, D., Kim, J., and Hwang, S.-w . GRPE: Relative Positional Encoding for Graph Trans- former , October 2022. Rahman, M. A., Ross, Z. E., and Azizzadenesheli, K. U- no: U-shaped neural operators, 2022. URL https: //arxiv.org/abs/2204.11127 . Ramp ´ a ˇ sek, L., Galkin, M., Dwivedi, V . P ., Luu, A. T ., W olf, G., and Beaini, D. Recipe for a General, Powerful, Scal- able Graph T ransformer . Advances in Neural Information Pr ocessing Systems , 35, 2022. Raoni ´ c, B., Molinaro, R., De Ryck, T ., Rohner, T ., Bartolucci, F ., Alaifari, R., Mishra, S., and de B ´ ezenac, E. Con volutional neural operators for robust and accurate learning of PDEs. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , v ol- ume 36, 2023. URL https://proceedings. neurips.cc/paper/2023/hash/ f3c1951b34f7f55ffaecada7fde6bd5a- Abstract- Conference. html . Rigotti, M., Miksovic, C., Giurgiu, I., Gschwind, T ., and Scotton, P . Attention-based Interpretability with Concept T ransformers. In International Confer ence on Learning Repr esentations (ICLR) , 2022. Sen, P ., Namata, G., Bilgic, M., Getoor, L., Galligher , B., and Eliassi-Rad, T . Collectiv e classification in network data. AI Magazine , 29(3):93–106, 2008. Shaw , P ., Uszkoreit, J., and V aswani, A. Self-Attention with Relativ e Position Representations, April 2018. Shirzad, H., V elingker , A., V enkatachalam, B., Sutherland, D. J., and Sinop, A. K. Exphormer: Sparse Transformers for Graphs. In Pr oceedings of the 40th International Confer ence on Machine Learning (ICML) , 2023. Tsai, Y .-H. H., Bai, S., Y amada, M., Morency , L.-P ., and Salakhutdinov , R. Transformer dissection: An unified un- derstanding for transformer’ s attention via the lens of ker - nel. In Proceedings of EMNLP-IJCNLP , pp. 4344–4353, Hong K ong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19- 1443. V aswani, A., Shazeer , N., P armar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser , L., and Polosukhin, I. Attention Is All Y ou Need. arXiv:1706.03762 [cs] , June 2017. V eli ˇ cko vi ´ c, P ., Cucurull, G., Casanov a, A., Romero, A., Li ` o, P ., and Bengio, Y . Graph Attention Networks, February 2018. W eber , M., Domeniconi, G., Chen, J., W eidele, D. K. I., Bellei, C., Robinson, T ., and Leiserson, C. E. Anti-money laundering in bitcoin: Experimenting with graph con vo- lutional networks for financial forensics. In KDD ’19 W orkshop on Anomaly Detection in F inance , 2019. W u, H., Luo, H., W ang, H., W ang, J., and Long, M. T ran- solver: A fast transformer solver for PDEs on general ge- ometries. In Pr oceedings of the 41st International Confer - ence on Machine Learning , v olume 235 of ICML’24 , pp. 53681–53705, V ienna, Austria, July 2024a. JMLR.org. W u, Q., Zhao, W ., Y ang, C., Zhang, H., Nie, F ., Jiang, H., Bian, Y ., and Y an, J. SGFormer: Simplifying and Em- po wering T ransformers for Large-Graph Representations, August 2024b. Y ang, Z., Cohen, W . W ., and Salakhutdino v , R. Revis- iting semi-supervised learning with graph embeddings. In Pr oceedings of the 33r d International Conference on Machine Learning (ICML) , pp. 40–48, 2016. Y ing, C., Cai, T ., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y ., and Liu, T .-Y . Do T ransformers Really Perform Bad for Graph Representation?, Nov ember 2021. 11 GIST : Gauge-In variant Spectral T ransf ormers Y un, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., and Kumar , S. Are transformers universal approximators of sequence-to-sequence functions? In International Confer ence on Learning Representations (ICLR) , 2020. Zeng, H., Zhou, H., Sri vasta va, A., Kannan, R., and Prasanna, V . GraphSAINT : Graph Sampling Based In- ductiv e Learning Method, February 2020. Zhang, J., Shi, X., Xie, J., Ma, H., King, I., and Y eung, D.-Y . GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs, March 2018. Zhu, W ., W en, T ., Song, G., Ma, X., and W ang, L. Hier- archical T ransformer for Scalable Graph Learning, May 2023. Zou, J., Qiu, W ., Sun, Z., Zhang, X., Zhang, Z., and Zhu, X. Adafield: Generalizable surface pressure modeling with physics-informed pre-training and flow-conditioned adaptation. arXiv preprint , 2026. doi: 10.48550/arXiv .2601.07139. 12 GIST : Gauge-In variant Spectral T ransf ormers A. Proofs and T echnical Details A.1. Proof of Pr oposition 4.1 : Gauge-Inv ariant Spectral Self-Attention W e pro ve that the Gauge-In variant Spectral Self-Attention mechanism is discretization-in variant with quantifiable discretiza- tion mismatch error through three stages of analysis. The proof establishes that the positional encodings used in the attention mechanism con verge to the continuum Green’ s function, allowing us to bound the discretization mismatch error between any two discretizations of the same manifold. For completeness, we restate the full proposition: Proposition A.1 (Full Statement) . Gauge-In variant Spectral Self-Attention is a discr etization-in variant Neural Oper ator with bounded discr etization mismatch err or . Let M be a compact m -dimensional Riemannian manifold and G n a sequence of graphs obtained by sampling n nodes from M with n → ∞ . Let ϕ n i be the Laplacian eigenmaps of G n (equation 2 ). Then: (i) The inner pr oducts ⟨ ϕ n i , ϕ n j ⟩ con verg e to the Green’ s function G M ( x i , x j ) of the Laplace-Beltrami operator at rate O ( n − 1 / ( m +4) ) . (ii) F or any two discr etizations G n and G n ′ of the same manifold M with n ≤ n ′ , the attention kernel values ⟨ ˜ ϕ n i , ˜ ϕ n j ⟩ at corr esponding points x i , x j ∈ M satisfy    ⟨ ˜ ϕ n i , ˜ ϕ n j ⟩ − ⟨ ˜ ϕ n ′ i , ˜ ϕ n ′ j ⟩    = O ( n − 1 / ( m +4) ) , wher e ˜ ϕ n i = R n ϕ n i denotes the pr ojected embeddings. This bounded kernel mismatch ensur es learned parameters transfer acr oss discr etizations (discretization-in variance). The proof proceeds in three stages corresponding to the key technical components. A . 1 . 1 . S TAG E 1 : S P E C T R A L C O N V E R G E N C E A N D G R E E N ’ S F U N C T I O N Proposition A.2. Let L n be the normalized graph Laplacian of a graph G n obtained by sampling n nodes fr om a compact m -dimensional Riemannian manifold M . Let λ n k , u n k be eigen values and eigen vectors of L n , and µ k , ψ k the eigen values and eigenfunctions of the Laplace-Beltrami oper ator ∆ M on M . Then: (a) Spectral con verg ence: | λ n k − µ k | = O ( n − 1 / ( m +4) ) and ∥ u n k − ψ k ∥ L 2 = O ( n − 1 / ( m +4) ) (up to log factors). (b) Gr een’s function con ver gence: The inner pr oducts of Laplacian eigenmaps con ver ge to the Gr een’ s function:  ϕ n i , ϕ n j  = X λ n k > 0 1 λ n k ( u n k ) i ( u n k ) j → G M ( x i , x j ) + O ( n − 1 / ( m +4) ) , wher e G M is the Gr een’s function of ∆ M on M . Pr oof. Part (a) follo ws from the spectral con vergence theory for graph Laplacians on manifolds ( Calder & Garc ´ ıa T rillos , 2022 ). W ith appropriate graph construction, both eigenv alues and eigen vectors of L n con verge to those of ∆ M at rate n − 1 / ( m +4) (up to log factors). See Calder & Garc ´ ıa T rillos ( 2022 ) for complete error estimates. Part (b) follo ws from part (a) by the spectral theorem. The discrete Green’ s function (pseudoinv erse of the g raph Laplacian) is giv en by the eigenfunction expansion: ⟨ ϕ n i , ϕ n j ⟩ = X λ n k > 0 1 λ n k ( u n k ) i ( u n k ) j . Using the con vergence result from part (a), this sum con ver ges to the continuum Green’ s function G M ( x i , x j ) = P ∞ k =1 1 µ k ψ k ( x i ) ψ k ( x j ) at rate O ( n − 1 / ( m +4) ) . 13 GIST : Gauge-In variant Spectral T ransf ormers A . 1 . 2 . S TAG E 2 : R A N D O M P RO J E C T I O N E R RO R Proposition A.3. Let R ∈ R r × N be a random pr ojection with r = O (log ( N ) /ε 2 ) constructed via F astRP . F or any vectors v , w ∈ R N , P ( |⟨ R v , Rw ⟩ − ⟨ v , w ⟩| ≤ ε ∥ v ∥∥ w ∥ ) ≥ 1 − 2 e − cε 2 r . Pr oof. This follows from the Johnson-Lindenstrauss Lemma ( Dasgupta & Gupta , 2003 ). The key properties: 1. Random projections with r = O (log ( N ) /ε 2 ) distort distances by at most a factor (1 ± ε ) with high probability . 2. For inner products, since distances are preserved, we have ⟨ Rv , Rw ⟩ = 1 2 ( ∥ Rv ∥ 2 + ∥ Rw ∥ 2 − ∥ R v − Rw ∥ 2 ) ≈ 1 2 ( ∥ v ∥ 2 + ∥ w ∥ 2 − ∥ v − w ∥ 2 ) = ⟨ v , w ⟩ . 3. FastRP specifically uses sparse random matrices that maintain these guarantees while enabling ef ficient computation ( Chen et al. , 2019 ). See Dasgupta & Gupta ( 2003 ) and Chen et al. ( 2019 ) for details on FastRP . A . 1 . 3 . S TAG E 3 : G AU G E I N V A R I A N C E Proposition A.4. GIST’ s learned parameters θ (pr ojection matrix, transformer weights) do not depend on arbitrary gauge choices (sign flips, r otations) in the spectral decomposition because GIST’ s computations depend only on gauge-in variant quantities. Pr oof. GIST’ s core attention mechanism is: α ij = softmax ⟨ ˜ ϕ i , ˜ ϕ j ⟩ √ r ! , ˜ ϕ i = Rϕ i , where R is the random projection matrix. The key insight is that while ˜ ϕ i depends on R (the arbitrary gauge choice), the inner products ⟨ ˜ ϕ i , ˜ ϕ j ⟩ do not (in the limit): ⟨ ˜ ϕ i , ˜ ϕ j ⟩ = ϕ ⊤ i ( R ⊤ R ) ϕ j ≈ ϕ ⊤ i ϕ j , by Johnson-Lindenstrauss. Thus α ij con verges to a gauge-in v ariant quantity (the continuum Green’ s function kernel) independent of the choice of R . Since all downstream computations operate on gauge-inv ariant quantities, the learned parameters θ do not encode any information about the specific gauge choice in the eigen vector decomposition. A . 1 . 4 . D I S C R E T I Z A T I O N M I S M AT C H E R RO R A N A LY S I S Combining all three stages, we establish the discretization mismatch error bound stated in Proposition A.1 (ii) ( Gao et al. , 2025 ). For two discretizations G n and G n ′ of the same manifold M , the attention kernel mismatch is bounded by triangle inequality:    ⟨ ˜ ϕ n i , ˜ ϕ n j ⟩ − ⟨ ˜ ϕ n ′ i , ˜ ϕ n ′ j ⟩    ≤    ⟨ ˜ ϕ n i , ˜ ϕ n j ⟩ − G M ( x i , x j )    +    G M ( x i , x j ) − ⟨ ˜ ϕ n ′ i , ˜ ϕ n ′ j ⟩    . Each term on the right decomposes into spectral conv ergence error (Stage 1) and random projection error (Stage 2), yielding the total bound O ( n − 1 / ( m +4) ) where n is the coarser discretization. For typical manifold dimensions, random projection errors decay faster than spectral con ver gence errors, so the latter dominate the discretization mismatch. 14 GIST : Gauge-In variant Spectral T ransf ormers Algorithm 1 Spectral Embeddings via FastRP ( Chen et al. , 2019 ) Require: Graph adjacency matrix A ∈ R N × N , embedding dimensionality r , iteration power k Ensure: Matrix of N node graph positional embeddings Φ ∈ R N × r 1: Produce very sparse random projection R ∈ R N × r according to Li et al. ( 2006 ) 2: P ← D − 1 · A the random walk transition matrix, where D is the degree matrix 3: Φ 1 ← P · R 4: for i = 2 to k do 5: Φ i ← P · Φ i − 1 6: end for 7: Φ = Φ 1 + Φ 2 + · · · + Φ k 8: retur n Φ A.2. Pseudo-code Note that in the pseudocode we use bold notation for matrices and vectors ( A , Φ , Q ) and follow the ro w-vector con vention standard in machine learning: Φ ∈ R N × r has nodes as ro ws and embedding dimensions as columns. In the main text, we use non-bold notation for compactness, with ϕ i ∈ R r representing individual column vectors and upper case characters denoting matrices. Below we provide pseudo-code for the core computations of GIST , the Gauge-In variant Spectral Self-Attention block and the Gauge-Equivariant Spectr al Self-Attention block. For illustration purposes, we compare the algorithms to a stripped down implementation of self-attention. Modifications relative to v anilla self-attention are highlighted in red (additions) and strikethrough (remov als). Algorithm 2 Gauge-In variant Spectral Self-Attention Require: Node feature tokens X ∈ R N × d , graph positional embeddings Φ ∈ R N × r Ensure: Output sequence O ∈ R N × d to be applied to features X 1: // Compute attention matrices 2: Q ← X · W Q where W Q ∈ R d × d Q ← Φ 3: K ← X · W K where W K ∈ R d × d K ← Φ 4: V ← X · W V where W V ∈ R d × d 5: // Compute linear attention with feature map φ ( x ) = ReLU( x ) 6: ˜ Q , ˜ K ← φ ( Q ) , φ ( K ) 7: S ← ˜ K T V { Compute key-v alue matrix: R r × d } 8: Z ← 1 / ( ˜ Q ( ˜ K T 1 N ) + ϵ ) { Normalization f actors: R N } 9: O ← ( ˜ QS ) ⊙ Z { Normalized output: element-wise product } 10: retur n O Note on the choice of F eature Map φ . While in Algorithm 3 we do not need to impose that restriction, in Algorithm 2 we use the feature map φ ( x ) = ReLU ( x ) . This choice is theoretically moti vated: when applied element-wise to random features, ReLU induces the arc-cosine k ernel ( Cho & Saul , 2009 ). Specifically , for vectors a , b ∈ R r , the inner product ⟨ φ ( a ) , φ ( b ) ⟩ conv erges (as r → ∞ ) to a kernel function k 0 ( a ⊤ b ) that depends only on the inner product a ⊤ b . This property is crucial for preserving gauge inv ariance: since the attention weights are computed as ⟨ φ ( ˜ ϕ i ) , φ ( ˜ ϕ j ) ⟩ ≈ k 0 ( ˜ ϕ ⊤ i ˜ ϕ j ) and ˜ ϕ ⊤ i ˜ ϕ j ≈ ϕ ⊤ i ϕ j by Johnson-Lindenstrauss (as established in Section A ), the resulting attention pattern depends only on gauge-in variant inner products between the spectral embeddings. W e note that the original linear attention work by Katharopoulos et al. ( 2020 ) used φ ( x ) = elu ( x ) + 1 . Empirically , this feature map also tends to work well in practice, and it is similar to ReLU in producing non-negati ve outputs. Howe ver , it is not known to correspond to an y particular kernel function, and thus the theoretical guarantee of gauge inv ariance via kernel structure does not apply . In vestigating other feature maps corresponding to different kernel functions (e.g., polynomial kernels, random Fourier features for RBF-lik e kernels) is left for future work. 15 GIST : Gauge-In variant Spectral T ransf ormers Algorithm 3 Gauge-Equiv ariant Spectral Self-Attention Require: Node feature tokens X ∈ R N × d , graph positional embeddings Φ ∈ R N × r Ensure: Output sequence O ∈ R N × d to be applied to graph positional embeddings Φ 1: // Compute attention matrices 2: Q ← X · W Q where W Q ∈ R d × d 3: K ← X · W K where W K ∈ R d × d 4: V ← X · W V where W V ∈ R d × d V ← Φ 5: // Compute linear attention ( Katharopoulos et al. , 2020 ) 6: ˜ Q , ˜ K ← φ ( Q ) , φ ( K ) 7: S ← ˜ K T V { Compute key-v alue matrix: R d × r } 8: Z ← 1 / ( ˜ Q ( ˜ K T 1 N ) + ϵ ) { Normalization f actors: R N } 9: O ← ( ˜ QS ) ⊙ Z { Normalized output: element-wise product } 10: retur n O A.3. GIST Hyperparameters r obustness In order to study the sensiti vity of GIST’ s performance to variations in its spectral embedding hyperparameters, we train multiple simplified GIST architectures (two-block Gauge-In variant Spectral Self-Attention linear transformers) on the Cora benchmark while varying the po wer iteration parameter k and the embedding dimension r in the FastRP approximation. These parameters directly control the quality of spectral embeddings while balancing computational efficienc y . As sho wn in Figure 2 , GIST exhibits rob ust performance across a wide range of both parameters. The left panel sweeps k with r fixed, a clear b ut relativ ely shallow peak in accuracy around the optimal v alue of k ≈ 32 . This on one hand suggests that ev en modest iteration counts are sufficient to capture the essential spectral structure, but also indicates that the choice of the power iteration is quite robust. The right panel varies embedding dimension r with k fixed, demonstrating a smooth monotonic improv ement as r increases. Crucially , saturation occurs relati vely quickly: performance gains beyond r = 256 are marginal, v alidating our choice of reasonable embedding dimensions that maintain computational efficienc y . These results empirically validate two important properties: (1) GIST does not require extensiv e hyperparameter tuning around these spectral parameters, suggesting stable generalization; and (2) the linear end-to-end complexity achiev ed with modest k and r values is both computationally practical and empirically effecti ve. Combined with the gauge-inv ariance guarantees that prev ent dependence on arbitrary spectral choices, these hyperparameters provide a principled way and empirically robust w ay to control the approximation quality of spectral embeddings without sacrificing scalability . A.4. GIST Scalability Study From a computational standpoint, the end-to-end cost of GIST is dominated by two components: spectral embedding generation and the subsequent transformer blocks. For the former , we employ a FastRP-style approximation in which the Laplacian spectral information is captured via repeated multiplication of a sparse random walk matrix with a low- dimensional random projection. Each power iteration requires O ( | E | r ) operations, where | E | is the number of edges and r is the embedding dimension, and the total cost over k iterations is O ( k | E | r ) . On meshes and graphs with bounded average degree, | E | = O ( N ) , so the overall spectral embedding stage scales linearly in the number of nodes N . This embedding is computed once per graph and then reused across all GIST layers, so its cost is amortized ov er the full network depth. The GIST layers themselves preserv e this linear scaling. Each block combines: (i) a feature branch based on linear attention, (ii) a local branch using graph con volution followed by linear attention, and (iii) a global branch using Gauge-In variant and Gauge-Equi v ariant Spectral Self-Attention follo wed by linear attention. In all cases, attention is implemented in the form φ ( Q )( φ ( K ) ⊤ V ) with an element-wise feature map φ ( · ) , which yields O ( N d 2 ) complexity for d -dimensional features instead of the O ( N 2 d ) cost of quadratic attention. T ogether with the O ( N ) spectral embedding stage, this results in an ov erall complexity of O ( N ( d 2 + rk )) per forward pass, i.e., linear in the number of nodes. The empirical VRAM and wall-clock measurements in Figure 3 corroborate this analysis: both memory usage and forward time grow approximately linearly with the number of graph nodes for all hidden dimensions, up to graphs with hundreds of thousands of nodes sampled from DrivAerNet. 16 GIST : Gauge-In variant Spectral T ransf ormers 2 32 128 256 384 512 M a x p o w e r i t e r a t i o n , k 0.50 0.55 0.60 0.65 T est A ccuracy (a) Max Power Iteration Sweep T est A ccuracy 8 128 384 640 1024 E m b e d d i n g D i m e n s i o n , r 0.45 0.50 0.55 0.60 0.65 0.70 T est A ccuracy (b) Embedding Dimension Sweep T est A ccuracy Graph Embeddings P arameter Sensitivity (Cora Dataset) F igure 2. Sensitivity study of GIST spectral embeddings parameters. The plots sho w the final test accuracy of a two-block Gauge-In v ariant Spectral Self-Attention linear transformer trained on Cora while sweeping over the power iteration parameter k with r = 256 (left panel), and sweeping ov er the embedding dimension r with k = 32 (right panel). T est accuracy is f airly robust around the best v alue of either parameter . As expected, r is monotonically related to higher performance, as higher r correspond to better approximations of the eigenmaps. Accuracy con veniently saturate relati vely fast, justifying the use of reasonably low r . The plots sho w mean test accuracy av eraged across 10 seeds and corresponding standard de viation as error bars. A.5. Multi-Scale Architectur e Ablation Study T o validate the design choices in the Multi-Scale GIST architecture, we systematically ablate each of the three parallel branches sho wn in Figure 1 (right panel) on the PPI dataset, where the full architecture achie ves state-of-the-art performance (see T able 3 ). For each ablation, we train with identical hyperparameters but remov e one branch: (1) feature processing, (2) local graph con volution, or (3) global spectral attention. All experiments are repeated across 20 random seeds. T able 5 sho ws that all three branches contribute meaningfully , with performance drops ranging from 4.29% to 7.90%. The local branch (Branch 2) has the strongest impact ( − 7 . 90% ), validating the Ef ficientV iT -inspired design principle that local operations provide focused information complementing the dif fuse patterns from linear attention. The global spectral branch (Branch 3, − 4 . 29% ) confirms that long-range dependencies are essential, while the feature branch (Branch 1, − 5 . 04% ) provides complementary signal beyond structural information. Overall, these results demonstrate that the Multi-Scale GIST effecti vely integrates complementary information sources. T able 5. Ablation study on PPI dataset showing the contribution of each branch of the Multi-Scale GIST (see Figure 1 , right panel). T est accuracy is reported as percentage of baseline performance. Results averaged o ver 10 seeds with standard deviations indicated as uncertainty intervals. Ablation T est Accurac y (% baseline) ∆ (%) No ablation 100.0 0.0 Branch 1 (feature) 95 . 0 ± 2 . 8 − 5 . 0 Branch 2 (local) 92 . 1 ± 1 . 8 − 7 . 9 Branch 3 (global) 95 . 7 ± 3 . 7 − 4 . 3 17 GIST : Gauge-In variant Spectral T ransf ormers 0 100000 200000 300000 400000 500000 600000 Number of Nodes 10 20 30 40 50 60 70 80 VR AM Usage (GB) Scalability Curve: GIST VRAM usage vs Number of Graph Nodes 0 100000 200000 300000 400000 500000 600000 Number of Nodes 0.5 1.0 1.5 2.0 2.5 3.0 3.5 F orwar d + Backwar d P ass T ime (s) Scalability Curve: GIST F orward + Backward P ass Time vs Number of Graph Nodes hidden_dim=64 hidden_dim=128 hidden_dim=256 F igure 3. Scalability study of GIST . All experiments use a fixed 3-layer model while varying the hidden dimensionality . VRAM consumption was measured as a function of the number of nodes in the input graph. Graph sizes were controlled using random node dropout applied to samples from the Driv aernet dataset, enabling a systematic ev aluation of memory scaling behavior . 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment