Physics-Informed Spatial-Temporal Transformer for Terahertz Near-Field Beam Tracking

1 Physics-Informed Spatial-T emporal T ransformer for T erahertz Near -Field Beam T racking Zhi Zeng, Student Member , IEEE , Chong Han, Senior Member , IEEE , and Emil Bj ¨ ornson, F ellow , IEEE Abstract —T erahertz (THz) ultra-massive multiple-input multiple-output (UM-MIMO) promises ultra-high thr oughput, while its highly directional beams demand rapid and accurate beam tracking driven by pr ecise user -state estimation. Moreov er , large array apertures at high frequencies induce near-ﬁeld propagation effects, where far -ﬁeld modeling becomes inaccurate and near -ﬁeld parametric channel estimation is costly . Bypassing near -ﬁeld codebook, P AST -TT is proposed to bridge near -ﬁeld tracking with low-ov erhead far-ﬁeld codebook pr obing by exploiting parallax, ampliﬁed by widely spaced subarrays. With comb-type frequency-division multiplexing pilots, each subarray yields frequency-afﬁne phase signatur es whose frequency and temporal incr ements encode propagation delay and its variation between frames. Building on these signatures, a Parallax-A war e Spatial T ransformer (P AST) compresses them and outputs per -frame position estimates with token reliability to downweight bad frames, regularized by a physics-in-the-loop consistency loss. A causal T emporal T ransformer (TT) then performs reliability-aware ﬁltering and prediction over a sliding window to initialize the beam of the next frame. Acting on short token sequences, P AST -TT av oids a monolithic spatial-temporal network over raw pilots, which k eeps the model lightweight with a critical path latency of 0.61 ms. Simulations show that at 15 dB signal-to-noise ratio, P AST achieves 7.81 mm distance RMSE and 0.0588 ◦ angle RMSE. Even with a bad-frame rate of 0.1, TT reduces the distance and angle prediction RMSE by 23.1% and 32.8% compared with the best competing tracker . Index T erms —T erahertz communications, near-ﬁeld, beam tracking, physics-inf ormed deep learning. I . I N T RO D U C T I O N T ERAHER TZ (THz) communications have emerged as an attractiv e component for future wireless systems, as the ultra-broad spectrum from 0.1 THz to 10 THz provides abundant bandwidth and opens the door to extreme-capacity wireless links [2]. Nev ertheless, the notable path loss in the THz band can affect the link budget and limit cov erage [3]. Fortunately , the short wa velength enables dense integration and ultra-large apertures, making ultra-massive multiple-input multiple-output (UM-MIMO) a natural companion to THz for harvesting substantial beamforming gains and compensating for the se vere propagation loss [4]. T o make THz UM-MIMO An earlier version of this paper will be presented in part at the IEEE ICASSP , May 2026 [1]. Zhi Zeng is with the T erahertz Wireless Communications (TWC) Lab- oratory , Shanghai Jiao T ong University , Shanghai 200240, China (e-mail: zhi.zeng@sjtu.edu.cn). Chong Han is with the T erahertz W ireless Communications (TWC) Labo- ratory and also the Cooperati ve Medianet Inno vation Center (CMIC), School of Information Science and Electronic Engineering, Shanghai Jiao T ong Univ ersity , Shanghai 200240, China (e-mail: chong.han@sjtu.edu.cn). Emil Bj ¨ ornson is with the Department of Communication Systems, KTH Royal Institute of T echnology , 100 44 Stockholm, Sweden (e-mail: emil- bjo@kth.se). practical, hybrid beamforming (HBF) with a limited number of RF chains is typically adopted for hardware ef ﬁciency , which constrains how many beams can be probed simultaneously and makes low-o verhead training imperativ e [5], [6]. Howe ver , the resulting pencil beams are highly sensitiv e to mobility and channel dynamics, and e ven modest pointing errors can cause sev ere throughput degradation [7]–[9]. Consequently , accurate beam alignment and tracking become essential, requiring fre- quent beam updates based on the time-varying user state under tight per-frame latency and training overhead constraints. Furthermore, this alignment requirement becomes more challenging because large array apertures at high carrier frequencies conﬁne THz links into the radiati ve near-ﬁeld (NF), where the con ventional far-ﬁeld (FF) plane-wav e model (PWM) is inaccurate due to non-ne gligible w av efront cur- vature, whereas element-wise spherical-wav e model (SWM) can be computationally prohibitiv e for UM-MIMO [8]–[12]. Additionally , in THz links with a dominant line-of-sight (LoS) path, NF beams depend jointly on angle and distance, which substantially enlarges the search space for beam management and makes joint angle-distance sweeping prohibitiv ely expen- siv e for low-latency tracking [10], [13]. Meanwhile, practical training observ ations can be unreliable due to intermittent blockage and other impairments, further destabilizing online tracking if not properly detected and handled. T aken together , these considerations underscore an urgent need for NF beam tracking mechanisms that simultaneously achiev e high accuracy , strong robustness to occasional unre- liable frames, and low control latency under practical THz UM-MIMO architectures. A. Related W ork Extremely large apertures at THz bands mov e a non- negligible portion of propagation from purely FF into the radiativ e NF , where wavefront curvature breaks the FF plane- wa ve abstraction and makes beamforming and channel acqui- sition jointly depend on angle and distance [8]–[12]. Accord- ingly , extensiv e efforts revisit beam training and alignment under NF channel models and polar-domain representations that parameterize paths by direction and distance [8]–[12], [14], [15]. T o reduce the prohibitiv e two-dimensional search burden, hierarchical training and structured NF codebooks hav e emer ged, including spatial-chirp [16] and multi-stage reﬁnement strategies that progressiv ely narrow do wn the can- didate region in the angle-distance domain [11], [14]. Related works also consider staged reﬁnement that ﬁrst lev erages FF- like probing to conﬁne angular candidates and then applies 2 NF-aware processing to reﬁne distance parameters, aiming to balance modeling accuracy and training complexity [17]. In addition, polar-domain codebook size and search over - head can scale sharply with the number of antennas, mobility and HBF constraints, which becomes particularly challenging for per-frame beam updates under tight control-latency bud- gets [15]. Beyond one-shot alignment, NF beam tracking has also started to attract attention, for example, by combining kinematic modeling with recursi ve estimators under hybrid architectures [12]. Nevertheless, many NF beam-management approaches still rely on searching over a large polar-domain candidate set to resolve distance-dependent focusing [11], [15]. This leav es a gap for fast beam management: extr acting NF geometric evidence under a lightweight pr obing budget without committing to NF codebook sweeping . In practical THz UM-MIMO deployments, uniﬁed probing ov er a wide distance range is fav ored, motiv ating NF tracking designs to operate with FF discrete Fourier transform (DFT) codebook probing [18]. Speciﬁcally , sev eral works revisit the long-standing assumption that FF codebooks are unsuitable in the NF , and show that FF DFT probing can still expose distance-sensitiv e signatures through beam-response patterns, enabling NF alignment with reduced overhead [8], [9], [19]. In parallel, widely-spaced multi-subarray (WSMS) architectures hav e been advocated as practical THz front-ends that preserve high per-subarray gain while exploiting inter-subarray phase div ersity , thereby enriching NF geometric information com- pared to co-located arrays [6]. Performance analyses further quantify such beneﬁts by relating angle-distance accuracy to the multi-view aperture created by the subarray layout [20]. Closely related, position- or geometry-aided beam manage- ment lev erages user-state information to reduce beam selection ov erhead in highly directional systems [21]. Despite these advances, existing solutions are often devel- oped for one-shot alignment or isolated estimation tasks, and it remains underexplored how to con vert low-ov erhead probing outcomes into a persistent and trackable measurement repre- sentation that can be robustly consumed by an online tracker under intermittent impairments. This moti vates a second gap: a structured handoff fr om FF pr obing to NF tracking that carries forwar d geometry evidence together with a principled notion of measurement r eliability . After initial access, beam tracking aims to maintain align- ment for mobile users with minimal overhead by exploiting temporal correlation in positions, path gains, or other low- dimensional channel parameters [22]. Kalman-type trackers perform recursive prediction and correction using sounding- beam observations, and rob ust extensions add realignment triggers and outlier rejection to mitigate loss of track under blockage and intermittent observations [23]–[26]. In parallel, deep learning (DL) has been increasingly explored for beam tracking, where sequence models map historical pilots or channel state information to future beams [27]. Beyond end- to-end predictors, learning-augmented ﬁltering preserves the recursiv e Bayesian prediction-update structure, while using neural networks to learn key components that are difﬁcult to model accurately , such as the Kalman gain [28]. More recently , DL has also been applied to NF and UM-MIMO beam training and tracking, typically by CNN-assisted hierarchical training and RNN-based tracking coupled with sweeping [29], [30]. Nevertheless, many learning-based designs still treat measurements as high-dimensional black box inputs or rely on additional probing to explore the joint angle-distance space. While these methods can improve accuracy in challenging dynamics, online deployment in highly directional THz sys- tems still relies on strict causality , robustness against sporadic bad frames and tight inference latency within millisecond- scale control loops. Moreover , much of the tracking literature is rooted in FF angle-only abstractions [24], [25], whereas NF tracking inv olves angle-distance coupling and regime- dependent measurement structures. Therefore, it remains un- derexplored to address the third gap: an end-to-end NF trac king loop that is explicitly reliability-awar e and latency- conscious under practical THz beam management budg ets . B. Contributions In this work, we enable THz NF beam tracking through par- allax under lightweight FF DFT probing. T o form a tracking- ready interface from comb-type frequency-di vision multiplex- ing (FDM) pilots, we design a parallax-aw are spatial trans- former (P AST) for one-shot NF localization and reliability- aware tokenization. Then, we develop a causal temporal trans- former (TT) to close the online beam tracking loop, leading to P AST -TT . The main contributions are summarized as follo ws. • W e establish a parallax bridge that enables NF tracking under a lightweight FF pr obing budget. Under a widely-spaced-subarray architecture with comb- type FDM pilots, we consider a hybrid spherical- and planar-w av e channel model (HSPM). The phase signature of each subarray exhibits an approximately frequency- afﬁne structure, and its adjacent-tone and inter-frame increments encode the propagation delay and its variation between frames. This leads to structured space-frequenc y- time evidence that supports low-o verhead geometric in- ference without exhausti ve joint angle-distance sweeping. • W e pr opose the P AST as a ph ysics-informed mea- surement encoder with explicit reliability-awar e tok- enization. Lev eraging the above e vidence structure, high- dimensional pilots are distilled into compact physical tokens that preserv e parallax and provide reliability statis- tics, suppressing intermittently corrupted observations without relying on fragile phase unwrapping. Geometry- biased, reliability-gated attention and a physics-in-the- loop consistency regularizer jointly improve the localiza- tion accuracy and measurement quality for each frame. • W e develop the TT as a causal reliability-awar e ﬁlter - predictor to close the tracking loop under bad frames. Using the per -frame tok ens from P AST , TT performs reliability-injected masked attention ov er a sliding win- dow and outputs a ﬁltered state and a one-step prediction, where the prediction directly initializes the next-frame probing beam under the same FF DFT codebook. An increment descriptor and a temporal physics loop further mitigate error propagation under bad frames. • W e evaluate the perf ormance of P AST -TT compared with other repr esentative methods. W e carry out ex- 3 Fig. 1. THz UM-MIMO system model with ﬁxed BS and mobile UE. tensiv e simulations across SNRs, bad-frame rates and mobility . The results demonstrate that P AST -TT outper- forms the compared solutions with improved estimation accuracy and robustness under intermittently corrupted observations, leading to higher spectral efﬁcienc y . Com- plexity analysis and GPU latency measurement further conﬁrm that its critical-path latency at each frame stays below the frame interval and enables real-time tracking. The remainder of this paper is org anized as follo ws. Sec. II presents the system model with the parallax-aware NF channel and signal structure, and the problem formulation is also presented. Sec. III introduces P AST for per-frame localization and reliability-aware evidence extraction (an earlier version of the NF localization part w as presented in [1]). Sec. IV de velops the causal TT for online ﬁltering and one-step prediction. Sec. V ev aluates the performance of the proposed methods. Finally , Sec. VI concludes the paper . I I . S Y S T E M O V E RV I E W A N D P RO B L E M F O R M U L AT I O N In this section, we ﬁrst introduce the system model of THz UM-MIMO, follo wed by the HSPM with parallax effects and the observation signal structure. Then, we formulate the per- frame localization and online beam tracking problems. A. System Model As illustrated in Fig. 1, we consider a THz downlink UM-MIMO system where a base station (BS) with widely- spaced subarrays serves a user equipment (UE) in the ra- diativ e NF region. The enlarged inter -subarray spacing not only improv es the spatial multiplexing capability , but more critically , ampliﬁes the geometric parallax effect that the same UE position induces distinct propagation distances and angles across subarrays. Speciﬁcally , the BS employs N t transmit antennas designed as K uniform planar arrays (UP As) on the x-z plane. Each subarray contains N s = N t /K antennas with inter-element spacing d = λ/ 2 at central carrier wa velength λ , and the inter-subarray spacing is much larger than d . The UE is equipped with a UP A of N r antennas with spacing d . T o reduce the hardw are ov erhead, both the BS and UE adopt HBF structure. The BS uses a sub-connected architecture with L t = K RF chains [6], therefore, the analog precoder F RF ∈ C N t × L t takes a block-diagonal form as F RF = blkdiag( f RF , 1 , f RF , 2 , . . . , f RF ,K ) , (1) where f RF ,k ∈ C N s denotes the analog precoding vector of the k th subarray for k = 1 , . . . , K . Since the analog precoder is implemented by phase shifters, each element in f RF ,k satisﬁes the constant-modulus constraint | f RF ,k ( i ) | = 1 / √ N s . At the UE, a fully-connected architecture is adopted, where each of the L r RF chains is connected to all N r antennas through phase shifters. The analog combiner W RF ∈ C N r × L r also obeys the constant-modulus constraint. The system operates with orthogonal frequency division multiplexing (OFDM) using M subcarriers with spacing ∆ f . At the m th subcarrier , the BS transmits an N st -dimensional symbol vector s [ m ] ∈ C N st . It is ﬁrst processed by a digital baseband precoder F BB [ m ] ∈ C L t × N st and then by the analog precoder F RF . After that, it is propagated through the wideband THz channel H [ m ] ∈ C N r × N t . The UE applies the analog combiner W RF and a digital combiner W BB [ m ] ∈ C L r × N sr to form the N sr -dimensional baseband output as y [ m ] = W H eq [ m ] H [ m ] F eq [ m ] s [ m ] + W H eq [ m ] n [ m ] , (2) where F eq [ m ] = F RF F BB [ m ] ∈ C N t × N st and W eq [ m ] = W RF W BB [ m ] ∈ C N r × N sr represent the precoding and combining matrices, respectiv ely , with   F eq [ m ]   2 F = 1 and   W eq [ m ]   2 F = 1 , and n [ m ] ∼ C N ( 0 , σ 2 n I N r ) is the additiv e white Gaussian noise. Moreover , the symbol vector is normal- ized as E  s [ m ] s H [ m ]  = ρ I N st , where ρ denotes the average transmit power per subcarrier . T o satisfy the total power constraint P t across all M subcarriers, we hav e M ρ = P t . During the pilot transmission stage for beam tracking, the above model specializes to a conﬁguration that enables parallax extraction across subarrays which share a common local oscillator and baseband. W e set N st = K and adopt an identity digital precoder F BB [ m ] = 1 √ K I K , so that each stream is mapped to one subarray , while all subarrays are excited simultaneously tow ard the UE. T o ensure that the UE can resolve the individual observation from each subarray , we employ FDM pilots with disjoint subcarrier subsets {M k } K k =1 , where only the k th stream is active for m ∈ M k . Moreover , we adopt a comb-type allocation where each M k uniformly spans the entire bandwidth with effecti ve spacing of K ∆ f for the observed tones of each subarray . For pilot transmission, the analog beam f RF ,k is selected from a standard far -ﬁeld DFT codebook. Under this design, the recei ved pilots across frequency and subarrays implicitly encode the parallax infor- mation used by the subsequent transformer-based modules. B. Near-F ield Channel and Signal Structure For the considered THz UM-MIMO system, the Rayleigh distance D R = 2( S b + S u ) 2 λ signiﬁcantly expands, where S b and S u denote the array apertures at the BS and UE, respectively . As a result, a typical UE operates in the NF , where a global PWM fails to capture the non-negligible wav efront curvature across widely separated subarrays, whereas a full SWM is computationally heavy . T o balance accuracy and complexity , 4 we adopt the HSPM, approximating the wav efront as planar within each subarray and spherical between subarrays. The applicability conditions of HSPM are provided in [8], [9]. W ithin each subarray , for azimuth θ and elev ation ϕ , the generic UP A steering vector in PWM can be represented as a N ( ψ x , ψ z ) = h 1 · · · e j 2 π λ ψ n · · · e j 2 π λ ψ N − 1 i T , (3) where ψ n = d n x ψ x + d n z ψ z , ψ x = sin θ cos ϕ and ψ z = sin ϕ . Here N is the number of elements in the subarray , ( d n x , d n z ) denotes the distances from the n th antenna to the reference antenna on the x- and z-axis. With N ℓ paths between the BS and the UE, the recei ve and transmit steering vectors of the k th subarray for the ℓ th path are denoted by a rℓ k = a N r ( ψ rℓ kx , ψ rℓ kz ) and a tℓ k = a N s ( ψ tℓ kx , ψ tℓ kz ) , respectiv ely . For the LoS path in Fig. 1, the distance D 1 k ( p ) and direction vector u 1 k ( p ) from the k th subarray (with reference position s k = [ x k , 0 , z k ] T ) to the UE (at p = [ x, y , z ] T ) are expressed as D 1 k ( p ) = ∥ p − s k ∥ , (4a) u 1 k ( p ) = p − s k D 1 k ( p ) = [ u k,x , u k,y , u k,z ] T , (4b) In addition, the azimuth θ t k and elev ation ϕ t k in the NF are uniquely determined by the speciﬁc geometry between p and s k , following cos θ t k = x − x k D 1 k ( p ) cos ϕ t k and sin ϕ t k = z − z k D 1 k ( p ) . Howe ver , across the widely separated subarrays, the spheri- cal wa vefront curvature is non-negligible and induces distinct phase shifts corresponding to the varying distances. Combin- ing the PWM within subarrays and the SWM across subarrays, the frequency-domain subchannel H k [ m ] ∈ C N r × N s between the k th subarray and the UE at subcarrier f m is modeled as H k [ m ] = N ℓ X ℓ =1 α ℓ k e − j 2 πf m D ℓ k ( p ) c a rℓ k ( p )  a tℓ k ( p )  T , (5) where α ℓ k is the complex path gain and D ℓ k ( p ) is the path length of the ℓ th path for the k th subarray . The aggreg ate wideband channel is obtained by concatenation as H [ m ] =  H 1 [ m ] , . . . , H K [ m ]  ∈ C N r × N t , and the FDM pilot scheme in Sec. II-A determines which subcarriers probe each H k [ m ] . The enlarged inter-subarray spacing ampliﬁes the geometric parallax. T aking the ﬁrst subarray as the reference and deﬁning the geo-arm vector b k, 1 = s k − s 1 , the geometry satisﬁes D k ( p ) = q D 2 1 ( p ) + ∥ b k, 1 ∥ 2 − 2 D 1 ( p )  u 1 ( p ) · b k, 1  , (6a) u k ( p ) = D 1 ( p ) u 1 ( p ) − b k, 1 D k ( p ) , (6b) which rev eals that the set { D k ( p ) , u k ( p ) } K k =1 forms a geometry-consistent ﬁngerprint of the UE position. In partic- ular , the inter-subarray T ime Difference of Arri val (TDoA) ∆ τ k, 1 = ( D k − D 1 ) /c and the power ﬁngerprints reﬂecting path-loss variations jointly encode the UE position. Comb- FDM induces a delay periodicity T amb = 1 / ( K ∆ f ) in τ k , but ∆ τ k, 1 is ambiguity-free since | ∆ τ k, 1 | ≤ d max /c < T amb / 2 . When the UE moves, the channel parameters ev olve over time t = q T 0 , where q ≥ 1 is the frame index and T 0 is the frame duration designed within the channel coherence time. Under the comb-type FDM pilot design, the pilot symbol vector on the m th subcarrier at the q th frame is denoted as s [ m, q ] = ( s p [ m, q ] e k , m ∈ M k , 0 , otherwise, (7) where e k is the k th canonical basis vector and s p [ m, q ] is a known pilot symbol with E {| s p [ m, q ] | 2 } = K ρ . Exploiting F BB [ m ] = 1 √ K I K and the block-diagonal structure of F RF , the scalar observation on the k th stream is expressed as y k [ m,q ] = w H eq ,k [ m ] H k [ m,q ] f RF ,k 1 √ K s p [ m,q ] + n k [ m,q ] , (8) where w eq ,k [ m ] is the effecti ve combining vector used to extract the k th pilot stream, and n k [ m, q ] is the post-combining noise. For LoS-dominant THz links and narrow pilot beams, the expression of (8) can be well approximated as y k [ m, q ] ≈ β k [ m, q ] e − j Φ k ( m,q ) + n k [ m, q ] , (9) where β k [ m, q ] collects path loss, beamforming gain and pilot amplitude, and is slowly varying over frequency and frames. The phase admits a frequency-linear decomposition as Φ k ( m, q ) = 2 πf m τ k ( q ) + ϕ ce ( q ) − ϑ k ( q ) + ϵ k [ m, q ] , (10) where τ k ( q ) = D 1 k ( p ( q )) /c is the propagation delay , ϕ ce ( q ) is the common phase error (CPE) and ϑ k ( q ) = 2 π q − 1 P i =0 ν k ( i ) T 0 is the Doppler -induced accumulated phase with the instantaneous Doppler shift ν k ( q ) = f c c v T ( q ) u k ( p ( q )) . Residual mismatch is collected in ϵ k [ m, q ] , including weak multipath, beam-squint effects and other model misalignment to be handled in the subsequent uncertainty-aware learning framew ork. Although Φ k ( m, q ) is linear with f m for ﬁxed q , its ev o- lution over q is nonlinear due to time-varying Doppler and motion. For robust beam tracking, we consider the inter-frame phase increment computed from y ∗ k [ m, q ] y k [ m, q − 1] , yielding ∆Φ k ( m, q ) = Φ k ( m, q ) − Φ k ( m, q − 1) , = 2 π f m ∆ τ k ( q ) − 2 π ¯ ν k ( q ) T 0 + ∆ ϕ ce ( q ) + ∆ ϵ k [ m, q ] , (11) where ∆ τ k ( q ) , ∆ ϕ ce ( q ) and ∆ ϵ k [ m, q ] denote the correspond- ing variations ov er the q th frame interval, and ¯ ν k ( q ) is the av erage Doppler over the q th frame. In summary , both Φ k ( m, q ) and ∆Φ k ( m, q ) are afﬁne in f m , with slopes governed by τ k ( q ) or ∆ τ k ( q ) . Across subarrays, the geometric parallax constraints in (6) tightly couple these parameters through the UE position. Therefore, the collection of frequency-linear signatures { Φ k ( m, q ) } k,m and their increments form a geometry-consistent ﬁngerprint of the UE. Over time, the ev olution of their slopes and offsets reﬂects the UE dynamics. This space-frequency-time structure provides the physical foundation of our work. C. Problem F ormulation 1) Static P arallax-A war e Localization: W e ﬁrst focus on a single frame and omit its frame index for bre vity . Under the comb-type FDM pilot design, stacking all pilot observa- tions across the K subarrays and their assigned subcarrier 5 subsets {M k } K k =1 yields y =  { y k [ m ] } K m ∈M k , k =1  T ∈ C M . Combining the HSPM channel in (5), the parallax constraints in (6) and the scalar phase structure in (9)-(10), y follows the nonlinear parametric model as y = h ( p , η ) + n , n ∼ C N ( 0 , σ 2 n I M ) , (12) where η collects nuisance parameters such as complex gains, beamforming gains and CPE. The maximum-likelihood (ML) estimator serves as a natural baseline, given by ˆ p ML = arg min p ∈P min η   y − h ( p , η )   2 2 , (13) where P is the BS coverage area. Due to the hybrid spherical- planar wav efront, inter-subarray parallax coupling and wide- band comb-type sampling, (13) is highly non-con vex and high- dimensional, making direct ML search unsuitable for low- latency beam alignment. W e therefore use P AST as a learned surrogate of the intractable ML estimator . For each frame, it maps the raw observation y to a compact e vidence packet as E = f Θ S ( y ) , (14) which comprises a position estimate ˆ p , and parallax-aware ev- idence tokens extracted from the frequency-linear signatures, together with reliability indicators to quantify the frame-wise measurement quality . By design, E preserves the geometry- informativ e content of (12) while reducing the per-frame input dimension for temporal tracking. 2) Dynamic Beam T racking: T o capture mobility , we deﬁne the kinematic state x ( q ) =  p ( q ) v ( q )  T ∈ R 6 and adopt a constant-velocity ev olution model, giv en by x ( q +1) = Fx ( q ) + w ( q ) , w ( q ) ∼ N ( 0 , Q ) , (15) where F is determined by T 0 and Q captures unmodeled accelerations. At the q th frame, stacking the FDM pilot obser- vations yields the observation model y ( q ) = h q ( x ( q )) + n ( q ) with n ( q ) ∼ C N ( 0 , σ 2 n I M ) . Here, h q ( · ) encapsulates the delay , Doppler shift, the comb-type FDM sampling pattern and slowly v arying gains. Equi valently , through (10) and (11), each subarray provides a frequency-afﬁne phase signature whose slope and intercept ev olve with x ( q ) under the geometric parallax, while residual mismatch is modeled as a disturbance. The beam tracking task is to recursi vely infer x ( q ) from y (0: q ) . The Bayes-optimal solution would propagate the ﬁlter- ing posterior p ( x ( q ) | y (0: q )) under (15) and the observation model, but exact nonlinear ﬁltering is analytically intractable and can be computationally expensi ve for high-dimensional observations. Therefore, to obtain a low-latenc y and robust learned tracker , we adopt a cascaded formulation as E ( q ) = f Θ S ( y ( q )) , ˆ x ( q ) = f Θ T  E (0) , . . . , E ( q )  , (16) where the TT f Θ T ( · ) acts as a learned nonlinear ﬁlter-predictor operating on the compact evidence sequence. This design mirrors the classical separation between mea- surement processing and state estimation, lev eraging physics- structured tokenization and spatial-temporal attention rather than black-box modeling. Importantly , using E ( q ) av oids a monolithic spatial-temporal network over raw pilots, improv- ing sample ef ﬁciency and reducing runtime while preserving geometry-informativ e signatures for tracking. I I I . P A S T F O R S TA T I C L O C A L I Z AT I O N In this section, we introduce the P AST , which leverages a deep neural network to extract and interpret the rich NF parallax signatures from lightweight FF codebook excitations, bypassing the need for prohibitive NF beam sweeping. A. F eature Distillation and Physical T okenization P AST tak es the receiv ed comb-type FDM pilots as input and serves as a physics-informed measurement encoder . It distills geometry-informativ e evidence consistent with the channel model, parallax constraints, and the per-frame frequency- afﬁne phase structure. Speciﬁcally , it compresses the high- dimensional FDM measurements into reliability-aware physi- cal tokens, which are fused into a compact e vidence packet E for the downstream TT . The pipeline below is deﬁned for a single frame with its index omitted. W e sort M k = { m k, 1 , . . . , m k,L k } and deﬁne y k,i = y k [ m k,i ] and f k,i = f m k,i for i = 1 , . . . , L k . T o obtain stable phase-increment signatures with bounded token length and prev ent unreliable patches from contaminat- ing the wideband evidence, we partition { 1 , . . . , L k } into G disjoint contiguous groups {I k,g } G g =1 . The group energy is E k,g = P i ∈I k,g | y k,i | 2 , and the normalized power weight for i ∈ I k,g is w k,g [ i ] = | y k,i | 2 + ε P j ∈I k,g  | y k,j | 2 + ε  , where ε > 0 is a small constant. After that, the weighted frequency centroid ¯ f k,g and spread σ 2 k,g are represented as ¯ f k,g = X i ∈I k,g w k,g [ i ] f k,i , (17a) σ 2 k,g = X i ∈I k,g w k,g [ i ] ( f k,i − ¯ f k,g ) 2 . (17b) Since the Fisher information of τ k scales with the power- weighted second central frequency moment, ( E k,g , σ 2 k,g ) are physics-grounded descriptors of groupwise delay information. T o exploit the frequency-af ﬁne structure without fragile phase unwrapping, we form adjacent-tone products. Deﬁning I ∆ k,g = { i ∈ I k,g : i +1 ∈ I k,g } and r k,i = y k,i +1 y ∗ k,i for i ∈ I ∆ k,g , under (9)-(10), we can derive − ∠ r k,i = Φ k ( f k,i +1 ) − Φ k ( f k,i ) + ∆ ϵ k , (18) where the dominant term is the delay-induced increment 2 π ( f k,i +1 − f k,i ) τ k , while frequency-ﬂat phase of fsets within a frame are suppressed by differencing. The remaining pertur- bations are absorbed into ∆ ϵ k and reﬂected in the reliability score. After that, with u k,i = r k,i / | r k,i | , the magnitude- weighted circular mean is expressed as ¯ u k,g = P i ∈I ∆ k,g | r k,i | u k,i P i ∈I ∆ k,g | r k,i | . (19) The reliability score and wrapped slope are deﬁned as κ k,g = | ¯ u k,g | ∈ [0 , 1] (higher κ k,g indicates a more consistent patch) and φ k,g = − ∠ ( ¯ u k,g ) , respecti vely . Since φ k,g is wrapped under the comb spacing, it is embedded continuously with q k,g =  cos φ k,g sin φ k,g  = 1 | ¯ u k,g | + ε  ℜ{ ¯ u k,g } −ℑ{ ¯ u k,g }  . (20) 6 Any remaining wrapping ambiguity is handled in the subse- quent attention layers by parallax consistency , cross-group ev- idence and reliability scores, rather than explicit unwrapping. For each subarray-group pair ( k , g ) , a compact descriptor is represented as t k,g =  log( E k,g + ε ) , log( σ 2 k,g + ε ) , q T k,g , κ k,g  T ∈ R d t . (21) Then it is mapped into the transformer latent space by z k,g = W emb t k,g + e k + e g r + e b , (22) where W emb is learnable, e k and e g r encode the subarray and group indices, and the ﬁxed geometry encoding e b is derived from (6). The resulting per-frame token sequence with length K G is denoted as Z =  z 1 , 1 , . . . , z 1 ,G , . . . , z K, 1 , . . . , z K,G  . (23) This physics-driv en tokenization preserves parallax evi- dence while suppressing nuisance variations in raw pilots, allowing the tracker to focus on ﬁltering and prediction under the kinematic prior, rather than low le vel denoising and feature learning from high dimensional complex pilots. B. Physics-Driven Attention Mechanism For the token sequence Z , P AST uses a physics-dri ven atten- tion encoder to fuse multi-subarray and multi-band evidence into E and ˆ p . The attention weights explicitly incorporate the geometry prior and token-wise reliability in Sec. III-A. T o quantify token-wise reliability for evidence fusion, a scalar gate c k,g ∈ [0 , 1) for the pair ( k , g ) is deﬁned as c k,g = κ k,g · σ  γ 0 + γ 1 log( E k,g + ε ) + γ 2 log( σ 2 k,g + ε )  , (24) where σ ( · ) is the sigmoid function and γ 1 , γ 2 ≥ 0 are learned scaling factors enforced by γ a = log  1 + exp( ˜ γ a )  for a ∈ { 1 , 2 } . It preserves the physical monotonicity that higher usable energy and larger effecti ve frequency spread increase the conﬁdence, yielding a bounded gate for stable scaling and gradients. Accordingly , c k,g serves as a learned conﬁ- dence score to steer attention to ward informativ e and phase- consistent patches. Then, denoting the i th element in (23) as z i , we form query , ke y and value projections q ( h ) i = W ( h ) Q z i , k ( h ) i = W ( h ) K z i and v ( h ) i = W ( h ) V z i in each attention head h . The attention logit from token i to token j is deﬁned as ℓ ( h ) ij =  q ( h ) i  T k ( h ) j √ d h + b ( h ) ij + δ log  c k ( j ) ,g ( j ) + ε  , (25) where d h is the per-head dimension, δ ∈ [0 , 1] balances source selection and content aggregation to avoid over -suppression, k ( · ) and g ( · ) map each tok en to its subarray and group indices. The observation-independent geometry bias b ( h ) ij is injected as b ( h ) ij = MLP ( h ) geo   ( b k ( i ) ,k ( j ) / Λ) T , ∆ g ij  T  , (26) where b k ( i ) ,k ( j ) = b k,k ′ is the known geo-arm vector , Λ = q 2 K ( K − 1) P 1 ≤ k q ′ ( n ) ) and allows bidirectional attention for tokens within each frame. The ﬁrst case in (46) a voids an all- −∞ mask row for padded queries, which would make the softmax ill-deﬁned. These outputs are ignored at readout. 9 Each token x m in X ( q ) is associated with a reliability weight ϱ m ( q ) ∈ (0 , 1] , giv en by ϱ m ( q ) =      1 , s m ( q ) = 0 , 1 , s m ( q ) = 1 and k ( m ) = 0 , g k ( m )  q ′ ( m )  , s m ( q ) = 1 and 1 ≤ k ( m ) ≤ K, (47) with ϱ ( q ) = [ ϱ 0 ( q ) , . . . , ϱ N − 1 ( q )] T . W e set H (0) ( q ) = X ( q ) and employ L T identical causal encoder blocks. For ℓ t = 1 , . . . , L T , the ℓ th t block follows the pre-normalization (Pre-LN) form as H ′ ( ℓ t ) ( q ) = H ( ℓ t − 1) ( q )+ MSA  LN  H ( ℓ t − 1) ( q )  ; M ( q ) , ϱ ( q )  , (48a) H ( ℓ t ) ( q ) = H ′ ( ℓ t ) ( q ) + FFN  LN  H ′ ( ℓ t ) ( q )   , (48b) where LN( · ) is applied ro w-wise, and FFN( z ) = ReLU  zW 1 + b 1  W 2 + b 2 is a two-layer position-wise FFN. Here, W 1 ∈ R d T × d f , W 2 ∈ R d f × d T , d f = 4 d T and b 1 , b 2 are learnable. With N h attention heads and d h = d T / N h , for head h at layer ℓ t , the standard projections are expressed as Q ( ℓ t ,h ) = LN( H ( ℓ t − 1) ) W ( ℓ t ,h ) Q , (49a) K ( ℓ t ,h ) = LN( H ( ℓ t − 1) ) W ( ℓ t ,h ) K , (49b) V ( ℓ t ,h ) = LN( H ( ℓ t − 1) ) W ( ℓ t ,h ) V , (49c) where W ( ℓ t ,h ) Q , W ( ℓ t ,h ) K , W ( ℓ t ,h ) V ∈ R d T × d h are learnable. The masked attention weights are then calculated as A ( ℓ t ,h ) ( q ) = softmax  Q ( ℓ t ,h )  K ( ℓ t ,h )  T / p d h + M ( q ) + δ T 1  log( ϱ ( q ) + ε )  T  , (50) where 1 ∈ R N is the all-one column vector , and softmax( · ) is applied row-wise. The head output is computed as O ( ℓ t ,h ) ( q ) = A ( ℓ t ,h ) ( q )  diag( ϱ ( q )) 1 − δ T V ( ℓ t ,h )  . (51) Therefore, TT injects ϱ ( q ) into attention competition and aggregation magnitude through the logit prior in (50) and value scaling in (51), explicitly downweighting tokens with low reliability . Finally , we obtain the multi-head output as MSA( H ; M , ϱ ) =  O ( ℓ t , 1) ( q ) , . . . , O ( ℓ t ,N h ) ( q )  W ( ℓ t ) O , (52) where W ( ℓ t ) O ∈ R d T × d T is learnable. Under the time-major order in (45), the global token of the most recent frame has index n 0 ( q ) = ( L w − 1)( K + 1) . Its ﬁnal-layer representation g ( q ) = H ( L T ) n 0 ( q ) ( q ) ∈ R d T is used as the temporal summary . D. Dual-Head Output and Physics Loop Based on g ( q ) , TT outputs a ﬁltering estimate ˆ x ﬁl ( q ) of the current kinematic state and a one-step prediction ˆ x pre ( q +1) to initialize the beam tracking of the next frame. A differen- tiable physics loop further regularizes the outputs through NF geometry and drift-consistent increments. The ﬁltering head maps g ( q ) to a heteroscedastic Gaussian posterior proxy as  µ ﬁl ( q ) , s ﬁl ( q )  = f ﬁl  g ( q )  , (53) where µ ﬁl ( q ) =  ˆ p ﬁl ( q ) ˆ v ﬁl ( q )  T , s ﬁl ( q ) is the log-variance vector and Σ ﬁl ( q ) = diag (exp( s ﬁl ( q ))) . For prediction, we follow the kinematic prior and learn a residual correction as  ∆ x ( q ) , s pre ( q +1)  = f pre  g ( q )  , (54a) µ pre ( q +1) = F µ ﬁl ( q ) + ∆ x ( q ) , (54b) where µ pre ( q +1) = [ ˆ p pre ( q +1) ˆ v pre ( q +1)] T , F is gi ven in (35) and Σ pre ( q +1) = diag(exp( s pre ( q +1))) . W ith ground- truth x ⋆ ( q ) = [ p ⋆ ( q ) v ⋆ ( q )] T , the ﬁltering and prediction loss functions are deﬁned as L ﬁl = X q    x ⋆ ( q ) − µ ﬁl ( q )   2  Σ fil ( q )  − 1 + log det Σ ﬁl ( q )  , (55a) L pre = X q    x ⋆ ( q +1) − µ pre ( q +1)   2  Σ pre ( q +1)  − 1 + log det Σ pre ( q +1)  , (55b) which capture the rob ustness at the current frame and the forecasting capability at the next frame, respectively . T o suppress common-mode terms, we use the relati ve Doppler ν rel k ( p , v ) = ν k ( p , v ) − ν 1 ( p , v ) for k ≥ 2 . Under the drift budgets in Sec. IV -A, the wrap-safe proxies c ∆ τ k ( q ) and b ζ rel k ( q ) in (38) provide temporal cues with reliability , which are matched to the drift implied by the ﬁltered state. W e ﬁrst deﬁne the normalized delay residual as ˜ r τ ,k ( q ) = c ∆ τ k ( q ) −  τ k ( ˆ p ﬁl ( q )) − τ k ( ˆ p ﬁl ( q − 1))  ∆ τ max , (56) Then, the relativ e Doppler phase mismatch is modeled by d ζ ,k ( q ) = 1 − cos  b ζ rel k ( q ) − 2 π T 0 ν rel k  ˆ p ﬁl ( q ) , ˆ v ﬁl ( q )  , (57) which is in variant to 2 π wrapping and acts as a mild consis- tency regularizer . Using g k ( q ) as a learned reliability proxy , we form a stop-gradient weight ω k ( q ) = sg ( g k ( q ) ) 1 K P K k ′ =1 sg ( g k ′ ( q ) ) . In this way , the physics regularizer is obtained as L t phy = X q ≥ 1 K X k =1 ω k ( q ) ϖ  ˜ r 2 τ ,k ( q )  + K X k =2 ω k ( q ) ϖ  2 d ζ ,k ( q )  ! , (58) where ϖ ( · ) and sg( · ) are the same as in (33). Finally , the TT training objective is expressed as L T = L ﬁl + λ pre L pre + λ t phy L t phy , (59) where λ t phy is annealed from 0 to its target value for stable early training. V . P E R F O R M A N C E E V A L U A T I O N In this section, we eval uate the performance of the proposed P AST -TT framework for the considered THz NF scenario, and compare it with other representative methods. A. Datasets and Simulation Setup Under the system in Fig. 1, we generate both the static local- ization dataset for P AST and the sequential tracking dataset for TT with the simulation parameters in T ABLE I. In this case, S b = 128 √ 10 λ and S u = 3 √ 2 2 λ , yielding a Rayleigh distance 10 T ABLE I S I MU L A T I ON P A R AM E T E RS Notation Deﬁnition V alue f c Carrier frequency 0.3 THz B Bandwidth 4 GHz M Number of subcarriers 1024 N t , N r Number of transmit and receive antennas 512, 16 K x , K z Number of subarrays at x- and z- axis of the BS 4, 2 L t , L r Number of transmit and receive RF chains 8, 4 d s Subarray spacing 128 λ T 0 Frame interval 1 ms q a Acceleration-noise strength in (36) 4, 9, 16 L w Length of sliding window 64 D R = 2( S b + S u ) 2 λ ≈ 331 m . UE positions are sampled in 3D coordinates within the NF with D 1 1 ( p ) ∼ U [35 m , 120 m] . For each position, we include a dominant LoS path determined by NF geometry and two additional NLoS paths. The correspond- ing channel realization is generated based on QuaDRiGa [31], adjusted by measurement-based statistics in urban microcell (UMi) en vironments obtained by our group [32], including K-factor , delay and angular spreads. In addition, the path powers are further scaled by frequency-dependent free-space loss and gaseous attenuation following ITU-R P .676. Comb- type FDM pilots are synthesized using the element-wise SWM, whereas the physics modules in Sec. II adopt the tractable HSPM to av oid in verse crime. For tracking, trajectories are generated by the state ev olution in Sec. IV -A with frame interval T 0 and three mobility regimes with q a ∈ { 4 , 9 , 16 } , where each is initialized with a random direction and speed v ∼ U [10 , 30] m / s . The above implementations ensure that the datasets preserve key NF parallax and THz propagation characteristics. W e generate N 1 = 240 , 000 static samples and N 2 = 60 , 000 trajectories, and di vide them into train/validation/test sets ( 80% / 10% / 10% ) at the sample lev el for the static dataset and at the trajectory lev el for the tracking dataset. Models are implemented in PyT orch and trained by AdamW with weight decay 10 − 2 and gradient clipping at 1 . 0 . P AST is trained with batch size 256 for 200 epochs using a peak learning rate 2 × 10 − 4 with a 5 -epoch linear warm-up followed by cosine decay to 10 − 6 , and then it is frozen to generate E ( q ) for TT . TT is trained with batch size 64 trajectories for 150 epochs using a peak learning rate 1 × 10 − 4 with the same schedule, and the ground truth is not injected into its inputs. λ t phy is linearly annealed from 0 to 0.2 o ver the ﬁrst 30 epochs. Results are av eraged ov er R = 10 random seeds. Experiments are conducted on a workstation equipped with 8 × NVIDIA GeForce R TX 4090 GPUs (CUDA 12.4; Driver 550.90.07). B. P erformance of P AST In this subsection, we ev aluate the NF localization perfor- mance of P AST . For the static task, the UE state reduces to x = [ p T , 0 T ] T and P AST outputs ˆ p . Moreov er , we also compare it with four representativ e baselines, including (i) J ARE [19]: a joint angle and range estimation method based on the DFT codebook, adapted by aggregating subarray-resolved pilot powers over the comb tones; (ii) 2D-MUSIC [33]: a                          (a) Distance RMSE comparison                          (b) Angle RMSE comparison Fig. 3. Static RMSE comparison of different methods with SNR. parametric estimator that performs a 2D spectral search on the NF manifold, with the sample covariance formed by wideband pilot snapshots; (iii) BSA [34]: a block-sparse-aware approach on a distance-dependent orthogonal dictionary , where the dom- inant component is recovered and mapped to the UE location through NF geometry; (iv) ConvNeXt re gression [35]: a DL backbone baseline inspired by [35], and it is trained to regress the distance and angles from the same FDM pilot tensor of magnitude and phase-increment features, without reproducing the CBS or TTD front-end. All methods use the same array conﬁguration, comb-type FDM pilots and FF DFT probing budget, and differ in the inference procedure. The accuracy is characterized by the distance and angle RMSEs av eraged over subarrays, given by RMSE ( D ) = v u u t 1 N 1 RK N 1 X n =1 R X r =1 K X k =1  ∆ D k n,r  2 , (60a) RMSE ( ∠ ) = 180 π v u u t 1 N 1 RK N 1 X n =1 R X r =1 K X k =1  ∆ ∠ k n,r  2 , (60b) where ∆ D k n,r = D k ( ˆ p n,r ) − D k ( p n ) and ∆ ∠ k n,r = arccos  u T k ( p n ) u k ( ˆ p n,r )  . As shown in Fig. 3, the accuracy of all methods improves with increased SNR, while P AST consistently achiev es the lowest distance and angle RMSE across the tested SNR range. At − 5 dB, P AST attains 0 . 102 m distance RMSE and 0 . 883 ◦ angle RMSE, and meanwhile, the strongest baseline is J ARE with RMSE of 0 . 178 m and 1 . 35 ◦ . At 15 dB, the RMSE further improv es to 7 . 81 mm and 0 . 0588 ◦ , and at 20 dB it reaches 5 . 52 mm and 0 . 0376 ◦ . For the high SNR regime, the best baseline is 2D-MUSIC, while P AST maintains clear gaps of 4 . 69 mm and 0 . 0342 ◦ at 15 dB. Among the baselines, J ARE is the strongest at low SNR because it estimates angle from angular support and distance from DFT probing power ratios, which is less sensiti ve to phase statistics, but its improv ement is limited at high SNR due to a ﬁnite probing resolution. 2D- MUSIC impro ves rapidly with SNR since more reliable sample cov ariance estimation yields cleaner subspace separation under the same pilot budget. BSA shows a similar SNR trend, but it can be affected by mismatch in the distance-dependent dictionary and the resulting leakage in block-sparse recovery . Con vNeXt regression also improv es steadily with SNR and reaches 0 . 0125 m and 0 . 105 ◦ at 20 dB, but it remains inferior to the physics-structured P AST under the same pilot budget. 11                                                                                                            (a) Distance RMSE                                                                                                         (b) Angle RMSE Fig. 4. Time e volution of TT prediction RMSE at SNR = 15 dB.                                                                                       (a) Distance RMSE comparison                                                                                       (b) Angle RMSE comparison Fig. 5. Filtering RMSE comparison of different methods versus SNR with α = 0 . 1 (TT with α = 0 is shown as a reference). C. P erformance of TT This subsection ev aluates TT for online ﬁltering and predic- tion in beam tracking, focusing on robustness to bad frames and the resulting closed loop beam beneﬁt. Bad frames occur with rate α and are generated by applying an additional 20 dB attenuation to the pilot observation in the corrupted frames. W e compare TT with four representati ve trackers, including (i) P AST+hold , which uses the current P AST output for both ﬁltering and prediction, (ii) EKF , which applies an Extended Kalman Filter model with the P AST output as measurement [23], (iii) LSTM , which performs causal recurrent temporal modeling for proactiv e tracking [27] under the same measure- ment interface, and (iv) KalmanNet , which learns Kalman style updates under partially known dynamics [28] with the same state prior . All compared methods are causal, sharing the same frozen P AST front end with the same pilot budget. Fig. 4 shows that at SNR = 15 dB, TT prediction RMSE con ver ges within a window and stays stable under bad frames and different mobilities. At q = 200 with q a = 9 , the distance RMSE is 8 . 92 mm for α = 0 , increases to 0 . 0227 m for α = 0 . 1 , and reaches 0 . 0345 m for α = 0 . 3 . The corre- sponding angle RMSE follows the same trend and con ver ges to 0 . 0608 ◦ , 0 . 105 ◦ and 0 . 160 ◦ , respectively . With α = 0 , increasing mobility from q a = 4 to q a = 16 raises the steady distance RMSE from 6 . 70 mm to 0 . 0146 m and the steady angle RMSE from 0 . 0461 ◦ to 0 . 0758 ◦ . Fig. 5 compares ﬁltering accuracy versus SNR, where TT performs the best with its strong robustness to bad frames. When α = 0 . 1 and SNR = − 5 dB, TT attains RMSE of 0 . 193 m and 0 . 829 ◦ , while the strongest baseline P AST+hold shows 25 . 4% higher distance RMSE and 60 . 9% higher angle RMSE. EKF is the weakest with RMSE of 0 . 521 m and                                                                                       (a) Distance RMSE comparison                                                                                       (b) Angle RMSE comparison Fig. 6. Prediction RMSE comparison of different methods v ersus SNR with α = 0 . 1 (TT with α = 0 is shown as a reference). 2 . 25 ◦ . This ordering is consistent with how bad frames affect temporal inference. P AST+hold av oids error propagation but fails to exploit temporal smoothing, while EKF is more af- fected once an unreliable update occurs. KalmanNet improves upon EKF but still follows a recursiv e update ﬂow [28]. At SNR = 20 dB, TT attains 0 . 0121 m and 0 . 0695 ◦ , with the best baseline KalmanNet exhibiting 31 . 2% and 42 . 7% higher distance and angle RMSE. The reference TT curve with α = 0 reaches 4 . 82 mm and 0 . 0236 ◦ at SNR = 20 dB, indicating the ceiling imposed by the bad frame process. Fig. 6 reports prediction accuracy , which determines the beam initialization for the next frame. At SNR = − 5 dB, TT attains 0 . 223 m and 0 . 872 ◦ , compared with 0 . 271 m and 1 . 34 ◦ for P AST+hold, while EKF yields the largest RMSE of 0 . 670 m and 2 . 63 ◦ . At SNR = 20 dB, TT attains 0 . 0142 m and 0 . 0772 ◦ , while KalmanNet attains 0 . 0203 m and 0 . 118 ◦ as the best baseline. Prediction is more sensitive to occasional unreliable evidence and motion changes, where TT maintains the best prediction accuracy by aggregating the causal window and downweighting unreliable evidence through its reliability- injected attention, whereas LSTM is more likely to be affected by bad frame perturbations in the input without such explicit reliability handling [27]. The communication performance is ev aluated by the single- stream spectral efﬁcienc y (SE), deﬁned as SE( q +1) = 1 M M X m =1 log 2  1 + ρ   w H eq [ m ] H [ m ] f eq [ m ]   2 σ 2 n  . (61) At the q th frame, based on the prediction ˆ p pre ( q +1) , an effecti ve unit-norm transmit beam f eq [ m ] is constructed using the HSPM in Sec. II, with intra-subarray steering from the predicted local angles and inter-subarray phase from the pre- dicted distances, while w eq [ m ] is set as an ideal matched unit- norm combiner . The ideal NF alignment curve uses the true p ( q +1) , and the FF beams discard the distance and use only the global angle. As shown in Fig. 7, under a ﬁxed transmit- power budget, corresponding to − 0 . 1 dBm per subcarrier , our proposed method TT stays close to the optimal SE across the tested distances, and consistently outperforms FF steering and other tracking approaches even when bad frames occur . The SE of TT with α = 0 is only 0 . 31 bps/Hz and 0 . 37 bps/Hz below the optimal SE at 35 m and 120 m, respectiv ely . Ignoring the distance incurs up to 4 . 15 bps/Hz SE loss at 12                                                                                               Fig. 7. Spectral efﬁciency comparison of different methods. T ABLE II A B LAT IO N S T U D IE S OF P A ST ( α = 0 . 1 , S N R = 15 dB ) Method Distance RMSE (m) Angle RMSE ( ◦ ) P AST (full) 0.0232 0.147 w/o c k,g ( c k,g = 1 ) 0.0351 (+51.29%) 0.239 (+62.59%) w/o e b , b ( h ) ij 0.0263 (+13.36%) 0.195 (+32.65%) w/o factorized fusion 0.0276 (+18.97%) 0.178 (+21.09%) w/o L s phy 0.0329 (+41.81%) 0.203 (+38.10%) 45 m which generally decreases with distance. It quantiﬁes the importance of considering distance in NF communication. D. Ablation Study Ablation studies are conducted for P AST and TT . For module-lev el attribution, all TT variants are trained on the same evidence sequences {E ( q ) } from the frozen full P AST . In T able II, we verify the necessity of ke y P AST compo- nents. First, removing the reliability gate by setting c k,g = 1 causes the largest degradation, increasing the distance and angle RMSE by 51 . 29% and 62 . 59% . It highlights the role of reliability-aware evidence competition and aggregation in (25) and (27) and pooling in (28) under bad frames. Second, without the physics-in-the-loop regularizer L s phy , the RMSE increases by 41 . 81% and 38 . 10% , demonstrating that the con- sistency constraint in (33) provides a strong physical anchor beyond supervised regression. Third, disabling the geometry priors e b and b ( h ) ij increases the angle RMSE by 32 . 65% , validating the geo-arm bias is essential for extracting parallax information. Excluding the factorized fusion with a single encoder also worsens performance, suggesting that the intra- and inter-subarray factorization aligns better with the array structure while being computationally efﬁcient. In T able III, the ﬁltering and prediction RMSEs are reported at the ﬁrst and the second line of each cell, respectiv ely . With- out reliability-injected attention ϱ ( q ) , the ﬁltering and predic- tion angle RMSEs increase by 45 . 77% and 60 . 95% , while removing the temporal physics loop L t phy also yields obvious degradation. These verify that strong robustness requires both reliability-aware aggregation in (50)-(51) and drift-consistent regularization in (58). In addition, dropping the increment descriptor d k ( q ) mainly affects prediction, consistent with its role of injecting inter-frame delay and offset cues from (38) to (40) for forecasting. Removing the causal constraint in M ( q ) results in smaller yet consistent decline, supporting that strictly causal message passing with intra-frame bidirectional attention T ABLE III A B LAT IO N S T U D IE S OF T T ( α = 0 . 1 , q a = 9 , S N R = 15 dB ) Method Distance RMSE (m) (Filtering / Prediction) Angle RMSE ( ◦ ) (Filtering / Prediction) TT (full) 0.0194 0.0227 0.0981 0.105 w/o ϱ ( q ) 0.0230 (+18.56%) 0.0293 (+29.07%) 0.143 (+45.77%) 0.169 (+60.95%) w/o d k ( q ) ( a κ = 0 ) 0.0226 (+16.49%) 0.0289 (+27.31%) 0.108 (+10.09%) 0.155 (+47.62%) w/o L t phy 0.0228 (+17.53%) 0.0302 (+33.04%) 0.151 (+53.92%) 0.172 (+63.81%) w/o causal M ( q ) 0.0210 (+8.25%) 0.0253 (+11.45%) 0.112 (+14.17%) 0.129 (+22.86%) w/o f ﬁl – 0.0275 (+21.15%) – 0.163 (+55.24%) T ABLE IV C O MP U TA T I O NA L C O M PL E X I TY C OM PA RI S O N Method Complexity P AST (ours) O ( C P AST ) J ARE O ( P N g ) 2D-MUSIC O ( K 2 M + K 3 + N g K 2 ) BSA O ( I P N d ) Con vNeXt regression O ( C cnn ) P AST+EKF O ( C P AST + n 3 a + n 2 a n b ) P AST+LSTM O ( C P AST + n b d r + d 2 r ) P AST+KalmanNet O ( C P AST + n b d k + d 2 k ) P AST+TT (ours) O  C P AST + M + L T ( N 2 d T + N d T d f )  improv es online generalization. Disabling the ﬁltering head f ﬁl degrades prediction to 0 . 0275 m and 0 . 163 ◦ , highlighting the necessity of the dual-head design and the ﬁltered-state anchor in (53) and (54b) for stable prediction. E. Computational Complexity T able IV compares the computational complexity of different methods under the same comb-type FDM pi- lot budget. For P AST , d P is the latent width, L a and L b denote the numbers of layers in Enc intra ( · ) and Enc inter ( · ) . Accordingly , the dominant per-frame P AST cost is C P AST = O  M + L a K G 2 d P + L b K 2 d P  . W e report attention-dominated asymptotic complexity , where linear pro- jections, FFN terms, causal masking and reliability injection only affect constant factors. For the baselines, P is the pilot-feature dimension, N g is the polar-grid size, N d is the dictionary size, I is the iteration number, n a and n b are the EKF state and measurement dimensions, d r and d k are the hidden widths of LSTM and KalmanNet, and C cnn is the MA C count of Con vNeXt at the chosen input resolution. Speciﬁcally , we use d T = 128 , L T = 4 and N h = 8 . Mea- sured on an R TX 4090, the runtime of P AST , TT frame-level tokenization and TT are t P AST = 0 . 29 ms, t tok = 0 . 12 ms and t TT = 0 . 32 ms, respectiv ely . Scheduling P AST and frame- lev el tokenization in parallel, the critical-path latency is t fr = max { t P AST , t tok } + t TT = 0 . 61 ms. A conservati ve sequential upper bound is t seq = t P AST + t tok + t TT = 0 . 73 ms. Both are below the frame interv al T 0 = 1 ms in T able I, leaving sufﬁcient margin for beam update and data transmission. 13 V I . C O N C L U S I O N In this paper , we showed that the parallax effect can bridge low-o verhead FF DFT codebook probing to accurate NF beam tracking in THz UM-MIMO without NF codebook sweeping. Comb-type FDM pilots create subarray-dependent frequency- afﬁne phase signatures whose adjacent-tone and inter-frame increments provide delay and drift evidence while mitigating frame-wise common phase offsets. Guided by this physics- informed structure, P AST produces compact tokens with ex- plicit reliability and delivers per-frame localization with a physics-in-the-loop consistency regularizer . TT then performs reliability-injected causal attention ov er a sliding window to jointly ﬁlter and predict the next-frame state for beam initialization, and it is further anchored by a temporal physics loop. Simulations conﬁrm that at SNR 15 dB, P AST achiev es 7.81 mm distance RMSE and 0.0588 ◦ angle RMSE, and with bad-frame rate α = 0 . 1 , TT achieves 0.0227 m distance and 0.105 ◦ angle prediction RMSE with 0.61 ms critical- path latency below frame duration. Future work will address stronger NLoS and blockage, measured datasets, hardware impairments, and multi-user settings. R E F E R E N C E S [1] Z. Zeng and C. Han, “Parallax-aware spatial transformer: Fusing physics and learning for terahertz near-ﬁeld localization, ” in Pr oc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) , May 2026. [2] A. Shaﬁe, N. Y ang, C. Han, J. M. Jornet, M. Juntti, and T . K ¨ urner , “T era- hertz communications for 6g and beyond wireless networks: Challenges, key advancements, and opportunities, ” IEEE Netw . , vol. 37, no. 3, pp. 162–169, May 2023. [3] I. F . Akyildiz, C. Han, and S. Nie, “Combating the distance problem in the millimeter wav e and terahertz frequency bands, ” IEEE Commun. Mag. , v ol. 56, no. 6, pp. 102–108, Jun. 2018. [4] C. Han, J. M. Jornet, and I. Akyildiz, “Ultra-massiv e mimo channel mod- eling for graphene-enabled terahertz-band communications, ” in Proc. IEEE 87th V eh. T echnol. Conf. (VTC Spring) , Jun. 2018, pp. 1–5. [5] L. Y an, C. Han, and J. Y uan, “ A dynamic array-of-subarrays architecture and hybrid precoding algorithms for terahertz wireless communications, ” IEEE J. Sel. Areas Commun. , v ol. 38, no. 9, pp. 2041–2056, Sep. 2020. [6] L. Y an, Y . Chen, C. Han, and J. Y uan, “Joint inter-path and intra- path multiplexing for terahertz widely-spaced multi-subarray hybrid beamforming systems, ” IEEE T rans. Commun. , vol. 70, no. 2, pp. 1391– 1406, Feb. 2022. [7] Y . Chen, L. Y an, C. Han, and M. T ao, “Millidegree-le vel direction- of-arriv al estimation and tracking for terahertz ultra-massiv e mimo systems, ” IEEE T rans. W ir eless Commun. , vol. 21, no. 2, pp. 869–883, Feb . 2022. [8] Y . Chen, C. Han, and E. Bj ¨ ornson, “Can far -ﬁeld beam training be deployed for cross-ﬁeld beam alignment in terahertz um-mimo commu- nications?” IEEE T rans. W ir eless Commun. , vol. 23, no. 10, pp. 14 972– 14 987, Oct. 2024. [9] Y . Chen, H. Shen, and C. Han, “Cross far- and near-ﬁeld beam man- agement technologies in millimeter-wa ve and terahertz mimo systems, ” IEEE Open J. V eh. T echnol. , vol. 7, pp. 73–107, Nov . 2025. [10] M. Cui and L. Dai, “Channel estimation for extremely large-scale mimo: Far -ﬁeld or near-ﬁeld?” IEEE T rans. Commun. , vol. 70, no. 4, pp. 2663– 2677, Apr . 2022. [11] Y . Lu, Z. Zhang, and L. Dai, “Hierarchical beam training for extremely large-scale mimo: From far-ﬁeld to near-ﬁeld, ” IEEE T rans. Commun. , vol. 72, no. 4, pp. 2247–2259, Apr . 2024. [12] K. Chen, C. Qi, C.-X. W ang, and G. Y . Li, “Beam training and tracking for extremely lar ge-scale mimo communications, ” IEEE T rans. W ireless Commun. , vol. 23, no. 5, pp. 5048–5062, May 2024. [13] M. Cui, Z. W u, Y . Lu, X. W ei, and L. Dai, “Near -ﬁeld mimo communica- tions for 6g: Fundamentals, challenges, potentials, and future directions, ” IEEE Commun. Mag. , vol. 61, no. 1, pp. 40–46, Jan. 2023. [14] K. Chen, C. Qi, O. A. Dobre, and G. Y e Li, “T riple-reﬁned hybrid-ﬁeld beam training for mmwav e extremely large-scale mimo, ” IEEE T rans. W ir eless Commun. , vol. 23, no. 8, pp. 8556–8570, Aug. 2024. [15] C. Y ou, Y . Zhang, C. Wu, Y . Zeng, B. Zheng, L. Chen, L. Dai, and A. L. Swindlehurst, “Near-ﬁeld beam management for extremely large-scale array communications, ” arXiv preprint , Jun. 2023. [16] X. Shi, J. W ang, Z. Sun, and J. Song, “Spatial-chirp codebook-based hierarchical beam training for extremely large-scale massive mimo, ” IEEE T rans. Wir eless Commun. , vol. 23, no. 4, pp. 2824–2838, Apr . 2024. [17] Y . Zhang, X. W u, and C. Y ou, “Fast near-ﬁeld beam training for extremely large-scale array , ” IEEE W ireless Commun. Lett. , vol. 11, no. 12, pp. 2625–2629, Dec. 2022. [18] M. Giordani, M. Polese, A. Roy , D. Castor, and M. Zorzi, “ A tutorial on beam management for 3gpp nr at mmwave frequencies, ” IEEE Commun. Surveys T uts. , vol. 21, no. 1, pp. 173–196, Sep. 2018. [19] X. Wu, C. Y ou, J. Li, and Y . Zhang, “Near-ﬁeld beam training: Joint angle and range estimation with dft codebook, ” IEEE T rans. Wir eless Commun. , vol. 23, no. 9, pp. 11 890–11 903, Sep. 2024. [20] S. Y ang, X. Chen, Y . Xiu, W . L yu, Z. Zhang, and C. Y uen, “Performance bounds for near-ﬁeld localization with widely-spaced multi-subarray mmwav e/thz mimo, ” IEEE T rans. W ir eless Commun. , vol. 23, no. 9, pp. 10 757–10 772, Sep. 2024. [21] G. E. Garcia, G. Seco-Granados, E. Karipidis, and H. W ymeersch, “T ransmitter beam selection in millimeter-wav e mimo with in-band position-aiding, ” IEEE Tr ans. W ir eless Commun. , vol. 17, no. 9, pp. 6082–6092, Sep. 2018. [22] W . Y i, W . Zhiqing, and F . Zhiyong, “Beam training and tracking in mmwav e communication: A survey , ” China Commun. , vol. 21, no. 6, pp. 1–22, Jun. 2024. [23] S. Shaham, M. Kokshoorn, M. Ding, Z. Lin, and M. Shirvanimoghad- dam, “Extended kalman ﬁlter beam tracking for millimeter wav e vehic- ular communications, ” in Proc. IEEE Int. Conf. Commun. W orkshops (ICC Wkshps) , Jul. 2020, pp. 1–6. [24] S. G. Larew and D. J. Love, “ Adaptiv e beam tracking with the unscented kalman ﬁlter for millimeter wave communication, ” IEEE Signal Process. Lett. , vol. 26, no. 11, pp. 1658–1662, Nov . 2019. [25] C. Zhang, D. Guo, and P . Fan, “Tracking angles of departure and arrival in a mobile millimeter wav e channel, ” in Proc. IEEE Int. Conf. Commun. (ICC) , Jul. 2016, pp. 1–6. [26] H. W ang, H. Li, J. Fang, and H. W ang, “Robust gaussian kalman ﬁlter with outlier detection, ” IEEE Signal Process. Lett. , vol. 25, no. 8, pp. 1236–1240, Aug. 2018. [27] S. H. Lim, S. Kim, B. Shim, and J. W . Choi, “Deep learning-based beam tracking for millimeter-wa ve communications under mobility , ” IEEE T rans. Commun. , vol. 69, no. 11, pp. 7458–7469, Nov . 2021. [28] G. Revach, N. Shlezinger , X. Ni, A. L. Escoriza, R. J. G. van Sloun, and Y . C. Eldar , “Kalmannet: Neural network aided kalman ﬁltering for partially known dynamics, ” IEEE Tr ans. Signal Process. , vol. 70, pp. 1532–1547, Mar . 2022. [29] Y . W ang, C. Qi, W . He, and A. Nallanathan, “Near-ﬁeld beam training and tracking with deep learning for extremely lar ge-scale mimo, ” IEEE T rans. V eh. T echnol. , vol. 74, no. 12, pp. 19 783–19 788, Dec. 2025. [30] S. K. Dehkordi, M. K obayashi, and G. Caire, “ Adaptiv e beam tracking based on recurrent neural networks for mmwave channels, ” in Proc. IEEE 22nd Int. W orkshop Signal Process. Adv . W ir eless Commun. (SP A WC) , Nov . 2021, pp. 1–5. [31] S. Jaeckel, L. Raschk owski, K. B ¨ orner , and L. Thiele, “Quadriga: A 3-d multi-cell channel model with time ev olution for enabling virtual ﬁeld trials, ” IEEE T rans. Antennas Propag . , vol. 62, no. 6, pp. 3242–3256, Jun. 2014. [32] Y . Li, Y . W ang, Y . L yu, Z. Y u, and C. Han, “220 ghz urban microcell channel measurement and characterization on a university campus, ” in Pr oc. IEEE Globecom W orkshops (GC Wkshps) , Dec. 2024, pp. 1–5. [33] D. G ¨ urg ¨ uno ˘ glu, A. Kosasih, P . Ramezani, ¨ O. T . Demir, E. Bj ¨ ornson, and G. Fodor , “Performance analysis of a 2d-music algorithm for parametric near-ﬁeld channel estimation, ” IEEE W ireless Commun. Lett. , vol. 14, no. 5, pp. 1496–1500, May 2025. [34] H. W ang, J. Fang, H. Duan, and H. Li, “Near/far-ﬁeld channel estimation for terahertz systems with elaas: A block-sparse-aware approach, ” arXiv pr eprint arXiv:2404.05544 , Sep. 2024. [35] H. Lei, J. Zhang, H. Xiao, D. W . K. Ng, and B. Ai, “Deep learning- based near-ﬁeld user localization with beam squint in wideband xl-mimo systems, ” IEEE Tr ans. W ir eless Commun. , vol. 24, no. 2, pp. 1568–1583, Feb . 2025.

Physics-Informed Spatial-Temporal Transformer for Terahertz Near-Field Beam Tracking

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment