Physics-Informed Spatial-Temporal Transformer for Terahertz Near-Field Beam Tracking

Terahertz (THz) ultra-massive multiple-input multiple-output (UM-MIMO) promises ultra-high throughput, while its highly directional beams demand rapid and accurate beam tracking driven by precise user-state estimation. Moreover, large array apertures…

Authors: Zhi Zeng, Chong Han, Emil Björnson

Physics-Informed Spatial-Temporal Transformer for Terahertz Near-Field Beam Tracking
1 Physics-Informed Spatial-T emporal T ransformer for T erahertz Near -Field Beam T racking Zhi Zeng, Student Member , IEEE , Chong Han, Senior Member , IEEE , and Emil Bj ¨ ornson, F ellow , IEEE Abstract —T erahertz (THz) ultra-massive multiple-input multiple-output (UM-MIMO) promises ultra-high thr oughput, while its highly directional beams demand rapid and accurate beam tracking driven by pr ecise user -state estimation. Moreov er , large array apertures at high frequencies induce near-field propagation effects, where far -field modeling becomes inaccurate and near -field parametric channel estimation is costly . Bypassing near -field codebook, P AST -TT is proposed to bridge near -field tracking with low-ov erhead far-field codebook pr obing by exploiting parallax, amplified by widely spaced subarrays. With comb-type frequency-division multiplexing pilots, each subarray yields frequency-affine phase signatur es whose frequency and temporal incr ements encode propagation delay and its variation between frames. Building on these signatures, a Parallax-A war e Spatial T ransformer (P AST) compresses them and outputs per -frame position estimates with token reliability to downweight bad frames, regularized by a physics-in-the-loop consistency loss. A causal T emporal T ransformer (TT) then performs reliability-aware filtering and prediction over a sliding window to initialize the beam of the next frame. Acting on short token sequences, P AST -TT av oids a monolithic spatial-temporal network over raw pilots, which k eeps the model lightweight with a critical path latency of 0.61 ms. Simulations show that at 15 dB signal-to-noise ratio, P AST achieves 7.81 mm distance RMSE and 0.0588 ◦ angle RMSE. Even with a bad-frame rate of 0.1, TT reduces the distance and angle prediction RMSE by 23.1% and 32.8% compared with the best competing tracker . Index T erms —T erahertz communications, near-field, beam tracking, physics-inf ormed deep learning. I . I N T RO D U C T I O N T ERAHER TZ (THz) communications have emerged as an attractiv e component for future wireless systems, as the ultra-broad spectrum from 0.1 THz to 10 THz provides abundant bandwidth and opens the door to extreme-capacity wireless links [2]. Nev ertheless, the notable path loss in the THz band can affect the link budget and limit cov erage [3]. Fortunately , the short wa velength enables dense integration and ultra-large apertures, making ultra-massive multiple-input multiple-output (UM-MIMO) a natural companion to THz for harvesting substantial beamforming gains and compensating for the se vere propagation loss [4]. T o make THz UM-MIMO An earlier version of this paper will be presented in part at the IEEE ICASSP , May 2026 [1]. Zhi Zeng is with the T erahertz Wireless Communications (TWC) Lab- oratory , Shanghai Jiao T ong University , Shanghai 200240, China (e-mail: zhi.zeng@sjtu.edu.cn). Chong Han is with the T erahertz W ireless Communications (TWC) Labo- ratory and also the Cooperati ve Medianet Inno vation Center (CMIC), School of Information Science and Electronic Engineering, Shanghai Jiao T ong Univ ersity , Shanghai 200240, China (e-mail: chong.han@sjtu.edu.cn). Emil Bj ¨ ornson is with the Department of Communication Systems, KTH Royal Institute of T echnology , 100 44 Stockholm, Sweden (e-mail: emil- bjo@kth.se). practical, hybrid beamforming (HBF) with a limited number of RF chains is typically adopted for hardware ef ficiency , which constrains how many beams can be probed simultaneously and makes low-o verhead training imperativ e [5], [6]. Howe ver , the resulting pencil beams are highly sensitiv e to mobility and channel dynamics, and e ven modest pointing errors can cause sev ere throughput degradation [7]–[9]. Consequently , accurate beam alignment and tracking become essential, requiring fre- quent beam updates based on the time-varying user state under tight per-frame latency and training overhead constraints. Furthermore, this alignment requirement becomes more challenging because large array apertures at high carrier frequencies confine THz links into the radiati ve near-field (NF), where the con ventional far-field (FF) plane-wav e model (PWM) is inaccurate due to non-ne gligible w av efront cur- vature, whereas element-wise spherical-wav e model (SWM) can be computationally prohibitiv e for UM-MIMO [8]–[12]. Additionally , in THz links with a dominant line-of-sight (LoS) path, NF beams depend jointly on angle and distance, which substantially enlarges the search space for beam management and makes joint angle-distance sweeping prohibitiv ely expen- siv e for low-latency tracking [10], [13]. Meanwhile, practical training observ ations can be unreliable due to intermittent blockage and other impairments, further destabilizing online tracking if not properly detected and handled. T aken together , these considerations underscore an urgent need for NF beam tracking mechanisms that simultaneously achiev e high accuracy , strong robustness to occasional unre- liable frames, and low control latency under practical THz UM-MIMO architectures. A. Related W ork Extremely large apertures at THz bands mov e a non- negligible portion of propagation from purely FF into the radiativ e NF , where wavefront curvature breaks the FF plane- wa ve abstraction and makes beamforming and channel acqui- sition jointly depend on angle and distance [8]–[12]. Accord- ingly , extensiv e efforts revisit beam training and alignment under NF channel models and polar-domain representations that parameterize paths by direction and distance [8]–[12], [14], [15]. T o reduce the prohibitiv e two-dimensional search burden, hierarchical training and structured NF codebooks hav e emer ged, including spatial-chirp [16] and multi-stage refinement strategies that progressiv ely narrow do wn the can- didate region in the angle-distance domain [11], [14]. Related works also consider staged refinement that first lev erages FF- like probing to confine angular candidates and then applies 2 NF-aware processing to refine distance parameters, aiming to balance modeling accuracy and training complexity [17]. In addition, polar-domain codebook size and search over - head can scale sharply with the number of antennas, mobility and HBF constraints, which becomes particularly challenging for per-frame beam updates under tight control-latency bud- gets [15]. Beyond one-shot alignment, NF beam tracking has also started to attract attention, for example, by combining kinematic modeling with recursi ve estimators under hybrid architectures [12]. Nevertheless, many NF beam-management approaches still rely on searching over a large polar-domain candidate set to resolve distance-dependent focusing [11], [15]. This leav es a gap for fast beam management: extr acting NF geometric evidence under a lightweight pr obing budget without committing to NF codebook sweeping . In practical THz UM-MIMO deployments, unified probing ov er a wide distance range is fav ored, motiv ating NF tracking designs to operate with FF discrete Fourier transform (DFT) codebook probing [18]. Specifically , sev eral works revisit the long-standing assumption that FF codebooks are unsuitable in the NF , and show that FF DFT probing can still expose distance-sensitiv e signatures through beam-response patterns, enabling NF alignment with reduced overhead [8], [9], [19]. In parallel, widely-spaced multi-subarray (WSMS) architectures hav e been advocated as practical THz front-ends that preserve high per-subarray gain while exploiting inter-subarray phase div ersity , thereby enriching NF geometric information com- pared to co-located arrays [6]. Performance analyses further quantify such benefits by relating angle-distance accuracy to the multi-view aperture created by the subarray layout [20]. Closely related, position- or geometry-aided beam manage- ment lev erages user-state information to reduce beam selection ov erhead in highly directional systems [21]. Despite these advances, existing solutions are often devel- oped for one-shot alignment or isolated estimation tasks, and it remains underexplored how to con vert low-ov erhead probing outcomes into a persistent and trackable measurement repre- sentation that can be robustly consumed by an online tracker under intermittent impairments. This moti vates a second gap: a structured handoff fr om FF pr obing to NF tracking that carries forwar d geometry evidence together with a principled notion of measurement r eliability . After initial access, beam tracking aims to maintain align- ment for mobile users with minimal overhead by exploiting temporal correlation in positions, path gains, or other low- dimensional channel parameters [22]. Kalman-type trackers perform recursive prediction and correction using sounding- beam observations, and rob ust extensions add realignment triggers and outlier rejection to mitigate loss of track under blockage and intermittent observations [23]–[26]. In parallel, deep learning (DL) has been increasingly explored for beam tracking, where sequence models map historical pilots or channel state information to future beams [27]. Beyond end- to-end predictors, learning-augmented filtering preserves the recursiv e Bayesian prediction-update structure, while using neural networks to learn key components that are difficult to model accurately , such as the Kalman gain [28]. More recently , DL has also been applied to NF and UM-MIMO beam training and tracking, typically by CNN-assisted hierarchical training and RNN-based tracking coupled with sweeping [29], [30]. Nevertheless, many learning-based designs still treat measurements as high-dimensional black box inputs or rely on additional probing to explore the joint angle-distance space. While these methods can improve accuracy in challenging dynamics, online deployment in highly directional THz sys- tems still relies on strict causality , robustness against sporadic bad frames and tight inference latency within millisecond- scale control loops. Moreover , much of the tracking literature is rooted in FF angle-only abstractions [24], [25], whereas NF tracking inv olves angle-distance coupling and regime- dependent measurement structures. Therefore, it remains un- derexplored to address the third gap: an end-to-end NF trac king loop that is explicitly reliability-awar e and latency- conscious under practical THz beam management budg ets . B. Contributions In this work, we enable THz NF beam tracking through par- allax under lightweight FF DFT probing. T o form a tracking- ready interface from comb-type frequency-di vision multiplex- ing (FDM) pilots, we design a parallax-aw are spatial trans- former (P AST) for one-shot NF localization and reliability- aware tokenization. Then, we develop a causal temporal trans- former (TT) to close the online beam tracking loop, leading to P AST -TT . The main contributions are summarized as follo ws. • W e establish a parallax bridge that enables NF tracking under a lightweight FF pr obing budget. Under a widely-spaced-subarray architecture with comb- type FDM pilots, we consider a hybrid spherical- and planar-w av e channel model (HSPM). The phase signature of each subarray exhibits an approximately frequency- affine structure, and its adjacent-tone and inter-frame increments encode the propagation delay and its variation between frames. This leads to structured space-frequenc y- time evidence that supports low-o verhead geometric in- ference without exhausti ve joint angle-distance sweeping. • W e pr opose the P AST as a ph ysics-informed mea- surement encoder with explicit reliability-awar e tok- enization. Lev eraging the above e vidence structure, high- dimensional pilots are distilled into compact physical tokens that preserv e parallax and provide reliability statis- tics, suppressing intermittently corrupted observations without relying on fragile phase unwrapping. Geometry- biased, reliability-gated attention and a physics-in-the- loop consistency regularizer jointly improve the localiza- tion accuracy and measurement quality for each frame. • W e develop the TT as a causal reliability-awar e filter - predictor to close the tracking loop under bad frames. Using the per -frame tok ens from P AST , TT performs reliability-injected masked attention ov er a sliding win- dow and outputs a filtered state and a one-step prediction, where the prediction directly initializes the next-frame probing beam under the same FF DFT codebook. An increment descriptor and a temporal physics loop further mitigate error propagation under bad frames. • W e evaluate the perf ormance of P AST -TT compared with other repr esentative methods. W e carry out ex- 3 Fig. 1. THz UM-MIMO system model with fixed BS and mobile UE. tensiv e simulations across SNRs, bad-frame rates and mobility . The results demonstrate that P AST -TT outper- forms the compared solutions with improved estimation accuracy and robustness under intermittently corrupted observations, leading to higher spectral efficienc y . Com- plexity analysis and GPU latency measurement further confirm that its critical-path latency at each frame stays below the frame interval and enables real-time tracking. The remainder of this paper is org anized as follo ws. Sec. II presents the system model with the parallax-aware NF channel and signal structure, and the problem formulation is also presented. Sec. III introduces P AST for per-frame localization and reliability-aware evidence extraction (an earlier version of the NF localization part w as presented in [1]). Sec. IV de velops the causal TT for online filtering and one-step prediction. Sec. V ev aluates the performance of the proposed methods. Finally , Sec. VI concludes the paper . I I . S Y S T E M O V E RV I E W A N D P RO B L E M F O R M U L AT I O N In this section, we first introduce the system model of THz UM-MIMO, follo wed by the HSPM with parallax effects and the observation signal structure. Then, we formulate the per- frame localization and online beam tracking problems. A. System Model As illustrated in Fig. 1, we consider a THz downlink UM-MIMO system where a base station (BS) with widely- spaced subarrays serves a user equipment (UE) in the ra- diativ e NF region. The enlarged inter -subarray spacing not only improv es the spatial multiplexing capability , but more critically , amplifies the geometric parallax effect that the same UE position induces distinct propagation distances and angles across subarrays. Specifically , the BS employs N t transmit antennas designed as K uniform planar arrays (UP As) on the x-z plane. Each subarray contains N s = N t /K antennas with inter-element spacing d = λ/ 2 at central carrier wa velength λ , and the inter-subarray spacing is much larger than d . The UE is equipped with a UP A of N r antennas with spacing d . T o reduce the hardw are ov erhead, both the BS and UE adopt HBF structure. The BS uses a sub-connected architecture with L t = K RF chains [6], therefore, the analog precoder F RF ∈ C N t × L t takes a block-diagonal form as F RF = blkdiag( f RF , 1 , f RF , 2 , . . . , f RF ,K ) , (1) where f RF ,k ∈ C N s denotes the analog precoding vector of the k th subarray for k = 1 , . . . , K . Since the analog precoder is implemented by phase shifters, each element in f RF ,k satisfies the constant-modulus constraint | f RF ,k ( i ) | = 1 / √ N s . At the UE, a fully-connected architecture is adopted, where each of the L r RF chains is connected to all N r antennas through phase shifters. The analog combiner W RF ∈ C N r × L r also obeys the constant-modulus constraint. The system operates with orthogonal frequency division multiplexing (OFDM) using M subcarriers with spacing ∆ f . At the m th subcarrier , the BS transmits an N st -dimensional symbol vector s [ m ] ∈ C N st . It is first processed by a digital baseband precoder F BB [ m ] ∈ C L t × N st and then by the analog precoder F RF . After that, it is propagated through the wideband THz channel H [ m ] ∈ C N r × N t . The UE applies the analog combiner W RF and a digital combiner W BB [ m ] ∈ C L r × N sr to form the N sr -dimensional baseband output as y [ m ] = W H eq [ m ] H [ m ] F eq [ m ] s [ m ] + W H eq [ m ] n [ m ] , (2) where F eq [ m ] = F RF F BB [ m ] ∈ C N t × N st and W eq [ m ] = W RF W BB [ m ] ∈ C N r × N sr represent the precoding and combining matrices, respectiv ely , with   F eq [ m ]   2 F = 1 and   W eq [ m ]   2 F = 1 , and n [ m ] ∼ C N ( 0 , σ 2 n I N r ) is the additiv e white Gaussian noise. Moreover , the symbol vector is normal- ized as E  s [ m ] s H [ m ]  = ρ I N st , where ρ denotes the average transmit power per subcarrier . T o satisfy the total power constraint P t across all M subcarriers, we hav e M ρ = P t . During the pilot transmission stage for beam tracking, the above model specializes to a configuration that enables parallax extraction across subarrays which share a common local oscillator and baseband. W e set N st = K and adopt an identity digital precoder F BB [ m ] = 1 √ K I K , so that each stream is mapped to one subarray , while all subarrays are excited simultaneously tow ard the UE. T o ensure that the UE can resolve the individual observation from each subarray , we employ FDM pilots with disjoint subcarrier subsets {M k } K k =1 , where only the k th stream is active for m ∈ M k . Moreover , we adopt a comb-type allocation where each M k uniformly spans the entire bandwidth with effecti ve spacing of K ∆ f for the observed tones of each subarray . For pilot transmission, the analog beam f RF ,k is selected from a standard far -field DFT codebook. Under this design, the recei ved pilots across frequency and subarrays implicitly encode the parallax infor- mation used by the subsequent transformer-based modules. B. Near-F ield Channel and Signal Structure For the considered THz UM-MIMO system, the Rayleigh distance D R = 2( S b + S u ) 2 λ significantly expands, where S b and S u denote the array apertures at the BS and UE, respectively . As a result, a typical UE operates in the NF , where a global PWM fails to capture the non-negligible wav efront curvature across widely separated subarrays, whereas a full SWM is computationally heavy . T o balance accuracy and complexity , 4 we adopt the HSPM, approximating the wav efront as planar within each subarray and spherical between subarrays. The applicability conditions of HSPM are provided in [8], [9]. W ithin each subarray , for azimuth θ and elev ation ϕ , the generic UP A steering vector in PWM can be represented as a N ( ψ x , ψ z ) = h 1 · · · e j 2 π λ ψ n · · · e j 2 π λ ψ N − 1 i T , (3) where ψ n = d n x ψ x + d n z ψ z , ψ x = sin θ cos ϕ and ψ z = sin ϕ . Here N is the number of elements in the subarray , ( d n x , d n z ) denotes the distances from the n th antenna to the reference antenna on the x- and z-axis. With N ℓ paths between the BS and the UE, the recei ve and transmit steering vectors of the k th subarray for the ℓ th path are denoted by a rℓ k = a N r ( ψ rℓ kx , ψ rℓ kz ) and a tℓ k = a N s ( ψ tℓ kx , ψ tℓ kz ) , respectiv ely . For the LoS path in Fig. 1, the distance D 1 k ( p ) and direction vector u 1 k ( p ) from the k th subarray (with reference position s k = [ x k , 0 , z k ] T ) to the UE (at p = [ x, y , z ] T ) are expressed as D 1 k ( p ) = ∥ p − s k ∥ , (4a) u 1 k ( p ) = p − s k D 1 k ( p ) = [ u k,x , u k,y , u k,z ] T , (4b) In addition, the azimuth θ t k and elev ation ϕ t k in the NF are uniquely determined by the specific geometry between p and s k , following cos θ t k = x − x k D 1 k ( p ) cos ϕ t k and sin ϕ t k = z − z k D 1 k ( p ) . Howe ver , across the widely separated subarrays, the spheri- cal wa vefront curvature is non-negligible and induces distinct phase shifts corresponding to the varying distances. Combin- ing the PWM within subarrays and the SWM across subarrays, the frequency-domain subchannel H k [ m ] ∈ C N r × N s between the k th subarray and the UE at subcarrier f m is modeled as H k [ m ] = N ℓ X ℓ =1 α ℓ k e − j 2 πf m D ℓ k ( p ) c a rℓ k ( p )  a tℓ k ( p )  T , (5) where α ℓ k is the complex path gain and D ℓ k ( p ) is the path length of the ℓ th path for the k th subarray . The aggreg ate wideband channel is obtained by concatenation as H [ m ] =  H 1 [ m ] , . . . , H K [ m ]  ∈ C N r × N t , and the FDM pilot scheme in Sec. II-A determines which subcarriers probe each H k [ m ] . The enlarged inter-subarray spacing amplifies the geometric parallax. T aking the first subarray as the reference and defining the geo-arm vector b k, 1 = s k − s 1 , the geometry satisfies D k ( p ) = q D 2 1 ( p ) + ∥ b k, 1 ∥ 2 − 2 D 1 ( p )  u 1 ( p ) · b k, 1  , (6a) u k ( p ) = D 1 ( p ) u 1 ( p ) − b k, 1 D k ( p ) , (6b) which rev eals that the set { D k ( p ) , u k ( p ) } K k =1 forms a geometry-consistent fingerprint of the UE position. In partic- ular , the inter-subarray T ime Difference of Arri val (TDoA) ∆ τ k, 1 = ( D k − D 1 ) /c and the power fingerprints reflecting path-loss variations jointly encode the UE position. Comb- FDM induces a delay periodicity T amb = 1 / ( K ∆ f ) in τ k , but ∆ τ k, 1 is ambiguity-free since | ∆ τ k, 1 | ≤ d max /c < T amb / 2 . When the UE moves, the channel parameters ev olve over time t = q T 0 , where q ≥ 1 is the frame index and T 0 is the frame duration designed within the channel coherence time. Under the comb-type FDM pilot design, the pilot symbol vector on the m th subcarrier at the q th frame is denoted as s [ m, q ] = ( s p [ m, q ] e k , m ∈ M k , 0 , otherwise, (7) where e k is the k th canonical basis vector and s p [ m, q ] is a known pilot symbol with E {| s p [ m, q ] | 2 } = K ρ . Exploiting F BB [ m ] = 1 √ K I K and the block-diagonal structure of F RF , the scalar observation on the k th stream is expressed as y k [ m,q ] = w H eq ,k [ m ] H k [ m,q ] f RF ,k 1 √ K s p [ m,q ] + n k [ m,q ] , (8) where w eq ,k [ m ] is the effecti ve combining vector used to extract the k th pilot stream, and n k [ m, q ] is the post-combining noise. For LoS-dominant THz links and narrow pilot beams, the expression of (8) can be well approximated as y k [ m, q ] ≈ β k [ m, q ] e − j Φ k ( m,q ) + n k [ m, q ] , (9) where β k [ m, q ] collects path loss, beamforming gain and pilot amplitude, and is slowly varying over frequency and frames. The phase admits a frequency-linear decomposition as Φ k ( m, q ) = 2 πf m τ k ( q ) + ϕ ce ( q ) − ϑ k ( q ) + ϵ k [ m, q ] , (10) where τ k ( q ) = D 1 k ( p ( q )) /c is the propagation delay , ϕ ce ( q ) is the common phase error (CPE) and ϑ k ( q ) = 2 π q − 1 P i =0 ν k ( i ) T 0 is the Doppler -induced accumulated phase with the instantaneous Doppler shift ν k ( q ) = f c c v T ( q ) u k ( p ( q )) . Residual mismatch is collected in ϵ k [ m, q ] , including weak multipath, beam-squint effects and other model misalignment to be handled in the subsequent uncertainty-aware learning framew ork. Although Φ k ( m, q ) is linear with f m for fixed q , its ev o- lution over q is nonlinear due to time-varying Doppler and motion. For robust beam tracking, we consider the inter-frame phase increment computed from y ∗ k [ m, q ] y k [ m, q − 1] , yielding ∆Φ k ( m, q ) = Φ k ( m, q ) − Φ k ( m, q − 1) , = 2 π f m ∆ τ k ( q ) − 2 π ¯ ν k ( q ) T 0 + ∆ ϕ ce ( q ) + ∆ ϵ k [ m, q ] , (11) where ∆ τ k ( q ) , ∆ ϕ ce ( q ) and ∆ ϵ k [ m, q ] denote the correspond- ing variations ov er the q th frame interval, and ¯ ν k ( q ) is the av erage Doppler over the q th frame. In summary , both Φ k ( m, q ) and ∆Φ k ( m, q ) are affine in f m , with slopes governed by τ k ( q ) or ∆ τ k ( q ) . Across subarrays, the geometric parallax constraints in (6) tightly couple these parameters through the UE position. Therefore, the collection of frequency-linear signatures { Φ k ( m, q ) } k,m and their increments form a geometry-consistent fingerprint of the UE. Over time, the ev olution of their slopes and offsets reflects the UE dynamics. This space-frequency-time structure provides the physical foundation of our work. C. Problem F ormulation 1) Static P arallax-A war e Localization: W e first focus on a single frame and omit its frame index for bre vity . Under the comb-type FDM pilot design, stacking all pilot observa- tions across the K subarrays and their assigned subcarrier 5 subsets {M k } K k =1 yields y =  { y k [ m ] } K m ∈M k , k =1  T ∈ C M . Combining the HSPM channel in (5), the parallax constraints in (6) and the scalar phase structure in (9)-(10), y follows the nonlinear parametric model as y = h ( p , η ) + n , n ∼ C N ( 0 , σ 2 n I M ) , (12) where η collects nuisance parameters such as complex gains, beamforming gains and CPE. The maximum-likelihood (ML) estimator serves as a natural baseline, given by ˆ p ML = arg min p ∈P min η   y − h ( p , η )   2 2 , (13) where P is the BS coverage area. Due to the hybrid spherical- planar wav efront, inter-subarray parallax coupling and wide- band comb-type sampling, (13) is highly non-con vex and high- dimensional, making direct ML search unsuitable for low- latency beam alignment. W e therefore use P AST as a learned surrogate of the intractable ML estimator . For each frame, it maps the raw observation y to a compact e vidence packet as E = f Θ S ( y ) , (14) which comprises a position estimate ˆ p , and parallax-aware ev- idence tokens extracted from the frequency-linear signatures, together with reliability indicators to quantify the frame-wise measurement quality . By design, E preserves the geometry- informativ e content of (12) while reducing the per-frame input dimension for temporal tracking. 2) Dynamic Beam T racking: T o capture mobility , we define the kinematic state x ( q ) =  p ( q ) v ( q )  T ∈ R 6 and adopt a constant-velocity ev olution model, giv en by x ( q +1) = Fx ( q ) + w ( q ) , w ( q ) ∼ N ( 0 , Q ) , (15) where F is determined by T 0 and Q captures unmodeled accelerations. At the q th frame, stacking the FDM pilot obser- vations yields the observation model y ( q ) = h q ( x ( q )) + n ( q ) with n ( q ) ∼ C N ( 0 , σ 2 n I M ) . Here, h q ( · ) encapsulates the delay , Doppler shift, the comb-type FDM sampling pattern and slowly v arying gains. Equi valently , through (10) and (11), each subarray provides a frequency-affine phase signature whose slope and intercept ev olve with x ( q ) under the geometric parallax, while residual mismatch is modeled as a disturbance. The beam tracking task is to recursi vely infer x ( q ) from y (0: q ) . The Bayes-optimal solution would propagate the filter- ing posterior p ( x ( q ) | y (0: q )) under (15) and the observation model, but exact nonlinear filtering is analytically intractable and can be computationally expensi ve for high-dimensional observations. Therefore, to obtain a low-latenc y and robust learned tracker , we adopt a cascaded formulation as E ( q ) = f Θ S ( y ( q )) , ˆ x ( q ) = f Θ T  E (0) , . . . , E ( q )  , (16) where the TT f Θ T ( · ) acts as a learned nonlinear filter-predictor operating on the compact evidence sequence. This design mirrors the classical separation between mea- surement processing and state estimation, lev eraging physics- structured tokenization and spatial-temporal attention rather than black-box modeling. Importantly , using E ( q ) av oids a monolithic spatial-temporal network over raw pilots, improv- ing sample ef ficiency and reducing runtime while preserving geometry-informativ e signatures for tracking. I I I . P A S T F O R S TA T I C L O C A L I Z AT I O N In this section, we introduce the P AST , which leverages a deep neural network to extract and interpret the rich NF parallax signatures from lightweight FF codebook excitations, bypassing the need for prohibitive NF beam sweeping. A. F eature Distillation and Physical T okenization P AST tak es the receiv ed comb-type FDM pilots as input and serves as a physics-informed measurement encoder . It distills geometry-informativ e evidence consistent with the channel model, parallax constraints, and the per-frame frequency- affine phase structure. Specifically , it compresses the high- dimensional FDM measurements into reliability-aware physi- cal tokens, which are fused into a compact e vidence packet E for the downstream TT . The pipeline below is defined for a single frame with its index omitted. W e sort M k = { m k, 1 , . . . , m k,L k } and define y k,i = y k [ m k,i ] and f k,i = f m k,i for i = 1 , . . . , L k . T o obtain stable phase-increment signatures with bounded token length and prev ent unreliable patches from contaminat- ing the wideband evidence, we partition { 1 , . . . , L k } into G disjoint contiguous groups {I k,g } G g =1 . The group energy is E k,g = P i ∈I k,g | y k,i | 2 , and the normalized power weight for i ∈ I k,g is w k,g [ i ] = | y k,i | 2 + ε P j ∈I k,g  | y k,j | 2 + ε  , where ε > 0 is a small constant. After that, the weighted frequency centroid ¯ f k,g and spread σ 2 k,g are represented as ¯ f k,g = X i ∈I k,g w k,g [ i ] f k,i , (17a) σ 2 k,g = X i ∈I k,g w k,g [ i ] ( f k,i − ¯ f k,g ) 2 . (17b) Since the Fisher information of τ k scales with the power- weighted second central frequency moment, ( E k,g , σ 2 k,g ) are physics-grounded descriptors of groupwise delay information. T o exploit the frequency-af fine structure without fragile phase unwrapping, we form adjacent-tone products. Defining I ∆ k,g = { i ∈ I k,g : i +1 ∈ I k,g } and r k,i = y k,i +1 y ∗ k,i for i ∈ I ∆ k,g , under (9)-(10), we can derive − ∠ r k,i = Φ k ( f k,i +1 ) − Φ k ( f k,i ) + ∆ ϵ k , (18) where the dominant term is the delay-induced increment 2 π ( f k,i +1 − f k,i ) τ k , while frequency-flat phase of fsets within a frame are suppressed by differencing. The remaining pertur- bations are absorbed into ∆ ϵ k and reflected in the reliability score. After that, with u k,i = r k,i / | r k,i | , the magnitude- weighted circular mean is expressed as ¯ u k,g = P i ∈I ∆ k,g | r k,i | u k,i P i ∈I ∆ k,g | r k,i | . (19) The reliability score and wrapped slope are defined as κ k,g = | ¯ u k,g | ∈ [0 , 1] (higher κ k,g indicates a more consistent patch) and φ k,g = − ∠ ( ¯ u k,g ) , respecti vely . Since φ k,g is wrapped under the comb spacing, it is embedded continuously with q k,g =  cos φ k,g sin φ k,g  = 1 | ¯ u k,g | + ε  ℜ{ ¯ u k,g } −ℑ{ ¯ u k,g }  . (20) 6 Any remaining wrapping ambiguity is handled in the subse- quent attention layers by parallax consistency , cross-group ev- idence and reliability scores, rather than explicit unwrapping. For each subarray-group pair ( k , g ) , a compact descriptor is represented as t k,g =  log( E k,g + ε ) , log( σ 2 k,g + ε ) , q T k,g , κ k,g  T ∈ R d t . (21) Then it is mapped into the transformer latent space by z k,g = W emb t k,g + e k + e g r + e b , (22) where W emb is learnable, e k and e g r encode the subarray and group indices, and the fixed geometry encoding e b is derived from (6). The resulting per-frame token sequence with length K G is denoted as Z =  z 1 , 1 , . . . , z 1 ,G , . . . , z K, 1 , . . . , z K,G  . (23) This physics-driv en tokenization preserves parallax evi- dence while suppressing nuisance variations in raw pilots, allowing the tracker to focus on filtering and prediction under the kinematic prior, rather than low le vel denoising and feature learning from high dimensional complex pilots. B. Physics-Driven Attention Mechanism For the token sequence Z , P AST uses a physics-dri ven atten- tion encoder to fuse multi-subarray and multi-band evidence into E and ˆ p . The attention weights explicitly incorporate the geometry prior and token-wise reliability in Sec. III-A. T o quantify token-wise reliability for evidence fusion, a scalar gate c k,g ∈ [0 , 1) for the pair ( k , g ) is defined as c k,g = κ k,g · σ  γ 0 + γ 1 log( E k,g + ε ) + γ 2 log( σ 2 k,g + ε )  , (24) where σ ( · ) is the sigmoid function and γ 1 , γ 2 ≥ 0 are learned scaling factors enforced by γ a = log  1 + exp( ˜ γ a )  for a ∈ { 1 , 2 } . It preserves the physical monotonicity that higher usable energy and larger effecti ve frequency spread increase the confidence, yielding a bounded gate for stable scaling and gradients. Accordingly , c k,g serves as a learned confi- dence score to steer attention to ward informativ e and phase- consistent patches. Then, denoting the i th element in (23) as z i , we form query , ke y and value projections q ( h ) i = W ( h ) Q z i , k ( h ) i = W ( h ) K z i and v ( h ) i = W ( h ) V z i in each attention head h . The attention logit from token i to token j is defined as ℓ ( h ) ij =  q ( h ) i  T k ( h ) j √ d h + b ( h ) ij + δ log  c k ( j ) ,g ( j ) + ε  , (25) where d h is the per-head dimension, δ ∈ [0 , 1] balances source selection and content aggregation to avoid over -suppression, k ( · ) and g ( · ) map each tok en to its subarray and group indices. The observation-independent geometry bias b ( h ) ij is injected as b ( h ) ij = MLP ( h ) geo   ( b k ( i ) ,k ( j ) / Λ) T , ∆ g ij  T  , (26) where b k ( i ) ,k ( j ) = b k,k ′ is the known geo-arm vector , Λ = q 2 K ( K − 1) P 1 ≤ k

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment