BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

BiF ormer3D: Grid-Fr ee Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded T ransf ormer Shaoheng Xu ID 1 , ∗ , ∗∗ , Chunyi Sun ID 1 , ∗ , Jihui (Aimee) Zhang ID 2 , 1 , Amy Bastine ID 1 , Prasanga N. Samarasinghe ID 1 , Thushara D. Abhayapala ID 1 , Hongdong Li ID 1 1 The Australian National Uni versity , Australia 2 The Uni versity of Queensland, Australia Shaoheng.Xu@anu.edu.au, Chunyi.Sun@anu.edu.au Abstract Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per -listener measurements are costly . W e address HRIR spatial up-sampling from sparse per- listener measurements: giv en a few measured HRIRs for a lis- tener , predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a ﬁxed direction grid, which can degrade temporal ﬁdelity and spatial continuity . W e propose BiFormer3D, a time-domain, grid-free binaural T ransformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinu- soidal spatial features, a Con v1D reﬁnement module, and aux- iliary interaural time difference (ITD) and interaural lev el dif- ference (ILD) heads. On SONICOM, it improv es normalized mean squared error (NMSE), cosine distance, and ITD/ILD er- rors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary . Index T erms : head-related impulse responses, binaural render - ing, spatial audio, HRIR up-sampling, transformer 1. Introduction Spatial audio seeks to capture and reproduce realistic acoustic scenes by preserving both audio content and spatial cues, en- abling an immersi ve listening experience [1, 2]. Among spatial- audio techniques, binaural recording and rendering are partic- ularly effecti ve because they aim to reproduce the signals re- ceiv ed at the listener’ s two ears [1, 3, 4]. Binaural rendering typically relies on head-related transfer functions (HR TFs) or their time-domain counterparts, head-related impulse responses (HRIRs), to reproduce direction-dependent spectral and timing cues that support human spatial hearing. As a result, binau- ral spatial audio has been widely adopted in virtual/augmented reality (VR/AR) [5, 6, 7] and object-based audio production for ﬁlm and other media [8]. Howe ver , acquiring individual- ized HR TFs/HRIRs remains challenging: per-listener measure- ments are tedious, costly , and time-consuming, require special- ized anechoic f acilities, and may require listeners to remain mo- tionless for around an hour [9]. Accordingly , prior work has largely focused on two directions: HRTF personalization and spatial up-sampling/augmentation . HR TF personalization estimates a listener’ s HR TF from side information such as anthropometric measurements of the head, torso, and pinna [10, 11, 12]. Early approaches relied on numerical simulation and similarity-based selection [13]. * These authors contributed equally . ** indicates the corresponding author . More recent methods employ deep neural networks (DNNs) to learn non-linear mappings from anthropometric descriptors to HR TFs [14, 15, 16, 17, 18, 19], including latent-space ap- proaches that encode HR TFs into a low-dimensional represen- tation and then decode them back to full HR TFs [20]. Howe ver , simulation-based methods often require high-quality 3D geom- etry of the listener’ s head and pinna [13], while DNN-based approaches are constrained by the limited amount of available measured data [20]. T o reduce the time and practical complexity of HR TF measurements, spatial up-sampling and augmentation meth- ods have been proposed. These methods aim to reconstruct a high angular-resolution HR TF set from only a small number of measurements at sparsely sampled directions [21]. Classi- cal approaches typically represent HR TFs using a set of basis functions (BFs) and synthesize responses at unmeasured direc- tions via interpolation or superposition [22, 23]. In practice, these methods often degrade when measurements are spatially sparse (e.g., with angular gaps of 30 – 40 ◦ ) [24]. Spherical- harmonic (SH)-based interpolation is another common ap- proach [25, 26, 27, 28, 29, 30, 31], but it can also perform poorly under sparse sampling [24]. The Listener Acoustic Personal- ization (LAP) challenge [24] has stimulated ML-based spatial augmentation, with results showing that learned models can outperform SH/BF interpolation under sparse sampling when reconstructing 793-direction HR TF sets from only 3–100 mea- surements [32, 33, 34, 35, 36]; ho wev er , most of these methods remain tied to reconstruction on a ﬁxed grid of directions. A further limitation of many existing ML-based pipelines is that they operate primarily in the frequency domain and fo- cus on magnitude responses, since phase is difﬁcult to model directly . Phase and timing cues are therefore often recovered using minimum-phase (MP) approximations and/or a separate estimate of the time-delay (time-of-arriv al) term [35, 37]. In contrast, relatively fe w works perform personalization or spa- tial up-sampling directly in the time domain. For example, [38] proposes HRIR personalization using diffusion models guided by anthropometric inputs, but targets a ﬁxed grid of source di- rections and does not support arbitrary directions. The time- domain decomposition and superposition methods in [39, 40] enable spatially continuous HRIR reconstruction, but are re- stricted to the median plane. Thus, time-domain, spatially con- tinuous HRIR reconstruction over arbitrary 3D directions re- mains under-e xplored. T ime-domain modeling is important for both room im- pulse responses (RIRs) and HRIRs, as phase and timing cues are central to spatial perception. Prior work shows that trans- former architectures with positional encoding can capture long- range temporal structure and enable continuous spatial model- Figure 1: Illustration of HRIR spatial up-sampling: estimating HRIRs at unmeasur ed targ et directions (white) from a sparse set of measur ed dir ections (r ed). ing of RIRs [41]. HRIR reconstruction, howe ver , poses dif- ferent challenges: the number of measured subjects is limited, inter-subject variability is substantial, and the model must pre- serve binaural cues such as interaural time difference (ITD) and interaural le vel difference (ILD). W e therefore dev elop a frame- work tailored to continuous HRIR reconstruction ov er arbitrary 3D directions under realistic data constraints. W e propose BiF ormer3D , a time-domain, grid-free binaural T ransformer for continuous angular (azimuth–elev ation) HRIR spatial up-sampling that reconstructs HRIRs at arbitrary target directions from sparse per-listener measurements. The main contributions are: (1) Time-domain binaural reconstruction that preserves phase information naturally and does not impose a MP assumption. (2) Grid-free directional modeling via si- nusoidal encoding of source positions, enabling inference at previously unseen target directions without requiring a ﬁxed measurement grid. (3) Auxiliary binaural-cue supervision using ITD/ILD-prediction heads to encourage accurate inter- aural timing and lev el cues. (4) T emporal reﬁnement via a post-T ransformer Con v1D module to improve local directional consistency . (5) Time–frequency supervision via time-domain losses and a complex HR TF loss L HR TF for spectral alignment. 2. Problem F ormulation The objective of this work is to reconstruct HRIRs at arbitrary , unmeasured directions from a limited set of measured directions for a given subject. Consider a free-ﬁeld HRIR measurement setup (without room rev erberation). Let there be L = M + N source directions x ℓ ≜ ( ϕ ℓ , θ ℓ , r ℓ ) for ℓ = 1 , . . . , L , where ϕ and θ denote azimuth and elev ation (in degrees), and r denotes the source radius (in meters) in a spherical coordinate system. Among these L directions, M are measured and N are unmea- sured target directions. The listener’ s head is centered at the global origin O , as illustrated in Fig. 1. At each measured direction { x meas m } M m =1 , we record a binau- ral HRIR pair sampled at rate f s and of length K , h ( L ) m , h ( R ) m ∈ R 1 × K for the left and right ears. W e form a binaural HRIR row vector by concatenation h meas m ≜  h ( L ) m h ( R ) m  ∈ R 1 × 2 K . Stacking the M measured binaural HRIRs yields H =  h meas 1 h meas 2 · · · h meas M  ⊤ ∈ R M × 2 K , (1) where ( · ) ⊤ denotes transpose. Our goal is to estimate binaural HRIRs at the N target di- rections { x tgt n } N n =1 at the same sampling rate f s and length K . Let ¯ h ( L ) n , ¯ h ( R ) n and ˆ h ( L ) n , ˆ h ( R ) n denote the ground-truth and es- timated HRIRs for the two ears at direction x tgt n . Deﬁne ¯ h tgt n ≜  ¯ h ( L ) n ¯ h ( R ) n  , ˆ h tgt n ≜  ˆ h ( L ) n ˆ h ( R ) n  ∈ R 1 × 2 K . (2) Stacking these yields ¯ H , ˆ H ∈ R N × 2 K : ¯ H =  ¯ h tgt 1 · · · ¯ h tgt N  ⊤ , ˆ H =  ˆ h tgt 1 · · · ˆ h tgt N  ⊤ . (3) The HRIR spatial up-sampling task can be formulated as learning a mapping ˆ H = F ( H , { x meas m } M m =1 , { x tgt n } N n =1 ) such that ˆ H closely matches ¯ H , i.e., ˆ H = arg min ˆ H    ¯ H − ˆ H    2 F giv en H , { x meas m } , { x tgt n } , (4) where ∥·∥ F denotes the Frobenius norm. 3. Proposed Method W e learn the mapping deﬁned in Eq. 4 by casting HRIR spatial up-sampling as masked inpainting over the spherical direction set, where unmeasured directions are treated as masked tokens to be predicted from the measured context. The main challenges are threefold: spatial HRIR variation is globally correlated yet irregularly sampled; temporal responses contain high-frequency structure that is sensitive to small errors; and binaural consis- tency must be preserved to maintain physically meaningful in- teraural cues. Our design explicitly separates spatial reasoning from temporal reconstruction and introduces auxiliary cue su- pervision for stability under sparse measurements. For each subject, measured and target directions are com- bined into a uniﬁed set of L = M + N directions. W e con- struct H full ∈ R L × 2 K where measured rows contain observed HRIRs and target rows are initialized as zeros. A binary mask i ∈ { 0 , 1 } L indicates observed directions, the ℓ -th element i ℓ indicates whether h ℓ is measured ( i ℓ = 1 ) or unmeasured ( i ℓ = 0 ). The network predicts HRIRs jointly for all directions, and measured rows are preserv ed via masked fusion. Geometric Encoding: Each binaural HRIR h ℓ ∈ R 1 × 2 K is masked as ˜ h ℓ = i ℓ h ℓ and projected into a latent feature: e ℓ = Norm  σ ( ˜ h ℓ W s )  , where σ ( · ) denotes GELU acti vation and Norm( · ) denotes layer normalization. Each direction coordinate x ℓ ∈ R 3 is encoded using multi- frequency sinusoidal embedding [42] γ ( x ℓ ) =  sin(2 0 π x ℓ ) , cos(2 0 π x ℓ ) , . . . , sin(2 P − 1 π x ℓ ) , cos(2 P − 1 π x ℓ )  , (5) with P = 6 . The geometric embedding is projected into the same latent space and added to the signal feature: p ℓ = Norm ( σ ([ x ℓ , γ ( x ℓ )] W g )) , o ℓ = [ e ℓ , p ℓ ] . (6) Stacking all tokens o ℓ yields O ∈ R L × D . T ransformer Encoder: The token sequence { o ℓ } L ℓ =1 is pro- cessed by a multi-layer Transformer encoder . Each layer con- sists of multi-head self-attention followed by a feed-forward network with residual connections and normalization. Self-attention [42] enables each direction to attend to all others, capturing global spatial dependencies across irregular directional layouts. The mask i is applied such that missing directions do not contribute as keys during attention. After T layers, we obtain contextual representations: C ∈ R L × D . Decoder and Conv1D T emporal Reﬁnement: Each contex- tual feature c ℓ ∈ R 1 × D is mapped to a full-length binaural HRIR using a shared multi-layer perceptron (Fig. 2): ˆ h ℓ = f dec ( c ℓ ) , ˆ h ℓ ∈ R 1 × 2 K . (7) Stacking predictions ˆ h ℓ yields ˆ H raw ∈ R L × 2 K . Measured rows are preserv ed through masked fusion: ˆ H fused = i ⊙ H full + (1 − i ) ⊙ ˆ H raw . (8) Transformer Encoder 𝐜 ! 𝐜 "#$ 𝐜 " 𝐜 "#! MLP 𝐡 ! %&'( = 𝐡 ! ) , 𝐡 ! * 𝐡 " %&'( = 𝐡 " ) , 𝐡 " * 𝐩 ! 𝐞 ! 𝐩 "#$ 𝐩 " 𝐞 " 𝐩 "#! CNN Temporal Refinement Reorder by direction 𝐡 ! %&'( 𝐡 " %&'( 𝐡 ' ! +,+ 𝐡 ' $ +,+ Decoder ITD $ ! ILD $ ! ITD $ " ILD $ " ITD $ "# ! ILD $ "# ! ITD $ "# $ ILD $ "# $ ITD/ILD Heads 𝑓 -&'. ) Figure 2: Overview of the proposed BiF ormer3D pipeline. Known HRIRs ar e pr ojected into a latent space using an MLP-based signal encoder , while geometric embeddings derived fr om direction coor dinates are independently projected and added to the signal featur es. The r esulting tok ens ar e pr ocessed jointly by a T ransformer encoder to captur e global spatial dependencies acr oss measur ed and target dir ections. A shar ed MLP decoder maps contextual features to full-length binaural HRIRs for all dir ections, followed by masked fusion to pr eserve measur ed r esponses and a lightweight Con v1D for temporal r eﬁnement. T o improv e local temporal coherence of the HRIRs, we ap- ply a lightweight one-dimensional con volution. First, ˆ H fused is reordered by rows via elev ation descending then azimuth as- cending. Then, ˆ H = Conv1D( ˆ H fused ) , with residual blending controlled by i to ensure e xact preserv ation of observed HRIRs. T raining Objectiv e: W e supervise reconstruction only at miss- ing directions, and deﬁne the reconstruction loss as: L rec = 1 N X ℓ : i ℓ =0    ˆ h ℓ − ¯ h ℓ    2 2 . (9) T o impro ve spectral ﬁdelity , we compute the orthonormally normalized discrete Fourier transform (scaled by 1 / √ K ), de- noted by DFT( · ) , for the left (L) and right (R) ears separately: L HR TF = X c ∈{ L,R }    DFT( ˆ H ( c ) ) − DFT( ¯ H ( c ) )    2 2 . (10) Auxiliary heads f head ( · ) predict d ITD ℓ and d ILD ℓ :  d ITD ℓ , d ILD ℓ  = f head ( c ℓ ) , (11) The corresponding ITD/ILD loss terms, L ITD and L ILD , are: L head = 1 N X ℓ : i ℓ =0    d head ℓ − head ℓ    , head ∈ { ITD , ILD } . (12) The total training loss L total combines all components: L total = L rec + λ HR TF L HR TF + λ ITD L ITD + λ ILD L ILD , (13) where λ ITD , λ ILD , and λ HR TF are loss weights. 4. Experiment and V alidation 4.1. Experimental Setup W e ev aluate the binaural HRIR spatial up-sampling perfor- mance of BiF ormer3D on the SONICOM database [45], which contains measurements for ov er 200 subjects with 793 source directions per subject. Each HRIR is sampled at 48 kHz and represented with K = 256 samples. For a fair comparison, we follow the protocol of [35]: the ﬁrst 180 subjects are used for training and the subsequent 20 subjects are used for v alidation; subject P0079 is excluded due to atypical ITD behavior [35]. Since BiF ormer3D reconstructs HRIRs in the time domain and does not impose a MP assumption, we train and ev aluate it on SONICOM free-ﬁeld HRIRs (i.e., free-ﬁeld compensated HRIRs without MP ﬁltering) [45]. During training, we set the loss weights λ ITD = λ ILD = 0 . 05 and λ HR TF = 500 . W e train BiF ormer3D using AdamW [46] with learning rate 3 × 10 − 4 and batch size 8 for 500 epochs on an NVIDIA A100 GPU. 4.2. Evaluation Metrics T o quantify binaural HRIR spatial up-sampling performance, we report four metrics: (1) ITD Error (ITD-E) [32], (2) ILD Error (ILD-E) [32], (3) Normalized Mean Squared Error (NMSE) [41], and (4) Cosine Distance (CD) [41, 47, 48]. For- mal deﬁnitions follow the referenced works. NMSE and CD ev aluate time-domain wav eform ﬁdelity and shape alignment between the estimated and ground-truth HRIRs; CD measures scale-in variant similarity and ranges in [0 , 2] , where CD = 0 indicates identical waveform shapes [41]. ITD-E is the mean absolute error between reconstructed and ground-truth ITDs, reﬂecting binaural timing discrepancies rele vant to localiza- tion [32]. ILD-E measures errors in interaural level dif ferences across frequency and direction; larger ILD-E may lead to lat- eralization shifts or degraded spatial clarity [32]. Lower v alues indicate better performance for all metrics. All metrics are av- eraged ov er N target directions and subjects. 4.3. Comparison Methods W e compare BiF ormer3D with six baselines: Nearest Neigh- bor (Nbr), HR TF-Sel-ITD, HR TF-Sel-LSD [35], NF-CbC [43], NF-LoRA [44], and RANF [35]. Since these methods primar- ily reconstruct HR TF magnitude responses in the frequenc y do- main, we focus on the shared binaural cues, ITD-E and ILD-E, for direct comparison with our time-domain approach. As we adopt the same training/ev aluation protocol as [35], we use the ITD-E and ILD-E results reported in [35] for these baselines and summarize them in T able 1. 4.4. Results T able 1 reports ITD-E and ILD-E under four measurement spar- sity lev els ( M = 3 , 5 , 19 , 100 ). BiF ormer3D achiev es the low- est ILD-E across all M , indicating consistently accurate inter- aural le vel cues, and attains the best ITD-E in the most sparse regimes ( M = 3 and M = 5 ). For denser inputs ( M = 19 and M = 100 ), BiF ormer3D remains competitiv e, with ITD- E comparable to the strongest frequency-domain baselines. In contrast, the Nbr baseline yields substantially larger ITD-E and T able 1: ITD-E ( µ s) and ILD-E (dB) under differ ent measurement sparsity levels M on SONICOM. Bold indicates the best result per column. Baseline r esults ar e taken fr om [35]. Method M = 3 M = 5 M = 19 M = 100 ITD-E ( µ s) ↓ ILD-E (dB) ↓ ITD-E ( µ s) ↓ ILD-E (dB) ↓ ITD-E ( µ s) ↓ ILD-E (dB) ↓ ITD-E ( µ s) ↓ ILD-E (dB) ↓ BiF ormer3D (Proposed) 18.5 1.14 16.4 1.10 15.4 0.95 10.6 0.70 Nbr [35] 274.2 7.6 154.2 4.8 108.3 3.0 44.0 1.4 HR TF-Sel-ITD [35] 26.3 1.4 24.5 1.5 23.2 1.6 20.0 1.4 HR TF-Sel-LSD [35] 37.5 1.5 35.7 1.4 37.4 1.5 31.4 1.3 NF-CbC [43] 22.1 1.5 20.7 2.0 14.8 1.8 11.8 1.7 NF-LoRA [44] 28.6 1.3 24.7 1.4 14.7 1.1 9.1 1.1 RANF [35] 20.5 1.2 18.7 1.2 14.2 1.0 10.0 0.8 T able 2: NMSE and CD of BiF ormer3D under differ ent mea- sur ement sparsity levels M . Metric M =3 M =5 M =19 M =100 NMSE (dB) ↓ -6.90 -7.35 -8.24 -10.20 CD ↓ 0.233 0.210 0.163 0.102 SampleIndex 50 100 150 200 250 Amplitude -0.2 0 0.2 Ground T ruth 7 h ( L ) n Estimated ^ h ( L ) n Figure 3: Left-ear HRIR reconstruction for subject P0187 at ( ϕ, θ , r ) = (90 ◦ , 20 ◦ , 1 . 5 m) under M = 19 : estimated HRIR (dashed) versus gr ound-truth HRIR (solid). Angular distance to the near est measur ed dir ection: 35 . 2 ◦ . ILD-E across all M , highlighting the difﬁculty of spatial up- sampling from sparse measurements and the beneﬁt of learned reconstruction [35]. T able 2 summarizes time-domain reconstruction accuracy . Across all M , BiF ormer3D achiev es NMSE below − 6 . 90 dB and CD belo w 0 . 233 , and both metrics improv e as M increases. Fig. 4 visualizes stacked binaural HRIRs across all 793 direc- tions, showing that the estimated HRIR ﬁeld closely matches the ground truth globally . Fig. 3 shows an example (left ear) under M = 19 , where the predicted waveform closely matches the ground truth in both onset timing and ﬁne structure. T able 3: Ablation study of BiF ormer3D under M = 5 . V ariant NMSE (dB) ↓ CD ↓ ITD-E ( µ s) ↓ ILD-E (dB) ↓ BiF ormer3D (full) -7.35 0.210 16.4 1.10 w/o sinusoidal encoding -4.99 0.298 24.2 1.66 w/o ITD/ILD-prediction heads -7.52 0.207 19.4 1.11 w/o CNN temporal reﬁnement -6.04 0.257 20.8 1.28 w/o L HR TF -7.10 0.217 17.3 1.14 w/ MP pre-processing -6.80 0.225 19.1 1.21 4.5. Ablation Study W e conduct ablation studies under the M = 5 setting: (1) re- moving the sinusoidal direction-encoding module in Eq. 5, (2) removing the auxiliary ITD/ILD-prediction heads from the net- work outputs, (3) removing the post-Transformer Con v1D tem- poral reﬁnement module, (4) removing the complex HR TF loss L HR TF , and (5) training and testing on MP-preprocessed HRIRs. Ground T ruth DirectionIndex 200 400 600 793 T ime(samples) 100 200 300 400 500 Estimated DirectionIndex 200 400 600 793 T ime(samples) 100 200 300 400 500 -0.2 -0.1 0 0.1 0.2 Amplitude Figure 4: Stacked binaural HRIRs for subject P0187 under M = 19 over 793 directions (each column: one direction; left/right ears concatenated along time). Ground truth (left) and estimation (right). T able 3 summarizes the results. Removing any component de- grades performance on multiple metrics, indicating that each design choice contributes to reconstruction quality . Removing the sinusoidal encoding causes the largest degra- dation across NMSE, CD, ITD-E, and ILD-E, highlighting the importance of rich geometric conditioning. In particular , sinu- soidal encoding provides a continuous representation of x tgt n = ( ϕ, θ , r ) , enabling grid-free inference at previously unseen di- rections. T raining and testing on MP-preprocessed HRIRs de- grades performance versus the default setting, indicating that BiF ormer3D does not impose an MP assumption and does not beneﬁt from MP pre-processing. The ITD/ILD-prediction heads primarily improve binaural-cue accuracy , while having a marginal effect on NMSE and CD. Finally , removing the Con v1D temporal reﬁnement degrades all metrics, consistent with its role in improving local consistency across neighboring directions; as illustrated by the zoom-in region in Fig. 4, this re- ﬁnement mitigates local discontinuities (e.g., from small mea- surement inconsistencies such as slight head motion) that could otherwise manifest as audible clicks during binaural rendering. 5. Conclusion W e presented BiF ormer3D , a time-domain, grid-free binaural T ransformer for continuous angular (azimuth–elev ation) HRIR spatial up-sampling, reconstructing HRIRs at arbitrary target di- rections from sparse per-listener measurements. BiF ormer3D combines sinusoidal direction encoding, auxiliary ITD/ILD pre- diction heads, a post-Transformer Conv1D temporal reﬁnement module, and a complex HR TF loss L HR TF . Experiments on SONICOM free-ﬁeld HRIRs show strong performance on time- domain metrics (NMSE, CD) and binaural-cue metrics (ITD-E, ILD-E). Ablations conﬁrm the contribution of each component and suggest that explicit MP pre-processing is not required. Fu- ture work will include deeper model analysis and perceptual listening tests for binaural rendering. 6. Generative AI Use Disclosur e W e used ChatGPT (GPT -5.2, OpenAI) for English-language editing and stylistic polishing. The tool was not used to gen- erate novel technical content, experimental results, or conclu- sions. All text w as revie wed and approv ed by the authors. 7. References [1] W . Zhang, P . N. Samarasinghe, H. Chen, and T . D. Abhayapala, “Surround by sound: A revie w of spatial audio recording and re- production, ” Appl. Sci. , vol. 7, no. 5, p. 532, 2017. [2] X. Sun, “Immersiv e audio, capture, transport, and rendering: a revie w , ” APSIP A T rans. Signal Inf. Pr ocess. , vol. 10, p. e13, 2021. [3] H. Møller, “Fundamentals of binaural technology , ” Appl. Acoust. , vol. 36, no. 3-4, pp. 171–218, 1992. [4] D. Hammershøi and H. Møller, “Methods for binaural recording and reproduction, ” Acta Acust. United Acust. , vol. 88, no. 3, pp. 303–311, 2002. [5] M. Naef, O. Staadt, and M. Gross, “Spatialized audio rendering for immersiv e virtual en vironments, ” in Proc. A CM Symp. V irtual Reality Softw . T echnol. A CM, 2002, pp. 65–72. [6] M. Gorzel, A. Allen, I. Kelly , J. Kammerl, A. Gungormusler, H. Y eh, and F . Boland, “Efﬁcient encoding and decoding of bin- aural sound with Resonance Audio, ” in Pr oc. AES Int. Conf. Im- mersive Interact. Audio . Audio Engineering Society , 2019, pp. 349–360. [7] T . Potter, Z. Cvetko vi ´ c, and E. De Sena, “On the relativ e impor- tance of visual and spatial audio rendering on VR immersion, ” F r ont. Sig. Pr oc. , vol. 2, p. 904866, 2022. [8] R. Bleidt, A. Borsum, H. Fuchs, and S. M. W eiss, “Object-based audio: Opportunities for improved listening experience and in- creased listener inv olvement, ” SMPTE Motion Imaging J. , vol. 124, no. 5, pp. 1–13, 2015. [9] S. Li and J. Peissig, “Measurement of Head-Related Transfer Functions: A review , ” Appl. Sci. , vol. 10, no. 14, p. 5014, 2020. [10] D. N. Zotkin, J. Hwang, R. Duraiswami, and L. S. Davis, “HR TF personalization using anthropometric measurements, ” in Proc. IEEE W orkshop Appl. Signal Process. Audio Acoust. (W ASP AA) . IEEE, 2003, pp. 157–160. [11] M. Zhang, R. A. Kennedy , T . D. Abhayapala, and W . Zhang, “Statistical method to identify key anthropometric parameters in HR TF individualization, ” in Pr oc. Joint W orkshop Hands-free Speech Commun. Micr ophone Arrays (HSCMA) . IEEE, 2011, pp. 213–218. [12] D. Fantini, M. Geronazzo, F . A vanzini, and S. Ntalampiras, “ A survey on machine learning techniques for Head-Related Transfer Function indi vidualization, ” IEEE Open J . Signal Pr ocess. , vol. 6, pp. 30–56, 2025. [13] C. Guezenoc and R. S ´ eguier , “ A wide dataset of ear shapes and pinna-related transfer functions generated by random ear draw- ings, ” J. Acoust. Soc. Am. , vol. 147, no. 6, pp. 4087–4096, Jun. 2020. [14] Y . W ang, Y . Zhang, Z. Duan, and M. F . Bocko, “Global HR TF personalization using anthropometric measures, ” in Audio Eng. Soc. Con v . 150 . Audio Engineering Society , 2021, paper 10502. [15] J. Xi, W . Zhang, and T . D. Abhayapala, “Magnitude modelling of individualized HR TFs using DNN-based spherical harmonic analysis, ” in Pr oc. IEEE W orkshop Appl. Signal Process. Audio Acoust. (W ASP AA) . IEEE, 2021, pp. 266–270. [16] H. Hu, L. Zhou, H. Ma, and Z. W u, “HR TF personalization based on artiﬁcial neural network in individual virtual auditory space, ” Appl. Acoust. , vol. 69, no. 2, pp. 163–172, 2008. [17] T .-Y . Chen, T .-H. Kuo, and T .-S. Chi, “ Autoencoding HR TFs for DNN based HR TF personalization using anthropometric fea- tures, ” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pr ocess. (ICASSP) , 2019, pp. 271–275. [18] M. Zhang, Z. Ge, T . Liu, X. W u, and T . Qu, “Modeling of in- dividual HR TFs based on spatial principal component analysis, ” IEEE/ACM Tr ans. Audio, Speech, Lang. Pr ocess. , vol. 28, pp. 785–797, 2020. [19] D. Y ao, J. Zhao, L. Cheng, J. Li, X. Li, X. Guo, and Y . Y an, “ An individualization approach for head-related transfer function in ar- bitrary directions based on deep learning, ” JASA Express Lett. , vol. 2, no. 6, p. 064401, Jun. 2022. [20] R. Niu, S. Koyama, and T . Nakamura, “Head-related transfer function individualization using anthropometric features and spa- tially independent latent representation, ” in Pr oc. IEEE W orkshop Appl. Signal Pr ocess. A udio Acoust. (W ASP AA) . IEEE, 2025, pp. 1–5. [21] B. Xie, Head-Related T ransfer Function and V irtual Auditory Dis- play , 2nd ed. Plantation, FL, USA: J. Ross Publishing, 2013. [22] K. Hartung, J. Braasch, and S. J. Sterbing, “Comparison of dif- ferent methods for the interpolation of head-related transfer func- tions, ” in Pr oc. AES 16th Int. Conf. Spatial Sound Repr od. Audio Engineering Society , 1999, pp. 319–329, paper 16-028. [23] D. Poirier-Quinot and B. F . G. Katz, “The Anaglyph binaural au- dio engine, ” in Audio Eng. Soc. Con v . 144 . Audio Engineering Society , 2018, pp. 1–4, convention e-Brief 431. [24] A. O. T . Hogg, R. Barumerli, R. Daugintis, K. C. Poole, F . Brinkmann, L. Picinali, and M. Geronazzo, “Listener acoustic personalisation challenge – LAP24: Head-related transfer func- tion upsampling, ” IEEE Open J. Signal Process. , vol. 6, pp. 926– 941, 2025. [25] M. J. Evans, J. A. S. Angus, and A. I. T ew , “ Analyzing head- related transfer function measurements using surface spherical harmonics, ” J. Acoust. Soc. Am. , vol. 104, no. 4, pp. 2400–2411, 1998. [26] R. Duraiswami, D. N. Zotkin, and N. A. Gumerov , “Interpola- tion and range extrapolation of HR TFs, ” in Pr oc. IEEE Int. Conf. Acoust., Speech Signal Pr ocess. (ICASSP) , vol. 4. IEEE, 2004, pp. IV –45–IV –48. [27] W . Zhang, T . D. Abhayapala, R. A. K ennedy , and R. Duraisw ami, “Insights into HR TF: Spatial dimensionality and continuous rep- resentation, ” J. Acoust. Soc. Am. , vol. 127, no. 4, pp. 2347–2357, 2010. [28] C. P ¨ orschmann, J. M. Arend, and F . Brinkmann, “Directional equalization of sparse head-related transfer function sets for spa- tial upsampling, ” IEEE/ACM T rans. Audio, Speech, Lang. Pr o- cess. , vol. 27, no. 6, pp. 1060–1071, 2019. [29] J. M. Arend, F . Brinkmann, and C. P ¨ orschmann, “ Assessing spher- ical harmonics interpolation of time-aligned head-related transfer functions, ” J . Audio Eng. Soc. , vol. 69, no. 1/2, pp. 104–117, 2021. [30] I. Engel, D. F . M. Goodman, and L. Picinali, “ Assessing HR TF preprocessing methods for ambisonics rendering through percep- tual models, ” Acta Acust. , vol. 6, p. 4, 2022. [31] J. M. Arend, C. P ¨ orschmann, S. W einzierl, and F . Brinkmann, “Magnitude-corrected and time-aligned interpolation of head- related transfer functions, ” IEEE/ACM T rans. Audio, Speech, Lang. Pr ocess. , vol. 31, pp. 3783–3799, 2023. [32] M. Geronazzo, L. Picinali, A. O. T . Hogg, R. Barumerli, K. C. Poole, R. Daugintis, J. Pauwels, S. Ntalampiras, G. McLachlan, and F . Brinkmann, “T echnical report: SONICOM/IEEE listener acoustic personalisation (LAP) challenge - 2024, ” T echRxiv , 2024, techRxiv preprint. [33] A. O. T . Hogg, M. Jenkins, H. Liu, I. Squires, S. J. Cooper, and L. Picinali, “HR TF upsampling with a generative adversarial network using a gnomonic equiangular projection, ” IEEE/A CM T rans. Audio, Speech, Lang. Process. , vol. 32, pp. 2085–2099, 2024. [34] Y . Ito, T . Nakamura, S. Koyama, and H. Saruwatari, “HR TF in- terpolation from spatially sparse measurements using autoencoder with source position conditioning, ” in Pr oc. Int. W orkshop Acoust. Signal Enhancement (IW AENC) . IEEE, 2022, pp. 1–5. [35] Y . Masuyama, G. W ichern, F . G. Germain, C. Ick, and J. Le Roux, “Retriev al-augmented neural ﬁeld for HR TF upsampling and per- sonalization, ” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pr ocess. (ICASSP) . IEEE, 2025, pp. 1–5. [36] X. Hu, J. Li, S. Zhang, S. Goetz, L. Picinali, O. B. Akan, and A. O. T . Hogg, “HR TFformer: A spatially-aware transformer for personalized HR TF upsampling in immersive audio rendering, ” 2025. [37] W . W . Hugeng and D. Gunawan, “Improved method for in- dividualization of head-related transfer functions on horizontal plane using reduced number of anthropometric measurements, ” J. T elecommun. , vol. 2, no. 2, pp. 31–41, May 2010. [38] J. C. Albarrac ´ ın S ´ anchez, L. Comanducci, M. Pezzoli, and F . An- tonacci, “T owards HR TF personalization using denoising diffu- sion models, ” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pr ocess. (ICASSP) . IEEE, 2025, pp. 1–5. [39] S. Hwang, Y . Park, and Y .-S. Park, “Modeling and customization of head-related impulse responses based on general basis func- tions in time domain, ” Acta Acust. United Acust. , vol. 94, no. 6, pp. 965–980, 2008. [40] ——, “Customization of spatially continuous head-related im- pulse responses in the median plane, ” Acta Acust. United Acust. , vol. 96, no. 2, pp. 351–363, 2010. [41] S. Xu, C. Sun, J. Zhang, P . N. Samarasinghe, and T . D. Abhaya- pala, “RIR-Former: Coordinate-guided transformer for continu- ous reconstruction of room impulse responses, ” 2026, accepted to ICASSP 2026. [42] A. V aswani, N. Shazeer, N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “ Attention is all you need, ” in Adv . Neural Inf. Pr ocess. Syst. (NeurIPS) , vol. 30, 2017, pp. 5998–6008. [43] Y . Zhang, Y . W ang, and Z. Duan, “HR TF ﬁeld: Unifying mea- sured HR TF magnitude representation with neural ﬁelds, ” in Pr oc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP) . IEEE, 2023, pp. 1–5. [44] Y . Masuyama, G. W ichern, F . G. Germain, Z. Pan, S. Khurana, C. Hori, and J. Le Roux, “NIIRF: Neural IIR ﬁlter ﬁeld for HR TF upsampling and personalization, ” in Pr oc. IEEE Int. Conf. Acoust., Speech Signal Pr ocess. (ICASSP) . IEEE, 2024, pp. 1016–1020. [45] I. Engel, R. Daugintis, T . V icente, A. O. T . Hogg, J. Pauwels, A. J. T ournier, and L. Picinali, “The SONICOM HR TF dataset, ” J. A udio Eng. Soc. , v ol. 71, no. 5, pp. 241–253, 2023. [46] I. Loshchilov and F . Hutter, “Decoupled weight decay regulariza- tion, ” in Proc. Int. Conf . Learn. Repr esent. (ICLR) , 2019. [47] D. R. Morgan, J. Benesty , and M. M. Sondhi, “On the ev alua- tion of estimated impulse responses, ” IEEE Signal Pr ocess. Lett. , vol. 5, no. 7, pp. 174–176, 1998. [48] S. Della T orre, M. Pezzoli, F . Antonacci, and S. Gannot, “Dif- fusionRIR: Room impulse response interpolation using diffusion models, ” arXiv , 2025.

BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment