SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes
Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.
💡 Research Summary
The paper introduces SoundCompass, a novel framework for target sound extraction (TSE) that leverages direction‑of‑arrival (DoA) information in a more expressive and efficient manner than prior work. Three core innovations are presented. First, the Spectral Pairwise Interaction (SPIN) module processes the complex spectrogram of a multichannel recording by multiplying the sine and cosine components of every channel pair, yielding a (2M)²‑dimensional feature that captures fine‑grained inter‑channel phase and amplitude relationships across all frequencies. Because the products are bounded within ±1, training remains stable and spatial cues are preserved without the discretization inherent in traditional IPD/ILD features.
Second, DoA cues are encoded using spherical harmonics (SH). By representing the azimuth‑elevation pair (θ, φ) with real and imaginary parts of SH up to 5th order, the method obtains a continuous, rotation‑invariant vector of size 2(N+1)². This eliminates the need for one‑hot or cyclic positional embeddings, which treat neighboring angles as independent categories and thus lose angular periodicity.
Third, the authors adopt a Chain‑of‑Inference (CoI) strategy. After an initial extraction stage, a sound‑event detection (SED) decoder produces a frame‑wise binary activation mask for the target source. This mask is linearly interpolated to match the spectrogram time axis and concatenated with the SH embedding, forming a time‑varying directional clue that is fed into a subsequent extraction stage. By recursively fusing spatial and temporal information, the system progressively refines its output, mitigating uncertainty about when the target is active.
The overall architecture builds on the DeepASA backbone, a state‑of‑the‑art universal source separation model. Input multichannel waveforms are transformed via STFT to a complex spectrogram (2M × T × F), passed through a 2‑D convolutional encoder to increase channel depth to D, then processed by the SPIN‑FiLM fusion module. The fusion is performed over 31 overlapping sub‑bands derived from a 12‑TET scale, allowing frequency‑dependent spatial conditioning. FiLM generates scale (γ) and shift (β) parameters from the SH vector, modulating SPIN outputs before they enter Feature Aggregation blocks that capture spectral and temporal dependencies. Two decoders reconstruct direct sound and reverberation separately, followed by inverse STFT.
Experiments were conducted on a regenerated version of the ASA2 dataset, using a 4‑channel tetrahedral microphone array in a simulated cuboid room (RT60 = 0.32 s). Training employed AdamW (lr = 5e‑4) for 100 epochs on four RTX 4090 GPUs. Evaluation metrics included SNR improvement (SNRi), scale‑invariant SNR improvement (SI‑SNRi), and mean absolute errors of inter‑channel level, phase, and time differences (ΔILD, ΔIPD, ΔITD).
Results show that inserting DoA clues before the Feature Aggregation (FA) blocks yields the best performance: SNRi = 17.86 dB and SI‑SNRi = 16.72 dB, surpassing the baseline DeepASA (15.64 dB / 12.98 dB) and recent DoA‑based single‑channel methods SSDQ (5.95 dB) and DSENet (16.42 dB). The model uses only 2.70 M parameters and 20.49 G MACs, lower than competing approaches. Ablation studies confirm the importance of each component: removing pairwise multiplication in SPIN drops SNRi to 5.66 dB; replacing SH with cyclic positional embeddings reduces performance slightly; eliminating the band‑split structure degrades accuracy; adding an SED decoder yields modest gains; and applying CoI twice further improves SNRi to 18.20 dB at the cost of additional parameters (+3.48 M) and compute (+24.01 G).
Visualization of FiLM scale parameters via t‑SNE reveals circular manifolds for azimuthal variations and smooth trajectories for elevation, indicating that the SH embedding preserves angular periodicity. SI‑SNRi contour maps demonstrate sensitivity to DoA estimation errors, with performance peaking near the true direction and declining sharply beyond ±15°.
In summary, SoundCompass combines continuous angular encoding, full‑frequency pairwise spatial interaction, and iterative temporal refinement to achieve robust target sound extraction in reverberant, multi‑source environments while maintaining computational efficiency. Future work may explore real‑time deployment, moving‑source tracking, and generalization to arbitrary microphone array geometries.
Comments & Academic Discussion
Loading comments...
Leave a Comment