LuSeeL: Language-queried Binaural Universal Sound Event Extraction and Localization
Most universal sound extraction algorithms focus on isolating a target sound event from single-channel audio mixtures. However, the real world is three-dimensional, and binaural audio, which mimics human hearing, can capture richer spatial information, including sound source location. This spatial context is crucial for understanding and modeling complex auditory scenes, as it inherently informs sound detection and extraction. In this work, we propose a language-driven universal sound extraction network that isolates text-described sound events from binaural mixtures by effectively leveraging the spatial cues present in binaural signals. Additionally, we jointly predict the direction of arrival (DoA) of the target sound using spatial features from the extraction network. This dual-task approach exploits complementary location information to improve extraction performance while enabling accurate DoA estimation. Experimental results on the in-the-wild AudioCaps dataset show that our proposed LuSeeL model significantly outperforms single-channel and uni-task baselines.
💡 Research Summary
The paper introduces LuSeeL, a novel framework that simultaneously performs language‑conditioned universal sound event extraction and direction‑of‑arrival (DoA) estimation from binaural audio mixtures. While most prior universal sound extraction methods operate on single‑channel waveforms and rely solely on textual or visual prompts, LuSeeL exploits the spatial cues inherent in binaural recordings—inter‑aural time and phase differences—to improve both extraction fidelity and source localization.
Core Architecture
- Text Encoder – A frozen T5 model converts the free‑form textual query (e.g., “car horns”) into a high‑dimensional embedding. This embedding is further refined by five transformer encoder layers (512‑dimensional, 2 heads, 1024 FFN) to align it with the audio modality.
- Dual‑Domain Audio Backbone – Inspired by HT‑Demucs, LuSeeL processes the mixture in parallel time‑domain (T‑audio) and frequency‑domain (F‑audio) streams. Each stream contains four convolutional layers that produce frame‑level embeddings, followed by three self‑attention blocks and two cross‑attention blocks. The two streams exchange information via cross‑attention, allowing temporal and spectral cues to complement each other.
- Conditional Fusion via FiLM – Before every self‑attention block, the text embedding modulates the audio features through Feature‑wise Linear Modulation (FiLM). This mechanism injects linguistic context directly into the feature space, guiding the network to attend to spectral patterns and temporal locations that correspond to the queried event.
- Signal Extractor – The outputs of the time and frequency decoders are summed element‑wise, yielding the final extracted waveform (\hat{s}). This design leverages the strengths of both domains: fine‑grained phase information from the time path and rich spectral detail from the frequency path.
Localization Sub‑Network
- GCC‑PHAT Encoder computes generalized cross‑correlation with phase alignment from the raw binaural mixture, capturing inter‑aural time differences essential for azimuth estimation.
- Spectral Localization Encoder extracts location‑specific features from each self‑attention and cross‑attention output of the frequency stream using 1‑D convolutions (input 1256 → 100). These are concatenated and fed into the F‑DoA Encoder, a three‑layer 1‑D CNN that progressively reduces dimensionality while preserving discriminative spatial information.
- The flattened GCC‑PHAT vector is concatenated with the compressed F‑DoA representation, forming a joint descriptor (888 dimensions) that passes through a six‑layer fully‑connected decoder (final layer 360) to produce a probability distribution over 360 azimuth bins.
Loss Functions
- Extraction Loss (L_{\text{signal}}) combines a scale‑invariant SI‑SNR term with a multi‑resolution spectral delta loss, encouraging both waveform fidelity and accurate spectral dynamics.
- Localization Loss uses mean‑squared error between the predicted distribution (\hat{d}) and a Gaussian‑smoothed ground‑truth label over the 360 bins (σ² = 5).
- The total objective is (L_{\text{total}} = L_{\text{signal}} + \gamma L_{\text{MSE}}) with (\gamma = 10). By back‑propagating the localization loss through the extraction backbone, the model learns a shared audio‑text‑spatial representation that benefits both tasks.
Dataset and Experimental Setup
AudioCaps (≈46 k clips) provides human‑written captions used as queries. For training, the authors synthesize binaural mixtures of 2 or 3 sources: each source is randomly assigned an azimuth (0–360°), normalized to a common energy level, and mixed with a random SNR between –5 dB and +5 dB. Head‑related transfer functions (HRIRs) generate realistic binaural signals sampled at 16 kHz, 10 s long.
Baselines include:
- T‑HTDemucs – single‑channel, language‑conditioned extraction only.
- MLP‑GCC – language‑conditioned binaural DoA estimation only.
- LuSeeL† – extraction only (localization module removed).
- LuSeeL◦ – extraction + localization but without GCC‑PHAT.
Training uses AdamW (lr = 1e‑4), linear warm‑up, learning‑rate decay on plateau, and early stopping. Batch size 128 across four A800 GPUs.
Results
On 2‑source mixtures, LuSeeL Both achieves SI‑SNR improvement of 20.3 dB and SDR improvement of 21.6 dB, far surpassing T‑HTDemucs (7.7 dB / 8.7 dB) and the extraction‑only ablation (17.6 dB / 18.8 dB). DoA accuracy within ±5° reaches 89.9 % with a mean absolute error of 7.0°, compared to 41.1 % / 51.6° for MLP‑GCC. Similar trends hold for 3‑source mixtures, where LuSeeL Both still outperforms all baselines, demonstrating robustness to increased source overlap. Ablation studies reveal that (i) GCC‑PHAT contributes substantially to localization precision, and (ii) the joint localization loss modestly improves extraction metrics, supporting the hypothesis that spatial cues act as an auxiliary constraint.
Key Contributions
- Unified Multi‑Task Framework – First model to jointly perform language‑driven universal sound extraction and binaural DoA estimation in an end‑to‑end fashion.
- Hybrid Time‑Frequency Transformer with FiLM Conditioning – Enables seamless integration of textual semantics into both temporal and spectral processing streams.
- Spatial Feature Fusion – Combines explicit inter‑aural cues (GCC‑PHAT) with learned spectral embeddings for accurate azimuth prediction.
- Cross‑Task Gradient Sharing – Demonstrates that back‑propagating localization loss through the extraction backbone yields mutual performance gains.
Implications and Future Work
LuSeeL showcases how spatial information inherent in binaural recordings can be harnessed alongside language cues to tackle the classic “cocktail‑party” problem more effectively. Potential applications span augmented/virtual reality, assistive hearing devices, and autonomous robots that need to both isolate and locate sound sources based on natural language commands. Future research directions include evaluating on real‑world binaural recordings (e.g., ear‑phone or binaural microphone captures), extending the query modality to multimodal inputs (images, video), handling dynamic sources with moving trajectories, and optimizing the architecture for low‑latency inference on edge devices.
Comments & Academic Discussion
Loading comments...
Leave a Comment