Detect, Attend and Extract: Keyword Guided Target Speaker Extraction
Target speaker extraction (TSE) aims to extract the speech of a target speaker from mixtures containing multiple competing speakers. Conventional TSE systems predominantly rely on speaker cues, such as pre-enrolled speech, to identify and isolate the target speaker. However, in many practical scenarios, clean enrollment utterances are unavailable, limiting the applicability of existing approaches. In this work, we propose DAE-TSE, a keyword-guided TSE framework that specifies the target speaker through distinct keywords they utter. By leveraging keywords (i.e., partial transcriptions) as cues, our approach provides a flexible and practical alternative to enrollment-based TSE. DAE-TSE follows the Detect-Attend-Extract (DAE) paradigm: it first detects the presence of the given keywords, then attends to the corresponding speaker based on the keyword content, and finally extracts the target speech. Experimental results demonstrate that DAE-TSE outperforms standard TSE systems that rely on clean enrollment speech. To the best of our knowledge, this is the first study to utilize partial transcription as a cue for specifying the target speaker in TSE, offering a flexible and practical solution for real-world scenarios. Our code and demo page are now publicly available.
💡 Research Summary
The paper introduces DAE‑TSE, a novel target‑speaker extraction (TSE) system that eliminates the need for a clean enrollment utterance by using only short keyword transcripts as cues. Traditional TSE approaches rely on pre‑recorded speech from the target speaker to generate a speaker embedding, which limits their applicability in dynamic scenarios such as ad‑hoc meetings or voice‑assistant interactions where such enrollment data is unavailable. DAE‑TSE follows a three‑stage “Detect‑Attend‑Extract” (DAE) paradigm.
In the Detect stage, the system determines whether the supplied keywords appear in the mixed audio and localizes their temporal span. This is achieved through a cross‑attention map produced by the Keyword‑guided Cue Encoder (KCE) and a lightweight dynamic‑programming algorithm that finds the optimal alignment path between the keyword sequence and the acoustic frames. If the keywords are absent, the system outputs silence; otherwise it proceeds.
The Attend stage extracts a fixed‑dimensional speaker embedding from the mixture, conditioned on the detected keywords. KCE consists of two Transformer encoders: one processes log‑Mel filter‑bank features of the mixture, the other processes phoneme‑level embeddings of the keyword text. Cross‑attention between speech queries and keyword keys/values aligns the acoustic content with the textual cue at a fine granularity. KCE is trained jointly on an Automatic Speech Recognition (ASR) objective (CTC loss on the full transcription) and a Speaker Verification (SV) objective (cross‑entropy loss on the speaker identity of the keyword segment). A trainable layer‑wise weighting followed by average pooling aggregates information across all Transformer layers to produce a robust speaker representation.
In the Extract stage, the speaker embedding feeds a Band‑Split Recurrent Neural Network (BSRNN) backbone. BSRNN splits the complex spectrogram into sub‑bands, applies interleaved time‑ and frequency‑domain RNNs, and predicts a complex mask that isolates the target speech. A fusion module injects the speaker embedding into the BSRNN, allowing the network to condition its mask estimation on the identified speaker. The model is trained with a negative scale‑invariant signal‑to‑noise ratio (SI‑SNR) loss, directly optimizing perceptual speech quality.
Experiments use simulated mixtures derived from LibriSpeech (train‑clean‑360 and train‑other‑500 for KCE pre‑training; train‑clean‑100 for backbone training). The keyword cue comprises only 28.4 % of the full transcription, yet DAE‑TSE outperforms strong enrollment‑based baselines in SI‑SNR and SDR. Keyword detection/localization achieves an average temporal error of about 100 ms, demonstrating precise alignment. Ablation studies confirm the importance of cross‑attention, joint ASR‑SV training, and layer‑wise weighting.
The contributions are threefold: (1) a Detect‑Attend‑Extract framework that derives a global speaker embedding solely from short keyword transcripts; (2) a jointly trained ASR‑SV Cue Encoder that aligns text and audio via cross‑attention, enabling simultaneous keyword detection, localization, and speaker embedding generation; (3) empirical evidence that keyword‑guided TSE can surpass traditional enrollment‑based methods while using substantially less textual information.
By making code and a demo publicly available, the authors facilitate reproducibility and further research. The work opens a practical path for TSE in real‑world applications where pre‑enrollment is infeasible, and suggests future extensions to multilingual keywords, real‑world noisy recordings, and integration with downstream tasks such as diarization or speech‑to‑text pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment