Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.
Target speech extraction (TSE) aims to isolate a specific speaker's voice from a multi-talker or noisy acoustic environment by leveraging a set of auxiliary clues, such as a reference enrollment utterance from the target speaker. Traditional TSE frameworks typically operate under the assumption that a high-quality, pre-recorded enrollment signal is readily available to guide the extraction backbone in characterizing the speaker's unique embedding [1][2][3]. However, in practical human-machine dialogue scenarios, requiring users to provide a pre-collected enrollment sample beforehand significantly disrupts the fluidity of interaction and limits the system's feasibility for spontaneous or first-time users. To bridge this gap and enable truly seamless interaction, it is essential to move toward an "Enroll-on-Wakeup" (EoW) TSE. In this setting, the system must extract the target speech using only the brief and often noise and interferer corrupted wake-word segment captured during the initial triggering phase, posing a significant yet necessary challenge for current TSE research.
Existing methods for providing target speaker clues in TSE can be broadly categorized into three paradigms: audioonly, audio-visual (AV-TSE), and spatial-assisted auditory TSE. Audio-only-enrollment based TSE primarily relies on pretrained speaker verification models [4] or dedicated speaker encoders [5][6][7] to extract target speaker identity embeddings. Recently, speaker embedding-free methods [8][9][10] and waveformlevel concatenation approaches [11] have demonstrated robust performance across various TSE tasks. Despite their success, ** indicates the corresponding author.
these methods remain dependent on the requirement of prerecorded enrollment. To leverage multi-modal information, AV-TSE [12][13][14] incorporates visual cues such as lip movements, while recent works [15] integrate the linguistic knowledge of large language models to compensate for acoustic degradation. However, AV-TSE is fundamentally limited by line-ofsight requirements and raises significant privacy and computational concerns for low-power embedded devices. Alternatively, spatial-assisted TSE [16,17] utilizes multi-channel features or direction-of-arrival information to localize the target speaker. While effective, these systems necessitate specific multi-microphone array configurations, limiting their versatility across diverse hardware.
Ultimately, regardless of the target speaker clues utilized, most existing frameworks overlook the practical challenge of obtaining high-quality references in spontaneous dialogue. The dependency on pre-collected data or specialized hardware remains a bottleneck for seamless human-machine interaction. To address these limitations, this paper investigates the feasibility of TSE using only the intrinsic information available during the interaction process itself. Specifically, we propose an EoW-TSE that directly adopts the wake-word segment as the enrollment reference. This approach eliminates the user’s burden of manual enrollment. The main contributions are summarized as follows:
• We introduce the Enroll-on-Wakeup (EoW) TSE paradigm, utilizing the triggering wake-word as an automatic enrollment clue to facilitate zero-effort, seamless human-machine interaction. • We present the first systematic study of EoW-TSE, providing a comprehensive evaluation of advanced generative and discriminative models under diverse acoustic conditions and complex interferences; • We investigate enrollment augmentation methods using LLM-based TTS, demonstrating that synthetic enrollment significantly enhances perceptual quality under severe acoustic degradation, while identifying the remaining challenges in balancing speech intelligibility with ASR accuracy.
This section defines the Enroll-on-Wakeup (EoW) TSE framework, illustrated in Figure 1. The goal of EoW-TSE is to extract target speech from a continuous noisy mixture by using only the transient wake-word segment as the enrollment reference.
In a traditional TSE task, the observed noisy mixture x(t) in the time domain is typically modeled as: where s(t) represents the clean target speech and n(t) denotes the sum of additive noise and interfering talkers. Standard models isolate s(t) by leveraging a high-quality enrollment utterance epre, which is pre-recorded under controlled, clean conditions. The extraction process can be expressed as:
where F is the TSE mapping function and Θ represents the model parameters. As discussed in Section 1, the reliance on a pre-collected epre limits the system’s spontaneity and disrupts the user experience.
Unlike the conventional approach, the proposed EoW-TSE derives the target speaker enrollment clue directly from the interaction process. As shown in Figure 1, the input stream contains the wake-up command (e.g., “Hi, Pandora”) followed immediately by the target query (e.g., “What’s the weather like today?”). The workflow of EoW-TSE can be defined as
This content is AI-processed based on open access ArXiv data.