Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation
Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present LaDen (latent denoising), the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.
💡 Research Summary
The paper introduces LaDen (latent denoising), the first test‑time adaptation (TTA) method specifically designed for speech enhancement (SE). Traditional deep‑learning SE models perform well when the test distribution matches the training data, but they degrade under realistic domain shifts such as new noise types, unseen speakers, or different languages. Collecting labeled data for every possible target domain is infeasible, and many existing unsupervised domain adaptation (UDA) techniques either require access to source data at test time or rely on mechanisms (entropy minimization, feature alignment) that are unsuitable for regression‑based SE.
LaDen tackles the core challenge—producing reliable pseudo‑labels for unlabeled target data—by moving the problem into a semantic embedding space. A pre‑trained speech encoder (WavLM Large CNN, 512‑dimensional) maps both noisy and clean waveforms to embeddings y′ and x′. The authors hypothesize that the relationship between noisy and clean speech becomes approximately linear in this space. They therefore estimate a domain‑invariant linear transformation A (size d×d) using a modest number of source‑domain paired samples (K ≥ d) via a closed‑form Moore‑Penrose solution. Empirically, A generalizes across a variety of target domains, achieving cosine similarities above 0.96 between transformed noisy embeddings A y′ and true clean embeddings x′.
During test time, each incoming noisy utterance y is encoded to y′, transformed to A y′ (the “pseudo clean embedding”), and the SE model fθ produces an enhanced waveform ˆx. The enhanced waveform is re‑encoded to ˆx′, and a cosine‑distance loss LLD = 1 − sim(ˆx′, A y′) is computed. This loss is used to adapt only a small subset of the SE model’s parameters (layer‑norm and output layers), preserving the bulk of the source model while allowing rapid online fine‑tuning.
Because embeddings discard fine‑grained temporal detail, the authors add an envelope regularization term. They extract the signal envelope via the magnitude of the Hilbert transform for both the SE output and a spectral‑subtraction reference (ˆxSS). Frame‑wise cosine similarity between these envelopes, weighted by frame energy, yields LR. The total loss is L = I(L ≤ γ)·(LLD + λ·LR), where an indicator caps LLD at γ = 0.05 to suppress outliers. Additionally, after each gradient step they perform weight averaging: θt ← β θt + (1 − β) θS (β = 0.9), stabilizing adaptation and preventing catastrophic forgetting.
The experimental protocol is extensive. The source model is trained on the EARS‑W dataset (100 h, 107 speakers, multiple speaking styles) mixed with WHAM! noise. Four target domains are constructed: (1) noise shift (EARS‑DEMAND), (2) speaker+noise shift (VoiceBank+DEMAND, VoiceBank+WHAM), and (3) language shift (DNS dataset covering six languages). Two SE architectures are evaluated: an amplitude‑masking residual‑block model and a Conv‑TasNet style model. Metrics include SI‑SDR and PESQ.
Results show that LaDen consistently outperforms baselines—source‑only, RemixIT (student‑teacher self‑training), and SSRA (representation‑based pseudo‑labeling)—by 1–2 dB SI‑SDR and 0.5–1.0 PESQ points across all domains. Notably, in speaker and language shifts LaDen reaches or exceeds the performance of models that are retrained directly on the target data, demonstrating the strength of the embedding‑based pseudo‑labeling.
The paper acknowledges limitations: only additive noise is addressed (no reverberation or compression artifacts), the linearity assumption for A may not hold under more extreme conditions, and the reliance on a frozen large encoder may be computationally demanding for some edge devices. Future work is suggested on non‑linear transformations, multi‑scale embeddings, lightweight encoders, and extending the framework to reverberant or multi‑modal scenarios.
In summary, LaDen offers a simple yet powerful solution: a pre‑computed linear mapping in a robust speech embedding space provides high‑quality pseudo‑labels, enabling efficient, online adaptation of speech enhancement models without any target‑domain labels or source data access. This advances the state of the art in test‑time adaptation for regression‑based audio tasks and opens avenues for practical deployment of SE systems in highly variable real‑world environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment