Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, a variational autoencoder (VAE)-based single-channel speech enhancement system using Bayesian permutation training has been proposed, which uses two pretrained VAEs to obtain latent representations for speech and noise. Based on these pretrained VAEs, a noisy VAE learns to generate speech and noise latent representations from noisy speech for speech enhancement. Modifying the pretrained VAE loss terms affects the pretrained speech and noise latent representations. In this paper, we investigate how these different representations affect speech enhancement performance. Experiments on the DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that a latent space where speech and noise representations are clearly separated significantly improves performance over standard VAEs, which produce overlapping speech and noise representations.

💡 Research Summary

This paper investigates how the latent representations learned by pretrained variational autoencoders (VAEs) for clean speech and noise affect the performance of a single‑channel speech‑enhancement system that uses Bayesian permutation training (PV‑AE). The original PV‑AE framework consists of three VAEs: a clean‑speech VAE (CV‑AE), a noise VAE (NV‑AE), and a noisy‑speech VAE (NSV‑AE). The CV‑AE and NV‑AE are pretrained on clean speech and noise respectively, providing latent distributions q(zₓ|x) and q(zᵥ|v). During enhancement, the NSV‑AE encoder receives a noisy log‑power spectrum y and outputs two latent vectors zₓ and zᵥ. The training objective for the NSV‑AE is to minimise the KL divergence between its posteriors q(zₓ|y), q(zᵥ|y) and the pretrained posteriors q(zₓ|x), q(zᵥ|v). The latent vectors are then fed to the decoders of CV‑AE and NV‑AE, which reconstruct speech and noise spectra; a real‑valued mask finally yields the enhanced speech.

The authors hypothesise that the quality and structure of the pretrained latent spaces are crucial for the overall system because the NSV‑AE is explicitly trained to mimic those spaces. To test this, they replace the standard ELBO loss used for pretraining CV‑AE and NV‑AE with the Disentangled Inferred Prior VAE (DIP‑VAE) loss. DIP‑VAE adds two regularisation terms to the ELBO: (i) an off‑diagonal penalty λ_od that forces the covariance of the mean vectors µ to be close to diagonal, and (ii) a diagonal penalty λ_d that pushes the diagonal entries toward 1. Additionally, a weighting factor β scales the KL term. By varying β, λ_od, and λ_d, four pretraining configurations are examined:

Standard VAE (β = 1, λ_od = λ_d = 0) – baseline.
DIP‑VAE (β = 1, λ_od = 10⁴, λ_d = 10²) – full disentanglement.
Standard VAE without KL (β = 0) – reconstruction‑only.
DIP‑VAE without KL (β = 0, λ_od = 10⁴, λ_d = 10²) – disentanglement‑only.

All VAEs share the same architecture: three fully‑connected layers (512 units each) followed by a unidirectional GRU (512 units). The latent dimension is L = 128. Training uses Adam (lr = 1e‑4), batch size 128, and early stopping after 20 epochs without validation loss improvement.

Experiments are conducted on three datasets: (i) the matched DNS‑3 synthetic test set, (ii) the mismatched WSJ0‑QUT set, and (iii) the VoiceBank‑DEMAND (VB‑DMD) set. Metrics are scale‑invariant SNR (SI‑SNR) and PESQ. The authors also report the reconstruction SI‑SNR of the pretrained CV‑AE and NV‑AE on DNS‑3.

Key findings:

Reconstruction quality differs across configurations. CV‑AE achieves its highest SI‑SNR when the KL term is removed (configuration 3), indicating that pure reconstruction loss favours speech modelling. NV‑AE reconstructs best with full DIP‑VAE (configuration 2), suggesting that disentanglement helps capture the more stochastic nature of noise.
For speech enhancement, the causal PV‑AE consistently outperforms a non‑causal recurrent VAE (R‑VAE) across all datasets. The best enhancement results are obtained with configuration 2 (full DIP‑VAE), where the latent space shows a clear separation between speech and noise clusters.
A strong correlation is observed between the CV‑AE reconstruction ability and the overall enhancement performance, confirming that a well‑trained speech latent space is pivotal.
The disentanglement regulariser alone (configuration 4) does not yield significant gains without the KL term, implying that both reconstruction fidelity (via KL) and latent independence are needed.
In mismatched conditions (WSJ0‑QUT, VB‑DMD), the advantage of the disentangled latent space persists, demonstrating improved generalisation.

The authors conclude that the design of the pretraining loss for the speech and noise VAEs is a decisive factor for PV‑AE performance. By enforcing latent independence (through DIP‑VAE) and appropriately weighting the KL term, the system achieves a latent representation where speech and noise are well separated, leading to more accurate posterior estimation by the NSV‑AE and, consequently, higher SI‑SNR and PESQ scores.

Future directions suggested include exploring higher‑dimensional latent spaces, non‑linear covariance regularisation, and lightweight encoder designs for real‑time deployment in low‑latency applications such as teleconferencing and AR/VR.

Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement

💡 Research Summary

Comments & Academic Discussion

Leave a Comment