I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-based Single-channel Speech Enhancement
Recently, a complex variational autoencoder (VAE)-based single-channel speech enhancement system based on the DCCRN architecture has been proposed. In this system, a noise suppression VAE (NSVAE) learns to extract clean speech representations from noisy speech using pretrained clean speech and noise VAEs with skip connections. In this paper, we improve DCCRN-VAE by incorporating three key modifications: 1) removing the skip connections in the pretrained VAEs to encourage more informative speech and noise latent representations; 2) using $β$-VAE in pretraining to better balance reconstruction and latent space regularization; and 3) a NSVAE generating both speech and noise latent representations. Experiments show that the proposed system achieves comparable performance as the DCCRN and DCCRN-VAE baselines on the matched DNS3 dataset but outperforms the baselines on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization ability. In addition, an ablation study shows that a similar performance can be achieved with classical fine-tuning instead of adversarial training, resulting in a simpler training pipeline.
💡 Research Summary
The paper introduces I‑DCCRN‑VAE, an improved version of the complex‑valued variational auto‑encoder (VAE) based single‑channel speech enhancement system originally built on the DCCRN architecture. The authors identify three limitations in the prior DCCRN‑VAE: (1) the pretrained clean‑speech VAE (CVAE) and noise VAE (NVAE) both contain skip connections, which allow the encoder‑decoder pair to bypass the latent bottleneck and thus limit the informativeness of the latent variables; (2) the standard VAE loss balances reconstruction and KL‑divergence equally, which can either over‑regularize or under‑regularize the latent space; (3) the noise‑suppression VAE (NSVAE) only extracts a clean‑speech latent representation, relying on the pretrained NVAE for noise information.
To address these issues, the authors make three key modifications. First, they remove all skip connections from the pretrained CVAE and NVAE, forcing every input to pass through the latent bottleneck and encouraging richer complex‑valued latent codes (zₓ for speech, zᵥ for noise). Second, they adopt a β‑VAE formulation for pre‑training, adding a weight β to the KL term (‑E₍q₎
Comments & Academic Discussion
Loading comments...
Leave a Comment