Do We Need EMA for Diffusion-Based Speech Enhancement? Toward a Magnitude-Preserving Network Architecture
We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios and perceptual scores, with the two skip-connection variants exhibiting complementary strengths. These findings provide new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement.
💡 Research Summary
This paper investigates diffusion‑based speech enhancement (SE) through a Schrödinger bridge (SB) formulation and extends the recent EDM2 framework to the audio domain. The authors first derive a forward SB process in the short‑time Fourier transform (STFT) domain, treating real and imaginary parts as separate channels and assuming independent Gaussian clean speech and environmental noise. The forward marginal is a Gaussian with time‑dependent mean µₜ = wₓ(t)·x₀ + w_y(t)·y and variance σₜ², where wₓ(t) and w_y(t) are analytically defined weighting functions. Training minimizes a data‑prediction loss J_SB(θ) = E
Comments & Academic Discussion
Loading comments...
Leave a Comment