ArrayDPS-Refine: Generative Refinement of Discriminative Multi-Channel Speech Enhancement
Multi-channel speech enhancement aims to recover clean speech from noisy multi-channel recordings. Most deep learning methods employ discriminative training, which can lead to non-linear distortions from regression-based objectives, especially under …
Authors: Zhongweiyang Xu, Ashutosh P, ey
ARRA YDPS-REFINE: GENERA TIVE REFINEMENT OF DISCRIMINA TIVE MUL TI-CHANNEL SPEECH ENHANCEMENT Zhongweiyang Xu ⋆ , Ashutosh P andey † , J uan Azcarr eta † , Zhaoheng Ni † , Sanjeel P arekh † , Buye Xu † ⋆ Uni versity of Illinois Urbana-Champaign, † Reality Labs Research at Meta ABSTRA CT Multi-channel speech enhancement aims to recov er clean speech from noisy multi-channel recordings. Most deep learning methods emplo y discriminativ e training, which can lead to non-linear distortions from regression-based objecti ves, especially under challenging environmen- tal noise conditions. Inspired by ArrayDPS for unsupervised multi- channel source separation, we introduce ArrayDPS-Refine, a method designed to enhance the outputs of discriminative models using a clean speech diffusion prior . ArrayDPS-Refine is training-free, generativ e, and array-agnostic. It first estimates the noise spatial cov ariance matrix (SCM) from the enhanced speech produced by a discriminativ e model, then uses this estimated noise SCM for diffusion posterior sampling. This approach allo ws direct refinement of any discriminativ e model’ s output without retraining. Our results show that ArrayDPS-Refine con- sistently improves the performance of various discriminative models, including state-of-the-art w aveform and STFT domain models. Audio demos are pro vided at https://xzwy .github .io/ArrayDPSRefineDemo/. Index T erms — Dif fusion, Multi-channel Speech Enhancement 1. INTR ODUCTION Speech enhancement addresses the challenge of recov ering clean speech from noisy rev erberant mixture(s). Deep learning has made remarkable progress in single-channel speech enhancement [ 1 ], with notable developments in both discriminativ e and generativ e modeling approaches. Discriminativ e models are trained with a regression loss to directly map noisy features to clean speech features. While these models often achiev e impressiv e gains in objecti ve metrics, they frequently introduce distortions that degrade perceptual quality [ 2 ] and can ne gativ ely impact the performance of do wnstream systems, such as automatic speech recognition (ASR) [ 3 ]. These distortions are generally attributed to the application of non-linear neural netw ork processing and to the inherently ill-posed problem of speech enhance- ment, particularly in low SNR conditions where the speech signal is heavily masked by noise [3, 4, 5]. In recent years, diffusion-based generativ e models have become increasingly popular for impro ving the perceptual quality of single- channel speech enhancement [ 6 , 7 , 8 ]. While these generativ e models achiev e superior perceptual quality , the y still lag behind state-of- the-art (SO T A) discriminati ve models in terms of objecti ve metrics and ASR performance. T o address this gap, one line of research explores using generati ve models to impro ve or refine the output of discriminativ e models [ 9 ]. In a similar vein, StoRM [ 10 ] leverages the enhanced output of a discriminativ e model as a warm initialization for a diffusion-based enhancement model, achieving impressive results. Diffiner [ 9 ] proposes using diffusion denoising restoration model (DDRM) [ 11 ] to further enhance any discriminativ e model. Despite the strong performance of these approaches, their application has so far been limited to single-channel speech enhancement. ⋆ W ork performed during Z. Xu’ s Meta internship. Unlike single-channel speech enhancement, multi-channel speech enhancement uses microphone arrays to apply spatial filtering, al- lowing for more effecti ve separation of speech and noise [ 12 ]. Deep learning based discriminative models have achiev ed great success in exploiting the spatial information, particularly the ones targeting phase enhancement with either wav eform [ 13 , 14 , 15 ] or short-time Fourier transform (STFT) domain processing [ 16 , 17 ]. Most multichannel speech enhancement methods are designed for fixed microphone arrays, limiting their use across different devices and necessitating device-specific model development. T o address this, array-agnostic architectures have been proposed that incorporate processing modules agnostic to the order and number of microphones, eliminating the need for model retraining for each array configuration [15, 14, 17]. Although discriminativ e multi-channel speech enhancement methods generally outperform their single-channel counterparts, they still suffer from nonlinear distortions introduced by deep neural networks, much like single-channel approaches [ 4 ]. T o address these distortions, multi-channel systems can integrate DNNs with traditional spatial filters—such as the minimum variance distortionless response (MVDR) beamformer [ 12 ]—which ensures distortionless output. Howe ver , this distortionless property often comes at the e xpense of increased residual noise, necessitating additional post-processing [ 12 ]. In response to these challenges, there is a growing trend to ward generativ e modeling-based speech enhancement [ 18 , 19 , 20 ]. Re- cently , ArrayDPS [ 21 ] has demonstrated unsupervised, array-agnostic, and generative multi-channel speech separation by lev eraging a pre- trained speech diffusion prior . Not only does this approach achie ve performance on par with SO T A discriminative models, but it also intro- duces a generic diffusion posterior sampling method [ 22 ] that can be applied to a wide range of multi-channel blind in verse problems [23]. Building on these advances, we introduce ArrayDPS-Refine—a training-free, generativ e, and array-agnostic refinement method designed to enhance any discriminativ e multi-channel speech enhance- ment model. For a giv en discriminativ e model, ArrayDPS-Refine first estimates the noise spatial cov ariance matrix (SCM) using the noisy mixture and the enhanced speech from model. The estimated noise SCM is then used to pro vide likelihood guidance within a modified ArrayDPS framework. Extensiv e ev aluations sho w that ArrayDPS-Refine deli vers sig- nificant impro vements in intelligibility , quality , and word error rate (WER) metrics across a range of discriminativ e models, underscoring its ef fectiv eness and v ersatility . Furthermore, the observed g ains in WER, along with demo samples, demonstrate that ArrayDPS-Refine effecti vely reduces distortions introduced by discriminativ e models. ArrayDPS-Refine’ s main contrib utions are: (1) a training-free , generative , array-agnostic method to refine any discriminati ve enhancement model without any re-training (2) a no vel approach of utilizing noise spatial cov ariance matrix for likelihood guidance in dif- fusion posterior sampling (3) the first multi-channel generative method that can outperform SOT A discriminativ e models in perceptual , intelligibility and WER metrics. 2. B A CKGROUND AND PROBLEM FORMULA TION In noisy reverberant scenarios, with a single tar get speaker and C microphones, let X ( ℓ, k ) ∈ C denote the clean anechoic speech at frame ℓ ∈ [0 , L − 1] and frequency k ∈ [0 , K − 1] in the short-time Fourier transform (STFT) domain. The C -channel noisy observ ations in the STFT domain are modeled as Y ( ℓ,k ) = H ( ℓ,k ) ∗ ℓ X ( ℓ,k ) + N ( ℓ,k ) , (1) where Y ( ℓ, k ) ∈ C C denotes the vector of observed multi-channel noisy signals, N ( ℓ, k ) ∈ C C denotes additive noise, H ( ℓ, k ) ∈ C C represents the room acoustic transfer functions (A TF) from the source to the C microphones, and ∗ ℓ denotes con volution across frames. Note that H ∈ C N H × K × C is a multi-frame filter with frame length N H . In this paper , we assume N ( ℓ, k ) follows a zero-mean com- plex Gaussian distribution with a spatial cov ariance Φ NN ( ℓ, k ) = E N ( ℓ,k ) N ( ℓ,k ) H , the lik elihood of the observations is then: p Y ( ℓ,k ) | H ( ℓ,k ) ,s ( ℓ,k ) = C N H ( ℓ,k ) ∗ ℓ X ( ℓ,k ) , Φ NN ( ℓ,k ) (2) Thus, the log-lik elihood can be calculated analytically with Φ NN : L = X ℓ,k log p Y ( ℓ,k ) | H ( ℓ,k ) ,X ( ℓ,k ) (3) 2.1. Diffusion Model In this paper , we follow Denoising Dif fusion Probabilistic Model (DDPM) [ 24 , 25 ]. Giv en a data distribution p data ( x 0 ) , DDPM defines a forward diffusion process to gradually transform x 0 to x 1 ,x 2 ,...,x T , where T is the final diffusion step, and x T ∼ N (0 ,I ) , following: q ( x t | x t − 1 ) = N √ α t x t − 1 ,β t I (4) where { β t } T t =1 is a pre-defined noise variance schedule with β t ∈ (0 , 1) . Then α t := 1 − β t and ¯ α t := Q t s =1 α s , which infers that x t = √ ¯ α t x 0 + √ 1 − ¯ α t ϵ, ϵ ∼ N (0 ,I ) . (5) DDPM proposes that we can train a model to reverse the diffusion process from x T ∼ N (0 , I ) to x 0 step by step, which enables sampling from the data distribution p data ( x 0 ) . Each reversal step is approximated by sampling from p θ ( x t − 1 | x t ) = N µ θ ( x t ,t ) , σ 2 t I , where σ 2 t can be deri ved to be σ 2 t = 1 − ¯ α t − 1 1 − ¯ α t β t , and µ θ ( x t ,t ) is: µ θ ( x t ,t ) = 1 √ α t x t − β t √ 1 − ¯ α t ϵ θ ( x t ,t ) , (6) where a neural network ϵ θ is trained to predict the forward noise by minimizing E t,x 0 ∼ p data ,ϵ ∼N (0 ,I ) h ϵ − ϵ θ ( √ ¯ α t x 0 + √ 1 − ¯ α t ϵ, t ) 2 2 i . The error estimation function ϵ θ ( x t ,t ) can also be used to denoise x t to x 0 as an MMSE denoiser: E [ x 0 | x t ] ≃ ˆ x 0 ( x t ,t ) = x t − √ 1 − ¯ α t ϵ θ ( x t ,t ) √ ¯ α t (7) Fundamentally equiv alent to DDPM, score-based diffusion [ 26 ] defines the forward dif fusion process as a stochastic differen- tial equation (SDE), and then uses a neural network s θ ( x t , t ) to model the score function ∇ x t log p t ( x t ) , which is used in a de- riv ed reversal SDE for sampling. From the T weedie’ s formula E [ x 0 | x t ] = 1 √ ¯ α t x t + (1 − ¯ α t ) ∇ x t log p t ( x t ) and Eq. 7, we can also approximate the score function in DDPM by: ∇ x t log p t ( x t ) ≃ s θ ( x t ,t ) = − 1 √ 1 − ¯ α t ϵ θ ( x t ,t ) (8) 2.2. Diffusion Posterior Sampling and ArrayDPS Follo wing Eq. 1, where Y = H ∗ ℓ X + N for abbreviation, assume the room acoustic transfer function H is known and N ∼ C N (0 , Φ NN ) . T o recov er X from measured Y , diffusion posterior sampling (DPS) [ 22 ] proposes to sample from P ( X | Y ) using a pre-trained score-based diffusion model for clean speech X . Follo wing Sec. 2.1, s θ ( X t ,t ) is trained to approximate ∇ X t log p ( X t ) , which can be directly used to sample from P data ( X ) following the re verse diffusion process. T o sample from p ( X | Y ) , ∇ X t log p ( X t | Y ) is needed for the rev erse diffusion process, so DPS decomposes ∇ X t log p ( X t | Y ) using Bayes’ theorem: ∇ X t log p ( X t | Y ) = ∇ X t log p ( X t ) + ∇ X t log p ( Y | X t ) (9) In Eq. 9, ∇ X t log p ( X t ) is approximated by the pre-trained diffu- sion score model s θ ( X t , t ) , and DPS proposes to approximate the likelihood score ∇ X t log p ( Y | X t ) using: ∇ X t log p ( Y | X t ) ≃ ∇ X t log p ( Y | ˆ X 0 ( X t ,t ) ,H ) (10) ˆ X 0 ( X t ,t ) = 1 √ ¯ α t X t + (1 − ¯ α t ) s θ ( X t ,t ) (11) Eq. 11 uses T weedie’ s formula as mentioned in Sec. 2.1, which denoises X t to ˆ X 0 . Then the denoised ˆ X 0 is used to calculate the likelihood p ( Y | ˆ X 0 ,H ) , which can be directly calculated using Eq. 2 because H is assumed to be kno wn in DPS. Howe ver , in real-world scenarios, the room acoustic transfer function H is nev er known. In the next section we will discuss how ArrayDPS solv es this problem. 2.3. ArrayDPS and FCP ArrayDPS [ 21 ] proposes to use diffusion posterior sampling for multi- channel speech separation under weak white noise. Thus our problem formulation is the same as ArrayDPS with number of speakers set to one under Gaussian noise assumption. T o solve the problem of un- known acoustic room transfer function, ArrayDPS proposes to use For - ward Con voluti ve Prediction (FCP) [ 27 ] to estimate H for each speech source at each DPS step where H is needed for likelihood calculation. Intuitiv ely , forward con voluti ve prediction (FCP) [ 27 ] is an STFT - domain filter estimation algorithm that tries to find the best filter such that the filtered input matches the target the most. T o estimate the room acoustic transfer functions, a source signal X ( ℓ,k ) ∈ C is input to FCP , and the C -channel noisy mixture Y ( ℓ,k ) ∈ C C is the FCP target. W e denote Y c ∈ C ( ℓ,k ) as c th channel noisy mixture and H c ( ℓ,k ) as c th channel acoustic transfer function. Then FCP( X , Y ) estimates each channel’ s room acoustic transfer functions by solving: ˆ H c = arg min H c X m,k 1 ˆ λ m,k Y c ( m,k ) − N H − 1 X n =0 H c ( n,k ) X ( m − n,k ) 2 (12) ˆ λ m,k = 1 C C X c =1 | Y c ( m,k ) | 2 + ϵ · max m,k 1 C C X c =1 | Y c ( m,k ) | 2 (13) As shown abov e, FCP is solving a weighted least squares problem, so it has an analytical solution as in [ 27 ]. N H is the number of frames of the acoustic transfer function. The weight ˆ λ m,k aims to prev ent the estimated filters from o verfitting to high-ener gy STFT bins, and ϵ is a tunable hyperparameter to adjust the weight. Thus, similar to Eq. 14, at each DPS step, ArrayDPS approximates the likelihood score ∇ X t log p ( Y | X t ) with: ∇ X t log p ( Y | X t ) ≃ ∇ X t log p ( Y | ˆ X 0 ( X t ,t ) , ˆ H ) (14) ˆ X 0 ( X t ,t ) = 1 √ ¯ α t X t + (1 − ¯ α t ) s θ ( X t ,t ) , ˆ H = FCP ( ˆ X 0 ( X t ,t ) ,Y ) (15) Eq. 15 first denoises X t to ˆ X 0 , and uses ˆ X 0 to estimate ˆ H using FCP . Then ˆ H is used for likelihood calculation in Eq. 14. Multi-channel Noisy Reverberant Mixture Discriminative Model ArrayDPS-Refine Noise Spatial Covaraince Matrix Estimator Reference-channel Enhanced Speech for observation likelihood Refined Speech for diffusion initialization Fig. 1 . ArrayDPS-Refine Pipeline. Algorithm 1 ArrayDPS-Refine Require: Y , ˆ Φ NN , ˜ X ,T ′ ,ξ ,ϵ θ ( · ) , { σ 2 t } T ′ t =1 , { α t } T ′ t =1 , { ¯ α t } T ′ t =1 1: ˜ X ′ ← | ˜ X | 0 . 5 exp j ∠ ˜ X ▷ transform to compressiv e domain 2: Sample ϵ ∼ C N (0 , 2 I ) ▷ N (0 ,I ) for both real and imag 3: X ′ T ′ ← √ ¯ α T ′ ˜ X ′ + √ 1 − ¯ α T ′ ϵ ▷ init. to step T’ 4: for t = T ′ ,..., 1 do 5: ˆ ϵ ← ϵ θ ( X ′ t ,t ) ▷ diffusion model estimate noise 6: Sample Z ∼ C N (0 , 2 I ) ▷ N (0 ,I ) for both real and imag 7: X ′ t − 1 ← 1 √ α t ( X ′ t − 1 − α t √ 1 − ¯ α t ˆ ϵ ) + σ 2 t Z ▷ prior step 8: ˆ X ′ 0 ← 1 √ ¯ α t ( X ′ t − √ 1 − ¯ α t ˆ ϵ ) ▷ one-step denoising 9: ˆ X 0 ← | ˆ X ′ 0 | 2 exp j ∠ ˆ X ′ 0 ▷ transform to STFT domain 10: ˆ H ← FCP ( ˆ X 0 ,Y ) ▷ room A TF estimation 11: ˆ X reverb ← ˆ H ∗ ℓ ˆ X 0 ▷ estimate multi-channel rev erb . speech 12: ˆ N = Y − ˆ X reverb ▷ estimate multi-channel noise 13: G ← ∇ X ′ t " − 1 2 P ℓ,k ˆ N ( ℓ,k ) H b Φ − 1 NN ( ℓ,k ) ˆ N ( ℓ,k ) # ▷ likelihood score 14: X ′ t − 1 ← X ′ t − 1 + ξ 1 − α t √ α t G ▷ likelihood step 15: end for 16: X 0 ← | X ′ 0 | 2 exp j ∠ X ′ 0 ▷ transform to STFT domain 17: H align ← FCP ( X 0 , ˜ X ) ▷ single-frame filter est. for alignment 18: retur n H align ∗ ℓ X 0 ▷ return aligned signal 3. METHOD W e propose an ArrayDPS-based refiner for multi-channel speech enhancement. The overall pipeline is shown in Fig. 1. A discriminative enhancement model f ϕ ( · ) first processes a multi-channel mixture Y ∈ C L × K × C (with L time frames and K STFT frequency bins), yielding ˜ X = f ϕ ( Y ) ∈ C L × K , the reference-channel anechoic clean speech es- timate. Then ˜ X and Y are used to estimate the noise spatial covariance matrix (SCM) Φ NN ( ℓ, k ) ∈ C C × C , l ∈ [0 , L − 1] , k ∈ [0 , K − 1] . With the noise SCM estimated, the lik elihood as in Eq. 2 can be calculated, which allows us to refine ˜ X using ArrayDPS-Refine in Algorithm 1. 3.1. Spatial Covariance Matrix Estimation W ith the reference-channel clean speech estimate ˜ X estimated by a dis- criminativ e model, we first estimate the room acoustic transfer function using FCP: ˜ H = FCP ( ˜ X ,Y ) . This allows us to estimate all-channel rev erberant clean speech: ˜ X reverb = ˜ H ∗ ℓ ˜ X , where ˜ X reverb ∈ C L × K × C . Then the multi-channel noise can be estimated by ˜ N = Y − ˜ X reverb , which can then be used to estimate the noise SCM Φ NN : ˆ Φ NN ( ℓ,k ) = α ˆ Φ NN ( ℓ − 1 ,k ) + (1 − α ) ˜ N ( ℓ,k ) ˜ N H ( ℓ,k ) (16) As in Eq. 16, we estimate the noise SCM using recursive exponential moving average with a smoothing coef ficient, α . Then the estimated noise SCM can be used for likelihood calculation in ArrayDPS-Refine. 3.2. ArrayDPS-Refine The ArrayDPS-Refine algorithm is shown in Algorithm 1, which uses a pre-trained diffusion model to refine discriminative model’ s enhanced speech ˜ X . Different from ArrayDPS, we train a prior diffusion model on compressive-domain STFT of clean speech. Given any clean speech STFT X from a dataset, the compressi ve STFT is X ′ = | X | 0 . 5 exp j ∠ X . Previous work has shown the effecti veness of using magnitude compression for audio diffusion models [28]. Follo wing Algorithm 1, ArrayDPS-Refine first transforms ˜ X to compressiv e domain ˜ X ′ for diffusion initialization (line 1). Then in line 2-3, X ′ T ′ is initialized from ˜ X ′ to diffusion time step T ′ ∈ [1 ,T ] (Similar to StoRM [ 10 ], we start from intermediate diffusion step T ′ initialized from ˜ X ′ ). In line 2, the noise variance is set to 2 because the diffusion takes place in the real and imaginary component of the compressiv e STFT , but the algorithm is defined in comple x domain. Then line 4-14 are the diffusion posterior sampling process. Line 5-7 use the pre-trained prior diffusion model for one rev ersal step. Then line 8 gets the one-step denoised compressiv e STFT ˆ X ′ 0 , which is then transformed to STFT ˆ X 0 in line 9. Same as ArrayDPS estimating H for likelihood estimation, ˆ X 0 is used to estimate the room A TF H in line 10. Using the likelihood model in Eq. 2 and the estimated noise SCM ˆ Φ NN , line 11-13 calculates G , the likelihood score approximation, similar to Eq. 14 for ArrayDPS. Line 14 then uses G for a likelihood score update, with a scalar ξ controlling the likelihood guidance. W ith the compressi ve-domain clean X 0 sampled, line 16 trans- forms it back to STFT domain. Ho we ver , one problem of this X 0 sampled is that there is no constraint in the algorithm making sure that it aligns with the clean signal in the reference channel. T o solve this problem, we use another FCP with N H = 1 in line 17 to estimate a single-frame filter that can align X 0 with ˜ X . Since the discriminativ e model is trained to output the reference-channel clean signal, ˜ X is fully aligned. Finally , line 18 outputs the aligned version of X 0 . 4. EXPERIMENTS AND RESUL TS 4.1. ArrayDPS-Refine Configurations As discussed in Sec. 2.1, we follow the DDPM dif fusion pro- cess [ 24 ], with the diffusion noise schedule β t increasing linearly from β 1 = 10 − 4 to β T = 0 . 02 , and T = 1000 steps. For the diffusion architecture, we use the 2-D U-Net same as in Dif finer [ 9 ], which accommodates real and imaginary channels of STFT . In our case, the model input is the real and imaginary channels of the compressi ve STFT as mentioned in Sec. 3.2. W e use FFT size of 512, hop size of 128, and square root Hanning window . For diffusion model train- ing, we train on 4-second 16kHz clean speech segments (w aveform normalized to − 1 to 1 ) from about 220 hours of clean speech from DNS-Challenge [ 29 ], using the Adam optimizer with learning rate 10 − 4 and batch size 64. The model is trained for 2 . 5 × 10 6 steps on 8 H100 GPUs. For inferece, we use the exponential moving average (EMA) of the model weight with a decay of 0 . 9999 . For noise SCM estimation, we set the smoothing factor α in Eq. 16 to be 0 . 95 . For ArrayDPS-Refine in Algorithm. 1 and Sec. 3.2, we set the starting dif fusion step T ′ to be 300 . The likelihood score guidance ξ controls the trade-off between better quality (higher prior) and better mixture consistency (higher likelihood). Thus we ev aluate result on ξ ∈ { 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 , 1 . 2 } . For the FCP in line 10 in Algorithm. 1, we set the filter frame length N H to be 13 and ϵ to be 10 − 3 , following the configuration in ArrayDPS [ 21 ]. For the alignment FCP in line 17 of Algorithm. 1, N H is set to 1 for single frame alignment. T able 1 . ArrayDPS-Refine ev aluation on Adhoc-microphone arrays (4-channel and 8-channels). Row Methods ξ 4-channel 8-channel STOI eSTOI PESQ(NB/WB) SI-SDR WER(%) DNSMOS UTMOSv2 STOI eSTOI PESQ(NB/WB) SI-SDR WER(%) DNSMOS UTMOSv2 A0 Noisy - 0.631 0.361 1.40 / 1.07 -6.3 118 1.48 1.82 0.631 0.361 1.40 / 1.07 -6.3 118 1.48 1.82 A1 T ADRN[14] - 0.896 0.783 2.86 / 2.05 8.9 58.4 2.84 2.55 0.913 0.814 2.98 / 2.21 9.9 46.9 2.85 2.68 A2 T ADRN+MCWF - 0.785 0.562 2.01 / 1.23 4.6 61.9 1.55 1.74 0.845 0.655 2.17 / 1.32 6.8 47.0 1.66 1.88 A3 Refined T ADRN (ours) 0.4 0.910 0.812 2.99 / 2.23 10.0 37.0 2.91 2.98 0.928 0.844 3.12 / 2.43 11.1 32.0 2.91 3.02 A4 Refined T ADRN (ours) 0.6 0.913 0.818 3.00 / 2.23 10.2 38.9 2.90 2.95 0.931 0.849 3.14 / 2.44 11.3 31.1 2.91 3.02 A5 Refined T ADRN (ours) 0.8 0.915 0.820 2.99 / 2.24 10.3 37.0 2.89 2.92 0.933 0.853 3.14 / 2.44 11.4 26.7 2.90 2.98 A6 Refined T ADRN (ours) 1.0 0.915 0.820 2.97 / 2.21 10.3 32.8 2.88 2.88 0.933 0.853 3.13 / 2.44 11.5 26.1 2.88 2.94 A7 Refined T ADRN (ours) 1.2 0.915 0.820 2.95 / 2.18 10.3 34.9 2.86 2.83 0.930 0.840 3.10 / 2.41 11.3 27.8 2.86 2.87 B1 F ASNET -T A C[15] - 0.839 0.677 2.52 / 1.65 5.4 68.9 2.57 1.82 0.856 0.708 2.61 / 1.74 6.2 58.4 2.69 1.91 B2 F ASNET -T A C+MCWF - 0.774 0.548 2.01 / 1.24 3.3 66.0 1.54 1.69 0.830 0.632 2.16 / 1.32 5.2 53.0 1.64 1.83 B3 Refined F ASNET -T AC (ours) 0.4 0.867 0.736 2.73 / 1.88 6.7 56.8 2.77 2.59 0.891 0.775 2.86 / 2.04 7.9 43.6 2.77 2.68 B4 Refined F ASNET -T AC (ours) 0.6 0.870 0.741 2.73 / 1.88 6.8 53.3 2.73 2.53 0.894 0.781 2.87 / 2.05 8.0 44.4 2.75 2.61 B5 Refined F ASNET -T AC (ours) 0.8 0.871 0.742 2.71 / 1.86 6.9 59.3 2.71 2.48 0.895 0.782 2.86 / 2.05 8.0 39.6 2.73 2.56 B6 Refined F ASNET -T AC (ours) 1.0 0.871 0.740 2.68 / 1.84 6.8 47.0 2.67 2.40 0.896 0.781 2.83 / 2.03 8.0 41.6 2.71 2.49 B7 Refined F ASNET -T AC (ours) 1.2 0.870 0.737 2.65 / 1.81 6.8 51.2 2.65 2.32 0.896 0.780 2.81 / 2.01 8.0 41.4 2.69 2.42 C1 USES2[17] - 0.923 0.833 3.13 / 2.42 6.0 36.6 2.85 2.90 0.936 0.858 3.25 / 2.59 5.7 30.5 2.86 3.00 C2 USES2+MCWF - 0.789 0.567 2.01 / 1.24 2.8 59.4 1.55 1.73 0.849 0.660 2.19 / 1.33 3.8 53.5 1.66 1.89 C3 Refined USES2 (ours) 0.4 0.922 0.835 3.11 / 2.41 6.7 32.7 2.88 3.07 0.938 0.863 3.26 / 2.63 6.2 24.2 2.89 3.10 C4 Refined USES2 (ours) 0.6 0.924 0.837 3.10 / 2.40 6.8 31.1 2.87 3.02 0.940 0.867 3.25 / 2.62 6.4 24.0 2.86 3.08 C5 Refined USES2 (ours) 0.8 0.923 0.838 3.07 / 2.37 6.8 33.5 2.85 2.97 0.940 0.868 3.23 / 2.60 6.4 22.2 2.85 3.03 C6 Refined USES2 (ours) 1.0 0.922 0.835 3.03 / 2.33 6.8 29.2 2.83 2.90 0.941 0.868 3.21 / 2.57 6.4 22.0 2.84 3.01 C7 Refined USES2 (ours) 1.2 0.920 0.831 2.98 / 2.26 6.7 28.2 2.80 2.84 0.940 0.866 3.17 / 2.53 6.3 22.7 2.81 2.96 4.2. Discriminative Baselines and Dataset As we focus on array-agnostic settings, we use three array-agnostic discriminativ e models as baseline models which our method refines from: FaSNet-T AC [ 15 ], T ADRN [ 14 ], and USES2 [ 17 ]. FaSNet-T A C is a time-domain model using transform-and-concatenate (T AC) for array-agnostic multi-channel modeling [ 15 ]. W e use the official implementation and configuration 1 . T ADRN is a state-of-the-art time-domain enhancement model, which uses a triple-path network for frame-wise, chunk-wise, and channel-wise modeling. W e follow the same MIMO configuration as in [ 14 ]. USES2 [ 17 ] is an STFT -domain SO T A array-agnostic model. W e also follow the of ficial implemen- tation and configuration 2 (USES2-Comp as in [ 17 ]), with FFT and hop size to be 512 and 256 respectively , and square root hanning window . For each enhancement model, we also use the enhanced speech to estimate a low-distortion single-frame time-in variant multi-channel W iener filter (MCWF) [ 30 ] as a baseline. The MCWF is using FFT size of 256 and hop size of 64 respectiv ely . T o train these array-agnostic discriminativ e models, we create ad-hoc microphone array datasets. For each sample in the dataset, we randomly draw a shoe-box room whose three dimensions range from 3 × 3 × 2 to 10 × 10 × 5 m. The absorption coefficient is uniformly sampled from 0 . 3 to 0 . 7 ( T 60 ∈ [0 . 13 , 0 . 55] s). The positions of 8 microphones are randomly sampled inside a sphere with radius 0 . 1 m , which forms an 8-channel ad-hoc microphone array . The number of interference speakers are randomly sampled from 8-16, simulating bubble noise. The number of noise sources are sampled from 1-50, simulating di verse spatial noise field. All sources’ and microphone array center’ s locations, are all randomly sampled inside the room. The signal-to-noise ratio and signal-to-interference ratio are set to [ − 10 , 5] dB and [5 , 10] dB respectiv ely . All the speech sources and noise sources are sampled from DNS-Challenge dataset [ 29 ]. W e use Pyroomacous- tics [ 31 ]’ s image method [ 32 ] (order 6) to simulate the noisy mixtures. W e simulate 80 , 000 10 -second samples for training, 1 , 000 4 -second samples for v alidation, 1 , 000 4 -second samples for testing. During training, the number of channels in each batch is randomly sampled from 2 to 8 , allowing training on variable number of channels. Also, the models are trained on randomly chunk ed 4-second segments. A phase constrained magnitude (PCM) loss [ 14 ] is used, with the training target to be the anechoic clean speech. W e use the Adam optimizer with a learning rate of 10 − 4 . For FaSNet-T AC and T ADRN, we train the model for 80 epochs with a batch size of 16. For USES2, we train the model for 40 epochs with a batch size of 8. 1 https://github .com/yluo42/T AC/blob/master/FaSNet.p y 2 https://github .com/espnet/espnet 4.3. Results and Analysis W e ev aluate all the discriminati ve baselines and ArrayDPS-Refine on both 4-channel and 8-channel of the test dataset. For evaluation metrics, we use ST OI [ 33 ], eSTOI [ 34 ], and w ord error rate (WER) to evaluate speech intelligibilty . For WER, we use a Whisper [ 35 ] base model for speech recognition. W e use SI-SDR [ 36 ] to ev aluate sample-level con- sistency . W e use PESQ [ 37 ], DNSMOS [ 38 ] (overall score), and UT - MOSv2 [ 39 ] to ev aluate speech perceptual quality . Note that both DNS- MOS and UTMOSv2 are non-intrusiv e. The result is shown in T able. 1. For T ADRN’ s result in Row A1-A7, after refinement, ArrayDPS- refine greatly outperforms T ADRN in all metrics. Specifically for Row A3, Refined T ADRN improves T ADRN by about 0.03 in eSTOI, 0.2 in W ide-band PESQ, 1 dB in SI-SDR, and 0.4 in UTMOSv2 in both 4-channel and 8-channel cases. Also, Refined T ADRN decrease the WER by more than 10 percent in 4-channel case, and about 15 percent in 8-channel case, showing remarkable improvement in speech intelligibility . All metrics are improved greatly after refinement. For FaSNet-T AC’ s result in Row B1-B7, the improv ement of refinement is ev en lar ger . F or Row B3 with ξ = 0 . 4 , Refine-FaSNet- T A C impro ves FaSNet-T AC with more than 0.06 in eST OI, 0.2 in wide-band PESQ, 1.3 dB in SISDR, 12 percent in WER, and 0.7 in UTMOSv2, in both 4-channel and 8-channel cases. For USES2’ s result in Row C1-C7, USES2 achiev es superior en- hancement performance even comparing T ADRN (better than T ADRN in all metrics except SISDR). F or Ro w C3 with ξ = 0 . 4 , ArrayDPS- Refine can still consistently impro ve over USES2. In 4-channel case, Refined USES2 achie ves similar performance to USES2 in ST OI, eSTOI, PESQ, and DNSMOS, but improves USES2 by about 0.8 dB in SI-SDR, 4 percent in WER, and 0.17 in UT -MOS. In 8-channels cases, Refined USES2 impro ves USES2 by 0.5 dB in SI-SDR, 6.3 percent in WER, and about 0.1 in UTMOSv2, without degradation in any metrics. One interesting observation is that ξ can be a good knob to balance perceptual quality and intelligibility . When ξ increase from 0 . 4 to 1 . 2 , we can in general observe an improvement in intelligibility-based metrics lik e ST OI, eSTOI, and WER, while seeing a de gradation in perceptual-based metrics lik e PESQ, DNSMOS and UTMOSv2. 5. CONCLUSION W e propose ArrayDPS-Refine, a training-free, generativ e, and array-agnostic method to improv e any discriminative model for multi- channel speech enhancement. W e use discriminative model’ s output to estimate the noise spatial covariance matrix, and then apply ArrayDPS using a pre-trained speech diffusion model. ArrayDPS-Refine sho ws consistent improvement on multiple discriminativ e models, including SO T A models, in both perceptual, intelligibility and WER metrics. 6. REFERENCES [1] C. Zheng, H. Zhang, W . Liu, X. Luo, A. Li, X. Li, and B. C. Moore, “Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods, ” T rends in Hearing , vol. 27, p. 23312165231209913, 2023. [2] A. R. A vila, M. J. Alam, D. O’Shaughnessy , and T . Falk, “Inv estigating speech enhancement and perceptual quality for speech emotion recognition, ” in INTERSPEECH , 2018, pp. 3663–3667. [3] T . Ochiai, K. Iwamoto, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “Rethinking processing distortions: Disentangling the impact of speech enhancement errors on speech recognition performance, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 32, pp. 3589–3602, 2024. [4] V . T ourbabin, P . Guiraud, S. Hafezi, P . A. Naylor, A. H. Moore, J. Donley , and T . Lunner, “The SPEAR challenge-revie w of results, ” 2023. [5] R. Mira, B. Xu, J. Donley , A. K umar, S. Petridis, V . K. Ithapu, and M. Pantic, “LA-V ocE: Low-snr audio-visual speech enhancement using neural vocoders, ” in ICASSP , 2023, pp. 1–5. [6] J. Richter, S. W elker, J.-M. Lemercier, B. Lay , and T . Gerkmann, “Speech enhancement and dere verberation with diffusion-based generati ve models, ” IEEE/A CM T ransactions on Audio, Speec h, and Languag e Pr ocessing , vol. 31, pp. 2351–2364, 2023. [7] Y .-J. Lu, Z.-Q. W ang, S. W atanabe, A. Richard, C. Y u, and Y . Tsao, “Conditional diffusion probabilistic model for speech enhancement, ” in ICASSP , 2022, pp. 7402–7406. [8] M. Y ang, C. Zhang, Y . Xu, Z. Xu, H. W ang, B. Raj, and D. Y u, “uSee: Unified speech enhancement and editing with conditional diffusion models, ” in ICASSP . IEEE, 2024, pp. 7125–7129. [9] R. Sa wata, N. Murata, Y . T akida, T . Uesaka, T . Shibuya, S. T akahashi, and Y . Mitsufuji, “Diffiner: A versatile diffusion-based generative refiner for speech enhancement, ” in INTERSPEECH , 2023, pp. 3824–3828. [10] J.-M. Lemercier, J. Richter , S. W elker , and T . Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dere verberation, ” IEEE/ACM Tr ansactions on Audio, Speech, and Language Processing , vol. 31, pp. 2724–2737, 2023. [11] B. Kawar , M. Elad, S. Ermon, and J. Song, “Denoising diffusion restoration models, ” Advances in neural information pr ocessing systems , vol. 35, pp. 23 593–23 606, 2022. [12] S. Gannot, E. V incent, S. Mark ovich-Golan, and A. Ozerov , “ A consolidated perspectiv e on multimicrophone speech enhancement and source separation, ” IEEE/ACM T ransactions on A udio, Speech, and Language Processing , vol. 25, no. 4, pp. 692–730, 2017. [13] A. Pande y , B. Xu, A. Kumar, J. Donley , P . Calamia, and D. W ang, “TP ARN: T riple-path attenti ve recurrent network for time-domain multichannel speech enhancement, ” in ICASSP , 2022, pp. 6497–6501. [14] ——, “Time-domain ad-hoc array speech enhancement using a triple-path network, ” in INTERSPEECH , 2022, pp. 729–733. [15] Y . Luo, Z. Chen, N. Mesgarani, and T . Y oshioka, “End-to-end mi- crophone permutation and number in variant multi-channel speech separation, ” in ICASSP , 2020, pp. 6394–6398. [16] W . Zhang, K. Saijo, Z.-Q. W ang, S. W atanabe, and Y . Qian, “T ow ard univ ersal speech enhancement for di verse input conditions, ” in ASR U , 2023, pp. 1–6. [17] W . Zhang, J.-w . Jung, and Y . Qian, “Improving design of input condition in variant speech enhancement, ” in ICASSP , 2024, pp. 10 696–10 700. [18] S. Dowerah, R. Serizel, D. Jouv et, M. Mohammadamini, and D. Matrouf, “Joint optimization of diffusion probabilistic-based multichannel speech enhancement with far-field speak er verification, ” in Spoken Language T echnology W orkshop , 2023, pp. 428–435. [19] S. Dowerah, A. Kulkarni, R. Serizel, and D. Jouvet, “Self-supervised learning with dif fusion-based multichannel speech enhancement for speaker verification under noisy conditions, ” in Interspeech 2023 , 2023, pp. 3849–3853. [20] F . Chen, W . Lin, C. Sun, and Q. Guo, “A T wo-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement, ” Circuits Systems and Signal Pr ocessing , vol. 43, no. 7, pp. 4369–4389, Jul. 2024. [21] Z. Xu, X. Fan, Z.-Q. W ang, X. Jiang, and R. R. Choudhury , “ ArrayDPS: Unsupervised blind speech separation with a diffusion prior , ” in ICML , 2025. [22] H. Chung, J. Kim, M. T . Mccann, M. L. Klasky , and J. C. Y e, “Diffusion posterior sampling for general noisy inv erse problems, ” in ICLR , 2023. [23] Y . W u, Z. Xu, J. Chen, Z.-Q. W ang, and R. R. Choudhury , “Unsupervised multi-channel speech derev erberation via diffusion, ” , 2025. [24] J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models, ” in Advances in Neural Information Pr ocessing Systems , vol. 33, 2020, pp. 6840–6851. [25] A. Q. Nichol and P . Dhariwal, “Improved denoising diffusion probabilistic models, ” in ICML . PMLR, 2021, pp. 8162–8171. [26] Y . Song, J. Sohl-Dickstein, D. P . Kingma, A. Kumar , S. Ermon, and B. Poole, “Score-based generativ e modeling through stochastic differential equations, ” in ICLR , 2021. [27] Z.-Q. W ang, G. W ichern, and J. L. Roux, “Con voluti ve prediction for monaural speech dereverberation and noisy-rev erberant speaker separation, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 29, pp. 3476–3490, 2021. [28] G. Zhu, Y . W en, M.-A. Carbonneau, and Z. Duan, “Edmsound: Spectrogram based diffusion models for ef ficient and high-quality audio synthesis, ” 2023. [Online]. A vailable: https://arxi v .org/abs/2311.08667 [29] C. K. A. Reddy , V . Gopal, R. Cutler , E. Beyrami, R. Cheng, H. Dube y , S. Matusevych, R. Aichner, A. Aazami, S. Braun, P . Rana, S. Sriniv asan, and J. Gehrke, “The INTERSPEECH 2020 deep noise suppression challenge: Datasets, subjective testing frame work, and challenge results, ” arXiv:2005.13981 , 2020. [30] Z.-Q. W ang, H. Erdogan, S. Wisdom, K. Wilson, D. Raj, S. W atanabe, Z. Chen, and J. R. Hershe y , “Sequential multi-frame neural beamforming for speech separation and enhancement, ” in Spoken Language T ec hnology W orkshop , 2021, pp. 905–911. [31] R. Scheibler , E. Bezzam, and I. Dokmani ´ c, “Pyroomacoustics: A p ython package for audio room simulation and array processing algorithms, ” in ICASSP , 2018, pp. 351–355. [32] J. B. Allen and D. A. Berkley , “Image method for efficiently simulating small-room acoustics, ” The Journal of the Acoustical Society of America , vol. 65, no. 4, pp. 943–950, 1979. [33] C. H. T aal, R. C. Hendriks, R. Heusdens, and J. Jensen, “ A short-time objectiv e intelligibility measure for time-frequency weighted noisy speech, ” in ICASSP , 2010, pp. 4214–4217. [34] J. Jensen and C. H. T aal, “ An algorithm for predicting the intelligibility of speech masked by modulated noise maskers, ” IEEE/ACM Tr ansactions on Audio, Speec h, and Language Processing , vol. 24, no. 11, pp. 2009–2022, 2016. [35] A. Radford, J. W . Kim, T . Xu, G. Brockman, C. McLeave y , and I. Sutskev er, “Rob ust speech recognition via lar ge-scale weak supervision, ” in ICML , 2023, pp. 28 492–28 518. [36] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey , “SDR – half-baked or well done?” ICASSP , pp. 626–630, 2018. [37] A. Rix, J. Beerends, M. Hollier , and A. Hekstra, “Perceptual ev aluation of speech quality (PESQ)-a ne w method for speech quality assessment of telephone networks and codecs, ” in ICASSP , vol. 2, 2001, pp. 749–752. [38] C. K. A. Reddy , V . Gopal, and R. Cutler , “DNSMOS: A non-intrusiv e perceptual objectiv e speech quality metric to ev aluate noise suppressors, ” in ICASSP , 2021, pp. 6493–6497. [39] K. Baba, W . Nakata, Y . Saito, and H. Saruwatari, “The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech, ” in Spoken Language T echnology W orkshop , 2024.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment