Using Monte Carlo dropout for non-stationary noise reduction from speech

In this work, we propose the use of dropout as a Bayesian estimator for increasing the generalizability of a deep neural network (DNN) for speech enhancement. By using Monte Carlo (MC) dropout, we show that the DNN performs better enhancement in unse…

Authors: Nazreen P.M., A.G. Ramakrishnan

Using Monte Carlo dropout for non-stationary noise reduction from speech
1 Using Monte Carlo dropout for non-stationary noise reduction from speech. Nazreen P .M. and A.G. Ramakrishnan, Senior Member , IEEE Abstract —In this work, we propose the use of dropout as a Bayesian estimator for increasing the generalizability of a deep neural network (DNN) f or speech enhancement. By using Monte Carlo (MC) dropout, we show that the DNN performs better enhancement in unseen noise and SNR conditions. The DNN is trained on speech corrupted with Factory2, M109, Babble, Leopard and V olvo noises at SNRs of 0, 5 and 10 dB. Speech samples are obtained from the TIMIT database and noises from NOISEX-92. In another experiment, we train five DNN models separately on speech corrupted with Factory2, M109, Babble, Leopard and V olvo noises, at 0, 5 and 10 dB SNRs. The model precision (estimated using MC dropout) is used as a proxy for squared error to dynamically select the best of the DNN models based on their performance on each frame of test data. W e propose an algorithm with a threshold on the model precision to switch between classifier based model selection scheme and model precision based selection scheme. T esting is done on speech corrupted with unseen noises White, Pink and Factory1 and all five seen noises. Index T erms —speech enhancement, deep neural networks, DNN, dropout, unseen noise, Monte Carlo, model uncertainty . I . I N T RO D U C T I O N Speech enhancement techniques find se veral applications such as automatic speech recognition, speaker recognition and hearing aids. Single channel speech enhancement has been a challenging problem for decades. Several speech enhancement techniques hav e been proposed in the past. Methods such as spectral subtraction [1], [2], W iener filtering [3], minimum mean-square error (MMSE) estimators [4], estimators based on Gaussian prior distributions [5], [6] and residual-weighting schemes [7], [8], [9] falls into the cate gory of unsupervised enhancement methods. Most of these methods fail when the background noise is non-stationary and in unexpected acoustic conditions. Supervised learning methods are expected to perform better than the unsupervised cases as prior information is being used [10], [11], [12]. T o learn the complex mapping between noisy and clean speech, neural networks have been shown to be useful. Several models hav e been proposed in this field [13], [14], [15]. Howe ver these models are small to properly learn the complex mapping. Deep architectures have been widely used in this area recently as they have shown the ability to learn the complex mapping between noisy and clean features and hence gi ve superior enhancement performances. Hinton et al. proposed a greedy layer-wise unsupervised learning algorithm [16], [17]. Mass et al. [18] use deep recurrent neural Nazreen P .M. and A.G. Ramakrishnan is with the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India, 560012 e-mail: (nazreenp@iisc.ac.in and agr@iisc.ac.in). networks (DRNNs) for feature enhancement for noise robust ASRs. One of the major issues encountered by DNN based en- hancement is the degradation of performance on noises for which the network is not adapted which is referred to as unseen condition. The model learns mapping between noisy and clean speech well for noises and SNRs on which it is trained, but performs poor on speech corrupted by an unseen noise or SNR. In fact, this itself could be dealt with as a challenging task in speech enhancement scenario. Though not dealt with separately , se veral techniques ha ve been proposed in the past to address this problem. [19] proposed a regression DNN-based speech enhancement framework, where a wide neural network is trained using a lar ge collection of data of about 100 hours of various noise types. A DNN-SVM based system is proposed in [20], which is trained on a v ariety of acoustic data for a considerably huge amount of time. A noise aw are training technique is adopted in [21], where a noise estimate is also appended along with the input feature for training. They use about 2500 hours of training data for training the network. W e propose a new algorithm to improv e the performance of unseen noises using a new algorithm based on Monte Carlo dropout proposed by Gal and Ghahramani in [22]. Our exper - iments show that the algorithm giv es superior performance in most of the unseen noise cases compared to that using con ventional dropout [23], [24]. I I . R E L A T E D W O R K S Hinton et al. [23], [24] introduced the concept of dropout to reduce overfitting during DNN training. Though dropout omits weights during training, it is inactiv e during the inference stage, whereby all the neurons contribute to the prediction. Gal and Ghahramani in [22] sho ws a theoretical relationship between dropout [24] and approximate inference in a Gaussian process and introduced the method of using dropout during inference. Kendall et al. in [25] sho ws that by enabling dropout during inference and averaging the results of multiple stochastic forward passes, the predictions show improvement. [25] uses the term MC droput to refer to this technique. These samples could be considered as Monte Carlo samples, from the posterior posterior distribution of models [25]. Gal et al. [22] also sho w the estimation of model uncertainty from these samples. The focus of this work is to use the idea od MC dropout to improv e the generalizability of speech models and to improv e the enhancement performance in a highly mismatched condition. In [26] we show that in the case of noisy speech corrupted with unseen noises, MC dropout models can give a 2 better denoised output than con ventional dropout models. T o show this we train two DNN models on multiple noises and SNRs , one employing MC dropout and another employing the conv entional dropout and compare the performance of the two. W e also explore the usage of model uncertainty in problems where multiple noise specific DNN models are used. By using model uncertainty as an estimate of the prediction error for a sample, this technique can enable selection of best models with least prediction error in frame basis. A similar approach of selecting the best model based on an error estimate was proposed in [27]. Howe ver this was used for robust SNR estimation. They trained a separate DNN as a classifier to select a particular regression model for SNR estimation. Howe v er this approach does not ameliorate the original problem of mismatch in training and testing conditions. In our proposed algorithm we use the intrinsic uncertainty of a model to estimate the prediction error . Since this method extracts information from the model itself, it has the potential to be a better representati ve of the prediction error . Our method also circumvents the issue of unseen testing conditions since according to [22], the model uncertainty itself is an indicator of unseen data. This paper extends our preliminary results reported in [26]. W e propose a predictiv e variance threshold based algorithm to switch between model uncertainty based selection scheme and classifier based model selection scheme to compensate for the performance drop of the intrinsic uncertainty based algorithm for seen noises. The baseline system is augmented by MC dropout as a bayesian approximation. The distribution ov er the weights could be learned using this approximation consequently gi ving an uncertainty of the output. The input X is fed into the network using dropout same as that employed during the training time. Multiple passes are made through the network dropping out different random units each time. Thus T rep- etitions are performed by dropping of random units each time during testing. This results in T dif ferent outputs for a giv en input X ; { ˆ S t ( X ) } ; 1 ≤ t ≤ T . [22] shows that av eraging forward passes through the network is equiv alent to Monte Carlo integration ov er a Gaussian process posterior approximation. Empirical estimators of the predictiv e mean ( E ( S ) ) and v ariance (uncertainty , V ar ( S ) ) from these samples are given as: E ( S ) ≈ 1 T T X t =1 ˆ S t ( X ) (1) V ar ( S ) ≈ τ − 1 I D + 1 T T X t =1 ˆ S t ( X ) T ˆ S t ( X ) − E ( S ) T E ( S ) (2) where τ = l 2 p/ 2 N λ ; l : defined prior length scale, p : probability of the units not being dropped, N : total input samples, λ : regularisation weight decay , which is zero for our experiments. I I I . D N N B A S E D S P E E C H E N H A N C E M E N T Under additiv e model, the noisy speech can be represented as, x t ( m ) = s t ( m ) + n t ( m ) (3) where x t ( m ) , s t ( m ) and n t ( m ) are the m th samples of the noisy speech, clean speech and noise signal, respectiv ely . T aking the short time Fourier transform (STFT), we hav e, x ( ω k ) = s ( ω k ) + n ( ω k ) (4) where ω k = (2 πk /R ) , k = 0 , 1 , 2 ...R − 1 , k is the index and R is the number of frequency bins. T aking the magnitude of the STFT , the noisy speech can be approximated as (5) where S and N represent the spectra of the clean speech and the noise, respectively . A DNN based regression model is trained using the mag- nitude STFT features of clean and noisy speech. The noisy features are then fed to this trained DNN to predict the enhanced features, ˆ S . The enhanced speech signal is obtained by using the inv erse Fourier transform of ˆ S with the phase of the noisy speech signal and overlap-add method. A. Baseline DNN ar chitectur e The baseline DNN consists of 3 fully connected layers of 2048 neurons and an output layer of 257. W e use ReLU non- linearity as the activ ation function in all the 3 layers. Our output acti vation is also ReLU to account for the nonnegativ e nature of STFT magnitude. Stochastic gradient descent is used to minimize the mean square logarithmic error ( E r ) between the noisy and clean magnitude spectra: E r = 1 R R X k =1 ( log ( S ( k ) + 1) − log ( ˆ S ( k ) + 1)) 2 (6) where ˆ S and S denote the estimated and reference spectral features, respectively , at sample index k . I V . P RO P O S E D M E T H O D S F O R G E N E R A L I Z E D S P E E C H M O D E L S Our approach to improve generalisation in v olves two ap- proaches. In the first approach, we show that MC dropout es- timate shows improv ement in the generalization performance of DNN and apply this to speech enhancement. In the second approach we use model uncertainty to op- timally choose among multiple DNN models so that the reconstruction error is minimum. This analysis inv olves two sets of frameworks as explained in sec. IV -B1 and IV -B2. A. Single DNN model using MC dropout (single-MC) In this method we use MC dropout to improv e the genarilz- ability of the baseline model. T o e v aluate the proposed method, we train a DNN model using MC dropout and e valuate the performance against the one using con ventional dropout. A single DNN model is trained using speech signals cor- rupted with v arious noises and SNRs employing MC dropout. 3 The block diagram of the proposed approach is sho wn in Figure 1. The input noisy speech is divided into frames and STFT is applied. Let X denote the magnitude STFT feature for a particular frame. Giv en a noisy speech frame X , multiple repetitions are performed by dropping out random units each time gi ving T dif ferent outputs, { ˆ S t ( X ) } ; 1 ≤ t ≤ T . The empirical mean of these outputs 1 is the estimated output ˆ S ( X ) . Enhanced speech is obtained as the inv erse Fourier transform of ˆ S ( X ) with the phase of the noisy speech signal and overlap-add method. ˆ S ( X ) ≈ 1 T T X t =1 ˆ S t ( X ) (7) ˆ s ( x ) = I D F T ( ˆ S ( X ) 6 X ) (8) where ˆ s ( x ) indicates the enhanced speech estimate for a noisy speech input x j DFT j X model using MC dropout IDFT enhanced speech noisy speech compute empirical mean DNN 32ms frame hamming ^ S ( X ) f ^ S t ( X ) g 6 X Fig. 1. Enhancement using single DNN-MC dropout model. B. Multiple noise-specific MC dr opout models for enhance- ment Model-specific enhancement techniques depend on a model selector , which ensures that the model chosen for enhancing each frame entails an overall improv ed performance [28], [29]. Giv en a framework of multiple DNN models for enhancement, one needs to select the appropriate noisy model to enhance the input noisy speech frame. One of the methods that one could employ is to use a noise classifier [27] to select the appropriate noise model. Howe ver in the scenarios where the input speech is corrupted with an unknown noise or SNR condition, the noise classifier might fail to pick the optimal model. In these cases, we need to ensure that the chosen model is the one that giv es the lo west error and hence a better enhancement performance. In our methods IV -B1 IV -B2, we use the model uncertainty estimated from the output samples of each MC dropout model as an estimate of the prediction error and choose the model based on it. Our experiments show that stronger the correlation between model uncertainty and the squared error , better is the enhancement performance. For ev aluating the performance, we compare our algorithms with the one where a classifier is used to pick the noise model. Here the noise model could be one using MC dropout (class-MC) and using conv entional dropout (class-Conv) 1) Multiple models using MC dr opout with predictive vari- ance (model uncertainty) as the selection sc heme (V ar -MC): In this work, we follow [22] and say that since model uncertainty giv es the intrinsic uncertainty of the model for a particular input, we can use it as an estimate of model error . The speech enhancement framework so designed is as shown in fig. 2. j DFT j X phase 6 X model 2 using MC dropout model M MC dropout f ^ S 1 t ( X ) g f ^ S 2 t ( X ) g f ^ S M t ( X ) g ^ S ( X ) IDFT enhanced speech noisy speech using MC dropout model 1 i ∗ = ar gmin i ( V ar ( S i )) ^ S ( X ) = E ( S i ∗ ) using Fig. 2. Enhancement using multiple DNN-MC dropout models with predicti ve variance as the selection criteria. M dif ferent DNN models with MC dropout are trained with speech corrupted with various noises and SNRs. The architecture of each model is as mentioned in section III-A. Input noisy speech is first divided into frames and magnitude STFT is obtained. The input noisy magnitude STFT feature X of a frame is fed into each of these five models. T repetitions are performed by each model by dropping different units ev ery time, obtaining results { ˆ S i t ( X ) } ; 1 ≤ t ≤ T ; 1 ≤ i ≤ M ; where i is the model index. The predicti ve variance (model uncertainty) of each of these M outputs are computed . The output with the minimum variance, { ˆ S i ∗ t ( X ) } ; 1 ≤ t ≤ T ; 1 ≤ i ∗ ≤ M ; is selected and the corresponding model is considered the best for that particular input X . The enhanced output ˆ S is estimated as the empirical mean of the T outputs: { ˆ S i ∗ t ( X ) } ; 1 ≤ t ≤ T . The enhanced speech signal is obtained as the inv erse Fourier transform of ˆ S with the phase of the noisy speech signal and overlap-add method. ˆ S ( X ) ≈ 1 T T X t =1 ˆ S i ∗ t ( X ) (9) 2) A predictive variance thr eshold ( µ ) based algorithm for enhancement using multiple models ( µ -MC): The experimen- tal results of algorithm V ar-MC shows superior performance for most of the unseen noises. Howe ver , the performance on seen noise sho ws significant degradation. This can be rectified using a conditional selection criteria for the noise models. Using this condition, selection of noise models can be switched from model uncertainty based to classifier based. A threshold is set for variance of all the fi ve models, so that the model for enhancing a noisy frame could either be selected on the basis of minimum variance scheme or on the basis of the prediction of a noise classifier as shown in Figure 3. The input noisy feature of a frame X , is fed into all the fiv e MC dropout models. The input is passed T different times by dropping out random units each time. The corresponding outputs are { ˆ S i t ( X ) } ; 1 ≤ t ≤ T ; 1 ≤ i ≤ M ; where i is the model index and M = 5 . Then the predictiv e variance V ( S i ) of each of these M outputs are computed. If all the M uncertainty values are above a threshold, say µ , it could be taken as an indication that the noise corrupting the giv en 4 input speech belongs to none of these M noise models. In such a case, the model which gi ves the minimum value of uncertainty is considered as the best model to enhance the input noisy speech feature X . the corresponding output is; { ˆ S i ∗ t ( X ) } ; 1 ≤ t ≤ T ; 1 ≤ i ∗ ≤ M . T aking the empirical mean of these T output gives the enhanced output. ˆ S ( X ) ≈ 1 T T X t =1 ˆ S i ∗ t ( X ) (10) The enhanced speech signal is obtained as the in v erse Fourier transform of ˆ S with the phase of the noisy speech signal and ov erlap-add method. On the other hand if the uncertainty values are below the threshold µ , the input feature X is first fed into a classifier to decide the best model for enhancing the frame. Let the corresponding output be; { ˆ S c ∗ t ( X ) } ; 1 ≤ t ≤ T ; 1 ≤ c ∗ ≤ M . As mentioned pre viously , taking the empirical mean of these T different outputs gives the enhanced output ˆ S . ˆ S ( X ) ≈ 1 T T X t =1 ˆ S c ∗ t ( X ) (11) The enhanced speech is obtained as the in verse Fourier transform with the noisy phase information and overlap add method. j DFT j X 6 X model 1 using MC dropout model 2 using MC dropout model M using MC dropout f ^ S 1 t ( X ) g f ^ S 2 t ( X ) g f ^ S M t ( X ) g ^ S ( X ) IDFT enhanced speech noisy speech If 8 i V ar ( S i ) > µ ; 1 ≤ i ≤ M use classi fi er to pick the noise mo del and Y es No i ∗ = ar gmin i ( V ar ( S i )) ^ S ( X ) = E ( S i ∗ ) ^ S ( X ) = E ( S c ∗ ) ^ S ( X ) its output f ^ S c ∗ t ( X ) g ; Fig. 3. A predictive variance threshold ( µ ) based algorithm for enhancement using multiple models V . E X P E R I M E N T A L S E T U P All experiments are carried out using TIMIT [30] speech corpus. The noise data is obtained from NOISEX-92 [31] database. In-order to synthesize noisy test and training speech data, the noise files are downsampled to 16 kHz so as to match with the sampling rate of TIMIT . The magnitude STFT is computed on frames of size 32 ms with 10 ms frame shift, after applying Hamming window . A 512-point FFT is taken and we use only the first 257 points as input to the DNN, because of symmetry in the spectrum. For our experiments, the number of repetitions T is chosen as 50. The Adam optimizer [32] is chosen, whose default regularization weight decay , λ is zero and thus, τ − 1 = 0 in eqn.2. A. single-MC Each DNN based regression models are trained with the magnitude STFT of noisy speech as input and clean speech as target. For experiments using single DNNs IV -A, a baseline DNN with con ventional dropout [23], [24]and a DNN using MC dropout is trained using speech corrupted with factory 2, m109, leopard, babble and volv o noises at 0, 5 and 10 dB SNRs. The architecture of both models are as mentioned in section III-A. The training is done on the entire TIMIT training data after randomly di viding them into fifteen parts for adding fiv e noises at three different SNRs. B. V ar -MC and µ -MC For multiple DNN model based experiments IV -B1 IV -B2, fiv e DNN models are trained on speeches corrupted with factory2, m109, leopard, babble and v olvo noises, each at SNRs 0, 5 and 10 dB. Each DNN models are trained using MC and con ventional dropout, using the entire TIMIT training data after randomly dividing the files into three for adding noises at SNRs 0, 5 and 10 dB. In this case also, the architecture of the models are as defined in section III-A. For those experiments where a classifier is used to pick the models (class-MC and class-Con v) , the classifier used is trained on speech corrupted with factory2, babble, leopard, m109 and volvo noises at SNRs 0,5 and 10dB. V I . R E S U L T S A N D D I S C U S S I O N A. single-MC T able I shows the results obtained in terms of sum squared error (SSE), and segmental SNR (SSNR) [33] for single DNN- MC dropout model (single-MC) ov er the baseline (single- con v) for unseen and seen noises. W e use white, pink and factory 1 noise as unseen noises and factory2 as a seen noise. The reported results are the av erage over 50 files randomly selected from TIMIT [30] test set. From the table it can be inferred that MC dropout model achieves superior performance in most of the unseen noise cases. It is to be noted that the improvement is significant for unseen noises like white noise, especially at low SNRs of -10 and -5 dB. Interestingly , the performance degrades at higher SNRs, though the model continues to perform better than the baseline (single-con v) in terms of SSE. Though the proposed method does not result in significant improv ement on seen noises, the performance is comparable to the baseline model. Hence, the observations validate the proposed method of using MC dropout to improve generalization performance on unseen noises. B. V ar -MC T ables II and III shows the performance of our V ar-MC algorithm in terms of SSE and SSNR. It can be inferred from the table that, V ar -MC gives superior performance over class-con v and class-MC for most of the unseen noise cases. Howe v er as the SNR improves, the improvement ov er the baseline drops. This performance drop can be explained by the reduced correlation between the squared error and the model uncertainty that is shown in Fig. 5. 5 T ABLE I P E RF O R M AN C E E V AL U A T I O N O F S I N GL E D N N M O D EL W I TH M C D RO P O U T ( S I NG L E - MC ) White (unseen) Pink (unseen) F actory1 (unseen) F actory2 (seen) SNR Metric Noisy input single-con v single-MC Noisy input single-con v single-MC Noisy input single-con v single-MC Noisy input single-con v single-MC -10 SSE x10ˆ4 3.64 3.36 3.14 3.96 0.874 0.848 3.69 0.720 0.70 4.13 0.0467 0.0461 SSNR -8.9 -8.5 -8.4 -8.8 -6.7 -6.6 -8.7 -6.0 -5.9 -8.5 1.0 1.0 -5 SSE x10ˆ4 1.12 0.960 0.913 1.22 0.270 0.251 1.12 0.213 0.200 1.29 0.0198 0.0197 SSNR -7.2 -6.6 -6.5 -7.1 -4.3 -4.2 -6.9 -3.51 -3.50 -6.7 3.05 3.08 0 SSE x10ˆ3 3.41 2.81 2.60 3.71 0.858 0.843 3.41 0.682 0.671 4.01 0.104 0.104 SSNR -4.6 -3.9 -3.8 -4.5 -1.5 -1.4 -4.4 -0.73 -0.73 -4.1 5.1 5.1 5 SSE x10ˆ3 1.03 0.844 0.827 1.12 0.291 0.288 1.02 0.244 0.242 1.24 0.069 0.069 SSNR -1.6 -0.7 -0.7 -1.4 1.7 1.7 -1.3 2.2 2.2 -0.9 7.1 7.1 10 SSE x10ˆ2 3.08 2.70 2.67 3.41 1.18 1.16 3.09 1.07 1.06 3.82 0.56 0.55 SSNR 2.0 2.7 2.7 2.2 4.7 4.7 2.3 5.0 5.0 2.6 8.9 8.9 T ABLE II P E RF O R M AN C E E V AL U A T I O N O F V AR - M C A N D µ - M C A L G O RI T H MS . White (unseen) Pink (unseen) F actory1 (unseen) F actory2 (seen) SNR Metric Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 -10 dB SSE x10ˆ4 3.64 3.61 3.42 3.23 3.24 3.96 1.13 1.17 0.708 1.05 3.69 1.03 1.01 0.677 0.876 4.13 0.0406 0.0397 0.331 0.0458 SSNR -8.9 -8.7 -8.6 -8.4 -8.5 -8.8 -7.1 -7.1 -5.4 -6.9 8.7 -6.6 -6.6 -5.3 -6.3 -8.5 2.1 2.1 0.5 2.1 -5 dB SSE x10ˆ4 1.12 1.02 0.976 0.936 0.956 1.22 0.312 0.322 0.261 0.311 1.12 0.285 0.285 0.20 0.260 1.29 0.0172 0.0171 0.257 0.0259 SSNR -7.2 -6.7 -6.6 -6.5 -6.6 -7.1 -4.5 -4.5 -3.7 -4.5 -6.9 -4.1 -4.1 -3.3 -4.0 -6.7 4.0 4.0 1.3 3.9 0 dB SSE x10ˆ3 3.41 2.94 2.86 2.70 2.84 3.71 0.902 0.918 0.943 0.981 3.41 0.828 0.832 0.771 0.836 4.01 0.089 0.090 1.37 0.15 SSNR -4.6 -4.1 -4.0 -3.8 -4.0 -4.5 -1.6 -1.6 -1.3 -1.6 -4.4 -1.1 -1.1 -0.83 -1.1 -4.1 5.8 5.8 3.3 5.8 5 dB SSE x10ˆ3 1.03 0.884 0.865 0.857 0.856 1.12 0.288 0.290 0.391 0.339 1.02 0.270 0.273 0.285 0.288 1.24 0.059 0.060 0.456 0.09 SSNR -1.6 -0.8 -0.8 -0.7 -0.7 -1.4 1.7 1.7 1.6 1.7 -1.3 2.0 2.0 2.0 2.0 -0.9 7.7 7.7 5.8 7.6 10 dB SSE x10ˆ2 3.08 2.82 2.81 2.73 2.69 3.41 1.12 1.14 1.40 1.20 3.09 1.10 1.14 1.24 1.16 3.82 0.47 0.48 1.34 0.55 SSNR 2.0 2.6 2.6 2.7 2.7 2.2 4.8 4.8 4.5 4.7 2.3 4.9 4.9 4.8 4.9 2.6 9.5 9.5 8.1 9.5 T ABLE III P E RF O R M AN C E E V AL U A T I O N O F V AR - M C A N D µ - M C A L G O RI T H MS . M109 (seen) Leopard (seen) Babble (seen) V olvo (seen) SNR Metric Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 -10 dB SSE x10ˆ4 3.68 0.0411 0.0410 0.230 0.0499 3.63 0.0266 0.0281 0.0612 0.0473 3.55 0.0729 0.0718 0.131 0.0894 5.33 0.0094 0.0097 0.367 0.0107 SSNR -8..6 1.9 1.9 1.0 1.9 -8.6 2.7 2.9 2.7 2.8 -8.5 1.5 1.5 1.3 1.5 -8.2 6.7 6.7 0.2 6.7 -5 dB SSE x10ˆ4 1.13 0.0186 0.0187 0.124 0.0268 1.11 0.0128 0.0133 0.0235 0.0180 1.07 0.0356 0.0360 0.0662 0.0452 1.68 0.0047 0.0049 0.325 0.0083 SSNR -6.8 3.5 3.5 2.5 3.5 -6.8 4.3 4.4 4.2 4.3 -6.7 2.7 2.7 2.4 2.7 -6.3 9.1 9.1 0.8 9.1 0 dB SSE x10ˆ3 3.51 0.102 0.103 0.360 0.129 3.35 0.076 0.082 0.133 0.10 3.21 0.191 0.197 0.298 0.236 5.28 0.036 0.037 2.47 0.070 SSNR -4.2 5.3 5.3 4.3 5.3 -4.3 5.9 5.9 5.6 5.9 -4.1 4.2 4.2 3.8 4.2 -3.6 10.8 10.8 2.1 10.8 5 dB SSE x10ˆ3 1.08 0.067 0.069 0.134 0.075 0.999 0.055 0.062 0.083 0.069 0.956 0.115 0.121 0.153 0.128 1.66 0.033 0.034 1.45 0.076 SSNR -1.1 7.3 7.3 6.3 7.3 -1.1 7.4 7.4 7.0 7.4 -1.0 5.8 5.8 5.3 5.8 -0.3 12.1 12.1 4.6 12.0 10 dB SSE x10ˆ2 3.30 0.52 0.54 0.78 0.55 2.95 0.48 0.50 0.64 0.52 2.84 0.87 0.87 0.90 0.83 5.18 0.33 0.34 5.01 0.51 SSNR 2.5 9.1 9.1 8.1 9.1 2.5 8.9 8.9 8.5 8.9 2.6 7.5 7.5 7.0 7.5 3.3 12.9 12.9 7.5 12.8 6 Figure 5 shows the correlation between the predictiv e vari- ance and the squared error (SE) of the estimated output frames for all the fi ve MC models, for speech corrupted with white noise. The uncertainty is computed by taking the trace of the covariance matrix of each frame [25]. The plots show the weakening of the correlation between the SE and model uncertainty as the SNR improv es. The correlation is strong for -10 and -5 dB and is weak for the values of SNR 0, 5 and 10 dB. This v ariation could be due to the fact that the DNN is less adapted to lower SNRs and highly adapted to high SNRs. This needs further exploration. This matches with our results, since we find that there is not much improv ement o ver the class-conv and class-MC as the SNR increases. Howe v er , the values are still comparable to the same. This observation matches with the observations in [25], that the test data which are far from training set are likely to be more uncertain as the network is less adapted to them. 1) Observations: T ables II and III sho ws that V ar-MC giv es really poor performance for seen noises like factory2, m109, leopard, babble and volv o. µ -MC algorithm compensates for this performance drop by using per frame predictiv e variance threshold µ to select between V ar-MC and class-MC. The threshold is selected based on the experiments on a validation set of 30 files from TIMIT corrupted with seen noises factory 2, m109, leopard, babble and volvo noises and an unseen pink noise at SNRs -10, -5 , 0, 5 and 10 dB. For our experiments this threshold is set to be µ = 0 . 16 . C. µ -MC T ables II and III shows the performance improvements of µ - MC algorithm over class-conv and class-MC in terms of SSE and SSNR for unseen noises pink, white and factory 1 and for seen noises factory 2, m109, leopard, babble and volv o. It can be observed that µ -MC giv es superior performance in most of the unseen noise cases, especially at lower SNRs. The algorithm also compensates for the poor performance of V ar-MC algorithm for seen noises. Figure 4 sho ws the variation of SSE with the predictiv e variance threshold µ , for test data corrupted with all the fi ve seen and three unseen noises for -10 dB SNR. It can be seen that as threshold increases, the performance on unseen noises de grades, while that on seen noises improves. Thus, the threshold µ can be used to trade-off between the performance of seen and unseen noise cases. W e also ev aluated the performance of our algorithms by mixing two unseen noises factory 1 and pink and corrupting the speech file with this new noise at SNRs varying from -10 dB to 10 dB. In another experiment we divided a gi ven speech wa veform into three segments and added white, factory2 and factory 1 noise at each segment. T able IV shows the performance e valuation of these two e xperiments. It can be observed that µ -MC algorithm giv es superior or comparable performance to Class-conv and Class-MC in all the cases. The algorithm V ar-MC and µ -MC algorithm gi ves superior performance for those cases for which the DNN is less adapted and hence where the correlation between squared error and variance is stronger . × 10 4 0.5 1 1.5 pink SSE 6000 8000 10000 factory1 µ 0 0.2 0.4 0.6 × 10 4 3.238 3.24 3.242 white 0 2000 4000 factory2 400 600 800 leopard SSE 0 2000 4000 m109 500 1000 1500 babble µ 0 0.2 0.4 0.6 0 2000 4000 volvo Fig. 4. V ariation of SSE with predictive variance µ for -10dB V I I . C O N C L U S I O N In this work, we propose techniques that use dropout as a Bayesian estimator to improve the generalizability of DNN based speech enhancement algorithms. The first method uses the empirical mean of multiple stochastic passes through a DNN-MC dropout model trained on multiple noises to obtain the enhanced output. Our experiments show that this technique results in a better enhancement performance, especially on unseen noise and SNR conditions. The second method looks at the potential application of the model uncertainty as an estimate of squared error (SE), for frame-wise selection of one out of multiple DNN models. W e devise a method based on a threshold µ for the predictive v ariance (V ar) to switch between a classifier based model selection and predictive variance based model selection. W e find that this method gi ves better enhancement performance compared to classifier based model selection method for unseen noises. The main purpose of this work is to see the effecti veness of MC dropout over standard dropout models and hence could be implemented on any state of the art system employing dropout. R E F E R E N C E S [1] S. F . Boll, “Suppression of acoustic noise in speech using spectral subtraction, ” Acoustics, Speech Signal Proc, IEEE T rans. , vol. 27, no. 2, pp. 113–120, 1979. [2] Y . Lu and P . C. Loizou, “ A geometric approach to spectral subtraction, ” Speech communication , vol. 50, no. 6, pp. 453–466, 2008. [3] V . Stahl, A. Fischer, and R. Bippus, “Quantile based noise estimation for spectral subtraction and W iener filtering, ” in Acoustics, Speech, and Signal Pr ocessing. ICASSP . Proceedings. Int. Conf. , vol. 3. IEEE, 2000, pp. 1875–1878. [4] Y . Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, ” IEEE T ransactions on acoustics, speech, and signal pr ocessing , vol. 32, no. 6, pp. 1109– 1121, 1984. [5] R. Martin, “Speech enhancement based on minimum mean-square error estimation and supergaussian priors, ” IEEE tr ansactions on speech and audio pr ocessing , vol. 13, no. 5, pp. 845–856, 2005. [6] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, “Mini- mum mean-square error estimation of discrete Fourier coefficients with generalized gamma priors, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 15, no. 6, pp. 1741–1752, 2007. 7 Factory2 model Leopard model M109 model Babble model V olvo model -10dB squared error 0 200 400 600 800 1000 1 200 variance 0 10 20 30 40 50 squared error 0 500 1000 1500 variance 0 10 20 30 40 50 60 squared error 0 500 1000 1500 variance 0 10 20 30 40 50 60 squared error 0 200 400 600 800 1000 1200 variance 0 10 20 30 40 squared error 0 200 400 600 800 1000 variance 0 10 20 30 40 50 -5dB squared error 0 100 200 300 400 variance 0 10 20 30 40 squared error 0 100 200 300 400 variance 0 10 20 30 40 squared error 0 100 200 300 400 variance 0 10 20 30 40 squared error 0 100 200 300 400 variance 0 5 10 15 20 25 30 squared error 0 100 200 300 400 variance 0 5 10 15 20 25 30 0dB squared error 0 20 40 60 80 100 variance 0 5 10 15 20 25 30 squared error 0 20 40 60 80 100 120 variance 0 5 10 15 20 25 30 squared error 0 20 40 60 80 100 120 variance 0 10 20 30 40 squared error 0 20 40 60 80 100 120 variance 0 10 20 30 40 squared error 0 20 40 60 80 100 variance 0 5 10 15 20 25 5dB squared error 0 10 20 30 40 50 variance 0 5 10 15 20 25 squared error 0 10 20 30 40 50 variance 0 5 10 15 20 25 30 squared error 0 10 20 30 40 50 variance 0 5 10 15 20 25 30 squared error 0 10 20 30 40 50 variance 0 10 20 30 40 squared error 0 5 10 15 20 25 30 variance 0 5 10 15 20 25 10dB squared error 0 5 10 15 20 25 30 variance 0 5 10 15 20 25 squared error 0 5 10 15 20 25 variance 0 5 10 15 20 25 squared error 0 10 20 30 40 variance 0 5 10 15 20 25 30 squared error 0 5 10 15 20 25 30 variance 0 5 10 15 20 25 30 squared error 0 5 10 15 20 variance 0 5 10 15 20 25 Fig. 5. Correlation plot between the predictiv e variance and the squared error of the estimated output frames for all the five MC models for the case of speech corrupted with white noise as input T ABLE IV P E RF O R M AN C E E V AL U A T I O N O F V AR - M C A N D µ - M C A L G O RI T H MS . Additive noise factory1+pink White-F actory2-Factory1 noises added segment-wise SNR Metric Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 Noisy input Class-con v Class-MC V ar -MC µ -MC µ = 0 . 16 -10 dB SSE x10ˆ4 3.77 1.06 1.03 0.678 0.892 4.0 0.325 0.319 0.309 0.294 SSNR -8.8 -6.8 -6.8 -5.5 -6.5 -7.2 -3.5 -3.5 -2.8 -3.4 -5 dB SSE x10ˆ4 1.15 0.291 0.289 0.244 0.266 1.23 0.0975 0.0963 0.159 0.0920 SSNR -7.0 -4.3 -4.3 -3.5 -4.1 -5.2 -0.9 -0.9 -1.0 -0.8 0 dB SSE x10ˆ3 3.49 0.843 0.840 0.873 0.870 3.82 0.318 0.317 0.939 0.326 SSNR -4.5 -1.3 -1.3 -1.0 -1.3 -2.5 1.9 1.9 1.2 1.9 5 dB SSE x10ˆ3 1.05 0.273 0.273 0.302 0.290 1.17 0.127 0.128 0.416 0.138 SSNR -1.3 1.8 1.8 1.8 1.8 0.7 4.5 4.5 3.8 4.5 10 dB SSE x10ˆ2 3.17 1.10 1.13 1.27 1.16 3.58 0.71 0.72 1.67 0.75 SSNR 2.2 4.8 4.8 4.6 4.8 4.4 6.8 6.8 6.3 6.9 8 [7] B. Y egnanarayana, C. A vendano, H. Hermansky , and P . S. Murthy , “Speech enhancement using linear prediction residual, ” Speech commu- nication , v ol. 28, no. 1, pp. 25–42, 1999. [8] W . Jin and M. S. Scordilis, “Speech enhancement by residual domain constrained optimization, ” Speech Communication , vol. 48, no. 10, pp. 1349–1364, 2006. [9] P . Krishnamoorthy and S. M. Prasanna, “Enhancement of noisy speech by temporal and spectral processing, ” Speech Communication , v ol. 53, no. 2, pp. 154–174, 2011. [10] S. Sriniv asan, J. Samuelsson, and W . B. Kleijn, “Codebook driven short- term predictor parameter estimation for speech enhancement, ” IEEE T ransactions on Audio, Speech, and Langua ge Pr ocessing , v ol. 14, no. 1, pp. 163–176, 2006. [11] Y . Ephraim, “ A bayesian estimation approach for speech enhancement using hidden markov models, ” IEEE T ransactions on Signal Pr ocessing , vol. 40, no. 4, pp. 725–735, 1992. [12] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan, “HMM-based strategies for enhancement of speech signals embedded in nonstationary noise, ” IEEE T ransactions on Speec h and A udio pr ocessing , v ol. 6, no. 5, pp. 445–455, 1998. [13] S. T amura, “ An analysis of a noise reduction neural network, ” in Acoustics, Speech, and Signal Pr ocessing, ICASSP-89, International Confer ence on . IEEE, 1989, pp. 2001–2004. [14] F . Xie and D. V an Compernolle, “ A family of MLP based nonlinear spectral estimators for noise reduction, ” in Acoustics, Speech, and Signal Pr ocessing, ICASSP-94, International Conference on , vol. 2. IEEE, 1994, pp. II–53. [15] E. A. W an and A. T . Nelson, “Networks for speech enhancement, ” Handbook of neural networks for speec h processing . Artech House, Boston, USA , vol. 139, p. 1, 1999. [16] G. E. Hinton and R. R. Salakhutdinov , “Reducing the dimensionality of data with neural networks, ” science , vol. 313, no. 5786, pp. 504–507, 2006. [17] G. E. Hinton, S. Osindero, and Y .-W . T eh, “ A fast learning algorithm for deep belief nets, ” Neural computation , vol. 18, no. 7, pp. 1527–1554, 2006. [18] A. L. Maas, Q. V . Le, T . M. O’Neil, O. V in yals, P . Nguyen, and A. Y . Ng, “Recurrent neural networks for noise reduction in robust ASR, ” in 13th Annual Conf. , International Speech Communication Association , 2012. [19] Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “ An experimental study on speech enhancement based on deep neural networks, ” IEEE Signal pr ocessing letters , vol. 21, no. 1, pp. 65–68, 2014. [20] Y . W ang and D. W ang, “T o wards scaling up classification-based speech separation, ” IEEE T ransactions on Audio, Speech, and Language Pr o- cessing , v ol. 21, no. 7, pp. 1381–1390, 2013. [21] Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “ A regression approach to speech enhancement based on deep neural networks, ” IEEE/ACM T rans. Audio, Speech and Language Pr ocessing (T ASLP) , vol. 23, no. 1, pp. 7–19, 2015. [22] Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning, ” in International Confer ence on Machine Learning , 2016, pp. 1050–1059. [23] G. E. Dahl, T . N. Sainath, and G. E. Hinton, “Improving deep neural networks for L VCSR using rectified linear units and dropout, ” in Acous- tics, Speech and Signal Pr ocessing (ICASSP), International Confer ence on . IEEE, 2013, pp. 8609–8613. [24] N. Sri vasta v a, G. Hinton, A. Krizhevsky , I. Sutske ver , and R. Salakhut- dinov , “Dropout: A simple way to prevent neural networks from ov er- fitting, ” The Journal of Machine Learning Resear ch , vol. 15, no. 1, pp. 1929–1958, 2014. [25] A. K endall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization, ” in Robotics and Automation (ICRA), Interna- tional Confer ence on . IEEE, 2016, pp. 4762–4769. [26] P . M. Nazreen and A. G. Ramakrishnan, “Dnn based speech enhance- ment for unseen noises using monte carlo dropout, ” arXiv preprint arXiv:1806.00516 , 2018. [27] P . Papadopoulos, A. Tsiartas, and S. Narayanan, “Long-term snr esti- mation of speech signals in known and unknown channel conditions, ” IEEE/ACM T r ansactions on Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2495–2506, 2016. [28] Z.-Q. W ang, Y . Zhao, and D. W ang, “Phoneme-specific speech sepa- ration, ” in IEEE Inter . Conf. Acoustics, Speech and Signal Processing , 2016. [29] P . M. Nazreen, A. G. Ramakrishnan, and P . K. Ghosh, “ A class-specific speech enhancement for phoneme recognition: A dictionary learning approach, ” Pr oc. Interspeec h , pp. 3728–3732, 2016. [30] J. S. Garofolo, L. F . Lamel, W . M. Fisher, J. G. Fiscus, and D. S. Pallett, “D ARP A TIMIT acoustic-phonetic continous speech corpus CD-R OM. NIST speech disc 1-1.1, ” NASA STI/Recon T ec hnical Report N , vol. 93, p. 27403, Feb. 1993. [31] A. V arga and H. J. M. Steeneken, “ Assessment for automatic speech recognition ii: Noisex-92: A database and an experiment to study the effect of additi ve noise on speech recognition systems, ” Speech Commun. , vol. 12, no. 3, pp. 247–251, Jul. 1993. [Online]. A v ailable: http://dx.doi.org/10.1016/0167- 6393(93)90095- 3 [32] D. P . Kingma and J. L. Ba, “ Adam: Amethod for stochastic optimiza- tion, ” Pr oc. the ICLR 2015 , pp. 1–13, 2015. [33] Y . Hu and P . C. Loizou, “Evaluation of objecti ve quality measures for speech enhancement, ” IEEE T r ansactions on audio, speech, and language processing , vol. 16, no. 1, pp. 229–238, 2008.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment