Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

MONA URAL SPEECH ENHANCEMENT USING DEEP NEURAL NETWORKS BY MAXIMIZING A SHOR T -TIME OBJECTIVE INTELLIGIBILITY MEASURE Morten K olbæk, Zheng-Hua T an, J esper Jensen Department of Electronic Systems, Aalborg Uni versity , Aalborg, Denmark { mok,zt,jje } @es.aau.dk ABSTRA CT In this paper we propose a Deep Neural Network (DNN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-T ime Objective Intelligibility (STOI) measure. W e formalize an approximate-ST OI cost function and deriv e analytical expressions for the gradients required for DNN training and sho w that these gradients hav e desirable properties when used together with gradient based optimization techniques. W e show through simulation experiments that the proposed SE system achiev es lar ge impro vements in estimated speech intelligibil- ity , when tested on matched and unmatched natural noise types, at multiple signal-to-noise ratios. Furthermore, we show that the SE system, when trained using an approximate-STOI cost function per- forms on par with a system trained with a mean square error cost applied to short-time temporal env elopes. Finally , we show that the proposed SE system performs on par with a traditional DNN based Short-T ime Spectral Amplitude (STSA) SE system in terms of es- timated speech intelligibility . These results are important because they suggest that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility . Index T erms — Speech Enhancement, Deep Neural Networks, Speech Intelligibility , Speech Denoising, Deep Learning. 1. INTR ODUCTION Design and development of Speech Enhancement (SE) algorithms capable of improving speech quality and intelligibility has been a long-lasting goal in both academia and industry [1, 2]. Such algo- rithms are useful for a wide range of applications e.g. for mobile communications devices and hearing assisti ve de vices [1]. Despite a large research effort for more than 30 years [1–3] mod- ern single-microphone SE algorithms still perform unsatisfactorily in the complex acoustic en vironments, which users of e.g. hearing assistiv e devices are exposed to on a daily basis, e.g. traf ﬁc noise, cafeteria noise, or competing speakers. T raditionally , SE algorithms have been divided into at least two groups; statistical-model based techniques and data-driv en tech- niques. The ﬁrst group encompasses techniques such as spectral subtraction, the W iener ﬁlter and the short-time spectral amplitude minimum mean square error estimator [1–3]. These techniques make statistical assumptions about the probability distributions of the speech and noise signals, that enable them to suppress the noise dominated time-frequency regions of the noisy speech signal. In particularly , for stationary noise types this type of algorithms may perform well in terms of speech quality , but in general these tech- niques do not impro ve speech intelligibility [4–6]. The second group encompasses data-dri ven or machine learning techniques e.g. based on non-negati ve matrix factorization [7], support vector machines [8], and Deep Neural Networks (DNNs) [9, 10]. These techniques make no statistical assumptions. Instead, they learn to suppress noise by observing a large number of representati ve pairs of noisy and noise-free speech signals in a supervised learning process. SE algorithms based on DNNs can, to some extent, improve speech intelligibility for hearing impaired and normal hearing people, in noisy conditions, if sufﬁcient a priori knowledge is av ailable e.g. the identity of the speaker or the noise type. [11–13]. Although the techniques mentioned abov e are fundamentally different, the y typically share at least two common properties. First, they often aim to minimize a Mean Square Error (MSE) cost func- tion, and secondly , they operate on short frames ( ≈ 20 – 30 ms ) in the Short-T ime discrete Fourier Transform (STFT) domain [1, 2]. Howe ver , it is well known [2, 14] that the human auditory system has a non-linear frequency sensiti vity , which is often approximated using e.g. a Gammatone or a one-third octav e ﬁlter bank [2]. Fur- thermore, it is known that preservation of modulation frequencies below 7 Hz is critical for speech intelligibility [14, 15]. This sug- gests that SE algorithms aimed at the human auditory system could beneﬁt by incorporating such information. Numerous works exist, e.g. [10, 16–23] and [1, Sec. 2.2.3] and the references therein, where SE algorithms have been designed with perceptual aspects in mind. Howe ver , although these algorithms do tak e some perceptual aspects into account, they do not directly optimize for speech intelligibility . In this paper we propose an SE system that maximizes an objec- tiv e speech intelligibility estimator . Speciﬁcally , we design a DNN based SE system that maximizes an approximation of the Short- T ime Objecti ve Intelligibility (STOI) [24] measure. The STOI mea- sure has been found to be highly correlated with intelligibility as measured in human listening tests [2, 24]. W e deriv e analytical ex- pressions for the required gradients used for the DNN weight up- dates during training and use these closed-form expressions to iden- tify desirable properties of the approximate-STOI cost function. Fi- nally , we study the potential performance gain between the pro- posed approximate-STOI cost function with a classical MSE cost function. W e note that our goal is not to achieve state-of-the-art STOI improvements per se, b ut rather to study and compare the pro- posed approximate-STOI based SE system to existing DNN based enhancement schemes. Further improvement may straightforwardly be achie ved with larger datasets and complex models lik e long short- term memory recurrent, or con volutional, neural networks [25]. 2. SPEECH ENHANCEMENT SYSTEM In the following we introduce the approximate-STOI measure and we present the DNN framework used to maximize it. Finally , we dis- cuss techniques used to reconstruct the enhanced and approximate- STOI optimal speech signal in the time-domain. 2.1. A pproximating Short-Time Objective Intelligibility Let x [ n ] be the n th sample of the clean time-domain speech signal and let a noisy observation y [ n ] be deﬁned as y [ n ] = x [ n ] + z [ n ] , (1) where z [ n ] is an additiv e noise sample. Furthermore, let x ( k , m ) and y ( k , m ) , k = 1 , . . . , K 2 + 1 , m = 1 , . . . M , be the single-sided magnitude spectra of the K -point Short-Time discrete Fourier Trans- forms (STFT) of x [ n ] and y [ n ] , respecti vely , where M is the number of STFT frames. Also, let ˆ x ( k , m ) be an estimate of x ( k , m ) ob- tained as ˆ x ( k, m ) = ˆ g ( k , m ) y ( k , m ) where ˆ g ( k , m ) is an estimated gain value. In this study we use a 10 kHz sample frequency and a 256 point STFT , i.e. K = 256 , with a Hann-window size of 256 samples (25.6 ms) and a 128 sample frame shift (12.8 ms). Similarly to ST OI [24], we deﬁne a short-time temporal en velope vector of the j th one-third octav e band for the clean speech signal as x j,m = [ X j ( m − N + 1) , X j ( m − N + 2) , . . . , X j ( m )] T , (2) where X j ( m ) = v u u u t k 2 ( j ) − 1 X k = k 1 ( j ) x ( k , m ) 2 , (3) and k 1 ( j ) and k 2 ( j ) denote the ﬁrst and last STFT bin inde x of the j th one-third octav e band, respectively . Similarly , we deﬁne y j,m and Y j ( m ) for the noisy observ ation. Also, let ˆ x j,m = diag ( ˆ g j,m ) y j,m be the short-time temporal one-third octav e band env elope vector of the enhanced speech signal, where ˆ g j,m is a gain vector deﬁned in the j th one-third octav e band and diag ( ˆ g j,m ) is a diagonal matrix with the elements of ˆ g j,m on the main diagonal. W e use N = 30 such that the short-time tempo- ral one-third octave band en velope vectors will span a duration of 384 ms, which ensures that important modulation frequencies are captured [24]. In total, J = 15 one-third octave bands are used with the ﬁrst band having a center frequency of 150 Hz and the last one of approximately 3.8 kHz. These frequencies are chosen such that they span the frequency range in which human speech normally lie [24]. For mathematical tractability , we discard the clipping step 1 , otherwise performed by STOI [24], and deﬁne the approximated STOI measure as L ( x j,m , ˆ x j,m ) =  x j,m − µ x j,m  T  ˆ x j,m − µ ˆ x j,m    x j,m − µ x j,m     ˆ x j,m − µ ˆ x j,m   , (4) where k·k is the euclidean ` 2 -norm and µ x j,m and µ ˆ x j,m are the sample means of x j,m and ˆ x j,m , respectiv ely . Obviously , L ( x j,m , ˆ x j,m ) is simply the En velope Linear Correlation (ELC) between the vectors x j,m and ˆ x j,m . 2.2. Maximizing the Appr oximated STOI Measure using DNNs The approximated STOI measure giv en by Eq. (4) is deﬁned in a one-third octave band domain and our goal is to ﬁnd ˆ x j,m = diag ( ˆ g j,m ) y j,m such that Eq. (4) is maximized, i.e. ﬁnding an optimal gain vector ˆ g j,m . In this study we estimate these optimal gains using DNNs. Speciﬁcally , we use Eq. (4) as a cost function and train multiple feed-forward DNNs, one for each one-third oc- tav e band, to estimate g ain vectors ˆ g j,m , such that the approximated 1 It has been observed empirically , that omitting the clipping step most often does not affect the performance of ST OI, e.g. [20, 26–28]. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.5 1 Fig. 1 . ` 2 -norm of Eq. (5) as function of cost function value. STOI measure is maximized. For the remainder of this paragraph we omit the subscripts j and m for con venience. Most modern deep learning toolkits, e.g. Microsoft Cognitive T oolkit (CNTK) [29], perform automatic differentiation, which al- low one to train a DNN with a custom cost function, without the need of computing the gradients of the cost function explicitly [25]. Nev ertheless, when working with cost functions that hav e not yet been exhausti vely studied, such as the approximated ST OI measure, an analytic expression of the gradient can be valuable for studying important properties, such as gradient ` 2 -norm. It can be sho wn (de- tails omitted due to space limitations) that the gradient of Eq. (4), with respect to the desired signal vector ˆ x , is giv en by ∇L ( x , ˆ x ) =  ∂ L ( x , ˆ x ) ∂ ˆ x 1 , ∂ L ( x , ˆ x ) ∂ ˆ x 2 , . . . , ∂ L ( x , ˆ x ) ∂ ˆ x N  T (5) where ∂ L ( x , ˆ x ) ∂ ˆ x m = L ( x , ˆ x ) ( x m − µ x ) ( ˆ x − µ ˆ x ) T ( x − µ x ) − L ( x , ˆ x ) ( ˆ x m − µ ˆ x ) ( ˆ x − µ ˆ x ) T ( ˆ x − µ ˆ x ) , (6) is the partial deriv ativ e of L ( x , ˆ x ) with respect to entry m of ˆ x . Furthermore, it can be sho wn that the ` 2 -norm of the gradient as formulated by Eqs. (5) and (6), is given by k∇L ( x , ˆ x ) k = p 1 − L ( x , ˆ x ) 2 k ˆ x k − 1 , (7) which is shown in Fig. 1 as function of L ( x , ˆ x ) for the complete range [ − 1 , 1] , and for k ˆ x k = 1 . W e see from Fig. 1 that the ` 2 - norm of L ( x , ˆ x ) is a concave function with a global maximum at L ( x , ˆ x ) = 0 and is symmetric around zero. W e also observe that k∇L ( x , ˆ x ) k is monotonically decreasing when L ( x , ˆ x ) < 0 and L ( x , ˆ x ) > 0 with k∇L ( x , ˆ x ) k = 0 when x and ˆ x are either per- fectly correlated or perfectly anti-correlated. Since k∇L ( x , ˆ x ) k is large when x and ˆ x are uncorrelated and zero when perfectly cor - related, and k∇L ( x , ˆ x ) k 6 = 0 otherwise, Eq. (4) is well suited as a cost function for gradient-based optimization techniques, such as Stochastic Gradient Descent (SGD) [25], since it guarantees non- zero step lengths for all inputs during optimization except at the op- timal solution. In practice, to apply SGD we minimize −L ( x , ˆ x ) . 2.3. Reconstructing A pproximate-STOI Optimal Speech When a gain vector ˆ g j,m has been estimated by a DNN, the en- hanced speech env elope in the one-third octave band domain can be computed as ˆ x j,m = diag ( ˆ g j,m ) y j,m . Howe ver , what we are really interested in is ˆ x ( k , m ) , i.e. the estimated speech signal in the STFT domain, since ˆ x ( k , m ) can straightforwardly be transformed into the time-domain using the overlap-and-add technique [2]. W e therefore seek a mapping from the gain vector ˆ g j,m estimated in the one-third octav e band domain, to the gain ˆ g ( k , m ) , for a single STFT coef- ﬁcient. T o do so, let ˆ g j ( m ) denote the gain value estimated by a DNN to be applied to the noisy one-third octave band amplitude in frame m . W e can then deri ve the relationship between the gain value ˆ g j ( m ) ≥ 0 in the one-third octa ve band, and the corresponding gain values ˆ g ( k , m ) ≥ 0 in the STFT domain as ˆ X j ( m ) = ˆ g j ( m ) Y j ( m ) = v u u u t k 2 ( j ) − 1 X k = k 1 ( j ) ( ˆ g ( k , m ) y ( k , m )) 2 . (8) One solution to Eq (8) is ˆ g j ( m ) = ˆ g ( k , m ) , k = k 1 ( j ) , . . . k 2 ( j ) − 1 . (9) Generally , the solution in Eq. (9) is not unique; many choices of ˆ g ( k , m ) exist that giv e rise to the same estimated one-third octav e band ˆ X j ( m ) (and hence the same value of L ( x , ˆ x ) ). W e choose, for con venience, a uniform gain across the STFT coef ﬁcients within a one-third octav e band. Since env elope estimates ˆ X j ( m ) are com- puted for successiv e v alues of m , N estimates exist for each ˆ X j ( m ) , which are averaged during enhancement. When reconstructing the enhanced speech signal in the time domain, we use the o verlap-and- add technique using the phase of the noisy STFT coefﬁcients [2]. 3. EXPERIMENT AL DESIGN T o e valuate the performance of the approximate-STOI optimal DNN based SE system we hav e conducted series of experiments in volving multiple matched and unmatched noise types at various SNRs. 3.1. Noisy Speech Mixtur es The clean speech signals used for training all models are from the W all Street Journal corpus [30]. The utterances used for training and validation are generated by randomly selecting utterances from 44 male and 47 female speakers from the WSJ0 training set enti- tled si tr s. The training and validation sets consist of 20000 and 2000 utterances, respectively , which is equiv alent to approximately 37 hours of training data and 4 hours of validation data. The test set is similarly generated using utterances from 16 speakers from the WSJ0 v alidation set si dt 05 and ev aluation set si et 05, and consists of 1000 mixtures or approximately 2 hours of data, see [31] for fur- ther details. Notice, the speakers in the test set are dif ferent from the speakers in the v alidation and training sets. W e use six different noise types: two synthetic signals and four noise signals recorded in real-life. The synthetic noise signals encompass a stationary Speech Shaped Noise (SSN) signal and a highly non-stationary 6-speaker Babble (BBL) noise. For real-life noise signals we use the street (STR), cafeteria (CAF), bus (BUS), and pedestrian (PED) noise signals from the CHiME3 dataset [32]. The SSN noise signal is Gaussian white noise, shaped according to the long-term spectrum of the TIMIT corpus [33]. Similarly , the BBL noise signal is constructed by mixing utterances from TIMIT . Further details on the design of the SSN and BBL noise signals can be found in [13]. All noise signals are split into non-overlapping sequences with a 40 min. training sequence, a 5 min. validation sequence and a 5 min. test sequence, i.e. there is no o verlap between the noise sequences used for training, validation and test. The noisy speech signals used for training and testing are con- structed using Eq. (1), where a clean speech signal x [ n ] is added to a noise sequence z [ n ] of equal length. T o achie ve a certain SNR, the noise signal is scaled based on the activ e speech lev el of the clean speech signal as per ITU P .56 [34]. The SNRs used for the training and validation sets are chosen uniformly from [ − 5 , 10] dB. The SNR range is chosen to ensure that SNRs are included where intelligibility ranges from degraded to perfectly intelligible. T able 1 . T raining conditions for different SE systems. ID: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 Cost: ELC ELC ELC ELC ELC EMSE EMSE EMSE EMSE EMSE Noise: SSN BBL CAF STR ALL SSN BBL CAF STR ALL 3.2. Model Architectur e and T raining T o ev aluate the performance of the proposed SE system a total of ten systems, identiﬁed as S0 – S9, hav e been trained using dif ferent cost functions and noise types as presented in T able 1. Five systems (S0–S4) have been trained using the ELC loss from Eq. (4) and ﬁve systems (S5–S9) have been trained using a standard MSE loss, de- noted as Env elope MSE (EMSE), since it operates on short-time tem- poral one-third octav e band en velope vectors. This is to inv estigate the potential performance difference between models trained with an approximate-STOI loss and models trained with the commonly used MSE loss. Eight systems (S0–S3 and S5–S8) are trained as noise type speciﬁc systems, i.e. they are trained using only a single noise type. T wo systems (S4 and S9) are trained as noise type general systems, i.e. they are trained on all noise types (Noise: ”ALL” in T able 1). This is to inv estigate the performance drop, if any , when a single system is trained to handle multiple noise types. Each DNN consists of three hidden layers with 512 units with ReLU acti vation functions and a sigmoid output layer . The DNNs are trained using SGD with the backpropagation technique and batch normalization [25]. The DNNs are trained for a maximum of 200 epochs with a minibatch size of 256 randomly selected short-time temporal one-third octave band env elope vectors and the learning rates were set to 0 . 01 , and 5 · 10 − 5 per sample initially , for S0–S4, and S5–S9, respectiv ely . The learning rates were scaled down by 0 . 7 when the training cost increased on the validation set. The training was terminated when the learning rate was below 10 − 10 . The dif fer- ent learning rates for the systems trained with the ELC cost function and the systems trained with the EMSE cost functions were found from preliminary experiments. All models were implemented us- ing CNTK [29] and the script ﬁles needed to reproduce the reported results can be found in [31]. 4. EXPERIMENT AL RESUL TS W e have ev aluated the performance of the ten systems based on their a verage ELC and STOI scores computed on the test set. The STOI score is computed using the enhanced and reconstructed time- domain speech signal, whereas the ELC score is computed using short-time one-third octav e band temporal env elope vectors. 4.1. Matched and Unmatched Noise T ype Experiments In T able 2 we compare the ELC scores for the noise type speciﬁc systems trained using the ELC (S0–S4), and EMSE (S5–S8) cost functions, and tested in matched noise-type conditions (SSN, BBL, CAF , and STR) at an input SNR of -5, 0, and 5 dB. Results co vering the SNR range from -10 to 20 dB can be found in [31]. All models achiev e large improvements in ELC with an average improv ement of approximately 0.15-0.20, for all SNRs and noise types, compared to the ELC score of the noisy , unprocessed signals (denoted UP . in T ables 2 to 4). W e also see that, as expected, models trained with the ELC cost function (S0–S4) in general achiev e similar or slightly higher ELC scores compared to the models trained with EMSE (S5– S8). In T able 3 we report the STOI scores for the systems in T able 2 tested in identical conditions. W e see moderate to large improve- ments in STOI in all conditions with an av erage improvement from 0.07–0.13. W e also observe that the systems trained with the EMSE T able 2 . ELC results for S0 – S9 tested with SSN, BBL, CAF , and STR SSN BBL CAF STR SNR [dB] UP . S0 (ELC) S5 (EMSE) S4 (ELC) S9 (EMSE) UP . S1 (ELC) S6 (EMSE) S4 (ELC) S9 (EMSE) UP . S2 (ELC) S7 (EMSE) S4 (ELC) S9 (EMSE) UP . S3 (ELC) S8 (EMSE) S4 (ELC) S9 (EMSE) -5 0.36 0.66 0.65 0.64 0.63 0.34 0.50 0.51 0.48 0.48 0.43 0.61 0.59 0.58 0.58 0.45 0.70 0.68 0.68 0.66 0 0.52 0.77 0.76 0.75 0.74 0.50 0.69 0.69 0.67 0.67 0.57 0.73 0.71 0.72 0.70 0.58 0.78 0.76 0.77 0.75 5 0.66 0.82 0.81 0.80 0.79 0.64 0.78 0.77 0.77 0.77 0.68 0.79 0.78 0.79 0.77 0.69 0.82 0.80 0.81 0.79 A vg. 0.51 0.75 0.74 0.73 0.72 0.49 0.66 0.66 0.64 0.64 0.56 0.71 0.69 0.70 0.68 0.57 0.77 0.75 0.75 0.73 T able 3 . STOI results for S0 – S9 tested with SSN, BBL, CAF , and STR SSN BBL CAF STR SNR [dB] UP . S0 (ELC) S5 (EMSE) S4 (ELC) S9 (EMSE) UP . S1 (ELC) S6 (EMSE) S4 (ELC) S9 (EMSE) UP . S2 (ELC) S7 (EMSE) S4 (ELC) S9 (EMSE) UP . S3 (ELC) S8 (EMSE) S4 (ELC) S9 (EMSE) -5 0.61 0.78 0.78 0.76 0.76 0.59 0.66 0.67 0.65 0.65 0.67 0.76 0.76 0.75 0.75 0.68 0.81 0.82 0.80 0.80 0 0.74 0.88 0.88 0.87 0.87 0.72 0.82 0.82 0.81 0.81 0.78 0.86 0.86 0.85 0.86 0.78 0.88 0.89 0.88 0.88 5 0.85 0.93 0.93 0.92 0.92 0.83 0.90 0.90 0.89 0.90 0.87 0.91 0.92 0.91 0.92 0.87 0.92 0.93 0.92 0.92 A vg. 0.73 0.86 0.86 0.85 0.85 0.71 0.79 0.80 0.78 0.79 0.77 0.84 0.85 0.84 0.84 0.78 0.87 0.88 0.87 0.87 T able 4 . ELC and STOI for S4 and S9 tested with BUS and PED. ELC STOI BUS PED BUS PED SNR UP . S4 S9 UP . S4 S9 UP . S4 S9 UP . S4 S9 -5 0.56 0.71 0.68 0.35 0.55 0.53 0.77 0.84 0.84 0.60 0.71 0.71 0 0.66 0.79 0.76 0.50 0.70 0.68 0.85 0.90 0.90 0.72 0.83 0.83 5 0.74 0.83 0.81 0.64 0.78 0.76 0.91 0.94 0.94 0.83 0.90 0.90 A vg. 0.65 0.78 0.75 0.50 0.68 0.66 0.84 0.89 0.89 0.72 0.81 0.81 cost function achieve similar improvement in STOI as the systems trained with the ELC cost function. In T able 4, the ELC and STOI scores for the noise type general systems (S4 and S9) tested with the unmatched BUS and PED noise types are summarized. W e see av- erage improvement in the order of 0.1–0.18 in terms of ELC score and 0.05 – 0.09 in terms of STOI. W e also see the performance gap between the S4 system (trained with ELC cost function) is small compared to the S9 system (trained with EMSE cost function) and that noise speciﬁc systems perform slightly better than the noise gen- eral one. The results in T ables 2 to 4 are interesting since they show roughly identical global behavior as measured by ELC and STOI for systems trained with the ELC and EMSE cost functions. 4.2. Gain Similarities Between ELC and EMSE Based Systems W e now study to which extent ELC and EMSE based systems be- hav e similarly on a more detailed lev el. Speciﬁcally , we compute correlation coefﬁcients between the gain vectors produced by each of the tw o types of systems, for SSN, BBL, and STR noise types, and summarize them in T able 5. In T able 5 we observe that high sample correlations ( > 0 . 90 ) are achiev ed for all noise types and both SNRs, which indicates that the gains produced by a system trained with the ELC cost function are quite similar to the gains produced by a system trained with the EMSE cost function, which supports the ﬁndings in Sec. 4.1. Similar conclusions can be drawn for the remaining noise types (results omitted due to space limitations, see [31]). 4.3. Appr oximate-STOI Optimal DNN vs. Classical SE DNN As a ﬁnal study we compare the performance of an approximate- STOI optimal DNN based SE system with classical Short-T ime Spectral Amplitude (STSA) DNN based enhancement systems that estimate ˆ g ( k , m ) directly for each STFT frame (see e.g. [35, 36]). Similarly to S0–S9 these systems are three-layered feed-forward T able 5 . Sample linear corre- lation between gain vectors. SNR [dB] SSN BBL STR -5 0.93 0.91 0.92 5 0.94 0.96 0.92 T able 6 . STOI score for clas- sical DNN, tested with BBL. SNR [dB] UP . # units 512 4096 -5 0.59 0.64 0.66 5 0.83 0.91 0.92 DNNs and use 30 STFT frames as input, but differently from S0– S9, they minimize the MSE between STFT magnitude spectra, i.e. across frequency . The DNNs estimate ﬁve STFT frames per time- step and ov erlapping frames are av eraged to construct the ﬁnal gain. W e have trained two of these classical systems, with 512 units and 4096 units, respectively , in each hidden layer , using the BBL noise corrupted training set. The results are presented in T able 6. From T able 6 we see, for example, that such classical STSA- DNN based SE systems trained and tested with BBL noise achieve a maximum STOI score of 0.66 at an input SNR of -5 dB, which is equi valent to the STOI score of 0.66 achieved by S1 in T able 3. W e also see that the classical system performs on par with S1 at an input SNR of 5 dB SNR with a ST OI score of 0.92 compared to 0.90 achiev ed by S1. Although surprising, this is an interesting result since it indicates that no improv ement in STOI can be gained by a DNN based SE system that is designed to maximize an approximate- STOI measure using short-time temporal one-third octave band en- velope vectors. The important implication of this is that traditional STSA-DNN based SE systems may be close to optimal from an es- timated speech intelligibility perspectiv e. 5. CONCLUSION In this paper we proposed a Speech Enhancement (SE) system based on Deep Neural Networks (DNNs) that optimizes an approximation of the Short-Time Objective Intelligibility (STOI) estimator . W e pro- posed an approximate-STOI cost function and derived closed-form expressions for the required gradients. W e sho wed that DNNs de- signed to maximize approximate-STOI, achieve large improvement in STOI when tested in matched and unmatched noise types at vari- ous SNRs. W e also showed that approximate-STOI optimal systems do not outperform systems that minimize a mean square error cost. Finally , we showed that approximate-STOI DNN based SE systems perform on par with classical DNN based SE systems. Our ﬁndings suggest that a potential speech intelligibility gain of approximate- STOI optimal systems o ver MSE based systems is modest at best. 6. REFERENCES [1] R. C. Hendriks, T . Gerkmann, and J. Jensen, “DFT-Domain Based Single-Microphone Noise Reduction for Speech En- hancement: A Surve y of the State of the Art, ” Synth. Lect. on Speech and Audio Process. , v ol. 9, no. 1, pp. 1–80, Jan. 2013. [2] P . C. Loizou, Speech Enhancement: Theory and Practice . CRC Press, 2013. [3] Y . Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude es- timator , ” IEEE T rans. Acoust., Speech, and Sig. Process. , vol. 32, no. 6, pp. 1109–1121, 1984. [4] Y . Hu and P . C. Loizou, “ A comparativ e intelligibility study of single-microphone noise reduction algorithms, ” J . Acoust. Soc. Am. , vol. 122, no. 3, pp. 1777–1786, Sep. 2007. [5] H. Luts et al. , “Multicenter ev aluation of signal enhancement algorithms for hearing aids, ” J. Acoust. Soc. Am. , vol. 127, no. 3, pp. 1491–1505, 2010. [6] J. Jensen and R. Hendriks, “Spectral Magnitude Minimum Mean-Square Error Estimation Using Binary and Continuous Gain Functions, ” IEEE/A CM T rans. Audio, Speech, Lang. Pr o- cess. , vol. 20, no. 1, pp. 92–102, Jan. 2012. [7] E. M. Grais and H. Erdogan, “Single channel speech music separation using nonnegati ve matrix factorization and spectral masks, ” in Proc. ICDSP , 2011, pp. 1–6. [8] Y . W ang and D. W ang, “T owards Scaling Up Classiﬁcation- Based Speech Separation, ” IEEE/ACM T rans. Audio, Speech, Lang. Process. , v ol. 21, no. 7, pp. 1381–1390, Jul. 2013. [9] Y . Xu et al. , “ A Regression Approach to Speech Enhancement Based on Deep Neural Networks, ” IEEE/ACM T rans. Audio, Speech, Lang. Pr ocess. , vol. 23, no. 1, pp. 7–19, Jan. 2015. [10] E. W . Healy et al. , “ An algorithm to increase speech intelligi- bility for hearing-impaired listeners in novel segments of the same noise type, ” J. Acoust. Soc. Am. , vol. 138, no. 3, pp. 1660–1669, 2015. [11] J. Chen et al. , “Large-scale training to increase speech intelligi- bility for hearing-impaired listeners in no vel noises, ” J . Acoust. Soc. Am. , v ol. 139, no. 5, pp. 2604–2612, 2016. [12] E. W . Healy et al. , “ An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker , ” J. Acoust. Soc. Am. , vol. 141, no. 6, pp. 4230–4239, 2017. [13] M. Kolbæk, Z. H. T an, and J. Jensen, “Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems, ” IEEE/ACM T rans. Au- dio, Speec h, Lang. Process. , v ol. 25, no. 1, pp. 153–167, 2017. [14] B. Moore, An Intr oduction to the Psyc hology of Hearing , 6th ed. Brill, 2013. [15] T . M. Elliott and F . E. Theunissen, “The Modulation Transfer Function for Speech Intelligibility , ” PLOS Computational Bi- ology , v ol. 5, no. 3, p. e1000302, Mar . 2009. [16] Y . Hu and P . C. Loizou, “ A perceptually motiv ated approach for speech enhancement, ” IEEE T rans. Speech, Audio, Pr ocess. , vol. 11, no. 5, pp. 457–465, 2003. [17] Y . Ephraim and D. Malah, “Speech enhancement using a min- imum mean-square error log-spectral amplitude estimator , ” IEEE T rans. Acoust., Speech, and Sig. Pr ocess. , vol. 33, no. 2, pp. 443–445, 1985. [18] N. V irag, “Single channel speech enhancement based on mask- ing properties of the human auditory system, ” IEEE/ACM T rans. Audio, Speech, Lang. Pr ocess. , vol. 7, no. 2, pp. 126– 137, 1999. [19] P . C. Loizou, “Speech Enhancement Based on Perceptually Motiv ated Bayesian Estimators of the Magnitude Spectrum, ” IEEE/A CM T rans. Audio, Speech, Lang . Pr ocess. , v ol. 13, no. 5, pp. 857–869, 2005. [20] L. Lightburn and M. Brookes, “SOBM - a binary mask for noisy speech that optimises an objective intelligibility metric, ” in Pr oc. ICASSP , 2015, pp. 5078–5082. [21] W . Han et al. , “Perceptual weighting deep neural networks for single-channel speech enhancement, ” in Proc. (WCICA , 2016, pp. 446–450. [22] P . G. Shi vakumar and P . Georgiou, “Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement - Se- mantic Scholar, ” in INTERSPEECH , 2016, pp. 3743–3747. [23] Y . K oizumi et al. , “DNN-based source enhancement self- optimized by reinforcement learning using sound quality mea- surements, ” in Proc. ICASSP , 2017, pp. 81–85. [24] C. H. T aal et al. , “ An Algorithm for Intelligibility Prediction of T ime-Frequency W eighted Noisy Speech, ” IEEE/A CM T rans. Audio, Speech, Lang. Pr ocess. , vol. 19, no. 7, pp. 2125–2136, 2011. [25] I. Goodfellow , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016. [26] J. Jensen and C. H. T aal, “ An Algorithm for Predicting the In- telligibility of Speech Masked by Modulated Noise Maskers, ” IEEE/A CM T rans. Audio, Speech, Lang . Pr ocess. , v ol. 24, no. 11, pp. 2009–2022, 2016. [27] A. H. Andersen et al. , “Predicting the Intelligibility of Noisy and Nonlinearly Processed Binaural Speech, ” IEEE/ACM T rans. Audio, Speec h, Lang. Process. , vol. 24, no. 11, pp. 1908–1920, 2016. [28] C. H. T aal, R. C. Hendriks, and R. Heusdens, “Matching pur- suit for channel selection in cochlear implants based on an in- telligibility metric, ” in Proc. EUSIPCO , 2012, pp. 504–508. [29] A. Agarw al et al. , “ An introduction to computational networks and the computational network toolkit, ” Microsoft T echnical Report { MSR-TR } -2014-112, T ech. Rep., 2014. [30] J. Garofolo et al. , “CSR-I (WSJ0) Complete LDC93s6a, ” 1993, philadelphia: Linguistic Data Consortium. [31] M. K olbæk, Z.-H. T an, and J. Jensen, “Supplemental Material. ” [Online]. A vailable: http://kom.aau.dk/ ∼ mok/icassp2018 [32] J. Barker et al. , “The third ’CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines, ” in Pr oc. ASR U , 2015. [33] J. S. Garofolo et al. , “D ARP A TIMIT Acoustic Phonetic Con- tinuous Speech Corpus CDR OM, ” 1993. [34] ITU, “Rec. P.56 : Objective measurement of acti ve speech lev el, ” 1993, https://www .itu.int/rec/T -REC-P .56/. [35] F . W eninger et al. , “Discriminatively trained recurrent neural networks for single-channel speech separation, ” in GlobalSIP , 2014, pp. 577–581. [36] M. K olbæk, Z. H. T an, and J. Jensen, “Speech Enhancement using Long Short-T erm Memory based Recurrent Neural Net- works for Noise Robust Speaker V eriﬁcation, ” in Pr oc. SLT , 2016, pp. 305–311.

Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment