Using recurrences in time and frequency within U-net architecture for speech enhancement

USING RECURRENCES IN TIME AND FREQUENCY WITHIN U-NET ARCHITECTURE FOR SPEECH ENHANCEMENT T omasz Grzywalski StethoMe R  Poznan Poland Szymon Dr gas Institute of Automation and Robotics Poznan Uni versity of T echnology Poland ABSTRA CT When designing fully-con volutional neural network, there is a trade-off between recepti ve ﬁeld size, number of parameters and spatial resolution of features in deeper layers of the net- work. In this work we present a novel network design based on combination of many con v olutional and recurrent layers that solv es these dilemmas. W e compare our solution with U- nets based models known from the literature and other base- line models on speech enhancement task. W e test our solu- tion on TIMIT speech utterances combined with noise seg- ments extracted from NOISEX-92 database and show clear advantage of proposed solution in terms of SDR (signal-to- distortion ratio), SIR (signal-to-interference ratio) and STOI (spectro-temporal objective intelligibility) metrics compared to the current state-of-the-art. Index T erms — deep learning, speech enhancement, U- nets 1. INTR ODUCTION The single-channel speech enhancement problem is to reduce a noise present in a single-channel recording of speech. This technique has man y applications, it can be employed as a pre- processing step in speech or speaker recognition system [1] or to improv e speech intelligibility what is important for ex- ample in hearing aids like cochlear implants [2]. Recently , data-driv en approaches became popular for speech enhancement. In these methods training data is used to train model whose aim is to reduce ev en nonstationary noise. Data-driv en methods include methods based on non- negati ve matrix factorization [3] and deep neural networks (DNNs)[4]. DNNs are used as a nonlinear transformation of a noisy signal to a clean one (mapping-based targets), or to a ﬁltering mask (i.e. time-varying ﬁlter in the short-time Fourier transform domain), that can be used to reco ver speech (masking-based targets). This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. In recent work [4] deep learning methods for speech en- hancement are summarized. In the early DNN-based methods for speech enhancement, neural network with fully-connected layers acted as a mapping from a spectrogram fragment of a noisy speech (a gi ven spectrogram frame and its conte xt) to a spectrogram frame of enhanced speech [5, 6]. Con volutional neural netw orks (CNNs) were applied to speech enhancement in [7], by combining conv olutional and fully connected layers to estimate the ideal binary mask (IBM). In [8] fully conv olutional network was used. In this case encoder-decoder architecture was employed, where a number of conv olutiv e layers interleaved by max-pooling act as encoder and the same number of conv olutiv e layers, b ut interleav ed by upsampling act as decoder . Decoder maps acti- vations (feature map) at the output of the encoder to the mag- nitude spectrum of enhanced speech. In this case ho wever , separation quality may be limited as max-pooling operation, which is irre versible, reduces time-frequenc y resolution of subsequent feature maps. The similar problem in the domain of medical image seg- mentation was mitigated in U-net architecture [9], by using skip connections. In [10] U-net architecture brought better singing voice separation quality in comparison to a network without skip connections. In conv olutional neural networks used in [11] an architecture similar to U-net was proposed, and the authors showed that remov al of max-pooling and up- sampling may be beneﬁcial in terms of separation quality . In [12] recurrent-con volutional architecture w as proposed for speech enhancement. It consists of con volutional layers followed by bidirectional recurrent component, ﬁnally fully- connected layer is used to predict the clean spectrogram. The authors obtained improved performance in comparison to DNN with fully-connected layers only and recurrent neural networks. In [13] recurrent U-net architecture for speech enhance- ment was proposed. The results suggested that max-pooling in U-net introduces loss of information on deeper lev els, but it is needed to build big enough recepti ve ﬁeld (context in the input spectrogram on which the corresponding element of the output spectrogram depends). Ho wev er recurrent layers PB 1 PB 2 PB L/2 PB L/2+1 PB L-1 PB L Level 1 Level 2 Level L/2 Concatenation Concatenation Spectrogram of noisy speech Estimated speech Estimated noise X ( 1 ) X ( 2 ) X ( ) L 2 Y S   N   X ( + 1 ) L 2 X ( L − 1 ) Fig. 1 . General architecture of the proposed U-nets can enlarge recepti ve ﬁeld so that max-pooling is no longer needed. On the av erage the best combination was to build a network without max-pooling and to use recurrent layers to extend the recepti ve ﬁeld. In [14] ReNet architecture was proposed which replaces con volution and pooling layers with four recurrent layers that sweep horizontally and vertically in both directions across the image. The use of horizontal and vertical layers alter - nately better scales with number of dimensions in comparison to multidimensional RNNs (recurrent neural networks) [15]. Evaluation performed on image data suggest that ReNet gi ves comparable results to CNNs. In this work we further de velop idea presented in [13]. W e introduce recurrent-con volutional (RC) pairs which are the blocks from which the proposed network is built. Each output of RC pair can potentially depend on a bigger context of its input than con volutional layer . This is for the cost of relativ ely small number of additional parameters. Moreov er, this context can be enhanced at many depths of the encoder and decoder . 2. DESCRIPTION OF THE PROPOSED ARCHITECTURE The proposed neural network represents a function ( ˆ S , ˆ N ) = f ( Y ; Θ) , where Y ∈ R B × N denotes spectro- gram of a noisy speech, B denotes the number of frequency channels, while N is the number of spectrogram frames. The neural network is trained to obtain at the output matri- ces ˆ S ∈ R B × N and ˆ N ∈ R B × N representing magnitude spectrograms of a clean speech and noise respectiv ely . The general structure of the proposed neural network architecture is shown in Figure 1. The network consists of processing blocks PB 1, . . . , PB L. Each processing block PB l , where l = 1 , . . . , L accepts a feature map (ten- sor) of dimension B × N × K ( l ) in and outputs feature map X ( l ) ∈ R B × N × K ( l ) out , where K ( l ) in and K ( l ) out are the numbers of features in the input and output feature maps respectiv ely . The input to PB 1, is the input spectrogram Y represented as a ten- sor with K (1) in = 1 , the inputs to PB l for l = 2 , . . . , L/ 2 + 1 are X ( l − 1) , while for l = L/ 2 + 2 , . . . , L the input is a con- catenation of X ( l − 1) and the output feature map of processing block from the encoder at the same lev el. The concatenation is done along the third dimension of the feature maps. In the original U-net [9] each processing block comprises con volutional layers, max-pooling or upsampling operations. In this w ork we propose to use RC pairs described in the sub- sequent sections. In RC pairs recurrences are used to enlarge receptiv e ﬁeld in time or frequency dimension. In proposed solution we use RC pairs with alternate dimensions in con- secutiv e processing blocks. An example is shown in Figure 1 where the bidirectional arrows show the dimension in which the context is enhanced (vertical and horizontal arrows cor- respond to frequency and time dimensions respecti vely). In this conﬁguration for each path between input and output, blocks with enhanced context in time dimensions are inter- leav ed with blocks with enhanced context in frequency di- mension. 2.1. Recurrent-con volutional (RC) pairs The recurrent-conv olutional (RC) pair is shown in Figure 2 (left). Let I ∈ R B × N × K in be an input feature map to the block RC. In comparison to a processing block with a con- volutional layer only (with nonlinearity), in the RC pair, map I is concatenated with additional features extracted by the recurrent (BWR) layer R ∈ R B × N × K rec which results in C ∈ R B × N × ( K in + K rec ) . Because of the skip connection, e ven when the number of features extracted by BWR ( K rec ) is small, the context in I on which each feature in the output feature map O depends can be substantially increased. 2.2. Bidirectional weight-sharing recurr ences BWR layers are shown in Figure 2 (right). They consists of a pair of recurrent layers iterating in opposite directions, whose outputs are concatenated along the third dimension. There are two types of BWR layers: time (T) and frequency (F) depending on the spatial dimension in which they iterate ov er input feature map I . BWR T accepts on input a sequence of N K in -dimensional vectors representing single frequency band b . The same weights are used to iterate ov er all B fre- quency bands. BWR F accepts on input a sequence of B K in - dimensional vectors representing single time frame n . The input features I GRU forward GRU backward weight share input features I BWR CONV output features O concatented features C BWR features R input features I GRU downward GRU upward weight share BWR features R weight share BWR features R Fig. 2 . Left: block diagram of RC pair; right: block diagrams of two types of BWR (top: BWR T bottom: BWR F ) same weights are shared to iterate ov er all N time frames. 3. EXPERIMENTS The noisy speech examples were obtained by mixing TIMIT [16] speech utterances with noise segments extracted from NOISEX-92 database. The training and test datasets con- tain 2000 and 192 utterances respectiv ely . Both speech and noise signals were resampled to 8000 Hz. The speech en- hancement quality was assessed for babble and factory noises, mixed with the speech utterances at SNR 0 dB. 3.1. Featur e extraction For all audio signals short-time Fourier transform were com- puted. The frame step was 10 ms, while frame length was 25 ms. Hann window was applied. For each frame 512-point FFT was calculated. Afterwards, 64-channel mel-scale ﬁlter- bank was applied to the magnitude of STFT . Finally , element- wise logarithm was calculated from the resulting STFT ma- trices. 3.2. Baseline ar chitectures W e compared our proposed solution with two baseline mod- els deﬁned in [17]. Each model was optimized to get best performance on our dataset. Fully connected layers netw ork (FCLN) accepts the same spectrogram on its input as the proposed U-net architectures and uses sliding-windo w technique with windo w length of 23 frames (11 to the left and 11 to the right). The output of the network is IRM (ideal ratio mask) [18] mask. Optimized ver - sion of this network featured 4 hidden layers with 512 neu- rons each. Additionally we used ELU nonlinearities instead of RELU and linear outputs instead of sigmoid for the output layer which also helped to improv e separation quality . Recurrent neural network (RNN) uses the same sliding- window scheme as FCLN network. RNN originally consisted of 4 hidden layers with 512 long short-term memory (LSTM) units. W e found these numbers optimal, but for each forward recurrence we added a second one going in the opposite di- rection, effecti vely making it a bidirectional layer with 1024 units, what improved separation quality . W e also added gra- dient clipping abov e 100 and, as in FCLN baseline, remov ed sigmoid nonlinearities on output. Similar to FCLN, this net- work also predicts IRM mask. W e also considered training baseline models (FCLN and RNN) without IRM (for direct estimation of clean speech and noise like in our proposed solution), but the initial experi- ments have shown very low performance so we decided not not in vestigate this scenario further . 3.3. U-nets Based on scheme presented in Figure 1 we deﬁne four base- line U-net architectures and tw o that implement RC pairs. All networks feature 5 le vels ( L = 10 processing blocks). Figure 3 sho ws architectures of all six networks along with number of parameters and receptiv e ﬁeld sizes. W e use follo wing notation: C48 : 2D con volutional layer with 48 ﬁlters (all con volutional layers use 3x3 ﬁlters with ELU and batch normalization ex- cept for ﬁnal layer which is always 1x1 with linear output), R T 16 C48 : RC pair comprising BWR layer with 8 units per direction, iterating in time axis (weight sharing in frequency axis), and C48 layer (all recurrences were implemented using Gated Recurrent Units (GRUs) with gradient clipping above 100 and batch normalization), R F 16 C48 : frequency counterpart to R T 16 C48, MP : max-pooling 2x2, TC48 : transposed con volution with ﬁlter size 6x6, stride 2 and crop 2 (an in verse of standard 3x3 con volution with ”same” padding followed by 2x2 max-pooling). W e also tested ALL RC model for predicting IRM, this allowed for more direct comparison with the baseline models. In this scenario ﬁnal layer consisted of single 1x1 ﬁlter with linear output. 3.4. Reconstruction and metrics The magnitude mel-spectrogram in log-scale of the clean speech was estimated using the proposed neural networks. After applying exponential function, the mel-ﬁltering w as in- verted by means of the pseudoin verse of the matrix with char- acteristics of the mel-ﬁlters. Ne xt, the result was combined N a m e C 4 8 C 6 4 C 4 8 _ C 4 8 C 6 4 _ M P A L L _ R C O D D _ R C C 4 8 C 2 C 6 4 C 2 C 4 8 _ C 4 8 C 4 8 _ C 2 C 6 4 C 2 C 4 8 C 4 8 C 6 4 C 6 4 C 4 8 _ C 4 8 C 4 8 _ C 4 8 C 6 4 _ M P C 6 4 C 4 8 C 4 8 C 4 8 C 4 8 C 6 4 C 6 4 C 4 8 _ C 4 8 C 4 8 _ C 4 8 C 6 4 C 6 4 _ T C 6 4 C 4 8 C 4 8 C 6 4 C 6 4 C 4 8 _ C 4 8 C 4 8 _ C 4 8 C 6 4 _ M P C 6 4 C 4 8 C 4 8 C 4 8 C 4 8 C 6 4 C 6 4 C 4 8 _ C 4 8 C 4 8 _ C 4 8 C 6 4 C 6 4 _ T C 6 4 N u m . o f p a r a m e t e r s 2 3 0 k 4 0 9 k 4 6 0 k 7 0 4 k 3 2 8 k 4 0 7 k R e c e p t i v e f i e l d 1 9 x 1 9 1 9 x 1 9 3 9 x 3 9 6 6 x 6 6 w h o l e s p e c t r o g r a m w h o l e s p e c t r o g r a m A r c h i t e c t u r e ( P B 1 . . P B 1 0 ) R F 1 6 _ C 4 8 R T 1 6 _ C 2 R F 2 4 _ C 4 8 R T 2 4 _ C 2 R T 1 6 _ C 4 8 R F 1 6 _ C 4 8 R F 1 6 _ C 4 8 R T 1 6 _ C 4 8 R T 2 4 _ C 4 8 R F 2 4 _ C 4 8 R T 1 6 _ C 4 8 R F 1 6 _ C 4 8 R F 1 6 _ C 4 8 R T 1 6 _ C 4 8 R F 2 4 _ C 4 8 R T 2 4 _ C 4 8 Fig. 3 . Evaluated U-nets, ﬁrst four are reference baseline U-nets, last two implement our proposed solution; architectures are described by deﬁning each processing block from Figure 1 in form of a 5x2 table (left column - encoder , right - decoder) with phase of the noisy speech. This allo wed to reconstruct the speech signal. In order to assess the quality of the separation for dif- ferent variants of the proposed architecture, SDR (signal-to- distortion ratio), SIR (signal-to-interference ratio), and SAR (signal-to-artifact) ratio were implemented as deﬁned in [19]. Additionally we used spectro-temporal objectiv e intelligibil- ity (STOI) as deﬁned in [20]. 3.5. Meta parameters All networks were trained for 100 epochs using Adam opti- mizer with batch size of 15. For all networks initial learning rate was set to 0.001 e xcept for networks with RC pairs where learning late 0.01 was used. Higher learning rate slightly im- prov ed results for these networks, no such improv ement was observed for the remaining netw orks. Learning rate was mul- tiplied by 0.99 after ev ery epoch. All presented U-nets (except for ALL RC IRM e xperi- ment) were trained to minimize absolute dif ference between actual and predicted spectrograms of clean speech and noise. 10% of training data was held out as v alidation set. Best model snapshot was selected based on SDR obtained on vali- dation set. V alidation was performed after each epoch. 3.6. Results The results of the performed experiments are shown in tables 1 and 2. The proposed architecture with RC pairs gave the best separation quality in terms of all metrics. ALL RC and ODD RC gav e the same SDRs and STOIs for factory noise, for the babble noise these two architectures also had compara- ble results (difference on SDR and ST OI was 0.1 dB and 0.01 respectiv ely). ALL RC brought higher SIR than ODD RC by 0.5 dB and 0.9 dB for babble and factory noise respecti vely . It was, ho wev er , slightly worse in comparison to ODD RC in terms of SAR. It can be noticed that U-net architectures without recur- rent layers, provide higher SDRs and SIRs in comparison to the baseline models. This is not the case for SAR and STOI. Howe ver , U-nets with RC pairs gi ve improv ement in terms of SDR, SIR, and STOI. In comparison to the best non-U-net T able 1 . Factory noise Name SDR SIR SAR STOI FCLN IRM 7.4 12.8 8.9 0.74 RNN IRM 7.5 12.2 9.3 0.76 U-net C48 7.9 14.7 8.9 0.73 U-net C64 8.0 14.2 9.1 0.73 U-net C48 C48 8.2 14.3 9.3 0.76 U-net C64 MP 8.1 14.5 9.3 0.76 U-net ALL RC IRM 8.2 14.8 9.3 0.80 U-net ALL RC 8.4 15.5 9.4 0.81 U-net ODD RC 8.4 15.0 9.5 0.81 T able 2 . Babble noise Name SDR SIR SAR STOI FCLN IRM 5.3 8.9 8.5 0.71 RNN IRM 5.6 9.2 8.7 0.72 U-net C48 6.1 11.9 7.8 0.69 U-net C64 6.1 11.6 7.8 0.69 U-net C48 C48 6.2 10.5 8.7 0.71 U-net C64 MP 6.3 11.2 8.4 0.73 U-net ALL RC IRM 6.7 12.0 8.5 0.76 U-net ALL RC 7.0 13.2 8.5 0.79 U-net ODD RC 6.9 12.3 8.8 0.78 baseline (RNN IRM), both ALL RC and ODD RC ga ve im- prov ement of 0.9 dB of SDR and 0.05 of STOI for the f actory noise. In the case of babble noise, this difference is bigger, i.e. 1.4 dB for SDR and 0.07 for ST OI. The best architectures with RC pairs outperformed also U-net architectures without recurrences. F or the factory noise the difference was 0.2 dB, while for the babble noise it was 0.7 dB. The improvement can be also observed for STOI metric – 0.05 for factory and 0.06 for babble noise. 4. CONCLUSIONS In this work we proposed U-net-based neural network archi- tectures in which recurrent-conv olutional pairs are used at different lev els. The obtained result show that the proposed architectures outperform the baseline models (FCLN, RNN, and U-nets without recurrences). The results of the performed experiments suggest, that U-net based architectures perform better for mapping-based rather than masking-based targets. 5. REFERENCES [1] Zixing Zhang, J ¨ urgen Geiger, Jouni Pohjalainen, Amr El-Desoky Mousa, W enyu Jin, and Bj ¨ orn Schuller, “Deep learning for en vironmentally robust speech recognition: An overvie w of recent development s, ” A CM T ransactions on Intelligent Systems and T echnol- ogy (TIST) , vol. 9, no. 5, pp. 49, 2018. [2] Y ing-Hui Lai, Y u Tsao, Xugang Lu, Fei Chen, Y u- T ing Su, Kuang-Chao Chen, Y u-Hsuan Chen, Li-Ching Chen, Lieber Po-Hung Li, and Chin-Hui Lee, “Deep learning–based noise reduction approach to improv e speech intelligibility for cochlear implant recipients, ” Ear and hearing , vol. 39, no. 4, pp. 795–809, 2018. [3] C ´ edric F ´ evotte, Emmanuel V incent, and Alex ey Ozerov , “Single-channel audio source separation with nmf: di- ver gences, constraints and algorithms, ” in Audio Sour ce Separation , pp. 1–24. Springer , 2018. [4] DeLiang W ang and Jitong Chen, “Supervised speech separation based on deep learning: An ov erview , ” IEEE/A CM T ransactions on A udio, Speech, and Lan- guage Pr ocessing , 2018. [5] Po-Sen Huang, Minje Kim, Mark Hase gawa-Johnson, and Paris Smaragdis, “Deep learning for monaural speech separation, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Confer- ence on . IEEE, 2014, pp. 1562–1566. [6] Po-Sen Huang, Minje Kim, Mark Hase gawa-Johnson, and Paris Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source sep- aration, ” IEEE/ACM T ransactions on Audio, Speec h, and Language Pr ocessing , vol. 23, no. 12, pp. 2136– 2147, 2015. [7] L. Hui, M. Cai, C. Guo, L. He, W . Q. Zhang, and J. Liu, “Con volutional maxout neural netw orks for speech sep- aration, ” in 2015 IEEE International Symposium on Sig- nal Pr ocessing and Information T echnolo gy (ISSPIT) , Dec 2015, pp. 24–27. [8] Emad M Grais and Mark D Plumbley , “Single channel audio source separation using conv olutional denoising autoencoders, ” arXiv pr eprint arXiv:1703.08019 , 2017. [9] Olaf Ronneberger , Philipp Fischer, and Thomas Brox, “U-net: Conv olutional networks for biomedical image segmentation, ” in International Confer ence on Med- ical imag e computing and computer-assisted interven- tion . Springer , 2015, pp. 234–241. [10] Andreas Jansson, Eric Humphrey , Nicola Montecchio, Rachel Bittner , Aparna Kumar , and T illman W eyde, “Singing voice separation with deep u-net con volutional networks, ” 2017. [11] Se Rim Park and Jinwon Lee, “ A fully conv olutional neural network for speech enhancement, ” arXiv pr eprint arXiv:1609.07132 , 2016. [12] Han Zhao, Shuayb Zarar , Iv an T ashev , and Chin- Hui Lee, “Con volutional-recurrent neural net- works for speech enhancement, ” arXiv preprint arXiv:1805.00579 , 2018. [13] T omasz Grzywalski and Szymon Drgas, “ Application of recurrent u-net architecture to speech enhancement, ” in Signal Pr ocessing: Algorithms, Ar chitectur es, Arr ange- ments, and Applications (SP A) . IEEE, 2018, pp. 82–87. [14] Francesco V isin, K yle Kastner , Kyunghyun Cho, Mat- teo Matteucci, Aaron Courville, and Y oshua Ben- gio, “Renet: A recurrent neural network based al- ternativ e to con volutional networks, ” arXiv pr eprint arXiv:1505.00393 , 2015. [15] Alex Grav es and J ¨ urgen Schmidhuber , “Ofﬂine hand- writing recognition with multidimensional recurrent neural netw orks, ” in Advances in neur al information pr ocessing systems , 2009, pp. 545–552. [16] J. S. Garofolo, L. F . Lamel, W . M. Fisher , J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Darpa timit acoustic phonetic continuous speech corpus cdrom, ” 1993. [17] Jitong Chen and DeLiang W ang, “Long short-term memory for speaker generalization in supervised speech separation, ” The J ournal of the Acoustical Society of America , vol. 141, no. 6, pp. 4705–4714, 2017. [18] Y uxuan W ang, Arun Narayanan, and DeLiang W ang, “On training targets for supervised speech separation, ” IEEE/A CM T ransactions on Audio, Speech and Lan- guage Processing (T ASLP) , vol. 22, no. 12, pp. 1849– 1858, 2014. [19] Emmanuel V incent, R ´ emi Gribonv al, and C ´ edric F ´ evotte, “Performance measurement in blind audio source separation, ” IEEE transactions on audio, speech, and language processing , vol. 14, no. 4, pp. 1462–1469, 2006. [20] Cees H T aal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “ An algorithm for intelligibility prediction of time–frequency weighted noisy speech, ” IEEE T ransactions on A udio, Speech, and Languag e Pr ocessing , vol. 19, no. 7, pp. 2125–2136, 2011.

Using recurrences in time and frequency within U-net architecture for speech enhancement

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment