Multi-View Networks for Denoising of Arbitrary Numbers of Channels

We propose a set of denoising neural networks capable of operating on an arbitrary number of channels at runtime, irrespective of how many channels they were trained on. We coin the proposed models multi-view networks since they operate using multipl…

Authors: Jonah Casebeer, Brian Luc, Paris Smaragdis

Multi-View Networks for Denoising of Arbitrary Numbers of Channels
MUL TI-VIEW NETWORKS FOR DENOISING OF ARBITRAR Y NUMBERS OF CHANNELS J onah Casebeer*, Brian Luc* Uni versity of Illinois at Urbana-Champaign jonahmc2, luc2@illinois.edu P aris Smaragdis Uni versity of Illinois at Urbana-Champaign Adobe Research ABSTRA CT W e propose a set of denoising neural networks capable of operating on an arbitrary number of channels at runtime, ir- respectiv e of ho w many channels the y were trained on. W e coin the proposed models multi-vie w networks since the y op- erate using multiple views of the same data. W e explore two such architectures and show how they outperform traditional denoising models in multi-channel scenarios. Additionally , we demonstrate ho w multi-view netw orks can lev erage infor- mation provided by additional recordings to make better pre- dictions, and how they are able to generalize to a number of recordings not seen in training. Index T erms — multichannel, denoising, deep learning 1. INTR ODUCTION Suppose you are pro vided with multiple noisy recordings of the same ev ent and wish to produce a single clean record- ing. Historically , this problem has been addressed using mi- crophone array techniques, which can seamlessly scale to an arbitrary number of channels if the necessary hardware is in place. Although such techniques work well, recently we have seen learning-based methods being used for such tasks be- cause of their ability to resolve much more complex denois- ing problems. More specifically , we hav e seen a strong wave of deep learning-based methods that introduce powerful non- linear models that can include more parameters and that can learn to resolve more challenging denoising problems. Deep learning-based denoising and source separation hav e been explored in a variety of settings. Liu et al. [1] has studied deep learning for single channel denoising through spectral masking and regression as applied independently ov er each spectral frame of the noisy input. Representations capable of le veraging the temporal nature of audio were intro- duced by Huang et al. in 2014 and 2015 [2, 3]. In this work the authors constructed Recurrent Neural Netw orks (RNNs) to le verage the strong dependenc y between consecutiv e spec- tral frames in single-channel audio recordings. Others have since employed similar techniques [4, 5, 6, 7, 8, 9]. *These two authors contributed equally . This work was supported by NSF grant #1319708 Multi-channel and deep learning techniques have been combined by Swietojanski et al. [10] in 2013 and Araki et al. [11] in 2015 who both constructed multi-channel features for speech enhancement in the context of ASR. The au- thors of these works found that some multi-channel features could outperform con ventional methods. Similarly Nugraha constructed a multi-channel framework using DNNs for es- timation of the spectra of sources and an EM algorithm to combine these in to a multi-channel filter [12]. When sev eral recordings are a v ailable these methods are able to capture new information. Li et al. and Xiao et al. both in 2016 where able to further leverage multiple channels with a time domain and frequency domain neural beamformers respec- tiv ely [13, 14].Now , with deep clustering W ang et al. [15] performed multi-channel speaker independent source separa- tion. A common problem with learning-based methods (whether based on deep learning or not), is that the training setup needs to be replicated during inference. That means that when one trains a multi-channel system to perform, e.g., 4-channel de- noising, that system can only be straightforwardly deployed on a 4-channel system. That is in contrast to analytical approaches, like classical array methods, that make use of geometric information to perform their task. The price to pay for more using more powerful deep learning models is that this analytical flexibility is lost, and one can only train and deploy using very similar setups. Here, we address this problem by introducing two neural network architectures that can be trained on a dif ferent number of channels than the number of channels that they are deplo yed for . This allo ws us to train systems on, e.g., 4-channels and deploy them on an 8-channel (or 2-channel) array without having to retrain, or in any way modify the learned models. W ith e vidence of RNNs surpassing feedforward networks, we propose two RNN architectures to extend current multi- channel techniques. Our first solution is to construct an RNN that instead of unrolling ov er time, it unrolls across the num- ber of input channels. This allows us to train and deploy this model on an arbitrary number of input channels. W e sub- sequently propose an additional model which unrolls both across time and channels. W e call these multi-view networks (MVN), because they combine multiple views of the input. W e find that the y are able to consistently le verage information provided from additional recordings, as well as to generalize to a number of recordings not seen in training. x 1 , 1 . . y 1 x 2 , 1 x k , 1 x 1 , 2 . . y 2 x 2 , 2 x k , 2 x 1 , n . . y n x 2 , n x k ,  n ... ... ... ... h 1 , 1 h 2 , 1 h k  1 , 1 h 1 , 2 h 2 , 2 h k  1 , 2 h 1 , n h 2 , n h k  1 , n Channels Time T ext Fig. 1 : 1D MVN unrolling across channels. x i,j represents the j th spectral frame of the i th recording. h i,j represents the hidden state produced by the RNN at the j th spectral frame of the i th recording. y j is the predicted clean spectral frame. 2. MUL TI-VIEW MODELS FOR DENOISING Suppose you are provided with a noisy signal s ( t ) which you wish to denois e. In the classic denoising RNN setup we apply a Short-T ime Fourier Transform (STFT) on s ( t ) to obtain a series of spectral magnitude frames x i . An RNN would then unroll through these frames ov er time as follows: h i = σ ( W h x i + U h h i − 1 ) y i = σ ( W x h i ) (1) and would find optimal matrices W h and U h to provide a set of denoised magnitude STFT frames y i . The function σ can be any appropriate neural network activ ation. The unrolling scheme in the equations above takes adv antage of temporal information that lies across spectral frames, and can also let us process inputs with an arbitrary length irrespectiv e of the training data. Suppose now that you are pro vided with mul- tiple noisy recordings s 1: k ( t ) of the same ev ent and wish to produce a single clean recording. W e define x i,j to repre- sent the j th spectral frame in the i th recording’ s STFT . If we wanted to use the aforementioned model and take advantage of the multiple channels, we could apply it on the av eraged channel spectra, or on each input channel separately and then av erage all the outputs. Although this makes use of the multi- ple channels, these approaches are not particularly effecti ve. W e propose using the RNN unrolling scheme across in- put recordings x 1: k,t at ev ery spectral frame t . This approach can allow us to take advantage of multi-channel information at each time frame, and has a number of advantages ov er av- eraging. For example, it is possible that at different points in time a different channel might provide the best input for de- noising; unrolling in this fashion allo ws the model to lev erage that instead of av eraging the result with worse channels. x 1 , 1 . . y 1 x 2 , 1 x k , 1 x 1 , 2 . . y 2 x 2 , 2 x k , 2 x 1 , n . . y n x 2 , n x k , n ... ... ... ... h 1 , 1 h 2 , 1 h k  1 , 1 h 1 , 2 h 2 , 2 h k  1 , 2 h 1 , n h 2 , n h k  1 , n h k , 2 Channels Time h k , 1 T ext Fig. 2 : 2D MVN unrolling across both channels and time, using the same notation as Fig. 1, note how the last channel’ s hidden state feeds into the first channel of the next time step Additionally , this approach allows us to test on an arbi- trary number of channels regardless of ho w many we used in training. T o reconstruct the denoised spectra, we experi- mented with averaging each output of the RNN as it unrolls ov er channels, as well as taking the output after the last chan- nel is processed. W e found using the last hidden state as the base for a prediction to work best. Figure 1 demonstrates how unrolling across channels works. The obvious disadvan- tage of this approach is that we do not make use of temporal structure an ymore. T o address this problem we introduce a 2- dimensional RNN that unrolls across both time and sources. This allo ws a model to le verage the temporal dependenc y be- tween time steps as well as the mutual information between different channels. No w , if the source of noise or clean sig- nal moves with respect to the microphones the model can find the best recording to denoise and leverage previous spectral information about the sound. W e accomplish this with the recurrence shown belo w: h i,j = ( σ ( W h x i,j + U h h k,j − 1 ) if i = 1 σ ( W h x i,j + U h h i − 1 ,j ) otherwise y j = σ ( W x h k,j ) (2) Note that in the case of a single input recording the 2D MVN simply unrolls across time. Figure 2 illustrates the 2D un- rolling ov er channels and time. Giv en these two dif ferent unrolling schemes we construct two different networks. First, for the 1D case. A denoising MVN is composed of a fully-connected front layer , a recur- rent layer , and a fully-connected back layer . The front layer is gi ven the magnitude STFT of the input channels, the recur - rent layers Eq. (1) performs the unrolling operations across recordings, and the back layer re gresses the RNNs output into the original STFT dimensions. T o transform back to the time domain we use the phase STFT of the last channel. For the 2D STFT FC MVN FC ISTFT Noisy Recordings Magnitude Spectra CleanSpectra Clean Waveform Fig. 3 : Denoising pipeline using an MVN. case, a denoising MVN is identical to a 1D MVN except for the recurrence which is defined in Eq. (2). Figure 3 broadly il- lustrates an MVN supplied with 3 noisy recordings containing mutual information. In practice, we used GR Us [16] instead of plain RNNs. W e find that training MVNs with an SDR-proxy loss as defined in [17] works far better than traditional norm-based loss functions. The loss used is: S D RLoss ( x, y ) = − ( x > y ) 2 x > x (3) where y and x are vectors containing the target and output time-domain signals respectiv ely . 3. EXPERIMENTS Our system was ev aluated on two different kinds of noisy mixtures. Both are created from speakers in the TIMIT data set and from segments of ”Babble”, ”Airport”, ”T rain”, and ”Subway” noises. From TIMIT we select 12 female speakers each of which has 10 unique utterances [18]. From the 120 ut- terances, we randomly select 100 for training and 20 for val- idation. From each utterance we create a two-second noisy mixture by adding one of the ”Babble”, ”Airport”, ”Train” and ”Subw ay” noises. W e propose tw o mixing techniques and ev aluate our performance with the BSS-Eval Metric: Source to Distortion Ratio (SDR) [19]. In both setups we only show results for the 2D MVN as it outperforms the 1D MVN. 3.1. Naive A veraging RNN As a benchmark we used an av eraging model. This model av erages all channels then passes the av eraged STFT frames through a dense layer which reduces 1024-point DFTs to a size of 512. W e then unroll a GR U with a hidden size of 512. The hidden size is expanded from 512 dimensions to 1024 with a dense layer . Finally , we perform an ISTFT to produce a denoised signal. W e refer to this model as the averaging RNN. This model is trained with the softplus nonlinearity . 3.2. Model Parameters W e construct a 2D MVN which operates on 1024-point DFTs, has a fully-connected layer that goes from 1024 to 512 di- mensions, and then a GR U over channels with a hidden size 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 # of Channels 5 6 7 8 9 10 11 SDR Static Setup Comparison 2D MVN Decreasing SNR 2D MVN Increasing SNR Averaging RNN Decreasing SNR Averaging RNN Increasing SNR Fig. 4 : T esting how 2D MVNs le verage ne w channel informa- tion. The x-axis on the plot denotes the number of provided noisy channels, and the y-axis shows the a verage SDR on the validation set. In the case of decreasing SNR the MVN ob- tains peak performance after a few channels and then largely retains it, not being influenced by the latter poor -quality input channels. W ith increasing SNR, the MVN keeps performing better as additional cleaner channels are being recei ved. Ul- timately , the MVN performs the same on both cases (which both exhibit recordings from − 5 to 5 dB b ut in rev erse order), showing it’ s performance is permutation in variant. of 512. This hidden size is projected back to 1024 dimen- sions with a final fully-connected layer . All models used the softplus nonlinearity . 3.3. Static Noise Setup Because RNNs process inputs serially we need to verify that the output we obtain is not mostly dependent on the last few observed channels. More specifically , we want to see that this model can le verage cleaner inputs (and ignore excessi vely noisy inputs) regardless of whether they are presented first or last. And more generally , we want to ensure that performance is in v ariant to input presentation order . In order to do that, we simulate a static mixing scenario with a static target and noise, and randomly placed static microphones. In this setup we e xamine two cases. In the first case we use k randomly-ordered input channels which exhibit an SNR ranging from − 5 to − 5 + k / 3 dB, with k ∈ [1 , . . . , 30] . That way for , e.g., six input channels we would observe SNRs ranging from − 5 to − 3 dB. Conv ersely , in the second case the input SNRs for k channels are ordered from 5 to 5 − k / 3 dB. In both cases, once k = 30 we will see recordings vary- ing from − 5 to 5 dB, b ut for k < 30 depending on the case we will either see a set of cleaner recordings, or one of nois- ier recordings. The goal here is to see how this approach gets swayed by the presentation order of the input channels. For both increasing and decreasing SNR scenarios, we test a 2D MVN trained on fi ve channels with random order SNRs. The results are shown in Figure 4. For reference we compare the 2D MVN with the baseline RNN. W e observ e that MVNs outperform the a veraging RNN at lev eraging ne w information and that the y are good at ignoring noisy channels and not being swayed by the input channel or- dering. This is a very desirable behavior as it means we can safely provide this denoiser with multiple recordings, with- out ha ving to worry about their ordering and whether the best inputs come first or last. 3.4. Dynamic Noise Setup The second setup replicates a physically stationary target source, a noise source mo ving in a circle, and stationary microphones randomly placed within the circle formed by the noisy source path. W e set the average SNR across each mixture to 0 while the instantaneous SNR is dependent on the microphone-source geometry . Figure 5 illustrates this setup. Here we rew ard models capable of leveraging the dynamic nature of the instantaneous SNR for each recording. W e call this the dynamic setup. 10 8 6 4 2 0 2 4 6 8 10 Meters 10 8 6 4 2 0 2 4 6 8 10 Meters Example Dynamic Setup Microphone Locations Speaker Path Source Location Fig. 5 : Example dynamic noise setup with fi ve channels. The noise source moves along a circle, changing which micro- phone gets the best input each time. In training we generated samples with fiv e channels. Ev- ery training sample simulates a new setup where mic place- ment and ordering has been randomized. This means our model can’t memorize any particular mic setup. Ho wever , at test time we maintain the same setup for each different num- ber of microphones to more accurately show how providing additional noisy channels to our model changes performance. Using this setup we generate a single result plot. The x- axis denotes the number of noisy channels provided to the model. The y-axis shows the a verage SDR in the valida- tion set for that number of channels. The horizontal line across the graph denotes the performance of the averaging RNN model, which mostly stays constant re gardless of the in- put channels. W e observe that the 2D MVN outperforms the Fig. 6 : 2D MVN performance vs number of recordings. Even though the network was trained on 5 channels (note the peak), the quality of outputs keeps improving past 5 channels. In contrast, the regular averaging RNN cannot take advantage of additional channel information. This plot corresponds to a bidirectional GR U across channels. av eraging RNN and is able to le verage numbers of channels far beyond the amount it was trained on. It does not how- ev er do as well when only observing one or two channels. W e hypothesize that this is because it doesn’t fully utilize the recurrent connection in these cases. Furthermore, the MVN can do so ev en when the ”best” recording is dif ferent at ev ery time step. This ability is exciting since new information with each channel can be always lev eraged, without necessitating changes in the model. 4. CONCLUSION W e hav e proposed a denoising RNN capable of operating on an arbitrary number of input recordings and lev eraging new channel information. W e show how the order of the chan- nels does not influence the quality of the results, and that its denoising ability keeps improving as we provide more input channels, ev en past the amount we trained on. Finally we show how this network outperforms the alternative approach of av eraging the input channels. Although not sho wn here due to space constraints, this model can also operate on an ar - bitrary number of recordings at every time step, allowing for deployment on settings with a dynamically changing number of sensors. 5. REFERENCES [1] Ding Liu, Paris Smaragdis, and Minje Kim, “Experi- ments on deep learning for speech denoising, ” in F if- teenth Annual Confer ence of the International Speech Communication Association , 2014. [2] Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis, “Deep learning for monaural speech separation, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Confer- ence on . IEEE, 2014, pp. 1562–1566. [3] Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source sep- aration, ” IEEE/A CM T ransactions on Audio, Speec h, and Language Pr ocessing , vol. 23, no. 12, pp. 2136– 2147, 2015. [4] Felix W eninger, John R Hershey , Jonathan Le Roux, and Bj ¨ orn Schuller , “Discriminatively trained recur - rent neural networks for single-channel speech sepa- ration, ” in Proceedings 2nd IEEE Global Conference on Signal and Information Pr ocessing, GlobalSIP , Ma- chine Learning Applications in Speech Pr ocessing Sym- posium, Atlanta, GA, USA , 2014. [5] Jen-Tzung Chien and Kuan-T ing Kuo, “V ariational re- current neural networks for speech separation, ” Pr oc. Interspeech 2017 , pp. 1193–1197, 2017. [6] Pritish Chandna, Marius Miron, Jordi Janer , and Emilia G ´ omez, “Monoaural audio source separation using deep con volutional neural networks, ” in International Con- fer ence on Latent V ariable Analysis and Signal Separa- tion . Springer , 2017, pp. 258–266. [7] Kaizhi Qian, Y ang Zhang, Shiyu Chang, Xuesong Y ang, Dinei Flor ˆ encio, and Mark Haseg aw a-Johnson, “Speech enhancement using bayesian w av enet, ” Pr oc. Inter - speech 2017 , pp. 2013–2017, 2017. [8] Keiichi Osako, Y uki Mitsufuji, Rita Singh, and Bhik- sha Raj, “Supervised monaural source separation based on autoencoders, ” in Acoustics, Speech and Signal Pr o- cessing (ICASSP), 2017 IEEE International Conference on . IEEE, 2017, pp. 11–15. [9] Y ong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “ A regression approach to speech enhancement based on deep neural netw orks, ” IEEE/A CM T ransactions on Au- dio, Speech and Language Pr ocessing (T ASLP) , v ol. 23, no. 1, pp. 7–19, 2015. [10] Pa wel Swietojanski, Arnab Ghoshal, and Stev e Renals, “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition, ” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE W orkshop on . IEEE, 2013, pp. 285–290. [11] Shoko Araki, T omoki Hayashi, Marc Delcroix, Masakiyo Fujimoto, Kazuya T akeda, and T omohiro Nakatani, “Exploring multi-channel features for denoising-autoencoder-based speech enhancement, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2015 IEEE International Confer ence on . IEEE, 2015, pp. 116–120. [12] Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel V incent, “Multichannel audio source separation with deep neural netw orks, ” IEEE/A CM T ransactions on Au- dio, Speech, and Language Processing , vol. 24, no. 9, pp. 1652–1664, 2016. [13] Bo Li, T ara N Sainath, Ron J W eiss, Ke vin W W ilson, and Michiel Bacchiani, “Neural network adaptiv e beam- forming for rob ust multichannel speech recognition., ” in INTERSPEECH , 2016, pp. 1976–1980. [14] Xiong Xiao, Shinji W atanabe, Hakan Erdogan, Liang Lu, John Hershey , Michael L Seltzer, Guoguo Chen, Y u Zhang, Michael Mandel, and Dong Y u, “Deep beamforming networks for multi-channel speech recog- nition, ” in Acoustics, Speech and Signal Pr ocess- ing (ICASSP), 2016 IEEE International Confer ence on . IEEE, 2016, pp. 5745–5749. [15] Zhong-Qiu W ang, Jonathan Le Roux, and John R Hershey , “Multi-channel deep clustering: Discrim- inativ e spectral and spatial embeddings for speaker - independent speech separation, ” 2018. [16] Kyungh yun Cho, Bart V an Merri ¨ enboer , Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y oshua Bengio, “Learning phrase rep- resentations using rnn encoder-decoder for statistical machine translation, ” arXiv preprint , 2014. [17] Shrikant V enkataramani, Jonah Casebeer, and Paris Smaragdis, “ Adaptiv e front-ends for end-to-end source separation, ” in Pr oc. NIPS , 2017. [18] John S Garofolo, Lori F Lamel, W illiam M Fisher , Jonathan G Fiscus, and David S Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, ” NASA STI/Recon technical report n , vol. 93, 1993. [19] C ´ edric F ´ evotte, R ´ emi Gribon val, and Emmanuel V in- cent, “Bss ev al toolbox user guide–revision 2.0, ” 2005.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment