An Exploration of Mimic Architectures for Residual Network Based Spectral Mapping

AN EXPLORA TION OF MIMIC ARCHITECTURES FOR RESIDU AL NETWORK B ASED SPECTRAL MAPPING P eter Plantinga, Deblin Bagchi, Eric F osler -Lussier The Ohio State Uni versity ABSTRA CT Spectral mapping uses a deep neural network (DNN) to map directly from noisy speech to clean speech. Our previ- ous study [1] found that the performance of spectral mapping improv es greatly when using helpful cues from an acoustic model trained on clean speech. The mapper network learns to mimic the input fa vored by the spectral classiﬁer and cleans the features accordingly . In this study , we explore two new innov ations: we replace a DNN-based spectral mapper with a residual network that is more attuned to the goal of pre- dicting clean speech. W e also examine ho w inte grating long term context in the mimic criterion (via wide-residual biL- STM networks) af fects the performance of spectral mapping compared to DNNs. Our goal is to derive a model that can be used as a preprocessor for an y recognition system; the fea- tures deri ved from our model are passed through the standard Kaldi ASR pipeline and achie v e a WER of 9.3%, which is the lowest recorded w ord error rate for CHiME-2 dataset using only feature adaptation. Index T erms : mimic loss, spectral mapping, CHiME-2, residual network, WRBN 1. INTRODUCTION Applying deep learning to the task of Automatic Speech Recognition (ASR) has sho wn great progress recently in clean en vironments. Howe ver , these ASR systems still suffer from performance degradation in the presence of acoustic interference, such as additiv e noise and room re verberation. One strate gy to address this problem is to use a deep learn- ing front-end for denoising the features, which are then fed to the ASR system. Some of these models attempt to estimate an ideal ratio mask (IRM) which is multiplied with the spectral features to remov e noise from the speech signal [2]. Others utilize spectral mapping in the signal domain [3, 4] or in the feature domain [5, 6] to translate directly from noisy to clean speech without additional constraints. When these pre-processing models were introduced, the y could be easily decoupled from the rest of the ASR pipeline. This was useful, because these models provided a general- purpose speech denoising module that could be applied to any noisy data. W ith time, impressive gains in performance were noticed with the addition of noise-robust features and joint training of spectral mapper and acoustic model [7]. Howe ver , the front-end and back-end models in these approaches each depend on the presence of the other , i.e. one would not be able to re-use the mapper for another task or dataset without re-training it. Moreov er , adding robust features increases the difﬁculty of feature creation and increases the number of pa- rameters in the speech recognition pipeline. Our previous work [1] introduced a form of kno wledge transfer we dubbed mimic loss . Unlike student-teacher learn- ing [8] or kno wledge distillation [9, 10, 11] which transfer knowledge from a cumbersome model to a small model, the mimic approach transfers kno wledge from a higher -lev el model (in this case, an acoustic model) to a lo wer-le v el model (a noisy to clean transformation). This can be seen in conte xt in Figure 1. In this work, we impro ve our results using the mimic loss framew ork in two ways: First, we propose a residual network [12] for spectral mapping. A residual network model is a natural ﬁt for the task of speech denoising, because like the model, the task in v olves computing a residual, i.e. the noise contained in the features. W e ﬁnd that a residual network architecture by itself works well for the task of speech enhancement, surpassing the performance of other front-end-only systems. Second, we use a more sophisticated architecture for senone classiﬁcation, since this is the backbone of mimic loss. This provides a more informativ e error signal to the spectral mapper . T o achie ve this goal, we choose Wide Residual BiLSTM Networks (WRBN) [13] as the architec- ture for our senone classiﬁer , which combines the effecti v e feature extraction of residual networks [12] and the long-term context modeling of recurrent networks [14, 15]. During ev aluation, a forward-pass through the residual spectral mapper generates denoised features which are then fed to an of f-the-shelf Kaldi recipe [16]. These features achiev e a much lo wer WER on their o wn as compared to DNN spectral mappers trained without mimic loss [5, 6]. W ith the addition of the stronger feedback from a senone classiﬁer , we achie ve results beating the state-of-the-art sys- tem, which includes both additional noise-robust features, and joint training of the front-end denoiser with the acoustic model back-end. c  2018 IEEE. Published in the IEEE 2018 W orkshop on Spoken Language T echnology (SL T 2018), scheduled for 18-21 December 2018 in Athens, Greece. 2. PRIOR WORK For the task of rob ust ASR, there has been some attention paid to strategies such as adding noise-robust features to acoustic models [7], using augmented training data [17], and recurrent neural network language models [17, 18]. Another approach is to use a more sophisticated acoustic model, such as Con- volutional Neural Networks (CNNs) [19, 20], Recurrent Neu- ral Networks (RNNs) [21], and Residual Memory Networks (RMNs) [22] that use residual connections with DNNs. In terms of front-end models, DNNs are the most com- mon approach [6], though RNNs have been used for speech enhancement as well, as in [23]. There ha ve also been a few studies that used CNNs for front-end speech denoising [24, 25, 26]. In the last of these, the authors used a single ”by- pass” connection from the encoder to the decoder, but none of the models described here can be said to use residual connec- tions. In addition, none of these authors ev aluated the output of their model for the task of ASR. Residual networks hav e seen success in computer vi- sion [12, 27], and speech recognition [28, 13]. These net- works add shortcut connections to a neural network that pass the output of some layers to higher layers. The shortcut con- nections allow the network to compute a modiﬁcation of the input, called the residual , rather than having to re-compute the important parts of the input at ev ery layer . This model seems a natural ﬁt for the task of spectral mapping, which seeks to reproduce the input with the noise removed. W e use an architecture similar to W ide ResNet with a small change: con v olutional (channel-wise) dropout rather than con v en- tional dropout. Architectural details are in Section 3.2. Senone classiﬁcation in speech recognition systems has improv ed due to recurrent neural networks. The horizontal connections in LSTMs work well in modeling the temporal nature of speech. On the other hand, con v olutional neural networks are good for extracting useful patterns from spec- tral features. DNNs further complement the performance of these models by warping the speech manifold so that it re- sembles the senone feature space. The CNN-LSTM-DNN combination (CLDNN) along with HMMs have seen good re- sults [29, 30]. Recently wide residual networks have been adapted for noise-robust speech recognition in the CHiME- 4 setting and used with LSTMs and DNNs. This network, called WRBN is reported by [13] as a great acoustic model. Mimic loss, proposed in [1], is a kind of knowledge trans- fer that uses an acoustic model trained on clean speech to teach the speech enhancement model how to produce more realistic speech — ke y to this idea is that the denoised speech should make a senone classiﬁer behave like it is operating with clean speech. In contrast to joint training, the mimic loss does not tie the speech enhancement model to the par- ticular acoustic model used; the enhancement module can be decoupled and used as a pure pre-processing unit with another recognizer . More details can be found in Section 4. ra w w a v eform noisy spectrogram cleaned spectrogram posterior gram Step 1: Denoising cleaned spectrogram posterior gram recognized w ords Step 2: ASR STFT Spectral mapping Mimic loss Acoustic modeling Mimic loss Acoustic modeling Decoding Fig. 1 . System pipeline for spectral mapping with mimic loss. Bold text indicates training a model. 3. SPECTRAL MAPPING Spectral mapping improves performance of the speech rec- ognizer by learning a mapping from noisy spectral patterns to clean ones. In our previous work [5, 6], we have sho wn that a DNN spectral mapper , which takes noisy spectrogram as input to predict clean ﬁlterbank features for ASR, yields good results on the CHiME-2 noisy and rev erberant dataset. Speciﬁcally , we ﬁrst divide the input time-domain signals into 25 ms windows with a 10 ms shift, and then apply short time Fourier transform (STFT) with a hamming windo w to com- pute log spectral magnitudes in each time frame. For a 16 kHz signal, each window contains 400 samples, and we use 512- point Fourier transform to compute the magnitudes, forming a 257-dimensional log magnitude vector . Many speech recognition systems extend the input fea- tures using delta and double-delta. These features are a simple arithmetic function of the surrounding frames. CNNs natu- rally learn ﬁlters of a similar nature to the delta function, and can easily learn to approximate these features if necessary . W e ﬁnd that the model w orks better without these redundant features. W e use 5 frames of stacked content (both past & Noisy frames 128 ﬁlter block 128 ﬁlter block 256 ﬁlter block 256 ﬁlter block Fully connected Denoised frame Size: 3 × 3 Stride: 2 × 2 Size: 3 × 3 Stride: 1 × 1 Size: 3 × 3 Stride: 1 × 1 + Fig. 2 . Our residual network architecture consists of four blocks and two fully-connected layers. Each block starts with a con volutional layer for do wn-sampling and increasing the number of ﬁlters. The output of this block is used twice, once as input to the two con volutional layers that compute the residual, and again as the original signal that is modiﬁed by adding the computed residual. future) for both DNN and ResNets. Hence, the input feature dimension decreases to 2827 ( 257 × 11 ) compared to 8481 when delta features are included ( 257 × 3 × 11 ). 3.1. Baseline Model W e use a baseline model for comparison, a DNN that is also a front-end-only system. Though this architecture is quite a bit simpler than the residual network architecture, similar architectures are commonly used in speech enhancement re- search [5, 7]. Unlike the proposed model, we add delta and double-delta features to the input for the baseline model, since a DNN can- not learn the delta function as easily . These features hav e been shown to dramatically improv e ASR performance, and they impro ve spectral mapping performance as well. Our baseline model is a 2-layer DNN with 2048 ReLU neurons in each layer , with an output layer of 257 neurons. W e use batch norm and dropout to regularize the network. The batch norm uses the moving mean and v ariance at train- ing time as well as test time. This is the same architecture that is used in [1]. 3.2. ResNet Architecture A residual network adds shortcut connections to neural net- work architectures, typically CNNs, in a way that causes the network to learn a modiﬁcation of the original input, rather than being forced to reconstruct the important information at each layer . This usually takes the form of blocks of sev eral neural netw ork layers with the output of the ﬁrst layer added to the output of the last layer , so that the interior layers can compute the residual. Adding these connections has se veral adv antages: the training time is decreased, the netw orks can gro w deeper , and the model tends to behave more like an ensemble of smaller models [31]. In addition to all of those, howe ver , we ex- pect this model to be particularly good for the task of speech denoising, since the architecture matches the task at hand: reconstructing the input signal with the residual noisy signal remov ed. In previous work using CNNs for speech enhancement, it has been noted that performance sometimes degrades with the addition of max pooling between con v olution layers [25]. W e also observe this phenomenon, and instead of doing max pooling, we use an additional CNN layer with stride 2 × 2 to learn a down-sampling function. This layer has the additional effect of increasing the number of ﬁlters, so the output can be directly added to the output of the last layer of the block as a residual connection. Inspired by W ide ResNet [27], we use dropout instead of batch normalization, though we use con v olutional (channel- wise) dropout rather than con ventional dropout in order to better preserve the local structure within each ﬁlter . This re- sults in a small gain in WER of around 0.2 percent. The au- thors also suggested that a shallower , but wider network may work better than a very deep network; we use a network that is only 14 layers deep, and the layers are a comparable width to W ide ResNet. Neither adding ﬁlters nor layers improved performance. The full architecture of this model can be seen in Figure 2. The ﬁrst part of the model uses four con v olutional blocks: two blocks of 128 ﬁlters, and two of 256 ﬁlters. After the con v olutional blocks, we append tw o fully-connected layers of 2048 neurons and an output layer . The whole network uses ReLU neurons. 3.3. T raining the Residual Network W e found that there was some sensiti vity to the training pro- cedure for the residual network, so we report our procedure here. W e use the Adam optimizer [32] with an initial learn- ing rate of 10 − 4 and an e xponential decay rate of 0.95 e very 10 4 steps. Follo wing the training procedure for ResNets in the ﬁeld of computer vision, we experimented with learning rate drops, going to one-tenth of the initial learning rate after con ver gence. W e found that this resulted in sizable improv e- ments in the ﬁdelity loss (see T able 1 in Section 6), but no Pretraining: Clean speech Senone classiﬁer Predicted senones Senone label L Cross − entrop y Noisy speech Spectral mapper Denoised speech Clean speech L Fidelity Mimic loss training: Noisy speech Spectral mapper Denoised speech Clean speech Senone classiﬁer Soft labels Senone classiﬁer Classiﬁer output L Fidelity L Mimic Fig. 3 . When using mimic loss, the enhancement system is trained in two stages. In the pretraining stage, the senone classiﬁer is trained on clean speech to predict senone labels with cross-entropy criterion ( L Cross − entrop y ) and the spectral mapper is pretrained to map from noisy speech to clean speech using MSE criterion (ﬁdelity loss, L Fidelity ). In the mimic loss training stage, the pretrained spectral mapper is trained further using both ﬁdelity loss and mimic loss ( L Mimic ), the loss between the two sets of outputs from the classiﬁer when fed parallel clean and denoised utterances. The gray models hav e frozen weights. improv ements in the ﬁnal WER. T raining a model to f aithfully reproduce the input via ﬁ- delity loss does not teach a model exactly what parts of the signal are important to focus on reproducing correctly . A lower learning rate allows the model to make more precise adjustments to its parameters, reproducing small details in the spectrogram more faithfully . Howe v er , the fact that these de- tails don’t help for the task of speech recognition indicates that they are mostly irrele v ant for speech comprehension. 4. MIMIC LOSS TRAINING In order to train with mimic loss, ﬁrst the two component models must be pre-trained. W e pre-train the spectral mapper to compute the function f ( · ) from a noisy spectral component x k m for frequenc y k at time slice m , augmented with a ﬁv e- frame window (designated ˜ x k m = [ x k m ± 5 ] ), to clean spectral slice y m . This is called the ﬁdelity loss , written as follows: L Fidelity ( ˜ x m , y m ) = 1 K K X k =1 ( y k m − f ( ˜ x k m )) 2 (1) While the residual network spectral mapper trained with only ﬁdelity loss results in performance better than pre vious front- end-only systems, we add mimic loss for an additional gain in performance. This is done by training a senone classiﬁer to learn a function g ( · ) from clean speech input ˜ y m to a set of D senones, and freezing the weights of the model. The spectral mapper is then trained to mimic the beha vior of clean speech by backpropagating the L2-loss between clean and denoised input after being run through the acoustic model. The loss is computed at the output layer , before softmax is applied. L Mimic ( ˜ ˜ x m , ˜ y m ) = 1 D D X d =1 ( g ( ˜ y m ) d − g ( ˜ f ( ˜ x m )) d ) 2 (2) In early experiments, we found that using only mimic loss did not allo w the model to con verge, since it cares only about behavior and not the actual shape of the features. So we use a linear combination of ﬁdelity loss and mimic loss: L Joint = L Fidelity + α L Mimic (3) where α is a hyper-parameter controlling the ratio of ﬁdelity and mimic losses. For our experiments, we use α = 0 . 1 when the mimic model is a DNN, and α = 0 . 05 when the mimic model is a WRBN. These values were chosen to ensure that the magnitude of the ﬁdelity loss and mimic loss were roughly equal. Higher or lower values of α do not usually produce better results. The entire process for training with mimic loss can be seen in Figure 3. 4.1. Senone Classiﬁcation In order to provide additional feedback to the spectral map- per , we train another model as a teaching model. This sec- ond model is trained for the task of senone classiﬁcation, with clean speech as input. This model will ideally learn what parts of a speech signal are important for recognition and be able to help a spectral mapper model to learn to reproduce these important speech structures faithfully . The loss used to train the senone classiﬁer is typical acoustic model criterion: the cross-entropy loss between the outputs of the classiﬁer , g ( ˜ y m ) , and the senone label, z m , where g ( · ) is the function computed by the classiﬁer . 4.2. Senone Classiﬁer Models W e experiment with two different senone classiﬁer models that are separate from the one used in the off-the-shelf Kaldi recipe used for recognition. This separation exists both in terms of the architecture and particular parameter values that are used, which giv es some evidence to our claim that our front-end model is not tied to any particular acoustic model. For both models, we tar get 1999 senone classes. For our ﬁrst model, we use a 6-layer 1024-node DNN with batch norm and leaky ReLU neurons (with a leak f ac- tor of 0.3), the same model used in [1]. Our second model is a WRBN model that has recently been shown to perform well on the CHiME-4 challenge [13]. This allo ws us to add a sequential component to the training of the residual network via the senone classiﬁer . The WRBN model combines a wide residual network to a bi-directional LSTM model. The wide residual network con- sists of 3 residual blocks of 6 con volutional layers each, with 80, 160, and 320 channels. The ﬁrst layer in the second and third blocks use a stride of 2 × 2 to downsample, with a 1 × 1 con volutional layer bypass connection. Follo wing these blocks is a linear layer . The LSTM part of the model is a 2-layer network with 512 nodes per layer in each direction. After the ﬁrst layer , the two directions are added together before being passed to the second layer , after which the tw o directions are concate- nated. The last two layers in the network are linear . The entire network uses ELU acti v ations [33], batch norm, and dropout. Both the classiﬁer networks are trained using the Adam optimizer [32] with learning rate η = 10 − 4 for the WRBN and η = 10 − 5 for the DNN. W e use 257 dimensional mean- normalized spectrogram features as input to the networks. Delta and delta-delta coefﬁcients are not used. The DNN senone classiﬁer uses a window of 5 context frames in the past and future while the WRBN is trained on a per utterance basis with full backpropagation through time. The WRBN model achiev ed a cross-entropy loss of 1.1 on the clean speech de v elopment set, which is almost half the cross-entropy loss of the DNN model, which was 2.1, so we expect it to be able to provide much more helpful feedback to the spectral mapper model. 5. EXPERIMENTS W e ev aluate the quality of the denoised features produced with our residual network spectral mapper by training an off-the-shelf Kaldi recipe for T rack 2 of the CHiME-2 chal- lenge [34]. 5.1. T ask and data description CHiME-2 is a medium-vocab ulary task for word recognition under reverberant and noisy en vironments without speaker mov ements. In this task, three types of data are provided T able 1 . Fidelity loss on the de v elopment set for our baseline model and the residual netw ork, both with and without mimic loss training. Enhancement Model Fidelity loss DNN spectral mapper 0.52 with DNN mimic 0.51 with WRBN mimic 0.51 Residual network mapper 0.47 with learn rate drop 0.44 with DNN mimic 0.48 with WRBN mimic 0.49 based on the W all Street Journal (WSJ0) 5K vocab ulary read speech corpus: clean, rev erberant and rev erberant+noisy . The clean utterances are e xtracted from the WSJ0 database. The rev erberant utterances are created by con volving the clean speech with binaural room impulse responses (BRIR) corre- sponding to a frontal position in a family li ving room. Real- world non-stationary noise background recorded in the same room is mixed with the re v erberant utterances to form the re- verberant+noisy set. The noise excerpts are selected such that the signal-to-noise ratio (SNR) ranges among -6, -3, 0, 3, 6 and 9 dB without scaling. The multi-condition training, de- velopment and test sets of the re verberant+noisy set contain 7138, 2454 and 1980 utterances respectively , which are the same utterances in the clean set but with reverberation and noise at 6 different SNR conditions. 5.2. Description of the Kaldi recipe In order to determine the effecti veness of our front-end sys- tem, we train the denoised features with an off-the-shelf Kaldi recipe for CHiME-2. The DNN-HMM hybrid system is trained using the clean WSJ0-5k alignments generated us- ing the method stated abov e. The DNN acoustic model has 7 hidden layers, with 2048 sigmoid neurons in each layer and a softmax output layer . Splicing context size for the ﬁlterbank features was ﬁxed at 11 frames (5 frames of past and 5 frames of future context), with the minibatch-size being 1024. After that, we train the DNN with state-le vel minimum Bayes risk (sMBR) sequence training. W e regenerate the lattices after the ﬁrst iteration and train for 4 more iterations. W e use the CMU pronunciation dictionary and the ofﬁcial 5k closed- vocab ulary trigram language model in our experiments. 6. RESUL TS W e report the best ﬁdelity loss of all models on the develop- ment set in T able 1. Fidelity loss is a record of ho w well a model can e xactly reproduce the clean speech signal, not tak- ing into account whether the denoised signal is speech-like T able 2 . W ord error rates after generating denoised fea- tures and feeding them to off-the-shelf Kaldi recipe for train- ing. The ﬁrst line for each model indicates WER for models trained with ﬁdelity loss only; the second includes the joint ﬁdelity-mimic loss. Enhancement Model WER No enhancement 17.3 DNN spectral mapper 16.0 with DNN mimic 14.4 with WRBN mimic 14.0 Residual network mapper 10.8 with DNN mimic 10.5 with WRBN mimic 9.3 or not. In terms of ﬁdelity loss, our residual networks gain about 10% ov er the baseline models. W ith the learning rate drop that is common in vision tasks, residual networks gain an additional 5%. Ho wev er , this improvement in ﬁdelity loss did not translate to any gain in WER. The last entries in the table sho w that the residual network performs slightly worse in terms of ﬁdelity loss when mimic is added, which is to be expected gi ven that the objecti ve is split between ﬁdelity loss and mimic loss. In addition to our ﬁdelity loss results, we present robust speech recognition results, generated by presenting our de- noised spectral features to an of f-the-shelf Kaldi recipe. The results are sho wn in T able 2. One point of note is that the fea- tures generated by the DNN spectral mapper without mimic loss only perform a little better than the original noisy fea- tures, likely due to introduced distortions [35]. It is also interesting to note that the WER gain for the residual network is much more signiﬁcant than the ﬁdelity loss alone would suggest, reaching around 30% relativ e im- prov ement. This improvement holds whether the model is trained with or without mimic loss. Finally , we note that us- ing a more sophisticated WRBN mimic leads to a lar ge im- prov ement in the performance of the residual netw ork spec- tral mapper , but only a small gain for the DNN mapper . W e speculate that the modeling power of the DNN may be lim- ited, since it has only two layers. Finally , we compare our best-performing model with other studies on the CHiME-2 test set that use only feature engineering and generation (e.g. more sophisticated language models not included). Ev en without mimic loss, our model performs much better than all other systems that use no ad- ditional noise-robust features or joint training of front-end speech enhancer and acoustic model. With the addition of mimic loss, our model also performs 10% better than the state-of-the-art, which uses both of these. T able 3 . Performance comparison with other studies on the CHiME2 test set. “ Additional NR features” indicates that noise-robust features are added. “Joint ASR training” indi- cates that the ﬁnal ASR system and enhancement model are jointly tuned. Our previous system [1] was a DNN trained with joint ﬁdelity-mimic loss. Additional Joint ASR Study NR features training WER Chen et. al [21] - X 16.0 Narayanan-W ang [2] X X 15.4 Bagchi et. al [1] - - 14.7 W eninger et.al [23] X - 13.8 W ang et.al [7] X X 10.6 Residual network - - 10.8 ResNet + mimic loss - - 9.3 7. CONCLUSIONS W e have enhanced the performance of the mimic loss frame- work with the help of a ResNet-style architecture for spectral mapping and a more sophisticated senone classiﬁer, with an almost 30% impro vement over the DNN baseline and achieve the best acoustic-only adaptation result without using addi- tional noise-rob ust features or joint training of a speech en- hancement module and ASR system. One route to achie ving improv ed WER may be to do mimic loss at a higher le vel, such as the word lev el rather than the senone le vel. Since other work has found that joint training all the way up to the w ord le vel has helped perfor - mance, we expect that this would help our denoiser . For some tasks, tar geting an ideal ratio mask which is then multiplied with the original signal has achiev ed higher per- formance than spectral mapping. W e plan to apply mimic loss to the technique of spectral masking; if successful, we could extend our work to the CHiME-3 and CHiME-4 chal- lenges where mask generation during the beamforming stage has achiev ed the state-of-the-art. Our code is publicly av ailable at https://github. com/OSU- slatelab/residual_mimic_net . 8. A CKNO WLEDGEMENTS This w ork w as supported by the National Science Foundation under Grant IIS-1409431. W e also thank the Ohio Super- computer Center (OSC) [36] for providing us with compu- tational resources. W e gratefully ackno wledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research. 9. REFERENCES [1] Deblin Bagchi, Peter Plantinga, Adam Stiff, and Eric Fosler -Lussier , “Spectral feature mapping with mimic loss for rob ust speech recognition, ” in Audio, Speech, and Signal Pr ocessing (ICASSP), International Confer - ence on , 2018. [2] Arun Narayanan and DeLiang W ang, “Improving robustness of deep neural network acoustic models via speech separation and joint adapti ve training, ” IEEE/A CM T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 23, no. 1, pp. 92–101, 2015. [3] Kun Han, Y uxuan W ang, and DeLiang W ang, “Learn- ing spectral mapping for speech derev erberation, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Conference on . IEEE, 2014, pp. 4628–4632. [4] Kun Han, Y uxuan W ang, DeLiang W ang, W illiam S W oods, Ivo Merks, and T ao Zhang, “Learning spec- tral mapping for speech dereverberation and denoisi ng, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 23, no. 6, pp. 982–992, 2015. [5] Kun Han, Y anzhang He, Deblin Bagchi, Eric F osler- Lussier , and DeLiang W ang, “Deep neural network based spectral feature mapping for rob ust speech recog- nition, ” in Sixteenth Annual Confer ence of the Interna- tional Speech Communication Association , 2015. [6] Deblin Bagchi, Michael I Mandel, Zhongqiu W ang, Y anzhang He, Andrew Plummer , and Eric Fosler- Lussier , “Combining spectral feature mapping and multi-channel model-based source separation for noise- robust automatic speech recognition, ” in A utomatic Speech Recognition and Understanding (ASR U), 2015 IEEE W orkshop on . IEEE, 2015, pp. 496–503. [7] Zhong-Qiu W ang and DeLiang W ang, “ A joint train- ing framework for robust automatic speech recognition, ” IEEE/A CM T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 24, no. 4, pp. 796–806, 2016. [8] Jimmy Ba and Rich Caruana, “Do deep nets really need to be deep?, ” in Advances in neural information pr o- cessing systems , 2014, pp. 2654–2662. [9] Geoffre y Hinton, Oriol V inyals, and Jef f Dean, “Distill- ing the knowledge in a neural netw ork, ” arXiv preprint arXiv:1503.02531 , 2015. [10] David Lopez-Paz, L ´ eon Bottou, Bernhard Sch ¨ olkopf, and Vladimir V apnik, “Unifying distillation and priv- ileged information, ” arXiv pr eprint arXiv:1511.03643 , 2015. [11] Jinyu Li, Rui Zhao, Jui-Ting Huang, and Y ifan Gong, “Learning small-size DNN with output-distribution- based criteria, ” in F ifteenth annual confer ence of the international speech communication association , 2014. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2016, pp. 770–778. [13] Lukas Drude Jahn Heymann and Reinhold Haeb- Umbach, “Wide residual BLSTM network with dis- criminativ e speaker adaptation for rob ust speech recog- nition, ” in Pr oceedings of the 4th International W ork- shop on Speech Pr ocessing in Everyday En vir onments (CHiME16) , 2016, pp. 12–17. [14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks, ” in Acoustics, speec h and signal pr ocessing (icassp), 2013 ieee international confer ence on . IEEE, 2013, pp. 6645–6649. [15] Alex Graves, Navdeep Jaitly , and Abdel-rahman Mo- hamed, “Hybrid speech recognition with deep bidirec- tional LSTM, ” in Automatic Speech Recognition and Understanding (ASR U), 2013 IEEE W orkshop on . IEEE, 2013, pp. 273–278. [16] Daniel Pov ey , Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, Jan Silovsk y , Georg Stemmer , and Karel V esely , “The kaldi speech recognition toolkit, ” in IEEE 2011 W orkshop on Automatic Speec h Recognition and Understanding . Dec. 2011, IEEE Signal Processing Society , IEEE Catalog No.: CFP11SR W -USB. [17] Jun Du, Y an-Hui T u, Lei Sun, Feng Ma, Hai-K un W ang, Jia Pan, Cong Liu, Jing-Dong Chen, and Chin-Hui Lee, “The USTC-iFlytek system for CHiME-4 challenge, ” Pr oc. CHiME , pp. 36–38, 2016. [18] T akuya Y oshioka, Nobutaka Ito, Marc Delcroix, At- sunori Ogaw a, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Y u, W ojciech J Fabian, Miquel Espi, T akuya Higuchi, et al., “The NTT CHiME-3 system: Ad- vances in speech enhancement and recognition for mo- bile multi-microphone de vices, ” in Automatic Speech Recognition and Understanding (ASR U), 2015 IEEE W orkshop on . IEEE, 2015, pp. 436–443. [19] Y anmin Qian and Philip C W oodland, “V ery deep con- volutional neural networks for robust speech recogni- tion, ” in Spoken Languag e T echnology W orkshop (SL T), 2016 IEEE . IEEE, 2016, pp. 481–488. [20] Y u Zhang, W illiam Chan, and Navdeep Jaitly , “V ery deep con v olutional networks for end-to-end speech recognition, ” in Acoustics, Speec h and Signal Process- ing (ICASSP), 2017 IEEE International Confer ence on . IEEE, 2017, pp. 4845–4849. [21] Z Chen, S W atanabe, H Erdogan, and JR Hershey , “In- tegration of speech enhancement and recognition using long-short term memory recurrent neural network, ” in Pr oc. Interspeec h , 2015. [22] Murali Karthick Baskar , Martin Karaﬁ ´ at, Luk ´ a ˇ s Bur- get, Karel V esel ` y, Franti ˇ sek Gr ´ ezl, and Jan ˇ Cernock ` y, “Residual memory networks: Feed-forward approach to learn long-term temporal dependencies, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE In- ternational Confer ence on . IEEE, 2017, pp. 4810–4814. [23] Felix W eninger, Hakan Erdogan, Shinji W atanabe, Em- manuel V incent, Jonathan Le Roux, John R Hershey , and Bj ¨ orn Schuller , “Speech enhancement with LSTM recurrent neural networks and its application to noise- robust ASR, ” in International Confer ence on La- tent V ariable Analysis and Signal Separation . Springer , 2015, pp. 91–99. [24] Like Hui, Meng Cai, Cong Guo, Liang He, W ei-Qiang Zhang, and Jia Liu, “Conv olutional maxout neural net- works for speech separation, ” in Signal Pr ocessing and Information T echnolo gy (ISSPIT), 2015 IEEE Interna- tional Symposium on . IEEE, 2015, pp. 24–27. [25] Szu-W ei Fu, Y u Tsao, and Xugang Lu, “SNR-a ware con v olutional neural network modeling for speech en- hancement., ” in Pr oc. Interspeech , 2016, pp. 3768– 3772. [26] Se Rim Park and Jin W on Lee, “ A fully con v olutional neural network for speech enhancement, ” Pr oc. Inter - speech 2017 , pp. 1993–1997, 2017. [27] Serge y Zagoruyko and Nikos Komodakis, “W ide resid- ual networks, ” in Pr oceedings of the British Machine V ision Confer ence (BMVC) , Edwin R. Hancock Richard C. W ilson and W illiam A. P . Smith, Eds. September 2016, pp. 87.1–87.12, BMV A Press. [28] W ayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer , Andreas Stolcke, Dong Y u, and Geoffre y Zweig, “The Microsoft 2016 conv ersational speech recognition system, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on . IEEE, 2017, pp. 5255–5259. [29] T ara N Sainath, Oriol V inyals, Andrew Senior , and Has ¸ im Sak, “Con v olutional, long short-term memory , fully connected deep neural networks, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2015 IEEE In- ternational Confer ence on . IEEE, 2015, pp. 4580–4584. [30] T ara N Sainath, Ron J W eiss, Andrew Senior , K evin W W ilson, and Oriol V inyals, “Learning the speech front- end with ra w wa veform CLDNNs, ” in Sixteenth Annual Confer ence of the International Speech Communication Association , 2015. [31] Andreas V eit, Michael J W ilber , and Serge Belongie, “Residual netw orks beha ve like ensembles of relativ ely shallow networks, ” in Advances in Neural Information Pr ocessing Systems , 2016, pp. 550–558. [32] Diederik P Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [33] Djork-Arn ´ e Clevert, Thomas Unterthiner , and Sepp Hochreiter , “Fast and accurate deep network learning by exponential linear units (ELUs), ” arXiv pr eprint arXiv:1511.07289 , 2015. [34] Emmanuel V incent, Jon Barker , Shinji W atanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matas- soni, “The second CHiME speech separation and recognition challenge: Datasets, tasks and baselines, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2013 IEEE International Conference on . IEEE, 2013, pp. 126–130. [35] Arun Narayanan and DeLiang W ang, “In vestig ation of speech separation as a front-end for noise robust speech recognition, ” IEEE/A CM T ransactions on Au- dio, Speech, and Language Pr ocessing , vol. 22, no. 4, pp. 826–835, 2014. [36] Ohio Supercomputer Center , “Ohio supercom- puter center , ” http://osc.edu/ark:/19495/ f5s1ph73 , 1987.

An Exploration of Mimic Architectures for Residual Network Based Spectral Mapping

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment