Transferring Knowledge from a RNN to a DNN

T ransferring Knowledge fr om a RNN to a DNN W illiam Chan ∗ 1 , Nan Rosemary K e ∗ 1 , Ian Lane 1 , 2 Carnegie Mellon Uni versity 1 Electrical and Computer Engineering, 2 Language T echnologies Institute ∗ Equal contribution williamchan@cmu.edu, rosemary.ke@sv.cmu.edu, lane@cmu.edu Abstract Deep Neural Network (DNN) acoustic models ha ve yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently , Recurrent Neural Network (RNN) models hav e been shown to outperform DNNs counterparts. Howe ver , state-of-the-art DNN and RNN models tend to be im- practical to deploy on embedded systems with limited computa- tional capacity . Traditionally , the approach for embedded plat- forms is to either train a small DNN directly , or to train a small DNN that learns the output distribution of a large DNN. In this paper , we utilize a state-of-the-art RNN to transfer knowledge to small DNN. W e use the RNN model to generate soft align- ments and minimize the Kullback-Leibler div ergence against the small DNN. The small DNN trained on the soft RNN align- ments achieved a 3.93 WER on the W all Street Journal (WSJ) ev al92 task compared to a baseline 4.54 WER or more than 13% relativ e improvement. Index T erms : Deep Neural Networks, Recurrent Neural Net- works, Automatic Speech Recognition, Model Compression, Embedded Platforms 1. Introduction Deep Neural Netw orks (DNNs) combined with Hidden Marko v Models (HMMs) ha ve been sho wn to perform well across man y Automatic Speech Recognition (ASR) tasks [1, 2, 3]. DNNs accept an acoustic context (e.g., a window of fMLLR features) as inputs and models the posterior distrib ution of the acoustic model. The “deep” in DNN is critical, state-of-the-art DNN models often contain multiple layers of non-linearities, giving it powerful modelling capabilities [4, 5]. Recently , Recurrent Neural Networks (RNNs) have demon- strated ev en more potential over its DNN counterparts [6, 7, 8]. RNN models are neural network models that contain recurrent connections or cycles in the connectivity graph. RNN mod- els when unrolled, can actually be seen as a very special case of DNN. The recurrent nature of the RNN allows us to model temporal dependencies, which is often the case in speech se- quences. In particular , the recurrent structure of the model al- lows us to store temporal information (e.g., the cell state in LSTM [9]) within the model. In [10], RNNs were shown to outperform DNNs in large commercial ASR systems. And in [8], RNNs ha ve been shown to provide better performance over DNNs in robust ASR. Currently , there has been much industry interest in ASR for embedded platforms, for example, mobile phones, tablets and smart watches. Howev er , these platforms tend to have lim- ited computational capacity (e.g., no/limited GPU and/or low performance CPU), limited po wer a vailability (e.g., small bat- teries) and latency requirements (e.g., asking a GPS system for driving directions should be responsi ve). Unfortunately , man y state-of-the-art DNN and RNN models are simply too expensi ve or impractical to run on embedded platforms. T raditionally , the approach is simply to use a small DNN, reducing the number of layers and the number of neurons per layer; howev er , such approaches often suffer from W ord Error Rate (WER) perfor- mance degradations [11]. In our paper, we seek to improve the WER of small models which can be applied to embedded plat- forms. DNNs and RNNs are typically trained from forced align- ments generated from a GMM-HMM system. W e refer to this as a hard alignment, the posterior distribution is concentrated on a single acoustic state for each acoustic context. There has been e vidence that these GMM alignment labels are not the op- timal training labels as seen in [12, 13]. The GMM alignments make v arious assumptions of the data, such as independence of acoustic frames gi ven states [12]. In this paper, we sho w soft distribution labels generated from an expert is potentially more informativ e over the GMM hard alignments leading to WER improv ements. The ef fects of the poor GMM alignment quality may be hidden away in large deep networks, which have suf- ﬁcient model capacity . Howe ver , in narro w shallo w networks, training with the same GMM alignments often hurts our ASR performance [11]. One approach is to change the training criteria, rather than trying to match our DNN to the GMM alignments, we can in- stead try and match our DNN to the distribution of an expert model (e.g., a big DNN). In [14], a small DNN was trained to match the output distribution of a large DNN. The training data labels are generated by passing labelled and unlabelled data through the large DNN, and training the small DNN to match the output distribution. The results were promising, [14] achiev ed a 1.33% WER reduction over their baseline systems. Another approach is to train an model to match the softmax logits of an expert model. In [15], an ensemble of experts were trained and used to teach a (potentially smaller) DNN. Their motiv ation was inference (e.g., computational cost grows lin- early to the number of ensemble models), howev er the principle of model compression applies [16]. [15] also generalized the framew ork, and showed that we can train the models to match the logits of the softmax, rather than directly modelling the dis- tributions which could yield more kno wledge transfer . In this paper , we want to maximize small DNN model per- formance targeted at embedded platforms. W e transfer knowl- edge from a RNN expert to a small DNN. W e ﬁrst build a large RNN acoustic model, and we then let the small DNN model learn the distribution or soft alignment from the large RNN model. W e sho w our technique will yield improvements in WER compared to the baseline models trained on the hard GMM alignments. The paper is structured as follows. Section 2, begins with an introduction of a state-of-the-art RNN acoustic model. In Section 3, we describe the methodology used to transfer knowl- edge from a large RNN model to a small DNN model. Section 4 is gi ves experiments, results and analysis. And we ﬁnish in Section 5 with our conclusion and future work discussions. 2. Deep Recurrent Neural Netw orks There exist many implementations of RNNs [17], and LSTM is a particular implementation of RNN that is easy to train and does not suffer from the vanishing or exploding gradient prob- lem in Backpropag ation Through T ime (BPTT) [18]. W e follo w [19, 20] in our LSTM implementation: i t = φ ( W xi x t + W hi h t − 1 ) (1) f t = φ ( W xf x t + W hf h t − 1 ) (2) c t = f t  cs t − 1 + i t  tanh( W xc x t + W hc h t − 1 ) (3) o t = φ ( W xo x t + W ho h t − 1 ) (4) h t = o t  tanh( c t ) (5) This particular LSTM implementation omits the the bias and peephole connections. W e also apply a cell clipping of 3 to ease the optimization to av oid exploding gradients. LSTMs can also be extended to be a Bidirectional LSTM (BLSTM), to capture temporal dependencies in both set of directions [7]. RNNs (and LSTMs) can be also be extended into deep RNN architectures [21]. There has been evidence that the deep RNN models can perform better than the shallow RNN models [7, 21, 20]. The additional layers of nonlinearities can give the network additional model capacity similar to the multiple layers of nonlinearities in a DNN. W e follow [20], in building our deep RNN; to be exact, the particular RNN model is actually termed a TC-DNN-BLSTM- DNN model. The architecture begins with a T ime Con volution (TC) over the input features (e.g., fMLLR) [22]. This is fol- lowed by a DNN signal processor which can project the fea- tures into a higher dimensional space. The projected features are then consumed by a BLSTM, modelling the acoustic con- text sequence. Finally a DNN with a softmax layer is used to model the posterior distribution. [20]’ s model gave more than 8% relativ e improvement ov er previous state-of-the-art DNNs in the W all Street Journal (WSJ) ev al92 task. In this paper , we use the TC-DNN-BLSTM-DNN model as our deep RNN to generate the training alignments from which the small DNN will learn from. 3. Methodology Our goal is to transfer knowledge from the RNN expert to a small DNN. W e follo w an approach similar to [14]. W e trans- fer knowledge by training the DNN to match the RNN’ s output distribution. Note that we train on the soft distribution of the RNN (e.g., top k states) rather than just the top-1 state (e.g., re- aligning the model with the RNN). In this paper we will show the distrib ution generated by the RNN is more informati ve o ver the GMM alignments. W e will also sho w the soft distribution of the RNN is more informative ov er taking just the top-1 state generated by the RNN. 3.1. KL Diver gence W e can match the output distrib ution of our DNN to our RNN by minimizing the Kullback-Leibler (KL) di vergence between the two distributions. Namely , given the RNN posterior dis- tribution P and the DNN posterior distrib ution Q , we want to minimize the KL div ergence D K L ( P || Q ) : D K L ( P ( s | x ) || Q ( s | x )) = X i P ( s i | x ) ln P ( s i | x ) Q ( s i | x ) (6) = H ( P , Q ) − H ( P ) (7) where s i ∈ s are the acoustic states, H ( P, Q ) = P i − P ( s i | x ) ln Q ( s i | x ) is the cross entropy term and H ( P ) = P i P ( s i | x ) ln P ( s i | x ) is the entropy term. W e can safely ignore the H ( P ) entropy term since its gradient is zero with respect to the small DNN parameters. Thus, minimizing the KL div ergence is equiv alent to minimizing the Cross En- tropy Error (CSE) between the two distrib utions: H ( P, Q ) = X i − P ( s i | x ) ln Q ( s i | x ) (8) which we can easily differentiate and compute the pre-softmax activ ation a (e.g., the softmax logits) derivati ve: ∂ J ∂ a i = Q ( s i | x ) − P ( s i | x ) (9) 3.2. Alignments In most ASR scenarios, DNNs and RNNs are typically trained with forced alignments generated from GMM-HMM models to model the posterior distribution. W e refer this alignment as a hard GMM alignment because the probability is concentrated on only a single state. Furthermore, the alignment labels gen- erated from GMM-HMM model are not always the optimal for training DNNs [12]. The GMM-HMM makes various assump- tions that may not be true (e.g., independence of frames). One possible solution is to use labels or alignments from another ex- pert model, for example in [15] an ensemble of e xperts was used to teach one model. In this paper , we generate labels from an expert RNN which provide better training tar gets compared to the GMM alignments. One possibility is to generate hard alignments from a RNN expert. This is done by ﬁrst training the RNN with hard align- ments from the GMM-HMM model. After the DNN is trained, we then realign the data by taking hard alignments (e.g., top- 1 probability state) from the trained RNN. The alignment is hard as it takes only the most probable phoneme state for each acoustic context, and the probability is concentrated on a single phoneme state. On the other hand, we could utilize the full distribution or soft alignment associated with each acoustic frame. More pre- cisely , for each acoustic context, we take the full distrib ution of the phonetic states and their probabilities. Howev er , this suf fers from sev eral problems. First, during training, we need to either run the RNN in parallel or pre-cache the distribution on disk. Running the RNN in parallel is an expensi ve operation and un- desirable. The alternative is caching the distribution on disk, which would require obscene amounts of storage (e.g., we typ- ically hav e several thousand acoustic states). For example, in WSJ, it would take over 30 TiB to store the full distribution of the si284 dataset. W e also run into bandwidth issues when load- ing the training samples from the disk cache. Finally , the entire distribution may not be useful, as there will be man y states with GMM-HMM RNN Expert Small DNN GMM Hard Alignments RNN Soft Alignments Figure 1: W e use the hard GMM alignments to ﬁrst train a RNN, after which we use the soft alignments from the RNN to train our small DNN. near zero values; intuition suggests we can just discard those states (e.g., lossy compression). Our solution sits inbetween the two extremes of taking only the top-1 state or taking the full distrib ution. W e ﬁnd that the posterior distributions are typically concentrated on only a few states. Therefore, we can make use of almost the full distri- bution by storing only a small portion of the states probability distribution. W e take the states that contains the top 98% of the probability distribution. Note, this is different than taking the top- k states, we tak e at least n states where we can capture at least 98% of the distribution, and n will vary per frame. W e then re-normalize the probability per frame to ensure the distri- bution sums up to 1. This lossy compression method losses up to 2% of the original probability mass. 4. Experiments and Results W e experiment with the WSJ dataset; we use si284 with ap- proximately 81 hours of speech as the training set, de v93 as our dev elopment set and eval92 as our test set. W e observe the WER of our de velopment set after every epoch, we stop training once the dev elopment set no longer improv es. W e report the con- ver ged dev93 and the corresponding eval92 WERs. W e use the same fMLLR features generated from the Kaldi s5 recipe [23], and our decoding setup is e xactly the same as the s5 recipe (e.g., big dictionary and trigram pruned language model). W e use the tri4b GMM alignments as our hard forced alignment training targets, and there are a total of 3431 acoustic states. The GMM tri4b baseline achieved a dev and test WER of 9.39 and 5.39 respectiv ely . 4.1. Optimization In our DNN and RNN optimization procedure, we initialized our netw orks randomly (e.g., no pretraining) and we used Stochastic Gradient Descent (SGD) with a minibatch size of 128. W e apply no gradient clipping or gradient projection in our LSTM. W e experimented with constant learning rates of [0.1, 0.01, 0.001] and geometric decayed learning rates with initial v alues of [0.1, 0.01] with a decay f actor of 0.5. W e re- port the best WERs out of these learning rate hyperparameter optimizations. 4.2. Big DNN and RNN W e ﬁrst built sev eral baseline (big) DNN and RNN systems. These are the large networks and not suitable for deployment on mobile platforms. W e followed the Kaldi s5 recipe and b uilt a 7 layer DNN and 2048 neurons per hidden layer with DBN pretraining and achieves a ev al92 WER of 3.81 [23]. W e also followed [20] and b uilt a 5 layer ReLU DNN with 2048 neurons per hidden layer and achie ves a e val92 WER of 3.79. Our RNN model follows [20], consists of 2048 neurons per layer for the DNN layers, and 256 bidirectional cells for the BLSTM. The RNN model achie ves a e val92 WER of 3.47, signiﬁcantly better T able 1: W all Street Journal WERs for big DNN and RNN models. Model dev93 WER ev al92 WER GMM Kaldi 9.39 5.39 DNN Kaldi s5 6.68 3.81 DNN ReLU 6.84 3.79 RNN [20] 6.58 3.47 than both big DNN models. Each network has a softmax output of 3431 states matching the GMM model. T able 1 summarizes the results for our baseline big DNN and big RNN experiments. 4.3. Small DNN W e want to build a small DNN that is easily computable by an embedded device. W e decided on a 3 layer network (2 hid- den layers), wherein each hidden layer has 512 ReLU neurons and a ﬁnal softmax of 3431 acoustic states matching the GMM. Since Matrix-Matrix Multiplication (MMM) is an O ( n 3 ) oper- ation, the effect is approximately a 128 times reduction in num- ber of computations for the hidden layers (when comparing the 4 hidden layers of 2048 neurons vs. a 2 hidden layers of 512 neurons). This will allo w us to perform f ast interference on em- bedded platforms with limited CPU/GPU capacity . W e ﬁrst trained a small ReLU DNN using the hard GMM alignments. W e achieved a 4.54 WER compared to 3.79 WER of the big ReLU DNN model on the ev al92 task. The dev93 WER is 8.00 for small model vs 6.84 for the lar ge model; the big gap in de v93 WER suggests the big DNN model is able to optimize substantially better . The large DNN model has signif- icantly more model capacity , and thus yielding its better results ov er the small DNN. Next, we e xperimented with the hard RNN alignment. W e take the top-1 state of the RNN model and train our DNN to- wards this alignment. W e did not see any impro vement, while the de v93 WER impro ves from 8.00 to 7.83, the e val92 WER degrades from the 4.54 to 4.63. This suggests, the RNN hard alignments are w orse labels than the original GMM alignments. The information provided by the RNN when looking at only the top state is no more informativ e over the GMM hard align- ments. One hypothesis is our DNN model overﬁts tow ards the RNN hard alignments, since the dev93 WER was able to im- prov e, while the model is unable to generalize the performance to the ev al92 test set. W e now e xperiment with the RNN soft alignment, wherein we can add the soft distrib ution characteristics of the RNN to the small DNN. W e take the top 98% percentile of probabili- ties of from the RNN distrib ution and renormalize them (e.g., ensure the distribution sums up to 1 ). W e minimize the KL di- ver gence between the RNN soft alignments and the small DNN. W e see a signiﬁcant improv ement in WER. W e achie ve a dev93 WER of 7.38 and e val92 3.93. In the ev al92 scenario, our WER T able 2: Small DNN WERs for W all Str eet Journal based on differ ent training alignments. Alignment dev93 WER ev al92 WER Hard GMM 8.00 4.54 Hard RNN 7.83 4.63 Soft RNN 7.38 3.93 Soft DNN 7.43 4.27 improv es by over 13% relati ve compared to the baseline GMM hard alignment. W e were almost able to match the WER of the big DNN of 3.79 (of f by 3.6% relativ e), despite the big DNN hav e many more layers and neurons. The RNN soft alignment adds considerable information to the training labels over the GMM hard alignments or the RNN hard alignments. W e also e xperimented training on the big DNN soft align- ments. The big DNN model is the DNN ReLU model men- tioned in table 1, wherein it achieved a ev al92 WER of 3.79. Once again, we generate the soft alignments and train our small DNN to minimize the KL div ergence. W e achieved a dev93 WER of 7.43 and e val92 WER of 4.27. There are se veral things to note, ﬁrst, we once ag ain improv e o ver the GMM baseline by 5 . 9 % relativ e. Next, the dev93 WER is very close to the RNN soft alignment (less than 1 % relativ e), howe ver , the gap widens when we look at the ev al92 WER (more than 8 % relativ e). This suggests the model overﬁts more under the big DNN soft align- ments, and the RNN soft alignments provide more generaliza- tion. The quality of the RNN soft alignments are much better than big DNN soft alignments. T able 2 summarizes the WERs for the small DNN model using different training alignments. 4.4. Cross Entropy Error W e compute the CSE of our various models against the GMM alignment for the de v93 dataset. W e measure the CSE against dev93 since that is our stopping criteria and that is the optimiza- tion loss. The CSE will gi ve us a better indication of the opti- mization procedure, and how our models are overﬁtting. T able 3 summarizes our CSE measurements. There are several ob- servations, ﬁrst the big RNN is able to achiev e a lower CSE compared to the big DNN. The RNN model is able to optimize better than the DNN as seen with the better WERs the RNN model provides. This is as expected since the big RNN model achiev es the best WER. The next observ ation is that the small DNNs trained off the soft alignment from the large DNN or RNN achie ved a lower CSE and compared to the small DNN trained on the GMM hard alignment. This suggests the soft alignment labels are indeed better training labels in optimizing the model. The e xtra infor- mation contained in the soft alignment helps us optimize better tow ards our dev93 dataset. The small DNN trained on the soft RNN alignments and soft DNN alignments giv e interesting results. These models achiev ed a lower CSE compared to the large RNN and large DNN models trained on the GMM alignments. Ho wever , the WERs are worse than the lar ge RNN and large DNN models. This suggests the small model trained on the soft distribution is o verﬁtting, it is unclear if the overﬁtting occurs because the smaller model can not generalize as well as the large model, or if the overﬁtting occurs because of the quality of the soft align- ment labels. T able 3: Cr oss Entr opy Error (CSE) on WSJ dev93 o ver our various models. Alignment Model CSE GMM Big RNN 1.27620 GMM Big DNN 1.28431 Hard RNN Small DNN 1.52135 Soft RNN Small DNN 1.24617 Soft DNN Small DNN 1.24723 5. Conclusion and Discussions The motiv ation and application of our work is to extend ASR onto embedded platforms, where there is limited computational capacity . In this paper we have introduced a method to transfer knowledge from a RNN to a small DNN. W e minimize the KL div ergence between the two distrib utions to match the DNN’ s output to the RNN’ s output. W e improv e the WER from 4.54 trained on GMM forced alignments to 3.93 on the soft align- ments generated by the RNN. Our method has resulted in more than 13% relative improvement in WER with no additional in- ference cost. One question we did not answer in this paper is whether the small DNN’ s model capacity or the RNN’ s soft alignment is the bottleneck of further WER performance. W e did not mea- sure the effect of the small DNN’ s model capacity on the WER, would we get similar WERs if we increased or decreased the small DNN’ s size? If the bottleneck is in the quality of the soft alignments, then in princple we could reduce the small DNN’ s size further without impacting WER (much), ho wev er , if model capacity is the issue, then we should not use smaller networks. On a similar question, we did not inv estigate the impact of the top probability selection in the RNN alignment. W e thresh- old the top 98% of the probabilities out of con venience, how- ev er , how would selecting more or less probabilities affect the quality of the alignments. In the e xtreme case, wherein we only selected the top-1 probability , we found the model to perform much worse compared to the 98% soft alignments, and even worse than the GMM alignments, this evidence deﬁnitely shows the importance of the information contained in the soft align- ment. W e could also e xtend our work similar to [14] and utilize vast amounts of unlabelled data to improve our small DNN. In [14], they applied unlabelled data to their large DNN expert to generate vast quantities of soft alignment labels for the small DNN to learn from. In principle, one could extend this to an inﬁnite amount of training data with synthetic data generation, which has been shown to impro ve ASR performance [24]. Finally , we did not e xperiment with sequence training [25], sequence training has almost always shown to help [26], it would be interesting to see the effects of sequence training on these small models, and whether we can further improve the ASR performance. 6. Acknowledgements W e thank W on Kyum Lee for helpful discussions and proof- reading this paper . 7. References [1] G. E. Dahl, D. Y u, L. Deng, and A. Acero, “Conte xt-Dependent Pre-T rained Deep Neural Networks for Large-V ocab ulary Speech Recognition, ” IEEE T ransactions on Audio, Speech, and Lan- guage Processing , v ol. 20, pp. 30–42, January 2012. [2] L. Deng, J. Li, J.-T . Huang, K. Y ao, D. Y u, F . Seide, M. Seltzer , G. Zweig, X. He, J. W illiams, Y . Gong, and A. Acero, “Recent advances in deep learning for speech research at microsoft, ” May 2013. [3] H. Soltau, G. Saon, and T . Sainath, “Joint T raining of Con- voutional and Non-Con voutional Neural Networks, ” in IEEE In- ternational Confer ence on Acoustics, Speech and Signal Process- ing , 2014. [4] M. Zeiler , M. Ranzato, R. Monga, M. Mao, K. Y ang, Q. V . Le, P . Nguyen, A. Senior, V . V anhoucke, J. Dean, and G. E. Hinton, “On Rectiﬁed Linear Units for Speech Processing, ” in IEEE Inter- national Conference on Acoustics, Speech and Signal Processing , 2013. [5] G. E. Dahl, T . N. Sainath, and G. E. Hinton, “Improving Deep Neural Networks for L VCSR Using Rectiﬁed Linear Units and Dropout, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , 2013. [6] A. Graves, A. rahman Mohamed, and G. Hinton, “Speech Recog- nition with Deep Recurrent Neural Networks, ” in IEEE Interna- tional Confer ence on Acoustics, Speech and Signal Pr ocessing , 2013. [7] A. Graves, N. Jaitly , and A. rahman Mohamed, “Hybrid Speech Recognition with Bidirectional LSTM, ” in Automatic Speech Recognition and Understanding W orkshop , 2013. [8] C. W eng, D. Y u, S. W atanabe, and F . Jung, “Recurrent Deep Neu- ral Networks for Robust Speech Recognition, ” in IEEE Interna- tional Confer ence on Acoustics, Speech and Signal Pr ocessing , 2014. [9] S. Hochreiter and J. Schmidhuber , “Long Short-T erm Memory , ” Neural Computation , vol. 9, no. 8, pp. 1735–1780, Nov ember 1997. [10] H. Sak, A. Senior, and F . Beaufays, “Long Short-T erm Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling, ” in INTERSPEECH , 2014. [11] X. Lei, A. Senior , A. Gruenstein, and J. Sorensen, “ Accurate and Compact Large V ocabulary Speech Recognition on Mobile De- vices, ” in INTERSPEECH , 2013. [12] N. Jaitly , V . V anhoucke, and G. Hinton, “ Autoregressiv e product of multi-frame predictions can improve the accuracy of hybrid models, ” in INTERSPEECH , 2014. [13] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “GMM-Free DNN Training, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2014. [14] J. Li, R. Zhao, J.-T . Huang, and Y . Gong, “Learning Small- Size DNN with Output-Distrib ution-Based Criteria, ” in INTER- SPEECH , 2014. [15] G. Hinton, O. V inyals, and J. Dean, “Distilling the Knowledge in a Neural Network, ” in Neural Information Pr ocessing Systems: W orkshop Deep Learning and Repr esentation Learning W ork- shop , 2014. [16] C. Bucila, R. Caruana, and A. Niculescu-Mizil, “Model Compres- sion, ” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2006. [17] J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical Eval- uation of Gated Recurrent Neural Networks on Sequence Model- ing, ” in Neural Information Processing Systems: W orkshop Deep Learning and Representation Learning W orkshop , 2014. [18] S. Hochreiter , Y . Bengio, P . Frasconi, and J. Schmidhuber, “Gradi- ent Flow in Recurrent Nets: the Dif ﬁculty of Learning Long-T erm Dependencies, ” 2011. [19] O. Vin yals, A. T oshev , S. Bengio, and D. Erhan, “Show and T ell: A Neural Image Caption Generator, ” in , 2014. [20] W . Chan and I. Lane, “Deep Recurrent Neural Networks for Acoustic Modelling, ” in INTERSPEECH (submitted) , 2015. [21] R. Pascanu, C. Gulcehre, K. Cho, and Y . Bengio, “How to Con- struct Deep Recurrent Neural Networks, ” in International Confer- ence on Learning Representations , 2014. [22] W . Chan and I. Lane, “Deep Conv olutional Neural Networks for Acoustic Modeling in Low Resource Languages, ” in IEEE Inter- national Conference on Acoustics, Speech and Signal Processing , 2015. [23] D. Povey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannenmann, P . Motlicek, Y . Qian, P . Schwarz, J. Silovsky , G. Stemmer, and K. V esely , “The Kaldi Speech Recognition T oolkit, ” in Automatic Speech Recognition and Un- derstanding W orkshop , 2011. [24] A. Hannun, C. Case, J. Casper , B. Catanzaro, G. Diamos, E. Elsen, R. Prenger , S. Satheesh, S. Sengupta, A. Coates, and A. Ng, “Deep Speech: Scaling up end-to-end speech recognition, ” in , 2014. [25] K. V esely , A. Ghoshal, L. Burget, and D. Povey , “Sequence- discriminativ e training of deep neural networks, ” in INTER- SPEECH , 2013. [26] H. Sak, O. V inyals, G. Heigold, A. Senior, E. McDermott, R. Monga, and M. Mao, “Sequence Discriminativ e Distributed T raining of Long Short-T erm Memory Recurrent Neural Net- works, ” in INTERSPEECH , 2014.

Transferring Knowledge from a RNN to a DNN

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment