Recurrent Neural Network Training with Dark Knowledge Transfer

RECURRENT NEURAL NETWORK TRAINING WITH D ARK KNO WLEDGE TRANS FER Zhiyuan T ang 1 , 3 , Dong W ang 1 , 2 ∗ , Zhiyong Zhang 1 , 2 1. Center for Speech and Language T echnologies (CSL T), RIIT , Tsi nghua Uni versity 2. Tsingh ua National Laboratory for Information Sc i ence and T echnology 3. Chengdu Insti tute of Computer Applications, C hi nese Academy of Sciences { tangzy,zhangzy } @cs lt.riit.tsingh ua.edu.cn ∗ Corresponding Author:wangdong99 @mails.tsinghu a.edu.cn ABSTRA CT Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), hav e gained much attention in automatic speech recog nition (ASR). Although some suc- cessful stories have b een reported , tr aining RNNs remains highly challengin g, especially with limited training data. Re- cent research foun d that a well-tr ained model can be used as a teacher to train other child models, by using the predictions generated b y the teacher mod el as sup ervision. This knowl- edge transfer lear ning has bee n employed to train simple neural nets with a complex one, so that the ﬁnal perform ance can reach a level th at is in feasible to ob tain by regular train- ing. I n this paper , we employ the knowledge transfer learning approa ch to train RNNs (precisely LSTM) using a deep neu- ral network (DNN) model as the teacher . This i s dif f erent from most of the existing r esearch on k nowledge transfer learning, since the teach er (DNN) is assumed to be wea ker than the ch ild ( RNN); however , our experiments on an ASR task showed that it works fairly well: without apply ing any tricks on th e learning scheme, th is appro ach can train RNNs successfully e ven with limited training data. Index T erms — recurren t neu ral n etwork, long sho rt- term memory , knowledge transfer learning, automatic speech recogn ition 1. INTR ODUCTION Deep le arning has gained signiﬁcant suc cess in a wide range of applications, f or example, autom atic speech recognition (ASR) [1]. A po werf ul deep learnin g model that has b een reported effecti ve in ASR is the recurr ent n eural network (RNN), e.g., [2, 3 , 4]. An o bvious ad vantage of RNNs co m- pared to co n vention al dee p neu ral network s (DNNs) is that RNNs can m odel long-term temporal p roperties and thus are suitable for modeling speech signals. A simple training metho d for RNNs is the b ackprop aga- tion thr ough time algorithm [5 ]. This ﬁrst-order appr oach, This work was sup ported by the Nat ional Natural Science Foun dation of China under Gran t No. 61371136 and the MESTDC Ph D Foundat ion Project No. 20130002120011. This paper was also supported by Huilan Ltd. a nd Sinov oice. howe ver, is rather inefﬁcient due to two main reasons: (1 ) the twists of the objecti ve function caused by the high nonlin- earity; (2) the vanishing and explo sion of gr adients in back- propag ation [6]. In order to ad dress these difﬁculties (mainly the second), a modiﬁed architecture called the long short-ter m memory (LSTM) was propo sed in [7] and has been success- fully applied to ASR [8]. In th e echo state network (ESN) ar - chitecture prop osed by [9], the h idden-to- hidden weights are not learned in the train ing so the p roblem of odd gra dients does not exist. Recently , a special variant of the Hessian-f ree (HF) optimization approach w as successfully applied to learn RNNs from random initialization [10, 11]. A particular prob- lem of the HF appro ach is tha t the computatio n is deman d- ing. An other r ecent st u dy shows that a carefully designed momentu m setting can sign iﬁcantly im prove RNN tra ining, with limited comp utation and ca n reach the perfo rmance of the HF method [12]. Although these methods can address the difﬁculties of R NN training to some e xten t, they are either too tricky (e.g., the m omentum metho d) or less o ptimal (e.g., th e ESN me thod). Particularly with limited data, RNN train ing remains difﬁcult. This paper fo cuses on th e LSTM structure and pr esents a simp le yet powerful train ing algorithm b ased on knowl- edge tr ansfer . This algo rithm is largely m otiv ated by the re - cently prop osed logit matching [13] and dark knowledge dis- tiller [14]. The basic idea of the knowledge transfer appro ach is that a well-trained model in volves rich kn owledge of the target task and can be used to guide the training of other mod- els. Current r esearch focuses o n learnin g simple models (in terms of structure) from a powerful yet complex model, or an ensemble of models [13, 14] based on the idea of model com- pression [ 15]. In ASR, this ide a has been employed to train small DNN models from a large and complex one [16]. In th is paper, we c onduct an opposite stud y , wh ich em- ploys a simple DNN mod el to train a more co mplex RNN. Different from the e x isting research that tries to distill knowl- edge fr om the teac her mod el, we tr eat the teac her mod el as a regularization so that the training p rocess of th e child model is smoothed, o r a pre-training step so tha t th e supervised train- ing ca n be located a t a g ood starting point. This in fact leads to a new training app roach that is easy to perform and can be extended to any mod el ar chitecture. W e employ this idea to add ress the difﬁculties in RNN trainin g. The experim ents on an ASR task with the Aur ora4 d atabase veriﬁed that the propo sed method can signiﬁcantly improve RNN training. The reset of th e paper is o rganized as follows. Section 2 brieﬂy discusses some re lated works, a nd Section 3 p resents the method . Section 4 presents th e experimen ts, and the paper is conclude d by Section 5. 2. RELA TED TO PRIOR WORK This stu dy is dire ctly m otiv ated b y the work of d ark knowl- edge distillation [14]. Th e important aspect that distinguishes our work from o thers is that the existing metho ds focus on distilling knowledge of complex model and use it to im- prove simple mo dels, wher eas our study u ses simple mo dels to t ea ch complex m odels. The teacher m odel in our w or k in fact knows no t so much, b ut it i s suf ﬁcien t to provide a rough g uide that is imp ortant to train complex models, su ch as RNNs in the present study . Another related work is the k nowledge tra nsfer between DNNs and RNNs, as p roposed in [17]. Ho wever , it employs knowledge transfer to train DNNs with RNNs. This still f ol- lows the conv entio nal idea described above, an d so is different from ours. 3. RNN TRAINING WITH KNO WLEDGE TRANSFER 3.1. Dark knowledge distiller The idea that a well- trained DNN mode l can be used as a teacher to guide the train ing o f other models w as proposed by se veral au thors almost at the sam e time [ 13, 14, 16]. The b a- sic assumptio n is that the teacher m odel enc odes rich knowl- edge fo r the task in hand and th is k nowledge can be d istilled to boo st the child mod el wh ich is o ften simpler a nd can no t learn many d etails without the teache r’ s guide . Th ere a re a few ways to distill the knowledge. The logit match ing ap - proach p roposed by [ 13] teaches a ch ild model by e ncourag - ing its logits ( activ ations b efore softma x) close to tho se of the teacher m odel in term s of th e ℓ -2 no rm, and the dark knowledge distiller mod el propo sed b y [1 4] encou rages the posterior probabilities (softmax outpu t) o f the child model close to those o f the teacher m odel in ter ms of cross e ntropy . This transfe r learn ing has be en applied to learn simple m od- els to approa ch the per formanc e of a complex mo del or a large model ensemb le, for example, learn ing a small DNN from a large DNN [16] or a DNN fro m a more comp lex RNN [1 7]. W e focus o n th e dark k nowledge d istiller approach a s it showed better perform ance in our experiments. Basically , a well-trained DNN model pla ys the role of a teacher and gen - erates p osterior proba bilities of the training samples as new targets fo r training o ther models. Th ese posterior p robabili- ties ar e called ‘soft targets’ since t h e c lass identities are n ot as deterministic as the origin al o ne-hot ‘hard targets’. T o make the targets softer , a tempera ture T can be applied to scale the logits in the softma x, formulated as p i = e z i /T P j e z j /T where i, j index the output units. The introd uction of T allo ws more in- formation of non -targets to be distilled. For examp le, a train- ing samp le with the har d target [ 1 0 0 ] d oes n ot inv olve any rank in formation fo r the second and third class; with the soft targets, e.g. , [0.8, 0 .15, 0.5], the rank informatio n of the sec- ond and th ird class is r eﬂected. Additionally , with a large T applied, the target is even softer , e.g, [0.6, 0.25, 0.15 ], which allows the n on-target classes to b e more prom inent in the training . Note th at the a dditional r ank inform ation on the non-target classes is not av ailab le in the origina l target, but is distilled fro m th e teacher model. Additionally , a larger T boosts in formatio n of non -target class es but at the same time reduces in formatio n of target classes. If T is very large, th e soft target falls b ack to a u niform d istribution and is n ot infor- mativ e any mo re 1 . Therefo re, T controls how the knowledge is distilled from the teacher mo del a nd hence needs to be set approp riately ac cording to the task in hand. 3.2. Dark knowledge for co mplex model training Dark knowledge, in the form of so ft targets, can b e used not only for b oosting simple models, but also for training com - plex models. W e argue th at training with soft targets offers at le ast two ad vantages: (1) it provides m ore in formatio n for model training an d ( 2) it makes the training mor e reliable. These two advantages a re p articularly importan t f or training complex models, es p ecially when the training data is limited. Firstly , soft targets offer p robabilistic class labels wh ich are not so ‘d eﬁnite’ as hard targets. On o ne hand , this matches the real situation wh ere unc ertainty always exists in classiﬁ- cation tasks. For example, in speech recogn ition, it is o ften difﬁcult to id entify th e pho ne class of a frame d ue to the ef- fect of co-articulatio n. On the other hand, this uncertainty in- volves rich (but less discrimin ati ve) information within a sin- gle exam ple. For example, the u ncertainty in pho ne classes indicates p hones are similar to each other and easy to get con- fused. Making use o f this in formation in the f orm of soft tar- gets (posterior pr obabilities) helps improve statistical strength of all phon es in a co llaborative way , an d th erefore is particu- larly helpful for phones with little training data. Secondly , soft targets blur the decision bo undary of classes, which offers a smooth training . The smoothness associated with soft targets has been notice d in [14], which states that soft targets result in less variance in th e gradien t between training sam ples. This can be easily veriﬁed by looking at the gradients backp ropaga ted to th e logit lay er , which is t i − y i for the i -th logit, where t i is the target and y i is the o utput of the child m odel in tra ining. The accu mulated 1 This ar gument should be not confu sed with the conc lusion in [14] where it was found that when T is also applied to the child net, a large T is equal to logit matchi ng. The assumption of this eq uiv alence is that T is large com- pared to the magnitude of the logit value s, but not inﬁnitely large. In fact, if T is very large , the gradient will approach zero so no knowled ge is distille d from the teac her model. variance is giv en by: V ar ( t ) = X i { E x ( t i − y i ) 2 − ( E x t i − E x y i ) 2 } where the expe ctation E x is conducted on the training data x . If we assume that E x t i is identical for soft a nd har d targets (which is reaso nable if the teacher mo del is well traine d on the same data), then the variance is giv en by: V ar ( t ) = X i E x ( t i − y i ) 2 + const where const is a constant term. If we assume that the ch ild model can well learn the teacher model, the gradient v ariance approa ches to zero with soft tar ge ts, wh ich is impossible with hard targets ev en if when the train ing has con verged. The redu ced gr adient variance is highly desirable when training deep and c omplex m odels such as RNNs. W e argue that it can mitig ate the risk of grad ient vanishing and explo- sion that is well known to h inder RNN trainin g, leading to a more reliable training. 3.3. Regularizatio n v iew It has bee n known tha t includin g b oth soft and hard targets improves perf ormance with appropriate setting of a weight factor to balance their relati ve contributions [14]. This can be formu lated as a regularize d training prob lem, with the o bjec- ti ve func tion given b y: L ( θ ) = α L H ( θ ) + L S ( θ ) = X i X j ( αt ij + p ij ) ln { y ij ( θ ) } where θ represents th e param eters o f the mod el, L H ( θ ) and L S ( θ ) are the cost associated with the hard an d so ft targets respectively , and α is th e weigh t factor . Ad ditionally , t ij and p ij are the hard an d soft targets f or t h e i -th s am ple on the j -th class, respecti vely . Note tha t L H ( θ ) is th e objective func- tion of the con ventional super vised train ing, and so L S ( θ ) plays a ro le of regular ization. Th e e ffect of the regular iza- tion term is to force th e mo del und er training (child mo del) to mim ic the teach er m odel, a way of k nowledge transfer . I n this study , a DNN model is used as th e teacher model to reg- ularize the training o f an RNN. W ith this regularization, th e RNN training look s for optima which prod uce similar targets as th e DNN does, so the risk of over-ﬁtting and un der-ﬁtting can be largely reduced . 3.4. Pre-training view Instead of training the model with soft an d har d targets al- together, we can ﬁrst tr ain a rea sonable m odel with soft tar- gets, and then reﬁne the model with hard targets. By this way , the tran sfer le arning play s the role of pre-trainin g, and the conv en tional su pervised training plays th e r ole of ﬁne- tuning. The ration ale is that the so ft targets results in a reliable train- ing so can be u sed to co nduct mod el initialization . H owe ver , since the inf ormation in volved in soft targets is less discrimi- native, reﬁnement with h ard targets tends to b e helpful. This can be in formally interprete d as teaching the m odel with less but important discriminative inform ation ﬁrstly , and once the model is strong enoug h, more discriminati ve information can be learned. This leads to a new p re-training strategy based o n dark knowledge transfer . In th e conventional pre- training ap- proach es based on either restricted Boltzm ann machine (RBM) [ 18] o r auto-encoder (AE) [ 19], simple models are trained an d stacked to construct comp lex mod els. The d ark knowledge p re-training functions in a dif feren t way: it makes a co mplex model trainable by using less discriminative in- formation (sof t targets), wh ile the model structure do es not change. This appr oach possesses sev era l advantages: (1 ) it is totally su pervised and so m ore task -oriented ; (2) it pre- trains the model as a whole, instead of layer by layer , s o ten ds to be fast; ( 3) it ca n b e used to pr e-train any complex mo dels for which the layer structure is not clear , such as the RNN model that we focus on in this paper . The p re-training view is related to the curricu lum train - ing method discussed in [2 0], where training samples that are easy to lear n are ﬁr stly selected to train the model, wh ile mor e difﬁcult ones are selected la ter when the m odel h as been fairly strong. In th e dark knowledge pre-tr aining, the sof t targets can be regarded as easy samples for p re-training , and hard targets as difﬁcult samples for ﬁne-tuning. Interestingly , th e regular ization view and th e pr e-training view are closely related. The pre -training is essentially a reg- ularization tha t places the model to so me lo cation in th e pa - rameter spac e w here goo d local min ima can be easily reached. This re lationship between regulariza tion and pr e-training has been discussed in the context of DNN training [21]. 4. EXPERIMENTS T o verify the proposed me thod, we use it to train RNN acous- tic m odels fo r an ASR task which is k nown to b e d ifﬁcult. Note that all th e RNNs we men tion in th is section are in- deed LSTMs. The experiments are conducted on the Au- rora4 datab ase in n oisy con ditions, and the data proﬁle is largely standard: 7137 u tterances fo r mo del trainin g, 4620 utterances fo r development and 4 620 utter ances fo r testing. The Kaldi toolk it[22] is used to c onduct the mod el training and perfo rmance ev aluation , an d the process largely fo llows the Aurora4 s5 recipe for GPU-based DNN training. Specif- ically , the training starts from constructing a system based on Gau ssian m ixture mo dels (GMM) with the standar d 13 - dimensiona l MFCC featu res plus the ﬁrst and second or der deriv atives. A DNN s y stem is t h en train ed with the align- ment provided by the GMM system. T he featur e used for the DNN system is the 40 -dimensio nal Fban ks. A symmetric 11 - frame window is a pplied to conc atenate n eighbor ing fram es, and an LDA transform is used to reduce the feature dimension to 2 00 , wh ich forms the DNN inp ut. The DNN architectu re in volves 4 hidden layers an d each layer consists o f 20 48 units. The output layer is composed of 2008 units, equal to the to tal number of Gaussian mixtures in the GMM system. The cross entropy is used a s the training criterion , and the stochastic gradient descendent (SGD) algorithm is employed to perfor m the training. In the dark knowledge transfer learning, the trained DNN model is used as the teac her model to g enerate soft targets for the RNN training. The RNN architectur e inv o lves 2 lay- ers of LSTMs with 80 0 cells p er layer . The un idirectiona l LSTM has a recurren t pr ojection layer as in [4] while the non-r ecurrent o ne is discarded. Th e input features are the 40 - dimensiona l Fban ks, and the ou tput units corr espond to the Gaussian mixtures as in the DNN. The RNN is trained with 4 streams and e ach stream contains 20 continu ous frames. The momentu m is empir ically set to 0 . 9 , and the starting lear ning rate is set to 0 . 0001 by default. The experimental results are reported in T a ble 1. The per- forman ce is evaluated in ter ms of two criteria: the f rame ac- curacy ( F A) and the word erro r rate ( WER). While F A is mo re related to the training criterion (c ross entropy), WER is more importan t for speech recog nition. In T able 1, th e F As are re- ported o n b oth th e train ing set (TR F A) an d th e cr oss valida- tion set (CV F A ), and the WER is repor ted on the test set. In T able 1 , RNN-0 is the RNN baseline train ed with ha rd targets. RNN-T1 and RNN-T2 are trained with dar k kn owl- edge tra nsfer , wh ere the temp erature T is set to 1 and 2 re- spectiv ely . F or each dark knowledge tr ansfer m odel, the soft targets are emp loyed in three ways: in the ‘ soft’ way , only so ft targets are used in RNN train ing; in the ‘reg. ’ way , the soft and hard targets are used togeth er , an d the sof t targets p lay the role of regular ization, where the gradients of the soft’ s are scaled up wi th T 2 [14]. In the ‘p retrain’ way , the soft tar- gets and the hard targets are u sed sequentially , an d the so ft targets play the role of p re-trainin g. The weight factor in th e regularization approach is empirically s et to 0 . 5 . T argets F A% F A% WER% TR CV DNN Hard 63.0 45.2 11.40 RNN-0 Hard 67.3 51.9 13.57 RNN-T1 (soft) Soft 59.4 49.9 11.46 RNN-T1 (reg.) Soft + Hard 67.5 53.7 10.84 RNN-T1 (pretrain ) Soft, Hard 65.5 54.2 10.71 RNN-T2 (soft) Soft 58.2 49.5 11.32 RNN-T2 (reg.) Soft + Hard 65.8 53.3 10.88 RNN-T2 (pretrain ) Soft, Hard 64.6 54.1 10.57 T able 1 : Results with Different Models and T ra ining Methods It can b e observed that the RNN baseline (RNN-0 ) can not beat the DNN baseline in terms of WER, althou gh much ef - fort h as been dev oted to calibrate th e training pro cess, in clud- ing various tr ials o n different learn ing rates an d m omentum values. This is consistent with the r esults published with th e Kaldi recipe. Note that this d oes not m ean RNNs are inf erior to DNNs. From the F A results, it is clear th at the RNN mode l leads to better quality in term s o f the training objective. Un- fortun ately , this advantage is no t pr opagated to WER o n th e test set. Additionally , the results shown here can not be in - terpreted as that RNNs are no t su itable for ASR ( in terms of WER). In fact se veral researchers ha ve reported better WERs with RNNs, e.g., [3] . Our results just say that with the Au - rora4 database, the RNN with the basic tr aining method does not g eneralize well in terms of WER, alth ough it works well in terms of the training criterion. This pro blem ca n be largely solved b y the dar k kn owl- edge transfer lear ning, as demo nstrated by th e results of the RNN-T1 and RNN- T2 systems. I t can be seen that with the soft targets o nly , the R NN system ob tains equa l ( T=1) or even better (T=2) perf ormanc e in comp arison with the DNN base- line, which m eans that the k nowledge embedded in the DNN model has b een transferr ed to the RNN model, and the knowl- edge can be arrang ed in a b etter form within the RNN struc- ture. Paying atten tion to the F A resu lts, it can be seen that the knowledge transfer lea rning do es not improve accu racy on the trainin g set, b ut lea ds to better or close F As on the CV set com pared to the DNN and RNN b aseline. This indicates that tr ansfer learn ing with soft targets sacriﬁces th e F A per- forman ce on the training set a little, but leads to better g ener- alization on the CV set. Ad ditionally , the advantage on WER indicates that the gen eralization is improved n ot o nly in the sense of data sets, b u t also in the sense of e valuation metrics. When combin ing so ft an d hard targets, either in th e way of r egularization or p re-training , the per forman ce in terms o f both F A and WER is imp roved. This co nﬁrms the hypoth esis that the kn owledge transfer learnin g does play role s of regu- larization and pr e-training. Note that in all these cases, the F A results on th e tr aining set a re lower th an th at o f th e RNN baseline, which conﬁrms that the advantage of the knowl- edge transform learnin g resides in imp roving generalizab il- ity of the resu ltant m odel. When comparing th e tw o d ark knowledge RNN systems with d ifferent temp eratures T , we see T =2 lead s to little worse F As on the training and CV set, but slightly better WERs. T his conﬁr ms that a high er tem - perature g enerates a smoother direction and leads to better generalizatio n. 5. CONCLUSION W e propo sed a novel RNN training metho d based on dark knowledge transfer learning. The expe rimental results on th e ASR task demonstrated that knowledge learned by simp le models can be effecti vely u sed to guide the trainin g of co m- plex mo dels. This kn owledge can be used either as a regu- larization or f or pr e-training , a nd b oth ap proaches can lead to models that are more generalizab le, a de sired proper ty for complex mo dels. Th e future work inv olves app lying this tech- nique to more complex models th at are difﬁcult to train with conv en tional appro aches, fo r example deep RNNs. Knowl- edge transfer b etween heterogeneous mo dels is under in vesti- gation as well, e.g ., between pro babilistic mod els an d neural models. 6. REFERENCES [1] L. Deng and D. Y u, “Deep learning : Methods and ap plications, ” F o unda tions a nd T r ends in S ignal Pr ocessing , vol. 7, no. 3 -4, pp. 1 97–3 87, 20 13. [Online]. A vailable: http ://dx.doi.o rg/10.1561/2000000039 [2] A. Graves, A.-R. Mohamed , and G. Hinton, “Spee ch recogn ition with deep recu rrent neural networks, ” in Pr oceeding s of IEEE Internationa l Conference o n Acoustics, Speech and Signal P r ocessing (I CASSP) . IEEE, 201 3, pp. 664 5–664 9. [3] A. Graves and N. Jaitly , “T owards en d-to-en d speech recogn ition with recu rrent neural networks, ” in Pr oceed- ings of the 3 1st I nternationa l Conference on Machine Learning (ICML-14 ) , 2014 , pp. 1764 –1772 . [4] H. Sak, A. Senior, and F . Beaufays, “Long short-term memory recurrent neur al network architectures for lar g e scale acoustic mo deling, ” in Pr oceed ings o f th e Annual Confer enc e o f I nternationa l S peech Commu nication As- sociation (INTERSP EECH) , 2014. [5] D. E. Rumelhart, G. E. Hinto n, and R. J. W illiams, “Learning rep resentations by back-propag ating er- rors, ” Natur e , vol. 323, no. 6088, pp. 533– 536, 1986 , 1 0.1038 /32353 3a0. [Online]. A vailable: http://dx.d oi.org/10.1 038/323533a0 [6] Y . Bengio , P . Simard , an d P . Frasco ni, “Learning long- term dep endencies with gradient descent is dif ﬁcult, ” Neural Networks, IEE E T ransactio ns on , vol. 5, no. 2 , pp. 157– 166, 1994. [7] S. Hochr eiter and J. Schmidhu ber , “Lon g short-term memory , ” Neu ral comp utation , vol. 9, no. 8, pp. 17 35– 1780, 1997 . [8] A. Gra ves and J. Schmid huber, “Framewise phoneme classiﬁcation with bidir ectional lstm and other ne ural network architectures, ” Neural Networks , vol. 18, no. 5, pp. 602– 610, 2005. [9] H. Jaeger and H. Haa s, “Harnessing non linearity: Pre- dicting chaotic systems and saving en ergy in wireless commun ication, ” Science , vol. 304 , no. 566 7, pp. 78– 80, 2004 . [10] J. Martens, “Deep learning via hessian-fre e optimiza- tion, ” in Pr oc eedings of the 27th Interna tional Confer- ence on Machine Learning (ICML-10 ) , 2010, p p. 73 5– 742. [11] J. Marte ns and I. Sutskever , “Learning recur rent neural networks with hessian-free optim ization, ” in Pr oce ed- ings of the 2 8th Interna tional Confer enc e on Machine Learning (ICML-11 ) , 2011 , pp. 1033 –1040 . [12] I. Sutske ver, J. Martens, G. Dahl, and G. Hinton , “On the importan ce of in itialization and mo mentum in deep learning, ” in Pr oceeding s of the 30 th Internatio nal Con- fer ence on Machine Learning (ICML-13 ) , 2 013, pp. 1139– 1147. [13] J. Ba and R. Caruana, “Do deep nets really n eed to be deep?” in Adva nces in Neural I nformation Pr oce ssing Systems , 2014, pp. 2654–2 662. [14] G. E . Hinton , O. V inyals, and J. Dean, “Distill in g the knowledge in a neu ral n etwork, ” in NIPS 2014 Dee p Learning W orkshop , 2014 . [15] C. Bucilu, R. Caruana, and A. Niculescu- Mizil, “Model compression , ” in Pr oce edings of the 12th ACM SI GKDD internationa l conference on Knowledge discovery and data mining . A CM, 2 006, pp. 535–54 1. [16] J. Li, R. Zh ao, J.-T . Huan g, and Y . Gong , “Learn- ing small-size DNN with output-d istribution-based cri- teria, ” in Pr oc eedings of the Annual Confe r ence of Inter - nationa l Sp eech Commu nication Association (INTER- SPEECH) , 2014. [17] W . Ch an, N. R. Ke, and I. Lan e, “Transferring knowledge from a RNN to a DNN, ” arXiv pr eprint arXiv:150 4.014 83 , 2015. [18] G. E. Hinton and R. R. S alak hutdinov , “Reducin g the dimensiona lity o f data with n eural n etworks, ” Science , vol. 313, no. 5786 , pp. 504– 507, 2006. [19] Y . Bengio, P . Lamblin, D. Pop ovici, H. Larochelle et al. , “Greedy layer-wise tr aining of de ep networks, ” Advance s in neural info rmation pr ocessing systems , vol. 19, p. 153, 2007 . [20] A. Romero, N. Ballas, S. E. Kahou , A. Chassang, C. Gatta, and Y . Bengio, “Fitnets: Hints for thin d eep nets, ” arXiv pr eprin t arXiv:1412.6 550 , 201 4. [21] D. Erhan , Y . Bengio, A. Courv ille, P .-A. Manzago l, P . V incent, and S. Bengio, “W h y do es unsup ervised pre - training help deep learn ing?” Th e Journal of Machine Learning Resear ch , v o l. 11, pp. 625–66 0, 20 10. [22] D. Povey , A. Gh oshal, G. Boulianne, L. Bu rget, O. Glembek , N. Goel, M . Hannem ann, P . Mo tlicek, Y . Qian , P . Sch warz, J. Silovsky , G. Stemmer, and K. V esely , “T he k aldi s p eech reco gnition too lkit, ” in IEEE 2011 W orkshop on Automatic Speech Recognition and Und erstanding . IEE E Signal Proce ssing Society , Dec. 2011, iEEE Catalog No.: CFP11SR W -USB.

Recurrent Neural Network Training with Dark Knowledge Transfer

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment