Improvements to deep convolutional neural networks for LVCSR

IMPR O VE MENTS T O DEEP CONVOLUT IONAL NEURAL NETWORKS FOR L VCSR T ara N. Sainath 1 , Brian Kingsbury 1 , Abdel-rahman Mohamed 2 , Geor g e E. Dahl 2 , Geor ge Saon 1 Hagen Soltau 1 , T omas Beran 1 , Aleksandr Y . Aravkin 1 , Bhuvana Ramabhadran 1 1 IBM T . J. W atson Research Center , Y orktown Heights, NY 10598 2 Department of Computer Science, Univ ersity of T oronto 1 { tsainath, bedk, gsaon, hsoltau, tberan, sara vkin, bhuvana } @us.ibm.com, 2 { asamir , gdahl } @cs.toronto.edu ABSTRA CT Deep Con v olutional Neural Networks (CNN s) are more po werful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been conﬁrmed experimen tally , with CNNs showing improv ements in word error rate (WER) between 4-12% relativ e comp ared to DNNs across a v a- riety of L VCSR tasks. In this paper , we describe dif ferent methods to further impro ve CNN p erformance. F irst, we cond uct a deep ana l- ysis comparing li mited weight sharing and f ull weight sharing wi th state-of-the-art features. Second, we apply v arious pooling strate- gies that hav e sho wn improvemen ts in computer vision t o an L VCS R speech task. Third, we introd uce a method to effecti vely inc orporate speak er adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an e ffecti ve strategy to use dropout during Hessian -free sequence training. W e ﬁ nd that with these improvements, particu- larly with fMLLR and dropout, we are able to achiev e an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we ﬁnd an additional 4-5% relat iv e improv ement ov er our pre vious best CNN baseline. 1. INTRODUCTION Deep Neural Networks (DNNs) are now the state-of-the-art i n acous- tic modeling for speech recognition, showing tremendou s improve- ments on the order of 10-30% relative across a variety of small and large voca bulary tasks [1]. Recently , deep con volution al neural net- works (CNNs) [2, 3] hav e been explored as an alternative type of neural network which can reduce translational v ari ance in the input signal. For example, in [4], deep C NNs were sho wn to offer a 4-12% relativ e imp rov ement o ver DNNs a cross different L VCS R tasks. The CNN architecture proposed in [4] was a some what va nilla architec- ture that ha d been used in co mputer vision for man y years. The goal of this paper is to analyze and justify what is an a ppropriate CNN ar- chitecture for speech, and to in vestigate va rious strat egies to improve CNN results further . First, the architecture proposed in [4] used multiple con volu- tional layers with full weight sharing (FWS ), which was found to b e beneﬁcial compared to a single F WS con volutiona l layer . Becau se the l ocality of speech is kno wn ahead of time, [3] proposed the use of limited weight sharing (L WS) for CNNs in speech. While L WS has the beneﬁt t hat it allows each l ocal weight to focus on parts of the signal which are most confusable, prev ious work with L WS had just focused on a single L WS layer [3 ], [5]. In this work, we do a detailed analysis and compare multiple layers of FWS and L WS. Second, there hav e been numerous improve ments to CNNs in computer vision, particularly for small tasks. Fo r example, using l p [6] or stochastic pooling [7] provides better generalization than max pooling used in [4]. Second, using overlapp ing pooling [8] and pooling in t ime [9] also improv es generalization to test data. F ur- thermore, multi -scale CNNs [6] , that is, combining outputs from differe nt layers of t he neural network, has also been successful in computer vision. W e explore the effecti veness of these strate gies for larger sc ale speech tasks. Third, we in vestigate using better features for CNNs. Features for CNNs must exh ibit locality in ti me and fr equenc y . In [4] it was fo und that VTLN-warped log-mel features were best f or CNNs. Ho wev er, speake r adapted features, such as feature space maximum likelihood linear regression (fMLLR) features [10], typically giv e the best performance for D NNs. In [4], the fMLLR transformation was applied directly to a correlated VTLN-warped l og-mel space. Ho wev er, no impro vement was observ ed as fMLLR transformation s typically assume uncorrelated features. In this paper , we propose a methodology t o effec tiv ely use fMLL R with l og-mel features. This in volves transforming log-mel into an uncorrelated space, applying fMLLR i n this space, and then transforming the new features back to a correlated space. Finally , we in vestigate t he role of rectiﬁed linear units (ReLU) and dropout for Hessian-free (HF) sequen ce training [11] of CNNs. In [12], ReLU+dropout was shown to giv e good performance for cross-entrop y (CE) trained DNNs but was not employed during HF sequence-training . Howe ver , sequence-training is critical for speech recognition performance, pro viding an additional relative gain of 1 0- 15% over a CE-trai ned DNN [11]. During CE t raining, the dropo ut mask changes for each utterance. Howe ver , during HF training, we are not guaranteed to get conjugate directions if the dropou t mask changes for each utterance. Therefore, in order to make dropout usable during HF , we keep the dropout mask ﬁ xed per utterance for all i terations of con jugate gradient (CG) within a single HF iteration. Results with the proposed strate gies are ﬁrst e xplored on a 50-hr English Broadcast News (BN) task. W e ﬁnd that there is no differ- ence between L WS and FWS wit h multiple layers for an L VCS R task. Second, we ﬁ nd that v arious pooling st rategies that gav e im- prov ements in computer vision tasks, do not help much in speech. Third, we obse rve that impro ving the CNN in put features by includ - ing fMLLR gi ves impro vements in WER. Finally , ﬁxing the dropout mask during the CG iterati ons of HF lets us use dropout during HF sequence training and avoids destroying the gains from dropout ac- crued during CE training. Putting together improv ements from fM- LLR and dropout, we ﬁnd that we are able to obtain a 2-3% relativ e reduction in WER compared to the CNN system proposed in [4]. In addition, on a larger 400-hr BN task, we can also achie ve a 4-5% relativ e improvem ent in WE R. The rest of this paper is organ ized as follows. Section 2 de- scribes the basic CNN architecture in [4] that serves as a starti ng point t o the proposed modiﬁcations. In Section 3, we discuss ex- periments with L WS/FWS, pooling, fMLLR and ReLU+dro pout for HF . Section 4 presents results wi th the proposed improvements on a 50 and 400-hr BN task. Finally , Secti on 5 concludes the paper and discusses future wo rk. 2. BASIC CNN ARCHITE CTURE In t his section, we describe the basic CNN architecture t hat was in- troduced i n [ 4], as this will serve as t he baseline system which we improv e u pon. In [4], it w as fo und that having two con volutional lay- ers and four fully connected layers, was optimal for L VCSR tasks. W e found that a pooling size of 3 was appropriate for the ﬁrst con- volution al layer , while n o pooling w as u sed in the second layer . Fur- thermore, the con volutional layers had 128 a nd 256 featu re maps re- specti vely , while t he fully connected layers had 1,024 hidden units. The optimal feature set used was VTLN-warped log-mel ﬁ lterbank coef ﬁcients, including delta + double delta. Using t his architecture for CNNs, we were able to achie ve between 4-12 % relati ve improve- ment ov er DNNs across man y differen t L VCSR tasks. In this paper , we explore feature, architecture and optimization strategies to improv e the CN N results further . Preliminary exp eri- ments are performed on a 50-hr English Broadcast News task [11]. The acoustic models are trained on 50 hours from the 1996 and 1997 English Broadcast Ne ws Speech Corpora. Results are reported on the EARS d ev04f set. Unless otherwise noted, all CNNs are trained with cross-entrop y , and results are reported i n a hybrid setup. 3. ANAL YSIS OF V ARIOUS ST RA TEGIES FOR L VCS R 3.1. Optimal Fe ature Set Con volutional neural networks require features which are locally correlated in time and frequenc y . This implies that Linear Discrim- inant Analysis (LDA) features, which are very common ly used i n speech, cannot be used with CNNs as they remov e locality in fre- quenc y [3]. Mel ﬁlter-bank (FB) features are one t ype of speech feature which exhibit this l ocality property [ ? ]. W e explore if any additional transforma tions can be app lied to these features t o further improv e WER. T able 3 sho ws the WE R as a function of i nput feature for CNNs. The following can be observ ed: • Using VT LN-warping to help map features into a canonical space offers impro vements. • Using fMLLR to further speaker -adapt the input does not help. One reason could be that fMLLR assumes the data is well modeled by a diagonal model, which would work best with decorrelated features. Howe ver , the mel FB features are highly correlated. • Using delta and do uble-delta (d + dd ) to capture further time- dynamic information in the feature helps. • Using energ y does not provide improv ements. In conclusion, it appears VTLN-warped mel FB + d+dd is the optimal input feature set to use. T his feature set is used for the re- mainder of the exp eriments, unless otherwise noted. Feature WER Mel FB 21.9 VTLN-warped mel FB 21.3 VTLN-warped mel FB + fMLLR 21.2 VTLN-warped mel FB + d + dd 20.7 VTLN-warped mel FB + d + dd + energ y 21.0 T able 1 . WER as a function of input feature 3.2. Number of Con volutional vs. Fully Connected Layers Most C NN work i n image recognition makes use of a few con vo- lutional layers before having fully connected layers. The con volu- tional layers are meant to r educe spectral v ariation and model spec- tral correlation, while t he full y connected layers aggre gate the local information learned in the con volution al layers to do class discrim- ination. Howe ver , the CNN work done thus far in speech [3] intro- duced a no vel frame work for modeling sp ectral correlations, b ut this frame work only allo wed for a single con volutional layer . W e adopt a spatial modeling approach similar to the image recognition work, and ex plore the beneﬁt of including multiple con volutional layers. T able 2 shows the WER as a function of the number of con- volution al and fully connected layers in the netwo rk. Note that for each experiment, the number of parameters in the netw ork is kept the same. The table sho ws that increasing the number of con volution al layers up to 2 helps, and then performance starts to deteriorate. Fur- thermore, we can see from the table that CNNs offer improve ments ov er DNNs for the same input feature set. # of Con volutional vs. WER Fully Connected Layers No con v , 6 full (DNN) 24.8 1 con v , 5 full 23.5 2 con v , 4 full 22.1 3 con v , 3 full 22.4 T able 2 . W ER as a Function of # of Con volutional Layers 3.3. Number of Hidden Units CNNs explored for i mage r ecognition tasks perform weight shar- ing across all pix els. Unlike images, the local behavior of speech features in low frequency i s v ery dif ferent than features in high fre- quenc y region s. [3] addresses t his issue by li miting weight sharing to frequency components that are close to each oth er . In other w ords, lo w and high frequenc y componen ts have different weights ( i.e. ﬁl - ters). Ho wev er, this type of approach limits adding additional conv o- lutional layers [3], as ﬁlter ou tputs in dif ferent p ooling bands are not related. W e argue that we can apply wei ght sharing across all time and frequency compo nents, by using a large n umber of hidden units compared to vision tasks in t he con volutional layers to capture the differe nces between lo w and high frequency components. This type of a pproach allows for multiple co n volutional layers, something that has thus far not been e xplored before in speech. T able 3 sho ws the WER as a function of number of hidden units for the con volutional layers. Again the total number of parameters in the network is kept constant for all experiments. W e can observe that as we increase the numbe r of hidden units up to 220, the WER steadily decreases. W e do not increase the number of hidden units past 220 as this would require us to reduce the number of hidden units in the fully connected layers to be less than 1,024 i n order to keep the total number of network parameters constant. W e ha ve ob- served that reducing the number of hidden units f rom 1,024 r esults in an increase in WER. W e were able to obtain a sli ght improvemen t by using 128 hidden units for the ﬁrst con volutional layer , and 256 for the second layer . T his is more hidden units in the con volutional lay- ers than are typically used for vision tasks [ 2], [9], as man y hidden units a re needed to capture the locality differences between different frequenc y regions in speech. Number of Hidden Units WER 64 24.1 128 23.0 220 22.1 128/256 21.9 T able 3 . W ER as a function of # of hidden units 3.4. Limited vs. Fu ll W eight Sharing In speech recognition tasks, the characteristics of the signal in lo w- frequenc y reg ions are very different than in high frequenc y regions. This all o ws a limit ed weight sharing (L WS) approach to be used for con volutional layers [3], where weights only span a small local region in frequency . L WS has the beneﬁt that it allows each l ocal weight to f ocus on parts of the si gnal which are most con fusable, an d perform discrimination within just that small local re gion. Ho wev er, one of the d rawbacks i s that it requires setting b y hand the frequen cy region each ﬁlt er spans. Furthermo re, when many L WS layers are used, this limits adding additional full-weight sharing con volutional layers, as ﬁ lter outpu ts in dif ferent bands are not related and thus the locality co nstraint required for con volutiona l layers is not preserved. Thus, most work with L WS up to this point has looked at L WS with one layer [3], [5]. Alternativ ely , i n [4], a full weight sharing (FWS) idea in con vo- lutional layers was explored, similar t o what was done in the image recognition community . W ith that approach, multiple con volutional layers were all o wed and it was shown t hat adding additional con- volution al layers was beneﬁcial. In addition, using a large number of hidden units in the con volutional layers better captures the differ- ences between lo w and high frequency compo nents. Since multiple con volutional layers are critical for good perfor- mance in WE R, in this paper w e explore doing L WS wi th multiple layers. Speciﬁcally , the acti v ations from one L WS layer ha ve local- ity preserving information, and can be fed i nto another L W S layer . Results comparing L WS and FWS are shown in T able 4. Note these results are with str onger VTLN-warped log-mel+d+dd f eatures, as opposed to pre vious L WS work which used simpler log-mel+d+dd. For bo th L WS and FWS, we used 2 con volutional layers, as this was found in [4] to be optimal. First, notice that as we i ncrease the number of hidden units for FWS, there is an improv ement in WER, conﬁrming our belief that having more hidden units with F WS is important to help exp lain variation s in fr equenc y in the input signal. Second, we ﬁnd that if we use L WS bu t match the numb er of param- eters to FWS, we get very slight improvemen ts in WER (0.1%). It seems that both L WS and FW S offer similar performance. Because FWS is simpler to implement, as we do not ha ve to choose ﬁlter lo- cations for each limited weight ahead of t ime, we prefer to use FWS. Because FWS with 5.6M parameters (256/256 hidden units pe r con- volution layer) gi ves the best tradeoff between WER and number of parameters, we use this setting for subsequ ent ex periments. 3.5. Poo ling Experiments Pooling is an important concept in CNNs which helps t o reduce spectral v ariance in the input features. Similar to [3], we explore Method Hidden Units in Con v Layers Params WER FWS 128/256 5.1M 19.3 FWS 256/256 5.6M 18.9 FWS 384/384 7.6M 18.7 FWS 512/512 10.0M 18.5 L WS 128/256 5.4M 18.8 L WS 256/256 6.6M 18.7 T able 4 . Limited vs. Full W eight Sharing pooling in frequency only and not time, as this was sho wn to be op- timal f or speech. Because pooling can be depend ent on the input sampling rate and speaking style, we compare the best pooling size for two dif f erent 50 hr tasks with different characteristics, namely 8kHZ speech - Switchboard T el ephone Con versations (SWB) and 16kHz speech , English Broadcast News (BN). T able 5 indicates that not only is pooling essential fo r CNNs, for all t asks pooling=3 is the optimal pooling si ze. Note that we did not run the experiment with no pooling for BN, as it was already sho wn t o not help for SWB. WER-SWB WER- BN No pooling 23.7 - pool=2 23.4 20.7 pool=3 22.9 20.7 pool=4 22.9 21.4 T able 5 . W ER vs. pooling 3.5.1. T ype of P ooling Pooling is an important concept in CNNs which helps to reduce spectral variance in the input features. The work in [4] explored using max pooling as t he pooling strategy . Gi ven a pooling region R j and a set of acti va tions { a 1 , . . . a | R j | } ∈ R j , the operation for max-pooling is sho wn in Equation 1. s j = max i ∈ R j a i (1) One of t he problems with max-pooling is that it can overﬁt the training data, and does not necessarily generalize to test data. T wo pooling alternativ es ha ve been proposed t o address some of the prob - lems with max-pooling , l p pooling [6] and stochastic pooling [7]. l p pooling looks to take a weighted average of activ ations a i in pooling region R j , as sho wn in Equation 2. s j =   X i ∈ R j a p i   1 p (2) p = 1 can be seen as a simple form of av eraging while p = ∞ cor- responds to max-pooling. One of the problems with averag e pooling is that all elements in the pooling region are considered, so areas of lo w-activ ations may downweig ht areas of h igh a ctiv ation. l p pooling for p > 1 is seen as a tradeof f between average an d max-pooling. l p pooling has shown to gi ve large improv ements in error rate in com- puter vision tasks compared to max pooling [6]. Stochastic pooling is ano ther pooling strategy tha t addresses the issues of max and av erage pooling. In stochastic pooling , ﬁ rst a set of probabilities p f or each region j is formed by normalizing the activ ations across that region, as sho wn in Equation 3. p i = a i P k ∈ R j a k (3) s j = a l where l ∼ P ( p 1 , p 2 , . . . p | R j | ) (4) A multinomial distribution is created from the probab ilities and the distribution is sampled based on p to pick the location l and correspondin g pooled acti v ation a l . This is shown by Equation 4. Stochastic pooling has the adv antages of max-pooling but prev ents ov erﬁ tting due to the stochastic componen t. Stochastic pooling has also sho wn huge i mprov ements in error rate in computer vision [7]. Giv en the success of l p and stochastic pooling, we co mpare b oth of these strategies to max-pooling on an L VCS R task. Results for the three p ooling strategies are sho wn in T able 6. S tochastic poo ling seems to provide impro vements ov er max and l p pooling, tho ugh the gains are slight. Unlike vision tasks, in appears that in tasks such as speech rec ognition which have a lot m ore data an d thus better model estimates, generalization methods such as l p and stochastic pooling do not of fer great impro vements ov er max pooling. Method WER Max Pooling 18.9 Stochastic Pooling 18.8 l p pooing 18.9 T able 6 . Results with Different Pooling T ypes 3.5.2. Overlapp ing P ooling The work presented in [4] did not explore overlapp ing pooling in frequenc y . Ho wever , work in computer vision has sho wn that ov er- lapping pooling can improve error rate by 0.3-0.5% compared to non-o verlapping pooling [8]. One of the motiv ations of overlapping pooling is to pre vent overﬁtting. T able 7 compares ov erlapping and non-o verlapping pooling on an L VCSR speech task. One t hing to point out is that because ov er- lapping pooling has many more activ ations, in order to keep t he ex- periment fair , the number of parameters between non-ov erlapping and ove rlapping pooling was matched. The t able shows that there is no difference in WER between ov erlapping or non-o verlapping pooling. Again, on tasks with a lot of data such as speech, regu- larization mechanisms such as ov erlapping pooling, do not seem to help compared to smaller computer vision tasks. Method WER Pooling No Overlap 18.9 Pooling with Overlap 18.9 T able 7 . P ooling W it h and W it hout Ov erl ap 3.5.3. P ooling in T i me Most previo us CNN work in speech explored pooling in frequency only ([4], [3], [5]), though [13] did inv estigate CNNs with pooling in time, but not frequency . Howe ver , most CNN work in vision per- forms pooling in both space and time [6], [8]. In this paper , we do a deeper analysis o f pooling in time for speech. One thing we mu st en- sure with pooling in time in speech is that there is ove rlap between the pooling windo ws. Otherwise, pooling in ti me without ov erlap can be seen as subsamp ling the signal in time, which degrade s per- formance. Pooling in time with ov erlap can thought of as a way to smooth out the signal in time, another form of reg ularization. T able 8 compares pooling in t ime for both max, stochastic and l p pooling. W e see that pooling in ti me helps slightly with stochas- tic and l p pooling. Howe ver , the gains are not large, and are likely to be diminished after sequence training. It appears that for large tasks with more data, reg ularizations such as pooling in time are not helpful, similar to other re gularization schemes such as l p /stochastic pooling and pooling with ov erlap in f requenc y . Method WER Baseline 18.9 Pooling in T i me, Max 18.9 Pooling in T ime, Stochastic 18.8 Pooling in T i me, l p 18.8 T able 8 . P ooling in T i me 3.6. Incorporating Speaker -Adaptation into CNNs In this section, we describe v arious techniques to incorporate speaker adapted features into CNNs. 3.6.1. fMLLR F eatur es Since CNNs model correlati on in time and frequency , they require the input feature space to ha ve this property . This implies t hat com- monly used feature spaces, such as Linear Discriminant Analysis, cannot be used with CNNs. In [4], it was sho wn that a good f eature set for CNNs was VTLN-warped log-me l ﬁlter bank coefﬁcients. Feature-space maximum likelihood linear regression (fMLLR) [10] i s a popular speaker-adap tation technique used to reduce vari- ability of speech due to different speak ers. T he fMLLR transforma- tion applied to features assumes that either features are uncorrelated and can be modeled by diagonal cov ari ance Gaussians, or features are correlated and can be modeled by a full cov ariance Gaussians. While correlated features are better modeled by full-cov ariance Gaussians, full-cov ariance matrices dramatically increase t he num- ber of parameters pe r Gaussian componen t, oftentimes leading to pa- rameter estimates which are not robust. Thus fMLLR i s most com- monly applied to a decorrelated space. When fMLLR was applied to the correlated l og-mel feature space with a diagonal cov ari ance assumption, little improv ement in WER was obse rved [4]. Semi-tied cov ariance matr ices (S TCs) [14] have been used to decorrelate the feature space so that it can be modeled by diagonal Gaussians. S TC of fers the added beneﬁt in that i t allows a fe w full cov ariance matrices to be shared over many dist ributions, while each distribution has its o wn diagonal cov ariance matrix. In this pa per , we explore applying fMLLR to correlated features (such as log-mel) by ﬁ rst decorrelating them such that we can ap- propriately use a diagonal G aussian approximation wit h fMLLR. W e then transform the fMLLR features back to the correlated space so that they can be us ed with CNNs. The algorithm t o do this is described as follows. First, starting from correlated feature space f , we estimate an STC matrix S to map the features into an uncorrelated space. This mapping is given by transformation 5 Sf (5) Next, in the uncorrelated space, an fMLL R M matri x is esti- mated, and is applied to the S TC transformed features. This is shown by transformation 6 MSf (6) Thus far , transformations 5 and 6 demonstrate standard trans- formations in speech with STC and fMLLR matrices. Ho we ver , in speech r ecognition t asks, once features are decorrelated with STC, further transformation (i.e. fMLLR, fBMMI) are applied in this decorrelated space, as sho wn in transformation 6. The features are ne ver transformed back into the correlated space. Ho wev er for CNNs, using co rrelated features is critical. By mul- tiplying the fMLLR transformed features by an inv erse ST C matrix, we can map the decorrelated fMLLR features back to the correlated space, so that they can be used with a CNN. The t ransformation we propose is gi ven in transformation 7 S − 1 MSf (7) 3.7. Multi-scale CNN/DNNs The information captured in each layer of a neural network varies from more general to more speciﬁc concepts. For exa mple, in speech lo wer layers focus more on speak er adaptation and higher layers fo- cus more on discrimination. In this section, we look to combine inputs from different layers of a neural network to explore if com- plementarity between different layers could potentially improve re- sults further . This idea, kno wn as multi-scale neural network s [6] has been explore d before for computer vision. Speciﬁcally , we look at combining the output from 2 fully- connected and 2 con volutional layers. This output is fed into 4 more fully-connected layers, and t he entire netwo rk is trained jointly . This can be thought of as combining features generated from a DNN-style and CNN-style network. Note for this experimen t, the same input feature, (i. e., log-mel features) were used for both DNN and CNN streams. Results are shown in T able 5. A small gain is observed by combining DNN and CNN features, again much smaller than gains observed in computer vision. Ho wev er, giv en that a small improv ement comes at the cost of such a l arge parameter increase, and the same gains c an be achie ved b y incre asing feature maps in the CNN alone (see T able 4), we do not see huge v alue in this idea. It is po ssible howe ver , that comb ining CNNs and DNNs with dif ferent types of i nput features which are complimentary , could potentially sho w more impro vements. 3.8. I-v ectors 3.8.1. Results Results with the propo sed fMLLR idea a re sho wn in T able 9 . Notice that by applying fMLLR in a decorrelated space, we can achie ve a 0.5% improveme nt o ver the baseline VTLN -warped log-mel system. This gain was not possible in [4] when fMLLR was applied directly to correlated log-mel features. Feature WER VTLN-warped log-mel+d+d d 18.8 proposed fMLLR + VTLN-warped log-mel+d +dd 18.3 T able 9 . W ER W ith Improv ed fMLLR F eatures 3.9. Rectiﬁed Linear Units and Dropout At IBM, tw o stage s of Neural Network training are p erformed. First, DNNs are trained with a frame-discriminativ e st ochastic gradient de- scent (SGD) cross-entrop y (CE ) criterion. Second, CE-trained DNN weights are re-adjusted using a sequence-le vel objecti ve function [15]. Since speech is a sequence-le vel task, this objectiv e is more appropriate f or the speech recognition problem. N umerous st udies hav e shown that sequence training provides an additional 10-15% relativ e i mprov ement ov er a C E trained DNN [11], [4]. Using a 2nd order Hessian-free (HF) optimization method i s critical for perfor- mance gains with sequence training compared to SGD-style opti- mization, though not as important for CE-training [11]. Rectiﬁed Linear Units (ReLU) and Dropout [16] hav e recently been proposed as a way to regularize large neural networks. In fact, ReLU+dropout was shown to provide a 5% relative reduction in WER for cross-entropy -trained DNNs on a 50-hr English Broadcast Ne ws L VCSR task [12] . Howe ver , subsequent HF sequence training [11] that used no dro pout erased some of these gains, and perfor- mance was similar to a DNN trained wit h a sigmoid non-linearity and no dropout. Giv en the importance of sequen ce-training for neu- ral networks , in this paper , we propose a strategy to make dropout effe ctiv e during HF sequence training. Results are presented in the contex t o f CNNs, th ough this algorithm can also be used with DNNs. 3.9.1. Hessian-F r ee T raining One popular 2 nd order technique for DNNs is Hessian-free ( HF) o p- timization [17 ]. Let θ denote the netw ork parameters, L ( θ ) denote a loss function, ∇L ( θ ) denote the gradient of the loss with respect to the p arameters, d denote a search direction, an d B ( θ ) denote a Hes- sian approxima tion matr ix characterizing the curvature of the loss around θ . The central idea in HF optimization is to i terativ ely form a quadratic approximation to the l oss and to minimize this approxi- mation using conjugate gradien t ( CG). L ( θ + d ) ≈ L ( θ ) + ∇L ( θ ) T d + 1 2 d T B ( θ ) d (8) During each iteration of t he HF algorithm, ﬁ rst, the gradient is computed using all training examples. Second, si nce the Hessian cannot be computed e xactly , the curvature matrix B is approx imated by a damped version of the Gauss-Netwon matrix G ( θ ) + λ I , where λ is set via Le venber g-Marquardt. Then, Conjugate gradient (CG) is run for multiple-iterations until t he relativ e per-iteration progress made in minimizing the CG objectiv e function falls below a certain tolerance. During each CG iterati on, Gauss-Newton matrix-vector products are computed ov er a sample of the tr aining data. 3.9.2. Dr opout Dropout is a popular technique to pre vent over-ﬁtting during neu- ral network training [16]. Speciﬁcally , during the feed-forward op- eration in neural network training, dropou t omits each hidden unit randomly wi th probability p . This prev ents complex co-adaptations between hidden units, forcing hidden units to not depen d on other units. Speciﬁcally , using dropou t the activ ation y l at layer l is giv en by E quation 9, where y l − 1 is t he input into laye r l , W l is the weight for layer l , b i s the bias, f is the non-linear activ ation function (i.e. ReLU) and r is a binary mask, where each entry is drawn from a Bernoulli( p ) distribution wi th probability p of being 1. Since dropout is not used during decoding, the factor 1 1 − p used during training en- sures that at test time, when no units are dropped out, the correct total input will reach each layer . y l = f  1 1 − p W l ( r l − 1 ∗ y l − 1 ) + b l  (9) 3.9.3. Combining HF + Dr opout Conjugate gradient tries to minimize the quadratic objectiv e func- tion giv en in E quation 8. For each CG iteration, t he damped Gauss- Netwon matrix, G ( θ ) , i s estimated using a subset of the training data. This subset is ﬁxed for all iterations of C G. This is because if the data used to estimate G ( θ ) changes, we are no l onger guaranteed to ha ve conjug ate search directions from it eration to iteration. Recall that dropout produces a ran dom binary mask for each pre- sentation of each training instance. Howe ver , in order to guarantee good conjugate search directions, f or a giv en utterance, t he dropout mask per layer cannot change during C G. The appropriate way to incorporate dropout into HF is to allow the dropout mask to change for dif ferent layers and different utterances, bu t to ﬁx it for all CG iterations while working wi th a speciﬁc layer and speciﬁc utterance (although the masks can be refreshed between HF iterations). As the number of netwo rk parameters is large, saving out the dropout mask per utterance and layer is infeasible. Therefore, we randomly choose a seed for each utterance and layer and save this out. Using a randomize function with the same seed guarantees that the same dropout mask is used per layer/per utterance. 3.9.4. Results W e experimentally conﬁrm that using a dropout probability of p = 0 . 5 in the 3rd and 4th layers is reasonable, and the dropout in all other layers is zero. For these experiments, we use 2K hidden units for the fully connected lay ers, as t his was found to b e more beneﬁcial with dropou t compared to 1K hidde n units [12]. Results wit h d ifferent dropout tech niques are sho wn in T able 10. Notice that if n o dropout is used, the WER i s the same as sigmo id, a result which w as also found for DNNs in [12]. By using dropout b ut ﬁxing the dropout mask per utterance across all CG iterations, we can achiev e a 0.6% improvem ent in WE R. Finally , if we compare this to v arying the dropout mask per CG training iteration, the WER increases. Further in vestigation in Figure 1 sh o ws that if we va ry the dropout mask, there is slow con ver gence of the loss during training, particularly when the number of CG iterati ons increases during the later part of HF training. This shows experimental eviden ce that i f the dropout mask i s not ﬁ xed, we cannot guarantee that CG iterations produce conjugate search directions for the loss function . Non-Linearity WER Sigmoid 15.7 ReLU, No Dropout 15.6 ReLU, Dropout Fixed for CG Iterations 15.0 ReLU, Dropout Per CG Iteration 15.3 T able 10 . WE R of HF Sequence T raining + Dropout 0 5 10 15 20 25 0.1 0.11 0.12 0.13 0.14 0.15 0.16 HF Iteration Loss Dropout Fixed Per CG Dropout Varied Per CG Fig. 1 . Held-out Loss W ith Dropout T echniques Finally , we e xplore if we can reduce the number of CE iterations before moving to sequence t raining. A main adv antage of sequence training is that it is more closely li nked to the speech recognition objecti ve function compared to cross-entrop y . Using this fact, we explore how many iterations of CE are actually necessary before moving to HF training. T able 11 shows t he W ER for different CE iterations, and the corresponding W ER aft er HF training. Note that HF training is started and lattices are dumped using the CE weight that is stopped at. Notice that just by annealing two times, we can achie ve the same WER after HF training, compared to having the CE weights con verge. This points to the fact that sp ending too mu ch time in CE is unnece ssary . Once the weights are in a relativ ely de- cent space, it is better t o just jump to HF sequen ce training which is more closely matched to the speech objecti ve function. CE Iter # T imes Annealed CE WER HF WER 4 1 20.8 15.3 6 2 19.8 15.0 8 3 19.4 15.0 13 7 18.8 15.0 T able 11 . HF Seq. T raining WER Per CE Iteration 4. RESUL TS In this section, we analyze CNN p erformance with the additions p ro- posed in Section 3, namely fMLLR and ReLU + dropout. Results are sho wn on both a 50 and 400 hr English Broadcast News task . 4.1. 50-hour English Broa dcast News 4.1.1. Experimental Setup Follo wi ng the setup in [4], the hybrid DNN is traine d using speak er- adapted, VTLN+fMLL R features as input, wit h a context of 9 frames. A 5-layer DN N wi th 1,024 hidden units per layer and a sixth softmax layer with 2,220 output targets is used. All DNNs are pre-trained, followed by CE training and then HF sequence-training [11]. T he DNN-based feature system is also trained with the same architecture, b ut use s 512 outpu t targets. A PCA is applied on top of the D NN before softmax to reduce the dimensionality from 512 to 40. Using these DNN-based features, we apply m aximum-likelihood GMM training, followed by f eature and model-space discriminati ve training using the BMMI criterion. In ord er to fairly compare resu lts to the DNN hybrid system, no MLLR is applied t o the DNN feature- based system. The o ld CNN systems are trained with VTLN-warped log-mel+d+dd features, and a sigmoid non-linearity . The proposed CNN-based systems are trained with the fMLL R features described in Section 3.6, and ReLU+Dropout discussed in Section 3.9. 4.1.2. Results T able 12 sh ows the performance of proposed C NN-based feature and hybrid systems, and compares this to DNN and old CNN systems. The proposed CNN hybrid system offers between a 6-7% relativ e improv ement ov er the DNN hybrid, and a 2-3% relative improve- ment over the old C NN hybrid system. While the proposed CNN- based f eature system offers a modest 1% improveme nt over the old CNN-based feature system, this slight impro vements with feature- based system is not surprising all. W e hav e observ ed huge relative improv ements in WER (10-12%) on a hyb rid sequenc e trained DNN with 512 output targets, compared to a hybrid CE-trained DNN. Ho wev er, after features are extracted from both systems, the gains diminish do wn to 1-2% relativ e [18]. Feature-based systems use the neural network to learn a feature transformation, and seem to satu- rate in perfo rmance ev en when the hybrid system u sed to extract the features improves. Thus, as t he t able shows, there is more potential to impro ve a hybrid system as oppo sed t o a feature-based system. model dev04f rt04 Hybrid DNN 16.3 15.8 Old Hybrid CNN [4] 15.8 15.0 Proposed Hybrid CNN 15.4 14.7 DNN-based Features 1 7.4 16.6 Old CNN-based Features [4] 15.5 15.2 Proposed CNN-based Features 15.3 15.1 T able 12 . WER on Broadcast Ne ws, 50 hours 4.2. 400 hr English Bro adcast News 4.2.1. Experimental Setup W e explo re scalability of the proposed techniques on 400 hours of English Broadcast Ne ws [15]. Develop ment is done on the D ARP A EARS dev04f set. T esting is done on the D ARP A EARS rt04 e va luation set. The DNN hybrid system uses fMLLR features, wit h a 9-frame contex t, and use ﬁve hidden layers each containing 1,024 sigmoidal units. The DNN-based feature system is trained wi th 512 output tar gets, while the hybrid system has 5,999 ou tput targets. Re- sults are reported after HF sequen ce tr aining. Again, the proposed CNN-based systems are trained with the fMLL R features described in Section 3.6, and ReLU+Dropout discussed in Section 3.9. 4.2.2. Results T able 13 shows the performance of t he proposed CNN system com- pared to DNNs and the old CNN system. While the proposed 512- hybrid CNN-base d feature sy stem did impro ve (14.1 WER) o ver the old CNN (14 .8 WER), performance sligh tly deteriorates after CNN- based features are extracted from the network . Howe ver , the 5,999- hybrid CNN offers between a 13-16% relative improv ement ov er the DN N hybrid system, and between a 4-5% relativ e improveme nt ov er the old CNN-based features systems. This helps to strengthen the hypothesis that hybrid CN Ns hav e more potential for improv e- ment, and the proposed fMLLR and ReLU+dropou t techniques pro- vide substantial improveme nts ov er DNNs and CNNs with a sigm oid non-linearity and VTLN-warped log-mel features. model dev04f rt04 Hybrid DNN 15.1 13.4 DNN-based Features 1 5.3 13.5 Old CNN-based Features [4] 13.4 12.2 Proposed CNN-based Features 13.6 12 .5 Proposed Hybrid CNN 12.7 11.7 T able 13 . WE R on Broadcast Ne ws, 400 hrs 5. CONCLUSIONS In this paper , we explored v arious strategies to imp rov e CNN perfor- mance. W e incorporated fMLLR into CNN features, and also made dropout effecti ve after HF sequence training. W e also explored var - ious pooling and w eight sharing techniqu es popular in computer vi- sion, but found they did not offer improveme nts for L VCS R tasks. Overall, with t he proposed fMLLR+dropout ideas, we w ere able to improv e our previous best CNN results by 2-5% relativ e. 6. REFERENCES [1] G. Hinton, L. Deng, D. Y u, G. Dahl, A. Mohame d, N. Jaitly , A. Senior , V . V anhoucke, P . Nguyen, T . N. Sainath, and B. Kingsb ury , “Deep Neural Networ ks for Acoustic Model ing in Spee ch Rec ognition , ” IEEE Signal Pr ocessing Magazine , vol. 29, no. 6, pp. 82–97, 2012. [2] Y . LeCun and Y . Bengi o, “Con volution al Networks for Images, Speec h, and Time-serie s, ” in The Handbook of Brain Theory and Neural Net- works . MIT Press, 1995. [3] O. Abdel-Hamid, A. Mohamed, H. J iang, and G. Penn, “Applyin g Con volutio nal Neural Netw ork Concept s to Hybrid NN-HMM Model for Speech Recog nition, ” in P r oc. ICASSP , 2012. [4] T .N. Sainat h, A. Mohamed, B. Kingsb ury , and B. Ramabhadran, “Deep Con volutio nal Neural Networks for L VCSR, ” in Pr oc. ICASSP , 2013. [5] L . Deng, O. Abdel-Hamid, and D. Y u, “A Deep Con volu tional Neural Networ k using Hetero geneous Pooling for T rading Acoust ic In va riance with Phonetic Confusion, ” in P r oc. ICASSP , 2013. [6] P . Sermanet, S. Chinta la, and Y . LeCun, “Con voluti onal neural net- works a pplied to ho use numbers d igit cla ssiﬁcation , ” in P attern Rec og- nition (ICPR), 2012 21st Internat ional Conferen ce on , 2012. [7] M. Zeiler and R. Fergus, “Stochastic Pooling for Re gulariz ation of Deep Con volut ional Neural Netwo rks, ” in Proc . of the International Confer ence on R epr esentait on Learning (ICLR) , 2013. [8] A. Krizhe vsky , I. Sutske ver , and G. Hinton, “Imagene t Classiﬁc ation with Deep Con volutio nal Neural Networks, ” in Advances in Neural Informatio n Pro cessing Systems , 2012. [9] Y . LeCun, F . Huang, and L. Bottou, “Learning Methods for Generic Object Recog nition with In varia nce to Pose and Lighting, ” in Pr oc. CVPR , 2004. [10] M. J.F . Gales, “Maximum likeli hood linea r transformations for HMM- based Speech Recognit ion, ” Computer Speech and Languag e , vol. 12, no. 2, pp. 75–98, 1998. [11] B. Kingsbury , T . N. Saina th, and H. Soltau, “Scalabl e Minimum Bayes Risk Tra ining of Deep Neural Network Acoustic Models Using Dis- trib uted Hessian-free Optimizatio n, ” in Pro c. Interspee ch , 2012. [12] G. E. Dahl, T .N. Sainath, and G.E. Hinton, “Improving Deep Neural Networ ks for L VCSR Using Rectiﬁed Linear Units and Dropout, ” in Pr oc. ICASSP , 2013. [13] A. W aibel, T . Hanaza wa, G. Hinton, K. Shikano, and K.J Lang, “Phoneme Recogniti on using Time-del ay Neural Network s, ” IEEE T ransactio ns on A coustics, Speech and Signal Pr ocessing , vol. 37, no. 3, pp. 328–339, 1989. [14] M. J.F . Gales, “Semi-ti ed Cov ariance Matric es for Hidden Marko v Models, ” IEEE T ransactio ns on Speech and Audio P r ocessing , vol . 7, pp. 272–281, 1999. [15] B. Kingsbury , “Lattic e-Based Optimizat ion of Sequence Classiﬁcati on Criteri a for Neural-Netw ork Acoustic Modeling, ” in Proc . ICASSP , 2009. [16] G. E. Hinton , N. Sri va stav a, A. Kriz he vsky , I. Sutsk eve r , and R. Salakhut dinov , “Improving Neural Networks by Prev enting Co- Adaptat ion of Feat ure Det ectors, ” T he Computing Rese arc h Repository (CoRR) , vol. 1207.058 0, 2012. [17] J. Martens, “Deep learning via Hessian-free optimiz ation, ” in Pr oc. Intl. Conf . on Machine Learning (ICML) , 2010. [18] T . N. Saina th, B. Kingsbury , and B. Ramabhadra n, “Auto -Encoder Bottle neck Features Using D eep Belief Networks, ” in Proc. ICASSP , 2012.

Improvements to deep convolutional neural networks for LVCSR

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment