State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previ…

Authors: Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu

ST A TE-OF-THE-AR T SPEECH RECOGNITION WITH SEQUENCE-T O-SEQUENCE MODELS Chung-Cheng Chiu, T ara N. Sainath, Y onghui W u, Rohit Prabhavalkar , P atric k Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J . W eiss, Kanishka Rao, Ekaterina Gonina, Navdeep J aitly , Bo Li, J an Chor owski, Michiel Bacchiani Google, USA { chungchengc,tsainath,yonghui,prabhavalkar,drpng,zhifengc,anjuli ronw,kanishkarao,kgonina,ndjaitly,boboli,chorowski,michiel } @google.com ABSTRA CT Attention-based encoder -decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we hav e shown that such architectures are comparable to state-of-the- art ASR systems on dictation tasks, b ut it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improv ements to our LAS model which significantly improv e performance. On the structural side, we show that word piece models can be used instead of graphemes. W e also introduce a multi-head attention architecture, which offers improv ements ov er the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all sho wn to improv e accurac y . W e present results with a unidirectional LSTM encoder for streaming recognition. On a 12 , 500 hour voice search task, we find that the proposed changes improve the WER from 9 . 2 % to 5 . 6 %, while the best conv entional system achieves 6 . 7 %; on a dictation task our model achie ves a WER of 4 . 1% compared to 5% for the con ventional system. 1. INTR ODUCTION Sequence-to-sequence models hav e been gaining in popularity in the automatic speech recognition (ASR) community as a way of folding separate acoustic, pronunciation and language models (AM, PM, LM) of a con ventional ASR system into a single neural network. There hav e been a variety of sequence-to-sequence models explored in the literature, including Recurrent Neural Network T ransducer (RNN- T) [ 1 ], Listen, Attend and Spell (LAS) [ 2 ], Neural T ransducer [ 3 ], Monotonic Alignments [ 4 ] and Recurrent Neural Aligner (RN A) [ 5 ]. While these models ha ve sho wn promising results, thus far , it is not clear if such approaches would be practical to unseat the current state-of-the-art, HMM-based neural network acoustic models, which are combined with a separate PM and LM in a con ventional system. Such sequence-to-sequence models are fully neural, without finite state transducers, a lexicon, or te xt normalization modules. Training such models is simpler than con ventional ASR systems: they do not require bootstrapping from decision trees or time alignments generated from a separate system. T o date, howe ver , none of these models has been able to outperform a state-of-the art ASR system on a large v ocabulary continuous speech recognition (L VCSR) task. The goal of this paper is to e xplore various structure and optimization improv ements to allow sequence-to-sequence models to significantly outperform a con ventional ASR system on a voice search task. Since previous work showed that LAS of fered improvements ov er other sequence-to-sequence models [ 6 ], we focus on improv e- ments to the LAS model in this w ork. The LAS model is a single neural network that includes an encoder which is analogous to a con ventional acoustic model, an attender that acts as an alignment model, and a decoder that is analogous to the language model in a con ventional system. W e consider both modifications to the model structure, as well as in the optimization process. On the structure side, first, we explore word piece models (WPM) which have been applied to machine translation [ 7 ] and more recently to speech in RNN-T [ 8 ] and LAS [ 9 ]. W e compare graphemes and WPM for LAS, and find modest improv ement with WPM. Next, we explore incorporating multi-head attention [ 10 ], which allo ws the model to learn to attend to multiple locations of the encoded features. Overall, we get 13% relativ e improvement in WER with these structure improv ements. On the optimization side, we explore a variety of strategies as well. Con ventional ASR systems benefit from discriminati ve se- quence training, which optimizes criteria more closely related to WER [ 11 ]. Therefore, in the present work, we explore training our LAS models to minimize the number of expected word errors (MWER) [ 12 ], which significantly improv es performance. Second, we include scheduled sampling (SS) [ 13 , 2 ], which feeds the previ- ous label prediction during training rather than ground truth. Third, label smoothing [ 14 ] helps to make the model less confident in its predictions, and is a regularization mechanism that has successfully been applied in both vision [ 14 ] and speech tasks [ 15 , 16 ]. Fourth, while many of our models are trained with asynchronous SGD [ 17 ], synchronous training has recently been shown to improv e neural systems [ 18 ]. W e find that all four optimization strategies allow for additional 27 . 5 % relative improv ement in WER on top of our structure improv ements. Finally , we incorporate a language model to rescore N-best lists in the second pass, which results in a further 3 . 4 % relati ve impro vement in WER. T aken together , the impro vements in model structure and optimization, along with second-pass rescoring, allo w us to improv e a single-head attention, grapheme LAS system, from a WER of 9 . 2 % to a WER of 5 . 6 % on a voice search task. This provides a 16 % relativ e reduction in WER compared to a strong conv entional model baseline which achie ves a WER of 6 . 7 %. W e also observ e a similar trend on a dictation task. In Sections 2.2.1 and 2.4, we show ho w language models can be integrated. Section 2.2.2 further extends the model to multi-head at- tention. W e explore discriminative training in Sections 2.3.1 and 2.3.2, and synchronous training regimes in Section 2.3.3. W e use unidirec- tional encoders for low-latenc y streaming decoding. 2. SYSTEM O VER VIEW In this section, we detail various structure and optimization improv e- ments to the basic LAS model. 2.1. Basic LAS Model Fig. 1 : Components of the LAS end-to-end model. The basic LAS model, used for e xperiments in this w ork, consists of 3 modules as shown in Figure 1. The listener encoder module, which is similar to a standard acoustic model, takes the input features, x , and maps them to a higher-le vel feature representation, h enc . The output of the encoder is passed to an attender , which determines which encoder features in h enc should be attended to in order to predict the ne xt output symbol, y i , similar to a dynamic time warping (DTW) alignment module. Finally , the output of the attention module is passed to the speller (i.e., decoder), which takes the attention context, c i , generated from the attender , as well as an embedding of the previous prediction, y i − 1 , in order to produce a probability distribution, P ( y i | y i − 1 , . . . , y 0 , x ) , ov er the current sub-word unit, y i , giv en the previous units, { y i − 1 , . . . , y 0 } , and input, x . 2.2. Structure Improvements 2.2.1. W or dpiece models T raditionally , sequence-to-sequence models have used graphemes (characters) as output units, as this folds the AM, PM and LM into one neural network, and side-steps the problem of out-of-vocabulary words [ 2 ]. Alternatively , one could use longer units such as word pieces or shorter units such as context-independent phonemes [ 19 ]. One of the disadvantages of using phonemes is that it requires ha v- ing an additional PM and LM, and was not found to improv e over graphemes in our experiments [19]. Our motiv ation for looking at word piece models (WPM) is a follows. T ypically , word-level LMs have a much lower perplexity compared to grapheme-lev el LMs [ 20 ]. Thus, we feel that model- ing word pieces allows for a much stronger decoder LM compared to graphemes. In addition, modeling longer units impro ves the ef- fectiv e memory of the decoder LSTMs, and allo ws the model to potentially memorize pronunciations for frequently occurring words. Furthermore, longer units require fe wer decoding steps; this speeds up inference in these models significantly . Finally , WPMs hav e sho wn Projected source Projected source Projected source Encoded source Projected source Attention queries Fig. 2 : Multi-headed attention mechanism. good performance for other sequence-to-sequence models such as RNN-T [8]. The word piece models [ 21 ] used in this paper are sub-word units, ranging from graphemes all the way up to entire words. Thus, there are no out-of-v ocabulary w ords with word piece models. The word piece models are trained to maximize the language model likeli- hood ov er the training set. As in [ 7 ], the word pieces are “position- dependent”, in that a special word separator marker is used to denote word boundaries. W ords are se gmented deterministically and inde- pendent of context, using a greedy algorithm. 2.2.2. Multi-headed attention Multi-head attention (MHA) was first explored in [ 10 ] for machine translation, and we extend this work to explore the value of MHA for speech. Specifically , as shown in Figure 2, MHA extends the con ventional attention mechanism to hav e multiple heads, where each head can generate a different attention distrib ution. This allows each head to hav e a dif ferent role on attending the encoder output, which we hypothesize makes it easier for the decoder to learn to retriev e information from the encoder . In the conv entional, single- headed architecture, the model relies more on the encoder to provide clearer signals about the utterances so that the decoder can pickup the information with attention. W e observed that MHA tends to allocate one of the attention head to the beginning of the utterance which contains mostly background noise, and thus we hypothesize MHA better distinguish speech from noise when the encoded representation is less ideal, for example, in de graded acoustic conditions. 2.3. Optimization impro vements 2.3.1. Minimum W or d Error Rate (MWER) T raining Con ventional ASR systems are often trained to optimize a sequence- lev el criterion (e.g., state-level minimum Bayes risk (sMBR) [ 11 ]) in addition to CE or CTC training. Although the loss function that we optimize for the attention-based systems is a sequence-level loss function , it is not closely related to the metric that we actually care about, namely word error rate. There ha ve been a variety of meth- ods explored in the literature to address this issue in the context of sequence-to-sequence models [ 22 , 23 , 5 , 12 ]. In this work, we focus on the minimum expected word error rate (MWER) training that we proposed in [12]. In the MWER strategy , our objecti ve is to minimize the expected number of word errors. The loss functon is giv en by Equation 1, where W ( y , y ∗ ) denotes the number of word errors in the hypothesis, y , compared to the ground-truth label sequence, y ∗ . This first term is interpolated with the standard cross-entropy based loss, which we find is important in order to stabilize training [12, 24]. L MWER = E P ( y | x ) [ W ( y , y ∗ )] + λ L CE (1) The expectation in Equation 1 can be approximated either via sam- pling [ 5 ] or by restricting the summation to an N-best list of decoded hypotheses as is commonly used for sequence training [ 11 ]; the latter is found to be more effecti ve in our experiments [ 12 ]. W e denote by NBest ( x , N ) = { y 1 , · · · , y N } , the set of N-best hypotheses com- puted using beam-search decoding [ 25 ] for the input utterance x . The loss function can then be approximated as sho wn in Equation 2, which weights the normalized word errors from each hypothesis by the probability b P ( y i | x ) concentrated on it: L N-best MWER = 1 N X y i ∈ NBest ( x ,N ) [ W ( y i , y ∗ ) − c W ] b P ( y i | x ) + λ L CE (2) where, b P ( y i | x ) = P ( y i | x ) P y i ∈ NBest ( x ,N ) P ( y i | x ) , represents the distribution re-normalized ov er just the N-best hypotheses, and c W is the av erage number of w ord errors o ver all hypotheses in the N-best list. Sub- tracting c W is applied as a form of v ariance reduction since it does not affect the gradient [12]. 2.3.2. Scheduled Sampling W e e xplore scheduled sampling [ 13 ] for training the decoder . Feeding the ground-truth label as the pre vious prediction (so-called teacher for cing ) helps the decoder to learn quickly at the beginning, but introduces a mismatch between training and inference. The scheduled sampling process, on the other hand, samples from the probability distribution of the pre vious prediction (i.e., from the softmax output) and then uses the resulting token to feed as the pre vious token when predicting the next label. This process helps reduce the gap between training and inference behavior . Our training process uses teacher forcing at the beginning of training steps, and as training proceeds, we linearly ramp up the probability of sampling from the model’ s prediction to 0 . 4 at the specified step, which we then keep constant until the end of training. The step at which we ramp up the probability to 0 . 4 is set to 1 million steps and 100 , 000 steps for asynchronous and synchronous training respectiv ely (See section 2.3.3). 2.3.3. Asynchr onous and Synchr onous T raining W e compare both asynchronous [ 17 ] and synchronous training [ 18 ]. As shown in [ 18 ], synchronous training can potentially pro vide f aster con vergence rates and better model quality , but also requires more effort in order to stabilize network training. Both approaches have a high gradient v ariance at the beginning of the training when using multiple replicas [ 17 ], and we explore dif ferent techniques to reduce this v ariance. In asynchronous training we use replica ramp up: that is, the system will not start all training replicas at once, but instead start them gradually . In synchronous training we use two techniques: learning rate ramp up and a gradient norm track er . The learning rate ramp up starts with the learning rate at 0 and gradually increases the learning rate, providing a similar ef f ect to replica ramp up. The gradient norm tracker keeps track of the mo ving average of the gradient norm, and discards gradients with significantly higher variance than the moving average. Both approaches are crucial for making synchronous training stable. 2.3.4. Label smoothing Label smoothing [ 14 , 16 ] is a re gularization mechanism to prev ent the model from making over -confident predictions. It encourages the model to ha ve higher entrop y at its prediction, and therefore makes the model more adaptable. W e followed the same design as [ 14 ] by smoothing the ground-truth label distribution with a uniform distribu- tion ov er all labels. 2.4. Second-Pass Rescoring While the LAS decoder topology is that of neural language model (LM), it can function as a language model; b ut it is only exposed to training transcripts. An external LM, on the other hand, can leverage large amounts of additional data for which we only hav e text (no audio). T o address the potentially weak LM learned by the decoder , we incorporate an external LM during inference only . The external LM is a lar ge 5-gram LM trained on text data from a variety of domains. Since domains have dif ferent predictive v alue for our L VCSR task, domain-specific LMs are first trained, then combined together using Bayesian-interpolation [26]. W e incorporate the LM in the second-pass by means of log-linear interpolation. In particular, gi ven the N-best hypotheses produced by the LAS model via beam search, we determine the final transcript y ∗ as: y ∗ = arg max y log P ( y | x ) + λ log P LM ( y ) + γ len ( y ) (3) where, P LM is provided by the LM, len ( y ) is the number of w ords in y , and λ and γ are tuned on a dev elopment set. Using this crite- rion, transcripts which ha ve a low language model probability will be demoted in the final ranked list. Additionally , the the last term addresses the common observ ation that the incorporation of an LM leads to a higher rate of deletions. 3. EXPERIMENT AL DET AILS Our experiments are conducted on a ∼ 12,500 hour training set con- sisting of 15 million English utterances. The training utterances are anonymized and hand-transcribed, and are representativ e of Google’ s voice search traf fic. This data set is created by artificially corrupting clean utterances using a room simulator , adding v arying degrees of noise and re verberation such that the overall SNR is between 0dB and 30dB, with an average SNR of 12dB. The noise sources are from Y ouT ube and daily life noisy en vironmental recordings. W e report results on a set of ∼ 14.8K utterances extracted from Google traffic, and also ev aluate the resulting model, trained with only voice search data, on a set of 15.7K dictation utterances that ha ve longer sentences than the voice search utterances. All experiments use 80-dimensional log-Mel features, computed with a 25ms windo w and shifted e very 10ms. Similar to [ 27 , 28 ], at the current frame, t , these features are stacked with 3 frames to the left and downsampled to a 30ms frame rate. The encoder network architecture consists of 5 long short-term memory [ 29 ] (LSTM) layers. W e e xplored both unidirectional [ 30 ] and bidirectional LSTMs, where the unidirectional LSTMs have 1 , 400 hidden units and bidirectional LSTMs hav e 1 , 024 hidden units in each direction ( 2 , 048 per layer). Unless otherwise stated, experiments are reported with unidirectional encoders. Additi ve attention [ 31 ] is used for both single-headed and multi-headed attention experiments. All multi-headed attention experiments use 4 heads. The decoder network is a 2-layer LSTM with 1 , 024 hidden units per layer . All neural networks are trained with the cross-entropy criterion (which is used to initialize MWER training) and are trained using T ensorFlow [32]. 4. RESUL TS 4.1. Structure Improvements Our first set of experiments explore dif ferent structure impro vements to the LAS model. T able 1 compares performance for LAS models giv en graphemes ( E1 ) and WPM ( E2 ). The table indicates that WPM perform slightly better than graphemes. This is consistent with the finding in [ 8 ] that WPM pro vides a stronger decoder LM compared to graphemes, resulting in roughly a 2 % relativ e improvement in WER (WERR). Second, we compare the performance of MHA with WPM, as shown by e xperiment E3 in the table. W e see MHA provides around a 11 . 1 % improvement. This indicates that ha ving the model focus on multiple points of attention in the input signal, which is similar in spirit to having a language model passed from the encoder , helps significantly . Since models with MHA and WPM perform best, we explore the proposed optimization methods on top of this model in the rest of the paper . Exp-ID Model WER WERR E1 Grapheme 9.2 - E2 WPM 9.0 2.2% E3 + MHA 8.0 11.1% T able 1 : Impact of word piece models and multi-head attention. 4.2. Optimization impro vements W e explore the performance of various optimization impro vements discussed in Section 2.3. T able 2 shows that including synchronous training ( E4 ) on top of the WPM+MHA model provides a 3 . 8 % improv ement. Furthermore, including scheduled sampling ( E5 ) gi ves an additional 7 . 8 % relati ve impro vement in WER; label smoothing giv es an additional 5 . 6 % relati ve improvement. Finally , MWER training provides 13 . 4 %. Overall, the gain from optimizations is around 27 . 5 %, moving the WER from 8 . 0 % to 5 . 8 %. W e see that synchronous training, in our configuration, yields a better con ver ged optimum at similar wall clock time. Interestingly , while scheduled sampling and minimum word error rate are both discriminativ e methods, we hav e observ ed that their combination continues to yield additi ve improvements. Finally , regularization with label smoothing, even with lar ge amounts of data, is proven to be beneficial. Exp-ID Model WER WERR E2 WPM 9.0 - E3 + MHA 8.0 11.1% E4 + Sync 7.7 3.8% E5 + SS 7.1 7.8% E6 + LS 6.7 5.6% E7 + MWER 5.8 13.4% T able 2 : Sync training, scheduled sampling (SS), label smoothing (LS) and minimum word error rate (MWER) training improv ements. 4.3. Incorporating Second-Pass Rescoring Next, we incorporate second-pass rescoring into our model. As can be see in T able 3, second-pass rescoring improves the WER by 3 . 4 %, from 5 . 8 % to 5 . 6 %. 4.4. Unidirectional vs. Bidirectional Encoders Now that we have established the impro vements from structure, op- timization and LM strategies, in this section we compare the gains Exp-ID Model WER E7 WPM + MHA + Sync + SS + LS + MWER 5.8 E8 + LM 5.6 T able 3 : In second pass rescoring, the log-linear combination with a larger LM results in a 0 . 2 % WER impro vement. on a unidirectional and bidirectional systems. T able 4 shows that the proposed changes give a 37 . 8 % relati ve reduction in WER for a unidi- rectional system, while a slightly smaller improvement of 28 . 4 % for a bidirectional system. This illustrates that most proposed methods offer impro vements independent of model topology . Exp-ID Model Unidi Bidi E2 WPM 9.0 7.4 E8 WPM + all 5.6 5.3 WERR - 37.8% 28.4% T able 4 : Both unidirectional and bidirectional models benefit from cumulativ e improvements. 4.5. Comparison with the Con ventional System Finally , we compare the proposed LAS model in E8 to a state-of- the-art, discriminativ ely sequence-trained low frame rate (LFR) sys- tem [ 28 ] in terms of WER. T able 5 shows the proposed sequence-to- sequence model ( E8 ) offers a 16% and 18% relativ e improv ement in WER ov er our production system ( E9 ) on voice search (VS) and dictation (D) task respecti vely . Furthermore, comparing the size of the first-pass models, the LAS model is around 18 times smaller than the con ventional model. It is important to note that the second pass model is 80 GB and still dominates model size. Exp-ID Model VS/D 1st pass Model Size E8 Proposed 5.6/4.1 0.4 GB E9 Con ventional 6.7/5.0 0.1 GB (AM) + 2.2 GB (PM) LFR system + 4.9 GB (LM) = 7.2GB T able 5 : Resulting WER on voice search (VS)/dictation (D). The improv ed LAS outperforms the con ventional LFR system while being more compact. Both models use second-pass rescoring. 5. CONCLUSION W e designed an attention-based model for sequence-to-sequence speech recognition. The model integrates acoustic, pronunciation, and language models into a single neural netw ork, and does not require a lexicon or a separate text normalization component. W e explored various structure and optimization mechanisms for improving the model. Cumulatively , structure improvements (WPM, MHA) yielded an 11% improv ement in WER, while optimization improvements (MWER, SS, LS and synchronous training) yielded a further 27 . 5% improv ement, and the language model rescoring yielded another 3 . 4% improv ement. Applied on a Google V oice Search task, we achiev e a WER of 5 . 6% , while a hybrid HMM-LSTM system achie ves 6 . 7% WER. T ested the same models on a dictation task, our model achie ves 4 . 1% and the hybrid system achieves 5% WER. W e note ho wever , that the unidirectional LAS system has the limitation that the entire utterance must be seen by the encoder , before any labels can be decoded (although, we encode the utterance in a streaming fashion). Therefore, an important next step is to revise this model with an streaming attention-based model, such as Neural T ransducer [33]. 6. REFERENCES [1] A. Graves, “Sequence transduction with recurrent neural net- works, ” CoRR , v ol. abs/1211.3711, 2012. [2] W . Chan, N. Jaitly , Q. V . Le, and O. V inyals, “Listen, attend and spell, ” CoRR , vol. abs/1508.01211, 2015. [3] N. Jaitly , D. Sussillo, Q. V . Le, O. V inyals, I. Sutske ver , and S. Bengio, “An Online Sequence-to-sequence Model Using Partial Conditioning, ” in Pr oc. NIPS , 2016. [4] C. Raf fel, M. Luong, P . J. Liu, R.J. W eiss, and D. Eck, “On- line and Linear-T ime Attention by Enforcing Monotonic Align- ments, ” in Pr oc. ICML , 2017. [5] H. Sak, M. Shannon, K. Rao, and F . Beaufays, “Recurrent Neural Aligner: An Encoder-Decoder Neural Netw ork Model for Sequence to Sequence Mapping, ” in Pr oc. Interspeec h , 2017. [6] R. Prabhav alkar, K. Rao, T . N. Sainath, B. Li, L. Johnson, and N. Jaitly , “A Comparison of Sequence-to-sequence Models for Speech Recognition, ” in Pr oc. Interspeech , 2017. [7] Y . W u, M. Schuster , and et. al., “Google’ s neural machine trans- lation system: Bridging the gap between human and machine translation, ” CoRR , vol. abs/1609.08144, 2016. [8] K. Rao, R. Prabhav alkar, and H. Sak, “Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-T ransducer, ” in Pr oc. ASR U , 2017. [9] W illiam Chan, Y u Zhang, Quoc Le, and Na vdeep Jaitly , “Latent Sequence Decompositions, ” in ICLR , 2017. [10] A. V aswani, N. Shazeer , N. Parmar , J. Uszk oreit, L. Jones, A. N. Gomez, L. Kaiser , and I. Polosukhin, “Attention Is All Y ou Need, ” CoRR , vol. abs/1706.03762, 2017. [11] B. Kingsbury , “Lattice-Based Optimization of Sequence Clas- sification Criteria for Neural-Network Acoustic Modeling, ” in Pr oc. ICASSP , 2009. [12] R. Prabhav alkar, T . N. Sainath, Y . W u, P . Nguyen, Z. Chen, C. Chiu, and A. Kannan, “Minimum W ord Error Rate Training for Attention-based Sequence-to-sequence Models, ” in Pr oc. ICASSP (accepted) , 2018. [13] S. Bengio, O. V inyals, N. Jaitly , and N. Shazeer , “Scheduled Sampling for Sequence Prediction with Recurrent Neural Net- works, ” in Pr oc. NIPS , 2015, pp. 1171–1179. [14] C. Szegedy , V . V anhoucke, S. Ioffe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, ” in Pr oc. of the IEEE Conference on Computer V ision and P attern Recognition , 2016. [15] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-Based Models for Speech Recognition, ” in Pr oc. NIPS , 2015. [16] J. K. Choro wski and N. Jaitly , “T owards Better Decoding and Language Model Integration in Sequence to Sequence Models, ” in Pr oc. Interspeech , 2017. [17] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V . Le, M.Z. Mao, M. Ranzato, A. Senior , P . T ucker , K. Y ang, and A.Y . Ng, “Large Scale Distributed Deep Networks, ” in Pr oc. NIPS , 2012. [18] P . Goyal, P . Doll ´ ar , R. B. Girshick, P . Noordhuis, L. W esolowski, A. Kyrola, A. T ulloch, Y . Jia, and K. He, “ Accurate, large minibatch SGD: training imagenet in 1 hour , ” CoRR , vol. abs/1706.02677, 2017. [19] T .N. Sainath, R. Prabha valkar , S. Kumar , S. Lee, D. Rybach, A. Kannan, V . Schogol, P . Nguyen, B. Li, Y . Wu, Z. Chen, and C. Chiu, “No need for a lexicon? e valuating the value of the pronunciation lexica in end-to-end models, ” in Pr oc. ICASSP (accepted) , 2018. [20] A. Kannan, Y . W u, P . Nguyen, T . N. Sainath, Z. Chen, and R. Prabhav alkar, “An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model, ” in Proc. ICASSP (accepted) , 2018. [21] M. Schuster and K. Nakajima, “Japanese and Korean voice search, ” 2012 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , 2012. [22] D. Bahdanau, J. Chorowski, D. Serdyuk, P . Brakel, and Y . Ben- gio, “End-to-End Attention-based Large V ocabulary Speech Recognition, ” in Pr oc. ICASSP , 2016. [23] M. Ranzato, S. Chopra, M. Auli, and W . Zaremba, “Sequence lev el training with recurrent neural networks, ” in Pr oc. ICLR , 2016. [24] D. Pove y and P . C. W oodland, “Minimum phone error and I-smoothing for impro ved discriminative training, ” in Pr oc. ICASSP . IEEE, 2002, vol. 1, pp. I–105. [25] I. Sutske ver , O. V inyals, and Q. V . Le, “Sequence to sequence learning with neural networks, ” in Pr oc. NIPS , 2014. [26] C. Allauzen and M. Riley , “Bayesian language model interpola- tion for mobile speech input, ” Interspeec h , 2011. [27] H. Sak, A. Senior, K. Rao, and F . Beaufays, “Fast and Accu- rate Recurrent Neural Network Acoustic Models for Speech Recognition, ” in Pr oc. Interspeech , 2015. [28] G. Pundak and T . N. Sainath, “Lower Frame Rate Neural Network Acoustic Models, ” in Pr oc. Interspeech , 2016. [29] S. Hochreiter and J. Schmidhuber , “Long Short-T erm Memory, ” Neural Computation , v ol. 9, no. 8, pp. 1735–1780, Nov 1997. [30] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks, ” IEEE T ransactions on Signal Pr ocessing , vol. 45, no. 11, pp. 2673–2681, Nov 1997. [31] D. Bahdanau, K. Cho, and Y . Bengio, “Neural Machine T ransla- tion by Jointly Learning to Align and Translate, ” in Pr oc. ICLR , 2015. [32] M. Abadi et al., “T ensorFlow: Large-Scale Machine Learn- ing on Heterogeneous Distributed Systems, ” A vailable on- line: http://download.tensorflo w .org/paper/whitepaper2015.pdf, 2015. [33] T . N. Sainath, C. Chiu, R. Prabhav alkar, A. Kannan, Y . W u, P . Nguyen, , and Z. Chen, “Improving the Performance of Online Neural Transducer models, ” in Pr oc. ICASSP (accepted) , 2018.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment