An Online Attention-based Model for Speech Recognition

AN ONLINE A TTENTION-B ASED MODEL FOR SPEECH RECOGNITION Ruchao F an 1 , 3 , P an Zhou 2 , W ei Chen 3 , Jia Jia 2 , Gang Liu 1 1 Beijing Uni versity of Posts and T elecommunications, Beijing, P .R.China 2 Department of Computer Science and T echnology , Tsinghua Uni versity , Beijing, P .R.China 3 V oice Interaction T echnology Center , Sogou Inc., Beijing, R.R.China { fanruchao,liugang } @bupt.edu.cn, { zh-pan,jjia } @mail.tsinghua.edu.cn, chenwei@sogou-inc.com Abstract Attention-based end-to-end models such as Listen, Attend and Spell (LAS), simplify the whole pipeline of traditional auto- matic speech recognition (ASR) systems and become popular in the ﬁeld of speech recognition. In previous work, researchers hav e shown that such architectures can acquire comparable re- sults to state-of-the-art ASR systems, especially when using a bidirectional encoder and global soft attention (GSA) mecha- nism. Howe ver , bidirectional encoder and GSA are two obsta- cles for real-time speech recognition. In this work, we aim to stream LAS baseline by removing the above two obstacles. On the encoder side, we use a latenc y-controlled (LC) bidirectional structure to reduce the delay of forward computation. Mean- while, an adapti ve monotonic chunk-wise attention (AMoChA) mechanism is proposed to replace GSA for the calculation of at- tention weight distribution. Furthermore, we propose two meth- ods to alleviate the huge performance degradation when com- bining LC and AMoChA. Finally , we successfully acquire an online LAS model, LC-AMoChA, which has only 3.5% relati ve performance reduction to LAS baseline on our internal Man- darin corpus. Index T erms : attention, end-to-end, online speech recognition 1. Introduction Recently , end-to-end (E2E) models have become increasingly popular on automatic speech recognition (ASR), as they al- low one neural network to jointly learn acoustic, pronuncia- tion and language model, greatly simplifying the whole pipeline of con ventional hybrid ASR systems. Connectionist T emporal Classiﬁcation (CTC) [1, 2], Recurrent Neural Network Trans- ducer (RNN-T) [3, 4, 5], Recurrent Neural Aligner [6, 7], Seg- ment Neural T ransduction [8] and Attention-based E2E (A- E2E) models [9, 10, 11] are such E2E models that are well ex- plored in the literature. W ith a unidirectional encoder, these model architectures are easy to deploy for an online ASR task except the attention based model. A-E2E model was ﬁrst in- troduced to ASR in [9]. Later on, a paradigm named Listen, Attend and Spell (LAS) with a bidirectional long short-term memory (BLSTM) [12] encoder w as examined on a lar ge-scale speech task [10] and more recently it shows superior perfor- mance to a con ventional hybrid system [11]. Besides, previ- ous work showed that LAS of fers improvements ov er CTC and RNN-T models [13]. Ho wev er, LAS with bidirectional encoder and global soft attention (GSA) must process the entire input sequence before producing an output. This makes it hard to be used in online ASR scenarios. The goal of this paper is to en- dow the LAS with the ability to decode in a streaming manner with as less degradation of its performance as possible. The ﬁrst problem of LAS to be solved for online purpose is the computation delay in bidirectional encoder . Latency- controlled BLSTM (LC-BLSTM) was proposed [14] to get a trade-off between performance and delay . This structure looks ahead for a ﬁxed number of frames other than making use of the entire future frames in BLSTM. With lower latency , LC- BLSTM hybrid system in [15] achiev es comparable perfor- mance to BLSTM. Howe ver , it is not clear whether such struc- tures would be practical for A-E2E models. For GSA mechanism, the context vector for decoder is com- puted based on the entire encoder outputs. Thus, the model can- not produce the ﬁrst character until all input speech frames have been consumed during inference. One straightforward idea for reducing the attention delay is to make the input sequence to be attended shorter . Neural Transducer (NT) [16, 17] is such a method. It cuts the input sequence into none-overlapping blocks and performs GSA ov er each block to predict the tar gets corre- spond to that block. The ground truth of each block needs to be acquired by a process named alignment, which resembles force alignment in con ventional ASR systems and takes much time during training. Another idea is to take advantage of the monotonicity of speech. Hard monotonic attention which at- tends only one of the encoder outputs was introduced to stream the attention part. The hard attention needs to calculate a proba- bility for each encoder output to determine whether it should be attended or not at each step and this causes a non-dif ferentiable problem. One solution for this problem is reinforcement learn- ing [18, 19] while the other is to estimate the expectation of con- text vector during training [20]. T o better exploit the informa- tion of encoder , Chiu et al. proposed Monotonic Chunkwise At- tention (MoChA) [21] which adaptively splits the encoder out- puts into small chunks o ver which soft attention is applied and only one output is generated for each chunk. Ho wever , MoChA uses a ﬁxed chunk size for all output steps and all utterances. It is not appropriate for ASR due to the variational pronunciation rate of output units and speakers. In this work, we construct a LAS baseline system with BLSTM encoder and GSA mechanism, denoted as BLSTM- GSA, and explore methods to stream it by overcoming the abov ementioned problems. On the encoder side, we replace BLSTM with LC-BLSTM to reduce the delay of forward com- putation and name the model as LC-GSA. W ith initialization from BLSTM-GSA, LC-GSA has almost no degradation to LAS baseline. On the attention side, MoChA with BLSTM en- coder (BLSTM-MoChA) is explored and it also beha ves well. Meanwhile, we propose an adapti ve monotonic chunk-wise at- tention (AMoChA) to adaptiv ely generate chunk length to bet- ter ﬁt the speech properties. The AMoChA is dif ferent from the MAtChA in [21], which calculate expectation of weight distri- bution of encoder output to implement adaptiv e chunk length. On the contrary , we use several feed forward layers to predict the chunk length directly . Furthermore, we offer another tw o re- visions which can be used both on MoChA and AMoChA to al- leviate the degradation caused by combing the LC-BLSTM en- coder with AMoChA (demoted as LC-AMoChA). Finally , the model LC-AMoChA can decode in a streaming manner with acceptable performance reduction of relative 3.5% to ofﬂine model BLSTM-GSA. The remainder of the paper is structured as follows. Sec- tion 2 introduces the basic of ﬂine LAS model. The details of our methods to stream LAS baseline are described in Section 3. Experimental details and results are presented in Section 4 and the paper is concluded with our ﬁndings and future work in Section 5. 2. B ASIC LAS MODEL The basic LAS model consists of three modules and it can be described by Eq.(1) - (6). Giv en a sequence of input frames x = { x 1 , x 2 , . . . , x T } , a listener encoder module, which con- sists of a multi-layer BLSTM is used to extract higher le vel rep- resentations h = { h 1 , h 2 , . . . , h U } with U ≤ T . h = Listener ( x ) (1) s i = S peller RN N ( s i − 1 , y i − 1 , c i − 1 ) (2) e i,u = E nerg y ( s i , h u ) = V T tanh ( W h h u + W s s i + b ) (3) α i,u = exp ( e i,u ) P u 0 exp ( e i,u 0 ) (4) c i = X u α i,u h u (5) P ( y i | x , y 0 . 5 or e i,u > 0 . Af- ter we get the attended h u , which can be viewed as the bound- ary of listener outputs for each output step in speller , AMoChA computes the c i within an adapti ve self-learned chunk ( W ) of listener outputs which starts from h u − W +1 to h u . The abov e process can be formulated as: e i,u = E nerg y ( s i , h u ) = g v T || v || tanh ( W s s i + W h h u + b ) + r (7) p i,u = σ ( e i,u ) (8) z i,u ∼ B ernoul li ( p i,u ) (9) v = u − W + 1 (10) d i,k = E nerg y ( s i , h k ) k = v , v + 1 , ..., u (11) c i = u X k = v exp ( d i,k ) P u l = v exp ( d i,l ) h k (12) where g , r are learnable scalars. The two energy functions abov e are similar but with different parameters. The difference between MoChA and AMoChA is the ac- quisition of chunk length W . For MoChA, ﬁxed-length chunk W is used and it is hard to be regulated. This hyper -parameter varies from language to output unit and is also af fected by the pronunciation rate. T o solve the problem, AMoChA is proposed to get free from selecting of hyper-parameter W . These two mechanisms are visualized in Fig. 1b, 1c. For AMoChA, after we get the attended h u , two approaches are proposed to learn the chunk length W with h u and speller state s i at step i . The ideas are similar with [22]. But the y use only h u as input and the usage of W is different from us. Constrained chunk length pr ediction : The chunk length W is bounded by the maximum of chunk length W max with the following equation: W = W max ∗ σ ( V T p F ( W h h u + W s s i + b )) (13) where, W h , W s , V T p are parameters to learn, σ is sigmoid func- tion and activ ate function (F) can be ReLU [23] or T anh. Unconstrained chunk length prediction : Compared to the Eq.(13), we in vestigate an unconstrained chunk length pre- diction network which ignores the hyper -parameter W max : W = exp ( V T p F ( W h h u + W s s i + b )) (14) where the exponential function here can ensure the chunk length W to be positiv e. In the original paper of MoChA, the author also talked about its limitation of ﬁxed-length chunk and proposed Mono- tonic Adapti ve Chunkwise Attention (MA TCHA). The method implements adaptive chunk length by using the chunk within two adjacent attended listener outputs and calculate e xpectation of c i as MoChA for training. On the contrary , our method use sev eral neural layers to predict the chunk length directly . 3.2.2. T raining For MoChA training, the model can not be trained with back- propagation due to the sampling process. T o remedy this, Raffel et al. propose to get c i = P u β i,u h u by computing the proba- bility of weight distrib ution o ver listener outputs, which can be formulated as: α i,u = p i,u ((1 − p i,u − 1 ) α i,u − 1 p i,u − 1 + α i − 1 ,u ) (15) β i,u = u + W − 1 X k = u α i,k exp ( d i,u ) P k l = k − W +1 exp ( d i,l ) (16) where the computation of α i,u and β i,u are time costed. There is an ef ﬁcient parallelized algorithm stated in [21]. W e are not going to discuss it here due to the page limitation. In order to learn parameters in prediction network of AMoChA, we use a multi-task loss function that is composed of standard cross-entropy (CE) loss and mean square error (MSE) loss between the predicted chunk length and the ground truth chunk length, which is as follows: Loss = (1 − λ ) × L C E + λ × L W (17) where λ controls the ratio of CE and MSE. The ground truth chunk length for each output character can be obtained from greedy decoding or beam search decoding of each utterance with BLSTM-GSA. W e use the number of attention weight that is abov e a threshold (0.01) as the target chunk length. Charac- ter le vel alignment from the hybrid system can also be used to obtain the true chunk length for each character . 4. EXPERIMENTS Our experiments are conducted on a ∼ 1,000 hour Sogou inter - nal data from voice dictation. The results are a verage WER of three test sets which hav e about 22,000 utterances in total. All experiments use 40-dimensional ﬁlter bank features, computed ev ery 10ms within a 25ms window . First and second deriv ates are not used. Similar to [24], we concatenate e very 5 consecutiv e frames to form a 200-dimensional feature v ector as the input to our network. For our LAS baseline, a four layer BLSTM with 256 hid- den units on each LSTM is used as encoder . The third and forth layer of BLSTM uses a pyramid structure that takes every two consecutiv e frames of its input as input. As a result, the ﬁnal output representation of listener is 4x subsampled. Addictiv e attention [25] is used in the attender . The speller is a 1-layer LSTM with 512 hidden units. T o compare the performance of LC listener , we also explore the LSTM-GSA model with unidi- rectional LSTMs encoder [26] whose hidden units is 640. The number of parameters is approximately the same as BLSTM- GSA. The hidden size of energy function and prediction net- work in AMoChA are both 512. The total number of modelling unit is 6812, including 6809 Chinese characters, start of sequence (SOS), end of sequence (EOS) and unkno wn character (UNK). All models are trained by optimizing the CE loss except for the AMoChA which is learned by optimizing Eq(17). Scheduled sampling [27] and label smoothing [28] are all adopted during training to improve performance. The training epoch for BLSTM-GSA and LSTM- GSA is 30.W e use teacher force at the ﬁrst 11 epochs and use output of last step to feed into network with schedule sampling rate that gradually increases to 0.3 from epoch 12 to epoch 17. From epoch 17, we ﬁx schedule sampling rate to 0.3. W e use an initial learning rate of 0.0002 and halve it from epoch 24. The other models are trained for 14 epochs with similar schedule sampling and learning rate setting. Beam search decoding is used without an external language model to e valuate our model on test sets and the beam width is set to 5. T emperature [28] is also used in decoding to let the output distribution more smooth for better beam search results. The weight decay is 1e-5 [29]. 4.1. LC-GSA: Streaming the Listener W e start to train BLSTM-GSA (MDL1) and LSTM-GSA (MDL2) as our baselines. The LSTM-GSA performs poor as seen in T able 1. Then we e xplore the ability of LC-BLSTM to stream the listener . As stated in section 3.1, the input sequence is split into none-o verlapping blocks with length N c . Then N r future frames are appended to each block. The v alue of N r controls the latency of model. T able 1 shows the results of our methods and it indicates that training from BLSTM-GSA can result in an impressiv e improvement than training from scratch. If N c and N r are chosen properly (64 and 32 for our task), the LC-GSA (MDL3) behav es nearly the same as MDL1. W ith LC-GSA, we think attention weight can be redis- tributed to attend more future frames and get similar context vector as BLSTM-GSA at each output step ev en if the LC- BLSTM lost some future information. T able 1: Results of our methods to stream the listener Model N c N r Initial Model CER(%) MDL1 - - Null 17 . 88 MDL2 - - Null 20 . 4 MDL3 32 16 Null 20 . 13 MDL3 32 16 MDL1 18 . 63 MDL3 64 32 MDL1 17 . 99 4.2. BLSTM-AMoChA: Streaming the Attender Next, we in vestigate the effecti veness of our proposed AMoChA with an BLSTM listener . The ﬁrst experiment is the implementation of MoChA with BLSTM listener, denoted as BLSTM-MoChA (MDL4). Our best model of MDL4 is trained with MDL1 as initial model and the setting of chunk length is 10 (40 frames of the raw input in listener) for Chinese characters. It achiev es a CER of 17.7% which is comparable to MDL1. T able 2: The results of BLSTM-AMoChA. F is the activate func- tion in Eq.13. SOL (source of label) illustrates how we get the gr ound truth chunk length. BS1 means greedy decoding with MDL1. BS5 r epr esents beam decoding with beam size equals 5 and e xtra languag e model. HMM is the alignment result of hybrid system. Model W max λ F SOL CER(%) MDL1 - 0 - - 17 . 88 MDL4 - 0 - - 17 . 77 C-MDL5 30 0 . 1 ReLU BS1 17 . 8 C-MDL5 40 0 . 1 ReLU BS1 17 . 75 C-MDL5 60 0 . 1 ReLU BS1 17 . 92 C-MDL5 40 0 . 2 ReLU BS1 17 . 9 C-MDL5 40 0 . 02 ReLU BS1 17 . 51 C-MDL5 40 0 . 02 T anh BS1 17 . 54 U-MDL5 - 0 . 02 ReLU BS1 17 . 45 U-MDL5 - 0 . 02 ReLU HMM 17 . 40 U-MDL5 - 0 . 02 ReLU BS5 17 . 36 T able 2 compares the results of two different chunk length prediction methods, which are proposed in section 3.2. These models are all trained with MDL1 as initial model. For con- strained chunk length prediction method, we denote it as Con- strained BLSTM-AMoChA (C-MDL5). The table shows that the best upper bound of chunk length is 40 and the 0.02 is the best proportion of the chunk length prediction task. For the ac- tiv ate function, ReLU behaves slightly better than T anh. W ith the best setting, Unconstrained BLSTM-AMoChA (U-MDL5) with different ground truth chunk length are studied. BS5 seems provide the most accurate chunk length supervision. The best result of model U-MDL5 can achie ve a 17.36% of CER and outperform the MDL4, which use ﬁxed chunk length. 4.3. LC-AMoChA: Online LAS After stream the listener and attender separately , we then com- bine the two parts with best setting and acquire LC-MoChA (MDL6) and LC-AMoChA (MDL7). These tw o models behave not well if no e xtra methods are used. In this part, we will ﬁgure out why it will happen and explore ho w to solve the problem. According to the experience in section 4.1, we use pre- trained model as initial model for MDL6 and MDL7. The re- sults are exhibited in T able 3. Despite using pre-trained models, a huge degradation occurs (-13.1%). By observing the atten- tion weight distribution of MDL1 and MDL6, we ﬁnd that the boundary of alignment shifts to the future listener features obvi- ously . One e xample shows in Fig.2a, 2b . MoChA calculates an energy scalar p i,u with listener features to determine the bound- ary of each output units. The lost future information of listener features caused by LC-BLSTM will make the boundary shift to the future and e ventually lead to the degradation of CER. In or - der to compensate the degradation, we propose two methods to ease the problem. Giv en the listener outputs h , we suppose the attended prob- ability of h at step i is p i = { p i, 1 , p i, 2 , . . . , p i,U } , the ﬁrst method (M1) we used is to append w − 1 future listener fea- tures for each h u with the following equation to calculate p i : ˆ h u = h u + h u +1 + · · · + h u + w − 1 w f or u = 1 , 2 , . . . , U (18) T able 3: Impro vements to entir e LAS. Futur e w means w-1 of futur e information are aver aged. Model Methods Future w CER(%) MDL1 - - 17 . 88 MDL6 - 1 20 . 22( − 13 . 1%) MDL6 M1 8 19 . 49 MDL6 M1 10 19 . 94 MDL6 M2 8 18 . 87 MDL6 M2 10 18 . 78 ( − 5%) MDL7 M2 10 18 . 5 ( − 3 . 5%) 80 80 80 80 80 80 (a) MDL1 13 10 22 10 10 10 10 10 68 77 (b) MDL6 9 4 4 6 6 32 44 9 (c) MDL7 Figure 2: Distribution of attention weight in differ ent models. The dotted box contains listener featur es with non-zer o weight and chunk length is printed in the box. MDL6 is the model with only pr e-training methods. MDL7 is the model in T able 3. The second method (M2) is to utilize the future information for each attended probability p i,u at output step i: ˆ p i,u = p i,u + p i,u +1 + · · · + p i,u + w − 1 w f or u = 1 , 2 , . . . , U (19) It means the a verage probability is used to judge the boundary instead of the original p i,u . The Eq.(19) can be inserted after Eq.(8) during both training and inference. By utilizing future information, we can drag back the boundary of alignment to some extent and the attention weights can be redistributed within the chunk length to hav e a better performance, which can be seen in Fig.2c. T able 3 compares the result of our two revisions to MDL6 with only pre-training method. The table indicates that M2 with av eraging 10 consec- utiv e future attended possibilities can stream the MDL1 with least degradation (-5%). Finally , together with AMoChA, the basic ofﬂine LAS can decode in a streaming manner with an acceptable 3.5% relativ e degradation of CER. 5. CONCLUSIONS In this paper , we e xplore an online attention-based LAS system. On the listener side, we propose to use LC-BLSTM to decrease the latency . On the attention side, we propose an AMoChA method to stream the attention and better ﬁt the speech prop- erties. By combining the two parts and utilizing our proposed methods, the BLSTM-GSA can be online with only a 3.5% rel- ativ e degradation of performance on our Mandarin corpus. In AMoChA, the chunk length at each step may be related to the last step. Perhaps LSTMs are better to be used here, which is included in our future work. Furthermore, we use an one-layer LSTM speller in our system. Ho w to use a multi-layer speller with AMoChA is another challenge for online ASR systems. 6. References [1] A. Grav es, S. Fern ´ andez, F . Gomez, and J. Schmidhuber , “Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks, ” in Pr oceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376. [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat- tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-end speech recognition in english and mandarin, ” in International conference on machine learning , 2016, pp. 173–182. [3] A. Graves, “Sequence transduction with recurrent neural net- works, ” arXiv preprint , 2012. [4] E. Battenberg, J. Chen, R. Child, A. Coates, Y . G. Y . Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural trans- ducers for end-to-end speech recognition, ” in 2017 IEEE Auto- matic Speech Recognition and Understanding W orkshop (ASR U) . IEEE, 2017, pp. 206–213. [5] K. Rao, H. Sak, and R. Prabhavalkar , “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer , ” in 2017 IEEE Automatic Speech Recognition and Understanding W orkshop (ASR U) . IEEE, 2017, pp. 193–199. [6] H. Sak, M. Shannon, K. Rao, and F . Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. ” in Interspeec h , 2017, pp. 1298–1302. [7] L. Dong, S. Zhou, W . Chen, and B. Xu, “Extending recurrent neu- ral aligner for streaming end-to-end speech recognition in man- darin, ” arXiv pr eprint arXiv:1806.06342 , 2018. [8] L. Y u, J. Buys, and P . Blunsom, “Online se gment to segment neu- ral transduction, ” arXiv pr eprint arXiv:1609.08194 , 2016. [9] J. Chorowski, D. Bahdanau, K. Cho, and Y . Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results, ” arXiv pr eprint arXiv:1412.1602 , 2014. [10] W . Chan, N. Jaitly , Q. Le, and O. V inyals, “Listen, attend and spell: A neural network for large vocab ulary con versational speech recognition, ” in 2016 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2016, pp. 4960–4964. [11] C.-C. Chiu, T . N. Sainath, Y . W u, R. Prabhav alkar, P . Nguyen, Z. Chen, A. Kannan, R. J. W eiss, K. Rao, E. Gonina et al. , “State- of-the-art speech recognition with sequence-to-sequence models, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 4774–4778. [12] M. Schuster and K. K. P aliwal, “Bidirectional recurrent neu- ral networks, ” IEEE Tr ansactions on Signal Pr ocessing , vol. 45, no. 11, pp. 2673–2681, 1997. [13] R. Prabhavalkar , K. Rao, T . N. Sainath, B. Li, L. Johnson, and N. Jaitly , “ A comparison of sequence-to-sequence models for speech recognition. ” in Interspeec h , 2017, pp. 939–943. [14] Y . Zhang, G. Chen, D. Y u, K. Y aco, S. Khudanpur , and J. Glass, “Highway long short-term memory rnns for distant speech recog- nition, ” in 2016 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2016, pp. 5755– 5759. [15] S. Xue and Z. Y an, “Improving latency-controlled blstm acous- tic models for online speech recognition, ” in 2017 IEEE Inter- national Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 5340–5344. [16] N. Jaitly , Q. V . Le, O. Vin yals, I. Sutskev er, D. Sussillo, and S. Bengio, “ An online sequence-to-sequence model using partial conditioning, ” in Advances in Neural Information Pr ocessing Sys- tems , 2016, pp. 5067–5075. [17] T . N. Sainath, C.-C. Chiu, R. Prabhavalkar , A. Kannan, Y . W u, P . Nguyen, and Z. Chen, “Impro ving the performance of on- line neural transducer models, ” in 2018 IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 5864–5868. [18] Y . Luo, C.-C. Chiu, N. Jaitly , and I. Sutske ver , “Learning on- line alignments with continuous rewards policy gradient, ” in 2017 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 2801–2805. [19] D. La wson, C.-C. Chiu, G. T ucker , C. Raf fel, K. Swersky , and N. Jaitly , “Learning hard alignments with variational inference, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 5799–5803. [20] C. Raf fel, M.-T . Luong, P . J. Liu, R. J. W eiss, and D. Eck, “Online and linear-time attention by enforcing monotonic alignments, ” in Pr oceedings of the 34th International Conference on Machine Learning-V olume 70 . JMLR. org, 2017, pp. 2837–2846. [21] C.-C. Chiu and C. Raf fel, “Monotonic chunkwise attention, ” arXiv pr eprint arXiv:1712.05382 , 2017. [22] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic atten- tion mechanism for end-to-end speech and language processing, ” arXiv pr eprint arXiv:1705.08091 , 2017. [23] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectiﬁer neural networks, ” in Proceedings of the fourteenth international confer- ence on artiﬁcial intelligence and statistics , 2011, pp. 315–323. [24] H. Sak, A. Senior , K. Rao, and F . Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition, ” arXiv pr eprint arXiv:1507.06947 , 2015. [25] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine trans- lation by jointly learning to align and translate, ” arXiv pr eprint arXiv:1409.0473 , 2014. [26] S. Hochreiter and J. Schmidhuber, “Long short-term memory , ” Neural computation , v ol. 9, no. 8, pp. 1735–1780, 1997. [27] S. Bengio, O. V inyals, N. Jaitly , and N. Shazeer, “Scheduled sam- pling for sequence prediction with recurrent neural networks, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 1171–1179. [28] J. Chorowski and N. Jaitly , “T o wards better decoding and lan- guage model integration in sequence to sequence models, ” arXiv pr eprint arXiv:1612.02695 , 2016. [29] C. Cortes, M. Mohri, and A. Rostamizadeh, “L 2 regularization for learning kernels, ” in Proceedings of the T wenty-F ifth Confer- ence on Uncertainty in Artiﬁcial Intelligence . A U AI Press, 2009, pp. 109–116.

An Online Attention-based Model for Speech Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment