Sequence-to-Sequence Models Can Directly Translate Foreign Speech

Sequence-to-Sequence Models Can Dir ectly T ranslate F or eign Speech Ron J . W eiss 1 , J an Chor owski 1 , Navdeep J aitly 2 ∗ , Y onghui W u 1 , Zhifeng Chen 1 1 Google Brain 2 Nvidia { ronw,chorowski } @google.com, njaitly@nvidia.com, { yonghui,zhifengc } @google.com Abstract W e present a recurrent encoder-decoder deep neural network ar - chitecture that directly translates speech in one language into text in another . The model does not e xplicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during train- ing. W e apply a slightly modiﬁed sequence-to-sequence with attention architecture that has pre viously been used for speech recognition and sho w that it can be repurposed for this more complex task, illustrating the po wer of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we ﬁnd that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder net- work can improv e performance by a further 1.4 BLEU points. Index T erms : speech translation, sequence-to-sequence model 1. Introduction Sequence-to-sequence models were recently introduced as a pow- erful new method for translation [ 1 , 2 ]. Subsequently , the model has been adapted and applied to various tasks such as image captioning [ 3 , 4 ], pose prediction [ 5 ], and syntactic parsing [ 6 ]. It has also led to a ne w state of the art in Neural Machine T ransla- tion (NMT) [ 7 ]. The model has also recently achie v ed promising results on automatic speech recognition (ASR) even without the use of language models [ 8 , 9 , 10 , 11 ]. These successes hav e only been possible because the sequence-to-sequence (seq2seq) models can accurately model very complicated probability dis- tributions. This makes it possible to apply this model even in situations where a precise analytical model is difﬁcult to intuit. In this paper we show that a single sequence-to-sequence model is po werful enough to translate audio in one language directly into text in another language. Using a model similar to Listen Attend and Spell (LAS) [ 9 , 10 ] we process log mel ﬁlterbank input features using a recurrent encoder . The encoder features are then used, along with an attention model, to build a conditional next-step prediction model for text in the target domain. Unlike LAS, ho wever , we use the text in the translated domain as the target – the source language te xt is not used. Con ventionally , this task is performed by pipelining results from an ASR system trained on the source language with a machine translation (MT) system trained to translate text from the source language to text in the tar get language. Ho we ver , we motiv ate an end-to-end approach from several dif ferent angles. First, by virtue of being an end-to-end model, all the param- eters are jointly adjusted to optimize the results on the ﬁnal goal. ∗ W ork done at Google. T raining separate speech recognition and translation models may lead to a situation where models perform well indi vidually , but do not work well together because their error surfaces do not compose well. For example typical errors in the ASR system may be such that they exacerbate the errors of the translation model which has not been trained to see such errors in the input during training. Another advantage of an end-to-end model is that, during inference, a single model can hav e lower latency compared to a cascade of two independent models. Addition- ally , end-to-end models ha ve adv antages in low resource settings since they can directly make use of corpora where the audio is in one language while the transcript is in another . This can arise, for example, in videos that ha ve been captioned in other languages. It can also reduce labeling budgets since speech would only need to be transcribed in one language. In extreme cases where the source language does not have a writing system, applying a sep- arate ASR system would require ﬁrst standardizing the writing system – a very signiﬁcant undertaking [12, 13]. W e make se veral interesting observ ations from experiments on con versational Spanish to English speech translation. As with LAS models, we ﬁnd that the model performs surprisingly well without using independent language models in either the source or target language. While the model performs well without seeing source language transcripts during training, we ﬁnd that we can le verage them in a multi-task setting to impro ve perfor - mance. Finally we sho w that the end-to-end model outperforms a cascade of independent seq2seq ASR and NMT models. 2. Related work Early work on speec h translation (ST) [ 14 ] – translating au- dio in one language into te xt in another – used lattices from an ASR system as inputs to translation models [ 15 , 16 ], giving the translation model access to the speech recognition uncertainty . Alternativ e approaches explicitly integrated acoustic and transla- tion models using a stochastic ﬁnite-state transducer which can decode the translated text directly using V iterbi search [17, 18]. In this paper we compare our inte grated model to results obtained from cascaded models on a Spanish to English speech translation task [ 19 , 20 , 21 ]. These approaches also use ASR lattices as MT inputs. Post et al. [ 19 ] used a GMM-HMM ASR system. K umar et al. [ 20 ] later showed that using a better ASR model improv ed ov erall ST results. Subsequently [ 21 ] showed that modeling features at the boundary of the ASR and the MT system can further improv e performance. W e carry this notion much further by deﬁning an end-to-end model for the entire task. Other recent work on speech translation does not use ASR. Instead [ 22 ] used an unsupervised model to cluster repeated audio patterns which are used to train a bag of words translation model. In [ 23 ] seq2seq models were used to align speech with translated text, but not to directly predict the translations. Our work is most similar to [ 24 ] which uses a LAS-lik e model for ST on data synthesized using a text-to-speech system. In contrast, we train on a much larger corpus composed of real speech. 3. Sequence-to-sequence model W e utilize a sequence-to-sequence with attention architecture similar to that described in [ 1 ]. The model is composed of three jointly trained neural networks: a recurrent encoder which trans- forms a sequence of input feature frames x 1 ..T into a sequence of hidden activ ations, h 1 ..L , optionally at a slower time scale: h l = enc ( x 1 ..T ) (1) The full encoded input sequence h 1 ..L is consumed by a decoder network which emits a sequence of output tokens, y 1 ..K , via next step prediction: emitting one output token (e.g. word or character) per step, conditioned on the token emitted at the previous time step as well as the entire encoded input sequence: y k = dec ( y k − 1 , h 1 ..L ) (2) The dec function is implemented as a stacked recurrent neural network with D layers, which can be expanded as follows: o 1 k , s 1 k = d 1 ( y k − 1 , s 1 k − 1 , c k − 1 ) (3) o n k , s n k = d n ( o n − 1 k , s n k − 1 , c k ) (4) where d n is a long short-term memory (LSTM) cell [ 25 ], which emits an output vector o n into the follo wing layer , and updates its internal state s n at each time step. The decoder’ s dependence on the input is mediated through an attention network which summarizes the entire input sequence as a ﬁxed dimensional conte xt vector c k which is passed to all subsequent layers using skip connections. c k is computed from the ﬁrst decoder layer output at each output step k : c k = P l α kl h l (5) α kl = softmax ( a e ( h l ) T a d ( o 1 k )) (6) where a e and a d are small fully connected networks. The α kl probabilities compute a soft alignment between the input and output sequences. An e xample is sho wn in Figure 1. Finally , an output symbol is sampled from a multinomial distribution computed from the ﬁnal decoder layer output: y k ∼ softmax ( W y [ o D k , c k ] + b y ) (7) 3.1. Speech model W e train seq2seq models for both end-to-end speech transla- tion, and a baseline model for speech recognition. W e found that the same architecture, a v ariation of that from [ 10 ], works well for both tasks. W e use 80 channel log mel ﬁlterbank fea- tures e xtracted from 25ms windo ws with a hop size of 10ms, stacked with delta and delta-delta features. The output softmax of all models predicts one of 90 symbols, described in detail in Section 4, that includes English and Spanish lowercase letters. The encoder is composed of a total of 8 layers. The input features are organized as a T × 80 × 3 tensor , i.e. raw features, deltas, and delta-deltas are concatenated along the ’ depth’ di- mension. This is passed into a stack of two con v olutional layers with ReLU activ ations, each consisting of 32 kernels with shape 3 × 3 × depth in time × frequency . These are both strided by 2 × 2 , downsampling the sequence in time by a total factor of 4, decreasing the computation performed in the following layers. Batch normalization [26] is applied after each layer . This downsampled feature sequence is then passed into a single bidirectional con v olutional LSTM [ 27 , 28 , 10 ] layer using a 1 × 3 ﬁlter (i.e. con volving only across the frequenc y dimension within each time step). Finally , this is passed into a stack of three bidirectional LSTM layers of size 256 in each direction, interleav ed with a 512-dimensional linear projection, followed by batch normalization and a ReLU activ ation, to com pute the ﬁnal 512-dimensional encoder representation, h l . The decoder input is created by concatenating a 64- dimensional embedding for y k − 1 , the symbol emitted at the previous time step, and the 512-dimensional attention context vector c k . The networks a e and a d used to compute c k (see equation 6) each contain a single hidden layer with 128 units. This is passed into a stack of four unidirectional LSTM layers with 256 units. Finally the concatenation of the attention context and LSTM output is passed into a softmax layer which predicts the probability of emitting each symbol in the output vocab ulary . The network contains 9.8m parameters. W e implement it with T ensorFlo w [ 29 ] and train using teacher forcing on mini- batches of 64 utterances. W e use asynchronous stochastic gra- dient descent across 10 replicas using the Adam optimizer [ 30 ] with β 1 = 0 . 9 , β 2 = 0 . 999 , and  = 10 − 6 . The initial learning rate is set to 10 − 3 and decayed by a factor of 10 after 1m steps. L2 weight decay is used with a weight of 10 − 6 , and beginning from step 20k, Gaussian weight noise with std of 0.125 is added to all LSTM weights and decoder embeddings. W e tuned all hyperparameters to maximize performance on the Fisher/dev set. W e decode using beam search with rank pruning at 8 hy- potheses and a beam width of 3, using the scoring function proposed in [ 7 ]. W e do not utilize any language models. For the baseline ASR model we found that neither length normaliza- tion nor the cov erage penalty from [ 7 ] were needed, ho wev er it was helpful to permit emitting the end-of-sequence token only when its log-probability was 3.0 greater than the next most prob- able token. For speech translation we found that using length normalization of 0.6 impro ved performance by 0.6 BLEU points. 3.2. Neural machine translation model W e also train a baseline seq2seq text machine translation model following [ 7 ]. T o reduce overﬁtting on the small training corpus we signiﬁcantly reduce the model size compared to those in [ 7 ]. The encoder network consists of four encoder layers (5 LSTM layers in total). As in the base architecture, the bottom layer is a bidirectional LSTM and the remaining layers are all unidirectional. The decoder network consists of 4 stacked LSTM layers. All encoder and decoder LSTM layers contain 512 units. The attention network uses a single hidden layer with 512 units. W e use the same character-le vel v ocab ulary for input and output as the speech model described abov e emits. As in [ 7 ] we apply dropout [ 31 ] with probability 0.2 dur- ing training to reduce overﬁtting. W e train using SGD with a single replica. T raining conv erges after about 100k steps using minibatches of 128 sentence pairs. 3.3. Multi-task training Supervision from source language transcripts can be incorpo- rated into the speech translation model by co-training an auxil- iary model with shared parameters, e.g. an ASR model using a common encoder . This is equiv alent to a multi-task conﬁguration [ 32 ]. W e use the models and training protocols described above with these modiﬁcations: we use 16 work ers that randomly select a model to optimize at each step, we introduce weight noise after 30k steps, and decay the learning rate after 1.5m ov erall steps. 0 15 30 45 60 75 E n c o d e d f r a m e i n d e x l s í y u s t e d h a c e m u c h o t i e m p o q u e q u e v i v e a q u í O u t p u t t o k e n y k 0.0 0.2 0.4 0.6 0.8 1.0 0/0.0 60/0.6 120/1.2 180/1.8 240/2.4 300/3.0 I n p u t f r a m e i n d e x t / T i m e ( s e c ) 0 80 160 240 Input features 3.0 1.5 0.0 1.5 3.0 (a) Spanish speech r ecognition decoder attention. 0 15 30 45 60 75 E n c o d e d f r a m e i n d e x l y e s a n d h a v e y o u b e e n l i v i n g h e r e a l o n g t i m e O u t p u t t o k e n y k 0.0 0.2 0.4 0.6 0.8 1.0 0/0.0 60/0.6 120/1.2 180/1.8 240/2.4 300/3.0 I n p u t f r a m e i n d e x t / T i m e ( s e c ) 0 80 160 240 Input features 3.0 1.5 0.0 1.5 3.0 (b) Spanish-to-English speech translation decoder attention. Figure 1: Example attention pr obabilities α kl fr om a multi-task model with two decoders. The ASR attention is roughly monotonic, wher eas the translation attention contains an e xample of wor d r eor dering typical of seq2seq MT models, attending primarily to frames l = 58 − 70 while emitting “living her e”. The r ecognition decoder attends to these frames while emitting the corr esponding Spanish phrase “vive aqui”. The ASR decoder is also more conﬁdent than the translation attention, whic h tends to be smoothed out acr oss many input frames for eac h output token. This is a consequence of the ambiguous mapping between Spanish sounds and English translation. 4. Experiments W e conduct e xperiments on the Spanish Fisher and Callhome cor- pora of telephone con v ersations augmented with English trans- lations from [ 19 ]. W e split training utterances according to the provided se gment annotations and preprocess the Spanish tran- scriptions and English translations by lo wercasing and removing punctuation. All models use a common set of 90 tokens to repre- sent Spanish and English characters, containing lowercase letters from both alphabets, digits 0-9, space, punctuation marks, 1 as well as special start-of-, end-of-sequence, and unknown tok ens. Follo wing [ 19 , 20 , 21 ], we train all models on the 163 hour Fisher train set, and tune hyperparameters on Fisher/dev. W e report speech recognition results as word error rates (WER) and translation results using BLEU [ 33 ] scores computed with the Moses toolkit 2 multi-bleu.pl script, both on lowercase reference text after removing punctuation. W e use the 4 provided Fisher reference translations and a single reference for Callhome. 4.1. T uning decoder depth It is common for ASR seq2seq models to use a shallow decoder , generally comprised of one recurrent layer [ 8 , 9 , 10 ]. In contrast, seq2seq NMT models often use much deeper decoders, e.g. 8 layers in [ 7 ]. In analogy to a traditional ASR system, one may think of the seq2seq encoder behaving as the acoustic model 1 Most are not used in practice since we remove punctuation aside from apostrophes from the training targets. 2 http://www.statmt.org/moses/ T able 1: V arying number of decoder layers in the speech tr ansla- tion model. BLEU scor e on the Fisher/de v set. Num decoder layers D 1 2 3 4 5 43.8 45.1 45.2 45.5 45.3 while the decoder acts as the language model. The additional complexity of the translation task when compared to monolin- gual language modeling moti v ates the use of a higher capacity decoder network. W e therefore experiment with v arying the depth of the stack of LSTM layers used in the decoder for speech translation and ﬁnd that performance improves as the decoder depth increases up to four layers, see T able 1. Despite this intuition, we obtained similar improvements in performance on the ASR task when increasing decoder depth, suggesting that tuning the decoder architecture is worth further in vestig ation in other speech settings. 4.2. T uning the multi-task model W e compare two multi-task training strategies: one-to-many in which an encoder is shared between speech translation and recognition tasks, and many-to-one in which a decoder is shared between speech and te xt translation tasks. W e found the ﬁrst strategy to perform better . W e also found that performing updates T able 2: V arying the number of shar ed encoder LSTM layers in the multi-task setting. BLEU score on the F isher/dev set. Num shared encoder LSTM layers 3 (all) 2 1 0 46.2 45.1 45.3 44.2 T able 3: Speec h r ecognition model performance in WER. Fisher Callhome dev dev2 test devtest evltest Ours 3 25.7 25.1 23.2 44.5 45.3 Post et al. [19] 41.3 40.0 36.5 64.7 65.3 Kumar et al. [21] 29.8 29.8 25.3 – – more often on the speech translation task yields the best results. Speciﬁcally , we perform 75% of training steps on the core speech translation task, and the remainder on the auxiliary ASR task. Finally , we vary ho w much of the encoder network param- eters are shared across tasks. Intuitively we expect that layers near the input will be less sensitiv e to the ﬁnal classiﬁcation task, so we always share all encoder layers through the con v LSTM b ut v ary the amount of sharing in the ﬁnal stack of LSTM layers. As sho wn in T able 2 we found that sharing all layers of the encoder yields the best performance. This suggests that the encoder learns to transform speech into a consistent interlingual subword unit representation, which the respecti v e decoders are able to assemble into phrases in either language. 4.3. Baseline models W e construct a baseline cascade of a Spanish ASR seq2seq model whose output is passed into a Spanish to English NMT model. Our seq2seq ASR model attains state-of-the-art performance on the Fisher and Callhome datasets compared to pre viously reported results with HMM-GMM [ 19 ] and DNN-HMM [ 21 ] systems, as shown in T able 3. Performance on the Fisher task is signiﬁcantly better than on Callhome since it contains more for- mal speech, consisting of conv ersations between strangers while Callhome con versations were often between f amily members. In contrast, our MT model slightly underperforms compared to previously reported results using phrase-based translation sys- tems [ 19 , 20 , 21 ] as shown in T able 4. This may be because the amount of training data in the Fisher corpus is much smaller than is typically used for training NMT systems. Additionally , our models used characters as training targets instead of word- and phrase-lev el tokens often used in machine translation systems, making them more vulnerable to e.g. spelling errors. 4.4. Speech translation T able 5 compares performance of dif ferent systems on the full speech translation task. Despite not ha ving access to source language transcripts at any stage of the training, the end-to-end model outperforms the baseline cascade, which passes the 1- best Spanish ASR output into the NMT model, by about 1.8 BLEU points on the Fisher/test set. W e obtain an additional improv ement of 1.4 BLEU points or more on all Fisher datasets in the multi-task conﬁguration, in which the Spanish transcripts are used for additional supervision by sharing a single encoder 3 A veraged ov er three runs. T able 4: T ranslation BLEU scor e on gr ound truth transcripts. Fisher Callhome dev dev2 test devtest evltest Ours 58.7 59.9 57.9 28.2 27.9 Post et al. [19] – – 58.7 – 27.8 Kumar et al. [21] – 65.4 62.9 – – T able 5: Speec h translation model performance in BLEU scor e. Fisher Callhome Model dev dev2 test devtest evltest End-to-end ST 3 46.5 47.3 47.3 16.4 16.6 Multi-task ST / ASR 3 48.3 49.1 48.7 16.8 17.4 ASR → NMT cascade 3 45.1 46.1 45.5 16.2 16.6 Post et al. [19] – 35.4 – – 11.7 Kumar et al. [21] – 40.1 40.4 – – sub-network across independent ASR and ST decoders. The ASR model con verged after four days of training (1.5m steps), while the ST and multitask models continued to improv e, with the ﬁnal 1.2 BLEU point improv ement taking two more weeks. Informal inspection of cascade system outputs yields many examples of compounding errors, where the ASR model makes an insertion or deletion that signiﬁcantly alters the meaning of the sentence and the NMT model has no way to recover . This illustrates a key adv antage of the end-to-end approach where the translation decoder can access the full latent representation of the speech without ﬁrst collapsing to an n-best list of hypotheses. A large performance gap of about 10 BLEU points remains between these results and those from T able 4 which assume perfect ASR, indicating signiﬁcant room for improv ement in the acoustic modeling component of the speech translation task. 5. Conclusion W e present a model that directly translates speech into te xt in a different language. One of its striking characteristics is that its architecture is essentially the same as that of an attention-based ASR neural system. Direct speech-to-text translation happens in the same computational footprint as speech recognition – the ASR and end-to-end ST models ha v e the same number of param- eters, and utilize the same decoding algorithm – narrow beam search. The end-to-end trained model outperforms an ASR-MT cascade ev en though it ne v er explicitly searches o ver transcrip- tions in the source language during decoding. While we can interpret the proposed model’s encoder and decoder networks, respectiv ely , as acoustic and translation mod- els, it does not ha ve an e xplicit concept of source transcription. The two sub-networks exchange information as abstract high- dimensional real v alued v ectors rather than discrete transcription lattices as in traditional systems. In f act, reading out transcrip- tions in the source language from this abstract representation requires a separate decoder network. W e ﬁnd that jointly training decoder networks for multiple languages re gularizes the encoder and improv es ov erall speech translation performance. An in- teresting extension would be to construct a multilingual speech translation system following [ 34 ] in which a single decoder is shared across multiple languages, passing a discrete input token into the network to select the desired output language. 6. References [1] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate, ” in Pr oceedings of ICLR , 2015. [2] I. Sutskever , O. V inyals, and Q. V . Le, “Sequence to sequence learning with neural netw orks, ” in Advances in Neur al Information Pr ocessing Systems , 2014, pp. 3104–3112. [3] O. V in yals, A. T oshev , S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator, ” in Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2015, pp. 3156–3164. [4] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov , R. S. Zemel, and Y . Bengio, “Show , attend and tell: Neural image caption generation with visual attention. ” in Pr oceedings of ICML , vol. 14, 2015, pp. 77–81. [5] G. Gkioxari, A. T oshev , and N. Jaitly , “Chained predictions us- ing conv olutional neural networks, ” in European Confer ence on Computer V ision . Springer , 2016, pp. 728–743. [6] O. V inyals, Ł . Kaiser , T . K oo, S. Petrov , I. Sutskev er , and G. Hin- ton, “Grammar as a foreign language, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 2773–2781. [7] Y . W u, M. Schuster , Z. Chen, Q. V . Le, M. Norouzi, W . Macherey , M. Krikun, Y . Cao, Q. Gao, K. Macherey et al. , “Google’ s neural machine translation system: Bridging the gap between human and machine translation, ” arXiv preprint , 2016. [8] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “ Attention-based models for speech recognition, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 577–585. [9] W . Chan, N. Jaitly , Q. Le, and O. V inyals, “Listen, attend and spell: A neural network for lar ge v ocabulary conv ersational speech recognition, ” in Proceedings of ICASSP . IEEE, 2016, pp. 4960– 4964. [10] Y . Zhang, W . Chan, and N. Jaitly , “V ery deep conv olutional networks for end-to-end speech recognition, ” in Proceedings of ICASSP , 2017. [11] J. Choro wski and N. Jaitly , “T ow ards better decoding and language model integration in sequence to sequence models, ” arXiv preprint arXiv:1612.02695 , 2016. [12] S. Bird, L. Gawne, K. Gelbart, and I. McAlister , “Collecting bilin- gual audio in remote indigenous communities, ” in Proceedings of COLING , 2014. [13] A. Anastasopoulos, D. Chiang, and L. Duong, “ An unsuper- vised probability model for speech-to-translation alignment of low-resource languages, ” arXiv pr eprint arXiv:1609.08139 , 2016. [14] F . Casacuberta, M. Federico, H. Ney , and E. V idal, “Recent ef forts in spoken language translation, ” IEEE Signal Pr ocessing Ma gazine , vol. 25, no. 3, 2008. [15] H. Ney , “Speech translation: Coupling of recognition and transla- tion, ” in Pr oceedings of ICASSP , v ol. 1. IEEE, 1999, pp. 517–520. [16] E. Matusov , S. Kanthak, and H. Ney , “On the inte gration of speech recognition and statistical machine translation. ” in Proceedings of Interspeech , 2005, pp. 3177–3180. [17] E. V idal, “Finite-state speech-to-speech translation, ” in Pr oceed- ings of ICASSP , vol. 1. IEEE, 1997, pp. 111–114. [18] F . Casacuberta, H. Ney , F . J. Och, E. V idal, J. M. V ilar , S. Bar- rachina, I. Garcıa-V area, D. Llorens, C. Martınez, S. Molau et al. , “Some approaches to statistical and ﬁnite-state speech-to-speech translation, ” Computer Speech & Language , vol. 18, no. 1, pp. 25–47, 2004. [19] M. Post, G. Kumar , A. Lopez, D. Karak os, C. Callison-Burch, and S. Khudanpur , “Impro ved speech-to-te xt translation with the Fisher and Callhome Spanish–English speech translation corpus, ” in Pr oceedings of IWSL T , 2013. [20] G. Kumar , M. Post, D. Povey , and S. Khudanpur , “Some insights from translating con versational telephone speech, ” in Pr oceedings of ICASSP . IEEE, 2014, pp. 3231–3235. [21] G. Kumar , G. W . Blackwood, J. Trmal, D. Pove y , and S. Khudan- pur , “ A coarse-grained model for optimal coupling of ASR and SMT systems for speech translation. ” in Pr oceedings of EMNLP , 2015, pp. 1902–1907. [22] S. Bansal, H. Kamper, A. Lopez, and S. Goldwater , “T owards speech-to-text translation without speech recognition, ” arXiv pr eprint arXiv:1702.03856 , Feb . 2017. [23] L. Duong, A. Anastasopoulos, D. Chiang, S. Bird, and T . Cohn, “ An attentional model for speech translation without transcription, ” in Pr oceedings of N AA CL-HLT , 2016, pp. 949–959. [24] A. B ´ erard, O. Pietquin, C. Servan, and L. Besacier, “Listen and translate: A proof of concept for end-to-end speech-to-text transla- tion, ” in NIPS W orkshop on End-to-end Learning for Speech and Audio Pr ocessing , 2016. [25] S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural Computation , v ol. 9, no. 8, pp. 1735–1780, 1997. [26] S. Ioffe and C. Sze gedy , “Batch normalization: Accelerating deep network training by reducing internal cov ariate shift, ” in Pr oceed- ings of ICML , 2015, pp. 448–456. [27] S. Xingjian, Z. Chen, H. W ang, D.-Y . Y eung, W .-K. W ong, and W .-c. W oo, “Con v olutional LSTM network: A machine learning approach for precipitation nowcasting, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 802–810. [28] I. Bogun, A. Angelova, and N. Jaitly , “Object recognition from short videos for robotic perception, ” CoRR , vol. abs/1509.01602, 2015. [Online]. A vailable: http://arxiv .org/abs/1509.01602 [29] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “T ensorﬂo w: A system for large-scale machine learning, ” in Pr oceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2016. [30] D. Kingma and J. Ba, “ Adam: A method for stochastic optimiza- tion, ” in Proceedings of ICLR , 2015. [31] N. Srivasta v a, G. E. Hinton, A. Krizhevsky , I. Sutskever , and R. Salakhutdinov , “Dropout: a simple way to prevent neural net- works from ov erﬁtting. ” Journal of Mac hine Learning Resear c h , vol. 15, no. 1, pp. 1929–1958, 2014. [32] M.-T . Luong, Q. V . Le, I. Sutskever , O. V in yals, and L. Kaiser , “Multi-task sequence to sequence learning, ” in Proceedings of ICLR , 2016. [33] K. Papineni, S. Roukos, T . W ard, and W .-J. Zhu, “BLEU: A method for automatic e valuation of machine translation, ” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Association for Computational Linguistics, 2002, pp. 311–318. [34] M. Johnson, M. Schuster, Q. V . Le, M. Krikun, Y . Wu, Z. Chen, N. Thorat, F . V i ´ egas, M. W attenberg, G. Corrado et al. , “Google’ s multilingual neural machine translation system: Enabling zero-shot translation, ” arXiv preprint , 2016.

Sequence-to-Sequence Models Can Directly Translate Foreign Speech

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment