Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

MUL TILINGU AL SEQUENCE-T O-SEQUENCE SPEECH RECOGNITION: ARCHITECTURE, TRANSFER LEARNING, AND LANGU A GE MODELING J aejin Cho 1 , ‡ , Murali Karthic k Baskar 2 , ‡ , Ruizhi Li 1 , ‡ , Matthew W iesner 1 , Sri Harish Mallidi 3 , Nelson Y alta 4 , Martin Karaﬁ ´ at 2 , Shinji W atanabe 1 , T akaaki Hori 5 1 Johns Hopkins Uni versity , 2 Brno Uni versity of T echnology , 3 Amazon, 4 W aseda Univ ersity , 5 Mitsubishi Electric Research Laboratories (MERL) { ruizhili,jcho52,shinjiw } @jhu.edu, { baskar,karafiat } @fit.vutbr.cz,thori@merl.com ABSTRA CT Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relativ ely new direction in speech research. The approach beneﬁts by performing model training without using lexicon and alignments. Howe ver , this poses a new problem of requiring more data compared to conv entional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multi- lingual seq2seq model as a prior model, and then port them tow ards 4 other B ABEL languages using transfer learning approach. W e also explore dif ferent architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the trans- fer learning approach from the multilingual model shows substan- tial gains over monolingual models across all 4 B ABEL languages. Incorporating an RNNLM also brings signiﬁcant improvements in terms of %WER, and achie ves recognition performance comparable to the models trained with twice more training data. Index T erms : Automatic speech recognition (ASR), sequence to sequence, multilingual setup, transfer learning, language modeling 1. INTRODUCTION The sequence-to-sequence (seq2seq) model proposed in [1, 2, 3] is a neural architecture for performing sequence classiﬁcation and later adopted to perform speech recognition in [4, 5, 6]. The model al- lows to integrate the main blocks of ASR such as acoustic model, alignment model and language model into a single framew ork. The recent ASR advancements in connectionist temporal classiﬁcation (CTC) [6, 5] and attention [4, 7] based approaches has created lar ger interest in speech community to use seq2seq models. T o lev erage performance gains from this model as similar or better to conven- tional hybrid RNN/DNN-HMM models requires a huge amount of data [8]. Intuiti vely , this is due to the wide-range role of the model in performing alignment and language modeling along with acoustic to character label mapping at each iteration. In this paper, we explore the multilingual training approaches [9, 10, 11] used in hybrid DNN/RNN-HMMs to incorporate them into the seq2seq models. In a context of applications of multilingual approaches towards seq2seq model, CTC is mainly used instead of the attention models. A multilingual CTC is proposed in [12], which uses a universal phoneset, FST decoder and language model. The ‡ All three authors share equal contribution authors also use linear hidden unit contribution (LHUC) [13] tech- nique to rescale the hidden unit outputs for each language as a way to adapt to a particular language. Another w ork [14] on multilingual CTC shows the importance of language adapti ve vectors as auxil- iary input to the encoder in multilingual CTC model. The decoder used here is a simple ar g max decoder . An extensi ve analysis on multilingual CTC mainly focusing on improving under limited data condition is performed in [15]. Here, the authors use a word lev el FST decoder integrated with CTC during decoding. On a similar front, attention models are explored within a multi- lingual setup in [16, 17] based on attention-based seq2seq to build a model from multiple languages. The data is just combined together assuming the target languages are seen during the training. And, hence no special transfer learning techniques were used here to ad- dress the unseen languages during training. The main motiv ation and contribution behind this w ork is as follows: • T o incorporate the existing multilingual approaches in a joint CTC-attention [18] (seq2seq) frame work, which uses a sim- ple beam-search decoder as described in sections 2 and 4 • In vestigate the effecti veness of transferring a multilingual model to a target language under various data sizes. This is explained in section 4.3. • T ackle the low-resource data condition with both transfer learning and including a character-based RNNLM trained with multiple languages. Section 4.4 explains this in detail. 2. SEQUENCE-TO-SEQUENCE MODEL In this work, we use the attention based approach [2] as it pro- vides an ef fectiv e methodology to perform sequence-to-sequence (seq2seq) training. Considering the limitations of attention in per- forming monotonic alignment [19, 20], we choose to use CTC loss function to aid the attention mechanism in both training and decod- ing. The basic network architecture is sho wn in Fig. 1. Let X = ( x t | t = 1 , . . . , T ) be a T -length speech feature se- quence and C = ( c l | l = 1 , . . . , L ) be a L -length grapheme se- quence. A multi-objecti ve learning framework L mol proposed in [18] is used in this work to unify attention loss p att ( C | X ) and CTC loss p ctc ( C | X ) with a linear interpolation weight λ , as follows: L mod = λ log p ctc ( C | X ) + (1 − λ ) log p ∗ att ( C | X ) (1) The uniﬁed model allows to obtain both monotonicity and effecti ve sequence lev el training. Deep CNN (VGG net) BLSTM Attention Decoder RNN-LM CTC c l c l -1 x t x 1 x T …… …… …… …… Shared Encoder Joint Decoder Fig. 1 : Hybrid attention/CTC network with LM extension: the shared encoder is trained by both CTC and attention model objec- tiv es simultaneously . The joint decoder predicts an output label se- quence by the CTC, attention decoder and RNN-LM. p att ( C | X ) represents the posterior probability of character label sequence C w .r .t input sequence X based on the attention approach, which is decomposed with the probabilistic chain rule, as follows: p ∗ att ( C | X ) ≈ L Y l =1 p ( c l | c ∗ 1 , ...., c ∗ l − 1 , X ) , (2) where c ∗ l denotes the ground truth history . Detailed explanations about the attention mechanism is described later . Similarly , p ctc ( C | X ) represents the posterior probability based on the CTC approach. p ctc ( C | X ) ≈ X Z ∈Z ( C ) p ( Z | X ) , (3) where Z = ( z t | t = 1 , . . . , T ) is a CTC state sequence composed of the original grapheme set and the additional blank symbol. Z ( C ) is a set of all possible sequences giv en the character sequence C . The follo wing paragraphs explain the encoder, attention de- coder , CTC, and joint decoding used in our approach. Encoder In our approach, both CTC and attention use the same encoder func- tion, as follows: h t = Encoder ( X ) , (4) where h t is an encoder output state at t . As an encoder function Encoder ( · ) , we use bidirectional LSTM (BLSTM) or deep CNN followed by BLSTMs. Conv olutional neural networks (CNN) has achiev ed great success in image recognition [21]. Previous studies applying CNN in seq2seq speech recognition [22] also sho wed that incorporating a deep CNNs in the encoder could further boost the performance. In this work, we in vestigate the effect of con v olutional layers in joint CTC-attention framework for multilingual setting. W e use the initial 6 layers of VGG net architecture [21] in table 2. For each speech feature image, one feature map is formed initially . VGG net then extracts 128 feature maps, where each feature map is do wnsam- pled to (1 / 4 × 1 / 4) images along time-frequency axis via the two maxpooling layers with str ide = 2 . Attention Decoder: Location aware attention mechanism [23] is used in this work. Equa- tion (5) de notes the output of location aware attention, where a lt acts as an attention weight. a lt = LocationAttention  { a l − 1 } T t =1 , q l − 1 , h t  . (5) Here, q l − 1 denotes the decoder hidden state, h t is the encoder out- put state as shown in equation (4). The location attention function represents a con volution function * as in equation (6). It maps the attention weight of the pre vious label a l − 1 to a multi channel vie w f t for better representation. f t = K ∗ a l − 1 , (6) e lt = g T tanh( Lin ( q l − 1 ) + Lin ( h t ) + LinB ( f t )) , (7) a lt = Softmax ( { e lt } T t =1 ) (8) Equation (7) provides the unnormalized attention vectors computed with the learnable vector g , linear transformation Lin ( · ) , and afﬁne transformation LinB ( · ) . Equation (8) computes a normalized atten- tion weight based on the softmax operation Softmax ( · ) . Finally , the context vector r l is obtained by the weighted summation of the en- coder output state h t ov er entire frames with the attention weight as follows: r l = T X t =1 a lt h t . (9) The decoder function is an LSTM layer which decodes the next character output label c l from their previous label c l − 1 , hidden state of the decoder q l − 1 and attention output r l , as follows: p ( c l | c 1 , ...., c l − 1 , X ) = Decoder ( r l , q l − 1 , c l − 1 ) (10) This equation is incrementally applied to form p ∗ att in equation (2). Connectionist temporal classiﬁcation (CTC): Unlike the attention approach, CTC do not use any speciﬁc decoder . Instead it in vokes two important components to perform character lev el training and decoding. First component, is an RNN based en- coding module p ( Z | X ) . The second component contains a language model and state transition module. The CTC formalism is a special case [6, 24] of hybrid DNN-HMM framework with an inclusion of Bayes rule to obtain p ( C | X ) . Joint decoding: Once we hav e both CTC and attention-based seq2seq models trained, both are jointly used for decoding as below: log p hyp ( c l | c 1 , ...., c l − 1 , X ) = α log p ctc ( c l | c 1 , ...., c l − 1 , X ) +(1 − α ) log p att ( c l | c 1 , ...., c l − 1 , X ) (11) Here log p hyp is a ﬁnal score used during beam search. α controls the weight between attention and CTC models. α and multi-task learn- ing weight λ in equation (1) are set differently in our e xperiments. T able 1 : Details of the B ABEL data used for performing the multi- lingual experiments Usage Language Train Eval # of characters # spkrs. # hours # spkrs. # hours Train Cantonese 952 126.73 120 17.71 3302 Bengali 720 55.18 121 9.79 66 Pashto 959 70.26 121 9.95 49 T urkish 963 68.98 121 9.76 66 V ietnamese 954 78.62 120 10.9 131 Haitian 724 60.11 120 10.63 60 T amil 724 62.11 121 11.61 49 Kurdish 502 37.69 120 10.21 64 T okpisin 482 35.32 120 9.88 55 Georgian 490 45.35 120 12.30 35 T arget Assamese 720 54.35 120 9.58 66 T agalog 966 44.0 120 10.60 56 Swahili 491 40.0 120 10.58 56 Lao 733 58.79 119 10.50 54 3. DA T A DET AILS AND EXPERIMENT AL SETUP In this work, the experiments are conducted using the BABEL speech corpus collected from the IARP A babel program. The cor- pus is mainly composed of conv ersational telephone speech (CTS) but some scripted recordings and far ﬁeld recordings are presented as well. T able 1 presents the details of the languages used in this work for training and e valuation. 80 dimensional Mel-ﬁlterbank (fbank) features are then ex- tracted from the speech samples using a sliding window of size 25 ms with 10ms stride. KALDI toolkit [25] is used to perform the fea- ture processing. The fbank features are then fed to a seq2seq model with the following conﬁguration: The Bi-RNN [26] models mentioned above uses a LSTM [27] cell followed by a projection layer (BLSTMP). In our experiments below , we use only a character-le vel seq2seq model trained by CTC and attention decoder . Thus in the following experiments we intend to use character error rate (% CER) as a suitable measure to ana- lyze the model performance. Howe ver , in section 4.4 we integrate a character-lev el RNNLM [28] with seq2seq model externally and showcase the performance in terms of word error rate (% WER). In this case the words are obtained by concatenating the characters and the space together for scoring with reference words. All exper- iments are implemented in ESPnet, end-to-end speech processing toolkit [29]. 4. MUL TILINGU AL EXPERIMENTS Multilingual approaches used in hybrid RNN/DNN-HMM sys- tems [11] have been used for for tackling the problem of low- resource data condition. Some of these approaches include language adaptiv e training and shared layer retraining [30]. Among them, the most beneﬁted method is the parameter sharing technique [11]. T o incorporate the former approach into encoder, CTC and attention de- coder model, we performed the following e xperiments: • Stage 0 - Naiv e training combining all languages • Stage 1 - Retraining the decoder (both CTC and attention) after initializing with the multilingual model from stage-0 T able 2 : Experiment details Model Conﬁguration Encoder Bi-RNN # encoder layers 5 # encoder units 320 # projection units 320 Decoder Bi-RNN # decoder layers 1 # decoder units 300 # projection units 300 Attention Location-aware # feature maps 10 # window size 100 Training Conﬁguration MOL 5 e − 1 Optimizer AdaDelta Initial learning rate 1.0 AdaDelta  1 e − 8 AdaDelta  decay 1 e − 2 Batch size 30 Optimizer AdaDelta Decoding Conﬁguration Beam size 20 ctc-weight 3 e − 1 (a) Con volutional layers in joint CTC-attention CNN Model Conﬁguration -2 components Component 1 2 con volution layers Con volution 2D in = 1, out = 64, ﬁlter = 3 × 3 Con volution 2D in = 64, out = 64, ﬁlter = 3 × 3 Maxpool 2D patch = 2 × 2, stride = 2 × 2 Component 2 2 con volution layers Con volution 2D in = 64, out = 128, ﬁlter = 3 × 3 Con volution 2D in = 128, out = 128, ﬁlter = 3 × 3 Maxpool 2D patch = 2 × 2, stride = 2 × 2 • Stage 2 - The resulting model obtained from stage-1 is further retrained across both encoder and decoder T able 4 : Comparison of naiv e approach and training only the last layer performed using the Assamese language Model type Retraining % CER % Absolute gain Monolingual - 45.6 - Multi. (after 4 th epoch) Stage 1 61.3 -15.7 Multi. (after 4 th epoch) Stage 2 44.0 1.6 Multi. (after 15 th epoch) Stage 2 41.3 4.3 4.1. Stage 0 - Naive approach In this approach, the model is ﬁrst trained with 10 multiple languages as denoted in table 1 approximating to 600 hours of training data. data from all languages av ailable during training is used to build a single seq2seq model. The model is trained with a character label set composed of characters from all languages including both train and target set as mentioned in table 1. The model provides better generalization across languages. Languages with limited data when T able 3 : Recognition performance of naive multilingual approach for e val set of 10 BABEL training languages trained with the train set of same languages %CER on Eval set T arget languages for Bengali Cantonese Georgian Haitian K urmanji Pashto T amil T urkish T okpisin V ietnamese Monolingual - BLSTMP 43.4 37.4 35.4 39.7 55.0 37.3 55.3 50.3 32.7 54.3 Multilingual - BLSTMP 42.9 36.3 38.9 38.5 52.1 39.0 48.5 36.4 31.7 41.0 + VGG 39.6 34.3 36.0 34.5 49.9 34.7 45.5 28.7 33.7 37.4 trained with other languages allows them to be robust and helps in improving the recognition performance. In spite of being simple, the model has limitations in keeping the target language data unseen during training. Comparison of V GG-BLSTM and BLSTMP T able 3 sho ws the recognition performance of naiv e multilingual approach using BLSTMP and VGG model against a monolingual model trained with BLSTMP . The results clearly indicate that hav- ing a better architecture such as VGG-BLSTM helps in improving multilingual performance. Except Pashto, Georgian and T okpisin, the multilingual VGG-BLSTM model gav e 8.8 % absolute gain in av erage over monolingual model. In case of multilingual BLSTMP , except Pashto and Georgian an absolute gain of 5.0 % in a verage is observed over monolingual model. Even though the VGG-BLSTM gav e improvements, we were not able to perform stage-1 and stage-2 retraining with it due to time constraints. Thus, we proceed further with multilingual BLSTMP model for retraining experiments tabu- lated below . 4.2. Stage 1 - Retraining decoder only T o alle viate the limitation in the previous approach, the ﬁnal layer of the seq2seq model which is mainly responsible for classiﬁcation is retrained to the target language. Fig. 2 : Difference in performance for 5 hours, 10 hours, 20 hours and full set of target language data used to retrain a multilingual model from stage-1 In previous works [11, 30] related to hybrid DNN/RNN models and CTC based models [12, 15] the softmax layer is only adapted. Howe ver in our case, the attention decoder and CTC decoder both hav e to be retrained to the target language. This means the CTC and attention layers are only updated for gradients during this stage. W e found using SGD optimizer with initial learning rate of 1 e − 4 works better for retraining compared to AdaDelta. The learning rate is decayed in this training at a factor of 1 e − 1 if there is a drop in v alidation accurac y . T able 4 shows the performance of simply retraining the last layer using a single target language As- samese. 4.3. Stage 2 - Finetuning both encoder and decoder Based on the observations from stage-1 model in section 4.2, we found that simply retraining the decoder towards a target language resulted in de grading %CER the performance from 45.6 to 61.3. This is mainly due to the difference in distribution across encoder and decoder . So, to alle viate this dif ference the encoder and decoder is once again retrained or ﬁne-tuned using the model from stage-1. The optimizer used here is SGD as in stage-1, b ut the initial learning rate is kept to 1 e − 2 and decayed based on validation performance. The resulting model gave an absolute gain of 1.6% when ﬁnetuned a multilingual model after 4th epoch. Also, ﬁnetuning a model after 15th epoch gav e an absolute gain of 4.3%. T able 5 : Stage-2 retraining across all languages with full set of target language data % CER on T arget Languages ev al set Assamese T agalog Swahili Lao Monolingual 45.6 43.1 33.1 42.1 Stage-2 retraining 41.3 37.9 29.1 38.7 T o further inv estigate the performance of this approach across different target data sizes, we split the train set into ∼ 5 hours, ∼ 10 hours, ∼ 20 hours and ∼ full set. Since, in this approach the model is only ﬁnetuned by initializing from stage-1 model, the model archi- tecture is ﬁxed for all data sizes. Figure 2 sho ws the ef fectiv eness of ﬁnetuning both encoder and decoder . The gains from 5 to 10 hours was more compared to 20 hours to full set. T able 5 tabulates the % CER obtained by retraining the stage-1 model with ∼ full set of target language data. An absolute gain is observed using stage-2 retraining across all languages compared to monolingual model. 4.4. Multilingual RNNLM In an ASR system, a language model (LM) takes an important role by incorporating external knowledge into the system. Conv entional ASR systems combine an LM with an acoustic model by FST gi v- ing a huge performance gain. This trend is also shown in general including hybrid ASR systems and neural network-based sequence- to-sequence ASR systems. The following experiments show a beneﬁt of using a language model in decoding with the previous stage-2 transferred models. Al- though the performance gains in %CER are also generally observed ov er all target languages, the impro vement in %WER w as more dis- tinctiv e. The results shown in the following Fig. 3 are in %WER. “whole” in each ﬁgure means we used all the av ailable data for the target language as full set e xplained before. !"# $ %&# % %'# ! %(# & %)# * %"# ( )&# & )'# " '+ )+ %+ !+ &+ ' (+ $+ ,-./ 0 1234 567 89: ;0 84 /9< ;=9 ;0 4>989 45 -:?7 @??9A0?0 ?89; 0 4 $4:0 8:9B >0> !"# $ %!# ! %&# & %$# " %'# & %$# $ '%# % '"# ( &) ') %) !) *) & $) +) ,-. /0 1234 567 89: ;0 84 /9< ;=9 ;0 4>989 45 -:?7 @9 ; 9 / . ; ?89; 0 4 +4:0 8:9A >0> !"# $ %$# % &'# ' &&# " %"# ( &$# ( &)# ( $&# " $) &) %) !) ') $ () ") *+, -. /012 345 678 9. 62 -7: 9;7 9. 2<767 23 +8=5 >*7 +?-? =679 . 2 "28. 687? :?: 9 1@@A B2 7<<.< !"# ! !$# ! %&# ' %(# " !)# * !)# & %'# $ &!# + &) %) !) *) +) & ') () ,-. /0 1234 567 89: ;0 84 /9< ;=9 ;0 4>989 45 -:?7 @9. ?89; 0 4 (4:0 8:9A >0> Fig. 3 : Recognition performance after integrating RNNLM during decoding in %WER for different amounts of tar get data W e used a character-lev el RNNLM, which was trained with 2- layer LSTM on character sequences. W e use all a vailable paired text in the corresponding target language to train the LM for the language. No e xternal text data were used. All language mod- els are trained separately from the seq2seq models. When build- ing dictionary , we combined all the characters over all 15 languages mentioned in table 1 to make them work with transferred models. Regardless of the amount of data used for transfer learning, the RNNLM provides consistent gains across all languages over differ - ent data sizes. T able 6 : Recognition performance in %WER using stage-2 retrain- ing and multilingual RNNLM Model type %WER on target languages Assamese T agalog Swahili Lao Stage-2 retraining 71.9 71.4 66.2 62.4 + Multi. RNNLM 65.3 64.3 56.2 57.9 As explained already , language models were trained separately and used to decode jointly with seq2seq models. The intuition be- hind it is to use the separately trained language model as a com- plementary component that works with a implicit language model within a seq2seq decoder . The way of RNNLM assisting decoding follows the equation belo w: log p ( c l | c 1: l − 1 , X ) = log p hyp ( c l | c 1: l − 1 , X ) + β log p lm ( c l | c 1: l − 1 , X ) (12) β is a scaling factor that combines the scores from a joint decod- ing eq.(11) with RNN-LM, denoted as p lm . This approach is called shallow fusion. Our experiments for target languages show that the gains from adding RNNLM are consistent re gardless of the amount of data used for transfer learning. In other words, in Figure 3, the gap between two lines are almost consistent ov er all languages. Also, we observe the gain we get by adding RNN-LM in de- coding is large. For example, in the case of assamese, the gain by RNN-LM in decoding with a model retrained on 5 hours of the tar - get language data is almost comparable with the model stage-2 re- trained with 20 hours of target language data. On average, absolute gain ∼ 6% is obtained across all target languages as noted in table 6. 5. CONCLUSION In this work, we have shown the importance of transfer learning ap- proach such as stage-2 multilingual retraining in a seq2seq model setting. Also, careful selection of train and target languages from B ABEL provide a wide v ariety in recognition performance (%CER) and helps in understanding the ef ﬁcacy of seq2seq model. The ex- periments using character-based RNNLM showed the importance of language model in boosting recognition performance (%WER) ov er all different hours of tar get data av ailable for transfer learning. T able 5 and 6 summarizes, the ef fect of these techniques in terms of %CER and %WER. These methods also show their ﬂexibility in incorporating it in attention and CTC based seq2seq model without compromising loss in performance. 6. FUTURE WORK W e could use better architectures such as VGG-BLSTM as a multi- lingual prior model before transferring them to a new target language by performing stage-2 retraining. The naive multilingual approach can be improved by including language vectors as input or target dur- ing training to reduce the confusions. Also, in vestigation of multi- lingual bottleneck features [31] for seq2seq model can provide better performance. Apart from using the character le vel language model as in this work, a word level RNNLM can be connected during de- coding to further improve %WER. The attention based decoder can be aided with the help of RNNLM using cold fusion approach dur - ing training to attain a better-trained model. In near future, we will incorporate all the above techniques to get comparable performance with the state-of-the-art hybrid DNN/RNN-HMM systems. 7. REFERENCES [1] Ilya Sutske ver , Oriol V inyals, and Quoc V Le, “Sequence to sequence learning with neural networks, ” in Advances in neu- ral information pr ocessing systems , 2014, pp. 3104–3112. [2] Dzmitry Bahdanau, Kyungh yun Cho, and Y oshua Bengio, “Neural machine translation by jointly learning to align and translate, ” arXiv preprint , 2014. [3] Kyunghyun Cho, Bart V an Merri ¨ enboer , Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y oshua Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation, ” arXiv pr eprint arXiv:1406.1078 , 2014. [4] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyungh yun Cho, and Y oshua Bengio, “ Attention-based mod- els for speech recognition, ” in Advances in neural information pr ocessing systems , 2015, pp. 577–585. [5] Alex Grav es and Navdeep Jaitly , “T owards end-to-end speech recognition with recurrent neural networks., ” in ICML , 2014, vol. 14, pp. 1764–1772. [6] Alex Graves, “Supervised sequence labelling, ” in Supervised sequence labelling with recurr ent neural networks , pp. 5–13. Springer , 2012. [7] W illiam Chan, Navdeep Jaitly , Quoc Le, and Oriol V inyals, “Listen, attend and spell: A neural netw ork for large vocab- ulary conv ersational speech recognition, ” in IEEE Interna- tional Confer ence on Acoustics, Speec h and Signal Pr ocessing (ICASSP) . IEEE, 2016, pp. 4960–4964. [8] Andrew Rosenberg, Kartik Audhkhasi, Abhina v Sethy , Bhu- vana Ramabhadran, and Michael Pichen y , “End-to-end speech recognition and keyword search on low-resource languages, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 5280–5284. [9] Laurent Besacier , Etienne Barnard, Alexe y Karpov , and T anja Schultz, “ Automatic speech recognition for under-resourced languages: A surve y , ” Speech Communication , vol. 56, pp. 85–100, 2014. [10] Zoltan T uske, Da vid Nolden, Ralf Schluter , and Hermann Ney , “Multilingual mrasta features for lo w-resource k eyword search and speech recognition systems, ” in IEEE International Con- fer ence on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , 2014, pp. 7854–7858. [11] Martin Karaﬁ ´ at, Murali Karthick Baskar , Pav el Mat ˇ ejka, Karel V esel ` y, Franti ˇ sek Gr ´ ezl, and Jan ˇ Cernocky , “Multilingual blstm and speaker -speciﬁc vector adaptation in 2016 but ba- bel system, ” in Spoken Langua ge T echnology W orkshop (SLT), 2016 IEEE . IEEE, 2016, pp. 637–643. [12] Sibo T ong, Philip N Garner, and Herv ´ e Bourlard, “Multilingual training and cross-lingual adaptation on CTC-based acoustic model, ” arXiv preprint , 2017. [13] Pawel Swietojanski and Ste ve Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, ” in Spoken Language T echnology W orkshop (SL T), 2014 IEEE . IEEE, 2014, pp. 171–176. [14] Markus M ¨ uller , Sebastian St ¨ uker , and Alex W aibel, “Language adaptiv e multilingual CTC speech recognition, ” in Interna- tional Conference on Speech and Computer . Springer , 2017, pp. 473–482. [15] Siddharth Dalmia, Ramon Sanabria, Florian Metze, and Alan W Black, “Sequence-based multi-lingual low resource speech recognition, ” arXiv preprint , 2018. [16] Shinji W atanabe, T akaaki Hori, and John R Hershey , “Lan- guage independent end-to-end architecture for joint language identiﬁcation and speech recognition, ” in A utomatic Speech Recognition and Understanding W orkshop (ASR U), 2017 IEEE . IEEE, 2017, pp. 265–271. [17] Shubham T oshniwal, T ara N Sainath, Ron J W eiss, Bo Li, Pe- dro Moreno, Eugene W einstein, and Kanishka Rao, “Multilin- gual speech recognition with a single end-to-end model, ” in IEEE International Conference on Acoustics, Speech and Sig- nal Pr ocessing (ICASSP) , 2018. [18] Shinji W atanabe, T akaaki Hori, Suyoun Kim, John R Hershey , and T omoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition, ” IEEE J ournal of Selected T op- ics in Signal Pr ocessing , vol. 11, no. 8, pp. 1240–1253, 2017. [19] Matthias Sperber , Jan Niehues, Graham Neubig, Sebastian St ˜ AŒker , and Alex W aibel, “Self-attentional acoustic mod- els, ” in 19th Annual Conference of the International Speech Communication Association (InterSpeech 2018) , 2018. [20] Chung-Cheng Chiu and Colin Raf fel, “Monotonic chunkwise attention, ” CoRR , vol. abs/1712.05382, 2017. [21] Karen Simon yan and Andrew Zisserman, “V ery deep con volu- tional networks for lar ge-scale image recognition, ” 09 2014. [22] Y ing Zhang, Mohammad Pezeshki, Phil ´ emon Brakel, Saizheng Zhang, C ´ esar Laurent, Y Bengio, and Aaron Courville, “T owards end-to-end speech recognition with deep con volutional neural netw orks, ” September 2016. [23] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyungh yun Cho, and Y oshua Bengio, “ Attention-based mod- els for speech recognition, ” in Advances in Neural Informa- tion Pr ocessing Systems . 2015, v ol. 2015-January , pp. 577– 585, Neural information processing systems foundation. [24] T akaaki Hori, Shinji W atanabe, Y u Zhang, and William Chan, “ Advances in joint CTC-attention based end-to-end speech recognition with a deep cnn encoder and RNN-LM, ” arXiv pr eprint arXiv:1706.02737 , 2017. [25] Daniel Pove y , Arnab Ghoshal, Gilles Boulianne, Lukas Bur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit, ” in A utomatic Speech Recognition and Un- derstanding, 2011 IEEE W orkshop on . IEEE, 2011, pp. 1–4. [26] Mike Schuster and Kuldip K Paliwal, “Bidirectional recurrent neural networks, ” IEEE T ransactions on Signal Pr ocessing , vol. 45, no. 11, pp. 2673–2681, 1997. [27] Sepp Hochreiter and J ¨ urgen Schmidhuber , “Long short-term memory , ” Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997. [28] T omas Mikolov , Stefan Kombrink, Anoop Deoras, Lukar Bur- get, and Jan Cernocky , “RNNLM-recurrent neural network language modeling toolkit, ” in Pr oc. of the 2011 ASRU W ork- shop , 2011, pp. 196–201. [29] Shinji W atanabe, T akaaki Hori, Shigeki Karita, T omoki Hayashi, Jiro Nishitoba, Y uya Unno, Nelson Enrique Y alta So- plin, Jahn He ymann, Matthe w W iesner , Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit, ” arXiv pr eprint arXiv:1804.00015 , 2018. [30] Sibo T ong, Philip N Garner , and Herv ´ e Bourlard, “ An inv esti- gation of deep neural networks for multilingual speech recog- nition training and adaptation, ” T ech. Rep., 2017. [31] Frantisek Gr ´ ezl, Martin Karaﬁ ´ at, and Karel V esel ` y, “ Adapta- tion of multilingual stacked bottle-neck neural network struc- ture for new language, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2014.

Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment