Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation
End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compo…
Authors: Ye Jia, Melvin Johnson, Wolfgang Macherey
LEVERA GING WEAKL Y SUPER VISED D A T A TO IMPR O VE END-TO-END SPEECH-TO-TEXT TRANSLA TION Y e Jia Melvin J ohnson W olfgang Macher e y Ron J. W eiss Y uan Cao Chung-Cheng Chiu Naveen Ari Stella Laur enzo Y onghui W u Google Research { jiaye,melvinp } @google.com ABSTRA CT End-to-end Speech T ranslation (ST) models hav e many potential ad- vantages when compared to the cascade of Automatic Speech Recog- nition (ASR) and text Machine T ranslation (MT) models, including lowered inference latency and the av oidance of error compounding. Howe ver , the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora of speech and translated transcript pairs. Pre vious studies ha ve proposed the use of pre-trained components and multi-task learning in order to benefit from weakly supervised training data, such as speech-to- transcript or text-to-foreign-te xt pairs. In this paper , we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to con vert weakly supervised data into speech-to-translation pairs for ST training can be more ef fective than multi-task learning. Further- more, we demonstrate that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance. Finally , we discuss methods for av oiding ov erfitting to synthetic speech with a quantitativ e ablation study . Index T erms — Speech translation, sequence-to-sequence model, weakly supervised learning, synthetic training data. 1. INTR ODUCTION Recent advances in deep learning and more specifically in sequence- to-sequence modeling have led to dramatic improv ements in ASR [ 1 , 2 ] and MT [ 3 , 4 , 5 , 6 , 7 ] tasks. These successes naturally led to attempts to construct end-to-end speech-to-te xt translation systems as a single neural network [ 8 , 9 ]. Such end-to-end systems have advantages over a traditional cascaded system that performs ASR and MT consecuti vely in that they 1) natur ally av oid compounding errors between the two systems; 2) can directly utilize prosodic cues from speech to impro ve translation; 3) hav e lo wer latency by a voiding inference with two models; and 4) lower memory and computational resource usage. Howe ver , training such an end-to-end ST model typically requires a large set of parallel speech-to-translation training data. Obtaining such a large dataset is significantly more expensiv e than acquiring data for ASR and MT tasks. This is often a limiting factor for the performance of such end-to-end systems. Recently explored tech- niques to mitigate this issue include multi-task learning [ 9 , 10 ] and pre-trained components [ 11 ] in order to utilize weakly supervised data, i.e. speech-to-transcript or text-to-translation pairs, in contrast to fully supervised speech-to-translation pairs. Although multi-task learning has been shown to bring significant quality improvements to end-to-end ST systems, it has two constraints which limit the performance of the trained ST model: 1) the shared components hav e to compromise between multiple tasks, which can limit their performance on individual tasks; 2) for each training ex- ample, the gradients are calculated for a single task, parameters are therefore updated independently for each task, which may lead to sub-optimal solution for the entire multi-task optimization problem. In this paper , we train end-to-end ST models on much larger datasets than pre vious work, spanning up to 100 million training examples, including 1.3K hours of translated speech and 49K hours of transcribed speech. W e confirm that multi-task learning and pre- training are still beneficial at such a large scale. W e demonstrate that performance of our end-to-end ST system can be significantly improv ed, ev en outperforming multi-task learning, by using a large amount of data synthesized from weakly supervised data such as typi- cal ASR or MT training sets. Similarly , we show that it is possible to train a high-quality end-to-end ST model without an y fully supervised training data by le veraging pre-trained components and data synthe- sized from weakly supervised datasets. Finally , we demonstrate that data synthesized from fully unsupervised monolingual datasets can be used to improv e end-to-end ST performance. 2. RELA TED WORK Early work on speech translation typically used a cascade of an ASR model and an MT model [ 12 , 13 , 14 ], giving the MT model access to the predicted probabilities and uncertainties from the ASR. Recent work has focused on training end-to-end ST in a single model [8, 9]. In order to utilize both fully supervised data and also weakly supervised data, [ 9 , 10 ] use multi-task learning to train the ST model jointly with the ASR and/or the MT model. By doing so, both of them achiev ed better performance with the end-to-end model than the cascaded model. [ 11 ] conducts experiments on a lar ger 236 hour English-to-French dataset and pre-trains the encoder and decoder prior to multi-task learning, which further improves performance. Howe ver , the end-to-end model performs worse than the cascaded model in that work. [ 15 ] shows that pre-training a speech encoder on one language can improv e ST quality on a different source language. Using TTS synthetic data for training speech translation was a requirement when no direct parallel training data is av ailable, such as in [ 8 , 16 ]. In contrast, we show that ev en when a large fully supervised training set is av ailable, using synthetic training data from a high quality multi-speaker TTS system can further improve the performance of an end-to-end ST model. Synthetic data has also been used to improv e ASR performance. [ 17 ] b uilds a c ycle chain between TTS and ASR models, in which the output from one model is used to help the training of the other . Instead of using TTS, [ 18 ] synthesizes repeated phoneme sequence from unlabeled te xt to mimic temporal English speech Bidi LSTM × 5 Pre-trained ASR encoder Bidi LSTM × 3 ST encoder 8-Head Attention LSTM × 8 Pre-trained MT decoder ST decoder Spanish text Fig. 1 . Overvie w of the end-to-end speech translation model. Blue blocks correspond to pre-trained components, gre y components are frozen, and green components are fine-tuned on the ST task. duration in acoustic input. Similarly , [ 19 ] trains a pseudo-TTS model to synthesize the latent representation from a pre-trained ASR model from text, and uses it for data augmentation in ASR training. The MT synthetic data in this work helps the system in a manner similar to knowledge distillation [ 20 ], since the network is trained to predict outputs from a pretrained MT model. In contrast, synthesizing speech inputs using TTS is more similar to MT back-translation [ 21 ]. 3. MODELS Similar to [ 9 ], we make use of three sequence-to-sequence models. Each one is composed of an encoder , a decoder, and an attention module. Besides the end-to-end ST model which is the major focus of this paper , we also build an ASR model and an MT model, which are used for building the baseline cascaded ST model, as well as for multi-task learning and encoder / decoder pre-training for ST . All three models represent text using the same shared English/Spanish W ord Piece Model (WPM) [22] containing 16K tokens. ASR model: Our ASR model follows the architecture of [ 2 ]. W e use a 5 layer bidirectional LSTM encoder , with cell size 1024. The decoder is a 2 layer unidirectional LSTM with cell size 1024. The attention is 4-head additiv e attention. The model takes 80-channel log mel spectrogram features as input. MT model: Our MT model follows the architecture of [ 23 ]. W e use a 6 layer bidirectional LSTM encoder , with cell size of 1024. The decoder is an 8 layer unidirectional LSTM with cell size 1024, with residual connection across layers. W e use 8-head additive attention. ST model: The encoder has similar architecture to the ASR encoder , and the decoder has similar architecture to the MT decoder . Through- out this work we experiment with v arying the number of encoder layers. The model with the best performance is visualized in Figure 1. It uses an 8 layer bidirectional LSTM for the encoder and an 8 layer unidirectional LSTM with residual connections for the decoder . The attention is 8-head additiv e attention, following the MT model. 4. SYNTHETIC TRAINING D A T A Acquiring large-scale parallel speech-to-translation training data is extremely expensi ve. The scarcity of such data is often a limiting factor on the quality of an end-to-end ST model. T o overcome this issue, we use two forms of weakly supervised data by: synthesizing input speech corresponding to the input text in a parallel te xt MT training corpus, and synthesizing translated text targets from the output transcript in an ASR training corpus. 4.1. Synthesis with TTS model Recent TTS systems are able to synthesize speech with close to human naturalness [ 24 ], in varied speakers’ v oices [ 25 ], create novel voices by sampling from a continuous speaker embedding space [ 26 ]. In this work, we use the TTS model trained on LibriSpeech [ 27 ] from [ 26 ], except that we use a Grif fin-Lim [ 28 ] vocoder as in [ 29 ] which has significantly lower cost, but results in reduced audio qual- ity 1 . W e randomly sample from the continuous speaker embedding space for each synthesized example, resulting in wide diversity in speaker v oices in the synthetic data. This a voids unintentional bias tow ard a fe w synthetic speakers when using it to train an ST model and encourages generalization to speakers outside the training set. 4.2. Synthesis with MT model Another way to synthesize training data is to use an MT model to translate the transcripts in an ASR training set into the tar get language. In this work, we use the Google T ranslate service to obtain such translations. This procedure is similar to kno wledge distillation [ 20 ], except that it uses the final predictions as training tar gets rather than the predicted probability distributions. 5. EXPERIMENTS 5.1. Datasets and metrics W e focus on an English speech to Spanish text con versational speech translation task. Our experiments make use of three proprietary datasets, all consisting of con versational language, including: (1) an MT training set of 70M English-Spanish parallel sentences; (2) an ASR training set of 29M hand-transcribed English utterances, col- lected from anonymized voice search log; and (3) a substantially smaller English-to-Spanish speech-to-translation set of 1M utterances obtained by sampling a subset from the 70M MT set, and cro wd- sourcing humans to read the English sentences. The final dataset can be directly used to train the end-to-end ST model. W e use data augmentation on both speech corpora by adding v arying degrees of background noise and re verberation in the same manner as [ 2 ]. The WPM shared among all models is trained with the 70M MT set. W e use two datasets for ev aluation: a held out subset of 10.8K examples from the 1M ST set, which contains read speech, and another 8.9K recordings of natural con versational speech in a domain different from both the 70M MT and 29M ASR sets. Both ev al sets contain English speech, English transcript and Spanish translation triples, so they can be used for e v aluating either ASR, MT , or ST . ASR performance is measured in terms of W ord Error Rate (WER) and translation performance is measured in terms of BLEU [ 30 ] scores, both on case and punctuation sensitiv e reference text. 5.2. Baseline cascaded model W e build a baseline system by training an English ASR model and an English-to-Spanish MT model with the architectures described 1 Synthetic wav eforms are needed only for audio data augmentation. Oth- erwise, the mel-spectrogram predicted by the TTS model can be directly fed as input to the ST or ASR models, bypassing the vocoder . T ask Metric In-domain Out-of-domain ASR WER 2 13.7% 30.7% MT BLEU 78.8 35.6 ST BLEU 56.9 21.1 T able 1 . Performance of the baseline cascaded ST system and the underlying ASR and MT components on both test sets. In-domain Out-of-domain Cascaded 56.9 21.1 V anilla 49.1 12.1 + Pre-training 54.6 18.2 + Pre-training + Multi-task 57.1 21.3 T able 2 . BLEU scores of baseline end-to-end ST and cascaded models. All end-to-end models use 5 encoder and 8 decoder layers. in Section 3, and cascading them together by feeding the predicted transcript from the ASR model as the input to the MT model. The ASR model is trained on a mixture of the 29M ASR set and the 1M ST set with 8:1 per -dataset sampling probabilities in order to better adapt to the domain of the ST set. The MT model is trained on the 70M MT set, which is a superset of the 1M ST set. As sho wn in T able 1, the ST BLEU is significantly lower than the MT BLEU, which is the result of cascading errors from the ASR model. 5.3. Baseline end-to-end models W e train a v anilla end-to-end ST model with a 5-layer encoder and an 8-layer decoder directly on the 1M ST set. W e then adopt pre- training and multi-task learning as proposed in previous literature [ 9 , 10 , 11 , 15 ] in order to improv e its performance. W e pre-train the encoder on the ASR task, and the decoder on the MT task as described in Section 5.2. After initialization with pre-trained components (or random values for components not being pre-trained), the ST model is fine-tuned on the 1M ST set. Finally , we make use of multi-task learning by jointly training a combined network on the ST , ASR, and MT tasks, using the 1M, 29M, 70M datasets, respectiv ely . The ST sub-network shares the encoder with the ASR network, and shares the decoder with the MT network. For each training step, one task is sampled and trained with equal probability . Performance of these baseline models is shown in T able 2. Con- sistent with previous literature, we find that pre-training and multi- task learning both significantly improve ST performance, because they increase the amount of data seen during training by two orders of magnitude. When both pre-training and multi-task learning are applied, the end-to-end ST model slightly outperforms the cascaded model. 5.4. Using synthetic training data W e explore the ef fect of using synthetic training data as described in Section 4. T o avoid o verfitting to TTS synthesized audio (especially since audio synthesized using the Griffin-Lim algorithm contains obvious and unnatural artifacts), we freeze the pre-trained encoder 2 W e report WER based on references which are case and punctuation sensitiv e in order to be consistent with the way BLEU is evaluated. The same ASR model obtains a WER of 6.9% (in-domain) and 14.1% (out-of-domain) if trained and ev aluated on lo wer-cased transcripts without punctuation. # additional encoder layers 0 1 2 3 4 In-domain 54.5 55.7 56.1 55.9 56.1 Out-of-domain 19.5 18.8 19.3 19.5 19.6 T able 3 . BLEU scores of the extended ST mode l, varying the number of additional encoder layers on top of the frozen pre-trained ASR encoder . The decoder is pre-trained but kept trainable. Multi-task learning is not used. Fine-tuning set In-domain Out-of-domain Real 55.9 19.5 Real + TTS synthetic 59.5 22.7 Real + MT synthetic 57.9 26.2 Real + both synthetic 59.5 26.7 Only TTS synthetic 53.9 20.8 Only MT synthetic 42.7 26.9 Only both synthetic 55.6 27.0 T able 4 . BLEU scores of ST trained with synthetic data. All rows use the same model architecture as Figure 1. but stack a fe w additional layers on top of it. The impact of adding different number of additional layers when fine-tuning with the 1M ST set is sho wn in T able 3. Even with no additional layer, this approach outperforms the fully trainable model (T able 2 row 3) on the out-of-domain e val set, which indicates that the frozen pre-trained encoder helps the ST model generalize better . The benefit of adding extra layers saturates after around 3 layers. Follo wing this result, we use 3 extra layers for the follo wing experiments, as visualized in Figure 1. W e analyze the ef fect of using different synthetic datasets in T able 4 for fine-tuning. The TTS synthetic data is sourced from the 70M MT training set, by synthesizing English speech as described in Section 4.1. The synthesized speech is augmented with noise and rev erberation following the same procedure as is used for real speech. The MT synthetic data is sourced from the 29M ASR training set, by synthesizing translation to Spanish as described in Section 4.2. The encoder and decoder are both pre-trained on ASR and MT tasks, respectiv ely , but multi-task learning is not used. The middle group in T able 4 presents the result of fine-tuning with both synthetic datasets and the 1M real dataset, sampled with equal probability . As expected, adding a lar ge amount of synthetic training data (increasing the total number of training examples by 1 – 2 orders of magnitude), significantly improves performance on both in- domain and out-of-domain ev al sets. The MT synthetic data improv es performance on the out-of-domain e val set more than it does on the in-domain set, partially because it contains natural speech instead of read speech, which is better matched to the out-of-domain ev al set, and partially because it introduces more di versity to the training set and thus generalizes better . Fine-tuning on the mixture of the three datasets results in dramatic gains on both e v al sets, demonstrating that the two synthetic sources have complementary effects. It also significantly outperforms the cascaded model. The bottom group in T able 4 fine-tunes using only synthetic data. Surprisingly , they achieve very good performance and ev en outperform training with both synthetic and real collected data on the out-of-domain ev al set. This can be attributed to the increased Fine-tuning set In-domain Out-of-domain Real + TTS synthetic 58.7 21.4 Only TTS synthetic 35.1 9.8 T able 5 . BLEU scores using fully trainable encoder , which performs worse than freezing lo wer encoder layers as in T able 4. Fine-tuning set In-domain Out-of-domain Real + one-speaker TTS synthetic 59.5 19.5 Only one-speaker TTS synthetic 38.5 13.8 T able 6 . BLEU scores when fine-tuning with synthetic speech data synthesized using a single-speaker TTS system, which performs worse than using the multi-speaker TTS as in T able 4. sampling weight of training data with natural speech (instead of read speech). This result demonstrates the possibility of training a high- quality end-to-end ST system with only weakly supervised data, by using such data for components pre-training and generating synthetic parallel training data from them by lev eraging on high quality TTS and MT models or services. 5.5. Importance of frozen encoder and multi-speaker TTS T o validate the importance of freezing the pre-trained encoder, we compare to a model where the encoder is fully fine-tuned on the ST task. As shown in T able 5, full encoder fine-tuning hurts ST performance, by overfitting to the synthetic speech. The ASR encoder learns a high quality latent representation of the speech content when pre-training on a large quantity of real speech with data augmentation. Additional fine-tuning on synthetic speech only hurts performance since the TTS data are not as realistic nor as div erse as real speech. Similarly , to validate the importance of using a high quality multi- speaker TTS system to synthesize training data with wide speaker variation, we train models using data synthesized with the single speaker TTS model from [ 24 ]. This model generates more natural speech than the multi-speaker model used in Sec. 5.4 [ 26 ]. T o ensure a fair comparison, we use a Griffin-Lim vocoder and a 16 kHz sampling rate. W e use the same data augmentation procedure described above. Results are sho wn in T able 6. Even though a frozen pre-trained encoder is used, fine-tuning on only single-speaker TTS synthetic data still performs much worse than fine-tuning with multi-speaker TTS data, especially on the out-of-domain e val set. Howe ver , when trained on the combination of real and synthetic speech, performance on the in-domain ev al set is not affected. W e conjecture that this is because the in-domain e val set consists of read speech, which is better matched to the prosodic quality of the single-speaker TTS model. The large performance de gradation on the out-of-domain ev al set again indicates worse generalization. Incorporating recent advances in TTS to introduce more natural prosody and style variation [ 31 , 32 , 33 ] to the synthetic speech might further improv e performance when training on synthetic speech. W e leav e such in vestigations as future work. 5.6. Utilizing unlabelled monolingual text or speech In this section, we go further and show that unlabeled monolingual speech and text can be leveraged to improve performance of an end-to-end ST model, by using them to synthesize parallel speech- to-translation examples using a vailable ASR, MT , and TTS systems. T raining set In-domain Out-of-domain Real 49.1 12.1 Real + Synthetic from text 55.9 19.4 Real + Synthetic from speech 52.4 15.3 Real + Synthetic from both 55.8 16.9 T able 7 . BLEU scores for the vanilla model trained on synthetic data generated from unlabeled monolingual data, without pre-training. Even though such datasets are highly synthetic, the y can still benefit an ST model trained with as many as 1M real training e xamples. W e take the English te xt from the 70M MT set as an unlabeled text set, synthesize English speech for it using a multi-speaker TTS model as in Section 5.4, and translate it to Spanish using the Google T ranslate service. Similarly , we take the English speech from the 29M ASR set as an unlabeled speech set, synthesize translation targets for it by using the cascaded model we b uild in Section 5.2. W e use this cascaded model only to enable comparison to its own performance. Replacing it with a cascade of other ASR and MT models or services should not change the conclusion. Since there is no parallel training data for ASR or MT in this case, pre-training does not apply . W e use the vanilla model with a 5-layer encoder as in Section 5.3. Results are presented in T able 7. Even though these datasets are highly synthetic, they still significantly improv e performance over the vanilla model. Because the unlabeled speech is processed with weaker models, it doesn’ t bring as much gain as the synthetic set from unlabeled text. Since it essentially distills kno wledge from the cascaded model, it is also understandable that it does not outperform it. Performance is far behind our best results in Section 5.4 since pre-training is not used. Ne vertheless, this result demonstrates that with access to high-quality ASR, MT , and/or TTS systems, one can lev erage large sets of unlabeled monolingual data to improve the quality of an end-to-end ST system, ev en if a small amount of direct parallel training data are av ailable. 6. CONCLUSIONS W e propose a weakly supervised learning procedure that lev erages synthetic training data to fine-tune an end-to-end sequence-to- sequence ST model, whose encoder and decoder networks hav e been separately pre-trained on ASR and MT tasks, respecti vely . W e demonstrate that this approach outperforms multi-task learning in experiments on a large scale English speech to Spanish text trans- lation task. When utilizing synthetic speech inputs, we find that it is important to use a high quality multispeaker TTS model, and to freeze the pre-trained encoder to av oid ov erfitting to synthetic audio. W e explore e ven more impo verished data scenarios, and sho w that it is possible to train a high quality end-to-end ST model by fine-tuning only on synthetic data from readily available ASR or MT training sets. Finally , we demonstrate that a large quantity of unlabeled speech or text can be lev eraged to improv e an end-to-end ST model when a small fully supervised training corpus is av ailable. 7. A CKNO WLEDGMENTS The authors thank Patrick Nguyen, Orhan Firat, the Google Brain team, the Google T ranslate team, and the Google Speech Research team for their helpful discussions and feedback, as well as Mengmeng Niu for her operational support on data collection. 8. REFERENCES [1] W . Chan, N. Jaitly , Q. Le, and O. V inyals, “Listen, attend and spell: A neural network for large vocabulary con versational speech recognition, ” in Pr oc. ICASSP , 2016. [2] C.-C. Chiu, T . N. Sainath, Y . W u, R. Prabhavalkar , P . Nguyen, Z. Chen, A. Kannan, R. J. W eiss, K. Rao, K. Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models, ” in Pr oc. ICASSP , 2018. [3] I. Sutske ver , O. V inyals, and Q. V . Le, “Sequence to sequence learning with neural networks, ” in Advances in NeurIPS , 2014. [4] K. Cho, B. V an Merri ¨ enboer , D. Bahdanau, and Y . Bengio, “On the properties of neural machine translation: Encoder-decoder approaches, ” in Eighth W orkshop on Syntax, Semantics and Structur e in Statistical T ranslation , 2014. [5] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine transla- tion by jointly learning to align and translate, ” in Pr oc. ICLR , 2015. [6] Y . W u, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W . Machere y , M. Krikun, Y . Cao, Q. Gao, K. Macherey , et al., “Google’ s neural machine translation system: Bridging the gap between human and machine translation, ” arXiv pr eprint arXiv:1609.08144 , 2016. [7] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł . Kaiser , and I. Polosukhin, “ Attention is all you need, ” in Advances in NeurIPS , 2017. [8] A. B ´ erard, O. Pietquin, C. Servan, and L. Besacier , “Listen and translate: A proof of concept for end-to-end speech-to-text translation, ” in NeurIPS W orkshop on End-to-end Learning for Speech and Audio Processing , 2016. [9] R. J. W eiss, J. Chorowski, N. Jaitly , Y . Wu, and Z. Chen, “Sequence-to-sequence models can directly translate foreign speech, ” in Pr oc. Interspeech , 2017. [10] A. Anastasopoulos and D. Chiang, “Tied multitask learning for neural speech translation, ” in Pr oc. NAA CL-HLT , 2018. [11] A. B ´ erard, L. Besacier , A. C. K ocabiyikoglu, and O. Pietquin, “End-to-end automatic speech translation of audiobooks, ” in Pr oc. ICASSP , 2018. [12] H. Ney , “Speech translation: Coupling of recognition and translation, ” in Pr oc. ICASSP , 1999. [13] E. Matusov , S. Kanthak, and H. Ney , “On the integration of speech recognition and statistical machine translation, ” in Euro- pean Conference on Speech Communication and T echnology , 2005. [14] M. Post, G. Kumar , A. Lopez, D. Karakos, C. Callison-Burch, and S. Khudanpur , “Improved speech-to-te xt translation with the Fisher and Callhome Spanish–English speech translation corpus, ” in Pr oc. IWSLT , 2013. [15] S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Gold- water , “Pre-training on high-resource speech recognition im- prov es low-resource speech-to-text translation, ” arXiv preprint arXiv:1809.01431 , 2018. [16] T . Kano, S. Sakti, and S. Nakamura, “Structured-based curricu- lum learning for end-to-end english-japanese speech translation, ” in Pr oc. Interspeech , 2017. [17] A. Tjandra, S. Sakti, and S. Nakamura, “Machine speech chain with one-shot speaker adaptation, ” in Proc. Interspeech , 2018. [18] A. Renduchintala, S. Ding, M. Wiesner , and S. W atanabe, “Multi-modal data augmentation for end-to-end ASR, ” in Pr oc. Interspeech , 2018. [19] T . Hayashi, S. W atanabe, Y . Zhang, T . T oda, T . Hori, R. As- tudillo, and K. T akeda, “Back-translation-style data augmen- tation for end-to-end ASR, ” arXiv preprint , 2018. [20] G. Hinton, O. V inyals, and J. Dean, “Distilling the knowl- edge in a neural network, ” in NeurIPS Deep Learning and Repr esentation Learning W orkshop , 2015. [21] R. Sennrich, B. Haddow , and A. Birch, “Improving neural machine translation models with monolingual data, ” in Pr oc. Association for Computational Linguistics (A CL) , 2016. [22] M. Schuster and K. Nakajima, “Japanese and Korean voice search, ” in Pr oc. ICASSP , 2012. [23] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W . Macherey , G. Foster , L. Jones, N. Parmar , M. Schuster, Z. Chen, Y . W u, and M. Hughes, “The best of both worlds: Combining recent advances in neural machine translation, ” in Pr oc. Association for Computational Linguistics (A CL) , 2018. [24] J. Shen, R. Pang, R. J. W eiss, M. Schuster , N. Jaitly , Z. Y ang, Z. Chen, Y . Zhang, Y . W ang, R. Skerry-Ryan, et al., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions, ” in Pr oc. ICASSP , 2017. [25] W . Ping, K. Peng, A. Gibiansky , S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000- speaker neural text-to-speech, ” in Proc. ICLR , 2018. [26] Y . Jia, Y . Zhang, R. J. W eiss, Q. W ang, J. Shen, F . Ren, Z. Chen, P . Nguyen, R. Pang, I. L. Moreno, and Y . W u, “Transfer learn- ing from speaker verification to multispeaker text-to-speech synthesis, ” in Advances in NeurIPS , 2018. [27] V . Panayotov , G. Chen, D. Pov ey , and S. Khudanpur , “Lib- riSpeech: an ASR corpus based on public domain audio books, ” in Pr oc. ICASSP , 2015. [28] D. Griffin and J. Lim, “Signal estimation from modified short- time Fourier transform, ” IEEE T ransactions on Acoustics, Speech, and Signal Pr ocessing , vol. 32, no. 2, 1984. [29] Y . W ang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. W eiss, N. Jaitly , Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, et al., “T acotron: T owards end-to-end speech synthesis, ” in Pr oc. Interspeech , 2017. [30] K. Papineni, S. Roukos, T . W ard, and W .-J. Zhu, “BLEU: A method for automatic ev aluation of machine translation, ” in Pr oc. Association for Computational Linguistics (ACL) , 2002. [31] R. Skerry-Ryan, E. Battenber g, Y . Xiao, Y . W ang, D. Stanton, J. Shor , R. J. W eiss, R. Clark, and R. A. Saurous, “T owards end- to-end prosody transfer for expressi ve speech synthesis with Tacotron, ” in Pr oc. ICML , 2018. [32] Y . W ang, D. Stanton, Y . Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor , Y . Xiao, F . Ren, Y . Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, ” in Pr oc. ICML , 2018. [33] W .-N. Hsu, Y . Zhang, R. J. W eiss, H. Zen, Y . W u, Y . W ang, Y . Cao, Y . Jia, Z. Chen, J. Shen, P . Nguyen, and R. Pang, “Hier- archical generativ e modeling for controllable speech synthesis, ” in Pr oc. ICLR , 2019, to appear .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment