Multilingual Speech Recognition with Corpus Relatedness Sampling
Multilingual acoustic models have been successfully applied to low-resource speech recognition. Most existing works have combined many small corpora together and pretrained a multilingual model by sampling from each corpus uniformly. The model is eve…
Authors: Xinjian Li, Siddharth Dalmia, Alan W. Black
Multilingual Speech Recognition with Corpus Relatedness Sampling Xinjian Li, Siddharth Dalmia, Alan W . Black, Florian Metze Language T echnologies Institute, Carnegie Mellon Uni versity; Pittsb urgh, P A; U.S.A. { xinjianl|sdalmia|awb|fmetze } @cs.cmu.edu Abstract Multilingual acoustic models have been successfully applied to low-resource speech recognition. Most existing works have combined many small corpora together, and pretrained a mul- tilingual model by sampling from each corpus uniformly . The model is e ventually fine-tuned on each target corpus. This ap- proach, howev er, fails to exploit the relatedness and similarity among corpora in the training set. For e xample, the target cor- pus might benefit more from a corpus in the same domain or a corpus from a close language. In this work, we propose a sim- ple but useful sampling strategy to tak e adv antage of this relat- edness. W e first compute the corpus-le vel embeddings and esti- mate the similarity between each corpus. Next we start training the multilingual model with uniform-sampling from each cor- pus at first, then we gradually increase the probability to sample from related corpora based on its similarity with the tar get cor - pus. Finally the model would be fine-tuned automatically on the target corpus. Our sampling strategy outperforms the baseline multilingual model on 16 lo w-resource tasks. Additionally , we demonstrate that our corpus embeddings capture the language and domain information of each corpus. 1. Introduction In recent years, Deep Neural Networks (DNNs) ha ve been suc- cessfully applied to Automatic Speech Recognition (ASR) for many well-resourced languages including Mandarin and En- glish [1, 2]. Howev er, only a small portion of languages have clean speech labeled corpus. As a result, there is an increasing interest in b uilding speech recognition systems for low-resource languages. T o address this issue, researchers hav e successfully exploited multilingual speech recognition models by taking ad- vantage of labeled corpora in other languages [3, 4]. Mul- tilingual speech recognition enables acoustic models to share parameters across multiple languages, therefore low-resource acoustic models can benefit from rich resources. While low-resource multilingual works have proposed var - ious acoustic models, those w orks tend to combine several low- resource corpora together without paying attention to the va- riety of corpora themselv es. One common training approach here is to first pretrain a multilingual model by combining all training corpora, then the pretrained model is fine-tuned on the target corpus [5]. During the training process, each corpus in the training set is treated equally and sampled uniformly . W e argue, howe ver , this approach does not take account of the char- acteristics of each corpus, therefore it fails to take advantage of the relations between them. F or example, a con versation corpus might be more beneficial to another conv ersation corpus rather than an audio book corpus. In this work, we propose an effecti ve sampling strategy (Corpus Relatedness Sampling) to take advantage of relations among corpora. Firstly , we introduce the corpus-lev el embed- ding which can be used to compute the similarity between cor - pora. The embedding can be estimated by being jointly trained with the acoustic model. Ne xt, we compute the similarity be- tween each corpus and the tar get corpus, the similarity is then used to optimize the model with respect to the tar get corpus. During the training process, we start by uniformly sampling from each corpus, then the sampling distrib ution is gradually updated so that more related corpora would be sampled more frequently . Eventually , only the tar get corpus would be sam- pled from the training set as the target corpus is the most related corpus to itself. While our approach dif fers from the pretrained model and the fine-tuned model, we can prove that those models are special cases of our sampling strategy . T o ev aluate our sampling strategy , we compare it with the pretrained model and fine-tuned model on 16 different corpora. The results show that our approach outperforms those baselines on all corpora: it achie ves 1.6% lower phone error rate on av- erage. Additionally , we demonstrate that our corpus-lev el em- beddings are able to capture the characteristics of each corpus, especially the language and domain information. The main con- tributions of this paper are as follo ws: 1. W e propose a corpus-le vel embedding which can capture the language and domain information of each corpus. 2. W e introduce the Corpus Relatedness Sampling strategy to train multilingual models. It outperforms the pre- trained model and fine-tuned model on all of our test corpora. 2. Related W ork Multilingual speech recognition has e xplored various models to share parameters across languages in different ways. For ex- ample, parameters can be shared by using posterior features from other languages [6], applying the same GMM components across dif ferent HMM states [7], training shared hidden layers in DNNs [3, 4] or LSTM [5], using language independent bot- tleneck features [8, 9]. Some models only share their hidden layers, b ut use separate output layers to predict their phones [3, 4]. Other models ha ve only one shared output layer to pre- dict the universal phone set shared by all languages [10, 11, 12]. While those w orks proposed the multilingual models in differ - ent ways, fe w of them ha ve e xplicitly e xploited the relatedness across v arious languages and corpora. In contrast, our work computes the relatedness between dif ferent corpora using the embedding representations and exploits them ef ficiently . The embedding representations ha ve been heavily used in multiple fields. In particular, embeddings of multiple granular- ities hav e been explored in many NLP tasks. T o name a few , character embedding [13], subword embedding [14], sentence embedding [15] and document embedding [16]. Howe ver , there are few works exploring the corpus level embeddings. The main reason is that the number of corpora in volv ed in most experi- ments is usually limited and it is not useful to compute corpus embeddings. The only exception is the multitask learning where many tasks and corpora are combined together . F or instance, the language level (corpus le vel) embedding can be generated along with the model in machine translation [17] and speech recognition [18]. Ho wever , those embeddings are only used as an auxiliary feature to the model, fe w works continue to exploit those embeddings themselves. Another important aspect of our work is that we focused on the sampling strategy for speech recognition. While most of the previous speech w orks mainly emphasized the acoustic mod- eling side, there are also some attempts focusing on the sam- pling strategies. For instance, curriculum learning would train the acoustic model by starting from easy training samples and increasingly adapt it to more difficult samples [1, 19]. Activ e learning is an approach trying to minimize human costs to col- lect transcribed speech data [20]. Furthermore, sampling strate- gies can also be helpful to speed up the training process [21]. Howe ver , the goals of most strategies are to improv e the acous- tic model by modifying the sampling distrib ution within a sin- gle speech corpus for a single language. On the contrary , our approach aims to optimize the multilingual acoustic model by modifying distributions across all the training corpora. 3. A pproach In this section, we describe our approach to compute the corpus embedding and our Corpus Relatedness Sampling strategy . 3.1. Corpus Embedding Suppose that C t is the target low-resource corpus, we are in- terested in optimizing the acoustic model with a much lar ger training corpora set S = {C 1 , C 2 ... C n } where n is the number of corpora and C t ∈ S . Each corpus C i is a collection of ( x , y ) pairs where x is the input features and y is its target. Figure 1: The acoustic model to optimize corpus embeddings. Our purpose here is to compute the embedding e i for each corpus C i where e i is expected to encode information about its corpus C i . Those embeddings can be jointly trained with the standard multilingual model [5]. First, the embedding matrix E for all corpora is initialized, the i -th ro w of E is corresponding to the embedding e i of the corpus C i . Next, during the training phase, e i can be used to bias the input feature x as follows. h = Enco der( x + e i ; W , E ) (1) where ( x , y ) ∈ C i is an utterance sampled randomly from S , h is its hidden features, W is the parameter of the acous- tic model and Encoder is the stacked bidirectional LSTM as shown in Figure.1. Next, we apply the language specific soft- max to compute logits l and optimize them with the CTC objec- tiv e [30]. The embedding matrix E can be optimized together with the model during the training process. 3.2. Corpus Relatedness Sampling W ith the embedding e i of each corpus C i , we can compute the similarity score between any two corpora using the cosine sim- ilarity . scor e ( C i , C j ) = e i · e j | e i || e j | (2) As the similarity reflects the relatedness between corpora in the training set, we w ould like to sample the training set based on this similarity: those corpora which hav e a higher similarity with the target corpus C t should be sampled more frequently . Therefore, we assume those similarity scores to be the sampling logits and they should be normalized with softmax. P r ( C i ) = exp T · scor e ( C i , C t ) P j exp( T · scor e ( C j , C t )) (3) where P r ( C i ) is the probability to sample C i from S , and T is the temperature to normalize the distribution during the train- ing phase. W e ar gue that different temperatures could create different training conditions. The model with a lower tempera- ture tends to sample each corpus equally like uniform sampling. In contrast, a higher temperature means that the sampling dis- tribution should be biased toward the target corpus like the fine- tuning. Next, we pro ve that both the pretrained model and the fine- tuned model can be realized with specific temperatures. In the case of the pretrained model, each corpus should be sampled equally . This can be implemented by setting T to be 0 . P r ( C i ) = lim T → 0 exp T · scor e ( C i , C t ) P j exp( T · scor e ( C j , C t )) = 1 n (4) On the other hand, the fine-tuned model should only con- sider samples from the target corpus C t , while ignoring all other corpora. W e argue that this condition can be approximated by setting T to a very lar ge number . As score ( C t , C t ) = 1 . 0 and scor e ( C i , C t ) < 1 . 0 if i 6 = t , we can prove the statement as follows: P r ( C i ) = lim T →∞ exp T · scor e ( C i , C t ) P j exp( T · scor e ( C j , C t )) = ( 1 . 0 if C i = C t 0 . 0 if C i 6 = C t (5) While both the pretrained model and the fine-tuned model are special cases of our approach, we note that our approach is more flexible to sample from related corpora by interpolat- ing between those two extreme temperatures. In practice, we T able 1: The collection of tr aining corpora used in the experiment. Both the baseline model and the pr oposed model ar e tr ained and tested with 16 corpora acr oss 10 languages. W e assign a corpus id to each corpus after its corpus name so that we can distinguish the corpora sharing the same languag e. Language Corpus Name Domain Utterance Language Corpus Name Domain Utterance English TED (ted) [22] broadcast 100,000 Mandarin Hkust (hk) [23] telephone 100,000 English Switchboard (swbd)[24] telephone 100,000 Mandarin SLR18 (s18) [25] read 13,388 English Librispeech (libri) [26] read 100,000 Mandarin LDC98S73 (hub) broadcast 35,999 English Fisher (fisher) [27] telephone 100,000 Mandarin SLR47 (s47) [28] read 50,384 Amharic LDC2014S06 (babel) telephone 41,403 Bengali LDC2016S08 (babel) telephone 60,663 Dutch voxfor ge (vox) read 8,492 German voxfor ge (vox) read 41,146 Spanish LDC98S74 (hub) broadcast 31,615 Swahili LDC2017S05(babel) telephone 44,502 T urkish LDC2012S06 (hub) [29] broadcast 97,427 Zulu LDC2017S19 (babel) telephone 60,835 would like to start with a low temperature to sample broadly in the early training phase. Then we gradually increase the tem- perature so that it can focus more on the related corpora. Even- tually , the temperature would be high enough so that the model is automatically fine-tuned on the tar get corpus. Specifically , in our experiment, we start training with a very lo w temperature T 0 , and increase its value e very epoch k as follows. T k +1 = aT k (6) where T k is the temperature of epoch k and a is a hyperparam- eter to control the growth rate of the temperature. 4. Experiments T o demonstrate that our sampling approach could improv e the multilingual model, we conduct experiments on 16 corpora to compare our approach with the pretrained model and fine-tuned model. 4.1. Datasets W e first describe our corpus collection. T able.1 lists all corpora we used in the experiments. There are 16 corpora from 10 lan- guages. T o increase the v ariety of corpus, we selected 4 English corpora and 4 Mandarin corpora in addition to the lo w resource language corpora. As the tar get of this experiment is low re- source speech recognition, we only randomly select 100,000 utterances e ven if there are more in each corpus. All corpora are available in LDC, voxfor ge, openSLR or other public web- sites. Each corpus is manually assigned one domain based on its speech style. Specifically , the domain candidates are telephone , r ead and br oadcast . 4.2. Experiment Settings W e use EESEN [31] for the acoustic modeling and epitran [32] as the g2p tool in this work. Every utterance in the corpora is firstly re-sampled into 8000Hz, and then we extract 40 dimen- sion MFCCs features from each audio. W e use a recent multi- lingual CTC model as our acoustic architecture [5]: The archi- tecture is a 6 layer bidirectional LSTM model with 320 cells in each layer . W e use this architecture for both the baseline models and the proposed model. Our baseline model is the fine-tuned model: we first pre- trained a model by uniformly sampling from all corpora. After the loss con verges, we fine-tune the model on each of our target corpora. T o compare it with our sampling approach, we first train an acoustic model to compute the embeddings of all cor- pora, then the embeddings are used to estimate the similarity as described in the previous section. The initial temperature T 0 is set to 0.01, and the gro wth rate is 1 . 5 . W e e valuated all mod- els using the phone error rate (PER) instead of the word error rate (WER). The reason is that we mainly focus on the acous- tic model in this experiment. Additionally , some corpora (e.g.: Dutch voxforge) in this experiment have very few amounts of texts, therefore it is difficult to create a reasonable language model without augmenting texts using other corpora, which is beyond the scope of this work. 4.3. Results T able 2: Phone error r ate (%PER) of the pretr ained model, the baseline model (fine-tuned model) and our CRS(Corpus Relat- edness Sampling) appr oach on all 16 corpora Corpus Pretrain PER Base PER CRS PER English (ted) 19.2 11.6 10.3 English (swbd) 24.9 15.3 14.1 English (libri) 12.1 6.5 5.4 English (fisher) 34.7 23.5 22.5 Mandarin (hk) 32.6 16.1 14.5 Mandarin (s18) 8.5 6.6 5.6 Mandarin (hub) 10.2 5.1 4.4 Mandarin (s47) 13.0 10.9 9.2 Amharic (babel) 45.4 40.7 36.9 Bengali (babel) 47.0 41.9 40.0 Dutch (vox) 27.3 21.6 18.2 German (vox) 23.1 12.9 10.3 Swahili (babel) 17.7 16.1 14.5 Spanish (hub) 14.0 8.4 7.5 T urkish (hub) 49.2 45.6 44.9 Zulu (babel) 47.8 38.5 37.9 A verage 26.7 20.1 18.5 T able.2 shows the results of our e valuation. W e compare our approach with the baseline using all corpora. The left-most column of T able.2 sho ws the corpus we used for each exper- iment, the remaining columns are corresponding to the phone error rate of the pretrained model, the fine-tuned model and our proposed model. First, we can easily confirm that the fine-tuned model outperforms the pretrained model on all corpora. For in- stance, the fine-tuned model outperforms the pretrained model by 4.7% on the Amharic corpus. The result is reasonable as the pretrained model is optimized with the entire training set, while the fine-tuned model is further adapted to the target cor- pus. Next, the table suggests our Corpus Relatedness Sampling approach achiev es better results than the fine-tuned model on all test corpora. F or instance, the phone error rate is improved from 40.7% to 36.9% on Amharic and is improved from 41.9% to 40.0% on Bengali. On average, our approach outperforms the fine-tuned model by 1.6% phone error rate. The results demon- strate that our sampling approach is more effectiv e at optimizing the acoustic model on the target corpus. W e also train baseline models by appending corpus embeddings to input features, but the proposed model outperforms those baselines similarly . One interesting trend we observed in the table is that the improv ements differ across the target corpora. For instance, the improv ement on the Dutch corpus is 3.4%, on the other hand, its improv ement of 0.6% is relativ ely smaller on the Zulu dataset. W e believ e the difference in impro vements can be explained by the size of each corpus. The size of Dutch corpus is very small as sho wn in T able.1, therefore the fine-tuned model is prone to ov erfit to the dataset very quickly . In contrast, it is less likely for a lar ger corpus to ov erfit. Compared with the fine-tuned model, our approach optimizes the model by gradually chang- ing the temperature without quick overfitting. This mechanism could be interpreted as a built-in regularization. As a result, our model can achieve much better performance in small corpora by prev enting the overfitting ef fect. T able 3: The similarity between tr aining corpora. F or each tar - get corpus, we show its most r elated corpus and second related corpus. T arget 1st Related Corpus 2nd Related Corpus English (ted) English (libri) Turkish (hub) English (swbd) English (fisher) Mandarin (hk) English (libri) English (ted) German (vox) English (fisher) English (swbd) Mandarin (hk) Mandarin (hk) Bengali (babel) English (swbd) Mandarin (s18) Mandarin (s47) Mandarin (hub) Mandarin (s47) Mandarin (s18) Dutch (vox) Mandarin (hub) Spanish (hub) Tkish (hub) Amharic (babel) Bengali (babel) Swahili (babel) Bengali (babel) Mandarin (hk) Amharic (babel) Dutch (vox) Mandarin (s47) Mandarin(s18) German (vox) English (libri) Mandarin (s47) Swahili (babel) Zulu (babel) Amharic (babel) Spanish (hub) Mandarin (hub) T urkish (hub) T urkish (hub) English (libri) Mandarin (hub) Zulu (babel) Swahili (babel) Amharic (babel) T o understand how our corpus embeddings contribute to our approach, we rank those embeddings and show the top-2 similar corpora for each corpus in T able.3. W e note that the target cor- pus itself is remo ved from the rankings because it is the most related corpus to itself. The results of the top half show very clearly that our embeddings can capture the language le vel in- formation: For most English and Mandarin corpora, the most related corpus is another English or Mandarin corpus. Addi- − 100 − 50 0 50 100 − 100 − 50 0 50 100 telephone read broadcast Figure 2: Domain plot of 36 corpora, the corpus embeddings ar e reduced to 2 dimensions by t-SNE tionally , the bottom half of the table indicates that our embed- dings are able to capture domain level information as well. For instance, the top 2 related corpus for Amharic is Bengali and Swahili. According to T able.1, those three corpora belong to the telephone domain. In addition, Dutch is a read corpus, its top 2 related corpora are also from the same domain. This also explains why the 1st related corpus of Mandarin (hk) is Bengali: because both of them are from the same telephone domain. T o further in vestigate the domain information contained in the corpus embeddings, we train the corpus embeddings with an ev en larger corpora collection (36 corpora) and plot all of them in Figure.2. T o create the plot, the dimension of each corpus embedding is reduced to 2 with t-SNE [33]. The figure demon- strates clearly that our corpus embeddings are capable of captur- ing the domain information: all corpora with the same domain are clustered together . This result also means that our approach improv es the model by sampling more frequently from the cor- pora of the same speech domain. 5. Conclusion In this w ork, we propose an approach to compute corpus-level embeddings. W e also introduce Corpus Relatedness Sampling approach to train multilingual speech recognition models based on those corpus embeddings. Our experiment shows that our approach outperforms the fine-tuned multilingual models in all 16 test corpora by 1.6 phone error rate on average. Additionally , we demonstrate that our corpus embeddings can capture both language and domain information of each corpus. 6. Acknowledgements This project was sponsored by the Defense Adv anced Re- search Projects Agency (D ARP A) Information Innov ation Of- fice (I2O), program: Low Resource Languages for Emer gent Incidents (LORELEI), issued by D ARP A/I2O under Contract No. HR0011-15-C-0114. 7. References [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat- tenberg, C. Case, J. Casper , B. Catanzaro, Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-end speech recognition in english and mandarin, ” in International conference on machine learning , 2016, pp. 173–182. [2] W . Xiong, J. Droppo, X. Huang, F . Seide, M. Seltzer, A. Stolcke, D. Y u, and G. Zweig, “ Achie ving human parity in con versational speech recognition, ” arXiv pr eprint arXiv:1610.05256 , 2016. [3] J.-T . Huang, J. Li, D. Y u, L. Deng, and Y . Gong, “Cross-language knowledge transfer using multilingual deep neural netw ork with shared hidden layers, ” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Confer ence on . IEEE, 2013, pp. 7304–7308. [4] G. Heigold, V . V anhoucke, A. Senior, P . Nguyen, M. Ranzato, M. Devin, and J. Dean, “Multilingual acoustic models using dis- tributed deep neural networks, ” in Acoustics, Speec h and Signal Pr ocessing (ICASSP), 2013 IEEE International Confer ence on . IEEE, 2013, pp. 8619–8623. [5] S. Dalmia, R. Sanabria, F . Metze, and A. W . Black, “Sequence- based multi-lingual lo w resource speech recognition, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 4909–4913. [6] A. Stolcke, F . Grezl, M.-Y . Hwang, X. Lei, N. Morgan, and D. V ergyri, “Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons, ” in Acous- tics, Speech and Signal Pr ocessing, 2006. ICASSP 2006 Pr oceed- ings. 2006 IEEE International Conference on , v ol. 1. IEEE, 2006, pp. I–I. [7] L. Burget, P . Schwarz, M. Agarwal, P . Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafi ´ at, D. Povey et al. , “Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models, ” in Acoustics Speech and Sig- nal Pr ocessing (ICASSP), 2010 IEEE International Conference on . IEEE, 2010, pp. 4334–4337. [8] K. V esel ` y, M. Karafi ´ at, F . Gr ´ ezl, M. Janda, and E. Egorov a, “The language-independent bottleneck features, ” in Spoken Language T echnology W orkshop (SLT), 2012 IEEE . IEEE, 2012, pp. 336– 341. [9] S. Dalmia, X. Li, F . Metze, and A. W . Black, “Domain rob ust fea- ture e xtraction for rapid low resource asr development, ” in 2018 IEEE Spok en Language T echnology W orkshop (SLT) . IEEE, 2018, pp. 258–265. [10] T . Schultz and A. W aibel, “Language-independent and language- adaptiv e acoustic modeling for speech recognition, ” Speech Com- munication , vol. 35, no. 1-2, pp. 31–51, 2001. [11] S. T ong, P . N. Garner, and H. Bourlard, “ An investig ation of deep neural networks for multilingual speech recognition training and adaptation, ” T ech. Rep., 2017. [12] N. T . V u and T . Schultz, “Multilingual multilayer perceptron for rapid language adaptation between and across language families. ” in Interspeech , 2013, pp. 515–519. [13] Y . Kim, Y . Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models, ” in Thirtieth AAAI Conference on Artifi- cial Intelligence , 2016. [14] P . Bojano wski, E. Grave, A. Joulin, and T . Mikolo v , “Enrich- ing word vectors with subword information, ” T ransactions of the Association for Computational Linguistics , vol. 5, pp. 135–146, 2017. [15] R. Kiros, Y . Zhu, R. R. Salakhutdinov , R. Zemel, R. Urtasun, A. T orralba, and S. Fidler, “Skip-thought vectors, ” in Advances in neural information pr ocessing systems , 2015, pp. 3294–3302. [16] Q. Le and T . Mikolo v , “Distributed representations of sentences and documents, ” in International confer ence on machine learning , 2014, pp. 1188–1196. [17] M. Johnson, M. Schuster, Q. V . Le, M. Krikun, Y . W u, Z. Chen, N. Thorat, F . V i ´ egas, M. W attenberg, G. Corrado et al. , “Googles multilingual neural machine translation system: Enabling zero- shot translation, ” T ransactions of the Association for Computa- tional Linguistics , vol. 5, pp. 339–351, 2017. [18] B. Li, T . N. Sainath, K. C. Sim, M. Bacchiani, E. W einstein, P . Nguyen, Z. Chen, Y . Wu, and K. Rao, “Multi-dialect speech recognition with a single sequence-to-sequence model, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 4749–4753. [19] S. Braun, D. Neil, and S.-C. Liu, “ A curriculum learning method for improved noise rob ustness in automatic speech recogni- tion, ” in 2017 25th Eur opean Signal Pr ocessing Conference (EU- SIPCO) . IEEE, 2017, pp. 548–552. [20] G. Riccardi and D. Hakkani-Tur , “ Acti ve learning: Theory and ap- plications to automatic speech recognition, ” IEEE transactions on speech and audio pr ocessing , vol. 13, no. 4, pp. 504–511, 2005. [21] J. Cui, B. Kingsbury , B. Ramabhadran, A. Sethy , K. Audhkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny et al. , “Multilingual representations for low resource speech recognition and ke yword search, ” in 2015 IEEE W orkshop on Au- tomatic Speech Recognition and Understanding (ASR U) . IEEE, 2015, pp. 259–266. [22] A. Rousseau, P . Del ´ eglise, and Y . Esteve, “TED-LIUM: an auto- matic speech recognition dedicated corpus. ” in LREC , 2012, pp. 125–129. [23] Y . Liu, P . Fung, Y . Y ang, C. Cieri, S. Huang, and D. Graff, “Hkust/mts: A very lar ge scale mandarin telephone speech cor- pus, ” in Chinese Spoken Language Processing . Springer , 2006, pp. 724–735. [24] J. J. Godfrey , E. C. Holliman, and J. McDaniel, “SWITCH- BO ARD: T elephone speech corpus for research and dev elop- ment, ” in Acoustics, Speech, and Signal Pr ocessing, 1992. ICASSP-92., 1992 IEEE International Confer ence on , vol. 1. IEEE, 1992, pp. 517–520. [25] Z. Z. Dong W ang, Xuewei Zhang, “Thchs-30 : A free chinese speech corpus, ” 2015. [Online]. A vailable: http: //arxiv .org/abs/1512.01882 [26] V . Panayotov , G. Chen, D. Povey , and S. Khudanpur , “Lib- rispeech: an ASR corpus based on public domain audio books, ” in Acoustics, Speec h and Signal Pr ocessing (ICASSP), 2015 IEEE International Confer ence on . IEEE, 2015, pp. 5206–5210. [27] C. Cieri, D. Miller , and K. W alker , “The fisher corpus: a resource for the ne xt generations of speech-to-te xt. ” in LREC , v ol. 4, 2004, pp. 69–71. [28] L. Primewords Information T echnology Co., “Primew ords chi- nese corpus set 1, ” 2018, https://www .primewords.cn. [29] M. Saraclar, “Turkish broadcast news speech and transcripts ldc2012s06, ” Philadelphia, Linguistic Data Consortium, W eb Download , 2012. [30] A. Gra ves, S. Fern ´ andez, F . Gomez, and J. Schmidhuber , “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks, ” in Pr oceedings of the 23rd international conference on Machine learning . A CM, 2006, pp. 369–376. [31] Y . Miao, M. Gow ayyed, and F . Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, ” in Automatic Speech Recognition and Understanding (ASR U), 2015 IEEE W orkshop on . IEEE, 2015, pp. 167–174. [32] D. R. Mortensen, S. Dalmia, and P . Littell, “Epitran: Precision G2P for many languages. ” in LREC , 2018. [33] L. v . d. Maaten and G. Hinton, “V isualizing data using t-sne, ” Journal of machine learning resear ch , vol. 9, no. Nov , pp. 2579– 2605, 2008.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment