Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

BYTES ARE ALL Y OU NEED: END-T O-END MUL TILINGU AL SPEECH RECOGNITION AND SYNTHESIS WITH BYTES Bo Li, Y u Zhang, T ara Sain ath, Y onghui W u, W illiam Chan Google { boboli,ng yuzh,tsain ath,yonghui,williamchan } @google.com ABSTRA CT W e present two end-to-end models: Audio-to-Byte (A2B) and Byte- to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice t o model text. T hese units are dif ﬁcult to scale to l anguages wit h large vocab ularies, particularly in the case of mul- tilingual processing. In this work, we model text via a sequence of Unicode bytes, speciﬁcally , the UTF - 8 variable length byte se- quence f or each character . Bytes allow us t o avoid large softmaxes in languages with large vocab ularies, and share r epresentations in multilingual models. W e show t hat bytes are superior to grapheme characters over a wide variety of languages in monolingual end-to- end speech recognition. Additionally , our multil ingual byte model outperform each respectiv e si ngle language baseline on av erage by 4.4% relativ ely . In Japanese-English code-switching sp eech, our multilingual byte model o utperform our monolingual baseline by 38.6% relativ ely . Finally , we present an en d-to-end multilingual speech synth esis model using byte representations which matches the performance of our mono lingual baselines. Index T erms — multilingual, end-to-end speech recognition, end-to-end speech synthesis 1. INTRODUCTION Expanding the cov erage of the w orld’ s languages in Autom atic Speech R ecognition (ASR) and T e xt-to-Speech (TTS) systems have been attracting much interest i n both academia and i ndustry [1, 2]. Con ventiona l phonetically-based speech processing systems r equire pronunciation dictionaries that map phonetic units to words. Build- ing such resources require exp ert kno wledge for each language. Even with the costly hum an ef fort in vo lved, many languages do not ha ve sufﬁcient linguistic resources av ailable for bu ilding such dictionaries. Additionally , the inconsistency in the phonetic systems is also challengin g to resolv e [3] when merging diffe rent languages. Graphemes ha ve bee n used as an alternativ e modeling unit to phonemes for speech processing [4–7]. For these systems, an or- thographic l exicon instead of a pronunciation dictionary is used to provid e a vocabu lary list. W ith recent adv ances in end-to-end (E2E) modeling, graphemes have become a popular choice. Fo r example, [8] built a Connectionist T emporal Classiﬁcation (CT C) model to di- rectly output graphemes, while [9–11] used graphemes in sequence- to-sequence (seq2seq) models. Sub-word units were used in seq2seq [12–14] and RNNT [15] models, and word units were used by [16, 17]. S imilarly , graphemes are also commonly used to build end-to- end TTS systems [18–20]. The use of graphemes bring mo del simp licit y and enables end-to-end optimization, which has been shown to yield better performance than phoneme-based mod els [21]. Howe ver , unlike phonemes, the size of the grapheme vocab ulary varies greatly across languages. For examp le, many eastern languages, such as Chinese, Japanese and Ko rean, hav e tens of thou sands of graphemes. With limited amounts of training data, many graphemes may ha ve lit- tle or no coverag e. The label sparsity issue becomes even more se vere for multili ngual models, where one nee ds to poo l all the distinct graphemes from all languages togethe r resulting i n a very large vocab ulary that often has long tail graphemes with ve ry poor cov erage. T o address these problems, [3] explored the use of features from Unicode character descriptions to construct decision trees for clus- tering graphemes. Ho weve r, when the model changes to support a ne w languag e, the decision tree needs to be updated. Recently , there has been wo rk on exploring the use of Unicode bytes to repre- sent text. [22] presented an L STM-based multil ingual byte-to-span model. The model consumes the input text byte-by-byte and outputs span annotations. T he Unicode bytes are language independ ent and hence a single model can be used for many languages. The vocab- ulary size of Unicode bytes is alwa ys 256 and it does not increase when pooling more langua ges together , which is more preferable to graphemes for multilingual applications. In this work, we in vestigate the potential of representing text using byte sequences introduced in [22] for speech processing. For ASR, we adopt the Listen, Attend and Spell (LAS) [9] model t o con- vert input speech into sequences of Unicode bytes which correspond to the UTF-8 encoding of t he target texts. This model is referred to as the Audio-to-Byte (A2B) model. For TTS, our model is based on the T acotron 2 architecture [20], and generates speech signals from an i nput byte sequence. This model i s referred to as the the Byte- T o -Audio (B2A) model. Since both the A2B model and the B2A model operate directly on Unicode bytes, they can handle any num- ber of languages wri t ten in Unicode without any modiﬁcation to the input processing. Due to the small vocab ulary size being used, 256 in this case, our models can be very compact and ve ry suitable for on-de vice applications. W e report recognition results for the A2B model on 4 differ - ent languag es – English, Japanese, Spanish and Korean. Fir st, for each indi vidual lang uage, we compare our A2B models to Audio- to-Char (A2C) models which emit grapheme outputs. For English and Spanish where the graphem es are single-byte charac ters, A2B has the exact same performance as A2C as expected. Ho we ver , for languages that have a large grapheme vocab ulary , such as Japanese and Korean , the l abel sparsity issue hurts the performance of A2C models, whereas the A2B model shares bytes across graphemes and performs better than A2C models. Beneﬁting from the languag e i n- dependen ce representation of Unicode bytes, we ﬁnd it is possible to progressi vely add support for new languages when building a multi- lingual A2B model. Speciﬁcally , we start with an A2B model trained on English and Japanese and add in a new language after con ver - gence. W hen adding a new language we usually make sure the ne w language has the highest mixing ratio but meanwhile keeping small portion for each of the existing languages to avoid forgetting older ones. W e experimen t adding Spanish and Korean one at a t i me. In this way , we can reuse the previo usly built model and expa nd the language coverag e without modifying the model str ucture. For mul- tilingual AS R, we ﬁnd that the A2B trained in this way is better than training from scratch. In addition, by adding a 1-hot language vector to the A2B system, which has been shown to boost multi-dialect [23] and multilingual [24] system performance, we ﬁnd that t he multilin- gual A2B system outperforms all the language dependent ones. W e e valua te the B2A model on 3 differen t languages, which include English, Mandarin and Spanish. Again, we compare B2A models wi th those take graphemes as input. For all three languages, B2A has similar performance on quantitativ e subjective ev aluations as graphemes trained on single languages, this providing a more compact multilingual T TS model. 2. MUL TILINGU AL A UDIO-TO-BYTE (A2B) 2.1. Model Stru cture The Audio-to-Byte (A2B) model i s based on the Listen, Attend and Spell (LAS) [9] model, with the output target changed from graphemes to Unicode bytes. The encoder network consists of 5 unidirectional Long Short-T erm Memory (LSTMs) [25] layers, wit h each layer having 1 , 40 0 hidden units. The decoder network consists of 2 unidirectional LSTM layers with 1 , 024 hidden units. Additiv e content-based attention [26] with 4 attention heads are used t o learn the alignment between the input audio features and the output target units. The outpu t layer is a 256 dimensional softmax, correspond ing to the 256 possible byte va lues. Our front-end consists of 80-dimensional l og-mel features, com- puted with a 25ms windo w and shifted ev ery 10ms. Similar to [27, 28], at each current f r ame, these features are stacked wit h 3 consec- utiv e frames to the left and then down-samp led to a 30ms frame rate. The amou nt of training data usually varies across languag es. For examp le, for E nglish we ha ve around 3.5 times the amount of data compared to the other languages. More details about data can be found in Section 4. In this work, we adjust the data sampling ratio of the different languages to help tackle the data imbalance. W e choose the sampling ratio based on intuition and empirical observ ations. Specially , we start wit h mixing the language equally and increase the ratio for a l anguage where the performance needs more improv e- ment. In addition, a simple 1-hot language ID vector has been found to be effecti ve improving multilingual systems [23, 24]. W e also adopt this 1-hot language ID ve ctor as additional input passed into the A2B models, and concatenate it to all the layers including both the encoder and decoder layers. 2.2. Output Unit End-to-end speech recognition models hav e typically used char- acters [9], sub-words [12], word-pieces [15] or words [16] as the output unit of choice. W ord-ba sed units are difﬁcult to scale for languages with lar ge vocab ularies, which makes the softmax pro- hibitiv ely large, especially i n multilingual models. One solution is to use data-dri ven word-piece models. W ord-piec es learned from data can be trained to hav e a ﬁxed vocabulary size. But it r equires building a ne w word-piece model when a new language or new data is added. Additionally , building a multili ngual word-p iece model is challenging due to the unbalanced grapheme distribution. Grapheme units giv e the smallest vocabulary size among these units; howe ver , some languages still have very large vocab ularies. For examp le our Japanese vocab ulary has ov er 4 . 8 k characters. In this work, we explore decomposing graphemes into a sequence of Unicode bytes. Our A2B model generates the text sequence one Unicode byte at a time. W e represent tex t as a seque nce of variable length UTF -8 bytes. Fo r languages with single-byte characters (e.g., English), the use of byte output is equi v alent to the grap heme character output. Ho weve r, for languages with multi - byte characters, such as Japanese and Korean , the A2B model needs to generate a sequence of cor- rect bytes to emit one grapheme token. This requires the model to learn both the short-term wi t hin-grapheme byte dependencies, and the long-term inter-grap heme or ev en inter-word/phrase dependen- cies, which would be a harder task t han grapheme based system. The main adv antage of byte representation is its language inde- pendence . Any script of any l anguage representable by Unicode can be represented by a byte sequence, and there is no need to change the existing model structure. Ho wev er, for grapheme models, whene ver there is a new symbol add ed, there is a need to change the output softmax layer . This language independence makes i t more prefer- able for modeling multiple l anguages and also code-switching [ 29] speech within a single model. 3. MUL TILINGU AL BYTE- TO-A UDIO (B2A) 3.1. Model Stru cture The Byte-to-Audio (B2A) model is based on T acotron 2 [20] model. The input byte sequence embedd ing is encod ed by three con volu- tional layers, which contain 512 ﬁlters with shape 5 × 1 , followed by a bidirectional long short-term memory (LSTM) layer of 256 units for each direction. The resulti ng text encodings are accessed by the decoder through a location sensitiv e attention mechanism, which takes attention history into accoun t when computing a normalized weight vector for aggreg ation. The autoreg ressive decoder network takes as input the agg re- gated byte encoding, and conditioned on a ﬁxed speaker embedding for each speaker , which is essentially the language ID since our train- ing data has only one speaker per language. Simil ar to T acotron 2, we separately train a W aveRNN [30] to i nv ert mel spectrograms to a time-domain wav eform. 4. RESUL TS 4.1. Byte for ASR 4.1.1. Data Our speech recogn iti on experiments are cond ucted on a human tran- scribed supervised training set consisting speech from 4 different languages, namely English (EN), Japanese (J A), Spanish (ES) and K orean (KO). The t otal amount of data i s around 76,000 hours and the language-speciﬁc information can be found in T able 2. These training utterances are anonym ized and hand-tran scribed, and are representati ve of Google’ s voice search and dictation trafﬁc. These utterances are further artiﬁcially corrupted using a room simulator [31], adding varying deg rees of noise and rev erberation such that the o verall SNR is between 0dB and 30dB, with an ave rage SNR of 12dB. The noise sources are from Y ouT ube and daily life noisy en- vironmental recordings. For each utterance, we generated 10 differ- ent noisy versions for tr ai ning. Fo r ev aluation, we report r esults on language-spe ciﬁc test sets, each contains roughly 15K anon ymized, hand-transcribed utterances from Goog le’ s voice search t rafﬁc with- out overlapp ing with the training data. This amounts to roughly 20 T able 1 : Speech recogn ition performance of monolingual and multilingual wit h Audio-to-Byte (A2B) or Audio-to-Char (A2C) models. Model ExpId Conﬁguration T raining English Japanese Spanish K orean Languages WER(%) TER(%) WER(%) WER(%) Mono- lingual A1 A2C EN/J A/ES/KO 6.9 13.8 11.2 26.5 A2 A2B 6.9 13.2 11.2 25.8 Multi- lingual B1 A2C EN+J A 9.5 13.9 - - B2 A2B 8.9 13.3 - - C1 A2B, Random Init EN+J A+E S 9.7 13.6 11.1 - C2 A2B, Init From B2 8.6 13.2 11.0 - D1 A2B, Init From C2 EN+J A+E S+KO 8.4 13.4 11.3 26.0 B3 A2B, Larger Model EN+J A 8.8 13.6 - - B4 A2B, Larger Model, LangV ec 7.5 13.3 - - C3 A2B, Init From B4 EN+J A+ES 7.5 12.9 10.8 - D2 A2B, L arger Model, LangV ec EN+J A+E S+KO 8.6 13.5 11.2 25.4 D3 A2B, Init From C3 7.0 12.8 10.8 25.0 D4 A2B, Init From D3 6.6 12.6 10.7 24.7 T able 2 : Statistics of t he training and testing data used in our exper- iments. “utts” denotes the total number of utterances in each set and “time” is the total duration of audio for each set. Languages T rain T est utts (M) time (Kh) utts (K) ti me (h) English (EN) 35.0 27.5 15.4 20.0 Japanese (J A) 9.9 16.5 17.6 22.2 Spanish (ES) 8.9 16.3 16.6 22.3 K orean (K O) 9.6 16.1 12.6 15.0 hours of test data per languag e. Details of each language dependent test set can be found in T able 2. W e use word error rates (WE R s) as the ev aluation crit erion for all the languages except for Japanese, where token error rates (TER s) are used to exc lude the ambiguity of word segmentation . 4.1.2. Langua ge Dependent Systems W e ﬁrst build language dependen t A2B models t o i nv estigate the per- formance of byte-based language representations for ASR. For com- parison, we also build corresponding Audio-to-Char (A2C) models that hav e the same model structure bu t output graphemes. For all the four languages, the model which outputs byte always has a 256- dimensional softmax output layer . Howe ver , for the grapheme mod- els, different graphem e vocab ularies hav e to be used for diffe rent languages. The grapheme set is complete for English and Spanish as it contains all possible letters in each of the language s. How- e ver , for Japanese and K orean, we use the training data vocab ularies which are 4 . 8 K and 2 . 7 K respectiv ely . The corresponding test set grapheme OOV rates are 2.1% and 1.0%. W hereas wit h byte out- puts, we do not ha ve OO V problem for an y language. Experimental results are presented as A1 for the A2C mod els and A2 for the A2B models in T able 1. The difference between grapheme and byte representations mainly li es in languages which use multi-byte characters, such as Japanese and K orean. Comparing A1 to A2 , byte outputs giv e better results f or Japanese and Korean. While for l anguages with single-byte characters, namely English and Spanish, t hey hav e exactly the same performance as expected. Byte output requires the model to l earn both the short-term wi thin- grapheme byte dependencies and t he long-term inter-graphe me or e ven inter-word/phrase depend encies; it would possibly be a harder task than grapheme based systems. Howe ver , t he A2B model yields a 4.0% relative WER reduction on Japanese and 2.6% on K orean ov er the grapheme systems. It is interesting to see that ev en with the same model structure, we are able to get better performance wi t h the byte representation. 4.1.3. Multilingual ASR Systems In this experiment, we justify the effecti veness of byte based models ov er graphemes for multil ingual speech recognition. W e ﬁrst build a joint English and Japanese model by equally mixing the training data. For grapheme system, we combine the graphe me vocab of En- glish and Japanese which l eads to a large softmax layer . The same model structure e xcept for the softmax layer , where a 256 dimen- sional softmax is used, is used to build the A2B model. Al though the model now needs to recognize two languages, we keep the model size the same as those language dependent ones. From T able 1, the multilingual byte system ( B2 ) is better than the grapheme system ( B1 ) on both English and Japanese test sets. Ho wever , its perfor - mance is worse than those language dependen t ones, which we will address later i n this work. For the followin g expe riments, we con- tinue with only the A2B models as they are better than A2C models. T o increase the model’ s language co verage, e.g., Spanish, one way is to start from a random initiali zati on and t rain on all the train- ing data. W e equally mix the data from these three languag es for training. The results are presented as C1 in T able 1. Due to the lan- guage i ndepend ence of the byte representation, we, alternative ly , can add a new language by simply training on new data. Hence, we reuse the B2 model to continue training with Spanish data. T o avoid the model f orgetting pre vious languages, namely English and Japanese, we also mix in those languages but with a slightly lower mixing ra- tio which is 3:3:4 for English, Japanese and Spanish. The results are presented as C 2 in T able 1. W ith this method, the byte model not only trains faster but also achieve s better performance than C 1 . Most importantly , C2 matches the performance of language depen- dent models on Japanese and is even slightly better for Spanish. T o add support for Ko rean, we simply continued the train- ing of C2 with the ne w training data mixture. W e use a ratio of 0.23:0.23:0.23:0.31 , which is based on heuristics to balance the existing languages and use a higher ratio for the new languages. W e did not speciﬁcally tune the mixing ratio. The results ( D1 in T able 1) sho w that we are able to get closer to the l anguage dependent models excep t for English. Even though worse than the English only model, D1 giv es the best multilingual performance on English so far . T able 3 : Results on A2B and A2C mo dels on English-Japanese code-switching data. Model ExpId Conﬁguration TER(%) Mono- lingual A1 A2C 36.5 A2 A2B 22.4 Multi- lingual B1 A2C 21.4 B2 A2B 20.5 D4 A2B Larger Model, LangV ec 21.3 T o improv e the performance of the multilingual systems, we ﬁrst increase the number of decoder layers f r om 2 to 6 i n consideration of the increased variations in byte sequences when mixing more lan- guages. Howe ver , experimen tal results show that the larger model improv es performance on English but degrades on Japanese due to potential over-ﬁtting (comparing B3 to B2 ) . T o address this prob- lem, we brings in the 1-hot language ID vecto r to all the layers in the A2B model. This enab les t he learning of language independent weight matrices together with language dependent biases to cater the speciﬁc needs for each language. Experiment B4 shows dramatic er- ror reduction with this simple 1-hot v ector comparing to B3 . Similarly , to support the recognition of Spanish, we continue the training of B4 by mixing the languages at the ratio of 3:3:4 where more weight is giv en to the new language . This gives us the model C3 which outperforms language dependent ones on both Japanese and S panish. F urthermore, we add K orean in a similar way with t he ratio of 0.3:0.15:0 .15:0.4. This time while making sure the ratio for the ne w language, Ko rean, is the highest, we also i ncrease the rati o for English as we ha ve more English trai ning data. The model D3 wins ov er language dependent models except for English. One as- sumption for the degradation on E nglish is that when mixing in other languages, the multilingual model sees less data from each l anguage than those single language models. T o justify this, we continue t he training of D3 with an increased English data presence ratio in the mixture, speciﬁcally we use the ratio of 2:1:1:1. W e didn’t specif- ically tune these mixing ratios used for training. The ﬁnal model D4 wi ns over all the l anguage dependent ones on av erage by 4.4% relativ ely . For comparison, we include the results for a rando mly initialized model with equal training data mixing ratio D2 , which is much worse. 4.1.4. Err or analysis T o further understand the gains of using bytes versus graphemes as language representations, we take Japanese for this study and com- pare the decoding hypo theses between A1 and A2 . Interestingly , the A2B model wins over the A2C models mainly on English words in utterances with mixed E nglish and Japanese. T he Japanese test set was not particularly created to include code-switching utterances. Examining the English words appeared in Japanese test set, they are mostly proper nouns such as “ Google ”, “ wi -ﬁ ”, “ LAN ” etc. One examp le of such cases is the A2B generates the correct hypothesis “ wi-ﬁ オン ” while the A2C outputs “ i-i オン ”. Another ex ample is “ goo gle 音声認識 ” where the A2B recognizes i t correctly , b ut the A2C model drops the initial “ g ” and giv es “ oo gle 音声認識 ”. One of the potential beneﬁts of using byte-based models is for code-switching spee ch. Collecting such data is challenging. The quality of artiﬁ cially concatenated speech is far from real. In this study we use data ﬁ ltered from the Japanese test set, where utter- ances having t ranscript that contains 5 or more consecutiv e English characters are kept. T hese utterances mostly contain only a single English word in Japan ese texts. Out of the 17.6K utterances, we get T able 4 : Speech naturalness Mean Opinion Score (MOS) with 95% conﬁdence intervals across dif ferent languag e and systems. Languages EN CN ES Monolingual C2A 4.24 ± 0.12 3.48 ± 0.11 4.21 ± 0.11 Multilin gual B2A 4.23 ± 0.14 3.42 ± 0.12 4.23 ± 0.10 476 code-switching sentenc es and we report the TE Rs on this subset in T able 3. W ith Japa nese monolingual models ( A1 and A2 ), our A2B model outperforms the A2C model by 38.6% relatively . With English and Japanese multilingual mode ls ( B 1 and B2 ), our A2B model wins over the A2C mod el by 4.2% relatively . W e also test system D4 on these code-switch data. Howe ver , due to the language 1-hot vector used in D4 is utterance-lev el, the performance is worse than B2 . Using frame/segm ent level language information may ad- dress this problem, which will be explored i n future. 4.2. Byte for TTS 4.2.1. Data T ext-to-speech models were trained on (1) 44 hours of North Amer- ican English speech recorded by a female speake r; (2) 37 hours of Mandarin speech by a female speaker; (3) 44 hours of North Amer- ican Spanish spee ch by a female speak er . For all compared mod- els, we synthe size raw audio at 24 kHz in 16-bit format. W e rely on cro wdsourced Mean Opinion Score (MOS) ev aluations based on subjecti ve li stening tests. All our MOS ev aluations are aligned to the Absolute Cate gory Rating scale [32], w i th rating scores from 1 to 5 in 0.5 point increments. 4.2.2. Multilingual TTS System T able 4 compares subjecti ve naturalness MOS of the proposed model to the baseline using grapheme s for English, Mandarin and Span- ish respecti vely . Both results indicate that the proposed multilin- gual B2A model is comparable as the state-of-the-art monolingua l model 1 . Moreov er , we observed that the B2A model was able to read code-switching text. Ho w ever , we don’t hav e good metric to e valuate the quality of cod e-switching for TTS, e.g. the speech is ﬂuent but the speaker is changed for different language. Future work may explore how to e valuate TTS on code-switching scenario and ho w to disentangle language and speak er giv en more training data. 5. CONCLUSIONS In this paper , we in vestigated the use of Unicode bytes as a new lan- guage r epresentation for both ASR and TTS. W e proposed A udio- to-Byte (A2B) and Byte-to-Audio (B2A) as multilingual ASR and TTS end-to-end models. T he use of bytes allows us to build a single model for man y languages without modifying the model structure for new ones. This brings representation sharing across graphemes, and is crucial for langu ages with large grapheme voc abularies, es- pecially in multilingual processing. Our experiments sho w that byte models outperform grapheme models in both multilingual and monolingual models. Mo reover , our multilingual A2B model out- performs our monolingual baselines by 4 . 4 % relatively on av erage. The language independence of byte models provides a ne w perspec- tiv e to the code-switching problem, where our multilingual A2B model achiev es 38 . 6 % relative improv ement ov er our monolingual baselines. Finally , we also show our multilingual B2A models match the performance of our mono lingual baselines in TTS. 1 MOS is worse than [20] because we have OOV in the test set. 6. REFERENCES [1] T anja Schultz and Katrin Kirchhoff, Multilingual speech pro - cessing , Elsevier , 2006. [2] Herv ´ e Bourlard, John Dines, Mathe w Magimai-Doss, Philip N Garner , David Imseng, Petr Motlicek, Hui L iang, Lakshmi S a- heer , and Fabio V alente, “Current trends in multili ngual speech processing, ” Sadhana , v ol. 36, no. 5, pp. 885–915, 2011. [3] Mark JF Gales, Kate M Knill, and Anton Ragni, “Unicode- based graphemic systems for li mited resource languages, ” i n ICASSP , 2015. [4] Stephan Kanthak and H ermann Ney , “Context-de pendent acoustic modeling using graphemes for large vocab ulary speech recognition, ” in ICASSP , 2002 . [5] Mirjam Killer , Sebastian Stuk er, a nd T anja Schultz, “Grapheme base d speech recognition, ” in Eighth E uro- pean Confer ence on Speec h Communication and T echnolo gy , 2003. [6] Sebastian S t ¨ uke r and T anja Schultz, “ A grapheme based speech recognition system for russian, ” in 9th Confere nce Speech and Computer , 2004. [7] Willem D Basson and Marelie H Dav el, “Comparing grapheme-bas ed and phoneme-ba sed speech recognition for afrikaans, ” in PRASA , 2012. [8] Alex Grav es and Navd eep Jaitly , “T ow ards end-to-end speech recognition with recurrent neural netw orks, ” in ICML , 2014. [9] William Chan, Navdeep Jaitly , Q uoc Le, and Oriol V i nyals, “Listen, Attend and S pell: A Neural Network for Large V ocab- ulary Con versational Speech Recognition , ” in ICASSP , 2016. [10] Dzmitry Bahda nau, Jan Choro wski, Dmitriy Serdyuk , P hile- mon B r akel, and Y oshua Bengio, “End-to-end attention-based large vocabu lary speech recogn it i on, ” in ICASSP , 2016 . [11] William Chan and Ian Lane, “On On line Attention-based Speech Recognition and Joint Mandarin Character-Pinyin T raining, ” in INTE RSPEECH , 2016. [12] William Chan, Y u Z hang, Quoc Le, and Navdeep Jaitly , “La- tent Sequence Decomp ositions, ” in ICLR , 2017. [13] Chung-Cheng Chiu, T ara N. S ainath, Y onghui Wu, Roh it Prabhav alkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kan- nan, Ron J. W eiss, Kanishka Rao, Ekaterina Gonina, Nav deep Jaitly , Bo L i, Jan Chorowsk, and Michiel Bacchiani, “State-of- the-art S peech Recognition W ith Sequence-to-Seque nce Mod- els, ” in ICASSP , 2018. [14] Albert Zey er, Kazuki Irie, Ralf Schluter , and Hermann Ney , “Improv ed training of end-to-end attention models for speech recognition, ” in INT ERSPEECH , 2018. [15] K. Rao, R. Prabhav alkar , and H. S ak, “Exploring Architec- tures, Data and Units for S treaming End-to-End Speech Recog- nition with RNN-T ransducer, ” in ASRU , 2017. [16] Hagen Soltau, Hank Liao, and Hasim S ak, “Neural Speech Recognizer: Acoustic-to-W ord LS TM Model f or Large V oc ab- ulary Speech Recogn iti on, ” in , 2016 . [17] Jinyu L i, Guoli Y e, Amit Das, Rui Zhao, and Y i fan Gong, “ Ad- v ancing Acoustic-to-W ord CTC Model, ” in ICA SSP , 2018. [18] Jose Sotelo, Soroush Mehri, Kundan Kumar , Joao Felipe Santos, Kyle Kastner , Aaron Courville, and Y oshua Bengio, “Char2wa v: End-to-end speech synthesis, ” in ICLR: W ork- shop , 2017. [19] Y uxuan W ang, RJ Skerry-Ryan , Daisy Stanton, Y onghu i W u, Ron J W eiss, Nav deep Jaitly , Z ongheng Y ang, Yin g Xiao, Zhifeng Chen, Samy Bengio, et al., “T acotron: A fully end- to-end text-to-speech synthesis model, ” arXiv pr eprint , 2017 . [20] Jonathan Shen, Ruomin g Pang, Ron J W eiss, Mik e Schuster, Navdee p Jaitly , Zongheng Y ang, Zhifeng Chen, Y u Zhang, Y uxuan W ang, RJ Skerry-Ryan, et al., “Natural TTS synthesis by conditioning wave net on mel spectrogram predictions, ” in ICASSP , 2018. [21] Rohit Pr abhav alkar, Kanishka Rao, T ara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly , “ A comparison of sequence-to- sequence models for speech recognition, ” in INTERSPEECH , 2017, pp. 939–94 3. [22] Dan Gillick, Cliff Brunk, Oriol Vin yals, and Amarnag Sub- ramany a, “Multilingual language processing f rom bytes, ” in NAA CL , 2016. [23] Bo Li, T ara N Sainath, Khe Chai Sim, Michiel Bacchiani, E u- gene W einstein, Patrick Nguyen, Zhifeng Chen, Y onghui W u, and Kanishka Rao, “Multi-dialect speech recognition wit h a single sequence-to-sequenc e model, ” in ICASSP , 201 8. [24] Shubham T oshniwal, T ara N Sainath, Ron J W eiss, Bo Li, Pe- dro Moreno, Eugene W einstein, and K anishka Rao, “Multili n- gual speech recognition wit h a single end-to-end model, ” in ICASSP , 2018. [25] Sepp Hochreiter and J ¨ ur gen Schmidhuber , “Long short-term memory, ” Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997. [26] Dzmitry Bahda nau, Kyung hyun Cho, and Y oshua Beng io, “Neural Machine Translation by Jointly Learning to Align and T ranslate, ” 2015. [27] Golan Pundak and T ara N S ainath, “Lo wer F rame Rate Neural Network Acoustic Models, ” in INTERSPEECH , 2016. [28] Has ¸ i m S ak, Andre w Senior , Kanishka Rao, and Franc ¸ oise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition, ” arXiv prepr int arXiv:1507.069 47 , 2015. [29] Peter Auer , Code-switching in con versation: Langu age , inter- action and identity , Routledge, 2013 . [30] Nal Kalchbrenner , Erich Elsen, Karen Simonyan , Seb Noury , Norman Casagrande, Edward Lockhart, Florian St i m- berg, A ¨ aron v an den Oord, Sander Dieleman, and K oray Kavuk cuoglu, “Efﬁcient neural audio synthesis, ” in ICML , 2018. [31] C. Kim, A. Misra, K. Chin, T . Hughes, A. Narayanan, T . N. Sainath, and M. Bacchiani, “ Generated of Large-scale Si mu- lated Utterances in V irt ual Rooms to T rain Deep-Neural Net- works for Far-ﬁeld Speech Recognition in Google Home, ” in INTERSPEE C H , 2017. [32] ITUT Rec, “P . 800: Methods for sub jective determination of transmission quality , ” International T elecommunication Union, Geneva , 1996.

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment