Towards Unsupervised Speech-to-Text Translation

T O W ARDS UNSUPER VISED SPEECH-T O -TEXT TRANSLA TION Y u-An Chung W ei-Hung W eng Schrasing T ong J ames Glass Computer Science and Artiﬁcial I ntelligence Laboratory Massachusetts Institute o f T echnology Cambridge, MA 02139, USA { andyyuan, ckbjimmy,st 9,glass } @mit.edu ABSTRA CT W e present a framework for b uilding speech-to-text transla- tion (ST) systems using only monolin gual spe e c h and text corpor a, in other words, speech utterances from a source lan- guage and ind e pendent text from a target language. As op- posed to traditio nal cascaded systems and en d-to-end arch i- tectures, our sy stem does no t require any labeled data (i.e. , transcribed source audio or parallel source and target te xt cor - pora) during training, making it especially applicable to lan- guage pairs with very few or even zero bilingual resources. The frame work initializes the ST system with a cr oss-modal bilingual dictiona r y in ferred from the mono lingual corpora, that maps every source speech segmen t co rrespond ing to a spoken w ord to its target text translation. For unseen s ource speech utterances, the system ﬁrst perfo r ms word-by -word translation o n each speech segment in the utterance. The translation is improved by leveraging a language m odel and a sequence denoising autoen coder to provide prior kn owledge about the target languag e. E xperimen tal results show that our unsuper v ised system achieves co mparable BLEU scores to supervised end-to-en d models despite the lack of superv ision. W e also provide an ablation analysis to examine the utility of each co mponen t in o ur system. Index T erms — speech-to- text tra nslation, unsup ervised speech p rocessing, speech2vec, bilingu al lexicon in duction 1. INTRODUCTION Con ventional spee ch-to-text translation (ST) systems, which typically cascad e auto matic speech recognitio n ( ASR) and machine translation (MT) , impose signiﬁcant requiremen ts on training data . They u sually req uire hund reds of h ours of tran- scribed au dio and million s of words of parallel te xt f rom the source and target langu a ges to tr a in indi vidual components, which makes it difﬁcult to u se this appr oach o n low-resource languag e s. Althoug h recent works hav e shown the feasibility of building end -to-end systems that directly translate sou rce speech to target text witho u t using any intermed iate source languag e transcriptions, they still require data in th e form of source aud io paired with target text tra nslations for end-to - end tr a ining [ 1 , 2, 3, 4]. In contrast to ST , which require s pair e d data fo r train- ing, recen t research in MT h a s explored fully unsuperv ised settings—relying only on m o nolingu al corpor a from each lan- guage. They have shown that unsuper vised MT models can achieve compar a ble (sometim es e ven superio r ) results to su- pervised on es [5, 6]. A key prin ciple behind the se unsuper- vised MT approach es is to initialize a MT mod e l with a bilin - gual dictionary infer red from monolingu al corpo ra, witho ut using cr o ss-lingual signals [7, 8]. Given a so urce word, the initial MT mod el is able to per form word -by-word tran slation by lookin g up th e dictiona ry , and can be fu rther improved by lev eraging o ther tech n iques such as back translation [9 ]. Recently , [10] showed that these unsup ervised bilingual dictionary in duction alg o rithms could also be applied to sce- narios where th e source and target corp ora are of d ifferent modalities, n amely sp e e ch and text. The learned cr oss-mo dal bilingual dictionar y , as we will show in t his pap er , is capable of perfo rming word -by-word translation, with the difference being that the inp ut, instead of text, is a speech se gment cor- respond in g to a spoken word in the source language. In th is pape r, we pro pose a fram ew ork for building a ST system u sing on ly indepen dent m o nolingu al corpor a of speech and text. The two corpora can be collec ted indepen- dently which g reatly red u ces human labe lin g efforts. Our framework starts by in itializin g a ST system with a cr oss- modal bilin gual d ictionary in ferred from the m onolingu al corpor a to p erform word-b y-word translation. T o further improve the quality of th e tr anslations, we incorpo rate a pre-train e d language mod el (LM) and sequenc e denoising autoenco der ( D AE) [11, 12] that con tain prior kn owledge about the target lan g uage; th eir pr im ary function is to con- sider co n text in lexical choices a nd h a n dle local reor dering and multi-aligned w ords. T o the best of o ur kno wledge, this is the ﬁrst work that tackles ST in an unsup ervised setting. More impo rtantly , experiments sh ow that our unsu p ervised system achieves co mparab le results to supervised end-to- end models [ 3] despite th e lack of superv ision. 2. PROPOSED FRAMEWORK Our f ramework builds on several recently developed tec h - niques for unsupervised speech processing a n d MT . W e ﬁrst derive a ST system that can perfo rm simple word - by-word translation. Next, we integrate a langua g e m odel into the framework to introdu ce con textual inform ation during the translation process. Finally , we post-pro cess the tra n slated results using a D AE to h a ndle local reor d ering and multi- aligned words. B elow we describe each step in detail. 2.1. W ord-by- W ord T ra nslation System In ou r framework, a speech corpus from the source language is ﬁr st pre- processed using an unsuper vised speech segmenta- tion algo rithm [ 13] to generate s peech s egments correspond- ing to spoken words. W e then apply a neural ar chitecture called Speech2 V ec [14, 15] to learn a speech emb edding space from the set of speech segments s uch that each vector correspo n ds to a word whose sema n tics has b een capture d. A text embed ding space that cap tures word seman tics can be learned by training W ord2V ec [16] on a te xt corpu s from the target langua g e. Based on the assumptio n that monolingu al word embedding spaces are approx imately isomor p hic, since languag e s a r e u sed to conve y thematically similar inform a- tion in similar contexts [17], it is the o retically possible to align th e se two spaces. T o ach ieve this, one can use an unsup e rvised bilingu al dictionary in duction (BDI) algorithm to learn a cross-lingual mapping from the source emb edding spa c e to th e target embedd in g space. T wo of the most representativ e BDI algo- rithms are MUSE [7] and V ecMap [8], neither of which r ely on cr oss-lingual signals. Note that b oth these BDI algo rithms were or iginally p roposed fo r alignin g two embedding spaces learned from text. In [10], h owe ver, the autho rs showed that MUSE c a n also be applied to learn a cr oss-modal align - ment between embed ding spaces learned f rom speech and text. In o ur experimen ts, we include the resu lts of bo th algo- rithms fo r c o mparison . W e obtain a rudimen tary ST system after deriving a cross- modal and c r oss-lingual map ping from speech to the text co r- pora, which is essentially a lin ear transfo rmation W . Given an unseen speech utterance, we ﬁrst segment it into se veral speech segments using the speec h segmentation algorithm previously mentioned. Then, for ea c h speech segment that potentially co rrespond s to a spoken word, we ma p it from th e speech embedding space to the text embeddin g s pace via W and apply near e st neigh bor sear c h to decide its te xt tr ansla- tion. However , the translations g enerated by this preliminary system are f ar from acceptable since nearest neighbo r search does not con sider the context of the current word. In many cases, the correct translation is not the nearest target word but synonym s or other close word s with morphological vari- ations, p r ompting u s to inco rporate fu rther impr ovements. 2.2. Languag e Model f or Context-A ware Beam Search W e incor porate co ntextual inf ormation into word-b y-word translation by introduc in g a L M during the decoding pro- cess [1 8]. Let w s be the w ord vector m apped from spee ch to the text em bedding space and w t the word vector of a possible target word. Giv en a history h of target words befo r e w t , th e score of w t being the tran slation of w s is computed as: LM ( w t ; w s , h ) = log f ( w s , w t ) + 1 2 + λ LM log p ( w t | h ) , where λ LM is the weight parameter that decides h ow c ontext- awar e the system is, and f ( w s , w t ) ∈ [ − 1 , 1 ] is the co- sine similarity betwee n w s and w t , linearly scaled to the range [0 , 1] to make it c o mparab le with the ou tp ut proba b ility of the LM . Empirically , we found that setting λ LM to 0.1 yields the best perf ormance. Acc u mulating the scores per position, we perform a beam search to allow only reason a b le translation hypotheses. 2.3. Sequence Denoising A utoencoder W e m ay ac hiev e semantic cor rectness th rough lear ning an a p - propr iate cross-modal bilingual dictionar y and using a LM. Howe ver , to f urther improve the quality of th e translation s, it is also necessary to consider syntactic corr ectness. T o this end, we apply a sequ ence D AE to correct the tra nslated out- puts. By injectin g noise to the in put sequence d uring the training process, the DAE learn s to outp u t the o riginal (clean) sequence given a co rrupted , noisy inp ut. In our framework, we adopt three noise simulation techn iques proposed in [18]: word insertion, deletion and perm utation. W e seek to simu- late the no ise intr oduced durin g the word-by-word translation process with th ese three techn iques. Readers can ref er to [18] for mo re d etails. Along with the context-aware LM, we fou nd that adopting a DAE further boosts translation p e rforman ce. 3. EXPERIMENTS 3.1. Dat asets W e used an En glish-to-Fren ch speech tran slatio n dataset [19] augmen te d fr om th e Libr iSpeech ASR co rpus [20]. The dataset is split in to tra in, dev , and test sets; all come with a collection of English speech utterance s and their co r respond- ing French text tr a n slations. The train set contains 100 hou rs of speech, which was used to train Speech 2V ec [1 4] to ob- tain the sp e ech emb edding space. For the text em bedding space, we trained W ord2V ec [1 6] on tw o d ifferent corpora— the parallel corp us that conta in s the text tran slations, and an indepen dent co rpus crawled from French W ikip e d ia. For ev aluation, we merged th e dev and test sets, resu lting in speech data of abou t 6 hou rs. BLEU scores [21] were used as the e valuation metric. 3.2. Model Architectures and T raining Details W e tr ained Speech 2 V ec fo llowing the same p rocedur e u sed in [10]. The te xt embed ding space was trained b y W ord2V ec using fastT ext [22] with default s ettings without subword in - formation . The d imension of both speech and text embe d - dings is 100 . For both V ecMap [8] and MUSE [7], we fo l- lowed the default settings of the implementatio ns released by their or ig inal authors. For th e LM, we train ed a 5-gram co u nt- based L M u sing KenLM [23] with its default settings. Fi- nally , we impleme nted th e DAE, structu red a s a 6- layer Trans- former [ 2 4], with embed ding and hidden layer size of 512, a feedfor ward subla y er size of 2,04 8 , and 8 a ttention h eads. 3.3. Results and Discussions W e ﬁrst study the similarities between different pairs o f em- bedding spaces to be align ed in Section 3.3.1. W e then presen t the main ST results in Section 3.3.2. 3.3.1. Eigenvector Similarity Having approximately isomorphic embeddin g sp aces is im- portant fo r BDI. T o q uantify wheth er the embedding spaces are isomorp hic, or similar in structur e, we com puted the eigenv ector similar ity , which is derived from Laplacian eigenv alues. Both o ur study an d [25] demonstra te th at the eigenv ector similarity metric is cor r elated to the perfo rmance of the translation task, which implies that the metric reﬂects the distance between embeddin g sp a ces in a meaning ful way . The similarity is com p uted as follo ws. Let L 1 and L 2 be the Laplacians of two nearest neighbo r em bedding graphs. W e search fo r the smallest value of k for each gr aph such that the sum of largest k Laplacian eigen values is smaller than 90 % of the Laplacian eigen values. The n , we select the smallest k across two grap hs and compute the squared differences be- tween the largest k Lap lacian eige nvalues in two graphs. The differences is the eigenv ector similarity we use to measure the similarity be tween emb edding spaces. No te that a higher value of th e eigenvector similarity metric indicates that the giv en two embeddin g spaces are less similar . T ab le 1 p resents the eigen vector similarity of different speech-text pairs. Th e eigenv ector similarity of spee c h and text emb edding space pairs is sma ller when we trained the speech embeddin g using the Spe ech2V ec algorithm than the Audio2V ec [26] algorithm. These results ar e expected s ince Speech2V ec utilizes semantic con text of the speech corpu s, similarly to h ow W or d2V ec u ses that of the text corp us. Furthermo re, we applied skip-gram as a tr aining metho dol- ogy for b o th algo r ithms, resulting in isomorp hic embedding spaces. In contrast, Audio2 V ec focu ses on similarities in acoustics rathe r than semantics, thus th e learned em bedding space dif fers fundamen tally . Embedding space pairs learned from compa r able corpora also yield h igher similarity , since T able 1 : Embed ding similarity of different speech and text embedd in gs p air e valuated by eigen vector similarity . W e de- note the embed ding trainin g metho d and corpus name in up- per and lo w e r case, respectively . For th e pair , we denote the speech and text em bedding space at the left an d righ t side, re- spectiv ely . For example, A libri - T wiki represents the speech embedd in g space trained on the LibriSp e e ch cor pus usin g Au- dio2V ec and the text embedding space trained on W ikip edia corpus. A , S , T in dicates Aud io2V ec, Spee c h 2V ec and text (W or d2V ec) embed ding. Speech & text embe d ding spaces pair Eigenvector similarity A libri - T libri 14.74 A libri - T wiki 15.02 S libri - T libri 6.43 S libri - T wiki 7.17 the word distributions ar e more similar; fo r example, th e dis- tribution of English LibriSp eech speech embeddings is more similar to that of th e French Lib riSpeech text embedd ings than Fre n ch W ikip edia text embedding s. 3.3.2. Spe ech-to-text T ranslation W e present th e results of ou r unsuperv ised appro ach as well as supervised baselines in T able 2. W e trained e very system 1 0 times an d r e port both the best and average perf ormance. In conﬁgur ations (a-d), we replicate state-of-the- a rt supervised algorithm s an d arrived at the conclusion that cascaded sys- tems p erform b etter than their end-to -end counterp arts and beam search p erforms better than gr e edy search. Note that cascaded systems req u ire m ore superv ision than end-to-en d systems, whereas ou r appr oach makes n o assum ptions of hav- ing speech-text or language pairs o f the com parable cor pora. In conﬁgu rations (e-l), we showcase the per f ormance of ou r un supervised appro ach, denoted as (BLEU score of V ecMap / BLEU sco re of MUSE ) in th e co lumns o f T able 2 . Alignment Quality Conﬁgura tions (e-h) demo nstrate that eigenv ector similarity of speech and text embe dding space pairs ha ve strong positi ve corre latio n, nam ely com paring the relativ e perfo rmances to tho se sh own in T ab le 1, with the BLEU score of align ment-based ST tasks. The results, from conﬁgur ations (g) and ( h), illu strates that using comparab le corpor a, an d thus better alig nment, af fects the q uality of ST . It a lso hin ts that th ere may exist a th r eshold o f usefu lness in alig nment perf ormances. Sinc e con ﬁguration s (e) and (f) lie underneath that thresh old, they achiev e scor e s of zero. These ﬁnd ings indicate th at eigenvector similarity of embed- ding spaces could serve as an indicator of unsuperv ised ST perfor mance. Unsupervised BDI In all of our unsuperv ised expe r iments, we compare d the p erform a nce b etween two unsu pervised BDI algor ithms, V ec Map and MUSE . V ec Map outpe r- T able 2 : Different c onﬁguratio ns for speech-to-text transla- tion and their perf ormance. T he numbe r s in the section o f un- supervised methods d enoted as BLEU score (%) of V ecMap / BLEU score (%) o f MUSE . Th e notatio n used in th e T ab le is the s ame as T able 1. For cascaded systems, we follo wed the ASR and MT pipe line in [3]. E2E stands for en d-to-en d . System Best A verage Cascaded and end-to-en d S T systems (supervised) (a) Cascaded + greedy 13.7 13.0 (b) Cascaded + beam 14.2 13.2 (c) E2E + greed y 12.3 11.6 (d) E2E + be a m 12.7 12.1 Our alignment-b ased S T systems (unsupervised) (e) A libri - T libri 0.0 / 0. 0 0.0 / 0. 0 (f) A libri - T wiki 0.0 / 0. 0 0.0 / 0. 0 (g) S libri - T libri 4.5 / 4. 6 4.2 / 2. 7 (h) S libri - T wiki 3.7 / 2. 1 3.0 / 0. 9 (i) (g) + LM libri 5.2 / 5. 0 4.7 / 2. 9 (j) (g) + LM wiki 9.5 / 8. 8 9.0 / 5. 7 (k) (g ) + LM wiki + D AE wiki 12.2 / 11. 8 11.3 / 7.3 (l) (h) + LM wiki + D AE wiki 11.5 / 9 .1 10.8 / 6 .2 forms MUSE in all but one experiment, demon strating that V ecMap can be applied to mo re difﬁcult scen arios throug h weak, fully u nsupervised initialization with iterative mappin g improvements, wh ereas MUSE , wh ic h map s embed d ings to the shared space th rough adversarial training , could only succeed on a more limited set o f c o nditions. Ad ditionally , V ecMap trains mo re stably and faster than MUSE , which has a similar best performa nce b u t much lower a verage per- forman ce. Language Model Integrat ion I n tegrating a LM im proves the perfor mance o f ST in all exper imental conﬁguratio n s, re- gardless of th e selection of corpus, conﬁgurations (g ) versus (i) and (j); co n ﬁguration s (h ) versus ( l) genera lize this re- sult to different em bedding spaces. By comp aring c o nﬁgu- rations ( i) and (j), we discover that the text cor pus used to train the LM does no t need to b e the same as the one used for W ord 2V ec text embedd in g space training. In fact, adop ting the LM trained on the W ikiped ia corpus ( LM wiki ) p roduces better perfo rmance than using that trained on the LibriSpeech corpus ( LM libri ). Since introd ucing the LM groun d s words into a context based on the previous word , the mu ch larger LM wiki , c o ntaining more word s, topic contexts, an d senten c e structures, serves as a b etter appro x imation o f the French lan- guage than LM libri . Sequence DAE In conﬁgu rations (j) versus (k), we show that applyin g DAE on top of the baseline alignment archi- tecture and LM can fur ther enhance per formanc e in unsu- pervised ST ; th e pe r forman ce is now comparab le to end-to- end sup ervised sy stems. This also justiﬁes our alignm e nt and post-pro cessing appr oach since c o nﬁguratio n (k) essentially has the sam e degree of supervision as co n ﬁguration s (c) and (d) and performs similarly well wh ile employing a com pletely different approach. W e attribute this to the D AE’ s ab ility to reconstruc t corru pted d a ta af ter tr anslation. Since the seman- tic alignment method we used may retrieve s ynonyms based on co ntext, rather than the exact syn tactically co rrect word [10], it is po ssible that the ou tput ev en when tak ing the LM into acco unt is still syntactically incorrect. Mo r eover , one of the key obstacles in tra ining Speech2V ec lies in the limited perfor mance of unsu p ervised speech segmentatio n methods. By in corpor ating a D AE, we could limit these negative ef- fects after tran slation. Last but no t least, the DAE was trained on LM wiki rather tha n LM libri . This design decision f ol- lows from the o bservation of the LM corp us choic e : since the D AE sho uld learn the French language, a larger , more di- verse dataset would perfo rm better th an th e same dataset used for W ord2V ec text embed d ings. Scenario o f Real-world ST In con ﬁguration (l), we c o n- ducted experime nts m odeling a real-world setting wh ere there exists no co mparable speech and text co rpora. In stead, we need to collect th em independ ently from different sour ces. T ext data exists in more ab undan ce than speech data and thus we u sually adopt the text e m bedding learned fr om larger cor- pus such as W ik ipedia, wh ich conﬁguration (h) replicates to our best efforts. By comp a ring co nﬁguratio n s (k ) and (l), we demo nstrate that the perfo rmance of our pro posed frame- work un d er no supervision is only slightly inferio r to the best perfor mance achieved using unsu p ervised alig nment, which requires com p arable corpor a fo r speech an d text embed ding spaces and should be considered supervised . The prop osed unsuper v ised ST fram ew ork is thus p romising for low lan - guage resource ST . 4. CONC LUSIONS In this pap er , we propose a framew ork cap a b le of perfo rming speech-to- text translation in a c o mpletely unsuper vised man- ner . Since the system translates using an in ferred cross-modal bilingual dictiona r y trained with out parallel data b etween speech and text, it could be applied to low o r zero-r e source languag e s. By incor porating knowledge of the target lan- guage, throu gh ad d ing a LM and a D AE, our s ystem greatly enhances the tran slation performance: W e a c hiev ed com pa- rable performan ce w ith state-o f-the-art e n d-to-en d systems using p arallel corp ora and on ly slightly lo wer scores without it. T hese results ind icate that o ur appro a c h could serve as a promising ﬁrst step towards fully unsupe r vised speech-to- text translation. Future works include testing the prop osed frame- work on other language pa irs and e xamining the relationship between embeddin g qu ality and tran slatio n perf o rmance in more detail. 5. REFERENCES [1] Ron W eiss, Jan Cho rowski, Navdeep Jaitly , Y ong hui W u, and Zhifeng Chen, “Seq u ence-to- seq uence mod - els can dir ectly translate fo reign speech , ” in INTER- SPEECH , 2017. [2] Sameer Bansal, Herman Kamp e r , Karen Livescu, Adam Lopez, and Sharon Goldwater , “ Low-resource speech - to-text tr anslation, ” in I NTERSPEECH , 2 018. [3] Alexandre B ´ erard, Lauren t Besacier , Ali Can K o- cabiyikoglu, and Olivier Pietquin, “En d-to-en d auto- matic speech translation o f audio books, ” in I CASSP , 2018. [4] Alexandre B ´ erard, Olivier Pietquin, Lau rent Besacier , and Christoph e Servan, “Listen and translate: A proo f of concept f or en d -to-end speech -to-text translation, ” in NIPS W orkshop on End-to -end Learning for S peech a nd Audio Pr ocessing , 2016 . [5] Guillaume Lamp le, Myle Ott, Alexis Conneau , Ludovic Denoyer, and Marc’Aurelio Ran z ato, “Phrase-ba sed & neural unsupervised machin e translation, ” in EMNLP , 2018. [6] Mikel Artetxe, Gorka Labaka, and Eneko Agirr e, “Un- supervised statistical machine tran slation, ” in EMNLP , 2018. [7] Alexis Conne a u , Gu illaume Lam ple, Marc’Aur elio Ranzato, Ludovic Denoyer, and Herv ´ e J ´ egou, “W ord translation without parallel d ata, ” in I CLR , 2018. [8] Mikel Artetxe, Gorka Labaka, and Eneko Agirre, “ A ro- bust self-learning method for fully u nsupervised cross- lingual mapping s of word emb e ddings, ” in ACL , 201 8. [9] Rico Sennrich , Barry Haddow , and Alexandra Birch, “Improvin g neur al machine translation models with monolin gual data, ” in ACL , 2016 . [10] Y u-A n Chun g, W ei-Hung W eng , Sch rasing T ong , and James Glass, “Unsuperv ised cross-m o dal alignment o f speech and text embed ding spaces, ” in NIPS , 2018. [11] I ly a Sutskev er , Oriol V inyals, an d Quoc Le, “Sequence to sequ e n ce learning with neural networks, ” in NIPS , 2014. [12] Pascal V incen t, Hugo Larochelle, Y oshua Bengio, and Pierre-Antoin e Manzag ol, “E x tracting a n d composing robust features w ith d enoising autoencod ers, ” in ICML , 2008. [13] Her man Kamper , Kar e n Livescu, and Sharon Goldwa- ter , “ An embed ded se gmental k- means model for un- supervised segmentation and clustering of speech , ” in ASRU , 2 017. [14] Y u-An Chun g and Jam e s Glass, “Speech2vec: A sequence-to -sequence framework for lear ning word em - bedding s from spe ech, ” in INTERSPEECH , 2 018. [15] Y u-An Chung an d James Glass, “Learning word em- bedding s fr om speech, ” in NIPS W orkshop on Machine Learning for A udio Signa l Pr ocessing , 2017. [16] T omas M ikolov , Ilya Sutskever , Ka i Chen, Greg Cor- rado, an d Jeff Dean , “D istributed representation s of words and phra ses and their co mpositionality , ” in NIPS , 2013. [17] Antonio V alerio Miceli Bar one, “T ow ards cross-lingual distributed represen ta tio ns without parallel text trained with adversarial autoen coders, ” in Rep L4NLP , 20 1 6. [18] Y unsu Kim , Jiahui Gen d, and Hermann Ney , “Improv- ing unsu p ervised word-by - word translation with lan- guage mo d el an d deno ising auto e ncoder, ” in EMNLP , 2018. [19] Ali K o cabiyikoglu, La u rent Besacier, and Oli vier Kraif, “ Aug menting librispe e ch with french translations: A multimoda l corpus for direc t spe ech translation evalu- ation, ” in LREC , 2 018. [20] V assil Panayotov , Guog uo Chen, Daniel Pove y , and San- jeev Khudan pur, “Librispeech : an ASR cor p us based on public domain au dio b o oks, ” in ICAS SP , 2 015. [21] Kishore P apineni, Salim Roukos, T odd W ard, and W ei- Jing Zhu, “Bleu : A method for au tomatic ev aluation of machine translation, ” in ACL , 20 02. [22] Piotr Bojanowski, Edouard Grav e, Armand Joulin, and T o m as Mikolov , “Enr ic h ing word vectors with subword informa tio n, ” T ransactions of the Association for Com- putation al Linguistics , v ol. 5, p p. 1 3 5–146 , 2 017. [23] Kenneth Heaﬁeld, “K enlm: Fas ter and sm a ller language model queries, ” in WMT , 2 0 11. [24] Ashish V aswani, Noam Shaze e r , Nik i Parmar , Jakob Uszkoreit, Llion Jones, Aid a n Gomez, Łukasz Kaiser, and I llia Polo sukhin, “ Attention is all you need, ” in NIPS , 2 017. [25] Anders Søgaard, Sebastian Rud er , and Iv an V uli ´ c, “On the limitation s of unsu p ervised bilingu al dictionary in- duction, ” in AC L , 2 018. [26] Y u-An Chung , Chao-Chu ng W u, Chia-Hao Shen, Hung- Y i Le e , and Lin-Shan Lee, “ Aud io word2vec: Un- supervised learn in g of audio segment repre sentations using sequen ce-to-seque nce a utoencod er , ” in INTER- SPEECH , 2 016.

Towards Unsupervised Speech-to-Text Translation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment