Multilingual bottleneck features for subword modeling in zero-resource languages
How can we effectively develop speech technology for languages where no transcribed data is available? Many existing approaches use no annotated resources at all, yet it makes sense to leverage information from large annotated corpora in other langua…
Authors: Enno Hermann, Sharon Goldwater
Multilingual bottleneck featur es f or subword modeling in zer o-r esour ce languages Enno Hermann 1 , Shar on Goldwater 1 1 ILCC, School of Informatics, Uni versity of Edinbur gh, UK ehermann@inf.ed.ac.uk, sgwater@inf.ed.ac.uk Abstract How can we effecti vely dev elop speech technology for lan- guages where no transcribed data is av ailable? Many existing approaches use no annotated resources at all, yet it makes sense to le verage information from large annotated corpora in other languages, for example in the form of multilingual bottleneck features (BNFs) obtained from a supervised speech recognition system. In this work, we e v aluate the benefits of BNFs for sub- word modeling (feature e xtraction) in six unseen languages on a word discrimination task. First we establish a strong unsu- pervised baseline by combining two existing methods: vocal tract length normalisation (VTLN) and the correspondence au- toencoder (cAE). W e then show that BNFs trained on a single language already beat this baseline; including up to 10 languages results in additional improvements which cannot be matched by just adding more data from a single language. Finally , we show that the cAE can improv e further on the BNFs if high-quality same-word pairs are a vailable. Index T erms : multilingual bottleneck features, subword model- ing, unsupervised feature extraction, zero-resource speech tech- nology 1. Introduction Recent years have seen increasing interest in “zero-resource” speech technology: systems dev eloped for a target language without transcribed data or other hand-curated resources. One challenge for these systems, highlighted by the Zero Resource Speech Challenge (ZRSC) of 2015 [1] and 2017 [2], is to im- prov e subword modeling, i.e., to e xtract speech features from the target language audio that w ork well for word discrimination or downstream tasks such as query-by-e xample. The ZRSCs were motiv ated largely by questions in artificial intelligence and human perceptual learning, and focused on approaches where no transcribed data from any language is used. Y et from an engineering perspective it also makes sense to explore ho w training data from higher-resource languages can be used to improv e speech features in a zero-resource language. There is considerable e vidence that bottleneck features (BNFs) extracted using a multilingually trained deep neural network (DNN) can improve ASR for tar get languages with just a few hours of transcribed data [3 – 7]. Howe ver , there has been little work so far exploring supervised multilingual BNFs for target languages with no transcribed data at all. [8, 9] trained monolingual BNF extractors and showed that applying them cross-lingually improves w ord discrimination in a zero-resource setting. [10, 11] trained a multilingual DNN to extract BNFs for a zero-resource task, b ut the DNN itself was trained on untran- scribed speech: an unsupervised clustering method was applied to each language to obtain phone-lik e units, and the DNN w as trained on these unsupervised phone labels. W e know of only tw o previous studies of supervised multi- lingual BNFs for zero-resource speech tasks. In [12], the authors trained BNFs on either Mandarin, Spanish or both, and used the trained DNNs to extract features from English (simulating a zero-resource language). On a query-by-example task, the y showed that BNFs al ways performed better than MFCCs, and that bilingual BNFs performed as well or better than monolin- gual ones. Further improvements were achiev ed by applying weak supervision in the target language using a correspondence autoencoder [13] trained on English word pairs. Ho we ver , the au- thors did not experiment with more than two training languages, and only ev aluated on English. In the second study [14], the authors b uilt multilingual sys- tems using either seven or ten high-resource languages, and ev aluated on the three “dev elopment” and two “surprise” lan- guages of the ZRSC 2017. Ho wever , they included transcribed training data from four out of the five e valuation languages, so only one language’ s results (W olof) are truly zero-resource. This paper presents a more thorough ev aluation of multilin- gual BNFs, trained on between one and ten languages from the GlobalPhone collection and e valuated on six others. W e show that training on more languages consistently improves perfor - mance on word discrimination, and that the impro vement is not simply due to more training data: an equiv alent amount of data from one language fails to gi ve the same benefit. Since BNF training uses no target language data at all, we also compare to methods that train unsupervised on the tar get language, either alone or in combination with the multilingual training. W e use a correspondence autoencoder (cAE) [13], which learns to abstract a way from signal noise and v ariability by training on pairs of speech se gments extracted using an unsu- pervised term discovery (UTD) system—i.e., pairs that are lik ely to be instances of the same word or phrase. In the setting with target language data only , we find that applying v ocal tract length normalisation (VTLN) to the input of both the UTD and cAE systems improv es the learned features considerably , suggesting that cAE and VTLN abstract over dif ferent aspects of the signal. Nev ertheless, BNFs trained on just a single other language al- ready outperform the cAE-only training, with multilingual BNFs doing better by a wide margin. W e then tried fine-tuning the multilingual BNFs to the target language by using them as input to the cAE. When trained with UTD word pairs, we found no benefit to this fine-tuning. How- ev er, training with manually labeled word pairs did yield benefits, suggesting that this type of supervision can help fine-tune the BNFs if the word pairs are suf ficiently high-quality . 2. Experimental setup 2.1. Dataset W e use 16 languages from the GlobalPhone corpus of speech read from news articles [15]. The selected languages and dataset sizes are shown in T able 1. W e consider the 10 languages in the top section with a combined 198.3 hours of speech as high- resource languages, where transcriptions are av ailable to train a supervised automatic speech recognition (ASR) system. W e treat the 6 languages in the bottom section as zero-resource languages on which we ev aluate the new feature representations. In addition we use the English W all Street Journal (WSJ) corpus [16] which is comparable to the GlobalPhone corpus. W e either use the entire 81 hours or only a 15 hour subset, so that we can compare the effect of increasing the amount of data for one language with training on data from 4 GlobalPhone languages. T able 1: Dataset sizes (hour s). About 100 speaker s per language with 80% of these in the training set and no speaker o verlap. Language T rain Dev T est High-r esource Bulgarian (BG) 17.1 2.3 2.0 Czech (CS) 26.8 2.4 2.7 French (FR) 22.8 2.1 2.0 German (DE) 14.9 2.0 1.5 K orean (K O) 16.6 2.2 2.1 Polish (PL) 19.4 2.8 2.3 Portuguese (PT) 22.8 1.6 1.8 Russian (R U) 19.8 2.5 2.4 Thai (TH) 21.2 1.5 0.4 V ietnamese (VI) 16.9 1.4 1.5 English81 WSJ (EN) 81.3 1.1 0.7 English15 WSJ 15.1 - - Zer o-resour ce Croatian (HR) 12.1 2.0 1.8 Hausa (HA) 6.6 1.0 1.1 Mandarin (ZH) 26.6 2.0 2.4 Spanish (ES) 17.6 2.1 1.7 Swedish (SV) 17.4 2.1 2.2 T urkish (TR) 13.3 2.0 1.9 2.2. Baseline features For baseline features, we use Kaldi [17] to extract MFCCs+ ∆ + ∆∆ and PLPs+ ∆ + ∆∆ with a window size of 25 ms and a shift of 10 ms, and we apply per-speaker cepstral mean normalization. W e also evaluated MFCCs and PLPs with vocal tract length normalisation (VTLN), a simple feature-space speaker adaptation technique that normalizes a speaker’ s speech by warping the frequency-axis of the spectra. VTLN models are trained using maximum likelihood estimation under a gi ven acoustic model—here, a diagonal-cov ariance universal back- ground model with 1024 components trained on each language’ s training data. W arp factors can then be extracted for both the training and for unseen data. 2.3. Bottleneck features For monolingual training of the high-resource languages, we fol- low the Kaldi recipes for the GlobalPhone and WSJ corpora and train a subspace Gaussian mixture model (SGMM) system for each language to get initial context-dependent state alignments; these states serve as tar gets for DNN training. For multilingual training, we closely follow the existing Kaldi recipe for the Babel corpus. W e train a time-delay neural network (TDNN) [18] with block softmax [19], i.e. all hidden y Align word pair frames Correspondence autoencoder cAE features Unsupervised term discovery (or forced alignment) x' x Figure 1: Correspondence autoencoder tr aining procedur e (see section 2.4). P arts of this figur e due to Herman Kamper , used with permission. layers are shared between languages, but there is a separate out- put layer for each language and for each training instance only the error at the corresponding language’ s output layer is used to update the weights. The TDNN has six 625-dimensional hid- den layers 1 followed by a 39-dimensional bottleneck layer with ReLU activ ations and batch normalization. Each language then has its own 625-dimensional af fine and a softmax layer . The in- puts to the network are 40-dimensional MFCCs with all cepstral coefficients to which we append i-vectors for speaker adapta- tion. The network is trained with stochastic gradient descent for 2 epochs. In preliminary e xperiments we trained a separate i-vector extractor for each dif ferent sized subset of training languages. Howe ver , results were similar to training on the pooled set of all 10 high-resource languages, so for expedience we used the 100- dimensional i-vectors from this pooled training for all reported experiments. Including i-vectors yielded a small performance gain ov er not doing so; we also tried applying VTLN to the MFCCs for TDNN training, but found no additional benefit. 2.4. Correspondence autoencoder In sev eral experiments we further adapt the baseline features or BNFs using a cAE network. The cAE attempts to normalize out non-linguistic factors such as speaker , channel, gender, etc., us- ing top-down information from pairs of similar speech se gments. Extracting cAE features requires three steps, as illustrated in Figure 1. First, an unsupervised term discov ery (UTD) system is applied to the tar get language to extract pairs of speech seg- ments that are likely to be instances of the same word or phrase. Each pair is then aligned at the frame le vel using dynamic time warping (DTW), and pairs of aligned frames are presented as the input x and target output x 0 of a DNN. After training, a middle layer y is used as the learned feature representation. T o obtain the UTD pairs, we used a freely av ailable UTD system 2 [20] and extracted 36k w ord pairs for each target lan- guage. Published results with this system use PLP features as input, and indeed our preliminary experiments confirmed that MFCCs did not work as well. W e therefore report results using only PLP or PLP+VTLN features as input to UTD. T o provide an upper bound on cAE performance, we also report results using gold standar d same-word pairs for cAE train- ing. As in [12, 13, 21], we force-align the target language data and extract all the same-word pairs that are at least 5 charac- ters and 0.5 seconds long (between 89k and 102k pairs for each language). 1 The splicing indexes are -1,0,1 -1,0,1 -1,0,1 -3,0,3 -3,0,3 -6,-3,0 0 . 2 https://github .com/arenjansen/ZR T ools Follo wing [9, 13], we train the cAE model 3 by first pre- training an autoencoder with eight 100-dimensional layers and a final layer of size 39 layer -wise on the entire training data for 5 epochs with a learning rate of 2 . 5 × 10 − 4 . W e then fine- tune the network with same-word pairs as weak supervision for 60 epochs with a learning rate of 2 . 5 × 10 − 5 . Frame pairs are presented to the cAE using either MFCC, MFCC+VTLN, or BNF representation, depending on the experiment (preliminary experiments indicated that PLPs performed worse than MFCCs, so MFCCs are used as the stronger baseline). Features are extracted from the final hidden layer of the cAE. 2.5. Evaluation W e ev aluate all speech features on the same-dif ferent task [22] which tests whether a gi ven speech representation can correctly classify two speech se gments as having the same w ord type or not. For each word pair in a pre-defined set S the DTW cost between the acoustic feature v ectors under a gi ven representation is computed. T wo segments are then considered a match if the cost is belo w a threshold. Precision and recall at a gi ven threshold τ are defined as P ( τ ) = M SW ( τ ) M all ( τ ) , R ( τ ) = M SWDP ( τ ) | S SWDP | where M is the number of same-word (SW), same-word different-speak er (SWDP) or all discovered matches at that threshold and | S SWDP | is the number of actual SWDP pairs in S . By v arying the threshold a precision-recall curve can be computed, where the final ev aluation metric is the average preci- sion (AP) or the area under that curve. W e generate ev aluation sets of word pairs for the GlobalPhone development and test sets as abo ve, from all words that are at least 5 characters and 0.5 seconds long, except that we now also include different-word pairs. W e note that pre vious work [13, 22] computed recall with all SW pairs for easier computation because their test sets included a negligible number of same-w ord same-speaker (SWSP) pairs. In our case the smaller number of speak ers in the GlobalPhone corpora results in up to 60% of SW pairs being from the same speaker . W e therefore explicitly compute the recall only for SWDP pairs to focus the ev aluation of features on their speaker in variance. As a sanity check, we also provide w ord error rates (WER) for the ASR systems trained on the high-resource languages. 3. Results 3.1. Using target language data only Our first set of experiments aims to find the best features that can be extracted using target language data only . Previous work has shown that cAE features are better than MFCCs, especially for cross-speaker word discrimination [9], but we know of no direct comparison between cAE features and VTLN, which can also be trained without transcriptions. T able 2 sho ws AP results on all tar get languages for baseline features, cAE features learned using raw features as input (as in previous w ork), and cAE features learned using VTLN-adapted features as input to either the UTD system, the cAE, or both. W e find that cAE features as trained previously are slightly better than MFCC+VTLN, b ut can be improv ed considerably by apply- ing VTLN to the input of both UTD and cAE training—indeed, 3 https://github .com/kamperh/speech correspondence T able 2: A verage precision scores on the same-differ ent task (dev sets), showing the effects of applying VTLN to the input featur es for the UTD and/or cAE systems. cAE input is either MFCC or MFCC+VTLN. T opline r esults (r ows 5-6) train cAE on gold standar d pairs, rather than UTD output. Baseline r esults (final r ows) dir ectly evaluate acoustic features without UTD/cAE training . Best unsupervised r esult in bold. UTD input cAE input ES HA HR SV TR ZH PLP 28.6 39.9 26.9 22.2 25.2 20.4 PLP +VTLN 46.2 48.2 36.3 37.9 31.4 35.7 PLP+VTLN 40.4 45.7 35.8 25.8 25.9 26.9 PLP+VTLN +VTLN 51.5 52.9 39.6 42.9 33.4 44.4 Gold pairs 65.3 65.2 55.6 52.9 50.6 60.5 Gold pairs +VTLN 68.9 70.1 57.8 56.9 56.3 69.5 Baseline: MFCC 18.3 19.6 17.6 12.3 16.8 18.3 Baseline: MFCC+VTLN 27.4 28.4 23.2 20.4 21.3 27.7 ev en using gold pairs as cAE input applying VTLN is beneficial. This suggests that cAE training and VTLN abstract ov er dif ferent aspects of the speech signal, and that both should be used when only target language data is a vailable. 3.2. Multilingual training T able 3 compares the WER of the monolingual SGMM systems which provide the targets for TDNN training to the WER of the final model trained on all 10 high-resource languages. The multilingual model shows small b ut consistent impro vements for all languages except V ietnamese. Ultimately though, we are not so much interested in the performance on typical ASR tasks, but in whether BNFs from this model also generalize to zero-resource applications on unseen languages. Figure 2 shows AP on the same-dif ferent task of multilingual BNFs trained from scratch on an increasing number of languages in two randomly chosen orders. W e provide two baselines for comparison, drawn from our results in T able 2. Firstly , our best cAE features trained with UTD pairs (from ro w 4 of T able 2) are a reference for a fully unsupervised system. Secondly , the best cAE features trained with gold standard pairs (from row 6 of T able 2) giv e an upper bound on the cAE performance. In all 6 languages, e ven BNFs from a monolingual TDNN already considerably outperform the cAE trained with UTD pairs. Adding another language usually leads to an increase in AP, with the BNFs trained on 8–10 high-resource languages performing the best, also al ways beating the gold cAE. Howe ver , the biggest performance gain is from adding a second training language—further increases are mostly smaller. The order of languages has only a small ef fect, although for example adding T able 3: W ord err or rates of monolingual SGMM and 10-lingual TDNN ASR system evaluated on the de velopment sets. Language Mono Multi BG 17.5 16.9 CS 17.1 15.7 DE 9.6 9.3 FR 24.5 24.0 K O 20.3 19.3 Mono Multi PL 16.5 15.1 PT 20.5 19.9 R U 27.5 26.9 TH 34.3 33.3 VI 11.3 11.6 35 45 55 65 75 Average precision (%) ES HA HR 0 40 80 120 160 200 35 45 55 65 75 Hours of data Average precision (%) SV 0 40 80 120 160 200 Hours of data TR 0 40 80 120 160 200 Hours of data ZH BNF 1 BNF 2 EN cAE UTD cAE gold Figure 2: Same-differ ent task evaluation on the development sets for BNFs trained on dif fer ent amounts of data. W e compar e training on up to 10 dif fer ent languages with additional data in one langua ge (English). F or multilingual training , languages were added in two differ ent or ders: FR-PT -DE-TH-PL-K O-CS-BG-RU-VI (BNFs 1) and R U-CZ-VI-PL-K O-TH-BG-PT -DE-FR (BNFs 2). Each datapoint shows the result of adding an additional language . As baselines we include the best unsupervised cAE and the cAE trained on gold standar d pairs fr om r ows 4 and 6 of T able 2. T able 4: AP on the same-differ ent task when training cAE on the 10-lingual BNFs fr om abo ve (cAE-BNF) with UTD and gold stan- dar d word pair s (test set r esults). Baselines are MFCC+VTLN and the cAE models fr om r ows 4 and 6 of T able 2 that use MFCC+VTLN as input featur es. Best result without target lan- guage supervision in bold. Featur es ES HA HR SV TR ZH MFCC+VTLN 44.1 22.3 25.0 34.3 17.9 33.4 cAE UTD 72.1 41.6 41.6 53.2 29.3 52.8 cAE gold 85.1 66.3 58.9 67.1 47.9 70.8 10-lingual BNFs 85.3 71.0 56.8 72.0 65.3 77.5 cAE-BNF UTD 85.0 67.4 40.3 74.3 64.6 78.8 cAE-BNF gold 89.2 79.0 60.8 79.9 69.5 81.6 other Slavic languages is generally associated with an increase in AP on Croatian, suggesting that it may be beneficial to train on languages related to the zero-resource language. T o determine whether these gains come from the di versity of training languages or just the larger amount of training data, we trained models on the 15 hour subset and the full 81 hours of the English WSJ corpus, which corresponds to the amount of data of four GlobalPhone languages. More data does help to some degree, as Figure 2 sho ws, but except for Mandarin training on just two languages (46 hours) already works better . 3.3. cAE results Previous work [13] and our baselines in T able 2 show that a fully unsupervised system like a cAE generates features that can discriminate between words much better than standard acoustic features like MFCCs. Is the cAE also able to further improv e on multilingual BNFs which already ha ve a much higher baseline performance? W e trained the cAE with the same sets of same-word pairs as before, but replaced VTLN-adapted MFCCs with the 10-lingual BNFs as input features without any other changes in the training procedure. T able 4 shows that the cAE trained with UTD pairs is able to slightly impro ve on the BNFs in some cases, b ut this is not consistent across all languages and for Croatian the cAE features are much worse. The limiting f actor appears to be the quality of the UTD pairs. With gold standard pairs, the cAE features improv e in all languages. 4. Conclusions W e e valuated multilingual BNFs trained on up to 10 high- resource languages on a word discrimination task in 6 zero- resource languages. These BNFs outperform both standard acoustic features like MFCCs and cAE features trained in a fully unsupervised way . W e showed that training on multiple languages helps the BNFs and that just training on more data in a single language does not work as well. While the cAE is theoretically able to further improve on the BNFs, this does not work in practice if only word pairs discovered by a UTD system are available. In future work we would lik e to further analyze the complementary nature of VTLN and cAE training and e x- plore the benefits of these multilingual BNFs for down-stream zero-resource applications like speech-to-text translation. 5. Acknowledgements W e thank Andrea Carmantini for helping to set up multilingual training for the GlobalPhone corpus in Kaldi and Herman Kam- per for helpful feedback. The research was funded in part by a James S. McDonnell Foundation Scholar A ward. 6. References [1] M. V ersteegh, R. Thiolliere, T . Schatz, X. N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015, ” in Pr oc. Interspeech , 2015, pp. 3169–3173. [2] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier , X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017, ” in Pr oc. ASRU , 2017, pp. 323–330. [3] K. V esel ´ y, M. Karafi ´ at, F . Gr ´ ezl, M. Janda, and E. Egorov a, “The Language-independent Bottleneck Features, ” in Proc. SLT , 2012, pp. 336–341. [4] N. T . V u, W . Breiter , F . Metze, and T . Schultz, “ An in vestigation on initialization schemes for multilayer perceptron training using multilingual data and their ef fect on ASR performance, ” in Pr oc. Interspeech , 2012, pp. 2586–2589. [5] S. Thomas, S. Ganapathy , and H. Hermansky , “Multilingual MLP features for lo w-resource L VCSR systems, ” in Pr oc. ICASSP , 2012, pp. 4269–4272. [6] J. Cui, B. Kingsbury , B. Ramabhadran, A. Sethy , K. Audhkhasi et al. , “Multilingual representations for low resource speech recog- nition and keyw ord search, ” in Proc. ASR U , 2015, pp. 259–266. [7] T . Alum ¨ ae, S. Tsakalidis, and R. M. Schwartz, “Impro ved multi- lingual training of stacked neural network acoustic models for low resource languages. ” in Pr oc. Interspeech , 2016, pp. 3883–3887. [8] Y . Y uan, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Learning neural network representations using cross-lingual bottleneck features with word-pair information, ” in Proc. Inter speech , 2016, pp. 788– 792. [9] D. Renshaw , H. Kamper , A. Jansen, and S. Goldwater , “A Compar - ison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge, ” in Proc. Inter - speech , 2015, pp. 3199–3203. [10] Y . Y uan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extract- ing Bottleneck Features and W ord-Like Pairs from Untranscribed Speech for Feature Representation, ” in Proc. ASRU , 2017, pp. 734–739. [11] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual Bottle-Neck Feature Learning from Untranscribed Speech, ” in Pr oc. ASRU , 2017, pp. 727–733. [12] Y . Y uan, C.-c. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Pair- wise Learning Using Multi-lingual Bottleneck Features for Lo w- Resource Query-by-Example Spoken T erm Detection, ” in Pr oc. ICASSP , 2017, pp. 5645–5649. [13] H. Kamper , M. Elsner, A. Jansen, and S. Goldwater , “Unsupervised Neural Network Based Feature Extraction Using W eak T op-Down Constraints, ” in Pr oc. ICASSP , 2015, pp. 5818–5822. [14] H. Shibata, T . Kato, T . Shinozaki, and S. W atanabe, “Composite Embedding Systems for Zerospeech 2017 T rack 1, ” in Pr oc. ASR U , 2017, pp. 747–753. [15] T . Schultz, N. T . V u, and T . Schlippe, “GlobalPhone: A Multilin- gual T ext & Speech Database in 20 Languages, ” in Proc. ICASSP , 2013, pp. 8126–8130. [16] D. B. Paul and J. M. Baker, “The Design for the W all Street Journal- based CSR Corpus, ” in Pr oc. HLT , 1992, pp. 357–362. [17] D. Pov ey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motl ´ ı ˇ cek, Y . Qian, P . Schwarz, J. Silovsk ´ y, G. Stemmer , and K. V esel ´ y, “The Kaldi Speech Recog- nition T oolkit, ” in Proc. ASR U , 2011. [18] V . Peddinti, D. Pov ey , and S. Khudanpur , “A time delay neural net- work architecture for ef ficient modeling of long temporal contexts, ” in Pr oc. Interspeech , 2015, pp. 3214–3218. [19] F . Gr ´ ezl, M. Karafi ´ at, and K. V esel ´ y, “Adaptation of Multilingual Stacked Bottle-neck Neural Netw ork Structure for Ne w Language, ” in Pr oc. ICASSP , 2014, pp. 7704–7708. [20] A. Jansen and B. V an Durme, “Efficient spoken term disco very using randomized algorithms, ” in Pr oc. ASR U , 2011, pp. 401–406. [21] A. Jansen, S. Thomas, and H. Hermansk y , “W eak top-down con- straints for unsupervised acoustic model training, ” in Proc. ICASSP , 2013, pp. 8091–8095. [22] M. A. Carlin, S. Thomas, A. Jansen, and H. Hermansky , “Rapid Evaluation of Speech Representations for Spoken T erm Discov ery, ” in Pr oc. Interspeech , 2011, pp. 828–831.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment