Multilingual Bottleneck Features for Query by Example Spoken Term Detection

State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time warping (DTW) based template matching. Here, we present a study o…

Authors: Dhananjay Ram, Lesly Miculicich, Herve Bourlard

Multilingual Bottleneck Features for Query by Example Spoken Term   Detection
MUL TILINGU AL BO TTLENECK FEA TURES FOR QUER Y BY EXAMPLE SPOKEN TERM DETECTION Dhananjay Ram, Lesly Miculicich, Herv ´ e Bourlard Idiap Research Institute, Martigny , Switzerland ´ Ecole Polytechnique F ´ ed ´ erale de Lausanne (EPFL), Switzerland ABSTRA CT State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature rep- resentation of the query and audio document to perform dy- namic time w arping (DTW) based template matching. Here, we present a study on QbE-STD performance using sev eral monolingual as well as multilingual bottleneck features ex- tracted from feed forward networks. Then, we propose to employ residual networks (ResNet) to estimate the bottleneck features and sho w significant improvements over the corre- sponding feed forward network based features. The neural networks are trained on GlobalPhone corpus and QbE-STD experiments are performed on a very challenging QUESST 2014 database. Index T erms — Multilingual feature, Bottleneck feature, Residual network, Multitask learning, Query by exampl e, Spoken term detection, DTW , CNN, ResNet, QbE, STD 1. INTR ODUCTION Query-by-example spoken term detection (QbE-STD) is the task of detecting audio documents from an archi ve, which contain a spoken query provided by a user . In contrast to tex- tual queries in keyword spotting, QbE-STD requires spoken queries which enables a language independant search with- out the need of a full speech recognition system. The search is performed in the acoustic feature domain without any lan- guage specific resources, making it a zero-resource task. The QbE-STD systems primarily in volv e the following two steps: (i) extract acoustic feature v ectors from both the query and the audio document and (ii) employ those features to compute the likelihood of the query occurring some where in the audio document as a sub-sequence. Dif ferent types of acoustic features hav e been used for this task: spectral fea- tures [1, 2], posterior features (posterior probability v ector for phone or phone-like units) [3, 4] as well as bottleneck fea- tures [5, 6]. The matching likelihood is generally obtained by computing a frame-le vel similarity matrix between the query and each audio document using the corresponding feature The research is funded by the Swiss NSF project ‘PHASER-QU AD’, grant agreement number 200020-169398. vectors and emplo ying a dynamic time warping (DTW) [4, 5] or con volutional neural network (CNN) based matching tech- nique [7]. Se veral variants of DTW hav e been used: Segmen- tal DTW [1, 3], Slope-constrained DTW [8], Sub-sequence DTW [9], Subspace-regularized DTW [10, 11] etc. State of the art performance has been achie ved using bottleneck fea- tures with DTW [5]. Bottleneck features [12, 13, 14] are lo w-dimensional rep- resentation of data generally obtained from a hidden bottle- neck layer of a feed forward network (FFN). This bottleneck layer has a smaller number of hidden units compared to the size of other layers. The smaller sized layer constrains in- formation flo w through the network which enables it to fo- cus on the information that is necessary to optimize the fi- nal objecti ve. Bottleneck features have been commonly esti- mated from auto-encoders [12] as well as FFNs for classifica- tion [13]. Language independent bottleneck features can be obtained using multilingual objectiv e function [14]. In this work, we present a performance analysis of differ- ent types of bottleneck features for QbE-STD. For this pur- pose, we train FFNs for phone classification using fi ve lan- guages to estimate fiv e distinct monolingual bottleneck fea- tures. W e also train multilingual FFNs using multitask learn- ing principle [15] in order to obtain language independent features. W e used a combination of three and fi ve languages to analyze the effect of increasing the language v ariation for training. Previous studies hav e sho wn the effecti veness of con volu- tional neural netw ork (CNN) for acoustic modeling in speech recognition [16, 17]. Residual networks (ResNet) is a spe- cial kind of CNN which is effecti ve for learning deeper ar - chitectures and has been sho wn to be v ery successful for im- age classification [18] as well as speech recognition [19, 20]. This inspired us to use ResNets instead of FFNs to estimate monolingual and multilingual bottleneck features for QbE- STD. T o the best of our knowledge, this is the first attempt to use ResNets for bottleneck features estimation. In the rest of the paper , we present the multitask learning approach used to train the multilingual networks in Section 2. Then, we explain the monolingual and multilingual architec- tures using FFNs and ResNets in Sections 3 and 4 respec- tiv ely . Later , we describe the experimental setup in Section 5, and we ev aluate and analyze the performance of our mod- els using QUESST 2014 database in Section 6. Finally , we present our conclusions in Section 7. 2. MUL TIT ASK LEARNING Multitask learning [14, 15] hav e been used to exploit similar- ities across tasks resulting in an impro ved learning efficienc y when compared to training each task separately . Generally , the network architecture consists of a shared part and sev- eral task-dependent parts. In order to obtain multilingual bot- tleneck features we model phone classification for each lan- guage as different tasks, thus we have a language independent part and a language dependent part. The language indepen- dent part is composed of the first layers of the network which are shared by all languages forcing the network to learn com- mon characteristics. The language dependent part is modeled by the output layers (marked in red in Figures 1 and 2), and enables the network to learn particular characteristics of each language. In the follo wing sections we present dif ferent ar- chitectures that we use to obtain the multilingual bottleneck features as well as monolingual ones for comparison. 3. FEED FOR W ARD NETWORKS Feed forward netw orks hav e been traditionally used to obtain bottleneck features for speech related tasks [5, 13, 14]. Here, we describe the different architectures employed in this study as shown in Figure 1: (a) Monolingual: our monolingual FFN architecture, con- sists of 3 fully connected layers of 1024 neurons each, followed by a linear bottleneck layer of 32 neurons, and a fully connected layer of 1024 neurons. The final layer feeds to the output layer of size c i corresponding to number of classes (e.g. phones) of the i -th language. (b) Multilingual (3 languages): this architecture consists of 4 fully connected layers having 1024 neurons each, fol- lowed by a linear bottleneck layer of 32 neurons. Then, a fully connected layer of 1024 neurons feeds to 3 out- put layers corresponding to the dif ferent training lan- guages. The 3 output layers are language dependent while the rest of the layers are shared among the lan- guages. (c) Multilingual (5 languages): this architecture is similar to the previous one except it uses an additional fully connected layer of 1024 neurons, and two extra output layers corresponding to the 2 new languages. The increased number of layers is intended at modeling the extra training data gained by adding languages. Fig. 1 . Monolingual and multilingual feed forward network architectures for extracting bottleneck features using multiple languages. c i is the number of classes for the i -th language and n is the size of input vector . Fig. 2 . Monolingual and multilingual residual network ar- chitectures for e xtracting bottleneck features using multiple languages. c i is the number of classes for the i -th language. 4. RESIDU AL NETWORKS A Residual Network [18] is a CNN with shortcut connections between its stacked layers. Skipping layers ef fecti vely simpli- fies the training and giv es flexibility to the network. Gi ven an input matrix x and an output matrix y , it models the function y = f ( x ) + x in each stacked layer , where f ( . ) represents two con v olutional layers with a non-linearity in-between. In case the size of the output of f does not match the size of x , one linear con volutional layer is applied to x (implemented using 1 × 1 con volutions) before the addition operation. Finally , a non-linearity is applied to the summed output y . Similar to FFNs, we implemented 3 different architec- tures depending on the number of languages used for training. Those architectures are shown in Figure 2. W e use 3 × 3 filters for all con v olution layers throughout the network. Every time we reduce the feature map size by half (using a stride of 2), we double the number of filters. Then we perform a global aver - age pooling to obtain 256 dimensional vector . These vectors are passed through a fully connected linear bottleneck layer which feeds to another layer of size 256. This goes to a sin- gle or multiple output classes depending on type of network: monolingual or multilingual. Smaller number of layers are used here in comparison to [18] due to the limited amount of training data. 5. EXPERIMENT AL SETUP In this section, we describe the databases and the pre- processing steps to perform the experiments. Then, we present the details of training different neural netw orks. 5.1. Databases GlobalPhone Corpus: GlobalPhone [21] is a multilingual speech database consisting of high quality recordings of read speech with corresponding transcription and pronun- ciation dictionaries in 20 different languages. In this work, we use French (FR), German (GE), Portuguese (PT), Span- ish (ES) and Russian (R U) to train monolingual as well as multilingual networks and estimate the corresponding bottleneck features for QbE-STD experiments. These lan- guages were chosen to ha ve a complete mismatch between the training and test languages. W e hav e an av erage of ∼ 20 hours of training and ∼ 2 hours of de velopment data per lan- guage. Query by Example Search on Speech T ask (QUESST): QUESST dataset [22] is part of MediaEv al 2014 bench- marking initiati ve and is used here to ev aluate the performance of different bottleneck features for QbE-STD. It consists of ∼ 23 hours of audio recordings (12492 files) in 6 languages as search corpus: Albanian, Basque, Czech, non-nativ e English, Romanian and Slov ak. The dev elopment and ev aluation set includes 560 and 555 queries respectiv ely which were separately recorded than the search corpus. The dev elopment queries are used to tune the hyperparameters of dif ferent systems. There are three types of occurrences of a query defined as a match in this dataset. T ype 1: exactly matching the lexical representation of a query , T ype 2: slight lexical variations at the start or end of a query , T ype 3: multiword query occurrence with dif ferent order or filler content between words. (See [22] for more details) 5.2. Neural Networks T raining W e use mel frequency cepstral coef ficients (MFCC) with cor - responding ∆ and ∆∆ features as input to the neural net- works. The outputs are mono-phone based tied states (also known as pdfs in Kaldi [23]) corresponding to each language as presented in Section 5.1. The training labels for these net- works are generated using GMM-HMM based speech rec- ognizers [24, 25]. The number of classes corresponding to French, German, Portuguese, Spanish and Russian are 124, 133, 145, 130, 151 respectiv ely . Note that, we also trained these networks using tri-phone based senone classes, ho w- ev er they perform worse than the mono-phone based training. All neural network architectures in this work is implemented using Pytorch [26]. Feed F orward Networks: The input to the FFNs is MFCC features with a conte xt of 6 frames (both left and right) re- sulting in a 507 dimensional vector . W e apply layer nor- malization [27] before the linear transforms and use rec- tifier linear unit (ReLU) as non-linearity after each linear transform except in the bottleneck layer . W e train those networks with batch size of 255 samples and dropout of 0.1. In case of multilingual training, we use equal number of samples from each language under consideration. Adam optimization algorithm [28] is used with an initial learning rate of 10 − 3 to train all networks by optimizing cross en- tropy loss. The learning rate is halved ev ery time the de vel- opment set loss increases compared to the previous epoch until a v alue of 10 − 4 . All the networks were trained for 50 epochs. Residual Networks: W e construct the input for ResNet training using MFCC features with a context of 12 frames (both left and right) resulting in a 39 × 25 size matrix with single channel in contrast to the 3 channel RGB images generally used in image classification tasks. W e also con- ducted experiments by arranging the input MFCC features in 3 channels: static, ∆ and ∆∆ values [16], howe ver the performance was worse. Batch normalization [29] is ap- plied after ev ery con volution layer and ReLU is used as non-linearity . The networks are trained with batch size of 255 samples and dropout of 0.05 for 50 epochs. W e use the same learning rate schedule as the FFNs with initial and final learning rate of 10 − 3 and 10 − 4 respectiv ely . The number of layers for both FFN and ResNet architectures for dif ferent monolingual and multilingual networks are opti- mized using the dev elopment queries to give best QbE-STD performance. The input context size for these networks are optimized as well by varying from it from 4 to 14. W e ob- served the optimal context size corresponding to FFN and ResNet are 6 and 12 respectively . The performance gain of the ResNet models o ver FFN models (as we will see in Sec- tion 6) indicates that ResNets are better equipped to capture information from longer temporal context than the FFNs. 5.3. DTW for T emplate Matching The trained neural networks are used to estimate bottleneck features for DTW . As a pre-processing step, we implement a speech activity detector (SAD) by utilizing silence and noise class posterior probabilities obtained from three differ- ent phone recognizers (Czech, Hungarian and Russian) [30] trained on SpeechD A T(E) database [31]. Those posterior probabilities are av eraged and compared with rest of the phone class probabilities to find and remove the noisy frames. Any audio file with less than 10 frames after SAD is not con- sidered for experiments. The DTW system presented in [4] is used here to com- pute the matching score for a query and audio document pair . It utilizes cosine similarity to obtain the frame-lev el distance matrix from a query and an audio document. This DTW al- gorithm is similar to slope-constrained DTW [8] where the optimal w arping path is normalized by its partial path length at each step and constraints are imposed so that the warping path can start and end at any point in the audio document. The scores generated by the DTW system are normalized to hav e zero-mean and unit-variance per query in order to reduce variability across dif ferent queries [4]. 5.4. Evaluation Metrics Minimum normalized cross entropy ( C min nxe ) is used as pri- mary metric and maximum term weighted v alue ( M T W V ) is used as secondary metric to compare performances of dif- ferent bottleneck features for QbE-STD [32]. The costs of false alarm ( C f a ) and missed detection ( C m ) for M T W V are considered to be 1 and 100 respecti vely . One-tailed paired- samples t-test is conducted to ev aluate the significance of per - formance improvement. Additionally , detection error trade- off (DET) curves are used to compare the detection perfor - mance of dif ferent systems for a gi ven range of false alarm probabilities. 6. EXPERIMENT AL ANAL YSIS In this section, we report and analyze the QbE-STD perfor- mance using various bottleneck features estimated from our FFN and ResNet models. Previously , the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]. W e implemented those models to compare with multilingual features as well as corresponding ResNet based models. 6.1. Monolingual Featur e P erformance W e train fiv e different monolingual networks for both archi- tectures: FFN and ResNet, corresponding to PT , ES, R U, FR, GE languages from GlobalPhone database. W e ev aluate the features estimated with these netw orks using QbE-STD as de- scribed in Section 5.3. Similar to [5], we did not employ any specific strategies to deal with different types of queries in QUESST 2014. The results are presented using C min nxe and M T W V metrics in T able 1. W e can see that the ResNet based bottleneck features perform better than most of the FFN based features in terms of C min nxe metric, except for T3 queries with FR, ES and R U features, where the performances are close. W e also observe that PT features perform best for both FFN and ResNet. 6.2. Multilingual Featur e P erformance W e present the results of our multitask learning based multi- lingual systems and compare their performance with a simple monolingual feature concatenation approach. Multitask Learning: W e implement two multilingual net- works corresponding to each FFN and ResNet architec- tures discussed in Sections 3 and 4 using 3 languages (PT , ES, R U) and 5 languages (PT , ES, R U, FR, GE). The 3 language network uses the best performing monolingual training languages. Performance of the features extracted from these networks are shown in T able 1. Clearly , ResNet based bottleneck features provide significant improv ement ov er the corresponding FFN based features. W e also ob- serve that PT -ES-R U-FR-GE features significantly outper- form PT -ES-R U features for both FFN and ResNet model indicating that additional languages for training pro vide better language independent features. Featur e Concatenation: Another way of utilizing training resources from multiple languages is to concatenate the monolingual bottleneck features to perform DTW . W e per- form two sets of experiments by concatenating monolin- gual features from PT -ES-R U and FR-GE-PT -ES-R U lan- guages corresponding to both FFN and ResNet. The re- sults are presented in T able 1. W e can see that there is marginal improvement over the best monolingual feature (PT) from FFN model, a similar observation was presented in [5]. On the other hand, ResNet based feature (PT -ES- R U) perform significantly better than the corresponding PT features. Howev er , there is no significant performance dif- ference between the ResNet based 3 and 5 language feature concatenation. T able 1 . Performance of the QbE-STD system in QUESST 2014 database using various monolingual and multilingual bot- tleneck features for dif ferent types of ev aluation queries. C min nxe (lower is better) and M T W V (higher is better) is used as ev aluation metric. Monolingual Feature T raining System T1 Queries T2 Queries T3 Queries Language C min nxe ↓ M T W V ↑ C min nxe ↓ M T W V ↑ C min nxe ↓ M T W V ↑ Portuguese (PT) FFN 0.5582 0.4671 0.6814 0.3048 0.8062 0.1915 ResNet 0.5405 0.4698 0.6607 0.2747 0.7954 0.1802 Spanish (ES) FFN 0.5788 0.4648 0.7074 0.2695 0.8361 0.1612 ResNet 0.5718 0.4465 0.7043 0.2613 0.8465 0.1462 Russian (R U) FFN 0.6119 0.4148 0.7285 0.2434 0.8499 0.1385 ResNet 0.5728 0.4405 0.7017 0.2481 0.8525 0.1346 French (FR) FFN 0.6266 0.4242 0.7462 0.2086 0.8522 0.1249 ResNet 0.5957 0.4225 0.7017 0.2216 0.8540 0.1267 German (GE) FFN 0.6655 0.3481 0.7786 0.1902 0.8533 0.1038 ResNet 0.6389 0.3803 0.7511 0.2230 0.8497 0.1166 Concat. Feature PT -ES-RU FFN 0.5450 0.4957 0.6665 0.2985 0.8053 0.1869 ResNet 0.5072 0.5164 0.6374 0.3162 0.7965 0.1899 PT -ES-RU-FR-GE FFN 0.5457 0.4965 0.6715 0.2903 0.8079 0.1930 ResNet 0.5040 0.5201 0.6309 0.3212 0.7941 0.1914 Multiling. Feature PT -ES-RU FFN 0.4828 0.5459 0.6218 0.3626 0.7849 0.2057 ResNet 0.4554 0.5666 0.6009 0.3529 0.7650 0.2201 PT -ES-RU-FR-GE FFN 0.4606 0.5663 0.6013 0.3605 0.7601 0.2138 ResNet 0.4345 0.5962 0.5703 0.3815 0.7387 0.2487 T able 2 . Number of parameters for different mono and multi lingual models using FFN and ResNet architecture. Model FFN ResNet Monolingual ∼ 1.8M ∼ 663K Multilingual: 3-lang ∼ 3.1M ∼ 1.4M Multilingual: 5-lang ∼ 4.4M ∼ 3.0 M W e also observ e that the multitask learning based features significantly outperform the monolingual feature concatena- tion, indicating the importance of multitask learning for uti- lizing training resources from multiple languages. The number of parameters for dif ferent models are sho wn in T able 2. W e observ e that the FFNs ha ve more parameters than the ResNet architectures. The impro ved performance of ResNet models in comparison to FFNs indicate that ResNet architecture produces better bottleneck features in spite of having less parameters. 6.3. Monolingual vs Multilingual Featur e The 3 language multilingual feature provides an average ab- solute gain of 5.2% and 5.8% (in C min nxe ) for FFN and ResNet model respectively in comparison to the corresponding best monolingual features. Further 2.3% and 2.5% absolute im- 5 10 20 40 60 80 0.1 0.2 0.5 1 2 5 10 20 40 Miss probability (in %) False Alarm probability (in %) FFN: PT FFN: PT-SP-RU FFN: PT-SP-RU-FR-GE ResNet: PT ResNet: PT-SP-RU ResNet: PT-SP-RU-FR-GE Fig. 3 . DET curv es showing the performance of monolingual and multilingual features estimated using FFNs and ResNets for T1 queries of QUESST 2014. Fig. 4 . Comparison of QbE-STD performance of language specific ev aluation queries (T1 query) using C min nxe values prov ements are observed while using 2 more languages for training. In order to compare the missed detection rate for a giv en range of false rates we present the DET curves corre- sponding to these systems in Figure 3. W e see a similar trend of performance improvement here as well. W e also observe that the performance gain is higher from 1 language to 3 lan- guages than 3 languages to 5 languages. It is due to our use of the best performing languages to train the 3 language net- work. 6.4. Language Specific Perf ormance W e compare the language specific query performance of ResNet based monolingual and multilingual features as it per- forms better than the FFN counterparts. W e use C min nxe values of T1 query performance to show this comparison in Figure 4. W e observe that the performance improves with more lan- guages used for training, ho wev er the amount of impro vement varies with language of the query . The smaller performance gain from 3 to 5 languages for some queries (e.g. Albanian, Czech, Slov ak) can be attrib uted to much worse performance of FR and GE features compared to rest of the monolingual features. 7. CONCLUSIONS W e proposed a ResNet based neural network architecture to estimate monolingual as well as multilingual bottleneck fea- tures for QbE-STD. W e present a performance analysis of these features using both ResNets and FFNs. It shows that additional languages for training improv es performance and the ResNets perform better than FFNs for both monolingual and multilingual features. Further analysis sho ws that the im- prov ement is consistent throughout queries of different lan- guages. In future, we plan to train deeper ResNets with more languages to compute and analyze language independence of those features. The impro ved bottleneck features can be used for other relev ant tasks e.g. unsupervised unit discovery . 8. REFERENCES [1] Alex S Park and James R Glass, “Unsupervised pat- tern discov ery in speech, ” IEEE T ransactions on Audio, Speech, and Languag e Pr ocessing , vol. 16, no. 1, pp. 186–197, 2008. [2] Chun-an Chan and Lin-shan Lee, “Model-based unsu- pervised spoken term detection with spoken queries, ” IEEE T ransactions on Audio, Speech, and Langua ge Pr ocessing , v ol. 21, no. 7, pp. 1330–1342, 2013. [3] Y aodong Zhang and James R Glass, “Unsupervised spoken ke yword spotting via segmental dtw on gaus- sian posteriorgrams, ” in IEEE W orkshop on Automatic Speech Recognition & Understanding (ASR U) , 2009, pp. 398–403. [4] Luis Javier Rodriguez-Fuentes, Amparo V arona, Mike Penagarikano, Germ ´ an Bordel, and Mireia Diez, “High- performance query-by-example spoken term detection on the SWS 2013 ev aluation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2014, pp. 7819–7823. [5] Igor Sz ¨ oke, Miroslav Sk ´ acel, Luk ´ a ˇ s Burget, and Jan ˇ Cernock ` y, “Coping with channel mismatch in query- by-example-BUT QUESST 2014, ” in 2015 IEEE In- ternational Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2015, pp. 5838–5842. [6] Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li, “Unsupervised bottleneck features for low-resource query-by-example spok en term detec- tion., ” in INTERSPEECH , 2016, pp. 923–927. [7] Dhananjay Ram, Lesly Miculicich, and Herv ´ e Bourlard, “CNN based query by example spok en term detection, ” in Pr oceedings of the Nineteenth Annual Confer ence of the International Speech Communication Association (INTERSPEECH) , 2018. [8] Timoth y J Hazen, W ade Shen, and Christopher White, “Query-by-example spoken term detection using pho- netic posteriorgram templates, ” in IEEE W orkshop on Automatic Speech Recognition & Understanding (ASR U) , 2009, pp. 421–426. [9] Meinard M ¨ uller , Information r etrie val for music and motion , vol. 2, Springer , 2007. [10] Dhananjay Ram, Afsaneh Asaei, and Herv ´ e Bourlard, “Sparse subspace modeling for query by example spo- ken term detection, ” IEEE/ACM T ransactions on A udio, Speech, and Languag e Pr ocessing , vol. 26, no. 6, pp. 1130–1143, June 2018. [11] Dhananjay Ram, Afsaneh Asaei, and Herv ´ e Bourlard, “Subspace regularized dynamic time warping for spo- ken query detection, ” in W orkshop on Signal Pr ocess- ing with Adaptive Sparse Structur ed Repr esentations (SP ARS) , 2017. [12] Geoffre y E Hinton and Ruslan R Salakhutdinov , “Re- ducing the dimensionality of data with neural networks, ” science , vol. 313, no. 5786, pp. 504–507, 2006. [13] Dong Y u and Michael L Seltzer , “Improv ed bottle- neck features using pretrained deep neural networks, ” in T welfth annual confer ence of the international speech communication association , 2011. [14] Karel V esel ` y, Martin Karafi ´ at, Franti ˇ sek Gr ´ ezl, Milo ˇ s Janda, and Ekaterina Egoro v a, “The language- independent bottleneck features, ” in 2012 IEEE Spoken Language T echnology W orkshop (SLT) . IEEE, 2012, pp. 336–341. [15] Rich Caruana, “Multitask learning, ” Machine learning , vol. 28, no. 1, pp. 41–75, 1997. [16] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Y u, “Con- volutional neural netw orks for speech recognition, ” IEEE/A CM T ransactions on audio, speech, and lan- guage pr ocessing , vol. 22, no. 10, pp. 1533–1545, 2014. [17] T ara N Sainath, Oriol V inyals, Andre w Senior, and Has ¸ im Sak, “Con volutional, long short-term memory , fully connected deep neural networks, ” in 2015 IEEE International Confer ence on Acoustics, Speech and Sig- nal Pr ocessing (ICASSP) . IEEE, 2015, pp. 4580–4584. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , 2016, pp. 770–778. [19] W ayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer , Andreas Stolcke, Dong Y u, and Geoffre y Zweig, “ Achie ving human parity in con versational speech recognition, ” arXiv pr eprint arXiv:1610.05256 , 2016. [20] Y u Zhang, W illiam Chan, and Navdeep Jaitly , “V ery deep conv olutional networks for end-to-end speech recognition, ” in 2017 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 4845–4849. [21] T anja Schultz, Ngoc Thang V u, and T im Schlippe, “Globalphone: A multilingual text & speech database in 20 languages, ” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Pr ocessing . IEEE, 2013, pp. 8126–8130. [22] Xavier Anguera, Luis Javier Rodriguez-Fuentes, Igor Sz ¨ oke, Andi Buzo, and Florian Metze, “Query by exam- ple search on speech at mediaev al 2014., ” in MediaEval , 2014. [23] Daniel Pov ey , Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit, ” in IEEE 2011 workshop on automatic speech r ecognition and understanding . IEEE Signal Processing Society , 2011. [24] Geoffre y Hinton, Li Deng, Dong Y u, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly , Andrew Se- nior , V incent V anhoucke, Patrick Nguyen, T ara N Sainath, et al., “Deep neural netw orks for acoustic mod- eling in speech recognition: The shared vie ws of four research groups, ” Signal Pr ocessing Magazine , IEEE , vol. 29, no. 6, pp. 82–97, 2012. [25] Sibo T ong, Philip N Garner , and Herv ´ e Bourlard, “ An in vestig ation of deep neural networks for multilingual speech recognition training and adaptation, ” in Pr o- ceedings of the Eighteenth Annual Confer ence of the International Speech Communication Association (IN- TERSPEECH) , 2017. [26] Adam Paszk e, Sam Gross, and Soumith Chintala, “Py- torch, ” 2017, [online] http://pytorch.org/ . [27] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffre y E Hinton, “Layer normalization, ” arXiv pr eprint arXiv:1607.06450 , 2016. [28] Diederik Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [29] Sergey Ioffe and Christian Szegedy , “Batch nor- malization: Accelerating deep network training by reducing internal co variate shift, ” arXiv pr eprint arXiv:1502.03167 , 2015. [30] Petr Schwarz, Phoneme r ecognition based on long tem- poral conte xt , Ph.D. thesis, Faculty of Information T echnology BUT , 2008. [31] Petr Poll ´ ak, Jerome Boudy , Khalid Choukri, Henk V an Den Heuvel, Klara V icsi, Attila V irag, Rainer Siemund, W ojciech Maje wski, Piotr Staroniewicz, Herbert T ropf, et al., “Speechdat (e)-eastern european telephone speech databases, ” in the Pr oc. of XLDB 2000, W orkshop on V ery Lar ge T elephone Speech Databases . Citeseer, 2000. [32] Luis J Rodriguez-Fuentes and Mik el Penagarikano, “Mediaev al 2013 spoken web search task: system per- formance measures, ” n. TR-2013-1, Department of Electricity and Electr onics, University of the Basque Country , 2013.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment