How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

HO W T O IMPR O VE Y OUR SPEAKER EMBEDDINGS EXTRA CT OR IN GENERIC T OO LKITS Hossein Zeinali 1 , Luk ´ a ˇ s Bu r get 1 , Johan Rohdin 1 , Themos Stafylakis 2 , Jan “Honza” ˇ Cernoc k ´ y 1 1 Brno University of T ec hn o l ogy , S peech@FIT and IT4I Center of Excellence, Czech Re pu b lic 2 Omilia - Con versational Intelligence, Athens , Greece ABSTRA CT Recently , sp eaker embeddings ex tr acted with deep neural n etworks became the state-of-the-art method for speak er v eriﬁ cation. In this paper we aim to facilitate i ts implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improve- ments on the method. W e examine sev eral tricks in training, such as the ef fects of normalizing input features and pooled statistics, dif- ferent methods for pre venting overﬁtting as well as alternativ e non - linearities that can be used instead of Rectiﬁ er Linear Units. In ad- dition, we in vestigate the dif ference in performance between TDNN and CNN, and between two types of attention mechanism. Exper- imental results on Speaker in the Wild, SRE 2016 and SRE 2018 datasets demonstrate the ef fectivenes s of the proposed implementa- tion. Index T erms — Deep neural network, speaker embedding, x- vector , T ensorﬂow , Kaldi. 1. INTRODUCTION For sev eral years, i-vector representation of a v ariable length speech signal alongside with P robabilistic Linear Discriminant Analysis (PLDA) has been the state-of-the-art in text-independent speaker veriﬁcation (T I-SV) [1, 2], yielding very good results in other tasks too, such as language identiﬁcation [3], text-depend ent S V [4, 5] and e ven in non-speech task such as online signature v eriﬁ cation [6]. In recent ye ars, nov el deep learning approach es hav e emerged which outperform the traditional i -vector/PLD A framework . Deep learning method s for speaker recognition can be summa- rized into four catego ries: (a) methods applied to ﬁxed utterance - lev el representations (typically i-vectors) such as non-linear map- pings and back end classiﬁ ers [7 , 8], (b) i-vectors with Baum-W elch statistics or frame-lev el features (e.g. bottleneck) extracted with Deep Neural Networks (DNNs) t rained for ASR (i.e. with pho- netic recognition units as targets) [9, 10, 11], (c) fully end-to-end DNN approa ches, where siamese DNNs learn directly t o approx i- mate the posterior probability of two or more utterance s belon ging to the same speaker [12], and (d) semi end-to-end approache s, where DNNs with either a closed-set speaker identiﬁcation architecture (us- ing a softmax over a large number of training speakers) or with a siamese architecture are t rained, and utterance-lev el representations (embeddings) are extracted and fed to a trainable back-end classiﬁer (typically PLDA) [13, 14] . T o the best of our kno wledge, the perfor- mance of the latter catego ry is the current state-of-the-art in most (if not all) speaker recognition benchmarks [13]. In this paper , we demonstrate how to train a speaker embed- ding system in a general-purpose deep learning framework and attain comparable ( or even better) performance compared to the original Kaldi v ersion [13]. Develop ing ne w ideas and combining other pro- posed method with the x-vec t or topolo gy is easier in such t oolkits, and this is the main motiv ation for sharing our experience with other researchers. Several papers hav e been published to show ho w to train speak er embedding systems in terms of different data augmentation methods and also the amou nt of requ i r ed training data [15, 16], but the aim of this paper is to sho w how to implement an x-vector topol- ogy in T ensorﬂow toolkit, proposing sev eral tricks to improve the performance of speaker embeddings, and empirically ev aluate the effe ctiveness of each trick. 2. SYSTEM SETUP In this pap er, we focus on speake r embedding training part of the x-vector pipeline and Kaldi toolkit is used for other parts of the pipeline. Our features are 23-dimensional MFC C features, which are extracted from 25 ms windo ws with short time mean normaliza- tion. Unv oiced frames are eliminated using an Energy based V AD. For creating training archiv es 1 for T ensorﬂo w , we use our imple- mentation which prod uces pretty similar archi ves like Kaldi except we sav e minibatches i n numpy arrays which saved to t ar ﬁles. For a fair comparison, all conﬁguration and number of training archiv es are the same for both Kaldi and T ensorﬂo w and also same Kaldi back-end is used f or both implementations. For training the network we use Adam [17] optimizer in almost all cases. The initial learning rate is set to 0.001 and linearly reduced to 0.0001. W e use 3 epochs for network training. W e check ed 6 epochs for some systems, but al most all of them overﬁtted more to the training speakers. In [15 ] , it was mentioned 6 epochs is better for Kaldi and our expe ri ments also prove it, but this is not the case for our T ensorﬂow implementation. 2.1. T raining data and augmentation The t raining data we use in this paper is the list prep ared for NIST SRE 2018 close condition and consists of: 1) S REs 4-8 and SRE12, 2) T elephony part of Mixer6, 3) Fisher English, 4) All switchboard data and 5) V ox celeb 1 and 2. For both V ox celeb the concatenated version of each session is used. The follo wing data augmentation methods are used in this pa- per . Apart from the four augmentation methods used i n [13], we also include audio compression using ogg and mp3 codecs. Finally , training data consists of 3-fold augmentation t hat combines clean data wit h 2 copies of augmented data, which are selected randomly . • Reverbera ti on : Artiﬁcially reve rberated data using con volu- tion wi th simulated RIRs. 1 In Kaldi, the network training exa mples are split to sev eral ﬁles which calle d archi ve. • Babble : Several speakers are randomly selected from MU- SAN [18] speech and the summation of t hem is added to the original signal with SNR between 13-20dB. • Music : Adding a random music ﬁle from MUSAN t o the original signal with random S NR between 5-15dB. • Noise : M US AN noises are added at one second intervals throughou t the recording wit h random SNR between 0-15dB. • Compression : The original signal is randomly compressed (using ogg or mp3 methods) and it is subsequen t l y con verted back to raw format. 2.2. Ev alu ation d ata W e ev aluate dif ferent networks on three datasets: Speaker in the W ild (S ITW) Core-Core condition do wnsampled to 8 kHz [19], the NIST SRE 2016 and the dev elopment set of NIS T SRE 2018 for T unisian Arabic (CMN2) 2 . SITW dataset contains recordings ex- tracted from videos in English language and both SRE 2016 and SRE 2018 are con versationa l telepho ne speech. W e remov ed the overlap- ping speakers between SI T W and V oxceleb1 from the training data. 2.3. PLD A backend W e use a Gaussian P LD A model as back-end classiﬁer . For both SRE 2016 and S RE 2018 , the PLD A model is adapted to the unla- beled dev elopment data using the unsupervised adaptation of Kaldi and the mean of the unlabeled data to center the x-v ectors prior to scoring. For training the PLDA model, a list containing all SREs, Switchboard, Mixer6 and their corresponding augmented versions is used, resulting in about 290 thousand utterances overall. T his set is also used for SITW , although this is a suboptimal choice for this set. Moreo ver , no adaptation technique is used for this set apart from centering the x-vectors using the mean of the SITW dev elopment part. 3. TOPOLOGY AND T R ICKS Implementing DNN-based methods and replicating published results is a cha l l enging t ask. In this case, an additional b urden is the fac t that the original method (x-vector spea ker embedding [13]) is im- plemented in a custom and perfectly tuned toolkit (Kaldi) with sev- eral unclear tricks for achiev ing such a good performance. Here, we try to k eep the ov erall topology same as the original Kaldi model and we in vestigate the effect of se veral tri cks for boosting its perfor- mance for TI-SV . T able 1 sho ws the ov erall topology , which is very close to the original paper [13] and is used as our baseline. In the follo wing, the inv estigated parts of the network and tricks are discussed. 3.1. Normalizing i n put features It has been pro ved that normalizing input features has a positi ve ef- fect on the performance of deep neural netw orks. Here, our MFCC features are mean-no rmalized using sliding windo w . Therefore, the ov erall features are not normalized and there is a question whether it is useful to normalizing the input features. Fo r this reason, here two differen t methods are in vestigated . In t he ﬁrst on e, features are simply normalized before feeding them to the netw ork using mean and standard de viation calculated using a subset of training data. In the seco nd method, a Batch-Normalization (BN) layer is add ed t o 2 Note that this results will be replaced with SRE 2018 test set. T able 1 . Deep neural network topology for x-vecto r extraction. Here CNNs are used for second and third frame lev el layers instead of TDNNs. Layer Layer context Ker n el × Input × Output Frame1 [ t − 2 , t + 2] 5 × 23 × 512 Frame2 [ t − 2 , t + 2] 5 × 512 × 512 Frame3 [ t − 3 , t + 3] 7 × 512 × 512 Frame4 [ t ] 1 × 512 × 512 Frame5 [ t ] 1 × 512 × 1536 Stats pooling [1 , T ] 1 536 × 3072 Segment1 – 3072 × 512 Segment2 – 512 × 512 Softmax – 512 × N the i nput of the network. In the ﬁr st method, the normalization pa- rameters are ke pt ﬁxed during training, while in the second method the normalization parameters are learned by the network. 3.2. Normalizing p ooled statistics In the original x-vector topology [13], all layers are followed by a BN layer except the statisti c pooling layer . So, the question here is what happ ened if a BN layer is also added after statistic pooling layer? 3.3. Order of non-linearity and BN Batch-Normalization (BN) is a useful method and helps training deeper networks with fewer epochs and higher learning rate. In [20 ] , the B N layer is placed before t he non-linearity while in the x-vector topology it is placed after the non-linearity [13]. Here, we examine the order of the BN layer and non-linearity to sho w the difference in performance. 3.4. A vo i d ing overﬁtting u sing dropouts and L2-regularization After ev aluating the ﬁrst x-vector implementation in T ensorﬂow , we observ ed overﬁtting to the training speakers compared to the Kaldi version. Assuming segment-le vel classiﬁcation accuracy as t he mea- sure for overﬁtting, our implementation attains about 10 % better segmen t accu racy compared to Kaldi for the same training data ( i.e. about 95 % compared to 85 % respectiv ely) and also the SV per- formance of the T ensorﬂo w version is inferior to that of the Kaldi version for some cases. W e t herefore examine seve ral methods to pre vent the network from overﬁtting. The ﬁrst regularization method we examine is dropou ts [21 ], where we test se veral dropou t probabilities. A second method for pre venting f rom overﬁtting is L2-regularization (also known as L2 weight decay), which penalizes large values in weights, i.e. L ′ = L + β 1 2 k W k 2 2 where the best va l ue for β sho uld be found empirically . Here, our aim is to answer se veral question s: for which layers L2- regularization should be used and ho w much it should participate in the optimization loss (i.e. the value of β ). 3.5. Fea tu re augmentation u sing Gaussian noise As mention ed in the introduction, sev eral papers in vestigate the ef- fects of dif ferent data aug mentations [15 , 16]. Here we are going to show the effe ct of add ing Gaussian noise to the features during the training. This augme ntation is performed in order to minimize ov erﬁtt i ng t o the training speakers and has a long history in the liter- ature [22, 23]. 3.6. Different type of non-li nearity After adding L2-regularization, we faced sparse x-v ector represen - tation due to Rectiﬁer Linear Unit (ReLU) saturation. The problem happened for some dimension s, where the ReLU inputs were alway s negati ve and so ReLU layer produces only zero output. After adding L2-regularization, the optimizer decide s to change the correspo nd- ing weights to zero. As a result, the extracted x-vectors were sparse. Sev eral alternativ e non-linearities hav e been proposed for ReLU, from which we test Leaky-ReLU (L ReLU) and Parametric-ReLU (PReLU) [24, 25]. In LReLU, instead of having zero slope for the negati ve side of the non-linearity , a small constant slope is used, while in PReLU the slope for the negati ve region is a trainable pa- rameter and can v ary i ndepen dently for each dimension (making it more vulnerable to ov erﬁtting). 3.7. Comparison b etween TDNN and CNN In the original x-vector paper [13], T ime Delay Neural Network (TDNN) layer is used in the second and third layers of the network. Here, we in vestigate the differences between TDNN and Con volu- tional Neural Network (CNN) in performance and also in t raining and ev aluation efﬁciency . TDNN is a special case of 1-dimensional CNN where instead of using all frames in the context windo w (con- volution windo w), some speciﬁc frames are used (here the ﬁrst, mid- dle and last frames of the window). 3.8. Using two types of attention Attention mechanism for speaker v eriﬁ cation has been in vestigated in r ecent papers. In [26], sev eral methods were propose d for using attention in an LSTM-based text-dependen t speak er veriﬁcation. A slightly different strategy for adding attention to the x-v ector topol- ogy was proposed in [27] while single and multi-head at t entions were in vestigated for TI-SV . Here, we only consider single-head at- tention in two modes. The ﬁrst one is the same as [27] while for the second one we dou bled the size of last hidden layer before pooling and equally split its dimension into t wo parts li ke [26] and use the ﬁrst part for calculating attention weights (i.e. key s) and the sec- ond part for calculating mean and standard de viation statistics (i.e. v alues) using suggested formulas in [28 ]. 4. EXPERIMENTS AND RES UL TS In order to draw a reliable conclusion about each t rick described in the previous section, we performed se veral experiments. Reporting results for all of them is not possible, hence we only report the most important ones in T able 2 and we summarize t he remaining in the text. The ﬁrst set of e xperiments is related to normalizing the i nput features. Adding a BN layer to the input of the network degrade s the performance in most cases, while normalizing features using global mean and va ri ance normalization improve s t he performance in about half cases. V ariance normalization of input features is not important, which is in li ne with the Kaldi implementation where only mean normalization is applied [13 ]. Normalizing statisti cs using the BN layer has a similar trend as normalizing input features and its results were not consistent in all cases. Adding a BN layer after st ats pooling using Adam optimizer slightly improv es the performance in some cases. But our exper- iments with SGD optimizer and normalizing statisti cs using mean and standard deviation calculated in fe w i nit i al iterations improve s the performance . So, it seems this trick is dep endent on which op- timizer is used . From here, neither input feature normalization nor statistic normalization was used. In vestigating the order of non-linearity and BN layer showed that using BN immediately after non -l i nearity yields better perfor- mance for speak er embe dding, while in other ﬁelds li ke image clas- siﬁcation [29] and audio scene classiﬁcation (ASC) [30, 31] usually BN layer is used i mmediately before non-linearity . Our previous ex- periments in ASC also conﬁrmed that for 2-dimen sional CNN net- work it is better to used BN before ReLU while for x-ve ctor topol- ogy (i.e. 1-dimensional netwo rk) it is better to put it after the non- linearity [32]. As explained in 3.4, we tried to use dropouts to reduce over- ﬁtting to the training speakers. Dropouts were shown to improv e generalization for classiﬁcation task, howe ver , our task is to learn speak er representations. Although we observ ed improv ed speaker classiﬁcation performance on our crossv alidation data, t he speaker veriﬁcation performance with the extracted x-v ectors degraded for most of the tested dropout probabilities. Also, in our previous work on x-vector based ASC [32], dropout helps the performance. It seems that dropouts are useful for classiﬁcation tasks but not for learning the utterance embeddin gs. T able 2 reports few results of different systems to better com- pare the gain attained by each technique. T he ﬁrst section of the ta- ble shows the results of Kaldi toolkit. The ﬁr st row shows the Kaldi original recipe for SR E 16 where SITW and SRE18 CMN2 dev elop- ment set were added to it with exactly the same training data. By comparing results of this ro w wi th the second row , it is clear that on av erage about 15 % relativ e improv ement can be attained by adding more training data and augmentation (or simply having more train- ing speakers). The third ro w of the table sho ws the results of Kaldi toolkit when CNN layers are used in the second and third layers of the network instead of TDNN. In this case, the performance i s quite similar to the TDNN while training CNN version needs about 35 % more time and also extracting embedd ing from the network is about 20 % slo wer . The second section of T able 2 sho ws the baseline results of our TF implementation. Comparin g this results wit h the Kaldi version results sho ws that our implementation is comparable with Kaldi, sometimes is better and sometimes worse. Here, again the differ - ence between CNN and TDNN is not too much and they performed almost t he same. In the last section of the table, we report results using differe nt tricks for impro ving our x-vector system. In the ﬁrst system, L2- regularization wa s applied to the CNN network (i.e. sixth row of the table). W e in vestigated sev eral conﬁgurations for adding L2- regularization. In the simplest way , L2-regularization was applied to all weights of the network while i n the second case, it just add ed to the seg ment lev el of the network (i.e. all layers after po oling). Experimental results hav e sho wn that the latter case is better and w e just consider this case from no w . As explained before, after adding L2-regularization, we faced with sparse x-vectors. For solving this problem, we ﬁ rst remove L2-regularization of the interested embedding layer and i t degrad ed the performance. W e also test a smaller coefﬁcient for β for this layer and found it was better . Empirically , β was set t o 0.00002 for embedding layer and 0.0002 for other weights in the segment le vel of the network. Comparing the r esults of ﬁf t h and sixth rows of T able 2 . The comparison results of different systems and implemen t at i ons. All networks use CNN except for them which explicitly named by TDNN. L2 means applying L2-reg ularization, Att means using at t ention mechanism in the netw ork and Noise means adding Gaussian noise during training. SITW cor e − cor e SRE16, All SRE16, T agalo g SRE16, Cantonese SRE18, CMN2 System EER DCF min 0 . 01 EER DCF min 0 . 01 EER DCF min 0 . 01 EER DCF min 0 . 01 EER C min Prm Kaldi recipe, ReLU, TDNN 6.45 0.543 8.84 0.604 12.72 0.764 5.02 0.409 9.16 0.578 Kaldi, ReLU, TDNN 5.03 0 .482 8.02 0.566 11.79 0.738 4.38 0.383 7.30 0.501 Kaldi, ReLU 4.98 0 .479 7.81 0.566 11.56 0.740 4.18 0.357 7.44 0.504 TF , ReLU, TD N N 5.08 0 .500 7.72 0.573 11.47 0.743 4.08 0.359 7.92 0.531 TF , ReLU 5.33 0 .517 7.87 0.583 11.62 0.756 4.15 0.362 7.63 0.520 TF , L2, ReLU 4.84 0 .471 7.59 0.568 11.24 0.747 4.02 0.355 7.57 0.517 TF , L2, PReL U 4.78 0 .480 7.39 0.563 11.01 0.742 3.86 0.336 7.89 0.515 TF , L2, LReLU 4.73 0 .467 7.40 0.550 11.08 0.722 3.79 0.340 7.51 0.485 TF , L2, LReLU, Att 4.54 0 .448 7.06 0.539 10.70 0.716 3.47 0.324 7.42 0.517 TF , L2, LReLU, Att, Noise 4.56 0 .459 7.20 0.543 10.74 0.710 3.66 0.349 6.90 0.485 the table shows t hat this simple technique improves the performance about 6 % relativ ely on average. Although smaller L2-regularization coefﬁcient has better perfor- mance, it did not solve the x-vector sparsity . For solving this, we e valuated t wo other versions of ReLU and their results are sho wn in the second and third ro ws of t his section. For LReLU, w e just select 0.2 for t he slope of the negati ve part and did not check other v al- ues. It is obviou s that both non-linearities have better performance than ReLU and LReLU performs sl ightly better . In theory , PReLU should perform better because learns t he slope based on the data but it seems it overﬁtted more to the training speake rs. The t wo types of attention mechanism described in S ecti on 3.8 were ev aluated in this work and we found the v ariant with separate activ ations for calculating attention weights and pooled statistics to perform better . The ninth ro w of the table shows the result of this conﬁguration. This method improv es the v eriﬁ cation performance for most of t he conditions while it increases the computation cost by about 100 % in our case. In t he last row of the table, we r eport the effect of adding Gaus- sian noise to the features during training as an additional regulariza- tion method. For each feature dimension, zero mean Gaussian noise is added with standard deviation of 0.2 times the standard de viation of that dimension. This augmentation i mproves performance for few cases. 5. CONCLUSIONS In this work, we have successfully implemented and trained x-vector extractor using a general-purpo se machine learning toolkit, namely T ensorﬂow . W e ha ve tested differe nt conﬁgurations and modiﬁca- tions to the x-vector extractor topology . W e sho w that using t he tricks an suggestions from this paper a similar or better performance can be obtained as compared to the well tuned original x-vec tor i m- plementation from the highly optimized Kaldi toolkit. W e tested different normalizations applied t o input features and statistics in the pooling layer , b ut these experiments did not provide consistent improvemen ts ov er all ev aluation datasets. Similarly , we found dropout re gularization inef fective when training our spe aker embedding extractor . On the other hand, L2-regularization consis- tently improv es the veriﬁcation performance across all the ev aluation conditions. Both LReLU and PR eL U activ ation functions have improved the veriﬁcation performance consistently as compared to standard ReLU non-linearity . LReLU performs sl ightly better than PReLU, which seems to ov erﬁt ted more to the training data. Attention mech- anism hav e improv ed the performance for most conditions while it increased the x-vector extraction time by about 100 %. Ho wev er, for the moment, it is not clear whether this i mprovement comes from the attention mechanism or from the increased number of parameters in the network. This still needs to be inv estigate in future. Like other augmentation methods, adding Gaussian noise to the input features during the training has a positiv e ef fects on the perfor- mance for some conditions. In our experiments, we ﬁlter speak ers used for training by a minimum number of utterances a vailable per speak er . Adding more augmentations increases the number of utter- ances av ailable for i ndi vidual spe akers and, as a result, we include more data from more speakers into our training set. Therefore, in fu- ture experiments, we should inv estigate, whether the improv ements obtained from the augmentations do not actually come only from having more speak er in the training data. W e wi ll also in vestigate other neural network architectures, ne w topologies and training objecti ves in our future work on learning speak er r epresentations. 6. A CKNO WLEDGMENT The work was supported by Czech Ministry of Education, Y ou t h and Sports from Project N o. CZ.02.2.69/0.0/0.0/16 027/0008 371, the National Programme of Sustainability (NPU II) project IT4Innov ations excellenc e in science - LQ1602, the Marie Sklodo wska-Curie coﬁnanced by the S outh Mora vian Region unde r grant agree- ment No. 665860 , and by Czech Ministry of Interior project No. VI2015202 0025 ”DRAP AK”. 7. REFERENCES [1] Najim Deha k, Patrick J Kenn y , R ´ eda Dehak, Pierre D u- mouchel, and Pierre Ouellet, “Front-end factor analysis for speak er veriﬁcation, ” IEEE T ransactions on Audio, Speec h, and Langua ge Pr ocessing , vol. 19, no. 4, pp. 788–798, 2011. [2] Simon JD Prince and James H Elder , “P robabilistic li near dis- criminant analysis for inferences about identity , ” in Computer V ision, 2007. ICCV 2007. IEEE 11th International Conferen ce on . IE EE, 2007, pp. 1–8. [3] Najim Dehak, Pedro A T orres-Carrasquillo, Dou glas Reyno l ds, and Reda Deha k, “Language recogn it i on v ia i-vectors and dimensionality reduction, ” in T welfth an- nual confer ence of the international speech communication association , 2011. [4] Hossein Zeinali, Hossein Sameti, and Luk ´ aˇ s Burget, “HMM- based phrase-indepen dent i-vector extractor for text-dependent speak er veriﬁcation, ” IEEE/ACM T ransactions on Audio, Speec h, and Langua ge Pr ocessing , vol. 25, no. 7, pp. 1421– 1435, 2017. [5] Hossein Zeinali, Hossein Sameti, Luk ´ aˇ s Burget, et al. , “T ext- dependen t speak er veriﬁcation based on i- vecto rs, neural net- works and hidden Marko v mode ls, ” Computer Speec h & L an- guag e , vol. 46, pp. 53–71 , 201 7. [6] Hossein Zeinali, Bagher BabaAli, and Hossein H adian, “On- line signature veriﬁcation using i-vector representation, ” I E T Biometrics , 2017. [7] Themos Stafylakis, Patrick Kenn y , Mohammed Senoussao ui, and Pierre Dumouchel, “Preliminary in vestigation of Boltz- mann machine classiﬁers for speaker recognition, ” in Odysse y 2012-The Speak er and Languag e Recognition W orksho p , 2012 . [8] Timur P ekho vsky , S erge y Novo selov , Aleksei Sholoho v , and Oleg Kuda shev , “On autoencod ers in the i-vector space for speak er r ecognition, ” in P roc. Odyssey , 2016, pp. 217–224. [9] Y un Lei, Nicolas Scheffe r, Luciana Ferrer , and Mitchell McLaren, “ A nov el scheme for speaker recognition using a phonetically-a ware deep neural network, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Confer ence on . IEE E, 2014, pp. 1695–1699. [10] Patrick Kenn y , V ishwa Gupta, T hemos Stafylakis, Pierre Ouel- let, and Jahangir Al am, “Deep neural networks for extract- ing Baum-W elch statistics for speake r recognition, ” in Pr oc. Odysse y , 2014, pp. 293–298. [11] Alicia Lozano-Diez, Anna Si lnov a, Pav el Matejka, On drej Glembek, Ol drich Plchot, Jan Pe ˇ s ´ an, Luk ´ aˇ s Burget, and Joaquin Gon zalez-Rodriguez, “ Analysis and optimization of bottleneck features for speaker recognition, ” in Procee dings of Odysse y , 2016, vol. 2016, pp. 352–357. [12] David Snyd er, Pe gah Ghahremani, Daniel Pove y , Daniel Garcia-Romero, Y ishay Carmiel, and S anjee v Kh udanpur , “Deep neural network-based speake r embeddings for end-to- end speake r veriﬁcation, ” in Spoken Languag e T echnolo gy W orksho p (SLT), 2016 IE EE . IEEE , 2016, pp. 165–170 . [13] David S nyd er, Daniel Garcia-Romero, Gregory Sell, Daniel Pov ey , and Sanjeev Khudanpur , “X-vectors: Rob ust D N N em- beddings for speaker recognition, ” Submitted to ICASSP , 2018. [14] Arsha Nagrani, Joon Son Chun g, and Andrew Zisserman, “V oxceleb: a large-sc ale speak er identiﬁcation dataset, ” in In- terspeec h , 2017. [15] Mitchell McLaren, Dieg o Castan, Mahesh Kumar Nandwana, Luciana Ferr er, and Emre Yılmaz, “How to train your speaker embeddings extractor , ” in Odysse y: The Speaker and Lan- guag e Recognition W orkshop, Les Sables dOlonne , 2018. [16] Ondˇ rej Nov otn ` y, Oldˇ ri ch P l chot, Pa vel Mat ˇ ejka, Ladislav Moˇ sner , and Ondˇ rej Glembek, “On the use of x-vectors for ro- bust speaker recognition, ” in Pr oc. Odyssey 2018 The Speaker and Languag e Reco gnition W orkshop , 2018, pp. 168–175. [17] Diederik P Kingma and Ji mmy Ba, “ Adam: A method for stochastic optimization, ” arXiv pre print arXiv:1412.6980 , 2014. [18] David Snyd er, Guoguo Chen, and Daniel Povey , “MU- SAN: A music, speech, and noise corpus, ” arXi v prep rint arXiv:1510.084 84 , 2015 . [19] Mitchell McLaren, Luciana Ferrer , Diego Castan, and Aaron Lawson, “The 2016 speakers in the wild speak er recogn it ion e valuation., ” in INTERSPEECH , 2016, pp. 823–8 27. [20] Sergey Iof fe and Christ i an Szegedy , “Batch normalization: Ac- celerating deep network tr aining by reducing internal cov ariate shift, ” arXiv pr eprint arXiv:1502.0316 7 , 2015 . [21] Nitish Sri vastav a, Geof frey Hinton, Ale x Krizhe vsky , Ilya Sutske ver , and Ruslan Salakhutdino v , “Dropout: a simple way to prev ent neural netw orks from overﬁtting, ” T he Jour nal of Mach i ne Learning Resear ch , vol. 15, no. 1, pp. 1929–19 58, 2014. [22] Alex Grave s, Santiago Fern ´ andez, Faustino Gomez, and J ¨ ur gen Schmidhuber , “Connectionist temporal classiﬁcation: la- belling unse gmented sequ ence data wit h recurrent neural net- works, ” in Pr oceedings of the 23r d international confer ence on Machine learning . ACM, 2006, pp. 369–376. [23] Alex Grav es, Abdel-rahman Mohamed, and Geof frey Hinton, “Speech recognition with deep recurrent neural networks, ” i n Acoustics, speech and signal pro cessing (icassp), 2013 ieee in- ternational confer ence on . IEEE, 2013 , pp. 6645–66 49. [24] Andrew L Maas, A wni Y Hannu n, and Andre w Y Ng, “Rec- tiﬁer nonlinearities improve neural network acoustic models, ” in Pr oc. icml , 2013, v ol. 30, p. 3. [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rectiﬁers: S urpassing human -l e vel perfor- mance on imagenet classiﬁcation, ” in P roceed ings of the IEEE international confer ence on computer vision , 2015, pp. 102 6– 1034. [26] F A Cho wdhury , Quan W ang, Ignacio Lopez Moreno, and Li W an, “ Attention-based models for text-dep endent speak er veriﬁcation, ” arXiv pre print arXiv:1710.10470 , 2017. [27] Y ingke Zhu, T om Ko, David Snyder , Brian Mak, and Daniel Povey , “S el f - attenti ve speaker embedding s for t ext- independe nt speaker veriﬁcation, ” Pr oc. Interspeech 2018 , pp. 3573–3 577, 201 8. [28] K oji Okabe, T akafumi Kosh i naka, and Koichi Shinoda, “ At- tentiv e statistics pooling for deep speak er embedding, ” arXiv pr eprint arXiv:1803.1096 3 , 2018. [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image r ecognition, ” in P r oceed- ings of the IEEE confer ence on computer vision and pa tt ern r ecognition , 2016, pp. 770–77 8. [30] Y oonch ang Han and Jeongsoo Park , “Con volutional neu- ral networks with binaural representations and bac kground subtraction for acoustic sc ene classiﬁcation, ” T ech. Rep ., DCASE2017 Challenge, September 2017. [31] Zheng W eiping, Y i Jiantao , Xing Xiaotao, Li u Xiangtao, and Peng Shaohu, “ Acoustic scene classiﬁcation using deep con- volution al neural network and multiple spectrograms f usion, ” T ech. Rep., DCASE2017 Challenge, September 2017. [32] Hossein Zeinali, Lukas Burget, a nd Jan Cernock y , “Con- volution al neural networks and x-vector embedding for DCASE2018 acoustic scene classiﬁcation challenge, ” arXiv pr eprint arXiv:1810.0427 3 , 2018.

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment