Generative x-vectors for text-independent speaker verification

GENERA TIVE X-VECTORS FOR TEXT -INDEPENDENT SPEAKER VERIFICA TION Longting Xu, Rohan K umar Das, Emr e Yılmaz, Jichen Y ang and Haizhou Li Department of Electrical and Computer Engineering, National Uni versity of Singapore, Singapore xltggn@gmail.com, rohankd@nus.edu.sg ABSTRA CT Speaker veriﬁcation (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems pro vides improved per- formance beneﬁting both from the discriminativ ely trained x-vectors and generativ e i-vectors capturing distinct speaker characteristics. In this paper, we propose a novel method to include the complementary information of i-vector and x-vector , that is called generativ e x-vector . The genera- tiv e x-vector utilizes a transformation model learned from the i-vector and x-vector representations of the background data. Canonical correlation analysis is applied to derive this transformation model, which is later used to transform the standard x-vectors of the enrollment and test se gments to the corresponding generativ e x-vectors. The SV experiments per - formed on the NIST SRE 2010 dataset demonstrate that the system using generati ve x-vectors provides considerably bet- ter performance than the baseline i-vector and x-vector sys- tems. Furthermore, the generati ve x-vectors outperform the fusion of i-vector and x-v ector systems for long-duration ut- terances, while yielding comparable results for short-duration utterances. Index T erms — Speaker v eriﬁcation, speaker embed- dings, transformation model, x-vector , canonical correlation analysis 1. INTR ODUCTION Speaker v eriﬁcation (SV) is to authenticate a person based on the voice samples [1, 2]. The factor analysis approaches for SV led to a ne w era with their achiev ement in having high performance [3, 4]. Later , the total v ariability model based i-vector system has been a benchmark for SV studies in the current decade [5]. Recently , the deep neural network (DNN) based systems hav e been in the focus of the research com- munity . Due to their good performance, DNN-based systems This work is supported by the Neuromorphic Computing Programme under the RIE2020 Advanced Manuf acturing and Engineering Programmatic Grant A1687b0033 in Singapore. hav e been incorporated in most systems in the latest NIST SRE challenge [6 – 8]. The initial attempts with DNNs for SV hav e been made in the conte xt of i-vector speaker modeling in terms of com- puting the phonetic posteriors [9, 10]. Alternativ e approaches extract bottleneck features from DNN acoustic models that are combined with the acoustic features [11, 12]. Ho wever , such approaches require a large amount of transcribed data and may not be as ef fective for out-of-domain data [11]. This led to the exploration of end-to-end DNN systems for SV that learn the speaker models in a discriminati ve manner [13 – 17]. The recent work in this direction focuses on using the speaker embeddings that are scored with a probabilistic lin- ear discriminant analysis (PLDA) based back-end [18, 19]. This kind of systems gi ve comparable or better results to that obtained with i-vector speaker modeling. Further , they are prov en to be very effecti ve under short-duration utterance sce- narios [20]. The study on the score level fusion of i-vector and embedding based systems [18] showed that the fused system outperforms the individual systems due to their complemen- tary characteristics. Later on, the robustness of x-vectors has been explored by applying data augmentation [21]. Another strategy to improv e the x-vector based SV is to include some input from the generativ e models as their fusion has been found promising. The embedding process is discriminativ e in nature, whereas the i-vector framework is a generati ve model. Speciﬁcally , x-vector extraction is achiev ed by training a DNN to discriminate among dif ferent output labels, while the i-vector model relies on a universal background model (UBM) to collect suf ﬁcient statistics for deriving speaker models. Howe ver , directly concatenating or score lev el fusion of these two models may not be an effecti ve way for application-oriented systems as it increases the need of run-time computation and memory . This motiv ated us to dev elop an efﬁcient way of including information from the generativ e model based on total variability modeling for an embedding based SV system. In this work, we propose a novel approach that learns a transformation matrix using the i-vectors and x-vectors from the background data to utilize both generativ e and discrimi- nativ e characteristics. Canonical correlation analysis (CCA) between these vectors is used to deriv e this transformation model. The CCA has been used previously for analysis of correlation among different features [22] and for fusion of multi-modal features in SV [23]. Additionally , it has been used for co-whitening for short and long duration utterances in an i-vector system [24]. In this work, the CCA is consid- ered to maximize the correlation of the tw o models based on generativ e and discriminativ e paradigms to disco ver comple- mentary attributes. The transformation model is then used to transform standard x-vectors, so that they also beneﬁt from the input of generative model. Moreov er , a comparison of the proposed system and the fusion of i-vector and x-vector systems is presented to highlight the impact of the work for practical systems. In the following sections, we ﬁrst introduce the funda- mentals of i-vector and x-vector approaches for SV in Sec- tion 2. Section 3 introduces the proposed framework of gen- erativ e x-v ectors. The results of the SV e xperiments using the proposed approach are reported in Section 4. Finally , Sec- tion 5 concludes the work. 2. SPEAKER RECOGNITION P ARADIGMS: GENERA TIVE VS. DISCRIMINA TIVE This section pro vides an explanation of the basics of i-v ector and x-vector systems as they are studied for the proposed framew ork of generativ e x-vectors. The detailed structure with the parameters used for v arious modules of both the sys- tems are also mentioned. 2.1. The i-vector: a generative model An i-vector system is based on generati ve mode that is deriv ed using total v ariability model (TVM) [5]. The TVM is learned by unsupervised learning that is used to represent each utter - ance in a compact low-dimensional v ector as follows M = m + T x (1) where M is Gaussian mixture model (GMM) mean supervec- tor of an utterance, m represents UBM mean supervector and total variability model T to obtain the i-vector x . 2.2. The x-vector: a discriminative model Generativ e models are successful due to the strong mathe- matical representations. Howe ver , considering the goal as speaker discrimination helps to increase the robustness. In this reg ard, researchers pay more attention on discriminati ve training for speaker recognition recently as discussed in the introduction. W e consider x-vector as the discriminativ e base- line system, since it is comparable with i-vector systems for text-independent speaker recognition, especially for short ut- terances. The DNN embedding structure in our work basi- cally follows the work of [18, 21]. W e do not use any data T able 1 . The time-delay conﬁguration of the frame-le vel layers in the TDNN architecture Layer index Layer context Output dimension 1 (-2,-1,0,1,2) 512 2 (-2,0,2) 512 3 (-3,0,3) 512 4 0 512 5 0 1500 augmentation in the current work, that deserves future explo- ration. A time-delay neural network (TDNN) [25] is trained us- ing the same acoustic features as in the i-vector system. The TDNN model includes ﬁ ve frame-le vel hidden layers, all us- ing rectiﬁed linear unit (ReLU) acti vation and batch normal- ization [26]. The speciﬁc time-delay information of these frame-lev el layers are listed in T able 1. A statistics pooling layer follo ws the output of the last frame-le vel layer which computes the mean and standard deviation of the frames of in- put segments. The mean and standard de viation are stacked in a manner such that the output dimension is doubled. The ﬁnal two hidden layers are 512-dimensional pooling layers, also operating at segment level, prior to the softmax layer which targets speaker labels for each audio segment. The softmax and the second pooling layer are remov ed during the testing phase and 512-dimensional x-vectors are e xtracted at the out- put of the ﬁrst pooling layer . 3. GENERA TIVE X-VECTORS: DNN EMBEDDINGS WITH GENERA TIVE MODEL INPUT In this w ork, we propose a no vel approach to take the advan- tage of the correlation between the i-v ectors and x-vectors to utilize their complementary nature of learning speaker mod- els. A transformation model is learned using CCA by con- sidering the i-vectors and the corresponding x-v ectors as the input pairs from the background speech data. During the enrollment and the testing phase, the i-vector system is e x- cluded from the pipeline and only x-vector system is consid- ered whose output is linearly transformed using the transfor- mation matrix obtained from the CCA model. W e refer this transformed output as generativ e x-vector , henceforth referred to as x g -vector , since it captures certain properties of the input generativ e model (i-vector model) during the transformation. Fig. 1 illustrates the steps to obtain the proposed x g -vector representation of speakers. During the training stage, a trans- formation matrix is learned by applying CCA and this ma- trix is later used for x g -vector extraction. It is important to note that the TVM is only used for extracting i-vectors of the background data and is computed once. There is no further i-vector extraction in volved during enrollment and test ses- sions. Hence, this kind of frame work is expected to ha ve rel- ativ ely less latency than feature concatenation or score lev el fusion of these systems. W e ﬁrst mathematically e xplain the left panel presented in Fig. 1. In order to take advantage of the generati ve model information, we aim to seek a pair of matrices W i d and W x g , which are conﬁned in the following w ay max W i d , W x g cor r ( W i d Φ i , W x g Φ x ) (2) Here, Φ i and Φ x contain the corresponding i-vectors and x- vectors from the same set of utterances. The proposed transformation with CCA is hypothesized to transfer information from the generativ e model to the dis- criminativ e model and vice-versa. Therefore, the resultant transformation matrices for i-vector and x-vector are denoted as W i d and W x g , respectiv ely . Let N be the number of background utterances used to train the transformation mod- els with CCA. The dimension of background data i-vectors and x-vectors to CCA are N × 600 and N × 512 , respecti vely . On applying CCA, we obtain transformation matrices W i d of size 600 × 512 and W x g of size 512 × 512 , respectiv ely . During the SV experiments, we only concentrate on the x-vector pipeline as sho wn in the right panel of Fig. 1. Given an x-vector φ x , the proposed vector is computed as φ x g = W x g φ x (3) where the φ x g denotes the x-vector with generativ e model in- put that we refer to as a generativ e x-vector . Both the i-v ectors and x-vectors are zero-centered in all of the mathematical e xpressions in this section. The details of CCA and the transformation of x-vectors are discussed in the following subsections. 3.1. Canonical correlation analysis As mentioned in the aforementioned section, in this work, we aim to maximize the linear relationship between a set of i-vectors and x-vectors. It is to be mentioned that the di- mensions for an i-vector and an x-vector are not the same. Giv en that a ﬁx ed number of background speech utterances is used to deriv e the background i-vectors and x-vectors, apply- ing CCA maximizes the correlation between the input vector pairs of different dimensions. Mathematically , gi ven random vectors X = ( x 1 , . . . , x n ) T and Y = ( y 1 , . . . , y m ) T , the CCA deﬁnes new set of vari- ables U = a T X and V = b T Y via linear combinations of X and Y U = a T X (4) V = b T Y (5) The CCA aims to ﬁnd vectors a and b that maximizes the DNN TVM W i W x i - v ect or s x - v ect or s CCA DNN x - v ect or x g - v ect or Ba ckgr ou nd da t a TVM i - v ect or i d - v ect or W x W i d d g g Fig. 1 . An ov erview of the proposed system, where the discriminativ e model (x-vector) beneﬁts from the generativ e model (i-vector) input. The left panel shows the use of back- ground data for CCA to train the transformation model W x g and W i d . The middle panel sho ws computation of x g -vector from x-vector using transformation model W x g . The right panel sho ws the contrast system of discriminati ve i-vector to generate i d -vector using transformation model W i d . correlation ρ = cor r h a T X , b T Y i , which can written as ρ = E ( a T X Y T b ) q E ( a T X X T a ) q E ( b T Y Y T b ) (6) W ith the constraints that a T Σ X a = 1 (7) and b T Σ Y b = 1 (8) the correlation parameter to be maximized becomes ρ = a T Σ X Y b (9) where Σ X = E ( X X T ) , Σ Y = E ( Y Y T ) and Σ X Y = E ( X Y T ) are the cov ariances. W e then obtain the ﬁrst pair of canonical v ariates ( U 1 , V 1 ) via maximizing ρ represented in Equation (9). The remaining canonical variates ( U l , V l ) maximize ρ subject to uncorrelated with ( U k , V k ) for all k < l . This procedure is iterated to min { m, n } times that is based on the dimension of the two random vectors. Finally , we obtain a k is the k -th eigen vector of Σ − 1 X Σ X Y Σ − 1 Y Σ Y X . Similarly , b k as the k -th eigen vector of Σ − 1 Y Σ Y X Σ − 1 X Σ X Y . 3.2. CCA based x-vector transformation In canonical correlation analysis we aim to ﬁnd mutually or- thogonal pairs of maximally correlated linear combinations of -20 -10 0 10 20 -30 -20 -10 0 10 20 x-vector spk1 spk2 spk3 spk4 spk5 -20 -10 0 10 20 -30 -20 -10 0 10 20 30 -20 -10 0 10 20 -20 -10 0 10 20 -20 -15 -10 -5 0 5 10 -30 -20 -10 0 10 20 i-vector x g -vector i d -vector Fig. 2 . t-SNE visualization of different representations. the variables in X and Y . In our work, the random vectors X and Y discussed in Section 3.1 form the i-vector matrix Φ i and x-vector matrix Φ x , respectiv ely . Revisiting the objectiv e function giv en in Equation (2), it can be solved with the follo wing constraints, W i d Σ i W T i d = I (10) and W x g Σ x W T x g = I (11) where the x-vectors can be automatically whitened in the test- ing phase. Notice that Σ i and Σ x denote the empirical cov ari- ances of i-vectors and x-v ectors, respectively . 3.3. t-SNE visualization T ogether with the proposed x g -vector system, a contrast sys- tem is also introduced to deriv e another transformation model W i d for the generativ e model i-v ector to take input from the x-vector based discriminativ e model. The transformed i-vector is denoted by i d -vector as input from discriminative model has been used. W e visualize each speaker repre- sentation model to examine the distribution from subset of speakers using the t-Distrib uted Stochastic Neighbor Embed- ding (t-SNE) technique [27]. The t-SNE technique is widely used for the visualization of high-dimensional data. W e hav e randomly chosen 5 speakers from the database that hav e more than 20 utterances and extracted correspond- ing i-vectors, x-vectors, x g -vectors and i d -vectors. Figure 2 shows the t-SNE distributions for dif ferent representations. It is observed that the proposed x g -vectors beneﬁt from the generativ e information with an increased separability , while the distribution of i d -vectors highly resembles the original i- vectors. The possible reason of this can be that the discriminativ e models like x-v ectors learn the dif ferences among the speak- ers without learning the characteristics of each speaker . Thus, when information from discriminativ e models is used as in- put to the generative model i-v ector, it may not contribute to- wards a better SV performance. On the other hand, the gen- erativ e models such as i-vectors, learn the characteristics of each speakers and they add speciﬁc speak er information when used as input to a discriminati ve model. Additionally , the dis- criminativ e models work well for a closed set of speakers, whereas there is no such constraint for generativ e models. 4. EXPERIMENT AL RESUL TS 4.1. Database The SV experiments in this work are performed using the NIST SRE 2010 database [28]. The common condition 5 (CC’5) has been chosen for the ev aluation. Further , we ha ve T able 2 . EER and DCF under CC’5 on NIST SRE 2010 database for different systems. Fusion results refer to score lev el fusion of i-vector and x-v ector systems. T asks EER (%) DCF i-vec x-vec fusion i d -vec x g -vec i-v ec x-vec fusion i d -vec x g -vec coreext-coree xt 2.20 2.96 2.19 2.23 1.51 0.42 0.42 0.36 0.44 0.35 core-10sec 6.07 6.39 4.71 6.00 4.41 0.85 0.72 0.78 0.84 0.70 10sec-10sec 11.46 11.51 8.92 11.56 8.93 0.98 0.85 0.88 0.96 0.89 considered different enrollment and test scenarios under this task, namely coreext-coreext, core-10sec, and 10sec-10sec, where coreext and core consist of long duration utterances, while 10sec denotes short-duration speech of 10 seconds. Ad- ditionally , Switchboard 2 Corpus of Phases 1, 2, and 3 as well as Switchboard Cellular , along with NIST SREs from 2004 to 2008 are considered as background data for learning the background models. 4.2. Implementation details In this work, the 20-dimensional mel frequency cepstral co- efﬁcients (MFCC) features, along with delta and acceleration are extracted for each frame of 25 ms in shift of 10 ms. The i-vector model is used as a baseline system for reference in our studies. A full-cov ariance gender -independent UBM with 2048 components is used in the i-vector framew ork to ob- tain 600-dimensional i-vectors. For both the systems, dimen- sionality is reduced to 200 with linear discriminant analysis (LD A). For the x-vector system, the TDNN is trained on the same 20-dimensional MFCC features. All non-linearities in the neural network are ReLUs. W e use PLD A for channel/session compensation and scor - ing in our experiments. Further , length normalization has been applied before performing PLD A [29]. The PLD A is trained to have 200 speaker factors with a full cov ariance, while the channel factor is ignored. The studies are reported in terms of equal error rate (EER) and detection cost function (DCF) that follows the protocol of NIST SRE 2010 e valuation plan [28]. W e used Kaldi recipes for building the baseline systems in this work [30]. 4.3. Results and discussion In this section, the results provided by the individual base- line systems using i-vectors and x-vectors are compared with the proposed generati ve x-vectors. W e further apply score fu- sion to the i-vector and x-vector systems and compare with the generativ e x-vectors to in vestigate their effecti veness in capturing the complementary information from the generati ve model. T able 2 reports the performance of different SV frame- works used in this study . Comparing the i-v ector and x-vector baselines, it is clear that the i-vector w orks better when both the enrollment and test utterances are of long durations, i.e., for coreext-coree xt task. On the other hand, the results for core-10sec and 10sec-10sec tasks sho w that the x-v ector sys- tem performs comparable to the i-vector system for short- duration test utterances when the enrollment data is either short or long. Further , a score lev el fusion of these two sys- tems results in a gain for all considered tasks of the NIST SRE 2010 database. The system fusion results follow the trend re- ported by the authors of [18]. W e then focus on the results provided by the proposed x g -vector system and its contrast i d -vector system. It is ob- served that the proposed x g -vector system outperforms the standard x-vector system by reducing the EER from 2.20% to 1.51%. On the other hand, the performance of the con- trast i d -vector system is similar to the original i-vector system. Finally , we compare the performance of proposed x g -vector with the score le vel fusion. F or short utterance cases, the per - formance of both systems are comparable. The proposed sys- tem outperforms the score fusion for the core condition with long utterances. Hence, the proposed system with a lower la- tency and less computational burden achie ves a remarkable performance compared to fusion of the x-vector and i-vector systems. This highlights its importance as a ﬁeld-deployable system in a practical setting. The detection error tradeoff (DET) curves for different systems obtained on the coreext-coree xt task is illustrated in Fig. 3. The superior performance of the proposed system is clearly reﬂected in this plot with a DET curve that is quite separate from the baseline indi vidual systems as well as their fusion. In terms of EER, we observe 48.83%, 22.46% and 31.01% relativ e improvement o ver the original x-vector sys- tem for the three dif ferent tasks of CC’5 on NIST SRE 2010 database discussed in this work. The future work will focus on extending this frame work with data augmentation to over - come mismatch conditions [21, 31, 32]. 5. CONCLUSIONS This work focuses on having an improv ed DNN embedding based SV system that considers input from generativ e mod- els. The total variability speaker modeling is used as the generativ e model for the studies. A transformation model is learned by applying CCA using background data i-vectors 0.1 0.2 0.5 1 2 5 10 20 30 40 False Alarm probability (in %) 0.1 0.2 0.5 1 2 5 10 20 30 40 Miss probability (in %) Train cond: coreext, Test cond: coreext-coreext i-vec x-vec fusion i d -vec x g -vec Fig. 3 . DET curves for the coreext-coree xt task of NIST SRE 2010 database. and x-vectors. This model is then used to obtain the genera- tiv e x-vectors that are found to perform superior to its baseline as well as i-vector counterparts. The studies are performed on the NIST SRE 2010 database on three different condi- tions. The studies reveal 48.83%, 22.46% and 31.01% rela- tiv e improv ement on EER for the coree xt-coreext, core-10sec and 10sec-10sec tasks, respectiv ely . This conﬁrms the im- portance of using some inputs from the generativ e models for the frame work of discriminati ve model of DNN embeddings for SV . Additionally , the performance of generativ e x-vectors is found to be superior for long utterances and competitive for short utterance cases to that obtained from the score le vel fusion of i-vector and x-vector systems. Thus, this kind of ap- proaches ha ve less latenc y than the dimension concatenation or score lev el fusion of systems that makes them useful for application purpose. 6. REFERENCES [1] T . Kinnunen and H. Li, “ An overvie w of te xt- independent speaker recognition: From features to su- pervectors, ” Speech Communication , vol. 52, no. 1, pp. 12–40, 2010. [2] J. H. L. Hansen and T . Hasan, “Speaker recognition by machines and humans: A tutorial revie w , ” IEEE Signal Pr ocessing Magazine , vol. 32, no. 6, pp. 74–99, Nov 2015. [3] P . K enny , “Joint factor analysis of speaker and session variability: Theory and algorithms, ” T ech. Rep. CRIM- 06/08-13, CRIM, Montreal, 2005. [4] P . Kenn y , G. Boulianne, P . Ouellet, and P . Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 15, no. 4, pp. 1435–1447, May 2007. [5] N. Dehak, P . Kenn y , R. Dehak, P . Dumouchel, and P . Ouellet, “Front-end factor analysis for speaker ver - iﬁcation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 19, no. 4, pp. 788–798, 2011. [6] K. A. Lee and SRE16 I4U Group, “The I4U mega fusion and collaboration for NIST speaker recognition e valua- tion 2016, ” in Pr oc. Inter speech 2017 , 2017, pp. 1328– 1332. [7] P . A. T orres-Carrasquillo, F . Richardson, S. Nercessian, D. Sturim, W . Campbell, Y . Gwon, S. V attam, N. Dehak, H. Mallidi, P . S. Nidada volu, R. Li, and R. Dehak, “The MIT -LL, JHU and LRDE NIST 2016 speaker recog- nition ev aluation system, ” in Pr oc. Interspeech 2017 , 2017, pp. 1333–1337. [8] N. Kumar , R. K. Das, S. Jelil, B. K. Dhanush, H. Kashyap, K. S. R. Murty , S. Ganapathy , R. Sinha, and S. R. M. Prasanna, “IITG-Indigo system for NIST 2016 SRE challenge, ” in Proc. Interspeec h 2017 , 2017, pp. 2859–2863. [9] Y . Lei, N. Scheffer , L. Ferrer , and M. McLaren, “ A novel scheme for speak er recognition using a phonetically- aware deep neural network, ” in IEEE International Confer ence on Acoustics, Speec h and Signal Pr ocessing (ICASSP) 2014 , May 2014, pp. 1695–1699. [10] P . Kenn y , V . Gupta, T . Stafylakis, P . Ouellet and J. Alam, “Deep neural networks for extracting baum- welch statistics for speaker recognition, ” in Speaker Odysse y 2014 , 2014, pp. 293–298. [11] F . Richardson, D. Reynolds, and N. Dehak, “Deep neu- ral network approaches to speaker and language recog- nition, ” IEEE Signal Pr ocessing Letter s , vol. 22, no. 10, pp. 1671–1675, Oct 2015. [12] M. McLaren, Y . Lei, and L. Ferrer , “ Adv ances in deep neural network approaches to speaker recognition, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) 2015 , April 2015, pp. 4814–4818. [13] E. V ariani, X. Lei, E. Mcdermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speak er veriﬁcation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing 2014 , May 2014, pp. 4052–4056. [14] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end te xt-dependent speaker veriﬁcation, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) 2016 , March 2016, pp. 5115–5119. [15] S. X. Zhang, Z. Chen, Y . Zhao, J. Li, and Y . Gong, “End- to-end attention based text-de pendent speaker veriﬁca- tion, ” in IEEE Spoken Langua ge T echnology W orkshop (SLT) 2016 , Dec 2016, pp. 171–178. [16] D. Snyder , P . Ghahremani, D. Pov ey , D. Garcia- Romero, Y . Carmiel, and S. Khudanpur, “Deep neu- ral network-based speaker embeddings for end-to-end speaker veriﬁcation, ” in IEEE Spoken Language T ech- nology W orkshop (SLT) 2016 , Dec 2016, pp. 165–170. [17] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y . Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system, ” [cs.CL] , 2017. [18] D. Snyder , D. Garcia-Romero, D. Pov ey , and S. Khu- danpur , “Deep neural network embeddings for te xt- independent speaker veriﬁcation, ” in INTERSPEECH , 2017, pp. 999–1003. [19] N. Brummer , A. Silnov a, L. Burget, and T . Stafylakis, “Gaussian meta-embeddings for efﬁcient scoring of a heavy-tailed plda model, ” in Pr oc. Odysse y 2018 The Speaker and Language Recognition W orkshop , 2018, pp. 349–356. [20] Chunlei Zhang and Kazuhito Koishida, “End-to-end text-independent speaker veriﬁcation with triplet loss on short utterances, ” in Pr oc. Interspeech 2017 , August 2017, pp. 1487–1491. [21] D. Snyder , D. Garcia-Romero, G. Sell, D. Pove y , and S. Khudanpur , “X-vectors: Robust DNN embeddings for speaker recognition, ” in IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) 2018 , April 2018, pp. 5329–5333. [22] R. K. Das and S. R. M. Prasanna, “Exploring different attributes of source information for speaker veriﬁcation with limited test data, ” Journal of the Acoustical Society of America , vol. 140, no. 1, pp. 184, 2016. [23] M. E. Sargin, E. Erzin, Y . Y emez, and A. M. T ekalp, “Multimodal speaker identiﬁcation using canonical cor- relation analysis, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing, 2006. ICASSP 2006 Pr oceedings , 2006, pp. I–I. [24] L. Xu, K. A. Lee, H. Li, and Z. Y ang, “Co-whitening of i-vectors for short and long duration speaker veriﬁ- cation, ” in INTERSPEECH 2018 , September 2018, pp. 1066–1070. [25] A. W aibel, T . Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neu- ral netw orks, ” IEEE T ransactions on Acoustics, Speec h, and Signal Pr ocessing , vol. 37, no. 3, pp. 328–339, Mar 1989. [26] S. Iof fe and C. Szegedy , “Batch normalization: Acceler- ating deep netw ork training by reducing internal co vari- ate shift, ” in Pr oc. ICML , 2015, pp. 448–456. [27] L Maaten and G Hinton, “V isualizing data using t-sne, ” Journal of Mac hine Learning Resear ch , vol. 9, no. 2605, pp. 2579–2605, 2008. [28] “The NIST year 2010 speaker recognition ev aluation plan, ” April 2010. [29] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity , ” in Computer V ision, 2007. ICCV 2007. IEEE 11th Interna- tional Confer ence on . IEEE, 2007, pp. 1–8. [30] D. Pove y , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qian, P . Schwarz, J. Silovsky , G. Stemmer , and K. V esely , “The kaldi speech recognition toolkit, ” in IEEE W orkshop on Automatic Speech Recognition and Understanding 2011 , Dec. 2011. [31] M. Mclaren, D. Castan, M. K. Nandwana, L. Ferrer , and E. Yılmaz, “How to train your speaker embeddings e x- tractor , ” in Pr oc. Odysse y 2018 The Speaker and Lan- guage Reco gnition W orkshop , 2018, pp. 327–334. [32] S. Nov oselov , A. Shulipa, I. Kremnev , A. Kozlo v , and V . Shchemelinin, “On deep speaker embeddings for text-independent speak er recognition, ” in Proc. Odyssey 2018 The Speaker and Language Recognition W ork- shop , 2018, pp. 378–385.

Generative x-vectors for text-independent speaker verification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment