Deep Segment Attentive Embedding for Duration Robust Speaker Verification

DEEP SEGMENT A TTENTIVE EMBEDDING FOR DURA TION R OBUST SPEAKER VERIFICA TION Bin Liu 1 , 2 , Shuai Nie 1 , Y aping Zhang 1 , 2 , Shan Liang 1 , W enju Liu 1 1 National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences, China 2 School of Artiﬁcial Intelligence, Uni versity of Chinese Academy of Sciences, China { bin.liu2015,shuai.nie,yaping.zhang,sliang,lwj } @nlpr.ia.ac.cn ABSTRA CT LSTM-based speaker veriﬁcation usually uses a ﬁxed-length local segment randomly truncated from an utterance to learn the utterance- lev el speaker embedding, while using the average embedding of all segments of a test utterance to verify the speaker, which results in a critical mismatch between testing and training. This mismatch de- grades the performance of speaker veriﬁcation, especially when the durations of training and testing utterances are very dif ferent. T o alleviate this issue, we propose the deep segment attenti ve embed- ding method to learn the uniﬁed speaker embeddings for utterances of variable duration. Each utterance is segmented by a sliding win- dow and LSTM is used to extract the embedding of each segment. Instead of only using one local segment, we use the whole utterance to learn the utterance-le vel embedding by applying an attentive pool- ing to the embeddings of all segments. Moreover , the similarity loss of segment-lev el embeddings is introduced to guide the segment at- tention to focus on the segments with more speak er discriminations, and jointly optimized with the similarity loss of utterance-lev el em- beddings. Systematic experiments on T ongdun and V oxCeleb show that the proposed method signiﬁcantly improves rob ustness of dura- tion variant and achiev es the relati ve Equal Error Rate reduction of 50% and 11.54% , respectiv ely . Index T erms — deep segment attentiv e embedding, speaker veriﬁcation, duration rob ustness, LSTM 1. INTR ODUCTION The key to speaker veriﬁcation is to extract the utterance-lev el speaker vectors with a ﬁxed dimension for utterances of v ariable duration. The extracted speaker vector is expected to be as close as possible to the same speaker while far from other speakers. It re- mains a challenge to e xtract the robust speaker vectors for utterances of variable duration, especially when the utterance duration varies greatly . The i-vector/PLD A framew ork [1, 2, 3] can easily extract the ﬁxed dimension speaker vectors for utterances of arbitrary dura- tion using statistical modeling. But it suffers performance reduction when handling short utterances [4, 5]. The reason is that i-vector is a Gaussian-based statistical feature, whose estimation need suf ﬁcient samples. And the short utterance will lead to the uncertainty in the estimated i-vector . Deep learning based speaker embedding [4, 6, 7] is another mainstream approach to speaker v eriﬁcation, which has been e x- tensiv ely studied recently and achieved promising performance in short-duration text-independent task. There are two ways to extract speaker embeddings using deep models. One approach is av eraging bottleneck features from frame-lev el speaker classiﬁcation networks [6]. Another approach is directly learning utterance-lev el speaker embeddings with distance-based similarity loss, such as triplet loss [4, 8] and generalized end-to-end (GE2E) loss [7]. LSTM-based speaker embedding is one of the most important deep speaker veriﬁcation methods and has been demonstrated to be substantially promising [9, 10]. Owing to the powerful ability in modeling time-series data, LSTM can effecti vely capture the lo- cal correlation information of speech, which is very important for speaker veriﬁcation. But it is still challenging for LSTM to model the long-term dependency of utterances, especially very long ut- terances. In addition, in order to facilitate batch training, LSTM- based speaker veriﬁcation usually uses a ﬁxed-length local segment randomly truncated from an utterance to learn the utterance-le vel speaker embedding in training phase, while using the a verage em- bedding of all segments of a test utterance to verify the speaker in testing phase, which leads to a critical mismatch between testing and training. The mismatch dramatically degrades the performance of speaker veriﬁcation, especially when the difference of durations be- tween training and testing utterances is large. Many methods are pro- posed to handle the issue of duration v ariability . The attention-based pooling [11, 12] is one of the most important technologies. But most of the attention mechanisms are performed at the frame le vel, which will leads to the “over -av erage” problem, especially when the utter- ance is very long. T o alleviate this issue, we propose the deep segment attentive embedding method to learn the uniﬁed speaker embeddings for ut- terances of v ariable duration. For both training and testing, we use a sliding windo w to divide utterances into the ﬁxed-length se gments and then use LSTM to extract the embedding of each segment. Fi- nally , all segment-le vel embeddings of an utterance are pooled into a ﬁx ed-dimension v ector through the segment attention, which is used as the utterance-le vel speaker embedding. The similarity loss of utterance-le vel embeddings is used to train the whole network. In addition, in order to guide the segment attention to focus on the segments with more speaker discriminations, we further incorporate the similarity loss of segment-lev el embeddings. With the joint op- timization of the segment-le vel and utterance-lev el similarity loss, both local details and global information of utterances are taken into account. Instead of only using one local segment, we use the whole utterance to learn the utterance-lev el embedding, which uniﬁes the process of training and testing and av oids the mismatch between them. 2. RELA TED WORK There are some efforts on the issue of duration variability . For exam- ple, in the con ventional i-vector systems, [13] proposed to propagate the uncertainty relev ant to the i-v ector extraction process into the PLD A model, which better handled the duration variability . More- ov er , in the deep learning based speaker embedding systems, the complementary center loss is proposed in [14, 15, 16] in order to solve the problem of large v ariation in text-independent utterances, including the duration variation. It acts as a regularizer that re- duces the intra-class distance v ariance of the ﬁnal embedding vec- tors. Howe ver , they don’t explicitly model the duration v ariability of utterances and the mismatch between training and testing phase still exists. Furthermore, attention mechanisms hav e been utilized to cap- ture the long-term variations of speaker characteristics in [11, 12]. An important metric is computed by the attention network, which is used to calculate the weighted mean of the frame-lev el embedding vectors. Ho we ver , most of the attention mechanisms are performed at the frame level, which will leads to the “o ver-a verage” problem, especially when the utterance is very long. 3. PROPOSED APPRO A CH It is still challenging for LSTM to model the long-term dependency of utterances, especially very long utterances. And the mismatch between training and testing phase de grades the performance of speaker veriﬁcation, especially when the difference of durations be- tween training and testing utterances is large. Therefore, we propose the deep segment attentiv e embedding method to extract the uniﬁed speaker embeddings for utterances of v ariable duration. As is sho wn in Fig. 1, we use a sliding window with 50% over- lap to divide utterances into the ﬁxed-length segments and LSTM is used to e xtract the embedding of each segment. Finally , all se gment- lev el embeddings of an utterance are pooled into a ﬁxed-dimension utterance-lev el speaker embedding through the segment attention mechanism. The whole network is trained with the joint supervision of the utterance-lev el and se gment-lev el similarity loss. It can extract the uniﬁed speaker embeddings for utterances of variable duration and take into account both local details and global information of utterances, especially long utterances. 3.1. Deep segment attentive embedding For both training and testing, we use a sliding window with 50% ov erlap to divide an utterance into the ﬁxed-length segments. Sup- posed that we get N speech segments X = { x x x 1 , x x x 2 , · · · , x x x N } . The sliding window length T is randomly chosen within [80 , 120] frames but the length of segments in a batch is ﬁxed. The vector x t n rep- resents the feature of segment n at frame t , which is fed into the network and the output is h t n . The last frame of output is used as the segment representation f ( x n ; w ) = h T n , where w represents parameters of the network. The segment-le vel speaker embedding is deﬁned as the L 2 normalization of the segment representation: e n = f ( x n ; w ) k f ( x n ; w ) k 2 . (1) Pooli ng LS TM LS TM LS TM … … … Sli ding window leng th Sli ding window s tride Segment Attention A t t en tiv e P oolin g 𝛼 𝑛 Speak er Embeddi ng ෤ 𝒆 𝒔 ℒ ( ෤ 𝒆 ) ℒ ( 𝑒 1 ) ℒ ( 𝑒 2 ) ℒ ( 𝑒 3 ) ℒ ( 𝑒 𝑁 ) 𝑒 2 𝑒 3 𝑒 𝑁 Cos Simila rities 𝒘𝒔 + 𝒃 𝑒 1 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 C os Simi la ri ties 𝒘𝒔 + 𝒃 … … … … Segmen t E mbedd ings Fig. 1 . System ov erview . For each batch training, there are Q × P utterances from Q different speakers and each speaker has P utter - ances. W e only draw one utterance for simplicity . W e compute the embedding vector of each segment according to Eq. 1 E = { e e e 1 , e e e 2 , · · · , e e e N } . Let the dimension of the segment- lev el speaker embedding e e e n be d e . It is often the case that some segment-le vel embeddings are more rele v ant and important for discriminating speakers than others. W e therefore apply attention mechanisms to integrate the segment embeddings by automatically calculating the importance of each segment. For each segment-lev el embedding e e e n , we could learn a score α α α n using the segment attention mechanism. All segment- lev el embeddings of an utterance are pooled into a ﬁxed-dimension utterance-lev el speaker embedding through the segment attention mechanism. For each se gment embedding e e e n , we apply the multi-head atten- tion mechanism [17] to learn a score α α α n as follows: α α α n = softmax ( g ( e e e n W 1 ) W 2 ) , (2) where W 1 and W 2 are parameters of the multi-head attention mech- anism; W 1 is a matrix of size d e × d a ; W 2 is a matrix of size d a × d r ; d a is the attention dim and d r is a hyperparameter that represents the number of attention heads; g ( · ) is the ReLU activation function [18]. When the number of attention heads d r = 1 , it is simply a basic attention. The normalized weight α α α n ∈ [0 , 1] is computed by the softmax function. The weight vector is then used in the attenti ve pooling layer to calculate the utterance-lev el speaker embedding ˜ e e e : ˜ e e e = N X n =1 α α α n e e e n . (3) When the number of attention heads d r = 1 , ˜ e e e is simply a weighted mean vector computed from E , which is expected to re- ﬂect an aspect of speaker discriminations in the gi ven utterance. Ob- viously , speakers can be discriminated along multiple aspects, es- pecially when the utterance duration is long. By increasing d r , we can easily have multiple attention heads to focus on dif ferent pat- tern aspects from an utterance. In order to encourage di versity in the attention vectors, [12] introduced a penalty term L p when d r > 1 : L p =    A T A − I    2 F , (4) where A = [ α α α 1 , · · · , α α α N ] is the attention matrix; I is the identity matrix and k·k F represents the Frobenius norm of a matrix. L p can encourage each attention head to extract different information from the same utterance. It is similar to L 2 regularization and is mini- mized together with the original cost of the system. 3.2. Loss function After getting the utterance-le vel speaker embedding, we calculate the similarity loss using the generalized end-to-end (GE2E) loss for- mulation [7]. The GE2E loss is based on processing a large number of utterances at once to minimize the distance of the same speaker while maximizing the distance of different speak ers. For each batch training, we randomly choose Q × P utterances from Q different speakers with P utterances per speaker . And we calculate the utterance-lev el speaker embedding ˜ e e e j i based on Equa- tions 1, 2, 3 for each utterance. ˜ e e e j i represents the speak er embedding of the j th speaker’ s i th utterance. And the centroid of embedding vec- tors from the j th speaker is deﬁned: c j = E i [ ˜ e e e j i ] = 1 P P X i =1 ˜ e e e j i . (5) GE2E builds a similarity matrix S j i,k that deﬁnes the scaled cosine similarities between each embedding v ector ˜ e e e j i to all centroids c k (1 6 j, k 6 Q and 1 6 i 6 P ) : S j i,k = w · cos( ˜ e e e j i , c k ) + b, (6) where w and b are learnable parameters. The weight is constrained to be positiv e w > 0 , because the scaled similarity is expected to be larger when the cosine similarity is lar ger . During the training, each utterance’ s embedding is expected to be similar to the centroid of that utterance’ s speaker , while far from other speakers’ centroids. The loss on each speaker embedding ˜ e e e j i could be deﬁned as: L ( ˜ e e e j i ) = log Q X k =1 exp( S j i,k ) − S j i,j . (7) And the utterance-level GE2E loss L u is the sum of all losses over the similarity matrix, shown as: L u ( x ; w ) = X j,i L ( ˜ e e e j i ) . (8) For the te xt-independent speaker v eriﬁcation, each extracted segment-le vel embedding is expected to capture the speaker char- acteristics. In order to guide the se gment attention to focus on the segments with more speaker discriminations, we further incorporate the similarity loss of segment-le vel embeddings. The segment-le vel GE2E loss L s is similar to the utterance-le vel GE2E loss L u except that it takes all segment-le vel embeddings as input, which could help the proposed model to learn more ef fectiv e ways of embedding fusion and accelerate model con vergence. The objective function can be formulated as: L s ( x ; w ) = X j,i X n L ( e n ) . (9) Finally , the utterance-lev el GE2E loss, segment-le vel GE2E loss and penalty loss are combined together to construct the total loss, shown as: L = L u + λ s L s + λ p L p (10) The magnitude of the segment-le vel GE2E loss and penalty loss is controlled by hyperparameters λ s and λ p . With the joint optimiza- tion of the segment-le vel and utterance-le vel GE2E loss, both local details and global information of utterances are taken into account. Our proposed method can extract the uniﬁed speak er embeddings for utterances of variable duration, which uniﬁes the process of training and testing and av oids the mismatch between them. 4. EXPERIMENTS W e report speaker veriﬁcation performance on T ongdun and V ox- Celeb [19] corpora. The proposed deep segment attentive embed- ding is compared with the generalized end-to-end loss based embed- ding as well as the traditional i-v ector . W e use Equal Error Rate (EER) to quantify the system performance. 4.1. Data T ongdun . The corpus is from the speaker veriﬁcation competition held by T ongdun technology company [20], which consists of more than 120 K utterances from 1 , 500 Chinese speakers in training set and 3 , 000 trial pairs are provided as test data. Most of the training data are short utterances with average duration of 3 . 7 s, while utter- ances in test set are very long and a verage duration is about 20 s. V oxCeleb . The training set consists of more than 140 K utterances of 1 , 251 speakers. And 37 , 720 trial pairs from 40 speakers are used as ev aluation data for the veriﬁcation process. The av erage duration of training and ev aluation data is 8 . 24 s and 8 . 28 s, respecti vely . For each speech utterance, a V AD [21, 22] is applied to prune out silence regions. 4.2. i-vector system The i-v ector system uses 20 -dimensional MFCCs as front-end fea- tures, which are then extended to 60 -dimensional acoustic features with their ﬁrst and second deri vati ves. Cepstral mean normalization is applied. An i-vector of 400 dimensions is then extracted from the acoustic features using a 2048 -mixture UBM and a total variability matrix. PLD A serves as the scoring back-end. Mean subtraction, whitening, and length normalization [23] are applied to the i-vector as preprocessing steps, and the similarity is measured using a PLD A model with a speaker space of 400 dimensions. 4.3. Deep speaker embedding system For deep speaker embedding systems, we take the 40 -dimensional ﬁlter-banks with 32 -ms Hamming windo w and 16 -ms frame shift as the input features, and each dimension of features is normalized to T able 1 . Speaker V eriﬁcation Results on T ongdun. Embedding EER (%) i-vector/PLD A 3.0 LSTM-GE2E 2.0 DSAE-GE2E-1 1.5 DSAE-GE2E-2 1.3 DSAE-GE2E-5 1.0 hav e zero mean and unit variance ov er the training set. A combi- nation of 3 -layer LSTM and a linear projection layer is used to ex- tract the speak er embeddings. Each LSTM layer contains 512 nodes, and the linear projection layer is connected to the last LSTM layer, whose output size is 256 . Therefore, we can extract 256 -dimension speaker embeddings according to the outputs of the linear projection layer . The cosine similarity score of the pair of embedding vectors is computed to v erify the speak er . According to [7], the scaling factors w and b in Eq. 6 are initialized to 10 and 5 , respecti vely . W e take the LSTM-based speaker embedding system proposed by W an [7] as the baseline, which is optimized by GE2E loss. Let us denote the baseline system as “LSTM-GE2E”. “LSTM-GE2E” uses the local segments truncated from utterances to learn the utterance- lev el speaker embedding. The length of segments is randomly cho- sen within [80 , 120] , b ut all segments in a batch is ﬁxed. In the testing phase, each utterance is segmented by a sliding windo w of 100 frames with 50% overlap. W e extract the embedding of each segment and then av erage them as the speaker embedding of the ut- terance. The embedding of each segment is obtained by performing a frame-le vel attention pooling operator on the outputs of the linear projection layer . Compared to “LSTM-GE2E”, the proposed deep segment at- tentiv e embedding system uses the whole utterance to learn the utterance-lev el speaker embedding by the segment attention, which is denoted as “DSAE-GE2E”. The se gment attention is implemented by performing the multi-head attention pooling on the se gment-lev el embeddings. The attention dim d a is set to 128 and the attention head number d r is chosen from [1 , 2 , 5] . In addition, “DSAE-GE2E” is jointly optimized by the utterance-le vel and segment-le vel GE2E losses, as sho wn in Eq. 10. The weights λ s and λ p of terms in Eq. 10 are experimentally set to 0 . 2 and 0 . 001 , respectively . All deep speaker embedding models are trained from a random initialization by an Adam optimizer [24]. The initial learning rate is set to 0 . 001 and decayed according to the performance of the valida- tion set. For each batch training, we randomly choose 640 utterances of 64 speakers with 10 utterances per speaker . W e mention that the length of segments in a batch is ﬁxed. About 15 , 000 batches are used to train the network. In addition, the L 2 norm of gradient is clipped at 3 to av oid gradient explosion [25]. 4.4. Results In the following results, “LSTM-GE2E” refers to the deep speaker embedding system trained with GE2E loss. “DSAE-GE2E-k” de- notes the proposed deep segment attentive embedding system with the multi-head attention layer of k attention heads. T able 1 shows the performance on T ongdun test set. All deep T able 2 . Speaker V eriﬁcation Results on V oxCeleb. Embedding EER (%) i-vector/PLD A 8.9 LSTM-GE2E 6.2 DSAE-GE2E-1 5.8 DSAE-GE2E-2 5.5 DSAE-GE2E-5 5.2 learning based speak er embedding systems outperform the tradi- tional i-vector system, which sho ws the ef fecti veness of the deep speaker embeddings. In general, the proposed “DSAE-GE2E” consistently and signiﬁcantly outperform “LSTM-GE2E”. F or the multi-head attention layer , more attention heads achieve greater improv ement. “DSAE-GE2E-1” is 25% better in EER than “LSTM- GE2E” and “DSAE-GE2E-5” outperform “LSTM-GE2E” by 50% . Note that the difference of durations between T ongdun training and testing utterances is very large and our systems can extract the uni- ﬁed utterance-lev el speaker embeddings for utterances of variable duration, which signiﬁcantly improve the system performance. Re- sults indicate that our proposed utterance-level speaker embedding is a duration robust representation for speak er veriﬁcation. The performance on V oxCeleb test set is shown in T able 2. Our proposed “DSAE-GE2E” also outperforms the i-vector system and “LSTM-GE2E”, which demonstrates the effectiv eness of the proposed method. “DSAE-GE2E-1” is 6 . 5% better in EER than “LSTM-GE2E” and “DSAE-GE2E-5” outperform “LSTM-GE2E” by 16 . 1% . The relati ve EER reduction is smaller than T ongdun corpus because there is little duration difference between V oxCeleb training and testing utterances. Our proposed method can obtain greater performance improvement when the dif ference of durations between training and testing utterances is larger . 5. CONCLUSIONS In this paper , we propose the deep segment attentive embedding method to learn the uniﬁed speaker embeddings for utterances of variable duration. Each utterance is segmented by a sliding window and LSTM is used to extract the embedding of each segment. In- stead of only using one local segment, we use the whole utterance to learn the utterance-lev el embedding by applying an attenti ve pool- ing to embeddings of all segments. Moreover , the similarity loss of segment-le vel embeddings is introduced to guide the segment atten- tion to focus on the segments with more speaker discriminations, and jointly optimized with the similarity loss of utterance-lev el embed- dings. Systematic experiments on T ongdun and V oxCeleb demon- strate the effectiv eness of the proposed method. In the future work, we will in vestigate different neural network architectures and atten- tion strategies in order to obtain greater performance impro vement. 6. A CKNO WLEDGEMENTS This work was supported by the China National Nature Science Foundation (No. 61573357, No. 61503382, No. 61403370, No. 61273267, No. 91120303). 7. REFERENCES [1] Najim Dehak, Patrick J. Kenny , Rda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker veriﬁcation, ” IEEE T ransactions on Audio Speech and Language Processing , vol. 19, no. 4, pp. 788–798, 2011. [2] Simon J. D. Prince and James H. Elder, “Probabilistic linear discrimi- nant analysis for inferences about identity , ” in IEEE International Con- fer ence on Computer V ision , 2007, pp. 1–8. [3] Sandro Cumani, Oldich Plchot, and Pietro Laface, “Probabilistic lin- ear discriminant analysis of i-vector posterior distributions, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2013, pp. 7644–7648. [4] Chao Li, Xiaokong Ma, Bing Jiang, Xiang ang Li, Xue wei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an end-to-end neural speaker embedding system, ” arXiv preprint arXiv:1705.02304 , 2017. [5] David Sn yder , Daniel Garcia-Romero, Daniel Po vey , and Sanjee v Khudanpur , “Deep neural network embeddings for text-independent speaker veriﬁcation, ” in INTERSPEECH , 2017, pp. 999–1003. [6] Ehsan V ariani, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small foot- print text-dependent speaker veriﬁcation, ” in IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing , 2014, pp. 4052– 4056. [7] Li W an, Quan W ang, Alan Papir, and Ignacio Lopez Moreno, “Gen- eralized end-to-end loss for speaker veriﬁcation, ” in 2018 IEEE In- ternational Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 4879–4883. [8] Chunlei Zhang and Kazuhito K oishida, “End-to-end text-independent speaker veriﬁcation with triplet loss on short utterances, ” in Pr oc. of Interspeech , 2017. [9] T . N Sainath, O V inyals, A Senior, and H Sak, “Conv olutional, long short-term memory , fully connected deep neural networks, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2015, pp. 4580–4584. [10] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker veriﬁcation, ” Computer Science , pp. 5115–5119, 2015. [11] Koji Okabe, T akafumi Koshinaka, and K oichi Shinoda, “ Attentive statistics pooling for deep speaker embedding, ” 2018. [12] Y ingke Zhu, T om Ko, David Snyder, Brian Mak, and Daniel Pove y , “Self-attentiv e speaker embeddings for text-independent speaker veri- ﬁcation, ” Pr oc. Interspeech 2018 , pp. 3573–3577, 2018. [13] Patrick Kenny , Themos Stafylakis, Pierre Ouellet, Md. Jahangir Alam, and Pierre Dumouchel, “Plda for speaker veriﬁcation with utterances of arbitrary duration, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , 2013, pp. 7649–7653. [14] Dan Su Zhifeng Li Na Li, Deyi Tuo and Dong Y u, “Deep discrimina- tiv e embeddings for duration robust speaker veriﬁcation, ” in INTER- SPEECH , 2018, pp. 2262–2266. [15] Jean-Marc Odobez Nam Le, “Robust and discriminative speaker em- bedding via intra-class distance variance re gularization, ” in INTER- SPEECH , 2018, pp. 2257–2261. [16] Atul Rai Sarthak Y adav , “Learning discriminative features for speaker identiﬁcation and veriﬁcation, ” in INTERSPEECH , 2018, pp. 2237– 2241. [17] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Y u, Bing Xiang, Bowen Zhou, and Y oshua Bengio, “ A structured self-attenti ve sentence embedding, ” arXiv preprint , 2017. [18] V inod Nair and Geoffrey E Hinton, “Rectiﬁed linear units improve re- stricted boltzmann machines, ” in Pr oceedings of the 27th international confer ence on machine learning (ICML-10) , 2010, pp. 807–814. [19] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “V ox- celeb: a large-scale speaker identiﬁcation dataset, ” arXiv preprint arXiv:1706.08612 , 2017. [20] “T ongdun T echnology Speaker V eriﬁcation Competition, ” https://www.kesci.com/home/competition/ 5b4eb2cfe87957000f9024a4/ . [21] Man W ai Mak and Hon Bill Y u, “ A study of voice activity detection techniques for nist speaker recognition e valuations, ” Computer Speech and Language , vol. 28, no. 1, pp. 295–313, 2014. [22] Hon-Bill Y u and Man-W ai Mak, “Comparison of voice activity de- tectors for interview speech in nist speaker recognition evaluation, ” in T welfth Annual Conference of the International Speec h Communication Association , 2011. [23] Daniel Garcia-Romero and Carol Y . Espy-W ilson, “ Analysis of i- vector length normalization in speaker recognition systems, ” in IN- TERSPEECH 2011, Confer ence of the International Speech Communi- cation Association, Florence, Italy , August , 2011, pp. 249–252. [24] Diederik P Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint , 2014. [25] Razvan P ascanu, T omas Mik olov , and Y oshua Bengio, “Understanding the exploding gradient problem, ” CoRR, abs/1211.5063 , 2012.

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment