Deep Segment Attentive Embedding for Duration Robust Speaker Verification

LSTM-based speaker verification usually uses a fixed-length local segment randomly truncated from an utterance to learn the utterance-level speaker embedding, while using the average embedding of all segments of a test utterance to verify the speaker…

Authors: Bin Liu, Shuai Nie, Yaping Zhang

Deep Segment Attentive Embedding for Duration Robust Speaker   Verification
DEEP SEGMENT A TTENTIVE EMBEDDING FOR DURA TION R OBUST SPEAKER VERIFICA TION Bin Liu 1 , 2 , Shuai Nie 1 , Y aping Zhang 1 , 2 , Shan Liang 1 , W enju Liu 1 1 National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences, China 2 School of Artificial Intelligence, Uni versity of Chinese Academy of Sciences, China { bin.liu2015,shuai.nie,yaping.zhang,sliang,lwj } @nlpr.ia.ac.cn ABSTRA CT LSTM-based speaker verification usually uses a fixed-length local segment randomly truncated from an utterance to learn the utterance- lev el speaker embedding, while using the average embedding of all segments of a test utterance to verify the speaker, which results in a critical mismatch between testing and training. This mismatch de- grades the performance of speaker verification, especially when the durations of training and testing utterances are very dif ferent. T o alleviate this issue, we propose the deep segment attenti ve embed- ding method to learn the unified speaker embeddings for utterances of variable duration. Each utterance is segmented by a sliding win- dow and LSTM is used to extract the embedding of each segment. Instead of only using one local segment, we use the whole utterance to learn the utterance-le vel embedding by applying an attentive pool- ing to the embeddings of all segments. Moreover , the similarity loss of segment-lev el embeddings is introduced to guide the segment at- tention to focus on the segments with more speak er discriminations, and jointly optimized with the similarity loss of utterance-lev el em- beddings. Systematic experiments on T ongdun and V oxCeleb show that the proposed method significantly improves rob ustness of dura- tion variant and achiev es the relati ve Equal Error Rate reduction of 50% and 11.54% , respectiv ely . Index T erms — deep segment attentiv e embedding, speaker verification, duration rob ustness, LSTM 1. INTR ODUCTION The key to speaker verification is to extract the utterance-lev el speaker vectors with a fixed dimension for utterances of v ariable duration. The extracted speaker vector is expected to be as close as possible to the same speaker while far from other speakers. It re- mains a challenge to e xtract the robust speaker vectors for utterances of variable duration, especially when the utterance duration varies greatly . The i-vector/PLD A framew ork [1, 2, 3] can easily extract the fixed dimension speaker vectors for utterances of arbitrary dura- tion using statistical modeling. But it suffers performance reduction when handling short utterances [4, 5]. The reason is that i-vector is a Gaussian-based statistical feature, whose estimation need suf ficient samples. And the short utterance will lead to the uncertainty in the estimated i-vector . Deep learning based speaker embedding [4, 6, 7] is another mainstream approach to speaker v erification, which has been e x- tensiv ely studied recently and achieved promising performance in short-duration text-independent task. There are two ways to extract speaker embeddings using deep models. One approach is av eraging bottleneck features from frame-lev el speaker classification networks [6]. Another approach is directly learning utterance-lev el speaker embeddings with distance-based similarity loss, such as triplet loss [4, 8] and generalized end-to-end (GE2E) loss [7]. LSTM-based speaker embedding is one of the most important deep speaker verification methods and has been demonstrated to be substantially promising [9, 10]. Owing to the powerful ability in modeling time-series data, LSTM can effecti vely capture the lo- cal correlation information of speech, which is very important for speaker verification. But it is still challenging for LSTM to model the long-term dependency of utterances, especially very long ut- terances. In addition, in order to facilitate batch training, LSTM- based speaker verification usually uses a fixed-length local segment randomly truncated from an utterance to learn the utterance-le vel speaker embedding in training phase, while using the a verage em- bedding of all segments of a test utterance to verify the speaker in testing phase, which leads to a critical mismatch between testing and training. The mismatch dramatically degrades the performance of speaker verification, especially when the difference of durations be- tween training and testing utterances is large. Many methods are pro- posed to handle the issue of duration v ariability . The attention-based pooling [11, 12] is one of the most important technologies. But most of the attention mechanisms are performed at the frame le vel, which will leads to the “over -av erage” problem, especially when the utter- ance is very long. T o alleviate this issue, we propose the deep segment attentive embedding method to learn the unified speaker embeddings for ut- terances of v ariable duration. For both training and testing, we use a sliding windo w to divide utterances into the fixed-length se gments and then use LSTM to extract the embedding of each segment. Fi- nally , all segment-le vel embeddings of an utterance are pooled into a fix ed-dimension v ector through the segment attention, which is used as the utterance-le vel speaker embedding. The similarity loss of utterance-le vel embeddings is used to train the whole network. In addition, in order to guide the segment attention to focus on the segments with more speaker discriminations, we further incorporate the similarity loss of segment-lev el embeddings. With the joint op- timization of the segment-le vel and utterance-lev el similarity loss, both local details and global information of utterances are taken into account. Instead of only using one local segment, we use the whole utterance to learn the utterance-lev el embedding, which unifies the process of training and testing and av oids the mismatch between them. 2. RELA TED WORK There are some efforts on the issue of duration variability . For exam- ple, in the con ventional i-vector systems, [13] proposed to propagate the uncertainty relev ant to the i-v ector extraction process into the PLD A model, which better handled the duration variability . More- ov er , in the deep learning based speaker embedding systems, the complementary center loss is proposed in [14, 15, 16] in order to solve the problem of large v ariation in text-independent utterances, including the duration variation. It acts as a regularizer that re- duces the intra-class distance v ariance of the final embedding vec- tors. Howe ver , they don’t explicitly model the duration v ariability of utterances and the mismatch between training and testing phase still exists. Furthermore, attention mechanisms hav e been utilized to cap- ture the long-term variations of speaker characteristics in [11, 12]. An important metric is computed by the attention network, which is used to calculate the weighted mean of the frame-lev el embedding vectors. Ho we ver , most of the attention mechanisms are performed at the frame level, which will leads to the “o ver-a verage” problem, especially when the utterance is very long. 3. PROPOSED APPRO A CH It is still challenging for LSTM to model the long-term dependency of utterances, especially very long utterances. And the mismatch between training and testing phase de grades the performance of speaker verification, especially when the difference of durations be- tween training and testing utterances is large. Therefore, we propose the deep segment attentiv e embedding method to extract the unified speaker embeddings for utterances of v ariable duration. As is sho wn in Fig. 1, we use a sliding window with 50% over- lap to divide utterances into the fixed-length segments and LSTM is used to e xtract the embedding of each segment. Finally , all se gment- lev el embeddings of an utterance are pooled into a fixed-dimension utterance-lev el speaker embedding through the segment attention mechanism. The whole network is trained with the joint supervision of the utterance-lev el and se gment-lev el similarity loss. It can extract the unified speaker embeddings for utterances of variable duration and take into account both local details and global information of utterances, especially long utterances. 3.1. Deep segment attentive embedding For both training and testing, we use a sliding window with 50% ov erlap to divide an utterance into the fixed-length segments. Sup- posed that we get N speech segments X = { x x x 1 , x x x 2 , · · · , x x x N } . The sliding window length T is randomly chosen within [80 , 120] frames but the length of segments in a batch is fixed. The vector x t n rep- resents the feature of segment n at frame t , which is fed into the network and the output is h t n . The last frame of output is used as the segment representation f ( x n ; w ) = h T n , where w represents parameters of the network. The segment-le vel speaker embedding is defined as the L 2 normalization of the segment representation: e n = f ( x n ; w ) k f ( x n ; w ) k 2 . (1) Pooli ng LS TM LS TM LS TM … … … Sli ding window leng th Sli ding window s tride Segment Attention A t t en tiv e P oolin g 𝛼 𝑛 Speak er Embeddi ng ෤ 𝒆 𝒔 ℒ ( ෤ 𝒆 ) ℒ ( 𝑒 1 ) ℒ ( 𝑒 2 ) ℒ ( 𝑒 3 ) ℒ ( 𝑒 𝑁 ) 𝑒 2 𝑒 3 𝑒 𝑁 Cos Simila rities 𝒘𝒔 + 𝒃 𝑒 1 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 C os Simi la ri ties 𝒘𝒔 + 𝒃 … … … … Segmen t E mbedd ings Fig. 1 . System ov erview . For each batch training, there are Q × P utterances from Q different speakers and each speaker has P utter - ances. W e only draw one utterance for simplicity . W e compute the embedding vector of each segment according to Eq. 1 E = { e e e 1 , e e e 2 , · · · , e e e N } . Let the dimension of the segment- lev el speaker embedding e e e n be d e . It is often the case that some segment-le vel embeddings are more rele v ant and important for discriminating speakers than others. W e therefore apply attention mechanisms to integrate the segment embeddings by automatically calculating the importance of each segment. For each segment-lev el embedding e e e n , we could learn a score α α α n using the segment attention mechanism. All segment- lev el embeddings of an utterance are pooled into a fixed-dimension utterance-lev el speaker embedding through the segment attention mechanism. For each se gment embedding e e e n , we apply the multi-head atten- tion mechanism [17] to learn a score α α α n as follows: α α α n = softmax ( g ( e e e n W 1 ) W 2 ) , (2) where W 1 and W 2 are parameters of the multi-head attention mech- anism; W 1 is a matrix of size d e × d a ; W 2 is a matrix of size d a × d r ; d a is the attention dim and d r is a hyperparameter that represents the number of attention heads; g ( · ) is the ReLU activation function [18]. When the number of attention heads d r = 1 , it is simply a basic attention. The normalized weight α α α n ∈ [0 , 1] is computed by the softmax function. The weight vector is then used in the attenti ve pooling layer to calculate the utterance-lev el speaker embedding ˜ e e e : ˜ e e e = N X n =1 α α α n e e e n . (3) When the number of attention heads d r = 1 , ˜ e e e is simply a weighted mean vector computed from E , which is expected to re- flect an aspect of speaker discriminations in the gi ven utterance. Ob- viously , speakers can be discriminated along multiple aspects, es- pecially when the utterance duration is long. By increasing d r , we can easily have multiple attention heads to focus on dif ferent pat- tern aspects from an utterance. In order to encourage di versity in the attention vectors, [12] introduced a penalty term L p when d r > 1 : L p =    A T A − I    2 F , (4) where A = [ α α α 1 , · · · , α α α N ] is the attention matrix; I is the identity matrix and k·k F represents the Frobenius norm of a matrix. L p can encourage each attention head to extract different information from the same utterance. It is similar to L 2 regularization and is mini- mized together with the original cost of the system. 3.2. Loss function After getting the utterance-le vel speaker embedding, we calculate the similarity loss using the generalized end-to-end (GE2E) loss for- mulation [7]. The GE2E loss is based on processing a large number of utterances at once to minimize the distance of the same speaker while maximizing the distance of different speak ers. For each batch training, we randomly choose Q × P utterances from Q different speakers with P utterances per speaker . And we calculate the utterance-lev el speaker embedding ˜ e e e j i based on Equa- tions 1, 2, 3 for each utterance. ˜ e e e j i represents the speak er embedding of the j th speaker’ s i th utterance. And the centroid of embedding vec- tors from the j th speaker is defined: c j = E i [ ˜ e e e j i ] = 1 P P X i =1 ˜ e e e j i . (5) GE2E builds a similarity matrix S j i,k that defines the scaled cosine similarities between each embedding v ector ˜ e e e j i to all centroids c k (1 6 j, k 6 Q and 1 6 i 6 P ) : S j i,k = w · cos( ˜ e e e j i , c k ) + b, (6) where w and b are learnable parameters. The weight is constrained to be positiv e w > 0 , because the scaled similarity is expected to be larger when the cosine similarity is lar ger . During the training, each utterance’ s embedding is expected to be similar to the centroid of that utterance’ s speaker , while far from other speakers’ centroids. The loss on each speaker embedding ˜ e e e j i could be defined as: L ( ˜ e e e j i ) = log Q X k =1 exp( S j i,k ) − S j i,j . (7) And the utterance-level GE2E loss L u is the sum of all losses over the similarity matrix, shown as: L u ( x ; w ) = X j,i L ( ˜ e e e j i ) . (8) For the te xt-independent speaker v erification, each extracted segment-le vel embedding is expected to capture the speaker char- acteristics. In order to guide the se gment attention to focus on the segments with more speaker discriminations, we further incorporate the similarity loss of segment-le vel embeddings. The segment-le vel GE2E loss L s is similar to the utterance-le vel GE2E loss L u except that it takes all segment-le vel embeddings as input, which could help the proposed model to learn more ef fectiv e ways of embedding fusion and accelerate model con vergence. The objective function can be formulated as: L s ( x ; w ) = X j,i X n L ( e n ) . (9) Finally , the utterance-lev el GE2E loss, segment-le vel GE2E loss and penalty loss are combined together to construct the total loss, shown as: L = L u + λ s L s + λ p L p (10) The magnitude of the segment-le vel GE2E loss and penalty loss is controlled by hyperparameters λ s and λ p . With the joint optimiza- tion of the segment-le vel and utterance-le vel GE2E loss, both local details and global information of utterances are taken into account. Our proposed method can extract the unified speak er embeddings for utterances of variable duration, which unifies the process of training and testing and av oids the mismatch between them. 4. EXPERIMENTS W e report speaker verification performance on T ongdun and V ox- Celeb [19] corpora. The proposed deep segment attentive embed- ding is compared with the generalized end-to-end loss based embed- ding as well as the traditional i-v ector . W e use Equal Error Rate (EER) to quantify the system performance. 4.1. Data T ongdun . The corpus is from the speaker verification competition held by T ongdun technology company [20], which consists of more than 120 K utterances from 1 , 500 Chinese speakers in training set and 3 , 000 trial pairs are provided as test data. Most of the training data are short utterances with average duration of 3 . 7 s, while utter- ances in test set are very long and a verage duration is about 20 s. V oxCeleb . The training set consists of more than 140 K utterances of 1 , 251 speakers. And 37 , 720 trial pairs from 40 speakers are used as ev aluation data for the verification process. The av erage duration of training and ev aluation data is 8 . 24 s and 8 . 28 s, respecti vely . For each speech utterance, a V AD [21, 22] is applied to prune out silence regions. 4.2. i-vector system The i-v ector system uses 20 -dimensional MFCCs as front-end fea- tures, which are then extended to 60 -dimensional acoustic features with their first and second deri vati ves. Cepstral mean normalization is applied. An i-vector of 400 dimensions is then extracted from the acoustic features using a 2048 -mixture UBM and a total variability matrix. PLD A serves as the scoring back-end. Mean subtraction, whitening, and length normalization [23] are applied to the i-vector as preprocessing steps, and the similarity is measured using a PLD A model with a speaker space of 400 dimensions. 4.3. Deep speaker embedding system For deep speaker embedding systems, we take the 40 -dimensional filter-banks with 32 -ms Hamming windo w and 16 -ms frame shift as the input features, and each dimension of features is normalized to T able 1 . Speaker V erification Results on T ongdun. Embedding EER (%) i-vector/PLD A 3.0 LSTM-GE2E 2.0 DSAE-GE2E-1 1.5 DSAE-GE2E-2 1.3 DSAE-GE2E-5 1.0 hav e zero mean and unit variance ov er the training set. A combi- nation of 3 -layer LSTM and a linear projection layer is used to ex- tract the speak er embeddings. Each LSTM layer contains 512 nodes, and the linear projection layer is connected to the last LSTM layer, whose output size is 256 . Therefore, we can extract 256 -dimension speaker embeddings according to the outputs of the linear projection layer . The cosine similarity score of the pair of embedding vectors is computed to v erify the speak er . According to [7], the scaling factors w and b in Eq. 6 are initialized to 10 and 5 , respecti vely . W e take the LSTM-based speaker embedding system proposed by W an [7] as the baseline, which is optimized by GE2E loss. Let us denote the baseline system as “LSTM-GE2E”. “LSTM-GE2E” uses the local segments truncated from utterances to learn the utterance- lev el speaker embedding. The length of segments is randomly cho- sen within [80 , 120] , b ut all segments in a batch is fixed. In the testing phase, each utterance is segmented by a sliding windo w of 100 frames with 50% overlap. W e extract the embedding of each segment and then av erage them as the speaker embedding of the ut- terance. The embedding of each segment is obtained by performing a frame-le vel attention pooling operator on the outputs of the linear projection layer . Compared to “LSTM-GE2E”, the proposed deep segment at- tentiv e embedding system uses the whole utterance to learn the utterance-lev el speaker embedding by the segment attention, which is denoted as “DSAE-GE2E”. The se gment attention is implemented by performing the multi-head attention pooling on the se gment-lev el embeddings. The attention dim d a is set to 128 and the attention head number d r is chosen from [1 , 2 , 5] . In addition, “DSAE-GE2E” is jointly optimized by the utterance-le vel and segment-le vel GE2E losses, as sho wn in Eq. 10. The weights λ s and λ p of terms in Eq. 10 are experimentally set to 0 . 2 and 0 . 001 , respectively . All deep speaker embedding models are trained from a random initialization by an Adam optimizer [24]. The initial learning rate is set to 0 . 001 and decayed according to the performance of the valida- tion set. For each batch training, we randomly choose 640 utterances of 64 speakers with 10 utterances per speaker . W e mention that the length of segments in a batch is fixed. About 15 , 000 batches are used to train the network. In addition, the L 2 norm of gradient is clipped at 3 to av oid gradient explosion [25]. 4.4. Results In the following results, “LSTM-GE2E” refers to the deep speaker embedding system trained with GE2E loss. “DSAE-GE2E-k” de- notes the proposed deep segment attentive embedding system with the multi-head attention layer of k attention heads. T able 1 shows the performance on T ongdun test set. All deep T able 2 . Speaker V erification Results on V oxCeleb. Embedding EER (%) i-vector/PLD A 8.9 LSTM-GE2E 6.2 DSAE-GE2E-1 5.8 DSAE-GE2E-2 5.5 DSAE-GE2E-5 5.2 learning based speak er embedding systems outperform the tradi- tional i-vector system, which sho ws the ef fecti veness of the deep speaker embeddings. In general, the proposed “DSAE-GE2E” consistently and significantly outperform “LSTM-GE2E”. F or the multi-head attention layer , more attention heads achieve greater improv ement. “DSAE-GE2E-1” is 25% better in EER than “LSTM- GE2E” and “DSAE-GE2E-5” outperform “LSTM-GE2E” by 50% . Note that the difference of durations between T ongdun training and testing utterances is very large and our systems can extract the uni- fied utterance-lev el speaker embeddings for utterances of variable duration, which significantly improve the system performance. Re- sults indicate that our proposed utterance-level speaker embedding is a duration robust representation for speak er verification. The performance on V oxCeleb test set is shown in T able 2. Our proposed “DSAE-GE2E” also outperforms the i-vector system and “LSTM-GE2E”, which demonstrates the effectiv eness of the proposed method. “DSAE-GE2E-1” is 6 . 5% better in EER than “LSTM-GE2E” and “DSAE-GE2E-5” outperform “LSTM-GE2E” by 16 . 1% . The relati ve EER reduction is smaller than T ongdun corpus because there is little duration difference between V oxCeleb training and testing utterances. Our proposed method can obtain greater performance improvement when the dif ference of durations between training and testing utterances is larger . 5. CONCLUSIONS In this paper , we propose the deep segment attentive embedding method to learn the unified speaker embeddings for utterances of variable duration. Each utterance is segmented by a sliding window and LSTM is used to extract the embedding of each segment. In- stead of only using one local segment, we use the whole utterance to learn the utterance-lev el embedding by applying an attenti ve pool- ing to embeddings of all segments. Moreover , the similarity loss of segment-le vel embeddings is introduced to guide the segment atten- tion to focus on the segments with more speaker discriminations, and jointly optimized with the similarity loss of utterance-lev el embed- dings. Systematic experiments on T ongdun and V oxCeleb demon- strate the effectiv eness of the proposed method. In the future work, we will in vestigate different neural network architectures and atten- tion strategies in order to obtain greater performance impro vement. 6. A CKNO WLEDGEMENTS This work was supported by the China National Nature Science Foundation (No. 61573357, No. 61503382, No. 61403370, No. 61273267, No. 91120303). 7. REFERENCES [1] Najim Dehak, Patrick J. Kenny , Rda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification, ” IEEE T ransactions on Audio Speech and Language Processing , vol. 19, no. 4, pp. 788–798, 2011. [2] Simon J. D. Prince and James H. Elder, “Probabilistic linear discrimi- nant analysis for inferences about identity , ” in IEEE International Con- fer ence on Computer V ision , 2007, pp. 1–8. [3] Sandro Cumani, Oldich Plchot, and Pietro Laface, “Probabilistic lin- ear discriminant analysis of i-vector posterior distributions, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2013, pp. 7644–7648. [4] Chao Li, Xiaokong Ma, Bing Jiang, Xiang ang Li, Xue wei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an end-to-end neural speaker embedding system, ” arXiv preprint arXiv:1705.02304 , 2017. [5] David Sn yder , Daniel Garcia-Romero, Daniel Po vey , and Sanjee v Khudanpur , “Deep neural network embeddings for text-independent speaker verification, ” in INTERSPEECH , 2017, pp. 999–1003. [6] Ehsan V ariani, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small foot- print text-dependent speaker verification, ” in IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing , 2014, pp. 4052– 4056. [7] Li W an, Quan W ang, Alan Papir, and Ignacio Lopez Moreno, “Gen- eralized end-to-end loss for speaker verification, ” in 2018 IEEE In- ternational Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 4879–4883. [8] Chunlei Zhang and Kazuhito K oishida, “End-to-end text-independent speaker verification with triplet loss on short utterances, ” in Pr oc. of Interspeech , 2017. [9] T . N Sainath, O V inyals, A Senior, and H Sak, “Conv olutional, long short-term memory , fully connected deep neural networks, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2015, pp. 4580–4584. [10] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification, ” Computer Science , pp. 5115–5119, 2015. [11] Koji Okabe, T akafumi Koshinaka, and K oichi Shinoda, “ Attentive statistics pooling for deep speaker embedding, ” 2018. [12] Y ingke Zhu, T om Ko, David Snyder, Brian Mak, and Daniel Pove y , “Self-attentiv e speaker embeddings for text-independent speaker veri- fication, ” Pr oc. Interspeech 2018 , pp. 3573–3577, 2018. [13] Patrick Kenny , Themos Stafylakis, Pierre Ouellet, Md. Jahangir Alam, and Pierre Dumouchel, “Plda for speaker verification with utterances of arbitrary duration, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , 2013, pp. 7649–7653. [14] Dan Su Zhifeng Li Na Li, Deyi Tuo and Dong Y u, “Deep discrimina- tiv e embeddings for duration robust speaker verification, ” in INTER- SPEECH , 2018, pp. 2262–2266. [15] Jean-Marc Odobez Nam Le, “Robust and discriminative speaker em- bedding via intra-class distance variance re gularization, ” in INTER- SPEECH , 2018, pp. 2257–2261. [16] Atul Rai Sarthak Y adav , “Learning discriminative features for speaker identification and verification, ” in INTERSPEECH , 2018, pp. 2237– 2241. [17] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Y u, Bing Xiang, Bowen Zhou, and Y oshua Bengio, “ A structured self-attenti ve sentence embedding, ” arXiv preprint , 2017. [18] V inod Nair and Geoffrey E Hinton, “Rectified linear units improve re- stricted boltzmann machines, ” in Pr oceedings of the 27th international confer ence on machine learning (ICML-10) , 2010, pp. 807–814. [19] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “V ox- celeb: a large-scale speaker identification dataset, ” arXiv preprint arXiv:1706.08612 , 2017. [20] “T ongdun T echnology Speaker V erification Competition, ” https://www.kesci.com/home/competition/ 5b4eb2cfe87957000f9024a4/ . [21] Man W ai Mak and Hon Bill Y u, “ A study of voice activity detection techniques for nist speaker recognition e valuations, ” Computer Speech and Language , vol. 28, no. 1, pp. 295–313, 2014. [22] Hon-Bill Y u and Man-W ai Mak, “Comparison of voice activity de- tectors for interview speech in nist speaker recognition evaluation, ” in T welfth Annual Conference of the International Speec h Communication Association , 2011. [23] Daniel Garcia-Romero and Carol Y . Espy-W ilson, “ Analysis of i- vector length normalization in speaker recognition systems, ” in IN- TERSPEECH 2011, Confer ence of the International Speech Communi- cation Association, Florence, Italy , August , 2011, pp. 249–252. [24] Diederik P Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint , 2014. [25] Razvan P ascanu, T omas Mik olov , and Y oshua Bengio, “Understanding the exploding gradient problem, ” CoRR, abs/1211.5063 , 2012.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment