RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

RawNet: Advanced end-to-end deep neural network using raw wa vef orms f or text-independent s peaker veriﬁcation J ee-weon J ung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Y u † School of Computer Science, Univ ersity of Seoul, South K orea jeewon.leo.ju ng@gmail.com, zhasgone@nave r.com, wngh1187@naver.com, shimhz6.6@gm ail.com, hjyu@uos.ac.k r Abstract Recently , direct modeling of raw wav eforms using deep neu- ral network s has been widely studied for a number of tasks in audio do mains. In sp eaker veriﬁcation, ho wever , util ization of raw wa veforms is in i t s preliminary phase, requiring further in- vestigation. In this study , we explore end-to-end deep neural networks that input raw wav eforms to improve v arious aspects: front-end speaker embedding extraction including model archi- tecture, pre-training scheme, additional objectiv e functions, and back-end classiﬁcation. Adjustment of mod el architecture using a pre-training scheme can extract speake r embeddings, gi ving a signiﬁcant improvem ent in performance. Additional objec- tiv e functions simplify the process of e xtracting speaker embed- dings by merg ing con ventional two-phase processes: extracting utterance-le vel features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis. Effecti ve back-end classiﬁ cation models that suit the proposed speak er embedding are also explored. W e propose an end-to- end system that comprises t wo deep neural ne tworks, one front- end for utterance-lev el speaker embedding extraction and the other for back-end classiﬁcation. Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achie ves state-of-the-art performance among systems without data augmentation. The proposed system is also comparable to the state-of-the-art x-vec tor system that adop ts data augmenta- tion. Index T erms : raw wavefo rm, deep neural network, end-to-end , speak er embedding 1. I ntroduction Direct modeling of raw wav eforms using deep neural networks (DNNs) is increasingly prev alent in a number of t asks due to adv ances in deep learning [1–8]. In speech recognition, studies such as those of Palaz et al. , Sainath et al. , and Hoshen et al. deal with raw wav eforms as input [1 –3]. In speak er recogni- tion, st udies by Jung et al. and Mucken hirn et al. were the ﬁrst to comprise systems that input raw wav eforms [5–7]. Other do- mains such as spoo ﬁng detection and automatic music tagging are also adopting raw waveform inputs [4, 8]. DNNs that directly input raw wav eforms hav e a number of advantag es ov er con ventional acoustic f eature-based DNNs. First, minimization of pre-processing remo ves the need for ex - ploration of v arious hyp er-parameters such as the t ype of acous- tic feature t o use, w i ndo w si ze, shift length, and feature dimen- sion. This i s expected t o lower entry barriers to conducting † Correspondi ng author This work was supported by the T echnology Innov ation Program (10076583, Dev elopment of free-runnin g speec h recognition technolo- gies for embedded robot s ystem) funded By the Ministry of Tra de, In- dustry & Energy(MO T IE , Korea) studies and l essen the burden of follow-up studies. Ad dition- ally , with recent trends of DNN replacing more sub-processes in va rious t asks, a raw wa veform DN N is well positioned to beneﬁt from future advances in deep l earning. Studies across v arious tasks have sho wn t hat an assembly of multiple frequenc y responses can be extracted when raw wa veforms are processed by each kernel of con volu tional lay- ers [6, 9]. Spectrograms, compared to raw wave form DNNs, hav e li nearly positioned frequency bands, meaning the ﬁrst con - volution al layer sees only adjacent frequency bands (although repetition of con volutions can aggregate v arious frequency re- sponses at deeper layers). In other words, spectrogram-based CNN can see ﬁ xed frequency regions depending on the inter- nal pooling rule. This difference is hyp othesized to increase the potential of the directly modeling raw waveforms; as increas- ing amounts of data become available, this data-dri ven approach can extract an aggre gation of informative fr equency responses appropriate to the target task. In this study , we improve v arious aspects of the raw wave - form DNN proposed by Jung et al. , which was the ﬁrst end-to- end model in speaker veriﬁcation using raw wa veforms [5, 7]. This model extracts frame-lev el embeddings using residual blocks with conv olutional neural networks (CNN) [10 , 11], and then aggregates features into utterance leve l using long short- term memory (LS TM) [12, 13]. Ke y improv ements made by our study include the following: 1. Model architecture: adjustments to v arious network con- ﬁgurations 2. CNN pre-trai ning scheme: remov al of inef ﬁcient aspects of the multi-step training scheme in [7, 14] 3. Objectiv e function: additional objective functions to in- corporate speaker embedding extraction and feature en- hancement phase 4. Back-end classiﬁcation models: comparison of various DNN-based back-end classiﬁers and proposal of a sim- ple, effe ctive back -end DNN classiﬁer Through changing these aspects, performance is si gniﬁcantly enhanced . T he equal error rat e (EER) of utterance-le vel speaker embedding DNN with cosine similari ty on the VoxCeleb 1 dataset i s 4.8 %, showing 44.8 % relative error rate reduction (RER) compared to the baseline [7]. An EER of 4.0 % was achie ved f or the end-to-end model using two DNNs, sho wing an RER of 46.0 %. The rest of this paper is organized as follows. S ection 2 describes the front-end speaker embedding extraction model. Section 3 addresses variou s back-end classiﬁ cati on models. Ex- periments and results are in S ections 4 and 5 and the paper is concluded in S ection 6. 2. Front-end: RawNet W e propose a model (referred to as “RawNet” for con venience) that is an improvemen t of the CNN-LSTM model i n [5, 7] by changing architectural details (Section 2.1.), proposing a mod- iﬁed pre-training scheme (Section 2.2.), and incorporating a speak er embedding enhancement phase (Section 2.3.). 2.1. Model ar chitecture The DNN used in this st udy comprises residual blocks, a gated recurrent unit (GRU) layer [15, 16], a fully-connected l ayer (used for extraction of speak er embedding), and an output layer . In t his architecture, input features are ﬁrst processed us- ing the residual blocks [10] to extract frame-leve l embeddings. The residual blocks comprise con volution al layers with iden- tity mapping [11] t o facilitate the training of deep architectures. A GRU is then employed to aggre gate the frame-le vel features into a single utterance-le vel embedding. Utterance-lev el em- bedding i s then fed into one fully-conn ected layer . T he output of the fully-connected l ayer is used as the speaker embedding and is connected to the output layer , where the number of nodes is identical to the number of speakers i n the training set. T he proposed RawNet architecture is depicted in T able 1. It includes a number of modiﬁcations to the CNN-LST M model in [ 5, 7], all owing for f urther improv ement. First, acti- v ation functions are changed from rectiﬁ ed linear units (ReLU) to leaky ReLU. Second, the LSTM l ayer is changed to a GR U layer . Third, the number of parameters is signiﬁcantly de- creased, i ncluding lower dimensionality of speake r embedding (from 1024 to 128). 2.2. CNN pre -train scheme Extracting utterance-lev el speak er embeddings directly from raw wave forms often leads to ov erﬁtting tow ard the training set [7]. In [7], multi-step training proposed in [14] was used to av oid such pheno menon. This training scheme ﬁ rst t r ai ns a CNN (for frame-le vel training), and then expand s to a CNN- LSTM (for utterance l e vel). This scheme demonstrates signiﬁ- cant improv ement compared to training a CNN-LSTM with ran - dom initializati on. Ho wev er , t he multi-step training approach in [7] is inef ﬁ- cient because aft er training 9 residual blocks, 3 residual blocks which contains a number of layers are remov ed when ex pand- ing the trained CNN model to a CNN-LSTM model. In our study , a new approach of interpreting the CNN training phase as pre-training is applied. T his approac h adopts fewer residual con volutional blocks, i. e. 6, connected to a global averag e pool- ing layer . After training the CNN, on ly the global a verage pool- ing layer is remov ed. The objecti ve is to consider the numb er of con volu tional blocks appropriate for training with the recur- rent layer and not remov e any parameters. This modiﬁcation enables more ef ﬁ cient and faster training. Application of model architecture modiﬁcations detailed in Section 2.1, and the CNN pre-training scheme exhibited an RER of 26.4 % (see T able 2). 2.3. Additional objective functions for sp eaker emb ed ding enhancement For speake r veriﬁcation, a number of studies enhance extracted utterance-le vel f eatures through an additional process before back-end classiﬁcation. L inear discriminant analysis (LD A) in i-vector/PLD A systems is one example [17, 18]. W ell -kno wn methods such as L DA or recent DNN-based deep embedding enhanceme nt, including discriminativ e auto-encoder (DCAE), T able 1: RawNet ar chitectur e. F o r con volutional layers, num- bers inside par entheses r efer to ﬁlter length, stride size, and number of ﬁlters. F or gated rec urr ent unit ( G RU) and f ully- connected layers, numbers inside the par entheses i ndicate the number of nodes. An input sequence of 59,049 is based on the training mini-batch conﬁgu ration. At the evaluation phase , in- put sequence length differs. Center loss and between-speak er loss is omitted for simplicity . F or residu al bloc ks, layer s under the dotted line are conducted after re sidual connection. Layer Input:59,049 samples Output shape Strided Con v(3,3,128) (19683, 128) -con v BN LeakyReLU Res block                  Con v(3,1,128) BN LeakyReLU Con v(3,1,128) BN LeakyReLU MaxPool(3)                  × 2 (2187, 128) Res block                  Con v(3,1,256) BN LeakyReLU Con v(3,1,256) BN LeakyReLU MaxPool(3)                  × 4 (27, 256) GR U GR U(1024) (1024,) Speaker FC(128) (128,) embedding Output FC(1211) (1211,) hav e been applied for this purpose in feature enhancement. In such approaches, one of the main objecti ves is to minimize intra-class cov ari ance and maximize inter-class cov ariance of utterance-le vel features. In this study , we aim to incorporate two phases of speaker embedding extraction and feature en- hancement into a single phase, using two additional objectiv e functions. T o consider both inter-class and intra-class cov ariance, we utilize center loss [19] and speaker basis loss [20] in addition to categorical cross-entropy loss for DNN training. W e adopt center l oss [19] to minimize intra-class cov ariance while t he embedding in t he l ast hidden layer remains discriminative. T o achie ve this goal, center loss function was proposed as L C = 1 2 N X i =1 || x i − c y i || 2 2 , (1) where x i refers t o embedding of the ith utterance, c y i refers to the center of class y i , and N refers to the si ze of a mini-batch. Speaker basis loss [20], aims to further maximize inter- class cov ari ance. This loss function considers a weight vector between the last hidden layer and a node of the softmax out- put layer as a basis vector for the corresponding speaker and i s formulated as: L BS = M X i =1 M X j =1 ,j 6 = i cos ( w i , w j ) , (2) T able 2: Experimental r esults showing the effectiveness of the RawNet system. Intermediate r efers to results of app lication of the modiﬁcations pr oposed in Sections 2.1. and 2.2. T he back- end classiﬁer is ﬁxed to cosine si milarity . System Loss Enhance EER % Baseline [7] Soft None 8.7 Inter- Soft None 6.8 mediate LD A 6.5 DCAE 6.4 RawNet Soft+Center+BS None 4.8 LD A 7.6 DCAE 4.9 T able 3: Comparison of equ al err or rate (EER ) of variou s fr ont- end speak er embedding extraction systems. The bac k-end clas- siﬁer is ﬁxed to cosine similarity . System EER % i-vector 13.8 i-vector/LD A 7.25 x-vecto r(w/o augment) [21] 11.3 x-vecto r(w augment) [21] 9.9 RawNet 4.8 where w i is the basis v ector of speaker i and M is the number of speak ers wit hin the training set. Hen ce, the ﬁnal objectiv e function used in t his study is L = L C E + λ L C + L BS , (3) where L C E refers to cate gorical cross-entrop y loss and λ refers to the weight of L C . 3. DNN-based back-end clas siﬁcation In speaker veriﬁcation, cosine similarity and PLDA are widely used for back-end cl assiﬁ cation to determine whether two speak er embeddings belong to t he same speaker [18]. Al t hough PLD A has shown competitive results in a number of studies, DNN-based classiﬁers hav e also shown potential in previous researches [5]. A number of DNN-based back-end classiﬁers hav e been explored: concatenation of speaker embeddings, and b-vecto r and rb-v ector systems [22, 23]. A novel ba ck-end clas- siﬁer using DNN is introduced based on analysis of a b-v ector system. The b-vector proposed by Lee et al. exp loits element-wise binary operation of speak er embeddings to represent rel at i on- ships [22]. Operations include addition, subtraction, and mul- tiplication. The results are concatenated, composing a b-vector that has three t imes the dimensions of a si ngle speak er embed- ding. Although the technique itself is simple, results demon- strate that binary operations effecti vely represent the relation- ship between the speak er embeddings. The rb-vector is an expansion of the b-vector , where an ad- ditional r-v ector is used with the b-vector [23]. The main pur- pose of the r-vec tor approach is to represent the relationship of the speaker embedding s to the training set. T o compose r- vectors, a ﬁxed number of represen tative vectors are deriv ed us- ing k-means on the speaker embeddings of the training set. For e very trial, b-vectors are composed by conducting binary oper- ations between r epresentati ve vecto rs and the speaker embed- T able 4: Comparison of various back-en d cl assiﬁers. Speaker embeddings extracted from RawNet ar e used. Results are re- ported in terms of equal erro r rate (EE R , %) Classiﬁer EER % Cosine similarity 4.8 PLD A 4.8 LD A/PLDA 4.8 b-vector 4.1 rb-vector 4.1 concat&mul 4.0 ding. These b-vectors are dimensionally reduced using princi- pal component analysis (PCA) and then concatenated, produc- ing r-vecto rs. The b-vector approach uses various element-wise binary operations to deriv e the relationship between the speaker em- beddings. Ho wev er , because weighted summation operations in a DNN can replace ( or ev en ﬁnd better combinations of) addition and subtraction, we hypothesized that the core term contributing to the success of the b-ve ctor is the multiplication operation. Therefore, we propose an approach using the con- catenation of the speaker embedding, test utterance, and their element-wise multiplication. Experimental results sho w that by only adding element-wise multiplication, performan ce e xceeds that of the b-vector (see T able 4, ‘concat&mul’). 4. Experimental settings Experiments in this study were conducted using Keras, a deep learning library in Python with T ensorﬂo w back- end [27–29]. Code used for experiments is av ailable at https://githu b.com/Jungjee/ RawNet . 4.1. Dataset W e use VoxCeleb1 dataset which comprises approximately 330 hours of recordings from 1251 speak ers in text-independen t scenarios and has a number of comparable recent studies in the literature. All utterances are encoded at a 16 kHz sam- pling rate with 16-bit resolution. As the dataset comprises v ar- ious utterances of celebriti es from Y ouTub e, it i ncludes div erse backgroun d noise and varied duration. W e followed the ofﬁcial guidelines which di vide the dataset into training and e valuation sets of 1211 and 40 speakers respectiv ely . 4.2. Experimental conﬁgurations W e didn’t apply any pre-processing, such as normalization, ex- cept pre-emphasis [30] to raw wav eforms which were exper- imentally sho wn effecti ve i n our i nternal comparison exper- iments. For mini-batch construction, utterances were either cropped or duplicated (concatenated until the leng th reach) into 59049 samples ( ≈ 3 . 59 s ) in t he t raining phase, following [5, 7]. In the ev aluation phase, no adjustments were made to length; the whole utterance was used. RawNet comprises one strided con volutional layer, six residual blocks, one GR U layer , one fully-connected layer, and an output layer (see T able 1). Residual block comprises two con volutional layers, two batch norma lization (BN) layers [31], two leaky ReLU layers, and a max pooling layer as shown in T a- ble 1. Residual conn ection adds t he input of each residual block to the output of the second BN layer . A GR U layer with 1024 T able 5: Overall r esults of speaker veriﬁcation system application to the V oxCeleb 1 dataset. Input Feature Front-end Back-end Loss Dims Augment EER (%) Shon et al. [21] MFCC x-vector PLD A Softmax 600 Light 6.0 Shon et al. [21] MFCC 1D-CNN PLD A Softmax 512 Light 5.3 Hajibabaei et al. [24] Spectrogram ResNet-20 N/A A-Softmax 128 Heavy 4.4 Hajibabaei et al. [24] Spectrogram ResNet-20 N/A AM-Softmax 128 Hea vy 4.3 Okabe et al. [25] MFCC x-vector PLD A Softmax 1500 Heavy 3.8 Nagrani et al. [26] MFCC i-vector PLD A - - - 8.8 Ours MFCC i-vector PLD A - 250 - 5.1 Nagrani et al. [26] S pectrogram VGG-M Cosine Metric learning 256 - 7.8 Jung et al. [7] Raw wa veform CNN-LST M Cosine Softmax 1024 - 8.7 Jung et al. [7] Raw wa veform CNN-LST M b-vector Softmax 1024 - 7.7 Shon et al. [21] MFCC x-vector PLD A Softmax 512 - 7.1 Shon et al. [21] MFCC 1D-CNN PLD A Softmax 600 - 5.9 Ours Raw wa veform RawNet cosine Softmax+Center+BS 128 - 4.8 Ours Raw wa veform RawNet c oncat&mul Softmax+Center+BS 128 - 4.0 nodes aggregates frame-lev el embeddings into an utterance- lev el embedding. One fully-connected layer i s used to extract speak er embeddings . The output layer has 1211 nodes, which represents the number of speakers in the training set. Back-end classiﬁers including b-v ector , r b-vector , concat&mul comprise four fully-connected layers with 1024 nodes. L2 regularization (weight decay) with a weight factor of 10 − 4 was applied to all layers. An AMSGrad optimizer wit h learning rate 10 − 3 with decay parameter 10 − 4 was used [32]. For center loss, λ = 10 − 3 was used. Training was conducted using a mini-batch size of 102. Recurrent dropout at a rate of 0.3 was applied to the GRU layer [33]. 5. Results and analysis T able 2 demonstrates t he effecti ven ess of modiﬁcations made to the model architecture (Section 2.1.) and the pre-training scheme (Section 2.2.). It also sho ws the effect of additional objecti ve functions for incorporating explicit feature enhance- ment phases (Secti on 2.3.). F irst, modiﬁcations to the model architecture and the pre-training scheme reduced the EER from 8.7 % to 6.8 %. Application of additional objectiv e functions further decrea sed EER to 4.8 %. The proposed RawNet demon- strates an RE R of 44.8 % compared to the baseline [7]. Inter- mediate , which does not include additional objectiv e functions, could beneﬁt from e xplicit feature enhancement techniques. On the other hand, RawNet, which includes additional objectiv e functions, exhibits an E ER of 4.8 % wit hout explicit f eature enhanceme nt techniques. RawNet also outperforms Intermedi- ate with feature enhancement sho wing that it has successfully incorporated the feature enhancement phase. Comparison with state-of-the-art front-end embedding ex- traction systems with cosine similarity back-end is shown i n T able 3. The proposed RawNet demonstrates the l owest EER compared to both i-vector systems and x-vector systems. For i-vector systems, we compared two conﬁgurations: with and without LD A feature enhancement. For x-vector systems, we compared two conﬁguration s based on Shon et al. were com- pared, w here the data augmen tation scheme introduced in [34] was conducted using rev erberation and va rious noise. T able 4 describes the performan ce of v arious back-en d clas- siﬁers with the proposed RawNet speaker embeddings. In our experimen ts, PLDA did not show improved results compared to the baseline cosine similari t y . On the other hand, DNN - based back-end classiﬁers demonstrated signiﬁcant improv e- ment, with an RER of 16 %. The rb-vector system did not sho w additional improv ements above the b-vector system. Among DNN-based back-end classiﬁers, the proposed approach of us- ing element-wise multiplication of speaker embeddings from enrol and test utterances performed best, with an E ER of 4.0 %. Recent studies using t he VoxCeleb1 dataset are com- pared in T able 5. The ﬁrst ﬁve rows depict systems that utilize data augmentation techniques. “Heavy” refers to the augmen- tation scheme of [24, 25], with v arious noise from the PRISM dataset and re verberations from the REVERB challenge dataset. “Light” refers to the scheme used in [21, 34] that doubles the size of the dataset using noise and rev erberations. The i - vecto r system with PLD A back-end in two implementation versions (one from Nar grani et al. and the other from our implementa- tion) demonstrates an EER of 8.8 % and 5.1 %, respectiv ely . The x-vector system with PLDA back-end without data aug- mentation, as studied by Shon et al. , demonstrates an EER of 7.1 %. The proposed RawNet system wit h concat&mul demon - strates the best performance among systems without data aug- mentation, exhibiting an EER of 4.0 %. The x-vec tor/PLDA system condu cted by Okabe et al. [25] is the only system sho w- ing lower EE R than our proposed system, but the former is subjected to intensive data augmentation , which hinders direct comparison. 6. Conclusion In this paper , we propose an end-to-end speak er v eriﬁcation sys- tem using two DNNs, for ex tracting speaker embedding extrac- tion and back-end classiﬁcation. The proposed system has a simple, yet efﬁcient, process pipeline where speaker embed- dings are extracted directly from raw wav eforms and veriﬁca- tion results are directly shown using two DNNs. V ariou s tech- niques that compose the proposed RawNet have been explored including the pre-training scheme and additional objecti ve func- tions. RawNet with the concat&mu l back-end classiﬁer demon- strates an EER of 4.0 % on the VoxCeleb1 dataset, which is state-of-the-art among systems without data augmentation, in- cluding the x-vector system. The proposed RawNet with concat&mul inputs raw wa ve- forms and ou tputs v eriﬁcation resu lts. Such a simpliﬁed process pipeline is expected to lower barriers to research and pro vide opportunities for many researchers to apply new techniques. 7. References [1] D. Palaz, M. Doss, and R. Collobert , “Con volut ional neural netw orks-based conti nuous speech recogni tion using raw s peech signal, ” in IE EE Inte rnational Confer ence on Acoustics, Speech and Si gnal Proc essing (ICASSP) . IEEE, 201 5, pp. 4295–42 99. [2] T . N. Sainath, R. J. W eiss, A. Senior , K. W . W ilson, and O. V inya ls, “Learning the speech front-end with raw wav eform cldnns, ” in Sixte enth Annual Confe renc e of the Internation al Speec h Communicatio n Association , 2015. [3] Y . Hoshen, R. J. W eiss, and K. W . W ilson, “Sp eech acoustic mod- eling from raw multich annel wav eforms, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2015 IEE E International Con- fer ence on . IEEE, 2015, pp. 4624–462 8. [4] H. Dinke l, N. Chen, Y . Qian, and K. Y u, “End-to-end spooﬁng de- tecti on with ra w w ave form cldnns, ” in Acoustics, Speech and Sig- nal Pr ocessing (ICASSP), 2017 IE EE Internat ional Confer ence on . IEEE , 2017, pp. 4860–4864. [5] J. Jung, H. Heo, I. Y ang, H. Shim, and H. Y u, “ A complete end- to-end speake r veri ﬁcation system using deep neural networks: From raw signals to veriﬁca tion result, ” in 2018 IEEE Interna- tional Confer ence on A coustic s, Speech and Signal Processi ng (ICASSP) . IEEE, 2018, pp. 5349–5353. [6] H. Muckenhi rn, M. Doss, and S. Marcell , “T oward s dire ctly mod- eling raw speech signal for speaker veriﬁcati on using cnns, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018 , pp. 4884–4888. [7] J. Jung, H. Heo, I. Y ang, H. Shim, and H. Y u, “ A voiding speak er ove rﬁtting in end-to-end dnns using raw wave form for tex t-independ ent speake r veriﬁca tion, ” in Pr oc. Interspeec h 2018 , 2018, pp. 3583–3587. [8] J. Lee, J. Park, K. Kim, Luke, and J. Nam, “Sample-le vel deep con volutiona l neural net works for music auto-ta gging using ra w wa veforms, ” arXiv preprint arX iv:1703.01789 , 2017. [9] M. Ra v anelli and Y . Ben gio, “Sp eaker rec ognition from raw wav e- form with sincnet, ” arXiv pr eprint arXiv:1 808.00158 , 2018. [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni tion, ” in Procee dings of the IEE E Confer ence on Computer V ision and P attern Rec ognitio n , 2016, pp. 770–778. [11] ——, “Ident ity mappi ngs in deep residual netwo rks, ” in Eur opean confer ence on co mputer vision . Springer , 2016, pp. 630–645. [12] S. Hochreit er and J. Schmidhube r , “Long short-te rm memory , ” Neural computation , vol. 9, no. 8, pp. 1735– 1780, 1997. [13] F . A. Gers, J. Schmidhuber , and F . Cummins, “Learning to forg et: Continua l predic tion with lstm, ” 1999. [14] H.-s. Heo, J.-w . Jung, I.-h. Y ang, S.-h. Y oon, and H.-j. Y u, “Joi nt traini ng of exp anded end-to-end dnn for text -dependent speak er veri ﬁcation. ” in INTER SPEECH , 2017, pp. 1532–153 6. [15] K. Cho, B. V an Merri ¨ enboer , C. Gulcehre, D. Bahdanau, F . Bougares, H. Schwe nk, and Y . Be ngio, “Learn ing phrase repre- sentati ons usin g rnn encode r-decod er for stati stical machine trans- latio n, ” arXiv pr eprint arXiv:14 06.1078 , 2014. [16] J. Chung , C. Gulcehre, K. Cho, and Y . Ben gio, “Empirical ev alu- ation of gated recurrent neural network s on sequence modeling, ” arXiv pr eprint arXiv:1412.3555 , 2014. [17] N. Dehak, P . Kenn y , R. Dehak, P . Dumouchel , and P . Ouellet, “Front-en d fa ctor analysis for speak er veriﬁcati on, ” IEEE T rans- actions on Audio, Speec h, and Langua ge Pr ocessing , vol. 19, no. 4, pp. 788–798, 2011. [18] P . Kenn y , “Bayesi an speaker veriﬁca tion with heavy-tai led pri- ors, ” in Odysse y , 2010, p. 14. [19] Y . W en, K. Zhang, Z. Li, and Y . Q iao, “ A discriminati ve feat ure learni ng approach for deep face recognition, ” in Euro pean confer - ence on computer vision . Springer , 2016, pp. 499–515. [20] H.-S. Heo, J.-w . Jung, I.-H. Y ang, S.-H. Y oon, H.-j. Shim, and H.-J. Y u, “End-to-end losses based on speaker basis vectors and all-spea ker hard negati ve mining for speak er ve riﬁcation , ” arXiv pre print arXiv:1902 .02455 , 2019. [21] S. Shon, H. T ang, and J. Glass, “Frame-l ev el speake r embeddings for text-in dependent speak er recogniti on and analysis of end-to- end model, ” arXiv prepri nt arXiv:1809.04437 , 201 8. [22] H. S. Lee, Y . Tso, Y . F . Chang, H. M. W ang, and S. K. Jeng, “Speake r veriﬁca tion using kerne l-based binary classiﬁe rs with binary operation deri ved features, ” in 2014 IEE E Interna- tional Confer ence on A coustic s, Speech and Signal Processi ng (ICASSP) . IEEE, 2014, pp. 1660–1664. [23] H.-S. Heo, I.-H. Y ang, M. -J. Kim, S.-H. Y oon, and H.-J. Y u, “ Ad- v anced b-vec tor system based deep neural network as classiﬁer for speake r veriﬁcat ion, ” in 2016 IEEE International Confere nce on Acoustics, Speech and Signal P r ocessing (ICASSP) . IEEE, 2016, pp. 5465–5469. [24] M. Hajibabaei and D. Dai, “Uniﬁed hypersphe re embedding for speak er recognition , ” arXiv preprin t arXiv:1807.08312 , 2018 . [25] K. Okabe, T . Koshina ka, and K. Shinoda , “ Attenti ve statis- tics pooling for dee p speake r embedding , ” arXiv preprin t arXiv:1803.10963 , 2018. [26] A. Nagrani, J. S. Chung, and A. Zisserman, “V o xceleb: a large- scale speak er ident iﬁcation dataset, ” in Inter speech , 2017. [27] F . Chollet et al. , “Kera s, ” https://githu b .com/keras- team/keras , 2015. [28] A. Mart ´ ın, A. Ashish, B. Pa ul, B. Eugene et al. , “T ensorﬂo w: Lar ge-scale machine learning on hetero- geneous distrib uted systems, ” 2015. [Online]. A v ailable : http:/ /down load.tensorﬂo w .org/pa per/whitepaper2015.pdf [29] A. Martin, B. Paul, C. Jianmin, C. Zhifeng, D. Andy , D. Jef frey , D. Matthieu, G. Sanjay , I. Geof frey , I. Michael, K. Manjunat h, L. Josh, M. Rajat, M. Sherry , M. G. Derek, S. Benoit , T . Paul, V . V ijay , W . Pete, W . Martin, Y . Y uan, and Z. Xiaoqiang, “T ensorﬂo w: A system for la rge-scal e machine learni ng, ” in 12th USENIX Symposium on Operating Systems Design and Imple- mentati on (OSDI 16) , 2016, pp. 265–283. [Online]. A v ailable : https:/ /www .usenix.or g/system/ﬁles/c onference/osdi16/osdi16- a badi.pdf [30] R. V er gin and D. O’Shaughnessy , “Pre-emphasis and speech recogni tion, ” in Electrical and Computer Engineering , 1995. Canadian Conf eren ce on , vol. 2. IEE E, 1995, pp. 1062–1065. [31] I. Serge y and S . Christia n, “Batc h normaliz ation: Accel erating deep netw ork training by reducin g internal cov ariat e shift, ” in In- ternati onal Conf eren ce on Mac hine Learning , 2015, pp. 448–456. [32] S. J. Reddi, S. Kal e, and S. Kumar , “On the con verge nce of ada m and be yond, ” 2018. [33] Y . Gal and Z. Ghahramani, “ A theore ticall y grounded appli cation of dropout in recurren t neural networks, ” in Advances in neural informati on proc essing systems , 2016, pp. 1019–1027. [34] D. Sn yder , D. Garcia -Romero, G. S ell, D . Pove y , and S. Khudan - pur , “X-ve ctors: Robust dnn embeddings for spea ker recogn ition, ” in 2018 IEEE Int ernational Confer ence on Acoustics, Spe ech and Signal Pr ocessing (ICASSP) . IEEE, 2018 , pp. 5329–5333.

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment