A Study On Convolutional Neural Network Based End-To-End Replay Anti-Spoofing

A Study On Con volutional Neural Netw ork Based End-T o-End Replay Anti-Spooﬁng Bhusan Chettri, Saumitra Mishr a, Bob L. Sturm and Emmanouil Benetos School of Electronic Engineering and Computer Science Queen Mary Uni versity of London, United Kingdom { b.chettri,saumitra.mishra,b.sturm,emmanouil.benetos } @qmul.ac.uk Abstract The second Automatic Speaker V eriﬁcation Spooﬁng and Countermeasures challenge (ASVspoof 2017) focused on “re- play attack” detection. The best deep-learning systems to com- pete in ASVspoof 2017 used Conv olutional Neural Networks (CNNs) as a feature extractor . In this paper , we study their performance in an end-to-end setting. W e ﬁnd that these ar- chitectures sho w poor generalization in the ev aluation dataset, but ﬁnd a compact architecture that shows good generalization on the dev elopment data. W e demonstrate that for this dataset it is not easy to obtain a similar lev el of generalization on both the development and evaluation data. This leads to a variety of open questions about what the differences are in the data; why these are more evident in an end-to-end setting; and how these issues can be ov ercome by increasing the training data. 1. Introduction The Automatic Speaker V eriﬁcation Spooﬁng and Countermea- sures Challenge (ASVspoof) focuses on techniques to mak e the automatic speaker veriﬁcation (ASV) systems robust against spooﬁng . Spooﬁng attack ‘fools’ an ASV system by using a speech utterance that imitates the vocal characteristics of the real speaker . The four commonly used methods to generate spoofed speech are (1) text-to-speech; (2) v oice con version; (3) mimicry; and (4) replay . The second version of the challenge, referred as ASVspoof 2017 1 , focused on text-dependent replay attack detection [1, 2]. The replay attack is a simple spooﬁng method that inv olves recording the original speech (through a recording device, e.g., mobile phone) and then replaying it (through a playback de- vice, e.g., speakers) to the biometric system. The challenge in volved building an anti-spooﬁng system to identify an input speech utterance as genuine or spoofed. Building such a system is a challenging task. One reason could be the high quality of the replayed speech (if recorded and replayed through a high- quality de vice). Moreover , the ASVspoof 2017 challenge was particularly difﬁcult as the challenge datasets were imbalanced (biased towards the ‘spoofed’ class) and had a large number of mutually exclusi ve spooﬁng conﬁgurations. One way to design a replay attack detection system is by hand-crafting features that capture the cues to differentiate be- tween a genuine and a replayed signal. But, this feature extrac- tion approach often requires domain expertise and is especially challenging for modelling high-dimensional data (e.g., images, audio). In another direction, researchers propose to train deep neural network models (DNNs) to learn the desired features 1 organised at Interspeech 2017 automatically from data [3]. Recent results claim that with a ‘large’ amount of training data 2 and ‘enough’ computing re- sources 3 , the deep models out-perform the shallow models. Many competing systems [4, 5, 6, 7, 8] in the ASVspoof 2017 used DNN models. For example, the best systems in the challenge [4] and [5] used deep Conv olutional Neural Net- work (CNN) models to learn better feature representations. Later , the authors trained shallow classiﬁers (Gaussian Mixture Model (GMM) and Support V ector Machine (SVM), respec- tiv ely) over the extracted features to discriminate between gen- uine and spoofed recordings. The best system from [4], that is a fusion of three models (two of which use DNNs), achiev es a re- markable performance (EER = 6.73) on the evaluation dataset. The success of deep learning for the ASVspoof 2017 inspires our research to analyse and understand the beha viour of these models [9]. For example, giv en a trained neural network model, we plan to visualise the features that inﬂuence its decisions. A precursor to analyse these models is to train one that performs ‘fairly’ (better than the baseline) on the e v aluation dataset. In this work we report our experiments and challenges to design a deep anti-spooﬁng system that is trained and ev aluated on ASVspoof database. W e ﬁrst describe our work to replicate the state-of-the-art system [4] in an end-to-end setting. W e train an end-to-end network as it is a preliminary requirement to use the feature visualisation and model analysis techniques from the literature [10, 11, 12]. W e found that our CNN-based model generalises in the de velopment dataset, but consistently under- performs in the evaluation dataset (section 3). W e later explain our experiments to ﬁnd a suitable architecture that generalises well to the unseen data (section 4). W e explored a number of architectures, including the second best system in the challenge [5]. But, the performance on the ev aluation dataset is always poor (EER > 26% ). This raises sev eral interesting questions about the possible differences in the dataset and why are they more evident in an end-to-end setting and what are the possible ways to tackle this problem. W e also propose a novel CNN ar- chitecture for the spooﬁng detection task that has about 5 k free parameters. 2. Background In this section we brieﬂy introduce ASVspoof2017 challenge, the spooﬁng dataset and some of the top performing deep- learning systems. T able 1: The ASVspoof 2017 database statistics. subset # spkrs # genuine # spoofed dur (hr) train 10 1508 1508 2.22 dev 8 760 950 1.44 ev al 24 1298 12922 11.95 2.1. The ASVspoof 2017 Challenge and the Database The ASVspoof 2017 challenge was held as a special session at the Interspeech 2017 conference. The challenge witnessed a huge participation with a total of 49 submitted systems. The ﬁrst ASV spooﬁng challenge was held in 2015 that focused on text independent text-to-speech and voice con version spoof- ing. The ASVspoof 2017 focused on text-dependent replay at- tack detection ‘in the wild’ with varying acoustic conditions [1]. Given a recorded speech utterance s , the main goal of the ASVspoof 2017 challenge is to build an anti-spooﬁng system that determines if s is a genuine speech. In T able 1 we sho w the ASVspoof 2017 database statis- tics. More details on the database can be found in [13]. Our prior work in [14] reports issues found in this database. The model predictions are highly inﬂuenced by the initial silence frames of zeros present in the genuine signals but missing in the spoofed counterpart. W e also found that the two audio ﬁles T 1001658 .wav and T 1000150 .w av do not contain any speech recording. Therefore, we have removed them in our study . Recently , an updated ASVspoof 2017 database version 2 has been released online 4 . Our work in this paper , ho we ver , is based on the version 1 database. In T able 2 we provide insights on how spoofed audio ﬁles are distributed between the training and the de velopment set. A total of ﬁfteen playback devices (P01-P15), sixteen recording devices (R01-R016) and six en vironments (E01-E06) are used to develop the ASVspoof 2017 replay database [2]. On the de- velopment set we ﬁnd two new en vironments (E01 and E03); ﬁv e new playback devices and six recording devices that do not appear in the training set. W e ﬁnd three spooﬁng conﬁgura- tions 5 in the training set and nine in the development set. A spooﬁng conﬁguration ‘E02 P02 R04’ seems to appear in both the training and the dev elopment set. 2.2. Published Deep-Learning Systems W e now provide a short description of the published deep- learning systems on the ASVspoof 2017 database. • A [4]: This system used score-level fusion of three sys- tems. The ﬁrst is a GMM trained on features e xtracted from a CNN. The second is an i-vector based SVM sys- tem trained on linear prediction cepstral coefﬁcients and the third system is an end-to-end CNN-RNN system. • B [5]: They train a CNN to model spooﬁng conﬁguration on tandem features obtained by combining Constant Q Cepstral Coef ﬁcient (CQCC) with High Frequency Cep- stral Coefﬁcient (HFCC). Then they use it as a feature extractor to obtain high dimensional features on which a 2 e.g., millions of images, sev eral hundred hours of audio data 3 powerful GPU’ s 4 https://datashare.is.ed.ac.uk/handle/10283/3017 5 Spooﬁng conﬁguration refers to a unique combination of playback, recording device and the en vironment where the audio is replayed T able 2: Spooﬁng conﬁguration statistics on the ASVspoof 2017 version 1 database . The number inside the bracket indicates the number of audio ﬁles. The letters t,d,e refers to training , devel- opment and evaluation subset. env , pd and rd denotes spooﬁng en vironment, playbac k and recor ding device r espectively . sub # en v # pd # rd # conﬁgurations t E02(335) P02(335) R04(1508) E02 P02 R04 (335) E05(1173) P05(1169) E05 P05 R04 (1168) P10(4) E05 P10 R04 (4) d E01(95) P02(95) R02(95) E02 P02 R04 (95) E02(190) P04(95) R04(95) E05 P08 R07 (95) E03(285) P07(95) R06(95) E05 P08 R11 (95) E05(380) P09(95) R03(95) E03 P15 R08 (95) P15(285) R07(190) E03 P15 R07 (95) P08(285) R08(190) E03 P15 R11 (95) R11(190) E05 P04 R03 (95) E02 P07 R02 (95) E05 P08 R08 (95) E01 P09 R06 (95) e meta-data not av ailable T able 3: P erformance (EER %) of the published deep-learning systems on the development and e valuation data. system dev ev al A [4] 3.95 6.73 B [5] 7.6 11.5 C [6] 2.58 13.29 D [15] 3.52 16.39 E [8] 2.21 17.82 LC N N F F T [4] 4.53 7.34 binary SVM classiﬁer is trained to discriminate genuine and a spoofed class. • C [6]: This system employed score-lev el fusion of three systems. The ﬁrst is a GMM trained on the CQCC fea- tures. The second and third systems are residual neural networks (ResNet) trained on the MFCC and CQCC fea- tures respectiv ely . • D [15]: This system used score-lev el fusion of three sys- tems. The ﬁrst is a GMM system trained on CQCC fea- tures. The second is also a GMM system b ut trained on CQCC features obtained from the augmented data. The third system is a residual neural network. • E [8]: This system use score-lev el fusion of GMM and Bi-directional long short term memory netw ork (BLSTM). They use their proposed Single Frequency Filter Cepstral Coefﬁcients (SFFCC) based delta- features to train the GMM and BLSTM models. • LC N N F F T [4]: This is one of the sub-system of A [4], that use features extracted from a CNN to train a single- component GMM to model spoofed and genuine classes. This system has outperformed all other systems on the ev aluation data by a large margin. In T able 3, we present the results of these systems on the de- velopment and ev aluation data. W e observe a remarkable per- formance by system A, the state-of-the-art [4]. The top two systems (A and B) show similar-le vel of generalization 6 on the 6 the gap between the EER on the development and the ev aluation dataset is nearly the same Figure 1: Cross entr opy loss on the training and development data for four differ ent runs of r eplicating the LC N N F F T of the state- of-the-art [4]. A dr opout of 70% is applied during training. W e observe some consistency on the loss pattern on the development data for differ ent runs except the run 4. dev elopment and the e valuation test sets which, howe ver , con- tradicts with performance shown by the other systems C, D and E. 2.3. Discussion and Motivation Usually the success of deep-learning systems is attributed to the av ailability of large training data. Ho wever , within the context of ASVspoof 2017, where the training data is signiﬁcantly less than the test data, training a deep neural network to be able to achiev e good generalization can be challenging. Therefore, we outline following questions that moti vates our research work. 1. Using only the av ailable training and development data, is it possible to train an end-to-end CNN that generalizes well to the ev aluation dataset (EER = 7% (approx.))? 2. Why is there inconsistency in model generalization be- tween the de velopment and the e valuation datasets? W ould this appear in end-to-end CNN systems too? 3. Lastly , given such a small training data, can we design a deep architecture with as fe wer trainable parameters that neither underﬁts nor ov erﬁts on training data? This work tries to seek answers to the above questions and discusses the possible outcome. For this, we start with replicat- ing the CNN architecture of the state-of-the-art. 3. Replicating the state-of-the-art CNN [4] W e describe our experiments in replicating the best CNN sys- tem, LC N N F F T , of the state-of-the-art [4]. W e train our CNN using their network parameterization but, with different input representation. W e use 2048 FFT points and 2048 window size with a hop of 10ms to compute the spectrograms. W e use Li- brosa 7 library for computing the spectrograms. Therefore, our 7 http://librosa.github .io input spectrogram is 400 × 1025 (time × frequency) dimen- sion in comparison to 400 × 864 used by LC N N F F T [4]. W e en visage that use of 2048 as FFT size should not deteriorate performance dramatically . The details of the LC N N F F T ar- chitecture can be found in [4]. 3.1. Model T raining and T esting The input to the network is a mean-variance normalized log power magnitude spectrogram of 400 × 1025 (time × fre- quency) dimension, where time denotes number of frames and frequency the number of bins. W e compute mean and variance on the training data. W e initialize our network weights using Xavier initialization [16] and bias with zero. The network is trained to optimize the cross entropy loss between a genuine and a spoofed class. As speciﬁed in [4], we use max-feature- map (MFM) non-linearity , learning rate of 1e-4, batch size of 32, 0.9 momentum with AD AM optimizer . Howe ver , the de- fault parameter v alue of epsilon did not work and we used 0.1 for epsilon. A dropout of 70% to the inputs of the ﬁrst fully con- nected layer is used during model training. W e use tensorﬂo w [17] framework for implementation. W e used early stopping as a terminating criterion: if the v alidation loss do not impro ve for 30 epochs then we abort the training loop. W e use a maximum of 300 training epochs and chose the model that sho w the best performance on the validation data. At inference time, for each audio spectrogram the model outputs a posterior probability distribution for genuine and spoofed classes. W e conv ert the posterior probability into log likelihoods ratio and compute the EER using the ofﬁcial Bosaris toolkit [18]. T able 4: P erformance (EER%) of our replicated CNNs under two settings: end-to-end and using 1-mixture GMM trained on CNN features. * indicates EER on the training data. System F is trained on the development data and all other systems are trained on the tr aining data. System End-to-End GMM dev ev al dev ev al A 9.04 32.02 9.49 34 B 9.30 37.67 10.46 39 C 8.01 30.96 9.4 34.24 D 14.11 36.97 15.6 36.9 E 9.11 37.34 10.78 35.66 F 2.17* 38.83 na T able 5: P erformance (EER%) of our best end-to-end CNN sys- tems on the development and evaluation data, ASVspoof 2017 database version 1. * denotes the best r eplicated system using LC N N F F T . System dev ev al # params # params after dropout C * 8.01 30.96 371K 138K Model 1 5.47 25.28 4M 600K Model 2 4.52 34.91 68K 35K Model 3 4.98 33.11 7682 5089 3.2. Results and Discussion In Figure 1 we show the loss curves of our four systems (A- D), depicting four different runs of model training. All these systems are trained on the training data and validated on the dev elopment data. For comparison with the state-of-the-art, we also trained a one-component GMM on the 32-dimensional fea- tures extracted from our trained CNNs. W e present these results in T able 4. None of our systems could achieve a performance closer to what LC N N F F T [4] reported on the ev aluation data. Our best performing system C show an EER of 8.01% and 30.96% on the de velopment and e v aluation data under end-to- end condition and 9.4% and 34.24% using GMMs. The perfor- mance shown by our end-to-end models and the GMMs do not show large difference. W e further see an interesting observation when we trained our CNN, system F , on the development data and used training data for model validation. The model show an impressiv e generalization on the training data with an EER of about 2% but gi ve a worse EER of 39% on the e valuation data. Therefore these experiments suggest that it is quite difﬁcult to achie ve the same-lev el of generalization between the two test sets. In the next section we in vestigate new CNN architecture to see if we can achiev e same-lev el of generalization on both the dev elopment and the evaluation data. 4. In vestigating CNN Architectur es 4.1. Model 1 [5] Now we replicate the CNN architecture of the second best per- forming deep-learning based system. Authors in [5] trained a CNN on hand-crafted features (CQCC+HFCC) to model the spooﬁng conﬁguration in a multi-class setting. Ho wever , we train this CNN using two output targets to model the genuine and spoofed class distribution. Inspired from [5] we chose to use ﬁrst one second of audio during training and e valuation. W e use 512 point FFT , 512 win- dow size and 10 ms hop size to produce a uniﬁed input spectro- gram of 100 × 257 (time × frequency) dimension. Therefore, we train our CNN on these spectrograms. The details of the CNN architecture can be found in [5]. Since the implementa- tion details of [5] is not disclosed in the paper, we chose to use parameter initialization and training approach described in sec- tion 3.1. The performance of Model 1 is sho wn in T able 5. W e only report the best system we found on this architecture, which show 5.47% and 25.28% EER on the dev elopment and ev alu- ation data. Howe ver , this system uses a high dropout rate, with 90% on the inputs of fully connected (FC) layer 1, 80% on sec- ond FC layer, third FC layer inputs and 60% on the inputs of the output layer . 4.2. Model 2 [19] This CNN architecture is motiv ated from the work of [19] on Birds Audio Detection (BAD) challenge 2017. Though the ob- jectiv e of the B AD and ASVspoof 2017 are completely dif fer- ent, they exhibit some similarity in the proposed test condi- tions: both focus on wild and div erse test conditions. There- fore, we adapt one of their CNN architecture, ‘Bulbul’, to study if changing architecture helps impro ve the g ap in generalization between the dev elopment and the evaluation data. The details of the ‘Bulbul’ architecture can be found in [19]. W e use split data, applying the algorithm described in appendix A, during model training and testing. W e use 256 point FFT , 1 seconds spectrogram window and shift window to obtain uni- ﬁed spectrogram of 100 × 129 dimension. At test time we take the av erage of the scores obtained for different spectro- gram parts and compute the EER. W e use the parameterization and training recipe as described in section 3.1. W e show the performance of Model 2 in T able 5. Though, we experimented with different dropouts, we found the best re- sult using 50% dropout on the fully connected (FC) layers in- puts. This model give 4.52% and 34.91% EER on the develop- ment and the ev aluation data respectiv ely . Howev er, the gener- alization gap between the two test sets is lar ge. 4.3. Model 3 Our work so far have in vestigated different CNN architectures. These architectures ranges from medium to lar ge in terms of the trainable parameters of the network. Howe ver , none of these systems showed similar-lev el of generalization on the develop- ment and ev aluation data. Therefore, we now propose an archi- tecture with smallest number of parameters that neither underﬁt nor overﬁt on the training data. This e xperiment seeks answer to our third question of section 2.3. This architecture has three conv olutional layers and two fully connected layers. Each conv olutional layer has 16 out- put ﬁlters (feature maps) and uses a small rectangular ﬁlter of 1 × 9 with a stride of 1 × 1 along time and frequency . W e ap- ply a max-pooling operation after each conv olution layer . W e use 3 × 3 kernel and 3 × 3 stride in all max-pooling layers. W e use 32 neurons in the ﬁrst fully connected layer with linear activ ation and two neurons in the output layer . All other layer use max-feature-map activ ation. W e apply 50% dropout on the fully connected layer inputs during training. W e show the archi- tecture of model 3 in Figure 2. The input representation, model training and testing approach we used is similar as in Model 2 of section 4.2. 8 50 Con v2 + MP2 Con v1 + MP1 65 Input Spectrogram Shap e: 1 x 10 0 x 129 8 25 33 Con v3 + MP3 160 32 Output FC4 FC5 Figure 2: Arc hitecture of the proposed model. The highlighted component shows a layer and its output featur e map. F or example, the shape of the feature map after the second con volutional and max pooling layer is 8 × 25 × 33 (number of channels × time × fr equency). Con v: Con volutional layer , FC: fully connected layer , MP: max pooling layer . T able 6: EER % for differ ent activation function. activ ation dev ev al MFM 4.98 33.11 RELU 5.29 31.7 ELU 8.66 40.78 T able 7: EER% for differ ent batch sizes. batch size de v ev al 8 4.62 36.02 16 5.64 35.35 32 4.98 33.11 64 5.96 36.6 W e show the performance of Model 3 in T able 5. Our pro- posed architecture seem to work quite well gi ving about 5% EER on the de velopment data. Ho wever , our model sho w a worse generalization on the e valuation data yielding an EER more than 30%. 5. In vestigating Effect of Parameterization 5.1. Activation Function vs EER Our w ork so f ar have used MFM activation function inspired by the impressive results of [4]. Here, we compare the performance of MFM with two other activ ations, RELU and ELU, that are of- ten used in various deep learning tasks. W e present the results in T able 6. ELU activ ation show the worse performance. On the dev elopment data, RELU and MFM seem to give similar per - formance. Howe ver , on the ev aluation data RELU outperforms MFM and ELU. 5.2. Batch Size vs EER All our CNN experiments so f ar use 32 batch size. Here, we in- vestigate how model performance compares when the network is trained using different batch sizes: 8, 16 and 64. W e present the results in T able 7. W e see worse performance on both the dev elopment and ev aluation data for 64 batch size. Similarly , batch size 16 does not seem to work well. Overall, we see an optimal performance for batch size of 32. T able 8: EER % using split spectr ograms and single spectr o- gram. spectrogram de v ev al split 4.98 33.11 single 6.5 35.56 5.3. Split Data vs EER Our work explored utterance representation either by a sin- gle spectrogram or multiple splits using the approach de- scribed in appendix A, for training the CNNs. Here, we com- pare the performance of these two representation. For single- spectrogram representation, we use three seconds audio (trun- cating/appending the original audio samples) to obtain 300 × 129 dimension spectrogram. Using split approach we obtain spectrograms of 100 × 129 dimension. For split, we chose one seconds spectrogram windo w and spectrogram shift. W e used 256 point FFT in both cases. W e show the results in T able 8. The model trained on split spectrograms outperforms single- spectrogram representation model on both the de velopment and ev aluation data. 6. Summary and conclusion In this work, we discussed the ASVspoof challenge and its sec- ond edition (ASVspoof 2017) that focused on the replay attack detection. W e described the best-performing systems from the challenge which were based on deep learning. W e presented our motiv ation to implement an end-to-end model for the ASVspoof 2017 challenge and reported our experiments to implement one. In our experiments, we explored four end-to-end deep CNN architectures to design a replay attack detection system. W e started with the state-of-the-art systems from the challenge, and later proposed a novel light-weight architecture. In all our ex- periments the trained models failed to generalise in the ev alua- tion dataset but achie ved good performance in the dev elopment dataset. This intriguing result raises several interesting ques- tions: why is it challenging to ﬁnd an architecture that gener- alises in the e v aluation dataset, how dif ferent are the data distri- butions between the subsets of the ASVspoof database. Our current work does not use the newly patched ASVspoof 2017 database version 2. All our study is based on the version 1. W e plan to use the new database for our future work that in- volv es analysing the proposed model (Model 3) to understand what the model has learned about the genuine and spoofed sig- nal. W e aim to generate explanations for individual model pre- dictions. Such an analysis will provide insight into the features maximally inﬂuencing a prediction. 7. References [1] T . Kinnunen et al., “ASVspoof 2017: Automatic speaker veriﬁcation spooﬁng and countermeasures challenge ev al- uation plan., ” 2017. [2] T . Kinnunen et al., “The ASVspoof 2017 challenge: As- sessing the limits of replay spooﬁng attack detection, ” Pr oc. Interspeech 2017 , 2017. [3] Y . LeCun, Y . Bengio, and G. Hinton, “Deep Learning, ” Natur e , vol. 521, no. 7553, pp. 436–444, 2015. [4] G. Lavrentye va et al., “ Audio replay attack detection with deep learning frame works, ” Pr oc. Interspeech 2017 , pp. 82–86, 2017. [5] Parav Nagarsheth, Elie Khoury , Kailash Patil, and Matt Garland, “Replay attack detection using DNN for channel discrimination, ” Pr oc. Interspeech 2017 , pp. 97–101. [6] Zhuxin Chen, Zhifeng Xie, W eibin Zhang, and Xiangmin Xu, “Resnet and model fusion for automatic spooﬁng de- tection, ” Pr oc. Interspeech 2017 , pp. 102–106. [7] Marcin Witko wski, Stanislaw Kacprzak, Piotr Zelasko, K onrad K owalczyk, and Jakub Galka, “ Audio replay at- tack detection using high-frequency features, ” Pr oc. In- terspeech 2017 , 2017. [8] K N R K Alluri, Siv anand Achanta, Sudarsana Kadiri, Suryakanth V . Gangashetty , and Anil V uppala, “SFF Anti- Spoofer: IIIT -H Submission for Automatic Speaker V eri- ﬁcation Spooﬁng and Countermeasures Challenge 2017, ” Pr oc. Interspeech 2017 , pp. 107–111. [9] G. Montav on, W . Samek, and K.-R. M ¨ uller , “Methods for Interpreting and Understanding Deep Neural Networks, ” Digital Signal Processing , v ol. 73, no. Supplement C, pp. 1–15, 2018. [10] K. Simonyan, A. V edaldi, and A. Zisserman, “Deep In- side Con volutional Networks: V isualising Image Classiﬁ- cation Models and Salienc y Maps , ” in Pr oc. ICLR , 2014. [11] M. D Zeiler and R. Fergus, “V isualizing and Understand- ing Con volutional Networks, ” in Proc. ECCV , 2014. [12] S. Mishra, B. L. Sturm, and S. Dixon, “Local Interpretable Model-Agnostic Explanations for Music Content Analy- sis, ” in Pr oc. ISMIR , 2017. [13] T . Kinnunen et al., “RedDots replayed: A ne w replay spooﬁng attack corpus for text-dependent speaker veriﬁ- cation research, ” in ICASSP 2017 . IEEE, 2017. [14] B. Chettri and B. L. Sturm, “ A deeper look at gaussian mixture model based anti-spooﬁng systems, ” in accepted in ICASSP 2018 . IEEE, 2018. [15] W eicheng Cai, Cai Danwei, W enbo Liu, Gang Li, and Ming Li, “Countermeasures for automatic speaker ver - iﬁcation replay spooﬁng attack : On data augmentation, feature representation, classiﬁcation and fusion, ” Pr oc. Interspeech 2017 , pp. 17–21. [16] X. Glorot and Y . Bengio, “Understanding the difculty of training networks, ” in 13th International Conference on Artiﬁcial Intelligence and Statistics (AIST A TS) , 2010, vol. 9, pp. 249–256. [17] M. Abadi et al., “T ensorFlow: Large-scale machine learn- ing on heterogeneous systems, ” Software av ailable from tensorﬂow .org. [18] N. Br ¨ ummer and E. D. V illiers, “The bosaris toolkit: The- ory , algorithms and code for survi ving the ne w dcf, ” arXiv pr eprint arXiv:1304.2865 , 2013. [19] T . Grill and J. Schlter , “T wo con volutional neural net- works for bird detection in audio signals, ” in 2017 25th Eur opean Signal Pr ocessing Conference (EUSIPCO) , Aug 2017, pp. 1764–1768. A. Data Split: Incr easing the Data Points W e propose a very simple technique that helps increase training data. W e call this approach as ‘split data’. Using this approach we can generate large amount of training data points. Given an audio utterance s , we outline the algorithm for generating data points below . 1. Let l = l eng t h ( s ) , be the original duration of s . 2. Update s by duplicating/truncating the samples such that l new = ceil ( l ) . 3. Compute the log power magnitude spectrogram: D = log | S T F T ( s ) | 2 , where, D matrix has T number of frames and F frequency bins. 4. Let spec wind and wind shif t be the desired windo w and shift size (in time) respectively . Now , split D into parts by moving spec wind by wind shif t . 5. Return the list of spectrograms generated in step 4, where each spectrogram is of dimension spec wind x F .

A Study On Convolutional Neural Network Based End-To-End Replay Anti-Spoofing

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment