On the use of DNN Autoencoder for Robust Speaker Recognition

On the use of DNN A utoencoder f or Rob ust Speaker Recognition Ond ˇ rej Novotn ´ y, Old ˇ rich Plc hot, P avel Mat ˇ ejka, Ond ˇ rej Glembek Brno Uni versity of T echnology , Speech@FIT and IT4I Center of Excellence, Czechia { inovoton,iplchot,matejkap,glembek } @fit.vutbr.cz Abstract In this paper , we present an analysis of a DNN-based autoen- coder for speech enhancement, dereverberation and denoising. The target application is a robust speaker recognition system. W e started with augmenting the Fisher database with artiﬁ- cially noised and re verberated data and we trained the autoen- coder to map noisy and re verberated speech to its clean version. W e use the autoencoder as a preprocessing step for a state- of-the-art text-independent speaker recognition system. W e compare results achieved with pure autoencoder enhancement, multi-condition PLDA training and their simultaneous use. W e present a detailed analysis with v arious conditions of NIST SRE 2010, PRISM and artiﬁcially corrupted NIST SRE 2010 tele- phone condition. W e conclude that the proposed preprocessing signiﬁcantly outperforms the baseline and that this technique can be used to build a robust speaker recognition system for rev erberated and noisy data. Index T erms : speaker recognition, signal enhancement, au- toencoder 1. Introduction In last years, various techniques for speech and signal process- ing ha ve been introduced to cope with the distortions caused by noise and re verberation. In the ﬁeld of speak er recognition, one way to tackle this problem is to use multi-condition training of PLD A, where we introduce noise v ariability and rev erberation variability into the within-class variability of speakers. Also, sev eral techniques were introduced in the ﬁeld of microphone array to solve this issue by activ e noise canceling, beamform- ing and ﬁltering [1]. For single microphone systems, front-ends utilize signal pre-processing methods such as W iener ﬁltering, adaptiv e v oice activity detection (V AD), gain control, etc. [2]. Next, various designs of rob ust features [3] are used in combi- nation with normalization techniques such as cepstral mean and variance normalization or short-time g aussianization [4]. The last years hav e seen, the rise of interest in NN signal pre-processing. An example of classical approach to remov e a room impulse response is proposed in [5], where the ﬁlter is estimated by an NN. NNs have also been used for speech sep- aration in [6]. NN-based autoencoder for speech enhancement was proposed in [7] with optimization in [8] and ﬁnally , rev er- berant speech recognition with signal enhancement by a deep autoencoder was tested in the Chime Challenge and presented in [9]. In this paper , we in vestigate the use of a DNN autoen- coder as an audio pre-processing front-end for speak er recogni- tion. The autoencoder is trained to learn a mapping from noisy The work was supported by Czech Ministry of Interior project No. VI20152020025 ”DRAP AK”, Google Faculty Research A ward program, Czech Science Foundation under project No. GJ17-23870Y , and by Czech Ministry of Education, Y outh and Sports from the Na- tional Programme of Sustainability (NPU II) project ”IT4Innovations excellence in science - LQ1602”. and re verberated speech to clean speech. The frame-by-frame aligned examples for DNN training are artiﬁcially created by adding noise and re verberation to the Fisher speech corpus. The analysis in this paper extends our previous work presented in [10] and focuses on dif ferent autoencoders in more v ariable and harder conditions. These conditions are simulated by adding the noise and rev erberation into the NITST SRE2010 telephone condition and extend the selection of test sets that we used in [10]. W e conﬁrm our conclusions from [10] and we of fer more experimental evidence and thorough analysis to demonstrate that the proposed method increases the performance of the text independent speaker recognition system. As it was already shown that performing multi-condition training with added noisy and rev erberated data helps signiﬁcantly in speaker recog- nition [11, 12], we will also discuss the inﬂuence of quantity , quality , and type of autoencoder training data on performance of the analyzed SRE system. In the end, we will sho w that we can signiﬁcantly proﬁt from combination of both techniques. 2. A utoencoder training and dataset design Fisher English database parts 1 and 2 were used for training the autoencoder . They contain o ver 20,000 telephone con versa- tional sides or approximately 1800 hours of audio. Our autoencoder consists of three hidden layers with 1500 neurons in each layer . The input of the autoencoder was cen- tral frame of a log-magnitude spectrum with context of +/- 15 frames (in total 3999-dimensional input). The output is an 129- dimensional enhanced central frame. W e used Mean Square Error (MSE) as objectiv e function during training. 2.1. Adding noise W e prepared a noise dataset that consists of three sources of different types of noise: • 272 samples (4 minutes long) taken from the Freesound library 1 (real fan, HV A C, street, city , shop, cro wd, li- brary , ofﬁce and workshop). • 7 samples (4 minutes long) of artiﬁcially generated noises: v arious spectral modiﬁcations of white noise + 50 and 100 Hz hum. • 25 samples (4 minutes long) of babbling noises by merging speech from 100 random speakers from Fisher database using speech activity detector . Noises were divided into three disjoint groups for training (223 ﬁles), dev elopment (40 ﬁles) and test (41 ﬁles). 2.2. Reverberation W e prepared tw o sets with room impulse responses (RIRs). The ﬁrst set consists of real room impulse responses from sev eral 1 http://www.freesound.org databases: AIR [13], C4DM [14, 15], MARD Y [16], OPE- N AIR [17], R VB 2014 [18], R WCP [19]. T ogether, they form a set with all types of rooms (small rooms, big rooms, lecture room, restrooms, halls, stairs etc.). All room models have more than one impulse response per room (dif ferent RIR w as used for source of the signal and source of the noise to simulate different locations of their sources). Rooms were split into two disjoint sets, with 396 rooms for training, 40 rooms for test. The second set consists of artiﬁcially generated room im- pulse responses using “Room Impulse Response Generator” tool from E. Habets [20]. The tool can model the size of room (3 dimensions), reﬂectivity of each wall, type of microphone, position of source and microphone, orientation of microphone tow ards the audio source, and number of bounces (reﬂections) of the signal. W e generated a pair of RIRs for each room model (one used for source of the sound, one for source of the noise). Again we generated two disjoint sets, with 1594 RIRs for train- ing and 250 RIRs for test. 2.3. Composition of the training set T o mix the re verberation, noise and signal at giv en SNR, we followed the procedure showed in ﬁgure 1. The pipeline begins with two branches, when speech and noise are re verberated sep- arately . Different RIRs from the same room are used for signal and noise, to simulate different positions of sources. The next step is A-weighting. A-weighting is applied to simulate the perception of the human ear to added noise [21]. W ith this ﬁltering, the listener would be able to better perceiv e the SNR, because most of the noise energy is coming from fre- quencies, that the human ear is sensitiv e to. In the following step, we set a ratio of noise and signal en- ergies to obtain the required SNR. Energies of the signal and noise are computed from frames giv en by original signal’ s voice activity detection (V AD). It means the computed SNR is really present in speech frames which are important for our recogni- tion (frames without voice activity are remov ed during process- ing). After the combination, where signal and noise are summed together at desired SNR, we ﬁlter the resulting signal with tele- phone channel. T o compensate for the fact that our noise sam- ples are not coming from the telephone channel, while the orig- inal clean data (Fisher, NIST tel-tel) are in fact telephone. The ﬁnal output is a reverberated and noisy signal with required SNR, which simulates a recording passing through the tele- phone channel (as was the original signal) in various acoustic en vironments. In case we want to add only noise or reverbera- tion, the appropriate part of the algorithm is used. 3. Speaker r ecognition system Our systems are based on i-v ectors [22, 23]. T o train i-vector extractors, we alw ays use 2048-component diagonal-cov ariance Univ ersal Background Model (GMM-UBM) and we set the dimensionality of i-vectors to 600. W e apply LD A to re- duce the dimensionality to 200. Such processed i-vectors are then transformed by global mean normalization and length- normalization [22, 24]. Speaker veriﬁcation score is produced by comparing two i-vectors corresponding to the segments in the veriﬁcation trial by means of PLD A [23]. In our e xperiments, we used cepstral features, e xtracted us- ing a 25 ms Hamming windo w . W e used 24 Mel-ﬁlter banks and we limited the bandwidth to the 120–3800Hz range. 19 MFCCs RIR 1 RIR 2 A-weighting A-weighting VAD-SNR combination telephone channel Signal Noise Output SNR signal+noise*ratio Figure 1: The process of data preparation (corruption) for au- toencoder training or ne w SRE condition design. together with zero- th coefﬁcient were calculated every 10 ms. This 20-dimensional feature vector was subjected to short time mean- and variance-normalization using a 3 s sliding window . Delta and double delta coef ﬁcients were then calculated using a ﬁv e-frame window gi ving a 60-dimensional feature vector . After feature extraction, voice acti vity detection (V AD) was performed by the BUT Czech phoneme recognizer [25], drop- ping all frames that are labeled as silence or noise. The recog- nizer was trained on the Czech CTS data, but we hav e added noise with varying SNR to 30% of the database. 3.1. Datasets W e used the PRISM [26] training dataset deﬁnition without added noise or reverb to train UBM and i-vector transformation. Fiv e variants of gender independent PLDA were trained: one only on the clean training data, the rest included also artiﬁcially added dif ferent cocktail of noises and rev erb . Artiﬁcially added noise and rev erb segments totaled approximately twenty-four thousand segments or 30% of total number of clean segments for PLD A training. The PRISM set comprises Fisher 1 and 2, Switchboard phase 2 and 3 and Switchboard cellphone phases 1 and 2, along with a set of Mixer speakers. This includes the 66 held out speakers from SRE10 (see Section III-B5 of [26]), and 965, 980, 485 and 310 speakers from SRE08, SRE06, SRE05 and SRE04, respecti vely . A total of 13,916 speakers are a vail- able in Fisher data and 1,991 in Switchboard data. W e e valuated our systems on the female portions of the fol- lowing conditions in NIST SRE 2010 [27] and PRISM [26]: • tel-tel : SRE 2010 e xtended telephone condition in volv- ing normal vocal effort con versational telephone speech in enrollment and test (known as condition 5). • int-int : SRE 2010 e xtended interview condition inv olv- ing intervie w speech from dif ferent microphones in en- rollment and test (known as condition 2). • int-mic : SRE 2010 extended interview-microphone con- dition inv olving interview enrollment speech and nor- mal vocal ef fort con versational telephone test speech recorded over a room microphone channel (known as condition 4). • prism,noi : Clean and artiﬁcially noised wav eforms from both intervie w and telephone con versations recorded ov er lav alier microphones. Noise was added at different SNR lev els and recordings tested against each other . • prism,rev : Clean and artiﬁcially rev erberated wave- forms from both interview and telephone con versations recorded o ver lavalier microphones. Re verberation was added with different R Ts and recordings tested against each other . • prism,chn : English telephone speech with normal vo- cal ef fort recorded ov er dif ferent microphones from both SRE2008 and 2010 tested against each other . Additionally , we created new artiﬁcially corrupted ev alua- tion sets from the NIST 2010 tel-tel condition. The process was the same as described in section 2.3 while using the tests por - tion of our noise and reverberation sets. W e created sev en new conditions: • rev-tel-tel : SRE 2010 tel-tel condition corrupted by real room impulse responses (rev erberation). • noi- ∗ -tel-tel : SRE 2010 tel-tel condition corrupted by noise. W e used three ranges of noise: 0-7dB, 7-14dB, 14-21dB (range is writen on position of ∗ , e.g. noi-0-7- tel-tel). • rev-noi- ∗ -tel-tel : SRE 2010 tel-tel condition corrupted by noise and real rooms impulse responses. Again, we used three ranges of noise: 0-7dB, 7-14dB, 14-21dB. The difference between these new conditions and the con- ditions based on the PRISM set is in more realistic re verber - ation. Condition prism,rev is created from clean microphone data corrupted with artiﬁcially generated RIRs. The new con- ditions focus on adding a real reverberation to the telephone data. Similarly , the prism,noi condition is created from micro- phone data by adding the noise at three le vels of SNR (8dB, 15dB, 20dB), the new conditions use telephone data and ran- domly chosen SNR le vels from the gi ven intervals. Addition- ally , the selected telephone data tend to be more difﬁcult than the microphone data used in the PRISM conditions. The recognition performance is ev aluated in terms of the equal error rate (EER). 4. Experiments and discussion W e provide a set of results for answering tw o questions: (i) Ho w does the speak er recognition performance depend on the type of the enhancement (denoising, dere verberation, both) and amount or type (real, artiﬁcial) of the autoencoder training data? (ii) How does using the autoencoder compare to using the multi- condition data for SRE system training? In the end we also combine the autoencoder with the multi-condition training and ﬁnd the best performing combination. W e trained ﬁve different autoencoders for signal enhance- ment. T wo autoencoders were trained only for dereverbera- tion. The ﬁrst was trained with artiﬁcially generated reverbera- tion and the second used real re verberation. The third autoen- coder was trained only for denoising. The last two autoencoders were trained simultaneously for denoising and derev erberation. Again, one of them used artiﬁcially generated RIRs and the sec- ond one used the real ones. Similarly , we created ﬁ ve dif ferent multi-condition training sets for PLD A. The approach is the same as in the autoencoder training. W e used exactly the same noises and rev erberation for segment corruption as in autoencoder training, allo wing us to compare the performance when using the autoencoder or multi- condition training. Our results are listed in table 1. Results are separated into two main blocks: PLD A trained on the clean data and PLDA trained on the multi-condition data. Each block is additionally separated to highlight whether the autoencoder enhancement is used or not. In the ﬁrst block, the baseline corresponds to the system where the PLDA was trained only on the clean data without any enhancement. The ne xt ﬁv e columns represent results when using dif ferent autoencoders: N - autoencoder trained only on the noised data, AR - autoencoder trained on the data corrupted with artiﬁcial generated RIRs, RR- autoencoder trained on the data corrupted with the real RIRs. N + (A/R)R - autoencoder simultaneously trained on the data with both types of distortion (noise and rev erberation). In the second block, we list the results for multi-condition training. W e trained ﬁ ve dif ferent PLD As, every time using a different mix of corrupted data added to the training list. PLDA or autoencoder on its o wn cannot fully proﬁt from the added corrupted data. Autoencoder is able to partially remove the noise and reverberation from the data, while PLD A can learn the effect these data have for within- and across- speaker vari- ability . Combining both techniques naturally brings the most improv ement as we can see from the last block in table 1. In these experiments, we were again modifying the data for the multi-condition PLD A training, but all of this data was previ- ously processed by a single autoencoder . W e decided to use the autoencoder simultaneously trained on the noisy and reverber - ated data (using real RIRs). This autoencoder w as chosen based on its good and consistent performance in various conditions and we belie ve that it could represent an universal preprocess- ing step as there is only a negligible drop in performance when using it on clean data (see for example the performance on tel- tel condition of baseline system versus the N+RR column in the ﬁrst block in table 1). Now , let us focus on comparing the baseline system and the system with enhanced data (PLDA is trained only on clean and enhanced data). In these experiment s, we study which autoen- coder training dataset is the best for given condition. If we look at these results globally , we can see that for most of the re verber- ation conditions (prism,rev , int-int, int-mic and rev-tel-tel, with exception of prism,chn), the autoencoder trained on the real re- verberation provides the best results. Similar situation occurs for noisy conditions (prism,noi, noise- ∗ -tel-tel) and noisy end rev erberated conditions (re v-noise- ∗ -tel-tel). These results con- ﬁrm our intuition, that it is best to use the autoencoder trained on the matching distortion to remove its effect from the data. W e can also observe that to remove the reverberation, it is best to train on data reverberated by real RIRs instead of those arti- ﬁcially generated. This holds ev en for the condition containing only artiﬁcial reverberation (prism,re v). In general, when look- ing at the ﬁrst block in table 1, all of the autoencoders trained using reverberation with real RIRs (columns RR, N + RR) are better than those trained using artiﬁcial RIRs (AR, N + AR). W e can also see, that the difference in performance between the RR- autoencoder and the N + RR autoencoder is rather small more in fa vor of the latter , both in reverberation and noisy conditions. This indicates that using the N + RR autoencoder is a good uni- versal choice and justiﬁes its selection for the experiments when combining the audio enhancing with multi-condition training. When focusing on the multi-condition training (ﬁrst part of the second block in table 1) and taking the global view , we can T able 1: Results (EER [%] ) obtained in four scenarios. The ﬁrst two bloc ks corr espond to the system tr ained only with clean data (PLD A trained on clean data). In the left block, scores of baseline system ar e displayed. In the right block, the score of the clean system with enhancement data is displayed. Results of ﬁve autoencoders trained on: N - noise, (A/R)R- artiﬁcial/real r everberation, or both ( + ) are presented in eac h column. The last two blocks correspond to systems tr ained in multi-condition fashion (with noised and r everber ated data in PLD A). Results in each column corr espond to dif fer ent PLD A multi-condition tr aining set: N - noise, (A/R)R- artiﬁcial/r eal r everberation, or both ( + ). The very last bloc k pr esent r esults of the combination of both techniques. F or combination, we select autoencoder trained on noised and r everber ated data with r eal r everberation (N + RR). PLD A trained on clean data PLD A trained on multi-condition data baseline Autoencoder training PLD A extension data Autoencoder (N + RR) + PLD A extension data Condition N AR N + AR RR N + RR N AR N + AR RR N + RR N AR N + AR RR N + RR tel-tel 2 . 062 2 . 075 2 . 093 2 . 074 1 . 999 2 . 063 2 . 458 2 . 071 2 . 728 2 . 035 2 . 796 2 . 480 2 . 070 2 . 677 2 . 143 2 . 752 prism,noi 2 . 950 2 . 122 2 . 497 2 . 256 2 . 470 2 . 190 2 . 265 3 . 080 2 . 518 2 . 926 2 . 456 1 . 969 2 . 243 2 . 037 2 . 236 2 . 059 prism,rev 2 . 071 1 . 748 1 . 621 1 . 608 1 . 511 1 . 559 2 . 220 1 . 537 1 . 620 1 . 613 1 . 632 1 . 583 1 . 419 1 . 385 1 . 422 1 . 419 int-int 1 . 756 1 . 792 1 . 693 1 . 766 1 . 634 1 . 790 1 . 860 1 . 677 1 . 760 1 . 669 1 . 714 1 . 806 1 . 697 1 . 714 1 . 705 1 . 761 int-mic 1 . 089 1 . 136 1 . 085 1 . 151 1 . 010 1 . 112 1 . 226 0 . 770 0 . 921 0 . 960 1 . 042 0 . 981 1 . 000 0 . 848 0 . 982 0 . 943 prism,chn 0 . 795 0 . 523 0 . 599 0 . 402 0 . 596 0 . 428 1 . 000 0 . 544 0 . 666 0 . 630 0 . 756 0 . 456 0 . 344 0 . 371 0 . 277 0 . 400 rev-tel-tel 19 . 373 14 . 760 11 . 182 13 . 450 9 . 149 9 . 365 17 . 835 9 . 461 10 . 151 5 . 246 6 . 598 8 . 287 6 . 137 5 . 847 4 . 066 4 . 761 noi-14-21-tel-tel 4 . 959 3 . 298 4 . 009 3 . 943 3 . 721 3 . 703 2 . 901 4 . 605 3 . 530 4 . 321 3 . 390 2 . 689 3 . 205 2 . 962 3 . 021 2 . 980 noi-7-14-tel-tel 8 . 291 5 . 117 6 . 808 5 . 710 6 . 660 5 . 749 3 . 941 8 . 026 4 . 920 7 . 540 4 . 715 3 . 528 5 . 084 3 . 719 4 . 595 3 . 517 noi-0-7-tel-tel 18 . 953 10 . 681 15 . 518 11 . 276 15 . 868 12 . 280 8 . 831 18 . 782 9 . 547 18 . 116 9 . 576 6 . 080 11 . 402 6 . 252 10 . 014 6 . 382 rev-noi-14-21-tel-tel 16 . 517 15 . 099 11 . 044 11 . 356 9 . 398 7 . 631 16 . 128 11 . 079 8 . 344 8 . 942 6 . 387 6 . 379 6 . 097 4 . 798 4 . 948 4 . 143 rev-noi-7-14-tel-tel 19 . 543 19 . 899 13 . 985 15 . 615 12 . 352 9 . 610 17 . 174 16 . 516 10 . 246 14 . 197 8 . 314 7 . 169 8 . 184 5 . 626 7 . 033 5 . 126 rev-noi-0-7-tel-tel 27 . 834 28 . 154 22 . 193 24 . 523 21 . 442 16 . 841 21 . 558 26 . 679 15 . 629 25 . 530 14 . 660 10 . 533 15 . 609 8 . 770 14 . 369 8 . 149 observe similar trends as in the pure enhancement task. If we want to remov e some type of distortion, it is best to add the matching distortion type into the PLD A training. If we look more closely , we can see the difference in re verberation con- ditions based on the PRISM set, where (as opposed to the en- hancement) the multi-condition system using artiﬁcially gener- ated RIRs hav e better results. This can indicate that it is easy for the PLD A to capture the channel v ariability caused by re verber - ating with the artiﬁcial RIRs which results in better performance in this matched-condition scenario. This hypothesis is further strengthened when comparing the AR with RR on rev-tel-tel condition when training on the matched-condition RR data al- most halves the error rate. If we analyze the dif ference in performance between the pure signal enhancement and the multi-condition training, we see that the multi-condition training has slightly better results, especially in the hardest conditions rev- ∗ -tel-tel. In the clean tel-tel condition, we can see that using autoencoder harms the performance less than multi-condition training. Additionally in some PRISM-based conditions (prism,rev , int-int, prism,chn), the autoencoder is also better than multi-condition training. Finally , we look at the combination of both techniques (the very last block in table 1). Here, we are still ha ving the same training lists for multi-condition PLD A training, but addition- ally , all data are enhanced by autoencoder trained on noised and rev erberated data with real RIRs. W e can see that in most con- ditions, we improve results with the pure multi-condition train- ing. W e suffer a signiﬁcant degradation in clean tel-tel condi- tion with respect to baseline for N+AR and N+RR training, but especially in the case of the latter, this degradation is compen- sated by excellent performance in other conditions, especially the most difﬁcult re v-noise- ∗ -tel-tel where we gain more than 70 % relati ve improv ement over the baseline. The combination of both techniques can also eliminate the big difference between artiﬁcially generated rev erberation and real reverberation as can be seen by comparing results of N + A G and N + RR systems. As we already saw for pure multi- condition training, the best results are again achieved by using the matched distortion for PLD A training, but the difference be- tween the best possible results and multi-condition training with N + RR autoencoder are small. This justiﬁes our recommen- dation to use the combination of multi-condition training with N + RR data that were preprocessed by the N + RR autoencoder as a univ ersal and robust system, especially when expecting re- verberated and/or noisy test data. 5. Conclusion In this paper , we analyzed se veral aspects of DNN-autoencoder enhancement for designing rob ust speaker recognition systems. W e studied the inﬂuence of dif ferent training sets on autoen- coder performance in speaker recognition and we concluded that in our case the use of smaller amount of quality real RIRs provided better results than using much larger amount of artiﬁ- cial RIRs. W e also directly compared the PLD A multi-condition train- ing with audio enhancing. Our results suggest that introducing the corrupted data on the i-vector level int the PLD A training provides slightly better results for noisy and re verberated con- dition, but at the same time causing more harm on clean data compared to the autoencoder . Finally , we conclude that the combination of both tech- niques can signiﬁcantly impro ve system performance compared to the baseline and even to systems using only one of the two techniques. W e obtained more than 70 % relative impro ve with respect to baseline and approximately 40 % relativ e improve- ment with respect to multi-condition PLD A training. Based on our results and in the light of very good performance of MFCC- based systems in the NIST SRE 2016, we can say that autoen- coders are a viable option to consider when designing a system that is robust against v arious le vels of re verberation and noise. 6. References [1] K. Kumatani, T . Arakawa, K. Y amamoto, J. McDonough, B. Raj, R. Singh, and I. T ashev , “Microphone array processing for distant speech recognition: T o wards real-world deployment, ” in APSIP A Annual Summit and Conference , Hollywood, CA, USA, Decem- ber 2012. [2] ETSI, “Speech processing, transmission and quality as- pects (STQ), ” European T elecommunications Standards Institute (ETSI), T ech. Rep. ETSI ES 202 050, 2007. [3] O. Plchot, S. Matsoukas, P . Mat ˇ ejka, N. Dehak, J. Ma, S. Cumani, O. Glembek, H. He ˇ rmansk ´ y, N. Mesgarani, M. M. Souﬁfar , S. Thomas, B. Zhang, and X. Zhou, “Developing a speaker iden- tiﬁcation system for the darpa rats project, ” in Pr oceedings of ICASSP 2013 , V ancouver , CA, 2013. [4] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker veriﬁcation, ” in Proceedings of Odyssey 2006: The Speaker and Language Recognition W orkshop , Crete, Greece, 2006. [5] B. Dufera and T . Shimamura, “Re verberated speech enhancement using neural networks, ” in Proc. International Symposium on In- telligent Signal Pr ocessing and Communication Systems, ISP ACS 2009. , Jan 2009, pp. 441–444. [6] T . Y anhui, D. Jun, X. Y ong, D. Lirong, and L. Chin-Hui, “Deep neural network based speech separation for robust speech recog- nition, ” in Proceedings of ICSP2014 , 2014, pp. 532–536. [7] Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “ An experimental study on speech enhancement based on deep neural networks, ” IEEE Signal pr ocessing letters , vol. 21, no. 1, Jan. 2014. [8] ——, “Global variance equalization for improving deep neural network based speech enhancement, ” in Pr oc. IEEE China Sum- mit & International Confer ence on Signal and Information Pr o- cessing (ChinaSIP) , 2014, pp. 71 – 75. [9] M. Mimura, S. Sakai, and T . Kawahara, “Reverberant speech recognition combining deep neural networks and deep autoen- coders, ” in Proc. Reverb Challenge W orkshop , Florence, Italy , 2014. [10] O. Plchot, L. Burget, H. Aronowitz, and P . Mat ˇ ejka, “Audio enhancing with DNN autoencoder for speaker recognition, ” in Pr oceedings of the 41th IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP 2016), 2016 . IEEE Signal Processing Society , 2016, pp. 5090–5094. [Online]. A vailable: http://www .ﬁt.vutbr .cz/research/view pub . php?id=11139 [11] D. G. Mart ´ ınez, L. Burget, T . Stafylakis, Y . Lei, P . Kenny , and E. LLeida, “Unscented transform for ivector -based noisy speaker recognition, ” in Pr oceedings of ICASSP 2014 , Florencie, IT , 2014. [12] Y . Lei, L. Burget, L. Ferrer , M. Graciarena, and N. Scheffer , “T owards noise-robust speaker recognition using probabilistic lin- ear discriminant analysis, ” in Pr oceedings of ICASSP , Kyoto, JP , 2012. [13] “ Aachen impulse response database, ” http://www .iks.rwth- aachen.de/en/research/tools-downloads/databases/aachen- impulse-response-database/. [14] “C4dm (center for digital music) RIR database, ” http://isophonics.net/content/room-impulse-response-data-set. [15] R. Stewart and M. Sandler, “Database of omnidirectional and b-format room impulse responses, ” in 2010 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , March 2010, pp. 165–168. [16] “Multichannel acoustic reverberation database at york, ” http://www .commsp.ee.ic.ac.uk/ sap/resources/mardy- multichannel-acoustic-rev erberation-database-at-york-database/. [17] “Openair impulse response database, ” http://www .openairlib.net/auralizationdb . [18] “Reverb challenge, ” http://reverb2014.dere verberation.com/index.html. [19] “Rwcp sound scene database, ” http://www .openslr .org/13/. [20] E. A. Habets, “Room impulse response generator, ” https://www .audiolabs-erlangen.de/content/05-fau/professor/00- habets/05-software/01-rir-generator/rir generator .pdf. [21] R. M. Aarts, “ A comparison of some loudness measures for loudspeaker listening tests, ” J. Audio Eng. Soc , vol. 40, no. 3, pp. 142–146, 1992, http://www .extra.research.philips.com/hera/ people/aarts/RMA papers/aar92a.pdf. [22] N. Dehak, P . Kenn y , R. Dehak, P . Dumouchel, and P . Ouel- let, “Front-end factor analysis for speaker veriﬁcation, ” vol. PP , no. 99, pp. 1 –1, 2010. [23] P . Kenny , “Bayesian speaker veriﬁcation with heavy–tailed pri- ors, ” keynote presentation, Proc. of Odyssey 2010, Brno, Czech Republic, June 2010. [24] D. Garcia-Romero, “ Analysis of i-vector length normalization in Gaussian-PLD A speaker recognition systems, ” 2011. [25] P . Mat ˇ ejka, L. Burget, P . Schwarz, and J. ˇ Cernock ´ y, “Brno uni- versity of technology system for NIST 2005 language recognition ev aluation, ” in Pr oceedings of Odysse y 2006 , San Juan, PR, 2006. [26] L. Ferrer , H. Bratt, L. Burget, H. Cernock y , O. Glembek, M. Gra- ciarena, A. Lawson, Y . Lei, P . Matejka, O. Plchot, and N. Schef- fer , “Promoting robustness for speaker modeling in the commu- nity: the PRISM evaluation set, ” in Pr oceedings of SRE11 analy- sis workshop , Atlanta, Dec. 2011. [27] “National institute of standards and technology , ” http://www .nist.gov/speech/tests/spk/index.htm.

On the use of DNN Autoencoder for Robust Speaker Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment