Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge

Detect ion and Cla ssiﬁcatio n of Acoustic Scene s and Ev ents 201 8 19-20 No v ember 2018, Surrey , UK CONV OLUTION AL NEURAL NETWORKS AND X-VECT OR EMBEDDING FOR DCASE2018 A COUSTIC SCENE CLASSIFICA TION CH ALLENGE Hossein Zeinali, Luk ´ a ˇ s Bur ge t a n d Jan “Honza” ˇ Cernoc k ´ y Brno Univ ersity of T echnology , Speech@FIT and IT4I Center o f Excellence, Czech Republic ABSTRA CT In this paper , the Brno University of T echno logy ( BUT) team sub- missions for T ask 1 (Acoustic Scene Classiﬁcati on, ASC) of the DCASE-2018 challenge are described. Also, the analysis of dif- ferent methods on the leaderboard set is pro vided. The proposed approach is a f usion of two different Con volutional Neural Network (CNN) topologies. The ﬁrst one is the common two-dimensional CNNs which is mainly used in image classiﬁcation. The second one is a one-dimensional CNN for extracting ﬁ xed-length audio seg- ment embeddings, so called x-vectors, which has also been used in speech processing, especially for speaker recognition. In addition to the dif ferent topologies, two types of features were tested: log mel-spectrogram and CQT features. Finally , the outputs of dif fer- ent systems are fused using a simple output av eraging in the best performing system. Our submissions r ank ed third among 24 teams in the ASC sub-task A (task1a). Index T erms — Audio scene classiﬁcation, Conv olutional neu- ral networks, Deep learning, x-vectors, Regularized LD A 1. INTR ODUCTION This paper deals with the problem of classifying scene or en viron- ment (see examp les listed i n T able 4) based on acoustic clues, which are normally used by humans and animals to understand and react on dif ferent en viron mental condition. Se v eral methods hav e been proposed for the Acoustic Scene Cl assiﬁcation (ASC ) . Nowada ys, most of them are deep learning based. The winner of the last year ASC challenge (i. e. DCASE2017 T ask1) used Generative Adver- sarial Network (G AN ) for data augmen tation and the combination of Support V ector Machine (SVM) and CNN for classiﬁcation [1]. The most used network topology in the previous challenges is CNN prov en to provide very good performance for ASC [1, 2, 3 , 4]. The winner of DCAS E2016 Challenge T ask1 [ 5 ] also used CNN fused with an i-vector based method [6]. This report describes Brno Univ ersity of T echno logy (BUT) team submissions for the ASC challenge of DCASE 2018. W e proposed two diffe rent deep neural network topo logies for this task. The ﬁrst one is a common two-dimensional CNN network for processing audio se gments as ﬁxed size two-dimensional im- ages. This network is fed in two ways: with single channel features and 4-channe ls features. This type of CNN network is useful for detection of audio e ven ts i n v ariant to their position in audio sig- nals. T he second network topology uses a one-dimen sional CNN along the time axis and is used to extract ﬁxed-length embeddings of (possibly variable length) acoustic segments. This architecture has been previously found useful for other speech processing tasks such as speaker recognition [7], where the extracted embeddings were called x-ve ctors. T herefore, in the rest of the paper, we will also refer to such neural embeddings of acoustic segmen ts as to x- vectors . These networks were trained with two f eature types: l og mel-spectrogram and constant-Q transform (CQT ) features. Our submissions are based on fusions of different networks and f eatures trained on the original dev elopment data or using additional aug- mented data. The current AS C challenge has three sub-tasks: In t ask1a, par- ticipants are allowed to use only the ﬁ xed dev elopmen t data for training. T ask1b is similar to task1a exc ept that the test ﬁles are from differen t mobile channels. Finally , t ask1c e v aluation data is the same as task1a but additional data is allowed for training. W e hav e participated in t ask1a only . 2. D A T ASET In this work, the DCASE2018 data was used [8]. The dataset con- sists of recordings from 10 scene classes and was acquired i n six large European cities, in different en vironments in each city . The de velop ment set of the dataset consists of 864 segm ents for each acoustic scene which means a total of 8640 audio segments. The e v aluation set was collected in the same cities, bu t in different en- vironments and has 3600 audio segments. Each segment has an ex- actly 10-second duration, this is achiev ed by splitt ing longer audio recordings from each en vironmen t. The dataset includes a prede- ﬁned validation fold. Each team can also create i ts own folds, but we used the single ofﬁcial fold for ev aluation. T he audio segments are 2-channels stereo ﬁles, recorded at 48 KHz sampling rate. 3. D A T A P R OCESSING 3.1. Featur es In this work, different features are used in single and multichannel modes. All features are extracted from zero mean audio signals. The main features are log mel-scale spectrogram. For ex tracting these f eatures, ﬁ rst short t ime Fourier t r ansform is computed on 40 ms Hamming windowed fr ames with 20 ms overlap using 2048 point FFT . Next, the power spectrum is transformed to 80 Mel-scale band energies and, ﬁnally , log of these energies is taken. The second set of features is obtained as 80-dimensional constant-Q transform of audio signals [ 9]. This features are extracted using librosa tool- box [10]. W e used the features in t wo modes, single-cha nnel and 4- channels. In single channel mode, t he audio signal is ﬁrst conv erted to mono and single-channel features are extracted from it (these fea- tures are indicated by “M” in the tables). In the 4-channels mode, four sets of features are extracted from the signal simi l ar to [2] (these features are indicated by “LRMS” in the tables). T wo feature sets from left (L) and right (R) channels, one fr om the summation of both channels (i.e. M = L + R ) and one from the subtraction Detect ion and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 Nove mber 2018, Surrey , UK of both channels (i. e. S = L − R ). W e use these 4 feature sets as a single input to the CNNs. T his mode is similar to multi-channel images ( e. g. RGB channels), which are the typical CNN inputs in image classiﬁ cation. In previou s works [2, 5], each channel was processed separately and ﬁnal scores were obtained by fusion of differe nt channel scores. Here, the network tries to use all channels at the same time t o use all the av ailable information. 3.2. Data augmentation Different methods have been proposed for data augmentation in au- dio processing. Based on t he rules of the challenge task1a, external data cannot be used for the data augmentation. Because of this lim- itation and based on our initial experiments, we decided to use a simple method based on the assumption that a combination of two or more audio segments from the same scene is another sample of that scene with more complex pattern and e ve nts. T wo new seg- ments were generated for each audio segment as a weighted sum of the audio and sev eral other r andomly selected audios from the same scene. T his way , we hav e tripled the amount of training data. 4. CNN TOPOLOGIES W e have used two different C N N t opologies for this challenge. The ﬁrst one is the common two-dimensiona l CNN known from image processing and the second topology is a one-dimensional CNN f or extracting x-vectors – neural network embeddings of audio segme nt as used, for example, in speaker recognition [7]. Both networks are described in more detail in the following sections. 4.1. T wo-Dimensional CNN W e follo wed the common CNN framework propos ed in [4] with some modiﬁcations. T able 1 sho ws the network architecture. The network contains 3 CNN blocks. The ﬁrst layer is a two- dimensional con v olutional layer with 32 ﬁ l ters with kernel size 7 × 11 and unitary depth and st ride i n both dimensions. This layer i s follo wed by batch-normalization and Rectiﬁed Linear Unit (R eLU) activ ati ons. The next layer is a max-po oling layer operating over 2 × 10 non-overlap ping rectangles, which is followed by the dropout layer at the end of t he CNN block. The output of this block form the input to the next block and so on. The ﬁlter and kernel sizes of each layer are shown in T able 1. The last MaxPooling layer in the network operates over the entire time sequence l ength (i.e. the out- put of the layer has dimension one for t he time axis). The next layer after the third CNN block is a global av erage pooling (ove r the fre- quenc y axis), which is followed by a batch-normalization layer . F i- nally , the last layer of the network is a Dense layer (fully connected) with 10 nodes and the softmax activ ation function. Compared to [4], where only one-channel features were used as the CNN input, we also train another CNN with 4-channel features (as indicated in the ﬁrst l ine of T able 1). 4.2. One-dimensional CNN for x-vector extraction The CNNs extracting x-vectors use one-dimensional con v olution along the time. T able 2 sho ws the netw ork architecture. The network has three parts. The ﬁrst part operates on the frame-by- frame lev el and outputs sequence of activ ation vectors (one for each frame). The second part compresses the frame-by-frame informa- tion into a ﬁxed length vector of statisti cs describing the whole T able 1: 2-Dimensional CNN topology . BN: Batch Normaliza- tion, ReLU: Rectiﬁer Linear Unit. The numbers in the parentheses sho w the kernel size of con volution layer and the number before BN sho ws the ﬁlter si ze of the l ayer . The numbers before MaxPooling sho w the w i ndo w size for this layer . Input 80 × 500 × 1 or 80 × 500 × 4 ( 7 × 11 ) Conv 2D(pad=1, stri de=1)-32-BN-ReLU ( 2 × 10 ) MaxPooling2D Dropout (0.3) ( 7 × 11 ) Conv 2D(pad=1, stri de=1)-64-BN-ReLU ( 2 × 5 ) MaxPooling2D Dropout (0.3) ( 7 × 11 ) Con v2D(pad= 1, stride=1)-128-BN-ReLU ( 5 × 10 ) MaxPooling2D Dropout (0.3) GlobalA verag ePooling2D BatchNormalization Dense-10-SoftMax acoustic segment. More precisely , mean and standard deviation of the i nput activ ati on vectors are calculated over frames. The last part of the network consists of two Dense ReLU layers followed by a Dense softmax layer like in t he previou s topology . This network has been used in two ways: In the ﬁrst case, the softmax output is used as before directly for the classiﬁcation (i.e. we train end-to- end ASC system). In the second case, t he x-vectors extracted at the output of the ﬁr st afﬁne transform after the pooling are used as the input for another classiﬁ er . Linear Discrimi nant Analysis (LD A) tr ansformation is used to precondition the x-vectors for the following ASC classiﬁer (i.e. Co- sine similari ty classiﬁ er). More speciﬁcally , it is used to whiten the within-class cov ariance and possibly reduce the dimensionality of the x-vectors. For this purpose, t he con ven tional LD A can be used, ho we ver , the number of preserved dimensions is at most the num- ber of classes minus one (9 in our case). Our prev ious works in the text-depen dent speaker ve riﬁcation [11, 12] and also the ASC ex- periments here indicate that such dimensionality reduction impacts the performance. For overc oming this limit ation, we hav e proposed to use Regularized version of LD A (RLDA), which enables us t o keep as many dimensions as we need. In RL DA, a small fraction of I dentity matrix is added to both within and between-class cov ari- ance matrices giving the following esti mati on formulas: S w = α I + 1 C C X c =1 1 N c N c X n =1 ( w n c − w c )( w n c − w c ) T , S b = β I + 1 C C X c =1 ( w c − w )( w c − w ) T , where I is the identity matrix, C is the total number of classes (i. e. scenes in this case), N c is the number of training samples in class c , w n c is the n th sample i n class c , w c = 1 N c P N c n =1 w n c is the mean of class c , w = 1 C P C n =1 w c is the mean of the class means and α and β were empirically set t o 0.001 and 0.01, respectiv ely . This type of regularization makes the between-class cov ariance matrix of full rank, which all ows us to freely choose the number of dimen- sions that we wish to preserve after the LD A transformation. In this Detect ion and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 Nove mber 2018, Surrey , UK T able 2: 1-Dimensional CNN topology for x-vector extraction. BN: Batch Normalization, ReLU:Rectiﬁ er Linear Unit. Input 500 × 80 ( 3 × 1 ) Con v1D(pad=1 , stride=1)-128-ReLU-BN Dropout (0.15) ( 3 × 1 ) Con v1D(pad=1 , stride=1)-128-ReLU-BN Dropout (0.15) ( 5 × 1 ) Con v1D(pad=1 , stride=1)-128-ReLU-BN Dropout (0.15) ( 1 × 1 ) Con v1D(pad=1 , stride=1)-128-ReLU-BN Dropout (0.15) ( 1 × 1 ) Con v1D(pad=1 , stride=1)-256-ReLU-BN Statistic Pooling, Mean and Standard-De viation Dense-128-ReLU-BN (x-vector) Dropout (0.15) Dense-128-ReLU-BN Dense-10-SoftMax work, we r educe the original 128-dimensional x-vectors t o 100 di- mensions. For more information about RL DA, we refer readers to our pre vious papers [ 12, 13]. After applying RLDA, av erage class x-vectors are esti mated on training data and used as class representation vectors. Cosine simi- larity is calculated between each test x-vector and each class repre- sentation vectors and the class wi t h the highest score is selected. Al - ternativ ely , t hese si mi l arity scores are fused with other scores from the CNN outputs for the ﬁ nal decision making. 5. SYSTEM S AND F USION In this challenge, we fused outputs of different systems to obtain the ﬁnal results. Fo r two-dimensio nal CNNs, both the single-channel and the 4-channels v ariants are t rained on both sets of features, which giv es us 4 different classiﬁers. Further , two CNN for x-vector extraction are trained each on one set of features. These are trained only for the single-chann el variant. The softmax outputs of all 6 neural networks are directly used for classiﬁcation. The two sets of x-vectors produced by the two latter CNNs are further used to construct another two cosine similari ty based classiﬁers. W e t r ai ned these systems in two scenarios, the ﬁrst one using the data without any augmen tation and t he second one using aug- mented data. T he scores from the resulting 16 systems (8 f or each scenario) were fused to form the ﬁnal submission. W e used two differe nt strategies for system fusion: Multiclass logistic regression classiﬁer was trained on the scores from t he different systems out- puts. FoCal Multiclass toolbox [14] was used for the logistic re- gression training. As an alternativ e f usion approach, we simply av eraged the scores from the dif ferent systems. W e used this al- ternativ e fusion strategies as we feared that the data ava ilable for the logistic regression fusion training might not be sufﬁcient. The logistic regression classiﬁer was trained on the v alidation set, which was already used for the early-stopping of CNN t raining and for the model selection (i.e. models performing best on the validation set were selected). Also, this set is rather small, which might l ead to ov er-ﬁtting during the fusion training. The four ﬁ nal submissions to the challenge were system fusions obtained with the two fusion methods. Each method was used to fuse either 1) all the sub-systems trained only on the augmen ted data or 2) all the subsystems (i.e. also including the subsystems trained only on the original data). 6. EXPERIMENT AL SETUPS The ex periments reported in this section were mainly carried out on the ofﬁcial challenge validation fold, which divides t he de velop- ment set into two subsets: training-set and evaluation-set . There are 6122 and 2518 audio segments in each subset, respectively . T he training set was further randomly divide d into two separate parts with the portions of 70 and 30 percent. The bigger part of the train- ing set was used for network training as well as classiﬁer training (for the cosine distance based method), the smaller part of this set was used f or stopping criteria in networks training, model selection and also the f usion training. Finally , the e v aluation part was used for reporting the results. In addition to the results on the deve lopment set, some results are reported using Kaggle leaderboard system 1 on a leaderboard set, which has 1200 segments. This set was div ided t o public and pri- v ate leaderboard subsets by the organizers and we report t he results only for the public subset 2 . In this case, the whole deve lopment set was used for training and validation : about 90% r andomly se- lected audio segments of this set were used for training and other segmen ts were used for va lidation. For the ﬁnal system training, the same data split was used as for the leaderboard results. The ﬁ- nal decisions for 3600 ev aluation audio segments was submitted to the challenge website. For ﬁnal submitted systems, the results on the ev alua tion set are also reported. Similar to t he baseline system provided by the organizers, our networks training wa s performed by optimizing the cate gorical cross-entrop y using Adam optimizer [15]. The initial l earning rate was set t o 0.001 and the network training was early-stopped if the v alidation loss did not decrease f or more than 20 epochs. Then, the training was started again from the best model but now w i th a reduced learning rate (half value). This training procedu re is re- peated 3 times until the learning rate reaches 0.00025. The maxi- mum number of epochs and the mini-batch size were set t o 200 and 64, respectiv ely . 7. RESUL TS 7.1. Comparison of Resu l ts T able 3 reports the public l eaderboard results for individual systems as well as se veral system combinations. W e separately report re- sults for the systems using the t wo differen t feature sets in order to compare their performance. For the four systems submitted t o the challenge, the table al so provides the results on the ev aluation set. Comparing the results of the different features, we can see that the mel-spectrogram performs better for ASC task in all cases. Ho wev er , the fusion of both feature sets improves the performance considerably , which indicates their complementarity . Generally , feeding the networks wi th 4-channels features im- prov es the performance as compared to the single-channel variant, 1 https:/ /www . kaggle .com/c/dca se2018-task1a-leaderboard 2 The public subset has the same number of audio segments for each class and also is included in the e v aluat ion set. As organi zers mentioned, the results on the pri v ate subset are not v alid because there are dif ferent numbers of audio segment s per class. Detect ion and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 Nove mber 2018, Surrey , UK T able 3: Comparison results between different methods and fea- ture types as well as two different fusion strategies and using data- augmentation or not. The star-marks on some fusion systems high- light the systems which were submitted as four ﬁnal submissions to the challenge. M: single channel feature, LRMS: 4-channels fea- ture, COS: cosine distance, MEL-All: all systems with MEL fea- tures and similarly for CQT -all. Method Public ACC [%] Eval. A CC [%] Baseline system 62.5 61.0 W ithout data augmentation Mel-2D-CNN-M 71.0 Mel-2D-CNN-LRMS 67.7 Mel-1D-CNN 65.3 Mel-x-vec tor-cos 64.8 CQT -2D-CNN-M 67.8 CQT -2D-CNN-LRMS 68.8 CQT -1D-CNN 60.3 CQT -x-vec tor-cos 60.2 Fusion-A verag e 75.0 Fusion-FoCal 71.5 W ith data augmentation Mel-2D-CNN-M 68.2 Mel-2D-CNN-LRMS 71.3 Mel-1D-CNN 67.8 Mel-x-vec tor-cos 64.7 CQT -2D-CNN-M 64.8 CQT -2D-CNN-LRMS 68.5 CQT -1D-CNN 60.8 CQT -x-vec tor-cos 58.2 Fusion-A verag e ∗ 76.8 78.1 Fusion-FoCal ∗ 73.3 75.1 Fusions MEL-All-A verage 72.5 CQT -All-A verage 71.3 All-A verage ∗ 77.5 78.4 All-FoCal ∗ 73.0 74.5 especially when more training data is available by the data augmen- tation. In some cases, this strategy , ho we ver , degrades the perfor- mance. W e believ e it should generally impro ve it, so these cases deserve a further in ve stigation. When comparing the results f r om the ﬁrst and the second sec- tions of T able 3, it is obvious that t he augmentation helps in some situations but degrades the performance of other ones. The results are not consistent for all network types. As mentioned before, in four-chan nel modes the augmentation improv es t he performance in almost all cases. The results from the two dif ferent fusion strategies sho w that the simple averaging performs considerably better in all cases. As we exp ected, the data for t he fusion training were not sufﬁcient. The fusion training over -ﬁtted to the valida tion data and did not generalize well on other datasets. The results on both leaderboard and ev aluation sets show that the fusion of the 8-systems trained on the augmen ted data already achie ves very good performance. When the systems with no data T able 4: Comparison results between dif ferent scenes of the ﬁnal fused system. Our system Baseline Scene label Accuracy [ %] Accuracy [ %] Airport 91.6 72.9 Bus 71.0 62.9 Metro 78.4 51.2 Metro Stati on 79.2 55.4 Park 88.4 79.1 Public Square 29.9 40.4 Shopping Mall 77.5 49.6 Street Pedestrian 75.4 50.0 Street T rafﬁc 82.0 80.5 T ram 80.1 55.1 A verage 75.3 59.7 augmentation are also added to t he fusion, only slight improve ment can be obtained. 7.2. Results on the Ofﬁcial Fold In this section, the results of the best ﬁnal system (i.e. All - A verag e system from T able 3) for each scene are reported. T able 4 shows the performance of the system for each scene separately as well as the ov erall performance on t he ofﬁcial challenge validation fold. The results indicate that our systems perform well for all the scene classes except the Public Squar e cl ass, which deserves a future in- vestigation. 8. CONCLUSIONS W e have described the systems submitted by BUT t eam to Acous- tic Scene Classiﬁcation (ASC) challenge of DCASE2018. Differ- ent systems were designed for t his challenge and the ﬁnal systems were fusions of the output scores from the indi vidual system. A simple score av eraging and logistic regression were used for the fusion. The systems included 2-dimensional CNNs with single and 4-channels features, one-dimensional CNNs trained on mel- spectrogram and CQT features. Cosine similarity classiﬁers were also used to compare x-vectors extracted using the one-dimensional CNNs. Our future work wil l include inv estigations into the failures of the 4-channel CNN variants in some scenarios. W e will also experi- ment with other methods for data augmentation, which, in our opin- ion, is crucial for t he good system performance . Also, we wou ld like to in vestigate into using bottleneck features for ASC. 9. A CKNO WLEDGME NT The work was supported by Czech Ministry of Education, Y o uth and Sports from P r oject No. CZ.02.2.69/0.0/0.0/16 027/0008 371, Czech Ministry of Interi or project No. VI20152020 025 ”DRA- P AK”, and the National Programme of Sustainability (NPU II ) project ”IT4Innov ations excellence in science - LQ1602”. The au- thors would like to thanks to Mr . Hamid Eghbalzadeh for his valu- able discussions. The authors are also tankful to the DCAS E orga- nizers for managing the annual challenges, which allows for rapid adv ances in the ASC technology . Detect ion and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 Nove mber 2018, Surrey , UK 10. REFERENCES [1] S . Mun, S . Park, D. Han, and H. K o, “Generativ e adversarial network based acoustic scene training set augmentation and selection using S V M hyper-p lane, ” DCAS E2017 Challenge, T ech. Rep., September 2017. [2] Y . Han and J. Park, “Con v olutional neural networks wi t h bin- aural representations and backgroun d subtraction for acous- tic scene classiﬁcation, ” DCASE 2017 Challenge, T ech. Rep., September 2017. [3] Z. W eiping, Y . Jiantao, X. Xiaotao, L. Xiangtao, and P . Shaohu, “ Acoustic scene classiﬁcation using deep con vo- lutional neural network and multiple spectrograms fusion, ” DCASE2017 Challenge, T ech . Rep., September 2017. [4] R. Hyder, S. Ghaffarzad egan, Z. Feng, and T . Hasan, “BUET bosch consortium (B2C) aco ustic scene classiﬁcation sys- tems for DCASE 2017, ” DCAS E2017 Challenge, T ech. Rep., September 2017. [5] H. Eghbal-Zadeh, B. Lehner , M. Dorfer, and G. Widmer , “CP- JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep con vo lutional neural networks, ” in IEE E AASP Challenge on Detection and Classiﬁcation of Acoustic Scenes and Events (DCA SE) , 2016. [6] N. Dehak, P . J. Kenny , R. Dehak, P . Dumouche l, and P . Ouel- let, “Front-end factor analysis for speake r veriﬁcation, ” IEEE T r ansactions on Audio, Speec h, and Languag e Pro cessing , vol. 19, no. 4, pp. 788–798, 2011. [7] D. S nyder , P . Ghahremani, D. P ovey , D. Garcia-R omero, Y . Carmiel, and S . Khudanp ur , “Deep neural network-based speak er embeddings for end-to-end speak er veriﬁcation, ” in Spok en Langua ge T ech nology W orksho p (SLT), 2016 IEEE . IEEE, 2016, pp. 165–170. [8] A. Mesaros, T . Heittola, and T . V irtanen, “ A multi-de vice dataset for urban acoustic scene classiﬁcation, ” in IEEE AASP Challenge on Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2018. [9] C. Sch ¨ orkhuber and A. Klapuri, “Constant-Q transform t ool- box for music processing, ” in 7th Sound and Music Computing Confer ence , Barcelo na, Spain , 2010, pp. 3–64. [10] B. Mc Fee, C. Raffel, D. Liang, D. P . Ellis, M. McV icar , E. Battenberg, and O . Nieto, “librosa: A udio and music sig- nal analysis in python, ” i n Pr oceeding s of the 14th python in science confer en ce , 2015, pp. 18–25. [11] H. Zeinali, L. Burget, H. S ameti, O. Glembek, and O. P l - chot, “Deep neural networks and hidden Marko v models i n i- vector -based text-dependen t speaker veriﬁcation, ” i n Odyssey- The Speaker and Languag e Recognition W orkshop , 2016, pp. 24–30. [12] H. Zeinali, H. Sameti, and L. Burget, “HMM-base d phrase- independe nt i-vector extractor for text-depen dent speaker ver - iﬁcation, ” IEEE/ ACM Tr ansaction s on A udio, Speec h, and Langua ge Proces sing , vol. 25, no. 7, pp. 1421–1435, 2017. [13] H. Z einali, H. Sameti, and N. Maghsoodi, “SUT submis- sion for NIST 2016 speaker recognition ev aluation: Descrip- tion and analysis, ” in P r oceed ings of the 29th Confer ence on Computational Linguistics and Speech Pr ocessing (R OCLING 2017) , 2017, pp. 276–286. [14] N. Br ¨ ummer , “FoCal multi-class: T oolk it for e v aluation, fu- sion and calibration of multi-class recognition scorestutorial and user manual, ” Softwar e available at http:// sites. google . com/site/nikobru mmer/focalmulticlass , 2007. [15] D. P . Kingma and J. Ba, “ Adam: A method for st ochastic op- timization, ” arXi v prep rint arXiv:1412.698 0 , 2014.

Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment