Masked Conditional Neural Networks for Audio Classification

Masked Conditional Neural Netw orks f or A udio Classiﬁcation Fady Medhat, Da vid Chesmore, and John Robinson Department of Electronic Engineering Univ ersity of Y ork, Y ork United Kingdom {fady.medhat, david.chesmore, john.robinson}@york.ac.uk Abstract. W e present the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN) 1 designed for temporal signal recogni- tion. The CLNN takes into consideration the temporal nature of the sound signal and the MCLNN extends upon the CLNN through a binary mask to preserve the spatial locality of the features and allows an automated e xploration of the fea- tures combination analogous to hand-crafting the most relev ant features for the recognition task. MCLNN has achiev ed competitive recognition accuracies on the GTZAN and the ISMIR2004 music datasets that surpass several state-of-the- art neural network based architectures and hand-crafted methods applied on both datasets. Keyw ords: Restricted Boltzmann Machine, RBM, Conditional Restricted Boltz- mann Machine, CRBM, Music Information Retriev al, MIR, Conditional Neural Network, CLNN, Masked Conditional Neural Network, MCLNN, Deep Neural Network 1 Introduction The success of the deep neural network architectures in image recognition [1] induced applying these models for audio recognition [2][3]. One of the main driv ers for the adap- tation is the need to eliminate the effort inv ested in hand-crafting the features required for classiﬁcation. Sev eral neural networks based architectures have been proposed, but they are usually adapted to sound from other domains such as image recognition. This may not exploit sound related properties. The Restricted Boltzmann Machine (RBM)[4] treats sound as static frames ignoring the inter -frame relation and the weight sharing in the v anilla Con v olution Neural Netw orks (CNN)[5] does not preserve the spatial lo- cality of the learned features, where limited weight sharing was proposed in [2] in an attempt to tackle this problem for sound recognition. The Conditional Restricted Boltzmann Machine (CRBM) [6] in Fig. 1 extends the RBM [7] to the temporal dimension. This is applied by including conditional links from the previous frames ( ˆ v − 1 , ˆ v − 2 , ..., ˆ v − n ) to both the hidden nodes ˆ h and the cur- rent visible nodes ˆ v 0 using the links ( ˆ B − 1 , ˆ B − 2 , ..., ˆ B − n ) and the autoregressiv e links 1 Code: https://github .com/fadymedhat/MCLNN 2 F . Medhat, D. Chesmore, J. Robinson à D á 9 à $ ?á à $ ?6 à $ ?5  # ?á  # ?6  # ?5 Ü R ?á Ü R ?6 Ü R ?5 Ü R 4 Fig. 1. Conditional Restricted Boltzmann Machine ( ˆ A − 1 , ˆ A − 2 , ..., ˆ A − n ) , respecti vely as depicted in Fig. 1. The Interpolating CRBM (ICRBM) [8] achieved a higher accuracy compared to the CRBM for speech phoneme recognition by extending the CRBM to consider both the pre vious and future frames. The CRBM beha vior (and similarly this work) o verlaps with that of a Recurrent Neural Network (RNN) such as the Long Short-T erm Memory (LSTM) [9], an archi- tecture designed for sequence labelling. The output of an RNN at a certain temporal instance depends on the current input and the the hidden state of the network’ s internal memory from the pre vious input. Compared to an LSTM, a CRBM does not require an internal state, since the inﬂuence of the previous temporal input states is concurrently considered with the current input. Additionally , increasing the order n does not hav e the consequence of the vanishing or e xploding gradient related to the Back-Propagation Through Time (BPTT) as in recurrent neural networks that LSTM was introduced to solve, since the back-propagation in a CRBM depends on the number of layers as in normal feed-forward neural networks. Inspired by the human visual system, the Con volutional Neural Network (CNN) de- pends on two main operations namely the con v olution and pooling. In the con v olutional operation, the input (usually a 2-dimensional representation) is scanned (con volved) by a small-sized weight matrix, referred to as a ﬁlter . Several small sized ﬁlters, e.g. 5 × 5, scan the input to generate a number of feature maps equal to the number of ﬁlters scan- ning the input. A pooling operation generates lo wer resolution feature maps, through either a mean or a max pooling operation. CNN depends on weight sharing that allows applying it to images of large sizes without having a dedicated weight for each pixel, since similar patterns may appear at dif ferent locations within an image. This is not op- timally suitable for time-frequency representations, which prompted attempts to tailor the CNN ﬁlters for sound [2],[10],[11]. 2 Conditional Neural Networks In this w ork, we introduce the ConditionaL Neural Netw ork (CLNN). The CLNN adaptes from the Conditional RBM the directed links between the pre vious visible and the hidden nodes and extends to future frames as in the ICRBM. Additionally , the CLNN adapts a global pooling operation [12], which behav es as an aggregation operation found to enhance the classiﬁcation accuracy in [13]. The CLNN allows the sequential relation across the temporal frames of a multi-dimensional signal to be con- sidered collectiv ely by processing a window of frames. The CLNN has a hidden layer Masked Conditional Neural Networks for Audio Classiﬁcation 3 Fig. 2. T wo CLNN layers with n = 1. in the form of a vector having e neurons, and it accepts an input of size [ d , l ] , where l is the feature v ector length and d = 2 n + 1 ( d is the number of frames in a windo w , n is the order for the number of frames in each temporal direction and 1 is for the windo w’ s middle frame). Fig. 2 shows tw o CLNN layers each having an order n = 1, where n is a tunable hyper-parameter to control the windo w’ s width. Accordingly , each CLNN layer in the ﬁgure has a 3-dimensional weight tensor composed of one central matrix ˆ W m 0 and two of f-center weight matrices, ˆ W m − 1 and ˆ W m 1 ( m is the layer id). During the scanning of the signal across the temporal dimension, a frame in the window at index u is processed with its corresponding weight matrix ˆ W m u of the same index. The size of each ˆ W m u is equal to the feature vector length × hidden layer width. The number of weight matrices is 2 n + 1 (the 1 is for the central frame), which matches the number of frames in the window . The output of a single CLNN step over a window of frames is a single representativ e v ector . Sev eral CLNN layers can be stacked on top of each other to form a deep architec- ture as shown in Fig. 2. The ﬁgure also depicts a number of k extra frames remaining after the processing applied through the two CLNN layers. These k extra frames allow incorporating an aggregation operation within the network by pooling the temporal di- mension or they can be ﬂattened to form a single vector before feeding them to a fully connected network. The CLNN is trained ov er se gments following (1) q = ( 2 n ) m + k , n , m and k ≥ 1 (1) where q is the segment size, n is the order , m is the number of layers and k is for the extra frames. The input at each CLNN layer has 2 n fewer frames than the previous layer . F or example, for n = 4, m = 3 and k = 5, the input is of size 29 frames. The output 4 F . Medhat, D. Chesmore, J. Robinson of the ﬁrst layer is 29 − 2 × 4 = 21 frames. Similarly , the output of the second and third layers is 13 and 5 frames, respecti v ely . The 5 remaining frames of third layer are the extra frames to be pooled or ﬂattened. The activ ation at a hidden node of a CLNN can be formulated as in (2) y j , t = f b j + n ∑ u = − n l ∑ i = 1 x i , u + t W i , j , u ! (2) where y j , t is the activ ation at node j of a hidden layer for frame t in a segment of size q . This frame is also the windo w’ s middle frame at u = 0. The output y is giv en by the value of the activ ation function f when applied on the summation of the bias b j of node j and the multiplication of W i , j , u and x i , u + t . The input x i , u + t is the i t h feature in a single feature vector of size l at index u + t within a window and W i , j , u is the weight between the i t h input of a feature vector and the j t h hidden node. The u index (in W i , j , u and x i , u + t ) is for the temporal windo w of the interval of frames to be considered within [ − n + t , n + t ] . Reformulating (2) in a vector form is giv en in (3). ˆ y t = f ˆ b + n ∑ u = − n ˆ x u + t · ˆ W u ! (3) where ˆ y t is the activ ation vector observed at the hidden layer for the central frame conditioned on the input vectors in the interval [ ˆ x − n + t , ˆ x n + t ] is given by the acti vation function f applied on the summation of the bias vector ˆ b and the summation of the multiplication between the vector ˆ x u + t at index u + t ( t is for the window’ s middle frame at u = 0 and the index of the frame in the segment) and the corresponding weight matrix ˆ W u at the same index, where u takes values in the range of the considered window from − n up to n . The conditional distrib ution can be captured using a logistic function as in p ( ˆ y t | ˆ x − n + t , ..., ˆ x − 1 + t , ˆ x t , ˆ x 1 + t , ..., ˆ x n + t ) = σ ( ... ) , where σ is the hidden layer sigmoid function or the output layer softmax. 3 Masked Conditional Neural Networks Fig. 3. Masking patterns. a) Band wid t h = 5 and Over l a p = 3, b) the active links following the masking pattern in a. c) Band widt h = 3 and Over l a p = − 1 The Mel-Scaled analysis applied in MFCC and Mel-Scaled spectrograms, both used extensi v ely as intermediate signal representation by sound recognition systems, e xploit the use of a ﬁlterbank (a group of signal processing ﬁlters). Considering a sound signal represented in a spectrogram, the energy of a certain frequency bin may smear across nearby frequency bins. Aggregating the energy across neighbouring frequency bins is Masked Conditional Neural Networks for Audio Classiﬁcation 5 a possible representation to o vercome the frequenc y shifts, which is tackled by ﬁlter- banks. More general mixtures across the bins could be hand-crafted to select the most prominent features for the signal under consideration. The Masked ConditionaL Neural Network (MCLNN), we introduce in this work embeds a ﬁlterbank-like behaviour and allows the exploration of a range of feature combinations concurrently instead of manually hand-crafting the optimum mixture of features. Fig. 3 depicts the implementation of the ﬁlterbank-like behaviour through the binary mask enforced ov er the netw ork’ s links that acti v ate different regions of a feature vector while deacti v ating others following a band-like pattern. The mask is designed based on two tunable hyper -parameters: the bandwidth and the ov erlap. Fig. 3.a. sho ws a binary mask having a bandwidth of 5 (the ﬁv e consecutive ones in a column) and an ov erlap of 3 (the overlapping ones between tw o successive columns). A hidden node will act as an expert in a localized region of the feature vector without considering the rest of it. This is depicted in Fig. 3.b . The ﬁgure shows the active connections for each hidden node over a local region of the input feature vector matching the mask pattern in Fig. 3.a. The ov erlap can be assigned negati v e v alues as sho wn in Fig. 3.c. The ﬁgure shows a mask with a bandwidth of 3 and overlap of − 1, depicted by the non-overlapping distance between the 1’ s of two successiv e columns. Additionally , the ﬁgure shows an additional role introduced by the mask through the presence of shifted versions of the binary pattern across the ﬁrst set of three columns compared to the second and third sets. The role in volv es the automatic exploration of a range of feature combinations concur- rently . The columns in the ﬁgure map to hidden nodes. Therefore, for a single feature vector , the input at the 1 st node (corresponding to the 1 st column) will consider the ﬁrst 3 features in the feature vector , the 4 t h node will consider a different combination in v olving the ﬁrst 2 features and the 7 t h node will consider even a different combina- tion using the ﬁrst feature only . This beha viour embeds the mix-and-match operation within the network, allowing the hidden nodes to learn different properties through the different combinations of feature vectors meanwhile preserving the spatial locality . The position of the binary values is speciﬁed through a linear inde x l x follo wing (4) l x = a + ( g − 1 )( l + ( bw − ov )) (4) where l x is giv en by bandwidth bw , the ov erlap ov and the feature vector length l . The term a takes the values in [ 0 , bw − 1 ] and g is in the interv al [ 1 , d ( l × e ) / ( l + ( bw − ov )) e ] . The binary masking is enforced through an element-wise multiplication following (5). ˆ Z u = ˆ W u ◦ ˆ M (5) where ˆ W u is the original weight matrix at a certain inde x u and ˆ M is the masking pattern applied. ˆ Z u is the new masked weight matrix to replace the weight matrix in (3). 4 Experiments W e performed the MCLNN ev aluation using the GTZAN [14] and the ISMIR2004 datasets widely used in the literature for benchmarking se veral MIR tasks including genre classiﬁcation. The GTZAN consists of 1000 music ﬁles categorized across 10 mu- sic genres (blues, classical, country , disco, hip-hop, jazz, metal, pop, reggae and rock). The ISMIR2004 dataset comprise training and testing splits of 729 ﬁles each. The splits 6 F . Medhat, D. Chesmore, J. Robinson hav e 6 unbalanced categories of music genres (classical, electronic, jazz-blues, metal- punk, rock-pop and world) of full length recordings. All ﬁles were resampled at 22050 Hz and chunks of 30 seconds were extracted. Logarithmic Mel-Scaled 256 frequency bins spectrogram transformation was applied using an FFT window of 2048 ( ≈ 100 msec) and an overlap of 50%. The feature-wise z-score parameters of the training set was applied to both the v alidation and test sets. Se gments of frames follo wing (1) were extracted. T able 1. Accuracies on the GTZAN Classiﬁer and Features Acc.% CS + Multiple feat. sets[15] 2 92.7 SRC + LPNTF + Cortical features[16] 2 92.4 RBF-SVM + Scattering T rans.[17] 2 91.4 MCLNN + Mel − Spec.(this work) 2 85.1 RBF-SVM + Spec. − DBN[4] 4 84.3 MCLNN + Mel − Spec.(this work) 3 84.1 Linear SVM + PSD on Octav es[18] 3 83.4 Random Forest + Spec. − DBN[19] 5 83.0 AdaBoost + Sev eral features[13] 1 83.0 RBF SVM + Spectral Cov ar .[20] 2 81.0 Linear SVM + PSD on frames[18] 3 79.4 SVM + D WCH[21] 2 78.5 T able 2. Accuracies on the ISMIR2004 Classiﬁer and Features Acc.% SRC + NTF + Cortical features[16] 9 94.4 KNN + Rhythm&timbre[22] 2 90.0 SVM + Block-Lev el features [23] 8 88.3 MCLNN + Mel − Spec.(this work) 2 86.0 MCLNN + Mel − Spec.(this work) 3 84.8 MCLNN + Mel − Spec.(this work) 9 84.8 GMM + NMF[24] 1 83.5 MCLNN + Mel − Spec.(this work) 6 83.1 SVM + Symbolic features [25]] 2 81.4 NN + Spectral Similarity FP [26] 7 81.0 SVM + High-Order SVD [27] 2 81.0 SVM + Rhythm and SSD [28] 6 79.7 1 5-fold cross-validation 4 50% training, 20% validation and 30% testing 7 leave-one-out cross-v alidation 2 10-fold cross-validation 5 4 × 50% training, 25% validation and 25% testing 8 Not referenced 3 10 × 10-fold cross-validation 6 10 × (T rain 729 ﬁle , test 729 ﬁle) 9 Train 729 ﬁles,test 729 ﬁles T able 3. MCLNN parameters Layer Hidden Nodes MCLNN Order Mask Bandwidth Mask Overlap 1 220 4 40 -10 2 200 4 10 3 T able 4. GTZAN random and ﬁltered Model Random Acc. % Filtered Acc. % MCLNN(this work) 84.4 65.8 DNN [25] 81.2 42.0 The network was trained to minimize the categorical cross entropy between the segment’ s predicted label and the tar get one. The ﬁnal decision of the clip’ s genre is de- cided based on a majority v oting across the frames of the clip. The experiments for both datasets were carried out using a 10-fold cross-validation that is repeated for 10 times. An additional experiment was applied using the ISMIR2004 dataset original split (729 training, 729 testing) that w as also repeated for 10 times. W e adapted a two-layered MCLNN, as listed in T able 3, followed by a single dimensional global mean pooling [12] layer to pool across k = 10 extra frames and ﬁnally a 50 node fully connected layer before the softmax output layer . Parametric Rectiﬁed Linear Units (PReLU) [29] were used for all the model’ s neurons. W e applied the same model to both datasets to gauge the generalization of the MCLNN to datasets of dif ferent distrib utions. T ables Masked Conditional Neural Networks for Audio Classiﬁcation 7 1 and T able 2 list the accuracy achiev ed by the MCLNN along with other methods widely cited in the literature for the genre classiﬁcation task on the GTZAN and the ISMIR2004 datasets. MCLNN surpasses se veral state-of-the-art methods that are de- pendent on hand-crafted features or neural networks, achieving an accuracy of 85.1% and 86% ov er a 10-fold cross-v alidation for the GTZAN and ISMIR2004, respecti vely . W e repeated the 10-fold cross-validation 10 times to validate the accuracy stability of the MCLNN, where the MCLNN achie v ed 84.1% and 84.83% ov er the 100 training runs for each of the GTZAN and the ISMIR2004, respectiv ely . T o further ev aluate the MCLNN performance, we adapted the publicly av ailable splits released by Kereliuk et al.[30]. In their work, they released tw o versions of splits for the GTZAN ﬁles: a randomly stratiﬁed split (50% training, 25% validation and 25% testing) and a fault ﬁltered version, where they cleared out all the mistakes in the GTZAN as reported by Sturm [31], e.g. repetitions, distortion, etc. As listed in T able 4, MCLNN achieved 84.4% and 65.8% compared to Kereliuk’ s attempt that achieved 81.2% and 42% for the random and f ault-ﬁltered, respectiv ely , in their attempt to repro- duce the work by Hamel et al. [4]. The experiments show that MCLNN performs better than sev eral neural networks based architectures and comparable to some other works dependent on hand-crafted features. MCLNN achiev ed these accuracies irrespective of the rhythmic and perceptual properties [32] that were used by methods that reported higher accuracies than the MCLNN. Finally , we wanted to tackle the problem of the data size used in training, referring to the works in [4,13,18,20,30], an FFT window of 50 msec was used. On the other hand, the MCLNN achiev ed the mentioned accuracies using a 100 msec window , which decreases the number of feature vectors to be used in classiﬁcation by 50% and consequently the training complexity , which allows the MCLNN to scale for larger datasets. 5 Conclusions and Future w ork W e hav e introduced the ConditionaL Neural Network (CLNN) and its extension the Masked ConditionaL Neural Network (MCLNN). The CLNN is designed to exploit the properties of the multi-dimensional temporal signals by considering the sequential re- lationship across the temporal frames. The mask in the MCLNN enforces a systematic sparseness that follows a frequency band-like pattern. Additionally , it plays the role of automating the the exploration of a range of feature combinations concurrently anal- ogous to the exhausti ve manual search for the hand-crafted feature combinations. W e hav e applied the MCLNN to the problem of genre classiﬁcation. Through an extensi ve set of experiments without an y especial rh ythmic or timbral analysis, the MCLNN hav e sustained accuracies that surpass neural based and sev eral hand-crafted feature extrac- tion methods referenced previously on both the GTZAN and the ISMIR2004 datasets, achieving 85.1% and 86%, respectively . Meanwhile, the MCLNN still preserves the generalization that allo ws it to be adapted for any temporal signal. Future work will in v olve optimizing the mask patterns, considering different combinations of the order across the layers. W e will also consider applying the MCLNN to other multi-channel temporal signals. 8 F . Medhat, D. Chesmore, J. Robinson Acknowledgments. This work is funded by the European Union’ s Sev enth Frame- work Programme for research, technological development and demonstration under grant agreement no. 608014 (CAP A CITIE). References 1. A. Krizhevsk y , I. Sutskev er , and G. E. Hinton, “Imagenet classiﬁcation with deep con volu- tional neural networks, ” in Neural Information Pr ocessing Systems, NIPS , 2012. 2. O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Y u, “Conv olutional neural networks for speech recognition, ” IEEE/A CM Tr ansactions on A udio, Speec h and Language Processing , v ol. 22, no. 10, pp. 1533–1545, 2014. 3. J. Schlter , Unsupervised Audio F eatur e Extraction for Music Similarity Estimation . Thesis, 2011. 4. P . Hamel and D. Eck, “Learning features from music audio with deep belief networks, ” in International Society for Music Information Retrieval Conference, ISMIR , 2010. 5. Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient-based learning applied to docu- ment recognition, ” Proceedings of the IEEE , v ol. 86, no. 11, pp. 2278–2324, 1998. 6. G. W . T aylor, G. E. Hinton, and S. Roweis, “Modeling human motion using binary latent variables, ” in Advances in Neural Information Pr ocessing Systems, NIPS , pp. 1345–1352, 2006. 7. P . Smolensk y , Information Processing in Dynamical Systems: F oundations of Harmony The- ory , pp. 194–281. 1986. 8. A.-R. Mohamed and G. Hinton, “Phone recognition using restricted boltzmann machines, ” in IEEE International Conference on Acoustics Speech and Signal Processing , ICASSP , 2010. 9. S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural Comput , v ol. 9, no. 8, pp. 1735–80, 1997. 10. J. Pons, T . Lidy , and X. Serra, “Experimenting with musically motiv ated con volutional neural networks, ” in International W orkshop on Content-based Multimedia Indexing , CBMI , 2016. 11. K. J. Piczak, “Environmental sound classiﬁcation with conv olutional neural netw orks, ” in IEEE international workshop on Machine Learning for Signal Pr ocessing (MLSP) , 2015. 12. M. Lin, Q. Chen, and S. Y an, “Network in network, ” in International Conference on Learning Repr esentations, ICLR , 2014. 13. J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kgl, “ Aggregate features and adaboost for music classiﬁcation, ” Machine Learning , vol. 65, no. 2-3, pp. 473–484, 2006. 14. G. Tzanetakis and P . Cook, “Musical genre classiﬁcation of audio signals, ” IEEE T r ansac- tions On Speech And Audio Pr ocessing , vol. 10, no. 5, 2002. 15. K. K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, “Music genre classiﬁcation via compressi ve sampling, ” in International Society for Music Information Retrieval, ISMIR , 2010. 16. Y . Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classiﬁcation using locality pre- serving non-negativ e tensor factorization and sparse representations, ” in International Soci- ety for Music Information Retrieval Conference , ISMIR , 2009. 17. J. Anden and S. Mallat, “Deep scattering spectrum, ” IEEE T r ansactions on Signal Pr ocess- ing , vol. 62, no. 16, pp. 4114–4128, 2014. 18. M. Henaf f, K. Jarrett, K. Kavukcuoglu, and Y . LeCun, “Unsupervised learning of sparse features for scalable audio classiﬁcation, ” in International Society for Music Information Retrieval, ISMIR , 2011. 19. S. Sigtia and S. Dixon, “Improved music feature learning with deep neural networks, ” in International Confer ence on Acoustics, Speech, and Signal Pr ocessing , ICASSP , 2014. 20. J. Bergstra, M. Mandel, and D. Eck, “Scalable genre and tag prediction with spectral co vari- ance, ” in International Society for Music Information Retrieval, ISMIR , 2010. Masked Conditional Neural Networks for Audio Classiﬁcation 9 21. T . Li, M. Ogihara, and Q. Li, “ A comparative study on content-based music genre classiﬁ- cation, ” in ACM SIGIR Conference on Researc h and Development in Information Retrieval, SIGIR , 2003. 22. T . Pohle, D. Schnitzer, M. Schedl, P . Knees, and G. W idmer, “On rhythm and general music similarity , ” in International Society for Music Information Retrieval, ISMIR , 2009. 23. K. Seyerlehner , M. Schedl, T . Pohle, and P . Knees, “Using block-le v el features for genre clas- siﬁcation, tag classiﬁcation and music similarity estimation, ” in Music Information Retrieval eXchange , MIREX , 2010. 24. A. Holzapfel and Y . Stylianou, “Musical genre classiﬁcation using nonnegati ve matrix factorization-based features, ” IEEE T r ansactions on Audio Speech and Language Process- ing , vol. 16, no. 2, pp. 424–434, 2008. 25. T . Lidy , A. Rauber , A. Pertusa, and J. M. Inesta, “Improving genre classiﬁcation by com- bination of audio and symbolic descriptors using a transcription system, ” in International Confer ence on Music Information Retrieval , 2007. 26. E. Pampalk, A. Flexer , and G. W idmer , “Improv ements of audio-based music similarity and genre classiﬁcation, ” in International Conference on Music Information Retrieval, ISMIR , 2005. 27. I. P anagakis, E. Benetos, and C. Kotropoulos, “Music genre classiﬁcation: A multilinear approach, ” in International Society for Music Information Retrieval, ISMIR , 2008. 28. T . Lidy and A. Rauber , “Evaluation of feature extractors and psycho-acoustic transformations for music genre classiﬁcation, ” in International Confer ence on Music Information Retrieval, ISMIR , 2005. 29. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-lev el performance on imagenet classiﬁcation, ” in IEEE International Confer ence on Computer V ision, ICCV , 2015. 30. C. Kereliuk, B. L. Sturm, and J. Larsen, “Deep learning and music adversaries, ” IEEE T rans- actions on Multimedia , vol. 17, no. 11, pp. 2059–2071, 2015. 31. B. L. Sturm, “The state of the art ten years after a state of the art: Future research in music information retriev al, ” Journal of New Music Research , v ol. 43, no. 2, pp. 147–172, 2014. 32. J. P . Bello, Machine Listening of Music , pp. 159–184. New Y ork, NY : Springer New Y ork, 2014.

Masked Conditional Neural Networks for Audio Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment