Combining High-Level Features of Raw Audio Waves and Mel-Spectrograms for Audio Tagging

Detection and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 November 2018, Surre y , UK COMBINING HIGH-LEVEL FEA TURES OF RA W A UDIO W A VES AND MEL-SPECTR OGRAMS FOR A UDIO T A GGING Mar cel Lederle ∗ Uni versity of K onstanz K onstanz, Germany marcel.lederle@uni.kn Benjamin W ilhelm ∗ Uni versity of K onstanz K onstanz, Germany benjamin.wilhelm@uni.kn ABSTRA CT In this paper , we describe our contribution to T ask 2 of the DCASE 2018 Audio Challenge [1]. While it has become ubiquitous to uti- lize an ensemble of machine learning methods for classiﬁcation tasks to obtain better predictiv e performance, the majority of en- semble methods combine predictions rather than learned features. W e propose a single-model method that combines learned high- lev el features computed from log-scaled mel-spectrograms and raw audio data. These features are learned separately by two Conv olu- tional Neural Networks, one for each input type, and then combined by densely connected layers within a single network. This rela- tiv ely simple approach along with data augmentation ranks among the best two percent in the Freesound General-Purpose Audio T ag- ging Challenge on Kaggle. Index T erms — audio-tagging, con volutional neural network, raw audio, mel-spectrogram 1. INTR ODUCTION For humans, it seems to be effortless to associate sounds with ev ents or categories that describe the perceiv ed sound best. Howe ver , the complex structure and the large amount of information transmitted through sound makes it particularly dif ﬁcult to extract that informa- tion automatically . Recognizing a wide variety of sounds has man y applications in our today’ s life. These include surveillance [2, 3, 4], acoustic moni- toring [5], and automatic description of multimedia [6]. Due to the div ersity of sounds belonging to the same category , a reliable recog- nition of manifold sound categories is still under ongoing research. Carefully hand-crafted features such as Mel Frequency Cepstral Coefﬁcients (MFCCs) were the dominant features used for speech recognition [7, 8] and music information retrieval [9], but the trend is no w shifting tow ard deep learning [10]. The approach of man- ual feature engineering has drawbacks compared to deep learning based methods because it requires considerable effort and exper- tise to manually create features for a speciﬁc purpose. In partic- ular , most of the engineered features, such as MFCCs and spec- tral centroids [11], are non-task speciﬁc, whereas all deep learning approaches are task-speciﬁc due to its formulation as a minimiza- tion process on task-speciﬁc training examples. Since Con volu- tional Neural Networks (CNNs) have sho wn a remarkable progress in visual recognition tasks [12] ov er the last years, it has become common to use CNNs for feature extraction and classiﬁcation in the audio domain [13, 14]. Sev eral CNN architectures, such as ∗ Both authors contributed equally to this work. AlexNet [15], VGG [16], ResNet [17], and Inception-v3 [18], have been proposed for image classiﬁcation, which are also well suited to the task of audio tagging. The goal of T ask 2 of the DCASE 2018 Challenge [1] was to predict the category of an audio clip belonging to one out of 41 het- erogeneous classes, such as “ Acoustic guitar”, “Bark”, “Bus” and “T elephone”, drawn from the AudioSet Ontology [19]. T raining and testing data contain a div erse set of user-generated audio clips from Freesound (https://freesound.org) [1]. In this paper, we mainly focus on building an audio-tagging system that uses both the raw audio data and the corresponding mel- spectrogram rather than ensembling [20] or stacking [21] multiple classiﬁers. 2. METHOD Our audio-tagging system comprises two separately trained Conv o- lutional Neural Networks on raw audio and mel-spectrogram, re- spectiv ely . The learned high-le vel features are then combined by a densely connected neural network to form the system. In the fol- lowing, we describe each model in detail. 2.1. CNN on Raw A udio ( cnn-audio ) For the cnn-audio model, we use an architecture similar to com- mon architectures for image classiﬁcation, like V GG16 [16] or AlexNet [15], but with one-dimensional con volutions and one- dimensional max pooling. As described in T able 1, we use four blocks, each consisting of two con volutional and one max-pooling layer . The number of ﬁlters is increased in each consecutiv e block, while the kernel size is decreased. The pool-size of the max-pooling layers is chosen to quickly reduce the large time dimension. After each block and after the dense layer , we apply batch normaliza- tion [22], as experiments have shown that it reduces the training time and increases the model accuracy . T o introduce nonlinearities into our network, we apply a ReLU activation function [23] after each con volutional layer and dense layer . 2.2. CNN on Mel-Spectrogram ( cnn-spec ) The cnn-spec model is a two-dimensional con volutional neural net- work taking mel-spectrograms as input. The architecture is again similar to common image classiﬁcation architectures and is de- scribed in detail in T able 2. As with the one-dimensional model, we apply batch normalization after each block and the dense layer and use the ReLU acti v ation function after con volutional and dense layers. Detection and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 November 2018, Surre y , UK Layer 1sec shape 2sec shape 3sec shape input (44100, 1) (88200, 1) (132300, 1) con v1d, 11, 32 (44100, 32) (88200, 32) (132300, 32) con v1d, 11, 32 (44100, 32) (88200, 32) (132300, 32) max-pool1d, 8/16 (5512, 32) (5512, 32) (8268, 32) con v1d, 9, 64 (5512, 64) (5512, 64) (8268, 64) con v1d, 9, 64 (5512, 64) (5512, 64) (8268, 64) max-pool1d, 16 (344, 64) (344, 64) (516, 64) con v1d, 7, 128 (344, 128) (344, 128) (516, 128) con v1d, 7, 128 (344, 128) (344, 128) (516, 128) max-pool1d, 16 (21, 128) (21, 128) (32, 128) con v1d, 5, 256 (21, 256) (21, 256) (32, 256) con v1d, 5, 256 (21, 256) (21, 256) (32, 256) max-pool1d, 16 (1, 256) (1, 256) (2, 256) dense, 512 (512) (512) (512) softmax, 41 (41) (41) (41) T able 1: The architecture of the cnn-audio model. Note that for the one-second model, the ﬁrst max-pooling layer uses a pool-size of 8, while for other models a pool-size of 16 is used. The mel-spectrogram is extracted using librosa [24] with the original sampling frequency of 44 . 1 kHz, 2048 FFT points, 128 mel-bins, and a hop-length of 256. The amplitude of the mel-spectrogram is scaled logarithmically , and the scaled mel- spectrogram is resized in time dimension to ﬁt the model input size. 2.3. Joining CNNs ( cnn-comb ) W e remov e the softmax and dense layer of both the trained cnn- audio and cnn-spec model and then concatenate the output features of the previous layer of both models so as to join them. The con- catenated features are then connected to a densely connected neural network with four hidden layers. The hidden dense layers have 512, 256, 256, and 128 neurons, respectiv ely . The complete model is il- lustrated in Figure 1. W e train cnn-audio and cnn-spec from scratch. Afterward, the weights of these models are transferred to the cnn-comb model and only the ne wly added dense layers are trained. The splitting of the training of cnn-comb into three steps, facilitates the procedure. 2.4. Data A ugmentation T o prevent our model from overﬁtting, we make use of exten- siv e data augmentation during training (time shifting, cropping, padding, and blending clips of same and different cate gories). Each of these augmentation techniques is applied to the raw audio wa ve and the mel-spectrogram. In the remaining section, we explain the augmentation methods based on the raw audio wa ve. First, we apply a uniformly random time shift to the audio clip. T o ensure that the audio clips ﬁt the size of the model input, crops are taken from too long audio ﬁles and too short audio ﬁles are padded. For audio clips that are longer than the model input size, we use a crop with the size of the model input taken from a random position. If an audio sample ﬁts multiple times ( n times) in the input size, it is replicated such that it appears k ∈ { 1 , . . . , n } times with Layer 1sec shape 2sec shape 3sec shape input (128, 170, 1) (128, 300, 1) (128, 400, 1) con v2d, 4 × 4, 64 (128, 170, 64) (128, 300, 64) (128, 400, 64) con v2d, 4 × 4, 64 (128, 170, 64) (128, 300, 64) (128, 400, 64) max-pool2d, 1 × 1/2 × 2 (128, 170, 64) (64, 150, 64) (64, 200, 64) con v2d, 4 × 4, 64 (128, 170, 64) (64, 150, 64) (64, 200, 64) max-pool2d, 2 × 2 (64, 85, 64) (32, 75, 64) (32, 100, 64) con v2d, 3 × 3, 128 (64, 85, 128) (32, 75, 128) (32, 100, 128) max-pool2d, 2 × 4 (32, 21, 128) (16, 18, 128) (16, 25, 128) con v2d, 3 × 3, 128 (32, 21, 128) (16, 18, 128) (16, 25, 128) max-pool2d, 2 × 2 (16, 10, 128) (8, 9, 128) (8, 12, 128) con v2d, 3 × 3, 256 (16, 10, 256) (8, 9, 256) (8, 12, 256) max-pool2d, 2 × 2 (8, 5, 256) (4, 4, 256) (4, 6, 256) con v2d, 3 × 3, 256 (8, 5, 256) (4, 4, 256) (4, 6, 256) max-pool2d, 2 × 2 (4, 2, 256) (2, 2, 256) (2, 3, 256) dense, 256 (256) (256) (256) softmax, 41 (41) (41) (41) T able 2: The architecture of the cnn-spec model. Note that the ﬁrst max-pooling layer does not exist in the case of the one-second model. a probability of 1 /n , and the remaining space before, after, and in between replications is ﬁlled with zeros. Additionally , we enhance the robustness of our model by blend- ing multiple audio clips of same or different categories. This method is referred to as mixup [25]. W e blend them by assign- ing a random weight to each sample (weights sum up to one) and taking the weighted sum. If the blended samples are of the same class, the model should still predict the common class for the newly generated training sample. In the other case, if the blended samples are of distinct classes, the model is trained to predict the weight of each included class. While the spectrogram is computed in advance, the computa- tionally inexpensiv e data augmentation techniques can be computed on-the-ﬂy during training. This saves disk-space and guarantees a large amount of di verse training data. 2.5. Implementation Details W e implemented the described method using K eras [26] in Python. T o monitor overﬁtting and the model performance during train- ing, we exclude a part of the training data as v alidation data. T o still make use of all training data, we train ﬁ ve models on stratiﬁed folds of the training data such that each training example is used once for validation and four times for training. For the ﬁnal prediction, we accumulate the predictions of all ﬁv e models using the geometric mean. All models are trained using the Adam optimizer [27] with a ﬁxed learning rate of 0 . 001 and a mini-batch size of 32 for a maxi- mum of 300 epochs, b ut stopping earlier if the validation loss hasn’t improv ed for 35 epochs. W e use the categorical cross-entropy loss function and weight the loss according to the distrib ution of train- ing examples per class, thereby ensuring that the models pay more attention to samples from an under-represented class. Because many clips contain silence, we cut off silent parts at Detection and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 November 2018, Surre y , UK v 6× conv maxpool 4× 512 dense 128 dense 41classes 256 256 conv maxpool Figure 1: Illustrated architecture of the complete model. the beginning and at the end of an audio clip that do not exceed a volume of 40 decibel. When predicting the test data, we have to take the varying length of audio ﬁles into account. It is not sufﬁcient to only pre- dict on one crop of too long tracks because important features might not be present in the selected crop. Therefore, we run the inference on many crops of the audio ﬁle with a step size of 5120 frames, which is approximately 0 . 12 seconds. For too short audio tracks, the model might be able to better recognize class-speciﬁc features in certain parts of the input. Therefore, we generate multiple inputs by padding the audio ﬁle with zeros such that the real audio appears at dif ferent positions in the input. Again, we use a step size of 5120 frames. Multiple predictions for one audio ﬁle are combined by means of the geometric mean. 3. EV ALU A TION 3.1. Dataset W e ev aluate our method on the dataset provided for T ask 2 of the DCASE 2018 Challenge [1], which comprises 9473 training and 1600 test samples. The test data set has been manually veri- ﬁed, whereas the training data features labels of dif ferent reliability . Each mono audio ﬁle has a bit-depth of 16 , a sampling rate of 44 . 1 kHz, and is associated with one out of 41 classes of the AudioSet Ontology [19]. The class distribution of both training and test set is not bal- anced and ranges from 94 to 300 and from 25 to 110 samples per class, respecti vely . The duration of the shortest audio ﬁle is 300 ms and 30 seconds for the longest clip, while the a verage length is 6 . 8 seconds for the train set and 5 . 2 seconds for the test set. 3.2. Metric The Challenge uses mean A verage Precision at three (mAP@3) for ev aluating test results, which allows up to three predictions per au- dio clip. Full credit is given if the ﬁrst prediction matches the label of the clip, while less credit is gi ven if one of the other predictions is correct. The ev aluation metric is deﬁned as mAP@3 = 1 U U X i =1 min(3 ,n i ) X j =1 J y ij = ˆ y i K j , Model Crop length Public score (mAP@3) Priv ate score (mAP@3) T otal (mAP@3) cnn-audio 1sec 0.920 0.888 0.894 2sec 0.921 0.884 0.891 3sec 0.935 0.889 0.898 cnn-spec 1sec 0.930 0.923 0.924 2sec 0.950 0.928 0.932 3sec 0.935 0.930 0.931 cnn-comb 1sec 0.955 0.939 0.942 2sec 0.966 0.944 0.948 3sec 0.956 0.944 0.946 T able 3: Evaluation results of the individual models on the public ( 301 samples), priv ate ( 1299 samples), and full test set. where U is the total number of scored audio ﬁles, y ij is the pre- dicted label for ﬁle i at position j , ˆ y i is the ground-truth label for ﬁle i , n i is the total number of predicted labels for ﬁle i , J True K = 1 , J False K = 0 . No label may be predicted multiple times for one audio ﬁle. 3.3. Results W e have trained all models on inputs of one, two, and three seconds, as described in Section 2.5, and ev aluated the model performance on the test set (see T able 3). W e have observed that for each crop length, the combined model performs signiﬁcantly better compared to models with a single input. Combined models with an input size of two and three seconds, perform best and rank in the upper two percent on the Priv ate Leaderboard on Kaggle. Additionally , we determined the per-cate gory mAP@3 on the complete test set showing that some classes are more challenging to predict than others (see T able 4). Our model primarily struggles with the classes “Squeak”, “T elephone” and “Fire works”, b ut it still beats the baseline system [1] in ev ery per-cate gory mAP@3 score. T o verify that the performance gain of the combined model re- sults from combining the extracted high-level features from both models, we compare cnn-audio and cnn-spec to the cnn-comb Detection and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 November 2018, Surre y , UK Name samples time mAP@3 Name samples time mAP@3 Name samples time mAP@3 Acoustic guitar 300 52.2 0.893 Electric piano 150 25.5 1.000 Microw av e oven 146 25.1 0.966 Applause 300 58.2 1.000 Fart 300 18.6 0.944 Oboe 299 15.3 0.976 Bark 239 44.6 0.982 Finger snapping 117 5.9 1.000 Saxophone 300 33.7 0.942 Bass drum 300 12.8 1.000 Firew orks 300 48.2 0.786 Scissors 95 15.7 0.927 Burping or eructation 210 11.7 1.000 Flute 300 46.2 1.000 Shatter 300 26.1 0.960 Bus 109 28.4 0.953 Glockenspiel 94 8.4 0.856 Snare drum 300 17.9 0.912 Cello 300 37.3 0.951 Gong 292 41.8 0.968 Squeak 300 38.2 0.603 Chime 115 23.8 0.891 Gunshot or gunﬁre 147 11.1 0.950 T ambourine 221 10.1 0.975 Clarinet 300 34.7 0.991 Harmonica 165 18.6 0.970 T earing 300 38.7 0.981 Computer keyboard 119 23.0 1.000 Hi-hat 300 18.6 0.957 T elephone 120 16.2 0.788 Cough 243 22.4 1.000 Ke ys jangling 139 18.8 0.929 T rumpet 300 28.3 0.959 Cowbell 191 10.9 1.000 Knock 279 19.6 0.957 V iolin or ﬁddle 300 26.6 0.986 Double bass 300 16.9 0.946 Laughter 300 36.3 0.974 Writing 270 48.3 0.948 Drawer open or close 158 18.0 0.925 Meow 155 18.7 1.000 T able 4: Per cate gory mAP@3 score of the cnn-comb 2sec model on the full test set and the number of samples along with time in minutes of the respectiv e class in the train set. Acoustic_guitar Applause Bark Bass_drum Burping_or_eructation Bus Cello Chime Clarinet 0.5 0.6 0.7 0.8 0.9 1.0 mAP@3 score cnn-audio cnn-spec cnn-comb (spec input set to 0) cnn-comb (audio input set to 0) cnn-comb Figure 2: Comparison of per -category scores of single input models, combined models with one input alternately set to zero, and the combined model with both inputs. The mAP@3 score is reported on a single fold for each model. model, where one of its inputs is set to zero (See Figure 2). For categories in which the single cnn-audio model outper - forms the single cnn-spec model, the combined model performs better if the audio input is present and not set to zero. Otherwise, if cnn-spec outperforms cnn-audio , cnn-comb has a higher score if the mel-spectrogram is present and not set to zero. Setting either the mel-spectrogram or the audio wa ve to zero forces the cnn-comb model to make predictions based on a single input. cnn-comb performs equally or better than cnn-comb with one single input set to zero because it has learned to utilize meaningful high-lev el features of both inputs jointly , which are not given by zeroed inputs. For the same reason, ccn-comb with one zeroed input usually performs worse than the corresponding model with a single input. W e conclude that cnn-comb makes use of the learned high-lev el features from both cnn-audio and cnn-spec , but it focuses more on features belonging to the superior single model. 4. CONCLUSION In this paper, we have proposed a method for audio-tagging that ex- tends current Conv olutional Neural Network approaches that only make use of a frequency representation by adding a second input that incorporates the raw audio wa ve. Adding the additional in- put, has improv ed the mAP@3 score signiﬁcantly . W e ha ve demon- strated the capabilities of our model by competing in the Freesound General-Purpose Audio T agging Challenge on Kaggle and ranking in the top two percent of all participants. 5. A CKNO WLEDGMENT W e thank Christian Bor gelt and Christoph Doell for moti vating us to take part in the Kaggle competition and Christoph Doell for his valuable comments on the manuscript. Detection and Classiﬁcation of Acoustic Scenes and Events 2018 19-20 November 2018, Surre y , UK 6. REFERENCES [1] E. Fonseca, M. Plakal, F . Font, D. P . Ellis, X. Fa vory , J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: T ask description, dataset, and baseline, ” arXiv pr eprint arXiv:1807.09902 , 2018. [2] A. Harma, M. F . McKinney , and J. Skowronek, “ Automatic surveillance of the acoustic activity in our living environ- ment, ” in Multimedia and Expo, 2005. ICME 2005. IEEE In- ternational Confer ence on . IEEE, 2005, pp. 4–pp. [3] M. Crocco, M. Cristani, A. Trucco, and V . Murino, “ Audio surveillance: a systematic revie w , ” ACM Computing Surve ys (CSUR) , vol. 48, no. 4, p. 52, 2016. [4] G. V alenzise, L. Gerosa, M. T agliasacchi, F . Antonacci, and A. Sarti, “Scream and gunshot detection and localization for audio-surveillance systems, ” in Advanced V ideo and Signal Based Surveillance, 2007. A VSS 2007. IEEE Conference on . IEEE, 2007, pp. 21–26. [5] S. Goetze, J. Schroder , S. Gerlach, D. Hollosi, J.-E. Appell, and F . W allhoff, “ Acoustic monitoring and localization for so- cial care, ” Journal of Computing Science and Engineering , vol. 6, no. 1, pp. 40–50, 2012. [6] Q. Jin and J. Liang, “V ideo description generation using audio and visual cues, ” in Pr oceedings of the 2016 ACM on Interna- tional Confer ence on Multimedia Retrieval . A CM, 2016, pp. 239–242. [7] C. Ittichaichareon, S. Suksri, and T . Y ingthawornsuk, “Speech recognition using MFCC, ” in International Confer ence on Computer Graphics, Simulation and Modeling (ICGSM’2012) J uly , 2012, pp. 28–29. [8] B. Logan et al. , “Mel frequenc y cepstral coef ﬁcients for music modeling. ” in ISMIR , vol. 270, 2000, pp. 1–11. [9] G. Tzanetakis and P . Cook, “Musical genre classiﬁcation of audio signals, ” IEEE T ransactions on speech and audio pr o- cessing , vol. 10, no. 5, pp. 293–302, 2002. [10] O. Abdel-Hamid, L. Deng, and D. Y u, “Exploring conv olu- tional neural network structures and optimization techniques for speech recognition. ” in Interspeech , vol. 2013, 2013, pp. 1173–5. [11] A. J. Eronen, V . T . Peltonen, J. T . T uomi, A. P . Klapuri, S. Fagerlund, T . Sorsa, G. Lorho, and J. Huopaniemi, “ Audio- based context recognition, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 14, no. 1, pp. 321– 329, 2006. [12] O. Russakovsk y , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. Bernstein, et al. , “Ima- geNet large scale visual recognition challenge, ” International Journal of Computer V ision , vol. 115, no. 3, pp. 211–252, 2015. [13] O. Abdel-Hamid, A.-r . Mohamed, H. Jiang, L. Deng, G. Penn, and D. Y u, “Con volutional neural networks for speech recog- nition, ” IEEE/A CM T ransactions on audio, speech, and lan- guage pr ocessing , vol. 22, no. 10, pp. 1533–1545, 2014. [14] K. J. Piczak, “En vironmental sound classiﬁcation with con- volutional neural networks, ” in Machine Learning for Signal Pr ocessing (MLSP), 2015 IEEE 25th International W orkshop on . IEEE, 2015, pp. 1–6. [15] A. Krizhe vsky , I. Sutskev er , and G. E. Hinton, “ImageNet classiﬁcation with deep con volutional neural networks, ” in Advances in neural information processing systems , 2012, pp. 1097–1105. [16] K. Simonyan and A. Zisserman, “V ery deep conv olutional networks for large-scale image recognition, ” arXiv preprint arXiv:1409.1556 , 2014. [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Proceedings of the IEEE conference on computer vision and pattern reco gnition , 2016, pp. 770– 778. [18] C. Szegedy , V . V anhoucke, S. Iof fe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, ” in Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2016, pp. 2818–2826. [19] J. F . Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W . Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “ Audio Set: An ontology and human-labeled dataset for audio ev ents, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on . IEEE, 2017, pp. 776– 780. [20] L. Breiman, “Bagging predictors, ” Machine learning , vol. 24, no. 2, pp. 123–140, 1996. [21] S. D ˇ zeroski and B. ˇ Zenko, “Is combining classiﬁers with stacking better than selecting the best one?” Machine learn- ing , vol. 54, no. 3, pp. 255–273, 2004. [22] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” arXiv pr eprint arXiv:1502.03167 , 2015. [23] V . Nair and G. E. Hinton, “Rectiﬁed linear units improve re- stricted boltzmann machines, ” in Pr oceedings of the 27th in- ternational confer ence on machine learning (ICML-10) , 2010, pp. 807–814. [24] B. McFee, M. McV icar , S. Balke, C. Thom, C. Raffel, O. Nieto, E. Battenberg, D. Ellis, R. Y amamoto, J. Moore, R. Bittner , K. Choi, F .-R. Stter , S. Kumar , S. W aloschek, Seth, R. Naktinis, D. Repetto, C. F . Hawthorne, C. Carr , hojinlee, W . Pimenta, P . V iktorin, P . Brossier , J. F . Santos, JackieWu, Erik, and A. Holov aty , “librosa/librosa: 0.6.0, ” Feb . 2018. [Online]. A vailable: https://doi.org/10.5281/zenodo.1174893 [25] H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization, ” in Interna- tional Confer ence on Learning Repr esentations , 2018. [26] F . Chollet et al. , “Keras, ” https://keras.io, 2015. [27] D. P . Kingma and J. Ba, “ Adam: A method for stochastic op- timization, ” arXiv pr eprint arXiv:1412.6980 , 2014.

Combining High-Level Features of Raw Audio Waves and Mel-Spectrograms for Audio Tagging

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment