A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification
In the past, Acoustic Scene Classification systems have been based on hand crafting audio features that are input to a classifier. Nowadays, the common trend is to adopt data driven techniques, e.g., deep learning, where audio representations are lea…
Authors: Eduardo Fonseca, Rong Gong, Xavier Serra
A SIMPLE FUSION OF DEEP AND SHALLO W LEARNING FOR A COUSTIC SCENE CLASSIFICA TION Eduardo Fonseca, Rong Gong, and Xa vier Serra Music T echnology Group, Univ ersitat Pompeu Fabra, Barcelona name.surname@upf.edu ABSTRA CT In the past, Acoustic Scene Classification systems hav e been based on hand crafting audio features that are input to a classifier . Now adays, the common trend is to adopt data driv en techniques, e.g., deep learning, where audio repre- sentations are learned from data. In this paper , we propose a system that consists of a simple fusion of two methods of the aforementioned types: a deep learning approach where log-scaled mel-spectrograms are input to a con volutional neural network, and a feature engineering approach, where a collection of hand-crafted features is input to a gradient boosting machine. W e first show that both methods pro- vide complementary information to some e xtent. Then, we use a simple late fusion strategy to combine both meth- ods. W e report classification accuracy of each method in- dividually and the combined system on the TUT Acoustic Scenes 2017 dataset. The proposed fused system outper- forms each of the indi vidual methods and attains a classifi- cation accuracy of 72.8% on the ev aluation set, improving the baseline system by 11.8%. 1. INTR ODUCTION En vironmental sounds provide conte xtual information about where we are and the physical events occurring nearby . Humans have the ability to identify the environments or the contexts where they are (e.g., park, beach or b us) lev er- aging only acoustic information. Howe ver , this task is not trivial for systems that attempt to automatize it. One of the goals of machine listening is to hav e systems that perform similarly as humans in tasks like this, which is recei ving growing attention from the research community [1]. The consecutiv e editions of the Detection and Classifica- tion of Acoustic Scenes and Events (DCASE) Challenges hav e stimulated the research in this field by benchmark- ing a variety of approaches for acoustic scene classification and acoustic ev ent detection using common publicly a vail- able datasets. Acoustic Scene Classification (ASC) can be defined as the task of associating a label to an audio stream thereby identifying the context or en vironment where the audio stream was recorded, e.g., park or beach [2]. The acoustic scene consists of all the acoustic material that can Copyright: c 2018 Eduardo F onseca, Rong Gong, and Xavier Serra et al. This is an open-access article distributed under the terms of the Cr eative Commons Attribution 3.0 Unported License , which permits unr estricted use, distribution, and repr oduction in any medium, pr ovided the original author and sour ce are cr edited. be present in a giv en context, including both background noises and specific acoustic ev ents that may occur either in the foreground or merged as part of the background. The benefits of machine listening systems performing sim- ilarly as humans in recognizing acoustic scenes are man- ifold. Existing applications range from audio collections management [3] and intelligent wearable interfaces [4] to the development of context-aware applications [5]. Some concrete examples include automatic description of multi- media content, or optimization of hearing aids parameter settings based on the recognized scene. In the past, ASC systems hav e been based on a featur e en- gineering approach, where pre-designed lo w-level features are extracted from the audio signal and input to a classifier . The most popular hand-crafted features in audio-related tasks are cepstral features, e.g., MFCCs. Initially taken from the speech recognition field, they are one of the most widespread in ASC too [6, 7]. T ypical examples of clas- sifiers used in ASC include GMM [6] and SVM [7]. The feature engineering approach relies hea vily on the capac- ity of the pre-designed features to capture relev ant infor - mation from the signal, which in turn may need significant expertise and ef fort. In fact, this approach turned out to be neither efficient nor sustainable in many disciplines giv en the high diversity of problems and particular cases encoun- tered in the real world. In recent years, we have witnessed a paradigm shift in ASC similarly to those experienced in areas like computer vision or speech recognition. New techniques have arisen based on learning r epresentations from data. These data- driv en approaches—especially deep learning—hav e allowed significant research breakthroughs and ha ve rapidly spread across the audio research community . In this case, the sys- tem is able to learn internal representations from a sim- pler one at the input (typically , a time-frequency represen- tation), and the two stages of the feature engineering ap- proach (feature extraction and classification) are optimized jointly . Among the various deep learning approaches av ail- able, Con volutional Neural Networks (CNNs) hav e pro ved to be effecti ve for several audio related tasks, e.g., speech recognition [8], automatic music tagging [9] or en viron- mental sound classification [10]. Specifically for ASC, CNNs hav e also been successfully used e.g., [11, 12]. In this paper , we propose an acoustic scene recognizer that employs the fusion of the two presented trends. First, a simple 2-layer CNN designed using domain knowledge learns features from mel-spectrograms. Second, a pool of low-le vel audio features are extracted and input to a Gra- dient Boosting Machine (GBM). By combining both ap- proaches with a simple fusion method, we obtain a sys- tem that takes advantage of the complementary informa- tion that they provide. The proposed system is an exten- sion of our pre vious w ork [13], including improvements in both individual approaches and in the late fusion method, as well as further discussion. In particular, the main im- prov ements are due to the usage of pre-activ ation in the CNN, LD A feature reduction in the GBM pipeline, and learning-based late fusion. The remainder of this paper is organized as follows. Section 2 describes the CNN and GBM methods that compose the system. In Section 3 we present the dataset used and the e valuation setup. Results and discussion for each method individually and the com- bined system can be found in Section 4. Section 5 summa- rizes and concludes this work. 2. METHOD 2.1 Con volutional Neural Network When CNNs are presented with an audio time-frequency representation, they are able, in theory , to capture spectro- temporal patterns that can be useful for recognition tasks. Furthermore, the dimensions (width and height) of the con- volutional filters can be related to the time and frequency axes, respecti vely . In this work, we explore ho w this re- lation can be exploited when designing the con volutional filters for ASC. 2.1.1 Audio Pr e-pr ocessing W e consider two input representations for the CNN: mel- spectrograms and gammatone-based spectrograms. Both start with the computation of the po wer of the short-time Fourier transform (STFT) (using Hamming windows of 40 ms with 50% o verlap) after do wn-mixing the 2-channel of the binaural files to mono. In short, the mel-spectrogram aggregates the po wer v alues using triangular filters (in the frequency domain) distributed according to the mel scale. In contrast, the gammatone-based spectrogram aggregates the power values using gammatone filters with center fre- quencies distrib uted according to the ERB-rate scale [14]. For the former we used the Librosa library (v0.5.1), while for the latter we used the Essentia implementation [15], which is in turn adapted from [16]. After preliminary ex- periments we chose mel-spectrograms as input representa- tion, whose computation is detailed next. A mel filter bank consisting of 128 bands from 0 to 22050 Hz according to Slaney’ s formula [17] is applied to the power of the STFT . Our mel filter bank presents triangular filters with a peak value of one, as opposed to other filter banks where the filters have equal area. Finally , the mel en- ergies are logarithmically scaled. W e standardize the log- scaled mel-spectrograms by subtracting the mean and di- viding by the standard deviation. W e do this on whichever subset of data we use for training. Then, we keep the normalization v alues and subsequently apply them to stan- dardize the corresponding test set (see Section 3.2). Since the recordings of the dataset used are 10s long, the dimensionality of the corresponding spectrograms is con- sidered too high for the proposed architecture. Therefore, they are split into non-o verlapping time-frequenc y patches (T -F patches) or se gments of 1.5s (i.e., 75 frames). W e hence obtain 7 segments per recording, the last one being padded with the last original frame. Thus, the CNN learns from T -F patches of R 75 × 128 . 2.1.2 CNN Ar chitectur e The proposed CNN architecture is depicted in T able 1 and illustrated in Fig. 1. Input: 1 x (75,128) Con v1 : 48x (3,8) | 32x (3,32) | 16x (3,64) | 16x (3,90) + BN + ReLU Max-Pooling: (5,5) Con v2 : 224x (5,5) + BN + ReLU Max-Pooling: (11,4) Dense: 15 units + softmax T able 1. Proposed CNN architecture. The architecture is composed of two con volutional lay- ers ( Con v1 and Con v2 ) alternated with max-pooling oper- ations and it ends with a softmax layer . It can be regarded as a relatively simple network comprising standard opera- tions. Also, the network can be re garded as wide , in con- trast to the trend of building deeper networks with tens of layers (or more in other disciplines like image recognition). One of the most distincti ve aspects of this network is the con volutional filters in the first layer . W e hypothe- size that the spectro-temporal patterns that allow to rec- ognize many of the scenes considered are more discrimina- tiv e along the frequency domain (rather than in the time do- main). W e consider this during the filters’ design. That is, our approach attempts to prioritize the modeling of spec- tral en velope shapes and background noises, rather than onsets/offsets or attack-decay patterns of specific acous- tic events. While most CNNs in the literature leverage squared filters and only one filter shape in the first con- volutional layer [10, 18, 19], some recent works suggest to employ rectangular filters and different shapes at the same time [20, 21]. In particular, we explore se veral configu- rations of filters with multiple vertical shapes in the first layer . W e call vertical filters to those whose frequency di- mension is much larger than its time dimension. By using these filters, we intend to aid the learning process to wards what we intuitiv ely assume as more important for ASC. The first conv olutional layer is implemented as the con- catenation of se veral con volutional layers such that e very layer has filters of one single and distinct shape. Using filters of different dimensions leads to feature maps of dif- ferent dimensions as well. In order to come back to same- sized feature maps, two options exist: i) zero-pad network’ s input appropriately , and ii) use filter-dependent max-pooling operations. Preliminary experiments were run with both options and no major dif ference in performance was ob- served. Hence the simpler zero-padding option was adopted. The filter shapes employed are listed in T able 1 as number of filters x (time, fr equency) . The first conv olutional layer presents 112 filters. This number is doubled for the second Figure 1. Sketch of the proposed CNN architecture. Four v ertical filter shapes co-exist in the first con v olutional layer . layer . The proposed final network presents four dif ferent filter shapes in Conv1, as illustrated over the T -F patch of Fig. 1. All the filters in Con v1 have a time dimension of 3. On the contrary , filters in Con v2 are squared 5x5. W e apply batch normalization (BN) [22] and Rectified Linear Unit (ReLU) [23] after e very conv olutional layer , followed by max-pooling operations. The latter do wnsam- ple the feature maps while adding some in variance along the time-frequenc y dimensions. In particular , max-pooling is applied over squares of dimension 5 after Con v1. Af- ter Con v2, global time domain pooling is applied in order to select only the most prominent feature [18]. Finally , af- ter flattening the resulting feature maps, the predicted class (for the input T -F patch) is obtained by a dense layer with softmax activ ation with 15 output units (corresponding to the 15 acoustic scenes). W e also experiment with the concept of pre-activation [24]. This technique w as initially devised for image recog- nition in the context of deep residual networks. In [24] a residual unit is proposed containing two paths: i) a clean information path for the information to propagate and ii) another path with an additi ve residual function. In the lat- ter path, BN and ReLU are applied as pre-activ ation of the conv olutional layers (in addition to the common post- activ ation consisting of the same couple BN and ReLU after the con volution operation). Reported advantages in the particular case of deep residual networks, with 100+ layers, include ease of optimization and improved re gular- ization. Moreover , pre-activ ation has recently proved suc- cessful for ASC in [11], still with a deeper network than the one proposed here. W e want to explore this technique in a fairly shallow network. Based on the results obtained in Section 4.1.2, we add BN and ReLU non-linearity directly at the netw ork’ s input of Fig. 1 (before the first con volution layer) to form the final proposed CNN. 2.1.3 T raining Strate gy and Hyperparameters Network weights are initialized with a uniform distribu- tion. The loss function is categorical cross-entropy and the optimizer is Adam. The initial learning rate is 0.002, and it is reduced by a factor of 2 whene ver the validation loss does not decrease during 5 epochs. W e also experimented with i) dropping the learning rate by half ev ery fixed num- ber of epochs and ii) using Adam with no learning rate scheduling. Howe ver , best results were obtained by reduc- ing learning rate when the validation loss plateaus. The training is early-stopped if the v alidation loss is not im- prov ed during 15 epochs, up to a maximum of 200. For early-stopping, a 15% v alidation set is randomly split from the training data of e very class. The batch size is 64, and training samples are shuffled between epochs. In both con- volutional layers L2 regularization is applied with a pa- rameter of 10 − 5 . The system is implemented using Keras (v2.1.3) and T ensorflow (v1.4.1). 2.2 Gradient Boosting Machine Gradient Boosting Machine [25] is a technique for con- structing predicti ve models based on an ensemble of man y weak learners—typically regression trees. The trees are added iterati vely , in such a way that the new tree focuses on the misclassifications by the pre vious ensemble of trees. Predictions of multiple trees are combined together in or- der to optimize an objective function, and the parameters of added trees are tuned by gradient descent. T wo GBM framew orks are widely used: XGBoost [26] and Light- GBM. 1 Experiments on fiv e public datasets show that Light- GBM outperforms XGBoost on both ef ficiency and accu- racy , with significantly lower memory consumption [27]. In our experiments, we also found out that LightGBM trains faster and achiev es a slightly better overall classification accuracy . Hence we use LightGBM in this work. 2.2.1 F eatur e Extraction and Pr e-processing W e se gment each 10s recording into 7 non-ov erlapped se g- ments . The first 6 segments last 1.5s, and the last one 1s. W e then extract features on each segment using the F reesoundExtr actor , 2 an out-of-box feature extractor from the Essentia open-source library for audio analysis [15]. This extractor computes hundreds of features for sound and music analysis and it is originally used by Freesound 3 1 https://github.com/Microsoft/LightGBM 2 http://essentia.upf.edu/documentation/ freesound_extractor.html 3 https://freesound.org/ in order to provide sound analysis API and searching func- tionalities. The most musically-related features (e.g., key , chords, etc.) are discarded. The selected pool of features is listed in T able 2, along with their dimensionality . The fea- tures are calculated at frame-lev el by using the same frame and hop size mentioned in Section 2.1.1. All other parame- ters of the F r eesoundExtractor are set to default v alues. W e perform four statistical aggregations—mean, variance, and mean and variance of the deri vati ve—to the frame-lev el feature v ectors of each segment. Therefore, a R 820 × 1 (i.e., 205 × 4) feature vector is output for each segment. As in Section 2.1.1, we fit a mean and variance standardization scaler over whichev er subset of data we use for training, and use it to standardize both train and test data. Feature name Dim. Feature name Dim. Bark bands energy 32 T onal features 3 ERB bands energy 23 Pitch features 3 Mel bands energy 45 Silence rate 3 MFCC 13 Spectral features 32 HPCP 38 GFCC 13 T able 2. Selected features extracted by F r eesoundExtrac- tor and number of dimensions. 2.2.2 Linear Discriminant Analysis F eatur e Reduction Linear Discriminant Analysis (LD A) can be used as a di- mensionality reduction technique after the feature extrac- tion stage. The ultimate goal is to mitigate overfitting by projecting a high dimensional dataset onto a lower dimen- sional space. This is done by maximizing the variance of the data as well as the separability of classes. Some of the features of T able 2 are computed in a similar way , e.g., sev eral energy bands are computed with different psychoa- coustic scales (e.g., Bark or Mel). While they may pro- vide some complementary information, it is likely that they also have a considerable amount of redundancy . This, to- gether with the high dimensionality of the feature vector , may cause ov erfitting and a slow-do wn of model training. In order to mitigate this, while keeping the rich informa- tion of the extracted features, we perform LD A-based fea- ture reduction. It is applied on any subset of data used for training, and then the corresponding test set is transformed accordingly (see Section 3.2). 2.2.3 Hyperparameter T uning Since ASC is a multi-class classification problem, we use logarithmic loss as the objectiv e function. W e do grid search ov er 5 hyperparameters. Four of them relate to the GBM (learning rate, max bins , number of leaves , and min data in leaf ) while the reduced feature dimension relates to the LD A. The number of leaves is the main parameter to con- trol model complexity , whereas max bins and min data in leaf are two important parameters to deal with overfitting. All other hyperparameters are set to default v alues. W e do the grid search in tw o cases—with and without LD A—and the hyperparameter values considered are listed in T able 3. The grid search is performed using cross-validation on the dev elopment set. The hyperparameters setting leading to the best cross-validation accuracy is kept for the final GBM model, which is used to predict acoustic scenes on the ev al- uation set. Hyperparameter V alues Learning rate [0.01, 0.05, 0.1] Max bins [128, 256, 512] Number of leav es [64, 128, 256] Min data in leaf [500, 1000, 2000] Reduced feature dimension [64, 128, 256, 512] T able 3. Hyperparameter grid search for GBM and LD A. 2.3 Late Fusion In order to combine the predictions from both methods, we tried approaches with and without learning, all of them starting from the individual models’ class probabilities com- puted on the dev elopment set using the proposed four-fold cross validation setup. The simplest approach (i.e., without learning) consists of combining the prediction probabilities by taking their geometric mean, arithmetic mean, or rank av eraging. Then, the final predicted label is selected by taking the argmax ov er the resulting values. The learning- based approach consists of two steps. First, using the mod- els’ prediction probabilities computed on the development set as training data , we fit a classifier or meta learner . W e experimented with logistic regression and SVM with sev eral kernels. The models’ hyperparameters were deter- mined by grid search on the training data using four -fold cross v alidation, trying to restrict the parameter search to ranges pro viding large regularization. Then, once the meta learner is fit, we predict labels on the evaluation set by tak- ing as input the pre-computed prediction probabilities from CNN and GBM on this set. This approach is sometimes re- ferred to as stacking . 3. EV ALU A TION 3.1 Dataset and Baseline Systems are ev aluated with TUT Acoustic Scenes 2017 , a dataset that contains recordings made in 15 acoustic scenes. The dataset is split into a dev elopment and an evaluation set, of 4680 and 1620 audio recordings respecti vely . 4 The dev elopment set contains 312 recordings per class. All recordings last 10s and hav e a sampling rate of 44.1 kHz. A four-fold cross-validation setup is provided for the de- velopment set. The dataset presents a mismatch between dev elopment and ev aluation set due to dif ferences in the recording conditions. The av erage accuracy drop between both sets across all submitted systems to the ASC task of DCASE2017 is 20.1%. 5 A multilayer perceptron (MLP) is provided as baseline system. The input representation is 40 log mel-band energies in 5 consecutive frames and the MLP has 2 layers with 50 hidden units each. 4 A list of the scenes together with more details about the dataset can be found in http://www.cs.tut.fi/sgn/arg/dcase2017/ challenge/task- acoustic- scene- classification . 5 http://www.cs.tut.fi/sgn/arg/dcase2017/ challenge/task- acoustic- scene- classification- results 3.2 Evaluation Setup The output of CNN and GBM models for e very input 1.5s segment is a R 15 × 1 vector with the probabilities of the segment belonging to every class. The class prediction for each 10s recording is computed by averaging per -class scores across segments and finding the class with maxi- mum a verage score. The de velopment set is used for train- ing/testing the CNN and GBM approaches according to the provided four -fold cross-validation setup (see Fig. 2). Figure 2. Flowchart illustrating the workflow in de velop- ment mode. For predicting acoustic scenes on the ev aluation set, the models are trained on the full development set and e valu- ated on the ev aluation set (see Fig. 3). The metric used is classification accuracy , i.e., the number of correctly classi- fied recordings divided by the total amount of recordings. Figure 3. Flowchart illustrating the w orkflow in e v aluation mode. Models are trained on the full de velopment set and predictions are computed on the ev aluation set. 4. RESUL TS AND DISCUSSION 4.1 Con volutional Neural Network T wo types of experiments were carried out with the CNN: i) e xperimenting with filter configurations in the first layer, and ii) e xploring the concept of pre-acti vation. Since re- sults obtained with GPU are generally non-deterministic, accuracies reported in this Section are the result of av erag- ing ten independent trials of every experiment. Confidence intervals are also sho wn. 4.1.1 F ilter Configurations W e design filter configurations with se veral filter shapes in the first layer . The number of shapes is specified in T able 4 and Fig. 4 as CNN x , where x denotes the number of differ - ent shapes. 6 Every shape (denoted by ( time, fr equency )) 6 CNN sq refers to the case where filters are squared, which is a spe- cific case of CNN 1. can be repeated a different number of times, as illustrated in T able 4, b ut in all cases the total number of filters is 112. System Filter configuration - # filters x ( time, fr eq ) MLP - CNN sq 112x (5,5) CNN 1 112x (3,40) CNN 2 64x (3,20) | 48x (3,70) CNN 3 48x (3,10) | 32x (3,30) | 32x (3,60) CNN 4 48x (3,8) | 32x (3,32) | 16x (3,64) | 16x (3,90) CNN 5 36x (3,6) | 22x (3,26) | 22x (3,48) | 16x (3,70) 16x (3,96) T able 4. Filter configurations in the first layer for the CNN of Fig. 1. The motiv ation for designing filters with different vertical dimensions is to intuitively be able to cover di verse spec- tral patterns, ranging from narro w-band patterns to others that may spread ov er frequency . In order to establish a fair comparison among networks, the number of parameters was kept approximately constant by adjusting the number of filters per shape and the filter dimensions. The number of parameters in all cases lie in the range 656k-660k, with the exception of the squared filters case that has 648k (due to the smaller size of the squared filters). In particular, the top performing case of CNN 4 has 657k parameters. The specific filter shapes in T able 4 were chosen through a number of preliminary experiments. While an exhaustiv e search may be desirable, it may require prohibitiv ely long computation times. Fig. 4 sho ws the classification accuracy values for the ar- chitecture of Fig. 1 and the filter configurations of T able 4. The accuracy of the MLP baseline is specified as well. Figure 4. ASC performance using the CNN of Fig. 1 with the filter configurations in the first layer given by T able 4. No pre-activ ation is adopted in these experiments. Note that the y-axis dif fers for dev elopment and e valuation sets. It can be observed that the accuracy on the e v aluation set increases overall with the di versity of the filter shapes, until a point where this div ersity no longer helps (CNN 5). W e also carried out some preliminary experiments with hori- zontal filters but results were slightly w orse than those ob- tained with vertical ones. 4.1.2 Pr e-activation Fig. 5 sho ws the results obtained by adding pre-activ ation [24] to the top-performing case of Fig. 4, i.e., to CNN 4. Figure 5. ASC performance by adopting pre-activ ation in the CNN of Fig. 1, i.e., adding BN and ReLU before the first con volutional layer . Note that the y-axis differs for dev elopment and e valuation sets. It can be seen that adding pre-acti vation impro ves the re- sults slightly on the e valuation set (see preact bar). How- ev er , the gap between de velopment and ev aluation accura- cies is still substantial. Curiously , we found out that this gap is reduced when we complement pre-activ ation with normalization of the input audio wav eform (see norm&pr eact bar). This is some what surprising as the T -F patches that input the CNN were already standardized (see Section 2.1.1). Finally , we report the accuracy obtained by applying only time domain normalization of audio (without pre-activ ation), to confirm that it is the combination of both which yields the improvement (see norm bar). W e also experimented with pre-activ ation not only prior to the first con volutional layer , but also between e very max-pooling operation and the ne xt layer , follo wing pre vious work [11]. Resulting ac- curacies were not higher . It hence appears that the combination of pre-activ ation and normalization of the input wav eform helps to improve model’ s generalization, showing slightly lower dev elop- ment accuracy while increasing e valuation accuracy . Ne v- ertheless, further experiments are needed to better assess and understand the benefits of pre-activ ation and its de- pendency on audio signal energy or dynamic range. F or example, one aspect of the audio signal in acoustic scenes or field-recordings is its small dynamic range. This hap- pens often as sources can be far away from the microphone, since the goal is to capture the entirety of the acoustic con- text rather than specific acoustic e vents. Ev aluating this approach on different datasets may be re vealing. 4.2 Gradient Boosting Machine The best hyperparameters found for LDA and non-LD A cases are listed in T able 5. The dimensionality of the fea- ture vector after LD A-based feature reduction is 64. This is a 7.8% of the initial dimensionality (820), which indicates considerable information redundancy in the initial pool of features gathered from the F r eesoundExtractor . After the feature dimension reduction, we observe significant boost in training speed. Hyperparameter non-LD A LD A Learning rate 0.05 0.05 Max bins 128 128 Number of leav es 128 128 Min data in leafs 1000 500 Reduced feature dimension – 64 T able 5. Best hyperparameters in both LDA and non-LD A cases by grid searching on the dev elopment set. T able 6 shows the accuracy results. The performance us- ing LD A feature reduction is greater than the one without LD A and the MLP baseline, resulting in small improve- ments of 1.7% and 2.6% on the e valuation set. Howe ver , we still witness a significant accuracy drop in both cases. It is worth to mention that, to tackle the ov erfitting prob- lem, we hav e experimented with another two techniques, namely PCA and feature selection using feature impor - tance. Howe ver , no significant improvements were ob- served. For the late fusion we use the GBM with LD A. Approach dev acc (%) e val acc (%) Baseline 74.8 61.0 GBM non-LD A 81.4 61.9 GBM LD A 81.1 63.6 T able 6. ASC performance by the GBM model with and without LD A feature reduction. 4.3 Models’ Comparison The CNN method clearly outperforms the GBM method. Howe ver , we wanted to assess the potential complemen- tarity of these models, i.e., whether their output predictions are complementary or redundant. W e follow the approach of [28] consisting of plotting the difference of confusion matrixes yielded by both systems, which is shown in Fig. 6. If we ha ve a look at the main diagonal, positi ve red num- bers illustrate scenes where CNN performs better , whereas negati ve blue numbers represent scenes where the GBM achiev es more correct predictions. The CNN yields better results in most of the acoustic scenes. Howe ver , despite the lower performance of the GBM, it interestingly yields bet- ter predictions in the ’park’, ’beach’ and ’cafe/restaurant’ scenes. Then, off the diagonal, positi ve red numbers illus- trate that the CNN presents higher confusion between pairs of acoustic scenes. Similarly , negati ve blue numbers repre- sent that the GBM suffers from higher confusion between pairs of acoustic scenes. Overall, it can be seen that the models get confused between dif ferent pairs of scenes. In summary , the methods present different behaviour to some extent, and hence their predictions may be complementary . Figure 6. Difference between the confusion matrixes pro- duced by i) the CNN and ii) the GBM models (in this or- der), ev aluated on the ev aluation set. 4.4 Late Fusion After exploring the approaches described in Section 2.3, the logistic regression led to the best results, which are listed in T able 7. System dev acc (%) e val acc (%) MLP baseline 74.8 61.0 Proposed CNN + GBM 83.3 72.8 T able 7. ASC performance by the combined system. The proposed combined system sho ws an improvement of 3.1% over the average score provided by the best CNN architecture, and an improvement of 11.8% over the MLP baseline. It also shows an improvement of 5.5% with re- spect to our previous work [13]. W e consider as state of the art the top performing submissions to the ASC task of DCASE2017 Challenge. 5 Among them, there are a few systems that outperform the one proposed here. Howe ver , they hav e the burden of being more complex or compu- tationally intensi ve, including Generati ve Adversarial Net- works, ensembles of 4 or more systems (with several CNNs), data augmentation, or deeper networks. Compared to them, we consider that our system is simpler in ov erall terms. The proposed CNN is more interpretable as domain knowl- edge was used in its design. The GBM can be trained in a standard desktop computer without need of additional infrastructure, e.g., a GPU. Figure 7 shows the confusion matrix for the proposed combined system, where it can be seen which acoustic scenes are misclassified the most. The worst case occurs when the systems predicts ’ residential area’ while the true label is ’beach’ or ’park’. 5. CONCLUSION W e ha ve proposed the fusion of two systems of radically different kind for ASC: a CNN designed with domain kno wl- edge that learns from log mel spectrograms, and a GBM that lev erages audio features from the out-of-box F r eesoundEx- Figure 7. Confusion matrix for the proposed combined system ev aluated on the ev aluation set. tractor . Evaluated on the TUT Acoustic Scenes 2017 dataset, the CNN performs substantially better than the GBM, which is not able to generalize well on the e v aluation set. Despite their difference in performance, the models provide some- what complementary predictions, and their fusion leads to a slight improvement. The proposed system attains a clas- sification accuracy of 72.8% on the e valuation set, which means a 11.8% improvement over the MLP baseline. Our experiments empirically show that adding pre-acti vation and wa veform normalization help the proposed CNN to re- duce overfitting. Future work includes ev aluating the prop- erties of pre-activ ation on different datasets and networks, and exploring additional measures against o verfitting. Acknowledgments This work is partially supported by the European Union’ s Horizon 2020 research and innov ation programme under grant agreement No 688382 “ AudioCommons”, and the European Research Council under the European Union’ s Sev enth Frame work Program, as part of the CompMusic project (ERC grant agreement 267583), and a Google Fac- ulty Research A ward 2017. W e are grateful for the GPUs donated by NV idia. 6. REFERENCES [1] T . V irtanen, M. D. Plumbley , and D. Ellis, Computa- tional Analysis of Sound Scenes and Events . Springer, 2018. [2] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley , “ Acoustic scene classification: Classifying en vironments from the sounds the y produce, ” IEEE Signal Pr ocessing Magazine , v ol. 32, no. 3, pp. 16–34, 2015. [3] C. Landone, J. Harrop, and J. Reiss, “Enabling access to sound archiv es through integration, enrichment and retriev al: The EASAIER project. ” in ISMIR , 2007, pp. 159–160. [4] Y . Xu, W . J. Li, and K. K. Lee, Intelligent wearable interfaces . John W iley & Sons, 2008. [5] B. Schilit, N. Adams, and R. W ant, “Context-a ware computing applications, ” in Mobile Computing Sys- tems and Applications . IEEE, 1994, pp. 85–90. [6] J.-J. Aucouturier , B. Defreville, and F . Pachet, “The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music, ” The Journal of the Acoustical So- ciety of America , vol. 122, no. 2, pp. 881–891, 2007. [7] G. Roma, W . Nogueira, and P . Herrera, “Recurrence quantification analysis features for auditory scene clas- sification, ” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events , 2013. [8] H. Lee, P . Pham, Y . Largman, and A. Y . Ng, “Unsu- pervised feature learning for audio classification us- ing conv olutional deep belief networks, ” in Advances in neural information processing systems , 2009, pp. 1096–1104. [9] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio, ” in Pr oceedings of the International Confer ence on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) . IEEE, 2014, pp. 6964–6968. [10] J. Salamon and J. P . Bello, “Deep con volutional neu- ral networks and data augmentation for environmental sound classification, ” IEEE Signal Pr ocessing Letters , vol. 24, no. 3, pp. 279–283, 2017. [11] Y . Han, J. Park, and K. Lee, “Con volutional neural net- works with binaural representations and background subtraction for acoustic scene classification, ” in Detec- tion and Classification of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , 2017. [12] Z. W eiping, Y . Jiantao, X. Xiaotao, L. Xiangtao, and P . Shaohu, “ Acoustic scene classification us- ing deep con volutional neural netw ork and multiple spectrograms fusion, ” in Detection and Classifica- tion of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , 2017. [13] E. Fonseca, R. Gong, D. Bogdanov , O. Slizovskaia, E. G ´ omez Guti ´ errez, and X. Serra, “ Acoustic scene classification by ensembling gradient boosting ma- chine and con volutional neural networks, ” in Detection and Classification of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , 2017. [14] D. W ang and G. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications . W iley , 2006. [15] D. Bogdanov , N. W ack, E. G ´ omez, S. Gulati, P . Her- rera, O. Mayor , G. Roma, J. Salamon, J. R. Zapata, X. Serra et al. , “Essentia: An audio analysis library for music information retrie val. ” in ISMIR , 2013, pp. 493– 498. [16] D. P . W . Ellis, “Gammatone-like spectrograms, ” http://www .ee.columbia.edu/ ∼ dpwe/resources/matlab/ gammatonegram/, 2009. [17] M. Slaney , “ Auditory toolbox, ” Interval Resear ch Cor - poration, T ech. Rep , v ol. 10, p. 1998, 1998. [18] M. V alenti, A. Diment, G. Parascandolo, S. Squartini, and T . V irtanen, “DCASE 2016 acoustic scene classifi- cation using con volutional neural networks, ” in Pr oc. W orkshop Detection Classif. Acoust. Scenes Events , 2016, pp. 95–99. [19] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. W id- mer , “CP-JKU submissions for DCASE-2016: A hy- brid approach using binaural i-vectors and deep con vo- lutional neural networks, ” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) , 2016. [20] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max pool- ing conv olutional neural networks, ” arXiv preprint arXiv:1604.06338 , 2016. [21] J. Pons, O. Slizovskaia, R. Gong, E. G ´ omez, and X. Serra, “Timbre analysis of music audio signals with con v olutional neural networks, ” arXiv pr eprint arXiv:1703.06697 , 2017. [22] S. Ioffe and C. Szegedy , “Batch normalization: Accel- erating deep network training by reducing internal co- variate shift, ” in International Conference on Machine Learning , 2015, pp. 448–456. [23] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectifier neural networks, ” in Pr oceedings of the F our- teenth International Conference on Artificial Intelli- gence and Statistics , 2011, pp. 315–323. [24] K. He, X. Zhang, S. Ren, and J. Sun, “Identity map- pings in deep residual networks, ” in European Confer - ence on Computer V ision . Springer , 2016, pp. 630– 645. [25] J. H. Friedman, “Greedy function approximation: a gradient boosting machine, ” Annals of statistics , pp. 1189–1232, 2001. [26] T . Chen and C. Guestrin, “XGBoost: A scalable tree boosting system, ” CoRR , vol. abs/1603.02754, 2016. [27] “Lightgbm and xgboost comparison experiment, ” https://github .com/Microsoft/LightGBM/blob/master/ docs/Experiments.rst, accessed: 2018-04-06. [28] J. Salamon, J. P . Bello, A. Farnsworth, and S. Kelling, “Fusing shallo w and deep learning for bioacoustic bird species classification, ” in Acoustics, Speec h and Signal Pr ocessing (ICASSP), 2017 IEEE International Con- fer ence on . IEEE, 2017, pp. 141–145.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment