A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification

A SIMPLE FUSION OF DEEP AND SHALLO W LEARNING FOR A COUSTIC SCENE CLASSIFICA TION Eduardo Fonseca, Rong Gong, and Xa vier Serra Music T echnology Group, Univ ersitat Pompeu Fabra, Barcelona name.surname@upf.edu ABSTRA CT In the past, Acoustic Scene Classiﬁcation systems hav e been based on hand crafting audio features that are input to a classiﬁer . Now adays, the common trend is to adopt data driv en techniques, e.g., deep learning, where audio repre- sentations are learned from data. In this paper , we propose a system that consists of a simple fusion of two methods of the aforementioned types: a deep learning approach where log-scaled mel-spectrograms are input to a con volutional neural network, and a feature engineering approach, where a collection of hand-crafted features is input to a gradient boosting machine. W e ﬁrst show that both methods pro- vide complementary information to some e xtent. Then, we use a simple late fusion strategy to combine both meth- ods. W e report classiﬁcation accuracy of each method in- dividually and the combined system on the TUT Acoustic Scenes 2017 dataset. The proposed fused system outper- forms each of the indi vidual methods and attains a classiﬁ- cation accuracy of 72.8% on the ev aluation set, improving the baseline system by 11.8%. 1. INTR ODUCTION En vironmental sounds provide conte xtual information about where we are and the physical events occurring nearby . Humans have the ability to identify the environments or the contexts where they are (e.g., park, beach or b us) lev er- aging only acoustic information. Howe ver , this task is not trivial for systems that attempt to automatize it. One of the goals of machine listening is to hav e systems that perform similarly as humans in tasks like this, which is recei ving growing attention from the research community [1]. The consecutiv e editions of the Detection and Classiﬁca- tion of Acoustic Scenes and Events (DCASE) Challenges hav e stimulated the research in this ﬁeld by benchmark- ing a variety of approaches for acoustic scene classiﬁcation and acoustic ev ent detection using common publicly a vail- able datasets. Acoustic Scene Classiﬁcation (ASC) can be deﬁned as the task of associating a label to an audio stream thereby identifying the context or en vironment where the audio stream was recorded, e.g., park or beach [2]. The acoustic scene consists of all the acoustic material that can Copyright: c  2018 Eduardo F onseca, Rong Gong, and Xavier Serra et al. This is an open-access article distributed under the terms of the Cr eative Commons Attribution 3.0 Unported License , which permits unr estricted use, distribution, and repr oduction in any medium, pr ovided the original author and sour ce are cr edited. be present in a giv en context, including both background noises and speciﬁc acoustic ev ents that may occur either in the foreground or merged as part of the background. The beneﬁts of machine listening systems performing sim- ilarly as humans in recognizing acoustic scenes are man- ifold. Existing applications range from audio collections management [3] and intelligent wearable interfaces [4] to the development of context-aware applications [5]. Some concrete examples include automatic description of multi- media content, or optimization of hearing aids parameter settings based on the recognized scene. In the past, ASC systems hav e been based on a featur e en- gineering approach, where pre-designed lo w-level features are extracted from the audio signal and input to a classiﬁer . The most popular hand-crafted features in audio-related tasks are cepstral features, e.g., MFCCs. Initially taken from the speech recognition ﬁeld, they are one of the most widespread in ASC too [6, 7]. T ypical examples of clas- siﬁers used in ASC include GMM [6] and SVM [7]. The feature engineering approach relies hea vily on the capac- ity of the pre-designed features to capture relev ant infor - mation from the signal, which in turn may need signiﬁcant expertise and ef fort. In fact, this approach turned out to be neither efﬁcient nor sustainable in many disciplines giv en the high diversity of problems and particular cases encoun- tered in the real world. In recent years, we have witnessed a paradigm shift in ASC similarly to those experienced in areas like computer vision or speech recognition. New techniques have arisen based on learning r epresentations from data. These data- driv en approaches—especially deep learning—hav e allowed signiﬁcant research breakthroughs and ha ve rapidly spread across the audio research community . In this case, the sys- tem is able to learn internal representations from a sim- pler one at the input (typically , a time-frequency represen- tation), and the two stages of the feature engineering ap- proach (feature extraction and classiﬁcation) are optimized jointly . Among the various deep learning approaches av ail- able, Con volutional Neural Networks (CNNs) hav e pro ved to be effecti ve for several audio related tasks, e.g., speech recognition [8], automatic music tagging [9] or en viron- mental sound classiﬁcation [10]. Speciﬁcally for ASC, CNNs hav e also been successfully used e.g., [11, 12]. In this paper , we propose an acoustic scene recognizer that employs the fusion of the two presented trends. First, a simple 2-layer CNN designed using domain knowledge learns features from mel-spectrograms. Second, a pool of low-le vel audio features are extracted and input to a Gra- dient Boosting Machine (GBM). By combining both ap- proaches with a simple fusion method, we obtain a sys- tem that takes advantage of the complementary informa- tion that they provide. The proposed system is an exten- sion of our pre vious w ork [13], including improvements in both individual approaches and in the late fusion method, as well as further discussion. In particular, the main im- prov ements are due to the usage of pre-activ ation in the CNN, LD A feature reduction in the GBM pipeline, and learning-based late fusion. The remainder of this paper is organized as follows. Section 2 describes the CNN and GBM methods that compose the system. In Section 3 we present the dataset used and the e valuation setup. Results and discussion for each method individually and the com- bined system can be found in Section 4. Section 5 summa- rizes and concludes this work. 2. METHOD 2.1 Con volutional Neural Network When CNNs are presented with an audio time-frequency representation, they are able, in theory , to capture spectro- temporal patterns that can be useful for recognition tasks. Furthermore, the dimensions (width and height) of the con- volutional ﬁlters can be related to the time and frequency axes, respecti vely . In this work, we explore ho w this re- lation can be exploited when designing the con volutional ﬁlters for ASC. 2.1.1 Audio Pr e-pr ocessing W e consider two input representations for the CNN: mel- spectrograms and gammatone-based spectrograms. Both start with the computation of the po wer of the short-time Fourier transform (STFT) (using Hamming windows of 40 ms with 50% o verlap) after do wn-mixing the 2-channel of the binaural ﬁles to mono. In short, the mel-spectrogram aggregates the po wer v alues using triangular ﬁlters (in the frequency domain) distributed according to the mel scale. In contrast, the gammatone-based spectrogram aggregates the power values using gammatone ﬁlters with center fre- quencies distrib uted according to the ERB-rate scale [14]. For the former we used the Librosa library (v0.5.1), while for the latter we used the Essentia implementation [15], which is in turn adapted from [16]. After preliminary ex- periments we chose mel-spectrograms as input representa- tion, whose computation is detailed next. A mel ﬁlter bank consisting of 128 bands from 0 to 22050 Hz according to Slaney’ s formula [17] is applied to the power of the STFT . Our mel ﬁlter bank presents triangular ﬁlters with a peak value of one, as opposed to other ﬁlter banks where the ﬁlters have equal area. Finally , the mel en- ergies are logarithmically scaled. W e standardize the log- scaled mel-spectrograms by subtracting the mean and di- viding by the standard deviation. W e do this on whichever subset of data we use for training. Then, we keep the normalization v alues and subsequently apply them to stan- dardize the corresponding test set (see Section 3.2). Since the recordings of the dataset used are 10s long, the dimensionality of the corresponding spectrograms is con- sidered too high for the proposed architecture. Therefore, they are split into non-o verlapping time-frequenc y patches (T -F patches) or se gments of 1.5s (i.e., 75 frames). W e hence obtain 7 segments per recording, the last one being padded with the last original frame. Thus, the CNN learns from T -F patches of R 75 × 128 . 2.1.2 CNN Ar chitectur e The proposed CNN architecture is depicted in T able 1 and illustrated in Fig. 1. Input: 1 x (75,128) Con v1 : 48x (3,8) | 32x (3,32) | 16x (3,64) | 16x (3,90) + BN + ReLU Max-Pooling: (5,5) Con v2 : 224x (5,5) + BN + ReLU Max-Pooling: (11,4) Dense: 15 units + softmax T able 1. Proposed CNN architecture. The architecture is composed of two con volutional lay- ers ( Con v1 and Con v2 ) alternated with max-pooling oper- ations and it ends with a softmax layer . It can be regarded as a relatively simple network comprising standard opera- tions. Also, the network can be re garded as wide , in con- trast to the trend of building deeper networks with tens of layers (or more in other disciplines like image recognition). One of the most distincti ve aspects of this network is the con volutional ﬁlters in the ﬁrst layer . W e hypothe- size that the spectro-temporal patterns that allow to rec- ognize many of the scenes considered are more discrimina- tiv e along the frequency domain (rather than in the time do- main). W e consider this during the ﬁlters’ design. That is, our approach attempts to prioritize the modeling of spec- tral en velope shapes and background noises, rather than onsets/offsets or attack-decay patterns of speciﬁc acous- tic events. While most CNNs in the literature leverage squared ﬁlters and only one ﬁlter shape in the ﬁrst con- volutional layer [10, 18, 19], some recent works suggest to employ rectangular ﬁlters and different shapes at the same time [20, 21]. In particular, we explore se veral conﬁgu- rations of ﬁlters with multiple vertical shapes in the ﬁrst layer . W e call vertical ﬁlters to those whose frequency di- mension is much larger than its time dimension. By using these ﬁlters, we intend to aid the learning process to wards what we intuitiv ely assume as more important for ASC. The ﬁrst conv olutional layer is implemented as the con- catenation of se veral con volutional layers such that e very layer has ﬁlters of one single and distinct shape. Using ﬁlters of different dimensions leads to feature maps of dif- ferent dimensions as well. In order to come back to same- sized feature maps, two options exist: i) zero-pad network’ s input appropriately , and ii) use ﬁlter-dependent max-pooling operations. Preliminary experiments were run with both options and no major dif ference in performance was ob- served. Hence the simpler zero-padding option was adopted. The ﬁlter shapes employed are listed in T able 1 as number of ﬁlters x (time, fr equency) . The ﬁrst conv olutional layer presents 112 ﬁlters. This number is doubled for the second Figure 1. Sketch of the proposed CNN architecture. Four v ertical ﬁlter shapes co-exist in the ﬁrst con v olutional layer . layer . The proposed ﬁnal network presents four dif ferent ﬁlter shapes in Conv1, as illustrated over the T -F patch of Fig. 1. All the ﬁlters in Con v1 have a time dimension of 3. On the contrary , ﬁlters in Con v2 are squared 5x5. W e apply batch normalization (BN) [22] and Rectiﬁed Linear Unit (ReLU) [23] after e very conv olutional layer , followed by max-pooling operations. The latter do wnsam- ple the feature maps while adding some in variance along the time-frequenc y dimensions. In particular , max-pooling is applied over squares of dimension 5 after Con v1. Af- ter Con v2, global time domain pooling is applied in order to select only the most prominent feature [18]. Finally , af- ter ﬂattening the resulting feature maps, the predicted class (for the input T -F patch) is obtained by a dense layer with softmax activ ation with 15 output units (corresponding to the 15 acoustic scenes). W e also experiment with the concept of pre-activation [24]. This technique w as initially devised for image recog- nition in the context of deep residual networks. In [24] a residual unit is proposed containing two paths: i) a clean information path for the information to propagate and ii) another path with an additi ve residual function. In the lat- ter path, BN and ReLU are applied as pre-activ ation of the conv olutional layers (in addition to the common post- activ ation consisting of the same couple BN and ReLU after the con volution operation). Reported advantages in the particular case of deep residual networks, with 100+ layers, include ease of optimization and improved re gular- ization. Moreover , pre-activ ation has recently proved suc- cessful for ASC in [11], still with a deeper network than the one proposed here. W e want to explore this technique in a fairly shallow network. Based on the results obtained in Section 4.1.2, we add BN and ReLU non-linearity directly at the netw ork’ s input of Fig. 1 (before the ﬁrst con volution layer) to form the ﬁnal proposed CNN. 2.1.3 T raining Strate gy and Hyperparameters Network weights are initialized with a uniform distribu- tion. The loss function is categorical cross-entropy and the optimizer is Adam. The initial learning rate is 0.002, and it is reduced by a factor of 2 whene ver the validation loss does not decrease during 5 epochs. W e also experimented with i) dropping the learning rate by half ev ery ﬁxed num- ber of epochs and ii) using Adam with no learning rate scheduling. Howe ver , best results were obtained by reduc- ing learning rate when the validation loss plateaus. The training is early-stopped if the v alidation loss is not im- prov ed during 15 epochs, up to a maximum of 200. For early-stopping, a 15% v alidation set is randomly split from the training data of e very class. The batch size is 64, and training samples are shufﬂed between epochs. In both con- volutional layers L2 regularization is applied with a pa- rameter of 10 − 5 . The system is implemented using Keras (v2.1.3) and T ensorﬂow (v1.4.1). 2.2 Gradient Boosting Machine Gradient Boosting Machine [25] is a technique for con- structing predicti ve models based on an ensemble of man y weak learners—typically regression trees. The trees are added iterati vely , in such a way that the new tree focuses on the misclassiﬁcations by the pre vious ensemble of trees. Predictions of multiple trees are combined together in or- der to optimize an objective function, and the parameters of added trees are tuned by gradient descent. T wo GBM framew orks are widely used: XGBoost [26] and Light- GBM. 1 Experiments on ﬁv e public datasets show that Light- GBM outperforms XGBoost on both ef ﬁciency and accu- racy , with signiﬁcantly lower memory consumption [27]. In our experiments, we also found out that LightGBM trains faster and achiev es a slightly better overall classiﬁcation accuracy . Hence we use LightGBM in this work. 2.2.1 F eatur e Extraction and Pr e-processing W e se gment each 10s recording into 7 non-ov erlapped se g- ments . The ﬁrst 6 segments last 1.5s, and the last one 1s. W e then extract features on each segment using the F reesoundExtr actor , 2 an out-of-box feature extractor from the Essentia open-source library for audio analysis [15]. This extractor computes hundreds of features for sound and music analysis and it is originally used by Freesound 3 1 https://github.com/Microsoft/LightGBM 2 http://essentia.upf.edu/documentation/ freesound_extractor.html 3 https://freesound.org/ in order to provide sound analysis API and searching func- tionalities. The most musically-related features (e.g., key , chords, etc.) are discarded. The selected pool of features is listed in T able 2, along with their dimensionality . The fea- tures are calculated at frame-lev el by using the same frame and hop size mentioned in Section 2.1.1. All other parame- ters of the F r eesoundExtractor are set to default v alues. W e perform four statistical aggregations—mean, variance, and mean and variance of the deri vati ve—to the frame-lev el feature v ectors of each segment. Therefore, a R 820 × 1 (i.e., 205 × 4) feature vector is output for each segment. As in Section 2.1.1, we ﬁt a mean and variance standardization scaler over whichev er subset of data we use for training, and use it to standardize both train and test data. Feature name Dim. Feature name Dim. Bark bands energy 32 T onal features 3 ERB bands energy 23 Pitch features 3 Mel bands energy 45 Silence rate 3 MFCC 13 Spectral features 32 HPCP 38 GFCC 13 T able 2. Selected features extracted by F r eesoundExtrac- tor and number of dimensions. 2.2.2 Linear Discriminant Analysis F eatur e Reduction Linear Discriminant Analysis (LD A) can be used as a di- mensionality reduction technique after the feature extrac- tion stage. The ultimate goal is to mitigate overﬁtting by projecting a high dimensional dataset onto a lower dimen- sional space. This is done by maximizing the variance of the data as well as the separability of classes. Some of the features of T able 2 are computed in a similar way , e.g., sev eral energy bands are computed with different psychoa- coustic scales (e.g., Bark or Mel). While they may pro- vide some complementary information, it is likely that they also have a considerable amount of redundancy . This, to- gether with the high dimensionality of the feature vector , may cause ov erﬁtting and a slow-do wn of model training. In order to mitigate this, while keeping the rich informa- tion of the extracted features, we perform LD A-based fea- ture reduction. It is applied on any subset of data used for training, and then the corresponding test set is transformed accordingly (see Section 3.2). 2.2.3 Hyperparameter T uning Since ASC is a multi-class classiﬁcation problem, we use logarithmic loss as the objectiv e function. W e do grid search ov er 5 hyperparameters. Four of them relate to the GBM (learning rate, max bins , number of leaves , and min data in leaf ) while the reduced feature dimension relates to the LD A. The number of leaves is the main parameter to con- trol model complexity , whereas max bins and min data in leaf are two important parameters to deal with overﬁtting. All other hyperparameters are set to default v alues. W e do the grid search in tw o cases—with and without LD A—and the hyperparameter values considered are listed in T able 3. The grid search is performed using cross-validation on the dev elopment set. The hyperparameters setting leading to the best cross-validation accuracy is kept for the ﬁnal GBM model, which is used to predict acoustic scenes on the ev al- uation set. Hyperparameter V alues Learning rate [0.01, 0.05, 0.1] Max bins [128, 256, 512] Number of leav es [64, 128, 256] Min data in leaf [500, 1000, 2000] Reduced feature dimension [64, 128, 256, 512] T able 3. Hyperparameter grid search for GBM and LD A. 2.3 Late Fusion In order to combine the predictions from both methods, we tried approaches with and without learning, all of them starting from the individual models’ class probabilities com- puted on the dev elopment set using the proposed four-fold cross validation setup. The simplest approach (i.e., without learning) consists of combining the prediction probabilities by taking their geometric mean, arithmetic mean, or rank av eraging. Then, the ﬁnal predicted label is selected by taking the argmax ov er the resulting values. The learning- based approach consists of two steps. First, using the mod- els’ prediction probabilities computed on the development set as training data , we ﬁt a classiﬁer or meta learner . W e experimented with logistic regression and SVM with sev eral kernels. The models’ hyperparameters were deter- mined by grid search on the training data using four -fold cross v alidation, trying to restrict the parameter search to ranges pro viding large regularization. Then, once the meta learner is ﬁt, we predict labels on the evaluation set by tak- ing as input the pre-computed prediction probabilities from CNN and GBM on this set. This approach is sometimes re- ferred to as stacking . 3. EV ALU A TION 3.1 Dataset and Baseline Systems are ev aluated with TUT Acoustic Scenes 2017 , a dataset that contains recordings made in 15 acoustic scenes. The dataset is split into a dev elopment and an evaluation set, of 4680 and 1620 audio recordings respecti vely . 4 The dev elopment set contains 312 recordings per class. All recordings last 10s and hav e a sampling rate of 44.1 kHz. A four-fold cross-validation setup is provided for the de- velopment set. The dataset presents a mismatch between dev elopment and ev aluation set due to dif ferences in the recording conditions. The av erage accuracy drop between both sets across all submitted systems to the ASC task of DCASE2017 is 20.1%. 5 A multilayer perceptron (MLP) is provided as baseline system. The input representation is 40 log mel-band energies in 5 consecutive frames and the MLP has 2 layers with 50 hidden units each. 4 A list of the scenes together with more details about the dataset can be found in http://www.cs.tut.fi/sgn/arg/dcase2017/ challenge/task- acoustic- scene- classification . 5 http://www.cs.tut.fi/sgn/arg/dcase2017/ challenge/task- acoustic- scene- classification- results 3.2 Evaluation Setup The output of CNN and GBM models for e very input 1.5s segment is a R 15 × 1 vector with the probabilities of the segment belonging to every class. The class prediction for each 10s recording is computed by averaging per -class scores across segments and ﬁnding the class with maxi- mum a verage score. The de velopment set is used for train- ing/testing the CNN and GBM approaches according to the provided four -fold cross-validation setup (see Fig. 2). Figure 2. Flowchart illustrating the workﬂow in de velop- ment mode. For predicting acoustic scenes on the ev aluation set, the models are trained on the full development set and e valu- ated on the ev aluation set (see Fig. 3). The metric used is classiﬁcation accuracy , i.e., the number of correctly classi- ﬁed recordings divided by the total amount of recordings. Figure 3. Flowchart illustrating the w orkﬂow in e v aluation mode. Models are trained on the full de velopment set and predictions are computed on the ev aluation set. 4. RESUL TS AND DISCUSSION 4.1 Con volutional Neural Network T wo types of experiments were carried out with the CNN: i) e xperimenting with ﬁlter conﬁgurations in the ﬁrst layer, and ii) e xploring the concept of pre-acti vation. Since re- sults obtained with GPU are generally non-deterministic, accuracies reported in this Section are the result of av erag- ing ten independent trials of every experiment. Conﬁdence intervals are also sho wn. 4.1.1 F ilter Conﬁgurations W e design ﬁlter conﬁgurations with se veral ﬁlter shapes in the ﬁrst layer . The number of shapes is speciﬁed in T able 4 and Fig. 4 as CNN x , where x denotes the number of differ - ent shapes. 6 Every shape (denoted by ( time, fr equency )) 6 CNN sq refers to the case where ﬁlters are squared, which is a spe- ciﬁc case of CNN 1. can be repeated a different number of times, as illustrated in T able 4, b ut in all cases the total number of ﬁlters is 112. System Filter conﬁguration - # ﬁlters x ( time, fr eq ) MLP - CNN sq 112x (5,5) CNN 1 112x (3,40) CNN 2 64x (3,20) | 48x (3,70) CNN 3 48x (3,10) | 32x (3,30) | 32x (3,60) CNN 4 48x (3,8) | 32x (3,32) | 16x (3,64) | 16x (3,90) CNN 5 36x (3,6) | 22x (3,26) | 22x (3,48) | 16x (3,70) 16x (3,96) T able 4. Filter conﬁgurations in the ﬁrst layer for the CNN of Fig. 1. The motiv ation for designing ﬁlters with different vertical dimensions is to intuitively be able to cover di verse spec- tral patterns, ranging from narro w-band patterns to others that may spread ov er frequency . In order to establish a fair comparison among networks, the number of parameters was kept approximately constant by adjusting the number of ﬁlters per shape and the ﬁlter dimensions. The number of parameters in all cases lie in the range 656k-660k, with the exception of the squared ﬁlters case that has 648k (due to the smaller size of the squared ﬁlters). In particular, the top performing case of CNN 4 has 657k parameters. The speciﬁc ﬁlter shapes in T able 4 were chosen through a number of preliminary experiments. While an exhaustiv e search may be desirable, it may require prohibitiv ely long computation times. Fig. 4 sho ws the classiﬁcation accuracy values for the ar- chitecture of Fig. 1 and the ﬁlter conﬁgurations of T able 4. The accuracy of the MLP baseline is speciﬁed as well. Figure 4. ASC performance using the CNN of Fig. 1 with the ﬁlter conﬁgurations in the ﬁrst layer given by T able 4. No pre-activ ation is adopted in these experiments. Note that the y-axis dif fers for dev elopment and e valuation sets. It can be observed that the accuracy on the e v aluation set increases overall with the di versity of the ﬁlter shapes, until a point where this div ersity no longer helps (CNN 5). W e also carried out some preliminary experiments with hori- zontal ﬁlters but results were slightly w orse than those ob- tained with vertical ones. 4.1.2 Pr e-activation Fig. 5 sho ws the results obtained by adding pre-activ ation [24] to the top-performing case of Fig. 4, i.e., to CNN 4. Figure 5. ASC performance by adopting pre-activ ation in the CNN of Fig. 1, i.e., adding BN and ReLU before the ﬁrst con volutional layer . Note that the y-axis differs for dev elopment and e valuation sets. It can be seen that adding pre-acti vation impro ves the re- sults slightly on the e valuation set (see preact bar). How- ev er , the gap between de velopment and ev aluation accura- cies is still substantial. Curiously , we found out that this gap is reduced when we complement pre-activ ation with normalization of the input audio wav eform (see norm&pr eact bar). This is some what surprising as the T -F patches that input the CNN were already standardized (see Section 2.1.1). Finally , we report the accuracy obtained by applying only time domain normalization of audio (without pre-activ ation), to conﬁrm that it is the combination of both which yields the improvement (see norm bar). W e also experimented with pre-activ ation not only prior to the ﬁrst con volutional layer , but also between e very max-pooling operation and the ne xt layer , follo wing pre vious work [11]. Resulting ac- curacies were not higher . It hence appears that the combination of pre-activ ation and normalization of the input wav eform helps to improve model’ s generalization, showing slightly lower dev elop- ment accuracy while increasing e valuation accuracy . Ne v- ertheless, further experiments are needed to better assess and understand the beneﬁts of pre-activ ation and its de- pendency on audio signal energy or dynamic range. F or example, one aspect of the audio signal in acoustic scenes or ﬁeld-recordings is its small dynamic range. This hap- pens often as sources can be far away from the microphone, since the goal is to capture the entirety of the acoustic con- text rather than speciﬁc acoustic e vents. Ev aluating this approach on different datasets may be re vealing. 4.2 Gradient Boosting Machine The best hyperparameters found for LDA and non-LD A cases are listed in T able 5. The dimensionality of the fea- ture vector after LD A-based feature reduction is 64. This is a 7.8% of the initial dimensionality (820), which indicates considerable information redundancy in the initial pool of features gathered from the F r eesoundExtractor . After the feature dimension reduction, we observe signiﬁcant boost in training speed. Hyperparameter non-LD A LD A Learning rate 0.05 0.05 Max bins 128 128 Number of leav es 128 128 Min data in leafs 1000 500 Reduced feature dimension – 64 T able 5. Best hyperparameters in both LDA and non-LD A cases by grid searching on the dev elopment set. T able 6 shows the accuracy results. The performance us- ing LD A feature reduction is greater than the one without LD A and the MLP baseline, resulting in small improve- ments of 1.7% and 2.6% on the e valuation set. Howe ver , we still witness a signiﬁcant accuracy drop in both cases. It is worth to mention that, to tackle the ov erﬁtting prob- lem, we hav e experimented with another two techniques, namely PCA and feature selection using feature impor - tance. Howe ver , no signiﬁcant improvements were ob- served. For the late fusion we use the GBM with LD A. Approach dev acc (%) e val acc (%) Baseline 74.8 61.0 GBM non-LD A 81.4 61.9 GBM LD A 81.1 63.6 T able 6. ASC performance by the GBM model with and without LD A feature reduction. 4.3 Models’ Comparison The CNN method clearly outperforms the GBM method. Howe ver , we wanted to assess the potential complemen- tarity of these models, i.e., whether their output predictions are complementary or redundant. W e follow the approach of [28] consisting of plotting the difference of confusion matrixes yielded by both systems, which is shown in Fig. 6. If we ha ve a look at the main diagonal, positi ve red num- bers illustrate scenes where CNN performs better , whereas negati ve blue numbers represent scenes where the GBM achiev es more correct predictions. The CNN yields better results in most of the acoustic scenes. Howe ver , despite the lower performance of the GBM, it interestingly yields bet- ter predictions in the ’park’, ’beach’ and ’cafe/restaurant’ scenes. Then, off the diagonal, positi ve red numbers illus- trate that the CNN presents higher confusion between pairs of acoustic scenes. Similarly , negati ve blue numbers repre- sent that the GBM suffers from higher confusion between pairs of acoustic scenes. Overall, it can be seen that the models get confused between dif ferent pairs of scenes. In summary , the methods present different behaviour to some extent, and hence their predictions may be complementary . Figure 6. Difference between the confusion matrixes pro- duced by i) the CNN and ii) the GBM models (in this or- der), ev aluated on the ev aluation set. 4.4 Late Fusion After exploring the approaches described in Section 2.3, the logistic regression led to the best results, which are listed in T able 7. System dev acc (%) e val acc (%) MLP baseline 74.8 61.0 Proposed CNN + GBM 83.3 72.8 T able 7. ASC performance by the combined system. The proposed combined system sho ws an improvement of 3.1% over the average score provided by the best CNN architecture, and an improvement of 11.8% over the MLP baseline. It also shows an improvement of 5.5% with re- spect to our previous work [13]. W e consider as state of the art the top performing submissions to the ASC task of DCASE2017 Challenge. 5 Among them, there are a few systems that outperform the one proposed here. Howe ver , they hav e the burden of being more complex or compu- tationally intensi ve, including Generati ve Adversarial Net- works, ensembles of 4 or more systems (with several CNNs), data augmentation, or deeper networks. Compared to them, we consider that our system is simpler in ov erall terms. The proposed CNN is more interpretable as domain knowl- edge was used in its design. The GBM can be trained in a standard desktop computer without need of additional infrastructure, e.g., a GPU. Figure 7 shows the confusion matrix for the proposed combined system, where it can be seen which acoustic scenes are misclassiﬁed the most. The worst case occurs when the systems predicts ’ residential area’ while the true label is ’beach’ or ’park’. 5. CONCLUSION W e ha ve proposed the fusion of two systems of radically different kind for ASC: a CNN designed with domain kno wl- edge that learns from log mel spectrograms, and a GBM that lev erages audio features from the out-of-box F r eesoundEx- Figure 7. Confusion matrix for the proposed combined system ev aluated on the ev aluation set. tractor . Evaluated on the TUT Acoustic Scenes 2017 dataset, the CNN performs substantially better than the GBM, which is not able to generalize well on the e v aluation set. Despite their difference in performance, the models provide some- what complementary predictions, and their fusion leads to a slight improvement. The proposed system attains a clas- siﬁcation accuracy of 72.8% on the e valuation set, which means a 11.8% improvement over the MLP baseline. Our experiments empirically show that adding pre-acti vation and wa veform normalization help the proposed CNN to re- duce overﬁtting. Future work includes ev aluating the prop- erties of pre-activ ation on different datasets and networks, and exploring additional measures against o verﬁtting. Acknowledgments This work is partially supported by the European Union’ s Horizon 2020 research and innov ation programme under grant agreement No 688382 “ AudioCommons”, and the European Research Council under the European Union’ s Sev enth Frame work Program, as part of the CompMusic project (ERC grant agreement 267583), and a Google Fac- ulty Research A ward 2017. W e are grateful for the GPUs donated by NV idia. 6. REFERENCES [1] T . V irtanen, M. D. Plumbley , and D. Ellis, Computa- tional Analysis of Sound Scenes and Events . Springer, 2018. [2] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley , “ Acoustic scene classiﬁcation: Classifying en vironments from the sounds the y produce, ” IEEE Signal Pr ocessing Magazine , v ol. 32, no. 3, pp. 16–34, 2015. [3] C. Landone, J. Harrop, and J. Reiss, “Enabling access to sound archiv es through integration, enrichment and retriev al: The EASAIER project. ” in ISMIR , 2007, pp. 159–160. [4] Y . Xu, W . J. Li, and K. K. Lee, Intelligent wearable interfaces . John W iley & Sons, 2008. [5] B. Schilit, N. Adams, and R. W ant, “Context-a ware computing applications, ” in Mobile Computing Sys- tems and Applications . IEEE, 1994, pp. 85–90. [6] J.-J. Aucouturier , B. Defreville, and F . Pachet, “The bag-of-frames approach to audio pattern recognition: A sufﬁcient model for urban soundscapes but not for polyphonic music, ” The Journal of the Acoustical So- ciety of America , vol. 122, no. 2, pp. 881–891, 2007. [7] G. Roma, W . Nogueira, and P . Herrera, “Recurrence quantiﬁcation analysis features for auditory scene clas- siﬁcation, ” IEEE AASP Challenge on Detection and Classiﬁcation of Acoustic Scenes and Events , 2013. [8] H. Lee, P . Pham, Y . Largman, and A. Y . Ng, “Unsu- pervised feature learning for audio classiﬁcation us- ing conv olutional deep belief networks, ” in Advances in neural information processing systems , 2009, pp. 1096–1104. [9] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio, ” in Pr oceedings of the International Confer ence on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) . IEEE, 2014, pp. 6964–6968. [10] J. Salamon and J. P . Bello, “Deep con volutional neu- ral networks and data augmentation for environmental sound classiﬁcation, ” IEEE Signal Pr ocessing Letters , vol. 24, no. 3, pp. 279–283, 2017. [11] Y . Han, J. Park, and K. Lee, “Con volutional neural net- works with binaural representations and background subtraction for acoustic scene classiﬁcation, ” in Detec- tion and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , 2017. [12] Z. W eiping, Y . Jiantao, X. Xiaotao, L. Xiangtao, and P . Shaohu, “ Acoustic scene classiﬁcation us- ing deep con volutional neural netw ork and multiple spectrograms fusion, ” in Detection and Classiﬁca- tion of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , 2017. [13] E. Fonseca, R. Gong, D. Bogdanov , O. Slizovskaia, E. G ´ omez Guti ´ errez, and X. Serra, “ Acoustic scene classiﬁcation by ensembling gradient boosting ma- chine and con volutional neural networks, ” in Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , 2017. [14] D. W ang and G. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications . W iley , 2006. [15] D. Bogdanov , N. W ack, E. G ´ omez, S. Gulati, P . Her- rera, O. Mayor , G. Roma, J. Salamon, J. R. Zapata, X. Serra et al. , “Essentia: An audio analysis library for music information retrie val. ” in ISMIR , 2013, pp. 493– 498. [16] D. P . W . Ellis, “Gammatone-like spectrograms, ” http://www .ee.columbia.edu/ ∼ dpwe/resources/matlab/ gammatonegram/, 2009. [17] M. Slaney , “ Auditory toolbox, ” Interval Resear ch Cor - poration, T ech. Rep , v ol. 10, p. 1998, 1998. [18] M. V alenti, A. Diment, G. Parascandolo, S. Squartini, and T . V irtanen, “DCASE 2016 acoustic scene classiﬁ- cation using con volutional neural networks, ” in Pr oc. W orkshop Detection Classif. Acoust. Scenes Events , 2016, pp. 95–99. [19] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. W id- mer , “CP-JKU submissions for DCASE-2016: A hy- brid approach using binaural i-vectors and deep con vo- lutional neural networks, ” IEEE AASP Challenge on Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2016. [20] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max pool- ing conv olutional neural networks, ” arXiv preprint arXiv:1604.06338 , 2016. [21] J. Pons, O. Slizovskaia, R. Gong, E. G ´ omez, and X. Serra, “Timbre analysis of music audio signals with con v olutional neural networks, ” arXiv pr eprint arXiv:1703.06697 , 2017. [22] S. Ioffe and C. Szegedy , “Batch normalization: Accel- erating deep network training by reducing internal co- variate shift, ” in International Conference on Machine Learning , 2015, pp. 448–456. [23] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectiﬁer neural networks, ” in Pr oceedings of the F our- teenth International Conference on Artiﬁcial Intelli- gence and Statistics , 2011, pp. 315–323. [24] K. He, X. Zhang, S. Ren, and J. Sun, “Identity map- pings in deep residual networks, ” in European Confer - ence on Computer V ision . Springer , 2016, pp. 630– 645. [25] J. H. Friedman, “Greedy function approximation: a gradient boosting machine, ” Annals of statistics , pp. 1189–1232, 2001. [26] T . Chen and C. Guestrin, “XGBoost: A scalable tree boosting system, ” CoRR , vol. abs/1603.02754, 2016. [27] “Lightgbm and xgboost comparison experiment, ” https://github .com/Microsoft/LightGBM/blob/master/ docs/Experiments.rst, accessed: 2018-04-06. [28] J. Salamon, J. P . Bello, A. Farnsworth, and S. Kelling, “Fusing shallo w and deep learning for bioacoustic bird species classiﬁcation, ” in Acoustics, Speec h and Signal Pr ocessing (ICASSP), 2017 IEEE International Con- fer ence on . IEEE, 2017, pp. 141–145.

A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment