DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

Acoustic scene recordings are represented by different types of handcrafted or Neural Network-derived features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Sup…

Authors: Abelino Jimenez, Benjamin Elizalde, Bhiksha Raj

DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant   Kernels and Random Features
Detection and Classification of Acoustic Scenes and Events 2017 16 November 2017, Munich, German y DCASE 2017 T ASK 1: A COUSTIC SCENE CLASSIFICA TION USING SHIFT -INV ARIANT KERNELS AND RANDOM FEA TURES Abelino Jim ´ enez, Benjam ´ ın Elizalde and Bhiksha Raj Carnegie Mellon Uni versity , Department of Electrical and Computer Engineering, Pittsb urgh, P A, USA abjimenez@cmu.edu, bmartin1@andre w .cmu.edu, bhiksha@cs.cmu.edu ABSTRA CT Acoustic scene recordings are represented by different types of handcrafted or Neural Network-deri ved features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support V ector Machines (SVM). Howe ver , the complexity of training these meth- ods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized lower -dimensional feature space. The resulting ran- dom features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Gaussian, Laplacian and Cauchy . W e com- pared their performance using an SVM in the context of the DCASE T ask 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4%. Moreover , the random features reduced the dimen- sionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations. Index T erms — Acoustic Scene Classification, Laplacian Ker - nel, Gaussian Kernel, Cauchy Kernel, K ernel Machines, Random Features, DCASE Challenge 1. INTRODUCTION AND RELA TED WORK The DCASE T ask 1 - Acoustic Scene Classification (ASC) aims to identify a recording as belonging to a predefined set of scene- classes that characterizes an en vironment, for example park , home , or office . T ypically , ASC approaches capture the div erse char- acteristics from the audio signal by computing dif ferent types of features, either hand-crafted [1, 2, 3, 4, 5] or deriv ed from Neu- ral Networks [6, 7, 8]. These features are commonly of high- dimensionality (up to ten of thousands) and state of the art ASC approaches classified them using Support V ector Machines, the best known member of k ernel methods. Kernel methods have the kernel trick property , which employs a non-linear kernel function to operate in a high-dimensional space by computing the inner products between the all pairs of trans- formed input features. The inner products are computed and stored in the Kernel or Gram matrix, which computing time and storage complexity increases in the dimensionality and number of the in- put features. A solution is to compute random features [9], which hav e been well studied mainly for shift-in variant kernels because of their closed form. The process maps the input features into a lower -dimensional random space. Then, the resulting random fea- tures approximate non-linear kernels with linear kernel computa- tions, hence speeding up the kernel matrix generation. In this paper , we evaluated our random features in the context the 2017 DCASE T ask 1 - Acoustic Scene Classification [10]. First, we computed input features with o ver six thousand dimensions, then we computed random features to approximate three types of shift-in variant kernels, Gaussian, Laplacian and Cauchy . Both type of features, input and random, were classified using an SVM. Ex- periments show that the baseline is outperformed by 4% by all fea- tures. Moreov er , random features reduced their dimensionality by more than three times with minimal loss of performance and by six times and still outperformed the baseline. The paper is or ganized as follo ws: In Section 2 we describe in detail the kernel functions used. In Section 3 we present exper - iments and results for T ask 1. Finally , in Section 4 we conclude discussing the scope of the presented technique as well as future directions. 2. METHODS: SHIFT -INV ARIANT KERNELS AND RANDOM FEA TURES In this section we describe the computation of random features for three types of shift-inv ariant kernels in the context of SVM. Acoustic Scene Classification has been explored by state of the art approaches based on k ernel methods, which find non-linear deci- sion boundaries using a kernel function. The function takes input features (extracted from the audio) in a space X and yield out- put scene classes in Y . In this paper, we consider X = R N and Y = { 1 , 2 , ..., C } . Moreo ver , the kernel function can be expressed as K : X × X → R , which is positi ve-definite and yields the v alue corresponding to the inner product between φ ( x 1 ) and φ ( x 2 ) . The function φ maps R N to some space H , which is generally of higher dimensionality and has better class separability . Howe ver , computing the kernel function could become a pro- hibitiv e task if the dimensionality of the input, N , is large and if the size of the training set n is large. This happens be- cause in order to learn the decision boundary function f from the input audio and the corresponding labels in the dataset { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ..., ( x n , y n ) } , we need to compute the value K ( x i , x j ) for ev ery element i, j ∈ { 1 , ..., n } . Therefore, our solution for this problem are random features , which approximate a kernel function by finding a map Φ from R N to a low-dimensional random space R M , such that K ( x 1 , x 2 ) ≈ h Φ( x 1 ) , Φ( x 2 ) i (1) Detection and Classification of Acoustic Scenes and Events 2017 16 November 2017, Munich, German y Although dif ferent random features mappings ha ve been pro- posed for dif ferent kernel functions [11] [12], we focused on ran- dom features for shift-invariant kernels. W e say that a kernel is shift-in variant if for an y x 1 , x 2 , z ∈ R N K ( x 1 + z , x 2 + z ) = K ( x 1 , x 2 ) (2) Which is equiv alent to say that, for any x 1 and x 2 K ( x 1 , x 2 ) = K ( x 1 − x 2 , 0 ) (3) Shift-in variant kernels ha ve been proven to admit a closed form of computing random features as stated by the use of the Bochner’ s theorem [9]. The function to compute random features Φ : R N → R M is giv en by Φ( x ) = r 2 M cos ( Wx + b ) (4) where W is a M × N matrix, b is a vector with M components and the cos function is element-wise. The randomness comes from the generation of the components of W and b , where b i comes from a uniform distrib ution between 0 and 2 π , and w ij comes from the Fourier Transform of the function g ( δ ) = K ( δ, 0 ) . Therefore, the approximation stated in equation 1, depends on the kernel function in volv ed and the distribution used to generate the matrix W . In this paper , we focus on three well studied shift-inv ariant ker- nel functions, Gaussian, Laplacian and Cauchy . Their definition and corresponding distributions used to generate random features are described below . 2.1. Gaussian Ker nel and Random Featur es The Gaussian kernel, also known as Radial Basis Kernel, is perhaps the most popular after the Linear kernel. The Gaussian function employs the ` 2 norm and we define, K ( x 1 , x 2 ) = exp  − γ k x 1 − x 2 k 2 2  (5) T o compute the random features, we generate the components of the matrix W according to a Gaussian distribution as follo ws, w ij ∼ N (0 , 2 γ ) 2.2. Laplacian Ker nel and Random Featur es The Laplacian kernel is similar to the Gaussian, but the main differ - ence is that it employs the ` 1 norm, where k x k 1 = P N i =1 | x i | . In this work, we consider the Laplacian kernel, K ( x 1 , x 2 ) = exp  − γ k x 1 − x 2 k 1  (6) T o compute the random features, we generate the components of the matrix W according to a Cauchy distribution as follo ws, w ij ∼ Cauchy (0 , γ ) 2.3. Cauchy K ernel and Random F eatures The Cauchy kernel is less known in comparison to the previous two and computing this kernel can be even a more expensi ve task with high-dimensional vectors due to its mathematical form, hence ben- efiting more from the speed of processing random features. W e define the kernel, K ( x 1 , x 2 ) = N Y i =1 1 1 + γ 2 ( x 1 i − x 2 i ) 2 (7) T o compute the random features, we generate the components of the matrix W according to a Laplace distribution, w ij ∼ Laplace (0 , γ ) 2.4. T raining SVMs with Random Featur es An SVM is a kernel method that can perform non-linear classifica- tion by solving the quadratic optimization of the dual form and tak- ing advantage of the kernel trick [13]. The kernel trick uses a non- linear function to map the input features into a high-dimensional feature space by computing the kernel matrix. An SVM using a non-linear shift-inv ariant kernel using the in- put features could be approximated by a linear SVM using the ran- dom features. The kernel matrix resulting from computing the in- ner product between the random features correspond to an approx- imation of the kernel matrix using the input features and the shift- in variant kernel. The linear computation has an important implica- tion because there are libraries optimized for these problems. 3. EXPERIMENT AL SETUP AND RESUL TS Our two set of experiments addressed the DCASE T ask 1 - Acous- tic Scene Classification [10]. W e ev aluate and compare the perfor- mance of the input features using SVMs with three non-linear shift- in variant kernels against the random features corresponding to the three kernel types using linear SVMs. Both pipelines are illustrated in Figure 1. Figure 1: The acoustic scene dataset is used to extract input features for each recording. Then, the input features are used to train the SVM in two dif ferent ways. One is to pass the features directly to a non-linear shift-inv ariant kernel SVM, and the other is to first compute the random features and then pass them to a linear kernel SVM. Lastly , the trained SVM is used for multi-class classification on the test recordings. 3.1. Acoustic Scene Dataset For our experiments we used the development set of the “DCASE: TUT Acoustic Scenes 2017” dataset [14]. It consists of recordings Detection and Classification of Acoustic Scenes and Events 2017 16 November 2017, Munich, German y T able 1: The class-wise accuracy of the four different kernels outperformed the baseline of the dev elopment set. *Note that the linear kernel is without using the random features. Acoustic scene Baseline Linear* Kernel Gaussian K ernel Laplacian Kernel Cauchy K ernel Beach 75.3 % 78.2 % 78.8 % 77.2 % 77.9 % Bus 71.8 % 93.3 % 93.6 % 92.0 % 92.3 % Cafe/Restaurant 57.7 % 79.2 % 76.9 % 82.7 % 78.5 % Car 97.1 % 95.2 % 94.9 % 94.2 % 95.5 % City center 90.7 % 92.0 % 91.0 % 92.3 % 89.4 % Forest path 79.5 % 87.8 % 89.1 % 85.9 % 87.2 % Grocery store 58.7 % 74.7 % 74.7 % 74.7 % 74.0 % Home 68.6 % 66.9 % 66.3 % 67.3 % 66.3 % Library 57.1 % 66.0 % 65.7 % 58.3 % 65.1 % Metro station 91.7 % 81.4 % 82.7 % 83.7 % 83.3 % Office 99.7 % 90.4 % 89.7 % 92.9 % 90.4 % Park 70.2 % 62.2 % 65.1 % 61.5 % 60.9 % Residential area 64.1 % 62.2 % 65.7 % 68.3 % 63.5 % T rain 58.0 % 59.0 % 57.7 % 65.7 % 61.9 % T ram 81.7 % 81.1 % 82.7 % 84.3 % 81.7 % Overall 74.8 % 78.0 % 78.3 % 78.8 % 77.9 % from various acoustic scenes of 3-5 minutes long divided into 4 cross-validation folds. The original recordings were then split into segments with a length of 10 seconds. Recordings were made us- ing a binaural microphone and a recorder using 44.1 kHz sampling rate and 24 bit resolution. The 15 acoustic scenes are: Bus, Cafe / Restaur ant, Car , City center , F or est path, Gr ocery store, Home, Lakeside beach, Library , Metr o station, Of fice, Residential ar ea, T rain, T ram, Urban park. 3.2. Compute Input Featur es W e extracted a large set of audio features proposed in [3], which are later used to compute the random features. The set include different features to capture different information from the acoustic scenes, which consist of multiple sound sources. The set is computed with the open-source feature e xtraction toolkit openSMILE [15] using the configuration file emolarg e.conf . The features are divided in four categories: cepstral, spectral, energy related and v oicing and are extracted every 10 ms from 25 ms frames. Moreo ver , included are functionals, such as mean, standard deviation, percentiles and quartiles, linear regression functionals, or local minima/maxima. The total dimensionality of the feature vector is 6,553. 3.3. Input Featur es and Non-linear SVM The first set of experiments aimed to ev aluate our large set of input features and non-linear SVMs in ASC. W e used the input features to train the three types of non-linear shift-in variant SVMs, also, we included the linear k ernel (without random features). The SVM pa- rameter C was tuned using a search grid on the linear kernel and was fixed in all cases to C = 100 and the performance was measured using accuracy . The accuracy is the average classification accuracy ov er the 4 v alidation folds provided for this challenge. Additionally , we explored dif ferent v alues for γ , obtaining the best results with γ = 2 − 18 for Gaussian Kernel, γ = 2 − 14 for Laplacian Kernel, and γ = 2 − 8 for Cauchy K ernel. Before training the models, in each fold we normalized the input features with respect to the train- ing set. W e computed the mean and the standard deviation using each feature file and then subtracted the mean and divided by the standard deviation e very file in the training and the testing sets. The classification performance for all kernel types was similar as shown in T able 1. Generally , non-linear kernels tend to perform better than linear kernels for ASC [1]. Howe ver , it’ s not uncom- mon to hav e a similar performance if the class separability given by the features is not so comple x, which could be our case. Among our best classified scene-classes we hav e Bus , Cafe/Restaurant and Gr ocery stor e with improv ements of up to 25%. 3.4. Random Featur es and Linear SVM The second set of experiments aimed to sho w that the use of ran- dom features and linear SVM have a similar performance to the non-linear SVMs. For this, we used the training and testing input features to compute the random features corresponding to each of the three shift-inv ariant kernels described in Section 2. Then, these random features were used to train the SVM with linear kernel. The performance of employing the random features indeed compared to the one of the input features with non-linear SVM as sho wn in T able 2. W e can see that the results improv e as M , the dimensionality of the random features increases, hence sho wing minimal loss of performance compared to the previous non-linear SVMs. Notice that M is always lower than the original dimension- ality of our input features. If we would have further increased the value of M , we would have an improv ement of performance until con ver gence to the values from T able 1. 3.5. Acoustic Scene Classification The reported DCASE baseline 1 was tailored to a multi-class sin- gle label classification setup, with the network output layer consist- ing of softmax type neurons representing the 15 classes and frame- based decisions were combined using majority voting to obtain a single label per classified segment. The classification resulted in 74.8% accuracy , which was outperformed by an absolute 4% using the input features and the SVM with Laplacian Kernel. In relation to random features, we can observed that already with a reduction of dimensionality of M = 2 10 = 1024 , we ob- 1 http://www .cs.tut.fi/sgn/arg/dcase2017/challenge/task-acoustic-scene- classification Detection and Classification of Acoustic Scenes and Events 2017 16 November 2017, Munich, German y T able 2: Overall accuracy by computing random features and using a linear SVM depending on the value of M , which is the dimension- ality of the random features. Note that all the M values are smaller than the input features (6,553) and the larger the values the closer the get to the ones in T able 1. M Gaussian Kernel ( γ = 2 − 18 ) Laplacian K ernel ( γ = 2 − 14 ) Cauchy Kernel ( γ = 2 − 8 ) 2 5 50.4 % 49.8 % 48.7 % 2 6 57.3 % 56.0 % 56.2 % 2 7 64.4 % 61.5 % 62.9 % 2 8 69.1 % 66.0 % 67.9 % 2 9 73.0 % 67.2 % 72.7 % 2 10 75.3 % 70.3 % 75.1 % 2 11 76.1 % 73.0 % 75.7 % 2 12 77.2 % 75.8 % 76.9 % tained a similar performance to the DCASE baseline (74.8%) for the Gaussian (75.3%) and the Cauchy (75.1%) kernels. Thus, reducing the dimensionality up to one sixth from the original 6,553 dims. Moreov er , with a reduction of dimensionality of M = 2 12 = 4096 , we obtained a minimal loss of an absolute 1% for the Gaussian and Cauchy kernels. Note that for the DCASE challenge we submitted a system using the input features and the Laplacian kernel SVM. The o verall classification was 60% in comparison to the reported baseline of 61%. The adv antage of random features is that they can reduce sig- nificantly the amount of the storage and the computational process- ing by reducing the dimensionality and using linear inner products. Unlike other dimensionality reduction methods, such as PCA, the technique presented in this paper does not need heavy computa- tion cost, like computing eigen vectors, b ut we just need to generate random numbers with the appropriate kernel-related distribution. Moreov er , other machine learning algorithms that employ kernels could be benefited. Multiple applications can take adv antage of random features. For example, state of the art techniques are currently dealing with features of over 10,000 dims and with hundreds of thousands of segments [6, 7, 8], which are then passed to linear SVMs. Another example is when the audio is recorded on local devices and sent to the cloud, this technique helps to compress information by reduc- ing the cost of transmission and preserve priv acy . For instance, we can compute the random features keeping the parameters W and b priv ate. Thus, we can still process the transformed data in the cloud with linear models without rev ealing the actual data. 4. CONCLUSIONS In this paper we ha ve addressed T ask 1 - Acoustic Scene Classifi- cation and hav e outperformed the baseline accurac y by 4% using a large set of acoustic features and non-linear SVMs. Addition- ally , we computed random features that approximated three types of shift-inv ariant kernels, which were passed to a linear SVM. W e showed ho w the dimensionality can be decreased by one sixth with a minimal degradation of performance of about 1%. The results may have significant implications in the big data context, where high dimensional features must be stored and quickly processed. 5. REFERENCES [1] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. La- grange, and M. D. Plumbley , “Detection and classification of acoustic scenes and events: an IEEE AASP challenge, ” in 2013 IEEE W ASP AA . IEEE, 2013, pp. 1–4. [2] Z. Zhang and B. Schuller , “Semi-supervised learning helps in sound e vent classification, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2012 IEEE International Conference on . IEEE, 2012, pp. 333–336. [3] J. T . Geiger, B. Schuller , and G. Rigoll, “Large-scale audio feature extraction and svm for acoustic scene classification, ” in 2013 IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics . IEEE, 2013, pp. 1–4. [4] F . Metze, S. Ra wat, and Y . W ang, “Improved audio features for large-scale multimedia event detection, ” in Multimedia and Expo (ICME), 2014 IEEE International Confer ence on . IEEE, 2014, pp. 1–6. [5] B. Elizalde, A. K umar , A. Shah, R. Badlani, E. V incent, B. Raj, and I. Lane, “Experiments on the DCASE Challenge 2016: Acoustic scene classification and sound e vent detection in real life recording, ” in DCASE2016 W orkshop on Detection and Classification of Acoustic Scenes and Events , 2016. [6] Z. Zhang, D. Liu, J. Han, and B. Schuller , “Learning au- dio sequence representations for acoustic ev ent classification, ” arXiv pr eprint arXiv:1707.08729 , 2017. [7] R. Arandjelovi ´ c and A. Zisserman, “Look, listen and learn, ” arXiv pr eprint arXiv:1705.08168 , 2017. [8] Y . A ytar , C. V ondrick, and A. T orralba, “Soundnet: Learning sound representations from unlabeled video, ” in Advances in Neural Information Pr ocessing Systems , 2016, pp. 892–900. [9] A. Rahimi and B. Recht, “Random features for large-scale ker- nel machines, ” in Advances in neural information pr ocessing systems , 2008, pp. 1177–1184. [10] A. Mesaros, T . Heittola, A. Diment, B. Elizalde, A. Shah, E. V incent, B. Raj, and T . V irtanen, “DCASE 2017 challenge setup: T asks, datasets and baseline system, ” in Pr oceedings of the Detection and Classification of Acoustic Scenes and Events 2017 W orkshop (DCASE2017) , No vember 2017, sub- mitted. [11] I. S. Fuxin LiCatalin, “Random fourier approximations for ske wed multiplicati ve histogram kernels, ” in D A GM 2010: P attern Recognition pp 262-271 , 2010. [12] A. Z. Andrea V edaldi, “Efficient additiv e kernels via explicit feature maps, ” in IEEE T ransactions on P attern Analysis and Machine Intelligence , 3(34):480492, 2012 , 2012. [13] C. M. Bishop, P attern Recognition and Machine Learning . Springer-V erlag, 2006. [14] A. Mesaros, T . Heittola, and T . V irtanen, “TUT database for acoustic scene classification and sound e vent detection, ” in 24th Eur opean Signal Pr ocessing Confer ence 2016 (EU- SIPCO 2016) , Budapest, Hungary , 2016. [15] F . Eyben, M. W ¨ ollmer , and B. Schuller , “Opensmile: the mu- nich versatile and fast open-source audio feature e xtractor , ” in Pr oceedings of the 18th A CM international conference on Multimedia . A CM, 2010, pp. 1459–1462.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment