A simple model for detection of rare sound events

A simple model f or detection of rar e sound ev ents W eiran W ang , Chieh-chi Kao, Chao W ang Amazon Alexa 101 Main St, Cambridge, MA 02142, USA { weiranw,chiehchi,wngcha } @amazon.com Abstract W e propose a simple recurrent model for detecting rare sound ev ents, when the time boundaries of events are a vailable for training. Our model optimizes the combination of an utterance- lev el loss, which classiﬁes whether an ev ent occurs in an ut- terance, and a frame-level loss, which classiﬁes whether each frame corresponds to the e vent when it does occur . The two losses make use of a shared vectorial representation the event, and are connected by an attention mechanism. W e demonstrate our model on T ask 2 of the DCASE 2017 challenge, and achieve competitiv e performance. 1. Introduction The task of detecting rare sound events from audio has drawn much recent attention, due to its wide applicability for acoustic scene understanding and audio security surveillance. The goal of this task is to classify if certain type of ev ent occurs in an audio segment, and when it does occur, detect also the time boundaries (onset and offset) of the e vent instance. The task 2 of DCASE 2017 challenge provides an ideal testbed for detection algorithms [1]. The data set consists of isolated sound e vents for three target classes (baby crying, glass breaking, and gun shot) embedded in various ev eryday acous- tic scenes as background. Each utterance contains at most one instance of the ev ent type, and the data generation process pro- vides temporal position of the e vent which can be used for mod- eling. The most direct solution to this problem is perhaps to model the hypothesis space of segments, and to predict if each seg- ment corresponds to the time span of the ev ent of interest. This approach was adopted by [2] and [3], whose model architecture heavily drew inspirations from the re gion proposal networks [4] dev eloped in the computer vision community . There are a large number of hyper-parameters in such models, which requires much human guidance in tuning. More importantly , this ap- proach is generally slow to train and test, due to the large num- ber of segments to be tested. Another straight-forward approach to this task is to gener- ate reference labels for each frame indicating if the frame cor- respond to the ev ent, and then train a classiﬁer to predict the bi- nary frame label. This was indeed the approach taken by many participants of the challenge (e.g., [5, 6]). The disadvantage of this approach is that it does not directly provide an utterance- lev el prediction (if an event occurs at all), and thus requires heuristics to aggregate frame-lev el e vidence for that. It is the motiv ation of our work to solv e this issue. W e propose a simple model for detecting rare sound ev ents without aggregation heuristics for utterance-le vel prediction. Our learning objectiv e combines a frame-le vel loss similar to the abovementioned approach, with an utterance-lev el loss that automatically collects the frame-le vel e vidence. The two losses share a single classiﬁer which can be seen as the vectorial rep- resentation of the event, and they are connected by an atten- tion mechanism. Additionally , we use multiple layers of recur- rent neural networks (RNNs) for feature e xtraction from the ra w features, and we propose an RNN-based multi-resolution archi- tecture that consistently improve over the standard multi-layer bi-directional RNNs architectures for our task. In the rest of this paper , we discuss our learning objectiv e in Section 2, introduce the multi-resolution architecture in Section 3, demonstrate them on the DCASE challenge in Section 4, and provide concluding remarks in Section 5. 2. Our model Denote an input utterance by X = [ x 1 , . . . , x T ] where x i ∈ R d contains the audio features for the i -th frame. For our task (de- tecting a single event at a time), we are given the binary utter- ance label y which indicates if an event occurs ( y = 1 ) or not ( y = 0 ). If y = 1 , we hav e additionally the onset and offset time of the e vent, or equiv alently frame label y = [ y 1 , . . . , y T ] , where y t = 1 if the e vent is on at frame t and y t = 0 otherwise. Our goal is to make accurate predictions at both the utterance lev el and the frame lev el. Our model uses a multi-layer RNN architecture f to e xtract nonlinear features from X , which yields a new representation f ( X ) = [ h 1 , . . . , h T ] ∈ R h × T , containing temporal information. W e also learn a vectorial rep- resentation of the acoustic event by w ∈ R h , which serves the purpose of a classiﬁer and will be used in predictions at two lev els. W ith the standard logistic regression model, we perform per-frame classﬁcation based on the frame-lev el representation and the classiﬁer w : for t = 1 , . . . , T , p t := P ( y t = 1 | X ) = 1 1 + exp ( − w > h t ) ∈ [0 , 1] , and we measure the frame-lev el loss if the ev ent occurs: L f rame ( X , y ) =  1 T P T t =1 y t log p t + (1 − y t ) log(1 − p t ) : y = 1 0 : y = 0 . Note that we do not calculate the frame loss if no ev ent occurs, ev en though one can consider the frame label to be all 0 ’ s in this case. This design choice is consistent with the e valuation metric for rare e vents, since if we believ e no event occurs in an utterance, the onset/offset or the frame labels are meaningless. On the other hand, we make the utterance-lev el prediction by collecting evidence at the frame level. Since the above p t ’ s provide the alignment between each frame and the target e vent, h1 h2 h3 input acoustics X stac ked R NN s f h4 h5 attentio n x1 x2 x3 x4 x5 w h y Figure 1: Illustration of our RNN-based attention mechanism for rar e sound events detection. we normalize them ov er the entire utterance to gi ve the “atten- tion” [7, 8]: a t = p t P T t =1 p t , t = 1 , . . . , T , and use these attention weights to combine the frame represen- tations to form the utterance representation as h = T X t =1 a t h t . W e make utterance-le vel prediction by classifying h using w : p := P ( y = 1 | X ) = 1 1 + exp ( − w > h ) ∈ [0 , 1] , and deﬁne the utterance-lev el loss based on it: L utt ( X , y ) = y log p + (1 − y ) log (1 − p ) . This loss naturally encourages the attention to be peaked at the ev ent frames (since they are better aligned with w ), and low at the non-ev ent frames. Our ﬁnal objectiv e function is a weighted combination of the two abov e losses: L ( X , y , y ) = L utt ( X , y ) + α · L f rame ( X , y ) , where α > 0 is a trade-off parameter . During training, we optimize L ( X , y , y ) jointly ov er the parameters of RNNs f and the event representation w . An illustration of our model is gi ven in Figure 1. 2.1. Inference For a test utterance, we ﬁrst calculate p and predict that no e vent occurs if p ≤ thr es 0 , and in the case of p > thr es 0 which indi- cates that an e vent occurs, we threshold [ p 1 , . . . , p T ] by thr es 1 to predict if the event occurs at each frame. For the DCASE challenge task 2, where we need to output the time boundary for a predicted e vent (and there is at most one ev ent in each utter- ance), we simply return the boundary of the longest connected component of 1 ’ s in the thresholded frame prediction. W e hav e simply used thr es 0 = thr es 1 = 0 . 5 in our experiments. 3. Multi-resolution feature extraction Different instances of the same e vent type may occur with somewhat different speeds and durations. T o be robust to v ari- ations in the time axis, we propose a multi-resolution feature extraction architecture based on RNNs, as depicted in Figure 2, which will be used as the f ( X ) mapping in our model. This architecture works as follows. After running each re- current layer , we perform subsampling in the time axis with a rate of 2, i.e., the outputs of the RNN cell for two neighboring frames are averaged, and the resulting sequence, whose length is half of the input length of this layer , is then used as input to the ne xt recurrent layer . In such a way , the higher recurrent lay- ers effecti vely vie w the original utterance at coarser resolutions (larger time scales), and extract information from increasingly larger conte xt of the input. After the last recurrent layer , we would like to obtain a representation for each of the input frames. This is achieved by upsampling (replicating) the subsampled output sequences from each recurrent layer, and summing them for correspond- ing frames. Therefore, the ﬁnal frame representation produced by this architecture takes into account information at dif ferent mu l t i - re s o l u t i o n re p re s e n t a t i o n bi - RN N s + su bsa m p l i n g in p u t s e q u e n ce upsampling bi - RNNs + subsampling Figure 2: RNN-based multi-resolution modeling. resolutions. W e note that the idea of subsampling in deep RNNs architecture is moti v ated by that used in speech recognition [9], and the idea of connecting lower level features to higher lay- ers is similar to that of resnet [10]. W e hav e implemented our model in the tensorﬂow frame work [11]. 4. Experimental results Data generation W e demonstrate our rare event detection model on the task 2 of DCASE 2017 challenge [12] . The task data consist of isolated sound ev ents for three tar get ev ents (babycry , glassbreak, gunshot), and recordings of 15 different audio scenes (b us, cafe, car , etc.) used as background sounds from TUT Acoustic Scenes 2016 dataset [13]. The synthesizer provided as a part of the DCASE challenge is used to gener- ate the training set, and the mixing event-to-background ratios (EBR) are − 6 , 0 and 6 dB. The generated training set has 5000 or 15000 utterances for each target class, and each utterance contains either one target class ev ent or no events. W e use the same dev elopment and ev aluation set (both of about 500 utter- ances) provided by the DCASE challenge. Featur e extraction The acoustic features used in this work are log ﬁlter bank energies (LFBEs). The feature extraction op- erates on mono audio signals sampled at 44.1 kHz. For each 30 seconds audio clip, we extract 64 dimensional LFBEs from frames of 46 ms duration with shifts of 23 ms. Evaluation metrics The e valuation metrics used for audio ev ent detection in DCASE 2017 are ev ent-based error rate (ER) and F1-score. These metrics are calculated using onset-only condition with a collar of 500 ms, taking into account inser- tions, deletions, and substitutions of ev ents. Details of these metrics can be found in [12]. T able 1: ER results of our model on the development set for differ ent RNN arc hitectur es. Her e the training set size is 5000 , and we ﬁx the number of GR U layers to be 3 . babycry glassbreak gunshot uni-directional 0.24 0.06 0.31 bi-directional 0.18 0.07 0.26 multi-resolution 0.13 0.04 0.20 4.1. T raining with 5K samples For each type of e vent, we ﬁrst e xplore different architectures and hyperparameters on training sets of 5000 utterances, 2500 of which contain the e vent. This training setup is similar to that of sev eral participants of the DCASE challenge. For the frame-lev el loss L f rame , instead of summing the cross-entropy over all frames in a positi ve utterance, we only consider frames near the event and in particular , from 50 frames before the onset to 50 frames after the of fset. In this way , we ob- tain a balanced set of frames (100 negati ve frames and a similar amount of positiv e frames per positiv e utterance) for L f rame . Our models are trained with the ADAM algorithm [14] with a minibatch size of 10 utterances, an initial stepsize of 0 . 0001 , for 15 epochs. W e tune the hyperparameter α over the grid { 0 . 1 , 0 . 5 , 1 , 5 , 10 } on the development set. For each α , we monitor the model’ s performance on the development set, and select the epoch that giv es the lowest ER. 4.1.1. Effect of RNN ar chitectur es W e explore the effect of RNN architectures for the frame fea- ture transformation. W e test 3 layers of uni-directional, bi- babycry glassbreak gunshot ER 0.1 0.5 1.0 5.0 10.0 0.125 0.150 0.175 0.200 0.225 0.250 0.275 0.300 uni bi multi-res 0.1 0.5 1.0 5.0 10.0 0.04 0.06 0.08 0.10 0.12 0.14 uni bi multi-res 0.1 0.5 1.0 5.0 10.0 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 uni bi multi-res α α α Figure 3: P erformance of differ ent RNN arc hitectur es for a range of α . Her e the training set size is 5000 . T able 2: P erformance of our model with 15000 training samples and 4 GRU layer s. Methods babycry glassbreak gunshot av erage ER F1 (%) ER F1 (%) ER F1 (%) ER F1 (%) Dev elopment set Ours 0.11 94.3 0.04 97.8 0.18 90.6 0.11 94.2 DCASE Baseline 0.53 72.7 DCASE 1st place [5] 0.07 96.3 DCASE 2nd place [6] 0.14 92.9 Evaluation set Ours 0.26 86.5 0.16 92.1 0.18 91.1 0.20 89.9 DCASE Baseline 0.64 64.1 DCASE 1st place 0.13 93.1 DCASE 2nd place 0.17 91.0 directional, and multi-resolution RNNs described in Section 3 for f ( X ) . The speciﬁc RNN cell we use is the standard gated recurrent units [15], with 256 units in each direction. W e observe that bi-directional RNNs tend to outperform uni- directional RNNs, and on top of that, the multi-resolution ar- chitecture brings further improv ements on all ev ents types. 4.1.2. Effect of the α parameter In Figure 3, we plot the performance of dif ferent RNN architec- tures at different values of trade-off parameter α . W e observe that there exists a wide range of α for which the model achie ves good performance. And for all three ev ents, the optimal α is close to 1 , placing equal weight on the utterance loss and frame loss. 4.2. T raining with 15K samples For each type of event, we then increase the training set to 15000 utterances, 7500 of which contain the event. W e use 4 GR U layers in our multi-resolution architecture, and set α = 1 . 0 . Training stops after 10 epochs and we perform early stop- ping on the dev elopment set as before. The results of our method, in terms of both ER and F1- score, are giv en in T able 2. With the larger training set and deeper architecture, our dev elopment set ER performance is fur- ther improved on babycry and gunshort; the averge ER of 0 . 11 is only worse than the ﬁrst place’ s result of 0 . 07 among all chal- lenge participants. 5. Conclusion W e have proposed a new recurrent model for rare sound events detection, which achieves competitiv e performance on T ask 2 of the DCASE 2017 challenge. The model is simple in that instead of heuristically aggre gating frame-lev el predictions, it is trained to directly make the utterance-le vel prediction, with an objec- tiv e that combines losses at both levels through an attention mechanism. T o be robust to the variations in the time axis, we also propose a multi-resolution feature extraction architecture that improv es over standard bi-directional RNNs. Our model can be trained ef ﬁciently in an end-to-end fashion, and thus can scale up to larger datasets and potentially to the simultaneous detection of multiple ev ents. 6. Acknowledgements The authors would like to thank Ming Sun and Hao T ang for useful discussions, and the anonymous revie wers for construc- tiv e feedback. 7. References [1] A. Mesaros, T . Heittola, A. Diment, B. Elizalde, A. Shah, E. V in- cent, B. Raj, and T . V irtanen, “DCASE 2017 challenge setup: T asks, datasets and baseline system, ” in Proc. Detection and Clas- siﬁcation of Acoustic Scenes and Events 2017 W orkshop , 2017. [2] K. W ang, L. Y ang, and B. Y ang, “ Audio events detection and clas- siﬁcation using extended R-FCN approach, ” DCASE2017 Chal- lenge, T ech. Rep., 2017. [3] C. Kao, W . W ang, M. Sun, and C. W ang, “R-CRNN: Region- based conv olutional recurrent neural network for audio e vent de- tection, ” in Proc. of Interspeech’18 , Hyderabad, India, Sep. 2–6 2018, to appear . [4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: T o- wards real-time object detection with region proposal networks, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Gar- nett, Eds., v ol. 28. MIT Press, Cambridge, MA, 2015, pp. 91–99. [5] H. Lim, J. Park, and Y . Han, “Rare sound ev ent detection using 1D conv olutional recurrent neural networks, ” DCASE2017 Chal- lenge, T ech. Rep., 2017. [6] E. Cakir and T . V irtanen, “Con volutional recurrent neural net- works for rare sound ev ent detection, ” DCASE2017 Challenge, T ech. Rep., 2017. [7] D. Bahdanau, J. Chorowski, D. Serdyuk, P . Brakel, and Y . Ben- gio, “End-to-end attention-based large vocab ulary speech recog- nition, ” Aug. 18 2015, arXiv:1508.04395 [cs.CL]. [8] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine transla- tion by jointly learning to align and translate, ” in Pr oc. of the 3rd Int. Conf. Learning Repr esentations (ICLR 2015) , San Die go, CA, May 7–9 2015. [9] Y . Miao, J. Li, Y . W ang, S.-X. Zhang, and Y . Gong, “Simplify- ing long short-term memory acoustic models for fast training and decoding, ” in Proc. of the IEEE Int. Conf. Acoustics, Speech and Sig. Pr oc. (ICASSP’16) , Shanghai, China, Mar. 20–25 2016. [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oc. of the 2016 IEEE Computer Society Conf. Computer V ision and P attern Recognition (CVPR’16) , Las V egas, NV , Jun. 26–Jul. 1 2016. [11] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfello w , A. Harp, G. Irving, M. Isard, Y . Jia, R. Jozefo wicz, L. Kaiser , M. Kudlur , J. Levenber g, D. Man ´ e, R. Monga, S. Moore, D. Murray , C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskev er , K. T alwar , P . T ucker , V . V anhoucke, V . V asudevan, F . V i ´ egas, O. V inyals, P . W arden, M. W attenberg, M. W icke, Y . Y u, and X. Zheng, “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015. [Online]. A vailable: https://www .tensorﬂow .org [12] A. Mesaros, T . Heittola, and T . V irtanen, “Metrics for polyphonic sound event detection, ” Applied Sciences , vol. 6, no. 6, p. 162, 2016. [13] ——, “Tut database for acoustic scene classiﬁcation and sound ev ent detection, ” in Proc. EUSIPCO , 2016. [14] D. Kingma and J. Ba, “ Adam: A method for stochastic opti- mization, ” in Pr oc. of the 3rd Int. Conf. Learning Representations (ICLR 2015) , San Diego, CA, May 7–9 2015. [15] K. Cho, B. van Merrienboer, C. Gulcehre, F . Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation, ” in Proc. 2014 Conference on Empirical Methods in Natural Lan- guage Pr ocessing , Doha, Qatar, Oct. 25–29 2014.

A simple model for detection of rare sound events

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment