Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection

Hierar chical P ooling Structur e f or W eakly Labeled Sound Event Detection K e-Xin He*, Y u-Han Shen*, W ei-Qiang Zhang Department of Electronic Engineering, Tsinghua Uni versity , Beijing 100084, China hekexinchn@163.com, yhshen@hotmail.com, wqzhang@tsinghua.edu.cn Abstract Sound ev ent detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper , we proposed a hierarchical pooling structure to improv e the performance of weakly labeled sound ev ent detection system. Proposed pooling structure has made remarkable impro vements on three types of pooling function without adding any pa- rameters. Moreov er , our system has achie ved competitiv e performance on T ask 4 of Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) 2017 Challenge using hierarchical pooling structure. Index T erms : sound ev ent detection, weakly-labeled data, pooling function, hierarchical structure 1. Intr oduction The aim of sound ev ent detection (SED) is to detect what types of sound ev ents occur in an audio stream and furthermore, locate the onset and offset times of sound e v ents. T raditional approaches of SED depend on strongly labeled data, which provides the type and its timestamp (onset and offset time) of each sound event occurrence. But such annotation is too consuming to acquire. In consequence, many researchers begin to focus on the detection of sound events using weakly labeled training data. W eak label represents that training data are annotated with only the presence of sound ev ents and no timestamps are provided. Google released the weakly labeled Audio Set [1] in 2017, which boosted the de velopment of relev ant research community . The Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) 2017 Challenge launched a task of large-scale weakly supervised sound ev ent detection for smart cars [2], and it employed a subset of Audio Set. Common solutions to SED with weak label are based on Multi-Instance Learning (MIL). In MIL, the groundtruth label of each instance is unknown. Instead, we only kno w the groundtruth label of bags, each containing many instances. A bag is labeled negativ e if all instances are negativ e; a bag is labeled positi v e if at least one instance in it is positi v e. As shown in Figure 1, in MIL for SED, an audio clip can be considered as a bag, each consisting of se veral frames. For a speciﬁc class of sound events, a clip is labeled positiv e if target sound ev ent occurs in at least one frame. T o solve the problem of MIL for SED, we usually use neural networks to predict the probabilities of each sound ev ent class occurring in each frame. Then, we need to aggregate the frame-le vel probabilities into a clip-level probability for each class of sound ev ents. Standard approaches to aggregating The ﬁrst two authors contributed equally . The corresponding author is W ei-Qiang Zhang. This work was supported by the National Natural Science Foundation of China under Grant No. U1836219. C lip - L ev el F ra m e - L ev el A c ous t i c F e a t ur e + S ound E v e nt D e t e c t i on S y s t e m — — — + — — — Figure 1: Illustration of Multi-Instance Learning System for sound event detection with weakly labeled data. the probabilities include max-pooling and av erage-pooling, and there are also man y variants and dev elopments. K ong et al. [3] proposed an attention model as pooling function, which has been adopted in many works [4, 5, 6]. McFee et al. [7] proposed a family of adapti v e pooling operators. W ang et al. [8] compared ﬁv e pooling functions for SED with weak labeling. In our paper , we proposed a hierarchical pooling structure to giv e a better supervision for neural network learning. Proposed pooling structure has improved the performance of three types of pooling functions without any added parameters. W e ev aluate our methods on DCASE 2017 Challenge T ask 4 and our model has shown e xcellent performance. 2. Methods 2.1. Baseline System The mainstream Con volutional Recurrent Neural Network (CRNN) system is implemented as our baseline. The overview of baseline system is illustrated in Figure 2. W e use log mel spectrogram as acoustic feature. The input feature will pass through several Con volutional layers, a Bi- directional Gated Recurrent Unit (Bi-GRU) and a dense layer with sigmoid acti v ation to produce predictions for frame-level presence probabilities of each sound ev ent class. The architecture of neural networks in our work is similar to that in [4]. As shown in Figure 3, the Con v olutional Neural Network (CNN) part consists of four con v olutional blocks and a single con volutional layer . Each block contains two gated con v olutional layers [9], batch normalization [10], dropout [11] and a max-pooling layer . Max-pooling layers are adopted on both time axis and frequenc y axis. Note that the frame rate has reduced from 50 Hz to 12.5 Hz due to the max-pooling operations on time axis. The extracted features over different con v olutional channels are stacked to the frequenc y axis before being fed into the Recurrent Neural Network (RNN) part. The RNN part in our work is based on Bi-GR U. The outputs of forw ard and backward GR U are concatenated to get ﬁnal … A ud io L og M e l - sp e c tr o g r am C NNs Bi - G R U P ooling F unc t ion T im e s t a m p s G R U G R U G R U G R U G R U G R U C ar Tr a i n Tr u c k … … … … Figure 2: Overview of baseline system. B at c h N or m al i zat i on G C NN ( 3 × 3, 32) Ma x - P o o l i n g ( 2 × 2) G C NN ( 3 × 3, 32) Dr o p o u t ( 0. 2) C NN ( 3 × 3, 256) C o n v b l o c k Ma x - P o o l i n g ( 1 × 5) Bi - d i r e ct i o n a l G RU ( 128 × 2) De n s e L a y e r ( 17, s i gm o i d ) P o o l i n g f u n ct i o n C l i p - l ev el P r e d i ct i o n F r a me - l ev el P r e d i ct i o n s C o n v b l o ck , f i l t e r = 128, p o o l i n g= 1x 2 C o n v b l o ck , f i l t e r = 128, p o o l i n g= 1x 2 C o n v b l o ck , f i l t e r = 64, p o o l i n g= 2x 2 Figure 3: Arc hitectur e of neural networks. The ﬁrst and second dimensions of convolutional kernels and strides repr esent the time axis and fr equency axis respectively . The size of all con volutional kernels is 3 × 3 . outputs. The hyper-parameters are included in Figure 3. Finally , a pooling function is adopted to calculate the presence probability of each sound e vent class in a 10-second audio clip. The choice and usage of pooling function will be speciﬁcally explained in the follo wing parts of this section. For testing, in order to locate the detected sound events, a threshold θ is set to the frame-level predictions. Then, we use post-processing methods including median ﬁlter and ignoring noise to get the onset and offset times of detected e v ents. 2.2. Pooling function As mentioned abov e, the design of pooling function is an essential issue in weakly labeled sound event detection. W ang et al. [8] made a comprehensive comparison of ﬁ v e pooling functions (max pooling, average pooling, linear softmax, exponential softmax and attention) in MIL for SED. Those pooling functions are introduced as follows. Let x i ∈ [0 , 1] be the predicted probability of a speciﬁc ev ent class occurring at the i -th frame. W e need a pooling function to make a clip-level prediction. Let y ∈ [0 , 1] be the clip-lev el probability , and we hav e the following equation: T able 1: Deﬁnition of ﬁve pooling functions Pooling function Deﬁnition W eight value Max pooling y = max i x i w i =    1 , i = arg max i x i 0 , else A verage pooling y = 1 n X x i w i = 1 n Linear softmax y = P x i 2 P x i w i = x i Exp. softmax y = P x i exp ( x i ) P exp ( x i ) w i = exp ( x i ) Attention y = P w i x i P w i w i = h ( u ) y = P N i =1 w i x i P N i =1 w i (1) where w i is the weight coefﬁcient for x i , and N is the number of frames in a clip. Sho wn in T able 1 is the formula to calculate the weight values w i for ﬁv e types of pooling functions. In the case of attention pooling function, the weight value w i is learnt by a dense layer with softmax activ ation. And its input u is the same as the input of the dense layer producing x i . It is obvious from T able 1 that w i is a function of x i or u , so we denote this function as: w i = f ( x i ; u ) (2) 2.3. Hierarchical pooling structur e Instead of aggre gating all N frame-lev el predictions x i to a clip-lev el prediction y at once, we ﬁrstly group N frames into sev eral segments with the length of M to make segment-le vel predictions ˆ x j . At the same time, the weight values w i are also weighted av eraged using themselves as weights to obtain segment-le vel weights ˆ w j . Finally , we use the segment-lev el predictions ˆ x j and weights ˆ w j to get the clip-lev el prediction y . The entire process is illustrated by the following formulas. ˆ x j = P j M i =1+( j − 1) M w i x i P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (3) ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (4) y = P N/ M j =1 ˆ w j ˆ x j P N/ M j =1 ˆ w j (5)                   F ram e - le v e l P re d ic t io n s S h a p e ( 125, 17) Se g m e nt - le v e l P re d ic t io n s S h a p e ( 25, 17 ) Se g m e nt - le v e l P re d ic t io n s ( lo n g e r) S h a p e ( 5 , 1 7 ) C lip - le v e l P re d ic t io n S h a p e ( 1 7 ) O u t p u t o f B i - G RU S h a p e ( 125, 256) L in e ar / Ex p A tten ti o n A v e rag e A v e rag e Figure 4: Thr ee-stag e hierarc hical pooling structur e. In linear and exponential softmax pooling, the frame-le vel weights w i derive fr om frame-level predictions x i ; in attention pooling, they ar e learnt from the output of Bi-GR U . In the ﬁrst stag e, every ﬁve frames ar e aggr egated together to get se gment-level pr edictions ˆ x j ; the weights of every ﬁve frames are averag ed to g et se gment-level weights ˆ w j . In the second stage , e very ﬁve se gments ar e aggr egated to get longer-se gment-level pr edictions e x k and e very ﬁve se gment-level weights ar e avera ged to get long er-se gment-level weights e w k . In the end, e x k and e w k ar e ag gr e gated to g et ﬁnal clip-level pr ediction. 2.4. Analysis of hierarchical pooling structur e Before we discuss this structure in depth, we would like to arrive at a proposition: the accuracy of ˆ x j is larger than that of x i in a well-tr ained system . This proposition is intuiti vely reasonable because it is easier for the system to output correct predictions when the required time resolution gets longer . According to the theoretical discussion in [8], the process of weight updating is related to ∂ y ∂ x i and ∂ y ∂ w i . W e take linear softmax pooling function as an example to interpret the function of proposed pooling structure. In the case of normal single pooling structure, w i = x i , ∂ y ∂ x i = 2 x i − y P N k =1 x k (6) In the case of hierarchical pooling structure, ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i = P j M i =1+( j − 1) M x i 2 P j M i =1+( j − 1) M x i (7) ∂ y ∂ x i = N/ M X l =1 ( ∂ y ∂ ˆ x l ∂ ˆ x l ∂ x i + ∂ y ∂ ˆ w l ∂ ˆ w l ∂ x i ) = x i (4 ˆ x j − 2 y ) − 2 ˆ x 2 j + y ˆ x j P N/ M l =1 ˆ x l P j M n =1+( j − 1) M x i , j = d i M e (8) As shown in Equation 6 and Equation 8, compared with single pooling structure, the segment-lev el prediction ˆ x j also contributes to the update of frame-lev el prediction x i in hierarchical pooling structure. As segment-le vel prediction is more accurate than frame-le vel prediction, we belie ve proposed hierarchical pooling structure can provide a better supervision for neural network learning. Detailed mathematical deriv ation and analysis of all ﬁv e pooling functions are available in the appendix. W e proved that proposed structure would make no difference on max and av erage pooling, so we conducted our experiments using the other three pooling functions. The hierarchical pooling structure used in our work is illustrated in Figure 4. It is a three-stage pooling structure. The number of predicted probabilities for a certain class of sound ev ents in an audio clip decreases from 125 to 25, and then 5, and ﬁnally 1. 3. Experiments 3.1. Dataset W e demonstrated our experiments on task 4 of DCASE 2017 Challenge [2]. This task contains 17 classes of sound e vents. The dataset is a subset of Audio Set [1]. The training set has weak labels denoting the presence of a gi v en sound e v ent in the videos soundtrack and no timestamps are provided. For testing and e valuation, strong labels with timestamps are provided for the purpose of ev aluating performance. 3.2. Experimental Setup T o extract log mel spectrogram feature, each audio is di vided into frames of 40 ms duration with 50% ov erlapping. The input of our system is a 500 × 80 matrix, where 500 denotes the number of frames and 80 is the number of mel-ﬁlter bins. Our model is trained using Adam optimizer [12]. The initial learning rate is 0.001. The mini batch size is 128. The loss function is cate gorical cross entropy based on clip-level labels. W e use early stop strate gy when the validation loss stops degrading for 10 epochs. T able 2: P erformance of single and hierar c hical pooling structur e, in terms of ER (lower is better) and F 1 -scor e (%) (higher is better). Single Pooling Structure Hierarchical Pooling Structure Sub . Del. Ins. ER Pre. Rec. F 1 Sub . Del. Ins. ER Pre. Rec. F 1 Dev elopment Linear 0.25 0.18 0.36 0.79 39.00 47.01 42.63 0.19 0.40 0.17 0.76 (3.8% ↓ ) 53.07 41.31 46.46 (9.0% ↑ ) Exp. 0.29 0.35 0.18 0.82 44.67 37.24 40.62 0.27 0.26 0.26 0.79 (3.7% ↓ ) 45.90 45.72 45.81 (12.8% ↑ ) Att. 0.30 0.34 0.19 0.83 44.68 36.97 40.46 0.25 0.33 0.21 0.79 (4.8% ↓ ) 48.17 42.51 45.16 (11.6% ↑ ) Evaluation Linear 0.21 0.36 0.18 0.76 53.40 43.19 47.76 0.19 0.30 0.20 0.69 (9.2% ↓ ) 56.39 50.78 53.44 (11.8% ↑ ) Exp. 0.23 0.35 0.23 0.81 48.35 43.78 45.95 0.23 0.28 0.22 0.73 (9.9% ↓ ) 53.40 51.38 52.37 (14.0% ↑ ) Att. 0.21 0.31 0.27 0.79 46.12 44.43 45.26 0.21 0.28 0.24 0.73 (7.6% ↓ ) 52.27 50.58 51.41 (13.6% ↑ ) ;ĂͿ WƌĞĚŝĐƚŝŽŶƐŽĨƚŚĞůŝŶĞĂƌƐŽĨƚŵĂǆƐǇƐƚĞŵ WƌŽďĂďŝůŝƚǇ ;ďͿ WƌĞĚŝĐƚŝŽŶƐŽĨƚŚĞĞǆƉŽŶĞŶƚŝĂůƐŽĨƚŵĂǆƐǇƐƚĞŵ WƌŽďĂďŝůŝƚǇ ;ĐͿ WƌĞĚŝĐƚŝŽŶƐŽĨƚŚĞĂƚƚĞŶƚŝŽŶƐǇƐƚĞŵ WƌŽďĂďŝůŝƚǇ Time (s) Figure 5: The frame-level predictions of thr ee systems on an evaluation audio clip. 3.3. Metrics According to the of ﬁcial instructions of DCASE 2017 Chal- lenge [2], our method is ev aluated based on two kinds of segment-based metrics: the primary metric is segment-based micro-av eraged error rate (ER) and the secondary metric is segment-based micro-averaged F 1 -score. ER is the sum of Substitution, Deletion and Insertion Errors, and F 1 -score is the harmonic a ver gae of Precision and Recall. Each segment-based metric will be calculated in one-second segments over the entire test set. Detailed information can be found in [2]. W e use sed ev al toolbox [13] to compute the metrics. 4. Results 4.1. Experimental results W e apply single pooling structure and proposed hierarchical pooling structure to three types of pooling functions. The performance on de velopment and ev aluation dataset is shown in T able 2. The percentage in red represents the change rate from single pooling structure to hierarchical pooling structure. Proposed structure can make remarkable improvements in all situations without adding any parameters. It is safe to dra w a conclusion that hierarchical pooling structure can improv e the performance of weakly-labeled sound event detection system signiﬁcantly . Besides, linear softmax pooling function outperforms the other pooling functions in all conditions, which corresponds with the experimental results in [8]. Figure 5 illustrates the frame-lev el predictions of single and T able 3: Comparison with other methods, in terms of ER and F 1 -scor e (%). W e compar e pr oposed system with the following systems: (1)EMSI: 1st place in DCASE 2017; (2) Surr e y: 2nd in DCASE 2017; (3) MLMS8: 3rd in DCASE 2017; (4) GCCaps: A Capsule Routing Network proposed in 2018; (5) W ang: Linear softmax system pr oposed in 2018. Dev elopment Evaluation ER F 1 ER F 1 EMSI ∗ [14] 0.71 47.1 0.66 55.5 Surrey ∗ [15] 0.72 49.7 0.73 51.8 MLMS8 ∗ [16] 0.84 34.2 0.75 47.1 GCCaps [17] - - - - - - 0.76 46.3 W ang [8] 0.79 45.4 - - - - - - Proposed 0.76 46.5 0.69 53.4 ∗ system using model ensemble; - - - results not presented in paper . hierarchical pooling structures on three pooling functions. In this audio, the sound of train occurs from 7.574 s to 10 s. In linear and exponential softmax, single pooling structure cannot output an y positive predictions; on the contrary , hierarchical pooling structure can correctly detect target event. In attention pooling, the predicted probabilities of hierarchical structure are also higher than single structure where the ev ent occurs. Besides, the linear and exponential softmax are more likely to produce deletion errors while attention will result in more insertion errors. This also complies with the analysis in [8]. 4.2. Comparison with other methods Compared with other methods, the performance of our system is also competitiv e. W e compare proposed system with the top 3 teams in DCASE 2017 Challenge and two methods proposed in 2018. Proposed system can outperform most methods e xcept the top 1 system in DCASE 2017 Challenge [14]. Note that the top 1 team utilized the ensemble of multiple systems, which signiﬁcantly improv ed its performance. Our system can achie ve comparable performance without ensemble. 5. Conclusion In this paper, we proposed a hierarchical pooling structure to solve the problem of Multi-Instance Learning. W e applied this strategy to dev elop a weakly-labeled sound event detection system. Our proposed method can effecti v ely improve the performance in three types of pooling functions without adding any parameters. Besides, our best system can achiev e compa- rable performance with the state-of-the-art systems without the techniques of ensemble. W e believ e our method can be applied in more applications of Multi-Instance Learning in addition to the ﬁeld of weakly labeled sound ev ent detection. 6. References [1] J. F . Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W . Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “ Audio set: An ontology and human-labeled dataset for audio e vents, ” in 2017 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 776–780. [2] A. Mesaros, T . Heittola, A. Diment, B. Elizalde, A. Shah, E. V incent, B. Raj, and T . V irtanen, “DCASE 2017 challenge setup: tasks, datasets and baseline system, ” in Pr oceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop , 2017, pp. 85–92. [3] Q. K ong, Y . Xu, W . W ang, and M. D. Plumbley , “ Audio set classiﬁcation with attention model: A probabilistic perspective, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 316–320. [4] Y . Xu, Q. Kong, W . W ang, and M. D. Plumbley , “Large-scale weakly supervised audio classiﬁcation using gated conv olutional neural network, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 121–125. [5] L. JiaKai, “Mean teacher con v olution system for dcase 2018 task 4, ” Detection and Classiﬁcation of Acoustic Scenes and Events , 2018. [6] Q. K ong, Y . Xu, I. Sobieraj, W . W ang, and M. D. Plumbley , “Sound event detection and time–frequency segmentation from weakly labelled data, ” IEEE/ACM T ransactions on Audio, Speec h and Language Pr ocessing (T ASLP) , vol. 27, no. 4, pp. 777–787, 2019. [7] B. McFee, J. Salamon, and J. P . Bello, “ Adapti ve pooling operators for weakly labeled sound event detection, ” IEEE/A CM T r ansactions on Audio, Speech and Language Pr ocessing (T ASLP) , v ol. 26, no. 11, pp. 2180–2193, 2018. [8] Y . W ang and F . Metze, “ A comparison of ﬁv e multiple instance learning pooling functions for sound ev ent detection with weak labeling, ” arXiv preprint , 2018. [9] J. Gehring, M. Auli, D. Grangier , D. Y arats, and Y . N. Dauphin, “Con volutional sequence to sequence learning, ” in Proceedings of the 34th International Conference on Machine Learning-V olume 70 . JMLR. org, 2017, pp. 1243–1252. [10] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal co v ariate shift, ” in Pr oceedings of The 32nd International Confer ence on Machine Learning , 2015, pp. 448–456. [11] N. Sriv asta va, G. E. Hinton, A. Krizhevsk y , I. Sutskev er , and R. Salakhutdinov , “Dropout: a simple way to prevent neural networks from overﬁtting, ” Journal of Machine Learning Resear ch , v ol. 15, no. 1, pp. 1929–1958, 2014. [12] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint , 2014. [13] A. Mesaros, T . Heittola, and T . V irtanen, “Metrics for polyphonic sound ev ent detection, ” Applied Sciences , vol. 6, no. 6, 2016. [14] D. Lee, S. Lee, Y . Han, and K. Lee, “Ensemble of con v olutional neural networks for weakly-supervised sound event detection using multiple scale input, ” Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2017. [15] Y . Xu, Q. Kong, W . W ang, and M. D. Plumbley , “Surrey- cvssp system for DCASE2017 challenge task4, ” arXiv pr eprint arXiv:1709.00551 , 2017. [16] J. Lee, J. Park, S. Kum, Y . Jeong, and J. Nam, “Combining multi-scale features using sample-lev el deep con v olutional neural networks for weakly supervised sound event detection, ” Pr oc. DCASE , pp. 69–73, 2017. [17] T . Iqbal, Y . Xu, Q. Kong, and W . W ang, “Capsule routing for sound e vent detection, ” in 2018 26th Eur opean Signal Pr ocessing Confer ence (EUSIPCO) . IEEE, 2018, pp. 2255–2259. A. Erratum Comment: W e ﬁgure out some errors in our paper , which has been published in the proceedings of Interspeech 2019. In order to correct the errors, we update the Arxiv version. If any of y ou is interested in our work, please refer to the lastest version on Arxiv . If you have any further questions, please feel free to contact the authors. The main error in our paper is the formula of segment-le vel weights w j in hierarchical pooling structure, i.e. Equation (4) in the body of this paper . The original formula is ˆ w j = P j M i =1+( j − 1) M w i M , j = 1 , 2 , ..., N / M (A-1) The corrected formula is ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (A-2) Our motiv ation is that segment-lev el prediction is more accurate than frame-le vel prediction and it is easier to get correct predictions when the required time resolution gets longer . So we let the groundtruth clip-le vel labels supervise the training of small segment-lev el predictions and get an accurate segment-le vel prediction ﬁrst, instead of directly supervising the training of each frame. Besides, there are many other methods to get w j in hierarchical pooling structure. For example, we can add two extra dense layers after Bi-GRU to get ˆ w j and e w k in Figure 4. It can also achieve similar ef fects but requires a small number of additional parameters. W e also did some experiments based on the wrong formula in the paper and the av erage of ER is similar to single pooling structure. But during experiments, we ﬁnd that system performances have big ﬂuctuation. For example, we use attention pooling function for ﬁve experiments and the ER on ev aluation dataset is 0.80, 0.85, 0.79, 0.78, 0.80 respectiv ely . Meanwhile, in order to locate the detected sound ev ents, a threshold is set to the frame-lev el predictions. And we use post-processing methods including median ﬁlter and ignoring noise to get the onset and offset times of detected events. The ev aluation performance is also sensiti ve to the parameter of threshold and post-processing. W e think our system may meet with overﬁtting. In future work, we will ev aluate whether our proposed method is general and robust on lar ger datasets. B. Appendix Detailed mathematical deriv ation and analysis of all ﬁv e pooling functions are av ailable in the appendix. The loss function we use is cross-entropy loss: L = − t log y − (1 − t ) log (1 − y ) (A-3) where t = 0 or 1 , is the groundtruth label for a speciﬁc sound ev ent in an audio clip, and y ∈ [0 , 1] is the predicted clip-le vel probability for the same ev ent. W e decompose the gradient of L with respect to the frame- lev el predictions x i and the frame-lev el weights w i using chain rule: ∂ L ∂ x i = ∂ L ∂ y ∂ y ∂ x i , ∂ L ∂ w i = ∂ L ∂ y ∂ y ∂ w i (A-4) Considering the term ∂ L ∂ y , we hav e: ∂ L ∂ y = − t y + 1 − t 1 − y = ( 1 1 − y , t = 0 − 1 y , t = 1 (A-5) It is obvious that this term is decided by the label t , so we focus on ∂ y ∂ x i and ∂ y ∂ w i in the following discussions. Before proceeding into the calculation process, let us revie w the expression of y in our hierarchical pooling structure. ˆ x j = P j M i =1+( j − 1) M w i x i P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., M (A-6) ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (A-7) y = P N/ M j =1 ˆ w j ˆ x j P N/ M j =1 ˆ w j (A-8) So y is a weighted sum of ˆ x j with weights ˆ w j . ∂ y ∂ x i = N/ M X l =1 ( ∂ y ∂ ˆ x l ∂ ˆ x l ∂ x i + ∂ y ∂ ˆ w l ∂ ˆ w l ∂ x i ) = ∂ y ∂ ˆ x j ∂ ˆ x j ∂ x i + ∂ y ∂ ˆ w j ∂ ˆ w j ∂ x i , j = d i M e (A-9) The four components are calculated as follows: ∂ y ∂ ˆ x j = ˆ w j P N/ M l =1 ˆ w l (A-10) ∂ ˆ x j ∂ x i = d ˆ x j d x i + ∂ ˆ x j ∂ w i ∂ w i ∂ x i = w i P j M n =1+( j − 1) M w n + x i − ˆ x j P j M n =1+( j − 1) M w n ∂ w i ∂ x i (A-11) ∂ y ∂ ˆ w j = ˆ x j − y P N/ M l =1 ˆ w l (A-12) ∂ ˆ w j ∂ x i = ∂ ˆ w j ∂ w i ∂ w i ∂ x i = 2 w i − ˆ w j P j M n =1+( j − 1) M w n ∂ w i ∂ x i (A-13) Here, ∂ w i ∂ x i relies on the choice of pooling functions. Hence we summarize as follows: ∂ y ∂ x i = ˆ w j P N/ M l =1 ˆ w l w i P j M n =1+( j − 1) M w n + x i − ˆ x j P j M n =1+( j − 1) M w n ∂ w i ∂ x i ! + ˆ x j − y P N/ M l =1 ˆ w l 2 w i − ˆ w j P j M n =1+( j − 1) M w n ∂ w i ∂ x i ! = ˆ w j w i + [( x i − ˆ x j ) ˆ w j + ( ˆ x j − y ) (2 w i − ˆ w j )] ∂ w i ∂ x i P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n (A-14) In the case of av erage pooling function, w i = 1 N , ∂ w i ∂ x i = 0 (A-15) ∂ y ∂ x i = ˆ w j w i P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n = 1 N (A-16) In the case of max pooling function, w i = ( 1 , i = arg max i x i 0 , else (A-17) so we hav e: ∂ w i ∂ x i = 0 (A-18) ∂ y ∂ x i = ( 1 , i = arg max i x i 0 , else (A-19) In the case of linear softmax pooling function, w i = x i , ∂ w n ∂ x i = ( 1 , n = i 0 , else (A-20) ∂ y ∂ x i = ˆ w j w i + ( x i − ˆ x j ) ˆ w j + ( ˆ x j − y )(2 w i − ˆ w j ) P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n = x i (4 ˆ x j − 2 y ) − 2 ˆ x 2 j + y ˆ x j P N/ M l =1 ˆ x l P j M n =1+( j − 1) M x i (A-21) In the case of exponential softmax pooling function, w i = exp ( x i ) , ∂ w n ∂ x i = ( exp ( x i ) , n = i 0 , else (A-22) ∂ y ∂ x i = ˆ w j w i + [( x i − ˆ x j ) ˆ w j + ( ˆ x j − y )(2 w i − ˆ w j )] exp( x i ) P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n = [ ˆ w j (1 + x i − 2 ˆ x j + y ) + 2 exp( x i )( ˆ x j − y )] exp( x i ) P N/ M l =1 ˆ x l P j M n =1+( j − 1) M exp( x n ) (A-23) In the case of attention pooling function, w i is decided by the input of the last dense layer u instead of x i , ∂ w n ∂ x i = 0 (A-24) ∂ y ∂ x i = ˆ w j w i P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n (A-25) In this case, we should consider the item ∂ y ∂ w i as well. The item is calculated as follows: ∂ y ∂ w i = N/ M X l =1 ∂ y ∂ ˆ w l ∂ ˆ w l ∂ w i + ∂ y ∂ ˆ x l ∂ ˆ x l ∂ w i = ∂ y ∂ ˆ w j ∂ ˆ w j ∂ w i + ∂ y ∂ ˆ x j ∂ ˆ x j ∂ w i = ˆ x j − y P N/ M l =1 ˆ w l 2 w i − ˆ w j P j M n =1+( j − 1) M w n + ˆ w j P N/ M l =1 ˆ w l x i − ˆ x j P j M n =1+( j − 1) M w n = 2 w i ( ˆ x j − y ) + ˆ w j ( x i + y − 2 ˆ x j ) P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n (A-26) The single pooling structure can be considered as a special case of hierarchical pooling structure in which ˆ w j = w i , ˆ x j = x i . According to the analysis above, it is easy to notice that proposed hierarchical pooling structure will mak e no difference when applied to max pooling and average pooling functions. So we only analyze the other three pooling functions in our paper . As shown in abov e results, the segment-lev el prediction ˆ x j will also contribute to weight updating during training. So we belie ve this kind of structure can give a better supervision for neural network learning.

Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment