Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection
Sound event detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper, we proposed a hierarchical pooling structure to improve the pe…
Authors: Ke-Xin He, Yu-Han Shen, Wei-Qiang Zhang
Hierar chical P ooling Structur e f or W eakly Labeled Sound Event Detection K e-Xin He*, Y u-Han Shen*, W ei-Qiang Zhang Department of Electronic Engineering, Tsinghua Uni versity , Beijing 100084, China hekexinchn@163.com, yhshen@hotmail.com, wqzhang@tsinghua.edu.cn Abstract Sound ev ent detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper , we proposed a hierarchical pooling structure to improv e the performance of weakly labeled sound ev ent detection system. Proposed pooling structure has made remarkable impro vements on three types of pooling function without adding any pa- rameters. Moreov er , our system has achie ved competitiv e performance on T ask 4 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge using hierarchical pooling structure. Index T erms : sound ev ent detection, weakly-labeled data, pooling function, hierarchical structure 1. Intr oduction The aim of sound ev ent detection (SED) is to detect what types of sound ev ents occur in an audio stream and furthermore, locate the onset and offset times of sound e v ents. T raditional approaches of SED depend on strongly labeled data, which provides the type and its timestamp (onset and offset time) of each sound event occurrence. But such annotation is too consuming to acquire. In consequence, many researchers begin to focus on the detection of sound events using weakly labeled training data. W eak label represents that training data are annotated with only the presence of sound ev ents and no timestamps are provided. Google released the weakly labeled Audio Set [1] in 2017, which boosted the de velopment of relev ant research community . The Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge launched a task of large-scale weakly supervised sound ev ent detection for smart cars [2], and it employed a subset of Audio Set. Common solutions to SED with weak label are based on Multi-Instance Learning (MIL). In MIL, the groundtruth label of each instance is unknown. Instead, we only kno w the groundtruth label of bags, each containing many instances. A bag is labeled negativ e if all instances are negativ e; a bag is labeled positi v e if at least one instance in it is positi v e. As shown in Figure 1, in MIL for SED, an audio clip can be considered as a bag, each consisting of se veral frames. For a specific class of sound events, a clip is labeled positiv e if target sound ev ent occurs in at least one frame. T o solve the problem of MIL for SED, we usually use neural networks to predict the probabilities of each sound ev ent class occurring in each frame. Then, we need to aggregate the frame-le vel probabilities into a clip-level probability for each class of sound ev ents. Standard approaches to aggregating The first two authors contributed equally . The corresponding author is W ei-Qiang Zhang. This work was supported by the National Natural Science Foundation of China under Grant No. U1836219. C lip - L ev el F ra m e - L ev el A c ous t i c F e a t ur e + S ound E v e nt D e t e c t i on S y s t e m — — — + — — — Figure 1: Illustration of Multi-Instance Learning System for sound event detection with weakly labeled data. the probabilities include max-pooling and av erage-pooling, and there are also man y variants and dev elopments. K ong et al. [3] proposed an attention model as pooling function, which has been adopted in many works [4, 5, 6]. McFee et al. [7] proposed a family of adapti v e pooling operators. W ang et al. [8] compared fiv e pooling functions for SED with weak labeling. In our paper , we proposed a hierarchical pooling structure to giv e a better supervision for neural network learning. Proposed pooling structure has improved the performance of three types of pooling functions without any added parameters. W e ev aluate our methods on DCASE 2017 Challenge T ask 4 and our model has shown e xcellent performance. 2. Methods 2.1. Baseline System The mainstream Con volutional Recurrent Neural Network (CRNN) system is implemented as our baseline. The overview of baseline system is illustrated in Figure 2. W e use log mel spectrogram as acoustic feature. The input feature will pass through several Con volutional layers, a Bi- directional Gated Recurrent Unit (Bi-GRU) and a dense layer with sigmoid acti v ation to produce predictions for frame-level presence probabilities of each sound ev ent class. The architecture of neural networks in our work is similar to that in [4]. As shown in Figure 3, the Con v olutional Neural Network (CNN) part consists of four con v olutional blocks and a single con volutional layer . Each block contains two gated con v olutional layers [9], batch normalization [10], dropout [11] and a max-pooling layer . Max-pooling layers are adopted on both time axis and frequenc y axis. Note that the frame rate has reduced from 50 Hz to 12.5 Hz due to the max-pooling operations on time axis. The extracted features over different con v olutional channels are stacked to the frequenc y axis before being fed into the Recurrent Neural Network (RNN) part. The RNN part in our work is based on Bi-GR U. The outputs of forw ard and backward GR U are concatenated to get final … A ud io L og M e l - sp e c tr o g r am C NNs Bi - G R U P ooling F unc t ion T im e s t a m p s G R U G R U G R U G R U G R U G R U C ar Tr a i n Tr u c k … … … … Figure 2: Overview of baseline system. B at c h N or m al i zat i on G C NN ( 3 × 3, 32) Ma x - P o o l i n g ( 2 × 2) G C NN ( 3 × 3, 32) Dr o p o u t ( 0. 2) C NN ( 3 × 3, 256) C o n v b l o c k Ma x - P o o l i n g ( 1 × 5) Bi - d i r e ct i o n a l G RU ( 128 × 2) De n s e L a y e r ( 17, s i gm o i d ) P o o l i n g f u n ct i o n C l i p - l ev el P r e d i ct i o n F r a me - l ev el P r e d i ct i o n s C o n v b l o ck , f i l t e r = 128, p o o l i n g= 1x 2 C o n v b l o ck , f i l t e r = 128, p o o l i n g= 1x 2 C o n v b l o ck , f i l t e r = 64, p o o l i n g= 2x 2 Figure 3: Arc hitectur e of neural networks. The first and second dimensions of convolutional kernels and strides repr esent the time axis and fr equency axis respectively . The size of all con volutional kernels is 3 × 3 . outputs. The hyper-parameters are included in Figure 3. Finally , a pooling function is adopted to calculate the presence probability of each sound e vent class in a 10-second audio clip. The choice and usage of pooling function will be specifically explained in the follo wing parts of this section. For testing, in order to locate the detected sound events, a threshold θ is set to the frame-level predictions. Then, we use post-processing methods including median filter and ignoring noise to get the onset and offset times of detected e v ents. 2.2. Pooling function As mentioned abov e, the design of pooling function is an essential issue in weakly labeled sound event detection. W ang et al. [8] made a comprehensive comparison of fi v e pooling functions (max pooling, average pooling, linear softmax, exponential softmax and attention) in MIL for SED. Those pooling functions are introduced as follows. Let x i ∈ [0 , 1] be the predicted probability of a specific ev ent class occurring at the i -th frame. W e need a pooling function to make a clip-level prediction. Let y ∈ [0 , 1] be the clip-lev el probability , and we hav e the following equation: T able 1: Definition of five pooling functions Pooling function Definition W eight value Max pooling y = max i x i w i = 1 , i = arg max i x i 0 , else A verage pooling y = 1 n X x i w i = 1 n Linear softmax y = P x i 2 P x i w i = x i Exp. softmax y = P x i exp ( x i ) P exp ( x i ) w i = exp ( x i ) Attention y = P w i x i P w i w i = h ( u ) y = P N i =1 w i x i P N i =1 w i (1) where w i is the weight coefficient for x i , and N is the number of frames in a clip. Sho wn in T able 1 is the formula to calculate the weight values w i for fiv e types of pooling functions. In the case of attention pooling function, the weight value w i is learnt by a dense layer with softmax activ ation. And its input u is the same as the input of the dense layer producing x i . It is obvious from T able 1 that w i is a function of x i or u , so we denote this function as: w i = f ( x i ; u ) (2) 2.3. Hierarchical pooling structur e Instead of aggre gating all N frame-lev el predictions x i to a clip-lev el prediction y at once, we firstly group N frames into sev eral segments with the length of M to make segment-le vel predictions ˆ x j . At the same time, the weight values w i are also weighted av eraged using themselves as weights to obtain segment-le vel weights ˆ w j . Finally , we use the segment-lev el predictions ˆ x j and weights ˆ w j to get the clip-lev el prediction y . The entire process is illustrated by the following formulas. ˆ x j = P j M i =1+( j − 1) M w i x i P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (3) ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (4) y = P N/ M j =1 ˆ w j ˆ x j P N/ M j =1 ˆ w j (5) F ram e - le v e l P re d ic t io n s S h a p e ( 125, 17) Se g m e nt - le v e l P re d ic t io n s S h a p e ( 25, 17 ) Se g m e nt - le v e l P re d ic t io n s ( lo n g e r) S h a p e ( 5 , 1 7 ) C lip - le v e l P re d ic t io n S h a p e ( 1 7 ) O u t p u t o f B i - G RU S h a p e ( 125, 256) L in e ar / Ex p A tten ti o n A v e rag e A v e rag e Figure 4: Thr ee-stag e hierarc hical pooling structur e. In linear and exponential softmax pooling, the frame-le vel weights w i derive fr om frame-level predictions x i ; in attention pooling, they ar e learnt from the output of Bi-GR U . In the first stag e, every five frames ar e aggr egated together to get se gment-level pr edictions ˆ x j ; the weights of every five frames are averag ed to g et se gment-level weights ˆ w j . In the second stage , e very five se gments ar e aggr egated to get longer-se gment-level pr edictions e x k and e very five se gment-level weights ar e avera ged to get long er-se gment-level weights e w k . In the end, e x k and e w k ar e ag gr e gated to g et final clip-level pr ediction. 2.4. Analysis of hierarchical pooling structur e Before we discuss this structure in depth, we would like to arrive at a proposition: the accuracy of ˆ x j is larger than that of x i in a well-tr ained system . This proposition is intuiti vely reasonable because it is easier for the system to output correct predictions when the required time resolution gets longer . According to the theoretical discussion in [8], the process of weight updating is related to ∂ y ∂ x i and ∂ y ∂ w i . W e take linear softmax pooling function as an example to interpret the function of proposed pooling structure. In the case of normal single pooling structure, w i = x i , ∂ y ∂ x i = 2 x i − y P N k =1 x k (6) In the case of hierarchical pooling structure, ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i = P j M i =1+( j − 1) M x i 2 P j M i =1+( j − 1) M x i (7) ∂ y ∂ x i = N/ M X l =1 ( ∂ y ∂ ˆ x l ∂ ˆ x l ∂ x i + ∂ y ∂ ˆ w l ∂ ˆ w l ∂ x i ) = x i (4 ˆ x j − 2 y ) − 2 ˆ x 2 j + y ˆ x j P N/ M l =1 ˆ x l P j M n =1+( j − 1) M x i , j = d i M e (8) As shown in Equation 6 and Equation 8, compared with single pooling structure, the segment-lev el prediction ˆ x j also contributes to the update of frame-lev el prediction x i in hierarchical pooling structure. As segment-le vel prediction is more accurate than frame-le vel prediction, we belie ve proposed hierarchical pooling structure can provide a better supervision for neural network learning. Detailed mathematical deriv ation and analysis of all fiv e pooling functions are available in the appendix. W e proved that proposed structure would make no difference on max and av erage pooling, so we conducted our experiments using the other three pooling functions. The hierarchical pooling structure used in our work is illustrated in Figure 4. It is a three-stage pooling structure. The number of predicted probabilities for a certain class of sound ev ents in an audio clip decreases from 125 to 25, and then 5, and finally 1. 3. Experiments 3.1. Dataset W e demonstrated our experiments on task 4 of DCASE 2017 Challenge [2]. This task contains 17 classes of sound e vents. The dataset is a subset of Audio Set [1]. The training set has weak labels denoting the presence of a gi v en sound e v ent in the videos soundtrack and no timestamps are provided. For testing and e valuation, strong labels with timestamps are provided for the purpose of ev aluating performance. 3.2. Experimental Setup T o extract log mel spectrogram feature, each audio is di vided into frames of 40 ms duration with 50% ov erlapping. The input of our system is a 500 × 80 matrix, where 500 denotes the number of frames and 80 is the number of mel-filter bins. Our model is trained using Adam optimizer [12]. The initial learning rate is 0.001. The mini batch size is 128. The loss function is cate gorical cross entropy based on clip-level labels. W e use early stop strate gy when the validation loss stops degrading for 10 epochs. T able 2: P erformance of single and hierar c hical pooling structur e, in terms of ER (lower is better) and F 1 -scor e (%) (higher is better). Single Pooling Structure Hierarchical Pooling Structure Sub . Del. Ins. ER Pre. Rec. F 1 Sub . Del. Ins. ER Pre. Rec. F 1 Dev elopment Linear 0.25 0.18 0.36 0.79 39.00 47.01 42.63 0.19 0.40 0.17 0.76 (3.8% ↓ ) 53.07 41.31 46.46 (9.0% ↑ ) Exp. 0.29 0.35 0.18 0.82 44.67 37.24 40.62 0.27 0.26 0.26 0.79 (3.7% ↓ ) 45.90 45.72 45.81 (12.8% ↑ ) Att. 0.30 0.34 0.19 0.83 44.68 36.97 40.46 0.25 0.33 0.21 0.79 (4.8% ↓ ) 48.17 42.51 45.16 (11.6% ↑ ) Evaluation Linear 0.21 0.36 0.18 0.76 53.40 43.19 47.76 0.19 0.30 0.20 0.69 (9.2% ↓ ) 56.39 50.78 53.44 (11.8% ↑ ) Exp. 0.23 0.35 0.23 0.81 48.35 43.78 45.95 0.23 0.28 0.22 0.73 (9.9% ↓ ) 53.40 51.38 52.37 (14.0% ↑ ) Att. 0.21 0.31 0.27 0.79 46.12 44.43 45.26 0.21 0.28 0.24 0.73 (7.6% ↓ ) 52.27 50.58 51.41 (13.6% ↑ ) ;ĂͿ WƌĞĚŝĐƚŝŽŶƐŽĨƚŚĞůŝŶĞĂƌƐŽĨƚŵĂdžƐLJƐƚĞŵ WƌŽďĂďŝůŝƚLJ ;ďͿ WƌĞĚŝĐƚŝŽŶƐŽĨƚŚĞĞdžƉŽŶĞŶƚŝĂůƐŽĨƚŵĂdžƐLJƐƚĞŵ WƌŽďĂďŝůŝƚLJ ;ĐͿ WƌĞĚŝĐƚŝŽŶƐŽĨƚŚĞĂƚƚĞŶƚŝŽŶƐLJƐƚĞŵ WƌŽďĂďŝůŝƚLJ Time (s) Figure 5: The frame-level predictions of thr ee systems on an evaluation audio clip. 3.3. Metrics According to the of ficial instructions of DCASE 2017 Chal- lenge [2], our method is ev aluated based on two kinds of segment-based metrics: the primary metric is segment-based micro-av eraged error rate (ER) and the secondary metric is segment-based micro-averaged F 1 -score. ER is the sum of Substitution, Deletion and Insertion Errors, and F 1 -score is the harmonic a ver gae of Precision and Recall. Each segment-based metric will be calculated in one-second segments over the entire test set. Detailed information can be found in [2]. W e use sed ev al toolbox [13] to compute the metrics. 4. Results 4.1. Experimental results W e apply single pooling structure and proposed hierarchical pooling structure to three types of pooling functions. The performance on de velopment and ev aluation dataset is shown in T able 2. The percentage in red represents the change rate from single pooling structure to hierarchical pooling structure. Proposed structure can make remarkable improvements in all situations without adding any parameters. It is safe to dra w a conclusion that hierarchical pooling structure can improv e the performance of weakly-labeled sound event detection system significantly . Besides, linear softmax pooling function outperforms the other pooling functions in all conditions, which corresponds with the experimental results in [8]. Figure 5 illustrates the frame-lev el predictions of single and T able 3: Comparison with other methods, in terms of ER and F 1 -scor e (%). W e compar e pr oposed system with the following systems: (1)EMSI: 1st place in DCASE 2017; (2) Surr e y: 2nd in DCASE 2017; (3) MLMS8: 3rd in DCASE 2017; (4) GCCaps: A Capsule Routing Network proposed in 2018; (5) W ang: Linear softmax system pr oposed in 2018. Dev elopment Evaluation ER F 1 ER F 1 EMSI ∗ [14] 0.71 47.1 0.66 55.5 Surrey ∗ [15] 0.72 49.7 0.73 51.8 MLMS8 ∗ [16] 0.84 34.2 0.75 47.1 GCCaps [17] - - - - - - 0.76 46.3 W ang [8] 0.79 45.4 - - - - - - Proposed 0.76 46.5 0.69 53.4 ∗ system using model ensemble; - - - results not presented in paper . hierarchical pooling structures on three pooling functions. In this audio, the sound of train occurs from 7.574 s to 10 s. In linear and exponential softmax, single pooling structure cannot output an y positive predictions; on the contrary , hierarchical pooling structure can correctly detect target event. In attention pooling, the predicted probabilities of hierarchical structure are also higher than single structure where the ev ent occurs. Besides, the linear and exponential softmax are more likely to produce deletion errors while attention will result in more insertion errors. This also complies with the analysis in [8]. 4.2. Comparison with other methods Compared with other methods, the performance of our system is also competitiv e. W e compare proposed system with the top 3 teams in DCASE 2017 Challenge and two methods proposed in 2018. Proposed system can outperform most methods e xcept the top 1 system in DCASE 2017 Challenge [14]. Note that the top 1 team utilized the ensemble of multiple systems, which significantly improv ed its performance. Our system can achie ve comparable performance without ensemble. 5. Conclusion In this paper, we proposed a hierarchical pooling structure to solve the problem of Multi-Instance Learning. W e applied this strategy to dev elop a weakly-labeled sound event detection system. Our proposed method can effecti v ely improve the performance in three types of pooling functions without adding any parameters. Besides, our best system can achiev e compa- rable performance with the state-of-the-art systems without the techniques of ensemble. W e believ e our method can be applied in more applications of Multi-Instance Learning in addition to the field of weakly labeled sound ev ent detection. 6. References [1] J. F . Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W . Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “ Audio set: An ontology and human-labeled dataset for audio e vents, ” in 2017 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 776–780. [2] A. Mesaros, T . Heittola, A. Diment, B. Elizalde, A. Shah, E. V incent, B. Raj, and T . V irtanen, “DCASE 2017 challenge setup: tasks, datasets and baseline system, ” in Pr oceedings of the Detection and Classification of Acoustic Scenes and Events 2017 W orkshop , 2017, pp. 85–92. [3] Q. K ong, Y . Xu, W . W ang, and M. D. Plumbley , “ Audio set classification with attention model: A probabilistic perspective, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 316–320. [4] Y . Xu, Q. Kong, W . W ang, and M. D. Plumbley , “Large-scale weakly supervised audio classification using gated conv olutional neural network, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 121–125. [5] L. JiaKai, “Mean teacher con v olution system for dcase 2018 task 4, ” Detection and Classification of Acoustic Scenes and Events , 2018. [6] Q. K ong, Y . Xu, I. Sobieraj, W . W ang, and M. D. Plumbley , “Sound event detection and time–frequency segmentation from weakly labelled data, ” IEEE/ACM T ransactions on Audio, Speec h and Language Pr ocessing (T ASLP) , vol. 27, no. 4, pp. 777–787, 2019. [7] B. McFee, J. Salamon, and J. P . Bello, “ Adapti ve pooling operators for weakly labeled sound event detection, ” IEEE/A CM T r ansactions on Audio, Speech and Language Pr ocessing (T ASLP) , v ol. 26, no. 11, pp. 2180–2193, 2018. [8] Y . W ang and F . Metze, “ A comparison of fiv e multiple instance learning pooling functions for sound ev ent detection with weak labeling, ” arXiv preprint , 2018. [9] J. Gehring, M. Auli, D. Grangier , D. Y arats, and Y . N. Dauphin, “Con volutional sequence to sequence learning, ” in Proceedings of the 34th International Conference on Machine Learning-V olume 70 . JMLR. org, 2017, pp. 1243–1252. [10] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal co v ariate shift, ” in Pr oceedings of The 32nd International Confer ence on Machine Learning , 2015, pp. 448–456. [11] N. Sriv asta va, G. E. Hinton, A. Krizhevsk y , I. Sutskev er , and R. Salakhutdinov , “Dropout: a simple way to prevent neural networks from overfitting, ” Journal of Machine Learning Resear ch , v ol. 15, no. 1, pp. 1929–1958, 2014. [12] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint , 2014. [13] A. Mesaros, T . Heittola, and T . V irtanen, “Metrics for polyphonic sound ev ent detection, ” Applied Sciences , vol. 6, no. 6, 2016. [14] D. Lee, S. Lee, Y . Han, and K. Lee, “Ensemble of con v olutional neural networks for weakly-supervised sound event detection using multiple scale input, ” Detection and Classification of Acoustic Scenes and Events (DCASE) , 2017. [15] Y . Xu, Q. Kong, W . W ang, and M. D. Plumbley , “Surrey- cvssp system for DCASE2017 challenge task4, ” arXiv pr eprint arXiv:1709.00551 , 2017. [16] J. Lee, J. Park, S. Kum, Y . Jeong, and J. Nam, “Combining multi-scale features using sample-lev el deep con v olutional neural networks for weakly supervised sound event detection, ” Pr oc. DCASE , pp. 69–73, 2017. [17] T . Iqbal, Y . Xu, Q. Kong, and W . W ang, “Capsule routing for sound e vent detection, ” in 2018 26th Eur opean Signal Pr ocessing Confer ence (EUSIPCO) . IEEE, 2018, pp. 2255–2259. A. Erratum Comment: W e figure out some errors in our paper , which has been published in the proceedings of Interspeech 2019. In order to correct the errors, we update the Arxiv version. If any of y ou is interested in our work, please refer to the lastest version on Arxiv . If you have any further questions, please feel free to contact the authors. The main error in our paper is the formula of segment-le vel weights w j in hierarchical pooling structure, i.e. Equation (4) in the body of this paper . The original formula is ˆ w j = P j M i =1+( j − 1) M w i M , j = 1 , 2 , ..., N / M (A-1) The corrected formula is ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (A-2) Our motiv ation is that segment-lev el prediction is more accurate than frame-le vel prediction and it is easier to get correct predictions when the required time resolution gets longer . So we let the groundtruth clip-le vel labels supervise the training of small segment-lev el predictions and get an accurate segment-le vel prediction first, instead of directly supervising the training of each frame. Besides, there are many other methods to get w j in hierarchical pooling structure. For example, we can add two extra dense layers after Bi-GRU to get ˆ w j and e w k in Figure 4. It can also achieve similar ef fects but requires a small number of additional parameters. W e also did some experiments based on the wrong formula in the paper and the av erage of ER is similar to single pooling structure. But during experiments, we find that system performances have big fluctuation. For example, we use attention pooling function for five experiments and the ER on ev aluation dataset is 0.80, 0.85, 0.79, 0.78, 0.80 respectiv ely . Meanwhile, in order to locate the detected sound ev ents, a threshold is set to the frame-lev el predictions. And we use post-processing methods including median filter and ignoring noise to get the onset and offset times of detected events. The ev aluation performance is also sensiti ve to the parameter of threshold and post-processing. W e think our system may meet with overfitting. In future work, we will ev aluate whether our proposed method is general and robust on lar ger datasets. B. Appendix Detailed mathematical deriv ation and analysis of all fiv e pooling functions are av ailable in the appendix. The loss function we use is cross-entropy loss: L = − t log y − (1 − t ) log (1 − y ) (A-3) where t = 0 or 1 , is the groundtruth label for a specific sound ev ent in an audio clip, and y ∈ [0 , 1] is the predicted clip-le vel probability for the same ev ent. W e decompose the gradient of L with respect to the frame- lev el predictions x i and the frame-lev el weights w i using chain rule: ∂ L ∂ x i = ∂ L ∂ y ∂ y ∂ x i , ∂ L ∂ w i = ∂ L ∂ y ∂ y ∂ w i (A-4) Considering the term ∂ L ∂ y , we hav e: ∂ L ∂ y = − t y + 1 − t 1 − y = ( 1 1 − y , t = 0 − 1 y , t = 1 (A-5) It is obvious that this term is decided by the label t , so we focus on ∂ y ∂ x i and ∂ y ∂ w i in the following discussions. Before proceeding into the calculation process, let us revie w the expression of y in our hierarchical pooling structure. ˆ x j = P j M i =1+( j − 1) M w i x i P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., M (A-6) ˆ w j = P j M i =1+( j − 1) M w i 2 P j M i =1+( j − 1) M w i , j = 1 , 2 , ..., N / M (A-7) y = P N/ M j =1 ˆ w j ˆ x j P N/ M j =1 ˆ w j (A-8) So y is a weighted sum of ˆ x j with weights ˆ w j . ∂ y ∂ x i = N/ M X l =1 ( ∂ y ∂ ˆ x l ∂ ˆ x l ∂ x i + ∂ y ∂ ˆ w l ∂ ˆ w l ∂ x i ) = ∂ y ∂ ˆ x j ∂ ˆ x j ∂ x i + ∂ y ∂ ˆ w j ∂ ˆ w j ∂ x i , j = d i M e (A-9) The four components are calculated as follows: ∂ y ∂ ˆ x j = ˆ w j P N/ M l =1 ˆ w l (A-10) ∂ ˆ x j ∂ x i = d ˆ x j d x i + ∂ ˆ x j ∂ w i ∂ w i ∂ x i = w i P j M n =1+( j − 1) M w n + x i − ˆ x j P j M n =1+( j − 1) M w n ∂ w i ∂ x i (A-11) ∂ y ∂ ˆ w j = ˆ x j − y P N/ M l =1 ˆ w l (A-12) ∂ ˆ w j ∂ x i = ∂ ˆ w j ∂ w i ∂ w i ∂ x i = 2 w i − ˆ w j P j M n =1+( j − 1) M w n ∂ w i ∂ x i (A-13) Here, ∂ w i ∂ x i relies on the choice of pooling functions. Hence we summarize as follows: ∂ y ∂ x i = ˆ w j P N/ M l =1 ˆ w l w i P j M n =1+( j − 1) M w n + x i − ˆ x j P j M n =1+( j − 1) M w n ∂ w i ∂ x i ! + ˆ x j − y P N/ M l =1 ˆ w l 2 w i − ˆ w j P j M n =1+( j − 1) M w n ∂ w i ∂ x i ! = ˆ w j w i + [( x i − ˆ x j ) ˆ w j + ( ˆ x j − y ) (2 w i − ˆ w j )] ∂ w i ∂ x i P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n (A-14) In the case of av erage pooling function, w i = 1 N , ∂ w i ∂ x i = 0 (A-15) ∂ y ∂ x i = ˆ w j w i P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n = 1 N (A-16) In the case of max pooling function, w i = ( 1 , i = arg max i x i 0 , else (A-17) so we hav e: ∂ w i ∂ x i = 0 (A-18) ∂ y ∂ x i = ( 1 , i = arg max i x i 0 , else (A-19) In the case of linear softmax pooling function, w i = x i , ∂ w n ∂ x i = ( 1 , n = i 0 , else (A-20) ∂ y ∂ x i = ˆ w j w i + ( x i − ˆ x j ) ˆ w j + ( ˆ x j − y )(2 w i − ˆ w j ) P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n = x i (4 ˆ x j − 2 y ) − 2 ˆ x 2 j + y ˆ x j P N/ M l =1 ˆ x l P j M n =1+( j − 1) M x i (A-21) In the case of exponential softmax pooling function, w i = exp ( x i ) , ∂ w n ∂ x i = ( exp ( x i ) , n = i 0 , else (A-22) ∂ y ∂ x i = ˆ w j w i + [( x i − ˆ x j ) ˆ w j + ( ˆ x j − y )(2 w i − ˆ w j )] exp( x i ) P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n = [ ˆ w j (1 + x i − 2 ˆ x j + y ) + 2 exp( x i )( ˆ x j − y )] exp( x i ) P N/ M l =1 ˆ x l P j M n =1+( j − 1) M exp( x n ) (A-23) In the case of attention pooling function, w i is decided by the input of the last dense layer u instead of x i , ∂ w n ∂ x i = 0 (A-24) ∂ y ∂ x i = ˆ w j w i P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n (A-25) In this case, we should consider the item ∂ y ∂ w i as well. The item is calculated as follows: ∂ y ∂ w i = N/ M X l =1 ∂ y ∂ ˆ w l ∂ ˆ w l ∂ w i + ∂ y ∂ ˆ x l ∂ ˆ x l ∂ w i = ∂ y ∂ ˆ w j ∂ ˆ w j ∂ w i + ∂ y ∂ ˆ x j ∂ ˆ x j ∂ w i = ˆ x j − y P N/ M l =1 ˆ w l 2 w i − ˆ w j P j M n =1+( j − 1) M w n + ˆ w j P N/ M l =1 ˆ w l x i − ˆ x j P j M n =1+( j − 1) M w n = 2 w i ( ˆ x j − y ) + ˆ w j ( x i + y − 2 ˆ x j ) P N/ M l =1 ˆ w l P j M n =1+( j − 1) M w n (A-26) The single pooling structure can be considered as a special case of hierarchical pooling structure in which ˆ w j = w i , ˆ x j = x i . According to the analysis above, it is easy to notice that proposed hierarchical pooling structure will mak e no difference when applied to max pooling and average pooling functions. So we only analyze the other three pooling functions in our paper . As shown in abov e results, the segment-lev el prediction ˆ x j will also contribute to weight updating during training. So we belie ve this kind of structure can give a better supervision for neural network learning.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment