SAM-GCNN: A Gated Convolutional Neural Network with Segment-Level Attention Mechanism for Home Activity Monitoring

SAM-GCNN: A Gated Con v olutional Neural Network with Se gment-Le vel Attention Mechanism for Home Acti vity Monitoring Y u-Han Shen, K e-Xin He, W ei-Qiang Zhang Department of Electronic Engineering, Tsinghua Univ ersity , Bejing 100084, China yhshen@hotmail.com, hekexinchn@163.com, wqzhang@tsinghua.edu.cn Abstract —In this paper , we propose a method f or home activity monitoring. W e demonstrate our model on dataset of Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) 2018 Challenge T ask 5. This task aims to classify multi-channel audios into one of the provided pre-deﬁned classes. All of these classes are daily acti vities performed in a home en vironment. T o tackle this task, we propose a gated con volutional neural network with segment-lev el attention mechanism (SAM-GCNN). The proposed framework is a con volutional model with two auxiliary modules: a gated conv olutional neural network and a segment-level attention mechanism. Furthermore, we adopted model ensemble to enhance the capability of generalization of our model. W e evaluated our work on the development dataset of DCASE 2018 T ask 5 and achieved competitive performance, with a macro-av eraged F-1 score increasing from 83.76% to 89.33%, compared with the conv olutional baseline system. Index T erms —acoustic activity classiﬁcation, gated con volu- tional neural network, attention mechanism, model ensemble, DCASE I . I N T R O D U C TI O N Recently , sound event detection and classiﬁcation has become more and more popular in the ﬁeld of acoustic signal processing, and it can be widely used in security surveillance, wildlife protection and smart home. One important application of sound e vent classiﬁcation in smart home is home activity monitoring. Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) Challenge is one of the most important international challenges concerning acoustic e vent detection and classiﬁca- tion and has been organized for sev eral years. DCASE 2018 challenge consists of ﬁ ve tasks and we focus on task 5 [1]. This task ev aluates systems for monitoring of domestic activities based on multi-channel acoustics. W e can also refer to this task as acoustic activity classiﬁ- cation. The main procedure of acoustic activity classiﬁcation consists of four parts: pre-processing, extracting acoustic features, designing acoustic models as classiﬁers, and post- processing. In the part of pre-processing, different methods of data augmentation have been utilized in [2] [3]. Data imbalance is This work was supported by the National Natural Science Foundation of China under Grant No. U183620001. The corresponding author is W ei-Qiang Zhang. a big challenge in acoustic ev ent classiﬁcation and detection because dif ferent ev ents may occur at a completely imbalanced frequency . In DCASE 2018 Challenge T ask 5, Inoue et al. used shufﬂing and mixing to produce more training samples [2], and T anabe et al. utilized derev erberation, blind source separation and data augmentation to impro ve the quality of audio clips [3]. Mel Frequency Cepstrum Coef ﬁcient (MFCC) is a common traditional acoustic feature and has been widely used. But log Mel-scale Filter Bank ener gies (fbank) are becoming more popular recently , and many w orks hav e been done based on fbank [1] [4] [5]. In recent years, Conv olutional Neural Networks (CNNs) hav e achiev ed great success in many ﬁelds such as character recognition, image classiﬁcation, speaker recognition. And many works based on CNNs have been done in acoustic e vent classiﬁcation and detection [6] [7]. Besides, some researchers combined CNNs with Recurrent Neural Networks (RNNs) to capture temporal contexts of audio signals for further improv ements [4] [5]. Attention model has been widely used in image classiﬁca- tion, object detection and natural language understanding. In the ﬁeld of acoustic signal processing, Xu et al. [8] proposed an attention model for weakly supervised audio tagging and K ong et al. [9] improved this work by giving a probabilistic perspectiv e. Their work is based on the assumption that those irrelev ant sound frames such as background noise and silences should be ignored and given less attention. Both of their models are achie ved by a weighted sum over frames where the attention values are automatically learned by neural network. In our work, acoustic acti vities might last for a longer period and a single frame is not enough to identify whether it should be ignored. In an audio recording, acoustic activities may keep happening in a majority of frames while acoustic e vent only occurs in a few frames. So we propose a segment-lev el attention mechanism (SAM) to decide ho w much attention should be gi ven based on the characteristics of segments. Here, a segment is comprised of several frames. In this paper , we mainly adopt three ways to improve the performance of our model: (1) W e replace currently popular CNN with gated conv o- lutional neural network to extract more temporal features of 978-1-5386-7568-7/18/$31.00 © 2018 IEEE T ABLE I A M OU N T S O F AU D I O C L IP S A N D S E SS I O N S Activity #10s clips #sessions Absence 18860 42 Cooking 5124 13 Dishwashing 1424 10 Eating 2308 13 Other 2060 118 Social activity 4944 21 V acuum cleaning 972 9 W atching TV 18648 9 W orking 18644 33 T otal 72984 268 audios; (2) W e propose a new segment-le vel attention mechanism to focus more on the audio segments with more energy; (3) W e utilize model ensemble to enhance the classiﬁcation capability of our model. The rest of this paper is organized as follo ws. In Section 2, we introduce our methods in detail, mainly including acoustic feature, gated con volutional neural network, segment-le vel attention mechanism and model ensemble. The experiment setup, e valuation metric and our results are illustrated in Section 3. Finally , the conclusion of our work is presented in Section 4. I I . M E T H O D S A. T ask Description The DCASE 2018 T ask 5 dataset [10] contains sound data recorded in a li ving room by individual de vices with four microphone arrays at seven undisclosed locations. The dataset is di vided into a de velopment dataset and an e valuation dataset. Four cross-validation folds are provided for the dev elopment dataset in order to make results reported with this dataset uniform. For each fold, a training, testing and e valuation subset is provided. In this paper , our work is based on the dev elopment dataset and we use the provided cross-validation folds for training and ev aluation. The audio clips in this dataset can be classiﬁed into nine classes: absence, cooking, dishwashing, eating, other, social activity , vacuum cleaning, watching TV and working. All audio clips are derived from continuous recording sessions collected by se ven microphone arrays and each clip contains four channels. The duration of each audio clip is 10 seconds. Speciﬁc information about the dataset is shown in T able 1 and more details can be found in [10]. B. System Overview Our proposed system is illustrated as Figure 1. The input of our system is log Mel-scaled ﬁlter banks (fbank). Then it will be fed into two structures: one is a Gated Con volutional Neural Network (GCNN) architecture, and the other is our proposed Segment-Le vel Attention Mechanism (SAM). Unlike most systems that output one probability score for an audio as a whole, we di vide a 10-s audio clip into sev eral Fig. 1. Overall architecture of proposed system segments. The output of our GCNN architecture is X ∈ R N × C and represents the probability for each class of each segment, where N ∈ N is the number of segments in an audio clip and C ∈ N is the number of predeﬁned classes. The output of SAM is a vector W ∈ R N and represents the attention weight factor for each se gment. Then we multiply X with W for each segment to obtain weighted segment scores. Those scores will be averaged among segments to get a vector Y ∈ R C and then go through a softmax to represent the normalized probability for each class. The class with the largest probability is considered to be the classiﬁcation result. The detailed explanations of our proposed system will be included in the following parts of this section. C. Acoustic F eature W e use fbank as the input of our system. Fbank is a two- dimensional time-frequenc y acoustic feature. It imitates the characteristics of human’ s ears and concentrates more on the low frequency components of audio signals. Compared with traditional MFCC feature, more original information can be kept in fbank and it has been widely used in deep learning. T o extract fbank feature, each input audio is divided into 40ms frames with 50% overlapping, and then 40 mel-scale ﬁlters are applied on the magnitude spectrum of each frame. Finally , we take logarithm on the amplitude and get fbank feature. As is mentioned in Section 1, the audio clips contain four channels, so our fbank feature contains four channels as well. In our work, four channels are fed into the system separately while training. And the av eraged output score of four channels is used for ev aluation. D. Gated Con volutional Neural Network Gated con volutional neural network was proposed by Dauphin et al. in [11] and has shown great power in machine translation, natural language processing. Our GCNN architecture consists of three main parts: 1) conv olutional Fig. 2. Overall architecture of gated conv olutional neural network neural network (CNN), 2) gated conv olutional neural network (GCNN), 3) feedforward neural network (FNN). And our ov erall architecture is shown in Figure 2. Fig. 3. Gated con volutional neural network. Before being fed into GCNN architecture, the extracted fbank feature is normalized to zero mean and unit standard deviation (we call it global normalization, to distinct with the following time normalization). Con volutional layers extract frequency features and connect features of adjacent frames. And the output of con volutional layer is followed by batch normalization [12], a ReLU activ ation unit and a dropout layer [13]. Then a max-pooling layer is applied to keep the most important features. The structure of gated con volutional neural network is illustrated in Figure 3. In gated con volutional neural network, the output of con volutional layer is divided into two parts with the same size. The input of this structure is E = [e 1 , e 2 , , e n ], E passes through a con volutional layer and the output is di vided into A and B . Then A passes through sigmoid acti vation function and multiplies with B by element-wise. In order to enable stronger work, we add residual connections from the input E to the output of this structure H . Residual network is introduced to av oid vanishing gradient problem [14]. The speciﬁc formula is as follows: A = E ∗ W + b, (1) B = E ∗ V + c, (2) H = B ⊗ σ ( A ) , (3) O = H + E , (4) where W , V represent conv olutional kernel values, and b , c mean biases. ⊗ represents element-wise production. σ ( · ) is a sigmoid activ ation function. The gated con volutional layer is also followed by batch normalization, a ReLU activ ation unit, a dropout layer and a max-pooling layer . After the gated con volutional neural network, the features on multiple channels are ﬂattened into frequency axis. Then two fully-connected layers are used to combine extracted features and output nine scores for each segment. Our work differs from others in that we output scores for each segment while most researchers output scores for an audio as a whole. W e intend to focus on those se gments with more energy and ignore segments with less energy , which we call “silence” segments. That is why we propose a segment-le vel attention mechanism. E. Se gment-Level Attention Mec hanism As mentioned in Section 1, attention mechanism was introduced to ignore irrelev ant sounds such as background noise and silences in audio ev ent classiﬁcation. In DCASE 2018 task 5, an audio clip labeled as cooking may contain some segments of silences and we should not pay too much attention to those segments because audio clips labeled as other classes may also contain silences. Motiv ated by Xu et al. [8], we propose a segment-le vel attention mechanism. Our work dif fers from pre vious work in that we giv e our attention weight factors based on the characteristics of segments instead of frames. Fig. 4. Segment-Lev el Attention Mechanism. The structure of segment-le vel attention mechanism is shown in Figure 4. The input of this structure is aforemen- tioned fbank feature. Then it will be normalized along the time axis, which we call time normalization. The purpose of time normalization is to further differentiate the features among frames. A fully-connected layer is added to extract deeper features of frames. Lik e in the gated conv olutional neural network, the fully-connected layer is followed by batch normalization, ReLU and dropout. Next, we calculate the sum along frequency axis. An a verage pooling layer is added to ﬁlter adjacent frames. Then a max-pooling layer is used to maintain the most important information of a segment. Finally , we use a sigmoid activ ation to limit the weight factors between 0 and 1. Based on our experiments, the duration of a segment is set to 1 second. Speciﬁc structure and hyperparameters will be illustrated in Section 3. F . Model Ensemble Model ensemble is a common strategy in machine learning. In our work, we propose a strategy of model ensemble. During our experiments, we notice that absence, other and working are three sorts of activities that are often misclassﬁed with the others. So we train a model in particular to classify those three classes of activities. When our main system classiﬁes an audio clip as any of the three classes, we will use the specially trained model for one more classiﬁcation. If an audio is classiﬁed as a class other than class 0, 4, 8 (absence, other and working) by our ﬁrst system, the output will be the ﬁnal output. Otherwise, the audio will be fed into our second system. W e denote the output of our ﬁrst system as X I ∈ R 9 and second system as X II ∈ R 3 . X N i represents the output probability of i -th class by the N -th system, where i ∈ [0 , 8] and N is 1 or 2. Then the ﬁnal output Y ∈ R 9 of our ensemble system will be calculated according to the following algorithm. W e calculate the sum of X I 0 , X I 4 , X I 8 and redistribute them based on our second system output X II . Algorithm 1 Model Ensemble ( X I , X II ) O ← ar g max X I Y ← X I if O == 0 or O == 4 or O == 8 then S ← sum ( X I 0 , X I 4 , X I 8 ) Y 0 , 4 , 8 ← S X II end if I I I . E X P E R I M E N T, E V A L U A T I O N A N D R E S U L T S A. Experiment setup Our model is trained using Adam [15] for gradient based optimization. Cross-entropy is used as the loss function. And the structure of our system is sho wn in T able 2 and T able 3 along with parameters. The initial learning rate is 0.001 and the batch size is 256 × 4 channels because each channel is considered as a different sample for training. W e train the classiﬁers for 300 epochs. W e select 5% of the testing data as validation dataset and choose models which result in the best accuracy on the validation dataset for ﬁnal e valuation. In the ev aluation process, the outputs of 4-channel acoustics are a veraged to get the ﬁnal posterior probability . T ABLE II M O DE L S T RU CT U R E A N D PAR A M E TE R S O F G A T E D C O NV OL U T I ON A L N E UR A L N E T WO R K Input 40 × 501 × 1 Output size Con v (padding: valid, kernel: [40, 5, 64]) 1, 497, 64 BN-ReLU-Dropout(0.2) 1, 497, 64 1 × 5 Max-Pooling(padding: v alid) 1, 99, 64 Gated Con v (padding: same, kernel: [1, 3, 128]) 1, 99, 64 BN-ReLU-Dropout(0.2) 1, 99, 64 1 × 10 Max-Pooling(padding: same) 1, 10 64 Feature Flattening 10, 64 Fully-connected(unit num: 64) -ReLU-Dropout(0.2) 10, 64 Fully-connected(unit num: 9) 10, 9 T ABLE III M O DE L S T RU CT U R E A N D PAR A M E TE R S O F S EG M E N T - L EV E L AT TE N T I ON M E CH A N I SM Input 40 × 501 × 1 Output size Fully-connected(unit num: 40) 40, 501, 1 BN-ReLU-Dropout(0.2) 40, 501, 1 Sum along frequency axis 1, 501, 1 1 × 5 A v erage-Pooling(padding: same) 1, 100, 1 1 × 10 Max-Pooling(padding: same) 1, 10, 1 Squeeze 10 Sigmoid 10 B. Evaluation Metric The ofﬁcial ev aluation metric for DCASE 2018 challenge task 5 is macro-av eraged F1-score. F1-score is a measure of a test’ s accuracy and it is the harmonic av erage of precision and recall. Macro-averaged means that F1-score is calculated for each class separately and averaged ov er all classes. For this task, a full 10s multi-channel audio is considered to be one sample. C. Results W e examine the following conﬁgurations: (1) CNN: Conv olutional neural network as baseline system; (2) SAM-CNN: Con volutional neural network with our proposed segment-le vel attention mechanism; (3) GCNN: Gated con volutional neural network; (4) SAM-GCNN: Gated con volutional neural network with our proposed segment-le vel attention mechanism; (5) Ensemble: Gated con volutional neural network with our proposed segment-lev el attention mechanism and model ensemble. T ABLE IV M AC RO - A V ER AG E D F 1 - S CO R E O F M U LTI P L E S Y S TE M S O N 4 F O L DS System F old1 Fold2 Fold3 Fold4 A verage CNN 81.92% 82.58% 83.26% 87.29% 83.76% GCNN 85.58% 84.22% 86.36% 88.83% 86.25% SAM-CNN 83.68% 82.26% 84.56% 88.09% 84.65% SAM-GCNN 88.49% 86.81% 86.51% 90.52% 88.08% Ensemble 89.62% 88.11% 87.95% 91.63% 89.33% As sho wn in T able 4, the macro-averaged F-1 score of GCNN is 2.49% higher than CNN. And our proposed segment- lev el attention mechanism can improv e the classiﬁcation performance of both CNN and GCNN. Moreov er, our proposed ensemble strategy can outperform previous systems and achiev e 89.33% F1-score. Confusion matrix before and after ensemble is shown in Figure 5. On the left is the confusion matrix of SAM-GCNN, and on the right is the confusion matrix of SAM-GCNN with model ensemble. The element in the i -th row and j -th column of this matrix represents the amount of audio clips that belong to class i and are classiﬁed as class j , so the elements on the diagonal represent the number of correctly classiﬁed audio clips. W e can ﬁnd that the number of correctly classiﬁed audio clips has increased after ensemble, especially for “absence”, “other” and “working”, sho wing that our model ensemble method does work. The class-wise performance of our ﬁnal model is shown in T able 5. T ABLE V C L AS S - W IS E P E R F OR M A N CE O F P RO P O S ED M O DE L fold1 fold2 fold3 fold4 A verage Absence 94.43% 92.99% 93.15% 94.93% 93.88% Cooking 95.92% 94.26% 93.75% 96.49% 95.10% Dishwashing 87.45% 81.22% 81.81% 83.87% 83.59% Eating 89.35% 89.66% 87.73% 90.56% 89.33% Other 52.28% 53.51% 54.61% 67.15% 56.89% Social activity 97.83% 95.85% 94.38% 98.50% 96.64% V acuum cleaning 99.99% 99.81% 100.00% 100.00% 99.95% W atching TV 99.55% 99.86% 99.42% 99.91% 99.69% W orking 89.82% 85.85% 86.68% 93.22% 88.89% Macro-A verage 89.62% 88.11% 87.95% 91.63% 89.33% T o better ev aluate our work, we compare the performance of proposed model with the top-2 ranked teams in DCASE 2018 Challenge T ask 5 and the ofﬁcial baseline system in T able 6. Both of the top-2 teams adopted complex methods of pre-processing, data augmentation and model ensemble. W e can achiev e equiv alent performance without any data augmentation. And our system outperforms the ofﬁcial baseline signiﬁcantly . T ABLE VI C O MPA R IS O N W I T H S T AT E - O F - TH E - AR T W OR K S A veraged F1-score Proposed 89.3% InouetMilk [2] 90.0% HITfweight [3] 89.8% Ofﬁcial Baseline 84.5% I V . C O N C L U S I O N In this paper , we hav e introduced our work and the results show that the performance of our proposed system is signiﬁcantly superior to that of the baseline. Our proposed segment-le vel attention mechanism improv es the performance of both CNN and GCNN architecture. Furthermore, by using model ensemble, we have achie ved competitiv e performance on the dev elopment dataset of DCASE 2018 task 5. Note that both the top two teams of this task utilized complex methods of data augmentation and model ensemble. Our system can achiev e equiv alent performance without data augmentation, which shows that our proposed attention mechanism can Fig. 5. Confusion matrix before and after ensemble on fold4. contribute a lot to home activity monitoring. Since the ground truth labels of ev aluation dataset of DCASE 2018 challenge hav e not been published yet, future work needs to be done for further ev aluation. R E F E R E N C E S [1] Gert Dekkers, Lode V uegen, T oon van W aterschoot, Bart V anrumste, and Peter Karsmakers, “DCASE 2018 Challenge - T ask 5: Monitoring of domestic acti vities based on multi-channel acoustics, ” T echnical Report, KU Leuven, 2018. URL: https://arxiv .org/abs/1807.11246, [2] T adanob u Inoue, Phongtharin V inaya vekhin, Shiqiang W ang, David W ood, Nancy Greco and Ryuki T achibana, “Domestic Activities Classiﬁcation Based on CNN Using Shufﬂing and Mixing Data Augmentation, ” DCASE 2018 Challenge, T echnical Report, 2018. [3] Ryo T anabe, T akashi Endo, Y uki Nikaido, T akeshi Ichige, Phong Nguyen, Y ohei Kawaguchi and Koichi Hamada, “Multichannel acoustic scene classiﬁcation by blind derev erberation, blind source separation, data augmentation, and model ensembling, ” DCASE 2018 Challenge, T echnical Report, 2018. [4] E.Cakir and T .V irtanen, “Conv olutional recurrent neural networks for rare sound event detection, ” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop, 2017, pp. 803806. [5] H. Lim, J. Park, K. Lee, and Y . Han, Rare sound ev ent detection using 1d con volutional recurrent neural networks, in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop, 2017, pp. 8084. [6] Y . Han, J. Park, and K. Lee, “Con volutional neural networks with binaural representations and background subtraction for acoustic scene classiﬁcation, ” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop, 2017, pp. 4650. [7] W . Zheng, J. Yi, X. Xing, X. Liu, and S. Peng, “ Acoustic scene classiﬁcation using deep con volutional neural network and multiple spectrograms fusion, ” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop, 2017, pp. 133137. [8] Y . Xu, Q. Kong, Q. Huang, W . W ang, and Mark D. Plumbley , Attention and localization based on a deep con volutional recurrent model for weakly supervised audio tagging, in Proceedings of INTERSPEECH.IEEE, 2017, pp.30833087. [9] Q. K ong, Y . Xu, W . W ang, and M. D. Plumbley , “ Audio set classiﬁcation with attention model: A probabilistic perspectiv e, arXiv preprint arXiv:1711.00927, 2017. [10] G. Dekkers, S. Lauwereins, B. Thoen, M. W . Adhana, H. Brouckxon, T . van W aterschoot, B. V anrumste, M. V erhelst, and P . Karsmakers, The SINS database for detection of daily activities in a home environment using an acoustic sensor network, in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop (DCASE2017), Munich, Germany , November 2017, pp. 3236. [11] Y . N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated con volutional networks, ” arXiv preprint 2016. [12] S. Iof fe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal cov ariate shift, ” in Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 448456. [13] N. Sriv astav a, G. E. Hinton, A. Krizhevsky , I. Sutskever , and R. Salakhutdinov , “Dropout: a simple way to prev ent neural networks from overﬁtting, ” Journal of Machine Learning Research, vol. 15, no. 1, pp. 19291958, 2014. [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition. ” IEEE Conference on Computer V ision and Pattern Recognition IEEE Computer Society , 2016:770-778. [15] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

SAM-GCNN: A Gated Convolutional Neural Network with Segment-Level Attention Mechanism for Home Activity Monitoring

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment