Transformer for Emotion Recognition

T ransf ormer f or Emotion Recognition Jean-Benoit Delbrouck TCTS Lab, Uni versity of Mons, Belgium jean-benoit.delbrouck@umons.ac.be Abstract This paper describes the UMONS solution for the OMG-Emotion Challenge 1 . W e explore a context-dependent architecture where the arousal and v alence of an ut- terance are predicted according to its sur- rounding context (i.e. the preceding and follo wing utterances of the video). W e re- port an improvement when taking into ac- count conte xt for both unimodal and mul- timodal predictions. Our code is made publicly av ailable 2 . 1 Data The or ganizers specially collected and annotated a One-Minute Gradual-Emotional Beha vior dataset (OMG-Emotion dataset) for the challenge. The dataset is composed of Y outube videos chosen through ke ywords based on long-term emotional behaviors such as ”monologues”, ”auditions”, ”di- alogues” and ”emotional scenes”. An annotator has to watch a whole video in a sequence so that he takes into consideration the contextual informa- tion before annotating the arousal and valence for each utterance of a video. The dataset pro vided by the organizers contains a train split of 231 videos composed of 2442 utterances and v alidation split of 60 videos composed of 617 utterances. For each utterance, the gold arousal and v alence le vel is gi ven. 2 Architectur e Because context is taken into account during an- notation, we propose a context-dependent archi- tecture (Poria et al., 2017) where the arousal and v alence of an utterance is predicted according to 1 https://www2.informatik.uni- hamburg. de/wtm/OMG- EmotionChallenge/ 2 https://github.com/jbdel/OMG_UMONS_ submission/ the surrounding context. Our model consists of three successi ve stages: • A context-independent unimodal stage to ex- tract linguistic, visual and acoustic features per utterance • A context-dependent unimodal stage to ex- tract linguistic, visual and acoustic features per video • A context-dependent multimodal stage to make a ﬁnal prediction per video 2.1 Context-independent Unimodal stage Firstly , the unimodal features are extracted from each utterance separately . W e use a mean square error as loss function : L mse = 1 N L 2 p =2 ( x , y ) = 1 N N X i =1 ( x i − y i ) 2 where N is the number of utterances predicted, x the prediction vector for arousal or v alence and y i the ground truth vector . Belo w , we explain the linguistic, visual and acoustic feature extraction methods. 2.1.1 Con volutional Neural Networks for Sentences For each utterance, a transcription is gi ven as a written sentence. W e train a simple CNN with one layer of con v olution (Kim, 2014) on top of word vectors obtained from an unsupervised neu- ral language model (Mikolo v et al., 2013). More precisely , we represent an utterance (here, a sen- tence) as a sequence of k -dimensional word2vec vectors concatenated. Each sentence is wrapped to a windo w of 50 words which serves as the in- put to the CNN. Our model has one con volutional layer of three kernels of size 3, 4 and 2 with 30, 30 and 60 feature maps respectiv ely . W e then apply a max-o vertime pooling operation over the feature map and capture the most important feature, one with the highest v alue, for each feature map. Each kernel and max-pooling operation are interleaved with ReLu activ ation function. Finally , a fully connected network layer FC out of size [120 → 2] predicts both arousal and v alence of the utterance. W e extract the 120-dimensional features of an ut- terance before the FC out operation. 2.1.2 3D-CNN f or visual input In this section, we explain how we extract features of each utterance’ s video with a 3D-CNN (Ji et al., 2013). A video is a sequence of frames of size W × H × 3 . The 3D con volution is achie ved by con v olving a 3D-kernel to the cube formed by stacking multiple successive video frames to- gether . By this construction, the feature maps in the conv olution layer is connected to multiple frames in the previous layer and therefore is able to capture the temporal information. In our exper- iments, we sample 32 frames of size 32 × 32 per video, equally inter -spaced, so that each video in the dataset ∈ R 32 × 32 × 32 × 3 . Our CNN consists of 2 con volutional layers of 32 ﬁlters of size 5 × 5 × 5 . Each layer is follo wed by two max-pooling layers of size 4 × 4 × 4 and 3 × 3 × 3 respectiv ely . After- wards, two fully connected network layers FC out 1 [864 → 128] and FC out 2 [128 → 2] map the CNN outputs to a predicted arousal and valence le vel. W e extract the 128-dimensional features of an ut- terance before the FC out 2 operation. 2.1.3 OpenSmile f or audio input For ev ery utterance’ s video, we sample a W a ve- form Audio ﬁle at 16 KHz frequency and use OpenSmile (Eyben et al., 2010) to extract 6373 features from the IS13-ComP arE conﬁguration ﬁle. T o reduce the number , we only select the k -best features based on univ ariate statistical re- gression tests where arousal and valence levels are the targets. W e pick k = 80 for both arousal and v alence tests and merge features index es together . W e ended up with 121 unique features per utter- ances. 2.2 Context-dependent Unimodal stage In this section, we stack the utterances video-wise for each modality . Lets consider a modality m of utterance feature size k , a video V i is the sequence of utterances vectors ( x 1 , x 2 , . . . , x T ) i where x j ∈ R k and T is the number of utterances in V i . W e now ha ve a set of modality videos V m = ( V 1 , V 2 , . . . , V N ) m where N is number of video in the dataset. In previous similar w ork (Poria et al., 2017), the video matrice V i was the input of a bi-directional LSTM network to capture pre vious and follo wing context. W e ar gue that, especially if the video has many utterances, the context might be incomplete or inaccurate for a speciﬁc utterance. W e tackle the problem by using self-attention (sometimes called intra-attention). This attention mechanism relates dif ferent positions of a single sequence in order to compute a representation of the sequence and has been successfully used in a variety of tasks (Parikh et al., 2016; Lin et al., 2017; V aswani et al., 2017). More speciﬁcally , we use the ”trans- former” encoder with multi-head self-attention to compute our context-dependent unimodal video features. Figure 1: Overview of the Context-dependent Uni- modal stage. Each utterance’ s arousal and v alence le vel are predicted through a whole video 2.2.1 T ransf ormer encoder The encoder is composed of a stack of N identical blocks. Each block has two layers. The ﬁrst layer is a multi-head self-attention mechanism, and the second is a fully connected feed-forw ard network. Each layer is followed by a normalization layer and employs a residual connection. The output of each layer can be re written as the following LayerNorm ( x + layer ( x )) where layer(x) is the function implemented by the layer itself (multi-head attention or feed forward). 2.2.2 Multi-Head attention Let d k be the queries and keys dimension and d v the values dimension, the attention function is the dot products of the query with all keys, di vide each by √ d k , and apply a softmax function to obtain the weights on the v alues : Attention ( Q, K, V ) = softmax ( QK T √ d k ) V Authors found it beneﬁcial to linearly project the queries, k eys and v alues h times with dif ferent learned linear projections to d k , d k and d v dimen- sions. The output of the multi-head attention is the concatenation of the h number of d v v alues. W e pick d k = d v = 64 , h = 8 , N = 2 . 2.2.3 Dense output layer The output of each utterance’ s transformer goes through a last fully connected layer FC out of size [ m, 2] to predict both arousal and valence level. Because we make our prediction per video, we propose to include the concordance correlation co- ef ﬁcient in our loss function. W e deﬁne L ccc = 1 − p c where p c = 2 σ 2 xy σ 2 x + σ 2 y + ( µ x − µ y ) 2 W e now want to minimize L total = L mse + 0 . 25 × L ccc for both arousal and valence v alue. In addition to lead to better results, we found it to giv e the model more stability between ev aluation and re- producibility between runs. 3 Context-dependent Multimodal stage This section is similar to the pre vious section, except that we no w have only one set of video V = ( V 1 , V 2 , . . . , V N ) where each video V i is composed of multimodal utterances x j = ( x linguistic , x visual , x audio ) . In our experiments, we tried two types of fusion. A Concatenation W e simply concatenate each modality utterance-wise. The utterance x j can be re written x j = x linguistic ⊕ x visual ⊕ x audio where ⊕ denotes concatenation. B Multimodal Compact Bilinear P ooling W e would like each feature of each modal- ity to combine with each others. W e would learn a model W (here linear), i.e. c = W [[ x ⊗ y ] ⊗ z ] where ⊗ is the outer-product operation and [] denotes linearizing the ma- trix in a vector . In our experiments, our modality feature size are 120, 128 and 121. If we want c ∈ R 512 , W would hav e 951 millions parameters. A multimodal compact bilinear pooling model (Fukui et al., 2016) can be learned by relying on the Count Sketch projection function (Charikar et al., 2002) to project the outer product to a lower dimen- sional space, which reduces the number of parameters in W . 4 Results W e report our preliminary results in term of the concordance correlation coef ﬁcient metric. Model Mean CCC Monomodal featur e extraction T ext - CNN 0.165 Audio - OpenSmile 0.150 V ideo - 3DCNN 0.186 Contextual monomodal T ext 0.220 Audio 0.223 V ideo 0.227 Contextual multimodal T + A + V 0.274 T + A + V + CBP 0.301 References [Charikar et al.2002] Moses Charikar, K evin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Pr ogramming , pages 693–703. Springer . [Eyben et al.2010] Florian Eyben, Martin W ¨ ollmer , and Bj ¨ orn Schuller . 2010. Opensmile: the munich ver- satile and fast open-source audio feature extractor . In Pr oceedings of the 18th ACM international con- fer ence on Multimedia , pages 1459–1462. A CM. [Fukui et al.2016] Akira Fukui, Dong Huk P ark, Daylen Y ang, Anna Rohrbach, T re vor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv pr eprint arXiv:1606.01847 . [Ji et al.2013] Shuiwang Ji, W ei Xu, Ming Y ang, and Kai Y u. 2013. 3d con volutional neural net- works for human action recognition. IEEE transac- tions on pattern analysis and machine intelligence , 35(1):221–231. [Kim2014] Y oon Kim. 2014. Con volutional neural networks for sentence classiﬁcation. arXiv preprint arXiv:1408.5882 . [Lin et al.2017] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Y u, Bing Xiang, Bowen Zhou, and Y oshua Bengio. 2017. A structured self-attentiv e sentence embedding. arXiv pr eprint arXiv:1703.03130 . [Mikolov et al.2013] T omas Mikolo v , Ilya Sutskev er , Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality . In Advances in neural information pr ocessing systems , pages 3111–3119. [Parikh et al.2016] Ankur P Parikh, Oscar T ¨ ackstr ¨ om, Dipanjan Das, and Jakob Uszkoreit. 2016. A de- composable attention model for natural language in- ference. arXiv pr eprint arXiv:1606.01933 . [Poria et al.2017] Soujanya Poria, Erik Cambria, De va- manyu Hazarika, Navonil Majumder , Amir Zadeh, and Louis-Philippe Morency . 2017. Conte xt- dependent sentiment analysis in user-generated videos. In Pr oceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long P apers) , volume 1, pages 873–883. [V aswani et al.2017] Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszk oreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural In- formation Pr ocessing Systems , pages 6000–6010.

Transformer for Emotion Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment