OMG - Emotion Challenge Solution

1 OMG - Emotion Challe nge Solution Y uqi Cui, Xiao Zhang, Y ang W ang, Chenfeng Guo and Dongrui W u Huazhong Univ ersity of Science and T echnology , W uhan, China Email: drwu@hust.edu.cn Abstract —This short paper describes o ur solution to the 2018 IEEE W orld Co ngress on Computat ional Intellig ence One-Minute Gradua l-Emotional Behavior Challenge , whose goal wa s t o estimate continuous arousal and valence values from short videos. W e designed f our base regression models using visual and audio features, and then used a spectral approach to fuse them to o btain impro ved performance. Index T erms —Affective computing, emotion estimation I . P RO B L E M S TA T E M E N T The One -Minute Gradual-Emotional Be havior Chal- lenge 1 was a c ompetition or ga n ized at the 2018 IEEE W orld Congress on Computational Intelligence 2 (IEEE WCCI 2018). The dataset w as compos ed o f 420 rela- ti vely long emotion videos with an a verag e length of 1 minute, collected from a variety of Y outube ch annels. V ideos were sep arated into clips based on utterances, and each utterance’ s v alence and arousal le vels were annotated by at least ﬁ ve ind epende nt sub jects using the Amazon Mechan ical T urk tool. Th e go a l was to e stimate the valence and arousal levels for eac h utterance, from modalities such a s visual, a udio, and text. The training dataset consisted of 2,442 utterances, v alidation dataset 621 utterances, and testing dataset 2,229 utterances. The p erformance mea s ure was the Congruen ce C o rre- lation Co eﬁcient (CCC). Let N be the numbe r of testing samples, { y i } N i =1 be the true valence (arousal) levels, and { ˆ y i } N i =1 be the estimated valence (arousal) le vels. L et m a nd σ be the mean a nd s tandard deviation of { y i } , respectively , ˆ m a nd ˆ σ be the mean a nd standa rd deviation of { ˆ y } , res pectively , and γ be the Pearso n correlation coefﬁcient between { y i } and { ˆ y } . The n, the CCC is computed as: ccc = 2 γ σ ˆ σ σ 2 + ˆ σ 2 + ( m − ˆ m ) 2 (1) Clearly , ccc ∈ [ − 1 , 1] . More information a bout the dataset and some bas eline results can be found in [1]. 1 https://www2.informatik.uni-hambu rg.de/wtm/OMG- EmotionChallenge/ 2 http://www .ecomp.poli.br/ wcci2018/competitions/ I I . O U R S O L U T I O N A N D R E S U L T S W e developed four ba se regress ion mod e ls, and then aggregated the ir outputs by sp ectral meta-learner for regression (SMLR) [9]. A. The CNN-F ac e Model W e used the face r ecognition package 3 to crop out the face of the actor in eac h frame of an utterance, and then pe rformed emotion analys is on the faces only . Each face image was rescaled to 80 × 80 × 3 (height × width × cha nnel). W e extracted face features by Xception [3] wi th weights pre-t rained on ImageNet. Each utterance g ave n 2 048-d fe a ture vectors, where n is the number o f frames. W e then took the average of these n 204 8 -d feature vectors to obtain a single 20 48- d feature vector for e ach utterance. These features were next passe d through a three-layer multi-layer p e rception (MLP) for regression. The hidden laye r had 10 24 nod es with Re LU ac ti vati on, and the output layer ha d on ly one node with sigmoid activ a tion for arousa l, and linear activ ation for valence. Op timization o f the MLP was done using Adamdelta, with dropout rate 0 . 25 . The validation c cc was used to determine when the training should stop. B. The CNN-V isua l Model This model was almost identica l to CNN-F ace , excep t that the entire frame instead of only the face was use d to extract the features . C. The LSTM-V isu al Model This regression model was inspired by the v ideo classiﬁca tion model in [6]. For ea ch utterance , we down- sampled 20 frames uniformly in time (if an utterance ha d less than 20 frames, then the ﬁrs t frame was repe ated to make up 2 0 frames), and the n us ed InceptionV3 [8], pre-trained on Image N e t, to o btain a 20 × 2048 feature matrix. Next we applied multi-layer long short-term memory (LSTM) to extract the time doma in information, 3 https://github .com/ageitgey/fac e recognition 2 and an MLP wit h 512 hidde n node s and one output nod e for regression. Dropout and ReLU a cti vation were used in both LSTM and MLP . D. The SVR-Audio Mod el W e ﬁrst con verted the .mp4 audio format to .wav format, partitioned each utterance into frames, an d then extracted the following features using moving windows (window length 200 , s liding distance 80): 1) Low-l evel f eatures , which describe the basic p rop- erties o f a udio in time- and frequen cy- doma ins, including the spe ctral cen troid, band energy ra- dio, delta spectrum mag nitude, zero c rossing rate, short-time average e nergy , and pitch. More details about these lo w-level features can be found in [5]. 2) Silence ratio , which is the ratio of the amount of silence frames to the time windo w [2]. A frame is considered as a silence frame whe n its root mean square is les s tha n 50% of the mean roo t me an square of the ﬁxed-length a udio fragments. 3) MFCCs and LPCCs . In o rder to combine the static and dynamic ch a racteristics of audio sig- nals, 12 Mel Frequency Cepstral Coefﬁcients (MFCCs), 11 Linea r Predictive Cepstral Coef- ﬁcients (LPCCs ), an d 1 2 ﬁrst-order dif ferential MFCC coefﬁcients were calcu lated. 4) F o rmant , which reﬂects the res onant frequencies of the vocal tract. Formant freque n cies F1-F5 in each frame were extracted. W e the n computed the mea n and /or variance of these frame-le vel features, resu lting in a total o f 76 au d io features, as shown in T a ble I. The se 76 fea tures have been used in our pre vious research [4]. T ABLE I T H E 7 6 AU D I O F E AT U R E S . Fea ture categ ory Number V alue Spectral centroid, Band energy radio, Delta spectrum mag nitude, Zero crossing rate, Pitch, Short-time average energy 12 Mean, variance Silence ratio 1 Mean MFCC coefﬁcients, Delta MFCC, LPCC 24 12 22 Mean, variance Mean Mean, variance Formant 5 Mean In this solution, instead of using these 76 features directly , we ﬁrst c lipped each feature into its [2 , 98] per- centile interval (e.g ., all values smaller than 2 percentile were replaced by the value a t 2 p ercentile, and all values lar ger than 98 perce ntile were replace d by the value at 98 percentile), normalized to [0 , 1] , and the n us ed RReliefF [7] to sort the features according to their importanc e. Next, we used supp ort vector regression (SVR) and the validation dataset to de termine the appropriate numbe r of features to use. W e performed feature clipping beca use many features had extreme values, which s igniﬁcantly deteriorated the estimation performance. E. Model Fusion by SMLR The above ba se regression models were then fused by o ur recently developed SMLR a pproach 4 [9]. S MLR ﬁrst use s a sp e ctral approach to estimate the accurac ies of the b ase regression models on the testing datas et, and then use s a weigh te d average to comb ine the bas e regression models (the we ights are the accuracies of the base mo dels) to obtain the ﬁna l es timates on the testing dataset. F . Results The validation res ults on the CCC and mean squ a red error (MSE) are shown in T able II. Note that CNN- V isual was not u s ed in SMLR fus ion for Arousal since its performance was too low . W e can observe fr om T able II that: 1) SVR-Audio achieved better CCCs than the other three base regression mode ls on the visual or face. 2) SMLR a chieved the best performance on both CCC and MSE, su ggesting the fusion was e f fec- ti ve. T ABLE II T H E V A L I DAT I O N R E S U LT S . CCC CCC MSE MSE Model Arousal V alence Arousal V alence CNN-Face 0.3214 0.3606 0.0551 0.1163 CNN-V isual 0.2448 0.3568 0.0515 0.1045 LSTM-V isual 0.3383 0.3694 0.0431 0.1382 SVR-Audio 0.3693 0.4150 0.0543 0.1089 SMLR 0.3969 0.4411 0.0404 0.0910 R E F E R E N C E S [1] P . Barros and S. W ermter , “De velop ing crossmodal expression recognition based on a deep neural model, ” Adaptive behavior , vol. 24, no. 5, pp. 373–396, 2016. 4 W e did not use the clustering step in [9] because we only had four base regression mode ls here. 3 [2] L. Chen, S. Gunduz, and M. T . Ozsu, “Mix ed type audio classiﬁcation wit h support vector machine, ” in P r oc. IEEE I nt’l Conf. on Multimedia and Expo , T oronto, ON, Can ada, July 2006, pp. 781–784. [3] F . Chollet, “Xception: Deep learning wi th depthwise separable con volutions, ” CoRR , vol. abs/1610 .02357, 2016. [Online]. A v ailable: http://arxi v .org/ab s/1610.0235 7 [4] C. Guo and D. W u, “F eature dimensionality reduction for video affe ct classiﬁcation: A comparativ e study , ” in Pro c. 1st Asian Conf. on Affec tive Compu ting and Intellig ent Inter action , Beijing, China, May 2018. [5] D. Li, I. Sethi, N. Di mitrov a, and T . McGee, “Classiﬁcation of general audio data for content-based retriev al, ” P attern Recogn i- tion Letters , vol. 22, no. 5, pp. 53 3–544, 2001. [6] J. Y .-H. Ng, M. Hausknecht, S. V ijayanarasimhan, O. V inyals, R. Monga, and G. T oderici, “Beyond short snipp ets: Deep networks for video classiﬁcation, ” in P r oc. IEE E Int’l. Conf. on Computer V isi on and P attern Recognition (CVPR) . Boston , MA: IEEE, June 201 5, pp. 469 4–4702. [7] M. Robnik-Sikonja and I. Ko nonenk o, “Theoretical and em pirical analysis of ReliefF and RReliefF, ” Machine Learning , vol. 53, pp. 23–69, 2003. [8] C. Szegedy , V . V anhoucke, S. Ioffe, J. Shlens, and Z. W ojna, “Rethinking t he i nception architecture for compu ter vision, ” in Pr oc. IEEE Conf. on Computer V ision and P attern Recognition (CVPR) , Las V egas, NV , June 2016, pp. 2818–282 6. [9] D. W u, V . J. Lawhern, S. Gordon, B. J. Lance, and C.-T . Lin, “Spectral meta-learner for regression (SMLR) model aggrega- tion: T ow ards calibrationless brain-computer interface (BCI), ” in Pr oc. IEEE Int’l Conf. on Systems, Man and Cybernetics , Budapest, Hungary , October 2016 , pp. 743–749.

OMG - Emotion Challenge Solution

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment