Aff Wild2 대규모 멀티태스크 감정 표정 액션 유닛 데이터베이스와 ArcFace 기반 모델

본 논문은 기존 인‑와일드 감정 데이터베이스의 한계를 극복하기 위해 비디오·오디오를 포함한 558개의 영상(≈2.8 M 프레임)으로 구성된 Aff‑Wild2를 구축하고, 연속적 정서(Valence‑Arousal), 7가지 기본 표정, 8개의 액션 유닛을 동시에 라벨링하였다. 이후 CNN·CNN‑RNN 기반 멀티태스크·멀티모달 네트워크와 ArcFace 손실을 적용한 새로운 모델을 학습시켜 10개의 공개 데이터베이스에서 최첨단 성능을 달성하였다.

저자: Dimitrios Kollias, Stefanos Zafeiriou

Aff Wild2 대규모 멀티태스크 감정 표정 액션 유닛 데이터베이스와 ArcFace 기반 모델
KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 1 Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-T ask Learning and Ar cF ace Dimitrios K ollias 1,2 dimitrios.kollias15@imperial.ac.uk 1 Depar tment of Computing Imperial College London London, UK Stef anos Zafeiriou 1,2 s.zaf eiriou@imperial.ac.uk 2 F aceSoft.io Abstract Affective computing has been largely limited in terms of available data re- sources. The need to collect and annotate diverse in-the-wild datasets has be- come apparent with the rise of deep learning models, as the default approach to addr ess any computer vision task. Some in-the-wild databases have been re- cently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence- arousal estimation, action unit detection and basic expression classification). T o address these, we substantially extend the lar gest available in-the-wild database (Aff-W ild) to study continuous emotions such as valence and arousal. Further- more, we annotate parts of the database with basic expr essions and action units. As a consequence, for the first time, this allows the joint study of all three types of behavior states. W e call this database Aff-W ild2. W e conduct extensive ex- periments with CNN and CNN-RNN architectur es that use visual and audio modalities; these networks are trained on Aff-W ild2 and their performance is then evaluated on 10 publicly available emotion databases. W e show that the networks achieve state-of-the-art performance for the emotion r ecognition tasks. Additionally , we adapt the ArcFace loss function in the emotion recognition con- text and use it for training two new networks on Aff-W ild2 and then re-train them in a variety of diverse expression r ecognition databases. The networks are shown to improve the existing state-of-the-art. The database, emotion recogni- tion models and source code are available at http://ibug.doc.ic.ac.uk/ resources/aff- wild2 . 1 Introduction Until recently affective computing has been mostly studied in controlled settings of the environments [ 14 , 2 5 , 3 9 , 4 7 , 48 ], with limited amount of participants [ 3 , 2 6 , 3 1 , 3 3 , 4 0 ], using pre-defined scenarios that users have to follow , depicting posed expressions [ 1 , 2 8 ]. However , with the development of large and diverse datasets in the field of computer vision (and the accompanying performance gains), it has c  2019. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electr onic forms. 2 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE become apparent that the diversity of human participants and spontaneous expr es- sions have to become the prer ogatives in deployment of the affective computing models in practice. Large datasets with in-the-wild settings have been recently col- lected to study facial expression (Expr) analysis [ 5 , 7 ], facial action units (AUs) [ 1 1 ] and continuous emotions of valence and arousal (V A) [ 3 5 , 4 5 ] in-the-wild. In [ 2 9 ] a static in-the-wild database (AffectNet) has been created that contains V A annotations for 1 M images. However , from those images, only around 450 K are manually annotated and from those only 350 K are valid faces. This number is moderate for training deep neural networks (DNNs). Also this database is static, meaning that it contains only images and no video/audio. It is worth to mention that this database also contains annotations for the seven basic expr essions plus the contempt class for 290 K images. In [ 2 ] a static in-the-wild database (Emotionet) has been created that contains AU annotations for 1 M images. However , from those images, only 50 K are manually annotated. Half of those consist the validation set and the other half the test set. Again, these sets have a small size to adequately train DNNs and generalize on other databases. Emotionet database also contains anno- tations for 6 basic and 10 compound emotion categories. Nevertheless, their total size is less than 3 K and the classes are heavily imbalanced, making it impossible to train DNNs. In [ 2 0 , 50 ], the authors have developed the lar gest existing audiovisual (A/V) in-the-wild database annotated in terms of V A for around 1.25 M . All these annotations are manual. However this database contains annotations only for V A and the number of subjects in the videos is moderate (298 subjects in total). Up to the present, there is no database that contains annotations for all main behavior tasks (V A estimation, AU detection, Expr classification). Also, most of the existing databases do not contain sufficiently large numbers of annotated samples for effectively training DNNs. The first contribution in this paper is the creation of a new dataset that contains 260 videos with around 1.4M frames, annotated for V A. W e merge this dataset with Aff-W ild (since this database also contains annotated videos), generating the so-called Aff-W ild2 database. Next, we annotate parts of Aff-W ild2 with AUs and seven basic expression labels, creating about 398K and 403K AU and Expr annotations, respectively . T o the best of our knowledge, Aff- W ild2 is the first large scale in-the-wild database containing annotations for all 3 main behavior tasks. It is also the first audiovisual database with annotations for AUs. All AU annotated databases do not contain audio, but only images or videos. Next, we conduct multi-task experiments on this database, for emotion recogni- tion. Many questions arise hereafter: how can we combine these three tasks? what loss function should we use? The first apparent answer is to use a loss function equal to the sum of the loss functions of each task. The binary cross entr opy loss is used for AU detection. The MSE and Concordance Correlation Coef ficient (CCC) losses are used for V A estimation. The standard loss for expression classification is the categorical cr oss entropy . The second contribution of the paper is the devel- opment of multi-task CNNs, multi-task CNN-RNNs and multi-modal, multi-task CNN-RNNs, which are trained on Aff-W ild2 and then applied to 10 publicly avail- able databases (including the Af f-W ild one). The results are very promising, beat- ing the state-of-the-art on emotion recognition in these databases; exceptions are two databases annotated for Expr Recognition. In one of them, a best performing network used a locality preserving loss function [ 2 2 ]. This, as well as the recent ten- dency to develop elaborate loss functions for specific tasks [ 18 ], has led us to sear ch KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 3 Figure 1: Frames of Af f-W ild2, showing subjects of dif ferent ethnicities, age groups, emotional states, head poses, illumination conditions and occlusions for a better loss function than the categorical cross entr opy . In fact, in the related face recognition field, it has been shown [ 2 4 , 4 4 ] that cat- egorical cross entropy loss is insufficient to acquire discriminating power for face classification. Several loss functions have been proposed for maximizing inter-class and minimizing intra-class variance. [ 4 , 15 ] propose multi-loss learning to increase feature discriminating power . These, requir e thorough mining of pair/triplet sam- ples, which is a time-consuming procedur e. [ 24 ] projects the original Euclidean space of features to an angular space, intr oducing an angular margin for larger inter- class variance. [ 4 3 ] directly adds a cosine margin penalty to the target logit, showing better performance than [ 2 4 ]. [ 8 ] further impr oved the discriminative power of face recognition models, stabilising the training process. As these losses boosted face recognition models performance, in this work, we choose to adopt the ArcFace loss [ 8 ] and adapt it for emotion recognition. T o the best of our knowledge, this is the first time that such a loss designed for face recognition, is used in the context of emotion recognition. Our final contribution in this paper is the design of 2 networks trained with the ArcFace loss. After training them on Aff-W ild2, we re-trained them on each of the examined databases. Our results outperformed all state-of-the-art networks, illustrating: i) the richness of Aff-W ild2 (providing it with the ability to be used as robust prior for network pre-training) and ii) that the ArcFace loss can be used in the emotion recognition field, yielding state-of-the-art results. In fact, this is the very first proof of the ef fectiveness of additive angular margin in emotion r ecognition. 2 The Af f-W ild2 database Aff-W ild2 is described next, presenting the new collected dataset and its properties, the generated partition sets, their distributions and the annotation procedur e. Collected dataset and properties W e extend the Aff-W ild database [ 2 0 , 5 0 ], by collecting a new dataset consisting of 260 Y ouT ube videos, with 1,413,000 frames and a total length of 13 hours and 5 minutes. The videos have been collected using the Y outube video sharing website. All of the collected videos are in MP4 format, with a frame rate of 30 , provided under the CC licence. Keywords for retrieving the videos were selected from the 2-D Emotion Wheel, shown in Figure 2 . The new videos have wide range in subjects’: age (from babies to elderly people); ethnicity (caucasian/hispanic/latino/asian/black/african american); profession (e.g. actors, athletes, politicians, journalists); head pose; illumination conditions; occlussions; emotions. Figure 1 shows frames of Aff-W ild2 verifying the above described ranges. 4 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE These videos show subjects who: react on a surprise, on something that brings them happiness or fulfillment, on flirting or rejection, on important political issues, on funny or mean tweets; are stand-up comedians; give a r eally interesting speech in ceremonies; are taking an oral exam; are giving lectures on depression, or other serious disorders; ar e performing passive, boring, apathetic, intense activities, etc. Four experts annotated the new dataset in terms of valence and ar ousal, as in the case of Aff-W ild. W e then concatenated the Aff-W ild database with the new dataset, forming Aff-W ild2. In total, Aff-W ild2 consists of 558 videos with 2,786,201 frames, showing both subtle and extreme human behaviours in real-world settings. The total number of subjects is 458 ; 279 of which are males and 179 females. T wo more tasks were implemented, in which we annotated parts of Af f-W ild2 with AUs and Exprs. In the first, three very experienced annotators annotated 63 videos, with 397,800 frames and a total length of 3 hours and 41 mins, in terms of AUs 1,2,4,6,12,15,20,25 - described in Figure 2 . These videos contain 31 male and 31 female subjects. In the second, three experts annotated 84 videos consisting of 403,758 frames, with a total length of 3 hours and 45 mins, in terms of the 7 basic expressions. The videos show 42 male and 42 female subjects. Consequently , Aff- W ild2 contains 3 datasets (V A, AU, Expr); each contains annotations for a r espective behavior task (preliminary work regar ding the 3 sets can be found in [ 1 6 , 17 ]). T able 1 summarizes the attributes and properties of the three annotated sets of Aff-W ild2. Figure 2: The 2D Emotion Wheel (left); the AUs annotated in Aff-W ild2 (right) T able 1: General Attributes of Aff-W ild2; in the V A set, top row refers to the new dataset, while bottom row r efers to Aff-W ild Aff-W ild2 # frames # videos # annotators V ideo Length Mean Resolution V A set 1, 413, 000 1, 373, 201 260 298 4 8 0.03 − 26.22 mins 0.10 − 14.47 mins 1450 × 900 607 × 359 AU set 397, 800 63 3 0.03 − 26.22 mins 1500 × 900 Expr set 403, 758 84 3 0.04 − 26.22 mins 1350 × 800 Partition Sets and Distributions Each set (V A, AU, Expr) is split into three sub- sets: training, validation and test. Partitioning is done in a subject independent manner , in the sense that a person can appear only in one of those three subsets. In the V A set, the resulting training, validation and test subsets consist of 350 , 70 and 138 videos respectively . In the AU set, the respective subsets consist of 42 , 7 and 14 videos respectively . In the Expr set, the corresponding subsets consist of 51 , 11 and 22 videos respectively . Figure 3 shows the 2D V A histogram of the new dataset, which was added to Aff-W ild. Figure 3 shows the distribution of the seven emotion categories in Aff-W ild2. T able 2 shows the distribution of the activated AUs. W e KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 5 note that the Expr Set of images can be extended to also contain AU annotations, according to T able 1 of [ 1 0 ]. However , this is out of the scope of the current paper . Figure 3: 2D V A Histogram of the new data added to Aff-W ild (left); Histogram of the seven basic expressions in Af f-W ild2 (right) T able 2: Distribution of AU annotations in Aff-W ild2 Action Unit # AU 1 AU 2 AU 4 AU 6 AU 12 AU 15 AU 20 AU 25 T otal Number of Activated AUs 86,677 43.9 % 4,166 2.1% 56,327 28.5% 25,226 12.8% 35,675 18.1% 3,340 1.7% 5,695 2.9% 9,048 4.6% Annotation Four experts have performed the V A set annotation, using the method proposed in [ 6 ]. V alence and arousal values range continuously in [-1,1]. The final label values are the mean of those four annotations. The mean inter-annotation correlation is 0.63 for valence and 0.60 for arousal. For the AU set, three experts have performed the annotation. For the Expr set, three more experts performed the annotation. In both cases, agreement between the annotators has not always been 100%. W e only kept the annotations, on which all experts agree. 3 Proposed Methods T wo pre-pr ocessing steps, on the visual and audio modalities, have been applied to generate the input data for DNN based emotion analysis. W e developed the fol- lowing deep network architectures for emotion recognition: i) CNN, single-/multi- task; ii) CNN-RNN multi-task; iii) CNN-RNN multi-modal (A/V) and multi-task; iv) new ArcFace networks with r espective loss function, as described below . V isual Modality Pre-Processing The SSH detector [ 3 0 ] based on the ResNet and trained on the W iderFace dataset [ 4 6 ] was used to extract face bounding boxes from all images. Also, 5 facial landmarks (two eyes, nose and two mouth corners) were extracted and used to perform similarity transformation (for face alignment). After that we obtain the cropped faces which are then resized to dimension 96 × 96 × 3. The pixel intensities are normalized to take values in [-1,1]. Audio Modality Pre-Processing The audio signal (mono) is sampled at 44, 100Hz. Then spectrograms are extracted; spectrogram frames are computed over a 33ms 6 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE window with 11ms overlap. The resulting intensity values are normalized in [-1,1] to be consistent with the visual modality . CNN Single- & Multi-T ask W e employ 3 state-of-the-art networks, SphereFace- 20 [ 2 4 ], VGGFace [ 3 2 ], and Inception ResNet [ 3 7 ] (denoted as Inc.ResNet). W e train these networks to perform one behavior task (V A estimation, AU detection, or Expr classification), or jointly perform all 3 tasks. W e call the multi-task VGGF ACE net- work, MT -VGG. The predictions for all tasks are pooled from the same feature space. CNN-RNN Multi-T ask As shown in the experimental section, MT -VGG has the best performance; thus we construct a CNN-RNN multi-task network, based on MT -VGG. In more detail, a 2-layer GRU with 128 cells each is stacked on top of the first fc layer of MT -VGG for capturing the temporal dynamics; the output layer is on top of the GRU. W e call this network MT -VGG-RNN. CNN-RNN Multi-Modal (A/V) & Multi-T ask T o handle both video and audio modalities, we use a feature level fusion strategy in our developed deep learning model, that we illustrate in Figur e 4 . This model consists of two identical streams that extract features dir ectly fr om raw input images and spectr ograms, respectively . Each stream consists of a MT -VGG-RNN, described above, without the output layer . The features from the two streams are concatenated, forming a 256-dimensional fea- ture vector that is passed through a 2-layer GRU layer with 128 units in each layer , in order to fuse the information of the audio and visual streams. The output layer follows on top of it. W e call this network A/V -MT -VGG-RNN. Figure 4: A/V -MT -VGG-RNN: the Multi-Modal and Multi-T ask developed model Standard Loss Functions The objective function minimized during training of the multi-task networks is the sum of the individual task losses: L CC E = E [ − log e p p + log ∑ 7 i = 1 e p i ] (1) L BC E = E [ − ∑ 17 i = 1 ( t i · log p i + ( 1 − t i ) · log ( 1 − p i )) ] (2) L CC C = 1 − 0.5 · ( ρ a + ρ v ) , with ρ a , v = 2 s x y ÷ [ s 2 x + s 2 y + ( ¯ x − ¯ y ) 2 ] (3) where L CC E is the categorical cross entropy loss, L BC E is the binary cross entropy loss, p p is the pr ediction of positive class, p i is the pr ediction of AU i , t i ∈ { 0, 1 } is the label of A U i , ρ a , v is the Concordance Correlation Coefficient (CCC) of arousal/valence, s x and s y are the variances of arousal/valence labels and predicted values respec- tively and s x y is the corresponding covariance value. KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 7 ArcFace Loss Function & Networks Next, we focus on Expr recognition and in- troduce a new loss function. The softmax cross-entr opy loss is modified as follows: L = − 1 N N ∑ i = 1 log e W T y i x i ∑ 7 j = 1 e W T j x i = − 1 N N ∑ i = 1 log e k W y i k· k x i k· cos θ y i ∑ 7 j = 1 e k W j k· k x i k· cos θ j k x i k = s , k W j k = 1 = = = = = = − 1 N N ∑ i = 1 log e s · cos θ y i ∑ 7 j = 1 e s · cos θ j (4) where the embedding feature x i ∈ R d denotes the deep feature of the i -th sample belonging to the y i -th class, W j ∈ R d denotes the j -th column of the weight W ∈ R d × 7 , N is the batch size, θ j is the angle between weight W j and feature x i , k W j k is fixed to 1 by l 2 normalization, k x i k is fixed by l 2 normalization and re-scaled to s . From eq. 4 , it can be seen that the embedding features are distributed around each feature centre on the hyperspher e. In our case, we adopt the ArcFace loss, where an angular margin penalty m between x i and W y i is added to simultaneously enhance the intra-class compactness and inter-class discrepancy (eq. 4 : θ y i − → θ y i + m ). m is equal to the geodesic distance margin penalty in the normalised hypersphere. W e refer the inter ested reader to [ 8 ] for mor e details and explanation of this loss. Next, we develop two networks to account for this loss. The first CNN archi- tecture, called Multi-T ask-ArcFace-Residual (MT -ArcRes) uses residual units and is depicted in Fig. 5 ; ’bn’ stands for batch normalization, the convolution layer is in the format: filter height × filter width conv ., number of output feature maps; the stride is equal to 2, everywhere; the fc layer is the embedding layer; the output layer pro- vides the seven expresion class logits ( W T j x i , j = 1..7). The second network is called Multi-T ask-Arcface-VGG (MT -ArcVGG); the difference with MT -ArcRes is that the rectangular ar ea in the Figure contains VGGFace’s layers. Figure 5: The MT -ArcRes network that has been trained with the Ar cFace loss 4 Experimental Study The experimental study consists of two parts. In the first, we train MT CNNs, CNN- RNNs & Multi-Modal CNN-RNNs, for V A, AU and Expr Recognition on the Aff- W ild2; then, we test the networks on 10 dif ferent databases, showing that Af f-W ild2 and the Multi-T ask networks provide the best pre-trained framework for a large va- riety of emotion recognition settings. In the second, focusing on expr ession recog- nition, we first train ArcFace networks with Aff-W ild2 and then re-train them with each expression database; we evaluate them, achieving state-of-the-art performance. Implementation Details & Settings Specific details about hyperparameters of the developed architectur es can be found in T able 3 . All experiments in this paper are 8 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE implemented in T ensorFlow , on a T esla V100 32GB GPU, using Adam optimizer (with default values) or SGD with momentum (0.9) in the ArcFace Networks’ case. Additional details follow: I) CNN Single- & Multi-T ask : The networks have first been pre-trained for V A esti- mation on the Af f-W ild database, then the output layer is discarded and substituted by a new one for single- or multi-task, depending on the network type. Then they are trained end-to-end on Af f-W ild2. II) CNN-RNN Multi-T ask : The CNN part is initialized with the weights of the CNN MT -VGG. Then the whole architecture is trained end-to-end on Af f-W ild2. III) CNN-RNN Multi-Modal (A/V) & Multi-T ask : T raining is divided in two phases: first the audio/visual streams are trained independently and then the audiovisual network is trained end-to-end. T o train each stream individually , we follow the same procedur e as in the CNN-RNN Multi-T ask case. Once the single streams are trained, they ar e used for initializing the corresponding str eams in the multi-stream architectur e. Finally , the entire audiovisual network is trained end-to-end. IV) ArcFace Networks : Both networks are first trained on Af f-W ild2. Then, they ar e re-trained end-to-end on each of the examined databases. During testing we keep the feature embedding layer , discarding the output layer . For all training images, we extract featur es fr om the embedding layer and split them in 7 clusters. Then, for each test image, we compute its distance (based on cosine similarity) from all cluster centers and assign it to the center for which this distance is minimum. T able 3: Network Configurations: ST= Single T ask, MT=Multi-T ask ST - & MT - CNN MT -CNN-RNN A/V -MT-CNN-RNN MT -ArcRes / MT -ArcVGG learning rate [ 10 − 4 , 10 − 5 ] , bes t : 10 − 4 [ 10 − 4 , 10 − 6 ] , bes t : 10 − 5 [ 10 − 3 , 10 − 6 ] , bes t : 10 − 5 [ 10 − 4 , 10 − 5 ] , bes t : 10 − 4 batch size/seq.length 256 / - 10 / 90 5 / 90 300 / - parameters dropout=0.4 dropout=0.4 dropout=0.4 dropout=0.4, d ∈ { 32, 512 } , s ∈ { 32, 64 } , m ∈ { 0.1, 0.5, 1, 1.5, 2, 2.5, 3 } , bes t : 0.1/1 Databases T able 4 shows the databases used in our experiments along with their properties. BP4DS and BP4D+ datasets correspond to the ones used in the FERA 2015 [ 4 1 ] and 2017 [ 4 2 ] Challenges, respectively . All databases are in-the-wild, apart from DISF A, BP4DS, BP4D+, which are spontaneous. Let us note that for the Af- fectNet, BP4DS and BP4D+ databases, the test set is not released; thus we report the performances on the validation set, which we use for testing. T able 4: Properties of Databases used in our Experiments Databases AFEW -V A [ 21 ] AffectNet RAF-DB [ 22 ] FER2013 [ 13 ] IMFDB[ 36 ] Emotionet DISF A [ 27 ] BP4DS [ 51 ] BP4D+ [ 52 ] Model of Affect V A V A, Expr Expr Expr Expr AUs AUs AUs AUs # of videos 600 - - - - - 54 1,640 5,463 # of frames 30,050 450,000 15,200 35,887 34,512 50,000 261,630 222,573 967,570 Evaluation Metrics CCC, defined in eq. 3 is used for V A estimation, as it has been the evaluation criterion in all related Challenges [ 20 , 34 ]. The usual F1 score is adopted for evaluation of AU detection and Expr classification. Exceptions are the RAF-DB and FER2013, in which the mean diagonal value of the confusion matrix and the accuracy metric, respectively , are the default performance measur es. KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 9 Results on static databases for V A & Expr Recognition T able 5 presents the re- sults of different CNN Single- and Multi-task (ST - and MT -) networks in a cross- database setting (networks are trained on Aff-W ild2 and tested on AffectNet, RAF- DB, FER2013 and IMFDB). The MT -VGG has the best performance for both V A and Expr recognition. In T able 5 , we also compare MT -VGG’s performance with that of the state-of-the-art in each of the tested databases (the r esults shown ar e taken fr om the respective papers). It can be seen that MT -VGG beats the state-of-the-art in all databases, illustrating the excellent cross-performance of the generated framework. Only , in expression recognition in AffectNet, the obtained performance is lower to the state-of-the-art. T able 5: Cross-database evaluation (models trained on Aff-W ild2 and tested on other databases) for V A and Expr on static databases: V A evaluation is shown as ( C C C V - C C C A ); single values correspond to expr essions’ performance metrics Databases MT -VGG ST -VGG MT - SphereFace ST - SphereFace MT - Inc. ResNet ST - Inc. ResNet AlexNet [ 29 ] VGGF ACE[ 22 ] VGG[ 12 ] FER2013 0.76 0.73 0.72 0.72 0.74 0.71 - - 0.75 RAF-DB 0.61 0.57 0.53 0.52 0.57 0.55 - 0.58 - IMFDB 0.42 0.39 0.39 0.38 0.4 0.39 - - - AffectNet ( 0.61 - 0.46 ) 0.54 (0.51-0.42) 0.52 (0.5-0.43) 0.5 (0.5-0.4) 0.51 (0.52-0.45) 0.52 (0.5-0.42) 0.51 (0.6-0.34) 0.58 - - Results on video databases for V A & Expr Recognition T able 6 presents the re- sults of CNN, CNN-RNN multi-task, single- and multi-modal networks in a cross- database setting, testing on Aff-W ild, Aff-W ild2 and AFEW -V A databases. It can be seen that the MT -VGG-RNN (trained on the visual modality) displays a better per- formance than the MT -VGG, for V A and Expr Recognition, in all databases. More- over , MT -VGG-RNN performs best for valence estimation when trained with the visual modality , whereas performs best for arousal when trained with the audio modality . This is because audio tends to have thematic constancy . Consider , for ex- ample, two fight sequences in a movie, one being a flashy fight scene and the other a one-sided fight with a person being injured. In both cases, arousal can be high due to loud and pronounced music, but valence will be positive in the former and negative in the latter sequence. It can also be seen that the A/V -MT -VGG-RNN out- performs the MT -VGG-RNN, illustrating that the A/V combination improves net- work performance, in valence, ar ousal and expression estimation. T able 6 compares the performance of the pr oposed networks with the state-of-the-art in the examined databases . It is evident that MT -VGG and MT -VGG-RNN outperform the respec- tive state-of-the-art. T able 6: Cross-database evaluation for V A and Expr on video databases: V A evalu- ation is shown as ( C C C V - C C C A ); single values correspond to F1 scor e Databases MT -VGG MT -VGG-RNN visual modality MT -VGG-RNN audio modality A/V -MT-VGG-RNN best CNN [ 19 , 20 ] AffW ildNet [ 19 , 20 ] Aff-W ild (0.56-0.35) (0.60-0.45) (0.51-0.47) (0.62-0.49) (0.51-0.33) (0.57-0.43) Aff-W ild2 (0.38-0.3) 0.4 (0.40-0.33) 0.43 (0.34-0.36) 0.43 (0.42-0.38) 0.46 (0.33-0.25) - (0.35-0.28) - AFEW -V A (0.58-0.53) (0.6-0.6) - - (0.49-0.52) (0.52-0.56) Results for AU Detection T able 7 compares the performance between MT -VGG and state-of-the-art networks in a cross-database setting among Emotionet, DISF A, BP4DS and BP4D+; all r eported r esults are for the common AUs between the testing 10 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE database and Aff-W ild2. It is clear that MT -VGG outperforms the winner [ 9 ] of Emotionet 2017 Challenge, the baseline [ 4 1 ] and the winner [ 4 9 ] of FERA 2015, the baseline [ 4 2 ] of FERA 2017 and the fine-tuned VGG (FVGG) and R-TI method of [ 23 ]. Apart from Emotionet, in all other cases, there is a boost in performance. MT -VGG displays a slightly worse performance than the winner [ 3 8 ] of FERA 2017. T able 7: Cross-database evaluation for AU Detection: evaluation metric is F1 score Databases MT -VGG MT -VGG-RNN [ 9 ] [ 49 ] [ 41 ] [ 38 ] [ 42 ] FVGG[ 23 ] R-T1[ 23 ] Aff-W ild2 0.42 0.44 - - - - - - - Emotionet 0.52 - 0.51 - - - - - - DISF A 0.61 - - - - - - 0.52 0.60 BP4DS 0.66 - - 0.54 0.53 - - - - BP4D+ 0.49 - - - - 0.51 0.34 - - From the above T ables, it is evident that Aff-W ild2 constitutes a very rich database for deep network training and further testing on very different and diverse emotion databases; the presented cross-database results validate our network developments. Results with ArcFace Loss for Expr Recognition T able 8 presents a performance comparison between: i) MT -VGG (trained on Aff-W ild2), ii) a fine-tuned MT -VGG (FT -MT -VGG; pre-trained on Aff-W ild2, then re-trained on each of the examined databases), iii) the two networks trained with the ArcFace loss (MT -ArcRes, MT - ArcVGG) on Aff-W ild2 and re-trained on each of the examined databases, iv) the state-of-the-art in these databases (whose results are taken fr om the respective pa- pers). The FT -MT -VGG outperforms the state-of-the-art in all databases, apart from RAF-DB, where DLP-CNN performs better , however , trained with a locality pre- serving loss function. T able 8 also compares this network’s performance to the per- formance of MT -ArcRes and MT -ArcVGG networks, trained with the ArcFace loss function. These networks outperform all others, including DLP-CNN. T able 8: Retrained multi-task networks with ArcFace loss, for expression recognition Databases MT -ArcRes MT -ArcVGG FT -MT -VGG MT -VGG AlexNet [ 29 ] DLP-CNN [ 22 ] VGG[ 12 ] AffectNet 0.63 0.62 0.59 0.54 0.58 - - RAF-DB 0.75 0.76 0.71 0.61 - 0.74 - IMFDB 0.55 0.56 0.51 0.42 - - - FER2013 0.8 0.79 0.78 0.76 - - 0.75 5 Conclusions In this paper , we present the first, largest, in-the-wild, A/V database, called Aff- W ild2, that is annotated for V A, AUs and Exprs. W e build and train multi-task and multi-modal CNNs and CNN-RNNs on Aff-W ild2 and test their performances on 10 databases, beating the state-of-the-art. W e further train two new networks on Aff-W ild2, adopting the ArcFace loss function, and then re-train them on a variety of expression databases; the r esults improve the existing state-of-the-art. Acknowledgements W e would like to thank V iktoriia Sharmanska for our fruitful conversations during preparation of this work. The work of S. Zafeiriou has been partially funded by the EPSRC Fellowship Deform (EP/S010203/1). The work of Dimitrios Kollias was funded by a T eaching Fellowship of Imperial College London. KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 11 References [1] Niki Aifanti, Christos Papachristou, and Anastasios Delopoulos. The mug fa- cial expression database. In 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10 , pages 1–4. IEEE, 2010. [2] C.F . Benitez-Quiroz, R. Srinivasan, and A.M. Martinez. Emotionet: An accu- rate, real-time algorithm for the automatic annotation of a million facial ex- pressions in the wild. In Proceedings of IEEE International Confer ence on Computer V ision & Pattern Recognition (CVPR’16) , Las V egas, NV , USA, June 2016. [3] Sanjay Bilakhia, Stavros Petridis, Anton Nijholt, and Maja Pantic. The mah- nob mimicry database: A database of naturalistic human interactions. Pattern recognition letters , 66:52–61, 2015. [4] Sumit Chopra, Raia Hadsell, Y ann LeCun, et al. Learning a similarity metric discriminatively , with application to face verification. In CVPR (1) , pages 539– 546, 2005. [5] Roddy Cowie and Randolph R Cornelius. Describing the emotional states that are expr essed in speech. Speech communication , 40(1):5–32, 2003. [6] Roddy Cowie, Ellen Douglas-Cowie, Susie Savvidou*, Edelle McMahon, Mar- tin Sawey , and Marc Schröder . ’feeltrace’: An instrument for recording per- ceived emotion in real time. In ISCA tutorial and resear ch workshop (ITR W) on speech and emotion , 2000. [7] T im Dalgleish and Mick Power . Handbook of cognition and emotion . John W iley & Sons, 2000. [8] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698 , 2018. [9] W an Ding, Dong-Y an Huang, Zhuo Chen, Xinguo Y u, and W eisi Lin. Facial ac- tion recognition using very deep networks for highly imbalanced class distri- bution. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIP A ASC) , pages 1368–1372. IEEE, 2017. [10] Shichuan Du, Y ong T ao, and Aleix M Martinez. Compound facial expressions of emotion. Proceedings of the National Academy of Sciences , 111(15):E1454–E1462, 2014. [11] Paul Ekman. Facial action coding system (facs). A human face , 2002. [12] Mariana-Iuliana Georgescu, Radu T udor Ionescu, and Marius Popescu. Local learning with deep and handcrafted features for facial expression recognition. IEEE Access , 7:64827–64836, 2019. [13] Ian J Goodfellow , Dumitru Erhan, Pierre Luc Carrier , Aaron Courville, Mehdi Mirza, Ben Hamner , W ill Cukierski, Y ichuan T ang, David Thaler , Dong-Hyun Lee, et al. Challenges in repr esentation learning: A report on three machine 12 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE learning contests. In International Conference on Neural Information Processing , pages 117–124. Springer , 2013. [14] Ralph Gross, Iain Matthews, Jeffr ey Cohn, T akeo Kanade, and Simon Baker . Multi-pie. Image and V ision Computing , 28(5):807–813, 2010. [15] Elad Hoffer and Nir Ailon. Deep metric learning using trip let network. In Inter- national W orkshop on Similarity-Based Pattern Recognition , pages 84–92. Springer , 2015. [16] Dimitrios Kollias and Stefanos Zafeiriou. Aff-wild2: Extending the aff-wild database for affect r ecognition. arXiv preprint , 2018. [17] Dimitrios Kollias and Stefanos Zafeiriou. A multi-task learning & generation framework: V alence-arousal, action units & primary expr essions. arXiv preprint arXiv:1811.07771 , 2018. [18] Dimitrios Kollias and Stefanos Zafeiriou. T raining deep neural networks with differ ent datasets in-the-wild: The emotion recognition paradigm. In 2018 In- ternational Joint Conference on Neural Networks (IJCNN) , pages 1–8. IEEE, 2018. [19] Dimitrios Kollias, Mihalis A Nicolaou, Irene Kotsia, Guoying Zhao, and Ste- fanos Zafeiriou. Recognition of affect in the wild using deep neural networks. In Computer V ision and Pattern Recognition Workshops (CVPR W), 2017 IEEE Con- ference on , pages 1972–1979. IEEE, 2017. [20] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Pa- paioannou, Guoying Zhao, Björn Schuller , Ir ene Kotsia, and Stefanos Zafeiriou. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep ar - chitectures, and beyond. International Journal of Computer V ision , 127(6-7):907– 929, 2019. [21] Jean Kossaifi, Georgios Tzimiropoulos, Sinisa T odorovic, and Maja Pantic. Afew-va database for valence and arousal estimation in-the-wild. Image and V ision Computing , 2017. [22] Shan Li, W eihong Deng, and JunPing Du. Reliable crowdsour cing and deep locality-preserving learning for expression recognition in the wild. In Proceed- ings of the IEEE Conference on Computer V ision and Pattern Recognition , pages 2852–2861, 2017. [23] W ei Li, Farnaz Abtahi, and Zhigang Zhu. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition , pages 1841– 1850, 2017. [24] W eiyang Liu, Y andong W en, Zhiding Y u, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 212–220, 2017. KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 13 [25] Patrick Lucey , Jeffr ey F Cohn, T akeo Kanade, Jason Saragih, Zara Ambadar , and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer V ision and Pattern Recognition W orkshops (CVPR W), 2010 IEEE Computer Society Confer- ence on , pages 94–101. IEEE, 2010. [26] Michael J L yons, Shigeru Akamatsu, Miyuki Kamachi, Jiro Gyoba, and Julien Budynek. The japanese female facial expr ession (jaf fe) database. In Proceedings of third international conference on automatic face and gestur e r ecognition , pages 14– 16, 1998. [27] S Mohammad Mavadati, Mohammad H Mahoor , Kevin Bartlett, Philip T rinh, and Jeffrey F Cohn. Disfa: A spontaneous facial action intensity database. Af- fective Computing, IEEE T ransactions on , 4(2):151–160, 2013. [28] Gary McKeown, Michel V alstar , Roddy Cowie, Maja Pantic, and Marc Schroder . The semaine database: Annotated multimodal recor ds of emotion- ally colored conversations between a person and a limited agent. IEEE T rans- actions on Affective Computing , 3(1):5–17, 2011. [29] Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor . Affectnet: A database for facial expression, valence, and arousal computing in the wild. arXiv preprint arXiv:1708.03985 , 2017. [30] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry Davis. SSH: Single stage headless face detector . In The IEEE International Confer ence on Com- puter V ision (ICCV) , 2017. [31] Maja Pantic, Michel V alstar , Ron Rademaker , and Ludo Maat. W eb-based database for facial expression analysis. In Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on , pages 5–pp. IEEE, 2005. [32] Omkar M Parkhi, Andrea V edaldi, and Andrew Zisserman. Deep face recogni- tion. In British Machine V ision Confer ence (BMVC) , 2015. [33] Fabien Ringeval, Andreas Sonderegger , Jens Sauer , and Denis Lalanne. In- troducing the recola multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE In- ternational Conference and W orkshops on , pages 1–8. IEEE, 2013. [34] Fabien Ringeval, Björn Schuller , Michel V alstar , Jonathan Gratch, Roddy Cowie, Stefan Scher er , Shar on Mozgai, Nicholas Cummins, Maximilian Schmi, and Maja Pantic. A vec 2017–real-life depression, and affect recognition work- shop and challenge. 2017. [35] James A Russell. Evidence of convergent validity on the dimensions of affect. Journal of personality and social psychology , 36(10):1152, 1978. [36] Shankar Setty , Moula Husain, Parisa Beham, Jyothi Gudavalli, Menaka Kan- dasamy , Radhesyam V addi, V idyagouri Hemadri, JC Karure, Raja Raju, B Ra- jan, et al. Indian movie face database: a benchmark for face recognition under wide variations. In 2013 Fourth National Conference on Computer V ision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG) , pages 1–5. IEEE, 2013. 14 KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE [37] Christian Szegedy , Sergey Ioffe, V incent V anhoucke, and Alexander A Alemi. Inception-v4, inception-r esnet and the impact of r esidual connections on learn- ing. In AAAI , volume 4, page 12, 2017. [38] Chuangao T ang, W enming Zheng, Jingwei Y an, Qiang Li, Y ang Li, T ong Zhang, and Zhen Cui. V iew-independent facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pages 878–882. IEEE, 2017. [39] Y ing-li T ian, T akeo Kanade, and Jeffr ey F Cohn. Recognizing action units for facial expression analysis. Pattern Analysis and Machine Intelligence, IEEE T rans- actions on , 23(2):97–115, 2001. [40] Michel V alstar and Maja Pantic. Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In Proc. 3rd Intern. W orkshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect , page 65, 2010. [41] Michel F V alstar , T imur Almaev , Jeffr ey M Girard, Gary McKeown, Marc Mehu, Lijun Y in, Maja Pantic, and Jeffr ey F Cohn. Fera 2015-second facial expression recognition and analysis challenge. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and W orkshops on , volume 6, pages 1–8. IEEE, 2015. [42] Michel F V alstar , Enrique Sánchez-Lozano, Jeffr ey F Cohn, László A Jeni, Jef- frey M Girard, Zheng Zhang, Lijun Y in, and Maja Pantic. Fera 2017-addr essing head pose in the third facial expression recognition and analysis challenge. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pages 839–847. IEEE, 2017. [43] Hao W ang, Y itong W ang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and W ei Liu. Cosface: Large margin cosine loss for deep face recog- nition. In Proceedings of the IEEE Conference on Computer V ision and Pattern Recog- nition , pages 5265–5274, 2018. [44] Y andong W en, Kaipeng Zhang, Zhifeng Li, and Y u Qiao. A discriminative feature learning approach for deep face recognition. In European conference on computer vision , pages 499–515. Springer , 2016. [45] CM Whissel. The dictionary of affect in language, emotion: Theory , research and experience: vol. 4, the measurement of emotions, r . Plutchik and H. Keller- man, Eds., New Y ork: Academic , 1989. [46] Shuo Y ang, Ping Luo, Chen Change Loy , and Xiaoou T ang. W ider face: A face detection benchmark. In IEEE Conference on Computer V ision and Pattern Recognition (CVPR) , 2016. [47] Lijun Y in, Xiaozhou W ei, Y i Sun, Jun W ang, and Matthew J Rosato. A 3d facial expression database for facial behavior resear ch. In Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on , pages 211–216. IEEE, 2006. KOLLIAS, ZAFEIRIOU: AFF-WILD2, MUL TI-T ASK LEARNING & ARCF ACE 15 [48] Lijun Y in, Xiaochen Chen, Y i Sun, T ony W orm, and Michael Reale. A high- resolution 3d dynamic facial expression database. In Automatic Face & Gesture Recognition, 2008. FG’08. 8th IEEE International Conference On , pages 1–6. IEEE, 2008. [49] Anıl Yüce, Hua Gao, and Jean-Philippe Thiran. Discriminant multi-label man- ifold embedding for facial action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) , vol- ume 6, pages 1–6. IEEE, 2015. [50] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou, Athanasios Pa- paioannou, Guoying Zhao, and Ir ene Kotsia. Aff-wild: V alence and arousal’in- the-wild’challenge. In Proceedings of the IEEE Conference on Computer V ision and Pattern Recognition Workshops , pages 34–41, 2017. [51] Xing Zhang, Lijun Y in, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. Bp4d-spontaneous: a high- resolution spontaneous 3d dynamic facial expression database. Image and V ision Computing , 32(10):692–706, 2014. [52] Zheng Zhang, Jeff M Girard, Y ue W u, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andy Horowitz, Huiyuan Y ang, et al. Mul- timodal spontaneous emotion corpus for human behavior analysis. In Pro- ceedings of the IEEE Conference on Computer V ision and Pattern Recognition , pages 3438–3446, 2016.

원본 논문

고화질 논문을 불러오는 중입니다...

댓글 및 학술 토론

Loading comments...

의견 남기기