Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

T emporal Attentiv e Alignment f or Large-Scale V ideo Domain Adaptation Min-Hung Chen 1 ∗ Zsolt Kira 1 Ghassan AlRegib 1 Jaekwon Y oo 2 Ruxin Chen 2 Jian Zheng 3 ∗ 1 Georgia Institute of T echnology 2 Sony Interacti v e Entertainment LLC 3 Binghamton Uni versity Abstract Although various image-based domain adaptation (DA) techniques have been pr oposed in recent years, domain shift in videos is still not well-explor ed. Most pr evious works only evaluate performance on small-scale datasets which ar e saturated. Ther efore , we ﬁrst pr opose two lar ge- scale video D A datasets with much lar g er domain discr ep- ancy: UCF-HMDB f ull and Kinetics-Gameplay . Second, we in vestigate differ ent DA integr ation methods for videos, and show that simultaneously aligning and learning tem- poral dynamics achie ves effective alignment even without sophisticated DA methods. F inally , we pr opose T emporal Attentive Adversarial Adaptation Network (T A 3 N) , which explicitly attends to the temporal dynamics using domain discr epancy for mor e effective domain alignment, achiev- ing state-of-the-art performance on four video DA datasets (e.g . 7.9% accuracy gain over “Sour ce only” fr om 73.9% to 81.8% on “HMDB → UCF”, and 10.3% gain on “Ki- netics → Gameplay”). The code and data ar e released at http://github.com/cmhungsteve/TA3N . 1. Introduction Domain adaptation (DA) [ 32 ] has been studied exten- siv ely in recent years [ 5 ] to address the domain shift prob- lem [ 37 , 34 ], which means the models trained on source labeled dataset do not generalize well to tar get datasets and tasks. D A is categorized in terms of the a v ailability of anno- tations in the target domain. In this paper , we focus on the harder unsupervised DA problem, which requires training models that can generalize to target samples without ac- cess to any target labels. While many unsupervised D A approaches are able to diminish the distrib ution gap be- tween source and target domains while learning discrimina- tiv e deep features [ 25 , 27 , 11 , 12 , 24 , 23 , 39 ], most methods hav e been de veloped only for images and not videos. Furthermore, unlike image-based D A work, there do not exist well-organized datasets to e v aluate and benchmark the performance of DA algorithms for videos. The most com- mon datasets are UCF-Olympic and UCF-HMDB small [ 44 , ∗ W ork partially done as a SIE intern Domain Shif t T e mp or al T e mp or al T em por al Ali gnmen t Sp a tial Spa tial Spa tia l A li gn men t So ur ce T ar g e t Figure 1: An overvie w of proposed T A 3 N for video DA. In addition to spatial discrepancy between frame images, videos also suffer from temporal discrepancy between sets of time-ordered frames that contain multiple local tempo- ral dynamics with different contributions to the ov erall do- main shift, as indicated by the thickness of green dashed arrows. Therefore, we propose to focus on aligning the tem- poral dynamics which have higher domain discrepancy us- ing a learned attention mechanism to ef fectiv ely align the temporal-embedded feature space for videos. Here we use the action basketball as the example. 52 , 17 ], which have only a few ov erlapping categories be- tween source and target domains. This introduces lim- ited domain discrepancy so that a deep CNN architecture can achie ve nearly perfect performance ev en without any D A method (details in Section 5.2 and T able 2 ). There- fore, we propose two larger-scale datasets to in vestig ate video D A: 1) UCF-HMDB f ull : W e collect 12 overlap- ping categories between UCF101 [ 43 ] and HMDB51 [ 21 ], which is around three times larger than both UCF-Olympic and UCF-HMDB small , and contains lar ger domain dis- crepancy (details in Section 5.2 and T ables 3 and 4 ). 2) Kinetics-Gameplay : W e collect from several currently 1 popular video games with 30 overlapping categories with Kinetics-600 [ 19 , 2 ]. This dataset is much more challeng- ing than UCF-HMDB f ull due to the signiﬁcant domain shift between the distributions of virtual and real data. V ideos can suffer from domain discrepancy along both the spatial and temporal directions, bringing the need of alignment for embedded feature spaces along both direc- tions, as sho wn in Figure 1 . Howe ver , most D A approaches hav e not explicitly addressed the domain shift problem in the temporal direction. Therefore, we ﬁrst inv estigate dif- ferent DA integration methods for video classiﬁcation and show that: 1) aligning the features that encode temporal dy- namics outperforms aligning only spatial features. 2) to ef- fectiv ely align domains spatio-temporally , whic h features to align is more important than what DA appr oaches to use. T o support our claims, we then propose T emporal Adver- sarial Adaptation Network (T A 2 N) , which simultaneously aligns and learns temporal dynamics, outperforming other approaches which nai vely apply more sophisticated image- based D A methods for videos. The temporal dynamics in videos can be represented as a combination of multiple local temporal features corre- sponding to different motion characteristics. Not all of the local temporal features equally contribute to the overall do- main shift. W e want to focus more on aligning those which hav e high contribution to the overall domain shift, such as the local temporal features connected by thicker green ar- rows shown in Figure 1 . Therefore, we propose T empo- ral Attentive Adversarial Adaptation Network (T A 3 N) to explicitly attend to the temporal dynamics by taking into account the domain distribution discrepancy . In this way , the temporal dynamics which contribute more to the ov erall domain shift will be focused on, leading to more effecti ve temporal alignment. T A 3 N achiev es state-of-the-art perfor - mance on all four in v estigated video D A datasets. In summary , our contributions are three-fold: 1. Video D A Dataset Collection : W e collect two large-scale video D A datasets, UCF-HMDB f ull and Kinetics-Gameplay , to in vestig ate the domain dis- crepancy problem across videos, which is an under- explored research problem. T o our knowledge, they are by far the lar gest datasets for video D A problems. 2. Featur e Alignment Exploration for Video DA : W e in v estigate dif ferent D A integration approaches for videos and provide a strategy to effecti vely align do- mains spatio-temporally for videos by aligning tempo- ral relation features. W e propose this simple b ut effec- tiv e approach, T A 2 N , to demonstrate the importance of determining what to align over the D A method to use. 3. T emporal Attentive Adversarial Adaptation Net- work (T A 3 N) : W e propose T A 3 N , which simultane- ously aligns domains, encodes temporal dynamics into video representations, and attends to representations with domain distrib ution discrepancy . T A 3 N achie ves state-of-the-art performance on both small- and large- scale cross-domain video datasets. 2. Related W orks V ideo Classiﬁcation. W ith the rise of deep con v olu- tional neural networks (CNNs), recent work for video clas- siﬁcation mainly aims to learn compact spatio-temporal representations by le veraging CNNs for spatial information and designing various architectures to exploit temporal dy- namics [ 18 ]. In addition to separating spatial and tempo- ral learning, some works propose different architectures to encode spatio-temporal representations with consideration of the trade-off between performance and computational cost [ 46 , 3 , 36 , 47 ]. Another branch of w ork utilizes optical ﬂow to compensate for the lack of temporal information in raw RGB frames [ 42 , 9 , 49 , 3 , 29 ]. Moreov er , some works extract temporal dependencies between frames for video tasks by utilizing recurrent neural networks (RNNs) [ 6 ], at- tention [ 28 , 30 ] and relation modules [ 57 ]. Note that we focus on attending to the temporal dynamics to effecti vely align domains and we consider other modalities, e.g. optical ﬂow , to be complementary to our method. Domain Adaptation. Most recent D A approaches are based on deep learning architectures designed for address- ing the domain shift problems given the fact that the deep CNN features without any D A method outperform tradi- tional D A methods using hand-crafted features [ 7 ]. Most D A approaches follow the two-branch (source and target) architecture, and aim to ﬁnd a common feature space be- tween the source and target domains. The models are there- fore optimized with a combination of classiﬁcation and do- main losses [ 5 ]. One of the main classes of methods used is Discr epancy- based D A , whose metrics are designed to measure the distance between source and target feature distribu- tions, including variations of maximum mean discrepancy (MMD) [ 25 , 26 , 54 , 53 , 27 ] and the CORAL function [ 45 ]. By diminishing the distance of distributions, discrepancy- based D A methods reduce the g ap across domains. Another common method, Adversarial-based DA , adopts a similar concept as GANs [ 13 ] by integrating domain discrimina- tors into the architectures. Through the adversarial objec- tiv es, the discriminators are optimized to classify dif fer- ent domains, while the feature extractors are optimized in the opposite direction. ADDA [ 48 ] uses an in verted label GAN loss to split the optimization into two parts: one for the discriminator and the other for the generator . In con- trast, the gradient re versal layer (GRL) is used in some work [ 11 , 12 , 55 ] to in vert the gradients so that the discrim- inator and generator are optimized simultaneously . Addi- tionally , Normalization-based D A [ 24 , 23 ] adapts batch nor- 2 malization [ 16 ] to D A problems by calculating tw o separate statistics, representing source and tar get, for normalization. Furthermore, Ensemble-based DA [ 10 , 38 , 39 , 22 ] builds a target branch ensemble by incorporating multiple target branches. Recently , T AD A [ 51 ] adopts the attention mech- anism to adapt the transferable regions. W e extend these concepts to spatio-temporal domains, aiming to attend to the important parts of temporal dynamics for alignment. V ideo Domain Adaptation. Unlike image-based D A, video-based D A is still an under-explored area. Only a few works focus on small-scale video DA with only a few o ver - lapping categories [ 44 , 52 , 17 ]. [ 44 ] improv es the domain generalizability by decreasing the ef fect of the background. [ 52 ] maps source and target features to a common feature space using shallow neural networks. AMLS [ 17 ] adapts pre-extracted C3D [ 46 ] features on a Grassmann manifold obtained using PCA. Howe v er , the datasets used in the abov e works are too small to ha ve enough domain shift to ev aluate DA performance. Therefore, we propose two larger cross-domain datasets UCF-HMDB f ull and Kinetics- Gameplay , and provide benchmarks with dif ferent baseline approaches. Recently , TSRNet [ 56 ] transfers knowledge for action localization using MMD, but only aligns the video- lev el features. Instead, our T A 3 N simultaneously attends, aligns, and encodes temporal dynamics into video features. 3. T echnical Appr oach W e ﬁrst introduce our baseline model which simply ex- tends image-base DA for videos using the temporal pooling mechanism (Section 3.1 ). And then we in vestig ate better ways to incorporate temporal dynamics for video D A (Sec- tion 3.2 ), and describe our ﬁnal proposed method with the domain attention mechanism (Section 3.3 ). 3.1. Baseline Model Giv en the recent success of large-scale video classiﬁca- tion using CNNs [ 18 ], we build our baseline on such archi- tectures, as shown in the lo wer part of Figure 2 . Co n vNe t Ra w video … Fr am e f ea tur es ℒ 𝑦 𝐺 𝑠 𝑓 𝐺 𝑦 T empo r a l p oo li n g GRL 𝐺 𝑠𝑑 ℒ 𝑡𝑑 ℒ 𝑠𝑑 ෠ 𝐺 𝑠𝑑 ෠ 𝐺 𝑡𝑑 domai n pr ed. cla ss pr ed. 𝐺 𝑡 𝑓 Figure 2: Baseline architecture (T emPooling) with the ad- versarial discriminators ˆ G sd and ˆ G td . L y is the class pre- diction loss, and L sd and L td are the domain losses. See the detailed architecture in the supplementary material. W e ﬁrst feed the input video X i = { x 1 i , x 2 i , ..., x K i } extracted from ResNet [ 14 ] pre-trained on ImageNet into our model, where x j i is the j th frame-le vel feature rep- resentation of the i th video. The model can be di vided into two parts: 1) Spatial module G sf ( . ; θ sf ) , which con- sists of multilayer perceptrons (MLP) that aims to conv ert the general-purpose feature vectors into task-dri ven feature vectors, where the task is video classiﬁcation in this paper; 2) T emporal module G tf ( . ; θ tf ) aggreg ates the frame-le vel feature vectors to form a single video-lev el feature vec- tor for each video. In our baseline architecture, we con- duct mean-pooling along the temporal direction to generate video-lev el feature vectors, and note it as T emP ooling . Fi- nally , another fully-connected layer G y ( . ; θ y ) con verts the video-lev el features into the ﬁnal predictions, which are used to calculate the class prediction loss L y . Similar to image-based D A problems, the baseline ap- proach is not able to generalize to data from dif ferent domains due to domain shift. Therefore, we integrate T emPooling with the unsupervised DA method inspired by one of the most popular adversarial-based approaches, D ANN [ 11 , 12 ]. The main idea is to add additional domain classiﬁers G d ( . ; θ d ) , to discriminate whether the data is from the source or tar get domain. Before back-propagating the gradients to the main model, a gradient reversal layer (GRL) is inserted between G d and the main model to in vert the gradient, as shown in Figure 2 . During adversarial train- ing, the parameters θ sf are learned by maximizing the do- main discrimination loss L d , and parameters θ d are learned by minimizing L d with the domain label d . Therefore, the feature generator G f will be optimized to gradually align the feature distributions between the tw o domains. In this paper, we note the Adversarial Discriminator ˆ G d as the combination of a gradient rev ersal layer (GRL) and a domain classiﬁer, and insert ˆ G d into T emPooling in two ways: 1) ˆ G sd : sho w how directly applying image-based D A approaches can beneﬁt video D A; 2) ˆ G td : indicate how D A on temporal-dynamics-encoded features beneﬁts video DA. The prediction loss L y , spatial domain loss L sd and tem- poral domain loss L td can be expressed as follo ws (ignoring all the parameter symbols through the paper to sav e space): L i y = L y ( G y ( G tf ( G sf ( X i ))) , y i ) (1) L i sd = 1 K K X j =1 L d ( G sd ( G sf ( x j i )) , d i ) (2) L i td = L d ( G td ( G tf ( G sf ( X i ))) , d i ) (3) where K is the number of frames sampled from each video. L is the cross entropy loss function. The ov erall loss can be expressed as follo ws: L = 1 N S N S X i =1 L i y − 1 N S ∪ T N S ∪ T X i =1 ( λ s L i sd + λ t L i td ) (4) where N S equals the number of source data, and N S ∪ T equals the number of all data. λ s and λ t is the trade-off weighting for spatial and temporal domain loss. 3 3.2. Integration of T emporal Dynamics with D A One main drawback of directly integrating image-based D A approaches into our baseline architecture is that the fea- ture representations learned in the model are mainly from the spatial features. Although we implicitly encode the temporal information by the temporal pooling mechanism, the relation between frames is still missing. Therefore, we would like to address two questions: 1) Does the video D A pr oblem beneﬁt fr om encoding temporal dynamics into fea- tur es? 2) Instead of only modifying featur e encoding meth- ods, how can DA be further inte grated while encoding tem- poral dynamics into featur es? T o answer the ﬁrst question, given the fact that humans can recognize actions by reasoning the observations across time, we propose the T emRelation architecture by replacing the temporal pooling mechanism with the T emporal Rela- tion module, which is modiﬁed from [ 41 , 57 ], as shown in Figure 4 . The n -frame temporal relation is deﬁned by the function: R n ( V i ) = X m g φ ( n ) (( V n i ) m ) (5) where ( V n i ) m = { v a i , v b i , ... } m is the m th set of frame-le v el representations from n temporal-ordered sampled frames. a and b are the frame indices. W e fuse the feature vectors that are time-ordered with the function g φ ( n ) , which is an MLP with parameters φ ( n ) . T o capture temporal relations at multiple time scales, we sum up all the n -frame relation features into the ﬁnal video representation. In this way , the temporal dynamics are explicitly encoded into features. W e then insert ˆ G d into T emRelation as we did for T emPooling. Although aligning temporal-dynamic-encoded features beneﬁts video D A, feature encoding and D A are still two separate processes, leading to sub-optimal DA per - formance. Therefore, we address the second question by proposing T emporal Adversarial Adaptation Network (T A 2 N) , which explicitly integrates ˆ G d inside the T empo- ral module to align the model across domains while learn- ing temporal dynamics. Speciﬁcally , we integrate each n - frame relation with a corresponding relation discriminator ˆ G n rd because different n -frame relations represent different temporal characteristics, which correspond to different parts of actions. The relation domain loss L rd can be expressed as follows: L i rd = 1 K − 1 K X n =2 L d ( G n rd ( R n ( G sf ( X i ))) , d i ) (6) The experimental results show that our integration strategy can effecti vely align domains spatio-temporally for videos, and outperform those which are extended from sophisti- cated DA approaches although T A 2 N is adopted from a sim- pler D A method (D ANN) (see details in T ables 3 to 5 ). So ur ce T ar g e t ℎ 𝑆 ℎ 𝑇 T empor al Ali gnmen t fr am e - le v e l f ea tu r es …… loc al t e m p or al f ea tu r es … … fr am e - le v e l f ea tu r es …… lo c al t em p or al f ea tu r es fi n al vid e o f ea tu r e fi n al vid e o f ea tu r e … … Figure 3: The domain attention mechanism in T A 3 N. Thicker arro ws correspond to larger attention weights. 3.3. T emporal Attentive Alignment f or V ideos The ﬁnal video representation of T A 2 N is generated by aggregating multiple local temporal features. Although aligning temporal features across domains beneﬁts video D A, not all the features are equally important to align. In or- der to ef fecti vely align overall temporal dynamics, we want to focus more on aligning the local temporal features which hav e larger domain discrepancy . Therefore, we represent the ﬁnal video representation as a combination of local tem- poral features with dif ferent attention weighting, as sho wn in Figure 3 , and aim to attend to features of interest that are domain discriminativ e so that the D A mechanism can fo- cus on aligning those features. The main question becomes: How to incorporate domain discr epancy for attention? T o address this, we propose T emporal Attentive Adver - sarial Adaptation Network (T A 3 N) , as shown in Figure 4 , by introducing the domain attention mechanism, which uti- lize the entropy criterion to generate the domain attention value for each n -frame relation feature as belo w: w n i = 1 − H ( ˆ d n i ) (7) where ˆ d n i is the output of G n rd for the i th video. H ( p ) = − P k p k · log( p k ) is the entropy function to measure uncer - tainty . w n i increases when H ( ˆ d n i ) decreases, which means the domains can be distinguished well. W e also add a resid- ual connection for more stable optimization. Therefore, the ﬁnal video feature representation h i generated from at- tended local temporal features, which are learned by local temporal modules G ( n ) tf , can be expressed as: h i = K X n =2 ( w n i + 1) · G ( n ) tf ( G sf ( X i )) (8) Finally , we add the minimum entropy regularization to reﬁne the classiﬁer adaptation. Howe v er , we only want to minimize the entropy for the videos that are similar across domains. Therefore, we attend to the videos which have low domain discrepancy , so that we can focus more on min- imizing the entropy for these videos. The attentiv e entropy 4 ℒ 𝑦 𝐺 𝑡 𝑓 𝑅 2 …… 𝑅 𝐾 T e mpor a l R e l a t i on modul e …… 𝑔 𝜙 ( 2 ) 𝑔 𝜙 ( 2 ) … 1 3 2 4 𝑔 𝜙 ( 3 ) 𝑔 𝜙 ( 3 ) … 3 4 4 5 1 2 𝑔 𝜙 ( K ) 𝑔 𝜙 ( K ) … 3 4 4 5 1 2 … … 𝐺 𝑠 𝑓 𝐺 𝑦 ෠ 𝐺 𝑠𝑑 ℒ 𝑠𝑑 ෠ 𝐺 𝑡𝑑 ℒ 𝑡𝑑 𝑅 3 Do mai n A t t en tio n ෠ 𝐺 𝑟𝑑 𝑛 ℒ 𝑟 𝑑 𝑛 Doma in A t t en t ion bl ock 𝑨 𝒏 𝑨 𝟐 𝑨 𝟑 𝑨 𝑲 …… ℒ 𝑎𝑒 Co n vNe t Ra w vid eo … Fr ame - l e v el f ea tur e v ect or s cla ss pr ed. 𝑯 ( ෡ 𝒅 ) 𝑯 ( ෝ 𝒚 ) Figure 4: The ov erall architecture of the proposed T emporal Attentive Adversarial Adaptation Network (T A 3 N). In the temporal relation module, time-ordered frames are used to generate K -1 relation feature representations R = { R 2 , ..., R K } , where R n corresponds to the n -frame relation (the numbers in this ﬁgure are e xamples of time indices). After attending with the domain predictions from relation discriminators G n rd , the relation features are summed up to the ﬁnal video representation. The attentiv e entropy loss L ae , which is calculated by domain entropy H ( ˆ d ) and class entropy H ( ˆ y ) , aims to enhance the certainty of those videos that are more similar across domains. See the detailed architecture in the supplementary material. loss L ae can be expressed as follo ws: L i ae = (1 + H ( ˆ d i )) · H ( ˆ y i ) (9) where ˆ d i and ˆ y i is the output of G td and G y , respecti vely . W e also adopt the residual connection for stability . By combining Equations ( 1 ) to ( 3 ), ( 6 ) and ( 9 ), and re- placing G sf and G tf with h i by Equation ( 8 ), the ov erall loss of T A 3 N can be expressed as follo ws: L = 1 N S N S X i =1 L i y + 1 N S ∪ T N S ∪ T X i =1 γ L i ae − 1 N S ∪ T N S ∪ T X i =1 ( λ s L i sd + λ r L i rd + λ t L i td ) (10) where λ s , λ r and λ t is the trade-off weighting for each do- main loss. γ is the weighting for the attentive entropy loss. All the weightings are chosen via grid search. Our proposed T A 3 N and T AD A [ 51 ] both utilize en- tropy functions for attention but with different perspectiv es. T AD A aims to focus on the foreground objects for image D A, while T A 3 N aims to ﬁnd important and discriminativ e parts of temporal dynamics to align for video D A. 4. Datasets There are very few benchmark datasets for video D A, and only small-scale datasets hav e been widely used [ 44 , 52 , 17 ]. Therefore, we speciﬁcally create tw o cross-domain datasets to ev aluate the proposed approaches for the video D A problem, as shown in T able 1 . For more details about the datasets, please refer to the supplementary material. 4.1. UCF-HMDB f ull W e extend UCF-HMDB small [ 44 ], which only selects 5 visually highly similar categories, by collecting all of the relev ant and ov erlapping categories between UCF101 [ 43 ] and HMDB51 [ 21 ], which results in 12 categories. W e fol- low the ofﬁcial split method to separate training and vali- dation sets. This dataset, UCF-HMDB f ull , includes more than 3000 video clips, which is around 3 times larger than UCF-HMDB small and UCF-Olympic. 4.2. Kinetics-Gameplay In addition to real-world videos, we are also interested in virtual-world videos for D A. While there are more than ten real-world video datasets, there is a limited number of virtual-world datasets for video classiﬁcation. It is mainly because rendering realistic human actions using game en- gines requires gaming graphics e xpertise which is time- consuming. Therefore, we create the Gameplay dataset by collecting gameplay videos from currently popular video games, Detr oit: Become Human and F ortnite , to build our own video dataset for the virtual domain. For the real domain, we use one of the lar gest public video datasets Kinetics-600 [ 19 , 2 ]. W e follo w the closed-set D A set- ting [ 34 ] to select 30 ov erlapping categories between the 5 UCF-HMDB small UCF-Olympic UCF-HMDB f ull Kinetics-Gameplay length (sec.) 1 - 21 1 - 39 1 - 33 1 - 10 class # 5 6 12 30 video # 1171 1145 3209 49998 T able 1: The comparison of the cross-domain video datasets. Kinetics-600 and Gameplay datasets to build the Kinetics- Gameplay dataset with both domains, including around 50K video clips. See the supplementary material for the complete statistics and example snapshots. 5. Experiments W e therefore ev aluate DA approaches on four datasets: UCF-Olympic, UCF-HMDB small , UCF-HMDB f ull and Kinetics-Gameplay . 5.1. Experimental Setup UCF-Olympic and UCF-HMDB small . First, we ev aluate our approaches on UCF-Olympic and UCF- HMDB small , and compare with all other works that also ev aluate on these two datasets [ 44 , 52 , 17 ]. W e follow the default settings, but the method to split the UCF video clips into training and v alidations sets is not speciﬁed in these papers, so we follow the ofﬁcial split method from UCF101 [ 43 ]. UCF-HMDB f ull and Kinetics-Gameplay . For the self-collected datasets, we follow the common experimen- tal protocol of unsupervised DA [ 34 ]: the training data con- sists of labeled data from the source domain and unlabeled data from the target domain, and the validation data is all from the target domain. Howe v er , unlike most of the im- age D A settings, our training and validation data in both domains are separate to av oid potentially overﬁtting while aligning different domains. T o compare with image-based D A approaches, we extend sev eral state-of-the-art meth- ods [ 12 , 27 , 23 , 39 ] for video D A with our T emPooling and T emRelation architectures, as sho wn in T ables 3 to 5 . The difference between the “T arget only” and “Source only” set- tings is the domain used for training. The “T arget only” setting can be reg arded as the upper bound without domain shift while the “Source only” setting sho ws the lower bound which directly applies the model trained with source data to the target domain without modiﬁcation. See supplementary materials for full implementation details. 5.2. Experimental Results UCF-Olympic and UCF-HMDB small . In these two datasets, our approach outperforms all the previous methods by at least 6.5% absolute dif ference (98.15% - 91.60%) on the “U → O” setting, and 9% difference (99.33% - 90.25%) on the “U → H” setting, as shown in T able 2 . Source → T arget U → O O → U U → H H → U W . Sultani et al. [ 44 ] 33.33 47.91 68.70 68.67 T . Xu et al. [ 52 ] 87.00 75.00 82.00 82.00 AMLS (GFK) [ 17 ] † 84.65 86.44 89.53 95.36 AMLS (SA) [ 17 ] † 83.92 86.07 90.25 94.40 D AAA [ 17 ] †‡ 91.60 89.96 - - T emPooling 96.30 87.08 98.67 97.35 T emPooling + DANN [ 12 ] 98.15 90.00 99.33 98.41 Ours (T A 2 N) 98.15 91.67 99.33 99.47 Ours (T A 3 N) 98.15 92.92 99.33 99.47 T able 2: The accuracy (%) for the state-of-the-art work on UCF-Olympic and UCF-HMDB small (U: UCF , O: Olympic, H: HMDB). † W e only show their results which are ﬁne-tuned with source data for fair comparison. Please refer to the supplementary material for more details. ‡ [ 17 ] did not test D AAA on UCF-HMDB small . These results also show that the performance on these datasets is saturated. W ith a strong CNN as the backbone architecture, even our baseline architecture T emPooling can achiev e high accuracy without any D A method (e.g. 96.3% for “U → O”). This suggests that these two datasets are not enough to ev aluate more sophisticated D A approaches, so larger -scale datasets for video D A are needed. UCF-HMDB f ull . W e then ev aluate our approaches and compare with other image-based DA approaches on the UCF-HMDB f ull dataset, as sho wn in T ables 3 and 4 . The accuracy difference between “T arget only” and “Source only” indicates the domain gap . The gaps for the HMDB dataset are 11.11% for T emRelation and 10.28% for T em- Pooling (see T able 3 ), and the gaps for the UCF dataset are 21.01% for T emRelation and 17.16% for T emPool- ing (see T able 4 ). It is w orth noting that the “Source only” accuracy of our baseline architecture (T emPooling) on UCF-HMDB f ull is much lower than UCF-HMDB small (e.g. 28.39 lower for “U → H”), which implies that UCF- HMDB f ull contains much larger domain discrepancy than UCF-HMDB small . The v alue “Gain” is the dif ference from the “Source only” accurac y , which directly indicates the ef- fectiv eness of the D A approaches. W e no w answer the two questions for video D A in Section 3.2 (see T ables 3 and 4 ): 1. Does the video D A problem beneﬁt fr om encoding tem- poral dynamics into featur es? From T ables 3 and 4 , we see that for the same D A method, T emRelation outperforms T emPooling in 6 T emporal Module T emPooling T emRelation Acc. Gain Acc. Gain T arget only 80.56 - 82.78 - Source only 70.28 - 71.67 - D ANN [ 12 ] 71.11 0.83 75.28 3.61 J AN [ 27 ] 71.39 1.11 74.72 3.05 AdaBN [ 23 ] 75.56 5.28 72.22 0.55 MCD [ 39 ] 71.67 1.39 73.89 2.22 Ours (T A 2 N) N/A - 77.22 5.55 Ours (T A 3 N) N/A - 78.33 6.66 T able 3: The comparison of accuracy (%) with other ap- proaches on UCF-HMDB f ull (U → H). Gain represents the absolute difference from the “Source only” accuracy . T A 2 N and T A 3 N are based on the T emRelation architecture, so they are not applicable to T emPooling. most cases, especially for the gain v alue. For example, “T emPooling+D ANN” reaches 0.83% absolute accu- racy gain on the “U → H” setting and 0.17% gain on the “H → U” setting while “T emRelation+DANN” reaches 3.61% gain on “U → H” and 2.45% gain on “H → U”. This means that applying DA approaches to the video representations which encode the temporal dynamics improves the overall performance for cross- domain video classiﬁcation. 2. How to further integr ate DA while encoding temporal dynamics into featur es? Although integrating T emRelation with image-based D A approaches generally has better alignment perfor- mance than the baseline (T emPooling), feature encod- ing and D A are still tw o separate processes. The align- ment happens only before and after the temporal dy- namics are encoded in features. In order to explic- itly force alignment of the temporal dynamics across domains, we propose T A 2 N, which reaches 77.22% (5.55% gain) on “U → H” and 80.56% (6.66% gain) on “H → U”. T ables 3 and 4 show that although T A 2 N is adopted from a simple D A method (D ANN), it still outperforms other approaches which are ex- tended from more sophisticated D A methods but do not follow our strate gy . Finally , with the domain attention mechanism, our pro- posed T A 3 N reaches 78.33% (6.66% gain) on “U → H” and 81.79% (7.88% gain) on “H → U”, achie ving state-of-the- art performance on UCF-HMDB f ull in terms of accuracy and gain, as sho wn in T ables 3 and 4 . Kinetics-Gameplay . Kinetics-Gameplay is much more challenging than UCF-HMDB f ull because the data is from real and virtual domains, which have more se vere domain shifts. Here we only utilize T emRelation as our backbone architecture since it is prov ed to outperform T emPooling on T emporal Module T emPooling T emRelation Acc. Gain Acc. Gain T arget only 92.12 - 94.92 - Source only 74.96 - 73.91 - D ANN [ 12 ] 75.13 0.17 76.36 2.45 J AN [ 27 ] 80.04 5.08 79.69 5.79 AdaBN [ 23 ] 76.36 1.40 77.41 3.51 MCD [ 39 ] 76.18 1.23 79.34 5.44 Ours (T A 2 N) N/A - 80.56 6.66 Ours (T A 3 N) N/A - 81.79 7.88 T able 4: The comparison of accuracy (%) with other ap- proaches on UCF-HMDB f ull (H → U). Acc. Gain T arget only 64.49 - Source only 17.22 - D ANN [ 12 ] 20.56 3.34 J AN [ 27 ] 18.16 0.94 AdaBN [ 23 ] 20.29 3.07 MCD [ 39 ] 19.76 2.54 Ours (T A 2 N) 24.30 7.08 Ours (T A 3 N) 27.50 10.28 T able 5: The comparison of accuracy (%) with other ap- proaches on Kinetics-Gameplay . UCF-HMDB f ull . T able 5 sho ws that the accuracy gap be- tween “Source only” and “T arget only” is 47.27%, which is more than twice the number in UCF-HMDB f ull . In this dataset, T A 3 N also outperforms all the other D A approaches by increasing the “Source only ” accuracy from 17.22% to 27.50%. 5.3. Ablation Study and Analysis Integration of ˆ G d . W e use UCF-HMDB f ull to in ves- tigate the performance for integrating ˆ G d in different po- sitions. There are three ways to insert the adversarial dis- criminator into our architectures, where each corresponds to different feature representations, leading to three types of discriminators ˆ G sd , ˆ G td and ˆ G rd , which are shown in Fig- ure 4 and the full experimental results are shown in T able 6 . For the T emRelation architecture, the accuracy of utilizing ˆ G td shows better performance than utilizing ˆ G sd (av eragely 0.58% absolute gain improv ement across two tasks), while the accuracies are the same for T emPooling. This means that the temporal relation module can encode temporal dy- namics that help the video DA problem, but temporal pool- ing cannot. Utilizing the relation discriminator ˆ G rd can fur- ther improve the performance (0.92% improv ement) since we simultaneously align and learn the temporal dynamics across domains. Finally , by combining all three discrimina- tors, T A 2 N improv es e ven more (4.20% impro vement). 7 S → T UCF → HMDB HMDB → UCF T emporal T emPooling T emRelation T emPooling T emRelation Module T arget only 80.56 (-) 82.78 (-) 92.12 (-) 94.92 (-) Source only 70.28 (-) 71.67 (-) 74.96 (-) 73.91 (-) ˆ G sd 71.11 (0.83) 74.44 (2.77) 75.13 (0.17) 74.44 (1.05) ˆ G td 71.11 (0.83) 74.72 (3.05) 75.13 (0.17) 75.83 (1.93) ˆ G rd - (-) 76.11 (4.44) - (-) 75.13 (1.23) All ˆ G d 71.11 (0.83) 77.22 ( 5.55 ) 75.13 (0.17) 80.56 ( 6.66 ) T able 6: The full ev aluation of accuracy (%) for integrating ˆ G d in different positions without the attention mechanism. Gain values are in (). S → T UCF → HMDB HMDB → UCF T emporal T emPooling T emRelation T emPooling T emRelation Module T arget only 80.56 (-) 82.78 (-) 92.12 (-) 94.92 (-) Source only 70.28 (-) 71.67 (-) 74.96 (-) 73.91 (-) All ˆ G d 71.11 (0.83) 77.22 (5.55) 75.13 (0.17) 80.56 (6.66) All ˆ G d 73.06 (2.78) 78.33 (6.66) 78.46 (3.50) 81.79 (7.88) +Domain Attn. T able 7: The affect of the domain attention mechanism. S → T UCF → HMDB HMDB → UCF T arget only 82.78 (-) 94.92 (-) Source only 71.67 (-) 73.91 (-) No Attention 77.22 (5.55) 80.56 (6.66) General Attention 77.22 (5.55) 80.91 (7.00) Domain Attention 78.33 ( 6.66 ) 81.79 ( 7.88 ) T able 8: The comparison of different attention methods. Attention mechanism . In addition to T emRelation, we also apply the domain attention mechanism to T emPooling by attending to the raw frame features instead of relation features, and improve the performance as well, as sho wn in T able 7 . This implies that video DA can beneﬁt from the domain attention even if the backbone architecture does not encode temporal dynamics. W e also compare the do- main attention module with the general attention module, which calculates the attention weights via the FC-T anh-FC- Softmax architecture. Howe ver , it performs worse since the weights are computed within one domain, lacking of the consideration of domain discrepancy , as sho wn in T able 8 . V isualization of distribution . T o inv estigate how our approaches bridge the gap between source and target do- mains, we visualize the distribution of both domains using t-SNE [ 31 ]. Figure 5 shows that T A 3 N can group source data (blue dots) into denser clusters and generalize the dis- tribution into the tar get domains (orange dots) as well. Domain discrepancy measure . T o measure the align- ment between dif ferent domains, we use Maximum Mean Discrepancy (MMD) and domain loss, which are calculated using the ﬁnal video representations. Lower MMD values and higher domain loss both imply smaller domain gap. T A 3 N reaches lower discrepancy loss (0.0842) compared to (a) T emPooling + DANN [ 12 ] (b) T A 3 N Figure 5: The comparison of t-SNE visualization. The blue dots represent source data while the orange dots represent target data. See the supplementary for more comparison. Discrepancy Domain V alidation loss loss accuracy T emPooling 0.1840 1.1163 70.28 T emPooling + DANN [ 12 ] 0.1604 1.2023 71.11 T emRelation 0.2626 1.7588 71.67 T A 3 N 0.0842 1.9286 78.33 T able 9: The discrepancy loss (MMD), domain loss and val- idation accuracy of our baselines and proposed approaches. the T emPooling baseline (0.184), and sho ws great improve- ment in terms of the domain loss (from 1.116 to 1.9286), as shown in T able 9 . 6. Conclusion and Future W ork In this paper, we present two large-scale datasets for video domain adaptation, UCF-HMDB f ull and Kinetics- Gameplay , including both real and virtual domains. W e use these datasets to in vestig ate the domain shift problem across videos, and show that simultaneously aligning and learning temporal dynamics achiev es effecti ve alignment without the need for sophisticated D A methods. Finally , we propose T emporal Attentive Adversarial Adaptation Network (T A 3 N) to simultaneously attend, align and learn temporal dynamics across domains, achieving state-of-the- art performance on all of the cross-domain video datasets in v estigated. The code and data are released here . The ultimate goal of our research is to solve real-world problems. Therefore, in addition to integrating more DA approaches into our video D A pipelines, there are tw o main directions we would like to pursue for future work: 1) ap- ply T A 3 N to different cross-domain video tasks, includ- ing video captioning, segmentation, and detection; 2) we would lik e to e xtend these methods to the open-set set- ting [ 1 , 40 , 34 , 15 ], which has different categories between source and target domains. The open-set setting is much more challenging but closer to real-w orld scenarios. 8 7. Supplementary In the supplementary material, we would like to show more detailed ablation studies, more implementation de- tails, and a complete introduction of the datasets. 7.1. V isualization of distrib ution W e visualize the distribution of both domains using t- SNE [ 31 ] to in v estigate how our approaches bridge the gap between the source and target domains. Figures 6a and 6b show that the models using the T emPooling archi- tecture poorly align the distribution between different do- mains, e ven with the integration of image-based D A ap- proaches. Figure 6c shows the temporal relation module helps to group source data (blue) into denser clusters but is still not able to generalize the distribution into the target domains (orange). Finally , with T A 3 N, data from both do- mains are clustered and aligned with each other (Figure 6d ). (a) T emPooling (b) T emPooling + DANN [ 12 ] (c) T emRelation (d) T A 3 N Figure 6: The comparison of t-SNE visualization with source (blue) and target (orange) distrib utions. 7.2. Domain Attention Mechanism W e also apply the domain attention mechanism to T em- Pooling by attending to the ra w frame features, as shown in Figure 7 . T ables 10 and 11 show that the domain attention mechanism impro ves the performance for both T emPooling and T emRelation architectures, including all types of adver- sarial discriminators. This implies that video DA can bene- ﬁt from domain attention even if the backbone architecture does not encode temporal dynamics. Spa tia l mod u le T emp or al mod ule T emp o r al p oo li n g 𝐺 𝑡 𝑓 Do mai n A t t en tio n ෠ 𝐺 𝑠𝑑 ℒ 𝑠𝑑 Doma in A t t en t ion bl ock 𝐺 𝑠 𝑓 ℒ 𝑦 𝐺 𝑦 ෠ 𝐺 𝑡𝑑 ℒ 𝑡𝑑 Do mai n en tr o p y 𝑯 ( ෡ 𝒅 ) ℒ 𝑎𝑒 Class en tr o p y 𝑯 ( ෝ 𝒚 ) cla ss pr ed. Figure 7: Baseline architecture (T emPooling) equipped with the domain attention mechanism (ignoring the input feature parts to sav e space). T emporal T emPooling T emPooling T emRelation T emRelation Module + Attn. + Attn. T arget only 80.56 (-) 82.78 (-) Source only 70.28 (-) 71.67 (-) ˆ G sd 71.11 (0.83) 71.94 (1.66) 74.44 (2.77) 75.00 (3.33) ˆ G td 71.11 (0.83) 72.78 (2.50) 74.72 (3.05) 76.94 (5.27) ˆ G rd - (-) - (-) 76.11 (4.44) 76.94 (5.27) All ˆ G d 71.11 (0.83) 73.06 ( 2.78 ) 77.22 (5.55) 78.33 ( 6.66 ) T able 10: The ev aluation of accurac y (%) for integrating ˆ G d in different positions on “U → H” . Gain values are in (). T emporal T emPooling T emPooling T emRelation T emRelation Module + Attn. + Attn. T arget only 92.12 (-) 94.92 (-) Source only 74.96 (-) 73.91 (-) ˆ G sd 75.13 (0.17) 77.58 (2.62) 74.44 (1.05) 78.63 (4.72) ˆ G td 75.13 (0.17) 78.46 (3.50) 75.83 (1.93) 81.44 (7.53) ˆ G rd - (-) - (-) 75.13 (1.23) 78.98 (5.07) All ˆ G d 75.13 (0.17) 78.46 ( 3.50 ) 80.56 (6.66) 81.79 ( 7.88 ) T able 11: The ev aluation of accurac y (%) for integrating ˆ G d in different positions on “H → U” . Gain values are in (). 7.3. Implementation Details 7.3.1 Detailed architectur es The architecture with detailed notations for the baseline is shown in Figure 8 . For our proposed T A 3 N, after generat- ing the n -frame relation features R n by the temporal rela- tion module, we calculate the domain attention value w n us- ing the domain prediction ˆ d from the relation discriminator G n rd , and then attend to R n using w n with a residual con- nection. T o calculate the attentiv e entropy loss L ae , since the videos with low domain discrepancy are what we only want to focus on, we attend to the class entropy loss H ( ˆ y ) using the domain entropy H ( ˆ d ) as the attention value with a residual connection, as shown in Figure 9 . 9 Co n vNe t Ra w vid eo … Fr ame - l e v el f ea tur e v ect or s Video model Cla ss pr edict i on Sp a tial modu le T emp or al modu le ℒ 𝑦 𝐺 𝑠 𝑓 𝐺 𝑦 T emp o r al p o o li n g G RL Doma i n c l as sifier Domai n pr edict i on 𝐺 𝑠𝑑 ℒ 𝑡𝑑 ℒ 𝑠𝑑 ෠ 𝐺 𝑠𝑑 ෠ 𝐺 𝑡𝑑 Figure 8: The detailed baseline architecture (T emPooling) with the adversarial discriminators ˆ G sd and ˆ G td . 7.3.2 Optimization Our implementation is based on the PyT orch [ 33 ] frame- work. W e utilize the ResNet-101 model pre-trained on Im- ageNet as the frame-level feature extractor . W e sample a ﬁxed number K of frame-level feature vectors with equal spacing in the temporal direction for each video ( K is equal to 5 in our setting to limit computational resource require- ments). For optimization, the initial learning rate is 0 . 03 , and we follo w one of the commonly used learning-rate- decreasing strategies shown in D ANN [ 12 ]. W e use stochas- tic gradient descent (SGD) as the optimizer with the mo- mentum and weight decay as 0 . 9 and 1 × 10 − 4 , respectiv ely . The ratio between the source and tar get batch size is propor- tional to the scale between the source and target datasets. The source batch size depends on the scale of the dataset, which is 32 for UCF-Olympic and UCF-HMDB small , 128 for UCF-HMDB f ull and 512 for Kinetics-Gameplay . The optimized values of λ s , λ r and λ t are found using the coarse-to-ﬁne grid-search approach. W e ﬁrst search using a coarse-grid with the geometric sequence [0, 10 − 3 , 10 − 2 , ..., 10 0 , 10 1 ]. After ﬁnding the optimized range of values, [0, 1], we search again using a ﬁne-grid with the arithmetic sequence [0, 0.25, ..., 1]. The ﬁnal values are 0.75 for λ s , 0.5 for λ r and 0.75 for λ t , respectiv ely . W e search γ only by a coarse-grid, and the best value is 0.3. For future work, we plan to adopt adaptive weighting techniques used for multi- task learning, such as uncertainty weighting [ 20 ] and Grad- Norm [ 4 ], to replace the manual grid-search method. 7.3.3 Comparison with other work As mentioned in the experimental setup, we compare our proposed T A 3 N with other approaches by e xtending several state-of-the-art image-based D A methods [ 12 , 27 , 23 , 39 ] for video D A with our T emPooling and T emRelation archi- tectures, which are shown as follo ws: 1. DANN [ 12 ] : we add one adversarial discriminator ˆ G sd right after the spatial module and add another one ˆ G td right after the temporal module. W e do not add one more discriminator for relation features for the f air comparison between T emPooling and T emRelation. 2. J AN [ 27 ] : we add Joint Maximum Mean Discrepancy (JMMD) to the ﬁnal video representation and the class prediction. 3. AdaBN [ 23 ] : we integrate an adapti v e batch- normalization layer into the feature generator G sf . In the adaptive batch-normalization layer, the statistics (mean and variance) for both source and target do- mains are calculated, but only the target statistics are used for validating the tar get data. 4. MCD [ 39 ] : we add another classiﬁer G 0 y and follo w the adversarial training procedure of Maximum Classi- ﬁer Discrepancy to iterativ ely optimize the generators ( G sf and G tf ) and the classiﬁer ( G y ). 7.4. Datasets The full summary of all four datasets in v estigated in this paper is shown in T able 12 . 7.4.1 UCF-HMDB f ull W e collect all of the relev ant and o verlapping categories be- tween UCF101 [ 43 ] and HMDB51 [ 21 ], which results in 12 cate gories: climb, fencing, golf, kic k ball, pullup, punch, pushup, ride bike, ride horse, shoot ball, shoot bow , and walk . Each category may correspond to multiple categories in the original UCF101 or HMDB51 dataset, as shown in T able 13 . This dataset, UCF-HMDB f ull , includes 1438 training videos and 571 validation videos from UCF , and 840 training videos and 360 v alidation videos from HMDB, as sho wn in T able 12 . Most videos in UCF are from certain scenarios or similar en vironments, while videos in HMDB are in unconstrained environments and different camera an- gles, as shown in Figure 10 . 7.4.2 Kinetics-Gameplay W e create the Gameplay dataset by ﬁrst collecting game- play videos from two video games, Detr oit: Become Human and F ortnite , to build our own action dataset for the virtual domain. The total length of the videos is 5 hours and 41 minutes. W e segment all of the raw , untrimmed videos into video clips according to hu- man annotations, which results in 91 categories: ar- gue, arrange object, assemble object, br eak, bump, carry , carve, chop wood, clap, climb, close door , close others, crawl, cr oss arm, crouc h, crumple, cry , cut, dance, draw , drink, drive, eat, fall down, ﬁght, ﬁx hair , ﬂy helicopter , get off , grab, hair cut, hit, hit break, hold, hug, jug- gle coin, jump, kick, kiss, kneel, knock, lick, lie down, lift, light up, listen, make bed, mop ﬂoor , news anchor , 10 Vid eo model Spa tia l modu le ℒ 𝑦 T emp o r al modu le 𝑅 2 …… 𝑅 𝐾 T e mpor a l R e l a t i on modul e …… 𝑔 𝜙 ( 2 ) 𝑔 𝜙 ( 2 ) … 1 3 2 4 𝑔 𝜙 ( 3 ) 𝑔 𝜙 ( 3 ) … 3 4 4 5 1 2 𝑔 𝜙 ( K ) 𝑔 𝜙 ( K ) … 3 4 4 5 1 2 … … 𝐺 𝑠 𝑓 𝐺 𝑦 ෠ 𝐺 𝑠𝑑 ℒ 𝑠𝑑 ෠ 𝐺 𝑡𝑑 ℒ 𝑡𝑑 𝑅 3 Do mai n A t t en tio n ෠ 𝐺 𝑟𝑑 𝑛 ℒ 𝑟 𝑑 𝑛 Domain A t t en ti on mod ule 𝑨 𝒏 𝑨 𝟐 𝑨 𝟑 𝑨 𝑲 …… Do mai n en tr o p y 𝑯 ( ෡ 𝒅 ) ℒ 𝑎𝑒 Class en tr o p y 𝑯 ( ෝ 𝒚 ) Co n vNe t Ra w vid eo … Fr ame - l e v el f ea tur e v ect or s GRL Domai n c l as sifier 𝐺 𝑠𝑑 domai n pr ed. cla ss pr ed. Figure 9: The detailed architecture of the proposed T A 3 N. UCF-HMDB small UCF-Olympic UCF-HMDB f ull Kinetics-Gameplay length (sec.) 1 - 21 1 - 39 1 - 33 1 - 10 resolution UCF: 320 × 240 / Olympic: vary / HMDB: vary × 240 / Kinetics: vary / Gameplay: 1280 × 720 frame rate UCF: 25 / Olympic: 30 / HMDB: 30 / Kinetics: vary / Gameplay: 30 class # 5 6 12 30 training video # UCF: 482 / HMDB: 350 UCF: 601 / Olympic: 250 UCF: 1438 / HMDB: 840 Kinetics: 43378 / Gameplay: 2625 validation video # UCF: 189 / HMDB: 150 UCF: 240 / Olympic: 54 UCF: 571 / HMDB: 360 Kinetics: 3246 / Gameplay: 749 T able 12: The summary of the cross-domain video datasets. UCF-HMDB f ull UCF HMDB climb RockClimbingIndoor , climb RopeClimbing fencing Fencing fencing golf GolfSwing golf kick ball SoccerPenalty kick ball pullup PullUps pullup punch Punch, punch BoxingPunchingBag, BoxingSpeedBag pushup PushUps pushup ride bike Biking ride bike ride horse HorseRiding ride horse shoot ball Basketball shoot ball shoot bow Archery shoot bow walk W alkingW ithDog walk T able 13: The lists of all collected categories in UCF and HMDB. open door , open others, paint brush, pass object, pet, poke, pour , press, pull, punch, push, push object, put object, raise hand, read, r ow boat, run, shake hand, shiver , shoot gun, sit, sit down, slap, sleep, slide, smile, stand, stand up, star e, strangle, swim, switch, take off, talk, talk phone, think, thr ow , touch, walk, wash dishes, water plant, wave hand, and weld . The maximum length for each video clip is 10 seconds, and the minimum is 1 second. W e also split the dataset into training, v ali- dation, and testing sets by randomly selecting videos in each category with the ratio 7:2:1. W e build the Kinetics- Gameplay dataset by selecting 30 overlapping categories between Gameplay and one of the largest public video datasets Kinetics-600 [ 19 , 2 ]: br eak, carry , clean ﬂoor , climb, crawl, cr ouch, cry , dance, drink, drive, fall down, ﬁght, hug, jump, kic k, light up, news anchor , open door , paint brush, paraglide, pour , push, r ead, run, shoot gun, star e, talk, thr ow , walk, and wash dishes . Each category may also correspond to multiple cate gories in both datasets, as sho wn in T able 14 . Kinetics-Gameplay includes 43378 training videos and 3246 validation videos from Kinetics, and 2625 training videos and 749 validation videos from Gameplay , as shown in T able 12 . Kinetics-Gameplay is much more challenging than UCF-HMDB f ull due to the signiﬁcant domain shift between the distributions of vir - tual and real data. Furthermore, The alignment between imbalanced-scaled source and target data is also another 11 (a) fencing (b) kick ball (c) walk Figure 10: Snapshots of some example categories on UCF- HMDB f ull . For each category , the snapshots from UCF are shown in the upper ro w , and the snapshots from HMDB are shown in the lo wer row . challenge. Some example snapshots are shown in Figure 11 . Figure 11: Some example screenshots from Y ouT ube videos in Kinetics-Gameplay (left two: Gameplay , right two: Kinetics) 7.5. More Details 7.5.1 J AN on Kinetics-Gameplay J AN [ 27 ] does not perform well on Kinetics-Gameplay compared to the performance on UCF-HMDB f ull . The main reason is the imbalanced size between the source and target data in Kinetics-Gameplay . The discrepancy loss MMD is calculated using the same number of source and target data (not the case for other types of DA approaches). Therefore, in each iteration, MMD is calculated using parts of the source batch and the whole target batch. This means that the domain discrepancy is reduced only between part of source data and tar get data during training, so the learned model is still overﬁtted to the source domain. The discrep- ancy loss MMD works well when the source and target data are balanced, which is the case for most image D A datasets and UCF-HMDB f ull , but not for Kinetics-Gameplay . 7.5.2 Comparison with AMLS [ 17 ] When ev aluating on UCF-HMDB small , AMLS [ 17 ] ﬁne- tunes their networks using UCF and HMDB, respectively , before applying their DA approach. Here we only show their results which are ﬁne-tuned with source data, because the target labels should be unseen during training in unsu- pervised D A settings. For example, we don’t compare their results which test on HMDB data using the models ﬁne- tuned with HMDB data since it is not unsupervised D A. 7.5.3 Other baselines 3D ConvNets [ 46 ] ha ve also been used for e xtracting video- lev el feature representations. Howe v er , 3D Con vNets con- sume a great deal of GPU memory , and [ 47 ] also shows that 3D Con vNets are limited by efﬁciency and ef fectiveness is- sues when extracting temporal information. Optical-ﬂow e xtracts the motion characteristics between neighbor frames to compensate for the lack of temporal in- formation in raw RGB frames. In this paper , we focus on attending to the temporal dynamics to ef fecti vely align do- mains ev en with only RGB frames. W e consider optical- ﬂow to be complementary to our method. 7.5.4 Comparison with literature in other ﬁelds Cycle-consistency . Some papers related to cycle- consistency [ 50 , 8 ] introduce self-supervised methods for learning visual correspondence between images or videos from unlabeled videos. They use cycle-consistenc y as free supervision to learn video representations. The main dif- ference from our approach is that we explicitly align the feature spaces between source and target domains, while these self-supervised methods aim to learn general repre- sentations using only the source domain. W e see cycle- consistency as a complementary method that can be inte- grated into our approach to achie ve more ef fecti ve domain alignment. Robotics. In Robotics, it is a common trend to transfer the models trained in simulation to real world. One of the ef- fectiv e method to bridge the domain gap is randomizing the dynamics of the simulator during training to improve the robustness for different en vironments [ 35 ]. The setting is different from our task because we focus on feature learn- ing rather than policy learning, and we see domain random- ization as a complementary technique that can extend our approach to a more generalized version. 12 Kinetics-Gameplay Kinetics Gameplay break breaking boards, smashing break, bump, hit break carry carrying baby carry clean ﬂoor mopping ﬂoor mop ﬂoor climb climbing a rope, climbing ladder , climbing tree, climb ice climbing, rock climbing crawl crawling baby crawl crouch squat, lunge crouch, kneel cry crying cry dance belly dancing, krumping, robot dancing dance drink drinking shots, tasting beer drink driv e driving car , driving tractor driv e fall do wn falling of f bike, f alling off chair , faceplanting fall down ﬁght pillow ﬁght, capoeira, wrestling, ﬁght, strangle, punching bag, punching person (boxing) punch, hit hug hugging (not baby), hugging baby hug jump high jump, jumping into pool, jump parkour kick drop kicking, side kick kick light up lighting ﬁre light ﬁre news anchor news anchoring news anchor open door opening door , opening refrigerator open door paint brush brush painting paint brush paraglide paragliding paraglide pour pouring beer pour push pushing car , pushing cart, pushing wheelbarro w , push, pushing wheelchair , push up push object read reading book, reading newspaper read run running on treadmill, jogging run shoot gun playing laser tag, playing paintball shoot gun stare staring stare talk talking on cell phone, arguing, testifying talk, argue, talk phone throw throwing ax e, thro wing ball (not baseball or American football), throw throwing knife, thro wing water balloon walk walking the dog, walking through sno w , jaywalking walk wash dishes washing dishes wash dishes T able 14: The lists of all collected categories in Kinetics and Gameplay . 7.5.5 F ailure cases f or T emRelation T emRelation shows limited improvement o ver T emPool- ing for some categories with consistency across time. For example, with the same DA method (D ANN), T emRela- tion has the same accuracy with T emPooling for ride bike (97%), and has lo wer accuracy for ride horse (93% and 97%). The possible reason is that temporal pooling can already model temporally consistent actions well, and it may be redundant to model these actions with multiple timescales like T emRelation. 7.5.6 T esting time for T A 3 N Different from T A 2 N, T A 3 N passes data to all the domain discriminators during testing. Howe ver , since all our do- main discriminators are shallow , the testing time is similar . In our experiment, T A 3 N only computes 10% more time than T A 2 N. References [1] Pau Panareda Busto and Juergen Gall. Open set domain adaptation. In IEEE International Conference on Computer V ision (ICCV) , 2017. 8 13 [2] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier , and Andrew Zisserman. A short note about kinetics- 600. arXiv pr eprint arXiv:1808.01340 , 2018. 2 , 5 , 11 [3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE conference on Computer V ision and P attern Recogni- tion (CVPR) , 2017. 2 [4] Zhao Chen, V ijay Badrinarayanan, Chen-Y u Lee, and An- drew Rabinovich. Gradnorm: Gradient normalization for adaptiv e loss balancing in deep multitask networks. In In- ternational Confer ence on Machine Learning (ICML) , 2018. 10 [5] Gabriela Csurka. A comprehensive survey on domain adap- tation for visual applications. In Domain Adaptation in Com- puter V ision Applications , pages 1–35. Springer , 2017. 1 , 2 [6] Jeffrey Donahue, Lisa Anne Hendricks, Ser gio Guadarrama, Marcus Rohrbach, Subhashini V enugopalan, Kate Saenko, and Tre vor Darrell. Long-term recurrent conv olutional net- works for visual recognition and description. In IEEE confer- ence on Computer V ision and P attern Recognition (CVPR) , 2015. 2 [7] Jeff Donahue, Y angqing Jia, Oriol V inyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Tre v or Darrell. Decaf: A deep con v olutional activ ation feature for generic visual recogni- tion. In International Conference on Machine Learning (ICML) , 2014. 2 [8] Debidatta Dwibedi, Y usuf A ytar , Jonathan T ompson, Pierre Sermanet, and Andrew Zisserman. T emporal cycle- consistency learning. In IEEE confer ence on Computer V i- sion and P attern Recognition (CVPR) , 2019. 12 [9] Christoph Feichtenhofer , Ax el Pinz, and Andrew Zisserman. Con v olutional two-stream network fusion for video action recognition. In IEEE confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. 2 [10] Geoff French, Michal Mackiewicz, and Mark Fisher . Self- ensembling for visual domain adaptation. In International Confer ence on Learning Repr esentations (ICLR) , 2018. 3 [11] Y aroslav Ganin and V ictor Lempitsky . Unsupervised domain adaptation by backpropagation. In International Confer ence on Machine Learning (ICML) , 2015. 1 , 2 , 3 [12] Y aroslav Ganin, Evgeniya Ustinov a, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ ois Laviolette, Mario Marchand, and V ictor Lempitsky . Domain-adversarial train- ing of neural networks. The Journal of Machine Learning Resear ch , 17(1):2096–2030, 2016. 1 , 2 , 3 , 6 , 7 , 8 , 9 , 10 [13] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair, Aaron Courville, and Y oshua Bengio. Generative adversarial nets. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2014. 2 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. 3 [15] Y en-Chang Hsu, Zhaoyang Lv , and Zsolt Kira. Learning to cluster in order to transfer across domains and tasks. In In- ternational Confer ence on Learning Repr esentations (ICLR) , 2018. 8 [16] Sergey Ioffe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In International Confer ence on Machine Learn- ing (ICML) , 2015. 2 [17] Arshad Jamal, V inay P Namboodiri, Dipti Deodhare, and KS V enkatesh. Deep domain adaptation in action space. In British Machine V ision Conference (BMVC) , 2018. 1 , 3 , 5 , 6 , 12 [18] Andrej Karpathy , George T oderici, Sanketh Shetty , Thomas Leung, Rahul Sukthankar , and Li Fei-Fei. Large-scale video classiﬁcation with conv olutional neural networks. In IEEE confer ence on Computer V ision and P attern Recognition (CVPR) , 2014. 2 , 3 [19] Will Kay , Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier , Sudheendra V ijayanarasimhan, Fabio V iola, T im Green, T re vor Back, Paul Natse v , et al. The kinetics hu- man action video dataset. arXiv pr eprint arXiv:1705.06950 , 2017. 2 , 5 , 11 [20] Alex Kendall, Y arin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. In IEEE conference on Computer V ision and P attern Recognition (CVPR) , 2018. 10 [21] Hildegard Kuehne, Hueihan Jhuang, Est ´ ıbaliz Garrote, T omaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In IEEE Interna- tional Confer ence on Computer V ision (ICCV) , 2011. 1 , 5 , 10 [22] Chen-Y u Lee, T anmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced wasserstein discrepancy for unsuper - vised domain adaptation. In IEEE conference on Computer V ision and P attern Recognition (CVPR) , 2019. 3 [23] Y anghao Li, Naiyan W ang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. Adaptive batch normalization for practical do- main adaptation. P attern Recognition , 80:109–117, 2018. 1 , 2 , 6 , 7 , 10 [24] Y anghao Li, Naiyan W ang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical do- main adaptation. In International Conference on Learning Repr esentations W orkshop (ICLRW) , 2017. 1 , 2 [25] Mingsheng Long, Y ue Cao, Jianmin W ang, and Michael Jor - dan. Learning transferable features with deep adaptation net- works. In International Conference on Machine Learning (ICML) , 2015. 1 , 2 [26] Mingsheng Long, Han Zhu, Jianmin W ang, and Michael I Jordan. Unsupervised domain adaptation with residual trans- fer networks. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2016. 2 [27] Mingsheng Long, Han Zhu, Jianmin W ang, and Michael I Jordan. Deep transfer learning with joint adaptation net- works. In International Conference on Machine Learning (ICML) , 2017. 1 , 2 , 6 , 7 , 10 , 12 [28] Xiang Long, Chuang Gan, Gerard de Melo, Jiajun W u, Xiao Liu, and Shilei W en. Attention clusters: Purely attention based local feature integration for video classiﬁcation. In IEEE Conference on Computer V ision and P attern Recogni- tion (CVPR) , 2018. 2 14 [29] Chih-Y ao Ma, Min-Hung Chen, Zsolt Kira, and Ghassan AlRegib . Ts-lstm and temporal-inception: Exploiting spa- tiotemporal dynamics for activity recognition. Signal Pr o- cessing: Image Communication , 2018. 2 [30] Chih-Y ao Ma, Asim Kada v , Iain Melvin, Zsolt Kira, Ghas- san AlRegib, and Hans Peter Graf. Attend and interact: Higher-order object interactions for video understanding. In IEEE conference on Computer V ision and P attern Recogni- tion (CVPR) , 2018. 2 [31] Laurens van der Maaten and Geof frey Hinton. V isualizing data using t-sne. The J ournal of Machine Learning Resear ch , 9(Nov):2579–2605, 2008. 8 , 9 [32] Sinno Jialin Pan, Qiang Y ang, et al. A survey on transfer learning. IEEE T ransactions on Knowledge and Data Engi- neering (TKDE) , 22(10):1345–1359, 2010. 1 [33] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Y ang, Zachary DeV ito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer . Automatic dif- ferentiation in pytorch. In Advances in Neural Information Pr ocessing Systems W orkshop (NeurIPSW) , 2017. 10 [34] Xingchao Peng, Ben Usman, Kuniaki Saito, Neela Kaushik, Judy Hof fman, and Kate Saenko. Syn2real: A new bench- mark for synthetic-to-real visual domain adaptation. arXiv pr eprint arXiv:1806.09755 , 2018. 1 , 5 , 6 , 8 [35] Xue Bin Peng, Marcin Andrychowicz, W ojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Con- fer ence on Robotics and Automation (ICRA) , 2018. 12 [36] Zhaofan Qiu, T ing Y ao, and T ao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. In IEEE International Confer ence on Computer V ision (ICCV) , 2017. 2 [37] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer , and Neil D Lawrence. Dataset Shift in Ma- chine Learning . The MIT Press, 2009. 1 [38] Kuniaki Saito, Y oshitaka Ushiku, T atsuya Harada, and Kate Saenko. Adversarial dropout regularization. In International Confer ence on Learning Repr esentations (ICLR) , 2018. 3 [39] Kuniaki Saito, Kohei W atanabe, Y oshitaka Ushiku, and T at- suya Harada. Maximum classiﬁer discrepancy for unsuper- vised domain adaptation. In IEEE conference on Computer V ision and P attern Recognition (CVPR) , 2018. 1 , 3 , 6 , 7 , 10 [40] Kuniaki Saito, Shohei Y amamoto, Y oshitaka Ushiku, and T atsuya Harada. Open set domain adaptation by back- propagation. In European Conference on Computer V ision (ECCV) , 2018. 8 [41] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lilli- crap. A simple neural network module for relational reason- ing. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2017. 4 [42] Karen Simonyan and Andrew Zisserman. T wo-stream con- volutional networks for action recognition in videos. In Ad- vances in Neural Information Pr ocessing Systems (NeurIPS) , 2014. 2 [43] Khurram Soomro, Amir Roshan Zamir , and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint , 2012. 1 , 5 , 6 , 10 [44] W aqas Sultani and Imran Saleemi. Human action recognition across datasets by fore ground-weighted histogram decompo- sition. In IEEE conference on Computer V ision and P attern Recognition (CVPR) , 2014. 1 , 3 , 5 , 6 [45] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Eur opean Con- fer ence on Computer V ision W orkshop (ECCVW) , 2016. 2 [46] Du Tran, Lubomir Bourde v , Rob Fergus, Lorenzo T orresani, and Manohar Paluri. Learning spatiotemporal features with 3d conv olutional networks. In IEEE International Confer- ence on Computer V ision (ICCV) , 2015. 2 , 3 , 12 [47] Du T ran, Heng W ang, Lorenzo T orresani, Jamie Ray , Y ann LeCun, and Manohar Paluri. A closer look at spatiotemporal con v olutions for action recognition. In IEEE confer ence on Computer V ision and P attern Recognition (CVPR) , 2018. 2 , 12 [48] Eric Tzeng, Judy Hoffman, Kate Saenko, and Tre v or Dar- rell. Adversarial discriminative domain adaptation. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2017. 2 [49] Limin W ang, Y uanjun Xiong, Zhe W ang, Y u Qiao, Dahua Lin, Xiaoou T ang, and Luc V an Gool. T emporal segment networks: T ow ards good practices for deep action recogni- tion. In European Confer ence on Computer V ision (ECCV) , 2016. 2 [50] Xiaolong W ang, Allan Jabri, and Alexei A Efros. Learn- ing correspondence from the cycle-consistency of time. In IEEE conference on Computer V ision and P attern Recogni- tion (CVPR) , 2019. 12 [51] Ximei W ang, Liang Li, W eirui Y e, Mingsheng Long, and Jianmin W ang. T ransferable attention for domain adaptation. In AAAI Conference on Artiﬁcial Intelligence (AAAI) , 2019. 3 , 5 [52] Tiantian Xu, Fan Zhu, Edward K W ong, and Y i Fang. Dual man y-to-one-encoder -based transfer learning for cross- dataset human action recognition. Image and V ision Com- puting , 55:127–137, 2016. 1 , 3 , 5 , 6 [53] Hongliang Y an, Y ukang Ding, Peihua Li, Qilong W ang, Y ong Xu, and W angmeng Zuo. Mind the class weight bias: W eighted maximum mean discrepancy for unsupervised do- main adaptation. In IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2017. 2 [54] W erner Zellinger , Thomas Grubinger, Edwin Lughofer, Thomas Natschl ¨ ager , and Susanne Saminger-Platz. Central moment discrepancy (cmd) for domain-in v ariant representa- tion learning. In International Confer ence on Learning Rep- r esentations (ICLR) , 2017. 2 [55] W eichen Zhang, W anli Ouyang, W en Li, and Dong Xu. Col- laborativ e and adversarial network for unsupervised domain adaptation. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2018. 2 [56] Xiao-Y u Zhang, Haichao Shi, Changsheng Li, Kai Zheng, Xiaobin Zhu, and Lixin Duan. Learning transferable self- attentiv e representations for action recognition in untrimmed videos with weak supervision. In AAAI Conference on Arti- ﬁcial Intelligence (AAAI) , 2019. 3 15 [57] Bolei Zhou, Alex Andonian, Aude Oliv a, and Antonio T or- ralba. T emporal relational reasoning in videos. In Eur opean Confer ence on Computer V ision (ECCV) , 2018. 2 , 4 16

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment