A Proposal-based Approach for Activity Image-to-Video Retrieval
Activity image-to-video retrieval task aims to retrieve videos containing the similar activity as the query image, which is a challenging task because videos generally have many background segments irrelevant to the activity. In this paper, we utiliz…
Authors: Ruicong Xu, Li Niu, Jianfu Zhang
A Pr oposal-based A ppr oach f or Acti vity Image-to-V ideo Retriev al Ruicong Xu, Li Niu, ∗ Jianfu Zhang, Liqing Zhang ∗ MoE Ke y Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao T ong Uni versity , Shanghai, China. { ranranxu, utscnewly , c.sis } @sjtu.edu.cn, zhang-lq@cs.sjtu.edu.cn Abstract Activity image-to-video retriev al task aims to retrieve videos containing the similar acti vity as the query image, which is a challenging task because videos generally hav e many back- ground segments irrele v ant to the activity . In this paper , we utilize R-C3D model to represent a video by a bag of activ- ity proposals, which can filter out background segments to some extent. Ho we ver , there are still noisy proposals in each bag. Thus, we propose an Activity Proposal-based Image- to-V ideo Retriev al (APIVR) approach, which incorporates multi-instance learning into cross-modal retriev al frame work to address the proposal noise issue. Specifically , we propose a Graph Multi-Instance Learning (GMIL) module with graph con v olutional layer , and integrate this module with classifica- tion loss, adversarial loss, and triplet loss in our cross-modal retriev al framew ork. Moreover , we propose geometry-aware triplet loss based on point-to-subspace distance to preserve the structural information of acti vity proposals. Extensi ve ex- periments on three widely-used datasets verify the effecti ve- ness of our approach. 1 Introduction Cross-modal retriev al task has attracted considerable re- search attention in the field of retriev al task. W ith the rapid development of video applications, a specific type of retriev al task, Acti vity Image-to-V ideo Retriev al (AIVR), comes into our sight. The goal of AIVR task is to retrieve the videos containing the similar acti vity as the image query , which expands its v alue in widespread applications. One daily-life example is ne ws videos searching with a provided photo containing a particular activity . Another example is fitness videos recommendation based on a sports picture. The key idea of cross-modal retriev al is to learn a com- mon feature space, where cross-modal data of relev ant se- mantic can be close to each other . Although there are abun- dant methods for cross-modal retriev al like text-image re- triev al (Feng, W ang, and Li 2014; Hardoon, Szedm ´ ak, and Shawe-T aylor 2004; Peng, Huang, and Qi 2016; W ang et al. 2016; 2013), few methods (de Ara ´ ujo and Girod 2018; ∗ Corresponding author Copyright c 2020, Association for the Advancement of Artificial Intelligence (www .aaai.org). All rights reserv ed. Xu et al. 2017) are proposed for image-video retriev al. How- ev er, these methods are not specifically designed for AIVR task. AIVR task is in high demand of meaningful video rep- resentations, because a video may contain background seg- ments irrelev ant to the activity and poor video representa- tions without considering noisy background segments will lead to inferior performance of AIVR task. Recently , RNN (Ng et al. 2015; Sriv astava, Mansimov , and Salakhutdinov 2015) and 3D CNN (Ji et al. 2013; T ran et al. 2015; Qiu, Y ao, and Mei 2017) are used to ex- tract deep learning-based video representations. As an ex- tension of 3D CNN, R-C3D (Xu, Das, and Saenko 2017) can generate candidate temporal regions containing activi- ties and filter out noisy background segments to obtain the superior activity video representations. Therefore, we take advantage of R-C3D model to generate temporal proposals that are most likely to contain activities and extract one fea- ture vector for each proposal, leading to a bag of proposal features for each video. This paper is the first to target at AIVR task by utilizing activity proposals for videos. In this paper , we propose an Activity Proposal-based Image-to-V ideo Retriev al (APIVR) approach for AIVR task. The major innovation in our paper is incorporating Graph Multi-Instance Learning (GMIL) module into cross-modal retriev al framework to address the proposal noise issue. As illustrated in Figure 1, our cross-modal retriev al frame work is based on Adversarial Cross-Modal Retriev al (ACMR) proposed in (W ang et al. 2017), in which image features and activity proposal-based video features are projected into a common feature space steered by triplet loss, classification loss, and adversarial loss. T o address the proposal noise is- sue, we treat each video as a bag and the activity propos- als in each bag as multiple instances, which coincides with multi-instance learning (MIL) paradigm (Ilse, T omczak, and W elling 2018). W e assume that there is at least one clean in- stance in each bag, and employ self-attention mechanism to learn different weights for multiple instances, with higher weights indicating clean activity proposals. T o further con- sider the relation among multiple instances in each bag, we insert graph con volutional layer into MIL module, yielding a nov el Graph MIL (GMIL) module. After learning weights based on our GMIL module, we Figure 1: The flowchart of our proposed approach. The image features and bags of activity proposal features for videos are extracted by VGG (Simonyan and Zisserman 2014) and R-C3D (Xu, Das, and Saenko 2017) models respectiv ely , and then projected into a common feature space. Our retrie val framework consists of triplet loss, classification loss, and adv ersarial loss. W e incorporate Graph Multi-Instance Learning (GMIL) module into retrie val framework to address the proposal noise issue. W e also design geometry-aware triplet loss based on truncated bag of acti vity proposals. Best viewed in color . use weighted av erage of activity proposal features in each bag as input for the classification loss and adv ersarial loss in cross-modal retriev al framework, to suppress noisy acti vity proposals. F or the remaining triplet loss, we propose a nov el geometry-aware triplet loss, which calculates the point-to- subspace distance between image and bag of activity propos- als. Considering that the noisy activity proposals may mis- lead the point-to-subspace distance, we use truncated bag of activity proposals based on the weights learnt by our GMIL module. Thus, our geometry-aware triplet loss can mitigate the proposal noise issue and simultaneously preserve the ge- ometry property of activity proposals. The contributions of our paper are summarized as follows: • This w ork is the first acti vity proposal-based approach for activity image-to-video retriev al task. Our major contri- bution is incorporating multi-instance learning into cross- modal retriev al framework to address the proposal noise issue. • Our two minor contributions are Graph Multi-Instance Learning (GMIL) module with graph con volutional layer and geometry-a ware triplet loss based on truncated bag of activity proposals. • Experiment results on three datasets, i.e. , action-based THUMOS’14 and ActivityNet datasets, e vent-based MED2017 Event dataset, demonstrate the superiority of our approach compared to state-of-the-art methods. 2 Related W ork In this section, we provide a brief overvie w of video repre- sentation, cross-modal retriev al, and multi-instance learning. V ideo repr esentations: V ideo representations play a cru- cial role in image-to-video retrie v al task. Recently , deep learning-based models, e .g. , RNN (Jiang et al. 2018) and 3D CNN (Qiu, Y ao, and Mei 2017), are proposed to fully exploit spatio-temporal information across consecuti ve frames. As an advanced 3D CNN model, R-C3D (Xu, Das, and Saenko 2017) can generate activity proposals across temporal di- mension to filter out noisy background segments. Hence, we adopt R-C3D to generate video representations, which sig- nificantly facilitates the AIVR task. Cross-modal retrie val methods: Cross-modal retriev al methods fall into two major categories: binary-value based methods (Y u et al. 2014; Lin, Shen, and van den Hen- gel 2014; Y e et al. 2017; Ding, Guo, and Zhou 2014; Xu et al. 2017) and real-value based retrie val methods (Zhai, Peng, and Xiao 2014; W ang et al. 2016; Peng, Huang, and Qi 2016; Peng et al. 2018; W ang, Li, and Lazebnik 2016; W ang et al. 2017; Zhen et al. 2019). Our cross-modal re- triev al framework is based on ACMR (W ang et al. 2017), which consists of classification loss, triplet loss, and adver - sarial loss. Our contribution is incorporating graph multi- instance learning module into cross-modal retrie v al frame- work together with geometry-aw are triplet loss. Multi-instance learning: Multi-instance learning (MIL) groups training samples into multi-instance bags, in which each bag contains at least one positive instance. Some early methods (Li et al. 2009) treat one bag as an entirety or in- fers instance labels within each bag. Recently , deep multi- instance learning methods (Zhu et al. 2017; Pappas and Popescu-Belis 2014; Ilse, T omczak, and W elling 2018) em- ploy pooling operators or trainable operators to aggregate multiple instances in each bag. Moreov er , several graph MIL (T u et al. 2019; Guo and Y i 2013) methods are pro- posed to exploit the graph structure of training bags in dif- ferent ways, but their methods cannot be easily integrated into our cross-modal retriev al framework. 3 Methodology In this section, we introduce our activity proposal-based image-to-video retriev al approach. 3.1 Problem Definition For concise mathematical expression, we denote a matrix ( e.g . , A ) and vector ( e.g . , a ) using an uppercase and lower - case letter in boldf ace respecti vely , and denote a scalar ( e.g . , a ) using a lo wercase letter . W e use I k and A T to denote an identity matrix with size k and the transpose of A respec- tiv ely . By using vec ( · ) , we perform column-wise concatena- tion to transform a matrix into a column vector . Moreover , we use h x , y i to denote the inner product of x and y . In the AIVR task, our training process is based on mini-batches of video-image pairs { ( V i , u i ) | n i =1 } , in which ( V i , u i ) is a pair of video and image with the same category label, and n is the number of pairs in a mini-batch. Specif- ically , V i = { h 1 , h 2 , ..., h k } with h k ∈ R d 1 × 1 is a bag of proposal features in the i -th video and u i ∈ R d 2 × 1 is the fea- ture of the i -th image. Note that the dimensionalities of the image feature and activi ty proposal features are not equal in our problem, i.e. , d 1 6 = d 2 . Each pair ( V i , u i ) is associated with a one-hot label vector y i with the entry corresponding to its category as one. In the testing stage, giv en an image query , the goal of the AIVR task is to retriev e the videos related to the activity in the image. 3.2 Activity Pr oposal-based Image-to-V ideo Retriev al (APIVR) A pproach As mentioned abov e, we represent each video as a bag of proposal features V = { h 1 , h 2 , ..., h k } and each image as a feature vector u . Considering the dif ferent statistical prop- erties and data distributions of videos and images, we project video and image features into a common feature space with the mapping function f v ( · ) and f u ( · ) respectiv ely . The map- ping functions are defined as f v ( V ) = { f v ( h 1 ) , f v ( h 2 ) , ..., f v ( h k ) } = ¯ h 1 , ¯ h 2 , ..., ¯ h k = ¯ V , (1) f u ( u ) = ¯ u , (2) where f v : R d 1 × k → R r × k , f u : R d 2 × 1 → R r × 1 . The map- ping functions f v ( · ) ( r esp. , f u ( · ) ) are three fully-connected layers with model parameters denoted as θ p . Based on the projected features ¯ V and ¯ u , follo wing A CMR (W ang et al. 2017), we emplo y three types of losses: triplet loss, classification loss, and adversarial loss. Con- cretely , triplet loss pulls an image close to the videos of the same category while pushing it far a way from the videos of a different category . The classification loss targets at suc- cessfully separating the training samples from different cat- egories re gardless of modalities, which can preserv e seman- tic information and simultaneously minimize the modality gap. The adversarial loss is in volved in a minimax game by discriminating two modalities with a modality classifier and generating modality-agnostic representations to confuse the modality classifier, which can further reduce the modality gap. In summary , the abo ve three types of losses jointly con- tribute to modality consistency and semantic distinguisha- bility in the common feature space. Graph Multi-Instance Learning Module In the common feature space, although we use R-C3D model to generate activity proposals from each video which are very likely to contain the activity , there still remain some noisy activ- ity proposals irrelev ant to the activity . Hence, each video is comprised of a mixture of clean and noisy proposals. If we utilize these noisy activity proposals based on the video label, the quality of semantic learning will be greatly de- graded. In fact, this problem can be formulated as multi- instance learning, in which each video is treated as a bag and the activity proposals in each bag are treated as in- stances. Based on the assumption that there should be at least one clean instance in each bag, we expect to assign higher weights on the clean instances and lower weights on the noisy ones, so that the clean instances will play a domi- nant role in video bags. Giv en a bag of instances ¯ V = ¯ h 1 , ¯ h 2 , ..., ¯ h k , in- spired by (Ilse, T omczak, and W elling 2018), we employ self-attention mechanism to learn dif ferent weights for dif- ferent instances in each bag as (3). In particular , we apply a fully-connected layer L 1 ∈ R r × r 0 with non-linear opera- tion tanh ( · ) to ¯ V , producing tanh ( ¯ V T L 1 ) . Then, we apply another fully-connected layer L 2 ∈ R r 0 × 1 followed by soft- max layer to obtain the k -dim weight vector a for ¯ V . a = sof tmax ( tanh ( ¯ V T L 1 ) L 2 ) . (3) Howe ver , the abov e process ignores the relation among multiple instances in each bag. T o take such relation into consideration, we insert graph conv olutional layer (Kipf and W elling 2017) into (3), which can leverage the graph structure of each bag. Graph con v olutional layer (Kipf and W elling 2017) is originally proposed for semi-supervised learning and now we employ it for multi-instance learning. Follo wing (Kipf and W elling 2017), we calculate the sim- ilarity graph S for each bag V = { h 1 , h 2 , ..., h k } during preprocessing, in which S ij is the cosine similarity between h i and h j . Besides, we define S 0 = S + I k and a diagonal matrix D with D ii = P j S 0 ij . Then, graph conv olutional layer can be represented by a 1 × 1 con volution layer with parameters ¯ S = D − 1 / 2 S 0 D − 1 / 2 . W e insert two graph con- volutional layers into (3) and arri ve at ˆ a = sof tmax ( ¯ S tanh ( ¯ S ¯ V T L 1 ) L 2 ) . (4) The generated ˆ a is expected to be smoother than a , i.e. , the weights of tw o instances in a bag should be close when these two instances are similar . The theoretical proof and more details can be found in (Kipf and W elling 2017). At last, we obtain the weighted average of instance fea- tures as the bag-lev el feature Z ( ¯ V ) = P k j =1 ˆ a j ¯ h j . By as- signing different weights on different activity proposals, we aim to focus more on the clean proposals and obtain discrim- inativ e video features. Geometry-aware T riplet Loss with GMIL W e use triplet loss to preserve the semantic relev ance of similar training samples across different modalities. As defined in (Schroff, Kalenichenko, and Philbin 2015), triplet loss is based on an anchor sample x , a positiv e sample p , and a ne gati ve sample n , where x has the same category label as p yet a dif ferent category label from n . Gi ven a triplet ( x , p , n ) , triplet loss is used to enforce the distance between x and p to be smaller than that between x and n by a margin. Since our objective is to retrieve videos by a given image query , anchor sample x is an image while positi ve sample p and negati ve sample n are videos. In a mini-batch of video- image pairs { ( ¯ V i , ¯ u i ) | n i =1 } , with each image ¯ u i being an anchor sample, we use its paired video sample as the positiv e sample ¯ V + i and one video from a different category as the negati ve sample ¯ V − j , leading to in total n triplets in a mini- batch. Then our triplet loss is formulated as L tripl et = X i,j d ( ¯ u i , ¯ V + i ) − d ( ¯ u i , ¯ V − j ) + m + , (5) in which m is the margin set as 0 . 1 in our experiments, d ( x , y ) is the distance between x and y , and | x | + = x if x > 0 and 0 otherwise. For d ( ¯ u , ¯ V ) , a straightforward ap- proach is calculating the distance between ¯ u and weighted av erage of activity proposal features Z ( ¯ V ) , but that will cause serious loss of structural information in activity pro- posals. As shown in (Xu et al. 2017), point-to-subspace dis- tance 1 is able to preserve the structural information and ge- ometric property . In our problem, an image can be seen as a high-dimensional data point and video is a subspace spanned by activity proposals. Then the point-to-subspace distance is the Euclidean distance between an image point and its or- thogonal projection on the subspace of videos. Considering that noisy proposals may mislead point-to- subspace distance, we use truncated bag of proposals in lieu of intact bag of proposals. T o be exact, we denote trun- cated bag as ¯ V 0 = ¯ V [: , S b ] , in which S b is the index set of proposals with top- b GMIL weights ˆ a i . That means, we use the top- b clean proposals in triplet loss. W ith sim- ple mathematical deriv ation 1 , the orthogonal projection of point ¯ u on subspace ¯ V 0 can be calculated as e V ¯ u , where e V = ¯ V 0 (( ¯ V 0 ) T ¯ V 0 ) − 1 ( ¯ V 0 ) T . Then, the point-to-subspace distance between ¯ u and ¯ V 0 , i.e. , Euclidean distance between ¯ u and e V ¯ u , can be simplified as d ¯ u , ¯ V 0 = ¯ u − e V ¯ u 2 2 = T r (( I r − e V ) T ( I r − e V ) ¯ u ¯ u T ) = ¯ u T ¯ u − D vec ( e V ) , vec ( ¯ u ¯ u T ) E (6) By using e d ¯ u, e V to denote D vec ( e V ) , vec ( ¯ u ¯ u T ) E and substituting (6) into (5), we can arriv e at L tripl et = X i,j d ( ¯ u i , ¯ V 0 + i ) − d ( ¯ u i , ¯ V 0 − j ) + m + = X i,j e d ¯ u i , e V − j − e d ¯ u i , e V + i + m + . (7) 1 https://en.wikipedia.org/wiki/Projection (linear algebra) Follo wing (Y ao, Mei, and Ngo 2015), gi ven an anchor sample ¯ u i , we tend to select its hardest ne gati ve sample ¯ V − j and the details are omitted here. Based on (7), we tend to minimize L tripl et by optimizing GMIL module parameters θ m and projection module parameters θ p . Classification Loss with GMIL T o ensure the training samples in each modality are semantically discriminativ e, we additionally use a semantic classifier to separate intra- modal training samples from different categories. T o min- imize the modality gap, we apply the same classifier for both images and videos. In particular , we add a softmax classification layer with model parameters θ c on top of the image features ¯ u and the weighted average of proposal features Z ( ¯ V ) . Given a mini-batch of video-image pairs { ( ¯ V i , ¯ u i ) | n i =1 } associated with one-hot label { y i | n i =1 } , the classification loss is written as follows, L class = − 1 n n X i =1 y T i ( log ( p ( Z ( ¯ V i ))) + l og ( p ( ¯ u i ))) , (8) in which p ( · ) denotes the prediction scores by using the soft- max classification layer . Defining GMIL module parameters θ m = { L , w } , we tend to minimize L class by optimizing semantic classifier parameters θ c , GMIL module parameters θ m , and projection module parameters θ p . Adversarial Loss with GMIL T o further minimize the modality gap across videos and images, adversarial learn- ing (Goodfellow et al. 2014; W ang et al. 2017) is imple- mented as an interplay between discriminating modalities by learning a modality classifier and learning representations to confuse the modality classifier . In the process of discriminat- ing modalities, we learn a modality classifier to discriminate the video modality from the image modality . The modality classifier is implemented as a binary classifier with model parameters θ d , in which we assume the label of video ( r esp. , image) modality is 1 ( resp. , 0 ). In the process of learning representations to confuse the modality classifier , we expect the projected video/image features in the common feature space could fool the modality classifier . Considering that clean proposals hav e more representati ve feature distrib ution while the noisy proposals are scattered throughout the fea- ture space, we apply the modality classifier on the weighted av erage of proposal features Z ( ¯ V ) for videos. Similar to the classification loss, the adv ersarial loss is formally defined as L adv = − 1 n n X i =1 log ( δ ( Z ( ¯ V i )) + l og (1 − δ ( ¯ u i )) , (9) where δ ( · ) is the predicted probability of being from video modality . As adversarial learning is an interplay between discriminating modalities and learning representations, in the process of discriminating modalities, we tend to mini- mize L adv by optimizing the modality classifier parameters θ d . On the contrary , in the process of learning representa- tions, we tend to maximize L adv by optimizing projection module parameters θ p and GMIL module parameters θ m . The Whole Algorithm W e collect L tripl et , L class , and L adv in (7), (8), (9) as the following total training loss: L total = α · L tripl et + β · L class − L adv , (10) where α and β are trade-of f parameters and empirically fixed as 0 . 1 and 10 respectively in our e xperiments. Due to the adversarial loss L adv in (10), we play a min- imax game by learning representations and discriminating modalities alternatingly . By using θ g = { θ p , θ m , θ c } to de- note the model parameters in learning representations, our objectiv e can be written as follows, min θ g max θ d α · L tripl et + β · L class − L adv , (11) which can be optimized by updating θ g and θ d in an alternat- ing manner . W e lea v e the summary of our training algorithm to Supplementary due to space limitation. In the testing, we pass the testing images and videos through our trained model, yielding the projected features ¯ u ( r esp. , Z ( ¯ V ) ) for images ( resp. , videos). Then, given a query image u i , we retriev e its relev ant videos by ranking all the l 2 distances be- tween ¯ u i and Z ( ¯ V ) . 4 Experiments In this section, we compare our APIVR approach with the state-of-the-art methods on three datasets and pro vide ex- tensiv e ablation studies. 4.1 Datasets Construction T o the best of our kno wledge, there are no publicly av ailable datasets of activity video-image pairs specifically designed for the AIVR task. Therefore, we construct video-image datasets for the AIVR task based on public video datasets, i.e. , THUMOS’14 2 , ActivityNet (Heilbron et al. 2015) and MED2017 Ev ent 3 datatsets, in which THUMOS’14 and Ac- tivityNet datasets are action-based datasets while MED2017 Event dataset is an ev ent-based dataset. The difference be- tween “action” and “ev ent” lies in that an ev ent generally consists of a sequence of interacti ve or stand-alone actions. The details of above three datasets are left to Supplementary . Based on the above three datasets, we aim to obtain activity images and activity video clips, which can be used to con- struct our datasets for AIVR task. T o obtain acti vity video clips, considering that long videos may belong to multiple acti vity categories, we di vide each long video into multiple short videos based on the ac- tivity temporal annotations to ensure that each short video only belongs to one acti vity category . Then, we sample a fixed number of consecuti ve ke y frames in each short video as a video clip. The number of k ey frames used in our exper- iments is 768 for all datasets, which is lar ge enough to cover at least one activity instance. T o obtain activity images, we first locate the activity in- tervals in long videos according to activity temporal annota- tions. Then, we sample images from those activity intervals so that each image should belong to one activity cate gory . W ith obtained activity images and acti vity video clips, we sample video clips and images from each category to form training pairs and testing pairs. Particularly , for THU- MOS’14 dataset, we form 1500 training pairs and 406 test- ing pairs. F or ActivityNet dataset, we form 4800 training 2 http://crcv .ucf.edu/THUMOS14/ 3 https://www .nist.gov/itl/iad/mig/med-2017-e v aluation/ Method mean A verage Precision (mAP) @10 @20 @50 @100 APIVR (w/o T L ) 0.3228 0.3096 0.2956 0.2875 APIVR (w/o AL ) 0.3278 0.3146 0.3026 0.2905 APIVR (w/o C L ) 0.2438 0.2389 0.2312 0.2306 APIVR (w/o GA ) 0.3531 0.3376 0.3204 0.3145 APIVR (w/o GM I L ) 0.3428 0.3368 0.3276 0.3102 APIVR (w/o Gr aph ) 0.3618 0.3521 0.3326 0.3285 Full APIVR approach 0.3812 0.3645 0.3459 0.3314 T able 1: Comparision of our full APIVR approach and our special cases in terms of mAP@K on THUMOS’14. Best results are denoted in boldface. Figure 2: The effect of top- b proposals chosen from video bags to represent videos on THUMOS’14 dataset. pairs and 1200 testing pairs. For MED2017 Event dataset, we form 2200 training pairs and 404 testing pairs. 4.2 Implementation Details For images, we employ VGG model (Simonyan and Zisser - man 2014) to extract the fc7 layer features and then reduce the dimension from 4096 -dim to 128 -dim by PCA for the ease of memory and computation in our experiment. For video clips, we use R-C3D model to generate activ- ity proposals, which is pretrained on Sports-1M dataset and finetuned on UCF101 dataset (Tran et al. 2015). W e extract a 4096 -dim feature vector for each acti vity proposal and each video is represented by a bag of top- 60 proposal features, i.e. , k = 60 , by ranking the scores that may contain activi- ties. In our geometry-aware triplet loss, we use top- 50 pro- posals in each bag, i.e. , b = 50 . In the projection module, mapping functions f v ( · ) ( r esp. , f u ( · ) ) are implemented as three fully-connected layers as follows. f v : V ( d 1 = 4096) → 500 → 200 → ¯ V ( r = 64) and f u : u ( d 2 = 128) → 100 → 80 → ¯ u ( r = 64) . In our experiments, we use mAP@K, i.e. , mean A version Precision based on top K retrieved results, as the e valuation metric. 4.3 Ablation Studies In order to explore the effecti veness of dif ferent components in our approach, we in vestigate some special cases of our approach. Specifically , we study the contributions of three types of losses by comparing with APIVR (w/o T L ), APIVR Figure 3: Illustration of acti vity proposal weights learnt by our GMIL module on the Acti vityNet dataset. The clean proposal is assigned with the highest weight (marked in red) and the other two noisy proposals are assigned with the lo west weights. (w/o AL ), and APIVR (w/o C L ), which are our three spe- cial cases by ablating T riplet Loss (TL), Adversarial Loss (AL), and Classification Loss (CL) respectiv ely . Besides, to verify the benefit of geometry-aware triplet loss, we replace d ¯ u , ¯ V 0 in (6) with ¯ u − Z ( ¯ V ) 2 2 and refer to this spe- cial case as APIVR (w/o GA ). T o demonstrate the effecti ve- ness of our Graph Multi-Instance Learning (GMIL) module, we replace GMIL module in (4) with MIL module in (3), and name this case as APIVR (w/o Gr aph ). Furthermore, we also assign uniform weights to proposals in each video instead of learning weights using GMIL module and name this special case as APIVR (w/o GM I L ), which means that Z ( ¯ V ) = 1 k P k s =1 ¯ h s in (8) (9) and intact bags of activity proposals are used in (7). By taking THUMOS’14 dataset as an example, experi- mental results are reported in T able 1. Obviously , we can see that APIVR (w/o T L ), APIVR (w/o AL ), and APIVR (w/o C L ) are all inferior to our full APIVR approach, which indicates that each type of loss plays an essential role in our cross-modal framework and contributes to the ov erall performance. Based on the results of three losses, com- pared with adversarial loss and triplet loss, we can see that classification loss has more influence on the perfor - mance, which prov es the significance of semantic classi- fier in our approach. When using standard triplet loss in- stead of geometry-aware triplet loss, APIVR (w/o GA ) suf- fers from a drop in performance, which demonstrates that it is beneficial to preserve the structural information and ge- ometric property of acti vity proposals. Moreover , we can also note that the results of APIVR (w/o GM I L ) are worse than the full APIVR approach, which prov es the benefit of paying more attention to clean proposals based on our GMIL module. More results of ablating GMIL for each loss are pro vided in Supplementary . Finally , we can observe that APIVR (w/o g r aph ) underperforms the full APIVR ap- proach, which shows the advantage of inserting graph con- volutional layer into MIL module. Recall that we use truncated bags of top- b clean proposals in our geometry-a ware triplet loss. T o in vestig ate the impact of b , we vary b and report the performance of our full APIVR approach in Figure 2. W e can observe that b = 50 achiev es the best performance, and the intact bags of proposals, i.e. , b = 60 , may harm the performance because of the included noisy proposals. When b is very small ( i.e. , b ≤ 30 ), too much useful information is discarded and thus the perfor- mance is also unsatisfactory . 4.4 V isualization of Retrieved V ideos T o better demonstrate the effecti veness of our GMIL mod- ule for identifying clean proposals, we provide two repre- sentativ e qualitativ e results in Figure 3, in which the query image belongs to the category “surfing” ( r esp. , “kick” ) in the top ( r esp. , “bottom” ) ro w . W e list top-2 retrie ved videos for each query image. For each retrie ved video, we show one proposal with the highest weight and another two proposals with the lo west weights. It is obvious that the proposals with the highest weight can capture the relev ant activity while the other two proposals are less relev ant or ev en background Methods THUMOS’14 dataset MED2017 Event dataset mAP@10 mAP@20 mAP@50 mAP@100 mAP@10 mAP@20 mAP@50 mAP@100 ITQ 0.2613 0.2572 0.2477 0.2340 0.2284 0.2168 0.2127 0.2034 SpH 0.2131 0.2080 0.2033 0.1914 0.2044 0.1926 0.1878 0.1611 SKLSH 0.2004 0.1974 0.1951 0.1847 0.1956 0.1924 0.1883 0.1774 CBE-opt 0.2687 0.2601 0.2554 0.2483 0.2268 0.2128 0.2051 0.1984 MFH 0.2402 0.2398 0.2188 0.2128 0.2246 0.2192 0.2108 0.1994 SCM 0.2661 0.2576 0.2484 0.2395 0.2113 0.2041 0.1962 0.1924 CMFH 0.2545 0.2513 0.2466 0.2331 0.2262 0.2169 0.2101 0.2088 BPBC 0.2724 0.2706 0.2684 0.2571 0.2488 0.2501 0.2451 0.2402 JRL 0.2770 0.2656 0.2526 0.2411 0.2347 0.2278 0.2203 0.2198 CCL 0.3222 0.3188 0.3072 0.2949 0.2454 0.2417 0.2321 0.2267 JFSSL 0.2367 0.2351 0.2325 0.2241 0.2292 0.2218 0.2131 0.2064 Corr-AE 0.2266 0.2178 0.2096 0.2104 0.2032 0.2011 0.1971 0.1918 DSPE 0.2632 0.2544 0.2443 0.2312 0.2312 0.2246 0.2161 0.2004 CMDN 0.2927 0.2892 0.2754 0.2714 0.2328 0.2342 0.2250 0.2171 A CMR 0.3361 0.3274 0.3107 0.3061 0.2518 0.2401 0.2373 0.2244 DSCMR 0.3621 0.3523 0.3251 0.3188 0.2665 0.2576 0.2470 0.2381 APIVR 0.3812 0.3645 0.3459 0.3314 0.3049 0.2973 0.2867 0.2771 T able 2: mAP@K of different methods on THUMOS’14 and MED2017 Ev ent dataset. Best results are denoted in boldface. Method mean A verage Precision (mAP) @10 @20 @50 @100 ITQ 0.1851 0.1704 0.1598 0.1414 SpH 0.1885 0.1843 0.1617 0.1551 SKLSH 0.1638 0.1595 0.1556 0.1474 CBE-opt 0.2044 0.1970 0.1842 0.1768 MFH 0.2155 0.2048 0.1977 0.1932 SCM 0.2285 0.2230 0.2166 0.2011 CMFH 0.2334 0.2318 0.2205 0.2155 BPBC 0.2352 0.2296 0.2184 0.2071 JRL 0.2266 0.2182 0.2177 0.2096 CCL 0.2358 0.2208 0.2138 0.2082 JFSSL 0.2166 0.2087 0.1958 0.1929 Corr-AE 0.2024 0.2012 0.1924 0.1866 DSPE 0.2212 0.2107 0.2079 0.2055 CMDN 0.2422 0.2401 0.2288 0.2232 A CMR 0.2318 0.2224 0.2111 0.2091 DSCMR 0.2481 0.2344 0.2287 0.2122 APIVR 0.2635 0.2545 0.2488 0.2319 T able 3: Performance of different methods in terms of mAP@K on the ActivityNet dataset. Best results are de- noted in boldface. segments, which indicates the great adv antages of our GMIL module in identifying clean proposals. 4.5 Comparisons with the State-of-the-art Methods W e compared our APIVR approach with the state-of- the-art methods including single modality hashing meth- ods CBE-opt (Y u et al. 2014), ITQ (Gong and Lazebnik 2011), SKLSH (Raginsky and Lazebnik 2009), SpH (Heo et al. 2012), multiple modalities hashing methods MFH (Y e et al. 2017), SCM (Zhang and Li 2014), CMFH (Ding, Guo, and Zhou 2014), BPBC (Xu et al. 2017), and cross- modal retrie v al methods Corr -AE (Feng, W ang, and Li 2014), CMDN (Peng, Huang, and Qi 2016), A CMR (W ang et al. 2017), DSPE (W ang, Li, and Lazebnik 2016), JRL (Zhai, Peng, and Xiao 2014), JFSSL (W ang et al. 2016), CCL (Peng et al. 2018), DSCMR (Zhen et al. 2019). Among them, BPBC is a hashing method targeting at image-to-video retriev al task. Although the method in (de Ara ´ ujo and Girod 2018) also targets at image-to-video retrie v al, but it focuses on improving video Fisher V ectors using bloom filters and thus cannot be directly applied to our problem. Besides, Corr-AE, CMDN, A CMR, CCL, DSPE and DSCMR are deep learning-based methods and ha ve achie ved remarkable results in cross-modal retriev al task. For all baselines, we take the av erage of proposal features extracted by R-C3D as the video features and VGG fc7 features as the image fea- tures for fair comparison. The encoding length in the hash- ing methods is set to 128-bit. The experiment results are summarized in T able 2, 3. Compared with A CMR (W ang et al. 2017) method, which has a similar framew ork to ours, our superior performance confirms the advantages of preserving structural informa- tion using geometric projection and attending clean propos- als using GMIL module. Ob viously , we can see that our approach achiev es significant improv ement over all base- lines in all scope of K on both action-based and e vent-based datasets. For example, on the ActivityNet dataset with the largest number of categories, APIVR approach outperforms the other methods by about 2% in all scope of K . 5 Conclusion In this paper, we have proposed the first activity proposal- based image-to-video retriev al (APIVR) approach for the activity image-to-video retriev al task. W e hav e incorporated graph multi-instance learning module into cross-modal re- triev al framework to address the proposal noise issue, and also proposed geometry-aware triplet loss. Experiments on three datasets hav e demonstrated the superiority of our ap- proach compared to the state-of-the-art methods. 6 Supplementary 6.1 Details of Datasets W e construct our datasets based on three public datasets: THUMOS’14 4 , Acti vityNet (Heilbron et al. 2015) and MED2017 Ev ent 5 datatsets, in which THUMOS’14 and Ac- tivityNet datasets are action-based datasets while MED2017 Event dataset is an ev ent-based dataset. The details of the abov e three datasets are introduced as follo ws: THUMOS’14 dataset: The THUMOS’14 dataset consists of 2765 trimmed training videos and 200 untrimmed vali- dation videos from 20 different sport activities. W e merge similar categories such as “cliff diving” and “diving”, and obtain a total number of 18 categories. ActivityNet dataset: The ActivityNet dataset(1.3) contains 200 activity categories. Due to the limit of the GPU mem- ory and speed, we only use the v alidation set with 4926 videos. Similar to THUMOS’14 dataset, we merge similar categories such as “clean and jerk” and “snatch”, leading to in total 156 categories. MED2017 Event dataset: ActivityNet dataset The Multi- media Event Detection (MED) launched by TRECVID con- sists of more complicated e vents, which is also suitable to explore the AIVR task. W e use the resources for the Pre- Specified Event portion of the MED2017 e valuation. The dataset has totally 200 trimmed videos distributed in 10 ev ent categories. 6.2 T raining Algorithm Recall that our objectiv e function is min θ g max θ d α · L tripl et + β · L class − L adv , (12) in which α and β are trade-off parameters and empirically fixed as 10 and 0 . 01 respectively . The problem in (12) is a minimax problem, which can be optimized by updating θ g and θ d in an alternating manner . The details of our training process are shown in Algorithm 1, in which we set the num- ber of image-video pairs in a mini-batch n = 64 , the number of generation steps t = 50 , and the learning rate λ = 0 . 0001 . Algorithm 1 The training process of our APIVR approach. Input: Mini-batches of video-image pairs { ( V i , u i ) | n i =1 } and the associated labels { y i | n i =1 } . The number of steps t , learning rate λ , and trade-off parameters α, β . update until con vergence: 1: for t steps do 2: update θ g by descending stochastic gradients: θ g ← θ g − λ · 5 θ g ( α · L tripl et + β · L class − L adv ) . 3: end for 4: update parameter θ d by ascending stochastic gradients: θ d ← θ d + λ · 5 θ d ( α · L tripl et + β · L class − L adv ) . 5: retur n Model parameters θ g and θ d 4 http://crcv .ucf.edu/THUMOS14/ 5 https://www .nist.gov/itl/iad/mig/med-2017-e v aluation/ References de Ara ´ ujo, A. F ., and Girod, B. 2018. Large-scale video retriev al using image queries. T -CSVT 28(6):1406–1420. Ding, G.; Guo, Y .; and Zhou, J. 2014. Collectiv e matrix factoriza- tion hashing for multimodal data. In CVPR . Feng, F .; W ang, X.; and Li, R. 2014. Cross-modal retrie val with correspondence autoencoder . In MM . Gong, Y ., and Lazebnik, S. 2011. Iterativ e quantization: A pro- crustean approach to learning binary codes. In CVPR . Goodfellow , I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; W arde- Farle y , D.; Ozair , S.; Courville, A. C.; and Bengio, Y . 2014. Gen- erativ e adversarial netw orks. CoRR abs/1406.2661. Guo, Z., and Y i, Y . 2013. Graph-based multiple instance learning for action recognition. In ICIP . Hardoon, D. R.; Szedm ´ ak, S.; and Shawe-T aylor , J. 2004. Canon- ical correlation analysis: An ov erview with application to learning methods. Neural Computation 16(12):2639–2664. Heilbron, F . C.; Escorcia, V .; Ghanem, B.; and Niebles, J. C. 2015. Activitynet: A large-scale video benchmark for human acti vity un- derstanding. In CVPR . Heo, J.; Lee, Y .; He, J.; Chang, S.; and Y oon, S. 2012. Spherical hashing. In CVPR . Ilse, M.; T omczak, J. M.; and W elling, M. 2018. Attention-based deep multiple instance learning. In Proceedings of the 35th Inter- national Confer ence on Machine Learning, ICML . Ji, S.; Xu, W .; Y ang, M.; and Y u, K. 2013. 3d con volutional neural networks for human action recognition. T -P AMI 35(1):221–231. Jiang, Y .; W u, Z.; W ang, J.; Xue, X.; and Chang, S. 2018. Ex- ploiting feature and class relationships in video categorization with regularized deep neural networks. T -P AMI 40(2):352–364. Kipf, T . N., and W elling, M. 2017. Semi-supervised classification with graph con v olutional networks. In ICLR . Li, Y .; Kwok, J. T .; Tsang, I. W .; and Zhou, Z. 2009. A con ve x method for locating regions of interest with multi-instance learn- ing. In Machine Learning and Knowledge Discovery in Databases, Eur opean Confer ence, ECML PKDD . Lin, G.; Shen, C.; and van den Hengel, A. 2014. Supervised hashing using graph cuts and boosted decision trees. CoRR abs/1408.5574. Ng, J. Y .; Hausknecht, M. J.; V ijayanarasimhan, S.; V inyals, O.; Monga, R.; and T oderici, G. 2015. Beyond short snippets: Deep networks for video classification. In CVPR . Pappas, N., and Popescu-Belis, A. 2014. Explaining the stars: W eighted multiple-instance learning for aspect-based sentiment analysis. In Pr oceedings of the 2014 Conference on Empirical Methods in Natural Language Pr ocessing, EMNLP . Peng, Y .; Qi, J.; Huang, X.; and Y uan, Y . 2018. CCL: cross-modal correlation learning with multigrained fusion by hierarchical net- work. T -MM 20(2):405–420. Peng, Y .; Huang, X.; and Qi, J. 2016. Cross-media shared repre- sentation by hierarchical learning with multiple deep networks. In IJCAI . Qiu, Z.; Y ao, T .; and Mei, T . 2017. Learning spatio-temporal rep- resentation with pseudo-3d residual networks. In ICCV . Raginsky , M., and Lazebnik, S. 2009. Locality-sensitive binary codes from shift-in v ariant kernels. In NIPS . Schroff, F .; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR . Simonyan, K., and Zisserman, A. 2014. V ery deep con volutional networks for lar ge-scale image recognition. CoRR abs/1409.1556. Sriv astava, N.; Mansimov , E.; and Salakhutdinov , R. 2015. Unsu- pervised learning of video representations using LSTMs. In ICML . T ran, D.; Bourdev , L. D.; Fergus, R.; T orresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d conv olutional net- works. In ICCV . T u, M.; Huang, J.; He, X.; and Zhou, B. 2019. Multiple instance learning with graph neural networks. CoRR abs/1906.04881. W ang, K.; He, R.; W ang, W .; W ang, L.; and T an, T . 2013. Learning coupled feature spaces for cross-modal matching. In ICCV . W ang, K.; He, R.; W ang, L.; W ang, W .; and T an, T . 2016. Joint feature selection and subspace learning for cross-modal retrieval. T -P AMI 38(10):2010–2023. W ang, B.; Y ang, Y .; Xu, X.; Hanjalic, A.; and Shen, H. T . 2017. Adversarial cross-modal retrie v al. In MM . W ang, L.; Li, Y .; and Lazebnik, S. 2016. Learning deep structure- preserving image-text embeddings. In CVPR . Xu, R.; Y ang, Y .; Shen, F .; Xie, N.; and Shen, H. T . 2017. Efficient binary coding for subspace-based query-by-image video retriev al. In MM . Xu, H.; Das, A.; and Saenko, K. 2017. R-C3D: region con volu- tional 3d network for temporal acti vity detection. In ICCV . Y ao, T .; Mei, T .; and Ngo, C. 2015. Learning query and image similarities with ranking canonical correlation analysis. In ICCV . Y e, D.; Li, Y .; T ao, C.; Xie, X.; and W ang, X. 2017. Multiple feature hashing learning for lar ge-scale remote sensing image re- triev al. ISPRS 6(11):364. Y u, F . X.; Kumar , S.; Gong, Y .; and Chang, S. 2014. Circulant binary embedding. In ICML . Zhai, X.; Peng, Y .; and Xiao, J. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. T - CSVT 24(6):965–978. Zhang, D., and Li, W . 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI . Zhen, L.; Hu, P .; W ang, X.; and Peng, D. 2019. Deep supervised cross-modal retriev al. In CVPR . Zhu, W .; Lou, Q.; V ang, Y . S.; and Xie, X. 2017. Deep multi- instance networks with sparse label assignment for whole mammo- gram classification. In Medical Image Computing and Computer Assisted Intervention - MICCAI .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment