Key Instance Selection for Unsupervised Video Object Segmentation

K ey Instance Selection f or Unsupervised V ideo Object Segmentation Donghyeon Cho ∗ SK T -Brain Sungeun Hong ∗ SK T -Brain Sungil Kang ∗ SK T -Brain Jiwon Kim SK T -Brain Abstract This paper pr oposes key instance selection based on video saliency covering objectness and dynamics for un- supervised video object se gmentation (UV OS). Our method takes frames sequentially and extracts object pr oposals with corr esponding masks for eac h frame . W e link objects ac- cor ding to their similarity until the M -th frame and then assign them unique IDs (i.e ., instances). Similarity mea- sur e takes into account multiple pr operties such as ReID de- scriptor , expected trajectory , and semantic co-se gmentation r esult. After M -th frame, we select K IDs based on video saliency and frequency of appearance; then only these ke y IDs are trac ked thr ough the r emaining frames. Thanks to these tec hnical contributions, our results ar e rank ed thir d on the leaderboard of UVOS D A VIS challenge . 1. Introduction Giv en a mask in the ﬁrst frame, semi-supervised video object segmentation (SVOS) is a task of generating masks in the subsequent frames. After the ﬁrst SV OS chal- lenge [11], the SVOS has been gradually getting attention. Also, datasets are continuously updated [12] or newly con- structed [15]. Recently , interactive video object segmenta- tion (IV OS) [2] and unsupervised video object se gmenta- tion (UVOS) [3] were introduced as new challenges. This paper tackles UV OS which does not require any human su- pervision. The basic approach of the UV OS is to estimate a mask of the ﬁrst frame, then use it to apply conv entional SV OS methods [6, 8]. Ho wev er , not only is it greatly inﬂuenced by the results of the ﬁrst frame, it is not guaranteed that there are all tar gets in the ﬁrst frame. T o resolv e these prob- lems, we can continuously assign a new ID (i.e., instance) to an object that satisﬁes certain criteria. Howe ver , this in- creases not only the number of IDs indiscriminately b ut also time complexity . Therefore, it is recommended to ﬁx the appropriate K , the maximum number of IDs. In this paper, we propose a method to select K IDs by exploring video saliency of se veral frames at the beginning of the video. *They are equal contrib utors to this work. (a) K =5 (b) K =10 (c) K =15 Figure 1. Segmentation results with respect to K instances on a ‘scooter-black’ video. T op: random instance selection. Bottom: our key instance selection. Compared to random K instance selection, our method can capture main objects well, ev en in case of small K as shown in Fig. 1-(a). The main contributions of this paper are as follows. Compared to adding new IDs continuously until the end of the video, our ke y instance selection approach signiﬁ- cantly reduces time as well as memory complexity . Also, our model can capture regions of importance better than random instance selection. When we measure the similar- ity between assigned IDs and e xtracted proposals, we use all set of positi ve ReID descriptors for each ID. Unlike con- ventional methods that use optical ﬂow [5] to handle large appearance changes in a frame sequence, we use semantic co-segmentation [9, 13]. Finally , we automatically search hyperparameters through Meta AI system de veloped by T - Brain under scalable en vironments (e.g., 144 GPUs). 2. Method Our objective is to assign IDs to the proposals without additional inputs such as a mask of the ﬁrst frame. For this, we use an object pool that manages assigning, adding, and deleting IDs. Fig. 2 illustrates the overvie w of the proposed method. As a ﬁrst step, we perform instance segmentation on the current frame and propagate the masks obtained in the pre- vious frame. W e then e xtract features and assign IDs to can- didates that satisfy the criteria of the online tracker linked with object pool. Here, if the candidate matches the exist- ing ID in the object pool, we assign the matched ID to the 1 Figure 2. Overvie w of the proposed method. candidate, and the online tracker is updated; otherwise, we assign a new ID to the candidate, then this new ID is added to the object pool. Adding a new ID is performed only up to the M -th frame. At the M -th frame, K instances among the accumulated IDs are selected and ﬁnally , those selected IDs are tracked in the remaining frames. Candidate Generation: Giv en a frame at time I t ( t ∈ T ), we perform instance segmentation to get bounding box and mask by using Mask R-CNN [7] and DeepLab [4]. Mo- tion blur or occlusion, which often occurs in videos, can result in poor segmentation results for certain frames. T o address this issue, we use masks propagated from the pre- vious frame as mask candidates. Concretely , we utilize RGMP [9, 13] in our experiments. ID Assignment: T o assign an ID to each candidate, we compute speciﬁc scores by comparing the candidates and the re gistered IDs in the object pool. The ﬁrst score is S iou ( l, n ) which is IOU between a mask from ID ( l ∈ L ) and a mask of the candidate ( n ∈ N ). Here, L refers to the number of IDs re gistered in the object pool and N is the number of candidates. W e omit notation time t for simplic- ity . The second score is S traj ( l, n ) that measures how far the candidate is from the predicted bounding box of ID as S traj ( l, n ) = 1 − min ( dot ( ~ b l , ~ b n ) α traj , 1) , (1) where α traj is a normalization factor . Also, ~ b l and ~ b n are vectors from bounding boxes of ID and candidate, respec- tiv ely . The third score S reid ( l, n ) considers a distance of Figure 3. V ideo saliency results. Note that a motorcycle with mo- tion is only focused among a lot of objects in a scene. ReID descriptors [10] between ID l and candidate n . Un- like [6, 8], our method use the nearest ReID descriptor among the all set of positiv e ReID descriptors of ID l as S reid ( l, n ) = 1 − min ( min j || d l ( j ) − r n || α reid , 1) , (2) where r n is a ReID descriptor for candidate n and d l ( · ) are positiv e ReID descriptors for ID l . The last score is a rela- tiv e ReID score S rel ( l, n ) : S rel ( l, n ) = S reid ( l, n ) max l S reid ( l, n ) . (3) T otal score is weighted summation of abo ve four scores: S total ( l, n ) = w iou · S iou ( l, n ) + w traj · S traj ( l, n ) + w reid · S reid ( l, n ) + w rel · S rel ( l, n ) , (4) where w iou , w traj , w reid , and w rel are weight factors of each term. Finally , we assign ID l to the candidate object as follows: ˆ n =    argmax n S total ( l, n ) , if ≥ τ c None , otherwise , (5) where τ c = { 1 , 2 } is a threshold v alue for cutting off object with low conﬁdence. Before selecting the K instances ( t < = M ) as in Sec. 2, c is 1, and after that c is 2. If one of candidates is assigned to ID l , then d ˆ n is added to pool of positiv e ReID descriptors for ID l . Also ~ b l is updated by using ~ b ˆ n . Key Instances Selection: Basically , the pipeline men- tioned above is iterated at each frame. There is ID addi- tion process at the beginning of the video. Especially , in the ﬁrst frame, object candidates with high conﬁdence are added as ne w IDs. From the second frame and M -th frame, new IDs are added when object candidates hav e high ob- jectness score and are not overlapped with objects assigned to existing IDs. After M -th frame, we select at most K IDs. As selection criteria, we use weighted summation of video saliency score [14] and frequenc y of each ID as S sel ( l ) = w sal · S sal ( l ) + w f r eq · S f r eq ( l ) , (6) 2 Measure R WTH V ision Oxford-CASIA SK T -Brain (ours) UV OS-test R WTH V ision 2 ZX VIP VIG UOC-UPC-BSC Ranking 1 2 3 4 5 6 7 8 Global Mean ↑ 0.564 0.562 0.516 0.504 0.481 0.471 0.448 0.412 J Mean ↑ 0.564 0.535 0.487 0.475 0.460 0.435 0.422 0.379 J Recall ↑ 0.609 0.613 0.551 0.542 0.514 0.490 0.476 0.413 J Decay ↓ 0.015 -0.021 0.040 0.032 0.084 0.035 0.035 0.076 F Mean ↑ 0.594 0.590 0.545 0.533 0.503 0.506 0.474 0.444 F Recall ↑ 0.641 0.632 0.594 0.569 0.538 0.543 0.506 0.473 F Decay ↓ 0.058 0.001 0.077 0.055 0.118 0.067 0.068 0.117 T able 1. Segmentation results on UV OS DA VIS challenge dataset. where w sal and w f r eq are weight factors of each term. S sal ( l ) and S f r eq ( l ) are computed over frames. Fig. 3 shows ef fectiv eness of video saliency based approach. The frequency of each ID means the number of each ID’ s ap- pearance up to the M -th frame. Because there are sev eral IDs that are not connected to any proposal at a certain time by τ 1 , S f r eq ( l ) can vary by ID. Hyperparameter Search: W e perform hyperparameter search for w iou , w traj , w reid , w rel , τ 1 , τ 2 , w sal and w f r eq on D A VIS validation dataset. Global mean ( J & F ) of results is 0 . 599 and searched weighting factors are 0 . 12 , 0 . 575 , 0 . 3 , 0 . 0065 , 0 . 55 , 0 . 35 , 0 . 5 and 1 . 0 , respec- tiv ely . 3. Experiments W e ev aluate the proposed method on UV OS DA VIS challenge dataset [3], which contains 30 videos without a mask from the ﬁrst frame in each video. W e directly sub- mit our results to the CodaLab site [1] to get segmenta- tion results. Evaluation metrics are Region Jaccard ( J ) and Boundary F measure ( F ) for each instance. As shown in T able. 1, our method achie ves the third rank with respect to Global Mean as well as all the other speciﬁc metrics. Fig. 4 sho ws qualitati ve results on D A VIS validation set. In the ﬁrst row of Fig. 4, our method well captures black- swan ov er time. The proposed method also shows ro- bust video segmentation results even in the relati vely dy- namic P arkour example. In addition, our method faithfully works on multiple object segmentation (from the third row to the last low in Fig. 4). Note that changes in scale ( lab- coat ) and appearance ( mbike-tric k ) are handled appropri- ately by our method. The proposed key instance selection method is quanti- tativ ely compared with random instance selection accord- ing to the K as shown in T able. 2. Since the proposed scheme focuses on semantically meaningful regions, it shows promising results e ven at a small K . 4. Conclusion W e hav e presented ke y instance selection for unsuper- vised video object segmentation (UV OS). Our method al- # of IDs 5 10 15 20 Random 0.474 0.529 0.587 0.594 Ours 0.576 0.579 0.591 0.599 T able 2. Global Mean ( J & F ) scores on the D A VIS validation dataset according to the number of IDs. W e compare our key in- stance selection method with random instance selection. lows maximum K instances to be tracked by considering video saliency and frequency of appearance. Focusing on objects that are in the spotlight enables to reduce time com- plexity . In addition, objects in a frame sequence are linked by speciﬁc scores from ReID descriptor , trajectories, and co-segmentation result. Finally , we perform hyperparam- eter search to ﬁnd out the optimal hyperparameters in our model by Meta AI system. Our method showed competi- tiv e results in UV OS DA VIS challenge. 5. Acknowledgements W e thank Y oungsoon Lee for help to use a Meta AI sys- tem which is effecti ve in searching hyperparameters. References [1] Codalab. https://competitions.codalab.org/ competitions/21739#results . [2] Sergi Caelles, Alberto Montes, K evis-K okitsi Maninis, Y uhua Chen, Luc V an Gool, Federico Perazzi, and Jordi Pont-T uset. The 2018 da vis challenge on video object seg- mentation. , 2018. [3] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Ke vis-K okitsi Maninis, and Luc V an Gool. The 2019 davis challenge on vos: Unsupervised multi-object seg- mentation. , 2019. [4] Liang-Chieh Chen, Y ukun Zhu, George P apandreou, Florian Schroff, and Hartwig” Adam. Encoder-decoder with atrous separable conv olution for semantic image segmentation. In Pr oc. of Eur opean Conf. on Computer V ision (ECCV) , 2018. [5] E. Ilg, N. Mayer , T . Saikia, M. Keuper , A. Dosovitskiy , and T . Brox”. Flownet 2.0: Evolution of optical ﬂow estimation with deep netw orks. In Pr oc. of Computer V ision and P attern Recognition (CVPR) , 2017. [6] B. Leibe J. Luiten, P . V oigtlaender . Premvos: Proposal- generation, reﬁnement and merging for the davis challenge on video object segmentation 2018. The 2018 DA VIS Chal- 3 Figure 4. Qualitativ e results on D A VIS validation set that contains a single object and multiple objects. lenge on V ideo Object Segmentation - CVPR W orkshops , 2018. [7] P . Dollr K. He, G. Gkioxari and R. Girshick. Mask r-cnn. In Pr oc. of Int’l Conf. on Computer V ision (ICCV) , 2017. [8] Jonathon Luiten, Paul V oigtlaender, and Bastian Leibe. Pre- mvos: Proposal-generation, reﬁnement and merging for video object segmentation. In Pr oc. of Asian Conf. on Com- puter V ision (A CCV) , 2018. [9] Seoung Wug Oh, Joon-Y oung Lee, Kalyan Sunkavalli, and Seon Joo Kim. Fast video object se gmentation by reference- guided mask propagation. In Proc. of Computer V ision and P attern Recognition (CVPR) , 2018. [10] Aljosa Osep, Paul V oigtlaender , Jonathon Luiten, Stefan Breuers, and Bastian Leibe. Large-scale object discov- ery and detector adaptation from unlabeled video. CoRR , abs/1712.08832, 2017. [11] F . Perazzi, J. Pont-T uset, B. McW illiams, L. V an Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and ev aluation methodology for video object segmentation. In Computer V ision and P attern Recognition , 2016. [12] Jordi Pont-T uset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel ´ aez, Alexander Sorkine-Hornung, and Luc V an Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675 , 2017. [13] N. Xu S. Joo Kim S. W ug Oh, J. Lee. Fast user-guided video object segmentation by deep networks. The 2018 D A VIS Challenge on V ideo Object Se gmentation - CVPR W orkshops , 2018. [14] J. Shen W . W ang and L. Shao. V ideo salient object detec- tion via fully con volutional networks. IEEE T rans. Image Pr ocessing (TIP) , 27(1):38–49, 2018. [15] Ning Xu, Linjie Y ang, Y uchen Fan, Dingcheng Y ue, Y uchen Liang, Jianchao Y ang, and Thomas S. Huang. Y outube-vos: A large-scale video object segmentation benchmark. CoRR , abs/1809.03327, 2018. 4

Key Instance Selection for Unsupervised Video Object Segmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment