Key Instance Selection for Unsupervised Video Object Segmentation

This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for eac…

Authors: Donghyeon Cho, Sungeun Hong, Sungil Kang

Key Instance Selection for Unsupervised Video Object Segmentation
K ey Instance Selection f or Unsupervised V ideo Object Segmentation Donghyeon Cho ∗ SK T -Brain Sungeun Hong ∗ SK T -Brain Sungil Kang ∗ SK T -Brain Jiwon Kim SK T -Brain Abstract This paper pr oposes key instance selection based on video saliency covering objectness and dynamics for un- supervised video object se gmentation (UV OS). Our method takes frames sequentially and extracts object pr oposals with corr esponding masks for eac h frame . W e link objects ac- cor ding to their similarity until the M -th frame and then assign them unique IDs (i.e ., instances). Similarity mea- sur e takes into account multiple pr operties such as ReID de- scriptor , expected trajectory , and semantic co-se gmentation r esult. After M -th frame, we select K IDs based on video saliency and frequency of appearance; then only these ke y IDs are trac ked thr ough the r emaining frames. Thanks to these tec hnical contributions, our results ar e rank ed thir d on the leaderboard of UVOS D A VIS challenge . 1. Introduction Giv en a mask in the first frame, semi-supervised video object segmentation (SVOS) is a task of generating masks in the subsequent frames. After the first SV OS chal- lenge [11], the SVOS has been gradually getting attention. Also, datasets are continuously updated [12] or newly con- structed [15]. Recently , interactive video object segmenta- tion (IV OS) [2] and unsupervised video object se gmenta- tion (UVOS) [3] were introduced as new challenges. This paper tackles UV OS which does not require any human su- pervision. The basic approach of the UV OS is to estimate a mask of the first frame, then use it to apply conv entional SV OS methods [6, 8]. Ho wev er , not only is it greatly influenced by the results of the first frame, it is not guaranteed that there are all tar gets in the first frame. T o resolv e these prob- lems, we can continuously assign a new ID (i.e., instance) to an object that satisfies certain criteria. Howe ver , this in- creases not only the number of IDs indiscriminately b ut also time complexity . Therefore, it is recommended to fix the appropriate K , the maximum number of IDs. In this paper, we propose a method to select K IDs by exploring video saliency of se veral frames at the beginning of the video. *They are equal contrib utors to this work. (a) K =5 (b) K =10 (c) K =15 Figure 1. Segmentation results with respect to K instances on a ‘scooter-black’ video. T op: random instance selection. Bottom: our key instance selection. Compared to random K instance selection, our method can capture main objects well, ev en in case of small K as shown in Fig. 1-(a). The main contributions of this paper are as follows. Compared to adding new IDs continuously until the end of the video, our ke y instance selection approach signifi- cantly reduces time as well as memory complexity . Also, our model can capture regions of importance better than random instance selection. When we measure the similar- ity between assigned IDs and e xtracted proposals, we use all set of positi ve ReID descriptors for each ID. Unlike con- ventional methods that use optical flow [5] to handle large appearance changes in a frame sequence, we use semantic co-segmentation [9, 13]. Finally , we automatically search hyperparameters through Meta AI system de veloped by T - Brain under scalable en vironments (e.g., 144 GPUs). 2. Method Our objective is to assign IDs to the proposals without additional inputs such as a mask of the first frame. For this, we use an object pool that manages assigning, adding, and deleting IDs. Fig. 2 illustrates the overvie w of the proposed method. As a first step, we perform instance segmentation on the current frame and propagate the masks obtained in the pre- vious frame. W e then e xtract features and assign IDs to can- didates that satisfy the criteria of the online tracker linked with object pool. Here, if the candidate matches the exist- ing ID in the object pool, we assign the matched ID to the 1 Figure 2. Overvie w of the proposed method. candidate, and the online tracker is updated; otherwise, we assign a new ID to the candidate, then this new ID is added to the object pool. Adding a new ID is performed only up to the M -th frame. At the M -th frame, K instances among the accumulated IDs are selected and finally , those selected IDs are tracked in the remaining frames. Candidate Generation: Giv en a frame at time I t ( t ∈ T ), we perform instance segmentation to get bounding box and mask by using Mask R-CNN [7] and DeepLab [4]. Mo- tion blur or occlusion, which often occurs in videos, can result in poor segmentation results for certain frames. T o address this issue, we use masks propagated from the pre- vious frame as mask candidates. Concretely , we utilize RGMP [9, 13] in our experiments. ID Assignment: T o assign an ID to each candidate, we compute specific scores by comparing the candidates and the re gistered IDs in the object pool. The first score is S iou ( l, n ) which is IOU between a mask from ID ( l ∈ L ) and a mask of the candidate ( n ∈ N ). Here, L refers to the number of IDs re gistered in the object pool and N is the number of candidates. W e omit notation time t for simplic- ity . The second score is S traj ( l, n ) that measures how far the candidate is from the predicted bounding box of ID as S traj ( l, n ) = 1 − min ( dot ( ~ b l , ~ b n ) α traj , 1) , (1) where α traj is a normalization factor . Also, ~ b l and ~ b n are vectors from bounding boxes of ID and candidate, respec- tiv ely . The third score S reid ( l, n ) considers a distance of Figure 3. V ideo saliency results. Note that a motorcycle with mo- tion is only focused among a lot of objects in a scene. ReID descriptors [10] between ID l and candidate n . Un- like [6, 8], our method use the nearest ReID descriptor among the all set of positiv e ReID descriptors of ID l as S reid ( l, n ) = 1 − min ( min j || d l ( j ) − r n || α reid , 1) , (2) where r n is a ReID descriptor for candidate n and d l ( · ) are positiv e ReID descriptors for ID l . The last score is a rela- tiv e ReID score S rel ( l, n ) : S rel ( l, n ) = S reid ( l, n ) max l S reid ( l, n ) . (3) T otal score is weighted summation of abo ve four scores: S total ( l, n ) = w iou · S iou ( l, n ) + w traj · S traj ( l, n ) + w reid · S reid ( l, n ) + w rel · S rel ( l, n ) , (4) where w iou , w traj , w reid , and w rel are weight factors of each term. Finally , we assign ID l to the candidate object as follows: ˆ n =    argmax n S total ( l, n ) , if ≥ τ c None , otherwise , (5) where τ c = { 1 , 2 } is a threshold v alue for cutting off object with low confidence. Before selecting the K instances ( t < = M ) as in Sec. 2, c is 1, and after that c is 2. If one of candidates is assigned to ID l , then d ˆ n is added to pool of positiv e ReID descriptors for ID l . Also ~ b l is updated by using ~ b ˆ n . Key Instances Selection: Basically , the pipeline men- tioned above is iterated at each frame. There is ID addi- tion process at the beginning of the video. Especially , in the first frame, object candidates with high confidence are added as ne w IDs. From the second frame and M -th frame, new IDs are added when object candidates hav e high ob- jectness score and are not overlapped with objects assigned to existing IDs. After M -th frame, we select at most K IDs. As selection criteria, we use weighted summation of video saliency score [14] and frequenc y of each ID as S sel ( l ) = w sal · S sal ( l ) + w f r eq · S f r eq ( l ) , (6) 2 Measure R WTH V ision Oxford-CASIA SK T -Brain (ours) UV OS-test R WTH V ision 2 ZX VIP VIG UOC-UPC-BSC Ranking 1 2 3 4 5 6 7 8 Global Mean ↑ 0.564 0.562 0.516 0.504 0.481 0.471 0.448 0.412 J Mean ↑ 0.564 0.535 0.487 0.475 0.460 0.435 0.422 0.379 J Recall ↑ 0.609 0.613 0.551 0.542 0.514 0.490 0.476 0.413 J Decay ↓ 0.015 -0.021 0.040 0.032 0.084 0.035 0.035 0.076 F Mean ↑ 0.594 0.590 0.545 0.533 0.503 0.506 0.474 0.444 F Recall ↑ 0.641 0.632 0.594 0.569 0.538 0.543 0.506 0.473 F Decay ↓ 0.058 0.001 0.077 0.055 0.118 0.067 0.068 0.117 T able 1. Segmentation results on UV OS DA VIS challenge dataset. where w sal and w f r eq are weight factors of each term. S sal ( l ) and S f r eq ( l ) are computed over frames. Fig. 3 shows ef fectiv eness of video saliency based approach. The frequency of each ID means the number of each ID’ s ap- pearance up to the M -th frame. Because there are sev eral IDs that are not connected to any proposal at a certain time by τ 1 , S f r eq ( l ) can vary by ID. Hyperparameter Search: W e perform hyperparameter search for w iou , w traj , w reid , w rel , τ 1 , τ 2 , w sal and w f r eq on D A VIS validation dataset. Global mean ( J & F ) of results is 0 . 599 and searched weighting factors are 0 . 12 , 0 . 575 , 0 . 3 , 0 . 0065 , 0 . 55 , 0 . 35 , 0 . 5 and 1 . 0 , respec- tiv ely . 3. Experiments W e ev aluate the proposed method on UV OS DA VIS challenge dataset [3], which contains 30 videos without a mask from the first frame in each video. W e directly sub- mit our results to the CodaLab site [1] to get segmenta- tion results. Evaluation metrics are Region Jaccard ( J ) and Boundary F measure ( F ) for each instance. As shown in T able. 1, our method achie ves the third rank with respect to Global Mean as well as all the other specific metrics. Fig. 4 sho ws qualitati ve results on D A VIS validation set. In the first row of Fig. 4, our method well captures black- swan ov er time. The proposed method also shows ro- bust video segmentation results even in the relati vely dy- namic P arkour example. In addition, our method faithfully works on multiple object segmentation (from the third row to the last low in Fig. 4). Note that changes in scale ( lab- coat ) and appearance ( mbike-tric k ) are handled appropri- ately by our method. The proposed key instance selection method is quanti- tativ ely compared with random instance selection accord- ing to the K as shown in T able. 2. Since the proposed scheme focuses on semantically meaningful regions, it shows promising results e ven at a small K . 4. Conclusion W e hav e presented ke y instance selection for unsuper- vised video object segmentation (UV OS). Our method al- # of IDs 5 10 15 20 Random 0.474 0.529 0.587 0.594 Ours 0.576 0.579 0.591 0.599 T able 2. Global Mean ( J & F ) scores on the D A VIS validation dataset according to the number of IDs. W e compare our key in- stance selection method with random instance selection. lows maximum K instances to be tracked by considering video saliency and frequency of appearance. Focusing on objects that are in the spotlight enables to reduce time com- plexity . In addition, objects in a frame sequence are linked by specific scores from ReID descriptor , trajectories, and co-segmentation result. Finally , we perform hyperparam- eter search to find out the optimal hyperparameters in our model by Meta AI system. Our method showed competi- tiv e results in UV OS DA VIS challenge. 5. Acknowledgements W e thank Y oungsoon Lee for help to use a Meta AI sys- tem which is effecti ve in searching hyperparameters. References [1] Codalab. https://competitions.codalab.org/ competitions/21739#results . [2] Sergi Caelles, Alberto Montes, K evis-K okitsi Maninis, Y uhua Chen, Luc V an Gool, Federico Perazzi, and Jordi Pont-T uset. The 2018 da vis challenge on video object seg- mentation. , 2018. [3] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Ke vis-K okitsi Maninis, and Luc V an Gool. The 2019 davis challenge on vos: Unsupervised multi-object seg- mentation. , 2019. [4] Liang-Chieh Chen, Y ukun Zhu, George P apandreou, Florian Schroff, and Hartwig” Adam. Encoder-decoder with atrous separable conv olution for semantic image segmentation. In Pr oc. of Eur opean Conf. on Computer V ision (ECCV) , 2018. [5] E. Ilg, N. Mayer , T . Saikia, M. Keuper , A. Dosovitskiy , and T . Brox”. Flownet 2.0: Evolution of optical flow estimation with deep netw orks. In Pr oc. of Computer V ision and P attern Recognition (CVPR) , 2017. [6] B. Leibe J. Luiten, P . V oigtlaender . Premvos: Proposal- generation, refinement and merging for the davis challenge on video object segmentation 2018. The 2018 DA VIS Chal- 3 Figure 4. Qualitativ e results on D A VIS validation set that contains a single object and multiple objects. lenge on V ideo Object Segmentation - CVPR W orkshops , 2018. [7] P . Dollr K. He, G. Gkioxari and R. Girshick. Mask r-cnn. In Pr oc. of Int’l Conf. on Computer V ision (ICCV) , 2017. [8] Jonathon Luiten, Paul V oigtlaender, and Bastian Leibe. Pre- mvos: Proposal-generation, refinement and merging for video object segmentation. In Pr oc. of Asian Conf. on Com- puter V ision (A CCV) , 2018. [9] Seoung Wug Oh, Joon-Y oung Lee, Kalyan Sunkavalli, and Seon Joo Kim. Fast video object se gmentation by reference- guided mask propagation. In Proc. of Computer V ision and P attern Recognition (CVPR) , 2018. [10] Aljosa Osep, Paul V oigtlaender , Jonathon Luiten, Stefan Breuers, and Bastian Leibe. Large-scale object discov- ery and detector adaptation from unlabeled video. CoRR , abs/1712.08832, 2017. [11] F . Perazzi, J. Pont-T uset, B. McW illiams, L. V an Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and ev aluation methodology for video object segmentation. In Computer V ision and P attern Recognition , 2016. [12] Jordi Pont-T uset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel ´ aez, Alexander Sorkine-Hornung, and Luc V an Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675 , 2017. [13] N. Xu S. Joo Kim S. W ug Oh, J. Lee. Fast user-guided video object segmentation by deep networks. The 2018 D A VIS Challenge on V ideo Object Se gmentation - CVPR W orkshops , 2018. [14] J. Shen W . W ang and L. Shao. V ideo salient object detec- tion via fully con volutional networks. IEEE T rans. Image Pr ocessing (TIP) , 27(1):38–49, 2018. [15] Ning Xu, Linjie Y ang, Y uchen Fan, Dingcheng Y ue, Y uchen Liang, Jianchao Y ang, and Thomas S. Huang. Y outube-vos: A large-scale video object segmentation benchmark. CoRR , abs/1809.03327, 2018. 4

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment