Virtual Worlds as Proxy for Multi-Object Tracking Analysis
Modern computer vision algorithms typically require expensive data acquisition and accurate manual labeling. In this work, we instead leverage the recent progress in computer graphics to generate fully labeled, dynamic, and photo-realistic proxy virt…
Authors: Adrien Gaidon, Qiao Wang, Yohann Cabon
V irtual W orlds as Pr oxy f or Multi-Object T racking Analysis Adrien Gaidon 1 ∗ Qiao W ang 2 ∗ Y ohann Cabon 1 Eleonora V ig 1 † 1 Computer V ision group, Xerox Research Center Europe, France 2 School of Electrical, Computer , and Energy Engineering and School of Arts, Media, and Engineering, Arizona State Univ ersity , USA { adrien.gaidon,yohann.cabon } @xrce.xerox.com qiao.wang@asu.edu eleonora.vig@dlr.de http://www.xrce.xerox.com/Research- Development/Computer- Vision/Proxy- Virtual- Worlds Abstract Modern computer vision algorithms typically r equire e x- pensive data acquisition and accurate manual labeling . In this work, we instead le verage the r ecent pr ogr ess in computer graphics to gener ate fully labeled, dynamic, and photo-r ealistic pr oxy virtual worlds. W e pr opose an ef fi- cient r eal-to-virtual world cloning method, and validate our appr oach by building and publicly r eleasing a new video dataset, called “V irtual KITTI” 1 , automatically labeled with accurate gr ound truth for object detection, tracking, scene and instance segmentation, depth, and optical flow . W e pr ovide quantitative e xperimental evidence sugg esting that (i) modern deep learning algorithms pre-tr ained on r eal data behave similarly in real and virtual worlds, and (ii) pr e-training on virtual data impro ves performance. As the gap between real and virtual worlds is small, virtual worlds enable measuring the impact of various weather and imaging conditions on r ecognition performance, all other things being equal. W e show these factors may affect dras- tically otherwise high-performing deep models for trac king. 1. Introduction Although cheap or even no annotations might be used at training time via weakly-supervised (resp. unsupervised) learning, experimentally ev aluating the generalization per- formance and robustness of a visual recognition model re- quires accurate full labeling of large representati ve datasets. This is, howe ver , challenging in practice for video under- standing tasks like multi-object tracking (MO T), because of the high data acquisition and labeling costs that limit the quantity and v ariety of existing video benchmarks. F or in- stance, the KITTI [ 1 ] multi-object tracking benchmark con- tains only 29 test sequences captured in similar good con- ∗ A G and QW have contributed equally † EV is currently at the German Aerospace Center 1 http://www.xrce.xerox.com/ Research- Development/Computer- Vision/ Proxy- Virtual- Worlds Figure 1: T op: a frame of a video from the KITTI multi-object tracking benchmark [ 1 ]. Middle: the corresponding rendered frame of the synthetic clone from our V irtual KITTI dataset with automatic tracking ground truth bounding boxes. Bottom: auto- matically generated ground truth for opt ical flo w (left), scene- and instance-lev el segmentation (middle), and depth (right). ditions and from a single source. T o the best of our kno wl- edge, none of the existing benchmarks in computer vision contain the minimum v ariety required to properly assess the performance of video analysis algorithms: v arying condi- tions (day , night, sun, rain, . . . ), multiple detailed object class annotations (persons, cars, license plates, . . . ), and dif- ferent camera settings, among many others factors. Using synthetic data should in theory enable full control of the data generation pipeline, hence ensuring lower costs, greater flexibility , and limitless variety and quantity . In this work, we le verage the recent progress in computer graph- ics (especially off-the-shelf tools lik e game engines) and commodity hardware (especially GPUs) to generate photo- r ealistic virtual worlds used as pr oxies to assess the perfor - mance of video analysis algorithms . 1 Our first contrib ution is a method to generate large, photo-realistic, varied datasets of synthetic videos, auto- matically and densely labeled for v arious video understand- ing tasks. Our main novel idea consists in creating virtual worlds not from scratch, but by cloning a few seed real- world video sequences. Using this method, our second and main contribution is the creation of the new Virtual KITTI dataset ( cf. Figure 1 ), which at the time of publication con- tains 35 photo-realistic synthetic videos (5 cloned from the original real-world KITTI tracking benchmark [ 1 ], coupled with 7 variations each) for a total of approximately 17,000 high resolution frames, all with automatic accurate ground truth for object detection, tracking , depth, optical flow , as well as scene and instance se gmentation at the pixel level . Our thir d contribution consists in quantitativ ely measur- ing the usefulness of these virtual worlds as proxies for multi-object tracking. W e first propose a practical defini- tion of transferability of experimental observ ations across real and virtual worlds. Our protocol rests on the compar- ison of real-world seed sequences with their corresponding synthetic clones using real-world pre-trained deep models (in particular Fast-RCNN [ 2 ]), hyper-parameter calibration via Bayesian optimization [ 3 ], and the analysis of task- specific performance metrics [ 4 ]. Second, we validate the usefulness of our virtual worlds for learning deep models by sho wing that virtual pr e-training followed by r eal-world fine-tuning outperforms training only on r eal world data . Our e xperiments, therefore, suggest that the recent progress in computer graphics technology allows one to easily build virtual worlds that are indeed effecti ve proxies of the real world from a computer vision perspecti ve. Our fourth contribution builds upon this small virtual- to-real gap to measure the potential impact on recognition performance of varied weather conditions (like fog), light- ing conditions, and camera angles, all other things being equal , something impractical or e ven impossible in real- world conditions. Our experiments sho w that these varia- tions may significantly deteriorate the performance of nor- mally high-performing models trained on larg e r eal-world datasets . This lack of generalization highlights the impor- tance of open research problems like unsupervised domain adaptation and building more varied training sets, to move further tow ards applying computer vision in the wild. The paper is or ganized as follows. Section 2 revie ws related works on using synthetic data for computer vision. Section 3 describes our approach to build virtual worlds in general and V irtual KITTI in particular . Section 4 reports our multi-object tracking experiments using strong deep learning baselines (Section 4.1 ) to assess the transferability of observations across the real-to-virtual gap (Section 4.2 ), the benefits of virtual pre-training (Section 4.3 ), and the im- pact of v arious weather and imaging conditions on recogni- tion performance (Section 4.4 ). W e conclude in section 5 . 2. Related W ork Sev eral works in vestigate the use of 3D synthetic data to tackle standard 2D computer vision problems such as object detection [ 5 ], face recognition, scene understanding [ 6 ], and optical flo w estimation [ 7 ]. From early on, computer vision researchers lev eraged 3D computer simulations to model ar- ticulated objects including human shape [ 8 ], face, and hand appearance [ 9 ], or even for scene interpretation and vision as inv erse graphics [ 10 , 11 , 12 ]. Howe ver , these methods typically require controlled virtual environments, are tuned to constrained settings, and require the development of task- specific graphics tools. In addition, the lack of photorealism creates a significant domain gap between synthetic and real world images, which in turn might render synthetic data too simplistic to tune or analyze vision algorithms [ 13 ]. The degree of photorealism allowed by the recent progress in computer graphics and modern high-le vel generic graphics platforms enables a more widespread use of synthetic data generated under less constrained settings. First attempts to use synthetic data for training are mainly limited to using rough synthetic models or synthesized real examples ( e.g., of pedestrians [ 14 , 15 ]). In contrast, Mar ´ ın et al. [ 16 , 17 , 18 ] went further and positi vely answer the in- triguing question whether one can learn appearance models of pedestrians in a virtual world and use the learned mod- els for detection in the real world. A related approach is described in [ 19 ], but for scene- and scene-location spe- cific detectors with fixed calibrated surv eillance cameras and a priori known scene geometry . In the context of video surveillance too, [ 20 ] proposes a virtual simulation test bed for system design and e valuation. Sev eral other works use 3D CAD models for more general object pose estimation [ 21 , 22 ] and detection [ 23 , 24 ]. Only few works use photo-realistic imagery for evalua- tion purposes , and in most cases these works focus on low- lev el image and video processing tasks. Kanev a et al. [ 25 ] ev aluate low-le vel image features, while Butler et al. [ 26 ] propose a synthetic benchmark for optical flow estimation: the popular MPI Sintel Flow Dataset. The recent work of Chen et al. [ 27 ] is another example for basic build- ing blocks of autonomous driving. These approaches view photo-realistic imagery as a way of obtaining ground truth that cannot be easily obtained otherwise ( e.g ., optical flo w). When ground-truth can be collected, for instance via cro wd- sourcing, real-world imagery is often preferred ov er syn- thetic data because of the artifacts the latter might introduce. In this paper , we show that such issues can be partially circumvented using our approach, in particular for high- lev el video understanding tasks for which ground-truth data is tedious to collect. W e believ e current approaches face two major limitations that prev ent broadening the scope of vir- tual data. First, the data generation is itself costly and time- consuming, as it often requires creating animation movies from scratch. This also limits the quantity of data that can be generated. An alternati ve consists in recording scenes from humans playing video games [ 16 ], but this faces simi- lar time costs, and further restricts the variety of the gener- ated scenes. The second limitation lies in the usefulness of synthetic data as a proxy to assess real-world performance on high-level computer vision tasks, including object de- tection and tracking. It is indeed dif ficult to e valuate how conclusions obtained from virtual data could be applied to the real world in general. Due to these limitations, only few of the previous works hav e so f ar e xploited the full potential of virtual worlds: the possibility to generate endless quantities of v aried video se- quences on-the-fly . This w ould be especially useful in order to assess model performance, which is crucial for real-world deployment of computer vision applications. In this paper , we propose steps to wards achie ving this goal by addressing two main challenges: (i) automatic generation of arbitrary photo-realistic video sequences with ground-truth by script- ing modern game engines, and (ii) assessing the degree of transferability of experimental conclusions from synthetic data to the real world. 3. Generating Proxy V irtual W orlds Our approach consists in five steps detailed in the follow- ing sections: (i) the acquisition of a small amount of real- world data as a starting point for calibration (Section 3.1 ), (ii) the “cloning” of this real-w orld data into a virtual world (Section 3.2 ), (iii) the automatic generation of modified syn- thetic sequences with dif ferent weather or imaging condi- tions (Section 3.3 ), (iv) the automatic generation of detailed ground truth annotations (Section 3.4 ), and (v) the quanti- tativ e e valuation of the “usefulness” of the synthetic data (Section 3.5 ). W e describe both the method and the partic- ular choices made to generate our V irtual KITTI dataset. 3.1. Acquiring r eal-world (sensor) data The first step of our approach consists in the acquisition of a limited amount of seed data from the real world for the purpose of calibration. T wo types of data need to be collected: videos of real-world scenes and physical mea- surements of important objects in the scene including the camera itself. The quantity of data required by our ap- proach is much smaller than what is typically needed for training or validating current computer vision models, as we do not require a reasonable co verage of all possible sce- narios of interest. Instead, we use a small fixed set of core real-world video sequences to initialize our virtual worlds, which in turn allows one to generate many v aried synthetic videos. Furthermore, this initial seed real-world data results in higher quality virtual worlds ( i.e. closer to real-world conditions) and to quantify their usefulness to deri ve con- clusions that are likely to transfer to real-world settings. Figure 2: Frames from 5 real KITTI videos (left, sequences 1, 2, 6, 18, 20 from top to bottom) and rendered virtual clones (right). In our experiments, we use the KITTI dataset [ 1 ] to ini- tialize our virtual worlds. This standard public benchmark was captured from a car dri ving in the German city of Karl- sruhe, mostly under sunn y conditions. The sensors used to capture data include gray-scale and color cameras, a 3D laser scanner , and an inertial and GPS navigation system. From the point clouds captured by the 3D laser scanner, hu- man annotators labeled 3D and 2D bounding boxes of se v- eral types of objects including cars and pedestrians. In our experiments we only consider cars as objects of interest for simplicity and because they are the main category of KITTI. The annotation data include the positions and sizes of cars, and their rotation angles about the vertical axis (yaw rota- tion). The movement of the camera itself was recorded via GPS (latitude, longitude, altitude) and its orientation (roll, pitch, yaw) via a GPS/IMU sensor , which has a fixed spatial relationship with the cameras. 3.2. Generating synthetic clones The next step of our approach consists in semi- automatically creating photo-realistic dynamic 3D virtual worlds in which virtual camera paths follow those of the real world seed sequences to generate outputs we call synthetic video clones , which closely resemble the real-world data. T o build V irtual KITTI, we select fiv e training videos from the original KITTI MOT benchmark as “real-world seeds” to create our virtual worlds ( cf. Figure 2 ): 0001 (crowded urban area), 0002 (road in urban area then busy intersec- tion), 0006 (stationary camera at a busy intersection), 0018 (long road in the forest with challenging imaging conditions and shadows), and 0020 (highw ay driving scene). W e decompose a scene into different visual compo- nents, with which off-the-shelf computer graphics engines ( e.g ., game engines) and graphic assets ( e .g., geometric and material models) can be scripted to reconstruct the scene. W e use the commercial computer graphics engine Unity 2 to create virtual worlds that closely resemble the original ones in KITTI. This engine has a strong community that has de- veloped many “assets” publicly av ailable on Unity’ s Asset Store. These assets include realistic 3D models and mate- rials of objects. This allows for efficient crowd-sourcing of most of the manual labor in the initial setup of our virtual worlds, making the creation of each virtual world ef ficient (approximately one-person-day in our experiments). The positions and orientations of the objects of interest in the 3D virtual world are calculated based on their posi- tions and orientations relati ve to the camera and the position and orientation of the camera itself, both a vailable from ac- quired real-world data in the case of KITTI. The main roads are also placed according to the camera position, with minor manual adjustment in special cases ( e.g., the road changing width). T o build the V irtual KITTI dataset, we manually place secondary roads and other background objects such as trees and b uildings in the virtual world, both for simplic- ity and because of the lack of position data for them. Note that this could be automated using V isual SLAM or seman- tic segmentation. A directional light source together with ambient light simulates the sun. Its direction and intensity are set manually by comparing the brightness and the shad- ows in the virtual and real-world scenes, a simple process that only takes a fe w minutes per world in our e xperiments. 3.3. Changing conditions in synthetic videos After the 3D virtual world is created, we can automati- cally generate not only the clone synthetic video, but also videos with changed components. This allo ws for the quan- titativ e study of the impact of single factors (“ceteris paribus analysis”), including rare e vents or dif ficult to observe con- ditions that might occur in practice (“what-if analysis”). The conditions that can be changed to generate ne w syn- thetic videos include (but are not limited to): (i) the num- ber , trajectories, or speeds of cars, (ii) their sizes, colors, or models, (iv) the camera position, orientation, and path, (v) the lighting and weather conditions. All components can be randomized or modified “on demand” by changing param- eters in the scripts, or by manually adding, modifying, or removing elements in the scene. T o illustrate some of the vast possibilities, V irtual KITTI includes some simple changes to the virtual world that translate in complex visual changes that would otherwise require the costly process of re-acquiring and re-annotating data in the real-world. First, we turned the camera to the right and then to the left, which lead to some considerable change of appearances of the cars. Second, we changed lighting conditions to simulate different time of the day: 2 http://unity3d.com Figure 3: Simulated conditions. From top left to bottom right: clone, camera rotated to the right by 15 ◦ , to the left by 15 ◦ , “morn- ing” and “sunset” times of day , overcast weather , fog, and rain. early morning and before sunset. Third, we used special effects and a particle system together with changed lighting conditions to simulate different weather: ov ercast, fog and heavy rain. See Figure 3 for an illustration. 3.4. Generating gr ound-truth annotations As stated above, ground-truth annotations are essen- tial for computer vision algorithms. In the KITTI dataset, the 2D bounding boxes used for ev aluation were obtained from human annotators by drawing rectangular boxes on the video frames and manually labeling the truncation and oc- clusion states of objects. This common practice is howe ver costly , does not scale to large volumes of videos and pixel- lev el ground-truth, and incorporates varying degrees of sub- jectiv eness and inconsistency . For example, the bounding boxes are usually slightly larger than the cars and the mar- gins differ from one car to another and from one annotator to another . The occlusion state (“fully visible”, “partly oc- cluded”, or “largely occluded”) is also subjectiv e and the underlying criterion may differ from case to case, yielding many important edge cases (occluded and truncated cars) with inconsistent labels. In contrast, our approach can automatically generate ac- curate and consistent ground-truth annotations accompany- ing synthetic video outputs, and the algorithm-based ap- proach allows richer ( e.g., pixel-lev el) and more consis- tent results than those from human annotators. W e render each moment of the scene four times. First, we do the photo-realistic rendering of the clone scene by lev eraging the modern rendering engine of Unity . Second, the depth map is rendered by using the information stored in the depth buf fer . Third, the per-pixel category- and instance-lev el ground-truth is efficiently and directly generated by using unlit shaders on the materials of the objects. These modified shaders output a color which is not affected by the lighting Figure 4: Rendered frame (left) and automatically generated scene and instance-level se gmentation ground-truth (right) for two modified conditions: camera horizontally rotated to the left (top), rain (bottom). and shading conditions. A unique color ID is assigned for each object of interest ( cf. Figure 4 ). Fourth, we compute the dense optical flow between the pre vious and the current frames by sending all Model, V iew , and Projection matrices for each object to a verte x shader , and interpolate the flow of each pixel using a fragment shader . Note that these multiple renderings are an efficient strategy to generate pixel-lev el ground truth, as it effecti vely lev erages shaders offloading parallel computations to GPUs (most of the computation time is used to swap materials). For V irtual KITTI, with a resolution of around 1242 × 375 , the full rendering and ground truth generation pipeline for segmentation, depth, and optical flow runs at 5-8 FPS on a single desktop with commodity hardware. W e generate 2D multi-object tracking ground truth by (i) doing the perspecti ve projection of the 3D object bound- ing boxes from the world coordinates to the camera plane (clipping to image boundaries in the case of truncated ob- jects), (ii) associating the bounding boxes with their corre- sponding object IDs to differentiate object instances, and (iii) adding truncation and occlusion meta-data as described below . The truncation rate is approximated by dividing the volume of an object’ s 3D bounding box by the volume of the 3D bounding box of the visible part (computed by inter - secting the original bounding box with the camera frustum planes). W e also estimate the 2D occupancy rate of an ob- ject by di viding the number of ground-truth pixels in its seg- mentation mask by the area of the projected 2D bounding box, which includes the occluder, as it results from the per- spectiv e projection of the full 3D bounding box of the ob- ject. In the special case of fog, we additionally compute the visibility of each object from the fog formula used to gener- ate the effect. T o have comparable experimental protocols and reproducible ground truth criteria across real and virtual KITTI, we remove manually annotated “DontCare” areas from the original KITTI training ground truth ( i.e . the y may can count as false alarms), and ignore all cars smaller than 25 pixels or heavily truncated / occluded during e valuation (as described in [ 1 ]). W e set per sequence global thresholds on occupancy and truncation rates of virtual objects to be as close as possible to original KITTI annotations. 3.5. Assessing the usefulness of virtual worlds In addition to our data generation and annotation meth- ods, a key novel aspect of our approach consists in the as- sessment of the usefulness of the generated virtual worlds for computer vision tasks. It is a priori unclear whether and when using photo-realistic synthetic videos is indeed a valid alternativ e to real-world data for computer vision algorithms. The transferability of conclusions obtained on synthetic data is likely to depend on man y f actors, including the tools used (especially graphics and ph ysics engines), the quality of implementation ( e.g., the degree of photo-realism and details of environments and object designs or anima- tions), and the target video analysis tasks. Although us- ing synthetic training data is common practice in computer vision, we are not aware of related works that systemati- cally study the r everse , i.e. using real-world training data, which can be noisy or weakly labeled, and synthetic test data, which must be accurately labeled and where, there- fore, synthetic data has obvious benefits. T o assess rob ustly whether the behavior of a recognition algorithm is similar in real and virtual worlds, we propose to compare its performance on the initial “seed” real-world videos and their corresponding virtual world clones. W e compare multiple metrics of interest (depending on the tar- get recognition task) obtained with fixed hyper-parameters that maximize recognition performance on both the real and virtual videos, while simultaneously minimizing the perfor- mance gap. In the case of MO T , we use Bayesian hyper - parameter optimization [ 3 ] to find fixed tracker hyper- parameters for each pair of real and clone videos. W e use as objecti ve function the sum of the multi-object tracking accuracies (MO T A [ 4 ]) over original real-world videos and their corresponding virtual clones, minus their absolute dif- ferences, normalized by the mean absolute de viations of all other normalized CLEAR MO T metrics [ 4 ]. This allows us to quantitatively and objectiv ely measure the impact of the virtual world design, the degree of photo- realism, and the quality of other rendering parameters on the algorithm performance metrics of interest. Note that this simple technique is a direct benefit of our virtual world generation scheme based on synthetically cloning a small set of real-world sensor data. Although the comparisons depend on the tasks of interest, it is also possible to com- plement task-specific metrics with more general measures of discrepancy and domain mismatch measures [ 28 ]. Finally , note that our protocol is complementary to the more standard approach consisting of using synthetic train- ing data and real-world test data. Therefore, in our exper - iments with V irtual KITTI we in vestigate both methods to assess the usefulness of virtual data, both for learning vir- tual models applied in the real world and for e valuating real- world pre-trained models in both virtual and real worlds. 4. Experiments In this section, we first describe the MOT models used in our experiments. W e then report results regarding the dif- ferences between the original real-world KITTI videos and our virtual KITTI clones. W e then report our experiments on learning in virtual worlds models applied on real-world data. Finally , we conclude with e xperiments to measure the impact of camera, lighting, and weather on recognition per - formance of real-world pre-trained MO T algorithms. 4.1. Strong Deep Learning Baselines f or MO T Thanks to the recent progress on object detection, association-based tracking-by-detection in monocular video streams is particularly successful and widely used for MO T [ 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 ] (see [ 39 ] for a recent re view). These methods consist in b uilding tracks by linking object detections through time. In our experiments, the detector we use is the recent Fast- R-CNN object detector from Girshick [ 2 ] combined with the efficient Edge Boxes proposals [ 40 ]. In all experiments (except for the virtual training ones), we follow the exper- imental protocol of [ 2 ] to learn a powerful V GG16-based Fast-RCNN car detector by fine-tuning successiv ely from ImageNet, to Pascal VOC 2007 cars, to the KITTI object detection benchmark training images 3 . T o use this detector for association-based MO T , we consider two trackers. The first is based on the princi- pled network flow algorithm of [ 41 , 30 ], which does not 3 http://www.cvlibs.net/datasets/kitti/eval_ object.php require video training data. The maximum a posteriori (MAP) data association problem can indeed be ele gantly formalized as a special integer linear program (ILP) whose global optimum can be found ef ficiently using max-flo w min-cost network flow algorithms [ 41 , 30 ]. In particular, the dynamic programming min-cost flow (DP-MCF) algo- rithm of Pirsiavash et al. [ 30 ] is well-founded and par- ticularly efficient. Although it obtains poor results on the KITTI MOT benchmark [ 42 ], it can be vastly improved by (i) using a better detector, (ii) replacing the binary pair - wise costs in the network by using the intersection-o ver- union, and (iii) allo wing for multiple time-skip connec- tions in the network to better handle missed detections. Our DP MCF RCNN tracker reaches 57% MOT A on the KITTI MOT ev aluation server [ 42 ], impro ving by +20% w .r .t. the original DP MCF [ 30 ]. Note that this baseline tracker could be further improved, as sho wn recently by W ang and Fowlk es [ 36 ]. Their method indeed obtains 77% MO T A with a related algorithm thanks to better appear- ance and motion modeling coupled with structured SVMs to learn hyper-parameters on training videos. The second track er we consider is the recent state-of- the-art Markov Decision Process (MDP) method of Xi- ang et al. [ 38 ]. It relies on reinforcement learning to learn a policy for data association from ground truth training tracks. This method reaches 76% MO T A on the KITTI MO T test set using Con vNet-based detections. In our experiments re- quiring a pre-trained tracker , we learned the MDP parame- ters on the following 12 real-world KITTI training videos: 0000, 0003, 0004, 0005, 0007, 0008, 0009, 0010, 0011, 0012, 0014, 0015. (The remaining videos are either the seed sequences used to create the virtual worlds, or sequences containing no or very fe w cars.) 4.2. T ransferability across the Real-to-V irtual Gap T able 1 contains the multi-object tracking performance of our DP-MCF and MDP trackers on the virtual KITTI clone videos and their original KITTI counterparts follow- ing the protocol described in Section 3.5 . See Figure 5 for some tracking visualizations. According to the MO T A metric which summarizes all as- pects of MO T , the real-to-virtual performance gap is mini- mal for all real sequences and their r espective virtual clones and for all track ers , and < 0 . 5% on average for both track- ers. All other metrics show also a limited gap. Conse- quently , the visual similarity of the sequences and the com- parable performance and behavior of the tracker across real- world videos and their virtual worlds counterpart suggest that similar causes in the real and virtual worlds are likely to cause similar ef fects in terms of recognition performance. The amount of expected “transferability of conclusions” from real to virtual and back can be quantified by the differ - ence in the metrics reported in table 1 . Figure 5: Predicted tracks on matching frames of two original videos (top) and their synthetic clones (bottom) for both DP-MCF (left) and MDP (right). Note the visual similarity of both the scenes and the tracks. Most dif ferences are on occluded, small, or truncated objects. DPMCF MO T A ↑ MO TP ↑ MT ↑ ML ↑ I ↓ F ↓ P ↑ R ↑ 0001 73.9% 86.8% 73.3% 4.0% 38 52 93.1% 85.5% v0001 73.1% 82.0% 58.2% 5.1% 30 47 98.4% 79.5% 0002 72.5% 84.1% 54.5% 27.3% 4 20 99.6% 75.5% v0002 74.0% 78.7% 50.0% 20.0% 5 16 98.6% 77.9% 0006 88.3% 85.6% 90.9% 0.0% 1 12 98.8% 90.6% v0006 88.3% 83.6% 100.0% 0.0% 3 7 94.6% 95.4% 0018 93.0% 87.2% 82.4% 0.0% 1 7 95.2% 98.7% v0018 93.7% 73.0% 66.7% 0.0% 2 16 99.9% 94.4% 0020 81.0% 84.8% 68.6% 4.7% 88 150 94.4% 90.0% v0020 82.0% 77.4% 44.8% 14.6% 67 142 99.3% 86.1% A VG 81.7% 85.7% 73.9% 7.2% 26 48 96.2% 88.1% v-A VG 82.2% 78.9% 63.9% 7.9% 21 45 98.2% 86.7% MDP MO T A ↑ MO TP ↑ MT ↑ ML ↑ I ↓ F ↓ P ↑ R ↑ 0001 81.8% 85.3% 78.7% 13.3% 5 6 91.1% 92.5% v0001 82.8% 81.9% 63.3% 13.9% 1 10 98.7% 85.8% 0002 80.7% 82.2% 63.6% 27.3% 0 1 99.0% 82.5% v0002 81.1% 81.8% 60.0% 20.0% 0 2 98.4% 83.4% 0006 91.3% 84.3% 72.7% 9.1% 0 3 99.7% 92.3% v0006 91.3% 84.4% 81.8% 9.1% 1 2 99.9% 92.0% 0018 91.1% 87.0% 52.9% 35.3% 1 1 96.7% 95.2% v0018 90.9% 74.9% 44.4% 33.3% 0 0 99.1% 92.4% 0020 84.4% 85.1% 58.1% 25.6% 14 24 96.7% 88.7% v0020 84.0% 79.4% 52.1% 34.4% 1 9 99.3% 85.6% A VG 85.9% 84.8% 65.2% 22.1% 4 7 96.7% 90.3% v-A VG 86.0% 80.5% 60.3% 22.1% 0 4 99.1% 87.9% T able 1: DP-MCF (left) and MDP (right) MO T results on original real-world KITTI train videos and virtual w orld video “clones” (prefix ed by a “v”). A VG (resp. v-A V G) is the a verage ov er real (resp. virtual) sequences. W e report the CLEAR MO T metrics [ 4 ] – including MO T Accuracy (MO T A), MOT Precision (MO TP), ID Switches (I), and Fragmentation (F) – complemented by the Mostly Tracked (MT) and Mostly Lost (ML) ratios, as well as our detector’ s precision (P) and recall (R). The most different metrics are the MOTP (av erage intersection-ov er-union of correct tracks with the matching ground truth), and the fraction of Mostly T racked (MT) ob- jects (fraction of ground truth objects tracked at least 80% of the time), which are both generally lower in the virtual world. The main factor explaining this gap lies in the inac- curate and inconsistent manual annotations of the frequent “corner cases” in the real world (hea vy truncation or occlu- sion, which in the original KITTI benchmark is sometimes labeled as “DontCare”, ignored, or considered as true pos- itiv es, depending on the annotator). In contrast, our V ir- tual KITTI ground truth is not subjecti ve, b ut automatically determined by thresholding the aforementioned computed occupancy and truncation rates. This discrepancy is illus- trated in Figure 5 , and explains the small drop in recall for sequences 0001, 0018, and 0020 (which contain many oc- cluded and truncated objects). Note, ho wever , that the F ast- RCNN detector achieves similar F1 performance between real and virtual worlds, so this drop in recall is generally compensated by an increase in precision. 4.3. V irtual Pre-T raining As mentioned previously , our method to quantify the gap between real and virtual w orlds from the perspecti ve of computer vision algorithms is complementary to the more widely-used approach of lev eraging synthetic data to train models applied in real-world settings. Therefore, we addi- tionally conduct experiments to measure the usefulness of V irtual KITTI to train MOT algorithms. W e ev aluated three different scenarios: (i) training only on the 5 real KITTI seed sequences (configuration ’ r ’), (ii) training only on the corresponding 5 virtual KITTI clones (configuration ’ v ’), and (iii) training first on the V irtual KITTI clones, then fine-tuning on the real KITTI sequences, a special form of virtual data augmentation we call vir- tual pr e-training (configuration ’ v → r ’). W e split the set of real KITTI sequences not used during training in two: (i) a test set of 7 long div erse videos (4,5,7,8,9,11,15) to ev al- uate performance, and (ii) a validation set of 5 short videos (0,3,10,12,14) used for hyper-parameter tuning. The Fast- MO T A ↑ MOTP ↑ MT ↑ ML ↑ I ↓ F ↓ P ↑ R ↑ DP-MCF v 64.3% 75.3% 35.9% 31.5% 0 15 96.6% 71.0% DP-MCF r 71.9% 79.2% 45.0% 24.4% 5 17 98.0% 76.5% DP-MCF v → r 76.7% 80.9% 53.2% 12.3% 7 27 98.3% 81.1% MDP v 63.7% 75.5% 35.9% 36.9% 5 12 96.0% 70.6% MDP r 78.1% 79.2% 60.7% 22.0% 3 9 97.3% 82.5% MDP v → r 78.7% 80.0% 51.7% 19.4% 5 10 98.3% 82.6% T able 2: DP-MCF and MDP MO T results on sev en held-out orig- inal real-world KITTI train videos (4,5,7,8,9,11,15) by learning the models on (r) the fiv e real seed KITTI videos (1,2,6,18,20), (v) the corresponding fiv e V irtual KITTI clones, and (v → r) by successiv ely training on the virtual clones then the real sequences (virtual pre-training). See T able 1 for details about the metrics. RCNN detector was always pre-trained on ImageNet. The MDP association model is trained from scratch using rein- forcement learning as described in [ 38 ]. T able 2 reports the average MOT metrics on the afore- mentioned real test sequences for all trackers trained with all configurations. Although training only on virtual data is not enough, we can see that the best results are obtained with configuration v → r . Therefore, virtual pr e-training im- pr oves performance , which further confirms the usefulness of virtual worlds for high-lev el computer vision tasks. The improv ement is particularly significant for the DP-MCF tracker , less for the MDP tracker . MDP can indeed better handle missed detections and works in the high-precision regime of the detector (the best minimum detector score threshold found on the v alidation set is around 95% ), which is not strongly improved by the virtual pre-training. On the other hand, DP-MCF is more robust to false positi ves but re- quires more recall (v alidation score threshold around 60% ), which is significantly improved by virtual pre-training. In all cases, we found that validating an early stopping cri- terion (maximum number of SGD iterations) of the sec- ond fine-tuning stage of the v → r configuration is critical to av oid overfitting on the small real training set after pre- training on the virtual one. 4.4. Impact of W eather and Imaging Conditions T able 3 contains the performance of our real-world pre- trained trackers (Section 4.1 ) in altered conditions gener- ated either by modifying the camera position, or by using special ef fects to simulate different lighting and weather conditions. As the trackers are trained on consistent ideal sunny conditions, all modifications negati vely affect all metrics and all trackers. In particular , bad weather ( e.g ., fog) causes the strongest degradation of performance. This is expected, but difficult to quantify in practice without re-acquiring data in dif ferent conditions. This also suggests that the empirical generalization performance estimated on the limited set of KITTI test videos is an optimistic upper bound at best. Note that the MDP tracker is suffering from stronger overfitting than DP-MCF , as suggested by the big- ger performance degradation under all conditions. DP-MCF MOT A ↑ MOTP ↑ MT ↑ ML ↑ I ↓ F ↓ P ↑ R ↑ clone 82.2% 78.9% 63.9% 7.9% 21 45 98.2% 86.7% +15deg -2.9% -0.8% -10.6% 6.3% -18 -31 0.5% -3.9% -15deg -8.1% -0.6% -6.9% -1.9% -8 -9 -3.4% -3.7% morning -2.8% -0.3% -6.0% 1.7% -2 -3 1.0% -3.9% sunset -6.8% -0.0% -13.7% 3.6% -2 0 -0.6% -6.1% overcast -2.0% -1.3% -12.3% 0.8% -3 -5 0.5% -2.7% fog -45.2% 4.0% -55.3% 33.3% -17 -29 1.1% -43.3% rain -7.8% -0.4% -18.8% 3.3% -9 -6 1.2% -8.6% MDP MO T A ↑ MO TP ↑ MT ↑ ML ↑ I ↓ F ↓ P ↑ R ↑ clone 86.0% 80.5% 60.3% 22.1% 0 4 99.1% 87.9% +15deg -5.9% -0.3% -7.4% 6.2% 0 0 0.1% -5.4% -15deg -4.5% -0.5% -4.8% 5.7% 0 3 -0.5% -4.0% morning -5.1% -0.4% -6.1% 3.1% 1 1 0.1% -4.9% sunset -6.3% -0.5% -6.4% 4.3% 0 2 -0.3% -5.5% overcast -4.0% -1.0% -7.2% 4.6% 0 0 -0.2% -3.6% fog -57.4% 1.2% -57.4% 40.7% 0 -2 -0.0% -53.9% rain -12.0% -0.6% -15.3% 5.7% 1 3 -0.2% -10.9% T able 3: Impact of variations on MOT performance in virtual KITTI for the DP-MCF (top) and MDP (bottom) trackers. W e report the a verage performance on the virtual clones and the dif fer- ence caused by the modified conditions in order to measure the im- pact of sev eral phenomena, all other things being equal. “+15deg” (resp. “-15deg”) corresponds to a camera rotation of 15 degrees to the right (resp. left). “morning” corresponds to typical light- ing conditions after dawn on a sunny day . “sunset” corresponds to slightly before night time. “overcast” corresponds to lighting conditions in overcast weather , which causes dif fuse shado ws and strong ambient lighting. “fog” is implemented via a volumetric formula, and “rain” is a simple particle effect ignoring the refrac- tion of water drops on the camera. 5. Conclusion In this work we introduce a new fully annotated photo- realistic synthetic video dataset called V irtual KITTI, built using modern computer graphics technology and a nov el real-to-virtual cloning method. W e provide quantitativ e ex- perimental evidence suggesting that the gap between real and virtual worlds is small from the perspecti ve of high- lev el computer vision algorithms, in particular deep learn- ing models for multi-object tracking. W e also show that these state-of-the-art models suf fer from ov er-fitting, which causes performance de gradation in simulated modified con- ditions (camera angle, lighting, weather). Our approach is, to the best of our knowledge, the only one that enables to scientifically measure the potential impact of these impor- tant phenomena on the recognition performance of a statis- tical computer vision model. In future works, we plan to expand V irtual KITTI by adding more worlds, and by also including pedestrians, which are harder to animate. W e also plan to e xplore and ev aluate domain adaptation methods and larger scale vir- tual pre-training or data augmentation to build more robust models for a variety of video understanding tasks, including multi-object tracking and scene understanding. References [1] A Geiger , P Lenz, and R Urtasun. Are we ready for au- tonomous driving? The KITTI vision benchmark suite. In CVPR, 2012, http://www.cvlibs.net/datasets/ kitti/eval_tracking.php . 1 , 2 , 3 , 5 [2] R Girshick. Fast R-CNN. In ICCV , 2015. 2 , 6 [3] J Bergstra, D Y amins, and D Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimen- sions for vision architectures. In ICML , 2013. 2 , 6 [4] Keni Bernardin and Rainer Stiefelhagen. Ev aluating Multi- ple Object Tracking Performance: The CLEAR MO T Met- rics. EURASIP Journal on Image and V ideo Processing , 2008. 2 , 6 , 7 [5] Bojan Pepik, Michael Stark, Peter Gehler , and Bernt Schiele. T eaching 3d geometry to deformable part models. In CVPR , 2012. 2 [6] Scott Satkin, Jason Lin, and Martial Hebert. Data-driven scene understanding from 3d models. In ECCV , 2012. 2 [7] D. J. Butler , J. W ulff, G. B. Stanley , and M. J. Black. A naturalistic open source movie for optical flow ev aluation. In ECCV , 2012. 2 [8] Kristen Grauman, Gregory Shakhnarovich, and T rev or Dar- rell. Inferring 3d structure with a statistical image-based shape model. In ICCV , 2003. 2 [9] Luca Ballan, Aparna T aneja, J ¨ urgen Gall, Luc V an Gool, and Marc Pollefeys. Motion capture of hands in action using discriminativ e salient points. In ECCV , 2012. 2 [10] Peter W Battaglia, Jessica B Hamrick, and Joshua B T enen- baum. Simulation as an engine of physical scene understand- ing. PNAS , 110(45):18327–18332, 2013. 2 [11] V Mansinghka, TD Kulkarni, YN Perov , and J T enenbaum. Approximate Bayesian Image Interpretation using Genera- tiv e Probabilistic Graphics Programs. In NIPS , 2013. 2 [12] TD Kulkarni, P Kohli, JB T enenbaum, and V Mansinghka. Picture: A Probabilistic Programming Language for Scene Perception. In CVPR , 2015. 2 [13] T obi V audrey , Clemens Rabe, Reinhard Klette, and James Milburn. Differences between stereo and motion behaviour on synthetic and real-world stereo sequences. In IVCNZ , 2008. 2 [14] A. Broggi, A. Fascioli, P . Grisleri, T . Graf, and M. Meinecke. Model-based validation approaches and matching techniques for automotive vision based pedestrian detection. In CVPR W orkshops , 2005. 2 [15] Michael Stark, Michael Goesele, and Bernt Schiele. Back to the future: Learning shape models from 3d cad data. In BMVC , 2010. 2 [16] J. Marin, D. V azquez, D. Geronimo, and A.M. Lopez. Learn- ing appearance in virtual scenarios for pedestrian detection. In CVPR , 2010. 2 , 3 [17] D. V ´ azquez, A.M. L ´ opez, and D. Ponsa. Unsupervised do- main adaptation of virtual and real worlds for pedestrian de- tection. In ICPR , 2012. 2 [18] D. V azquez, A. Lopez, J. Marin, D. Ponsa, and D. Geronimo. V irtual and real world adaptation for pedestrian detection. P AMI , 2014. 2 [19] Hiro Hattori, V ishnu Bodetti, Kris M. Kitani, and T akeo Kanade. Learning scene-specific pedestrian detectors with- out real data. In CVPR , 2015. 2 [20] Geoffre y R T aylor , Andrew J Chosak, and Paul C Brewer . Ovvv: Using virtual worlds to design and ev aluate surveil- lance systems. In CVPR , 2007. 2 [21] M Aubry , D Maturana, A Efros, B Russell, and J Si vic. See- ing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models. In CVPR , 2014. 2 [22] PP Busto, J Liebelt, and J Gall. Adaptation of Synthetic Data for Coarse-to-Fine V iewpoint Refinement. In BMVC , 2015. 2 [23] B Sun and K Saenko. From V irtual to Reality: Fast Adapta- tion of V irtual Object Detectors to Real Domains. In BMVC , 2014. 2 [24] X Peng, B Sun, K Ali, and K Saenk o. Learning Deep Object Detectors from 3D Models. In ICCV , 2015. 2 [25] B. Kanev a, A. T orralba, and W .T . Freeman. Evaluation of image features using a photorealistic virtual world. In ICCV , 2011. 2 [26] Daniel J. Butler , Jonas W ulff, Garrett B. Stanley , and Michael J. Black. A naturalistic open source movie for opti- cal flow e valuation. In ECCV , 2012. 2 [27] Chenyi Chen, Ari Seff, Alain L. K ornhauser , and Jianx- iong Xiao. DeepDriving: Learning affordance for direct perception in autonomous driving. T echnical report, 2015. arXiv:1505.00256. 2 [28] A Gretton, KM Borgwardt, MJ Rasch, B Sch ¨ olkopf, and A Smola. A kernel two-sample test. JMLR , 13, 2012. 6 [29] MD Breitenstein, F Reichlin, B Leibe, E Koller -Meier, and L V an Gool. Online Multi-Person Tracking-by-Detection from a Single, Uncalibrated Camera. P AMI , 2011. 6 [30] H Pirsia vash, D Ramanan, and C F owlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR , 2011. 6 [31] A Milan, S Roth, and K Schindler . Continuous Energy Min- imization for Multi-T arget T racking. P AMI , 2014. 6 [32] A Geiger, M Lauer, C W ojek, C Stiller , and R Urtasun. 3D T raffic Scene Understanding from Movable Platforms. P AMI , 2014. 6 [33] D Hall and P Perona. Online, Real-Time Tracking Using a Category-to-Indi vidual Detector. In ECCV , 2014. 6 [34] R T Collins and P Carr . Hybrid Stochastic/Deterministic Op- timization for Tracking Sports Players and Pedestrians. In ECCV , 2014. 6 [35] A Gaidon and E V ig. Online Domain Adaptation for Multi- Object T racking. In BMVC , 2015. 6 [36] Shaofei W ang and C F owlk es. Learning Optimal Parameters For Multi-tar get Tracking. In BMVC , 2015. 6 [37] W ongun Choi. Near -Online Multi-target Tracking with Ag- gregated Local Flo w Descriptor . In ICCV , 2015. 6 [38] Y u Xiang, A Alahi, and S Sa varese. Learning to T rack : On- line Multi-Object T racking by Decision Making. In ICCV , 2015. 6 , 8 [39] W Luo, X Zhao, and TK Kim. Multiple object tracking: A revie w . arXiv pr eprint arXiv:1409.7618 , 2014. 6 [40] CL Zitnick and P Dollar . Edge Boxes: Locating Object Pro- posals from Edges. In ECCV , 2014. 6 [41] Li Zhang, Y uan Li, and Ramakant Nev atia. Global data as- sociation for multi-object tracking using network flo ws. In CVPR , 2008. 6 [42] A Geiger, P Lenz, and R Urtasun. KITTI MOT bench- mark results. http://www.cvlibs.net/datasets/ kitti/eval_tracking.php . [Online; accessed 2016- 04-08]. 6
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment