Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering

Mono-Camera 3D Multi-Object T racking Using Deep Learning Detections and PMBM Filtering Samuel Scheidegger ∗ † , Joachim Benjaminsson ∗ † , Emil Rosenberg † , Amrit Krishnan ∗ , Karl Granstr ¨ om † ∗ Zenuity , † Department of Electrical Engineering, Chalmers Univ ersity of T echnology ∗ { ﬁrstname.lastname } @zenuity .com, karl.granstrom@chalmers.se Abstract —Monocular cameras ar e one of the most commonly used sensors in the automotive industry for autonomous vehicles. One major drawback using a monocular camera is that it only makes obser vations in the two dimensional image plane and can not directly measure the distance to objects. In this paper , we aim at ﬁlling this gap by developing a multi-object tracking algorithm that takes an image as input and produces trajectories of detected objects in a world coordinate system. W e solve this by using a deep neural network trained to detect and estimate the distance to objects from a single input image. The detections from a sequence of images are fed in to a state-of-the art Poisson multi-Bernoulli mixture tracking ﬁlter . The combination of the learned detector and the PMBM ﬁlter results in an algorithm that achieves 3D tracking using only mono-camera images as input. The performance of the algorithm is evaluated both in 3D world coordinates, and 2D image coordinates, using the publicly av ailable KITTI object tracking dataset. The algorithm shows the ability to accurately track objects, correctly handle data associations, even when there is a big overlap of the objects in the image, and is one of the top performing algorithms on the KITTI object tracking benchmark. Furthermore, the algorithm is efﬁcient, running on average close to 20 frames per second. I . I N T R O D U C T I O N T o enable a high le vel of automation in dri ving, it is necessary to accurately model the surrounding en vironment, a problem called en vironment perception. Data from onboard sensors, such as cameras, radars and lidars, has to be pro- cessed to extract information about the en vironment needed to automatically and safely navigate the vehicle. For example, information about both the static en vironment, such as road boundaries and lane information, and the dynamic objects, like pedestrians and other vehicles, is of importance. The focus of this paper is the detection and tracking of multiple dynamic objects, speciﬁcally vehicles. Dynamic objects are often modeled by state vectors, and are estimated over time using a multi-object tracking ( M OT ) framew ork. M O T denotes the problem of, giv en a set of noisy measurements, estimating both the number of dynamic objects, and the state of each dynamic object. Compared to the single object tracking problem, in addition to handling measurement noise and detection uncertainty , the M O T problem also has to resolve problems like object birth and object death 1 ; clutter 1 Object birth and object death is when an object ﬁrst appears within, and departs from, the ego-vehicle’ s surveillance area, respectively . detections 2 ; and unknown measurement origin. A recent family of M OT algorithms are based on random ﬁnite sets ( R FS s) [1]. The probability hypothesis density ( P H D ) [2] ﬁlter , and the cardinalized P H D ( C P H D ) [3] ﬁlter, are two examples of moment approximations of the multi-object density . The generalized labeled multi-Bernoulli ( G L M B ) [4], [5] and the Poisson multi-Bernoulli mixture ( P M B M ) [6], [7] ﬁlters are examples of M O T ﬁlters based on multi-object conju- gate priors; these ﬁlters have been shown to outperform ﬁlters based on moment approximation. A recent comparison study published in [8] has shown that the ﬁlters based on the P M B M conjugate prior both achie ves greater tracking performance, and has fav ourable computational cost compared to G L M B , hence we use the P M B M ﬁlter in this work. All of the aforementioned M O T algorithms takes sets of object estimates, or detections , as their input. This implies that the raw sensor data, e.g., the images, should be pre- processed into detections. The recent development of deep neural networks has lead to big improvement in ﬁelds of image processing. Indeed, considerable improv ements hav e be achiev ed for the object detection problem, see, e.g., [9], [10], which is is crucial to the tracking performance. Con volutional neural networks ( C N N s) [11] hav e shown to vastly outperform previous methods in image processing for tasks such as classiﬁcation, object detection and semantic segmentation. C N N s make use of the spatial relation between neighbouring pixels in images, by processing data in a con vo- lutional manner . Each layer in a C N N consists of a ﬁlter bank with a number of con volutional kernels, where each element is a learnable parameter . The most common approach for object detection using deep neural networks is region-based C N N s ( R - C N N s). R - C N N s are divided into two parts; a region proposal network ( R P N ), followed by a box regression and classiﬁcation network. The R P N takes an image as input, and outputs a set of general object proposals, which are fed into the following classiﬁcation and box regression network. The box regression and classiﬁcation network will reﬁne the size of the object and classify it into one of the object classes. This type of deep neural network structure is used in, e.g., Fast R - C N N [12], and later in the 2 Clutter detections are false detections, i.e., detections not corresponding to an actual object. improv ed Faster R - C N N [9]. Another approach to the object detection problem is you only look once ( YO L O ) [10]. Here, the region proposal step is omitted, and the box regression and classiﬁcation are applied directly on the entire image. In the automotive industry , monocular camera is a well studied and commonly used type of sensors for developing au- tonomous driving systems. A monocular camera is a mapping between 3D world coordinates and 2D image coordinates [13] where, in contrary to, e.g., radars and lidars, distance informa- tion is lost. Howe ver , to achiev e a high level of automation, tracking in the image plane is not adequate. Instead, we need to track objects in world coordinates in order to obtain the relativ e pose between the ego vehicle the detected objects, information that is crucial for automatic decision making and control. W e refer to this as 3D tracking. Previous work on object tracking using monocular camera data is restricted to tracking in the image-plane, see, e.g., [14], [15], [16], for some recent work. The main contribution of this paper is a multi-vehicle 3D tracking algorithm, that takes as input mono camera data, and outputs vehicle estimates in world coordinates. The proposed M O T algorithm is ev aluated using the image sequences from the publicly av ailable KITTI tracking dataset [17], and the results show that accurate 3D tracking is achiev ed. The presented 3D tracking ﬁlter has two main components: a detector and an object tracking ﬁlter . The detector is a deep neural network trained to from an input image not only extract a 2D bounding box for each detected object, but also to estimate the distance from the camera to the object. This is achieved by using object annotations in lidar data during the learning of the network parameters. The object tracking ﬁlter is a state-of-the-art P M B M object tracking ﬁlter [6], [7] that processes the detections and outputs estimates. The tracking ﬁlter is computationally ef ﬁcient, and handles both false detections and missed detections. For each object, a position, as well as kinematical properties such as velocity , are estimated. The paper is structured as follows. In Section II, we giv e a problem formulation and present an ov erview of the algorithm. In Section III we present the object detection, and in Section IV we present object tracking. The results of an experimental ev aluation using data sequences from the KITTI dataset are presented in Section V, and the paper is concluded in Section VI. I I . P R O B L E M F O R M U L AT I O N A N D A L G O R I T H M OV E RV I E W The KITTI object tracking dataset [17] contains data from multiple sensors, e.g., four cameras and a lidar sensor . The data from such sensors can be used for environment perception, i.e., tracking of moving objects and mapping of the stationary en vironment. In this work we focus on data from a forward looking camera, with the objecti ve to track the other vehicles that are in the en vironment. Each vehicle is represented by a state vector x that contains the relev ant information about the object. For 3D object I k Detection Z k Prediction Update Extraction ˆ X k | k f k | k − 1 f k | k f k − 1 | k − 1 Fig. 1. Algorithm overvie w . The M OT algorithm has two modules, detection (left, red) and tracking (right, blue/green). The tracking modules consists of a recursive tracking ﬁlter (prediction+update) and an object extraction. tracking, the following state vector is used, x =  x y z v x v y v z w h  T , (1) where ( x, y , z ) is the 3D position in world coordinates, ( v x , v y , v z ) is the corresponding velocity , and ( w , h ) is the width and height of the object’ s bounding box in the camera image. The position and velocity describes the tracked object’ s properties of interest; the width and height of the bounding box are used for ev aluation analogue to the KITTI object tracking benchmark [17]. The number of vehicles in the en vironment is not kno wn, and changes with time, so the task is to estimate both the number of vehicles, as well as each vehicle’ s state. The vehicles at time step k are represented by a set X k that contains the state vectors of all vehicles that are present in the vicinity of the ego-vehicle. The set of vehicles X k is modeled as a Random Finite Set ( R F S ) [1]. That is, the number of objects, or the cardinality of the set, is modeled as a time varying discrete random variable and each object’ s state is a multiv ariate random variable. The problem addressed in this paper is the processing of the sequence of images I k into a sequence of estimates ˆ X k | k of the set of vehicles, I 0 , I 1 , . . . , I k ⇒ ˆ X 0 | 0 , ˆ X 1 | 1 , . . . , ˆ X k | k , (2) where the sub-indices denote time. In other words, we wish to process the image sequence to gain information at each time step about the number of vehicles (the cardinality of the set X ), and the state of each vehicle. The proposed M OT algorithm has two main parts: object detection, and object tracking; an illustration of the algorithm is giv en in ﬁg. 1. In the detection module, each image is processed to output a set of object detections Z k , I k Detection = = = = ⇒ Z k . (3) T ABLE I T A B L E O F N OTA T I ON • Minor non-bold letter, e.g., a , b , γ , denote scalars. • Minor bold letters, e.g., x , z , ξ , denote vectors. • Capital non-bold letters , e.g., M , F , H , denote matrices. • Capital bold letters, e.g., X , Y , Z , denote sets. • | X | denotes the cardinality of set X , i.e., the number of elements in X . • ] denotes disjoint set union, i.e., X ] Y = Z means X ∪ Y = Z and X ∩ Y = ∅ . • [ h ( · )] X = Q x ∈ X h ( x ) and [ h ( · )] ∅ = 1 by deﬁnition. • h a ; b i = R a ( x ) b ( x ) d x , the inner product of a ( x ) and b ( x ) . The set of detections Z k , where each z i k ∈ Z k is an estimated object, is also modeled as a R F S . The detection is based on a C N N , which is presented in detail in Section III. The tracking module takes the image detections as input and outputs an object set estimate; it has three parts: prediction, update, and extraction. T ogether , the prediction and the update constitute a tracking ﬁlter that recursiv ely estimates a multi- object set density , Z k PMBM ﬁlter = = = = = = ⇒ f k | k ( X k | Z k ) , (4) where Z k denotes all measurement sets up to time step k , { Z t } t ∈ (0 ,k ) . Speciﬁcally , in this work we estimate a P M B M density [6]. The Chapman-K olmogorov prediction f k | k − 1 ( X k | Z k − 1 ) = Z g ( X k | X k − 1 ) f k − 1 | k − 1 ( X k − 1 | Z k − 1 ) δ X k − 1 , (5a) predicts the P M B M density to the next time step using the multi-object motion model g ( X k +1 | X k ) . W e use the standard multi-object motion model [1], meaning that g ( ·|· ) models a Markovian process for objects that remain in the ﬁeld of view , combined a Poisson point process ( P P P ) birth process. Using the set of detections Z k and the multi-object mea- surement model h ( Z k | X k ) , the updated P M B M density is computed using the Bayes update f k | k ( X k | Z k ) = h ( Z k | X k ) f k | k − 1 ( X k | Z k − 1 ) R h ( Z k | X k ) f k | k − 1 ( X k | Z k − 1 ) δ X k , (5b) W e use the standard multi-object measurement model [1], in which h ( Z k | X k ) models noisy measurement with detection uncertainty , combined with P P P clutter . The ﬁnal part of the tracking is the object extraction, where object estimates are extracted from the P M B M density , f k | k ( X k | Z k ) Extraction = = = = = ⇒ ˆ X k | k (6) The tracking is described further in Section IV. The integrals in (5) are set-integrals, deﬁned in [1]. I I I . O B J E C T D E T E C T I O N In this section, we describe how deep learning, see, e.g., [18], is used to process the images { I t } k t =0 to output sets of detections { Z t } k t =0 . For an image I t with a corresponding set of detections Z t , each detection z ∈ Z t consists of a 2D bounding box and a distance from the camera center to the center of the detected object, z =  x min y min x max y max d  T , (7) where ( x min , y min ) and ( x max , y max ) are the pixel positions of the top left and bottom right corner of the bounding box, respectiv ely , and d is the distance from the camera to the object. The bounding box encloses the object in the image. Using this information, the angle from the camera center to the center of the detected object can be inferred. This, together with the camera-to-object-distance d , allows the camera to be transformed into a range/bearing sensor, which is suitable for object tracking in 3D world coordinates. The object detection is implemented using a improved version of the network dev eloped in [19]. The network can be divided into two parts; the ﬁrst part can be viewed as a feature extractor , and the second part consists of three parallel output headers. The feature extractor is identical to the DRN-C-26 [20] network, with the exception that the last two classiﬁcation layers hav e been removed. The last two layers are structured for the original classiﬁcation task of DRN-C-26, which is not suitable in this work. T o represent objects using a bounding box and its distance, the network has three different types of output: classiﬁcation score, bounding box and distance. Each header in the network has two 1 × 1 conv olutional layers and ﬁnally a sub-pixel con volutional layer [21], upscaling the output to 1/4th of the input image resolution. The bounding box header has 4 output channels, representing the top left and bottom right corner of the bounding box, the distance header has one output channel, representing the distance to the object, and the classiﬁcation header has an additional softmax function and represents the different class scores using one-hot encoding, i.e., one output channel for each class, where each channel represents the score for each class, respectiv ely . For each pixel in the output layer there will be an estimated bounding box, i.e., there can be more than one bounding box per object. T o address this, Soft-NMS [22] is applied. In this step, the box with the highest classiﬁcation score is selected and the score of boxes intersecting the selected box are decayed according to a function of the intersection over union ( I O U ). This process is repeated until the classiﬁcation score of all remaining boxes are below a manually chosen threshold. The feature extractor is pre-trained on ImageNet [23] and the full network is ﬁne-tuned using annotated object labels from the KITTI object data set [17]. The network is tuned using stochastic gradient descent with momentum. The task of classiﬁcation used a cross entropy loss function while bound- ing box regression and distance estimation used a smooth L1 loss function [12]. I V . O B J E C T T R AC K I N G T o associate objects between consecutive frames and ﬁlter the object detections from the neural network, a P M B M track- ing ﬁlter is applied. Both the set of objects X k and the set of image detections Z k are modeled as R F S s. The purpose of the tracking module is to process the sequence of detection sets, and output a sequence of estimates ˆ X k | k of the true set of objects. W e achiev e this by using a P M B M ﬁlter to estimate the multi-object density f k | k ( X k | Z k ) , and to extract estimates from this density . In this section, we ﬁrst present some necessary R F S back- ground, and the standard point object models that are used to model both the object motion, as well as the detection process. Then, we present the P M B M ﬁlter . A. R F S backgr ound In this work, two types of R F S s are important: the P P P and the Bernoulli process. A general introduction to R F S is gi ven in, e.g., [1]. 1) P oisson point pr ocess: A P P P is a type of R F S where the cardinality is Poisson distributed and all elements are independent and identically distributed ( I I D ). A P P P can be parametrized by an intensity function, D ( x ) , deﬁned as D ( x ) = µf ( x ) . (8) The intensity function has two parameters, the Poisson rate µ > 0 and the spatial distribution f ( x ) . The expected number of set members in a P P P S is R x ∈ S D ( x ) d x . The P P P density is f ( X ) = e −h D ( x );1 i Y x ∈ X D ( x ) = e − µ Y x ∈ X µf ( x ) . (9) The P P P s are used to model object birth, undetected objects and clutter measurements. 2) Bernoulli pr ocess: A Bernoulli R F S is a R F S that with the probability r contains a single element with the probability density function ( P D F ) f ( x ) , and with the probability 1 − r is empty: f ( X ) =      1 − r , X = ∅ r f ( x ) , X = { x } 0 , | X | > 1 . (10) It is suitable to use a Bernoulli R F S to model objects in a M OT problem, since it both models the object’ s probability of existence r , and uncertainty in its state x . In M OT , the objects are typically assumed to be indepen- dent [6]. The disjoint union of a ﬁxed number of independent Bernoulli R F S s, X = ] i ∈ I X i , where I is an index set, is a multi-Bernoulli ( M B ) R F S . The parameters { r i , f i ( · ) } i ∈ I deﬁnes the M B distribution. A multi-Bernoulli mixture ( M B M ) density is a normalized, weighted sum of M B densities. The M B M density is entirely deﬁned by { w j , { r j,i , f j,i ( · ) } i ∈ I j } j ∈ J , wh e re J is an index set for the M B s in the M B M , w j is the probability of the j th M B , and I j is the index set for the Bernoulli distributions. In a M O T problem, the different M B s typically corresponds to different data association sequences. B. Standar d models Here we present the details of the standard measurement and motion models, under Gaussian assumptions. 1) Measur ement model: Let x i k be the state of the i th vehicle at the k th time step. At time step k , given a set of objects X k = { x i k } i ∈ I , the set of measurements is Z k = ( ] i ∈ I W i k ) ] K k , where W i k denotes the set of object generated measurements from the i th object, I is an index set and K k denotes the set of clutter measurements. The set K k is modeled as a P P P with the intensity κ ( z ) = λc ( z ) , where λ is the Poisson rate and the spatial distribution c ( z ) is assumed to be uniform. Assuming an object is correctly detected with probability of detection p D . If the object is detected, the measurement z ∈ W i k has P D F φ z ( x i k ) = N ( z ; a ( x i k ) , R ) , where a ( x i k ) is a camera measurement model. The resulting measurement likelihood is ` Z ( x ) = p ( Z | x ) =      1 − p D , Z = ∅ p D φ z ( x ) , Z = { z } 0 , | Z | > 1 . (11) As can be seen in eq. (11), if multiple measurements are associated to one object this will have zero likelihood. This is a standard point object assumption, see, e.g., [1]. Because of the unkno wn measurement origin 3 , it is neces- sary to discuss data association. Let the measurements in the set Z be index ed by m ∈ M , Z = { z m } m ∈ M , (12) and let A j be the space of all data associations A for the j th predicted global hypothesis, i.e., the j th predicted M B . A data association A ∈ A j is an assignment of each measurement in Z to a source, either to the backgr ound (clutter or new object) or to one of the existing objects indexed by i ∈ I j . Note that M ∩ I j = ∅ for all j . The space of all data associations for the j th hypothesis is A j = P ( M ∪ I j ) , i.e., a data association A ∈ A j is a partition of M ∪ I j into non-empty disjoint subsets C ∈ A , called index cells 4 . Due to the standard M OT assumption that the objects generate measurements independent of each other, an index cell contains at most one object index and at most one measurement index, i.e., | C ∩ I j | ≤ 1 and | C ∩ M | ≤ 1 for all C ∈ A . Any association in which there is at least one cell, with at least two object indices and/or at least two measurement indices, will have zero likelihood because this violates the independence assumption and the point object assumption, respectiv ely . If the index cell C contains an object index, then let i C denote the corresponding object index, and if the index cell C contains a measurement index, then let m C denote the corresponding measurement index. 3 An inherent property of M O T is that it is unknown which measurements are from object and which are clutter, and among the object generated measurements, it is unknown which object generated which measurement. Hence, the update must handle this uncertainty . 4 For example, let M = ( m 1 , m 2 , m 3 ) and I = ( i 1 , i 2 ) , i.e., three mea- surements and two objects. One valid partition of M ∩ I , i.e., one of the possible associations, has the following four cells { m 1 } , { m 2 , i 1 } , { m 3 } , { i 2 } . The meaning of this is that measurement m 2 is associated to object i 1 , object i 2 is not detected, and measurements m 1 and m 3 are not associated to any previously detected object, i.e., measurements m 1 and m 3 are either clutter or from new objects. 2) Standar d dynamic model: The existing objects—both the detected and the undetected—survi ve from time step k to time step k + 1 with probability of surviv al p S . The objects ev olve independently according to a Markov process with Gaussian transition density g ( x k +1 | x k ) = N ( x k +1 ; b ( x k ) , Q ) , where b ( · ) is a constant velocity ( C V ) mo- tion model. New objects appear independently of the objects that already exist. The object birth is assumed to be a P P P with intensity D b k +1 ( x ) , deﬁned in eq. (9). C. P M B M ﬁlter In this section, the time indexing has been omitted for notational simplicity . The P M B M ﬁlter is a combination of two R F S s, a P P P to model the objects that exist at the current time step, but have not yet been detected and a M B M to model the objects that have been detected previously at least once. The set of objects can be divided into two disjoint subsets, X = X d ] X u , where X d is the set of detected objects and X u is the set of undetected objects. The P M B M density can be expressed as f ( X ) = X X u ] X d = X f u ( X u ) X j ∈ J w j f j ( X d ) , (13a) f u ( X u ) = e −h D u ( x );1 i [ D u ( · )] X u , (13b) f j ( X d ) = X ] i ∈ I i X i = X d Y i ∈ I j f j,i ( X i ) , (13c) where • f u ( · ) is the P P P density for the set of undetected objects X u , where D u ( · ) is its intensity . • J is an index set of M B M components. There are | J | M B s, where each M B corresponds to a unique global data as- sociation hypothesis. The probability of each component in the M B M is denoted as w j . • For e very component j in the M B M , there is an index set I j , where each index i corresponds to a potentially detected object X i . • f j,i ( · ) are Bernoulli set densities, deﬁned in eq. (10). Each M B corresponds to a potentially detected object with a probability of existence and a state P D F . The P M B M density in eq. (13) is deﬁned by the in volv ed parameters, D u , { ( w j , { ( r j,i , f j,i ) } i ∈ I j ) } j ∈ J . (14) Further , the P M B M density is an M OT conjugate prior [6], meaning that for the standard point object models (Sec- tions IV -B1 and IV -B2), the prediction and update in eq. (5) both result in P M B M densities. It follows that the P M B M ﬁlter propagates the multi-object density by propagating the set of parameters. In this work, we assume that the birth intensity D b is a non- normalized Gaussian mixture. It follows from this assumption that the undetected intensity D u is also a non-normalized Gaussian mixture, and all Bernoulli densities f j,i are Gaussian densities. Below , we present the parameters that result from the prediction and the update, and we present a simple method for extracting target estimates from the set of parameters. T o compute the predicted and updated Gaussian parameters, we use the UKF prediction and update, respectively , see, e.g., [24, Ch. 5]. 1) Pr ediction: Gi ven a posterior P M B M density with pa- rameters D u , { ( w j , { ( r j,i , f j,i ) } i ∈ I j ) } j ∈ J , (15) and the standard dynamic model (Section IV -B2), the predicted density is a P M B M density with parameters D u + , { ( w j + , { ( r j,i + , f j,i + ) } i ∈ I j ) } j ∈ J , (16a) where D u + ( x ) = D b ( x ) + p S h D u ; g i , (16b) r j,i + = p S r j,i , (16c) f j,i + ( x ) =  f j,i ; g  , (16d) and w j + = w j . For Gaussian mixture intensity D u , and Gaussian densities f j,i , the predictions h· ; g i in eq. (16) are easily computed using the UKF prediction, see, e.g., [24, Ch. 5]. 2) Update: Giv en a prior P M B M density with parameters D u + , { ( w j + , { ( r j,i + , f j,i + ) } i ∈ I j + ) } j ∈ J + , (17) a set of measurements Z , and the standard measurement model (Section IV -B1), the updated density is a P M B M density f ( X | Z ) = X X u ] X d = X f u ( X u ) X j ∈ J + X A ∈A j w j A f j A ( X d ) , (18a) f u ( X u ) = e −h D u ;1 i Y x ∈ X u D u ( x ) , (18b) f j A ( X d ) = X ] C ∈ A X C = X Y C ∈ A f j C ( X C ) , (18c) where the weights are w j A = w j + Q C ∈ A L C P j 0 ∈ J P A 0 ∈A j 0 w j 0 + Q C 0 ∈ A 0 L C 0 , (18d) L C =    κ + p D  D u + ; φ z m C  if C ∩ I j = ∅ , C ∩ M 6 = ∅ , 1 − r j,i C + p D if C ∩ I j 6 = ∅ , C ∩ M = ∅ , r j,i C + p D D f j,i C + ; φ z m C E if C ∩ I j 6 = ∅ , C ∩ M 6 = ∅ , (18e) the densities f j C ( X ) are Bernoulli densities with parameters r j C =          p D h D u + ; φ z m C i κ + p D h D u + ; φ z m C i if C ∩ I j = ∅ , C ∩ M 6 = ∅ , r j,i C + (1 − p D ) 1 − r j,i C + p D if C ∩ I j 6 = ∅ , C ∩ M = ∅ , 1 if C ∩ I j 6 = ∅ , C ∩ M 6 = ∅ , (18f) f j C ( x ) =          φ z m C ( x ) D u + ( x ) h D u + ; φ z m C i if C ∩ I j = ∅ , C ∩ M 6 = ∅ , f j,i C + ( x ) if C ∩ I j 6 = ∅ , C ∩ M = ∅ , φ z m C ( x ) f j,i C + ( x ) D f j,i C + ; φ z m C E if C ∩ I j 6 = ∅ , C ∩ M 6 = ∅ , (18g) and the updated P P P intensity is D u ( x ) = (1 − p D ) D u + ( x ) . For Gaussian mixture intensity D u + , and Gaussian densities f j,i + , the updates h· ; φ i in (18) are easily computed using the UKF update, see, e.g., [24, Ch. 5]. 3) Extraction: Let the set of updated P M B M parameters be D u , { ( w j , { ( r j,i , f j,i ) } i ∈ I j ) } j ∈ J . (19) T o extract a set of object estimates, the hypothesis with highest probability is chosen, j ? = arg max j ∈ J w j . (20) From the corresponding M B , with parameters { ( r j ? ,i , f j ? ,i ) } i ∈ I j ? , (21) all Bernoulli components with probability of existence r j ? ,i larger than a threshold τ are selected, and the expected v alue of the object state is included in the set of object estimates, ˆ X = n ˆ x j ? ,i o i ∈ I j ? : r j ? ,i >τ , (22a) ˆ x j ? ,i = E f j ? ,i h x j ? ,i i = Z x f j ? ,i ( x ) d x . (22b) V . E X P E R I M E N TA L R E S U LT S A. Setup For ev aluation, the KITTI object tracking dataset [17] is used. The datasets consists of 21 training sequences and 29 testing sequences that were collected using sensors mounted on a moving car . Each sequence has been manually annotated with ground truth information, e.g., in the images, objects from the classes Car , P edestrian and Cyclist hav e been marked by bounding box es. In this work, the training dataset w as split into two parts; one for training the C N N , and one for validation. The sequences used for training are 0, 2, 3, 4, 5, 7, 9, 11, 17 and 20, and the remaining ones are used for validation. B. Evaluation In this work we are primarily interested in the 3D tracking results. Howe ver , the KITTI testing sequences e valuate the tracking in 2D, hence we present results in both 2D and 3D. Performance is ev aluated using I O U of the image plane bound- ing boxes and Euclidean distance as distance measurements, respectiv ely . For a valid correspondence between a ground truth ( G T ) object and an estimated object, the 2D I O U has to be at least 50 % , and the 3D Euclidean distance has to be within 3 m , for the 2D and 3D ev aluation, respectively . The performance is ev aluated using the CLEAR MO T performance measures [25], including M OT accuracy ( M OTA ), M O T preci- sion ( M O T P ), with addition of mostly tracked ( M T ), mostly lost ( M L ), identity switches ( I D S ) and fragmentations ( F R ) from [26], and F1 score ( F 1), precision ( P R E ), recall ( R E C ) and false alarm rate ( FA R ). The F1 score is the weighted harmonic mean of the precision and recall. Note that, for the 2D I O U measure, a larger value is better, whereas for the 3D Euclidean distance, lower is better . C. Results Examples of the 3D tracking results are shown in ﬁg. 2. The three examples show that the tracking algorithm successfully estimates the states of v ehicles moving in the same direction as the ego-v ehicle, vehicles moving in the opposite direction, as well as vehicles making sharp turns in intersections. In dense scenarios, such as in ﬁg. 2b, there are big overlaps between the bounding boxes; this is handled without problem by the data association. Noteworthy is that the distance estimates are quite noisy . Sometimes this leads to incorrect initial estimates of the velocity vector , as can be seen at the beginning of the track of the oncoming vehicle labelled purple in ﬁg. 2c. Howe ver , the tracking ﬁlter quickly con ver ges to a correct estimate. V ideos of these, and of additional sequences, can be seen at https://goo.gl/AoydgW. Quantitativ e results from the ev aluation on the validation sequences are shown in table II. Noteworthy is the low amount of identity switches, in both 2D and in 3D. Comparing the raw C N N detections and the M O T algorithm, the M OT precision is lower , and the M OT recall is higher, leading to an F1 score that is higher for the M OT than for the C N N ; in other words, the overall object detection performance is slightly improv ed. The runtime of the algorithm is on av erage in total 52 ms , 38 ms for the detection network and 14 ms for the tracking algorithm, on a Nvidia T esla V100 SXM2 and a single thread on a 2 . 7 GHz Intel Core i7. D. KITTI M O T benchmark The M O T algorithm was also ev aluated in 2D using the test sequences on the KITTI ev aluation server . For these results, the full training set was used for training the detection C N N . The results are presented in table III; at the time of submission our algorithm was ranked 3rd in terms of M OTA among the published algorithms. Note that, ev en if not reaching the same M OTA performance, the runtime of our algorithm (frames per second ( F P S )) is one magnitude faster and has a signiﬁcantly lower number of identity switches than the two algorithms with higher M O T A . V I . C O N C L U S I O N This paper presented an image based M O T algorithm using deep learning detections and P M B M ﬁltering. It was shown that a C N N and a subsequent P M B M ﬁlter can be used to detect and track objects. The algorithm successfully can track multiple objects in 3D from a single camera image, which can provide valuable information for decision making and control. A C K N O W L E D G M E N T This work was partially supported by the W allenberg Au- tonomous Systems and Software Program (W ASP). R E F E R E N C E S [1] R. Mahler , Statistical Multisour ce-Multitarg et Information Fusion . Nor- wood, MA, USA: Artech House, Inc., 2007. [2] ——, “Multitarget Bayes ﬁltering via ﬁrst-order multitarget moments, ” IEEE T ransactions on Aerospace and Electr onic Systems , vol. 39, no. 4, pp. 1152–1178, October 2003. 10 20 30 40 50 [m] -30 -20 -10 0 10 20 30 (a) 10 20 30 40 50 [m] -30 -20 -10 0 10 20 30 (b) 10 20 30 40 50 [m] -30 -20 -10 0 10 20 30 (c) Fig. 2. Object estimates shown both projected to the image plane (top) and in top-view (bottom). The dotted tracks illustrate the G T and tracked object estimates at previous time steps. The color of the tracks corresponds to the identity of the tracked object. T ABLE II R E SU LT S O F T H E E V A LU A T IO N OF T R AC KI N G P E R F OR M A N CE I N 2 D A N D 3 D O N T H E Car C L A S S . ↑ A N D ↓ R EP R E S EN T S T H A T HI G H V A L U ES A ND L OW V A LU E S A R E B E T T ER , RE S P E CT I V E L Y . T H E B E ST V AL U E S A R E M A R KE D WI T H B O L D F O N T . Method M OTA ↑ M OT P ↑ M T ↑ ML ↓ I D S ↓ F R ↓ F 1 ↑ P R E ↑ R E C ↑ FA R ↓ 2D C N N – 82 . 04 % 74 . 59 % 3 . 78 % – – 91 . 16 % 95 . 72 % 87 . 02 % 9 . 08 % M OT 81 . 23 % 81 . 63 % 76 . 22 % 3 . 78 % 19 107 91 . 26 % 94 . 76 % 88 . 02 % 11 . 46 % Method M OTA ↑ M OT P ↓ M T ↑ ML ↓ I D S ↓ F R ↓ F 1 ↑ P R E ↑ R E C ↑ FA R ↓ 3D C N N – 111 . 39 cm 45 . 95 % 10 . 27 % – – 73 . 53 % 78 . 74 % 68 . 97 % 41 . 90 % M OT 47 . 20 % 110 . 73 cm 48 . 65 % 11 . 35 % 20 166 73 . 86 % 78 . 18 % 70 . 00 % 44 . 32 % T ABLE III K I TT I M OT B E N CH M A R K [ 1 7] R ES U LTS F OR Car C LA S S . ↑ A N D ↓ R E PR E S E NT S TH AT H I GH V AL U E S A N D L OW V AL U E S A R E B E T TE R , R E SP E C T IV E L Y . T H E B E S T V A L UE S A R E M A R K ED W I TH B OL D FO N T . [ 1 4 ] I S T U S I M P L E , [ 16 ] I S I M MD P AN D [ 27 ] IS M C MOT- CP D . O N LY R ES U LTS O F P U BL I S H ED M ET H O D S A R E R E PO RT E D . M OTA ↑ M OT P ↑ MT ↑ M L ↓ I D S ↓ F R ↓ FP S ↑ a [14] 86 . 6 % 84 . 0 % 72 . 5 % 6 . 8 % 293 501 1.7 [16] 83 . 0 % 82 . 7 % 60 . 6 % 11 . 4 % 172 365 5.3 Our 80 . 4 % 81 . 3 % 62 . 8 % 6 . 2 % 121 613 73 [27] 78 . 9 % 82 . 1 % 52 . 3 % 11 . 7 % 228 536 100 a The time for object detection is not included in the speciﬁed runtime. [3] ——, “PHD ﬁlters of higher order in target number, ” IEEE T ransactions on Aerospace and Electr onic Systems , vol. 43, no. 4, pp. 1523–1543, October 2007. [4] B. T . V o and B. N. V o, “Labeled Random Finite Sets and Multi-Object Conjugate Priors, ” IEEE T ransactions on Signal Processing , vol. 61, no. 13, pp. 3460–3475, July 2013. [5] S. Reuter, B. T . V o, B. N. V o, and K. Dietmayer, “The Labeled Multi- Bernoulli Filter , ” IEEE T ransactions on Signal Processing , vol. 62, no. 12, pp. 3246–3260, June 2014. [6] J. L. Williams, “Marginal multi-bernoulli ﬁlters: RFS deriv ation of MHT , JIPD A, and association-based member , ” IEEE T ransactions on Aer ospace and Electronic Systems , vol. 51, no. 3, pp. 1664–1687, July 2015. [7] ´ A. F . Garc ´ ıa-Fern ´ andez, J. L. Williams, K. Granstr ¨ om, and L. Svensson, “Poisson multi-Bernoulli mixture ﬁlter: direct deriv ation and implementation, ” 2017. [Online]. A vailable: http://arxiv .org/abs/1703.04264 [8] Y . Xia, K. Granstr ¨ om, L. Svensson, and A. F . G. Fern ´ andez, “Perfor- mance Evaluation of Multi-Bernoulli Conjugate Priors for Multi-T ar get Filtering, ” in 2017 20th International Conference on Information Fusion (Fusion) , July 2017, pp. 1–8. [9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: T o wards Real-T ime Object Detection with Region Proposal Networks, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 39, no. 6, pp. 1137–1149, June 2017. [10] J. Redmon, S. Divv ala, R. Girshick, and A. Farhadi, “Y ou Only Look Once: Uniﬁed, Real-T ime Object Detection, ” in 2016 IEEE Conference on Computer V ision and P attern Recognition (CVPR) , June 2016, pp. 779–788. [11] Y . LeCun, L. Bottou, Y . Bengio, and P . Haf fner , “Gradient-based learning applied to document recognition, ” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2323, 1998. [12] R. Girshick, “Fast R-CNN, ” in 2015 IEEE International Confer ence on Computer V ision (ICCV) , December 2015, pp. 1440–1448. [13] A. Z. Richard Hartley, Multiple V iew Geometry , 2nd ed. Ne w Y ork, NY , USA: Cambridge University Press, 2004. [14] W . Choi, “Near-Online Multi-target T racking with Aggregated Local Flow Descriptor, ” in 2015 IEEE International Confer ence on Computer V ision (ICCV) , December 2015, pp. 3029–3037. [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition, ” in 2016 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2016, pp. 770–778. [16] Y . Xiang, A. Alahi, and S. Sav arese, “Learning to T rack: Online Multi- object T racking by Decision Making, ” in 2015 IEEE International Confer ence on Computer V ision (ICCV) , Dec 2015, pp. 4705–4713. [17] A. Geiger , P . Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite, ” in 2012 IEEE Conference on Computer V ision and P attern Recognition (CVPR) , June 2012, pp. 3354–3361. [18] Y . LeCun, Y . Bengio, G. Hinton, L. Y ., B. Y ., and H. G., “Deep learning, ” Natur e , vol. 521, no. 7553, pp. 436–444, 2015. [19] A. Krishnan and J. Larsson, “V ehicle detection and road scene segmen- tation using deep learning, ” 2016. [20] F . Y u, V . K oltun, and T . Funkhouser, “Dilated Residual Networks, ” may 2017. [Online]. A vailable: http://arxi v .org/abs/1705.09914 [21] W . Shi, J. Caballero, F . Husz ´ ar , J. T otz, A. P . Aitken, R. Bishop, D. Rueckert, and Z. W ang, “Real-Time Single Image and V ideo Super- Resolution Using an Ef ﬁcient Sub-Pixel Con v olutional Neural Network, ” in 2016 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2016, pp. 1874–1883. [22] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS - Improving Object Detection with One Line of Code, ” in 2017 IEEE International Conference on Computer V ision (ICCV) , Oct 2017, pp. 5562–5570. [23] J. Deng, W . Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database, ” in 2009 IEEE Conference on Computer V ision and P attern Recognition (CVPR) , June 2009, pp. 248– 255. [24] S. S ¨ arkk ¨ a, Bayesian F iltering and Smoothing . Cambridge Univ ersity Press, 2013. [25] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics, ” Eurasip Journal on Image and V ideo Pr ocessing , vol. 2008, 2008. [26] Y . Li, C. Huang, and R. Ne vatia, “Learning to associate: HybridBoosted multi-target tracker for crowded scene, ” in 2009 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2009, pp. 2953– 2960. [27] B. Lee, E. Erdenee, S. Jin, and P . K. Rhee, “Multi-Class Multi-Object T racking using Changing Point Detection, ” Lecture Notes in Computer Science (including subseries Lectur e Notes in Artiﬁcial Intelligence and Lectur e Notes in Bioinformatics) , vol. 9914 LNCS, no. Mcmc, pp. 68– 83, aug 2016.

Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment