Object Tracking via Non-Euclidean Geometry: A Grassmann Approach
A robust visual tracking system requires an object appearance model that is able to handle occlusion, pose, and illumination variations in the video stream. This can be difficult to accomplish when the model is trained using only a single image. In t…
Authors: Sareh Shirazi, Mehrtash T. Har, i
Object T racking via Non-Euclidean Geometry: A Grassmann A ppr oach Sar eh Shir azi, Mehrtash T . Harandi, Brian C. Lo vell, Conrad Sanderson NICT A, GPO Box 2434, Brisbane, QLD 4001, Australia Uni versity of Queensland, School of ITEE, QLD 4072, Australia Queensland Uni versity of T echnology , Brisbane, QLD 4000, Australia Abstract A r ob ust visual trac king system r equir es an object appear- ance model that is able to handle occlusion, pose, and il- lumination variations in the video str eam. This can be dif- ficult to accomplish when the model is trained using only a single image . In this paper , we first propose a trac king appr oach based on affine subspaces (constructed fr om sev- eral images) which ar e able to accommodate the abovemen- tioned variations. W e use af fine subspaces not only to r epr e- sent the object, but also the candidate ar eas that the object may occupy . W e furthermore pr opose a novel appr oach to measur e affine subspace-to-subspace distance via the use of non-Euclidean geometry of Grassmann manifolds. The trac king problem is then consider ed as an infer ence task in a Markov Chain Monte Carlo framework via particle fil- tering. Quantitative evaluation on challenging video se- quences indicates that the pr oposed appr oach obtains con- siderably better performance than sever al r ecent state-of- the-art methods such as T rac king-Learning-Detection and MILtrac k. 1. Introduction V isual tracking is a fundamental task in many computer vision applications including ev ent analysis, visual surveil- lance, human behaviour analysis, and video retrie val [18]. It is a challenging problem, mainly because the appearance of tracked objects changes over time. Designing an ap- pearance model that is rob ust against intrinsic object varia- tions ( e .g . shape deformation and pose changes) and e xtrin- sic variations ( e.g . camera motion, occlusion, illumination changes) has attracted a large body of w ork [4, 24]. Rather than relying on object models based on a single training image, more robust models can be obtained through the use of se veral images, as evidenced by the recent sur ge of interest in object recognition techniques based on image- set matching. Among the many approaches to image-set matching, superior discrimination accuracy , as well as in- creased rob ustness to practical issues (such as pose and illu- mination v ariations), can be achie ved by modelling image- sets as linear subspaces [10, 11, 12, 20, 21, 22]. In spite of the above observ ations, we believ e modelling via linear spaces is not completely adequate for object track- ing. W e note that all linear subspaces of one specific order hav e a common origin. As such, linear subspaces are the- oretically robust against translation, meaning a linear sub- space extracted from a set of points does not change if the points are shifted equally . While the resulting robustness against small shifts is attracti ve for object recognition pur - poses, the task of tracking is to generally maintain precise locations of objects. T o account for the abov e problem, in this paper we first propose to model objects, as well as candidate areas that the objects may occupy , through the use of generalised linear subspaces, i.e . af fine subspaces, where the origin of sub- spaces can be varied. As a result, the tracking problem can be seen as finding the most similar affine subspace in a gi ven frame to the object’ s affine subspace. W e furthermore pro- pose a nov el approach to measure distances between affine subspaces, via the use of non-Euclidean geometry of Grass- mann manifolds, in combination with Mahalanobis distance between the origins of the subspaces. See Fig. 1 for a con- ceptual illustration of our proposed distance measure. T o the best of our knowledge, this is the first time that ap- pearance is modelled by affine subspaces for object track- ing. The proposed approach is somewhat related to adap- tiv e subspace tracking [13, 19]. Ho et al . [13] represent an object as a point in a linear subspace, which is constantly updated. As the subspace was computed using only recent tracking results, the tracker may drift if large appearance changes occur . In addition, the location of the track ed object is inferred via measuring point-to-subspace distance, which is in contrast to the proposed method, where a more robust subspace-to-subspace distance is used. Ross et al . [19] improv ed tracking robustness against large appearance changes by modelling objects in a lo w- dimensional subspace, updated incrementally using all pre- ceding frames. Their method also in v olves a point-to- subspace distance measurement to localise the object. The proposed method should not be confused with sub- space learning on Grassmann manifolds proposed by W ang et al . [25]. More specifically , in [25] an online subspace learning scheme using Grassmann manifold geometry is de- vised to learn/update the subspace of object appearances. In contrast to the proposed method, they also consider the point-to-subspace distance to localise objects. (a) (b) (c) Figure 1. Dif ference between point-to-subspace and subspace-to-subspace distance measurement approaches. (a) Three groups of images, with each image represented as a point in space; the first group (top-left) contains three consecuti ve object images (frames 1, 2 and 3) used for generating the object model; the second group (bottom-left) contains tracked object images from frames t − 2 and t − 1 ; the third group (right) contains three candidate object re gions from frame t . (b) Subspace generated based on object images from frames 1, 2 and 3, represented as a dashed line; the minimum point-to-subspace distance can result in selecting the wrong candidate region ( i.e . wrong location). (c) Generated subspaces, represented as points on a Grassmann manifold; the top-left subspace represents the object model; each of the remaining subspaces was generated by using tracked object images from frames t − 2 and t − 1 , with the addition of a unique candidate region from frame t ; using subspace-to-subspace distance is more likely to result in selecting the correct candidate re gion. 2. Proposed Affine Subspace T racker (AST) The proposed Affine Subspace Track er (AST) is com- prised of four components, overvie wed belo w . A block dia- gram of the proposed tracker is sho wn in Fig. 2. 1. Motion Estimation . This component takes into ac- count the history of object motion in previous frames and creates a set of candidates as to where the object might be found in the new frame. T o this end, it pa- rameterises the motion of the object between consec- utiv e frames as a distribution via particle filter frame- work [2]. Particle filters are sequential Monte Carlo methods and use a set of points to represent the distri- bution. As a result, instead of scanning the whole of the new frame to find the object, only highly probable locations will be examined. 2. Candidate Subspaces . This module encodes the ap- pearance of a candidate (associated to a particle filter) by an affine subspace A ( t ) i . This is achieved by taking into account the history of tracked images and learning the origin µ ( t ) i and basis U ( t ) i of A ( t ) i for each particle. 3. Decision Making . This module measures the likeli- hood of each candidate subspace A ( t ) i to the stored ob- ject models in the bag M . Since object models are encoded by affine subspaces as well, this module de- termines the similarity between affine subspaces. The most similar candidate subspace to the bag M is se- lected as the result of tracking. 4. Bag of Models . This module keeps a history of previ- ously seen objects in a bag. This is primarily dri ven by the fact that a more robust and flexible tracker can be attained if a history of variations in the object appear- ance is kept [15]. T o understand the benefit of the bag of models, assume a person tracking is desired where the appearance of whole body is encoded as an object model. Moreov er , assume at some point of time only the upper body of person is visible (due to partial oc- clusion) and the tracker has successfully learned the new appearance. If the tracking system is only aware of the very last seen appearance (upper -body in our example), upon termination of occlusion, the tracker is likely to lose the object. Keeping a set of models (in our e xample both upper -body and whole body) can help the tracking system to cope with drastic changes. Each of the components is elucidated in the following sub- sections. 2.1. Motion Estimation In the proposed framework, we are aiming to obtain the location x ∈ X , y ∈ Y and the scale s ∈ S of an object in frame t based on prior kno wledge about previous frames. Figure 2. Block diagram for the proposed Affine Subspace T racker (AST). A blind search in the space of X − Y − S is obviously inef- ficient, since not all possible combinations of x , y and s are plausible. T o ef ficiently search the X − Y − S space, we use a sequential Monte Carlo method known as the Condensa- tion algorithm [14] to determine which combinations in the X − Y − S space are most probable at time t . The k ey idea is to represent the X − Y − S space by a density function and estimate it through a set of random samples (also kno wn as particles). As the number of particles becomes large, the condensation method approaches the optimal Bayesian esti- mate of density function ( i.e . combinations in the X − Y − S space). Below , we briefly describe how the condensation algorithm is used within the proposed tracking approach. Let Z ( t ) = x ( t ) , y ( t ) , s ( t ) denote a particle at time t . By the virtue of the principle of importance sampling [2], the density of X − Y − S space (or most probable candidates) at time t is estimated as a set of N particles {Z ( t ) i } N i =1 using previous particles {Z ( t − 1) i } N i =1 and their associated weights { w ( t − 1) i } N i =1 with P N i =1 w ( t − 1) i = 1 . For no w we assume the associated weights of particles are known and later discuss how the y can be determined. In the condensation algorithm, to generate {Z ( t ) i } N i =1 , {Z ( t − 1) i } N i =1 is first sampled (with replacement) N times. The probability of choosing a giv en element Z ( t − 1) i is equal to the associated weight w ( t − 1) i . Therefore, the particles with high weights might be selected sev eral times, lead- ing to identical copies of elements in the new set. Others with relatively lo w weights may not be chosen at all. Next, each chosen element undergoes an independent Brownian motion step. Here, the Bro wnian motion of a particle is modelled by a Gaussian distribution with a diagonal co- variance matrix. As a result, for a chosen particle Z ( t − 1) ∗ from the first step of condensation algorithm, a new parti- cle Z ( t ) ∗ is obtained as a random sample of N Z ( t − 1) ∗ , Σ where N ( µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ . The cov ariance Σ go verns the speed of motion, and is a constant parameter ov er time in our frame- work. 2.2. Candidate T emplates T o accommodate v ariations in object appearance, this module models the appearance of particles 1 by affine sub- spaces (see Fig. 3 for a conceptual example). An affine subspace is a subset of Euclidean space [23], formally de- scribed by a 2-tuple { µ, U } as: A = n z ∈ R D : z = µ + U y o (1) where µ ∈ R D and U ∈ R D × n are origin and basis of the subspace, respectiv ely . Let I ( Z ( t ) ∗ , t ) denote the vector rep- resentation of an N 1 × N 2 patch extracted from frame t by considering the values of particle Z ( t ) ∗ . That is, frame t is first scaled appropriately based on the value s ( t ) ∗ and then a patch of N 1 × N 2 pixels with the top left corner located at x ( t ) ∗ , y ( t ) ∗ is extracted. The appearance model for Z ( t ) ∗ is generated from a set of P + 1 images by considering P previous re- sults of tracking. More specifically , let b Z ( t ) denote the result of tracking at time t , i.e . b Z ( t ) is the most similar particle to the bag of models at time t . Then set B ( t ) Z ∗ = n I ( b Z ( t − P ) , t – P ) , I ( b Z ( t − P +1) , t – P + 1) , · · · , I ( Z ( t ) ∗ , t ) o is used to obtain the appearance model for particle Z ( t ) ∗ . More specifically , the origin of affine subspace associated to Z ( t ) ∗ is the mean of B ( t ) Z ∗ . The basis is obtained by computing the Singular V alue Decomposition (SVD) of B ( t ) Z ∗ and choosing the n dominant left-singular vectors. 2.3. Bag of Models Although affine subspaces accommodate object changes along with a set of images, to produce a rob ust tracker , the object’ s model should be able to reflect the appearance changes during the tracking process. Accordingly , we pro- pose to k eep a set of object models m j = { µ j , U j } for coping with deformations, pose variations, occlusions, and other variations of the object during tracking. 1 W e loosely use “particle appearance” to mean the appearance of a candidate template described by a particle. Figure 3. In the proposed approach, object appearance is mod- elled by an affine subspace. An af fine subspace is uniquely de- scribed by its origin µ and basis U . Here, µ and basis U are ob- tained by computing mean and eigenbasis of a set of object images. Fig. 4 shows two frames with a tracked object, the bag models used to localise the object, and the recent images of the image set used to generate each bag model. A bag M = { m 1 , · · · , m k } is defined as a set of k object models, i.e . each m j is an affine subspace learned during the tracking process. The bag is updated ev ery W frames (see Fig. 5) by replacing the oldest model with the latest learned model ( i.e . latest result of tracking specified by b Z ( t ) ). The size of bag k determines the memory of the tracking system. Thus, a large bag with sev eral models might be required to track an object in a challenging scenario. In our experi- ments, a bag of size 10 with the updating rate W = 5 is used in all experiments. Having a set of models at our disposal, we will next ad- dress ho w the similarity between a particle’ s appearance and the bag can be determined. (a) (b) (c) Figure 4. (a) T wo examples of a frame with a tracked object. (b) The first eigenbasis of ten sample template bags. (c) The recent frame in each of the 10 image sets used to generate the templates. Figure 5. The model extraction procedure in v olves a sliding win- dow update scheme. The template is learned from a set of P con- secutiv e frames. T emplate update occurs every W frames. 2.4. Decision Making Giv en the previously learned affine subspaces as the in- put to this module, the aim is to find the nearest affine sub- space to the bag templates. Although the minimal Euclidean distance is the simplest distance measure between two af fine subspaces ( i.e . the minimum distance of any pair of points of the two subspaces), this measure does not form a metric [5] and it does not consider the angular distance between affine subspaces, which can be a useful discriminator [16]. Howe v er , the angular distance ignores the origin of affine subspaces and simplifies the problem to a linear subspace case, which we wish to av oid. T o address the above limitations, we propose a distance measure with the following form: dist( A i , A j ) = dist G ( U i , U j ) + α ( µ i − µ j ) T M ( µ i − µ j ) (2) where dist G is the Geodesic distance between two points on a Grassmann manifold [7], ( µ i − µ j ) T M ( µ i − µ j ) is the Mahalanobis distance between origins of A i and A j , and α is a mixing weight. The components in the proposed dis- tance are described below . A Grassmann manifold (a special type of Riemannian manifold) is defined as the space of all n -dimensional lin- ear subspaces of R D for 0 < n < D . A point on Grass- mann manifold G D,n is represented by an orthonormal basis through a D × n matrix. The length of the shortest smooth curve connecting two points on a manifold is known as the geodesic distance. For Grassmann manifolds, the geodesic distance is giv en by: dist G ( X , Y ) = k Θ k 2 (3) where Θ = [ θ 1 , θ 2 , · · · , θ n ] is the principal angle vector , i.e . cos( θ l ) = max x ∈ X , y ∈ Y x T y = x T l y l (4) subject to k x k = k x k = 1 , x T x i = y T y i = 0 , i = 1 , . . . , l − 1 . The principal angles have the property of θ i ∈ [0 , π / 2] and can be computed through the SVD of X T Y [7]. W e note that the linear combination of a Grassmann dis- tance (distance between linear subspaces) and Mahalanobis distance (between origins) of two affine subspaces has roots in probabilistic subspace distances [9]. More specif- ically , consider two normal distributions N 1 ( µ 1 , C 1 ) and N 2 ( µ 2 , C 2 ) with C i = σ 2 I + U i U T i as the covariance ma- trix, and µ i as the mean vector . The symmetric Kullback- Leibler (KL) distance between N 1 and N 2 under orthonor- mality condition ( i.e . U T i U i = I n ) results in: J K L = 1 2 σ 2 ( µ 1 – µ 2 ) T 2 I D – U 1 U T 1 – U 2 U T 2 ( µ 1 – µ 2 ) + 1 2 σ 2 ( σ 2 + 1) 2 n − 2 tr ( U T 1 U 2 U T 2 U 1 ) (5) The term tr ( U T 1 U 2 U T 2 U 1 ) in J K L is identified as the projection distance on Grassmann manifold G D,n (defined as dist P roj ( U 1 , U 2 ) = k sin (Θ) k 2 ) [9], and the term ( µ 1 − µ 2 ) T 2 I D − U 1 U T 1 − U 2 U T 2 ( µ 1 − µ 2 ) is the Ma- halanobis distance with M = 2 I D − U 1 U T 1 − U 2 U T 2 . Since the geodesic distance is a more natural choice for measuring lengths on Grassmann manifolds (compared to the projection distance), we ha ve elected to combine it with the Mahalanobis distance from (5), resulting in the follow- ing instantiation of the general form giv en in Eqn. (2): dist( A i , A j ) = dist G ( U i , U j ) + α ( µ i – µ j ) T 2 I D – U i U T i – U j U T j ( µ i – µ j ) W e measure the likelihood of a candidate subspace A ( t ) i , giv en template m j , as follows: p A ( t ) i | m j = exp − dist( A ( t ) i , m j ) σ ! (6) where σ indicates the standard deviation of the likelihood function and is a parameter in the tracking framew ork. The likelihoods are normalised such that P N i =1 p ( A ( t ) i | m j ) = 1 . T o measure the likelihood between a candidate affine sub- space A ( t ) i and bag M , the individual likelihoods between A ( t ) i and bag templates m j should be integrated. Based on [17], we opt for the sum rule: p ( A ( t ) i |M ) = X k j p ( A ( t ) i | m j ) (7) The object state is then estimated as: b Z ( t ) = Z ( t ) j , where j = argmax i p ( A ( t ) i |M ) (8) 2.5. Computational Complexity The computational complexity of the proposed tracking framew ork can be associated with generating a new model and comparing a target candidate with a model. The model generation step requires O ( D 3 + 2 D n ) operations. Com- puting the geodesic distance between two points on G D,n requires O (( D + 1) n 2 + n 3 ) operations. Therefore, compar - ing an affine subspace candidate against each bag template needs O ((2 n + 3) D 2 + ( n 2 + 1) D + n 3 + n 2 ) operations. 3. Experiments In this section we ev aluate and analyse the performance of the proposed AST method using eight publicly av ailable videos 6 consisting of two main tracking tasks: face and object tracking. The sequences are: Occluded F ace [1], Occluded F ace 2 [4], Girl [6], T iger 1 [4], T iger 2 [4], Coke Can [4], Surfer [4], and Coupon Book [4]. Example frames from sev eral videos are sho wn in Fig. 6. 6 The videos and the corresponding ground truth are av ailable at http://vision.ucsd.edu/˜bbabenko/pr oject miltrack.shtml Each video is composed of 8-bit grayscale images, re- sized to 320 × 240 pixels. W e used the raw pixel values as image features. For the sake of computational efficiency in the affine subspace representation, we resized each candi- date image region to 32 × 32 , and the number of eigen vec- tors ( n ) used in all experiments is set to three. Furthermore, we only consider 2D translation and scaling in the motion modelling component. The batch size ( W ) for the template update is set to five as a trade-off between computational ef- ficiency and effecti veness of modelling appearance change during fast motion. W e ev aluated the proposed tracker based on (i) av erage center location error , and (ii) precision [4]. Precision shows the percentage of frames for which the estimated object lo- cation is within a threshold distance of the ground truth. Follo wing [4], we use a fixed threshold of 20 pix els. T o contrast the effect of affine subspace modelling against linear subspaces, we assessed the performance of the AST tracker against a tracker that only exploits lin- ear subspaces, i.e ., an AST where µ = 0 for all models. The results, in terms of center location errors, are sho wn in T able 1. The proposed AST method significantly out- performs the linear subspaces approach, thereby confirming our idea of affine subspace modelling. Algorithm 1 : Affine Subspace T racking Input: • New frame, a set of updated candidate object states from the last frame, and the previous P − 1 estimated object states { b Z ( τ ) } t − 1 τ = t − P +1 1: Initialisation: • t = 1 : P • Set the initial object state b Z ( t ) in the first P frames. • Use a single state to indicate the location. 2: Begin: • Select candidate object states according to the dynamic model {Z ( t ) i } N i =1 • For each sample, extract the corresponding image patch • For each Z ( t ) i do: – Generate the affine subspace A ( t ) i { µ ( t ) i , U ( t ) i } based on image regions corresponding to Z ( t ) i and { b Z ( τ ) } t − 1 τ = t − P +1 – Calculate the likelihoods gi ven each template in the bag by Eqn. (6) – Compute the final likelihoods using Eqn. (7) • Determine the object state b Z ( t ) by Maximum Likeli- hood (ML) estimation • Update the existing candidate object states according to their probabilities [14] Output: current object state b Z ( t ) (a) (b) (c) (d) Figure 6. Examples of bounding boxes resulting from tracking on several video sequences. For the sake of clarity , we only demonstrate the results of the overall top four trackers. (a) Surfer [4]: includes large pose variations, occlusion; (b) Coupon Book [4]: contains severe appearance change in addition to including an imposter to distract the tracker; (c) Occluded F ace 2 [4]: contains various occlusions; (d) Girl [6] in volv es partial and full occlusion, large pose changes. T able 1. Performance comparison between tracking based on affine and linear subspaces, in terms of av erage center location errors (pixels). V ideo proposed AST linear subspace Surfer 8 39 Coke Can 9 31 Girl 19 29 T iger 1 22 38 T iger 2 15 42 Coupon Book 8 25 Occluded Face 14 27 Occluded Face 2 13 24 av erage error 13 . 5 31 . 88 3.1. Quantitative Comparison T o assess and contrast the performance of AST tracker against state-of-the-art methods, we consider six meth- ods, here. The competitors are: fragment-based tracker (FragT rack) [1], multiple instance boosting-based tracker (MIL Track) [4, 3], online Adaboost (OAB) [8], tracking- learning-detection (TLD) [15], incremental visual tracking (IVT) [19], and Sparsity-based Collaborative Model tracker (SCM) [26]. W e use the publicly a v ailable source codes for FragT rack 1 , MIL Track 2 , O AB 2 , TLD 3 , IVT 4 and SCM 5 . T ables 2 and 3 show the performance in terms of preci- sion and location error , respectively , for the proposed AST method as well as the competing trackers. Fig. 6 shows re- sulting bounding boxes for sev eral frames from the Surfer , Coupon Book , Occluded F ace 2 and Girl sequences. On av erage, the proposed AST method obtains notably better performance than the competing trackers, with TLD being the second best tracker . 1 http://www .cs.technion.ac.il/˜amita/fragtrac k/fragtr ack.htm 2 http://vision.ucsd.edu/ ∼ bbabenko/pr oject miltrack.shtml 3 http://info.ee.surr e y .ac.uk/P ersonal/Z.Kalal/ 4 http://www .cs.toronto.edu/˜dr oss/ivt/ 5 http://ice.dlut.edu.cn/lu/Pr oject/cvpr12 scm/cvpr12 scm.htm T able 2. Comparison of the proposed AST method ag ainst compet- ing track ers, in terms of a verage center location errors (pixels). Best performance is indicated by ∗ , while second best by ∗∗ . Video AST TLD MIL Track SCM OAB IVT FragT rack (proposed) [15] [4] [26] [8] [19] [1] Surfer 8 ∗ 9 ∗∗ 11 76 23 30 139 Coke Can 9 ∗ 13 ∗∗ 20 9 ∗ 25 61 63 Girl 19 ∗∗ 28 32 10 ∗ 48 52 27 Tiger 1 22 10 ∗ 16 ∗∗ 37 35 59 39 Tiger 2 15 ∗ 15 ∗ 18 ∗∗ 43 33 43 37 Coupon Book 8 ∗ 37 15 ∗∗ 36 25 17 56 Occluded Face 14 16 27 4 ∗ 43 9 6 ∗∗ Occluded Face 2 13 ∗∗ 28 20 8 ∗ 21 17 45 average err or 13 . 5 ∗ 19 . 49 ∗∗ 19 . 87 27 . 87 31 . 62 36 . 00 51 . 5 T able 3. Precision at a fixed threshold of 20, as per [4]. Best perfor - mance is indicated by ∗ , while second best is indicated by ∗∗ . The higher the precision, the better . Video AST TLD MIL Track SCM OAB IVT FragTrack (proposed) [15] [4] [26] [8] [19] [1] Surfer 0 . 98 ∗ 0 . 97 ∗∗ 0 . 93 0 . 10 0 . 51 0 . 19 0 . 28 Coke Can 0 . 99 ∗ 0 . 98 ∗∗ 0 . 55 0 . 97 0 . 45 0 . 13 0 . 14 Girl 0 . 73 ∗∗ 0 . 42 0 . 32 0 . 97 ∗ 0 . 11 0 . 50 0 . 51 Tiger 1 0 . 54 0 . 92 ∗ 0 . 81 ∗∗ 0 . 35 0 . 48 0 . 32 0 . 28 Tiger 2 0 . 83 ∗ 0 . 81 ∗∗ 0 . 83 ∗ 0 . 14 0 . 51 0 . 29 0 . 22 Coupon Book 0 . 94 ∗ 0 . 66 0 . 69 ∗∗ 0 . 52 0 . 67 0 . 57 0 . 41 Occluded Face 0 . 79 0 . 64 0 . 43 1 . 00 ∗ 0 . 22 0 . 94 0 . 95 ∗∗ Occluded Face 2 0 . 75 ∗∗ 0 . 18 0 . 60 0 . 95 ∗ 0 . 61 0 . 72 0 . 44 average pr ecision 0 . 82 ∗ 0 . 69 ∗∗ 0 . 64 0 . 63 0 . 44 0 . 45 0 . 40 3.2. Qualitative Comparison Heavy occlusions . Occlusion is one of the major issues in object tracking. T rackers such as SCM, FragTrack and IVT are designed to resolve this problem. Other trackers, including TLD, MIL and OAB, are less successful in han- dling occlusions, especially at frames 271, 529 and 741 of the Occluded F ace sequence, and frames 176, 432 and 607 of Occluded F ace 2 . SCM can obtain good performance mainly as it is capable of handling partial occlusions via a patch-based model. The proposed AST approach can toler- ate occlusions to some extent, thanks to the properties of the appearance model. One prime example is Occluded F ace 2 , where AST accurately localised the sev erely occluded ob- ject at frame 730. Pose V ariations . On the T iger 2 sequence, most track- ers, including SCM, IVT and FragTrack, fail to track the object from the early frames onwards. On T ig er 2 , the pro- posed AST approach can accurately follo w the object at frames 207 and 271 when all the other trackers have f ailed. In addition, compared to the other trackers, the proposed approach partly handles motion blurring ( e.g . frame 344), where the blurring is a side-effect of rapid pose variations. On T iger 1 , although TLD obtains the best performance, AST can successfully locate (in contrast to the other track- ers) the object at frames 204 and 249, which are subject to occlusion and sev ere illumination changes. Rotations . The Girl and Surfer sequences include dras- tic out-of-plane and in-plane rotations. On Surfer , Frag- T rack and SCM fail to track from the start. The proposed AST approach consistently tracked the surfer and outper- forms the other trackers. On Girl , the IVT , O AB, and Frag- T rack methods fail to track in many frames. While IVT is able to track in the beginning, it fails after frame 230. The AST approach manages to track the correct person through- out the whole sequence, especially tow ards the end where the other trackers fail due to hea vy occlusion. Illumination changes . The Coke Can sequence con- sists of dramatic illumination changes. FragT rack fails from frame 20 where the first signs of illumination changes ap- pear . IVT and O AB fail from frame 40 where the frames include both sev ere illumination changes and slight motion blur . MIL Track fails after frame 179 where a part of the object is almost faded by the light. Since affine subspaces accommodate robustness to the illumination changes, the proposed AST approach can accurately locate the object throughout the whole sequence. Imposters/Distractors . The Coupon Book sequence contains a se vere appearance change, as well as an imposter book to distract the tracker . FragTrack and TLD fail mainly where the imposter book appears. AST successfully tracks the correct book with notably better accuracy than the other methods. 4. Main Findings and Future Dir ections In this paper we in vestig ated the problem of object track- ing in a video stream where object appearance can drasti- cally change due to factors such as occlusions and/or vari- ations in illumination and pose. The selection of subspaces for target representation purposes, in addition to a regular subspace update, are mainly dri ven by the need for an adap- tiv e object template reflecting appearance changes. W e ar- gued that modelling the appearance by af fine subspaces and applying this notion on both the object templates and the query data leads to more robustness. Furthermore, we main- tain a record of k previously observ ed templates for a more robust track er . W e also presented a novel subspace-to-subspace mea- surement approach by reformulating the problem over Grassmann manifolds, which provides the target represen- tation with more robustness against intrinsic and extrinsic variations. Finally , the tracking problem was considered as an inference task in a Markov Chain Monte Carlo frame- work using particle filters to propag ate sample distrib utions ov er time. Comparativ e ev aluation on challenging video sequences against several state-of-the-art trackers show that the pro- posed AST approach obtains superior accuracy , ef fectiv e- ness and consistency , with respect to illumination changes, partial occlusions, and various appearance changes. Unlike the other methods, AST in volv es no training phase. There are se veral challenges, such as drifts and mo- tion blurring, that need to be addressed. A solution to drifts could be to formulate the update process in a semi- supervised fashion in addition to including a training stage for the detector . Future research directions also include an enhancement to the updating scheme by measuring the effecti v eness of a ne w learned model before adding it to the bag of models. T o resolve the motion blurring issues, we can enhance the framework by introducing blur-dri ven models and particle filter distributions. Furthermore, an in- teresting extension would be multi-object tracking and ho w to join multiple object models. Acknowledgements NICT A is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence program. References [1] A. Adam, E. Rivlin, and I. Shimshoni. Rob ust fragments-based track- ing using the integral histogram. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , volume 1, pages 798–805, 2006. [2] M. Arulampalam, S. Maskell, N. Gordon, and T . Clapp. A tutorial on particle filters for on-line nonlinear/non-gaussian bayesian tracking. IEEE T rans. Signal Pr ocessing , 50(2):174–188, 2002. [3] B. Babenko, M. Y ang, and S. Belongie. V isual tracking with online multiple instance learning. In IEEE Conference on Computer V ision and P attern Recognition (CVPR) , pages 983–990, 2009. [4] B. Babenko, M. Y ang, and S. Belongie. Robust object tracking with online multiple instance learning. IEEE Tr ansactions on P attern Analysis and Machine Intelligence , 33(8):1619–1632, 2011. [5] R. Basri, T . Hassner , and L. Zelnik-Manor . Approximate nearest sub- space search. IEEE T ransactions on P attern Analysis and Machine Intelligence , 33(2):266–278, 2011. [6] S. Birchfield. Elliptical head tracking using intensity gradients and color histograms. In IEEE Conference on Computer V ision and P at- tern Recognition (CVPR) , pages 232–237, 1998. [7] A. Edelman, T . Arias, and S. Smith. The geometry of algorithms with orthogonality constraints. SIAM J ournal on Matrix Analysis and Applications , 20(2):303–353, 1998. [8] H. Grabner , M. Grabner , and H. Bischof. Real-time tracking via on-line boosting. In British Machine V ision Conference , volume 1, pages 47–56, 2006. [9] J. Hamm and D. Lee. Extended Grassmann kernels for subspace- based learning. In Advances in Neural Information Pr ocessing Sys- tems (NIPS) , pages 601–608, 2009. [10] M. Harandi, C. Sanderson, C. Shen, and B. C. Lovell. Dictionary learning and sparse coding on Grassmann manifolds: An extrinsic solution. In Int. Conference on Computer V ision (ICCV) , 2013. [11] M. Harandi, C. Sanderson, S. Shirazi, and B. C. Lov ell. Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching. In IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 2705–2712, 2011. [12] M. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Kernel analysis on Grassmann manifolds for action recognition. P attern Recognition Letters , 34(15):1906–1915, 2013. [13] J. Ho, K. Lee, M. Y ang, and D. Kriegman. V isual tracking using learned linear subspaces. In IEEE Conference on Computer V ision and P attern Recognition (CVPR) , volume 1, pages 782–789, 2004. [14] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density . Eur opean Conference on Computer V ision (ECCV) , pages 343–356, 1996. [15] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE T ransactions on P attern Analysis and Machine Intelligence , 34(7):1409–1422, 2012. [16] T . Kim, J. Kittler , and R. Cipolla. Discriminative learning and recog- nition of image set classes using canonical correlations. IEEE T rans- actions on P attern Analysis and Machine Intelligence , 29(6):1005– 1018, 2007. [17] J. Kittler , M. Hatef, R. Duin, and J. Matas. On combining classifiers. IEEE T ransactions on P attern Analysis and Machine Intelligence , 20(3):226–239, 1998. [18] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. W ang. Incre- mental learning of 3D-DCT compact representations for robust vi- sual tracking. IEEE T ransactions on P attern Analysis and Machine Intelligence , 35(4):863–881, 2013. [19] D. Ross, J. Lim, R. Lin, and M. Y ang. Incremental learning for robust visual tracking. Int. Journal of Computer V ision (IJCV) , 77(1):125– 141, 2008. [20] C. Sanderson, M. Harandi, Y . W ong, and B. C. Lovell. Combined learning of salient local descriptors and distance metrics for image set face verification. In IEEE International Confer ence on Advanced V ideo and Signal-Based Surveillance (A VSS) , pages 294–299, 2012. [21] S. Shirazi, M. Harandi, C. Sanderson, A. Alavi, and B. C. Lovell. Clustering on Grassmann manifolds via kernel embedding with application to action analysis. In Int. Conference on Image Pr ocessing (ICIP) , pages 781–784, 2012. [22] P . T uraga, A. V eeraraghav an, A. Sriv astav a, and R. Chellappa. Sta- tistical computations on Grassmann and Stiefel manifolds for image and video-based recognition. IEEE T ransactions on P attern Analysis and Machine Intelligence , 33(11):2273–2286, 2011. [23] U. V on Luxburg. A tutorial on spectral clustering. Statistics and Computing , 17(4):395–416, 2007. [24] S. W ang, H. Lu, F . Y ang, and M.-H. Y ang. Superpixel tracking. In Int. Conference on Computer V ision (ICCV) , pages 1323–1330, 2011. [25] T . W ang, A. Backhouse, and I. Gu. Online subspace learning on Grassmann manifold for moving object tracking in video. In IEEE International Conference on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) , pages 969–972, 2008. [26] W . Zhong, H. Lu, and M.-H. Y ang. Robust object tracking via sparsity-based collaborative model. In IEEE Conference on Com- puter V ision and P attern Recognition (CVPR) , pages 1838–1845, 2012.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment