Learning where to Attend with Deep Architectures for Image Tracking

1 Learning where to Attend with Deep Architectures for Image T racking Misha Denil 1 , Loris Bazzani 2 , Hugo Larochelle 3 and Nando de Fr eitas 1 1 Uni versity of British Columbia. 2 Uni versity of V erona. 3 Uni versity of Sherbrook e. K eywords: Restricted Boltzmann machines, Bayesian optimization, bandits, atten- tion, deep learning, particle ﬁltering, saliency Abstract W e discuss an attentional model for simultaneous object tracking and recognition that is dri ven by g aze data. Moti v ated by theories of perception, the model consists of two interacting pathways: identity and control, intended to mirror the what and where pathways in neuroscience models. The identity pathway models object appearance and performs classiﬁcation using deep (factored)-Restricted Boltzmann Machines. At each point in time the observations consist of foveated images, with decaying resolution to- ward the periphery of the gaze. The control pathway models the location, orientation, scale and speed of the attended object. The posterior distribution of these states is esti- mated with particle ﬁltering. Deeper in the control pathway , we encounter an attentional mechanism that learns to select gazes so as to minimize tracking uncertainty . Unlike in our pre vious work, we introduce gaze selection strate gies which operate in the presence of partial information and on a continuous action space. W e sho w that a straightforward extension of the e xisting approach to the partial information setting results in poor per- formance, and we propose an alternati v e method based on modeling the re w ard surface as a Gaussian Process. This approach giv es good performance in the presence of par - tial information and allo ws us to expand the action space from a small, discrete set of ﬁxation points to a continuous domain. 1 Intr oduction Humans track and recognize objects ef fortlessly and ef ﬁciently , exploiting attentional mechanisms (Rensink, 2000; Colombo, 2001) to cope with the vast stream of data. W e use the human visual system as inspiration to build a system for simultaneous object tracking and recognition from gaze data. An attentional strategy is learned online to choose ﬁxation points which lead to low uncertainty in the location of the target ob- ject. Our tracking system is composed of two interacting pathways. This separation of responsibility is a common feature in models from the computational neuroscience literature as it is believ ed to reﬂect a separation of information processing into ventral and dorsal pathways in the human brain (Olshausen et al., 1993a). The identity pathway (ventral) is responsible for comparing observ ations of the scene to an object template using an appearance model, and on a higher lev el, for clas- sifying the target object. The identity pathway consists of a two hidden layer deep network. The top layer corresponds to a multi-ﬁxation Restricted Boltzmann Machine (RBM) (Larochelle & Hinton, 2010), as shown in Figure 1. It accumulates informa- tion from the ﬁrst hidden layers at consecutiv e time steps. For the ﬁrst layers, we use (factored)-RBMs (Hinton & Salakhutdinov, 2006; Ranzato & Hinton, 2010; W elling et al., 2005; Swersky et al., 2011), but autoencoders (V incent et al., 2008), sparse coding (Olshausen & Field, 1996; Kavukcuoglu et al., 2009), two-layer ICA (K ¨ oster & Hyv ¨ arinen, 2007) and con v olutional architectures (Lee et al., 2009) could also be adopted. The contr ol pathway (dorsal) is responsible for aligning the object template with the full scene so the remaining modules can operate independently of the object’ s position and scale. This pathway is separated into a localization module and a ﬁxation module 2 v t x t h t a t b t R t v t+1 x t+1 h t+1 a t+1 b t+1 R t+1 h [2] t+1 P ol ic y π R ewa r d Bel ief st ate Ga ze o bs er vat i on F i r s t h i dden l a yer Sta te Sec o n d h i dden l a yer Object c las s A c tio n HD V ideo T r ac k in g r egi on c t+1 Figure 1: F r om a sequence of gazes ( v t , v t +1 , . . . ) , the model infer s the hidden features h for each gaze (that is, the activation intensity of each receptive ﬁeld), the hidden features for the fusion of the sequence of gazes and the object class c . Only one time step of classiﬁcation is kept in the ﬁgur e for clarity . The location, size , speed and orientation of the gaze patch ar e encoded in the state x t . The actions a t follow a learned policy π t that depends on the past r ewar ds { r 1 , . . . , r t − 1 } . This particular r e war d is a function of the belief state b t = p ( x t | a 1: t , h 1: t ) , also known as the ﬁltering distribution. Unlike typical commonly used partially observed Mark ov decision models (POMDPs), the r e war d is a function of the beliefs. In this sense, the pr oblem is closer to one of sequential e xperimental design. W ith mor e layers in the ventral v − h − h [2] − c pathway , other r ewar ds and policies could be designed to implement higher -le vel attentional strate gies. 3 which work cooperati v ely to accomplish this goal. The localization module is imple- mented with a particle ﬁlter (Doucet et al., 2001) which estimates the location, velocity and scale of the tar get object. W e make no attempt to implement such states with neural architectures, b ut it seems clear that they could be encoded with grid cells (McNaughton et al., 2006) and retinotopic maps as in V1 and the superior colliculus (Rosa, 2002; Gi- rard & Berthoz, 2005). The ﬁxation module learns an attentional strategy to select ﬁxation points relati v e to the object template. These ﬁxation points are the centres of partial template observ ations, and are compared with observ ations of the corresponding locations in the scene using the appearance model (see Figure 2). Rew ard is assigned to each ﬁxation based on the uncertainty of the tar get location at each time step. The ﬁxation module uses the reward signal to adapt its gaze selection policy to achie ve good localization. Our previous work (Bazzani et al., 2010) used Hedge (Auer et al., 1998a; Freund & Schapire, 1997) to learn this policy . In this extended paper we show that a straightforward adaptation of our pre vious approach to the partial information setting results in poor performance, and we propose an alternati ve method based on modelling the re ward surf ace as a Gaussian Process. This approach gi v es good performance in the presence of partial information and allows us to expand the action space from a small, discrete set of ﬁxation points to a continuous domain. The proposed system can be motiv ated from dif ferent perspectiv es. First, starting with Isard & Blak e (1996), many particle ﬁlters ha v e been proposed for image tracking, but these typically use simple observation models such as B-splines (Isard & Blake, 1996) and colour templates (Okuma et al., 2004). RBMs are more expressi ve mod- els of shape, and hence, we conjecture that they will play a useful role where simple appearance models fail. Second, from a deep learning computational perspectiv e, this work allo ws us to tackle large images and video, which is typically not possible due to the number of parameters required to represent lar ge images in deep models. The use of ﬁxations synchronized with information about the state (e.g. location and scale) of such ﬁxations eliminates the need to look at the entire image or video. Third, the system is in variant to image transformations encoded in the state, such as location, scale and orientation. Fourth, from a dynamic sensor netw ork perspecti ve, this paper presents a very simple, but ef ﬁcient and nov el way of deciding how to gather measure- ments dynamically . Lastly , in the conte xt of psychology , the proposed model realizes 4 Figure 2: Left: A typical video frame with the estimated tar get r e gion highlighted. T o cope with the lar g e image size our system considers only the targ et re gion at each time step. Centre left: A close-up of the template extracted fr om the ﬁrst fr ame . The template is compar ed to the tar g et r e gion by selecting a ﬁxation point for comparison as shown. Centre right: A visualization of a single ﬁxation. In addition to co vering only a very small portion of the original frame , the image is foveated with high r esolution near the centre and low r esolution on the periphery to further r educe the dimensionality . Right: The most active ﬁlters of the ﬁrst layer (factor ed)-RBM when observing the displayed location. The contr ol pathway compares these featur es to the featur es active at the corr esponding scene location in or der to update the belief state. to some extent the functional architecture for dynamic scene representation of Rensink (2000). The rate at which different attentional mechanisms dev elop in ne wborns (in- cluding alertness, saccades and smooth pursuit, attention to object features and high- le vel task dri ven attention) guided the design of the proposed approach and was a great source of inspiration (Colombo, 2001). Our attentional model can be seen as building a saliency map (Koch & Ullman, 1985) ov er the tar get template. Previous work on saliency modelling has focused on identifying salient points in an image using a bottom up process which looks for out- liers under some local feature model (which may include a task dependent prior , global scene features, or v arious other heuristics). These features can be computed from static images (T orralba et al., 2006), or from local regions of spacetime (Gaborski et al., 2004) for video. Additionally , a wide v ariety of dif ferent feature types hav e been applied to this problem, including engineered features (Gao et al., 2007) as well as features that are learned from data (Zhang et al., 2009). Core to these methods is the idea that saliency is determined by some type of novelty measure. Our approach is dif ferent, in that rather than identifying locally or globally no v el features, our process identiﬁes features which are useful for the task at hand. In our system the salienc y signal for a location comes from a top down process which ev aluates how well the features at that location enable 5 the system to localize the tar get object. The work of Gao et al. (2007) considers a simi- lar approach to saliency by deﬁning saliency to be the mutual information between the features at a location and the class label of an object being sought; ho we ver , in order to make their model tractable the authors are forced to use speciﬁcally engineered fea- tures. Our system is able to cope with arbitrary feature types, and although we consider only on localization in this paper , our model is sufﬁciently general to be applied to identifying salient features for other goals. Recently , a dynamic RBM state-space model was proposed in T aylor et al. (2010). Both the implementation and intention behind that proposal are different from the ap- proach discussed here. T o the best of our kno wledge, our approach is the ﬁrst successful attempt to combine dynamic state estimation from gazes with online polic y learning for gaze adaptation, using deep network network models of appearance. Many other dual- pathway architectures ha ve been proposed in computational neuroscience, including Ol- shausen et al. (1993b) and Postma et al. (1997), but we believ e ours has the advantage that it is very simple, modular (with each module easily replaceable), suitable for large datasets and easy to extend. 2 Identity Pathway The identity pathway in our model mirrors the v entral pathway in neuroscience models. It is responsible for modelling the appearance of the target object and also, at a higher le vel, for classiﬁcation. 2.1 A ppearance Model W e use (f actored)-RBMs to model the appearance of objects and perform object classi- ﬁcation using the gazes chosen by the control module (see Figure 3). These undirected probabilistic graphical models are governed by a Boltzmann distribution over the gaze data v t and the hidden features h t ∈ { 0 , 1 } n h . W e assume that the recepti ve ﬁelds w , also known as RBM weights or ﬁlters, have been trained beforehand. W e also assume that readers are familiar with these models and, if otherwise, refer them to Ranzato & Hinton (2010) and Swersky et al. (2010). 6 h t v t V isu al ﬁ eld W Hidden ac t i va tio n s Figure 3: An RBM senses a small foveated imag e derived fr om the video. The level of activation of each ﬁlter is recor ded in the h t units. The RBM weights (ﬁlters) W ar e visualized in the upper left. W e curr ently pr e-tr ain these weights. 2.2 Classiﬁcation Model The identity pathway also performs object recognition, classifying a sequence of gaze instances selected with the gaze polic y . W e implement a multi-ﬁxation RBM very sim- ilar to the one proposed in Larochelle & Hinton (2010), where the binary variables z t (see Figure 4) are introduced to encode the relati ve gaze location a t within the multi- ﬁxation RBM (a “1 in K ” or “one hot” encoding of the gaze location was used for z t ). The multi-ﬁxation RBM uses the relati ve gaze location information in order to ag- gregate the ﬁrst hidden layer representations h t at ∆ consecutiv e time steps into a single, higher le vel representation h [2] t . More speciﬁcally , the energy function of the multi-ﬁxation RBM is: E ( h t − ∆+1: t , z t − ∆+1: t , h [2] t ) = − d > h [2] t − ∆ X i =1 b > h t − ∆+ i + F X f =1 ( P f , : h [2] t )( W f , : h t − ∆+ i )( V f , : z t − ∆+ i ) where the notation P f , : refers to the f th ro w vector of the matrix P . From this en- ergy function, we deﬁne a distrib ution ov er h t − ∆+1: t and h [2] t (conditioned on z t − ∆+1: t ) 7 h t+1 v t+1 z t+1 a t+1 h [2] t+1 C l a ss label h t v t z t A c tio n V isu al ﬁ eld a t Figure 4: Gaze accumulation and classiﬁcation in the identity pathway . A multi-ﬁxation RBM models the conditional distribution (given the gaze positions a t ) of ∆ consecutive hidden fea- tur es h t , extracted by the ﬁrst layer RBM on the foveated images. In this illustration, ∆ = 2 . The multi-ﬁxation RBM encodes the gaze position a t in a “one hot” repr esentation noted z t . The activation pr obabilities of the second layer hidden units h [2] t ar e used by a classiﬁer to pr edict the object’ s class. through the Boltzmann distribution: p ( h t − ∆+1: t , h [2] t | z t − ∆+1: t ) = exp  − E ( h t − ∆+1: t , z t − ∆+1: t , h [2] t )  / Z ( z t − ∆+1: t ) (1) where the normalization constant Z ( z t − ∆+1: t ) ensures that Equation 1 sums to 1. T o sample from this distrib ution, one can use Gibbs sampling by alternating between sam- pling the top-most hidden layer h [2] t gi ven all indi vidual processed gazes h t − ∆+1: t and vice versa. T o train the multi-ﬁxation RBM, we collect a training set consisting in se- quences of ∆ pairs ( h t , z t ) by randomly selecting ∆ gaze positions at which to ﬁxate and computing the associated h t . These sets are extracted from a collection of images in which the object to detect has been centred. Unsupervised learning using contrasti ve di ver gence can then be performed on this training set. See Larochelle & Hinton (2010) for more details. The main difference between this multi-ﬁxation RBM and the one described in 8 Larochelle & Hinton (2010) is that h [2] t does not explicitly model the class label c t . Instead, a multinomial logistic regression classiﬁer is trained separately , to predict c t from the aggregated representation e xtracted from h [2] t . More speciﬁcally , we use the vector of acti v ation probabilities of all hidden units h [2] t,j in h [2] t , conditioned on h t − ∆+1: t and z t − ∆+1: t , as the aggregated representation: p ( h [2] t,j = 1 | h t − ∆+1: t , z t − ∆+1: t ) = sigm d j + ∆ X i =1 F X f =1 P f ,j ( W f , : h t − ∆+ i )( V f , : z t − ∆+ i ) ! W e experimented with a single ﬁxation module, b ut found the multi-ﬁxation module to increase classiﬁcation accuracy . T o improv e the estimate the class variable c t ov er time, we accumulate the classiﬁcation decisions at each time step. Note that the process of pursuit (tracking) is essential to classiﬁcation. As the target is tracked, the algorithm ﬁxates at locations near the target’ s estimated location. The size and orientation of these ﬁxations also depends on the corresponding state estimates. Note that we don’t ﬁxate exactly at the target location estimate as this would provide only one distinct ﬁxation ov er sev eral time steps if the tracking policy has con ver ged to a speciﬁc gaze. It should also be pointed out that instead of using random ﬁxations, one could again use the control strategy proposed in this paper to decide where to look with respect to the track estimate so as to reduce classiﬁcation uncertainty . W e leav e the implementation of this extra attentional mechanism for future w ork. 3 Contr ol Pathway The control pathway mirrors the responsibility of the dorsal pathw ay in human visual processing. It tracks the state of the target (position, speed, etc) and normalizes the input so that other modules need not account for these variations. At a higher lev el it is responsible for learning an attentional strategy which maximizes the amount of information learned with each ﬁxation. The structure of the control pathway is shown in Figure 5. 3.1 State-space model The standard approach to image tracking is based on the formulation of Marko vian, nonlinear , non-Gaussian state-space models, which are solved with approximate Bayesian 9 Sta te Bel ief R ewa r d A c tio n b t x t h t b t+1 x t+1 h t+1 P ol ic y a t R t a t+1 R t+1 Figure 5: Inﬂuence diagram for the contr ol pathway . The true state of the tacked object x t , gener ates some set of featur es h t , in the identity pathway . These featur es depend on the action chosen at time t and ar e used to update the belief state b t . Statistics of the belief state ar e collected to compute the r e war d r t , which is used to update the policy for the ne xt time step. ﬁltering techniques. In this setting, the unobserved signal (object’ s position, veloc- ity , scale, orientation or discrete set of operations) is denoted { x t ∈ X ; t ∈ N } . This signal has initial distribution p ( x 0 ) and transition equation p ( x t | x t − 1 , a t − 1 ) . Here a t ∈ A denotes an action at time t , deﬁned on a compact set A . For descrete poli- cies A is ﬁnitie whereas for continuous policies A is a region in R 2 . The observations { h t ∈ H ; t ∈ N ∗ } , are assumed to be conditionally independent gi ven the process state { x t ; t ∈ N } . Note that from the state space model perspective the observations are the hidden units of the second layer of the of the appearance m odel in the identity pathw ay . In summary , the state-space model is described by the following distrib utions: p ( x 0 ) p ( x t | x t − 1 , a t − 1 ) for t ≥ 1 p ( h t | x t , a t ) for t ≥ 1 , For the transition model, we will adopt a classical autore gressi v e process. 10 Our aim is to estimate recursi vely in time the posterior distribution 1 p ( x 0: t | h 1: t , a 1: t ) and its associated features, including the marginal distrib ution b t , p ( x t | h 1: t , a 1: t ) — kno wn as the ﬁltering distribution or belief state . This distribution satisﬁes the follow- ing recurrence: b t ∝ p ( h t | x t , a t ) Z p ( x t | x t − 1 , a t − 1 ) p ( d x t − 1 | h 1: t − 1 , a 1: t − 1 ) . Except for standard distributions ( e.g. Gaussian or discrete), this recurrence is intractable. After learning the observ ation model we will use it for tracking. The observation model is often deﬁned in terms of the distance of the observ ations from a template τ , p ( h t | x t , a t ) ∝ exp ( − d ( h ( x t , a t ) , τ )) , where d ( · , · ) denotes a distance metric and τ an object template (for example, a color histogram or spline). In this model, the observ ation h ( x t , a t ) is a function of the current state hypothesis and the selected action. The problem with this approach is eliciting a good template. Often color histograms or splines are insuf ﬁcient. For this reason, we will construct the templates with (factored)-RBMs as follo ws. First, optical ﬂo w is used to detect ne w object candidates entering the visual scene. Second, we assign a template to the detected object candidate, as sho wn in Figure 2. The same ﬁgure also shows a typical fov eated observation (higher resolution in the center and lower in the periphery of the gaze) and the receptiv e ﬁelds for this observation learned beforehand with an RBM. The control algorithm will be used to learn which parts of the template are most informati ve, either by picking from amoung a predeﬁned set of ﬁxation points, or by using a continuous policy . Finally , we deﬁne the lik elihood of each observ ation directly in terms of the distance of the hidden units of the RBM h ( x t , a t , v t ) , to the hidden units of the corresponding template region h ( x 1 , a t = k , v 1 ) . That is, p ( h t | x t , a t = k ) ∝ exp ( − d ( h ( x t , a t = k , v t ) , h ( x 1 , a t = k , v 1 ))) . The abov e template is static, b ut concei v ably one could adapt it o v er time. 3.2 Reward Function A gaze control strate gy speciﬁes a polic y π ( · ) for selecting ﬁxation points. The purpose of this strategy is to select ﬁxation points which maximize an instantaneous re ward 1 W e use the notation x 0: t , { x 0 , ..., x t } to represent the past history of a variable ov er time. 11 function r t ( · ) . The reward can be any desired behaviour for the system, such as mini- mizing posterior uncertainty or achie ving a more abstract goal. W e focus on gathering observ ations so as to minimize the uncertainty in the estimate of the ﬁltering distribu- tion: r t ( a t | b t ) , u [ e p ( x t | h 1: t , a 1: t )] . More speciﬁcally , as discussed later , this re ward will be a function of the variance of the importance weights of the particle ﬁlter approx- imation e p ( x t | h 1: t , a 1: t ) of the belief state. It is also useful to consider the cumulati ve re ward R T = T X t =1 r t ( a t | b t ) , which is the sum of the instantaneous rew ards which have been recei ved up to time T . The gaze control strategies we consider are all “no-regret” which means that the a verage gap between our cumulati ve reward and the cumulati v e re ward from always picking the optimal action goes to zero as T → ∞ . In our current implementation, each action is a different gaze location and the ob- jecti ve is to choose where to look so as to minimize the uncertainty about the belief state. 4 Gaze contr ol W e compare sev eral different strategies for learning the gaze selection policy . In an earlier version of this work (Bazzani et al., 2010) we learned the gaze selection policy with a portfolio allocation algorithm called Hedge (Freund & Schapire, 1997; Auer et al., 1998b). Hedge requires knowledge of the rew ards for all actions at each time step, which is not realistic when gazes must be preformed sequentially , since the target object will mov e between ﬁxations. W e compare this strategy , as well as two baseline methods, to two v ery dif ferent alternati v es. EXP3 is an extension of Hedge to partial information games (Auer et al., 2001). Unlike Hedge, EXP3 requires knowledge of the re ward only for the action selected at each time step. EXP3 is more appropriate to the setting at hand, and is also more computationally efﬁcient than Hedge; howe ver , this comes at a cost of substantially lo wer theoretical performance. Both Hedge and EXP3 learn gaze selection policies which choose among a discrete 12 set of predetermined ﬁxation points. W e can instead learn a continuous policy by es- timating the rew ard surface using a Gaussian Process (Rasmussen & W illiams, 2006). By assuming that the rew ard surface is smooth, we can draw on the tools of Bayesian optimization (Brochu et al., 2010) to search for the optimal gaze location using as few exploratory steps as possible. The follo wing sections describe each of these approaches in more detail. 4.1 Baseline W e consider two baseline strate gies, which we call random and circular . The random strategy samples gaze selections uniformly from a small discrete set of possibilities. The circular strategy also uses a small discrete set of gaze locations and c ycles through them in a ﬁxed order . 4.2 Hedge T o use Hedge (Freund & Schapire, 1997; Auer et al., 1998b) for gaze selection we must ﬁrst discretize the action space by selecting a ﬁxed ﬁnite number of possible ﬁxation points. Hedge maintains an importance weight G ( i ) for each possible ﬁxation point and uses them to form a stochastic polic y at each time step. An action is selected according to this policy and the re ward for each possible action is observed. These re wards are then used to update the importance weights and the process repeats. Pseudo code for Hedge is sho wn in Algorithm 1. Algorithm 1 Hedge Input: γ > 0 Input: G 0 ( i ) ← 0 f oreach i ∈ A f or t = 1 , 2 , . . . do f or i ∈ A do p t ( i ) ← exp( γ G t − 1 ( i )) P j ∈A exp( γ G t − 1 ( j )) a t ∼ ( p t (1) , . . . , p t ( |A| )) // sample an action from the distribution ( p t ( k )) f or i ∈ A do r t ( i ) ← r t ( i | b t ) G t ( i ) ← G t − 1 ( i ) + r t ( i ) 13 4.3 EXP3 EXP3 (Auer et al., 2001) is a generalization of Hedge to the partial information setting. In order to maintain estimates for the importance weights, Hedge requires reward infor - mation for each possible action at each time step. EXP3 works by wrapping Hedge in an outer loop which simulates a fully observed rew ard vector at each time step. EXP3 selects actions based on a mixture of the polic y found by Hedge and a uniform distrib u- tion. EXP3 is able to function in the presence of partial information, but this comes at the cost of substantially worse theoretical guarantees. Pseudo code for EXP3 is shown in Algorithm 2. Algorithm 2 EXP3 Input: γ ∈ (0 , 1] Initialize Hedge ( γ ) f or t ∈ 1 , 2 , . . . do Recei ve p t from Hedge ˆ p t ← (1 − γ ) p t + γ |A| a t ∼ ( ˆ p t (1) , . . . , ˆ p t ( |A| )) Simulate re ward v ector for Hedge where ˆ r t ( j ) ←      r t ( j ) /p t ( j ) if j = a t 0 otherwise 4.4 Bayesian Optimization Both Hedge and EXP3 discretize the space of possible ﬁxation points and learn a dis- tribution o ver this ﬁnite set. In contrast, Bayesian optimization is able to treat the space of possible ﬁxation points as fully continuous by placing a smoothness prior on how re ward is expected to vary with location. Intuiti vely , if we know the rew ard at one lo- cation, then we expect other , nearby locations to produce similar rew ards. Gaussian Process priors encode this type of belief (Rasmussen & W illiams, 2006), and have been used extensi vely for optimization of cost functions when it is important to minimize the total number of function e v aluations (Brochu et al., 2010). W e model the latent re ward function r t ( a t | b t ) , r ( a t | b t , θ t ) as a zero mean Gaus- 14 sian Process r ( a t | b t , θ t ) ∼ G P ( 0 , k ( a t , a 0 t | b t , θ t )) , where b t is the belief state (see Section 3.1), and θ t are the model hyperparameters. The kernel function k ( · , · ) , gi ves the cov ariance between the reward at any two gaze locations. T o ease the notation, the explicit dependence of r ( · ) and k ( · , · ) on b t and θ t will be dropped. W e assume that the true re ward function r ( · ) is not directly measurable, and what we observe are measurements of this function corrupted by Gaussian noise. That is, at each time step the instantaneous re ward r t , is gi ven by r t = r ( a t ) + σ n δ n , where δ n ∼ N (0 , 1) and σ n is a hyperparameter indicating the amount of observ ation noise, which we absorb into θ t . Gi ven a set of observ ations we can compute the posterior predicti ve distribution for r ( · ) : r ( a | r 1: t , a 1: t ) ∼ N ( m t ( a ) , s 2 t ( a )) , (2) m t ( a ) = k T [ K + σ 2 n I ] − 1 r 1: t , s 2 t ( a ) = k ( a , a ) − k T [ K + σ 2 n I ] − 1 k , where K =      k ( a 1 , a 1 ) · · · k ( a 1 , a t ) . . . . . . . . . k ( a t , a 1 ) · · · k ( a t , a t )      , k = h k ( a 1 , a ) · · · k ( a t , a ) i T , r 1: t = h r 1 · · · r t i T . It remains to specify the form of the kernel function, k ( · , · ) . W e experimented with se veral possibilities, but found that the speciﬁc form of the kernel function is not crit- ical to the performance of this method. For the experiments in this paper we used the squared exponential k ernel, k ( a i , a j ) = σ 2 m exp − 1 2 D X k =1  a i,k − a j,k ` k  2 ! , 15 where σ 2 m and the { ` 1 , . . . , ` D } are hyperparameters. Equation 2 is a Gaussian Process estimate of the reward surface and can be used to select a ﬁxation point for the next time step. This estimate giv es both a predicted rew ard v alue and an associated uncertainty for each possible ﬁxation point. This is the strength of Gaussian Processes for this type of optimization problem, since the predictions can be used to balance exploration (choosing a ﬁxation point where the re ward is highly uncertain) and exploitation (choosing a point we are conﬁdent will ha v e high re ward). There are man y selection methods a v ailable in the literature which offer dif ferent tradeof fs between these two criteria. In this paper we use GP-UCB (Sriniv as et al., 2010) which selects a t +1 = arg max a m t ( a ) + p β t s t ( a ) (3) where β t is a parameter . The setting β t = 2 log( t 3 π 2 / 3 δ ) (with δ = 0 . 001 ) is used throughout this paper . Equation 3 must still be optimized to ﬁnd a t +1 , which can be performed using stan- dard global optimization tools. W e use DIRECT (Jones et al., 1993) due to the existence of a readily av ailable implementation. The Gaussian Process re gression is controlled by se v eral hyperparameters (see Fig- ure 6): σ 2 m controls the o verall magnitude of the co v ariance, and σ 2 n controls the amount of observ ation noise. The remaining parameters { ` 1 , . . . , ` D } are length scale parame- ters which control the range of the cov ariance effects in each dimension. T reatment of the hyperparameters requires special consideration in this setting. The pure Bayesian approach is to put a prior on each parameter and integrate them out of the predictiv e distribution. Howe v er , since the integrals in volv ed are not tractable analytically , this requires computationally expensiv e numerical approximations. Speed is an issue here since GP-UCB requires that we optimize a function of the posterior process at each time step so, for instance, computing Monte Carlo av erages for each e v aluation of Equation 2 is prohibiti v ely slo w . An alternativ e approach is to choose parameter v alues via maximum likelihood. This can be done quickly , and allows us to mak e speedy predictions; ho we ver , in this case we suf fer from problems of data scarcity , particularly early in the tracking process when few observations hav e been made. The length scale parameters are particularly 16 T a t r ( ) r t σ 2 n σ 2 m D ℓ i K Figure 6: Graphical model for Bayesian optimization. The ` i ar e length scales in each dimen- sion, σ 2 m is the magnitude parameter and σ 2 n is the noise level. In our model σ 2 m and σ 2 n follow a uniform prior and the ` i follow independent Student-t priors. prone to recei ving very poor estimates when there is little data a v ailable. W e have found that using informati ve priors for the length scale parameters and making MAP , rather than ML, estimates at each time step provides a solution to the problems described abov e. MAP estimates can be made quickly using gradient opti- mization methods (Rasmussen & W illiams, 2006), and informati ve priors provide re- sistance to the problems encountered with ML. The experiments in Section 6 place uniform priors on the magnitude and noise parameters and place independent Student-t priors on each length scale parameter . The experiments also use an initial data collec- tion phase of 10 time steps before any adjustment of the parameters is made. 5 Algorithm Since the belief state cannot be computed analytically , we will adopt particle ﬁltering to approximate it. The full algorithm is shown in Algorithm 3. W e refer readers to Doucet et al. (2001) for a more in depth treatment of these sequential Monte Carlo methods. Assume that at time t − 1 we hav e N  1 parti- cles (samples) { x ( i ) 0: t − 1 } N i =1 distributed according to p ( d x 0: t − 1 | h 1: t − 1 , a 1: t − 1 ) . W e can 17 Algorithm 3 Particle ﬁltering algorithm with gaze control. The algorithm shown here is for partial information strate gies. For full information strategies the importance sam- pling step is done independently for each possible action and the gaze control step is able to use rew ard information from each possible action to create the new strategy π t +1 ( · ) . 1. Initialization f or i = 1 to N do x ( i ) 0 ∼ p ( x 0 ) Initialize the policy π 1 ( · ) // Ho w this is done depends on the control strategy f or t = 1 . . . do 2. Importance sampling f or i = 1 to N do // Predict the next state e x ( i ) t ∼ q t  d x ( i ) t    e x ( i ) 0: t − 1 , h 1: t , a 1: t  e x ( i ) 0: t ←  x ( i ) 0: t − 1 , e x ( i ) t  k ? ∼ π t ( · ) // Select an action according to the policy f or i = 1 to N do // Ev aluate the importance weights e w ( i ) t ← p  h t | e x ( i ) t , a t = k ?  p  e x ( i ) t | e x ( i ) 0: t − 1 , a t − 1  q t  e x ( i ) t    e x ( i ) 0: t − 1 , h 1: t , a 1: t  f or i = 1 to N do // Normalize the importance weights w ( i ) t ← e w ( i ) t P N j =1 e w ( j ) t 3. Gaze control r t = P N i =1 ( w ( i ) t ) 2 // Recei ve re ward for the chosen action Incorporate r t into the policy to create π t +1 ( · ) 4. Selection Resample with replacement N particles  x ( i ) 0: t ; i = 1 , . . . , N  from the set  e x ( i ) 0: t ; i = 1 , . . . , N  according to the normalized importance weights w ( i ) t 18 approximate this belief state with the follo wing empirical distrib ution b p ( d x 0: t − 1 | h 1: t − 1 , a 1: t − 1 ) , 1 N N X i =1 δ x ( i ) 0: t − 1 ( d x 0: t − 1 ) . Particle ﬁlters combine sequential importance sampling with a selection scheme de- signed to obtain N ne w particles { x ( i ) 0: t } N i =1 distributed approximately according to p ( d x 0: t | h 1: t , a 1: t ) . 5.1 Importance sampling step The joint distrib utions p ( d x 0: t − 1 | h 1: t − 1 , a 1: t − 1 ) and p ( d x 0: t | h 1: t , a 1: t ) are of different dimension. W e ﬁrst modify and extend the current paths x ( i ) 0: t − 1 to obtain ne w paths e x ( i ) 0: t using a proposal kernel q t ( d e x 0: t | x 0: t − 1 , h 1: t , a 1: t ) . As our goal is to design a sequential procedure, we set q t ( d e x 0: t | x 0: t − 1 , h 1: t , a 1: t ) = δ x 0: t − 1 ( d e x 0: t − 1 ) q t ( d e x t | e x 0: t − 1 , h 1: t , a 1: t ) , that is e x 0: t = ( x 0: t − 1 , e x t ) . The aim of this kernel is to obtain new paths whose distribu- tion q t ( d e x 0: t | h 1: t , a 1: t ) = p ( d e x 0: t − 1 | h 1: t − 1 , a 1: t − 1 ) q t ( d e x t | e x 0: t − 1 , h 1: t , a 1: t ) , is as “close” as possible to p ( d e x 0: t | h 1: t , a 1: t ) . Since we cannot choose q t ( d e x 0: t | h 1: t , a 1: t ) = p ( d e x 0: t | h 1: t , a 1: t ) because this is the quantity we are trying to approximate in the ﬁrst place, it is necessary to weight the ne w particles so as to obtain consistent estimates. W e perform this “correction” with importance sampling, using the weights e w t = e w t − 1 p ( h t | e x t , a t ) p ( d e x t | e x 0: t − 1 , a t − 1 ) q t ( d e x t | e x 0: t − 1 , h 1: t , a 1: t ) . The choice of the transition prior as proposal distribution is by far the most com- mon one. In this case, the importance weights reduce to the expression for the like- lihood. Howe v er , it is possible to construct better proposal distributions, which make use of more recent observations, using object detectors (Okuma et al., 2004), saliency maps (Itti et al., 1998), optical ﬂow , and approximate ﬁltering methods such as the unscented particle ﬁlter . One could also easily incorporate strategies to manage data association and other tracking related issues. After normalizing the weights, w ( i ) t = e w ( i ) t P N j =1 e w ( j ) t , we obtain the follo wing estimate of the ﬁltering distrib ution: e p ( d x 0: t | h 1: t , a 1: t ) = N X i =1 w ( i ) t δ e x ( i ) 0: t ( d x 0: t ) . 19 Finally a selection step is used to obtain an “unweighted” approximate empirical distribution ˆ p ( d x 0: t | h 1: t , a 1: t ) of the weighted measure ˜ p ( d x 0: t | h 1: t , a 1: t ) . The basic idea is to discard samples with small weights and multiply those with lar ge weights. The use of a selection step is key to making the SMC procedure effecti ve; see Doucet et al. (2001) for details on ho w to implement this black box routine. 6 Experiments 6.1 Full Inf ormation P olicies In this section, three experiments are carried out to ev aluate quantitativ ely and qual- itati vely the proposed approach. The ﬁrst experiment provides comparisons to other control policies on a synthetic dataset. The second e xperiment, on a similar synthetic dataset, demonstrates how the approach can handle large v ariations in scale, occlusion and multiple tar gets. The ﬁnal e xperiment is a demonstration of tracking and classiﬁca- tion performance on sev eral real videos. For the synthetic digit videos, we trained the ﬁrst-layer RBMs on the fov eated images, while for the real videos we trained factored- RBMs on fov eated natural image patches (Ranzato & Hinton, 2010). The ﬁrst experiment uses 10 video sequences (one for each digit) built from the MNIST dataset. Each sequence contains a mo ving digit and static digits in the back- ground (to create distractions). The objecti v e is to track and recognize the moving digit; see Figure 7. The gaze template had K = 9 g aze positions, chosen so that gaze G5 was at the center . The location of the template w as initialized with optical ﬂo w . W e compare the Hedge learning algorithm against algorithms with deterministic and random policies. The deterministic policy chooses each gaze in sequence and in a particular pre-speciﬁed order , whereas the random policy selects a gaze uniformly at random. W e adopted the Bhattacharyya distance in the speciﬁcation of the observation model. A multi-ﬁxation RBM was trained to map the ﬁrst layer hidden units of three time consecuti ve time steps into a second hidden layer , and trained a logistic regressor to further map to the 10 digit classes. W e used the transition prior as proposal for the particle ﬁlter . T ables 6.1 and 6.1 report the comparison results. T racking accuracy was measured 20 0 1 2 3 4 5 6 7 8 9 A VG . L EA R N ED P OL I C Y 1 .2 ( 1. 2 ) 3 .0 ( 2. 0 ) 2 .9 ( 1. 0 ) 2 .2 ( 0. 7 ) 1 .0 ( 1. 9 ) 1 .8 ( 1. 9 ) 3 .8 ( 1. 0 ) 3 .8 ( 1. 5 ) 1 .5 ( 1. 7 ) 3 .8 ( 2. 8 ) 2 .5 ( 1. 6 ) D ET E R MI N IS T I C P OL I C Y 1 8. 2 ( 29 . 6 ) 5 36 . 9 ( 39 5 . 6) 1 04 . 4 ( 69 . 7 ) 2 .9 ( 2. 2 ) 2 01 . 3 ( 11 3 . 4) 4 .6 ( 4. 0 ) 5 .6 ( 3. 1 ) 6 4. 4 ( 45 . 3 ) 1 42 . 0 ( 19 8 . 8) 1 44 . 6 ( 15 7 . 7) 1 22 . 5 ( 10 1 . 9) R AN D O M P OL I C Y 4 1. 5 ( 54 . 0 ) 4 10 . 7 ( 32 9 . 4) 3 .2 ( 2. 0 ) 3 .3 ( 2. 4 ) 4 2. 8 ( 60 . 9 ) 6 .5 ( 9. 6 ) 5 .7 ( 3. 2 ) 8 0. 7 ( 48 . 6 ) 3 8. 9 ( 50 . 6 ) 2 25 . 2 ( 24 1 . 6) 8 5. 9 ( 80 . 2 ) T able 1: T r ac king err or (in pixels) on several video sequences using differ ent policies for gaze selection. 0 1 2 3 4 5 6 7 8 9 A VG . L EA R N ED P OL I C Y 9 5. 6 2 % 1 0 0 .0 0 % 9 9. 6 6 % 9 9 . 33 % 9 9. 6 6 % 1 0 0 .0 0 % 1 00 . 0 0 % 9 8 . 32 % 9 7. 9 8 % 8 9 . 56 % 9 8 .0 1 % D ET E R MI N IS T I C P OL I C Y 9 9. 3 3 % 1 0 0 .0 0 % 9 8. 9 9 % 9 4 . 95 % 5 .3 9 % 9 8 . 32 % 0 .0 0 % 2 9. 6 3 % 5 2 . 19 % 0 .0 0 % 57 . 8 8% R AN D O M P OL I C Y 9 8. 3 2 % 1 0 0 .0 0 % 9 6. 3 0 % 9 9 . 66 % 2 9. 9 7 % 9 6 . 30 % 8 9. 5 6 % 22 .9 0 % 12 . 7 9% 1 3 .8 0 % 6 5. 9 6 % T able 2: Classiﬁcation accuracy on sever al video sequences using differ ent policies for gaze selection. in terms of the mean and standard de viation (in brackets) ov er time of the distance be- tween the target ground truth and the estimate; measured in pixels. The analysis high- lights that the error of the learned polic y is al ways belo w the error of the other policies. In most of the e xperiments, the tracker f ails when an occlusion occurs for the determin- istic and the random policies, while the learned policy is successful. This is very clear in the videos at: http://www.youtube.com/user/anonymousTrack The loss of track for the simple policies is mirrored by the high variance results in T able 6.1 (experiments 0 , 1 , 4 , and so on). The av erage mean and standard deviations (last column of T able 6.1) make it clear that the proposed strategy for learning a gaze policy can be of enormous beneﬁt. The improv ements in tracking performance are mirrored by improv ements in classiﬁcation performance (T able 6.1). Figure 7 provides further anecdotal e vidence for the policy learning algorithm. The top sequence shows the target and the particle ﬁlter estimate of its location over time. The middle sequence illustrates how the policy changes over time. In particular , it demonstrates that hedge can effecti v ely learn where to look in order to improve tracking performance (we chose this simple example as in this case it is obvious that the center of the eight (G5) is the most reliable gaze action). The classiﬁcation results ov er time are sho wn in the third ro w . 21 T = 69/300 T = 20/300 T = 142/300 T = 181/300 T = 215/300 T = 260/300 G1 G2 G3 G4 G5 G6 G7 G8 G9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G1 G2 G3 G4 G5 G6 G7 G8 G9 G1 G2 G3 G4 G5 G6 G7 G8 G9 G1 G2 G3 G4 G5 G6 G7 G8 G9 G1 G2 G3 G4 G5 G6 G7 G8 G9 G1 G2 G3 G4 G5 G6 G7 G8 G9 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 Figure 7: T racking and classiﬁcation accuracy r esults with the learned policy . F irst row: position of the tar get and estimate over time. Second row: policy distribution over the 9 gazes; hedge clearly con ver ges to the most r easonable policy . Third ro w: cumulative class distrib ution for r eco gnition. The second experiment addresses a similar video sequence, but tracking multiple targets. The image scale of each tar get changes signiﬁcantly over time, so the algorithm has to be in v ariant with respect to these scale transformations. In this case, we used a mixture proposal distrib ution consisting of motion detectors and the transition prior . W e also tested a saliency proposal b ut found it to be less effecti v e than the motion detectors for this dataset. Figure 8 (top) shows some of the video frames and tracks. The videos allo w one to better appreciate the performance of the multi-tar get tracking algorithm in the presence of occlusions. T racking and classiﬁcation results for the real videos are sho wn in Figure 8 and the accompanying videos. 6.2 Partial Inf ormation Policies In this section, two experiments are carried out to ev aluate the performance of the dif- ferent gaze selection policies. In the ﬁrst experiment we compare the performance of each gaze selection method on a data set of sev eral videos of digits from the MNIST data set moving on a black background. The target in each video encounters one or more partial occlusions which the tracking algorithm must handle gracefully . Additionally , each video sequence has 22 jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez jennifer lopez Figure 8: T op: Multi-tar get trac king with occlusions and changes in scale on a synthetic video. Middle and bottom: T racking in r eal video sequences. 0 1 2 3 4 5 6 7 8 9 A vg Bayesopt 5.36 (2.32) 7.92 (2.52) 2.62 (3.89) 4.05 (1.67) 1.70 (5.10) 8.31 (3.35) 4.94 (2.28) 12.09 (3.53) 1.52 (2.76) 9.06 (1.66) 5.76 (2.91) Hedge 2.97 (1.56) 3.20 (2.19) 2.97 (1.99) 2.92 (2.00) 3.14 (1.80) 2.96 (2.08) 2.86 (1.96) 2.98 (1.76) 2.81 (1.64) 3.15 (3.73) 3.00 (2.07) EXP3 3.18 (5.05) 3.03 (10.08) 65.46 (3212.16) 91.81 (3671.66) 2.62 (2.35) 7.20 (303.29) 67.54 (2346.82) 2.97 (3.99) 3.06 (2.71) 77.01 (3135.17) 32.39 (1269.33) T able 3: T r ac king err or on sever al video sequences using differ ent methods for gaze selection. The table shows mean trac king err or as well as the err or variance (in brack ets) over a single test sequence. been corrupted with 30% noise. W e measure the error between the estimated track and the ground truth for each gaze selection method, and demonstrate that Bayesian optimization preforms comparably to Hedge, but that EXP3 is not able to reach a sat- isfactory lev el of performance. W e also demonstrate qualitativ ely that the Bayesian optimization approach learns good gaze selection policies on this data set. Our second experiment provides e vidence that the Bayesian optimization method can generalize to real world data. T able 3 reports the results from our ﬁrst experiment. The table shows the mean tracking error , measured by averaging distance between the estimated and ground truth 23 track over the entire video sequence. Here we see that the Bayesian optimization ap- proach compares fa v orably to Hedge in terms of tracking performance, and that EXP3 preforms substantially worse than the other two methods. Although Hedge preforms marginally better than Bayesian optimization, it is important to remember that Bayesian optimization solves a signiﬁcantly more dif ﬁcult problem. Hedge relies on discretizing the action space, and must ha v e access to the re wards for all possible actions at each time step. In contrast, Bayesian optimization considers a fully continuous action space, and recei ves re ward information only for the chosen actions. Figure 9: T op: Digit templates with the estimated re war d surfaces superimposed. Markers indicate the best ﬁxation point found in each of ten runs. Bottom: A visualization of the image found by avera ging the best ﬁxation points found acr oss ten runs. Figure 9 sho ws the re w ard surfaces learned for each digit by Bayesian optimization, as well as a visualization of the o verall best ﬁxation points using data aggreg ated across ten runs. The optimal ﬁxation points found by the algorithm are tightly clustered, and the resulting observ ations are very distinguishable. In our second experiment we use the Y outube celebrity dataset from Kim et al. (2008). This data set consists of sev eral videos of celebrities taken from Y outube and is challenging for tracking algorithms as the videos exhibit a wide variety of illumi- nations, expressions and face orientations. W e run our tracking model using Bayesian optimization to learn a gaze selection policy on this data set, and present some results in Figure 10. Although we report only qualitati v e results from this e xperiment, it pro vides anecdotal e vidence that Bayesian optimization is able to form a good gaze selection policy on real w orld data. 24 Figure 10: Results on a real data set. F ar left: An example frame fr om the video sequence. Center left: The trac king template with the optimal ﬁxation window highlighted. Center right: The r e war d surface pr oduced by Bayesian optimization. The white markers show the centers of each ﬁxation point in a single trac king run. Right: Input to the observation model when ﬁxating on the best point. (Best viewed fr om a distance). 7 Conclusions and Futur e W ork W e hav e proposed a decision-theoretic probabilistic graphical model for joint classiﬁ- cation, tracking and planning. The experiments demonstrate the signiﬁcant potential of this approach. W e examined se veral different strategies for gaze control in both the full and partial information settings. W e saw that a straightforward generalization of the full information policy to partial information ga v e poor performance and we pro- posed an alternative method which is able not only to perform well in the presence of partial information but also allo ws us to expand the set of possible ﬁxation points to be a continuous domain. There are man y routes for further exploration. In this work we pre-trained the (factored)-RBMs. Ho we v er , existing particle ﬁltering and stochastic optimization al- gorithms could be used to train the RBMs online. F ollo wing the same methodology , we should also be able to adapt and improv e the target templates and proposal distrib u- tions over time. This is essential to extend the results to long video sequences where the object under goes signiﬁcant transformations (e.g. as is done in the predator tracking system (Kalal et al., 2010)). Deployment to more comple x video sequences will require more careful and thought- ful design of the proposal distributions, transition distrib utions, control algorithms, tem- plate models, data-association and motion analysis modules. Fortunately , many of the solutions to these problems ha v e already been engineered in the computer vision, track- ing and online learning communities. Admittedly , much work remains to be done. 25 Saliency maps are ubiquitous in visual attention studies. Here, we simply used stan- dard salienc y tools and motion ﬂo w in the construction of the proposal distrib utions for particle ﬁltering. There might be better ways to exploit the salienc y maps, as neuro- physiological experiments seem to suggest (Gottlieb et al., 1998). One of the most interesting avenues for future work is the construction of more abstract attentional strategies. In this work, we focused on attending to regions of the visual ﬁeld, but clearly one could attend to subsets of receptive ﬁelds or objects in the deep appearance model. The current model has no ability to recover from a tracking failure. It may be possible to use information from the identity pathway (i.e. the classiﬁer output) to detect and recov er from tracking failure. A closer examination of the e xploration/exploitation tradeof f in the tracking setting is in order . For instance, the methods we considered assume that future rewards are independent of past actions. This assumption is clearly not true in our setting, since choosing a long sequence of very poor ﬁxation points can lead to tracking failure. W e can potentially solv e this problem by incorporating the current tracking conﬁdence into the gaze selection strategy . This would allo w the exploration/exploitation trade of f to be explicitly modulated by the needs of the tracker , e.g. after choosing a poor ﬁxation point the selection policy could be adjusted temporarily to place extra emphasis on ex- ploiting good ﬁxation points until conﬁdence in the target location has been recov ered. Contextual bandits pro vide a frame w ork for integrating and reasoning about this type of side-information in a principled manner . Acknowledgments W e thank Ben Marlin, K enji Okuma, Marc’Aurelio Ranzato and Ke vin Swersky . This work was supported by CIF AR’ s NCAP program and NSERC. Refer ences Auer , P ., Cesa-Bianchi, N., Freund, Y ., and Schapire, R.E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In focs , pp. 322. Published by the IEEE 26 Computer Society , 1998a. Auer , P ., Cesa-Bianchi, N., Freund, Y ., and Schapire, R.E. The nonstochastic mul- tiarmed bandit problem. SIAM Journal on Computing , 32(1):48–77, 2001. ISSN 0097-5397. Auer , Peter , Cesa-Bianchi, Nicol ` o, Freund, Y oa v , and Schapire, Robert E. Gambling in a rigged casino: the adversarial multi-armed bandit problem. T echnical Report NC2-TR-1998-025, 1998b. Bazzani, L., de Freitas, N., and T ing, J.A. Learning attentional mechanisms for si- multaneous object tracking and recognition with deep networks. NIPS 2010 Deep Learning and Unsupervised F eatur e Learning W orkshop , 2010. Brochu, E., Cora, V .M., and de Freitas, N. A tutorial on Bayesian optimization of expensi ve cost functions, with application to acti v e user modeling and hierarchical reinforcement learning. T echnical report, Uni v ersity of British Columbia, 2010. Colombo, John. The de velopment of visual attention in infancy . Annual Review of Psycholo gy , pp. 337–367, 2001. Doucet, A, de Freitas, N, and Gordon, N. Introduction to sequential Monte Carlo meth- ods. In Doucet, A, de Freitas, N, and Gordon, N J (eds.), Sequential Monte Carlo Methods in Practice . Springer -V erlag, 2001. Freund, Y oa v and Schapire, Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences , 55:119–139, 1997. Gaborski, R., V aingankar , V ., Chaoji, V ., T eredesai, A., and T entler , A. Detection of inconsistent regions in video streams. In Pr oc. SPIE Human V ision and Electr onic Imaging . Citeseer , 2004. Gao, D., Mahadev an, V ., and V asconcelos, N. The discriminant center-surround hy- pothesis for bottom-up salienc y . Advances in neur al information pr ocessing systems , 20, 2007. 27 Girard, B. and Berthoz, A. From brainstem to corte x: Computational models of saccade generation circuitry . Pr ogr ess in Neur obiology , 77(4):215 – 251, 2005. Gottlieb, Jacqueline P ., Kusunoki, Makoto, and Goldberg, Michael E. The representa- tion of visual salience in monke y parietal cortex. Natur e , 391:481–484, 1998. Hinton, GE and Salakhutdinov , RR. Reducing the dimensionality of data with neural networks. Science , 313(5786):504–507, 2006. Isard, M and Blake, A. Contour tracking by stochastic propagation of conditional den- sity . In Eur opean Computer V ision Confer ence , pp. 343–356, 1996. Itti, L., Koch, C., and Niebur , E. A model of saliency-based visual attention for rapid scene analysis. IEEE T ransactions on P attern Analysis and Mac hine Intelligence , 20 (11):1254 –1259, 1998. Jones, D.R., Perttunen, C.D., and Stuckman, B.E. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications , 79(1):157– 181, 1993. ISSN 0022-3239. Kalal, Z., Mikolajczyk, K., and Matas, J. F ace-tld: Tracking-learning-detection applied to faces. In Image Pr ocessing (ICIP), 2010 17th IEEE International Confer ence on , pp. 3789–3792. IEEE, 2010. Kavukcuoglu, K., Ranzato, M.A., Fergus, R., and Le-Cun, Y ann. Learning in v ariant features through topographic ﬁlter maps. In Computer V ision and P attern Recogni- tion , pp. 1605–1612, 2009. Kim, M., Kumar , S., Pa vlovic, V ., and Rowle y , H. Face tracking and recognition with visual constraints in real-world videos. IEEE Conf. Computer V ision and P attern Recognition , 2008. K och, C. and Ullman, S. Shifts in selecti ve visual attention: towards the underlying neural circuitry . Hum Neur obiol , 4(4):219–27, 1985. K ¨ oster , Urs and Hyv ¨ arinen, Aapo. A two-layer ICA-like model estimated by score matching. In International Confer ence of Artiﬁcial Neur al Networks , pp. 798–807, 2007. 28 Larochelle, Hugo and Hinton, Geoffrey . Learning to combine foveal glimpses with a third-order Boltzmann machine. In Neural Information Pr ocessing Systems , 2010. Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y . Con v olutional deep belief networks for scalable unsupervised learning of hierarchical representations. In International Confer ence on Mac hine Learning , 2009. McNaughton, Bruce L., Battaglia, Francesco P ., Jensen, Ole, Moser, Edvard I., and Moser , May-Britt. Path inte gration and the neural basis of the ’cogniti v e map’. Natur e Revie ws Neur oscience , 7(8):663–678, 2006. Okuma, K enji, T aleghani, Ali, de Freitas, Nando, and Lowe, David G. A boosted particle ﬁlter: Multitarget detection and tracking. In ECCV , 2004. Olshausen, B. A. and Field, D. J. Emer gence of simple-cell recepti v e ﬁeld properties by learning a sparse code for natural images. Natur e , 381:607–609, 1996. Olshausen, B.A., Anderson, C.H., and V an Essen, D.C. A neurobiological model of visual attention and in v ariant pattern recognition based on dynamic routing of infor- mation. The Journal of Neur oscience , 13(11):4700, 1993a. ISSN 0270-6474. Olshausen, Bruno A., Anderson, Charles H., and Essen, David C. V an. A neurobio- logical model of visual attention and in v ariant pattern recognition based on dynamic routing of information. Journal of Neur oscience , 13:4700–4719, 1993b. Postma, Eric O., v an den Herik, H. Jaap, and Hudson, Patrick T . W . SCAN: A scalable model of attentional selection. Neural Networks , 10(6):993 – 1015, 1997. Ranzato, M.A. and Hinton, G.E. Modeling pixel means and cov ariances using factor- ized third-order Boltzmann machines. In Computer V ision and P attern Recognition , pp. 2551–2558, 2010. Rasmussen, C.E. and W illiams, C.K.I. Gaussian pr ocesses for mac hine learning . Adap- ti ve computation and machine learning. MIT Press, 2006. ISBN 9780262182539. URL http://books.google.ca/books?id=vWtwQgAACAAJ . Rensink, Ronald A. The dynamic representation of scenes. V isual Cognition , pp. 17–42, 2000. 29 Rosa, M.G.P . V isual maps in the adult primate cerebral cortex: Some implications for brain dev elopment and ev olution. Brazilian Journal of Medical and Biolo gical Resear c h , 35:1485 – 1498, 2002. Srini v as, N., Krause, A., Kakade, S.M., and Seeger , M. Gaussian process optimization in the bandit setting: No re gret and experimental design. International Confer ence on Machine Learning , 2010. Swersky , K., Chen, Bo, Marlin, B., and de Freitas, N. A tutorial on stochastic approx- imation algorithms for training restricted Boltzmann machines and deep belief nets. In IT A W orkshop , pp. 1–10, 2010. Swersky , K., Buchman, D., Marlin, B.M., and de Freitas, N. On autoencoders and score matching for energy based models. International Conference in Machine Learning , 2011. T aylor , G.W ., Sigal, L., Fleet, D.J., and Hinton, G.E. Dynamical binary latent variable models for 3D human pose tracking. In Computer V ision and P attern Recognition , pp. 631–638, 2010. T orralba, A., Oliv a, A., Castelhano, M.S., and Henderson, J.M. Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological r evie w , 113(4):766, 2006. V incent, P ., Larochelle, H., Bengio, Y ., and Manzagol, P .A. Extracting and composing robust features with denoising autoencoders. In International Confer ence on Mac hine Learning , pp. 1096–1103, 2008. W elling, M., Rosen-Zvi, M., and Hinton, G. Exponential family harmoniums with an application to information retriev al. Neural Information Pr ocessing Systems , 17: 1481–1488, 2005. Zhang, L., T ong, M.H., and Cottrell, G.W . Sunday: Salienc y using natural statistics for dynamic analysis of scenes. In Pr oceedings of the 31st Annual Cognitive Science Confer ence , Amster dam, Netherlands . Citeseer, 2009. 30

Learning where to Attend with Deep Architectures for Image Tracking

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment