Modeling sequential data using higher-order relational features and predictive training

Modeling sequential data using higher -order r elational featur es and pr edictiv e training V incent Michalski V M I C H A L S @ R Z . U N I - F R A N K F U RT . D E Goethe-Univ ersity Frankfurt, Frankfurt, Germany Roland Memise vic RO L A N D . M E M I S E V I C @ U M O N T R E A L . C A Univ ersity of Montreal, Montreal, Canada Kishore Konda K ON D A @ I N F O R M A T I K . U N I - F R A N K F U RT . D E Goethe-Univ ersity Frankfurt, Frankfurt, Germany Abstract Bi-linear feature learning models, like the gated autoencoder , were proposed as a way to model relationships between frames in a video. By min- imizing reconstruction error of one frame, giv en the pre vious frame, these models learn “mapping units” that encode the transformations inherent in a sequence, and thereby learn to encode mo- tion. In this work we extend bi-linear models by introducing “higher order mapping units” that al- low us to encode transformations between frames and transformations between transformations. W e sho w that this makes it possible to encode temporal structure that is more comple x and lon- ger-range than the structure captured within stan- dard bi-linear models. W e also show that a nat- ural way to train the model is by replacing the commonly used reconstruction objectiv e with a pr ediction objectiv e which forces the model to correctly predict the e volution of the input multi- ple steps into the future. Learning can be achiev ed by back-propagating the multi-step prediction through time. W e test the model on various temporal prediction tasks, and sho w that higher-order mappings and predic- tiv e training both yield a signiﬁcant improvement ov er bi-linear models in terms of prediction accu- racy . 1. Intr oduction W e explore the application of relational feature learning models (e.g. Memisevic & Hinton , 2007 ; T aylor & Hin- ton , 2009 ) in sequence modeling. T o this end, we propose a bilinear model to describe frame-to-frame transitions in image sequences. In contrast to e xisting work on model- ing relations, we propose a ne w training scheme, which we call predicti ve training: after a transformation is extracted from two frames in the video, the model tries to predict the next frame by assuming constancy of the transformations through time. W e then introduce a deep bilinear model as a natural ap- plication of predictive training, and as a way to relax the assumption of constancy of the transformation. The model learns relational features, as well as “higher- order relational features”, representing relations between the transformations themselv es. T o this end, the bottom- layer bilinear model infers a representation of motion from two seed frames as well as a representation of motion from two later frames. The top layer is itself a bilinear model, that learns to represent the relation between the inferred lower -lev el transformations. It can be thought of as learn- ing a second-order “deriv ativ e” of the temporal ev olution of the high-dimensional input time series. W e show that an ef fectiv e way to train these models is to ﬁrst pre-train the layers indi vidually using pairs of frames for the bot- tom layer and pairs of inferred transformations for the ne xt layer , and to subsequently ﬁne-tune parameters using com- plete sequences, by back-propagating a multi-step looka- head cost through time. The model as a whole may be thought of as a way to model a dynamical system as a second order partial difference equation. While in principle the model could be stacked to take into account dif ferences of arbitrary order , we demon- strate that the tw o-layer model is surprisingly effecti ve at modeling a variety of complicated image sequences. Both layers of our model make use of multiplicati ve in- teractions between ﬁlter responses in order to model rela- tions ( Memisevic , 2013 ). Multiplicativ e interactions were Modeling sequential data using higher-order relational featur es and predictive training recently sho wn to be useful in recurrent neural networks by ( Sutske ver et al. , 2011 ). In contrast to our work, ( Sutskev er et al. , 2011 ) use multiplicativ e interactions to gate the con- nections between hidden states, so that each observ ation can be thought of as blending in a separate hidden state transition. A natural application of this is sequences of dis- crete symbols, and the model is consequently demonstrated on text. In our work, the role of multiplicative interactions is explicitly to yield encodings of transformations, such as frames in a video, and we apply the model primarily to video data. Our model also bears some similarity to ( T aylor & Hin- ton , 2009 ) who model MOCAP data using a generati vely trained three-way Restricted Boltzmann Machine, where a second layer of hidden units can be used to model more “abstract” features of the time series. In contrast to that work, our higher-order units which are three-way units too, are used to expres sly model higher -order transformations (transformations between the transformations learned in the ﬁrst layer). Furthermore, we show that predictiv e ﬁne- tuning using backprop through time allo ws us to train the model discriminati vely and yields much better performance than generativ e training by itself. 2. Relational feature learning In order to learn features, m , that represent the relationship between two frames x (1) and x (2) as shown in Figure 1 , it is necessary to learn a basis that can represent the correlation structure across the frames. Figure 1. The relational features m represent the correspondences between two inputs x (1) and x (2) . In a video, gi ven one frame x (1) there can be a multitude of potential next frames x (2) . It is therefore common to use bi-linear models, like the Gated Boltzmann Machine (GBM) ( T aylor & Hinton , 2009 ), the Gated Autoencoder (GAE) ( Memise vic , 2011 ), and similar models (see Memi- sevic , 2013 , for an ov erview) whose hidden v ariables can represent which transformation, out of the pool of man y possible transformations, can take x (1) to x (2) . More formally , bi-linear models learn to represent the lin- ear transformation, L , between two images x (1) and x (2) , where x (2) = Lx (1) (1) It can be sho wn that a weighted sum of pr oducts of ﬁlter r esponses is able to identify the transformation. The rea- son is that the weighted sum is large if the angle between ﬁlters is similar to the angle (in “pixel-space”) between the two frames. That way , hidden units represent the observed transformation in the form of a set of phase-differences in the in v ariant subspaces of the transformation class ( Memi- sevic , 2013 ). As hidden units encode the transformation between images, rather than the content of the images, they are commonly referred to as mapping units . W e shall fo- cus on the autoencoder v ariant of these models for the pur- poses of this paper , b ut one can use other models such as the GBM. Figure 2. Graphical representation of the gated autoencoder . The two inputs x (1) and x (2) are projected onto features and the map- ping units pool ov er pairwise products of these features. Formally , the response of a mapping unit layer can be writ- ten m = σ  W ( U x (1)  Vx (2) )  (2) where U , V and W are parameter matrices, and where  denotes elementwise multiplication. Further , σ is an ele- mentwise non-linearity , such as the logistic sigmoid. Giv en mapping unit acti vations, m , as well as the ﬁrst im- age, the second image can be reconstructed by applying the transformation encoded in m as follo ws ( Memisevic , 2011 ): ˜ x (2) = V T  U x (1)  W T m  . (3) As the model is symmetric, we can like wise deﬁne the re- construction of the ﬁrst image giv en the second as: ˜ x (1) = U T  Vx (2)  W T m  (4) Modeling sequential data using higher-order relational featur es and predictive training from which one obtains the reconstruction error L = || x (1) − ˜ x (1) || 2 + || x (2) − ˜ x (2) || 2 (5) for training. It can be shown that minimizing reconstruc- tion error on image pairs will turn each ro w in U and the corresponding row in V into a pair of phase-shifted ﬁl- ters. T ogether the ﬁlters span the in variant subspaces of the transformation class inherent in the training pairs with which the model w as trained. As as result, each component of m is tuned to a phase-delta after learning, and it is inde- pendent of the absolute phase of each image ( Memisevic , 2013 ). 3. Higher -order r elational features 3.1. Approximating discrete-time dynamical systems A quite natural extension of the concept of relational fea- tures can be motiv ated by looking at relational models as performing a kind of ﬁrst-order T aylor approximation of the input sequence, where the hidden representation mod- els the partial ﬁrst-order deriv ativ es of the inputs with re- spect to time. Based on this view , we propose an approach that exploits correlations between subsequent sequence el- ements to model a dynamical system which approximates the sequence. This is a very different way to address long- range correlations than assuming mem ory units that explic- itly keep state ( Hochreiter & Schmidhuber , 1997 ). Instead, here we assume that there is structure in the temporal ev o- lution of the input stream and we focus on capturing this structure. As an intuiti ve example, consider a video that is known to be a sinusoidal signal, b ut with unkown frequency , phase and motion direction. The complete video can be speci- ﬁed exactly and completely by the ﬁrst three seed images. Therefore, gi ven these three images, we w ould in principle be able to predict the rest of the video ad inﬁnitum. The ﬁrst-order partial deri vati ve of a multidimensional dis- crete-time dynamical system describes the correspondences between state vectors at subsequent time steps. Relational feature learning applied to subsequent elements of a se- quence can be viewed as a way to learn these deriv ati ves, suggesting that we may model higher-order partial deri va- tiv es with higher-order relational features. W e model second-order deri vati ves by cascading relational features in a pyramid as depicted 1 in 3 . Giv en a sequence of inputs x ( t − 2) , x ( t − 1) , x ( t ) , ﬁrst-order relational features m ( t − 1: t ) 1 describe the transformations between two subse- quent inputs x ( t − 1) and x ( t ) . Second-order relational fea- tures m ( t − 2: t ) 2 describe correspondences between tw o ﬁrst- 1 Images taken from the NORB data set described in ( LeCun et al. , 2004 ) order relational features m ( t − 2: t − 1) 1 and m ( t − 1: t ) 1 , model- ing the analog of the partial second-order deri vati ves of the inputs with respect to time. Section 5.3 presents experi- ments with tw o layers of relational features that support this view . Figure 3. First-order relational features m ( t − 2: t − 1) 1 and m ( t − 1: t ) 1 describe correspondences between multiple entities, e.g. two frames of a video. The second-order relational features m ( t − 2: t ) 2 describe correspondences between the ﬁrst-order relational fea- tures. 3.2. The higher -order gated autoencoder W e implement a higher-order gated autoencoder (HGAE) using the follo wing modular approach. The second-order HGAE is constructed using two GAE modules, one that relates inputs and another that relates mappings of the ﬁrst GAE. The ﬁrst-layer GAE instance models correspondences be- tween input pairs using ﬁlter matrices U 1 , V 1 and W 1 (the subscript index refers to the layer). Using the ﬁrst-layer GAE, mappings m ( t − 2: t − 1) 1 and m ( t − 1: t ) 1 for overlapping input pairs ( x ( t − 2) , x ( t − 1) ) and ( x ( t − 1) , x ( t ) ) are inferred and this pair of ﬁrst-layer mappings is used as input for a second GAE instance. This second GAE models corre- spondences between the mappings of the ﬁrst-layer using ﬁlter matrices U 2 , V 2 and W 2 . For the two-layer model, inference amounts to computing ﬁrst- and second-order mappings according to m ( t − 2: t − 1) 1 = σ  W 1  ( U 1 x ( t − 2) )  ( V 1 x ( t − 1) )  (6) m ( t − 1: t ) 1 = σ  W 1  ( U 1 x ( t − 1) )  ( V 1 x ( t ) )  (7) m ( t − 2: t ) 2 = σ  W 2  ( U 2 m ( t − 2: t − 1) 1 )  ( V 1 m ( t − 1: t ) 1 )  (8) Cascading GAE modules in this w ay can also be moti vated Modeling sequential data using higher-order relational featur es and predictive training from the perspective of orthogonal transformations as sub- space rotations. As stated in ( Memisevic , 2013 ), summing ov er ﬁlter-response products can yield transformation de- tectors which are in variant to the initial phase of the trans- formation and also partially in variant to the content of the images. The relativ e rotation angle (phase delta) between two projections is itself an angle, and their relation can be viewed as an “angular acceleration”. In contrast to the standard two-frame model, in this model reconstruction error is not directly applicable (although a naiv e way to train the model is to minimize reconstruction error for each pair of adjacent nodes in each layer). How- ev er , there is a more natural w ay to train the model if train- ing data forms a sequence, as we discuss next. 4. Pr edictive training of relational models 4.1. Single-step prediction Giv en the ﬁrst two frames of a sequence x (1) , x (2) , x (3) one can use the GAE to compute a prediction of the third frame as follows. First, mappings m (1 , 2) are inferred from x (1) and x (2) (see Equation 2 ) and then used to compute a prediction ˆ x (3) by applying the inferred transformation m (1 , 2) to frame x (2) . Applying the transformation amounts to computing: ˆ x (3) = V T  U x (2)  W T m (1 , 2)  (9) This prediction of x (3) will be a good prediction under the assumption that the frame-to-frame transformations from x (1) to x (2) and from x (2) to x (3) are approximately the same, in other words if transformations themselves are as- sumed to be approximately constant in time. In this case, one can train the GAE to minimize the predic- tion error L = || ˆ x (3) − x (3) || 2 2 (10) instead of minimizing the reconstruction error in Equation 5 . This type of supervised training objectiv e, in contrast to the standard GAE objecti ve, can also guide the mapping representation to be inv ariant to image content, because en- coding the content of x (2) will not in general help predict- ing x (3) . When the assumption of constancy of the transformations is violated, we can use a higher layer to model how trans- formations themselves change over time. This will require a farther look-ahead for predictiv e training which we dis- cuss in the following. 4.2. Multi-step prediction One can iterate the inference-prediction process to look more than one frame ahead in time. T o compute a pre- Figure 4. The assumption of similarity between the transforma- tions from x ( t − 1) to x ( t ) and from x ( t ) to x ( t +1) allows us to de- ﬁne a prediction ˆ x ( t +1) by applying the inferred transformation m ( t − 1: t ) to x ( t ) . diction ˆ x (4) one infers mappings from x (2) and ˆ x (3) : m (2:3) = σ  W ( U x (2)  V ˆ x (3) )  (11) and computes the prediction ˆ x (4) = V T  U x (3)  W T m (2:3)  . (12) Then mappings can be inferred again from ˆ x (3) and ˆ x (4) to compute a prediction of ˆ x (5) , and so on. For the two-layer HGAE this amounts to the assumption that the second-order relational structure in the sequence changes slo wly ov er time and under this assumption we compute a prediction ˆ x ( t +1) in two steps: First a predic- tion is made of the ﬁrst-order relational features describing the correspondence between x ( t ) and x ( t +1) : ˆ m ( t : t +1) 1 = V 2 T  U 2 m ( t − 1: t ) 1  W T 2 m ( t − 2: t ) 2  (13) Using this prediction of the transformation between x ( t ) and x ( t +1) the prediction ˆ x ( t +1) is made as follows: ˆ x ( t +1) = V 1 T  U 1 x ( t )  W T 1 ˆ m ( t : t +1) 1  (14) As with the GAE, one can predict multiple steps ahead in time using the HGAE by repeating the inference-prediction process on x ( t − 1) , x ( t ) and ˆ x ( t +1) , i.e. by appending the prediction to the sequence and increasing t by one. The prediction process simply consists of iteratively com- puting predictions of the next lo wer lev el’ s acti v ations be- ginning from the top. T o compute the top-lev el activ a- tions themselves, one needs a number of seed frames cor- responding to the depth of the model. While two frames are sufﬁcient to infer the transformations in the case of the GAE, three frames are required in the case of the tw o-layer model. Modeling sequential data using higher-order relational featur es and predictive training Figure 5. A prediction is made in two steps (the dashed lines) from top-to-bottom. The second-order relational feature m ( t − 2: t ) 2 , inferred on the sequence x ( t − 2) , x ( t − 1) , x ( t ) is as- sumed to be slo wly changing and used to make a prediction of the ﬁrst-order relational feature which describes the correspon- dences between x ( t ) and x ( t +1) . This prediction is then used to transform x ( t ) into the prediction ˆ x ( t +1) . The models can be trained using backprop through time ( W erbos , 1988 ) to compute the gradients of the k -step ahead prediction error w .r .t. the parameters: L = k X i =1 || ˆ x ( t + i ) − x ( t + i ) || 2 2 (15) In our e xperiments, we observed that starting with single- step prediction, training and iteratively increasing the num- ber of prediction steps during training considerably stabi- lizes the dynamics of the model and helps to pre vent explo- sions in the magnitude of the predictions. 5. Experiments W e tested and compared the models on videos with v ary- ing degrees of complexity , from synthetic constant to syn- thetic accelerated transformations to more complex real- world transformations. 5.1. Preprocessing and initialization For all data sets PCA whitening w as used for dimensional- ity reduction, retaining around 95% of the variance. Predictiv e training of the HGAE only w orked after layer- wise pre-training. W e used gradient descent with a learning rate of 0 . 001 and momentum 0 . 9 . Without pretraining the parameters did not con verge to a useful conﬁguration. The ﬁrst-layer GAE was trained to reconstruct pairs of subse- quent sequence elements (as described in Section 2 ). Then pairs of mappings were computed on three subsequent in- puts using the pretrained ﬁrs t-layer GAE. These mapping T able 1. Classiﬁcation accuracy of the GAE on the constant rota- tions (C O N S T R OT ) and constant shifts ( C O N S T S H I F T ) data sets, for reconstructiv e and 1 -step predictive training. M O D E L C O N S T R OT C O N S T S H I F T R E C . T R A I NI N G 9 7 . 6 7 6 . 4 P R E D . T R A IN I N G 9 8 . 2 7 9 . 4 pairs were then used for reconstructiv e pretraining of the second-layer GAE. 5.2. Comparison of predictive and reconstructive training T o e valuate whether predictiv e training of the GAE yields better representations of transformations than training with the reconstruction objectiv e, a classiﬁcation experiment on videos sho wing artiﬁcially transformed natural images was performed. The 13 × 13 patches were cropped from the Berkeley Segmentation data set ( Martin et al. , 2001 ). T wo data sets with videos featuring constant velocity shifts ( C O N S T S H I F T ) and rotations (C O N S T R O T ) were generated. The elements of the shift vectors for the C O N S T S H I F T data set were sampled uniformly from the interval [ − 3 , 3] (in pixels). The rotation angles were sampled uniformly from the interval ( − π , π ) . Labels for the C O N S T S H I F T data set were generated by dividing the shift vectors as sho wn in Figure 7 . For C O N S T R O T the angles were divided into 8 equally-sized bins. Both data sets were partitioned into a training set containing 100 000 , a validation set containing 20 000 and a test set containing 50 000 sequences. The numbers of ﬁlters and mapping units were chosen us- ing a grid search. The setting with best performance on the validation set was 256 ﬁlters and 256 mapping units each for both training objectiv es and both data sets. The models were each trained for 1 000 epochs using stochastic gradient descent (SGD) with a learning rate of 0 . 001 and momentum 0 . 9 . For the experiment the mappings of the ﬁrst two inputs were used as input to a logistic regression classiﬁer . The experiment was performed multiple times on both data sets and the mean classiﬁcation accuracies are reported in T able 1 . In all trials the GAE trained with 1 - step predicti ve training achie ved a higher accuracy than the GAE trained on the reconstruction objective. This suggests that predictive training is able to generate a more explicit representation of transformations, that is plagued less by image content, as discussed in Section 4.1 . 5.3. Detecting acceleration T o test the hypothesis that the HGAE learns to model second- order correspondences in sequences, image sequences with Modeling sequential data using higher-order relational featur es and predictive training (a) A C C R OT (b) A C C S H I F T (c) Bouncing Balls (d) NORBV ideos Figure 6. HGAE ﬁrst-layer ﬁlter pairs (after multi-step predicti ve training). x y 0 1 2 3 4 5 6 7 Figure 7. Discretization of shift vectors: The 2D plane is divided into the four quadrants and then a magnitude threshold β is cho- sen, such that the distribution of samples into the 8 sho wn bins (numbered 0 − 7 ) is uniform. α denotes the maximum magnitude in the respectiv e data set. accelerated shifts ( A C C S H I F T ) and rotations ( A C C R O T ) of natural image patches were generated. The patches were again cropped from the Berkeley Segmentation data set and artiﬁcially transformed with initial (angular) velocity and constant (angular) acceleration. The scalar angular acceler - ations were sampled uniformly from the interval [ − π 12 , π 12 ] degrees. The initial angular velocites were sampled from the same interv al. T o get labels for classiﬁcation, the angu- lar accelerations were divided into 8 equally sized bins. For the accelerated shifts data set, elements of the velocity and acceleration v ectors were sampled in the interval [ − 3 , 3] (in pixels). The discretization of acceleration v ectors is the same as for the shift vectors in C O N S T S H I F T (see Figure 7 ). The partition sizes are the same as for C O N S T R O T and C O N S T S H I F T . The number of ﬁlters and mapping units was set to 512 and 256 , respecti vely (after performing a grid search). Af- ter pretraining the HGAE was trained with gradient de- scent using a learning rate of 0 . 0001 and momentum of 0 . 9 , ﬁrst for 400 epochs on single-step prediction and then 500 epochs on two-step prediction. After training, ﬁrst- and second-layer mappings were in- ferred from the ﬁrst three frames of the test sequences. The classiﬁcation accuracies using logistic regression with second-layer mappings of the HGAE ( m (1:3) 2 ) as descrip- tor , using the individual ﬁrst-layer mappings ( m (1:2) 1 and m (2:3) 1 ), and using the concatenation of both ﬁrst-layer map- pings are reported in T able 2 for both data sets (before and after predictiv e ﬁnetuning). The second-layer mappings achie ved a signiﬁcantly higher accuracy for both data sets after predicti ve training. For the A C C R OT data set, the concatenation of ﬁrst-layer map- pings performed better than the second-layer mappings be- fore ﬁnetuning, which may be because the angular acceler - ation data is based on a one-parameter transformation and is thus simpler than the shift acceleration data, which is based on a two-parameter transformation. Predicti ve ﬁne- tuning also helped improv e the intermediate representation, as can be observed by the increase in accurac y for the con- catenation of the ﬁrst-layer mappings. Modeling sequential data using higher-order relational featur es and predictive training Figure 8. T wo examples for sev en prediction steps of the HGAE model on the A C C R O T data set, shown are from top to bottom, groundtruth, the predictions of the model after pretraining, and after one-, two- and three-step predicti ve training. Figure 9. T wo examples of predictions using the HGAE model on the A C C S H IF T data set, shown are from top to bottom, groundtruth, the predictions of the model after pretraining, one- and two-step predicti ve training. T able 2. Classiﬁcation accuracies (%) using different layer map- pings of the HGAE before and after 2-step ﬁnetuning on the accelerated rotations ( A C C R OT ) and the accelerated shifts (A C C S H I F T ) data set. ( m (1:2) 1 , m (2:3) 1 ) denotes the concatenation of both ﬁrst-layer mappings. D E S C RI P T O R A C C R O T A C C S HI F T P R E T RA I N E D m (1:2) 1 1 9 . 4 2 0 . 6 m (2:3) 1 3 0 . 9 3 3 . 3 ( m (1:2) 1 , m (2:3) 1 ) 6 4 . 9 3 8 . 4 m (1:3) 2 5 3 . 7 6 3 . 4 FI N E T UN E D m (1:2) 1 1 8 . 1 2 0 . 9 m (2:3) 1 2 9 . 3 3 4 . 4 ( m (1:2) 1 , m (2:3) 1 ) 7 4 . 0 4 2 . 7 m (1:3) 2 7 4 . 4 8 0 . 6 These results shows that the second layer of the HGAE can build a much better representation of the second-order relational structure in the data than the single-layer GAE model. They further sho w that predictiv e training improves the capability of both models and is crucial for the two- layer model to work well. 5.4. Sequence prediction In this experiment we test the capability of the models to predict pre viously unseen sequences multiple steps into the future. This allows us to assess to what degree modeling second order “deriv ati ves” makes it possible to capture the temporal evolution without resorting to an explicit repre- sentation of a hidden state. After training, test sequences were generated by seeding the models with two (GAE) or three (HGAE) seed frames. Figure 6 shows some of the ﬁlter pairs learned by the HGAE on dif ferent data sets after predictiv e training. 5 . 4 . 1 . A C C E L E R A T E D T R A N S F O R M A T I O N S Figures 8 and 9 sho w predictions with the HGAE model on the data sets introduced in Section 5.3 after dif ferent stages of training. As can be seen in the ﬁgures, the accuracy of the predictions increases signiﬁcantly with multi-step train- ing. 5 . 4 . 2 . N O R B V I D E O S The NORBvideos data set introduced in ( Memisevic & Exar- chakis , 2013 ) contains videos of objects from the NORB dataset ( LeCun et al. , 2004 ). These are objects divided into 5 classes ( four-le gged animals, human ﬁgures, airplanes, trucks and cars ), each with 9 instances . The 5 frames of each video from the NORBvideos data show incremen- Modeling sequential data using higher-order relational featur es and predictive training tally changed viewpoints of one of the objects. W e trained our sequence learning models on this data, using the au- thor’ s original split: all videos of objects from instances 1 − 8 are in the training set and instance 9 objects are in the test set. This yields 109 350 training examples and 12 150 test examples. The frame size is 96 × 96 and the videos are 5 frames long. The GAE and the HGAE model were trained on the multi-step prediction task with a learn- ing rate of 0 . 0001 and momentum 0 . 9 . Both models used 2000 features and 1000 mapping units (per layer). The test- performance of the GAE model seemed to stop improving at 2000 features, while the HGAE was able to make use of the additional parameters. Figure 10 sho ws predictions made by both models. The HGAE manages to generate predictions that correctly re- ﬂect the 3-D structure in the data. In contrast to the GAE model it is much better at e xtrapolating the observed trans- formations. Note that seed frames are from test data. Due to the large input dimensionality and the low number of training samples a few of the ﬁlters sho wn in Figure 6(d) seem to be overﬁtting on the training data while many oth- ers are localized Gabor-lik e features. 5 . 4 . 3 . B O U N C I N G B A L L S W e also trained the HGAE on the bouncing balls data set 2 to see whether the HGAE captures the highly non-linear dynamics of this data set. The number of features was set to 512 and the number of mappings to 256 . Figure 11 shows two predictions on test data. The predictions show that the second-order model is able to correctly capture reﬂections on the boundaries and the other balls, and mak es consistent predictions over in some cases up to around 10 − 20 frames. 6. Discussion A major long-standing problem in sequence modeling is how to deal with long range correlations. It has been pro- posed that deep learning may help address this problem by ﬁnding representations that capture better the abstract, se- mantic content of the inputs ( Bengio , 2009 ). In this work we propose learning representations with the explicit goal to enable the prediction of the temporal ev olution of the input stream multiple time steps ahead. Thus we seek a hidden representation that captures e xactly those aspects of the input data which allow us to make predictions about the future. It is interesting to note that predictive training can also be viewed as an analogy making task ( Memisevic & Hinton , 2010 ). It amounts to taking the transformation taking frame 2 The training and test sequences were generated using the script released with ( Sutske ver et al. , 2008 ). Figure 10. Comparison of 10 prediction steps on the NORBvideos data set. The original sequences only contain 5 frames, providing only 2 frames of ground truth for predictions. Figure 11. Seven HGAE prediction steps on two samples of the bouncing balls data set after training on 3-step predictions. Modeling sequential data using higher-order relational featur es and predictive training t to t + 1 and applying it to a new observ ation at time t + 1 or later . The difference is that in a genuine analogy mak- ing task, the target image may be unrelated to the source image pair, whereas here target and source are related. It would be interesting to apply the model to word represen- tations, or language in general, as this is a domain where both, sequentially structured data and analogical relation- ships between data-points, play a crucial role (e.g. Mikolov et al. , 2013 ). Acknowledgments This work was supported by the German Federal Ministry of Education and Research (BMBF) in project 01GQ0841 (BFNT Frankfurt), by an NSERC Discovery grant and by a Google faculty research aw ard. References Bengio, Y oshua. Learning deep architectures for AI. F oun- dations and T r ends in Machine Learning , 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009. Hochreiter , Sepp and Schmidhuber , J ¨ urgen. Long short- term memory . Neural computation , 9(8):1735–1780, 1997. LeCun, Y ann, Huang, Fu Jie, and Bottou, Leon. Learning methods for generic object recognition with in variance to pose and lighting. In Computer V ision and P attern Recognition, 2004. CVPR 2004. Pr oceedings of the 2004 IEEE Computer Society Conference on , volume 2, pp. II–97. IEEE, 2004. Martin, D., Fowlk es, C., T al, D., and Malik, J. A database of human segmented natural images and its application to ev aluating segmentation algorithms and measuring ecological statistics. In Pr oc. 8th Int’l Conf. Computer V ision , volume 2, pp. 416–423, July 2001. Memisevic, R. and Hinton, G.E. Learning to repre- sent spatial transformations with factored higher-order boltzmann machines. Neural Computation , 22(6):1473– 1492, 2010. Memisevic, Roland. Gradient-based learning of higher- order image features. In Computer V ision (ICCV), 2011 IEEE International Confer ence on , pp. 1591–1598. IEEE, 2011. Memisevic, Roland. Learning to relate images. IEEE T ransactions on P attern Analysis and Machine Intelli- gence , 35(8):1829–1846, 2013. Memisevic, Roland and Exarchakis, Geor gios. Learning in variant features by harnessing the aperture problem. In Pr oceedings of the 30th International Confer ence on Machine Learning (ICML 2013) , 2013. Memisevic, Roland and Hinton, Geof frey . Unsupervised learning of image transformations. In Pr oceedings of IEEE Conference on Computer V ision and P attern Recognition , 2007. Mikolov , T omas, Chen, Kai, Corrado, Greg, and Dean, Jef- frey . Efﬁcient estimation of word representations in v ec- tor space. CoRR , abs/1301.3781, 2013. Sutske ver , Ilya, Hinton, Geof frey E, and T aylor , Gra- ham W . The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Pr ocess- ing Systems , pp. 1601–1608, 2008. Sutske ver , Ilya, Martens, James, and Hinton, Geof frey . Generating text with recurrent neural networks. In Pro- ceedings of the 2011 International Conference on Ma- chine Learning (ICML-2011) , 2011. T aylor , Graham and Hinton, Geof frey . Factored condi- tional restricted Boltzmann machines for modeling mo- tion style. In Pr oceedings of the 26th International Confer ence on Machine Learning , Montreal, June 2009. Omnipress. W erbos, Paul J. Generalization of backpropagation with application to a recurrent gas market model. Neural Net- works , 1(4):339–356, 1988.

Modeling sequential data using higher-order relational features and predictive training

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment