Learning deep dynamical models from image pixels
Modeling dynamical systems is important in many disciplines, e.g., control, robotics, or neurotechnology. Commonly the state of these systems is not directly observed, but only available through noisy and potentially high-dimensional observations. In…
Authors: Niklas Wahlstr"om, Thomas B. Sch"on, Marc Peter Deisenroth
Learning deep dynamical mo dels from image pixels Niklas W ahlstr¨ om 1 Thomas B. Sc h¨ on 2 Marc P eter Deisenroth 3 1 Departmen t of Electrical Engineering, Link¨ oping Univ ersit y , Sweden 2 Departmen t of Information T echnology , Uppsala Univ ersit y , Sw eden 3 Departmen t of Computing, Imperial College London, United Kingdom Abstract Mo deling dynamical systems is imp ortan t in man y disciplines, e.g., con trol, rob otics, or neurotechnology . Commonly the state of these systems is not directly observ ed, but only av ailable through noisy and p o- ten tially high-dimensional observ ations. In these cases, system iden tification, i.e., find- ing the measurement mapping and the tran- sition mapping (system dynamics) in latent space can b e challenging. F or linear sys- tem dynamics and measurement mappings efficien t solutions for system iden tification are av ailable. How ev er, in practical appli- cations, the linearity assumptions do es not hold, requiring non-linear system iden tifica- tion techniques. If additionally the obser- v ations are high-dimensional (e.g., images), non-linear system identification is inheren tly hard. T o address the problem of non-linear system identification from high-dimensional observ ations, w e com bine recen t adv ances in deep learning and system identification. In particular, w e join tly learn a lo w-dimensional em b edding of the observ ation by means of deep auto-enco ders and a predictive transi- tion model in this lo w-dimensional space. W e demonstrate that our mo del enables learning go od predictive models of dynamical systems from pixel information only . 1 In tro duction High-dimensional time series include video streams, Electro encephalograph y (EEG) and sensor netw ork Cop yright 2014 b y the authors. data. Dynamical models describing such data are de- sired for forecasting (prediction) and con troller design, b oth of which play an imp ortan t role, e.g., in au- tonomous systems, machine translation, robotics and surv eillance applications. A key challenge is system iden tification, i.e., finding a mathematical model of the dynamical system based on the information pro- vided by measurements from the underlying system. In the con text of state-space models this includes find- ing tw o functional relationships b et ween (a) the states at different time steps (prediction/transition mo del) and (b) states and corresp onding measuremen ts (ob- serv ation/measurement mo del). In the linear case, this problem is very w ell s tudied, and man y standard tech- niques exist, e.g., subspace methods [18], expectation maximization [16, 4] and prediction-error metho ds [9]. Ho wev er, in realistic and practical scenarios w e require non-linear system identification techniques. Learning non-linear dynamical mo dels is an inher- en tly difficult problem, and it has b een one of the most activ e areas in system iden tification for the last decades [10, 17]. In recen t y ears, sequential Mon te Carlo (SMC) methods ha v e receiv ed attention for iden- tifying non-linear state-space mo dels [15], see also the recen t surveys [6]. While metho ds based on SMC are p ow erful, they are also computationally expen- siv e. Learning non-linear dynamical mo dels from very high-dimensional sensor data is ev en more challeng- ing. First, finding (non-linear) functional relationships in very high dimensions is hard (un-iden tifiability , lo- cal optima, ov erfitting, etc.); second, the amount of data required to find a go od function approximator is enormous. F ortunately , high-dimensional data of- ten p ossesses an intrinsic lo w er dimensionality . W e will exploit this prop ert y for system iden tification b y finding a lo w-dimensional representation of high- dimensional data and learning predictiv e mo dels in this lo w-dimensional space. F or this purp ose, we need an automated pro cedure for finding compact lo w- dimensional representations/features. Learning deep dynamical mo dels from image pixels The state of the art in learning parsimonious represen- tations of high-dimensional data is currently defined b y deep learning architectures, such as deep neural net works [5], stack ed/deep auto-enco ders [19] and con- v olutional neural net w orks [8], all of which hav e been successfully applied to image, text, sp eec h and audio data in commercial products, e.g., by Google, Amazon and F acebo ok. Typically , these feature learning meth- o ds are applied to static data sets, e.g., for image clas- sification. The auto-enco der gives explicit expressions of t wo generativ e mappings: 1) an encoder g − 1 map- ping the high-dimensional data to the features, and 2) a deco der g mapping the features to high-dimensional reconstructions. In the machine learning literature, there exists a v ast n umber of other w ell studied non- linear dimensionality reduction metho ds suc h as the Gaussian pro cess laten t v ariable mo del (GP-L VM) [7], k ernel PCA [14], Laplacian Eigenmaps [1] and Locally Linear Embedding [12]. How ever, all of them provide at most one of the tw o mappings g and g − 1 . In this pap er, we combine feature learning and dy- namical systems mo deling to obtain go o d predictive mo dels for high-dimensional time series. In partic- ular, w e use deep auto-enco der neural netw orks for automatically finding a compact low-dimensional rep- resen tation of an image. In this low-dimensional fea- ture space, we use a neural netw ork for modeling the nonlinear system dynamics. The embedding and the predictiv e mo del in feature space are learned jointly . An simplified illustration of our approach is sho wn in Fig. 1. An encoder g − 1 maps an image y t − 1 at time step t − 1 to a low-dimensional feature z t − 1 . In this feature space, a prediction mo del l maps the feature forw ard in time to z t . Subsequently , the deco der g can b e used to generate a predicted image y t at the next time step. This framew ork needs access to b oth the enco der g − 1 and the deco der g , which motiv ates our use of the auto-enco der as dimensionality reduction tec hnique. Consequen tly , the contributions of this paper are (a) a mo del for learning a low-dimensional dynamical repre- sen tation of high-dimensional data, which can b e used for long-term predictions; (b) exp erimen tal evidence demonstrating that join tly learning the parameters of the laten t embedding and the predictiv e mo del in la- ten t space can increase the p erformance compared to a separate training. 2 Mo del W e consider a dynamical system where control in- puts are denoted by u and observ ations are denoted b y y . In the context of this pap er, the observ ations are pixel information from images. W e assume that a low-dimensional latent v ariable z exists that com- pactly represents the relev ant prop erties of y . Since w e consider dynamical systems, a low-dimensional rep- resen tation z of a (static) image y is insufficient to capture imp ortant dynamic information, such as ve- lo cities. Therefore, we in troduce an additional latent v ariable x , the state . In our case, the state x t con tains features from multiple time steps (e.g., t − 1 and t ) to capture velocity (or higher-order) information. There- fore, our transition mo del do es not map features at time t − 1 to time t (as illustrated in Fig. 1), but the transition function f maps states x t − 1 (and con trols u t − 1 ) to states x t at time t . The full dynamical system is given as the state-space mo del x t +1 = f ( x t , u t ; θ ) + w t ( θ ) , (1a) z t = h ( x t ; θ ) + v t ( θ ) , (1b) y t = g ( z t ; θ ) + e t ( θ ) , (1c) where each measurement y t can b e describ ed by a lo w-dimensional feature represen tation z t (1c). These features are in turn mo deled with a low-dimensional state-space mo del in (1a) and (1b), where the state x t con tains the full information ab out the state of the system at time instant t , see also Fig. 2a. Here w t ( θ ), v t ( θ ) and e t ( θ ) are sequences of indep enden t random v ariables and θ are the mo del parameters. 2.1 Appro ximate predictor mo del T o iden tify parameters in dynamical systems, the prediction-error metho d [9] has b een applied exten- siv ely within the system iden tification comm unity dur- ing the last five decades. It is based on minimizing the error b etw een the sequence of measurements y t and the predictions b y t | t − 1 ( θ ), usually the one-step ahead pre- diction. T o achiev e this, w e need a predictor mo del that relates the prediction b y t | t − 1 ( θ ) to all previous measuremen ts, con trol inputs and the system param- eters θ . In general, it is difficult to derive a predictor mo del based on the nonlinear state-space mo del (1), and a closed form expression for the prediction is only av ail- able in a few sp ecial cases [9]. How ev er, b y approxi- mating the optimal solution, a predictor mo del for the features z t can b e stated in the form b z t | t − 1 ( θ M ) = l ( Z t − 1 ; θ M ) , (2) where Z t − 1 = ( z 1 , u 1 , . . . , z t − 1 , u t − 1 ) includes all past features and control inputs, l is a nonlinear function and θ M is the corresponding model parameters. Note that the predictor mo del no longer has an explicit no- tion of the state x t . The mo del introduced in (2) is indeed very flexible, and in this w ork we ha ve re- stricted this flexibilit y somewhat b y working with a Niklas W ahlstr¨ om 1 , Thomas B. Sch¨ on 2 , Marc P eter Deisenroth 3 Figure 1: Combination of deep learning architectures for feature learning and predictor mo dels in feature space. A camera observes a robot approac hing an ob ject. A goo d lo w-dimensional feature representation of an image is imp ortant for learning a predictive mo del if the camera is the only sensor a v ailable. x t − 1 x t · · · · · · · · · z t − 1 z t y t − 1 y t u t − 1 High-dim. data F eatures Hidden state Con trol inputs f h g h g (a) General probabilistic model u t − n u t − 1 z t − n z t − 1 b z t | t − 1 · · · · · · · · · y t − n y t − 1 b y t | t − 1 High-dim. data F eatures Con trol inputs l g − 1 g − 1 g (b) Appro ximate predictor mo del Figure 2: (a) The general graphical mo del of the high-dimensional data y t . Eac h data point y t has a lo w- dimensional represen tation z t , whic h is modeled using a state-space mo del with hidden state x t and con trol input u t . (b) The approximate predictor mo del, where the predicted feature b z t | t − 1 is a function of the n past features z t − n to z t − 1 and n past control inputs u t − n to u t − 1 . Each of the features z t − n to z t − 1 is computed from high-dimensional data y t − n to y t − 1 via the enco der g − 1 . The predicted feature b z t | t − 1 is mapp ed to predicted high-dimensional data via the deco der g . nonlinear autoregressive exogenous mo del (NARX) [9], whic h relates the predicted current v alue b z t | t − 1 ( θ M ) of the time series to the past n v alues of the time se- ries z t − 1 , z t − 2 , . . . , z t − n , as well as the past n v alues of the control inputs u t − 1 , u t − 2 , . . . , u t − n . The resulting NARX predictor mo del is given by b z t | t − 1 ( θ M ) = l ( z t − 1 , u t − 1 , . . . , z t − n , u t − n ; θ M ) , (3) where l is a nonlinear function, in our case a neu- ral netw ork. The mo del parameters in the nonlin- ear function are normally estimated b y minimizing the sum of the prediction errors z t − b z t | t − 1 ( θ M ). How- ev er, since we are interested in a goo d predictiv e p er- formance for the high-dimensional data y rather than for the features z , w e transform the predictions back to the high-dimensional space and obtain a prediction b y t | t − 1 = g ( b z t | t − 1 ; θ D ), whic h we use in our error mea- sure. An additional complication is that we do not ha ve ac- cess to the features z t . Therefore, b efore training, the past v alues of the time series ha v e to b e replaced with their feature representation y = g − 1 ( y ; θ E ), whic h w e compute from the pixel information y . Here, g − 1 is an approximate inv erse of g , whic h will b e describ ed in more detail the next section. This gives the final predictor mo del b y t | t − 1 ( θ E , θ D , θ M ) = g ( b z t | t − 1 ( θ E , θ M ); θ D ) , (4a) b z t | t − 1 ( θ E , θ M ) = l ( z t − 1 ( θ E ) , u t − 1 , . . . , z t − n ( θ E ) , u t − n ; θ M ) , (4b) z t ( θ E ) = g − 1 ( y t ; θ E ) , (4c) whic h is also illustrated in Fig. 2b. The corresp onding prediction error will b e ε P t ( θ E , θ D , θ M ) = y t − b y t | t − 1 ( θ E , θ D , θ M ) . (5) 2.2 Auto-enco der W e use a deep auto-enco der neural netw ork to param- eterize the feature mapping and its inv erse. It consists of a deep encoder netw ork g − 1 and a deep decoder net work g . Eac h lay er k of the enco der neural netw ork Learning deep dynamical mo dels from image pixels . . . . . . . . . . . . . . . y 1 ,t y 2 ,t y 3 ,t y M ,t z 1 ,t z m,t b y 1 ,t | t b y 2 ,t | t b y 3 ,t | t b y M ,t | t Input lay er (high-dim. data) Hidden lay er (feature) Output lay er (reconstructed) | {z } Encoder g − 1 | {z } Decoder g Figure 3: An auto-enco der consisting of an enco der g − 1 and a decoder g . The original image y t = [ y 1 ,t , · · · , y M ,t ] T is mapped to its low-dimensional rep- resen tation z t = [ z 1 ,t , · · · , z m,t ] T = g − 1 ( y t ) with the enco der, and then back to a high-dimensional repre- sen tation b y t | t − 1 = g ( b z t | t − 1 ) by the deco der g , where M m . g − 1 computes y ( k +1) t = σ ( A k y ( k ) t + b k ), where σ is a squashing function and A k and b k are free parameters. The con trol input to the first lay er is the image, i.e., y (1) t = y t . The last lay er is the low-dimensional fea- ture representation of the image z t ( θ E ) = g − 1 ( y t ; θ E ), where θ E = [ . . . , A k , b k , . . . ] are the parameters of all neural netw ork la y ers. The decoder g consists of the same num b er of la yers in rev erse order, see Fig. 3. It can b e considered an approximate inv erse of the en- co der, such that b y t | t ( θ E , θ D ) ≈ y t , where b y t | t ( θ E , θ D ) = g ( g − 1 ( y t ; θ E ); θ D ) (6) is the reconstructed v ersion of y t . In the classical set- ting, the enco der and the decoder are trained simulta- neously to minimize the reconstruction error ε R t ( θ E , θ D ) = y t − b y t | t ( θ E , θ D ) , (7) where the parameters of g and g − 1 optionally can b e coupled to constrain the solution to some degree [19]. W e realize that the auto encoder suits our problem at hand very w ell, since it provides an explicit expression of both the mapping from features to data g as well as its approximate in v erse g − 1 , which is conv enient to form the predictions in (4a). Many other nonlinear dimensionalit y reduction metho ds such as the Gaus- sian pro cess latent v ariable mo del (GP-L VM) [7], k er- nel PCA [14], Laplacian Eigenmaps [1] and Lo cally Linear Embedding [12] do not pro vide an explicit ex- pression of b oth mappings g and g − 1 . This is the main motiv ation w ay w e use the (deep) auto-enco der mo del for dimensionality reduction. 3 T raining T o summarize, our mo del contains the following free parameters: the parameters for the enco der θ E , the pa- rameters for the decoder θ D and the parameters for the predictor model θ M . T o train the mo del, we employ t wo cost functions, the sum of the prediction errors (5), V P ( θ E , θ D , θ M ) = log 1 2 N M N X t =1 k ε P t ( θ E , θ D , θ M ) k 2 ! , (8a) and the sum of the reconstruction errors (7), V R ( θ E , θ D ) = log 1 2 N M N X t =1 k ε R t ( θ E , θ D ) k 2 ! . (8b) Generally , there are tw o w ays of finding the mo del pa- rameters: (1) separate training and (2) join t training, b oth of which are explained b elow. 3.1 Separate training Normally when features are used for inference of dy- namical mo dels, they are first extracted from the data in a pre-pro cessing step and as a second step the predictor model is estimated based on these features. In our setting, this would corresp ond to sequentially training the mo del with using t w o cost functions (8a)– (8b) where we first learn a compact feature represen- tation by minimizing the reconstruction error b θ E , b θ D ∈ arg min θ E ,θ D V R ( θ E , θ D ) , (9a) and subsequently minimize the prediction error b θ M = arg min θ M V P ( b θ E , b θ D , θ M ) (9b) with fixed auto-encoder parameters b θ E , b θ D . The gra- dien ts of these cost functions with resp ect to the mo del parameters can b e computed efficiently by back- propagation. The cost functions are then minimized b y the BFGS algorithm [11]. 3.2 Join t training An alternative to separate training is to minimize the reconstruction error and the prediction error simulta- neously by considering the optimization problem b θ E , b θ D , b θ M = arg min θ E ,θ D ,θ M ( V R ( θ E , θ D ) + V P ( θ E , θ D , θ M )) , (10) Niklas W ahlstr¨ om 1 , Thomas B. Sch¨ on 2 , Marc P eter Deisenroth 3 where we join tly optimize the free parameters in b oth the auto-enco der θ E , θ D and the predictor model θ M . Again, back-propagation is used for computing the gradien ts of this cost function. 3.3 Initialization The auto-enco der has strong similarities with principal comp onen t analysis (PCA). More precisely , if w e use a linear activ ation function and only consider a sin- gle lay er, the auto-encoder and PCA are iden tical [2]. W e exploited this relationship to initialize the param- eters of the auto-encoder. The auto-enco der net w ork has b een unfolded, eac h pair of lay ers in the enco der and the decoder hav e been combined, and the corre- sp onding PCA solution has b een computed for each of these pairs. By starting with the high-dimensional data at the top la yer and using the principal compo- nen ts from that pair of lay ers as input to the next pair of la y ers, w e recursively compute a goo d initial- ization for all parameters in the auto-encoder net w ork. Similar pre-training routines are found in [5] where a restricted Boltzmann mac hine is used instead of PCA. 4 Results W e rep ort results on identification of (1) the nonlinear dynamics of a p endulum (1-link rob ot arm) moving in a horizontal plane and the torque as control input, (2) an ob ject moving in the 2D-plane, where the 2D v elo cit y serves as con trol input. In b oth examples, we learn the dynamics solely based on pixel information. Eac h pixel y ( i ) t is a comp onen t of the measurement y t = [ y (1) t , . . . , y ( M ) t ] T and assumes a contin uous gray- v alue in [0 , 1]. 4.1 Exp erimen t 1: Pendulum in the plane W e simulated 400 frames of a p endulum mo ving in a plane with 51 × 51 = 2 601 pixels in each frame and the torque of the pendulum as con trol input. T o sp eed up training, the image input has b een reduced to dim( y t ) = 50 prior to mo del learning (system identifi- cation) using PCA. With these 50 dimensional inputs, four lay ers hav e b een used for the enco der g − 1 as well as the deco der g with dimension 50-25-12-6-2. Hence, the features ha v e dimension dim( x t ) = 2. The order of the dynamics was chosen as n = 2 to capture v elo c- it y information. F or the predictor mo del l w e used a t wo-la yer neural netw ork with a 6-4-2 architecture. W e ev aluate the p erformance in terms of long term predictions. These predictions are constructed b y con- catenating multiple 1-step ahead predictions. More precisely , the p -step ahead prediction b y t + p | t = g ( b z t + p | t ) T able 1: Exp. 1: Prediction error V P and reconstruc- tion error V R for separate and joint training. T raining V P V R Join t training (10) -6.91 -6.92 Separate training (9) -5.12 -6.99 is computed iteratively as b z t +1 | t = l ( b z t | t , u t , . . . ) , (11a) . . . b z t + p | t = l ( b z t + p − 1 | t , u t + p − 1 | t , . . . ) , (11b) where b z t | t = g − 1 ( y t ) is the features of the data point at time instance t . W e assumed that the applied future torques were known. The predictive p erformance on tw o exemplary image sequences of the v alidation data of our system identifi- cation mo dels is illustrated in Fig. 4, where con trol in- puts (torques) are assumed kno wn. The top rows sho w the ground truth images, the cen ter rows sho w the pre- dictions based on a mo del using join t training (10), the b ottom rows sho w the corresp onding predictions of a mo del where the auto-enco der and the predictiv e mo del were trained sequentially according to (9). F or the mo del based on join tly training all parameters, we obtain go od predictiv e performance for both one-step ahead prediction and multiple-step ahead prediction. In contrast, the predictiv e p erformance of learning the features and the dynamics separately is w orse than the predictiv e p erformance of the mo del trained by join tly optimizing all parameters. Although the auto-enco der do es a perfect job (left-most frame, 0-step ahead pre- diction), already the (reconstructed) one-step ahead prediction is not similar to the ground-truth image. This can also b e seen in T able 1 where the reconstruc- tion error is equally go o d for b oth mo dels, but for the prediction error we manage to get a b etter v alue us- ing joint training than using separate training. Let us ha v e a closer lo ok at the mo del based on separate training: Since the auto-enco der p erforms well, the learned transition mo del is the reason for bad predic- tiv e p erformance. W e b eliev e that the auto-enco der found a go od feature represen tation for reconstruction, but this representation w as not ideal for learning a transition mo del. Fig. 5 displa ys the “decoded” images corresp onding to the latent representations using joint and sepa- rate training, resp ectively . In the joint training the feature v alues line up in a circular shap e enabling a lo w-dimensional dynamical description, whereas sepa- rate training finds feature v alues, whic h are not ev en placed sequen tially in the low-dimensional representa- tion. Separate training extracts the low-dimensional represen tations without context, i.e., the knowledge Learning deep dynamical mo dels from image pixels y t +0 y t +1 y t +2 y t +3 True video frames y t +4 y t +5 y t +6 y t +7 y t +8 b y t +0 | t b y t +1 | t b y t +2 | t b y t +3 | t Predicted video frames − joint training b y t +4 | t b y t +5 | t b y t +6 | t b y t +7 | t b y t +8 | t b y t +0 | t b y t +1 | t b y t +2 | t b y t +3 | t Predicted video frames − separate training b y t +4 | t b y t +5 | t b y t +6 | t b y t +7 | t b y t +8 | t (a) Sequence 1 y t +0 y t +1 y t +2 y t +3 True video frames y t +4 y t +5 y t +6 y t +7 y t +8 b y t +0 | t b y t +1 | t b y t +2 | t b y t +3 | t Predicted video frames − joint training b y t +4 | t b y t +5 | t b y t +6 | t b y t +7 | t b y t +8 | t b y t +0 | t b y t +1 | t b y t +2 | t b y t +3 | t Predicted video frames − separate training b y t +4 | t b y t +5 | t b y t +6 | t b y t +7 | t b y t +8 | t (b) Sequence 2 Figure 4: Exp. 1: Tw o t ypical image sequences and corresponding prediction results (v alidation data), computed according to (11). The top rows show nine consecutive ground truth image frames from time instant t to t + 8. The second and the third rows display the corresp onding long-term ahead predictions based on measured images up to time t for b oth joint (center) and separate training (b ottom) of the mo del parameters. 0 2 4 6 8 90 95 100 Prediction horizon p Fit (%) Join t learning Separate learning Const. pred. b y k | k − p = y k − p Linear subspace identification Figure 6: Exp. 1: Fitting quality (12) for joint and separate learning of features and dynamics for different prediction horizons p . The fit is compared with the naiv e prediction b y t | t − p = y t − p , where the most recen t image is used for the prediction p steps ahead and a linear subspace-ID metho d. that these features constitute a time-series. On the other hand, joint training enables the extraction of features that can also mo del the dynamical b eha vior in a compact manner. In this particular data set, the data p oints clearly reside on one-dimensional manifold, enco ded by the p endulum angle. How ever, a one-dimensional feature space would b e insufficient since this one-dimensional manifold is cyclic, see Fig. 5, compare also with the 2 π p eriod of an angle. Therefore, we hav e used a tw o- dimensional latent space. F urther, only along the man- ifold in the latent space where the training data reside the deco der pro duces reasonable outputs. This can be further insp ected by zooming in on a smaller region as displa yed in Fig. 7. T o analyze the predictive p erformance of the tw o train- ing metho ds, we define the fitting quality as FIT p = 1 − r 1 N M X N t =1 k y t − b y t | t − p k 2 . (12) As a reference, the predictive p erformance is compared with a naiv e prediction using the previous frame at time step t − p as the prediction at t as b y t | t − p = y t − p . The result for a prediction horizon ranging from p = 0 to p = 8 is display ed in Fig. 6. Clearly , join t learning (blue) outperforms separate learning in terms of predictive p erformance for pre- diction horizons greater than 0. Even by using the last a v ailable image frame for prediction (const. pred., bro wn), w e obtain a b etter fit than the model that learns its parameter sequen tially (red). This is due to the fact that the dynamical model often predicts frames, which do not corresp ond to any real p endu- lum, see Fig. 4, leading to a p oor fit. F urthermore, join t training gives b etter predictions than the naive prediction. The predictive p erformance slightly de- grades when the prediction horizon p increases, which is to b e exp ected. Finally we also compare with the subspace iden tification method [18] (black, starred), whic h is restricted to linear mo dels. Such a restriction do es not capture the non-linear, embedded features and, hence, the predictive performance is sub-optimal. 4.2 Exp erimen t 2: Tile moving in the plane In this exp erimen t we simulated 601 frames of a mov- ing tile in a 51 × 51 = 2601 pixels image. The control inputs are the increments in position in both of the tw o Cartesian directions. As in the previous exp eriment, the image sequence was reduced to dim( y t ) = 50 prior to the parameter learning using PCA. A four lay er 50- Niklas W ahlstr¨ om 1 , Thomas B. Sch¨ on 2 , Marc P eter Deisenroth 3 z 1 z 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (a) Results from join t learning. z 1 z 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (b) Results from separate learning. Figure 5: Exp. 1: The feature space z ∈ [ − 1 , 1] × [ − 1 , 1] is divided in to 9 × 9 grid p oints. F or eac h grid p oint the deco ded high-dimensional image is displa y ed. The feature v alues corresponding to the training (red) and v alidation (y ello w) data are display ed. F eature spaces found by joint (a) and separate (b) parameter learning. A zo omed in v ersion of the green rectangle is presented in Fig. 7. z 1 z 2 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.92 0.93 0.94 0.95 0.96 0.97 0.98 Figure 7: Exp. 1: A zo omed-in version of the feature space, corresponding to the green rectangle in Fig. 5a. 25-12-8-2 auto enco der w as used for feature learning, and an 8-5-2 neural netw ork for the dynamics. As in the previous example, we ev aluate the p erfor- mance in terms of long-term predictions. The perfor- mance of joint training is illustrated in Fig. 8 on a v al- idation data set. The mo del predicts future frames of the tile with high accuracy . In Fig. 9, the feature rep- resen tation of the data is displa yed. The features re- side on a tw o-dimensional manifold enco ding the tw o- dimensional p osition of the moving tile. The four cor- ners in this manifold represent the four corners of the tile p osition within the image frame. This structure is induced by the dynamical description. The corre- sp onding feature representation for the case of sepa- rate learning do es not exhibit suc h a structure, see the supplemen tary material. In Fig. 10, the prediction p erformance is displa y ed, where our mo del achiev es a substantially b etter fit than naively using the previous frame as prediction. 5 Discussion F rom a system identification p oin t of view, the pre- diction error metho d, where w e minimize the one-step ahead prediction error, is fairly standard. Ho w ev er, in a future con trol or reinforcemen t learning setting, w e are primarily in terested in go od prediction perfor- mance on a longer horizon in order to do planning. Th us, we ha ve also in v estigated whether to addition- ally include a multi-step ahead prediction error in the cost (5). These models ac hiev ed similar performance, but no significan tly b etter prediction error could b e observ ed either for one-step ahead predictions or for longer prediction horizons. Instead of computing the prediction errors in image space, see (5), we can compute errors in feature space to a void a deco ding step back to the high-dimensional space according to (4c). Ho wev er, this will require an extra p enalt y term in order to a void trivial solutions that map everything to zero, even tually resulting in a more complicated and less intuitiv e cost function. Although joint learning aims at finding a feature rep- Learning deep dynamical mo dels from image pixels y t +0 b y t +0 | t y t +1 b y t +1 | t y t +2 b y t +2 | t y t +3 b y t +3 | t True video frames y t +4 Predicted video frames b y t +4 | t y t +5 b y t +5 | t y t +6 b y t +6 | t y t +7 b y t +7 | t y t +8 b y t +8 | t (a) Sequence 1 y t +0 b y t +0 | t y t +1 b y t +1 | t y t +2 b y t +2 | t y t +3 b y t +3 | t True video frames y t +4 Predicted video frames b y t +4 | t y t +5 b y t +5 | t y t +6 b y t +6 | t y t +7 b y t +7 | t y t +8 b y t +8 | t (b) Sequence 2 Figure 8: Exp. 2: T rue and predicted video frames on v alidation data. resen tation that is suitable for mo deling the low- dimensional dynamical b eha vior, the pre-training ini- tialization as described in Section 3.3 do es not. If this pre-training yields feature v alues far from “useful” ones for mo deling the dynamics, join t training migh t not find a go o d mo del. The net w ork structure has to b e chosen b efore the ac- tual training starts. Especially the dimension of the laten t state and the order of the dynamics ha ve to b e chosen b y the user, whic h requires a some prior kno wledge about the system to b e identified. In our examples, we c hose the latent dimensionality based on insigh ts ab out the true dynamics of the problem. In general, a mo del selection pro cedure will b e preferable to find both a go od net work structure and a go od la- ten t dimensionalit y . 6 Conclusions and future work W e hav e presented an approac h to non-linear system iden tification from high-dimensional time series data. Our mo del combines techniques from b oth the system iden tification and the mac hine learning communit y . In particular, we used a deep auto-enco der for finding lo w-dimensional features from high-dimensional data, and a nonlinear autoregressive exogenous mo del w as used to describ e the low-dimensional dynamics. The framework has b een applied to a p endulum mo v- ing in the horizon tal plane. The prop osed mo del ex- hibits go od predictive p erformance and a ma jor im- pro vemen t has b een identified by training the auto- enco der and the dynamical mo del join tly instead of z 1 z 2 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 Figure 9: Exp. 2: F eature space zoomed in to the region z ∈ [ − 0 . 35 , 0 . 35] × [ − 0 . 25 , 0 . 25]. The train- ing (red) and v alidation (yello w) data reside on a t wo-dimensional manifold corresp onding to the t w o- dimensional p osition of the tile. 0 2 4 6 8 85 90 95 100 Prediction horizon p Fit (%) Join t learning Const. pred. b y k | k − p = y k − p Figure 10: Exp erimen t 2: The fit (12) for different prediction horizons p . training them separately/sequentially . P ossible directions for future work include (a) robus- tify learning b y using denoising auto encoders [19] to deal with noisy real-world data (b) apply conv olutional neural netw orks, which are often more suitable for images; (c) exploiting the learned mo del for learning con troller purely based on pixel information; (c) Se- quen tial Mon te Carlo methods will b e in v estigated for systematic treatments of suc h nonlinear probabilistic mo dels, whic h are required in a reinforcemen t learning setting. In a setting where we make decisions based on (one- step or multiple-step ahead) predictions, such as opti- mal con trol or mo del-based reinforcemen t learning, a probabilistic mo del is often needed for robust decision making as w e need to account for mo dels errors [13, 3]. An extension of our present mo del to a probabilistic setting is non-trivial since random v ariables hav e to b e transformed through the neural netw orks, and their exact probabilit y density functions will b e intractable to compute. Sampling-based approaches or determin- Niklas W ahlstr¨ om 1 , Thomas B. Sch¨ on 2 , Marc P eter Deisenroth 3 istic appro ximate inference are tw o options that we will inv estigate in future. Ac kno wledgmen ts This w ork was supp orted b y the Sw edish F oundation for Strategic Researc h under the pro ject Co op er ative L o c alization and by the Swedish Research Council un- der the pro ject Pr ob abilistic mo deling of dynamic al systems (Contract num b er: 621-2013-5524). MPD w as supp orted by an Imperial College Junior Researc h F ellowship. Supplemen tary material In the second experiment in the pap er, results the join t learning of prediction and reconstruction error hav e b een rep orted. The join t learning brings structure to feature v alues, whic h is not present if the auto encoder is learn separately , see Fig. 11. z 1 z 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 11: Exp. 2: The feature space in the region z ∈ [ − 1 , 1] × [ − 1 , 1] for separate learning. The training (red) and v alidation (yello w) data do es not reside on a structured a tw o-dimensional tile-formed manifold as it was the case for the joint learning in Fig. 9. References [1] M. Belkin and P . Niy ogi. Laplacian eigenmaps for dimensionalit y reduction and data representation. Neur al Computation , 15(6):1373–1396, 2003. [2] H. Bourlard and Y. Kamp. Auto-asso ciation by m ultilay er p erceptrons and singular v alue decom- p osition. Biolo gic al cyb ernetics , 59(4-5):291–294, 1988. [3] M. P . Deisenroth, D. F o x, and C. E. Rasm ussen. Gaussian pro cesses for data-efficient learning in rob otics and con trol. IEEE T r ansactions on Pat- tern Analysis and Machine Intel ligenc e , 2014. [4] Z. Ghahramani. Learning dynamic bay esian net- w orks. In C. Giles and M. Gori, editors, A daptive Pr o c essing of Se quenc es and Data Structur es , v ol- ume 1387 of L e ctur e Notes in Computer Scienc e , pages 168–197. Springer, 1998. [5] G. Hinton and R. Salakh utdinov. Reducing the dimensionalit y of data with neural netw orks. Sci- enc e , 313:504–507, 2006. [6] N. Kan tas, A. Doucet, S. Singh, and J. Ma- ciejo wski. An ov erview of sequential Mon te Carlo methods for parameter estimation in gen- eral state-space mo dels. In Pr o c e e dings of the 15th IF A C Symp osium on System Identific ation , pages 774–785, Saint-Malo, F rance, July 2009. [7] N. Lawrence. Probabilistic non-linear principal comp onen t analysis with gaussian pro cess latent v ariable mo dels. The Journal of Machine L e arn- ing R ese ar ch , 6:1783–1816, 2005. [8] Y. LeCun, L. Bottou, Y. Bengio, and P . Haffner. Gradien t-based learning applied to do cu- men t recognition. Pr o c e e dings of the IEEE , 86(11):2278–2324, 1998. [9] L. Ljung. System identific ation, The ory for the user . System sciences series. Prentice Hall, Upper Saddle River, NJ, USA, second edition, 1999. [10] L. Ljung. Perspectives on system iden tification. A nnual R eviews in Contr ol , 34(1):1–12, 2010. [11] J. Nocedal and S. J. W righ t. Numeric al Optimiza- tion . Springer Series in Op erations Research. New Y ork, USA, 2 edition, 2006. [12] S. T. Row eis and L. K. Saul. Nonlinear dimen- sionalit y reduction by lo cally linear embedding. Scienc e , 290(5500):2323–2326, 2000. [13] J. G. Sc hneider. Exploiting model uncertaint y es- timates for safe dynamic control learning. In A d- vanc es in Neur al Information Pr o c essing Systems (NIPS) . Morgan Kaufman Publishers, 1997. [14] B. Sc h¨ olkopf, A. J. Smola, and K.-R. M ¨ uller. Non- linear Comp onen t Analysis as a Kernel Eigen- v alue Problem. Neur al Computation , 10(5):1299– 1319, 1998. [15] T. B. Sch¨ on, A. Wills, and B. Ninness. System iden tification of nonlinear state-space mo dels. A u- tomatic a , 47(1):39–49, 2011. Learning deep dynamical mo dels from image pixels [16] R. H. Sh umw ay and D. S. Stoffer. An approach to time series smo othing and forecasting using the EM algorithm. Journal of Time Series Analysis , 3(4):253–264, 1982. [17] J. Sj¨ ob erg, Q. Zhang, L. Ljung, A. Benv eniste, B. Delyon, P .-Y. Glorennec, H. Hjalmarsson, and A. Juditsky . Nonlinear blac k-b o x modeling in sys- tem iden tification: a unified ov erview. A utomat- ic a , 31(12):1691–1724, 1995. [18] P . V an Ov erschee and B. De Moor. Subsp ac e iden- tific ation for line ar systems - the ory, implementa- tion, applic ations . Klu w er Academic Publishers, 1996. [19] P . Vincent, H. Laro chelle, Y. Bengio, and P .- A. Manzagol. Extracting and composing robust features with denoising auto enco ders. In ICML , 2008.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment