PredNet and Predictive Coding: A Critical Review
PredNet, a deep predictive coding network developed by Lotter et al., combines a biologically inspired architecture based on the propagation of prediction error with self-supervised representation learning in video. While the architecture has drawn a…
Authors: Roshan Rane, Edit Sz"ugyi, Vageesh Saxena
PredNet and Predictive Coding: A Critical Review Roshan Prakash Rane ∗ rane@uni- potsdam.de University of Potsdam, Potsdam, Germany Edit Szügyi ∗ szuegyi@uni- potsdam.de University of Potsdam, Potsdam, Germany V ageesh Saxena ∗ saxena@uni- potsdam.de University of Potsdam, Potsdam, Germany André Ofner andre.ofner@ovgu.de Otto von Guericke University , Magdeburg, Germany Sebastian Stober stober@ovgu.de Otto von Guericke University , Magdeburg, Germany ABSTRA CT PredNet, a deep predictive coding network developed by Lotter et al., combines a biologically inspired architecture base d on the propagation of prediction error with self-supervised representation learning in video. While the architecture has drawn a lot of atten- tion and various extensions of the model exist, there is a lack of a critical analysis. W e ll in the gap by evaluating PredNet both as an implementation of the predictive coding theory and as a self-supervise d video prediction model using a challenging video action classication dataset. W e design an extended model to test if conditioning future frame predictions on the action class of the video improves the model performance. W e sho w that PredNet does not yet completely follow the principles of pr edictive coding. The proposed top-down conditioning leads to a performance gain on synthetic data, but does not scale up to the more complex real-world action classication dataset. Our analysis is aimed at guiding future research on similar architectures based on the predictive co ding theory . CCS CONCEPTS • Computing methodologies → Machine learning ; Semi-supervise d learning settings ; Hierarchical representations ; Computer vision prob- lems . KEY W ORDS deep learning, convolutional neural networks, video classication, video prediction, semi-super vised, pr edictive coding A CM Reference Format: Roshan Prakash Rane, Edit Szügyi, V ageesh Saxena, André Ofner, and Se- bastian Stober . 2020. PredNet and Predictive Coding: A Critical Revie w . In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR ’20), June 8–11, 2020, Dublin, Ireland . ACM, New Y ork, NY, USA, 9 pages. https://doi.org/10.1145/3372278.3390694 ∗ All authors contributed equally to this research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distribute d for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. T o cop y otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. ICMR ’20, June 8–11, 2020, Dublin, Ireland © 2020 Association for Computing Machinery . ACM ISBN 978-1-4503-7087-5/20/06. . . $15.00 https://doi.org/10.1145/3372278.3390694 1 IN TRODUCTION Learning a model of the visual world is a crucial prerequisite to reliably perform computer vision tasks like object detection and semantic segmentation. As illustrate d by [ 15 ], self-supervise d learn- ing allows to extract this complex structure of the real-world with- out a need for expensive labele d data. Videos contain information about how scenes evolve in time and therefore predicting the fu- ture frames of a video is one popular method [ 7 ][ 20 ][ 29 ][ 30 ] [ 31 ] of extracting this structure in a self-super vised manner . Previous research [ 18 ][ 20 ][ 26 ] has hypothesized that to accurately predict how the visual world changes, a model should learn about the ob- ject structure and the possible transformations objects can undergo. Among the various video prediction models, PredNet by Lotter et al.[ 18 ] achieves high video prediction accuracy with the additional benet of using a biologically plausible architecture. The PredNet architecture is inspired by the predictive coding theory from the neuroscience literature [ 8 ][ 21 ][ 25 ] and attempts to implement it with deep neural networks. Predictive coding is a promising self-supervised learning technique and has shown to replicate some of the neuronal behavior seen in the mammalian visual cortex. It posits that the brain is continually making pre- dictions of incoming sensory stimuli and uses the deviations from these predictions as a learning signal. It describes hierarchical net- works consisting of top-down connections that carry predictions from higher to lower le vels and bottom-up connections that carr y sensory evidence from lower to higher levels at each layer . The error in prediction is propagated upwards, eventually leading to better predictions in the future. In this paper , we rst evaluate PredNet’s performance on video prediction by testing it on a demanding dataset. Then we examine its capability to learn latent features that are useful for downstream tasks. Specically , our contributions in this paper are two-fold: (1) Using visualization techniques and experiments, we review PredNet as an emulation of the predictiv e coding framew ork and as a video prediction model. (2) W e test the features extracted by PredNet by training it in a semi-supervise d setup to perform video action classication. W e also evaluate if the conditioning of top-down predictions on action classes of the vide o and vice versa improv es the model’s accuracy . W e further test this hypothesis on a simple synthetic dataset by conditioning the predictions of PredNet with informative top-down class labels. The paper is organized as follows: Section 2 reviews predictive coding and its implementations. Se ction 3 describes our e xperiment setup, namely our data, the architecture and the evaluation metrics used. Section 4 is dedicated to the rst phase of our e xperiments, listing our observations while testing PredNet. Section 5 gives de- tails on the second phase, the implementation and evaluation of PredNet+, our proposed extension of the ar chitecture, and in Sec- tion 6 we conclude the paper and list possible directions for future research. 2 RELA TED WORK Rao and Ballard [ 21 ] replicated the ‘ extra-classical’ receptive eld eects observed in the early stages of cat and monkey visual cor- tex with an articial predictive coding network. These observable eects are a direct result of the brain trying to eciently encode sensory data using prediction. This was accompanied by a rich body of work in neuroscience [ 2 ][ 6 ][ 8 ][ 25 ][ 27 ] and computational modelling [ 1 ][ 18 ][ 24 ] that explored dierent interpretations and implementations of the basic idea, not only in the visual domain but also in various sensory-motor domains. Following the success of deep learning in the last de cade , many researchers have attempted to implement the predictive coding model using deep learning [ 12 ] [ 18 ] [ 28 ] [ 34 ]. W en et al. [ 34 ] use predictive coding on static images to learn optimal feature vectors at each layer for object recognition. Han et al. [ 12 ] build on this to develop a bidirectional and dynamic neural network with local r e- current processing. A. Oor d et al. [ 28 ] perform predictive coding in a latent space and use a probabilistic contrastiv e loss to learn useful representations. Lotter et. al. [ 18 ] design a video prediction network using the principles of predictive co ding. W e chose to evaluate Lot- ter et al. ’s PredNet architecture for two main reasons: First, it is designed to be structurally close to the biological predictive coding model with its hierarchical structure, bottom-up error propagation, and top-down predictions. Second, PredNet achieves accuracy on- par with state-of-the-art video prediction models and is a popular baseline used across dierent spatio-temporal prediction tasks. Zhong et al. [ 35 ] extend PredNet into AF A -PredNet within the robotics domain. They integrate the motor actions of a robot as an additional signal to condition the top-down generative process. Following this, they design MT A -PredNet [ 36 ] that has dierent temporal scales at dierent layers in the hierar chy . They developed MT A -PredNet to compensate for PredNet’s inability to perform reliable long-term predictions which is a necessity in robotics for planning. Furthermore, researchers have tried to improve Pred- Net by adding skip-connections alongside error propagation [ 22 ], reducing the number of parameters by using fewer gates in the top-down ConvLSTM units [ 5 ] and also using inception-type units within each PredNet layer [ 14 ]. Sato et al.[ 22 ] evaluate PredNet on weather precipitation dataset and W atanabe et al. [ 33 ] test PredNet’s response to visual illusions to examine whether predictive coding models respond to visual illusions just as humans do. However , none of the ab o ve work critically evaluates PredNet as an imple- mentation of predictive co ding and as a reliable self-supervise d pretraining method. Our work aims to pr ovide this in the form of a critical review of PredNet for future works that intend to use the architecture or design architectures inspired by it. 3 EXPERIMEN T SET UP 3.1 Dataset Most existing large scale video classication datasets have coarse- grained labels[ 13 ][ 16 ][ 17 ]. This means that the models are trained on a relatively easy task and the label can be detecte d e ven from isolated frames, e.g. the ‘soccer’ label can b e inferr ed from a green eld. T o overcome this issue and for ce models to learn better repre- sentations, the Something-something dataset [ 11 ] was introduced. This dataset contains 220,000 videos with 174 ne-grained action labels. For instance, ‘putting something on a table ’ , ‘pretending to put something on a table’ , and ‘putting something on a slanted surface so it slides down’ are three dierent lab el classes in the dataset. Mahdisoltani et al. [ 19 ] provide e vidence for the hypoth- esis that task granularity is strongly correlated with the quality and generalizability of learne d features. As for the nature of the data, being crowd-sourced, it includes noise much resembling the real world: thousands of dierent objects, variations of lighting conditions, background patterns and camera motion. 3.2 PredNet architecture The PredNet architecture is shown in Figur e 1 [ 18 ]. The network is composed of stacke d hierarchical lay ers, each of which attempts to make local predictions of its input. The dierence between the actual input and this prediction is then passed up the hierarchy to the next layer . Information ows in three ways through the network: (1) the error signal ows in the bottom-up direction as marked by the red arrows on the right, (2) the prediction signal ows in the top-do wn direction as shown by the gr een arrow on the left, and (3) the local error signal and prediction estimation signal ow within each layer . Every layer consists of four units: an input convolution unit ( A i ), a recurrent representation unit ( R i ) followed by a prediction unit ( A hat i ) and an error calculation unit ( E i ) as labelled in Figure 1. The representation unit is made of a ConvLSTM [ 23 ] layer that estimates what the input will b e on the next time step . This input is then fed into the prediction unit that generates the prediction A hat i . The error units calculate the dierence between the prediction and the input which is fed as input to the next layer . The representation unit r eceives a copy of the error signal (red arrow ) along with the up-sample d input from the representation unit of the higher-lev el (green arrow ), which it uses along with its recurr ent memory to perform future predictions. 3.3 Evaluation metrics Dening a good evaluation metric for the quality of image predic- tions is a challenging task by itself [ 4 ][ 20 ]. There is no universally accepted measurement of image quality and conse quently for image similarity . For the video prediction task, we employ the tw o com- monly used metrics in literature: Peak Signal Noise Ratio (PSNR) [ 20 ] and the Structural Similarity Index Measure (SSIM) [ 32 ]. Like Mathieu et al. [ 20 ], we calculate PSNR and SSIM only for the frames which have movement with respect to the previous frame and call them ‘PSNR movement’ and ‘SSIM movement’ respectively . In our case this is crucial as action videos often contain very few frames with movement and a metric should not re ward a model for simply Figure 1: The original PredNet architecture by Lotter et al. [18]. predicting a still frame. W e use a third metric called ‘ conditioned SSIM’ , which is calculated as given in Equation 1. This metric quan- ties how dierent the pr edictions are from the previous frame and therefore measures how ‘risky’ the predictions of our model are in comparison to simply performing a ‘last-frame-copy’ . S S I M c o nd = ( S S I M m ax − S S I M ( ac tu al t − 1 , pr ed t )) ∗ S S I M ( ac tu al t , pr ed t ) (1) 4 PROBING PREDNET In the rst phase, we re view PredNet by e valuating its performance on the Something-something dataset and visualize dierent states of the architecture. For our experiments we use 10 dierent hyper- parameters settings with a dierent number of layers, channels per layer , input image size and frames-per-se cond (FPS) of the video. These settings ar e listed in T able 1. Along with the predicted frame, we visualize the dierent states of PredNet at each layer by averag- ing the activation of all channels in a layer , similar to Han et al. [ 12 ]. W e also plot the mean of the err or signals E i and representations R i of every layer to visualize how they evolv e over the span of the video. A sample video with visualizations is shown in Figure 2. In the following se ction, we de dicate one paragraph to each of our ndings, and Figure 2 and T able 1 provide further details. Frame Image Number Model rate Layers size of Loss (FPS) (pixel) param. 0 3 4 48 X 56 6.9 L 0 1 6 4 48 X 56 6.9 L 0 2 12 4 48 X 56 6.9 L 0 3 12 4 32 X 48 6.9 L 0 4 12 5 48 X 80 5.3 L 0 5 12 6 64 X 96 5.8 L 0 6 12 7 128 X 192 6.2 L 0 7 12 6 96 X 160 7.2 L 0 8 12 5 48 X 80 5.3 L a l l 9 12 6 64 X 96 5.8 L a l l T able 1: Results obtained from dierent frame rates, num- ber of layers and model parameters (in millions) in the model. Listed are the image size and whether the model was trained with L 0 loss or L a l l loss. Similar mo dels are grouped with horizontal lines and the column that varies is marked in bold. 4.1 Observations Comparing the frame predictions with input frames in Figure 2, Figure 3 and Figure 4, we can summarize the working dynamics of PredNet on the action classication dataset as follows: The model performs previous-frame-copy if there are no cues for motion in the previous two frames. If there is a cue for motion and if the direction of the motion is continuous and the motion is smooth, it interpolates the object in the direction of the motion. Otherwise, it blurs the region containing the object of motion to keep the L2 loss minimal by virtue of regression-to-the-mean. The blurring is a result of Pr edNet’s inability to learn multi-modal predictions in the sense that it learns to perform one ideal prediction. It is a typical characteristic of action-based video sequences that there are multiple possible futur e states. For instance, as seen in Figure 3, the thumb can move up or move down or not move at all in the ne xt frame. The blurring behaviour of PredNet is further characterized by the e xperiment w e conducted with dierent sharpness measur es. The predictions by the model are always less sharp than the actual videos. PredNet learns relevant features only when trained on videos with continuous motion. The authors [ 18 ] designed and tested PredNet on videos with continuous motion, such as the KI T TI dataset [ 9 ] and their synthetic ‘Rotating Faces’ dataset. This is in stark contrast with our action dataset which can have a lot of still frames, see e.g. Figure 4. In this scenario, PredNet resolves to mere last-frame copying, as it is statistically benecial to do just that. If the model is not motivated enough to learn the dynamics of how objects move and scenes evolve then the featur es it learns would not be useful for downstream tasks as hyp othesized by Lotter et al. [18]. PredNet’s learning ability is sensitive to the frames-per- second (FPS) rate. When we compare the performance of models trained on videos with FPS rates 3, 6 and 12 in Figure 5, we can se e Figure 2: Visualization of evaluated time-specic model states during next frame prediction. Each column corresponds to a single time step, while rows resemble the compute d states in each layer . Figure 3: Example of a video with multiple p ossible future states. Figure 4: Example of a low FPS video and the predictions made by PredNet. that the performance varies greatly . In this and all follo wing gur es, we show the impr ovement in the giv en evaluation metric, meaning the dierence in the value of the evaluation metric between the model being evaluated and a baseline model performing last-frame copying, i.e. how much improvement the given PredNet model shows to simple last-frame copying. Manual inspection of the pre- dictions further conrms the large dierence in prediction quality . At v ery high FPS there is minimal motion between two consecutive frames while at low FPS rates ther e is abrupt movement between frames which is challenging to predict. In both of these scenarios, the model completely resorts to last-frame-copy . Therefore, the FPS of the video is one of the most important hyperparameters of PredNet. T wo key insights indicate that PredNet is not a comprehen- sive emulation of hierarchical predictive coding . Firstly , from the "mean E activation" plot in Figure 2, it is evident that the mean bottom-up error increases as one go es up towards higher layers. This behavior can be observed in all sample videos being visualized. This is in contrary to the expectations of predictive coding, which posits that the error decreases as w e go up the hierarchy as parts of the incoming signal are iteratively ‘e xplained away . ’ Secondly , Lotter et al. [ 18 ] demonstrate that models trained with L 0 loss per- form better than L a l l loss on the KI T TI data. W e cross-check this on our dataset and get similar results. As shown in Figure 6, models with L 0 loss p erform better on all metrics. Training with L 0 loss implies that we only minimize the error E 0 on the lowest lay er (see Figure 1) while in L a l l loss the model is trained to minimize the prediction errors in all the layers. Pr edictive coding suggests that each layer in the hierarchy minimizes the error signal iteratively . Therefore, an accurate implementation of predictive coding should improve results when trained with L a l l loss. Furthermore, the visu- alization of mean activation of PredNet’s states at dierent layers Figure 5: Comparison of model performance with respect to the employed frames-p er-second rate (FPS). SSIM and PRNS scores show the model’s improvement on a last-frame-copy baseline model. in Figure 2 shows that the states in the lowest layer are dierent from all the higher lay ers. This indicates that the model operates with two sub-modules when trained with the L 0 loss: the lowest layer R 0 aims to generate realistic ( t + 1 ) predictions, while the rest of the layers op erate as one deep network that regresses E 0 to generate the context R 1 . This is also indicated by the fact that the mean R activation for the low est layer is higher and follows a dierent trajector y than the rest of the lay ers in Figure 2. W e further evaluate PredNet by examining its ability to extrapolate and predict longer time steps into the future. As explained in Lotter et al. [ 18 ], PredNet can be used to generate long-term predictions by simply feeding its predictions at time t back in as input at the next time step ( t + 1 ) . This can be done iteratively for n time steps to get a ( t + n ) prediction into the future. W e test the extrapolation capability of PredNet models that are trained only to perform ( t + 1 ) predictions as well as PredNet models that are trained exclusiv ely to perform ( t + n ) predictions. As e xpecte d, the results are marginally better in the latter case as also demonstrated in Lotter et al. [ 18 ]. The extrapolated predictions of our best p erforming model are given in Figure 7. The extrapolation is started at dierent time points in the video as shown by the red marker in the gure. The following three obser vations can be made from the ab ov e experiments (1) After two-time steps, the model resorts to last-frame-cop ying. As already discussed, PredNet performs predictions by using the move- ment between consecutive input frames as an active cue. Therefore while extrapolating, when we feed the predictions back as input, the model gets a cue that the action has stoppe d and reverted to performing last-frame-copying. (2) The predictions get blurrier over time. This is because the minor blur added by the down-sampling units in the bottom-up and up-sampling units in the top-down ac- cumulates exponentially with time. (3) Fr om the metrics in Figure 8 we can infer that the models perform better if the extrapolation Figure 6: Inuence of L 0 and L a l l loss on model performance. SSIM and PRNS scores show the model’s improvement on a last-frame-copy baseline model. is started in the later stages of the video. This can be due to the fact that in our dataset, motion generally starts in the middle or towards the end of the video. In conclusion, the extrapolation ex- periments suggest that the network design compels it to learn just short-term interp olations instead of building long-term predictions . Finally , we notice that the model delivers improved predictions only when the topmost layers have a full receptive eld. Only this setting allows to predict object movements instead of just blurring local regions of motion. The receptive eld can be increased either by using deeper lay ers or by increasing the kernel size of the convolutions or even by reducing the image size. W e experimente d with each of these and found that the prediction scores improve with the increased receptive eld. W e show the r esults of experimenting with a dierent number of layers in Figure 9 and it is apparent that the prediction quality improves with incr easing depth. 5 LABEL CLASSIFICA TION WI TH PREDNET + In this section, we describe the second phase of experiments, where we test the architecture by modifying it to perform super vised label classication simultaneously with video prediction. For a com- parison of the architectures of PredNet+ and the vanilla PredNet, see Figures 10 and 1 respectively . The model design, the rationale behind the design and the results are discussed next. 5.1 PredNet+ Design W e modify the PredNet architecture such that it can perform video label classication along with next frame prediction and informally call this architecture PredNet+. The architecture is shown in Figur e 10. As seen here, PredNet+ contains an additional label classication unit that is attached to the top-most representation layer . It consists of an encoder and a decoder part. As display ed in Figure 10 the tw o Figure 7: Extrapolation results for Model 7 extrapolated at dierent time steps. The red mark shows the start of the extrapo- lation. Figure 8: Comparison of the metrics when starting extrap- olation at dierent stages in the video. t denotes the to- tal number of frames in the vide o . SSIM and PRNS scores show the model’s improvement on a last-frame-copy base- line model. ConvLSTM layers form the encoder which transforms the output of the representation unit R 3 into label class probabilities. The two transposed convolution layers make up the de coder that up-samples and transforms the label classes back to the imaging modality which is fed back into the top down as shown by the black arrow to R 2 . The lab el prediction unit makes predictions at each incoming frame, whose weighted sum is passed through a softmax function to get the nal class probabilities for the video. As the model does not have enough context to make meaningful predictions at the beginning of the vide o , the weighing-over-time is done using an exponential function. PredNet+ is designed such that the latent features at the top-most representation layer are shared between its two tasks. The future frame predictions are conditione d on the label predictions made by the label classication unit (shown in Figure 10 by the black arrow going into R 2 ). W e hypothesized that this would improve the results on b oth sub-tasks as evident in many multi-task training scenarios [ 3 ] [ 10 ]. Even though in our case, we attach the label Figure 9: Comparison of performance with models encom- passing 4, 5, 6 and 7 layers. SSIM and PRNS scores show the model’s improvement on a last-frame-copy baseline mo del. classication unit to the top-most layer , this is not the only approach nor necessarily the best one as per predictive coding. W e decided for this setup because the top-most layer in the model has both a full receptive eld and access to previous states. In summary , the label classication unit and the prediction units in PredNet+ are expe cted to work in tandem in a multitask learning set-up and form a synergy . However , this is not what we observe in our results. 5.2 Results T able 2 shows our best classication accuracy in comparison to the baseline model scores of Goyal et al. [ 11 ] and the current state-of- the-art results by Mahdisoltani et al. on the Something-something dataset[ 19 ]. W e test the PredNet+ architecture on our best 4 layer model, 5 lay er model and 6 layer model from T able 1. Furthermor e, we test the fol- lowing minor variations of PredNet+ to further evaluate the model architecture: First, we remo ve the recurrent memory in the label classication unit by replacing the ConvLSTM with Convolution layers. Next, w e extend the label classication loss function such Figure 10: The proposed PredNet+ architecture with an addi- tional classication pathway attached to the deepest predic- tion layer . Model T op-1 T op-5 Baseline [11] 11.5 30.0 Ours 28.2 57.0 Mahdisoltani et al. [19] 51.38 - T able 2: Classication accuracy on the Something- something dataset with 174 label categories. that the model is rewar ded for predicting at least the correct verb in the label. For example, if the correct label is Pretending to put some- thing behind something , the mo del is penalized twice as much if it predicts Showing something to the camera than if it predicts Putting something behind something , which has the same verb as the correct label. Surprisingly , the classication results do not change at all ( ± 0 . 6 % ) for any of these model variations. This suggests that the fea- tures from the top-most repr esentation unit do not have any more information. The label classication scores suggest that PredNet+ is a long way from the state-of-the-art ar chitectures. Furthermore, the future frame prediction of PredNet+ degrade in comparison to its equiva- lent vanilla PredNet mo dels: Model 5 ( L 0 loss) and Model 8 ( L a l l loss). The metrics in Figure 11 and the visualization of predictions point this out. T o further analyze this, w e e xperiment with dierent loss weights for the tw o tasks. This allows us to control the relative importance of each task for the model during training. W e nd Figure 11: Comparison of the prediction quality of PredNet+ with equivalent PredNet models. SSIM and PRNS scores show the model’s improvement on a last-frame-copy base- line model. that the model’s future prediction quality degrades when the label classication task is given incr eased importance, suggesting that the multi-task constraint leads to worse future frame predictions. 5.3 Representation learning with top-down conditioning and synthetic data In order to further e valuate our hypothesis that conditioning the top-down predictions on class labels of the video improves model accuracy , we evaluate PredNet+ on a synthetic dataset. W e employ a moving MNIST dataset with a static background consisting of randomly generated overlapping geometric shapes, and a single hand-written digit moving in one of eight directions. Each frame was annotated with a lab el repr esenting the future direction of the digit’s movement. Samples of the dataset are given in the upper rows of Figure 12. W e test if adding semantic top-do wn information helps increasing the prediction score of the network. W e thus keep track of spatiotemporal prediction performance, while using the movement label classication as an auxiliary task. The generated predictions generally showed lower condence in the moving part of the input frames, esp ecially in the rst frames, as se en in the lower row of Figure 12 (a). Predictions made by the model with additional label classication pathway are presented in Figur e 12 (b). The resulting scores generated by a previous-frame-copy model, plain PredNet and PredNet+ are displayed in T able 3. The multi- task learning with movement direction classication improves the MAE score and leads to sharper predictions in the non-stationary parts of the input images. This indicates that conditioning the top- down predictions with semantic information can improve model performance, esp ecially when the additional information can be related directly to predicted features in the input space. (a) PredNet (b) PredNet+ Figure 12: Model predictions on moving MNIST . Shown are examples for the original model and a model with additional classication pathway . Model MAE score Previous-frame-copy 8e-050 PredNet 7.6e-05 PredNet+ 7.3e-05 T able 3: Comparison of the predictive performance of the original and multi-task PredNet mo del. Scores are given for next-frame prediction on the moving MNIST dataset. 6 CONCLUSION AND F U T URE WORK W e have evaluated PredNet [ 18 ] on a challenging action classica- tion dataset in two phases. In the rst phase of our work, we investigate PredNet and derive the following insights: (1) PredNet does not completely follow the principles of the predictive coding framew ork. (2) It can perform only short-term next frame interpolations, rather than long-term video predictions. This has be en further conrmed by the extrapo- lation experiments. (3) The repr esentation units are unable to learn multi-modal distributions and produce blurry predictions. (4) The models’ learning ability is sensitive to the continuity of motion and the FPS rate of the videos. In the se cond phase, we test PredNet’s ability to learn useful latent features to perform label classication. W e use the features from the highest repr esentation layer and nd that this is not ad- equate for the task at hand, namely , the prediction of a complex action classication dataset. W e achiev e a classication accuracy of 28.2% in comparison to current state-of-the-art of 51.38% [19] and the prediction accuracy also under-performs the vanilla Pr edNet. In a further step, w e experiment on a synthetic dataset and show that that top-down conditioning can improve the prediction scores. Our results lead to several suggestions for improv ed models: Firstly , the network should b e trainable with L a l l loss. This can be done by designing error estimators that are local to each layer . Se c- ondly , the network should be redesigned such that it is encourage d to perform long-term predictions rather than just frame-to-frame interpolation. One way to do this is to have additional layers higher in the hierarchy , that make predictions at dierent temporal scales. Additionally , PredNet’s performance metrics show high variance while PredNet+ is easily susceptible to over-tting. These points signal the need for including regularization techniques and model averaging methods like dropout within the architecture. Finally , the representation units should learn multi-modal probability dis- tributions, from which pr edictions can be sampled. This could be addressed, for example, with probabilistic repr esentations in some or all layers. REFERENCES [1] Rakesh Chalasani and Jose C. Principe. 2013. Deep Predictive Co ding Net- works. arXiv e-prints , Article arXiv:1301.3541 (Jan 2013), arXiv:1301.3541 pages. arXiv:cs.LG/1301.3541 [2] Andy Clark. 2013. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences 36, 3 (2013), 181–204. https://doi.org/10.1017/S0140525X12000477 arXiv:0140-525X [3] Ronan Collobert and Jason W eston. 2008. A Unied Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Pro- ceedings of the 25th International Conference on Machine Learning (ICML ’08) . ACM, Ne w Y ork, NY, USA, 160–167. https://doi.org/10.1145/1390156.1390177 [4] Kanjar De and V Masilamani. 2013. Image Sharpness Measure for Blurred Images in Frequency Domain. Procedia Engineering 64 (12 2013). https://doi.org/10.1016/ j.proeng.2013.09.086 [5] N. Elsayed, A. S. Maida, and M. Bayoumi. 2019. Re duced-Gate Convolutional LSTM Architecture for Next-Frame Video Pr ediction Using Predictive Coding. In 2019 International Joint Conference on Neural Networks (IJCNN) . 1–9. https: //doi.org/10.1109/IJCNN.2019.8852480 [6] Lauren L. Emberson, John E. Richards, and Richard N. Aslin. 2015. Top-down modulation in the infant brain: Learning-induced expectations rapidly aect the sensory cortex at 6 months. Proceedings of the National Academy of Sciences 112, 31 (2015), 9585–9590. https://doi.org/10.1073/pnas.1510343112 arXiv:https://www.pnas.org/content/112/31/9585.full.pdf [7] Chelsea Finn, Ian Goodfellow , and Sergey Levine. 2016. Unsupervised Learning for Physical Interaction through Video Prediction. In Advances in Neural Information Processing Systems 29 , D . D. Le e , M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 64– 72. http://papers.nips.cc/paper/6161- unsupervise d- learning- for- physical- interaction- through- video- prediction.pdf [8] Karl Friston. 2009. The free-energy principle: a rough guide to the brain? Tr ends in Cognitive Sciences 13, 7 (2009), 293 – 301. https://doi.org/10.1016/j.tics.2009.04.005 [9] Andreas Geiger , Philip Lenz, Christoph Stiller , and Raquel Urtasun. 2013. Vision meets Robotics: The KI T TI Dataset. International Journal of Robotics Research (IJRR) (2013). [10] Ross B. Girshick. 2015. Fast R-CNN. CoRR abs/1504.08083 (2015). http://arxiv .org/abs/1504.08083 [11] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne W estphal, Heuna Kim, V alentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The "something something" video database for learning and evaluating visual common sense. CoRR abs/1706.04261 (2017). http://arxiv .org/abs/1706.04261 [12] Kuan Han, Haiguang W en, Yizhen Zhang, Di Fu, Eugenio Culur ciello, and Zhong- ming Liu. 2018. De ep Predictive Coding Network with Local Recurrent Pro- cessing for Object Recognition. CoRR abs/1805.07526 (2018). http://arxiv .org/abs/1805.07526 [13] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understand- ing. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 961–970. [14] Matin Hosseini, Anthony S. Maida, Majid Hosseini, and Gottumukkala Raju. 2019. Inception-inspired LSTM for Ne xt-frame Video Prediction. arXiv e-prints , Article arXiv:1909.05622 (Aug 2019), arXiv:1909.05622 pages. arXiv:cs.CV/1909.05622 [15] Longlong Jing and Yingli Tian. 2019. Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey . CoRR abs/1902.06162 (2019). http://arxiv .org/abs/1902.06162 [16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-Scale Video Classication with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition . 1725–1732. https://doi.org/10.1109/CVPR.2014.223 [17] H. Kuehne, H. Jhuang, E. Garrote, T . Poggio, and T . Serre. 2011. HMDB: A Large Video Database for Human Motion Recognition. In Procee dings of the 2011 International Conference on Computer Vision (ICCV ’11) . IEEE Computer Society, W ashington, DC, USA, 2556–2563. https://doi.org/10.1109/ICCV .2011.6126543 [18] William Lotter , Gabriel Kreiman, and David D. Cox. 2016. Deep Pr edictive Coding Networks for Video Prediction and Unsuper vised Learning. CoRR abs/1605.08104 (2016). arXiv:1605.08104 http://arxiv .org/abs/1605.08104 [19] Farzaneh Mahdisoltani, Guillaume Berger , W aseem Gharbieh, David J. Fle et, and Roland Memisevic. 2018. Fine-grained Video Classication and Captioning. CoRR abs/1804.09235 (2018). arXiv:1804.09235 http://arxiv .org/abs/1804.09235 [20] Michaël Mathieu, Camille Couprie, and Y ann LeCun. 2015. Deep multi-scale video prediction beyond mean square error . CoRR abs/1511.05440 (2015). arXiv:1511.05440 http://arxiv .org/abs/1511.05440 [21] Rajesh Rao and Dana H. Ballard. 1999. Predictive Coding in the Visual Cortex: a Functional Interpretation of Some Extra-classical Receptive-eld Eects. Nature neuroscience 2 (02 1999), 79–87. https://doi.org/10.1038/4580 [22] Ryoma Sato , Hisashi K ashima, and Takehir o Y amamoto. 2018. Short- T erm Precip- itation Prediction with Skip-Connecte d PredNet. In Articial Neural Networks and Machine Learning – ICANN 2018 , Věra Kůrková, Y annis Manolopoulos, Barbara Hammer , Lazaros Iliadis, and Ilias Maglogiannis (Eds.). Springer International Publishing, Cham, 373–382. [23] Xingjian Shi, Zhourong Chen, Hao W ang, Dit-Y an Y eung, W ai-Kin W ong, and W ang-chun W o o . 2015. Convolutional LSTM Network: A Machine Learn- ing Approach for Precipitation Nowcasting. CoRR abs/1506.04214 (2015). arXiv:1506.04214 http://arxiv .org/abs/1506.04214 [24] Z. Song, J. Zhang, G. Shi, and J. Liu. 2019. Fast Inference Predictive Coding: A Novel Model for Constructing Deep Neural Netw orks. IEEE Transactions on Neural Networks and Learning Systems 30, 4 (April 2019), 1150–1165. https: //doi.org/10.1109/TNNLS.2018.2862866 [25] M. W . Spratling. 2017. A review of predictive coding algorithms. Brain and Cogni- tion 112 (2017), 92 – 97. https://doi.org/10.1016/j.bandc.2015.11.003 Perspectives on Human Probabilistic Inferences and the ’Bayesian Brain’ . [26] Nitish Srivastava, Elman Mansimov , and Ruslan Salakhutdinov . 2015. Unsuper- vised Learning of Video Representations using LSTMs. CoRR abs/1502.04681 (2015). arXiv:1502.04681 http://arxiv .org/abs/1502.04681 [27] Christopher Summereld and Etienne Koechlin. 2008. A Neural Representation of Prior Information during Perceptual Inference. Neuron 59, 2 (2008), 336 – 347. https://doi.org/10.1016/j.neuron.2008.05.021 [28] Aaron van den Oord, Y azhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv e-prints , Article arXiv:1807.03748 (Jul 2018), arXiv:1807.03748 pages. arXiv:cs.LG/1807.03748 [29] Carl V ondrick, Hamed Pirsiavash, and Antonio T orralba. 2015. Anticipating the fu- ture by watching unlabeled video. CoRR abs/1504.08023 (2015). http://arxiv .org/abs/1504.08023 [30] Limin W ang, Y uanjun Xiong, Zhe W ang, Yu Qiao , Dahua Lin, Xiaoou Tang, and Luc V an Gool. 2016. Temporal Segment Networks: T owards Good Practices for De ep Action Re cognition. CoRR abs/1608.00859 (2016). http://arxiv .org/abs/1608.00859 [31] Y unbo W ang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2018. PredRNN++: T owards A Resolution of the Deep-in- Time Dilemma in Spa- tiotemporal Predictive Learning. CoRR abs/1804.06300 (2018). http://arxiv .org/abs/1804.06300 [32] Zhou W ang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image Quality Assessment: From Err or Visibility to Structural Similarity . IEEE TRANSACTIONS ON IMAGE PROCESSING 13, 4 (2004), 600–612. [33] Eiji W atanabe, Akiyoshi Kitaoka, Kiwako Sakamoto, Masaki Y asugi, and Kenta T anaka. 2018. Illusory Motion Reproduced by Deep Neural Networks Trained for Prediction. Frontiers in Psychology 9 (2018), 345. https://doi.org/10.3389/fpsyg. 2018.00345 [34] Haiguang W en, Kuan Han, Junxing Shi, Yizhen Zhang, Eugenio Culurciello, and Zhongming Liu. 2018. Deep Predictive Coding Network for Object Recognition. CoRR abs/1802.04762 (2018). arXiv:1802.04762 http://arxiv .org/abs/1802.04762 [35] Junpei Zhong, Angelo Cangelosi, Xinzheng Zhang, and T etsuya Ogata. 2018. AFA - PredNet: The action modulation within predictive coding. CoRR abs/1804.03826 (2018). arXiv:1804.03826 http://arxiv .org/abs/1804.03826 [36] Junpei Zhong, Tetsuya Ogata, and Angelo Cangelosi. 2018. Encoding Longer- term Contextual Multi-modal Information in a Predictive Coding Model. CoRR abs/1804.06774 (2018). arXiv:1804.06774 http://arxiv .org/abs/1804.06774
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment