Deep Predictive Coding Networks

Deep Pr edictive Coding Networks Rakesh Chalasani Jose C. Principe Departmen t of Electrical and Compute r Engineering University of Florida, Gainesville, FL 32611 rakeshch@ufl .edu, principe@cne l.ufl.edu Abstract The quality of data rep resentation in deep learning method s is directly related to the prior mod el imposed on the representations; howe ver , generally used ﬁxed priors are not capable of adjusting to the context in the data. T o address t his issue, we propose deep predicti ve coding networks, a hierarchical generati ve model that empirically alters priors on th e latent representations in a d ynamic and context- sensiti ve manner . This mode l captures the tempora l depend encies in time-v arying signals and uses top-down info rmation to modulate the representation in lower layers. The centerpiece of our model is a novel procedure to in fer spar se states of a dynamic network which is used for feature extraction. W e also extend this feature extraction block to in troduc e a po oling f unction that captu res locally in variant representatio ns. When applied o n a natur al vid eo data, we show that our method is able to learn high- lev el visu al features. W e also demonstrate th e role of the top- down conn ections by showing the robustness of the proposed model to structure d noise. 1 Intr oduction The p erform ance of machin e learn ing algo rithms is dep enden t on h ow the data is repr esented. In most metho ds, the quality of a data representation is itself d ependen t on prior k nowledge imposed on the representation . Such prior k nowledge can be imposed u sing domain speciﬁc infor mation, as in SIFT [1], HOG [2], etc., o r in learning re presentation s using ﬁxed priors like spa rsity [3], tempora l coheren ce [4], etc. The use of ﬁxed p riors became particu larly popular while training dee p networks [5 – 8]. In spite of th e success of th ese general pu rpose prior s, they are no t capable of adju sting to the context in the data. On the other hand, ther e are several advantages to having a model that ca n “activ ely” ada pt to the c ontext in the data. One way of achie ving th is is to empirically alter the priors in a dyn amic an d c ontext-sensitive manne r . This will be the main focus of this work, with emphasis on visual percep tion. Here we propose a predictive coding framew ork, where a deep loca lly-conn ected g enerative model uses “top -down” informatio n to empir ically alter the priors used in th e lower layers to perfo rm “bottom- up” infe rence. The cen terpiece of the pr oposed model is extracting sparse f eatures from time-varying observations using a linear d ynamical model . T o this end, we propose a novel proce - dure to infer sparse states (or features) o f a dynam ical system. W e then extend th is feature extraction block to introdu ce a pooling strategy to learn in v ariant feature representations from the data. In line with other “deep learning ” me thods, we use these b asic building blo cks to construct a hierarch ical model u sing greed y layer-wise unsup ervised learning. The hier archical mode l is built su ch th at the output from one layer acts as an input to the layer above. In other words, t he layers are arranged in a Markov chain such that the st ates at any layer are only dependent on the representation s in the layer below and above, and are indepen dent of the rest of the model. The overall goal of the dynamical system at any layer is to make the b est prediction of th e repre sentation in the layer below using th e top-down information from the layers above and the tempo ral informatio n from the pr evious states. Hence, the name deep predictive codin g networks ( DPCN). 1 1.1 Related W or k The DPCN p roposed here is closely relate d to models pro posed in [9, 1 0], w here pred ictiv e cod - ing is used as a statistical mo del to exp lain cortical function s in the mammalian brain. Similar to the p roposed mo del, they co nstruct hierarch ical gener ativ e models that seek to inf er the underlyin g causes of the sensor y inputs. While Rao and Ballard [9] use an update rule similar to Kalman ﬁlter for inf erence, Friston [10] proposed a g eneral framework considering all th e higher-orde r moments in a co ntinuou s time d ynamic model. Howev er , neither of the mo dels is capable of extracting dis- criminative informatio n, namely a sparse and in v ariant representation, from an image sequence that is helpful for high-level tasks like object r ecognitio n. Unlike these models, here we propose an efﬁcient infer ence pro cedure to extract locally inv ariant repr esentation fr om im age sequen ces and progr essi vely extract more abstract information at higher le vels in the mod el. Other methods used fo r building deep mo dels, like re stricted Boltzmann machine (RBM) [11], auto- encoder s [8, 12] and predictive sp arse de composition [1 3], are also related to th e mo del p roposed here. All these mode ls are constru cted on similar u nderly ing principles: (1) like ou rs, they also u se greedy lay er-wise unsupervised learning to construct a hierarch ical model and (2 ) each lay er con sists of an encoder an d a decod er . The key to th ese mo dels is to learn b oth en coding and decodin g concur rently (with some regulariza tion like sparsity [13], denoising [8] or weigh t sharing [11]), while building the d eep network as a feed forward model using only the e ncoder . The idea is to approximate the latent rep resentation using only the feed -forward en coder, while a voiding the decoder which typically requires a more expensi ve inference proced ure. Howe ver in DPCN there is no encoder . Instead, DPCN relies on an efﬁcient infe rence pr ocedur e to get a mor e accu rate latent representatio n. As we will show below , the use of reciprocal to p-down and bottom-up con nections make the prop osed model mo re robust to structured noise d uring reco gnition and also a llows it to perfor m low-le vel tasks like image denoising. T o scale to large images, several conv o lutional models are also proposed in a similar deep learning paradigm [5 – 7]. Inferen ce in these models is applied over an en tire image, rather than small parts of the inpu t. DPCN ca n also be extend ed to fo rm a con volutional network, but this will not be d iscussed here. 2 Model In this s ection, we b egin with a brief d escription of th e gen eral predictive coding fram ew o rk a nd proceed to discuss the details o f the architectu re used in this work. The basic block of the proposed model that is pervasi ve across all layers is a generalized state-space model of the form: ˜ y t = F ( x t ) + n t x t = G ( x t − 1 , u t ) + v t (1) where ˜ y t is the data and F a nd G ar e some f unction s that can be parameterized , say by θ . Th e terms u t are called the u n known causes . Since we are usually inter ested in obtainin g abstract information from th e observations, the cau ses are encour aged to have a non-lin ear relationship with the obser- vations. The hidden states, x t , then “me diate th e inﬂuence of th e cause on the output an d endow the system with memory” [10]. The terms v t and n t are stoc hastic an d m odel un certainty . Se veral such s tate-space models can no w be stacked, with the output from one actin g as an input to th e layer above, to f orm a hierarchy . Such an L -layered hierarchical model at any time ’ t ’ can be described as 1 : u ( l − 1 ) t = F ( x ( l ) t ) + n ( l ) t ∀ l ∈ { 1 , 2 , ..., L } x ( l ) t = G ( x ( l ) t − 1 , u ( l ) t ) + v ( l ) t (2) The terms v ( l ) t and n ( l ) t form s tochastic ﬂuctuations at the hig her lay ers an d enter each layer in- depend ently . In o ther words, this model fo rms a Markov chain across the layers, simplify ing the inference procedur e. Notice h ow the causes at the lower layer form the “o bservations” to the lay er above — th e causes form the link between the layers, an d th e states link the dy namics over time. The important point in th is design is th at the higher-le vel p redictions inﬂuence the lower le vels’ 1 When l = 1 , i.e., at the bottom layer , u ( i − 1) t = y t , where y t the input data. 2 Causes (u t ) States (x t ) Observations (y t ) (a) Sho ws a single layered d ynamic network depicting a basic computational block. - States (x t ) - Causes (u t ) (Invariant Units) { { Layer 1 Layer 2 (b) Sho ws the distri buti ve hierarchical model formed by stacking sev eral basic blocks. Figure 1: (a) Shows a single layere d n etwork on a gro up of small overlapping patch es of the inpu t video. The green bubbles indicate a group of inputs ( y ( n ) t , ∀ n ), red bubbles indicate their cor re- sponding states ( x ( n ) t ) and the blu e bubbles indic ate the cau ses ( u t ) that po ol all the states within the grou p. (b) Shows a two-layere d h ierarchical mod el constructe d by stacking several such basic blocks. For visualizatio n no overlapping is shown between the imag e patches here, but overlapping patches are considered during actual implemen tation. inference . The p rediction s from a hig her layer no n-linear ly enter into the state space mo del by em- pirically altering th e prior on the causes. In summar y , the top- down conn ections and the temp oral depend encies in the state space inﬂuen ce the latent representation at any layer . In the following sections, w e will ﬁrst describe a ba sic computation al n etwork, as in (1) with a particular for m of the fun ctions F an d G . Speciﬁcally , we will consider a linear dynamical model with sparse states for encod ing the inputs and the state tra nsitions, fo llowed by the non-linea r poo ling function to infer the causes. Next, we will discuss how to stack and learn a hierarchical model using se veral of these basic networks. Also, we will discuss how to incorpora te the top-down i nform ation during inferen ce in the hierar chical model. 2.1 Dynamic network T o begin with, we conside r a dy namic network to extract features from a small part of a video sequence. Let { y 1 , y 2 , ..., y t , ... } ∈ R P be a P -dimensional sequence of a patch extracted fr om the same location acr oss all th e frames in a v ideo 2 . T o pro cess this, o ur network co nsists o f two distinctiv e parts (see Fig ure.1a): f eature extraction (infer ring states) and pooling (infer ring causes). For the ﬁrst p art, sparse codin g is used in conju nction with a linear state space model to map the inputs y t at time t o nto an over-complete d ictionary of K -ﬁlters, C ∈ R P × K ( K > P ) , to get sparse states x t ∈ R K . T o keep track of the dyn amics in the latent states we use a linear func tion with state-transition matrix A ∈ R K × K . Mor e formally , inf erence of the features x t is perform ed by ﬁnding a representation that minimizes the energy function: E 1 ( x t , y t , C , A ) = k y t − Cx t k 2 2 + λ k x t − Ax t − 1 k 1 + γ k x t k 1 (3) Notice that the seco nd term in volving the state-transition is also constrained to be sparse to make the state-space representation consistent. Now , to take advantage of th e spatial relation ships in a loca l neighborh ood, a small grou p of states x ( n ) t , where n ∈ { 1 , 2 , ...N } re presents a set of contigu ous patches w .r .t. the position in the image space, are added (or su m pooled ) to gether . Such poolin g of the states may be lead to lo cal tra nslation in variance. On top this, a D -dimen sional causes u t ∈ R D are infe rred from the po oled states to obtain repr esentation that is inv ar iant to mo re complex local transform ations like r otation, spatial frequen cy , etc. In line w ith [ 14], this inv ar iant fun ction is learn ed such that it can capture the depend encies between the compo nents in the po oled states. Spe ciﬁcally , the causes u t are in ferred 2 Here y t is a vec torized form of √ P × √ P square patch ex tracted from a frame at time t . 3 by minimizing the energy function: E 2 ( u t , x t , B ) = N X n =1  K X k =1 | γ k · x ( n ) t,k |  + β k u t k 1 (4) γ k = γ 0 h 1 + exp ( − [ Bu t ] k ) 2 i where γ 0 > 0 is some constan t. No tice that her e u t multiplicatively interacts with th e accumu lated states through B , modeling the shap e of the sparse prior on the s tates. Essentially , the in variant matrix B is adapted such t hat each co mpon ent u t connects to a grou p of compon ents in the ac- cumulated states that co-o ccur frequen tly . In other words, whenev e r a componen t in u t is active it lowers the coefﬁcient o f a set of compo nents in x ( n ) t , ∀ n , mak ing them mor e likely to b e active. Since co-o ccurring compon ents typically share some com mon statistical regular ity , such acti vity of u t typically leads to locally in variant representation [14]. Thoug h the two cost fun ctions are presented separately above, we can combine b oth to de vise a uniﬁed energy function of the form: E ( x t , u t , θ ) = N X n =1  1 2 k y ( n ) t − Cx ( n ) t k 2 2 + λ k x ( n ) t − Ax ( n ) t − 1 k 1 + K X k =1 | γ t,k · x ( n ) t,k |  + β k u t k 1 (5) γ t,k = γ 0 h 1 + exp ( − [ Bu t ] k ) 2 i where θ = { A , B , C } . As we will discu ss n ext, both x t and u t can b e in ferred concurrently fr om (5) by alternatively updating one wh ile keeping the oth er ﬁxed using an efﬁcient pr oximal grad ient method. 2.2 Learning T o lear n th e par ameters in (5), we alter natively minimize E ( x t , u t , θ ) using a pro cedure similar to block co-ordin ate descent. W e ﬁrst infer th e latent v a riables ( x t , u t ) while keeping the parame ters ﬁxed an d then upd ate the parameters θ while keep ing the variables ﬁxed. This is d one until the parameters conver g e. W e now d iscuss separately the inf erence proced ure and h ow we upd ate the parameters using a gradient descent method with the ﬁxed v ar iables. 2.2.1 Inference W e jointly infer both x t and u t from (5) u sing pr oximal gradien t methods, taking alternative gradien t descent steps to update one wh ile h olding the oth er ﬁx e d. In other words, we alternate between updating x t and u t using a single u pdate step to minimize E 1 and E 2 , respectively . Howe ver, updating x t is relatively more inv olved. So, keeping aside the causes, we ﬁrst fo cus on inf erring sparse states alone fro m E 1 , and then go back to discuss the join t inferen ce of bo th the states and the causes. Inferring State s: In ferring sparse states, given the parameters, from a linear dyn amical system forms the crux of our mod el. This is perfor med by ﬁnding the solution th at min imizes the e nergy function E 1 in ( 3) with respect to th e states x t (while keeping the spar sity p arameter γ ﬁxed). Here there are two p riors o f th e states: the temp oral dep endence and th e sparsity ter m. Althoug h this energy fu nction E 1 is con vex in x t , the presence of two no n-smooth terms makes it hard to use standard optimization tech niques used for s parse coding alone. A similar problem is solved using dyn amic p rogram ming [15], homotopy [16] an d Bayesian sparse co ding [17]; howe ver , the optimization used in these m odels is computatio nally e x pensive fo r use in large scale problems like object recogn ition. T o overcome this, insp ired by the m ethod propo sed in [18] for structur ed sparsity , we p ropose an approx imate solution that is consistent and able to use efﬁcient solvers like fast iter ativ e shrinkag e thresholdin g alogorithm (FIST A) [19]. The key to our appro ach is t o ﬁrst use Nestrov’ s smooth ness method [18, 20] to approxima te the no n-smoo th state transition term. Th e resulting energy f unction 4 is a convex and con tinuou sly differentiable function in x t with a sparsity con straint, and hence, can be efﬁciently solved using proximal methods like FIST A. T o begin, let Ω( x t ) = k e t k 1 where e t = ( x t − Ax t − 1 ) . The idea is to ﬁnd a smooth appro ximation to this function Ω( x t ) in e t . Notice tha t, since e t is a linear fu nction on x t , the appro ximation will also be smooth w .r . t. x t . Now , we can re-write Ω( x t ) using the dual no rm of ℓ 1 as Ω( x t ) = arg max k α k ∞ ≤ 1 α T e t where α ∈ R k . Using the smoothing approximation from Nesterov [20] on Ω( x t ) : Ω( x t ) ≈ f µ ( e t ) = arg max k α k ∞ ≤ 1 [ α T e t − µd ( α )] (6) where d ( · ) = 1 2 k α k 2 2 is a smoothing function and µ is a smo othness p arameter . Fr om Nestrov’ s theorem [20], it can b e shown that f µ ( e t ) is conve x and continuously differentiable in e t and the gradient of f µ ( e t ) with respe ct to e t takes the form ∇ e t f µ ( e t ) = α ∗ (7) where α ∗ is th e op timal solu tion to f µ ( e t ) = arg max k α k ∞ ≤ 1 [ α T e t − µd ( α )] 3 . This implies, by using the chain rule, that f µ ( e t ) is also c onv ex an d co ntinuou sly differentiable in x t and with the same gradient. W ith this smooth ing approxim ation, th e overall cost functio n from (3) can now be re-written as x t = arg min x t 1 2 k y t − Cx t k 2 2 + λf µ ( e t ) + γ k x t k 1 (8) with the smooth part h ( x t ) = 1 2 k y t − Cx t k 2 2 + λf µ ( e t ) whose grad ient with respect to x t is giv en by ∇ x t h ( x t ) = C T ( y t − Cx t ) + λ α ∗ (9) Using the gradient informatio n i n (9), we solve for x t from (8) using FIST A [19]. Inferring Causes: Given a grou p of state vector s, u t can b e inferr ed by min imizing E 2 , whe re we deﬁne a g enerative model that mo dulates the sparsity of the pooled state vector , P n | x ( n ) | . Here we observe that FIST A can be readily applied to infer u t , as the smooth part of the function E 2 : h ( u t ) = K X k =1  γ 0 h 1 + exp( − [ Bu t ] k ) 2 i · N X n =1 | x ( n ) t,k |  (10) is conv ex, continuously differentiable and Lipschitz in u t [21] 4 . Following [19], it is easy to obtain a boun d on the con vergence rate of the solutio n. Joint Inference: W e s howed th us far that bo th x t and u t can be inf erred from their respective energy function s using a ﬁrst-or der prox imal method called FIST A. However , fo r joint inf erence we h av e to minim ize the comb ined energy function in (5) over both x t and u t . W e do this by alternately updating x t and u t while hold ing the other ﬁxed and using a single FIST A up date step at each iteration. It is imp ortant to point ou t that th e intern al FIST A step size parameters are maintained between iterations. T his procedur e is equiv alent to alternating minimizatio n using gradient descent. Although th is procedu re no lo nger guarantee s c onv ergen ce of both x t and u t to th e o ptimal solution, in all of ou r simulation s it lead to a r easonably good solutio n. Please re fer to Algorithm . 1 (in the supplemen tary material) for details. Note that, with the alternating u pdate p rocedu re, eac h x t is now inﬂuenced by th e feed- forward observations, temporal pr edictions and the feed back connectio ns from the causes. 3 Please refer to the supplementary material for the exact form of α ∗ . 4 The matrix B is initi alized with non -negati ve entries and continues to be non-negati ve without any addi- tional constraints [21]. 5 2.2.2 Parameter Updat es W ith x t and u t ﬁxed, we upd ate the parameter s by minimizing E in (5) with respect to θ . Since the inputs here are a time-varying sequen ce, th e parameters are up dated using dual estimation ﬁlter ing [22]; i.e., we put an additio nal constraint on the parameters such that th ey follow a state space equation of the form: θ t = θ t − 1 + z t (11) where z t is Gaussian transition noise over the parameters. This keeps track of their temp oral rela- tionships. Along with this constraint, we upd ate the par ameters using gradient descent. Notice that with a ﬁxed x t and u t , each of the parameter matrices can b e up dated independ ently . Ma trices C and B are column norm alized after the update to a void any tri vial solution. Mini-Batch Update: T o ge t faster conv ergence, the para meters are upd ated after perfor ming infer- ence over a large sequ ence of inp uts instead of at every time instanc e. With this “b atch” of signals, more sop histicated grad ient metho ds, like conjugate grad ient, can be used and, hence, can lead to more accurate and faster con vergence. 2.3 Building a hierarchy So far the discussion is focu sed on encodin g a small part of a video frame using a sing le stage network. T o build a hierarchical m odel, we use this single stage network as a b asic building b lock and arra nge them up to form a tr ee structu re (see Figur e.1b). T o learn th is hierarchic al model, we adopt a greedy layer-wise procedure like many other deep learnin g methods [6, 8, 11]. Spe ciﬁcally , we use the following s trategy to learn the hierarchical model. For the ﬁrst (or bottom) layer, we learn a dyn amic network as described above over a group of small patches fro m a video. W e then tak e th is learned network and replicate it at several plac es on a larger p art of the input frames (similar to weight sharing in a con volutional network [2 3]). The outpu ts ( causes) from each of the se replicated networks are con sidered as inputs to the layer above. Similarly , in the second laye r the inputs are again g roup ed tog ether (depen ding o n the spatial proxim ity in the ima ge space) and are used to train a nother dynamic network. Similar procedure can be followed t o build more higher layers. W e again emphasis tha t the mo del is learned in a layer-wise mann er , i.e. , ther e is n o top -down informa tion while learn ing the netw o rk par ameters. Also n ote that, b ecause of the pooling of the states at each layers, the recepti ve ﬁeld of the causes becomes progr essi vely lar ger with the depth of the model. 2.4 Inference with to p-down information W ith the par ameters ﬁxed, we now shif t o ur focus to inference in th e hierarchica l model with the top-down information. As we discussed above, the layers in the hierarchy are arrang ed in a Markov chain, i.e. , the variables at any laye r are only inﬂuen ced by the variables in the layer below and the layer above. Spec iﬁcally , the states x ( l ) t and the c auses u ( l ) t at layer l are inferre d fr om u ( l − 1) t and are inﬂuenced by x ( l +1) t (throu gh the prediction term C ( l +1) x ( l +1) t ) 5 . Ideally , to perfo rm inference in this hierarchical mo del, all the states and the causes hav e to be updated simultaneously dep ending on the presen t state of all the other layers un til the model reaches eq uilibrium [1 0]. Howe ver, such a procedur e can be very slow in p ractice. In stead, we pr opose an ap proxim ate inference p rocedu re that only requires a single top-d own ﬂow of information and then a single botto m-up infe rence u sing this top-down information. 5 The suf ﬁxes n indicating the group are considered implicit here to simplify the notation. 6 For this we consider that at any layer l a g roup o f in put u ( l − 1 ,n ) t , ∀ n ∈ { 1 , 2 , ..., N } are encod ed using a group of states x ( l,n ) t , ∀ n and the causes u ( l ) t by minimizing the following energy function : E l ( x ( l ) t , u ( l ) t , θ ( l ) ) = N X n =1  1 2 k u ( l − 1 ,n ) t − C ( l ) x ( l,n ) t k 2 2 + λ k x ( l,n ) t − A ( l ) x ( l,n ) t − 1 k 1 + K X k =1 | γ ( l ) t,k · x ( l,n ) t,k |  + β k u ( l ) t k 1 + 1 2 k u ( l ) t − ˆ u ( l +1) t k 2 2 (12) γ ( l ) t,k = γ 0 h 1 + exp ( − [ B ( l ) u ( l ) t ] k ) 2 i where θ ( l ) = { A ( l ) , B ( l ) , C ( l ) } . Notice the addition al term in volving ˆ u ( l +1) t when compared to (5). This comes from the top-down information, where we call ˆ u ( l +1) t as the to p-down predic tion of the causes of layer ( l ) using the previous states in layer ( l + 1) . Speciﬁcally , befor e the “arriv al” of a new observation at time t , at each layer ( l ) (starting from the top -layer) we ﬁrst propag ate the most likely causes to the layer below using the state at the previous time instance x ( l ) t − 1 and the predicted causes ˆ u ( l +1) t . Mor e formally , the top-down p rediction at layer l is o btained as ˆ u ( l ) t = C ( l ) ˆ x ( l ) t where ˆ x ( l ) t = arg min x ( l ) t λ ( l ) k x ( l ) t − A ( l ) x ( l ) t − 1 k 1 + γ 0 K X k =1 | ˆ γ t,k · x ( l ) t,k | (13) and ˆ γ t,k = (exp( − [ B ( l ) ˆ u ( l +1) t ] k )) / 2 At the top most layer , L , a “bias” is set such tha t ˆ u ( L ) t = ˆ u ( L ) t − 1 , i.e., the top-layer in duces som e temporal cohe rence on the ﬁnal outpu ts. From (13), it is easy to show that the pr edicted states for layer l can be obtained as ˆ x ( l ) t,k = ( [ A ( l ) x ( l ) t − 1 ] k , γ 0 γ t,k < λ ( l ) 0 , γ 0 γ t,k ≥ λ ( l ) (14) These predicted causes ˆ u ( l ) t , ∀ l ∈ { 1 , 2 , ..., L } are sub stituted in (12) and a sing le layer-wise bo ttom- up inference is performed as described in section 2.2 .1 6 . The co mbined prior now imposed on the causes, β k u ( l ) t k 1 + 1 2 k u ( l ) t − ˆ u ( l +1) t k 2 2 , is similar to the elastic net prior discussed in [24], leading to a smoother and biased estimate of the causes. 3 Experiments 3.1 Receptive ﬁelds o f causes in the hierarchical mo del Firstly , we would like to test the ability of the proposed model to learn complex featu res in the higher-layers of the m odel. For this we train a two layer ed ne twork fro m a natural video. Each frame in the video was ﬁrst con trast no rmalized as described in [1 3]. Th en, we train the ﬁrst la yer of the model on 4 ov e rlapping co ntiguo us 1 5 × 15 pixel p atches from this video; this layer has 400 d imensiona l states and 1 00 dimen sional causes. Th e causes pool the states related to all the 4 patches. The sep aration between the overlapp ing patche s her e w as 2 pixels, implying tha t the receptive ﬁeld of the causes in the ﬁrst layer is 17 × 17 pixels. Sim ilarly , the second layer is trained on 4 cau ses f rom the ﬁrst layer ob tained fr om 4 overlapping 17 × 17 pixel patch es fr om the video . The separation between the patches here is 3 pixels, implying that the receptive ﬁeld of the causes in the seco nd layer is 20 × 20 pixels. T he second layer contain s 20 0 dimensional states and 50 dimensiona l causes that pools the states related to all the 4 p atches. Figure 2 shows the v isualization of the recep tiv e ﬁelds of the inv ariant u nits (columns of matrix B ) at each layer . W e obser ve that each dimensio n o f causes in the ﬁrst layer r epresents a group of 6 Note that the additional term 1 2 k u ( l ) t − ˆ u ( l +1) t k 2 2 in the ener gy function only leads to a minor modiﬁcation in the inference procedu re, namely this has to be added to h ( u t ) in (10). 7 (a) Layer 1 in variant matrix, B (1) (b) Layer 2 in variant matrix, B (2) Figure 2 : V isua lization of the receptive ﬁelds of th e inv a riant units lear ned in (a) layer 1 and (b) lay er 2 when trained on natur al videos. The receptive ﬁelds are constructed as a weighted combination of the dictionary of ﬁlters at the bottom layer . primitive features (like inclined lines) which are localized in orientation or position 7 . Whereas, the causes in the second layer rep resent m ore complex f eatures, like corn ers, angles, etc. These ﬁlters are consistent with the previously proposed methods l ike Lee et al. [5] and Zeiler et al. [7]. 3.2 Role of to p- down information In this section, we show the r ole of the top- down in formatio n d uring infer ence, par ticularly in the presence of structur ed noise. V ideo sequ ences consisting of ob jects of three different shapes (Refer to Figure 3) were constructed. T he o bjective is to classify each fr ame a s co ming from one of the three different classes. For this, sev eral 32 × 32 pixel 100 frame long sequ ences were made using two o bjects of the same s hape bouncing off ea ch other and the “w alls”. Several suc h sequenc es were then co ncatenated to fo rm a 30,00 0 lo ng sequ ence. W e train a two lay er network using this sequ ence. First, we d ivided each frame into 12 × 12 patches with neighbo ring patch es overlapping by 4 pix els; each frame is d ivided into 16 patches. The botto m layer was trained such th e 12 × 12 patch es were used as inpu ts and were en coded using a 1 00 dimension al state vector . A 4 con tiguous neighbo ring patches were pooled to infer the causes t hat have 4 0 dimensions. The second layer was trained wit h 4 ﬁrst layer causes as inputs, which were itself in ferred from 20 × 20 contiguo us overlap ping block s of the vid eo frames. The states here are 60 dim ensional long and the causes have only 3 d imensions. It is important to note here that the receptiv e ﬁeld of the second layer causes encompasses the entire frame. W e test th e per forman ce of the DPCN in two conditions. The ﬁrst case is with 300 f rames of clean video, wit h 100 frames per shape, constructed as described above. W e consider this as a single v ideo without consider ing an y discontinuities. I n the second case, we cor rupt the clean video with “struc- tured” noise, where we randomly pick a n umber of objects f rom same three shapes with a Po isson distribution (w ith m ean 1.5) and add them to e ach frame indepen dently at a ran dom locations. There is no correlation b etween any two consecutiv e frames regarding where the “noisy objects” are added (see Figure.3b). First we con sider the clean video and perfor m inference with only bo ttom-up inferen ce, i.e., during inference we consider ˆ u ( l ) t = 0 , ∀ l ∈ { 1 , 2 } . Figu re 4 a shows the scatter plot of the thre e dimen- sional causes at the top layer . Clearly , there ar e 3 clu sters reco gnizing three different shape in the video sequence. Figu re 4 b shows the scatter plot when the same procedure is applied o n the noisy video. W e observe that 3 shapes here can n ot b e clearly distinguishe d. Finally , we use top-down informa tion along with the bottom-u p in ference as described in section 2.4 on the noisy data. W e argue that, since the second layer learned class speciﬁc inf ormation , the top-d own information can help the bottom layer units to disambiguate the noisy objects from the true objects. Figu re 4c shows the scatter plot fo r this case. Clearly , with th e top -down information , in spite of largely cor rupted sequence, the DP CN is able t o separate the frames belonging to the three shapes (the trace from one cluster to the other is because of the tempora l coherence i mposed on the causes at the top layer .). 7 Please refer to supplementary material for more results. 8 (a) Clear Sequences (b) Corrupted Sequences Figure 3: Shows part of the (a) clean and (b) corr upted vid eo s equenc es construc ted using three different shapes. Each row indicates one sequence. 0 5 10 0 2 4 0 2 4 6 Object 1 Object 2 Object 3 (a) 0 5 10 0 2 4 6 0 2 4 6 Object 1 Object 2 Object 3 (b) 0 2 4 6 0 1 2 3 0 2 4 6 Object 1 Object 2 Object 3 (c) Figure 4: Shows the scatter p lot of the 3 d imensional causes at the to p-layer for (a) clean video with only bottom- up inf erence, (b) co rrup ted vide o with only bottom-u p infer ence and (c) corru pted video with top- down ﬂow alo ng with bottom-u p in ference. At each poin t, the shape of the ma rker indicates the true shape of the object in the frame. 4 Conclusion In this paper we proposed the deep p redictive coding n etwork, a generative model that emp irically alters the priors in a dynamic and context sensitive man ner . This model composes to two main co m- ponen ts: (a) linear dynamical mo dels with sparse states used for feature extraction , and (b) to p-down informa tion to adapt the em pirical priors. The dynamic m odel captures th e temporal dependen cies and red uces the instability usually associated with sparse cod ing 8 , while the task speciﬁc informa- tion fr om the top layers help s to resolve ambig uities in th e lower -layer impr oving data r epresentatio n in the presence of noise. W e believe th at our approa ch can be extended wit h conv olutio nal meth ods, paving the way for implemen tation of high- lev el tasks like ob ject recognitio n, e tc., on large scale videos or images. Acknowledgments This work is su pported b y th e Ofﬁce of Nav al Research (ONR) gran t #N000 1410 1037 5. W e than k Austin J. Brockmeier and Matthew Emigh for their comm ents and s uggestion s. Refer ences [1] David G. L owe. Object reco gnition fro m local scale-in variant features. I n P r oceeding s of the Internation al Con fer ence on Computer V ision- V olume 2 - V olu m e 2 , ICCV ’99 , pages 1150–, 1999. ISBN 0-7695-0 164- 8. [2] Navneet Dalal and Bill Triggs. Histograms of oriented g radients for hu man detection. In Pr oceedin gs of the 2 005 I E EE Computer Society Co n fer en ce on Computer V ision an d P attern Recognition (CVPR’05) - V olume 1 - V o lume 01 , CVPR ’0 5, pag es 886–893 , 2005. ISBN 0-769 5-237 2-2. [3] B. A. Olshau sen an d D. J. Field. Emergence of simple-cell receptive ﬁeld prop erties b y learning a sparse code for natural images. Natur e , 381( 6583 ):607– 609, June 1996. ISSN 0028-083 6. [4] L. Wis kott and T .J. Sejnowski. Slow feature analysis: Unsuperv ised learning of in variances. Neural com p utation , 14(4):7 15–77 0, 200 2. 8 Please refer to the supplementary material for more details. 9 [5] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y . Ng. Con volutional d eep belief networks for scalable unsupervised learning of hierarchical representation s. In Pr oceedin gs of the 26th Annual Interna tional Confer en c e on Machine Learning , ICML ’0 9, p ages 609 –616, 2009. ISBN 978-1-60 558- 516-1. [6] K. Kavukcuo glu, P . Serm anet, Y .L. Bou reau, K. Gr egor , M. Mathieu, and Y . LeCun. Lear n- ing conv o lutional f eature hierarch ies for visual reco gnition. Advance s in Neural Informa tion Pr ocessing Systems , pag es 1090–1 098, 2010. [7] M.D. Zeiler, D. Krishnan, G.W . T aylor, an d R. Fergus. Decon volution al networks. In Comp uter V ision and P a ttern Recognition (CVPR), 2010 IE E E Conference o n , p ages 252 8–25 35. IEEE, 2010. [8] P . V incen t, H. Larochelle, I. Lajoie, Y . Bengio, and P .A . Manzagol. Stacked denoising autoen- coders: Learnin g usefu l re presentation s in a deep netw ork with a local denoising criter ion. The Journal of Machine Learning Researc h , 11:3371– 3408 , 201 0. [9] Rajesh P . N. Rao and Dana H. Ballard. Dynamic model of visual recogn ition p redicts neural response properties in the visual cortex. Neural Compu tation , 9:721– 763, 1997. [10] Karl Friston. Hierarchical models in the brain. PLoS Comp ut Biol , 4(11 ):e100 0211, 1 1 2008. [11] Geoffrey E. Hinton, Simon Osindero, and Y ee-Whye T eh . A F ast Learnin g Algorithm fo r Deep Belief Nets. Neur al Comp. , (7) :1527– 1554 , July . [12] Y oshua Bengio, Pascal Lamb lin, Dan P opovici, a nd Hugo Larochelle. Greedy layer-wise train- ing of deep networks. In In NIPS , 20 07. [13] K o ray Kavukcuoglu, Marc’Au relio Ranzato, and Y ann LeCun. Fast inference in sparse coding algorithm s wit h applications to object recognitio n. CoRR , ab s/1010.3 467, 2010. [14] Y an Kar klin an d Michael S. Le wicki. A hierarchical bayesian mo del for learning non linear statistical regularities in nonstation ary natu ral sign als. Neur a l Computation , 1 7:397– 423, 20 05. [15] D. Ange losante, G.B. Gian nakis, an d E. Gro ssi. Compressed sensin g of time -varying signals. In Digital Sign a l Pr o cessing, 2009 1 6th Intern ational Con fer en ce on , pages 1 –8, july 2009. [16] A. Charles, M.S. Asif, J. Romberg, and C. Rozell. Sparsity penalties i n dynamical sy stem estimation. In I nformation S ciences an d Systems (CISS), 2011 45th Annua l Con fe rence on , pages 1 –6, march 2011 . [17] D. Sejdin ovic, C. A ndrieu, an d R. Piechock i. Bayesian sequen tial compressed sensing in spar se dynamica l systems. In Communication , Con tr ol, and Computing (Allerton), 20 10 48th An nual Allerton Confer ence on , pages 1730 –1736, 29 20 10-oct. 1 201 0. d oi: 10.11 09/ALLER TON. 2010. 57071 25. [18] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, and E.P . Xing. Smoothin g proxima l gradient method for general structured sparse regression. The Anna ls of Ap p lied Statistics , 6(2):71 9–752 , 2 012. [19] Amir Beck and Marc T eb oulle. A Fast Iterative Shrinkage- Thresho lding Algorithm for Linear In verse Pro blems. S I AM Journal on Imaging Sc iences , (1):183–20 2, Marc h . ISSN 193 64954 . doi: 1 0.113 7/080 716542. [20] Y . Nesterov . Sm ooth m inimization of non-smo oth fun ctions. Mathem a tical Pr ogramming , 103 (1):12 7–152 , 20 05. [21] Karol Gregor and Y ann LeCun. E fﬁcient Learnin g of Sparse In variant Representa tions. CoRR , abs/1105 .5307 , 201 1. [22] Alex Nelson. Nonlin ear estimation and modeling of n o isy time-series by d ual Kalman ﬁltering methods . PhD thesis, 2000. [23] Y . LeCun, B. Boser, J. S. Denker, D. Hender son, R. E. How a rd, W . Hubbard , and L. D. Jackel. Back propa gation a pplied to handw ritten zip code r ecognitio n. Neural Comput. , 1 (4):54 1–551 , Decembe r 1 989. ISSN 08 99-76 67. doi: 10.11 62/neco .1989.1.4.541. URL http://dx.do i.org/10.11 62/neco.1989.1.4.541 . [24] H. Z ou and T . Ha stie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Soc ie ty: Series B (S ta tistical Methodology) , 67(2):3 01–32 0, 200 5. 10 A Supplementary material f o r Deep Predicti ve Coding Networks A.1 From sectio n 2 .2.1, c o mputing α ∗ The optimal solution of α in (6) is given by α ∗ = arg max k α k ∞ ≤ 1 [ α T e t − µ 2 k α k 2 ] = arg min k α k ∞ ≤ 1    α − e t µ    2 = S  e t µ  (15) where S ( . ) is a function proje cting  e t µ  onto an ℓ ∞ -ball. This is o f the form: S ( x ) =    x, − 1 ≤ x ≤ 1 1 , x > 1 − 1 , x < − 1 A.2 Algorithm for jo int inference of t he states and t he c a uses. Algorithm 1 Upd ating x t , u t simultaneou sly using FIST A- like procedure [19]. Require: T ake L x 0 ,n > 0 ∀ n ∈ { 1 , 2 , ..., N } , L u 0 > 0 and some η > 1 . 1: Initialize x 0 ,n ∈ R K ∀ n ∈ { 1 , 2 , ..., N } , u 0 ∈ R D and set ξ 1 = u 0 , z 1 ,n = x 0 ,n . 2: Set step-size parame ters: τ 1 = 1 . 3: while no co n vergence do 4: Update γ = γ 0 (1 + exp( − [ Bu i ]) / 2 . 5: for n ∈ { 1 , 2 , ..., N } do 6: Line search: Find the best step size L x k,n . 7: Compute α ∗ from (15) 8: Update x i,n using the gradien t from (9 ) with a soft-thr esholding function. 9: Update internal variables z i +1 with step size parameter τ i as in [19]. 10: end for 11: Compute P N n =1 | x i,n | 12: Line search: Find the best step size L u k . 13: Update u i,n using the gradien t of (10) with a soft-th resholdin g function. 14: Update internal variables ξ i +1 with step size parameter τ i as in [19]. 15: Update τ i +1 =  1 + q (4 τ 2 i + 1)  / 2 . 16: Check for con vergence . 17: i = i + 1 18: end while 19: return x i,n ∀ n ∈ { 1 , 2 , ..., N } and u i 11 A.3 Inferring sparse states with known paramet ers 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 Observation Dimensions steady state rMSE Kalman Filter Proposed Sparse Coding [20] Figure 5: Shows the performan ce of the inferen ce algorithm with ﬁxed parameters when compared with sparse coding and Kalman ﬁltering. For this we ﬁrst simulate a state sequen ce with o nly 20 non-ze ro e lements in a 500-d imensional state vector e volving with a p ermutation matrix, which is different f or e very time instant, followed by a scaling matrix to gener ate a sequence of observations. W e consider that both the perm utation and the scaling matrices are known apriori . The observation noise is Ga ussian z ero mean and variance σ 2 = 0 . 0 1 . W e consider sparse state-transition noise, which is simulated by ch oosing a subset o f active elements in the state vector ( number o f elem ents is cho sen random ly via a Poisson distribution with mean 2) and switch ing each o f them with a random ly ch osen element (with uniform pr obability over th e state vector) . This resem ble a sp arse innovation in the states. W e use these genera ted observation sequence s as inputs and use the apriori know p arameters to infer the states from the dy namic model. Fig ure 5 shows the results obtained , where we comp are the inferred states from different method s with the true states in terms of r elativ e mean square d error (rM SE) (deﬁned as k x est t − x tr ue t k / k x tr ue t k ). The steady state error (rMSE) after 50 time instances is plotted versu s with the dime nsionality of the ob servation sequence. Each point is obtained after averaging over 50 runs. W e observe that o ur m odel is a ble to con verge to the true solution even for low dimension al observation, when other meth ods like sparse coding fail. W e argue that the temporal dependen cies co nsidered i n our model is able to drive the solution to the right attractor basin, insulating it from instabilities typically associated with sparse coding [24]. 12 A.4 V isualizing ﬁrst layer of the lea rned model (a) Observ at ion matrix (Bases) Active state element at (t-1) Corresponding, predicted states at (t) (b) State-transition matrix Figure 6 : V isualizatio n of the param eters. C and A , of the model describ ed in s ection 3.1. (A) Shows the le arned o bservation matrix C . Ea ch square blo ck ind icates a co lumn of the matrix, reshaped as √ p × √ p pixel block. (B) Shows the state tr ansition matrix A using its co nnection s strength with the observation matrix C . On the left are the basis correspo nding to the sing le active element in the state at time ( t − 1) and on the right are t he basis correspon ding to the ﬁve most “activ e” elements in the predicted state (ordered in decreasing order of the magnitude) . (a) Connections (b) Centers and Orientations (c) Orientations and Frequencies Figure 7: Connections between the inv arian t un its a nd the basis fun ctions. ( A) Shows the con nec- tions between t he ba sis and co lumns of B . Each row indicates an in variant unit. Here the set of basis th at a stron gly corr elated to an inv ariant un it are shown, arran ged in the d ecreasing o rder o f the magn itude. (B) Sh ows spatially localized grouping of the inv ar iant units. Firstly , we ﬁt a Gabor function to each of the basis functions. Each subplot here i s then obtained by plotting a line indicat- ing the center and the orien tation o f th e Ga bor f unction. The colors in dicate the connection s strength with an in variant unit; red indicatin g stro nger co nnection s and blue indicate almost zero strength . W e rando mly select a subset of 25 inv ariant units h ere. W e observe that the in variant unit group the b asis that are local in spatial centers and orientations. (C) Similarly , we show the correspon d- ing orientation and spatial frequen cy selectivity of the inv ariant units. He re each plot indicates the orientation and frequ ency of each Gabor function color coded according to the connection strengths with the in variant units. E ach s ubplot is a h alf-polar plot with the orientation plotted along the angle ranging fr om 0 to π and the distance fro m the cen ter indicatin g the freque ncy . Again, we obser ve that the in variant units group the basis that ha ve similar orientation. 13

Deep Predictive Coding Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment