Convolutional Bipartite Attractor Networks
In human perception and cognition, a fundamental operation that brains perform is interpretation: constructing coherent neural states from noisy, incomplete, and intrinsically ambiguous evidence. The problem of interpretation is well matched to an ea…
Authors: Michael Iuzzolino, Yoram Singer, Michael C. Mozer
C O N V O L U T I O N A L B I P A RT I T E A T T R A C T O R N E T W O R K S Michael Iuzzolino ∗ , Y oram Singer † , Michael C. Mozer ∗† ∗ Univ ersity of Colorado, Boulder † Google Research A B S T R A C T In human perception and cognition, a fundamental operation that brains perform is interpr etation : constructing coherent neural states from noisy , incomplete, and intrinsically ambiguous evidence. The problem of interpretation is well matched to an early and often ov erlooked architecture, the attractor network—a recurrent neural net that performs constraint satisfaction, imputation of missing features, and clean up of noisy data via ener gy minimization dynamics. W e revisit attractor nets in light of modern deep learning methods and propose a conv olutional bipartite architecture with a nov el training loss, acti vation function, and connectivity constraints. W e tackle larger problems than ha ve been pre viously explored with attractor nets and demonstrate their potential for image completion and super-resolution. W e argue that this architecture is better moti vated than e ver -deeper feedforward models and is a viable alternativ e to more costly sampling-based generativ e methods on a range of supervised and unsupervised tasks. 1 I N T R O D U C T I O N Under ordinary conditions, human visual perception is quick and accurate. Studying circumstances that gi ve rise to slo w or inaccurate perception can help rev eal the underlying mechanisms of visual information processing. Recent in vestigations of occluded ( T ang et al. , 2018 ) and empirically challenging ( Kar et al. , 2019 ) scenes hav e led to the conclusion that recurrent brain circuits can play a critical role in object recognition. Further, recurrence can improv e the classification performance of deep nets ( T ang et al. , 2018 ; Nayebi et al. , 2018 ), specifically for the same images with which humans and animals hav e the most difficulty ( Kar et al. , 2019 ). Recurrent dynamics allo w the brain to perform pattern completion , constructing a coherent neural state from noisy , incomplete, and intrinsically ambiguous evidence. This interpretiv e process is well matched to attractor networks ( ANs ) ( Hopfield , 1982 ; 1984 ; Krotov and Hopfield , 2016 ; Zemel and Mozer , 2001 ), a class of dynamical neural networks that con v erge to fix ed-point attractor states (Figure 1 a). Given e vidence in the form of a static input, an AN settles to an asymptotic state—an interpretation or completion—that is as consistent as possible with the evidence and with implicit knowledge embodied in the netw ork connectivity . W e sho w examples from our model in Figure 1 b . ANs hav e played a piv otal role in characterizing computation in the brain ( Amit , 1992 ; McClelland and Rumelhart , 1981 ), not only perception (e.g., Sterzer and Kleinschmidt , 2007 ), but also language ( Stowe et al. , 2018 ) and aw areness ( Mozer , 2009 ). W e revisit attractor nets in light of modern deep layer l layer l+1 layer l avg. pool 𝑾 𝑾′ (a) (c) (d) visible? visible hidden 1,7 2,6 3,5 4 8 update order (b) target complet ion evid . Figure 1: (a) Hypothetical activ ation flow dynamics of an attractor net over a 2D state space; the contours depict an energy landscape. (b) top-to-bottom: original image, completion, and evidence. (c) Bipartite architecture with layer update order . (d) Conv olutional architecture with av erage pooling. 1 learning methods and propose a conv olutional bipartite architecture for pattern completion tasks with a nov el training loss, activ ation function, and connectivity constraints. 2 B A C K G R O U N D A N D R E L A T E D R E S E A R C H Although ANs ha ve been mostly neglected in the recent literature, attractor -like dynamics can be seen in many models. For e xample, clustering and denoising autoencoders are used to clean up internal states and improv e the rob ustness of deep models ( Liao et al. , 2016 ; T ang et al. , 2018 ; Lamb et al. , 2019 ). In a range of image-processing domains, e.g., denoising, inpainting, and super-resolution, performance gains are realized by constructing deeper and deeper architectures (e.g., Lai et al. , 2018 ). State-of-the-art results are often obtained using deep recursive architectures that replicate layers and weights ( Kim et al. , 2016 ; T ai et al. , 2017 ), effecti vely implementing an unfolded-in-time recurrent net. This approach is sensible because image processing tasks are fundamentally constraint satisfaction problems: the value of an y pixel depends on the values of its neighborhood, and iterativ e processing is required to conv erge on mutually consistent activ ation patterns. Because ANs are specifically designed to address constraint-satisf action problems, our goal is to re-examine them from a modern deep-learning perspectiv e. Interest in ANs seems to be narro w for two reasons. First, in both early ( Hopfield , 1982 ; 1984 ) and recent ( Li et al. , 2015 ; W u et al. , 2018a ; b ; Chaudhuri and Fiete , 2017 ) work, ANs are characterized as content-addressable memories: activ ation vectors are stored and can later be retrie ved with only partial information. Ho wev er , memory retriev al does not well characterize the model’ s capabilities: like its probabilistic sibling the Boltzmann machine ( Hinton , 2007 ; W elling et al. , 2005 ), the AN is a general computational architecture for supervised and unsupervised learning. Second, ANs hav e been limited by training procedures. In Hopfield’ s work, ANs are trained with a simple procedure—an outer product (Hebbian) rule—which cannot accommodate hidden units and the representational capacity they pro vide. Recent explorations ha ve considered stronger training procedures (e.g., W u et al. , 2018b ; Liao et al. , 2018 ); howe ver , as for all recurrent nets, training is complicated by the issue of vanishing/e xploding gradients. T o facilitate training and increase the computational power of ANs, we propose a set of extensions to the architecture and training procedures. ANs are related to several popular architectures. Autoencoding models such as the V AE ( Kingma and W elling , 2013 ) and denoising autoencoders ( V incent et al. , 2008 ) can be vie wed as approximating one step of attractor dynamics, directing the input tow ard the training data manifold ( Alain et al. , 2012 ). These models can be applied recursiv ely , though con ver gence is not guaranteed, nor is improvement in output quality ov er iterations. Flow-based generative models (FBGMs) (e.g., Dinh et al. , 2016 ) are in vertible density-estimation models that can map between observ ations and latent states. Whereas FBGMs require in vertibility of mappings, ANs require only a weaker constraint that weights in one direction are the transpose of the weights in the other direction. Ener gy-based models (EBMs) are also density-estimation models that learn a mapping from input data to energies and are trained to assign lo w energy values to the data manifold ( LeCun et al. , 2006 ; Han et al. , 2018 ; Xie et al. , 2016 ; Du and Mordatch , 2019 ). Whereas AN dynamics are determined by an implicit energy function, the EBM dynamics are dri ven by optimizing or sampling from an explicit energy function. In the AN, lowering the energy for some states raises it for others, whereas the explicit EBM energy function requires well-chosen neg ative samples to ensure it discriminates likely from unlikely states. Although the EBM and FBGM seem well suited for synthesis and generation tasks, due to their probabilistic underpinnings, we show that ANs can be used for conditional generation (maximum likelihood completion) tasks. 3 C O N V O L U T I O N A L B I P A RT I T E A T T R A C T O R N E T S V arious types of recurrent nets ha ve been sho wn to con ver ge to activ ation fix ed points, including fully interconnected networks of asynchronous binary units ( Hopfield , 1982 ) and networks of continuous- valued units operating in continuous time ( Hopfield , 1984 ). Most relev ant to modern deep learning, K oiran ( 1994 ) identified con vergence conditions for synchronous update of continuous-v alued units in discrete time: giv en a network with state x , parallel updates of the full state with the standard activ ation rule, x ← f ( xW + b ) , (1) 2 will asymptote at either a fixed point or a limit cycle of 2. Sufficient conditions for this result are: initial x ∈ [ − 1 , +1] n , W = W T , w ii ≥ 0 , and f ( . ) piecewise continuous and strictly increasing with lim η →±∞ f ( η ) = ± 1 . The proof is cast in terms of an energy function, E ( x ) = − 1 2 xW x T − xb T + X i Z x i 0 f − 1 ( ξ ) dξ . (2) W ith f ≡ tanh , we have the barrier function: ρ ( x i ) ≡ Z x i 0 f − 1 ( ξ ) dξ = (1 + x i ) ln(1 + x i ) + (1 − x i ) ln(1 − x i ) (3) T o ensure a fixed point (no limit cycle > 1), asynchronous updates are sufficient because the solution of ∂ E /∂ x i = 0 is the standard update for unit i (Equation 1 ). Because the energy function additiv ely factorizes for units that hav e no direct connections, parallel updates of these units still ensure non-increasing energy , and hence attainment of a fixed point. W e adopt the bipartite architecture of a stacked restricted Boltzmann machine ( Hinton and Salakhut- dinov , 2006 ), with bidirectional symmetric connections between adjacent layers of units and no connectivity within a layer (Figure 1 c). W e distinguish between visible layers, which contain inputs and/or outputs of the net, and hidden layers. The bipartite architecture allo ws for units within a layer to be updated in parallel while guaranteeing strictly non-increasing energy and attainment of a local ener gy minimum. W e thus perform layerwise updating of units, defining one iter ation as a sweep from one end of the architecture to the other and back. The 8-step update sequence for the architecture in Figure 1 c is shown abo ve the network. 3 . 1 C O N VO L U T I O NA L W E I G H T C O N S T R A I N T S W eight constraints required for con ver gence can be achiev ed within a con volutional architecture as well (Figure 1 d). In a feedforward con volutional architecture, the connectivity from layer l to l + 1 is represented by weights W l = { w l q rab } , where q and r are channel indices in the destination ( l + 1 ) and source ( l ) layers, respectiv ely , and a and b specify the relativ e coordinate within the kernel, such that the weight w l q rab modulates the input to the unit in layer l + 1 , channel q , absolute position ( α, β ) —denoted x l +1 q αβ —from the unit x l r,α + a,β + b . If W l +1 = { w l +1 q rab } denotes the re verse weights to channel q in layer l from channel r in layer l + 1 , symmetry requires that w l q ,r,a,b = w l +1 r,q, − a, − b . (4) This follows from the fact that the weights are translation in variant: the reverse mapping from x l +1 q ,α,β to x l r,α + a,β + b has the same weight as from x l +1 q ,α − a,β − b to x l r,α,β , embodied in Equation 4 . Implementation of the weight constraint is simple: W l is unconstrained, and W l +1 is obtained by transposing the first two tensor dimensions of W l and flipping the indices of the last two. The con volutional bipartite architecture has ener gy function: E ( x ) = − L − 1 X l =1 X q x l +1 q • W l q ∗ x l + L X l =1 X q ,α,β ρ ( x l q αβ ) − b l q x l q αβ (5) where x l is the activ ation in layer l , b l are the channel biases, and ρ ( . ) is the barrier function (Equation 3 ), ‘ ∗ ’ is the conv olution operator , and ‘ • ’ is the element-wise sum of the Hadamard product of tensors. The factor of 1 2 ordinarily found in energy functions is not present in the first term because, in contrast to Equation 2 , each second-order term in x appears only once. For a similar formulation in stacked restricted Boltzmann machines, see Lee et al. ( 2009 ). 3 . 2 L O S S F U N C T I O N S Evidence provided to the CB AN consists of activ ation constraints on a subset of the visible units. The CB AN is trained to fill-in or complete the activ ation pattern ov er the visible state. The manner in which evidence constrains activ ations depends on the nature of the evidence. In a scenario where all features are present but potentially noisy , one should treat them as soft constraints that can be ov erridden by the model; in a scenario where the evidence features are reliable but other features are entirely missing, one should treat the evidence as hard constraints. W e have focused on this latter scenario in our simulations, although we discuss the use of soft constraints in Appendix A . For a hard constraint, we clamp the visible units to the value of the 3 evidence, meaning that acti vation is set to the observ ed value and not allo wed to change. Energy is minimized conditioned on the clamped v alues. One extension to clamping is to replicate all visible units and designate one set as input, clamped to the evidence, and one set as output, which serves as the network read out. W e considered using the e vidence to initialize the visible state, but initialization is inadequate to anchor the visible state and it w anders. W e also considered using the evidence as a fixed bias on the input to the visible state, b ut redundancy of the bias and top-down signals from the hidden layer can prev ent the CB AN from achieving the desired acti vations. An obvious loss function is squared error , L S E = P i || v i − y i || 2 , where i is an index o ver visible units, v is the visible state, and y is the target visible state. Ho wev er, this loss misses out on a key source of error . The clamped units hav e zero error under this loss. Consequently , we replace v i with ˜ v i , the value that unit i would take were it unclamped, i.e., free to take on a value consistent with the hidden units driving it: L S E = P i || ˜ v i − y i || 2 . An alternative loss, related to the contrastive loss of the Boltzmann machine (see Appendix B ), explicitly aims to ensure that the energy of the current state is higher than that of the target state. W ith x = ( y , h ) being the complete state with all visible units clamped at their target v alues and the hidden units in some configuration h , and ˜ x = ( ˜ v , h ) being the complete state with the visible units unclamped, one can define the loss L ∆ E = E ( x ) − E ( ˜ x ) = P i f − 1 ( ˜ v i )( ˜ v i − y i ) + ρ ( y i ) − ρ ( ˜ v i ) . W e apply this loss by allowing the net to iterate for some number of steps gi ven a partially clamped input, yielding a hidden state that is a plausible candidate to generate the target visible state. Note that ρ ( y i ) is constant and although it does not factor into the gradient computation, it helps interpret L ∆ E : when L ∆ E = 0 , ˜ v = y . This loss is curious in that it is a function not just of the visible state, but, through the term f − 1 ( ˜ v i ) , it directly depends on the hidden state in the adjacent layer and the weights between these layers. A variant on L ∆ E is based on the observation that the goal of training is only to make the two ener gies equal, suggesting a soft hinge loss: L ∆ E + = log (1 + exp( E ( x ) − E ( ˜ x ))) . Both energy-based losses ha ve an interpretation under the Boltzmann distribution: L ∆ E is related to the conditional likelihood ratio of the clamped to unclamped visible state, and L ∆ E + is related to the conditional probability of the clamped versus unclamped visible state: L ∆ E = − log p ( y | h ) p ( ˜ v | h ) and L ∆ E + = − log p ( y | h ) p ( ˜ v | h )+ p ( y | h ) . 3 . 3 P R E V E N T I N G V A N I S H I N G / E X P L O D I N G G R A D I E N T S Although gradient descent is a more powerful method to train the CB AN than Hopfield’ s Hebb rule or the Boltzmann machine’ s contrasti ve loss, v anishing and exploding gradients are a concern as with any recurrent net ( Hochreiter et al. , 2001 ), particularly in the CBAN which may take 50 steps to fully relax. W e address the gradient issue in two ways: through intermediate training signals and through a soft sigmoid activ ation function. The aim of the CB AN is to produce a stable interpretation asymptotically . The appropriate way to achiev e this is to apply the loss once activ ation con verges. Howe ver , the loss can be applied prior to conv ergence as well, essentially training the net to achieve con ver gence as quickly as possible, while also introducing loss gradients deep inside the unrolled net. Assume a stability criterion θ that determines the iteration t ∗ at which the net has effecti vely con v erged: t ∗ = min t [max i | x i ( t ) − x i ( t − 1) | < θ ] . T raining can be logically separated into pre- and post-con ver gence phases, which we will refer to as transient and stationary . In the stationary phase, the Almeida/Pineda algorithm ( Pineda , 1987 ; Almeida , 1987 ) leverages the fact that activ ation is constant over iterations, permitting a computationally ef ficient gradient calculation with low memory requirements. In the transient phase, the loss can be injected at each step, which is exactly the temporal-dif ference method TD(1) ( Sutton , 1988 ). Casting training as temporal-dif ference learning, one might consider other values of λ in TD( λ ); for example, TD(0) trains the model to predict the visible state at the ne xt time step, encouraging the model to reach the target state as quickly as feasible while not penalizing it for being unable to get to the target immediately . Any of the losses, L S E , L ∆ E , and L ∆ E + , can be applied with a weighted mixture of training in the stationary and transient phases. Although we do not report systematic experiments in this article, we 4 consistently find that transient training with λ = 1 is as efficient and ef fecti ve as weighted mixtures including stationary-phase-only training, and that λ = 1 outperforms any λ < 1 . In our results, we thus conduct simulations with transient-phase training and λ = 1 . W e propose a second method of av oiding vanishing gradients specifically due to sigmoidal activ ation functions: a leaky sigmoid , analogous to a leaky ReLU, which allo ws gradients to propagate through the net more freely . The leaky sigmoid has acti v ation and barrier functions f ( z ) = α ( z + 1) − 1 z < − 1 z − 1 ≤ z ≤ 1 α ( z − 1) + 1 z > 1 , ρ ( x ) = 1 2 α x 2 + (1 − α )(1 + 2 x ) if x < − 1 1 2 x 2 if − 1 ≤ x ≤ 1 1 2 α x 2 + (1 − α )(1 − 2 x ) if x > 1 . Parameter α specifies the slope of the piecewise linear function outside the | x | < 1 interval. As α → 0 , loss gradients become flat and the CBAN fails to train well. As α → 1 , acti vation magnitudes can blow up and the CB AN fails to reach a fixed point. In Appendix C , we show that con v ergence to a fixed point is guaranteed when α || W || 1 , ∞ < 1 , where || W || 1 , ∞ = max i || w i || 1 . In practice, we hav e found that restricting W is unnecessary and α = 0 . 2 works well. 4 S I M U L A T I O N S W e report on a series of simulation studies of increasing complexity . First, we e xplore fully connected bipartite attractor net (FB AN) on a bar imputation task and then supervised MNIST image completion and classification. Second, we apply CBAN to unsupervised image completion tasks on Omniglot and CIF AR-10 and compare CB AN to CBAN-v ariants and denoising-V AEs. Lastly , we revision CB AN for the task of super-resolution and report promising results against competing models, such as DRCN and LapSRN. Details of architectures, parameters, and training are in Appendix D . 4 . 1 B A R T A S K W e studied a simple inference task on partial images that have exactly one correct interpretation. Images are 5 × 5 binary pixel arrays consisting of two horizontal bars or tw o vertical bars. T wenty distinct images exist, shown in the top ro w of Figure 2 . A subset of pixels is provided as e vidence; examples are sho wn in the bottom row of Figure 2 . The task is to fill in the masked pixels. Evidence is generated such that only one consistent completion exists. In some cases, a bar must be inferred without any white pix els as e vidence (e.g., second column from the right). In other cases, the local evidence is consistent with both v ertical and horizontal bars (e.g., first column from left). An FB AN with one layer of 50 hidden units is sufficient for the task. Evidence is generated randomly on each trial, and ev aluating on 10k random states after training, the model is 99.995% correct. The middle row in Figure 2 shows the FB AN response after one iteration. The net comes close to performing the task in a single shot, but after a second iteration of clean up and the asymptotic state is shown in the top ro w . comple tion evidence iter ation 1 Figure 2: Bar task: Input consists of 5 × 5 pixel arrays with the tar get being either two ro ws or two columns of pixels present. Figure 3: Bar task: W eights between visible and first hidden layers 5 targe t comp let ion evidence dream Figure 4: MNIST completions. Row 1: target test examples, with class label coded in the bottom ro w . Row 2: completions produced by the FB AN. Row 3: Evidence with masked regions (including class labels) in red. Row 4: the top-down ‘dream’ state produced by the hidden representation. Figure 3 shows some visible-hidden weights learned by the FBAN. Each 5 × 5 array depicts weights to/from one hidden unit. W eight sign and magnitude are indicated by coloring and area of the square, respectiv ely . Units appear to select one row and one column, either with the same or opposite polarity . Same-polarity weights within a ro w or column induce coherence among pix els. Opposite- polarity weights between a row and a column allow the pix el at the intersection to activ ate either the row/column depending on the sign of the unit’ s activ ation. 4 . 2 S U P E RV I S E D M N I S T W e trained an FBAN with two hidden layers on a supervised version of MNIST in which the visible state consists of a 28 × 28 array for an MNIST digit and an additional vector to code the class label. For the sake of graphical con v enience, we allocate 28 units to the label, using the first 20 by redundantly coding the class label in pairs of units, and ignoring the final 8 units. Our architecture had 812 inputs, 200 units in the first hidden layer , and 50 units in the second. During training, all bits of the label were masked as well as one-third of image pixels. The image was masked with thresholded Perlin coherent noise ( Perlin , 1985 ), which produces missing patches that are f ar more difficult to fill in than the isolated pix els produced by Bernoulli masking. Figure 4 shows e vidence provided to FB AN for 20 random test set items in the third ro w . The red masks indicate unobserved pix els; the other pixels are clamped in the visible state. The unobserved pixels include those representing the class label, coded in the bottom ro w of the pixel array . The top row of the Figure sho ws the target visible representation, with class labels indicated by the isolated white pixels. Even though the training loss treats all pixels as equiv alent, the FBAN does learn to classify unlabeled images. On the test set, the model achiev es a classification accuracy of 87.5% on Perlin-masked test images and 89.9% on noise-free test images. Note that the 20 pixels indicating class membership are no different than any other missing pixels in the input. The model learns to classify by virtue of the systematic relationship between images and labels. W e can train the model with fully observed images and fully unobserved labels, and its performance is like that of any fully-connected MNIST classifier , achieving an accurac y of 98.5%. The FB AN does an excellent job of filling in missing features in Figure 4 and in further examples in Appendix E . The FB AN’ s interpretations of the input seem to be respectable in comparison to other recent recurrent associativ e memory models (Figures 8 a,b). W e mean no disrespect of other research efforts—which have very different foci than ours—but merely wish to indicate we are obtaining state-of-the-art results for associativ e memory models. Figure 8 c shows some weights between visible and hidden units. Note that the weights link image pixels with multiple digit labels. These weights stand apart from the usual hidden representations found in feedforward classification networks. 4 . 3 U N S U P E RV I S E D O M N I G L O T W e trained a CB AN with the Omniglot images ( Lake et al. , 2015 ). Omniglot consists of multiple instances of 1623 characters from 50 different alphabets. The CBAN has one visible layer containing the character image, 28 × 28 × 1 , and three successi ve hidden layers with dimensions 28 × 28 × 128 , 14 × 14 × 256 , and 7 × 7 × 256 , all with av erage pooling between the layers and filters of size 3 × 3 . Other network parameters and training details are presented in Appendix D . T o experiment with a dif ferent type of masking, we used random square patches of diameter 3–6, which remo ve on av erage roughly 30% of the white pixels in the image. 6 Target Evidence CBAN CBAN - noTD CBAN - asym VAE CD - VAE CD - VAE CD - Figure 5: Omniglot image completion comparison examples (left) and quantitativ e results (right). The top two rows of the examples show the target image and the e vidence provided to the model (with missing pixels depicted in red), respectiv ely . The subsequent ro ws show the image completions produced by CB AN, CBAN-asym, CB AN-noTD, and the denoising V AE. The quantitativ e measures ev aluate each model on PSNR and SSIM metrics; black lines indicate +1 standard error of the mean. W e compared our CBAN to v ariants with critical properties removed: one without weight symmetry ( CBAN-asym ) and one in which the TD(1) training procedure is substituted for a standard squared loss at the final step ( CB AN-noTD ). W e also compare to a con v olutional denoising V AE ( CD-V AE ), which takes the masked image as input and outputs the completion. The CBAN with symmetric weights reaches a fixed point, whereas CB AN-asym appears to attain limit cycles of 2-10 iterations. Qualitativ ely , CB AN produces the best image reconstructions (Figure 5 ). CB AN-asym and CBAN- noTD tend to hallucinate additional strokes; and CBAN-noTD and CD-V AE produce less crisp edges. Quantitativ ely , we assess models with two measures of reconstruction quality , peak signal-to-noise ratio ( PSNR ) and structural similarity ( SSIM , W ang et al. , 2004 ); larger is better on each measure. CB AN is strictly superior to the alternatives on both measures (Figure 5 , right panel). CB AN completions are not merely memorized instances; the CBAN has learned structural re gularities of the images, allowing it to fill in big gaps in images that—with the missing pixels—are typically uninterpretable by both classifiers and humans. Additional CBAN image completion examples for Omniglot can be found in Appendix E . 4 . 4 U N S U P E RV I S E D C I F A R - 1 0 W e trained a CB AN with one visible and three hidden layers on CIF AR-10 images. The visible layer is the size of the input image, 32 × 32 × 3 . The successiv e hidden layers had dimensions 32 × 32 × 40 , 16 × 16 × 120 , and 8 × 8 × 440 , all with filters of size 3 × 3 and average pooling between the hidden layers. Further details of architecture and training can be found in Appendix D . Figure 6 sho ws qualitativ e and quantitati ve comparisons of alternati ve models. Here, CB AN-asym performs about the same as CB AN. Ho wev er , CB AN-asym typically attains bi-phasic limit c ycles, and CB AN-asym sometimes produces splotchy artifacts in background regions (e.g., third image from left). CBAN-noTD and the CD-V AE are clearly inferior to CB AN. Additional CB AN image completions can be found in Appendix E . Target Evidence CBAN CBAN - noTD CBAN - asym VAE CD - VAE CD - VAE CD - Figure 6: CIF AR-10 image completion comparison examples (left) and quantitati ve results (right). Layout identical to that of Figure 5 . 7 Algorithm Set5 PSNR / SSIM Set14 PSNR / SSIM BSD100 PSNR / SSIM Urban100 PSNR / SSIM Bicubic (baseline) 32.21 / 0.921 29.21 / 0.911 28.67 / 0.810 25.63 / 0.827 DRCN ( Kim et al. , 2016 ) 37.63 / 0.959 32.94 / 0.913 31.85 / 0.894 30.76 / 0.913 LapSRN ( Lai et al. , 2018 ) 37.52 / 0.959 33.08 / 0.913 31.80 / 0.895 30.41 / 0.910 CB AN (ours) 34.18 / 0.947 30.79 / 0.953 30.12 / 0.872 27.49 / 0.915 T able 1: Quantitative benchmark presenting a verage PSNR/SSIMs for scale f actor × 2 on four test sets. Red indicates superior performance of CB AN. CB AN consistently outperforms baseline Bicubic. Figure 7: Examples of super-resolution, with the columns in a gi ven image group comparing high- resolution ground truth, CB AN, and bicubic interpolation (baseline method). 4 . 5 S U P E R - R E S O L U T I O N Deep learning models ha ve proliferated in man y domains of image processing, perhaps none more than image super-resolution, which is concerned with recovering a high-resolution image from a low-resolution image. Many specialized architectures ha ve been de veloped, and although common test data sets exist, comparisons are not as simple as one would hope due to subtle differences in methodology . (F or example, even the baseline method, bicubic interpolation, yields different results depending on the implementation.) W e set out to explore the feasibility of using CB ANs for super-resolution. Our architecture processes 40 × 40 color image patches, and the visible state included both the low- and high-resolution images, with the lo w-resolution version clamped and the high-resolution version read out from the net. Details can be found in Appendix D . T able 1 presents tw o measures of performance, SSIM and PSNR, for the CB AN and various published alternati ves. CB AN beats the baseline, bicubic interpolation, on both measures, and performs well on SSIM against some leading contenders (ev en beating LapSRN and DRCN on Set14 and Urban100 ), but poorly on PSNR. It is common for PSNR and SSIM to be in opposition: SSIM rewards crisp edges, PSNR rew ards av eraging toward the mean. The border sharpening and contrast enhancement that produce good perceptual quality and a high SSIM score (see Figure 7 ) are due to the fact that CB AN comes to an interpretation of the images: it imposes edges and textures in order to make the features mutually consistent. W e believe that CB AN warrants further in vestigation for super- resolution; regardless of whether it becomes the winner in this competiti ve field, one can argue that it is performing a different type of computation than feedforw ard models like LapSRN and DRCN. 5 D I S C U S S I O N In comparison to recent published results on image completion with attractor networks, our CB AN produces far more impressive results (see Appendix, Figure 8, for a contrast). The computational cost and challenge of training CBANs is no greater than those of training deep feedforward nets. CB ANs seem to produce crisp images, on par with those produced by generati ve (e.g., energy- and flo w-based) models. CB ANs have potential to be applied in many conte xts in volving data interpretation, with the virtue that the computational resources they bring to bear on a task is dynamic and dependent on the difficulty of interpreting a gi ven input. Although this article has focused on con v olutional networks that hav e attractor dynamics between levels of representation, we ha ve recently recognized the value of architectures that are fundamentally feedforward with attractor dynamics within a le vel. Our current research explores this v ariant of the CB AN as a biologically plausible account of intralaminar lateral inhibition. 8 R E F E R E N C E S G. Alain, Y . Bengio, and S. Rifai. Regularized auto-encoders estimate local statistics. CoRR , abs/1211.4246, 2012. L. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial en vironment. In IEEE F irst International Confer ence on Neural Networks , pages 608–18. IEEE Press, San Diego, CA, 1987. D. J. Amit. Modeling brain function: The world of attractor neural networks . Cambridge Univ ersity Press, Cambridge, England, 1992. M. Bevilacqua, A. Roumy , C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonneg ativ e neighbor embedding. In British Machine V ision Confer ence . BMV A press, 2012. R. Chaudhuri and I. Fiete. Associativ e content-addressable networks with exponentially many rob ust stable states. arXiv preprint arXiv:1704.02019 q-bio.NC , 2017. L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. arXiv pr eprint arXiv:1605.08803 cs.LG , 2016. Y . Du and I. Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689 cs.LG , 2019. T . Han, E. Nijkamp, X. Fang, M. Hill, S.-C. Zhu, and Y . Nian W u. Div ergence triangle for joint training of generator model, ener gy-based model, and inference model. arXiv e-prints , art. arXiv:1812.10907, Dec 2018. G. E. Hinton. Boltzmann machine. Scholarpedia , 2(5):1668, 2007. doi: 10.4249/scholarpedia.1668. revision #91076. G. E. Hinton and R. R. Salakhutdinov . Reducing the dimensionality of data with neural networks. Science , 313(5786):504–507, 2006. S. Hochreiter , Y . Bengio, and P . Frasconi. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In J. K olen and S. Kremer , editors, Field Guide to Dynamical Recurrent Networks . IEEE Press, 2001. J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Pr oceedings of the National Academy of Sciences , 79(8):2554–2558, 1982. doi: 10.1073/pnas.79. 8.2554. J. J. Hopfield. Neurons with graded response have collecti ve computational properties like those of two-state neurons. Pr oceedings of the National Academy of Sciences , 81(10):3088–3092, 1984. J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-e xemplars. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 5197–5206, 2015. K. Kar , J. Kubilius, K. Schmidt, E. B. Issa, and J. J. DiCarlo. Evidence that recurrent circuits are critical to the ventral stream’s ex ecution of core object recognition behavior . Natur e neur oscience , 2019 Apr 29 2019. ISSN 1546-1726. doi: 10.1038/s41593- 019- 0392- 5. URL https://www. nature.com/articles/s41593- 019- 0392- 5 . J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive con volutional network for image super- resolution. In Computer V ision and P attern Recognition , pages 1637–1645, 06 2016. doi: 10.1109/CVPR.2016.181. D. P . Kingma and M. W elling. Auto-encoding v ariational Bayes. arXiv pr eprint arXiv:1312.6114 , 2013. P . K oiran. Dynamics of discrete time, continuous state Hopfield networks. Neural Computation , 6(3): 459–468, 1994. doi: 10.1162/neco.1994.6.3.459. URL https://doi.org/10.1162/neco. 1994.6.3.459 . 9 D. Krotov and J. J. Hopfield. Dense associati ve memory for pattern recognition. In Advances in Neural Information Pr ocessing Systems , pages 1172–1180, 2016. W .-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Y ang. Fast and accurate image super-resolution with deep Laplacian pyramid networks. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2018. B. M. Lake, R. Salakhutdinov , and J. B. T enenbaum. Human-Lev el Concept Learning through Probabilistic Program Induction. Science , 350(6266):1332–1338, 2015. ISSN 0036-8075. doi: 10.1126/science.aab3050. A. Lamb, J. Binas, A. Goyal, S. Subramanian, I. Mitliagkas, D. Kazakov , Y . Bengio, and M. C. Mozer . State-reification networks: Improving generalization by modeling the distribution of hidden representations. In K. Chaudhuri and R. Salakhutdinov , editors, Pr oceedings of the 36th International Confer ence on Machine Learning , volume 96 of Pr oceedings of Machine Learning Resear ch , Long Beach, CA, 2019. Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F .-J. Huang. A tutorial on energy-based learning. In G. Bakir, T . Hofman, B. Schölkopf, A. Smola, and B. T askar, editors, Pr edicting Structur ed Data . MIT Press, Boston, MA, 2006. H. Lee, R. Grosse, R. Ranganath, and A. Y . Ng. Con v olutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Pr oceedings of the 26th International Confer ence on Machine Learning , 2009. G. Li, K. Ramanathan, N. Ning, L. Shi, and C. W en. Memory dynamics in attractor networks. Computational Intelligence and Neur oscience , 2015. R. Liao, A. Schwing, R. Zemel, and R. Urtasun. Learning deep parsimonious representations. In D. D. Lee, M. Sugiyama, U. V . Luxb urg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems 29 , pages 5076–5084. Curran Associates, Inc., 2016. R. Liao, Y . Xiong, E. Fetaya, L. Zhang, K. Y oon, X. Pitko w , R. Urtasun, and R. Zemel. Re viving and improving recurrent back-propag ation. In J. Dy and A. Krause, editors, Pr oceedings of the 35th International Confer ence on Machine Learning , volume 80 of Pr oceedings of Machine Learning Resear ch , pages 3082–3091, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/liao18c.html . D. Martin, C. Fowlk es, D. T al, J. Malik, et al. A database of human segmented natural images and its application to e valuating se gmentation algorithms and measuring ecological statistics. In Pr oceedings of the International Conference on Computer V ision (ICCV) , 2001. J. L. McClelland and D. E. Rumelhart. An interacti ve activ ation model of context effects in letter perception: I. an account of basic findings. Psychological Review , 88(5):375–407, 1981. M. C. Mozer . Attractor networks. In P . Wilk en, A. Cleeremans, and T . Bayne, editors, Oxfor d Companion to Consciousness , pages 86–89. Oxford Univ ersity Press, Oxford, UK, 2009. A. Nayebi, D. Bear , J. Kubilius, K. Kar , S. Ganguli, D. Sussillo, J. J. DiCarlo, and D. L. Y amins. T ask- driv en con volutional recurrent models of the visual system. In Advances in Neural Information Pr ocessing Systems , pages 5290–5301, 2018. K. Perlin. An image synthesizer . In Pr oceedings of the 12th Annual Conference on Computer Graphics and Interactive T echniques , SIGGRAPH ’85, pages 287–296, Ne w Y ork, NY , USA, 1985. A CM. ISBN 0-89791-166-0. doi: 10.1145/325334.325247. URL http://doi.acm.org/10. 1145/325334.325247 . F . J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical Review Letters , 59:2229–2232, 1987. P . Sterzer and A. Kleinschmidt. A neural basis for inference in perceptual ambiguity . Pr oceedings of the National Academy of Sciences , 104(1):323–328, 2007. ISSN 0027-8424. doi: 10.1073/pnas. 0609006104. URL https://www.pnas.org/content/104/1/323 . 10 L. A. Stowe, E. Kaan, L. Sabourin, and R. C. T aylor . The sentence wrap-up dogma. Cognition , 176: 232–247, 2018. R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning , 3:9–44, 1988. doi: 10.1007/BF00115009. URL https://doi.org/10.1007/BF00115009 . Y . T ai, J. Y ang, and X. Liu. Image super-resolution via deep recursi ve residual network. 2017 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 2790–2798, 2017. H. T ang, M. Schrimpf, W . Lotter , C. Moerman, A. Paredes, J. O. Caro, W . Hardesty , D. Cox, and G. Kreiman. Recurrent computations for visual pattern completion. Pr oceedings of the National Academy of Sciences , 115(35):8835–8840, 2018. P . V incent, H. Larochelle, Y . Bengio, and P .-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Confer ence on Machine Learning , ICML ’08, pages 1096–1103, Ne w Y ork, NY , USA, 2008. A CM. ISBN 978-1-60558- 205-4. doi: 10.1145/1390156.1390294. URL http://doi.acm.org/10.1145/1390156. 1390294 . Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli. Image quality assessment: from error visibility to structural similarity . IEEE T ransactions on Image Processing , 13(4):600–612, April 2004. doi: 10.1109/TIP .2003.819861. M. W elling, M. Rosen-zvi, and G. E. Hinton. Exponential family harmoniums with an application to information retrie val. In L. K. Saul, Y . W eiss, and L. Bottou, edi- tors, Advances in Neural Information Pr ocessing Systems 17 , pages 1481–1488. MIT Press, 2005. URL http://papers.nips.cc/paper/2672- exponential- family- harmoniums- with- an- application- to- information- retrieval.pdf . Y . W u, G. W ayne, A. Graves, and T . Lillicrap. The Kanerva machine: A generativ e distributed memory . In International Conference on Learning Representations , 2018a. URL https:// openreview.net/forum?id=S1HlA- ZAZ . Y . W u, G. W ayne, K. Gregor , and T . Lillicrap. Learning attractor dynamics for generati ve memory . In S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 9379–9388. Curran Associates, Inc., 2018b. URL http://papers.nips.cc/paper/8149- learning- attractor- dynamics- for- generative- memory.pdf . J. Xie, Y . Lu, S.-C. Zhu, and Y . N. W u. A theory of generativ e Con vNet. In Pr oceedings of the 33r d International Conference on International Conference on Machine Learning - V olume 48 , ICML ’16, pages 2635–2644. JMLR.or g, 2016. URL http://dl.acm.org/citation. cfm?id=3045390.3045668 . J. Y ang, J. Wright, T . S. Huang, and Y . Ma. Image super-resolution via sparse representation. IEEE transactions on image pr ocessing , 19(11):2861–2873, 2010. R. S. Zemel and M. C. Mozer . Localist attractor networks. Neural Computation , 13(5):1045–1064, 2001. doi: 10.1162/08997660151134325. R. Zeyde, M. Elad, and M. Protter . On single image scale-up using sparse-representations. In Pr oceedings of the International confer ence on curves and surfaces , pages 711–730. Springer , 2010. 11 A P P E N D I X A U S I N G E V I D E N C E The CB AN is probed with an observation —a constraint on the activ ation of a subset of visible units. For an y visible unit, we must specify ho w an observation is used to constrain acti vation. The possibilities include: • The unit is clamped , meaning that the unit activ ation is set to the observed v alue and is not allowed to change. Con vergence is still guaranteed, and the ener gy is minimized conditional on the clamped value. Howe ver , clamping a unit has the disadv antage that any error signal back propagated to the unit will be lost (because changing the unit’ s input does not change its output). • The unit is initialized to the observed value, instead of 0. This scheme has the disadvantage that activ ation dynamics can cause the network to wander away from the observed state. This problem occurs in practice and the consequences are so sev ere it is not a viable approach. • In principle, we might try an acti vation rule which sets the visible unit’ s acti v ation to be a con ve x combination of the observed value and the v alue that would be obtained via acti vation dynamics: α × observed + (1 − α ) × f ( net input ) . W ith α = 1 this is simply the clamping scheme; with α = 0 and appropriate start state, this is just the initialization scheme. • The unit has an external bias proportional to the observation. In this scenario, the net input to a visible unit is: x i ← f ( xw T i + b i + e i ) , (6) where e j ∝ observation . The initial activ ation can be either 0 or the observation. One concern with this scheme is that the ideal input to a unit will depend on whether or not the unit has this additional bias. For this reason the magnitude of the bias should probably be small. Howe ver , in order to ha ve an impact, the bias must be lar ger . • W e might r eplicate all visible units and designate one set for input (clamped) and one set for output (unclamped). The input is clamped to the observation (which may be zero). The output is allo wed to settle. The hidden layer(s) would synchronize the inputs and outputs, but it could handle noisy inputs, which isn’t possible with clamping. Essentially , the input would serve as a bias, b ut on the hidden units, not on the inputs directly . In practice, we hav e found that external biases work but are not as effecti ve as clamping. Partial clamping with 0 < α < 1 has partial effecti veness relative to clamping. And initialization is not ef fective; the state wanders from the initialized v alues. Howe ver , the replicated-visible scheme seems very promising and should be e xplored further . B L O S S F U N C T I O N S The training procedure for a Boltzmann machine aims to maximize the likelihood of the training data, which consist of a set of observ ations ov er the visible units. The complete states in a Boltzmann machine occur with probabilities specified by p ( x ) ∝ e − E ( x ) /T , (7) where T is a computational temperature and the likelihood of a visible state is obtained by marginal- izing ov er the hidden states. Raising the likelihood of a visible state is achiev ed by lo wering its energy . The Boltzmann machine learning algorithm has a contrastive loss: it tries to maximize the energy of states with the visible units clamped to training observ ations and minimize the energy of states with the visible units unclamped and free to take on whatev er values the y want. This contrastive loss is an example of an ener gy-based loss , which expresses the training objecti ve in terms of the network energies. In our model, we will define an ener gy-based loss via matched pairs of states: x is a state with the visible units clamped to observ ed values, and ˜ x is a state in which the visible units are unclamped, i.e., they are free tak e on values consistent with the hidden units dri ving them. Although ˜ x could be 12 any unclamped state, it will be most useful for training if it is related to x (i.e., it is a good point of contrast). T o achiev e this relationship, we propose to compute ( ˜ x , x ) pairs by: 1. Clamp some portion of the visible units with a training example. 2. Run the net to some iteration, at which point the full hidden state is h . (The point of this step is to identify a hidden state that is a plausible candidate to generate the target visible state.) 3. Set x to be the complete state in which the hidden component of the state is h and the visible component is the target visible state. 4. Set ˜ x to be the complete state in which the hidden component of the state is h and the visible component is the fully unclamped activ ation pattern that would be obtained by propagating activities from the hidden units to the (unclamped) visible units. Note that the contrasti ve pair at this iteration, ( ˜ x i , x i ) , are states close to the acti vation trajectory that the network is following. W e might train the net only after it has reached con ver gence, but we’ ve found that defining the loss for ev ery iteration i up until con ver gence improves training performance. B . 1 L O S S 1 : T H E D I FF E R E N C E O F E N E R G I E S L ∆ E = E ( x ) − E ( ˜ x ) = − 1 2 xW x T − b x T + X j Z x j 0 f − 1 ( ξ ) dξ − − 1 2 ˜ xW ˜ x T − b ˜ x T + X j Z ˜ x j 0 f − 1 ( ξ ) dξ = X i ( w i x + b i )( ˜ v i − v i ) + Z v i 0 f − 1 ( ξ ) dξ − Z ˜ v i 0 f − 1 ( ξ ) dξ = X i f − 1 ( ˜ v i )( ˜ v i − v i ) + ρ ( v i ) − ρ ( ˜ v i ) with ρ ( s ) = 1 2 (1 + s ) ln(1 + s ) + 1 2 (1 − s ) ln(1 − s ) This reduction depends on x and ˜ x sharing the same hidden state, a bipartite architecture in which visible and hidden are interconnected, all visible-to-visible connections are zero, a tanh acti vation function, f , for all units, and symmetric weights. B . 2 L O S S 2 : T H E C O N D I T I O NA L P R O B A B I L I T Y O F C O R R E C T R E S P O N S E This loss aims to maximize the log probability of the clamped state conditional on the choice between unclamped and clamped states. Framed as a loss, we hav e a negati ve log likelihood: L ∆ E + = − ln P ( x | x ∨ ˜ x ) = − ln p ( x ) p ( ˜ x ) + p ( x ) = ln 1 + e xp E ( x ) − E ( ˜ x ) T The last step is attained using the Boltzmann distribution (Equation 7 ). 13 C P RO O F O F C O N V E R G E N C E O F C B A N W I T H L E A K Y S I G M O I D AC T I V A T I O N F U N C T I O N (1) x fCWx t b xe l R f is applied element wise xC z H I 2 s 1 fe e f Z la z e L Z 1 I Z 1 wher e o a I As s ume llblla.si bglelHxtH Eltm wher e mso Vg i l l w y l l Er Then fo r Hx HI M to hold we need llf cwxttbdla ltm Vg l f c w j xth.pl It m 2 if 12 14 we ar e done if Z 1 analogously 2 a l (2) f z _LEZ l 11 x WE by 1 I E L Hugh H Xe n a l l 11 E L ra t i o n 11 we wa n t the last te r m to be at mo st It m Lv It m 11 E l 1M Ar e Ci ar m dr e i Le t max gllwyll In fa c t ther e is a de g r e e of fr e e dom her e so we can simply us e c Xr I as the only par am in the analysis so long as we also re par amet eriz e b accor dingly In conclusion giv en c xr I (3) the re g i o n of conv er gence must include the hyper cube i Ea HE if the barrier is set to be smaller than the abo v e re g i o n no conv er gence is guar ant eed D N E T W O R K A R C H I T E C T U R E S A N D H Y P E R PA R A M E T E R S D . 1 B A R TAS K Our architecture was a fully connected bipartite attractor net (FB AN) with one visible layer and two hidden layers having 48 and 24 channels. W e trained using L ∆ E + with the transient TD(1) procedure, defining network stability as the condition in which all changes in unit activ ation on successiv e iterations are less than 0.01 for a gi ven input, tanh activ ation functions, batches of 20 examples (the complete data set), with masks randomly generated on each epoch subject to the constraint that only one completion is consistent with the evidence. W eights between layers l and l + 1 and the biases in layer l are initialized from a mean-zero Gaussian with standard deviation 0 . 1( 1 2 n l + 1 2 n l +1 + 1) − 1 2 , where n l is the number of units in layer l . Optimization is via stochastic gradient descent with an initial learning rate of 0.01, dropped to .001; the gradients in a given layer of weights are L 2 renormalized to be 1.0 for a batch of examples, which we refer to as SGD-L2 . 14 D . 2 M N I S T Our architecture was a fully connected bipartite attractor net (FB AN) with one visible layer to one hidden layer with 200 units to a second hidden layer with 50. W e trained using L ∆ E + with the transient TD(1) procedure, defining netw ork stability as the condition in which all changes in unit activ ation on successiv e iterations are less than 0.01 for a giv en input, tanh activ ation functions, batches of 250 e xamples. Masks are generated randomly for each e xample on each epoch. The masks were produced by generating Perlin noise, frequency 7, thresholded such that one third of the pixels were obscured. W eights between layers l and l + 1 and the biases in layer l are initialized from a mean-zero Gaussian with standard de viation 0 . 1( 1 2 n l + 1 2 n l +1 + 1) − 1 2 , where n l is the number of units in layer l . Optimization is via stochastic gradient descent with learning rate 0.01; the gradients in a giv en layer of weights are L ∞ renormalized to be 1.0 for a batch of examples, which we refer to as SGD-Linf . T arget acti vations scaled to lie in [-0.999,0.999]. (a) (b) (c) Figure 8: (a) Example of recurrent net clean-up dynamics from Liao et al. ( 2018 ). Left column is noisy input, right column is cleaned representation. (b) Example of associativ e memory model of W u et al. ( 2018a ). Column 1 is target, column 2 is input, and remaining columns are retrie v al iterations. (c) Some weights between visible and first hidden layer in FB AN trained on MNIST with labels. D . 3 C I FA R - 1 0 The netw ork architecture consists of four layers: one visible layer and three hidden layers. The visible layer dimensions match the input image dimensions: (32 , 32 , 3) . The channel dimensions of the three hidden layers increase by 40, 120, and 440, respectiv ely . W e used filter sizes of 3 × 3 between all layers. Beyond the first hidden layer , we introduce a 2 × 2 av erage pooling operation followed by half-padded con volution going from layer l to layer l + 1 , and a half-padded con volution follo wed by a 2 × 2 nearest-neighbor interpolation going from layer l + 1 to layer l . Consequently , the spatial dimensions of the hidden states, from lowest to highest, are (32,32), (16,16) and (8,8). A trainable bias is applied per -channel to each layer . All biases are initialized to 0, whereas k ernel weights are Gaussian initialized with a standard de viation of 0.0001. The CB AN used tanh acti vation functions and L S E with TD(1) transient training, as described in the main text. W e trained our model on 50,000 images from the CIF AR10 dataset (test set 10,000). The images are noised by online-generation of Perlin noise that masks 40% of the image. W e optimized our mean-squared error objecti ve using Adam. The learning rate is initially set to 0.0005 and then decreased manually by a factor of 10 e very 20 epochs beyond training epoch 150. For each batch, the network runs until the state stabilizes, where the condition for stabilization is specified as the maximum absolute difference of the full network states between stabilization steps t and t + 1 being less than 0.01. The maximum number of stabilization steps was set to 100; the av erage stabilization iteration per batch ov er the course of training was 50 stabilization steps. D . 4 O M N I G L O T The netw ork architecture consists of four layers: one visible layer and three hidden layers. The visible layer dimensions match the input image dimensions: (28 , 28 , 1) . The channel dimensions of the three hidden layers increase by 128, 256, and 512, respecti vely . W e used filter sizes of 3 × 3 between all layers. Beyond the first hidden layer , we introduce a 2 × 2 av erage pooling operation followed by half-padded con volution going from layer l to layer l + 1 , and a half-padded con volution follo wed by a 2 × 2 nearest-neighbor interpolation going from layer l + 1 to layer l . Consequently , the spatial dimensions of the hidden states, from lowest to highest, are (28,28), (14,14) and (7,7). A trainable bias is applied per -channel to each layer . All biases are initialized to 0, whereas k ernel weights are 15 Gaussian initialized with a standard deviation of 0.01. The CB AN used tanh activ ation functions and L S E with TD(1) transient training, as described in the main text. W e trained our model on 15,424 images from the Omniglot dataset (test set: 3856). The images are noised by online-generation of squares that mask 20-40% of the white pixels in the image. W e optimized our mean-squared error objecti ve using Adam. The learning rate is initially set to 0.0005 and then decreased manually by a factor of 10 e very 20 epochs after training epoch 100. For each batch, the network runs until the state stabilizes, where the condition for stabilization is specified as the maximum absolute difference of the full network states between stabilization steps i and i+1 being less than 0.01. The maximum number of stabilization steps was set to 100; the av erage stabilization iteration per batch ov er the course of training was 50 stabilization steps. Masks were formed by selecting patches of diameter 3–6 uniformly , in random, possibly ov erlapping locations, stopping when at least 25% of the white pixels ha ve been masked. D . 5 S U P E R - R E S O L U T I O N The netw ork architecture consists of four layers: one visible layer and three hidden layers. The visible layer spatial dimensions match the input patch dimensions, but consists of 6 channels: (40 , 40 , 6) . The low-resolution e vidence patch is clamped to the bottom 3 channels of the visible state; the top 3 channels of the visible state serv e as the unclamped output against which the high-resolution tar get patch is compared and loss is computed as a mean-squared error . The channel dimensions of the three hidden layers are 300, 300, and 300. W e used filter sizes of 5 × 5 between all layers. All con volutions are half-padded and no a verage pooling operations are introduced in the SR network scheme. Consequently , the spatial dimensions of the hidden states remain constant and match the input patches of (40 , 40) . A trainable bias is applied per-channel to each layer . All biases are initialized to 0, whereas kernel weights are Gaussian initialized with a standard de viation of 0.001. W e trained our model on 91 images from the T91 dataset ( Y ang et al. , 2010 ) scaled at × 2. W e optimized our mean-squared error objectiv e using Adam. The learning rate is initially set to 0.00005 and then decreased by a factor of 10 ev ery 10 epochs. The stability conditions described in the CB AN models for CIF AR-10 and Omniglot are repeated for the SR task, e xcept the stability threshold was set to 0.1 half way through training. W e e valuated on four test datasets at × 2 scaling: Set5 ( Bevilacqua et al. , 2012 ), Set14 ( Zeyde et al. , 2010 ), DSB100 ( Martin et al. , 2001 ), and Urban100 ( Huang et al. , 2015 ). E A D D I T I O N A L R E S U L T S 16 Figure 9: Additional examples of noise completion from supervised MNIST 17 ta r g e t comple t i o n evide n c e dr e a m ta r g e t comple t i o n evid e n c e dr e a m target comple tion evidence dream target comple tion evidence dream Figure 10: Additional examples of noise completion from color CIF AR-10 images and the letter-lik e Omniglot symbols. The rows of each array show the target image, the completion (reconstruction) produced by CB AN, the e vidence provided to CBAN with missing pixels depicted in red, and the CB AN “dream” state encoded in the latent representation. 18
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment