Density estimation using Real NVP
Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of…
Authors: Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio
Published as a conference paper at ICLR 2017 D E N S I T Y E S T I M A T I O N U S I N G R E A L N V P Laurent Dinh ∗ Montreal Institute for Learning Algorithms Univ ersity of Montreal Montreal, QC H3T1J4 Jascha Sohl-Dickstein Google Brain Samy Bengio Google Brain A B S T R AC T Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically , designing models with tractable learning, sam- pling, inference and ev aluation is crucial in solving this task. W e extend the space of such models using real-v alued non-v olume preserving (real NVP) transforma- tions, a set of po werful, stably in vertible, and learnable transformations, resulting in an unsupervised learning algorithm with exact log-lik elihood computation, exact and efficient sampling, exact and efficient inference of latent v ariables, and an interpretable latent space. W e demonstrate its ability to model natural images on four datasets through sampling, log-likelihood e v aluation, and latent v ariable manipulations. 1 Intr oduction The domain of representation learning has undergone tremendous adv ances due to improved super - vised learning techniques. Howe ver , unsupervised learning has the potential to le verage lar ge pools of unlabeled data, and extend these adv ances to modalities that are otherwise impractical or impossible. One principled approach to unsupervised learning is generati ve probabilistic modeling. Not only do generativ e probabilistic models hav e the ability to create nov el content, they also ha ve a wide range of reconstruction related applications including inpainting [ 61 , 46 , 59 ], denoising [ 3 ], colorization [71], and super-resolution [9]. As data of interest are generally high-dimensional and highly structured, the challenge in this domain is building models that are po werful enough to capture its complexity yet still trainable. W e address this challenge by introducing real-valued non-volume preserving (r eal NVP) transformations , a tractable yet expressi ve approach to modeling high-dimensional data. This model can perform ef ficient and exact inference, sampling and log-density estimation of data points. Moreover , the architecture presented in this paper enables exact and ef ficient reconstruction of input images from the hierarchical features extracted by this model. 2 Related work Substantial work on probabilistic generati ve models has focused on training models using maximum likelihood. One class of maximum likelihood models are those described by pr obabilistic undir ected graphs , such as Restricted Boltzmann Machines [ 58 ] and Deep Boltzmann Machines [ 53 ]. These models are trained by taking adv antage of the conditional independence property of their bipartite structure to allow ef ficient exact or approximate posterior inference on latent variables. Howe ver , because of the intractability of the associated marginal distribution over latent variables, their training, e valuation, and sampling procedures necessitate the use of approximations like Mean F ield infer ence and Marko v Chain Monte Carlo , whose con ver gence time for such complex models ∗ W ork was done when author was at Google Brain. 1 Published as a conference paper at ICLR 2017 Data space X Latent space Z Inference x ∼ ˆ p X z = f ( x ) ⇒ Generation z ∼ p Z x = f − 1 ( z ) ⇐ Figure 1: Real NVP learns an inv ertible, stable, mapping between a data distribution ˆ p X and a latent distribution p Z (typically a Gaussian). Here we show a mapping that has been learned on a toy 2-d dataset. The function f ( x ) maps samples x from the data distribution in the upper left into approximate samples z from the latent distribution, in the upper right. This corresponds to exact inference of the latent state gi ven the data. The inv erse function, f − 1 ( z ) , maps samples z from the latent distribution in the lower right into approximate samples x from the data distribution in the lower left. This corresponds to exact generation of samples from the model. The transformation of grid lines in X and Z space is additionally illustrated for both f ( x ) and f − 1 ( z ) . remains undetermined, often resulting in generation of highly correlated samples. Furthermore, these approximations can often hinder their performance [7]. Dir ected graphical models are instead defined in terms of an ancestral sampling pr ocedur e , which is appealing both for its conceptual and computational simplicity . They lack, howe ver , the conditional independence structure of undirected models, making exact and approximate posterior inference on latent variables cumbersome [ 56 ]. Recent advances in stochastic variational inference [ 27 ] and amortized inference [ 13 , 43 , 35 , 49 ], allowed efficient approximate inference and learning of deep directed graphical models by maximizing a variational lo wer bound on the log-likelihood [ 45 ]. In particular, the variational autoencoder algorithm [ 35 , 49 ] simultaneously learns a gener ative network , that maps gaussian latent v ariables z to samples x , and a matched appr oximate infer ence network that maps samples x to a semantically meaningful latent representation z , by exploiting the r eparametrization tric k [ 68 ]. Its success in leveraging recent adv ances in backpropagation [ 51 , 39 ] in deep neural networks resulted in its adoption for se veral applications ranging from speech synthesis [ 12 ] to language modeling [ 8 ]. Still, the approximation in the inference process limits its ability to learn high dimensional deep representations, moti vating recent work in impro ving approximate inference [42, 48, 55, 63, 10, 59, 34]. Such approximations can be avoided altogether by abstaining from using latent variables. Auto- r e gr essive models [ 18 , 6 , 37 , 20 ] can implement this strategy while typically retaining a great deal of flexibility . This class of algorithms tractably models the joint distrib ution by decomposing it into a product of conditionals using the pr obability c hain rule according to a fix ed ordering ov er dimensions, simplifying log-likelihood e v aluation and sampling. Recent work in this line of research has taken advantage of recent adv ances in recurr ent networks [ 51 ], in particular long-short term memory [ 26 ], and r esidual networks [ 25 , 24 ] in order to learn state-of-the-art generativ e image models [ 61 , 46 ] and language models [ 32 ]. The ordering of the dimensions, although often arbitrary , can be critical to the training of the model [ 66 ]. The sequential nature of this model limits its computational ef ficiency . For example, its sampling procedure is sequential and non-parallelizable, which can become cumbersome in applications like speech and music synthesis, or real-time rendering.. Additionally , there is no natural latent representation associated with autoregressi ve models, and the y hav e not yet been sho wn to be useful for semi-supervised learning. 2 Published as a conference paper at ICLR 2017 Generative Adversarial Networks (GANs) [ 21 ] on the other hand can train any differentiable gen- erativ e network by av oiding the maximum likelihood principle altogether . Instead, the generative network is associated with a discriminator network whose task is to distinguish between samples and real data. Rather than using an intractable log-likelihood, this discriminator netw ork provides the training signal in an adversarial fashion. Successfully trained GAN models [ 21 , 15 , 47 ] can consistently generate sharp and realistically looking samples [ 38 ]. Howe ver , metrics that measure the div ersity in the generated samples are currently intractable [ 62 , 22 , 30 ]. Additionally , instability in their training process [47] requires careful hyperparameter tuning to av oid di verging beha vior . T raining such a generativ e network g that maps latent variable z ∼ p Z to a sample x ∼ p X does not in theory require a discriminator network as in GANs, or approximate inference as in variational autoencoders. Indeed, if g is bijecti ve, it can be trained through maximum likelihood using the chang e of variable formula : p X ( x ) = p Z ( z ) det ∂ g ( z ) ∂ z T − 1 . (1) This formula has been discussed in sev eral papers including the maximum likelihood formulation of independent components analysis (ICA) [ 4 , 28 ], gaussianization [ 14 , 11 ] and deep density models [ 5 , 50 , 17 , 3 ]. As the existence proof of nonlinear ICA solutions [ 29 ] suggests, auto-regressi ve models can be seen as tractable instance of maximum likelihood nonlinear ICA, where the residual corresponds to the independent components. Howe ver , nai ve application of the change of v ariable formula produces models which are computationally expensiv e and poorly conditioned, and so large scale models of this type hav e not entered general use. 3 Model definition In this paper , we will tackle the problem of learning highly nonlinear models in high-dimensional continuous spaces through maximum lik elihood. In order to optimize the log-lik elihood, we introduce a more flexible class of architectures that enables the computation of log-likelihood on continuous data using the change of variable formula. Building on our previous work in [ 17 ], we define a powerful class of bijecti ve functions which enable e xact and tractable density e valuation and e xact and tractable inference. Moreo ver , the resulting cost function does not to rely on a fixed form reconstruction cost such as square error [ 38 , 47 ], and generates sharper samples as a result. Also, this flexibility helps us le verage recent advances in batch normalization [ 31 ] and residual networks [24, 25] to define a very deep multi-scale architecture with multiple le vels of abstraction. 3.1 Change of variable f ormula Gi ven an observed data v ariable x ∈ X , a simple prior probability distrib ution p Z on a latent v ariable z ∈ Z , and a bijection f : X → Z (with g = f − 1 ), the change of v ariable formula defines a model distribution on X by p X ( x ) = p Z f ( x ) det ∂ f ( x ) ∂ x T (2) log ( p X ( x )) = log p Z f ( x ) + log det ∂ f ( x ) ∂ x T , (3) where ∂ f ( x ) ∂ x T is the Jacobian of f at x . Exact samples from the resulting distribution can be generated by using the in verse transform sampling rule [ 16 ]. A sample z ∼ p Z is drawn in the latent space, and its in verse image x = f − 1 ( z ) = g ( z ) generates a sample in the original space. Computing the density on a point x is accomplished by computing the density of its image f ( x ) and multiplying by the associated Jacobian determinant det ∂ f ( x ) ∂ x T . See also Figure 1. Exact and efficient inference enables the accurate and fast e v aluation of the model. 3 Published as a conference paper at ICLR 2017 x 1 x 2 = y 1 y 2 x + s t (a) Forw ard propagation x 1 x 2 = y 1 y 2 ÷ - s t (b) In verse propagation Figure 2: Computational graphs for forward and inv erse propagation. A coupling layer applies a simple inv ertible transformation consisting of scaling followed by addition of a constant offset to one part x 2 of the input vector conditioned on the remaining part of the input v ector x 1 . Because of its simple nature, this transformation is both easily in vertible and possesses a tractable determinant. Ho wev er , the conditional nature of this transformation, captured by the functions s and t , significantly increase the flexibility of this otherwise weak function. The forward and in verse propagation operations hav e identical computational cost. 3.2 Coupling layers Computing the Jacobian of functions with high-dimensional domain and codomain and computing the determinants of large matrices are in general computationally very expensiv e. This combined with the restriction to bijective functions mak es Equation 2 appear impractical for modeling arbitrary distributions. As sho wn howe ver in [ 17 ], by careful design of the function f , a bijecti ve model can be l earned which is both tractable and e xtremely flexible. As computing the Jacobian determinant of the transformation is crucial to ef fectiv ely train using this principle, this work e xploits the simple observation that the determinant of a triangular matrix can be efficiently computed as the product of its diagonal terms. W e will b uild a flexible and tractable bijecti ve function by stacking a sequence of simple bijections. In each simple bijection, part of the input vector is updated using a function which is simple to in vert, but which depends on the remainder of the input vector in a comple x way . W e refer to each of these simple bijections as an affine coupling layer . Given a D dimensional input x and d < D , the output y of an affine coupling layer follo ws the equations y 1: d = x 1: d (4) y d +1: D = x d +1: D exp s ( x 1: d ) + t ( x 1: d ) , (5) where s and t stand for scale and translation, and are functions from R d 7→ R D − d , and is the Hadamard product or element-wise product (see Figure 2(a)). 3.3 Properties The Jacobian of this transformation is ∂ y ∂ x T = " I d 0 ∂ y d +1: D ∂ x T 1: d diag exp [ s ( x 1: d )] # , (6) where diag exp [ s ( x 1: d )] is the diagonal matrix whose diagonal elements correspond to the v ector exp [ s ( x 1: d )] . Giv en the observ ation that this Jacobian is triangular , we can ef ficiently compute its determinant as exp h P j s ( x 1: d ) j i . Since computing the Jacobian determinant of the coupling layer operation does not in volv e computing the Jacobian of s or t , those functions can be arbitrarily complex. W e will make them deep con volutional neural netw orks. Note that the hidden layers of s and t can hav e more features than their input and output layers. Another interesting property of these coupling layers in the context of defining probabilistic models is their in vertibility . Indeed, computing the in verse is no more complex than the forward propagation 4 Published as a conference paper at ICLR 2017 4 8 7 3 2 1 2 3 4 5 6 7 8 6 1 5 Figure 3: Masking schemes for affine coupling layers. On the left, a spatial checkerboard pattern mask. On the right, a channel-wise masking. The squeezing operation reduces the 4 × 4 × 1 tensor (on the left) into a 2 × 2 × 4 tensor (on the right). Before the squeezing operation, a checkerboard pattern is used for coupling layers while a channel-wise masking pattern is used afterward. (see Figure 2(b)), y 1: d = x 1: d y d +1: D = x d +1: D exp s ( x 1: d ) + t ( x 1: d ) (7) ⇔ x 1: d = y 1: d x d +1: D = y d +1: D − t ( y 1: d ) exp − s ( y 1: d ) , (8) meaning that sampling is as efficient as inference for this model. Note again that computing the in verse of the coupling layer does not require computing the in verse of s or t , so these functions can be arbitrarily complex and dif ficult to in vert. 3.4 Masked con volution Partitioning can be implemented using a binary mask b , and using the functional form for y , y = b x + (1 − b ) x exp s ( b x ) + t ( b x ) . (9) W e use two partitionings that e xploit the local correlation structure of images: spatial checkerboard patterns, and channel-wise masking (see Figure 3). The spatial checkerboard pattern mask has value 1 where the sum of spatial coordinates is odd, and 0 otherwise. The channel-wise mask b is 1 for the first half of the channel dimensions and 0 for the second half. For the models presented here, both s ( · ) and t ( · ) are rectified con volutional netw orks. 3.5 Combining coupling layers Although coupling layers can be powerful, their forward transformation leaves some components unchanged. This difficulty can be o vercome by composing coupling layers in an alternating pattern, such that the components that are left unchanged in one coupling layer are updated in the next (see Figure 4(a)). The Jacobian determinant of the resulting function remains tractable, relying on the fact that ∂ ( f b ◦ f a ) ∂ x T a ( x a ) = ∂ f a ∂ x T a ( x a ) · ∂ f b ∂ x T b x b = f a ( x a ) (10) det( A · B ) = det( A ) det( B ) . (11) Similarly , its in verse can be computed easily as ( f b ◦ f a ) − 1 = f − 1 a ◦ f − 1 b . (12) 5 Published as a conference paper at ICLR 2017 = + x + x = = + x (a) In this alternating pattern, units which remain identical in one transformation are modified in the next. z 1 z 2 x 1 x 2 x 3 x 4 z 3 z 1 z 2 z 3 z 4 (1) (1) (2) f (1) f (2) f (3) h 4 h 4 h 3 (b) Factoring out variables. At each step, half the vari- ables are directly modeled as Gaussians, while the other half undergo further transfor - mation. Figure 4: Composition schemes for affine coupling layers. 3.6 Multi-scale architectur e W e implement a multi-scale architecture using a squeezing operation: for each channel, it divides the image into subsquares of shape 2 × 2 × c , then reshapes them into subsquares of shape 1 × 1 × 4 c . The squeezing operation transforms an s × s × c tensor into an s 2 × s 2 × 4 c tensor (see Figure 3), effecti vely trading spatial size for number of channels. At each scale, we combine se veral operations into a sequence: we first apply three coupling layers with alternating checkerboard masks, then perform a squeezing operation, and finally apply three more coupling layers with alternating channel-wise masking. The channel-wise masking is chosen so that the resulting partitioning is not redundant with the pre vious checkerboard masking (see Figure 3). For the final scale, we only apply four coupling layers with alternating checkerboard masks. Propagating a D dimensional vector through all the coupling layers would be cumbersome, in terms of computational and memory cost, and in terms of the number of parameters that would need to be trained. For this reason we follo w the design choice of [ 57 ] and factor out half of the dimensions at regular interv als (see Equation 14). W e can define this operation recursiv ely (see Figure 4(b)), h (0) = x (13) ( z ( i +1) , h ( i +1) ) = f ( i +1) ( h ( i ) ) (14) z ( L ) = f ( L ) ( h ( L − 1) ) (15) z = ( z (1) , . . . , z ( L ) ) . (16) In our experiments, we use this operation for i < L . The sequence of coupling-squeezing-coupling operations described above is performed per layer when computing f ( i ) (Equation 14). At each layer , as the spatial resolution is reduced, the number of hidden layer features in s and t is doubled. All variables which hav e been factored out at different scales are concatenated to obtain the final transformed output (Equation 16). As a consequence, the model must Gaussianize units which are factored out at a finer scale (in an earlier layer) before those which are f actored out at a coarser scale (in a later layer). This results in the definition of intermediary lev els of representation [ 53 , 49 ] corresponding to more local, fine-grained features as shown in Appendix D. Moreov er , Gaussianizing and factoring out units in earlier layers has the practical benefit of distribut- ing the loss function throughout the network, follo wing the philosophy similar to guiding intermediate layers using intermediate classifiers [ 40 ]. It also reduces significantly the amount of computation and memory used by the model, allowing us to train lar ger models. 6 Published as a conference paper at ICLR 2017 3.7 Batch normalization T o further impro ve the propagation of training signal, we use deep residual networks [ 24 , 25 ] with batch normalization [ 31 ] and weight normalization [ 2 , 54 ] in s and t . As described in Appendix E we introduce and use a no vel v ariant of batch normalization which is based on a running a verage ov er recent minibatches, and is thus more robust when training with v ery small minibatches. W e also use apply batch normalization to the whole coupling layer output. The effects of batch normalization are easily included in the Jacobian computation, since it acts as a linear rescaling on each dimension. That is, giv en the estimated batch statistics ˜ µ and ˜ σ 2 , the rescaling function x 7→ x − ˜ µ √ ˜ σ 2 + (17) has a Jacobian determinant Y i ( ˜ σ 2 i + ) ! − 1 2 . (18) This form of batch normalization can be seen as similar to re ward normalization in deep reinforcement learning [44, 65]. W e found that the use of this technique not only allowed training with a deeper stack of coupling layers, but also alle viated the instability problem that practitioners often encounter when training conditional distributions with a scale parameter through a gradient-based approach. 4 Experiments 4.1 Procedur e The algorithm described in Equation 2 shows how to learn distributions on unbounded space. In general, the data of interest hav e bounded magnitude. For examples, the pixel values of an image typically lie in [0 , 256] D after application of the recommended jittering procedure [ 64 , 62 ]. In order to reduce the impact of boundary ef fects, we instead model the density of logit ( α + (1 − α ) x 256 ) , where α is picked here as . 05 . W e take into account this transformation when computing log-likelihood and bits per dimension. W e also augment the CIF AR-10, CelebA and LSUN datasets during training to also include horizontal flips of the training examples. W e train our model on four natural image datasets: CIF AR-10 [ 36 ], Imagenet [ 52 ], Lar ge-scale Scene Understanding (LSUN) [ 70 ], CelebF aces Attributes (CelebA) [ 41 ]. More specifically , we train on the downsampled to 32 × 32 and 64 × 64 versions of Imagenet [ 46 ]. For the LSUN dataset, we train on the bedr oom , tower and chur ch outdoor cate gories. The procedure for LSUN is the same as in [ 47 ]: we do wnsample the image so that the smallest side is 96 pixels and tak e random crops of 64 × 64 . For CelebA, we use the same procedure as in [ 38 ]: we take an approximately central crop of 148 × 148 then resize it to 64 × 64 . W e use the multi-scale architecture described in Section 3.6 and use deep conv olutional residual networks in the coupling layers with rectifier nonlinearity and skip-connections as suggested by [ 46 ]. T o compute the scaling functions s , we use a hyperbolic tangent function multiplied by a learned scale, whereas the translation function t has an affine output. Our multi-scale architecture is repeated recursiv ely until the input of the last recursion is a 4 × 4 × c tensor . For datasets of images of size 32 × 32 , we use 4 residual blocks with 32 hidden feature maps for the first coupling layers with checkerboard masking. Only 2 residual blocks are used for images of size 64 × 64 . W e use a batch size of 64 . For CIF AR-10, we use 8 residual blocks, 64 feature maps, and do wnscale only once. W e optimize with AD AM [ 33 ] with default hyperparameters and use an L 2 regularization on the weight scale parameters with coefficient 5 · 10 − 5 . W e set the prior p Z to be an isotropic unit norm Gaussian. Howev er , any distrib ution could be used for p Z , including distributions that are also learned during training, such as from an auto-re gressiv e model, or (with slight modifications to the training objectiv e) a variational autoencoder . 7 Published as a conference paper at ICLR 2017 Dataset PixelRNN [46] Real NVP Con v DRA W [22] IAF-V AE [34] CIF AR-10 3.00 3.49 < 3.59 < 3.28 Imagenet ( 32 × 32 ) 3.86 (3.83) 4.28 (4.26) < 4.40 (4.35) Imagenet ( 64 × 64 ) 3.63 (3.57) 3.98 (3.75) < 4.10 (4.04) LSUN (bedroom) 2.72 (2.70) LSUN (tower) 2.81 (2.78) LSUN (church outdoor) 3.08 (2.94) CelebA 3.02 (2.97) T able 1: Bits/dim results for CIF AR-10, Imagenet, LSUN datasets and CelebA. T est results for CIF AR-10 and validation results for Imagenet, LSUN and CelebA (with training results in parenthesis for reference). Figure 5: On the left column, examples from the dataset. On the right column, samples from the model trained on the dataset. The datasets shown in this figure are in order: CIF AR-10, Imagenet ( 32 × 32 ), Imagenet ( 64 × 64 ), CelebA, LSUN (bedroom). 4.2 Results W e sho w in T able 1 that the number of bits per dimension, while not improving over the Pix el RNN [ 46 ] baseline, is competitiv e with other generativ e methods. As we notice that our performance increases with the number of parameters, larger models are likely to further improv e performance. For CelebA and LSUN, the bits per dimension for the v alidation set was decreasing throughout training, so little ov erfitting is expected. W e show in Figure 5 samples generated from the model with training examples from the dataset for comparison. As mentioned in [ 62 , 22 ], maximum likelihood is a principle that v alues di versity 8 Published as a conference paper at ICLR 2017 Figure 6: Manifold generated from four examples in the dataset. Clockwise from top left: CelebA, Imagenet ( 64 × 64 ), LSUN (to wer), LSUN (bedroom). ov er sample quality in a limited capacity setting. As a result, our model outputs sometimes highly improbable samples as we can notice especially on CelebA. As opposed to variational autoencoders, the samples generated from our model look not only globally coherent b ut also sharp. Our hypothesis is that as opposed to these models, real NVP does not rely on fixed form reconstruction cost lik e an L 2 norm which tends to rew ard capturing low frequency components more hea vily than high frequency components. Unlike autoregressi ve models, sampling from our model is done very ef ficiently as it is parallelized ov er input dimensions. On Imagenet and LSUN, our model seems to have captured well the notion of background/foreground and lighting interactions such as luminosity and consistent light source direction for reflectance and shadows. W e also illustrate the smooth semantically consistent meaning of our latent v ariables. In the latent space, we define a manifold based on four v alidation examples z (1) , z (2) , z (3) , z (4) , and parametrized by two parameters φ and φ 0 by , z = cos( φ ) cos( φ 0 ) z (1) + sin( φ 0 ) z (2) + sin( φ ) cos( φ 0 ) z (3) + sin( φ 0 ) z (4) . (19) W e project the resulting manifold back into the data space by computing g ( z ) . Results are shown Figure 6. W e observe that the model seems to have organized the latent space with a notion of meaning that goes well beyond pixel space interpolation. More visualization are shown in the Appendix. T o further test whether the latent space has a consistent semantic interpretation, we trained a class-conditional model on CelebA, and found that the learned representation had a consistent semantic meaning across class labels (see Appendix F). 5 Discussion and conclusion In this paper , we hav e defined a class of in vertible functions with tractable Jacobian determinant, enabling exact and tractable log-likelihood e v aluation, inference, and sampling. W e ha ve shown that this class of generati ve model achie ves competiti ve performances, both in terms of sample quality and log-likelihood. Many a venues exist to further improv e the functional form of the transformations, for instance by exploiting the latest advances in dilated conv olutions [ 69 ] and residual networks architectures [60]. This paper presented a technique bridging the gap between auto-regressi ve models, variational autoencoders, and generati ve adversarial netw orks. Like auto-regressi ve models, it allo ws tractable and exact log-lik elihood ev aluation for training. It allows ho we ver a much more flexible functional form, similar to that in the generative model of v ariational autoencoders. This allows for fast and exact sampling from the model distribution. Like GANs, and unlike v ariational autoencoders, our technique does not require the use of a fixed form reconstruction cost, and instead defines a cost in terms of higher lev el features, generating sharper images. Finally , unlike both variational 9 Published as a conference paper at ICLR 2017 autoencoders and GANs, our technique is able to learn a semantically meaningful latent space which is as high dimensional as the input space. This may make the algorithm particularly well suited to semi-supervised learning tasks, as we hope to explore in future work. Real NVP generati ve models can additionally be conditioned on additional v ariables (for instance class labels) to create a structured output algorithm. More so, as the resulting class of inv ertible transformations can be treated as a probability distribution in a modular way , it can also be used to improv e upon other probabilistic models like auto-re gressiv e models and v ariational autoencoders. For variational autoencoders, these transformations could be used both to enable a more flexible reconstruction cost [ 38 ] and a more flexible stochastic inference distribution [ 48 ]. Probabilistic models in general can also benefit from batch normalization techniques as applied in this paper . The definition of powerful and trainable inv ertible functions can also benefit domains other than generativ e unsupervised learning. For example, in reinforcement learning, these in vertible functions can help extend the set of functions for which an argmax operation is tractable for continuous Q- learning [ 23 ] or find representation where local linear Gaussian approximations are more appropriate [67]. 6 Acknowledgments The authors thank the de velopers of T ensorflo w [ 1 ]. W e thank Sherry Moore, Da vid Andersen and Jon Shlens for their help in implementing the model. W e thank Aäron v an den Oord, Y ann Dauphin, Kyle Kastner , Chelsea Finn, Maithra Raghu, David W arde-Farley , Daniel Jiwoong Im and Oriol V inyals for fruitful discussions. Finally , we thank Ben Poole, Rafal Jozefo wicz and George Dahl for their input on a draft of the paper . Refer ences [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jef frey Dean, Matthieu De vin, et al. T ensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv pr eprint arXiv:1603.04467 , 2016. [2] V ijay Badrinarayanan, Bamdev Mishra, and Roberto Cipolla. Understanding symmetries in deep networks. arXiv pr eprint arXiv:1511.01029 , 2015. [3] Johannes Ballé, V alero Laparra, and Eero P Simoncelli. Density modeling of images using a generalized normalization transformation. arXiv preprint , 2015. [4] Anthony J Bell and T errence J Sejnowski. An information-maximization approach to blind separation and blind decon volution. Neural computation , 7(6):1129–1159, 1995. [5] Y oshua Bengio. Artificial neural networks and their application to sequence recognition. 1991. [6] Y oshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In NIPS , volume 99, pages 400–406, 1999. [7] Mathias Berglund and T apani Raik o. Stochastic gradient estimate variance in contrastive di ver gence and persistent contrastiv e div ergence. arXiv preprint , 2013. [8] Samuel R Bowman, Luke V ilnis, Oriol V inyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint , 2015. [9] Joan Bruna, Pablo Sprechmann, and Y ann LeCun. Super-resolution with deep conv olutional sufficient statistics. arXiv preprint , 2015. [10] Y uri Burda, Roger Grosse, and Ruslan Salakhutdinov . Importance weighted autoencoders. arXiv pr eprint arXiv:1509.00519 , 2015. [11] Scott Shaobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in Neural Information Pr ocessing Systems , 2000. [12] Junyoung Chung, K yle Kastner , Laurent Dinh, Kratarth Goel, Aaron C Courville, and Y oshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information pr ocessing systems , pages 2962–2970, 2015. [13] Peter Dayan, Geoffre y E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation , 7(5):889–904, 1995. [14] Gustav o Deco and Wilfried Brauer . Higher order statistical decorrelation without information loss. In G. T esauro, D. S. T ouretzky , and T . K. Leen, editors, Advances in Neural Information Pr ocessing Systems 7 , pages 247–254. MIT Press, 1995. [15] Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adv ersarial networks. In Advances in Neural Information Pr ocessing Systems 28: 10 Published as a conference paper at ICLR 2017 Annual Confer ence on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 1486–1494, 2015. [16] Luc Devroye. Sample-based non-uniform random v ariate generation. In Pr oceedings of the 18th confer ence on W inter simulation , pages 260–265. ACM, 1986. [17] Laurent Dinh, Da vid Krueger , and Y oshua Bengio. Nice: non-linear independent components estimation. arXiv pr eprint arXiv:1410.8516 , 2014. [18] Brendan J Frey . Graphical models for machine learning and digital communication . MIT press, 1998. [19] Leon A. Gatys, Alexander S. Ecker , and Matthias Bethge. T exture synthesis using con volutional neural networks. In Advances in Neural Information Pr ocessing Systems 28: Annual Conference on Neural Information Pr ocessing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages 262–270, 2015. [20] Mathieu Germain, Karol Gregor , Iain Murray , and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. CoRR , abs/1502.03509, 2015. [21] Ian J. Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron C. Courville, and Y oshua Bengio. Generativ e adversarial nets. In Advances in Neural Information Pr ocessing Systems 27: Annual Conference on Neur al Information Pr ocessing Systems 2014, December 8-13 2014, Montr eal, Quebec, Canada , pages 2672–2680, 2014. [22] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan W ierstra. T o wards conceptual compression. arXiv preprint , 2016. [23] Shixiang Gu, Timoth y Lillicrap, Ilya Sutskev er , and Sergey Levine. Continuous deep q-learning with model-based acceleration. arXiv preprint , 2016. [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR , abs/1512.03385, 2015. [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR , abs/1603.05027, 2016. [26] Sepp Hochreiter and Jürgen Schmidhuber . Long short-term memory . Neural Computation , 9(8):1735–1780, 1997. [27] Matthew D Hof fman, David M Blei, Chong W ang, and John Paisley . Stochastic variational inference. The Journal of Mac hine Learning Resear ch , 14(1):1303–1347, 2013. [28] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent component analysis , volume 46. John W iley & Sons, 2004. [29] Aapo Hyvärinen and Petteri P ajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks , 12(3):429–439, 1999. [30] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with recurrent adversarial networks. arXiv preprint , 2016. [31] Serge y Ioffe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. arXiv preprint , 2015. [32] Rafal Józefo wicz, Oriol V inyals, Mik e Schuster, Noam Shazeer , and Y onghui W u. Exploring the limits of language modeling. CoRR , abs/1602.02410, 2016. [33] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [34] Diederik P Kingma, T im Salimans, and Max W elling. Improving variational inference with inv erse autoregressi ve flo w . arXiv preprint , 2016. [35] Diederik P Kingma and Max W elling. Auto-encoding variational bayes. arXiv preprint , 2013. [36] Alex Krizhe vsky and Geof frey Hinton. Learning multiple layers of features from tiny images, 2009. [37] Hugo Larochelle and Iain Murray . The neural autoregressive distrib ution estimator . In AIST A TS , 2011. [38] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby , and Ole W inther . Autoencoding be yond pixels using a learned similarity metric. CoRR , abs/1512.09300, 2015. [39] Y ann A LeCun, Léon Bottou, Gene vie ve B Orr , and Klaus-Robert Müller . Efficient backprop. In Neural networks: T ric ks of the trade , pages 9–48. Springer , 2012. [40] Chen-Y u Lee, Saining Xie, Patrick Gallagher , Zhengyou Zhang, and Zhuowen T u. Deeply-supervised nets. arXiv pr eprint arXiv:1409.5185 , 2014. [41] Ziwei Liu, Ping Luo, Xiaogang W ang, and Xiaoou T ang. Deep learning face attributes in the wild. In Pr oceedings of International Confer ence on Computer V ision (ICCV) , December 2015. [42] Lars Maaløe, Casper Kaae Sønderby , Søren Kaae Sønderby , and Ole W inther . Auxiliary deep generative models. arXiv preprint , 2016. [43] Andriy Mnih and Karol Gregor . Neural variational inference and learning in belief networks. arXiv pr eprint arXiv:1402.0030 , 2014. [44] V olodymyr Mnih, Koray Kavukcuoglu, David Silver , Andrei A Rusu, Joel V eness, Marc G Bellemare, Alex Graves, Martin Riedmiller , Andreas K Fidjeland, Georg Ostrovski, et al. Human-lev el control through deep reinforcement learning. Nature , 518(7540):529–533, 2015. [45] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models , pages 355–368. Springer , 1998. 11 Published as a conference paper at ICLR 2017 [46] Aaron van den Oord, Nal Kalchbrenner , and Koray Ka vukcuoglu. Pixel recurrent neural networks. arXiv pr eprint arXiv:1601.06759 , 2016. [47] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep con volutional generati ve adversarial netw orks. CoRR , abs/1511.06434, 2015. [48] Danilo Jimenez Rezende and Shakir Mohamed. V ariational inference with normalizing flows. arXiv pr eprint arXiv:1505.05770 , 2015. [49] Danilo Jimenez Rezende, Shakir Mohamed, and Daan W ierstra. Stochastic backpropagation and approxi- mate inference in deep generativ e models. arXiv preprint , 2014. [50] Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. arXiv pr eprint arXiv:1302.5125 , 2013. [51] David E Rumelhart, Geoffrey E Hinton, and Ronald J W illiams. Learning representations by back- propagating errors. Cognitive modeling , 5(3):1, 1988. [52] Olga Russakovsky , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy , Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer V ision , 115(3):211–252, 2015. [53] Ruslan Salakhutdinov and Geof frey E Hinton. Deep boltzmann machines. In International confer ence on artificial intelligence and statistics , pages 448–455, 2009. [54] T im Salimans and Diederik P Kingma. W eight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv preprint , 2016. [55] T im Salimans, Diederik P Kingma, and Max W elling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint , 2014. [56] Lawrence K Saul, T ommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of artificial intellig ence resear ch , 4(1):61–76, 1996. [57] Karen Simonyan and Andrew Zisserman. V ery deep con volutional netw orks for large-scale image recogni- tion. arXiv preprint , 2014. [58] Paul Smolensky . Information processing in dynamical systems: Foundations of harmony theory . T echnical report, DTIC Document, 1986. [59] Jascha Sohl-Dickstein, Eric A. W eiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Confer ence on Machine Learning, ICML 2015, Lille , F rance, 6-11 J uly 2015 , pages 2256–2265, 2015. [60] Sasha T arg, Diogo Almeida, and Ke vin L yman. Resnet in resnet: Generalizing residual architectures. CoRR , abs/1603.08029, 2016. [61] Lucas Theis and Matthias Bethge. Generativ e image modeling using spatial lstms. In Advances in Neural Information Pr ocessing Systems , pages 1918–1926, 2015. [62] Lucas Theis, Aäron V an Den Oord, and Matthias Bethge. A note on the e valuation of generati ve models. CoRR , abs/1511.01844, 2015. [63] Dustin Tran, Rajesh Ranganath, and Da vid M Blei. V ariational gaussian process. arXiv pr eprint arXiv:1511.06499 , 2015. [64] Benigno Uria, Iain Murray , and Hugo Larochelle. Rnade: The real-valued neural autore gressiv e density- estimator . In Advances in Neural Information Pr ocessing Systems , pages 2175–2183, 2013. [65] Hado van Hasselt, Arthur Guez, Matteo Hessel, and Da vid Silver . Learning functions across many orders of magnitudes. arXiv preprint , 2016. [66] Oriol V inyals, Samy Bengio, and Manjunath Kudlur . Order matters: Sequence to sequence for sets. arXiv pr eprint arXiv:1511.06391 , 2015. [67] Manuel W atter , Jost Springenber g, Joschka Boedecker , and Martin Riedmiller . Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Pr ocessing Systems , pages 2728–2736, 2015. [68] Ronald J Williams. Simple statistical gradient-follo wing algorithms for connectionist reinforcement learning. Machine learning , 8(3-4):229–256, 1992. [69] Fisher Y u and Vladlen Koltun. Multi-scale context aggregation by dilated con volutions. arXiv preprint arXiv:1511.07122 , 2015. [70] Fisher Y u, Y inda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint , 2015. [71] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. arXiv pr eprint arXiv:1603.08511 , 2016. 12 Published as a conference paper at ICLR 2017 A Samples Figure 7: Samples from a model trained on Imagenet ( 64 × 64 ). 13 Published as a conference paper at ICLR 2017 Figure 8: Samples from a model trained on CelebA . 14 Published as a conference paper at ICLR 2017 Figure 9: Samples from a model trained on LSUN ( bedr oom category). 15 Published as a conference paper at ICLR 2017 Figure 10: Samples from a model trained on LSUN ( chur ch outdoor category). 16 Published as a conference paper at ICLR 2017 Figure 11: Samples from a model trained on LSUN ( tower category). 17 Published as a conference paper at ICLR 2017 B Manif old Figure 12: Manifold from a model trained on Imagenet ( 64 × 64 ). Images with red borders are taken from the v alidation set, and define the manifold. The manifold was computed as described in Equation 19, where the x-axis corresponds to φ , and the y-axis to φ 0 , and where φ, φ 0 ∈ { 0 , π 4 , · · · , 7 π 4 } . 18 Published as a conference paper at ICLR 2017 Figure 13: Manifold from a model trained on CelebA . Images with red borders are taken from the training set, and define the manifold. The manifold was computed as described in Equation 19, where the x-axis corresponds to φ , and the y-axis to φ 0 , and where φ, φ 0 ∈ { 0 , π 4 , · · · , 7 π 4 } . 19 Published as a conference paper at ICLR 2017 Figure 14: Manifold from a model trained on LSUN ( bedr oom category). Images with red bor- ders are taken from the validation set, and define the manifold. The manifold was computed as described in Equation 19, where the x-axis corresponds to φ , and the y-axis to φ 0 , and where φ, φ 0 ∈ { 0 , π 4 , · · · , 7 π 4 } . 20 Published as a conference paper at ICLR 2017 Figure 15: Manifold from a model trained on LSUN ( chur ch outdoor category). Images with red borders are taken from the v alidation set, and define the manifold. The manifold was computed as described in Equation 19, where the x-axis corresponds to φ , and the y-axis to φ 0 , and where φ, φ 0 ∈ { 0 , π 4 , · · · , 7 π 4 } . 21 Published as a conference paper at ICLR 2017 Figure 16: Manifold from a model trained on LSUN ( tower category). Images with red bor- ders are taken from the validation set, and define the manifold. The manifold was computed as described in Equation 19, where the x-axis corresponds to φ , and the y-axis to φ 0 , and where φ, φ 0 ∈ { 0 , π 4 , · · · , 7 π 4 } . C Extrapolation Inspired by the texture generation w ork by [ 19 , 61 ] and extrapolation test with DCGAN [ 47 ], we also ev aluate the statistics captured by our model by generating images twice or ten times as large as present in the dataset. As we can observe in the following figures, our model seems to successfully create a “texture” representation of the dataset while maintaining a spatial smoothness through the image. Our con volutional architecture is only aware of the position of considered pixel through edge ef fects in con volutions, therefore our model is similar to a stationary process. This also explains why these samples are more consistent in LSUN , where the training data was obtained using random crops. 22 Published as a conference paper at ICLR 2017 (a) × 2 (b) × 10 Figure 17: W e generate samples a factor bigger than the training set image size on Imag enet ( 64 × 64 ). 23 Published as a conference paper at ICLR 2017 (a) × 2 (b) × 10 Figure 18: W e generate samples a factor bigger than the training set image size on CelebA . 24 Published as a conference paper at ICLR 2017 (a) × 2 (b) × 10 Figure 19: W e generate samples a factor bigger than the training set image size on LSUN ( bedr oom category). 25 Published as a conference paper at ICLR 2017 (a) × 2 (b) × 10 Figure 20: W e generate samples a factor bigger than the training set image size on LSUN ( chur ch outdoor category). 26 Published as a conference paper at ICLR 2017 (a) × 2 (b) × 10 Figure 21: W e generate samples a factor bigger than the training set image size on LSUN ( tower category). 27 Published as a conference paper at ICLR 2017 D Latent variables semantic As in [ 22 ], we further try to grasp the semantic of our learned layers latent variables by doing ablation tests. W e infer the latent variables and resample the lo west lev els of latent variables from a standard gaussian, increasing the highest level affected by this resampling. As we can see in the following figures, the semantic of our latent space seems to be more on a graphic level rather than higher lev el concept. Although the heavy use of con volution impro ves learning by exploiting image prior knowledge, it is also likely to be responsible for this limitation. Figure 22: Conceptual compression from a model trained on Imag enet ( 64 × 64 ). The leftmost column represent the original image, the subsequent columns were obtained by storing higher le vel latent variables and resampling the others, storing less and less as we go right. From left to right: 100% , 50% , 25% , 12 . 5% and 6 . 25% of the latent variables are k ept. Figure 23: Conceptual compression from a model trained on CelebA . The leftmost column represent the original image, the subsequent columns were obtained by storing higher lev el latent variables and resampling the others, storing less and less as we go right. From left to right: 100% , 50% , 25% , 12 . 5% and 6 . 25% of the latent variables are k ept. 28 Published as a conference paper at ICLR 2017 Figure 24: Conceptual compression from a model trained on LSUN ( bedr oom cate gory). The leftmost column represent the original image, the subsequent columns were obtained by storing higher le vel latent variables and resampling the others, storing less and less as we go right. From left to right: 100% , 50% , 25% , 12 . 5% and 6 . 25% of the latent variables are k ept. Figure 25: Conceptual compression from a model trained on LSUN ( chur ch outdoor category). The leftmost column represent the original image, the subsequent columns were obtained by storing higher lev el latent variables and resampling the others, storing less and less as we go right. From left to right: 100% , 50% , 25% , 12 . 5% and 6 . 25% of the latent variables are k ept. 29 Published as a conference paper at ICLR 2017 Figure 26: Conceptual compression from a model trained on LSUN ( tower category). The leftmost column represent the original image, the subsequent columns were obtained by storing higher le vel latent variables and resampling the others, storing less and less as we go right. From left to right: 100% , 50% , 25% , 12 . 5% and 6 . 25% of the latent variables are k ept. E Batch normalization W e further experimented with batch normalization by using a weighted av erage of a moving av erage of the layer statistics ˜ µ t , ˜ σ 2 t and the current batch batch statistics ˆ µ t , ˆ σ 2 t , ˜ µ t +1 = ρ ˜ µ t + (1 − ρ ) ˆ µ t (20) ˜ σ 2 t +1 = ρ ˜ σ 2 t + (1 − ρ ) ˆ σ 2 t , (21) where ρ is the momentum. When using ˜ µ t +1 , ˜ σ 2 t +1 , we only propagate gradient through the current batch statistics ˆ µ t , ˆ σ 2 t . W e observe that using this lag helps the model train with very small minibatches. W e used batch normalization with a moving a verage for our results on CIF AR-10. F Attrib ute change Additionally , we exploit the attrib ute information y in CelebA to b uild a conditional model, i.e. the in vertible function f from image to latent variable uses the labels in y to define its parameters. In order to observe the information stored in the latent variables, we choose to encode a batch of images x with their original attribute y and decode them using a ne w set of attributes y 0 , build by shuf fling the original attributes inside the batch. W e obtain the new images x 0 = g f ( x ; y ); y 0 . W e observe that, although the faces are changed as to respect the new attributes, several properties remain unchanged like position and background. 30 Published as a conference paper at ICLR 2017 Figure 27: Examples x from the CelebA dataset. 31 Published as a conference paper at ICLR 2017 Figure 28: From a model trained on pairs of images and attributes from the CelebA dataset, we encode a batch of images with their original attributes before decoding them with a new set of attributes. W e notice that the new images often share similar characteristics with those in Fig 27, including position and background. 32
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment