Deep Directed Generative Autoencoders
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$…
Authors: Sherjil Ozair, Yoshua Bengio
Deep Dir ected Generativ e A utoencoders Sherjil Ozair Indian Institute of T echnology Delhi Y oshua Bengio Univ ersit ´ e de Montr ´ eal CIF AR Fellow Abstract For discrete data, the likelihood P ( x ) can be rewritten exactly and parametrized into P ( X = x ) = P ( X = x | H = f ( x )) P ( H = f ( x )) if P ( X | H ) has enough capacity to put no probability mass on any x 0 for which f ( x 0 ) 6 = f ( x ) , where f ( · ) is a deterministic discrete function. The log of the first f actor gi ves rise to the log-likelihood reconstruction error of an autoencoder with f ( · ) as the encoder and P ( X | H ) as the (probabilistic) decoder . The log of the second term can be seen as a regularizer on the encoded activ ations h = f ( x ) , e.g., as in sparse autoen- coders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the a verage of the optimal log-likelihood log p ( x ) . The objectiv e is to learn an encoder f ( · ) that maps X to f ( X ) that has a much simpler distribution than X itself, estimated by P ( H ) . This “flattens the manifold” or concentrates probability mass in a smaller number of (relev ant) dimensions o ver which the distribution factorizes. Generating samples from the model is straight- forward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder , but we find that using the straight-through estimator works well here. W e also find that al- though optimizing a single le vel of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. 1 Introduction Deep learning is an aspect of machine learning that regards the question of learning multiple lev els of representation, associated with dif ferent le vels of abstraction (Bengio, 2009). These representations are distributed (Hinton, 1989), meaning that at each lev el there are many variables or features, which together can take a very lar ge number of configurations. An important conceptual challenge of deep learning is the following question: what is a good rep- r esentation ? The question is most challenging in the unsupervised learning setup. Whereas we understand that features of an input x that are predictiv e of some target y constitute a good repre- sentation in a supervised learning setting, the question is less obvious for unsupervised learning. 1.1 Manifold Unfolding In this paper we explore this question by follo wing the geometrical inspiration introduced by Bengio (2014), based on the notion of manifold unfolding , illustrated in Figure 2. It was already observed by Bengio et al. (2013a) that representations obtained by stacking denoising autoencoders or RBMs appear to yield “flatter” or “unfolded” manifolds: if x 1 and x 2 are examples from the data generating distribution Q ( X ) and f is the encoding function and g the decoding function, then points on the line h α = αf ( x 1 ) + (1 − α ) f ( x 2 ) ( α ∈ [0 , 1] ) were experimentally found to correspond to probable input configurations, i.e., g ( h α ) looks like training examples (and quantitati vely often comes close 1 to one). This property is not at all observed for f and g being the identity function: interpolating in input space typically gives rise to non-natural looking inputs (we can immediately recognize such inputs as the simple addition of two plausible examples). This is illustrated in Figure 1. It means that the input manifold (near which the distribution concentrates) is highly twisted and curved and occupies a small volume in input space. Instead, when mapped in the representation space of stack ed autoencoders (the output of f ), we find that the con ve x between high probability points (i.e., training examples) is often also part of the high-probability manifold, i.e., the transformed manifold is flatter , it has become closer to a con ve x set. Linear'interpola,on'at'layer'2' Linear'interpola,on'at'layer'1' 3’s'manifold' 9’s'manifold' Linear'interpola,on'in'pixel'space' Pixel'space' 9’s'manifold' 3’s'manifold' Representa,on'space' 9’s'manifold' 3’s'manifold' X 0 where a i is the acti v ation of the i -th output unit of the encoder before the discretizing non-linearity is applied. The actual encoder output is discretized, but we are interested in obtaining a “pseudo- gradient” for a i , which we will back-propag ate inside the encoder to update the encoder parameters. What we did in the experiments is to compute the deri vati ve of the reconstruction loss and prior loss L = − log P ( x | h = f ( x )) − log P ( h = f ( x )) , (6) with respect to f ( x ) , as if f ( x ) had been continuous-valued. W e then used the Straight-Through Pseudo-Gradient : the update direction (or pseudo-gradient) for a is just set to be equal to the gradient with respect to f ( x ) : ∆ a = ∂ L ∂ f ( x ) . The idea for this technique w as proposed by Hinton (2012) and was used very succesfully in Bengio et al. (2013b). It clearly has the right sign (per value of a i ( x ) , but not necessarily ov erall) but does not take into account the magnitude of a i ( x ) explicitly . 5 Let us see how the prior ne gati ve log-likelihood bound can be written as a function of f ( x ) in which we can pretend that f ( x ) is continuous. For example, if P ( H ) is a factorized Binomial, we write this negati ve log-likelihood as the usual cross-entropy: − X i f i ( x ) log P ( h i = 1) + (1 − f i ( x )) log(1 − P ( h i = 1)) so that if P ( x | h ) is also a factorized Binomial, the o verall loss is L = − X i f i ( x ) log P ( h i = 1) + (1 − f i ( x )) log(1 − P ( h i = 1)) − X j x j log P ( x j = 1 | h = f ( x )) + (1 − x j ) log(1 − P ( x j = 1 | h = f ( x ))) (7) and we can compute ∂ L ∂ f i ( x ) as if f i ( x ) had been a continuous-v alued variable. All this is summarized in Algorithm 1. Algorithm 1 T raining procedure for Directed Generati ve Autoencoder (DGA). • Sample x from training set. • Encode it via feedforward network a ( x ) and h i = f i ( x ) = 1 a i ( x ) > 0 . • Update P ( h ) with respect to training e xample h . • Decode via decoder network estimating P ( x | h ) . • Compute (and average) loss L , as per Eq. 6, e.g., in the case of factorized Binomials as per Eq. 7. • Update decoder in direction of gradient of − log P ( x | h ) w .r .t. the decoder parameters. • Compute gradient ∂ L ∂ f ( x ) as if f ( x ) had been continuous. • Compute pseudo-gradient w .r .t. a as ∆ a = ∂ L ∂ f ( x ) . • Back-propagate the above pseudo-gradients (as if they were true gradients of the loss on a ( x ) ) inside encoder and update encoder parameters accordingly . 3 Greedy Annealed Pre-T raining for a Deep DGA For the encoder to twist the data into a form that fits the prior P ( H ) , we e xpect that a very strongly non-linear transformation will be required. As deep autoencoders are notoriously diffi- cult to train (Martens, 2010), adding the extra constraint of making the output of the encoder fit P ( H ) , e.g., factorial, was found e xperimentally (and without surprise) to be dif ficult. What we propose here is to use a an annealing (continuation method) and a greedy pre-training strat- egy , similar to that previously proposed to train Deep Belief Networks (from a stack of RBMs) (Hin- ton et al. , 2006) or deep autoencoders (from a stack of shallo w autoencoders) (Bengio et al. , 2007; Hinton and Salakhutdinov, 2006). 3.1 Annealed T raining Since the loss function of Eq.1 is hard to optimize directly , we consider a generalization of the loss function by adding trade-off parameters to the two terms in the loss function corresponding to the reconstruction and prior cost. A zero weight for the prior cost makes the loss function same as that of a standard autoencoder , which is a considerably easier optimization problem. L = − β X i f i ( x ) log P ( h i = 1) + (1 − f i ( x )) log(1 − P ( h i = 1)) − X j x j log P ( x j = 1 | h = f ( x )) + (1 − x j ) log(1 − P ( x j = 1 | h = f ( x ))) . (8) 6 3.1.1 Gradient Descent on Annealed Loss Function T raining DGAs with fixed trade-off parameters is sometimes difficult because it is much easier for gradient descent to perfectly optimize the prior cost by making f map all x to a constant h . This may be a local optimum or a saddle point, escaping from which is dif ficult by gradient descent. Thus, we use the tradeoff paramters to make the model first learn perfect reconstruction, by setting zero weight for the prior cost. β is then gradually increased to 1. The gradual increasing schedule for β is also important, as an y rapid growth in β ’ s v alue causes the system to ‘forget’ the reconstruction and prioritize only the prior cost. A slow schedule thus ensures that the model learns to reconstruct as well as to fit the prior . 3.1.2 Annealing in Deep DGA Abov e, we describe the usefulness of annealed training of a shallow DGA. A similar trick is also useful when pretraining a deep DGA. W e use different values for these tradeof f parameters to control the degree of difficulty for each pretraining stage. Initial stages ha ve a high weight for reconstruc- tion, and a low weight for prior fitting, while the final stage has β set to unity , gi ving back the original loss function. Note that the lower -le vel DGAs can sacrifice on prior fitting, b ut must make sure that the reconstruction is near-perfect, so that no information is lost. Otherwise, the upper -lev el DGAs can ne ver reco ver from that loss, which will sho w up in the high entropy of P ( x | h ) (both for the low-le vel decoder and for the global decoder). 4 Relation to the V ariational A utoencoder (V AE) and Reweighted W ake-Sleep (R WS) The DGA can be seen as a special case of the V ariational Autoencoder (V AE), with v arious v ersions introduced by Kingma and W elling (2014); Gregor et al. (2014); Mnih and Gre gor (2014); Rezende et al. (2014), and of the Reweighted W ak e-Sleep (R WS) algorithm (Bornschein and Bengio, 2014). The main dif ference between the DGA and these models is that with the latter the encoder is stochas- tic, i.e., outputs a sample from an encoding (or approximate inference) distribution Q ( h | x ) instead of h = f ( x ) . This basically giv es rise to a training criterion that is not the log-likelihood but a variational lo wer bound on it, log p ( x ) ≥ E Q ( h | x ) [log P ( h ) + log P ( x | h ) − log Q ( h | x )] . (9) Besides the fact that h is no w sampled, we observe that the training criterion has e xactly the same first two terms as the DGA log-likelihood, b ut it also has an e xtra term that attempts to maximize the conditional entropy of the encoder output, i.e., encouraging the encoder to introduce noise, to the extent that it does not hurt the two other terms too much. It will hurt them, but it will also help the marginal distribution Q ( H ) (a veraged ov er the data distribution Q ( X ) ) to be closer to the prior P ( H ) , thus encouraging the decoder to contract the “noisy” samples that could equally arise from the injected noise in Q ( h | x ) or from the broad (generally factorized) P ( H ) distrib ution. 5 Experiments In this section, we pro vide empirical e vidence for the feasibility of the proposed model, and analyze the influence of various techniques on the performance of the model. W e used the binarized MNIST handwritten digits dataset. W e used the same binarized version of MNIST as Murray and Larochelle (2014), and also used the same training-validation-test split. W e trained shallow DGAs with 1, 2 and 3 hidden layers, and deep DGAs composed of 2 and 3 shallow DGAs. The dimension of H was chosen to be 500, as this is considered sufficient for coding binarized MNIST . P ( H ) is modeled as a f actorized Binomial distribution. Parameters of the model were learnt using minibatch gradient descent, with minibatch size of 100. Learning rates were chosen from 10.0, 1.0, 0.1, and halv ed whenev er the average cost o ver an epoch increased. W e did not use momentum or L1, L2 regularizers. W e used tanh activ ation functions in the hidden layers and sigmoid outputs. 7 (a) (b) Figure 3: (a) Samples generated from a 5-layer deep DGA, composed of 3 shallo w DGAs of 3 layers (1 hidden layer) each. The model was trained greedily by training the first shallow DGA on the raw data and training each subsequent shallo w DGA on the output code of the previous DGA. (b) Sample generated from a 3-layer shallow DGA. While training, a small salt-and-pepper noise is added to the decoder input to make it robust to inevitable mismatch between the encoder output f ( X ) and samples from the prior, h ∼ P ( H ) . Each bit of decoder input is selected with 1% probability and changed to 0 or 1 randomly . 5.1 Loglikelihood Estimator When the autoencoder is not perfect, i.e. the encoder and decoder are not perfect inv erses of each other , the loglikelihood estimates using Eq. 7 are biased estimates of the true loglikehood. W e treat these estimates as an unnormalized probability distrib ution, where the partition function would be one when the autoencoder is perfectly trained. In practice, we found that the partition function is less than one. Thus, to compute the loglikehood for comparison, we estimate the partition function of the model, which allows us to compute normalized loglikelihood estimates. If P ∗ ( X ) is the unnormalized probability distribution, and π ( X ) is another tractable distribution on X from which we can sample, the partition function can be estimated by importance sampling: Z = X x P ∗ ( x ) = X x π ( x ) P ∗ ( x ) π ( x ) = E x ∼ π ( x ) [ P ∗ ( x ) π ( x ) ] As proposal distrib ution π ( x ) , W e took N expected v alues µ j = E [ X | H j ] under the decoder distri- bution, for H j ∼ P ( H ) , and use them as centroids for a mixture model, π ( x ) = 1 N N X j =1 F actor iz edB inomial ( x ; µ j ) . Therefore, log ( P ( X )) = l og ( P ∗ ( X )) − log ( Z ) giv es us the estimated normalized loglikelihood. 5.2 Perf ormance of Shallow vs Deep DGA The 1-hidden-layer shallow DGA gav e a loglikelihood estimate of -118.12 on the test set. The 5-layer deep DGA, composed of two 3-layer shallow DGAs trained greedily , gav e a test set loglike- lihood estimate of -114.29. W e observed that the shallow DGA had better reconstructions than the deep DGA. The deep DGA sacrificed on the reconstructibility , but was more successful in fitting to the factorized Binomial prior . Qualitativ ely , we can observe in Figure 3 that samples from the deep DGA are much better than those from the shallow DGA. The samples from the shallow DGA can be described as a mixture of incoherent MNIST features: although all the decoder units ha ve the necessary MNIST features, the output of the encoder does not match well the factorized Binomial of P ( H ) , and so the decoder is not correctly mapping these unusual inputs H from the Binomial prior to data-like samples. 8 The samples from the deep DGA are of much better quality . Howe ver , we can see that some of the samples are non-digits. This is due to the fact that the autoencoder had to sacrifice some recon- structibility to fit H to the prior . The reconstructions also have a small fraction of samples which are non-digits. 5.3 Entropy , Sparsity and Factorizability W e also compared the entropy of the encoder output and the raw data, under a factorized Binomial model. The entropy , reported in T able 1, is measured with the logarithm base 2, so it counts the number of bits necessary to encode the data under the simple factorized Binomial distribution. A lower entropy under the factorized distribution means that fe wer independent units are necessary to encode each sample. It means that the probability mass has been moved from a highly complex manifold which is hard to capture under a factorized model and thus requires many dimensions to characterize to a manifold that is aligned with a smaller set of dimensions, as in the cartoon of Figure 2. Practically this happens when many of the hidden units take on a nearly constant value, i.e., the representation becomes extremely “sparse” (there is no explicit preference for 0 or 1 for h i but one could easily flip the sign of the weights of some h i in the output layer to make sure that 0 is the frequent v alue and 1 the rare one). T able 1 also contains a measure of sparsity of the representations based on such a bit flip (so as to make 0 the most frequent v alue). This flipping allows to count the av erage number of 1’ s (of rare bits) necessary to represent each example, in a verage (third column of the table). W e can see from T able 1 that only one autoencoder is not suf ficient to reduce the entropy down to the lo west possible v alue. Addition of the second autoencoder reduces the entropy by a significant amount. Since the prior distribution is factorized, the encoder has to map data samples with highly correlated dimensions, to a code with independent dimensions. T o measure this, we computed the Frobenius norm of the off-diagonal entries of the correlation matrix of the data represented in its raw form (Data) or at the outputs of the different encoders. See the 3rd columns of T able 1. W e see that each autoencoder removes correlations from the data representation, making it easier for a factorized distribution to model. Samples Entropy A vg # active bits || C or r − diag ( C or r ) || F Data ( X ) 297.6 102.1 63.5 Output of 1 st encoder ( f 1 ( X ) ) 56.9 20.1 11.2 Output of 2 nd encoder ( f 2 ( f 1 ( X )) ) 47.6 17.4 9.4 T able 1: Entropy , av erage number of acti ve bits (the number of rarely acti ve bits, or 1’ s if 0 is the most frequent bit v alue, i.e. a measure of non-sparsity) and inter -dimension correlations, decrease as we encode into higher lev els of representation, making it easier to model it. 6 Conclusion W e have introduced a no vel probabilistic interpretation for autoencoders as generati ve models, for which the training criterion is similar to that of regularized (e.g. sparse) autoencoders and the sam- pling procedure is very simple (ancestral sampling, no MCMC). W e showed that this training criterin is a lower bound on the likelihood and that the bound becomes tight as the decoder capacity (as an estimator of the conditional probability P ( x | h ) ) increases. Furthermore, if the encoder keeps all the information about the input x , then the optimal P ( x | h = f ( x )) is unimodal, i.e., a simple neural network with a factorial output suf fices. Our experiments showed that minimizing the proposed criterion yielded good generated samples, and that ev en better samples could be obtained by pre- training a stack of such autoencoders, so long as the lo wer ones are constrained to have very lo w reconstruction error . W e also found that a continuation method in which the weight of the prior term is only gradually increased yielded better results. These experiments are most interesting because they re veal a picture of the representation learning process that is in line with the idea of manifold unfolding and transformation from a complex twisted region of high probability into one that is more regular (factorized) and occupies a much smaller 9 volume (small number of acti ve dimensions). The y also help to understand the respectiv e roles of reconstruction error and representation prior in the training criterion and training process of such regularized auto-encoders. Acknowledgments The authors would like to thank Laurent Dinh, Guillaume Alain and Kush Bhatia for fruitful discus- sions, and also the de velopers of Theano (Bergstra et al. , 2010; Bastien et al. , 2012). W e acknowl- edge the support of the following agencies for research funding and computing support: NSERC, Calcul Qu ´ ebec, Compute Canada, the Canada Research Chairs and CIF AR. References Bastien, F ., Lamblin, P ., Pascanu, R., Bergstra, J., Goodfellow , I. J., Bergeron, A., Bouchard, N., and Bengio, Y . (2012). Theano: ne w features and speed improv ements. Deep Learning and Unsupervised Feature Learning NIPS 2012 W orkshop. Bengio, Y . (2009). Learning deep arc hitectur es for AI . Now Publishers. Bengio, Y . (2013). Estimating or propagating gradients through stochastic neurons. T echnical Report arXiv:1305.2982, Uni versite de Montreal. Bengio, Y . (2014). Ho w auto-encoders could provide credit assignment in deep networks via target propagation. T echnical report, arXiv preprint arXi v:1407.7906. Bengio, Y ., Lamblin, P ., Popovici, D., and Larochelle, H. (2007). Greedy layer -wise training of deep networks. In NIPS’2006 . Bengio, Y ., Mesnil, G., Dauphin, Y ., and Rifai, S. (2013a). Better mixing via deep representations. In Pr oceedings of the 30th International Conference on Mac hine Learning (ICML ’13) . ACM. Bengio, Y ., L ´ eonard, N., and Courville, A. (2013b). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv pr eprint arXiv:1308.3432 . Bergstra, J., Breuleux, O., Bastien, F ., Lamblin, P ., Pascanu, R., Desjardins, G., T urian, J., W arde- Farle y , D., and Bengio, Y . (2010). Theano: a CPU and GPU math e xpression compiler . In Pr oceedings of the Python for Scientific Computing Conference (SciPy) . Oral Presentation. Bornschein, J. and Bengio, Y . (2014). Reweighted wake-sleep. T echnical report, arXi v preprint Gregor , K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressi ve networks. In ICML’2014 . Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence , 40 , 185–234. Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science , 313 , 504–507. Hinton, G. E., Osindero, S., and T eh, Y .-W . (2006). A fast learning algorithm for deep belief nets. Neural Computation , 18 , 1527–1554. Kingma, D. P . and W elling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Confer ence on Learning Repr esentations (ICLR) . Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and M. Littman, ed- itors, Pr oceedings of the T wenty-seventh International Confer ence on Machine Learning (ICML- 10) , pages 735–742. A CM. Mnih, A. and Gregor , K. (2014). Neural v ariational inference and learning in belief networks. In ICML ’2014 . Murray , B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator . In ICML ’2014 . Rezende, D. J., Mohamed, S., and W ierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generativ e models. T echnical report, 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment