Deep Directed Generative Autoencoders

For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$…

Authors: Sherjil Ozair, Yoshua Bengio

Deep Directed Generative Autoencoders
Deep Dir ected Generativ e A utoencoders Sherjil Ozair Indian Institute of T echnology Delhi Y oshua Bengio Univ ersit ´ e de Montr ´ eal CIF AR Fellow Abstract For discrete data, the likelihood P ( x ) can be rewritten exactly and parametrized into P ( X = x ) = P ( X = x | H = f ( x )) P ( H = f ( x )) if P ( X | H ) has enough capacity to put no probability mass on any x 0 for which f ( x 0 ) 6 = f ( x ) , where f ( · ) is a deterministic discrete function. The log of the first f actor gi ves rise to the log-likelihood reconstruction error of an autoencoder with f ( · ) as the encoder and P ( X | H ) as the (probabilistic) decoder . The log of the second term can be seen as a regularizer on the encoded activ ations h = f ( x ) , e.g., as in sparse autoen- coders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the a verage of the optimal log-likelihood log p ( x ) . The objectiv e is to learn an encoder f ( · ) that maps X to f ( X ) that has a much simpler distribution than X itself, estimated by P ( H ) . This “flattens the manifold” or concentrates probability mass in a smaller number of (relev ant) dimensions o ver which the distribution factorizes. Generating samples from the model is straight- forward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder , but we find that using the straight-through estimator works well here. W e also find that al- though optimizing a single le vel of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model. 1 Introduction Deep learning is an aspect of machine learning that regards the question of learning multiple lev els of representation, associated with dif ferent le vels of abstraction (Bengio, 2009). These representations are distributed (Hinton, 1989), meaning that at each lev el there are many variables or features, which together can take a very lar ge number of configurations. An important conceptual challenge of deep learning is the following question: what is a good rep- r esentation ? The question is most challenging in the unsupervised learning setup. Whereas we understand that features of an input x that are predictiv e of some target y constitute a good repre- sentation in a supervised learning setting, the question is less obvious for unsupervised learning. 1.1 Manifold Unfolding In this paper we explore this question by follo wing the geometrical inspiration introduced by Bengio (2014), based on the notion of manifold unfolding , illustrated in Figure 2. It was already observed by Bengio et al. (2013a) that representations obtained by stacking denoising autoencoders or RBMs appear to yield “flatter” or “unfolded” manifolds: if x 1 and x 2 are examples from the data generating distribution Q ( X ) and f is the encoding function and g the decoding function, then points on the line h α = αf ( x 1 ) + (1 − α ) f ( x 2 ) ( α ∈ [0 , 1] ) were experimentally found to correspond to probable input configurations, i.e., g ( h α ) looks like training examples (and quantitati vely often comes close 1 to one). This property is not at all observed for f and g being the identity function: interpolating in input space typically gives rise to non-natural looking inputs (we can immediately recognize such inputs as the simple addition of two plausible examples). This is illustrated in Figure 1. It means that the input manifold (near which the distribution concentrates) is highly twisted and curved and occupies a small volume in input space. Instead, when mapped in the representation space of stack ed autoencoders (the output of f ), we find that the con ve x between high probability points (i.e., training examples) is often also part of the high-probability manifold, i.e., the transformed manifold is flatter , it has become closer to a con ve x set. Linear'interpola,on'at'layer'2' Linear'interpola,on'at'layer'1' 3’s'manifold' 9’s'manifold' Linear'interpola,on'in'pixel'space' Pixel'space' 9’s'manifold' 3’s'manifold' Representa,on'space' 9’s'manifold' 3’s'manifold' X

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment