Is Joint Training Better for Deep Auto-Encoders?

1 Is Joint T raining Better for Deep Auto-Encoders? Y ingbo Zhou, De v ansh Arpit, Ifeoma Nwogu, V enu Govindaraju Abstract —T raditionally , when generative models of data are developed via deep architectur es, greedy layer -wise pre-training is employed. In a well-trained model, the lower layer of the architectur e models the data distrib ution conditional upon the hidden variables, while the higher layers model the hidden distribution prior . But due to the greedy scheme of the layerwise training technique, the parameters of lower layers are ﬁxed when training higher layers. This makes it extremely challenging for the model to lear n the hidden distribution prior , which in turn leads to a suboptimal model f or the data distrib ution. W e theref ore in vestigate joint training of deep autoencoders, where the architectur e is viewed as one stack of two or more single-layer autoencoders. A single global reconstruction objective is jointly optimized, such that the objectiv e for the single autoencoders at each layer acts as a local, layer -level regularizer . W e empirically evaluate the performance of this joint training scheme and observe that it not only learns a better data model, but also learns better higher layer repr esentations, which highlights its potential for unsupervised feature learning. In addition, we ﬁnd that the usage of regularizations in the joint training scheme is crucial in achieving good performance. In the supervised setting, joint training also shows superior performance when training deeper models. The joint training framework can thus provide a platform for in vestigating more efﬁcient usage of different types of regularizers, especially in light of the growing volumes of av ailable unlabeled data. I . I N T R O D U C T I O N D EEP learning algorithms ha ve been the object of much attention in the machine learning and applied statistics literature in recent years, due to the state-of-the-art results achiev ed in various applications such as object recognition [1], [2], speech recognition [3], [4], [5], face recognition [6], etc. W ith the exception of con volutional neural networks (CNN), the training of multi-layered networks was in general unsuccessful until 2006, when breakthroughs were made by three seminal works in the ﬁeld - Hinton et al. [7], Bengio et al. [8] and Ranzato et al. [9]. They addressed the notion of gr eedy layer-wise pr e-training to initialize the weights of an entire network in an unsupervised manner , follo wed by a supervised back-propagation step. The inclusion of the unsupervised pre-training step appeared to be the missing ingredient which then lead to signiﬁcant improvements ov er the con ventional training schemes. The recent progress in deep learning is more tow ards supervised learning algorithms [1], [10], [11]. While these algorithms obviate the need for the additional pre-training step in supervised settings, large amount of labeled data is still critical. Giv en the e ver-gro wing volumes of unlabeled data and the cost of labeling, it remains a challenge to dev elop better unsupervised learning techniques to exploit the copious amounts of unlabeled data. The unsupervised training of a netw ork can be interpreted as learning the data distribution p ( x ) in a probabilistic generative model of the data. A typical method for accomplishing this is to decompose the generati ve model into a latent conditional generativ e model and a prior distribution over the hidden variables. When this interpretation is e xtended to a deep neural network [7], [12], [13], it implies that while the lower layers model the conditional distribution p ( x | h ) , the higher layers model the distribution over the latent variables. It has been shown that local-le vel learning accomplished via pre-training is important when training deep architectures (initializing ev ery layer of an unsupervised deep Boltzmann machine (DBM) improves performance [14]; similarly , initializing e v- ery layer of a supervised multi-layer network also improv es performance [8]). But this greedy layer -wise scheme has a major disadvantage – the higher layers hav e signiﬁcantly less knowledge about the original data distribution than the bottom layers. Hence, if the bottom layer does not capture the data representation sufﬁciently well, the higher layers may learn something that is not useful. Furthermore, this error/bias will propagate through layers, i.e. the higher the layer , the more the errors it will incur . T o summarize, the greedy layer-wise training focuses on the local constraints introduced by the learning algorithm (such as in auto-encoders), but loses sight of the original data distribution when training higher layers. T o compensate for this disadvantage, all the levels of the deep architecture should be trained simultaneously . But this type of joint training can be very challenging and if done naiv ely , will fail to learn [14]. Thus in this paper , we present an ef fective method for jointly training a multi-layer auto-encoder from end-to-end in an un- supervised fashion, and analyze its performance ag ainst greedy layer-wise training in various settings. This unsupervised joint training method consists of a global reconstruction objective, thereby learning a good data representation without losing sight of the original input data distribution. In addition, it can also easily cope with local regularization on the parameters of hidden layer in a similar way as in layer-wise training, and therefore allo w us to use more powerful regularizations proposed more recently . This method can also be viewed as a generalization of single- to multi-layer auto-encoders. Because our approach achie ves a global reconstruction in a deep network, the feature representations learned are better . This attrib ute also makes it a good feature extractor and we conﬁrm this by our extensi ve analysis. The representations learned from joint training approach consistently outperform those obtained from greedy layer-wise pre-training algorithms in unsupervised settings. In supervsied setting, this joint train- ing scheme also demonstrate superior performance for deeper models. 2 A. Motivation Because supervised methods typically require large amounts of labeled data and the cost of acquiring these labels can be an expensiv e and time consuming task, it remains a chal- lenge to continue to dev elop impro ved unsupervised learning techniques that can exploit large volumes of unlabeled data. Although the greedy layerwise pre-training procedure has till date been very successful, from an engineering perspective, it can be challenging to train, and monitoring the training process can be difﬁcult. For the layerwise method, apart from the bottom layer where the unsupervised model is learning directly on the input, the training cost are measured with respect to the layer below . Hence, any changes in error values from one layer to the next has little meaning to the user . But by having one global objective, the joint training technique has training cost that are consistently measured with respect to the input layer . This way , one can readily monitor the changes in training errors and e ven with respect to post-training tasks such as classiﬁcation or prediction. Also, as stated earlier , the unsupervised training of a net- work can be interpreted as learning the data distribution p ( x ) in a probabilistic generati ve model of the data. But in order for the unsupervised method to learn p ( x ) , a common strategy is to decompose p ( x ) into p ( x | h ) and p ( h ) [13], [12]. There are now two models to optimize, the conditional generating model p ( x | h ) and the prior model p ( h ) . Since p ( h ) covers a wide range of distributions and is hard to optimize in general, one tends to emphasize more on the optimization of p ( x | h ) and assume that the later learning of p ( h ) could compensate for the loss occurred due to imperfect modeling. Note that the prior p ( h ) can also be decomposed in exactly the same way as p ( x ) , resulting in additional hidden layers. Thus, one can recursiv ely apply this trick and delay the learning of the prior . The moti vation behind this recursion is to expect that as the learning progresses through layers, the prior gets simpler , and thus makes the learning easier . The greedy layer-wise training employs this idea, b ut it may fail to learn the optimum data distribution for the follo wing reason: in layerwise training, the parameters of the bottom layers are ﬁxed after training and the prior model can only observe x through its ﬁxed hidden representations. Hence, if the learning of p ( x | h ) does not preserve all informatin regarding x (which is very likely), then it is not possible to learn the prior p ( h ) that leads to the optimum model for the data distribution. For these reasons, therefore, we explore the possibility of jointly training deep autoencoders where the joint training scheme makes the adjustments of p ( h ) and p ( x | h ) (and consequently p ( h | x ) ) po ssible with respect to each other, thus alleviating the burden for both models. As a consequence, the prior model p ( h ) can now observ e the input x making it possible to ﬁt the prior distribution better . Hence, joint training makes global optimization mor e possible . I I . B AC K G RO U N D In this section, we brieﬂy revie w the concept of autoen- coders and some of its variants, and expand on the notion of the deep autoencoder , the primary focus of this paper . A. Basic Autoencoder s (AE) A basic autoencoder is a one-hidden-layer neural network [15], [16], and its objective is to reconstruct the input using its hidden activ ations so that the reconstruction error is as small as possible. It takes the input and puts it through an encoding function to get the encoding of the input, and then it decodes the encodings through a decoding function to recover (an approximation of) the original input. More formally , let x ∈ R d be the input, h = f e ( x ) = s e ( W e x + b e ) x r = f d ( x ) = s d ( W d h + b d ) where f e : R d 7→ R h and f d : R h 7→ R d are encoding and decoding functions respectively , W e and W d are the weights of the encoding and decoding layers, and b e and b d are the biases for the two layers. s e and s d are elementwise non- linear functions in general, and common choices are sigmoidal functions like tanh or logistic. For training, we want to ﬁnd a set of parameters Θ = { W e , W d , b e , b d } that minimize the reconstruction error J AE ( Θ ) : J AE ( Θ ) = X x ∈D L ( x , x r ) (1) where L : R d × R d 7→ R is a loss function that measures the error between the reconstructed input x r and the actual input x , and D denotes the training dataset. W e can also tie the encoding and decoding weights by setting the weights W d = W e > . Common choices of L includes sum-of-squared-errors for real valued inputs, cross-entropy for binary valued inputs etc. . Howe ver , this model has a signiﬁcant drawback in that if the number of hidden units h is greater than the dimensionality of the input data d , the model can perform very well during training b ut fail at test time because it trivially copied the input layer to the hidden one and then copied it back. As a work-around, one can set h < d to force the model to learn something meaningful, but the performance is still not very efﬁcient. B. Denoising Autoencoder (D AE) V icent et al. [17] proposed a more ef ﬁcient way to ov ercome the shortcomings of the basic autoencoder, namely the denois- ing autoencoder . The idea is to corrupt the input before passing it to the netw ork, but still require the model to reconstruct the uncorrupted input. In this way , the model is forced to learn representations that are useful since trivially copying the input will not optimize this denoising objecti ve (equation 2). Formally , let x c ∼ q ( x c | x ) be the corrupted version of the input, where q ( ·| x ) is some corruption process ov er the input x , then the objectiv e it tries to optimize is: J DAE ( Θ ) = X x ∈D E x c ∼ q ( x c , x ) [ L ( x , f d ◦ f e ( x c ))] (2) C. Contractive Autoencoders (CAE) More recently , Rifai et al. [18] proposed to add a special penalty term to the original autoencoder objecti ve to achie ve robustness to small local perturbations. The penalty term was 3 Fig. 1. Illustration of deep autoencoder that considered in this paper . Left: an illustration of how greedy layerwise training is employed when training a two layered deep autoencoder: the bottom layer is ﬁrst trained followed by a second layer using the previous layer’ s encoding as the input. The loss of layerwise training is always measured with respect to its input, so the intermediate layers are trying to reconstruct back the representation ( i.e. h 1 ) not the input data x . The ﬁnal deep autoencoder can be viewed as a stacking of encoding layers followed by decoding layers. Right: the joint training scheme, where one start with a deep autoencoder and try to optimize it jointly to minimize the loss between the input and its ﬁnal reconstruction. Therefore, all the parameters are tuned towards better represent the input data. In addition, regularization is enforced in each local layer to help learning. Note that for simplicity all biases are excluded from the illustration. the Frobenius norm of the Jacobian matrix of the hidden acti- vations with respect to the input, and their modiﬁed objecti ve is: J C AE ( Θ ) = X x ∈D  L ( x , x r ) + λ k∇ x f e ( x ) k 2 F  (3) This penalty term measured the change within the hidden activ ations with respect to the input. Thus, the penalty term alone would prefer hidden acti vations to stay constant when the input v aries ( i.e. the Frobenius norm would be zero). The loss term will be small only if the reconstruction is close to the original, which is not possible if the hidden activ ation does not change according to the input. Therefore, the hidden representation that tries to represent the data manifold would be preferred, since otherwise we would hav e high costs in both terms. D. Deep Autoencoder s A deep autoencoder is a multi-layer neural network that tries to reconstruct its input (see Figure 1). In general, an N -layer deep autoencoder with parameters Θ = { Θ i | i ∈ { 1 , 2 , . . . N }} , where Θ i = { W i e , W i d , b i e , b i d } can be for- mulated as follows: h i = f i e ( h i − 1 ) = s i e ( W i e h i − 1 + b i e ) (4) h i r = f i d ( h i + 1 r ) = s i d ( W i d h i + 1 r + b i d ) (5) h 0 = x (6) The deep autoencoder architecture therefore contains multiple encoding and decoding stages made up of a sequence of encoding layers followed by a stack of decoding layers. Notice that the deep autoencoder therefore has a total of 2 N layers. This type of deep autoencoder has been in vestigated in several previous works (see [19], [20], [21] for examples). A deep autoencoder can also be vie wed as an unwrapped stack of multiple autoencoders with the higher layer autoencoder taking the lower layer encoding as its input. Hence, when viewed as a stack of autoencoders, one could train the stack from bottom to top. E. General Autoencoder Objective By observing the formulations of Equations (1-3), we can rewrite the single layer autoencoder objectiv e in a more general form as: J GAE ( Θ ) = X x ∈D E x c ∼ Q ( x c , x ) [ L ( x , f d ◦ f e ( x c ))] + λ R ( Θ ) (7) where λ ∈ [0 , + ∞ ) , Q ( x c | x ) is some conditional distribution ov er the input and Q ( x c , x ) = Q ( x c | x ) P ( x ) , and R ( Θ ) is an arbitrary regularization function over the parameters. The choice of a good regularization term R ( · ) seems to be the ingredient that has led to the recent successes of several autoencoder variants (the models proposed in paper [22], [23], for e xample). It is straightforward to see that we can recover the previous autoencoder objectiv es from Equation 7. For example, the basic autoencoder can be obtained by setting Q to be a Dirac delta at the data point, and λ = 0 . Since this objectiv e is more general, going forward, we will refer to this objectiv e rather than the speciﬁc autoencoder ones presented earlier . I I I . O U R J O I N T T R A I N I N G M E T H O D As mentioned previously , a deep autoencoder can be op- timized via layer-wise training, i.e. bottom layer objecti ve J GAE ( Θ i ) is optimized before J GAE ( Θ i + 1 ) ( i.e . from bot- tom to top, see also Figure 1 left) until the desired depth is reached. If one would like to jointly train all layers, a natural question would be – why not combine these objectives into one and tr ain all the layers jointly? Directly combing all the objectiv es ( i.e. optimizing P i J GAE ( Θ i ) ) is not appropriate, since the end goal is to reconstruct the input as accurate as pos- sible, and not the intermediate representations. The aggregated loss and re gularization may hinder the training process since the objectiv e has deviated from the goal. Furthermore, during training, the representation of the lower layers is varying continuously (since the parameters are changing), making the optimization v ery difﬁcult 1 . 1 W e have tried this scheme on several datasets but the results are very poor . 4 So, focusing on the goal of reconstructing the input, we explore the following joint training objective for an N - layered deep autoencoder: J J oint ( Λ ) = X x ∈D E Q ( x , h 0 c ,..., h N c ) [ L ( x , x r )] + N X i =1 λ i R i ( Θ i ) (8) where Λ = {∪ N i =1 Θ i } , Q ( x , h 0 c , . . . , h N c ) = P ( x ) Q N i =1 Q i ( h i c | h i ) P ( h i ) ; and h i c ∼ Q i ( h i c | h i ) with h 0 = x , and x r = f 1 d ◦ f 2 d ◦ · · · ◦ f N d ( h N ) . In other words, this method is optimizing the reconstruction error directly from a multi-layer neural network, and at the same time enabling us to con veniently apply more powerful regularizers for the single autoencoders at each layer . For ex- ample, if we w ant to train a two-layered denoising autoencoder using this method, we will need to corrupt the input and feed it into the ﬁrst layer , and then corrupt the hidden output from the ﬁrst layer and feed it into the second layer; then we reconstruct from the second layer hidden activ ations followed by a reconstruction to the original input space using the ﬁrst layer reconstruction weight; we then measure the loss between the reconstruction and the input and do gradient update to all the parameters (see Figure 1). In this formulation, we not only train all layers jointly by optimizing one global objecti ve (the ﬁrst part of equation 8), so that all hidden activ ations will adjust according to the original input; but we also take into account the local constraints of the single layer autoencoders (the second part of equation 8). Locally , if we isolate each autoencoder out from the stacks and look at them individually , this training process is optimizing almost exactly the single layer objectiv e as in layerwise case, but without the unnecessary constraint on reconstructing back the intermediate representations. Globally , because the loss is measured between the reconstruction and the input, all parameters must be tuned to minimize the global loss, and thus the resulting higher lev el hidden representations would be more representative of the input. This approach addresses some of the drawbacks of layer - wise training: since all the parameters are tuned simultane- ously to ward the reconstruction error of the original input, the optimization is no longer local for each layer . In addition, the reconstruction error in the input space makes the training process much easier to monitor and interpret. T o sum up, this formulation provides an easier and more modular way to train deep autoencoders. W e now relate this joint objecti ve to some of the existing, well-known deep learning paradigms that implement deter- ministic objectiv e functions. W e also relate this work to the techniques in the literature that explain deep learning from a probabilistic perspecti ve. A. Relationship to W eight Decay in Multi-layer P erceptr on It is easy to see that if we replace all the regularizers R i ( Θ i ) with L 1 or L 2 norms, and set Q i ( ˜ z | z ) to Dirac delta distribution at point z in Equation (8), we recover the standard multi-layer perceptron (MLP) with L 1 or L 2 weight decay with one exception – MLP is not commonly used for unsupervised learning. So if we replace the loss term with some supervised loss, it is then identical to the ordinary MLP with corresponding weight decay . B. Relationship to Greedy Layerwise Pr e-training It is straight forward to see that in the single layer case, the proposed training objective in Equation (8) is equiv alent to greedy layerwise training. It w ould be interesting to in vestigate the relationship of this method and layerwise training in multi- layer cases. Therefore, we construct a scenario that make these two method v ery similar by modify the joint training algorithm slightly by setting a training schedule with learning rates α i t , and re gularizer λ i t , where i, t ∈ Z + , i and t indicate the corresponding layer and training iteration. The objecti ve in equation 8 and the gradient update is as follo ws: J S ( Λ t ) = X x ∈D E Q ( x , h 0 c ,..., h N c ) [ L ( x , x r )] + N X i =1 λ i t R i ( Θ i ) (9) Θ i t +1 ← Θ i t − α i t ∆ Θ i t (10) Let α i t ∈ { 0 ,  } and P i α i t =  , where  ∈ R + indicates the learning rate. Let us set the value of α in the following way: α 1 t =  for t ∈ [1 , T ] , α 2 t =  for t ∈ [ T + 1 , 2 T ] , . . . , α N t =  for t ∈ [( N − 1) T + 1 , N T ] , where T is the number of iterations used in the greedy layerwise training. In this way , the joint training scheme is set up in a very similar way to the layerwise training scheme, i.e. only tune the parameter of a single layer at a time from bottom to top. Howe ver , since the loss is measured in the domain of x instead of h i , the learning will still behave dif ferently . This will be more apparent as we write do wn both joint loss ( L J ) and layerwise loss ( L L ) for training a gi ven layer j as follows: L J ( x , x r ) = L ( f ∗ 1 d ◦ · · · ◦ f ∗ j − 1 d ◦ f j d ◦ f j e ( h j − 1 ) , x ) (11) L L ( h j − 1 , h r ( j − 1) ) = L ( f j d ◦ f j e ( h j − 1 ) , h j − 1 ) (12) where f ∗ i d and f ∗ i e represent the trained decoding and encoding function respecti vely . In other words, we do not optimize the parameters of these function during training. Note that, the loss and the resulting parameter update will be very dif ferent between equation 11 and 12 in general, since the gradient for the joint loss will be modiﬁed by the previous layers parameters but not in the layerwise case. It is somewhat surprising that ev en by constructing a very similar situation the two methods still beha ve very differently . Note that ev en in the special case where one uses linear activ ation (which is not very practical), these two losses are still not equiv alent. Hence, the joint training will perform very differently from layerwise in general. C. Relationship to Learning the Data Distrib ution If we take a probabilistic vie w of the learning of a deep model as learning the data distrib ution p ( x ) , a common interpretation of the model is to decompose p ( x ) into: p ( x ) = X h p ( x | h ) p ( h ) = X h p ( x | h ) X y p ( h | y ) q ( y ) (13) 5 where q ( y ) is the empirical distribution over data. So the model is decomposed such that the bottom layer models the distribution p ( x | h ) and the higher layers models the prior p ( h ) . Notice that, if we apply layerwise training it is only possible to learn the prior p ( h ) through a ﬁxed p ( h | x ) , and thus the prior will not be optimal with respect to p ( x ) if p ( h | x ) does not preserve all information re garding x . On the other hand, in joint training, both the generativ e distribution p ( x | h ) and the prior p ( h ) are tuned together, and therefore is more likely to obtain better estimations of the true p ( x ) . In addition, while training layers greedily , we are not taking into account the f act that some more capacity may be added later to improv e the prior for hidden units. This problem is also alle viated by joint training since the architecture is ﬁxed at the be ginning of training and all the parameters are tuned tow ards better representation of the data. D. Relationship to Generative Stochastic Networks Bengio et al. [24] recently proposed a new alternativ e to maximum likelihood for training probabilistic models – the generati ve stochastic networks (GSNs). The idea is that learning the data distribution p ( x ) directly is hard since it is highly multi-modal in general. On the other hand, one can try to learn to approximate the Markov chain transition operator ( e.g . p ( h | h t , x t ) , p ( x | h t ) ). The intuition is that the move in Markov chain are mostly local, and thus these distributions are likely to be less complex or even unimodal, make it an easier learning problem. For example, the denoising autoencoder learns p ( x | ˜ x ) where ˜ x ∼ c ( ˜ x | x ) is a corrupted example. They show that if one can get a consistent estimator of p ( x | h ) , then following the implied Markov chain, the stationary distrib ution of this chain will con ver ge to the true data distribution p ( x ) . Like the denoising autoencoder , a deep denoising autoen- coder also deﬁnes the Marko v chain: ˜ X t +1 ∼ c ( ˜ X | X t ) (14) X t +1 ∼ p ( X | ˜ X t +1 ) (15) Therefore, a deep denoising autoencoder also learns a data generating distrib ution within the GSN frame work, and we can generate samples from it. I V . E X P E R I M E N T S A N D R E S U LT S In this section, we empirically analyse the unsupervised joint training method with the follo wing questions: 1) does joint training lead to better data models? 2) does joint training result in better representations that would be helpful for other tasks? 3) what role does the more recent usage of regularizers for autoencoder play in joint training? 4) does joint training affect the performance of supervised ﬁnetuning? W e tested our approach on MNIST [25] – a digit classiﬁca- tion dataset contains 60,000 training and 10,000 test examples, where we used the standard 50,000 and 10,000 training and validation split. In addition, we also used the MNIST variation datasets [26] each with 10,000 data points for training, 2,000 for validation and 50,000 for test. Additional shape datsets employed in [26] are also employed, this set of datasets contains tw o shape classiﬁcation tasks. One is to classify short and tall rectangles, the other is to classify con vex and non-con vex shapes. All of these datasets ha ve 50,000 testing examples. The rectangle dataset has 1,000 and 200 training and v alidation examples respectively , and the conv ex dataset has 6,000 and 2,000 training and validation respecti vely . The rectangle dataset also has a variation that uses image as the foreground rectangles, and it has 10,000 and 2,000 training and validation examples, respectively (see Figure 2 left for visual e xamples from these datasets). W e tied the weights of the deep autoencoders ( i.e. W e i = W d i > ) in this set of experiments, and set each layer with 1,000 hidden units using logistic activ ations, and cross-entropy loss was applied as the training cost. W e optimized the deep network using rms-prop [27] with 0.9 decay factor for the rms estimate and 100 samples per mini-batch. The hyper- parameters were chosen on the validation set, and the model that obtained best validation result was used to obtain the test set result. The h yper-parameters we considered were the learning rate (from the set { 0.001,0.005,0.01,0.02 } ), noise lev el (from the set { 0.1,0.3,0.5,0.7,0.9 } ) for deep denoising autoencoders (deep-D AE), and contraction le vel (from the set { 0.01,0.05,0.15,0.3,0.6 } ) for deep contractive autoencoders (deep-CAE). Gaussian noise is applied for D AE. A. Does J oint T raining Lead to Better Data Models? As mentioned in previous sections, we hypothesize that the joint training will alle viate the bur den on both the bottom layer distribution p ( x | h ) and top layer priors p ( h ) , and hence r esult a better data model p ( x ) . In this experiment, we inspect the goodness on the modeling of data distrib ution through samples from the model. Since the deep denoising autoencoder follo ws the GSN framework, we can follo w the implied Marko v chain to generate samples. The models are trained for 300 epochs using both layerwise and joint training method 2 , and the quality of samples are then estimated by measuring the log-lik elihood of the test set under a Parzen window density estimator [28], [29]. This measuerment can be seen as a lo wer bound on the true log-likelihood, and will con verge to the true likelihood as the number of samples increase and with an appropriate Parzen windo w parameter . 10,000 consecutiv e samples were generated for each of the datasets with models that were trained using the layerwise and joint training method, and we used a Gaussian Parzen window for the density estimation. The estimated log-likelihoods on the respectiv e test sets are shown in T able I 3 . W e use a suf ﬁx ’L ’ and ’J’ to denote the layerwise and joint training method, respectiv ely . For qualitativ e purposes the generated samples from each dataset are shown in Figure 2; the quality of samples are comparable in both cases, howe ver , the models trained through joint training sho ws faster mixing with less spurious samples in general. It is also interesting to note that, in most of the cases (see T able I) the log-likelihood on the test set improved 2 Each layer is trained for 300 epochs in layerwise training 3 W e note that this estimate has a little high v ariance, but this is to our knowledge the best available method for estimating generativ e models that can generate samples but not estimate data likelihood directly . 6 T ABLE I T E ST S E T L O G - LI K E LI H O O D E S TI M A T E D F RO M P A RZ E N W I N DO W D E N SI T Y E S T I MATO R . S U FFI X ’ L ’ I S U S ED T O D E N OT E T H E M O D EL T R A I NE D F RO M L A Y E RW IS E M E T HO D A N D ’ J ’ D E N OT ES T H E M O D E L T R AI N E D F RO M T H E J O IN T T R A IN I N G M E T HO D . Dataset/Method D AE-2L D AE-2J D AE-3L D AE-3J MNIST 204 ± 1.59 266 ± 1.32 181 ± 1.68 270 ± 1.26 basic 201 ± 1.69 205 ± 1.59 183 ± 1.62 217 ± 1.51 rot 174 ± 1.41 172 ± 1.49 163 ± 1.60 187 ± 1.36 bg-img 156 ± 1.41 154 ± 1.45 142 ± 1.56 155 ± 1.48 bg-rand -267 ± 0.38 -252 ± 0.36 -275 ± 0.37 -249 ± 0.36 bg-img-rot 151 ± 1.46 152 ± 1.50 139 ± 1.52 149 ± 1.31 rect 160 ± 1.08 161 ± 1.12 152 ± 1.15 154 ± 1.12 rect-img 275 ± 1.35 273 ± 1.30 269 ± 1.40 272 ± 1.37 con vex -967 ± 8.99 -853 ± 8.85 -1011 ± 10.79 -704 ± 8.50 T raining Samples Layerwise Samples Joint T raining Samples Fig. 2. Samples generated from deep denoising autoencoder with the best estimated log-likelihood trained on MNIST and the image classiﬁcation datasets used in [26]. From top to bottom: samples from MNIST , MNIST -rotation, MNIST -bg-image, MNIST -bg-random, MNIST -bg-rot-image, rectangle, rectangle-image and conv ex dataset. Left: training samples from the corresponding dataset. Middle: consecutive samples generated from deep denoising autoencoder trained by using layerwise scheme. Right: consecutive samples generated from deep denoising autoencoder trained by joint training. Notice that only ev ery fourth sample is shown, the last column of each consecutive samples shows the closest image from the training set to illustrate that the model is not memorizing the training data. See Figure 4 for samples from longer runs. with deeper models in the joint training case, whereas in layerwise settings, the likelihood dropped with additional layers. This illustrates one advantage of using joint training scheme to model data since it accommodates the additional capacity of the hidden prior p ( h ) while training the whole model. Since the joint training objecti ve in eq. 8 focuses on reconstruction, it is expected that the reconstruction errors from models trained by joint training should be less than that obaited from layerwise training. T o conﬁrm this we also record the training and testing errors as training progresses. As can be seen from Figure 3, it is clearly the case that joint training achiev es better performance in all cases 4 . It is also interesting to note that the models from joint training are less prone to ov erﬁtting as compare to layerwise case, this is true ev en in case of less training examples (see Figure 3g and i). B. Does J oint T raining Result in Better Repr esentations? In the previous experiment we illustrated the advantage of joint training ov er greedy layerwise training when learning the data distribution p ( x ) . W e also conﬁrmed that joint training has better reconstruction error in both training and testing as compare to layerwise case, since it focuses on reconstruction. A natural follo w-up question would be to ﬁnd out if tuning all the parameters of a deep unsupervised model towar ds better repr esenting the input, lead to better higher level 4 W e did not show the rest models since the trends are very similar. r epresentations . T o answer this question the following experi- ment was conducted. W e trained two- and three-layered deep autoencoder 5 using both the layerwise and our joint training methods for 300 epochs 6 . W e then ﬁx all the weights and used the deep autoencoders as feature e xtractors on the data, i.e. use the top encoding layer representation as the feature for the corresponding data. A support vector machine (SVM) with a linear k ernel was further trained on those features and the test set classiﬁcation performance was e valuated. The model that achie ved the best validation performance was used to report the test set error , and the performance for the two different models is shown in T ables II and III. W e report the test set classiﬁcation error (all in percentages) along with its 95% conﬁdence interval for all classiﬁcation experiments. Since the computation of contracti ve penalty of higher layers with respect to the input is expensiv e for the deep contractive autoencoder , we set R i (Θ i ) = k∇ h i − 1 h i k 2 F to sav e some computations. The result from T ables II and III suggest that the representations learned from the joint training was generally better . It is also important to note that the model achie ved low error rate without an y ﬁnetuning by using the labeled data; in other words, the feature e xtraction was completely unsupervised. Interestingly , the three-layered models seemed to perform worse than two-layered models in most of the cases for all methods, which contradicts the generativ e performance. 5 Note that the actual depth of the network has double the number of hidden layers because of the presence of intermediate reconstruction layers. 6 Each layer is trained for 300 epochs in layerwise training. 7 (a) (b) (c) (d) (e) (f) (g) (h) (i) Fig. 3. T raining and testing reconstruction errors from 2-layered deep denoising autoencoder on various datasets using layerwise and joint training. a) MNIST , b) MNIST -basic, c) MNIST -rotation, d) MNIST -bg-image, e) MNIST -bg-random, f) MNIST -bg-rot-image, g) rectangle, h) rectangle-image and i) con vex dataset. This is because the goal for a good unsupervised data model is to capture all information regarding p ( x ) , whereas for supervised tasks the goal is to learn a good model over p ( y | x ) . p ( x ) contains useful information regarding p ( y | x ) but not all the information from p ( x ) might be helpful. Therefore, good generativ e performance does not necessarily translate to good discriminativ e performance. Apart from performing joint training directly with random initialization as in equation 8, it is also possible to ﬁrst apply layerwise training and then jointly train with the pre- trained weights. Therefore, to in vestigate whether joint training is beneﬁcial in this situation, we performed another set of experiments by initializing the weights of deep autoencoders using the pre-trained weights from denoising and contrac- tiv e autoencoders, and further performed unsupervised joint training for 300 epochs, with and without their corresponding regularizations. The results are shown in T ables II and III. W e use a sufﬁx ‘U’ to indicate the results from this training procedure when the following joint training is preformed without any regularization, and ‘UJ’ is used to indicate the 8 T ABLE II L I NE A R S V M C L A S SI FI C A T I O N P E RF O R M AN C E O N T OP - L E VE L R E PR E S E NTA T I O N S L E AR N T U N S U PE RV IS E D B Y D E E P D E NO I S I NG AU T O EN C O DE R S T R AI N E D F RO M VAR I O U S S C HE M E S . S U FFI X ‘ L ’ I S U S ED T O D E N OT E M O D EL S T H A T A R E T R AI N E D F RO M L A Y E RW I SE M E T H OD , ‘ U ’ I S U S E D T O D E NO TE M O DE L S T H A T A R E J O I N TLY T R A IN E D W I T H OU T R E GU L A R IZ ATI O N B U T I N I TI A L I ZE D W I T H P R E - T R A IN E D W E I GH T S F RO M L AYE RW I S E S C HE M E , ‘ J ’ I S U S ED T O D E N OT E M O D EL S T H A T A R E J O IN T L Y T R A I NE D W I TH C O R R ES P O N DI N G R E G U LA R I ZAT IO N S W I T H R A N DO M LY I N IT I A LI Z E D PA R AM E T ER S , A N D ‘ U J’ I S U S ED T O D E N OT E M O D EL S T H A T A R E J O IN T LY T R AI N E D W I T H C O RR E S P ON D I N G R E GU L A RI Z A T I O N S A N D I N IT I A L IZ E D W I T H P R E - T R A I NE D W E IG H T S F RO M L AY ERW I S E S C H EM E . S E E T E X T F O R D E T A I L S . Dataset/Method D AE-2L D AE-2U D AE-2J D AE-2UJ D AE-3L D AE-3U D AE-3J D AE-3UJ MNIST 1.48 ± 0.24 1.94 ± 0.27 1.39 ± 0.23 1.47 ± 0.24 1.40 ± 0.23 1.85 ± 0.26 1.41 ± 0.23 1.71 ± 0.25 basic 2.75 ± 0.14 2.72 ± 0.14 2.66 ± 0.14 2.66 ± 0.14 2.65 ± 0.14 2.44 ± 0.14 2.98 ± 0.15 2.57 ± 0.14 rot 15.75 ± 0.32 17.09 ± 0.33 14.23 ± 0.31 14.98 ± 0.31 14.33 ± 0.31 15.89 ± 0.32 15.05 ± 0.31 13.25 ± 0.30 bg-img 20.69 ± 0.36 17.01 ± 0.33 17.23 ± 0.33 17.44 ± 0.33 21.36 ± 0.36 17.75 ± 0.33 21.77 ± 0.36 19.00 ± 0.34 bg-rand 12.72 ± 0.29 13.65 ± 0.30 8.52 ± 0.24 9.51 ± 0.26 10.72 ± 0.27 12.11 ± 0.29 8.03 ± 0.24 8.70 ± 0.25 bg-img-rot 52.44 ± 0.44 52.92 ± 0.44 49.61 ± 0.44 50.93 ± 0.44 56.19 ± 0.43 56.61 ± 0.43 57.04 ± 0.43 52.49 ± 0.44 rect 1.14 ± 0.09 0.97 ± 0.09 0.83 ± 0.08 0.87 ± 0.08 1.30 ± 0.10 1.85 ± 0.12 0.98 ± 0.09 1.26 ± 0.10 rect-img 22.84 ± 0.37 22.81 ± 0.37 21.96 ± 0.36 21.98 ± 0.36 24.22 ± 0.38 23.46 ± 0.37 23.04 ± 0.37 21.86 ± 0.36 con vex 28.65 ± 0.40 25.78 ± 0.38 22.18 ± 0.36 25.65 ± 0.38 27.24 ± 0.39 25.00 ± 0.38 20.86 ± 0.36 24.42 ± 0.38 T ABLE III L I NE A R S V M C L A S SI FI C A T I O N P E RF O R M AN C E O N T OP - L E VE L R E PR E S E NTA T I O N S L E AR N T U N S U PE RV IS E D B Y D E E P C O NT R AC T I VE AU T O EN C O D ER S T R AI N E D F RO M VAR I O U S S C HE M E S . N O T ATI O N S A R E T H E S A M E A S I N TA B LE I I . S E E T E X T F O R D E TAI L S . Dataset/Method CAE-2L CAE-2U CAE-2J CAE-2UJ CAE-3L CAE-3U CAE-3J CAE-3UJ MNIST 1.55 ± 0.24 1.91 ± 0.27 1.33 ± 0.22 1.65 ± 0.25 1.47 ± 0.24 1.98 ± 0.27 1.55 ± 0.24 1.74 ± 0.26 basic 2.96 ± 0.15 2.80 ± 0.14 2.38 ± 0.13 2.70 ± 0.14 2.90 ± 0.15 2.92 ± 0.15 2.80 ± 0.14 2.95 ± 0.15 rot 15.10 ± 0.31 15.29 ± 0.32 13.17 ± 0.30 15.38 ± 0.32 15.23 ± 0.31 16.23 ± 0.32 15.14 ± 0.31 15.59 ± 0.32 bg-img 19.80 ± 0.35 17.48 ± 0.33 19.05 ± 0.34 19.63 ± 0.35 21.31 ± 0.36 19.65 ± 0.35 21.68 ± 0.36 19.49 ± 0.35 bg-rand 15.00 ± 0.31 12.50 ± 0.29 14.00 ± 0.30 13.18 ± 0.30 14.30 ± 0.31 12.05 ± 0.29 11.53 ± 0.28 12.68 ± 0.29 bg-img-rot 52.57 ± 0.44 53.92 ± 0.44 52.00 ± 0.44 52.29 ± 0.44 53.66 ± 0.44 55.61 ± 0.44 54.32 ± 0.44 53.57 ± 0.44 rect 1.62 ± 0.11 1.77 ± 0.12 1.07 ± 0.09 1.85 ± 0.12 1.36 ± 0.10 2.40 ± 0.13 0.93 ± 0.08 1.99 ± 0.12 rect-img 22.62 ± 0.37 22.40 ± 0.37 22.23 ± 0.36 22.76 ± 0.37 22.37 ± 0.37 22.56 ± 0.37 23.06 ± 0.37 22.53 ± 0.37 con vex 27.45 ± 0.39 25.90 ± 0.38 24.91 ± 0.38 25.31 ± 0.38 28.49 ± 0.40 26.66 ± 0.39 24.66 ± 0.38 25.26 ± 0.38 case where the further joint training is performed with the corresponding regularization. Results from ‘U’ scheme are similar to the layerwise case, whereas, the performance from ‘UJ’ is clearly better as compare to the layerwise case. The abov e result also suggest that the performance improvement from joint training is more signiﬁcant while combined with corresponding regularizations, irrespectiv e of the parameter initialization. In summary , the representation learned through unsuper - vised joint training is better as compared to the layerwise case. In addition, it is important to apply appropriate regularizations during joint training in order to get the most beneﬁt. C. How Important is the Re gularization? From the previous experiments, it is clear that the joint training scheme has sev eral advantages ov er the layerwise training scheme, and it also suggest that the use of proper regularizations is important (see the performance between ‘U’ and ‘UJ’ in T able II and III for example). Howe ver , the role of the regularization for training deep autoencoder is still unclear - is it possible to achie ve good performance by applying joint training without any r egularization? T o in vestigate this we trained two- and three-layered deep autoencoders for 300 epochs with L2 constraints on the weights 7 , for fair comparison with the pre vious experiments. W e ag ain trained a linear SVM using the top-layer representation and reported the results in T able IV. Performance was signiﬁcantly worse as 7 T raining without any regularization resulted in very poor performance. compared to the case where more powerful regularizers from autoencoders were employed (see T ables II and III), especially in the case where noise was presented in the dataset. Hence, the results strongly suggest that for unsupervised feature learning, more powerful regularization is required to achie ve superior performance. In addition, it is more beneﬁcial to incorporate those powerful regularizations, such as denoising or contracti ve, during joint training (see T able II and III). T ABLE IV L I NE A R S V M C L A S SI FI C A T I O N P E RF O R M AN C E O N T OP - L E VE L R E PR E S E NTA T I O NS L E A R NT U N S UP E RV IS E D B Y P L AI N D E E P AU TO E N C OD E R S W I T H L 2 W E I GH T R E G UL A R I ZATI O N . S E E T E X T F O R D E T A I L S . Dataset/Method AE-2 AE-3 MNIST 1.96 ± 0.27 2.36 ± 0.30 basic 3.20 ± 0.15 3.20 ± 0.15 rot 18.75 ± 0.34 66.70 ± 0.41 bg-img 26.71 ± 0.39 86.30 ± 0.30 bg-rand 84.24 ± 0.32 85.95 ± 0.30 bg-img-rot 58.33 ± 0.43 81.00 ± 0.34 rect 4.00 ± 0.17 4.15 ± 0.17 rect-img 24.30 ± 0.38 48.70 ± 0.44 con vex 43.70 ± 0.43 45.08 ± 0.44 D. How does Joint T raining Af fect F inetuning? So f ar , we ha ve compared joint with layerwise training in unsupervised representation learning settings. Now we turn our attention to supervised setting and in vestigate how joint 9 T ABLE V C L AS S I FI CATI O N P E RF O R M AN C E A F T ER FI N E T UN I N G O N D E E P D E N OI S I N G AU TO E N C OD E R S T R A IN E D W I T H V A R I OU S S C H EM E S . N OTA T I O NS A R E T H E S A ME A S I N T A BL E I I . S E E T E XT F O R D E TAI L S . Dataset/Method DAE-2L D AE-2U D AE-2J D AE-2UJ D AE-3L D AE-3U D AE-3J DAE-3UJ MNIST 1.05 ± 0.20 1.17 ± 0.21 1.17 ± 0.21 1.21 ± 0.21 1.20 ± 0.21 1.15 ± 0.21 1.10 ± 0.21 1.26 ± 0.22 basic 2.55 ± 0.14 2.42 ± 0.13 2.44 ± 0.14 2.68 ± 0.14 3.14 ± 0.15 2.50 ± 0.14 2.65 ± 0.14 2.70 ± 0.14 rot 9.51 ± 0.26 9.77 ± 0.26 8.40 ± 0.24 9.57 ± 0.26 9.65 ± 0.26 9.32 ± 0.25 7.87 ± 0.24 8.96 ± 0.25 bg-img 15.21 ± 0.31 17.08 ± 0.33 17.11 ± 0.33 17.50 ± 0.33 24.01 ± 0.37 15.47 ± 0.32 18.65 ± 0.34 17.02 ± 0.33 bg-rand 12.16 ± 0.29 11.91 ± 0.28 8.98 ± 0.25 10.88 ± 0.27 18.11 ± 0.34 11.64 ± 0.28 8.04 ± 0.24 8.48 ± 0.24 bg-img-rot 46.35 ± 0.44 47.90 ± 0.44 46.94 ± 0.44 47.71 ± 0.44 56.69 ± 0.43 45.16 ± 0.44 46.95 ± 0.44 46.51 ± 0.44 rect 1.60 ± 0.11 1.45 ± 0.10 0.98 ± 0.09 0.98 ± 0.09 1.39 ± 0.10 1.89 ± 0.12 0.92 ± 0.08 0.82 ± 0.08 rect-img 21.87 ± 0.36 21.09 ± 0.36 21.87 ± 0.36 22.53 ± 0.37 24.79 ± 0.38 22.17 ± 0.36 22.49 ± 0.37 22.48 ± 0.37 con vex 19.33 ± 0.35 19.24 ± 0.35 18.60 ± 0.34 19.88 ± 0.35 23.17 ± 0.37 18.03 ± 0.34 18.33 ± 0.34 19.20 ± 0.35 T ABLE VI C L AS S I FI CATI O N P E RF O R M AN C E A F T ER FI N E T UN I N G O N D E E P C O N TR AC T I V E AU T OE N C OD E R S T R A IN E D W I T H V A R I O US S C H EM E S . N OTA T I O NS A R E T H E S A ME A S I N T A BL E I I . S E E T E X T F O R D E TAI L S . Dataset/Method CAE-2L CAE-2U CAE-2J CAE-2UJ CAE-3L CAE-3U CAE-3J CAE-3UJ MNIST 1.25 ± 0.22 1.35 ± 0.23 1.09 ± 0.20 1.26 ± 0.22 1.47 ± 0.24 1.53 ± 0.24 1.11 ± 0.21 1.24 ± 0.22 basic 2.74 ± 0.14 2.85 ± 0.15 2.63 ± 0.14 2.74 ± 0.14 3.23 ± 0.15 3.01 ± 0.15 2.78 ± 0.14 2.84 ± 0.15 rot 10.31 ± 0.27 10.47 ± 0.27 8.56 ± 0.25 10.18 ± 0.27 10.85 ± 0.27 10.04 ± 0.26 7.91 ± 0.24 9.36 ± 0.26 bg-img 15.82 ± 0.32 18.45 ± 0.34 18.34 ± 0.34 18.45 ± 0.34 25.32 ± 0.38 16.58 ± 0.33 17.24 ± 0.33 17.01 ± 0.33 bg-rand 13.12 ± 0.30 10.53 ± 0.27 12.99 ± 0.29 12.53 ± 0.29 18.21 ± 0.34 11.19 ± 0.28 13.67 ± 0.30 12.25 ± 0.29 bg-img-rot 46.37 ± 0.44 47.38 ± 0.44 47.77 ± 0.44 48.00 ± 0.44 58.93 ± 0.43 46.80 ± 0.44 47.19 ± 0.44 47.50 ± 0.44 rect 2.17 ± 0.13 1.99 ± 0.12 1.17 ± 0.09 1.39 ± 0.10 2.44 ± 0.14 2.15 ± 0.13 1.04 ± 0.09 1.51 ± 0.11 rect-img 21.83 ± 0.36 21.72 ± 0.36 22.13 ± 0.36 22.22 ± 0.36 24.56 ± 0.38 22.03 ± 0.36 23.22 ± 0.37 23.58 ± 0.37 con vex 18.60 ± 0.34 18.63 ± 0.34 18.01 ± 0.34 19.23 ± 0.35 23.20 ± 0.37 18.60 ± 0.34 18.00 ± 0.34 19.20 ± 0.35 training af fects ﬁnetuning. In this e xperiment, the unsupervised deep autoencoders were used to initialize the parameters of a multi-layer perceptron for the supervised ﬁnetuning (the same way as one w ould use layerwise for supervised tasks). The ﬁnetuning was performed for all previously trained models for a maximum 1,000 epochs with early stopping on the validation set error . As expected, the performance of the standard deep autoencoder (see T able VII) was not very impressiv e except on MNIST which contained ‘cleaner’ samples and signiﬁcantly more training examples. It is also reasonable to expect sim- ilar performance from layerwise and joint training since the supervised ﬁnetuning process adjusts all parameters to better ﬁt p ( y | x ) . This is partially true as can be observed from the results in T able V and VI. The performance of 2-layer models are close in almost all cases. Howe ver , in 3-layer case the models trained with joint training appear to perform better . This is true for the models pre-trained via joint training with or without regularization ( i.e. scheme ‘U’ and ‘UJ’ respectiv ely), which might suggests that joint training is more beneﬁcial for deeper models. Hence, e ven in the case, where one would tuning all parameters of the model for supervised tasks, unsupervised joint training can still be beneﬁcial, especially for deeper models. The results also suggest that as long as appropriate regularization is employed in joint pre-training, initialization does not inﬂuence the supervised performance signiﬁcantly . V . C O N C L U S I O N In this paper we presented an unsupervised method for jointly training all layers of deep autoencoder and analysed its performance against greedy layerwise training in various T ABLE VII C L AS S I FI CATI O N P E RF O R M AN C E O F T H E P L AI N D E E P AU TO E N C OD E R W I TH L 2 W E IG H T R E G U LA R I Z A T I ON A F T ER S U P E R V I S ED FI N E T UN I N G . S E E T E XT F O R D E TAI L S . Dataset/Method AE-2 AE-3 MNIST 1.36 ± 0.23 1.30 ± 0.22 basic 3.18 ± 0.15 3.04 ± 0.15 rot 9.95 ± 0.26 9.43 ± 0.26 bg-img 24.38 ± 0.38 24.00 ± 0.37 bg-rand 14.47 ± 0.31 18.53 ± 0.34 bg-img-rot 52.00 ± 0.44 54.91 ± 0.44 rect 2.91 ± 0.15 3.08 ± 0.15 rect-img 23.45 ± 0.37 24.00 ± 0.37 con vex 21.27 ± 0.36 22.25 ± 0.36 circumstances. W e formulated a single objective for the deep autoencoder , which consists of a global reconstruction objec- tiv e with local constraints on the hidden layers, so that all layers could be trained jointly . This could also be viewed as a generalization of training single- to multi-layer autoencoders, and pro vided a straightforward way to stack the dif ferent variants of autoencoders. Empirically , we showed that the joint training method not only learned better data models, but also learned more representativ e features for classiﬁcation as com- pared to the layerwise method, which highlights its potential for unsupervised feature learning. In addition, the experiments also showed that the success of the joint training technique is dependent on the more po werful regularizations proposed in the more recent variants of autoencoders. In the supervised setting, joint training also sho ws superior performance when training deeper models. Going forward, this framew ork of jointly training deep autoencoders can pro vide a platform 10 for in vestigating more ef ﬁcient usage of different types of regularizers, especially in light of the gro wing volumes of av ailable unlabeled data. A P P E N D I X A. Additional Qualitative Samples R E F E R E N C E S [1] A. Krizhevsky , I. Sutske ver , and G. E. Hinton, “Imagenet classiﬁcation with deep conv olutional neural networks, ” in NIPS , 2012, pp. 1106– 1114. [2] M. D. Zeiler and R. Fergus, “V isualizing and understanding convolu- tional networks, ” in ECCV , 2014, pp. 818–833. [3] A. Mohamed, G. E. Dahl, and G. E. Hinton, “ Acoustic modeling using deep belief networks, ” IEEE T ransactions on Audio, Speec h & Languag e Pr ocessing , vol. 20, no. 1, pp. 14–22, 2012. [4] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the mean-co variance restricted boltzmann machine, ” in NIPS , 2010, pp. 469–477. [5] L. Deng, M. L. Seltzer , D. Y u, A. Acero, A. Mohamed, and G. E. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder, ” in INTERSPEECH , 2010, pp. 1692–1695. [6] Y . T aigman, M. Y ang, M. Ranzato, and L. W olf, “Deepface: Closing the gap to human-level performance in face veriﬁcation, ” in CVPR , 2014. [7] G. E. Hinton, S. Osindero, and Y . W . T eh, “ A fast learning algorithm for deep belief nets, ” Neural Computation , vol. 18, no. 7, pp. 1527–1554, 2006. [8] Y . Bengio, P . Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- wise training of deep networks, ” in NIPS , 2006, pp. 153–160. [9] M. Ranzato, C. Poultney , S. Chopra, and Y . Lecun, “Efﬁcient learning of sparse representations with an energy-based model, ” in NIPS . MIT Press, 2006. [10] G. E. Hinton, N. Sriv astava, A. Krizhevsky , I. Sutske ver , and R. Salakhutdinov , “Improving neural networks by prev enting co- adaptation of feature detectors, ” CoRR , vol. abs/1207.0580, 2012. [11] L. W an, M. D. Zeiler , S. Zhang, Y . LeCun, and R. Fer gus, “Regu- larization of neural networks using dropconnect, ” in ICML , 2013, pp. 1058–1066. [12] N. L. Roux and Y . Bengio, “Representational power of restricted boltzmann machines and deep belief networks, ” Neural Computation , vol. 20, no. 6, pp. 1631–1649, 2008. [13] L. Arnold and Y . Olli vier, “Layer-wise learning of deep generative models, ” CoRR , vol. abs/1212.1524, 2012. [14] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines, ” in AIST ATS , 2009, pp. 448–455. [15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen- tations by back-propagating errors, ” Nature , pp. 533–536, 1986. [16] H. Bourlard and Y . Kamp, “Auto-association by multilayer perceptrons and singular value decomposition, ” Biological Cybernetics , vol. 59, no. 4-5, pp. 291–294, 1988. [17] P . V incent, H. Larochelle, Y . Bengio, and P . Manzagol, “Extracting and composing robust features with denoising autoencoders, ” in ICML , 2008, pp. 1096–1103. [18] S. Rifai, P . V incent, X. Muller, X. Glorot, and Y . Bengio, “Contractive auto-encoders: Explicit inv ariance during feature extraction, ” in ICML , 2011, pp. 833–840. [19] G. E. Hinton and R. R. Salakhutdinov , “Reducing the dimensionality of data with neural networks, ” Science , vol. 313, no. 5786, pp. 504–507, 2006. [20] J. Martens, “Deep learning via hessian-free optimization, ” in ICML , 2010, pp. 735–742. [21] I. Sutske ver , J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning, ” in ICML , 2013, pp. 1139–1147. [22] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks, ” in NIPS , 2012, pp. 350–358. [23] M. Chen, K. Q. W einberger , F . Sha, and Y . Bengio, “Marginalized denoising auto-encoders for nonlinear representations, ” in ICML , 2014, pp. 1476–1484. [24] Y . Bengio, E. Thibodeau-Laufer , and J. Y osinski, “Deep generati ve stochastic networks trainable by backprop, ” in ICML , 2014. [25] Y . Lecun and C. Cortes, “The MNIST database of handwritten digits. ” [Online]. A vailable: http://yann.lecun.com/exdb/mnist/ [26] H. Larochelle, D. Erhan, A. C. Courville, J. Ber gstra, and Y . Bengio, “ An empirical evaluation of deep architectures on problems with many factors of variation, ” in ICML , 2007, pp. 473–480. [27] T . Tieleman and G. E. Hinton, “Lecture 6.5 - rmsprop, ” COURSERA: Neural Networks for Machine Learning , 2012. [28] G. Desjardins, A. C. Courville, and Y . Bengio, “ Adaptive parallel tempering for stochastic maximum likelihood learning of rbms, ” CoRR , vol. abs/1012.3476, 2010. [29] O. Breuleux, Y . Bengio, and P . V incent, “Unlearning for better mixing, ” Univ ersit ´ e de Montr ´ eal/DIR O, T ech. Rep. 1349, 2010. 11 Fig. 4. Expanded samples from those are shown in Figure 2 in which we show every fourth sample but here we show every consecutiv e sample (from left to right, top to bottom) and with longer runs. The last column shows the closest sample from the training set to illustrate the model is not memorizing the training data. From top to bottom: samples from MNIST , MNIST -rotation, MNIST -bg-image, MNIST -bg-random, MNIST -bg-rot-image, rectangle, rectangle-image and conv ex dataset. Left: consecutive samples generated from deep denoising autoencoder trained by using layerwise scheme. Right: consecutive samples generated from deep denoising autoencoder trained by joint training. The joint trained models show sharper and more diverse samples in general.

Is Joint Training Better for Deep Auto-Encoders?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment