Deep unsupervised learning through spatial contrasting

Convolutional networks have marked their place over the last few years as the best performing model for various visual tasks. They are, however, most suited for supervised learning from large amounts of labeled data. Previous attempts have been made …

Authors: Elad Hoffer, Itay Hubara, Nir Ailon

Deep unsupervised learning through spatial contrasting
Under revie w as a conference paper at ICLR 2017 D E E P U N S U P E R V I S E D L E A R N I N G T H R O U G H S P A T I A L C O N T R A S T I N G Elad Hoffer T echnion - Israel Institute of T echnology Haifa, Israel ehoffer@tx.technion.ac.il Itay Hubara T echnion - Israel Institute of T echnology Haifa, Israel itayh@tx.technion.ac.il Nir Ailon ∗ T echnion - Israel Institute of T echnology Haifa, Israel nailon@cs.technion.ac.il A B S T R A C T Con volutional networks hav e marked their place over the last few years as the best performing model for various visual tasks. They are, howe ver , most suited for supervised learning from large amounts of labeled data. Pre vious attempts hav e been made to use unlabeled data to impro ve model performance by applying unsupervised techniques. These attempts require dif ferent architectures and train- ing methods. In this work we present a nov el approach for unsupervised training of Con volutional networks that is based on contrasting between spatial regions within images. This criterion can be employed within con ventional neural net- works and trained using standard techniques such as SGD and back-propagation, thus complementing supervised methods. 1 I N T R O D U C T I O N For the past few years con volutional networks (ConvNets, CNNs) LeCun et al. (1998) hav e prov en themselves as a successful model for vision related tasks Krizhevsk y et al. (2012) Mnih et al. (2015) Pinheiro et al. (2015) Razavian et al. (2014). A con volutional network is composed of multiple con volutional and pooling layers, followed by a fully-connected affine transformations. As with other neural network models, each layer is typically follo wed by a non-linearity transformation such as a rectified-linear unit (ReLU). A con volutional layer is applied by cross correlating an image with a trainable weight filter . This stems from the assumption of stationarity in natural images, which means that features learned for one local region in an image can be shared for other re gions and images. Deep learning models, including con volutional networks, are usually trained in a supervised man- ner , requiring large amounts of labeled data (ranging between thousands to millions of examples per-class for classification tasks) in almost all modern applications. These models are optimized a variant of stochastic-gradient-descent (SGD) over batches of images sampled from the whole train- ing dataset and their ground truth-labels. Gradient estimation for each one of the optimized param- eters is done by back propagating the objectiv e error from the final layer towards the input. This is commonly known as ”backpropag ation” Rumelhart et al.. One early well kno wn usage of unsupervised training of deep architectures was as part of a pre- training procedure used for obtaining an effecti ve initial state of the model. The network was later fine-tuned in a supervised manner as displayed by Hinton (2007). Such unsupervised pre-training procedures were later abandoned, since they provided no apparent benefit over other initialization heuristics in more careful fully supervised training regimes. This led to the de-facto almost exclusiv e usage of neural networks in supervised en vironments. ∗ The author acknowledges the generous support of ISF grant number 1271/13 1 Under revie w as a conference paper at ICLR 2017 In this w ork we will present a novel unsupervised learning criterion for con volutional network based on comparison of features extracted from regions within images. Our experiments indicate that by using this criterion to pre-train networks we can improve their performance and achieve state-of- the-art results. 2 P R E V I O U S W O R K S Using unsupervised methods to improve performance hav e been the holy grail of deep learning for the last couple of years and vast research efforts hav e been focused on that. W e hereby give a short ov erview of the most popular and recent methods that tried to tackle this problem. A utoEncoders and reconstruction loss These are probably the most popular models for unsu- pervised learning using neural networks, and Con vNets in particular . Autoencoders are NNs which aim to transform inputs into outputs with the least possible amount of distortion. An Autoencoder is constructed using an encoder G ( x ; w 1 ) that maps an input to a hidden compressed representation, followed by a decoder F ( y ; w 2 ) , that maps the representation back into the input space. Mathemat- ically , this can be written in the following general form: ˆ x = F ( G ( x ; w 1 ); w 2 ) The underlying encoder and decoder contain a set of trainable parameters that can be tied together and optimized for a predefined criterion. The encoder and decoder can ha ve different architectures, including fully-connected neural networks, Con vNets and others. The criterion used for training is the reconstruction loss, usually the mean squared error (MSE) between the original input and its reconstruction Zeiler et al. (2010) min k x − ˆ x k 2 This allows an efficient training procedure using the aforementioned backpropagation and SGD tech- niques. Over the years autoencoders gained fundamental role in unsupervised learning and man y modification to the classic architecture were made. Ng (2011) regularized the latent representation to be sparse, V incent et al. (2008) substituted the input with a noisy version thereof, requiring the model to denoise while reconstructing. Kingma et al. (2014) obtained very promising results with variational autoencoders (V AE). A variational autoencoder model inherits typical autoencoder ar- chitecture, b ut makes strong assumptions concerning the distrib ution of latent v ariables. The y use variational approach for latent representation learning, which results in an additional loss component and specific training algorithm called Stochastic Gradient V ariational Bayes (SGVB). V AE assumes that the data is generated by a directed graphical model p ( x | z ) and require the encoder to learn an approximation q w 1 ( z | x ) to the posterior distribution p w 2 ( z | x ) where w 1 and w 2 denote the param- eters of the encoder and decoder . The objective of the variational autoencoder in this case has the following form: L ( w 1 , w 2 , x ) = − D K L ( q w 1 ( z | x ) || p w 2 ( z )) + E q w 1 ( z | x )  log p w 2 ( x | z )  Recently , a stacked set of denoising autoencoders architectures sho wed promising results in both semi-supervised and unsupervised tasks. A stacked what-where autoencoder by Zhao et al. (2015) computes a set of complementary variables that enable reconstruction whene ver a layer implements a man y-to-one mapping. Ladder networks by Rasmus et al. (2015) - use lateral connections to allo w higher lev els of an autoencoder to focus on in variant abstract features by applying a layer -wise cost function. Exemplar Networks: The unsupervised method introduced byDoso vitskiy et al. (2014) takes a different approach to this task and trains the netw ork to discriminate between a set of pseudo-classes. Each pseudo-class is formed by applying multiple transformations to a randomly sampled image patch. The number of pseudo-classes can be as big as the size of the input samples. This criterion ensures that different input samples would be distinguished while providing robustness to the applied transformations. 2 Under revie w as a conference paper at ICLR 2017 Context prediction Another method for unsupervised learning by context w as introduced by Do- ersch et al. (2015). This method uses an auxiliary criterion of predicting the location of an image patch gi ven another from the same image. This is done by classification to 1 of 9 possible locations. Adversarial Generativ e Models: This a recently introduced model that can be used in an unsu- pervised fashion Goodfellow et al. (2014). Adversarial Generativ e Models uses a set of networks, one trained to discriminate between data sampled from the true underlying distrib ution (e.g., a set of images), and a separate generativ e network trained to be an adversary trying to confuse the first network. By propagating the gradient through the paired networks, the model learns to generate samples that are distrib uted similarly to the source data. As shown by Radford et al. (2015),this model can create useful latent representations for subsequent classification tasks as demonstrated Sampling Methods: Methods for training models to discriminate between a v ery large number of classes often use a noise contrasting criterion . In these methods, roughly speaking, the posterior probability P ( t | y t ) of the ground-truth tar get t giv en the model output on an input sampled from the true distrib ution y t = F ( x ) is maximized, while the probability P ( t | y n ) given a noise measurement y = F ( n ) is minimized. This was successfully used in a language domain to learn unsupervised representation of words. The most note worthy case is the word2v ec model introduced by Mikolov et al. (2013). When using this setting in language applications, a natural contrasting noise is a smooth approximation of the Unigram distribution. A suitable contrasting distribution is less obvious when data points are sampled from a high dimensional continuous space, such as in the case of patches of images. 2 . 1 P R O B L E M S W I T H C U R R E N T A P P RO AC H E S Only recently the potential of Con vNets in an unsupervised en vironment beg an to bear fruit, still we believ e it is not fully uncovered. The majority of unsupervised optimization criteria currently used are based on variations of recon- struction losses. One limitation of this fact is that a pixel le vel reconstruction is non-compliant with the idea of a discriminati ve objectiv e, which is expected to be agnostic to low lev el information in the input. In addition, it is evident that MSE is not best suited as a measurement to compare images, for example, viewing the possibly large square-error between an image and a single pixel shifted copy of it. Another problem with recent approaches such as Rasmus et al. (2015); Zeiler et al. (2010) is their need to extensi vely modify the original con volutional network model. This leads to a gap between unsupervised method and the state-of-the-art, supervised, models for classification - which can hurt future attempt to reconcile them in a unified framework, and also to ef ficiently le verage unlabeled data with otherwise supervised regimes. 3 L E A R N I N G B Y C O M P A R I S O N S The most common way to train NN is by defining a loss function between the target values and the network output. Learning by comparison approaches the supervised task from a different angle. The main idea is to use distance comparisons between samples to learn useful representations. For example, we consider relativ e and qualitativ e examples of the form X 1 is closer to X 2 than X 1 is to X 3 . Using a comparative measure with neural network to learn embedding space w as introduced in the “Siamese network” framework by Bromley et al. (1993) and later used in the works of Chopra et al. (2005). One use for this methods is when the number of classes is too large or expected to v ary ov er time, as in the case of face verification, where a face contained in an image has to compared against another image of a face. This problem was recently tackled by Schrof f et al. (2015) for training a conv olutional network model on triplets of examples. There, one image served as an anchor x , and an additional pair of images served as a positive e xample x + (containing an instance of the face of the same person) together with a ne gativ e example x − , containing a face of a different person. The training objective was on the embedded distance of the input faces, where the distance between the anchor and positiv e example is adjusted to be smaller by at least some constant α from the negati ve distance. More precisely , the loss function used in this case was defined as L ( x, x + , x − ) = max {k F ( x ) − F ( x + ) k 2 − k F ( x ) − F ( x − ) k 2 + α, 0 } (1) 3 Under revie w as a conference paper at ICLR 2017 where F ( x ) is the embedding (the output of a conv olutional neural network), and α is a predefined margin constant. Another similar model used by Hoffer & Ailon (2015) with triplets comparisons for classification, where examples from the same class were trained to hav e a lower embedded distance than that of two images from distinct classes. This work introduced a concept of a distance ratio loss, where the defined measure amounted to: L ( x, x + , x − ) = − log e −k F ( x ) − F ( x + ) k 2 e −k F ( x ) − F ( x + ) k 2 + e −k F ( x ) − F ( x − ) k 2 (2) This loss has a fla vor of a probability of a biased coin flip. By ‘pushing’ this probability to zero, we express the objectiv e that pairs of samples coming from distinct classes should be less similar to each other , compared to pairs of samples coming from the same class. It was shown empirical by Balntas et al. (2016) to provide better feature embeddings than the mar gin based distance loss 1 4 O U R C O N T R I B U T I O N : S P AT I A L C O N T R A S T I N G One implicit assumption in con volutional networks, is that features are gradually learned hierar- chically , each level in the hierarchy corresponding to a layer in the network. Each spatial location within a layer corresponds to a region in the original image. It is empirically observed that deeper layers tend to contain more ‘abstract’ information from the image. Intuitiv ely , features describing different regions within the same image are likely to be semantically similar (e.g. different parts of an animal), and indeed the corresponding deep representations tend to be similar . Con versely , regions from two probably unrelated images (say , tw o images chosen at random) tend to be far from each other in the deep representation. This logic is commonly used in modern deep networks such as Szegedy et al. (2015) Lin et al. (2013) He et al. (2015), where a global a verage pooling is used to aggregate spatial features in the final layer used for classification. Our suggestion is that this property , often observed as a side effect of supervised applications, can be used as a desired objective when learning deep representations in an unsupervised task. Later , the resulting representation can be used, as typically done, as a starting point or a supervised learning task. W e call this idea which we formalize belo w Spatial contrasting . The spatial contrasting crite- rion is similar to noise contrasting estimation Gutmann & Hyv ¨ arinen (2010) Mnih & Kavukcuoglu (2013), in trying to train a model by maximizing the expected probability on desired inputs, while minimizing it on contrasting sampled measurements. 4 . 1 F O R M U L AT I O N W e will concern ourselves with samples of images patches ˜ x ( m ) taken from an image x . Our con- volutional network model, denoted by F ( x ) , extracts spatial features f so that f ( m ) = F ( ˜ x ( m ) ) for an image patch ˜ x ( m ) . W e wish to optimize our model such that for two features representing patches taken from the same image ˜ x (1) i , ˜ x (2) i ∈ x i for which f (1) i = F ( ˜ x (1) i ) and f (2) i = F ( ˜ x (2) i ) , the conditional probability P ( f (1) i | f (2) i ) will be maximized. This means that features from a patch taken from a specific image can ef fectively predict, under our model, features e xtracted from other patches in the same image. Conv ersely , we want our model to minimize P ( f i | f j ) for i, j being two patches taken from distinct images. Follo wing the logic presented before, we will need to sample contrasting patc h ˜ x (1) j from a dif ferent image x j such that P ( f (1) i | f (2) i ) > P ( f (1) j | f (2) i ) , where f (1) j = F ( ˜ x (1) j ) . In order to obtain contrasting samples, we use regions from two random images in the training set. W e will use a distance ratio, described earlier 2 for the supervised case, to represent the probability two feature vectors were taken from the same image. The resulting training loss for a pair of images will be defined as L S C ( x 1 , x 2 ) = − log e −k f (1) 1 − f (2) 1 k 2 e −k f (1) 1 − f (2) 1 k 2 + e −k f (1) 1 − f (1) 2 k 2 (3) Effecti vely minimizing a log-probability under the SoftMax measure. This formulation is portrayed in figure 4.1. Since we sample our contrasting sample from the same underlying distribution, we 4 Under revie w as a conference paper at ICLR 2017 can evaluate this loss considering the image patch as both patch compared (anchor) and contrast symmetrically . The final loss will be the average between these estimations: ˆ L S C ( x 1 , x 2 ) = 1 2 [ L S C ( x 1 , x 2 ) + L S C ( x 2 , x 1 )] Figure 1: Spatial contrasting depiction. 4 . 2 M E T H O D Since training con volutional network is done in batches of images, we can use the multiple samples in each batch to train our model. Each image serves as a source for both an anchor and positiv e patches, for which the corresponding features should be closer , and also a source for contrasting samples for all the other images in that batch. For a batch of N images, two samples from each image are taken, and N 2 different distance comparisons are made. The final loss is the av erage distance ratio for images in the batch: ¯ L S C ( { x } N i =1 ) = 1 N N X i =1 L S C ( x i , { x } j 6 = i ) = − 1 N N X i =1 log e −k f (1) i − f (2) i k 2 P N j =1 e −k f (1) i − f (2) j k 2 (4) Since the criterion is differentiable with respect to its inputs, it is fully compliant with standard methods for training con volutional network and specifically using backpropagation and gradient descent. Furthermore, SC can be applied to any layer in the network hierarchy . In fact, SC can be used at multiple layers within the same con volutional network. The spatial properties of the features means that we can also sample from feature space ˜ f ( m ) ∈ f instead of from the original image, which we use to simplify implementation. The complete algorithm for batch training is described in 1. This algorithm is also related to the batch normalization layer Ioffe & Szegedy (2015), a recent usage for batch statistics in neural networks. Spatial contrasting also uses the batch statistics, but to sample contrasting patches. 5 E X P E R I M E N T S In this section we report empirical results showing that using SC loss as an unsupervised pretraining procedure can improve state-of-the-art performance on subsequent classification. W e experimented 5 Under revie w as a conference paper at ICLR 2017 Algorithm 1 Calculation the spatial contrasting loss Require: X = { x } N i =1 # T raining on batches of images # Get the spatial features for the whole batch of images # Size: N × W f × H f × C { f } N i =1 ← ConvNet ( X ) # Sample spatial features and calculate embedded distance between all pairs of images for i = 1 to N do ˜ f (1) i ← sample ( f i ) for j = 1 to N do ˜ f (2) j ← sample ( f j ) D ist ( i, j ) ← k ˜ f (1) i − ˜ f (2) j k 2 end for end for # Calculate log SoftMax normalized distances d i ← − log e − Dist ( i,i ) P N k =1 e − Dist ( i,k ) # Spatial contrasting loss is the mean of distance ratios retur n 1 N P N i =1 d i with MNIST , CIF AR-10 and STL10 datasets. W e used modified versions of well studied networks such as those of Lin et al. (2013) Rasmus et al. (2015). A detailed description of our architecture can be found in Appendix A. In each one of the experiments, we used the spatial contrasting criterion to train the network on the unlabeled images. Training was done by using SGD with an initial learning rate of 0 . 1 that was decreased by a factor of 10 whene ver the measured loss stopped decreasing. After con vergence, we used the trained model as an initialization for a supervised training on the complete labeled dataset. The supervised training was done follo wing the same regime, only starting with a lo wer initial learning rate of 0 . 01 . W e used mild data augmentations, such as small translations and horizontal mirroring. The datasets we used are: • STL10 (Coates et al. (2011)). This dataset consists of 100 , 000 96 × 96 colored, unlabeled images, together with another set of 5 , 000 labeled training images and 8 , 000 test images . The label space consists of 10 object classes. • Cifar10 (Krizhevsky & Hinton (2009)). The well known CIF AR-10 is an image classifi- cation benchmark dataset containing 50 , 000 training images and 10 , 000 test images. The image sizes 32 × 32 pix els, with color . The classes are airplanes, automobiles, birds, cats, deer , dogs, frogs, horses, ships and trucks. • MNIST (LeCun et al. (1998)). The MNIST database of handwritten digits is one of the most studied dataset benchmark for image classification. The dataset contains 60,000 examples of handwritten digits from 0 to 9 for training and 10,000 additional examples for testing. Each sample is a 28 x 28 pixel gray le vel image. 5 . 1 R E S U L T S O N S T L 1 0 Since STL10 is comprised of mostly unlabeled data, it is the most suitable to highlight the benefits of the spatial contrasting criterion. The initial training was unsupervised, as described earlier , using the entire set of 105 , 000 samples (union of the original unlabeled set and labeled training set). The representation outputted by the training, w as used to initialize supervised training on the 5 , 000 labeled images. Ev aluation was done on a separate test set of 8 , 000 samples. Comparing with state of the art results 1 we see an improvement of 7% in test accuracy o ver the best model by Zhao et al. 6 Under revie w as a conference paper at ICLR 2017 T able 1: State of the art results on STL-10 dataset Model STL-10 test accuracy Zero-bias Con vnets - Paine et al. (2014) 70 . 2% T riplet network - Hoffer & Ailon (2015) 70 . 7% Exemplar Con vnets - Dosovitskiy et al. (2014) 72 . 8% T arget Coding - Y ang et al. (2015) 73 . 15% Stacked what-where AE - Zhao et al. (2015) 74 . 33% Spatial contrasting initialization (this work) 81 . 34% ± 0 . 1 The same model without initialization 72 . 6% ± 0 . 1 T able 2: State of the art results on Cifar10 dataset with only 4000 labeled samples Model Cifar10 (400 per class) test accuracy Con volutional K-means Network - Coates & Ng (2012) 70 . 7% V ie w-Inv ariant K-means - Hui (2013) 72 . 6% DCGAN - Radford et al. (2015) 73 . 8% Exemplar Con vnets - Dosovitskiy et al. (2014) 76 . 6% Ladder networks - Rasmus et al. (2015) 79 . 6% Spatial contrasting initialization (this work) 79 . 2% ± 0 . 3 The same model without initialization 72 . 4% ± 0 . 1 (2015), setting the SC as best model at 81 . 3% test classification accuracy . W e also compare with the same network, but without SC initialization, which achiev es a lower classification of 72 . 6% . This is an indication that indeed SC managed to lev erage unlabeled examples to pro vide a better initialization point for the supervised model. 5 . 2 R E S U L T S O N C I FA R 1 0 For Cifar10, we used a previously used setting Coates & Ng (2012) Hui (2013) Dosovitskiy et al. (2014) to test a model’ s ability to learn from unlabeled images. In this setting, only 4 , 000 samples from the av ailable 50 , 000 are used with their label annotation, but the entire dataset is used for unsupervised learning. The final test accuracy is measured on the entire 10 , 000 test set. In our experiments, we trained our model using SC criterion on the entire dataset, and then used only 400 labeled samples per class (for a total of 4000 ) in a supervised regime over the initialized network. The results are compared with previous ef forts in table 2. Using the SC criterion allowed an improv ement of 6.8% ov er a non-initialized model, and achie ved a final test accuracy of 79.2%. This is a competitiv e result with current state-of-the-art model of Rasmus et al. (2015). 5 . 3 R E S U L T S O N M N I S T The MNIST dataset is very different in nature from the Cifar10 and STL10, we experimented earlier . The biggest difference, rele vant to this work, is that spatial regions sampled from MNIST images usually provide very little, or no information. Because of this fact, SC is much less suited for use with MNIST , and was conjured to have little benefit. W e still, howe ver , experimented with initializing a model with SC criterion and continuing with a fully-supervised re gime ov er all labeled examples. W e found again that this provided benefit ov er training the same network without pre- initialization, impro ving results from 0 . 63% to 0 . 34% error on test set. The results, compared with previous attempts are included in 3. 6 C O N C L U S I O N S A N D F U T U R E W O R K In this work we presented spatial contrasting - a novel unsupervised criterion for training con vo- lutional networks on unlabeled data. Its is based on comparison between spatial features sampled from a number of images. W e’ ve sho wn empirically that using spatial contrasting as a pretraining 7 Under revie w as a conference paper at ICLR 2017 T able 3: results on MNIST dataset Model MNIST test error Stacked what-where AE - Zhao et al. (2015) 0 . 71% T riplet network - Hoffer & Ailon (2015) 0 . 56% Jarrett et al. (2009) 0 . 53% Ladder networks - Rasmus et al. (2015) 0 . 36% DropConnect - W an et al. (2013) 0 . 21% Spatial contrasting initialization (this work) 0 . 34% ± 0 . 02 The same model without initialization 0 . 63% ± 0 . 02 technique to initialize a Con vNet, can improv e its performance on a subsequent supervised train- ing. In cases where a lot of unlabeled data is av ailable, such as the STL10 dataset, this translates to state-of-the-art classification accuracy in the final model. Since the spatial contrasting loss is a dif ferentiable estimation that can be computed within a network in parallel to supervised losses, future work will attempt to embed it as a semi-supervised model. This usage will allow to create models that can lev erage both labeled an unlabeled data, and can be compared to similar semi-supervised models such as the ladder network Rasmus et al. (2015). It is is also apparent that contrasting can occur in dimensions other than the spatial, the most straightfor- ward is the temporal one. This suggests that similar training procedure can be applied on segments of sequences to learn useful representation without explicit supervision. R E F E R E N C E S V assileios Balntas, Edw ard Johns, Lilian T ang, and Krystian Mikolajczyk. Pn-net: Conjoined triple deep network for learning local image descriptors. arXiv pr eprint arXiv:1601.05030 , 2016. Jane Bromley , James W Bentz, L ´ eon Bottou, Isabelle Guyon, Y ann LeCun, Clif f Moore, Eduard S ¨ ackinger , and Roopak Shah. Signature verification using a siamese time delay neural network. International Journal of P attern Recognition and Artificial Intellig ence , 7(04):669–688, 1993. Sumit Chopra, Raia Hadsell, and Y ann LeCun. Learning a similarity metric discriminativ ely , with application to face verification. In Computer V ision and P attern Recognition, 2005. CVPR 2005. IEEE Computer Society Confer ence on , volume 1, pp. 539–546. IEEE, 2005. Adam Coates and Andrew Y Ng. Learning feature representations with k-means. In Neural Net- works: T ricks of the T rade , pp. 561–580. Springer , 2012. Adam Coates, Andre w Y Ng, and Honglak Lee. An analysis of single-layer networks in unsuper- vised feature learning. In International Confer ence on Artificial Intelligence and Statistics , pp. 215–223, 2011. Carl Doersch, Abhina v Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Pr oceedings of the IEEE International Confer ence on Computer V ision , pp. 1422–1430, 2015. Alex ey Dosovitskiy , Jost T obias Springenberg, Martin Riedmiller , and Thomas Brox. Discrimina- tiv e unsupervised feature learning with con volutional neural networks. In Advances in Neural Information Pr ocessing Systems , pp. 766–774, 2014. Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farle y , Sherjil Ozair, Aaron Courville, and Y oshua Bengio. Generati ve adversarial nets. In Advances in Neural Infor- mation Pr ocessing Systems , pp. 2672–2680, 2014. Michael Gutmann and Aapo Hyv ¨ arinen. Noise-contrastiv e estimation: A ne w estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics , pp. 297–304, 2010. 8 Under revie w as a conference paper at ICLR 2017 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. arXiv pr eprint arXiv:1512.03385 , 2015. Geoffre y E Hinton. T o recognize shapes, first learn to generate images. Pr ogr ess in br ain r esear ch , 165:535–547, 2007. Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Similarity-Based P attern Recognition , pp. 84–92. Springer , 2015. Ka Y Hui. Direct modeling of complex in v ariances for visual object features. In Pr oceedings of the 30th International Confer ence on Machine Learning (ICML-13) , pp. 352–360, 2013. Serge y Iof fe and Christian Sze gedy . Batch normalization: Accelerating deep network training by re- ducing internal cov ariate shift. In Pr oceedings of The 32nd International Confer ence on Mac hine Learning , pp. 448–456, 2015. Ke vin Jarrett, K oray Kavukcuoglu, Marc’Aurelio Ranzato, and Y ann LeCun. What is the best multi- stage architecture for object recognition? In Computer V ision, 2009 IEEE 12th International Confer ence on , pp. 2146–2153. IEEE, 2009. Alex Krizhevsky and Geoffre y Hinton. Learning multiple layers of features from tiny images. Com- puter Science Department, University of T or onto, T ech. Rep , 2009. Alex Krizhevsk y , Ilya Sutskev er, and Geoffre y E Hinton. ImageNet Classification with Deep Con- volutional Neural Networks. Advances In Neural Information Pr ocessing Systems , pp. 1–9, 2012. Y ann LeCun, L ´ eon Bottou, Y oshua Bengio, and Patrick Haf fner . Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11):2278–2324, 1998. Min Lin, Qiang Chen, and Shuicheng Y an. Network in network. arXiv preprint , 2013. T omas Mikolov , Ilya Sutskev er, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen- tations of words and phrases and their compositionality . In Advances in neural information pr o- cessing systems , pp. 3111–3119, 2013. Andriy Mnih and K oray Ka vukcuoglu. Learning w ord embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Pr ocessing Systems , pp. 2265–2273, 2013. V olodymyr Mnih, K oray Ka vukcuoglu, David Silver , Andrei A Rusu, Joel V eness, Marc G Belle- mare, Ale x Gra ves, Martin Riedmiller , Andreas K Fidjeland, Geor g Ostro vski, et al. Human-level control through deep reinforcement learning. Natur e , 518(7540):529–533, 2015. Andrew Ng. Sparse autoencoder . 2011. T om Le Paine, Pooya Khorrami, W ei Han, and Thomas S Huang. An analysis of unsupervised pre-training in light of recent advances. arXiv preprint , 2014. Pedro O Pinheiro, Ronan Collobert, and Piotr Dollar . Learning to segment object candidates. In Advances in Neural Information Pr ocessing Systems , pp. 1981–1989, 2015. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep con volutional generati ve adversarial networks. arXiv preprint , 2015. Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri V alpola, and T apani Raiko. Semi- supervised learning with ladder networks. In Advances in Neural Information Pr ocessing Systems , pp. 3532–3540, 2015. Ali Raza vian, Hossein Azizpour , Josephine Sulliv an, and Stefan Carlsson. Cnn features off-the- shelf: an astounding baseline for recognition. In Pr oceedings of the IEEE Confer ence on Com- puter V ision and P attern Recognition W orkshops , pp. 806–813, 2014. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back- propagating errors. Cognitive modeling , 5(3):1. 9 Under revie w as a conference paper at ICLR 2017 Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer V ision and P attern Recognition , pp. 815–823, 2015. Christian Szegedy , W ei Liu, Y angqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov , Du- mitru Erhan, V incent V anhoucke, and Andre w Rabinovich. Going deeper with con volutions. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Reco gnition , pp. 1–9, 2015. Pascal V incent, Hugo Larochelle, Y oshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing rob ust features with denoising autoencoders. In Pr oceedings of the 25th international confer ence on Machine learning , pp. 1096–1103. A CM, 2008. Li W an, Matthew Zeiler , Sixin Zhang, Y ann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Pr oceedings of the 30th International Confer ence on Machine Learning (ICML-13) , pp. 1058–1066, 2013. Shuo Y ang, Ping Luo, Chen Change Loy , K enneth W Shum, and Xiaoou T ang. Deep representation learning with target coding. 2015. Matthew D Zeiler , Dilip Krishnan, Graham W T aylor, and Rob Fergus. Deconv olutional networks. In Computer V ision and P attern Recognition (CVPR), 2010 IEEE Confer ence on , pp. 2528–2535. IEEE, 2010. Junbo Zhao, Michael Mathieu, Ross Goroshin, and Y ann Lecun. Stack ed what-where auto-encoders. arXiv pr eprint arXiv:1506.02351 , 2015. 7 A P P E N D I X T able 4: Con volutional models used, based on Lin et al. (2013), Rasmus et al. (2015) Model STL10 CIF AR-10 MNIST Input: 96 × 96 RGB Input: 32 × 32 RGB Input: 28 × 28 monochrome 5 × 5 conv . 64 BN ReLU 3 × 3 conv . 96 BN LeakyReLU 5 × 5 conv . 32 ReLU 1 × 1 conv . 160 BN ReLU 3 × 3 conv . 96 BN LeakyReLU 1 × 1 conv . 96 BN ReLU 3 × 3 conv . 96 BN LeakyReLU 3 × 3 max-pooling, stride 2 2 × 2 max-pooling, stride 2 BN 2 × 2 max-pooling, stride 2 BN 5 × 5 conv . 192 BN ReLU 3 × 3 conv . 192 BN LeakyReLU 3 × 3 conv . 64 BN ReLU 1 × 1 conv . 192 BN ReLU 3 × 3 conv . 192 BN LeakyReLU 3 × 3 conv . 64 BN ReLU 1 × 1 conv . 192 BN ReLU 3 × 3 conv . 192 BN LeakyReLU 3 × 3 max-pooling, stride 2 2 × 2 max-pooling, stride 2 BN 2 × 2 max-pooling, stride 2 BN 3 × 3 conv . 192 BN ReLU 1 × 1 conv . 192 BN ReLU 1 × 1 conv . 192 BN ReLU Spatial contrasting criterion 3 × 3 conv . 256 ReLU 3 × 3 conv . 192 BN LeakyReLU 3 × 3 conv . 128 BN ReLU 3 × 3 max-pooling, stride 2 1 × 1 conv . 192 BN LeakyReLU 1 × 1 conv . 10 BN ReLU dropout, p = 0 . 5 1 × 1 conv . 10 BN LeakyReLU global av erage pooling 3 × 3 conv . 128 ReLU global av erage pooling dropout, p = 0 . 5 fully-connected 10 10-way softmax 10 Under revie w as a conference paper at ICLR 2017 Figure 2: First layer conv olutional filters after spatial-contrasting training 11

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment