Selfie: Self-supervised Pretraining for Image Embedding

Selﬁe : Self-supervised Pr etraining f or Image Embedding T rieu H. T rinh ∗ Minh-Thang Luong ∗ Quoc V . Le ∗ Google Brain {thtrieu,thangluong,qvl}@google.com Abstract W e introduce a pretraining technique called Selﬁe , which stands for SELF- supervised Image Embedding. Selﬁe generalizes the concept of masked language modeling of BER T (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastiv e Predicti ve Coding loss (Oord et al., 2018). Giv en masked-out patches in an input image, our method learns to select the correct patch, among other “distractor” patches sampled from the same image, to ﬁll in the masked location. This classiﬁcation objective sidesteps the need for predicting exact pix el values of the target patches. The pretraining architecture of Selﬁe includes a network of con volutional blocks to process patches followed by an attention pooling network to summarize the content of unmask ed patches before predicting masked ones. During ﬁnetuning, we reuse the con volutional weights found by pretraining. W e ev aluate Selﬁe on three benchmarks (CIF AR-10, Ima- geNet 32 × 32 , and ImageNet 224 × 224 ) with varying amounts of labeled data, from 5% to 100% of the training sets. Our pretraining method provides consistent improv ements to ResNet-50 across all settings compared to the standard supervised training of the same network. Notably , on ImageNet 224 × 224 with 60 examples per class (5%), our method improves the mean accuracy of ResNet-50 from 35.6% to 46.7%, an improvement of 11 . 1 points in absolute accuracy . Our pretraining method also improv es ResNet-50 training stability , especially on low data regime, by signiﬁcantly lo wering the standard deviation of test accuracies across dif ferent runs. 1 Introduction A weakness of neural networks is that they often require a large amount of labeled data to perform well. Although self-supervised/unsupervised representation learning (Hinton et al., 2006; Bengio et al., 2007; Raina et al., 2007; V incent et al., 2010) was attempted to address this weakness, most practical neural network systems today are trained with supervised learning (Hannun et al., 2014; He et al., 2016a; W u et al., 2016). Making use of unlabeled data through unsupervised representation learning to improv e data-efﬁciency of neural netw orks remains an open challenge for the ﬁeld. Recently , language model pretraining has been suggested as a method for unsupervised representation learning in NLP (Dai and Le, 2015; Ramachandran et al., 2017; Peters et al., 2018; Ho ward and Ruder, 2018; Devlin et al., 2019). Most notably , Devlin et al. (2019) made an observ ation that bidirectional representations from input sentences are better than left-to-right or right-to-left representations alone. Based on this observation, the y proposed the concept of masked language modeling by masking out words in a conte xt to learn representations for text, also kno wn as BER T . This is crucially achiev ed by replacing the LSTM architecture with the T ransformer feedforward architecture (V aswani et al., 2017). The feedforward nature of the architecture makes BER T more ready to be applied to images. ∗ Equal contribution. Preprint. Under revie w . Y et BER T still cannot be used for images because images are continuous objects unlike discrete words in sentences. W e hypothesize that bridging this last gap is key to translating the success of language model pretraining to the image domain. In this paper , we propose a pretraining method called Selﬁe , which stands for SELF-supervised Image Embedding. Selﬁe generalizes BER T to continuous spaces, such as images. In Selﬁe , we propose to continue to use classiﬁcation loss because it is less sensiti ve to small changes in the image (such as translation of an edge) compared to re gression loss which is more sensiti ve to small perturbations. Similar to BER T , we mask out a fe w patches in an image and try to reconstruct the original image. T o enable the classiﬁcation loss, we sample “distractor” patches from the same image, and ask the model to classify the right patch to ﬁll in a target masked location. Our method therefore can be viewed as a combination of BER T (Devlin et al., 2019) and Contrastiv e Predicti ve Coding loss (Oord et al., 2018), where the negati ve patches are sampled from the same image, similar to Deep InfoMax (Hjelm et al., 2018). Experiments sho w that Selﬁe works well across many datasets, especially when the datasets ha ve a small number of labeled examples. On CIF AR-10, ImagetNet 32 × 32 , and ImageNet 224 × 224 , we observe consistent accuracy gains as we v ary the amount of labeled data from 5% to 100% of the typical training sets. The gain tends to be bigger when the labeled set is smaller . For e xample, on ImageNet 224 × 224 with only 60 labeled examples per class, pretraining using our method impro ves the mean accuracy of ResNet-50 by 11.1%, going from 35.6% to 46.7%. Additional analysis on ImageNet 224 × 224 provides e vidence that the beneﬁt of self-supervised pretraining signiﬁcantly takes of f when there is at least an order of magnitude (10X) more unlabeled data than labeled data. In addition to impro ving the a veraged accurac y , pretraining ResNet-50 on unlabeled data also stabilizes its training on the supervised task. W e observe this by reporting the standard de viation of the ﬁnal test accuracy across 5 dif ferent runs for all experiments. On CIF AR-10 with 400 examples per class, the standard deviation of the ﬁnal accuracy reduces 3 times comparing to training with the original initialization method. Similarly , on ImageNet rescaled to 32 × 32 , our pretraining process giv es an 8X reduction on the test accuracy v ariability when training on 5% of the full training set. 2 Method An o verview of Selﬁe is shown in Figure 1. Similar to pre vious works in unsupervised/self-supervised representation learning, our method also has two stages: (1) Pretrain the model on unlabeled data and then (2) Finetune on the tar get supervised task. T o make it easy to understand, let us ﬁrst focus on the ﬁne-tuning stage. In this paper , our goal is to improv e ResNet-50, so we will pretrain the ﬁrst three blocks of this architecture. 2 Let us call this network P . The pretraining stage is therefore created for training this P network in an unsupervised f ashion. Now let us focus on the pretraining stage. In the pretraining stage, P , a patch pr ocessing network, will be applied to small patches in an image to produce one feature vector per patch for both the encoder and the decoder . In the encoder , the feature vectors are pooled together by an attention pooling network A to produce a single vector u . In the decoder , no pooling takes place; instead the feature vectors are sent directly to the computation loss to form an unsupervised classiﬁcation task. The representations from the encoder and decoder networks are jointly trained during pretraining to predict what patch is being masked out at a particular location among other distracting patches. In our implementation, to make sure the distracting patches are hard, we sample them from the same input image and also mask them out in the input image. Next we will describe in detail the interaction between the encoder and decoder networks during pretraining as well as dif ferent design choices. 2.1 Pretraining Details The main idea is to use a part of the input image to predict the rest of the image during this phase. T o do so, we ﬁrst sample different square patches from the input. These patches are then routed into the encoder and decoder networks depending on whether the y are randomized to be masked out or not. 2 In our e xperiments, we found using the ﬁrst three con volution blocks gi ves similar results to the full network (4 con volution blocks). During pretraining, therefore, only the ﬁrst three blocks (i.e. ResNet-36) are used to sa ve computation and memory load. 2 Convolution block    ...    Attention Pooling A Softmax - Cross Entropy with true label  Distractor Patch 8  Patch 1 Patch 2 Patch 9 + Positional Embedding of Patch 4 Softmax [ 1     0     0 ]   ℎ 1   ℎ 2   ℎ 3 Patch 4 Distractor Patch 3 ℎ 1 ℎ 2 ℎ 3 Pretraining Finetuning Encoder Decoder Cross Entropy with Convolution block Convolution block Convolution block Resnet-50  Figure 1: An overview of Selﬁe . (Left) During pretr aining , our method makes use of an encoder- decoder architecture: the encoder takes in a set of square patches from the input image while the decoder takes in a different set. The encoder b uilds a single vector u that represents all of its input patches using a patch processing network P follo wed by an attention pooling netw ork A . The decoder then takes u to predict its own input patches from their positions. Instead of predicting the actual pixel content, the decoder classiﬁes the correct patch from negati ve examples (distractors) with a cross-entropy loss. In our implementation, we use the ﬁrst three blocks of ResNet-50 (equiv alent to ResNet-36) for P and T ransformer layers (V aswani et al., 2017) for A . Square patches are processed independently by P to produce a feature vector per patch. (Right) During ﬁnetuning , ResNet-50 is applied on the full image . Its ﬁrst three blocks are initialized from the pretrained P , and the network is ﬁnetuned end-to-end. Let us take Figure 1 as an example, where P atch 1 , P atch 2 , P atch 5 , P atch 6 , P atch 7 , P atch 9 are sent into the encoder , whereas P atch 3 , P atch 4 , P atch 8 are sent into the decoder . All the patches are processed by the same patch processing network P . On the encoder side, the output vectors produced by P are routed into the attention pooling netw ork to summarize these representations into a single vector u . On the decoder side, P creates output vectors h 1 , h 2 , h 3 . The decoder then queries the encoder by adding to the output vector u the location embedding of a patch, selected at random among the patches in the decoder (e.g., location 4 ) to create a v ector v . The vector v is then used in a dot product to compute the similarity between v and each h . Having seen the dot products between v and h ’ s, the decoder has to decide which patch is most relev ant to ﬁll in the chosen location (at location 4 ). The cross entropy loss is applied for this classiﬁcation task, whereas the encoder and decoder are trained jointly with gradients back-propagated from this loss. During this pretraining process, the encoder network learns to compress the information in the input image to a vector u such that when seeded by a location of a missing patch, it can recover that patch accurately . T o perform this task successfully , the network needs to understand the global content of the full image, as well as the local content of each individual patch and their relativ e relationship. This ability prov es to be useful in the downstream task of recognizing and classifying objects. Patch sampling method. On small images of size 32 × 32 , we use a patch size of 8 , while on larger images of size 224 × 224 , we use a patch size of 32 × 32 . The patch size is intentionally selected to divide the image e venly , so that the image can be cut into a grid as illustrated in Figure 1. 3 T o add more randomness to the position of the image patches, we perform zero padding of 4 pixels on images with size 32 × 32 and then random crop the image to its original size. Figure 2: From left to right: Impro vement in predictions during our pretraining process on ImageNet 224 × 224 . The patch size for this dataset is 32 × 32 , which resulted in a grid of 7 × 7 patches. The masked-out patches are highlighted with a border of color white or red. The model is trained to put the masked-out patches back into their original slots. A border of color red indicates wrong prediction from the model. Here we display four different samples with the same masking positions ﬁxed throughout the training process. At the beginning, the orders of the patches are mostly incorrect due to random initialization of the model. During training, the model learns to classify more correctly . As pretraining progresses from left to right, the model makes less error , while the mistakes made in later stages usually confuse between patches that ha ve similar content. For e xample, the generic texture of the sky (ro w 3, step 2K), water (ro w 4, step 116K) or trees (ro w 2, step 10K) are generally interchangeable across locations. Patch pr ocessing network. In this work, we focus on improving ResNet-50 (He et al., 2016a) on various benchmarks by pretraining it on unlabeled data. For this reason, we use ResNet-50 as the patch processing network P . 3 As described before, only the ﬁrst three blocks of ResNet-50 is used. Since the goal of P is to reduce an y image patch into a single feature vector , we therefore perform av erage pooling across the spatial dimensions of the output of ResNet-36. Efﬁcient implementation of mask prediction. For a more efﬁcient use of computation, the de- coder is implemented to predict multiple correct patches for multiple locations at the same time. For example, in the e xample above, besides ﬁnding the right patch for location 4 , the decoder also tries to ﬁnd the right patch for location 3 as well as location 8 . This way , we reuse three times as much computation from the encoder-decoder architecture. Our method is, therefore, analogous to solving a jigsaw puzzle where a few patches are knocked out from the image and are required to be put back to their original locations. This procedure is demonstrated in Figure 2. 3 Our implementation of ResNet-50 achie ves 76.9 ± 0.2 % top-1 accuracy on ImageNet, which is in line with other results reported in the literature (He et al., 2016a; Zagoruyko and K omodakis, 2016; Huang et al., 2017). 4 2.2 Attention Pooling In this section, we describe in detail the attention pooling netw ork A introduced in Section 2.1 and the way positional embeddings are b uilt for images in our work. T ransformer as pooling operation. W e make use of T ransformer layers to perform pooling. Giv en a set of input vectors { h 1 , h 2 , ..., h n } produced by applying the patch processing network P on different patches, we want to pool them into a single vector u to represent the entire image. There are multiple choices at this stage including max pooling or average pooling. Here, we consider these choices special cases of the attention operation (where the softmax has a temperature approaching zero or inﬁnity respectiv ely) and let the network learn to pool by itself. T o do this, we learn a vector u 0 with the same dimension with h ’ s and feed them together through the Transformer layers: u, h output 1 , h output 2 , ..., h output n = T ransformerLayers ( u o , h 1 , h 2 , .., h n ) The output u corresponding to input u o is the pooling result. W e discard h output i ∀ i . Attention block. Each self-attention block follows the design in BER T (Devlin et al., 2019) where self-attention layer is follo wed with two fully connected layers that sequentially project the input vector to an intermediate size and back to the original hidden size. The only non-linearity used is GeLU and is applied at the intermediate layer . W e perform dropout with rate 0 . 1 on the output, followed by a residual connection connecting from the block’ s input and ﬁnally layer normalization. Positional embeddings. For images of size 32 × 32 , we learn a positional embedding v ector for each of the 16 patches of size 8 × 8 . Images of size 224 × 224 , on the other hand, are divided into a grid of 7 × 7 patches of size 32 × 32 . Since there are signiﬁcantly more positions in this case, we decompose each positional embedding into two dif ferent components: row and column embeddings. The resulting embedding is the sum of these two components. For e xample, instead of learning 49 positional embeddings, we only need to learn 7 + 7 = 14 positional embeddings. This greatly reduces the number of parameters and helps with regularizing the model. 2.3 Finetuning Details As mentioned above, in this phase, the ﬁrst three con volution blocks of ResNet-50 is initialized from the pretrained patch processing network. The last con volution block of ResNet-50 is initialized by the standard initialization method. ResNet-50 is then applied on the full image and ﬁnetuned end-to-end. 3 Experiments and Results In the following sections, we in vestigate the performance of our proposed pretraining method, Selﬁe , on standard image datasets, such as CIF AR-10 and ImageNet. T o simulate the scenario when we hav e much more unlabeled data than labeled data, we sample small fractions of these datasets and use them as labeled datasets, while the whole dataset is used as unlabeled data for the pretraining task. 3.1 Datasets W e consider three different datasets: CIF AR-10, ImageNet resized to 32 × 32 , and ImageNet original size ( 224 × 224 ). For each of these datasets, we simulate a scenario where an additional amount of unlabeled data is available besides the labeled data used for the original supervised task. For that purpose, we create four dif f erent subsets of the supervised training data with approximately 5%, 10%, 20%, and 100% of the total number of training examples. On CIF AR-10, we replace the 10% subset with one of 4000 training examples (8%), as this setting is used in (Oliv er et al., 2018; Cubuk et al., 2018). In all cases, the whole training set is used for pretraining (50K images for CIF AR-10, and 1.2M images for ImageNet). 5 3.2 Experimental setup Model architectur e. W e reuse all settings for ResNet conv olution blocks from ResNet-50v2 including hidden sizes and initialization (He et al., 2016b). Batch normalization is performed at the beginning of each residual block. For self-attention layers, we apply dropout on the attention weights and before each residual connection with a drop rate of 10%. The sizes of all of our models are chosen such that each architecture has roughly 25M parameters and 50 layers, the same size and depth of a standard ResNet-50. For attention pooling, three attention blocks are added with a hidden size of 1024 , intermediate size 640 and 32 attention heads on top of the patch processing network P . Model training. Both pretraining and ﬁnetuning tasks are trained using Momentum Optimizer with Nesterov coef ﬁcient of 0 . 9 . W e use a batch size of 512 for CIF AR-10 and 1024 for ImageNet. Learning rate is scheduled to decay in a cosine shape with a warm up phase of 100 steps and the maximum learning rate is tuned in the range of [0 . 01 , 0 . 02 , 0 . 05 , 0 . 1 , 0 . 2 , 0 . 4] . W e do not use any extra re gularization besides an L 2 weight decay of magnitude 0 . 0001 . The full training is done in 120 , 000 steps. Furthermore, as described in Section 2.1, we divide the images into non-ov erlapping square patches of size 8 × 8 or 32 × 32 during pretraining and sample a fraction p of these patches to predict the remaining. W e try for two v alues of p : 75% or 50% and tune it as a hyper-parameter . Reporting results. For each reported experiment, we ﬁrst tune its hyper -parameters by using 10% of training data as validation set and train the neural net on the remaining 90%. Once we obtain the best hyper-parameter setting, the neural network is retrained on 100% training data 5 times with different random seeds. W e report the mean and standard de viation values of these ﬁ ve runs. 3.3 Results W e report the accuracies with and without pretraining across different labeled dataset sizes in T able 1. As can be seen from the table, Selﬁe yields consistent improvements in test accuracy across all three benchmarks (CIF AR-10, ImageNet 32 × 32 , ImageNet 224 × 224 ) with varying amounts of labeled data. Notably , on ImageNet 224 × 224 , a gain of 11.1% in absolute accurac y is achie ved when we use only 5% of the labeled data. W e ﬁnd the pretrained models usually con verge to a higher training loss, but generalizes signiﬁcantly better than model with random initialization on test set. This highlights the strong ef fect of regularization of our proposed pretraining procedure. An example is sho wn in Figure 3 when training on 10% subset of Imagenet 224 × 224 . 0K 20K 40K 60K 80K 100K 120K 0 1 2 3 4 5 6 7 8 9 Training Loss Random init Finetuned 0K 20K 40K 60K 80K 100K 120K 2.0 2.5 3.0 3.5 4.0 4.5 Test Loss Random init Finetuned 0K 20K 40K 60K 80K 100K 120K 0.2 0.3 0.4 0.5 0.6 Test Accuracy Random init Finetuned Figure 3: Regularization ef fect of our pretraining method on the 10% subset of Imagenet 224 × 224 . W e observe the v alues of training loss, test loss and test accurac y during 120K steps of supervised training, comparing a randomly initialized model to one that is initialized from pretrained weights. Beside the gain in mean accuracy , training stability is also enhanced as evidenced by the reduction in standard de viation in almost all experiments. When the unlabeled dataset is the same with the labeled dataset (Labeled Data Percentage = 100%), the gain becomes small as expected. Baseline Comparison. W e want to emphasize that our ResNet baselines are very strong compared to those in (He et al., 2016a). Particularly , on CIF AR-10, our ResNet with pure supervised learning on 100% labeled data achiev es 95.5% in accuracy , which is better than the accuracy 94.8% achiev ed by DenseNet (Huang et al., 2017) and close to 95.6% obtained by W ide-ResNet (Zagoruyko and 6 T able 1: T est accuracy (%) of ResNet-50 with and without pretraining across datasets and sizes. Labeled Data Percentage 5% 8% 20% 100% CIF AR-10 Supervised 75.9 ± 0.7 79.3 ± 1.0 88.3 ± 0.3 95.5 ± 0.2 Selﬁe Pretrained 75.9 ± 0.4 80.3 ± 0.3 89.1 ± 0.5 95.7 ± 0.1 ∆ 0.0 +1.0 +0.8 +0.2 5% 10% 20% 100% ImageNet 32 × 32 Supervised 13.1 ± 0.8 25.9 ± 0.5 32.7 ± 0.4 55.7 ± 0.6 Selﬁe Pretrained 18.3 ± 0.1 30.2 ± 0.5 33.5 ± 0.2 56.4 ± 0.6 ∆ +5.2 +4.3 +0.8 +0.7 ImageNet 224 × 224 Supervised 35.6 ± 0.7 59.6 ± 0.2 65.7 ± 0.2 76.9 ± 0.2 Selﬁe Pretrained 46.7 ± 0.4 61.9 ± 0.2 67.1 ± 0.2 77.0 ± 0.1 ∆ +11.1 +2.3 +1.4 +0.1 K omodakis, 2016). Like wise, on ImageNet 224 × 224 , our baseline reaches 76.9% in accuracy , which is on par with the result reported in (He et al., 2016a), and surpasses the 76.2% accurac y of DenseNet (Huang et al., 2017). Our pretrained models further improv e on our strong baselines. Contrast to Other W orks. Notice that our classiﬁcation accurac y of 77.0% on ImageNet 224 × 224 is also signiﬁcantly better than previously reported results in unsupervised representation learn- ing (Pathak et al., 2016; Oord et al., 2018; K olesnikov et al., 2019). For example, in a comprehensi ve study by (K olesnikov et al., 2019), the best accuracy on ImageNet of all pretraining methods is around 55.2%, which is well below the accuracy of our models. Similarly , the best accuracy reported by Context Autoencoders (Pathak et al., 2016) and Contrasti ve Predictiv e Coding (Oord et al., 2018) are 56.5% and 48.7% respectiv ely . W e suspect that such poor performance is perhaps due to the fact that past works did not ﬁnetune into the representations learned by unsupervised learning. Concurrent to our work, there are also other attempts at using unlabeled data in semi-supervised learning settings. Hénaff et al. (2019) showed the effecti veness of pretraining in lo w-data regime using cross-entropy loss with ne gativ e samples similar to our loss. Howe ver , t heir results are not comparable to ours because they employed a much larger network, ResNet-171, compared to the ResNet-50 architecture that we use through out this work. Consistency training with label propagation has also achiev ed remarkable results. For example, the recent Unsupervised Data Augmentation (Xie et al., 2019) reported 94.7% accuracy on the 8% subset of CIF AR-10. W e expect that ur self-supervised pretraining method can be combined with label propagation to provide additional gains, as shown in (Zhai et al., 2019). Finetuning on ResNet-36 + attention pooling. In the previous e xperiments, we ﬁnetune ResNet- 50, which is essentially ResNet-36 and one conv olution block on top, dropping the attention pooling network used in pretraining. W e also explore ﬁnetuning on ResNet-36 + attention pooling and ﬁnd that it slightly outperforms ﬁnetuning on ResNet-50 in some cases. 4 More in Section 4.2. Finetuning Sensitivity and Mismatch to Pr etraining. Despite the encouraging results, we found that there are difﬁ culties in transferring pretrained models across tasks such as from ImageNet to CIF AR. For the 100% subset of Imagenet 224 × 224 , additional tuning of the pretraining phase using a development set is needed to achiev e the result reported in T able 1. There is also a slight mismatch between our pretraining and ﬁnetuning settings: during pretraining, we process image patches independently whereas for ﬁnetuning, the model sees an image as a whole. W e hope to address these concerns in subsequent works. 4 Analysis 4.1 Pretraining beneﬁts mor e when there is less labeled data In this section, we conduct further experiments to better understand our method, Selﬁe , especially how it performs as we decrease the amount of labeled data. T o do so, we e valuate test accuracy when 4 W e chose to use ResNet-50 for ﬁnetuning as it is faster and facilitates better comparison with past w orks. 7 ﬁnetuning on 2%, 5%, 10%, 20% and 100% subset of ImageNet 224 × 224 , as well as the accuracy with purely supervised training at each of the ﬁ ve marks. Similar to previous sections, we a verage results across ﬁve different runs for a more stable assessment. As sho wn in Figure 4, the ResNet mean accuracy improves drastically when there is at least an order of magnitude more unlabeled image than the labeled set (i.e., ﬁnetuning on the 10% subset). W ith less unlabeled data, the gain quickly diminishes. At the 20% mark there is still a slight improv ement of 1.4% mean accuracy , while at the 100% mark the positiv e gain becomes minimal, 0.1%. 0 20 40 60 80 100 Training data percentage (%) 0 10 20 30 40 50 60 70 80 Mean accuracy across 5 runs Accuracy (%) on ImageNet 224x224 Supervised With Pretraining 0 20 40 60 80 100 Training data percentage (%) 0 2 4 6 8 10 12 14 Mean accuracy gain Accuracy Gain (%) on ImageNet 224x224 Figure 4: Pretraining with Selﬁe beneﬁts the most when there is much more unlabeled data than labeled data. Left: Mean accuracy across ﬁv e runs on ImageNet 224 × 224 for purely supervised model versus one with pretraining. Right: Mean accuracy gain from pretraining. The improvement quickly diminishes at the 10% mark when there is 10 times more data than the labeled set. 4.2 Self-attention as the last layer helps ﬁnetuning perf ormance. As mentioned in Section 3.3, we explore training ResNet-36 + attention pooling (both are reused from pretraining phase) on CIF AR-10 and ImageNet 224 × 224 on two settings: limited labeled data and full access to the labeled set. The architectures of the two networks are sho wn in Figure 5. Experimental results on these two architectures with and without pretraining are reported in T able 2. Convolution block Convolution block Convolution block Convolution block Resnet-50 Convolution block Convolution block Convolution block Attention Pooling ResNet-36 ResNet-36 Figure 5: (Left) ResNet-50 architecture. (Right) ResNet-36 + attention pooling architecture. W ith pretraining on unlabeled data, ResNet-36 + attention pooling outperforms ResNet-50 on both datasets with limited data. On the full training set, this hybrid con volution-attention architecture gi ves 0.5% gain on ImageNet 224 × 224 . These show great promise for this hybrid architecture which we plan to further explore in future work. 8 T able 2: Accuracy (%) of ResNet-50 and ResNet-36 + attention pooling after ﬁnetuning from pretrained weights, found by Selﬁe on limited and full labeled sets. The gain ( ∆ ) indicates how much improv ement is made from using attention pooling in place of the last conv olution block. Method ResNet-50 ResNet-36 + attention pooling ∆ CIF AR-10 8% 80.3 ± 0.3 81.3 ± 0.1 +1.0 ImageNet 10% 61.8 ± 0.2 62.1 ± 0.2 +0.3 CIF AR-10 100% 95.7 ± 0.1 95.4 ± 0.2 -0.3 ImageNet 100% 77.0 ± 0.1 77.5 ± 0.1 +0.5 5 Related W ork Unsupervised r epresentation learning f or text. Much of the success in unsupervised representa- tion learning is in NLP . First, using language models to learn embeddings for words is commonplace in many NLP applications (Mik olov et al., 2013; Pennington et al., 2014). Building on this success, similar methods are then proposed for sentence and paragraph representations (Le and Mikolov, 2014; Kiros et al., 2015). Recent successful methods howe ver focus on the use of language models or “masked” language models as pretraining objecti ves (Dai and Le, 2015; Ramachandran et al., 2017; Peters et al., 2018; How ard and Ruder, 2018; Devlin et al., 2019). A general principle to all of these successful methods is the idea of conte xt prediction: giv en some adjacent data and their locations, predict the missing words. Unsupervised representation learning f or images and audio. Recent successful methods in unsupervised representation learning for images can be di vided into four categories: 1) predicting rotation angle from an original image (e.g., Gidaris et al. (2018)), 2) predicting if a perturbed image belongs to the same category with an unperturbed image (Exemplar) (e.g., Dosovitskiy et al. (2016)), 3) predicting relative locations of patches (e.g., Doersch et al. (2015)), solving Jigsaw puzzles (e.g., Noroozi and Fa varo (2016)) and 4) impainting (e.g., Huang et al. (2014); P athak et al. (2016); Iizuka et al. (2017)). Their success, ho wever , is limited to small datasets or small settings, some resort to expensiv e jointing training to surpass their purely supervised counterpart. On the challenging benchmark ImageNet, our method is the ﬁrst to report gain with and without additional unlabeled data as shown in T able 1. Selﬁe is also closely related to denoising autoencoders (V incent et al., 2010), where various kinds of noise are applied to the input and the model is required to reconstruct the clean input. The main dif ference between our method and denoising autoencoders is how the reconstruction step is done: our method focuses only on the missing patches, and tries to select the right patch among other distracting patches. Our method is also making use of Contrastiv e Predictiv e Coding (Oord et al., 2018), where negati ve sampling was also used to classify continuous objects in a sequence. Concurrent to our work, wa v2vec (Schneider et al., 2019) also employs Contrastiv e Predictive Coding to learn representation of audio data during pretraining and achiev e improvement on do wnstream tasks. Semi-supervised learning . Semi-supervised learning is another branch of representation learning methods that tak e adv antage of the e xistence of labeled data. Unlike pure unsupervised representation learning, semi-supervised learning does not need a separate ﬁne-tuning stage to improv e accuracy , which is more common in unsupervised representation learning. Successful recent semi-supervised learning methods for deep learning are based on consistency training (Miyato et al., 2018; Sajjadi et al., 2016; Laine and Aila, 2016; V erma et al., 2019; Xie et al., 2019). 6 Conclusion W e introduce Selﬁe , a self-supervised pretraining technique that generalizes the concept of mask ed language modeling to continuous data, such as images. Given a masked-out position of a square patch in the input image, our method learns to select the tar get masked patches from negati ve samples obtained from the same image. This classiﬁcation objecti ve therefore sidesteps the need for predicting the exact pix el values of the target patches. Experiments show that Selﬁe achiev es signiﬁcant gains when labeled set is small compared to the unlabeled set. Besides the gain in mean accurac y across 9 different runs, the standard de viation of results is also reduced thanks to a better initialization from our pretraining method. Our analysis demonstrates the revi ved potential of unsupervised pretraining ov er supervised learning and that a hybrid con volution-attention architecture shows promise. References Y oshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances in Neural Information Pr ocessing Systems , pages 153–160. Ekin D Cub uk, Barret Zoph, Dandelion Mane, V ijay V asude van, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition . Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neur al Information Pr ocessing Systems , pages 3079–3087. Jacob De vlin, Ming-W ei Chang, Kenton Lee, and Kristina T outanov a. 2019. BER T: Pre-training of deep bidirectional transformers for language understanding. In Annual Confer ence of the North American Chapter of the Association for Computational Linguistics . Carl Doersch, Abhina v Gupta, and Alex ei A Efros. 2015. Unsupervised visual representation learning by context prediction. In Pr oceedings of the IEEE International Conference on Computer V ision , pages 1422–1430. Alex ey Dosovitskiy , Philipp Fischer , Jost T obias Springenberg, Martin Riedmiller , and Thomas Brox. 2016. Discriminativ e unsupervised feature learning with exemplar con volutional neural networks. IEEE transactions on pattern analysis and mac hine intelligence , 38(9):1734–1747. Spyros Gidaris, Pra veer Singh, and Nikos K omodakis. 2018. Unsupervised representation learning by predicting image rotations. In International Confer ence on Learning Representations . A wni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Gre g Diamos, Erich Elsen, Ryan Prenger , Sanjee v Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv pr eprint arXiv:1412.5567 . Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 770–778. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In Eur opean confer ence on computer vision , pages 630–645. Springer . Olivier J. Hénaf f, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aäron van den Oord. 2019. Data-efﬁcient image recognition with contrasti ve predictiv e coding. CoRR , abs/1905.09272. Geoffre y E Hinton, Simon Osindero, and Y ee-Whye T eh. 2006. A fast learning algorithm for deep belief nets. Neural Computation , 18(7):1527–1554. R Dev on Hjelm, Alex Fedorov , Samuel Lav oie-Marchildon, Karan Grewal, Adam Trischler , and Y oshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv pr eprint arXiv:1808.06670 . Jeremy How ard and Sebastian Ruder . 2018. Uni versal language model ﬁne-tuning for text classiﬁca- tion. In Annual Confer ence of the North American Chapter of the Association for Computational Linguistics . Gao Huang, Zhuang Liu, Laurens V an Der Maaten, and Kilian Q W einberger . 2017. Densely connected con volutional networks. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 4700–4708. Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Johannes K opf. 2014. Image completion using planar structure guidance. ACM T ransactions on gr aphics (TOG) , 33(4):129. 10 Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikaw a. 2017. Globally and locally consistent image completion. ACM T rans. Gr aph. , 36(4):107:1–107:14. Ryan Kiros, Y ukun Zhu, Ruslan R Salakhutdinov , Richard Zemel, Raquel Urtasun, Antonio T orralba, and Sanja Fidler . 2015. Skip-thought vectors. In Advances in Neural Information Pr ocessing Systems , pages 3294–3302. Alexander Kolesnik ov , Xiaohua Zhai, and Lucas Beyer . 2019. Re visiting self-supervised visual representation learning. arXiv pr eprint arXiv:1901.09005 . Samuli Laine and T imo Aila. 2016. T emporal ensembling for semi-supervised learning. In Interna- tional Confer ence on Learning Repr esentations . Quoc Le and T omas Mikolo v . 2014. Distributed representations of sentences and documents. In International Confer ence on Machine Learning , pages 1188–1196. T omas Mikolov , Ilya Sutskev er, Kai Chen, Greg S Corrado, and Jef f Dean. 2013. Distrib uted representations of w ords and phrases and their compositionality . In Advances in Neur al Information Pr ocessing Systems , pages 3111–3119. T akeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori K oyama. 2018. V irtual adversarial training: a re gularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence . Mehdi Noroozi and Paolo Fa varo. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In Eur opean Conference on Computer V ision , pages 69–84. Springer . A vital Oliv er , Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfello w . 2018. Real- istic e valuation of deep semi-supervised learning algorithms. In Advances in Neur al Information Pr ocessing Systems , pages 3235–3246. Aaron v an den Oord, Y azhe Li, and Oriol V inyals. 2018. Representation learning with contrasti ve predictiv e coding. arXiv preprint . Deepak P athak, Philipp Krahenbuhl, Jef f Donahue, T rev or Darrell, and Ale xei A Efros. 2016. Context encoders: Feature learning by inpainting. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 2536–2544. Jef frey Pennington, Richard Socher , and Christopher Manning. 2014. Glo ve: Global vectors for w ord representation. In Pr oceedings of the 2014 conference on empirical methods in natur al language pr ocessing (EMNLP) , pages 1532–1543. Matthew E Peters, Mark Neumann, Mohit Iyyer , Matt Gardner , Christopher Clark, Kenton Lee, and Luke Zettlemoyer . 2018. Deep contextualized word representations. In Annual Confer ence of the North American Chapter of the Association for Computational Linguistics . Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Pack er, and Andrew Y Ng. 2007. Self-taught learning: transfer learning from unlabeled data. In Pr oceedings of the 24th International Confer ence on Machine Learning , pages 759–766. A CM. Prajit Ramachandran, Peter J Liu, and Quoc V Le. 2017. Unsupervised pretraining for sequence to sequence learning. In Pr oceedings of the 2017 Confer ence on Empirical Methods in Natural Language Pr ocessing . Mehdi Sajjadi, Mehran Javanmardi, and T olga T asdizen. 2016. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Pr ocessing Systems , pages 1163–1171. Stef fen Schneider , Ale xei Bae vski, Ronan Collobert, and Michael Auli. 2019. wa v2vec: Unsupervised pre-training for speech recognition. arXiv pr eprint arXiv:1904.05862 . Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neur al Information Pr ocessing Systems , pages 5998–6008. 11 V ikas V erma, Alex Lamb, Juho Kannala, Y oshua Bengio, and David Lopez-Paz. 2019. Interpolation consistency training for semi-supervised learning. arXiv pr eprint arXiv:1903.03825 . Pascal V incent, Hugo Larochelle, Isabelle Lajoie, Y oshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of mac hine learning resear ch , 11(Dec):3371–3408. Y onghui W u, Mike Schuster , Zhifeng Chen, Quoc V Le, Mohammad Norouzi, W olfgang Macherey , Maxim Krikun, Y uan Cao, Qin Gao, Klaus Macherey , et al. 2016. Google’ s neural machine translation system: Bridging the gap between human and machine translation. arXiv pr eprint arXiv:1609.08144 . Qizhe Xie, Zihang Dai, Eduard Hovy , Minh-Thang Luong, and Quoc V . Le. 2019. Unsupervised data augmentation. arXiv pr eprint arXiv:1904.12848 . Serge y Zagoruyko and Nikos K omodakis. 2016. W ide residual networks. In The British Machine V ision Confer ence . Xiaohua Zhai, A vital Oliv er, Alexander K olesnikov , and Lucas Beyer . 2019. s 4 l : Self-supervised semi-supervised learning. arXiv pr eprint arXiv:1905.03670 . 12

Selfie: Self-supervised Pretraining for Image Embedding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment