Pruning Convolutional Neural Networks with Self-Supervision

Pruning Convolutional Neural Networks with Self-Supervision

Transfer learning performance of pruned representations from self-supervised methods as we vary the pruning rate. We show performance when training linear SVMs on VOC2007 and linear classifiers with mini-batch stochastic gradient descent on Places205 and ImageNet. For reference, on VOC07 dataset, the performance for supervised ImageNet features and for random features are respectively 88mAP and 7.7mAP (numbers from Goyal  ). We do not prune NPID networks for extreme rates (i.e. less than 0.1 of remaining weights) because this is too computationally intensive.

Approach

In this work, our goal is to study how a standard pruning method primarily developed for supervised learning applies to networks trained without annotations. We use well-established methods from the unstructured pruning and self-supervised literatures to do so. Specifically, we adopt the magnitude-based unstructured iterative pruning process of Han   to extract sparse subnetworks from an over-parameterized network. Following recent works, we reset the subsequent subnetworks to a selected set of weights  or randomly re-initialize them . The self-supervised tasks we consider are rotation classification of Gidaris   and the “Exemplar” approach of Dosovitskiy  . We provide details about our implementation at the end of this section.
Preliminaries. We represent a subnetwork $`(W, m)`$ by the association of a mask $`m`$ in $`\{0, 1\}^d`$ and weight $`W`$ in $`\mathbb{R}^{d}`$. The convolutional network, or convnet, function associated with a subnetwork is denoted by $`f_{m \odot W}`$, where $`\odot`$ is element-wise product. We refer to the vector obtained at the penultimate layer of a convnet as feature or representation. Such a representation is pruned if $`m`$ has at least one zero component. The subnetwork weights $`W`$ may be pre-trained, such that the corresponding feature function can be transferred to downstream tasks. On the other hand, the subnetwork weights $`W`$ may be initialization weights, such that the corresponding convnet $`f_{m \odot W}`$ can be re-trained from scratch.

Unstructured magnitude-based pruning

Pruned mask. Han   propose an algorithm to prune networks by estimating which weights are important. This approach consists of compressing networks by alternatively minimizing a training objective and pruning the network parameters with the smallest magnitude, hence progressively reducing the network size. At each pruning iteration, the network is first trained to convergence, thus arriving at weights $`W^*`$. Then, the mask $`m`$ is updated by setting to zero the elements already masked plus the smallest elements of $`\{ |W^*[j]|\, |\,m[j] \neq 0 \}`$.
Weight resetting. Frankle and Carbin  refine this approach and propose to also find a good initialization $`W`$ for each subnetwork such that it may be re-trained from scratch. On small-scale computer vision datasets and with shallow architectures, they indeed show that sub-architectures found with iterative magnitude pruning can be re-trained from the start, as long as their weights are reset to their initial values. Further experiments, however, have shown that this observation does not exactly hold for more challenging benchmarks such as ImageNet . Specifically, Frankle   found that resetting weights to their value from an early stage in optimization can still lead to good trainable subnetworks. Formally, at each pruning iteration, the subnetwork is reset to weights $`W_k`$ obtained after $`k`$ weight updates from the first pruning iteration. Liu  , on the other hand, argue that the mask $`m`$ only is responsible for the good performance of the subnetwork and thus its weights $`W`$ may be randomly drawn at initialization. In our work, we consider both weights initialization schemes: winning tickets of Frankle   or random re-initialization .

Self-supervised learning

We prune networks without supervision by simply setting the training objective in the method of Han   to a self-supervised pretext task. We consider two prominent self-supervised methods: RotNet and the Exemplar approach of Dosovitskiy   following the implementation of Doersch  . RotNet consists in predicting the rotation which was applied to the input image among a set of $`4`$ possible large rotations: $`\{0\degree, 90\degree, 180\degree, 270\degree\}`$. Exemplar is a classification problem where each image and its transformations form a class, leading to as many classes as there are training examples. We choose these two self-supervised tasks because they have opposite characteristics: RotNet encourages discriminative features to data transformations and has a small number of classes, while Exemplar encourages invariance to data transformations and its output space dimension is large. We also investigate the non-parametric instance discrimination (NPID) approach of Wu  , which is a variant of the Exemplar method that uses a non-parametric softmax layer and a memory bank of feature vectors.

Transfer learning performance of representations pruned during pre-training or during transfer on VOC07 classification task with full fine-tuning as we vary the pruning rate. For reference, finetuning unpruned supervised ImageNet features gives 90.3 while training from random initialization gives 48.4 (numbers from Goyal  ).

Implementation

Pruning. We follow closely the winning tickets setup of Morcos  . At each pruning iteration, we globally prune $`20\%`$ of the remaining weights. The last fully-connected and batch-norm layers parameters are left unpruned. We apply up to $`30`$ pruning iterations to reach extreme pruning rates where only $`0.1\%`$ of the weights remain. Overall we report results for $`14`$ different pruning rates ranging from $`20\%`$ to $`99.9\%`$, thus covering both moderate and extreme sparsity. The weight resetting parameter is set to $`3 \times 1,3`$M samples, which corresponds to $`3`$ epochs on full ImageNet. More details about this late resetting parameter are in the supplementary material.
Datasets and models. In this work, we choose to mostly work with ImageNet dataset , though we also report some results for CIFAR-10 . We use ResNet-50 on ImageNet and ResNet-$`18`$ for CIFAR-10 . In supplementary material, we also report results with more architectures: AlexNet  and the modified VGG-19  of Morcos   (multilayer perceptron (MLP) is replaced by a fully connected layer). Our experiments on ImageNet are computationally-demanding since pruning can involve training deep networks from scratch up to $`31`$ times. For this reason, we distribute most of our runs across several GPUs. Models are trained with weight decay and stochastic gradient descent with a momentum of $`0.9`$. We use PyTorch  version 1.0 for all our experiments. Full training details for each of our experiments are in the supplementary material. We run each experiment with $`6`$ (CIFAR-$`10`$) or $`3`$ (ImageNet) random seeds, and show the mean and standard error of the accuracy.

Conclusion

Our work takes a first step into studying the pruning of networks trained with self-supervised tasks. We believe this is an emergent and important problem due to the recent rise of highly over-parametrized unsupervised pre-trained networks. In our study, we empirically provide different insights about pruning self-supervised networks. Indeed, we show that a well-established pruning method for supervised learning actually works well for self-supervised networks too, in the sense that the quality of the pruned representation is not deteriorated and the pruned masks can be re-trained to good performance on ImageNet labels. This is somewhat surprising given that labels are not seen during pruning and given that the goal of the pruning algorithm we use is to preserve the performance on the training task, which is agnostic to downstream tasks or ground-truth labels.

We also find several limitations to our study. First, we have observed pruning through the scope of unstructured magnitude-based pruning only. Future work might generalize our observations to a wider range of pruning methods, in particular structured pruning. Second, we have observed while conducting our experiments that winning tickets initializations are particularly sensitive to the late resetting parameter (see the supplementary material for a discussion about our choice of rewind parameter). The definition of “early in training” is somehow ill-defined: network weights change much more for the first epochs than for the last ones. Thus, by resetting weights early in their optimization, they contain a vast amount of information. Third, we find that pruning large modern architectures on CIFAR-$`10`$ should be done with caution as these networks tend to be sparse at convergence, making unstructured pruning at rates below $`80\%`$ particularly simple.