Pruning Convolutional Neural Networks with Self-Supervision
Approach
In this work, our goal is to study how a standard pruning method
primarily developed for supervised learning applies to networks trained
without annotations. We use well-established methods from the
unstructured pruning and self-supervised literatures to do so.
Specifically, we adopt the magnitude-based unstructured iterative
pruning process of Han to extract sparse subnetworks from an
over-parameterized network. Following recent works, we reset the
subsequent subnetworks to a selected set of weights or randomly
re-initialize them . The self-supervised tasks we consider are rotation
classification of Gidaris and the “Exemplar” approach of
Dosovitskiy . We provide details about our implementation at the end of
this section.
Preliminaries. We represent a subnetwork $`(W, m)`$ by the
association of a mask $`m`$ in $`\{0, 1\}^d`$ and weight $`W`$ in
$`\mathbb{R}^{d}`$. The convolutional network, or convnet, function
associated with a subnetwork is denoted by $`f_{m \odot W}`$, where
$`\odot`$ is element-wise product. We refer to the vector obtained at
the penultimate layer of a convnet as feature or representation. Such a
representation is pruned if $`m`$ has at least one zero component. The
subnetwork weights $`W`$ may be pre-trained, such that the
corresponding feature function can be transferred to downstream tasks.
On the other hand, the subnetwork weights $`W`$ may be initialization
weights, such that the corresponding convnet $`f_{m \odot W}`$ can be
re-trained from scratch.
Unstructured magnitude-based pruning
Pruned mask. Han propose an algorithm to prune networks by
estimating which weights are important. This approach consists of
compressing networks by alternatively minimizing a training objective
and pruning the network parameters with the smallest magnitude, hence
progressively reducing the network size. At each pruning iteration, the
network is first trained to convergence, thus arriving at weights
$`W^*`$. Then, the mask $`m`$ is updated by setting to zero the elements
already masked plus the smallest elements of
$`\{ |W^*[j]|\, |\,m[j] \neq 0 \}`$.
Weight resetting. Frankle and Carbin refine this approach and
propose to also find a good initialization $`W`$ for each subnetwork
such that it may be re-trained from scratch. On small-scale computer
vision datasets and with shallow architectures, they indeed show that
sub-architectures found with iterative magnitude pruning can be
re-trained from the start, as long as their weights are reset to their
initial values. Further experiments, however, have shown that this
observation does not exactly hold for more challenging benchmarks such
as ImageNet . Specifically, Frankle found that resetting weights to
their value from an early stage in optimization can still lead to good
trainable subnetworks. Formally, at each pruning iteration, the
subnetwork is reset to weights $`W_k`$ obtained after $`k`$ weight
updates from the first pruning iteration. Liu , on the other hand,
argue that the mask $`m`$ only is responsible for the good performance
of the subnetwork and thus its weights $`W`$ may be randomly drawn at
initialization. In our work, we consider both weights initialization
schemes: winning tickets of Frankle or random re-initialization .
Self-supervised learning
We prune networks without supervision by simply setting the training objective in the method of Han to a self-supervised pretext task. We consider two prominent self-supervised methods: RotNet and the Exemplar approach of Dosovitskiy following the implementation of Doersch . RotNet consists in predicting the rotation which was applied to the input image among a set of $`4`$ possible large rotations: $`\{0\degree, 90\degree, 180\degree, 270\degree\}`$. Exemplar is a classification problem where each image and its transformations form a class, leading to as many classes as there are training examples. We choose these two self-supervised tasks because they have opposite characteristics: RotNet encourages discriminative features to data transformations and has a small number of classes, while Exemplar encourages invariance to data transformations and its output space dimension is large. We also investigate the non-parametric instance discrimination (NPID) approach of Wu , which is a variant of the Exemplar method that uses a non-parametric softmax layer and a memory bank of feature vectors.
Implementation
Pruning. We follow closely the winning tickets setup of Morcos . At
each pruning iteration, we globally prune $`20\%`$ of the remaining
weights. The last fully-connected and batch-norm layers parameters are
left unpruned. We apply up to $`30`$ pruning iterations to reach extreme
pruning rates where only $`0.1\%`$ of the weights remain. Overall we
report results for $`14`$ different pruning rates ranging from $`20\%`$
to $`99.9\%`$, thus covering both moderate and extreme sparsity. The
weight resetting parameter is set to $`3 \times 1,3`$M samples, which
corresponds to $`3`$ epochs on full ImageNet. More details about this
late resetting parameter are in the supplementary material.
Datasets and models. In this work, we choose to mostly work with
ImageNet dataset , though we also report some results for CIFAR-10 . We
use ResNet-50 on ImageNet and ResNet-$`18`$ for CIFAR-10 . In
supplementary material, we also report results with more architectures:
AlexNet and the modified VGG-19 of Morcos (multilayer perceptron
(MLP) is replaced by a fully connected layer). Our experiments on
ImageNet are computationally-demanding since pruning can involve
training deep networks from scratch up to $`31`$ times. For this reason,
we distribute most of our runs across several GPUs. Models are trained
with weight decay and stochastic gradient descent with a momentum of
$`0.9`$. We use PyTorch version 1.0 for all our experiments. Full
training details for each of our experiments are in the supplementary
material. We run each experiment with $`6`$ (CIFAR-$`10`$) or $`3`$
(ImageNet) random seeds, and show the mean and standard error of the
accuracy.
Conclusion
Our work takes a first step into studying the pruning of networks trained with self-supervised tasks. We believe this is an emergent and important problem due to the recent rise of highly over-parametrized unsupervised pre-trained networks. In our study, we empirically provide different insights about pruning self-supervised networks. Indeed, we show that a well-established pruning method for supervised learning actually works well for self-supervised networks too, in the sense that the quality of the pruned representation is not deteriorated and the pruned masks can be re-trained to good performance on ImageNet labels. This is somewhat surprising given that labels are not seen during pruning and given that the goal of the pruning algorithm we use is to preserve the performance on the training task, which is agnostic to downstream tasks or ground-truth labels.
We also find several limitations to our study. First, we have observed pruning through the scope of unstructured magnitude-based pruning only. Future work might generalize our observations to a wider range of pruning methods, in particular structured pruning. Second, we have observed while conducting our experiments that winning tickets initializations are particularly sensitive to the late resetting parameter (see the supplementary material for a discussion about our choice of rewind parameter). The definition of “early in training” is somehow ill-defined: network weights change much more for the first epochs than for the last ones. Thus, by resetting weights early in their optimization, they contain a vast amount of information. Third, we find that pruning large modern architectures on CIFAR-$`10`$ should be done with caution as these networks tend to be sparse at convergence, making unstructured pruning at rates below $`80\%`$ particularly simple.