Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning

Pruning by Explaining: A No v el Criterion for Deep Neural Network Pruning Seul-Ki Y eom a,i , Philipp Seegerer a,h , Sebastian Lapuschkin c , Alexander Binder d,e , Simon W iedemann c , Klaus-Robert M ¨ uller a,f,g,b, ∗ , W ojciech Samek c,b, ∗ a Machine Learning Gr oup, T echnisc he Universit ¨ at Berlin, 10587 Berlin, Germany b BIFOLD – Berlin Institute for the F oundations of Learning and Data, Berlin, Germany c Department of Artiﬁcial Intelligence, F raunhofer Heinrich Hertz Institute, 10587 Berlin, Germany d ISTD Pillar , Singapor e University of T echnology and Design, Singapor e 487372, Singapore e Department of Informatics, University of Oslo, 0373 Oslo, Norway f Department of Artiﬁcial Intelligence, K or ea University , Seoul 136-713, Kor ea g Max Planck Institut f ¨ ur Informatik, 66123 Saarbr ¨ ucken, Germany h Aignostics GmbH, 10557 Berlin, Germany i Nota AI GmbH, 10117 Berlin, Germany Abstract The success of con volutional neural networks (CNNs) in v arious applications is accompanied by a signiﬁcant increase in computation and parameter storage costs. Recent e ﬀ orts to reduce these ov er- heads in volv e pruning and compressing the weights of v arious layers while at the same time aiming to not sacriﬁce performance. In this paper , we propose a nov el criterion for CNN pruning inspired by neural network interpretability: The most relev ant units, i.e. weights or ﬁlters, are automatically found using their rele vance scores obtained from concepts of explainable AI (XAI). By e xploring this idea, we connect the lines of interpretability and model compression research. W e show that our proposed method can e ﬃ ciently prune CNN models in transfer -learning setups in which networks pre-trained on lar ge corpora are adapted to specialized tasks. The method is e valuated on a broad range of computer vision datasets. Notably , our nov el criterion is not only competitiv e or better compared to state-of-the-art pruning criteria when successi v e retraining is performed, b ut clearly outperforms these pre vious criteria in the resource-constrained application scenario in which the data of the task to be transferred to is very scarce and one chooses to refrain from ﬁne-tuning. Our method is able to compress the model iterativ ely while maintaining or ev en improving accurac y . At the same time, it has a computational cost in the order of gradient computation and is comparati vely simple to apply without the need for tuning hyperparameters for pruning. K eywor ds: Pruning, Layer-wise Rele vance Propag ation (LRP), Con volutional Neural Netw ork (CNN), Interpretation of Models, Explainable AI (XAI) ∗ Corresponding Authors Email addr esses: yeom@tu-berlin.de (Seul-Ki Y eom), philipp.seegerer@tu-berlin.de (Philipp Seegerer), sebastian.lapuschkin@hhi.fraunhofer.de (Sebastian Lapuschkin), alexabin@uio.no (Alexander Binder), simon.wiedemann@hhi.fraunhofer.de (Simon W iedemann), klaus-robert.mueller@tu-berlin.de (Klaus-Robert M ¨ uller), wojciech.samek@hhi.fraunhofer.de (W ojciech Samek) Article published in P attern Recognition 115. doi:10.1016/j.patcog.2021.107899 Mar ch 15, 2021 1. Introduction Deep CNNs hav e become an indispensable tool for a wide range of applications [ 1 ], such as image classiﬁcation, speech recognition, natural language processing, chemistry , neuroscience, medicine and even are applied for playing games such as Go, poker or Super Smash Bros. They hav e achiev ed high predicti ve performance, at times e ven outperforming humans. Furthermore, in specialized domains where limited training data is a vailable, e.g., due to the cost and di ﬃ culty of data generation (medical imaging from fMRI, EEG, PET etc.), transfer learning can improv e the CNN performance by extracting the knowledge from the source tasks and applying it to a target task which has limited training data. Ho we v er , the high predictiv e performance of CNNs often comes at the expense of high storage and computational costs, which are related to the energy e xpenditure of the ﬁne-tuned network. These deep architectures are composed of millions of parameters to be trained, leading to o verpa- rameterization (i.e. having more parameters than training samples) of the model [ 2 ]. The run-times are typically dominated by the ev aluation of con v olutional layers, while dense layers are cheap but memory-heavy [ 3 ]. For instance, the VGG-16 model has approximately 138 million parameters, taking up more than 500MB in storage space, and needs 15.5 billion ﬂoating-point operations (FLOPs) to classify a single image. ResNet50 has approx. 23 million parameters and needs 4.1 billion FLOPs. Note that o verparametrization is helpful for an e ﬃ cient and successful training of neural networks, ho wev er , once the trained and well generalizing network structure is established, pruning can help to reduce redundancy while still maintaining good performance [4]. Reducing a model’ s storage requirements and computational cost becomes critical for a broader applicability , e.g., in embedded systems, autonomous agents, mobile de vices, or edge de vices [ 5 ]. Neural netw ork pruning has a decades long history with interest from both academia and industry [ 6 ] aiming to eliminate the subset of network units (i.e. weights or ﬁlters) which is the least important w .r .t. the network’ s intended task. For netw ork pruning, it is crucial to decide ho w to identify the “irrele v ant” subset of the parameters meant for deletion. T o address this issue, pre vious researches hav e proposed speciﬁc criteria based on Taylor expansion, weight, gradient, and others, to reduce complexity and computation costs in the netw ork. Related works are introduced in Section 2. From a practical point of vie w , the full capacity (in terms of weights and ﬁlters) of an o verpa- rameterized model may not be required, e.g., when (1) parts of the model lie dormant after training (i.e., are permanently ”switched o ﬀ ”), (2) a user is not interested in the model’s full array of possible outputs, which is a common scenario in transfer learning (e.g. the user only has use for 2 out of 10 av ailable network outputs), or (3) a user lacks data and resources for ﬁne-tuning and running the ov erparameterized model. In these scenarios the redundant parts of the model will still occupy space in memory , and information will be propagated through those parts, consuming energy and increasing runtime. Thus, criteria able to stably and signiﬁcantly reduce the computational comple xity of deep neural networks across applications are relev ant for practitioners. In this paper , we propose a no vel pruning frame work based on Layer-wise Relev ance Propagation (LRP) [ 7 ]. LRP was originally dev eloped as an explanation method to assign importance scores, so called r elevance , to the di ﬀ erent input dimensions of a neural network that reﬂect the contribution of an input dimension to the models decision, and has been applied to di ﬀ erent ﬁelds of computer vision (e.g., [ 8 , 9 , 10 ]). The rele v ance is backpropagated from the output to the input and hereby assigned to each unit of the deep model. Since relev ance scores are computed for ev ery layer 2 and neuron from the model output to the input, these rele v ance scores essentially reﬂect the importance of e very single unit of a model and its contrib ution to the information ﬂow through the network — a natural candidate to be used as pruning criterion. The LRP criterion can be motiv ated theoretically through the concept of Deep T aylor Decomposition (DTD) (c.f. [ 11 , 12 , 13 ]). Moreov er , LRP is scalable and easy to apply , and has been implemented in software frameworks such as iNNvestigate [ 14 ]. Furthermore, it has linear computational cost in terms of network inference cost, similar to backpropagation. W e systematically e v aluate the compression e ﬃ cacy of the LRP criterion compared to common pruning criteria for two di ﬀ erent scenarios. Scenario 1 : W e prune pre-trained CNNs followed by subsequent ﬁne-tuning. This is the usual setting in CNN pruning and requires a su ﬃ cient amount of data and computational po wer . Scenario 2 : In this scenario a pretrained model needs to be transferred to a related problem as well, but the data av ailable for the new task is too scarce for a proper ﬁne-tuning and / or the time consumption, computational po wer or energy consumption is constrained. Such transfer learning with restrictions is common in mobile or embedded applications. Our experimental results on v arious benchmark datasets and four di ﬀ erent popular CNN architec- tures sho w that the LRP criterion for pruning is more scalable and e ﬃ cient, and leads to better performance than existing criteria regardless of data types and model architectures if retraining is performed (Scenario 1). Especially , if retraining is prohibited due to external constraints after pruning, the LRP criterion clearly outperforms previous criteria on all datasets (Scenario 2). Finally , we would lik e to note that our proposed pruning frame work is not limited to LRP and image data, but can be also used with other e xplanation techniques and data types. The rest of this paper is or ganized as follo ws: Section 2 summarizes related works for network compression and introduces the typical criteria for network pruning. Section 3 describes the frame work and details of our approach. The e xperimental results are illustrated and discussed in Section 4, while our approach is discussed in relation to pre vious studies in Section 5. Section 6 gi ves conclusions and an outlook to future w ork. 2. Related W ork W e start the discussion of related research in the ﬁeld of network compression with network quantization methods which ha ve been proposed for storage space compression by decreasing the number of possible and unique v alues for the parameters [ 15 , 16 ]. T ensor decomposition approaches decompose network matrices into se veral smaller ones to estimate the informativ e parameters of the deep CNNs with lo w-rank approximation / factorization [17]. More recently , [ 18 ] also propose a frame work of architecture distillation based on layer -wise replacement, called LightweightNet for memory and time sa ving. Algorithms for designing e ﬃ cient models focus more on acceleration instead of compression by optimizing con volution operations or architectures directly (e.g. [19]). Network pruning approaches remo ve redundant or irrele vant units — i.e. nodes, ﬁlters, or layers — from the model which are not critical for performance [ 6 , 20 ]. Network pruning is robust to v arious settings and giv es reasonable compression rates while not (or minimally) hurting the model accuracy . Also it can support both training from scratch and transfer learning from pre-trained models. Early works ha ve sho wn that network pruning is e ﬀ ectiv e in reducing network comple xity 3 and simultaneously addressing ov er-ﬁtting problems. Current network pruning techniques make weights or channels sparse by remo ving non-informati ve connections and require an appropriate criterion for identifying which units of the model are not rele vant for solving a problem. Thus, it is crucial to decide ho w to quantify the rele vance of the parameters (i.e. weights or channels) in the current state of the learning process for deletion without sacriﬁcing predictiv e performance. In pre vious studies, pruning criteria hav e been proposed based on the magnitude of their 1) weights, 2) gradients, 3) T aylor expansion / deri vati ve, and 4) other criteria, as described in the following section. T aylor expansion: Early approaches to wards neural network pruning — optimal brain dam- age [ 4 ] and optimal brain surgeon [ 21 ] — le veraged a second-order T aylor expansion based on the Hessian matrix of the loss function to select parameters for deletion. Ho we v er , computing the in verse of Hessian is computationally expensi ve. The work of [ 22 , 23 ] used a ﬁrst-order T aylor e xpansion as a criterion to approximate the change of loss in the objectiv e function as an e ﬀ ect of pruning aw ay network units. W e contrast our nov el criterion to the computationally more comparable ﬁrst-order T aylor expansion from [22]. Gradient: Liu and W u [24] proposed a hierarchical global pruning strate gy by calculating the mean gradient of feature maps in each layer . They adopt a hierarchical global pruning strategy between the layers with similar sensiti vity . Sun et al. [25] proposes a sparsiﬁed back-propagation approach for neural network training using the magnitude of the gradient to ﬁnd essential and non-essential features in Multi-Layer Perceptron (MLP) and Long Short-T erm Memory Network (LSTM) models, which can be used for pruning. W e implement the gradient-based pruning criterion after [25]. W eight: A recent trend is to prune redundant, non-informati ve weights in pre-trained CNN mod- els, based on the magnitude of the weights themselves. Han et al. [26] and Han et al. [27] proposed the pruning of weights for which the magnitude is belo w a certain threshold, and to subsequently ﬁne-tune with a l p -norm regularization. This pruning strategy has been used on fully-connected layers and introduced sparse connections with BLAS libraries, supporting specialized hardware to achie ve its acceleration. In the same context, Structured Sparsity Learning (SSL) added group sparsity regularization to penalize unimportant parameters by removing some weights [ 28 ]. Li et al. [29] , against which we compare in our e xperiments, proposed a one-shot channel pruning method using the l p norm of weights for ﬁlter selection, provided that those channels with smaller weights always produce weak er activ ations. Other criteria: [ 30 ] proposed the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of ﬁnal responses before the softmax, classiﬁcation layer in the network. The method is based on — in contrast to our proposed metric — a per-layer pruning process which does not consider global importance in the network. Luo et al. [31] proposed ThiNet, a data-dri ven statistical channel pruning technique based on the statistics computed from the ne xt layer . Further hybrid approaches can be found in, e.g. [ 32 ], which suggests a fusion approach to combine with weight-based channel pruning and network quantization. More recently , Dai et al. [33] proposed an e volutionary paradigm for weight-based pruning and gradient-based gro wing to reduce the network heuristically . 3. LRP-Based Network Pruning A feedforward CNN consists of neurons established in a sequence of multiple layers, where each neuron recei ves the input data from one or more pre vious layers and propagates its output to 4 e very neuron in the succeeding layers, using a potentially non-linear mapping. Network pruning aims to sparsify these units by eliminating weights or ﬁlters that are non-informati ve (according to a certain criterion). W e speciﬁcally focus our experiments on transfer learning, where the parameters of a network pre-trained on a sour ce domain is subsequently ﬁne-tuned on a targ et domain, i.e., the ﬁnal data or prediction task. Here, the general pruning procedure is outlined in Algorithm 1. Algorithm 1 Neural Network Pruning 1: Input: pre-trained model net , reference data x r , training data x t 2: pruning threshold t , pruning criterion c , pruning ratio r 3: while t not reached do 4: // Step 1: assess network substructure importance 5: f or all layer in net do 6: f or all units in layer do 7: B compute importance of unit w .r .t. c (and x r ) 8: end f or 9: if required for c then 10: B globally regularize importance per unit 11: end if 12: end f or 13: // Step 2: identify and remov e least important units in groups of r 14: B remov e r units from net where importance is minimal 15: B remov e orphaned connections of each removed unit 16: if desired then 17: // Step 2.1: optional ﬁne-tuning to recov er performance 18: B ﬁne-tune net on x t 19: end if 20: end while 21: // return the pruned network upon hitting threshold t (e.g. model performance or size) 22: retur n net Even though most approaches use an identical process, choosing a suitable pruning criterion to quantify the importance of model parameters for deletion while minimizing performance drop (Step 1) is of critical importance, gov erning the success of the approach. 3.1. Layer-wise Rele vance Pr opagation In this paper , we propose a nov el criterion for pruning neural network units: the r elevance quantity computed with LRP [ 7 ]. LRP decomposes a classiﬁcation decision into proportionate contributions of each network unit to the overall classiﬁcation score, called “relev ances”. When computed for the input dimensions of a CNN and visualized as a heatmap, these relev ances highlight parts of the input that are important for the classiﬁcation decision. LRP thus originally serv ed as a tool for interpreting non-linear learning machines and has been applied as such in various ﬁelds, amongst others for general image recognition, medical imaging and natural language processing, cf. [ 34 ]. The direct linkage of the rele v ances to the classiﬁer output, as well as the conserv ativity constraint imposed on the propagation of relev ance between layers, makes LRP not only attractiv e for model explaining, b ut can also naturally serve as pruning criterion (see Section 4.1). 5 relevance low relevance high activation strong activation weak Figure 1: Illustration of LRP-based sequential process for pruning. A. Forward propagation of a giv en image (i.e. cat) through a pre-trained model. B. Evaluation on relev ance for weights / ﬁlters using LRP, C. Iterativ e pruning by eliminating the least rele v ant units (depicted by circles) and ﬁne-tuning if necessary . The units can be indi vidual neurons, ﬁlters, or other arbitrary grouping of parameters, depending on the model architecture. The main characteristic of LRP is a backward pass through the network during which the network output is redistrib uted to all units of the network in a layer -by-layer fashion. This backward pass is structurally similar to gradient backpropagation and has therefore a similar runtime. The redistribution is based on a conservation principle such that the rele v ances can immediately be interpreted as the contrib ution that a unit makes to the network output, hence establishing a direct connection to the network output and thus its predicti ve performance. Therefore, as a pruning criterion, the method is e ﬃ cient and easily scalable to generic netw ork structures. Independent of the type of neural netw ork layer — that is pooling, fully-connected, con v olutional layers — LRP allo ws to quantify the importance of units throughout the netw ork, gi ven a global prediction conte xt. 3.2. LRP-based Pruning The procedure of LRP-based pruning is summarized in Figure 1. In the ﬁrst phase, a standard forward pass is performed by the network and the activ ations at each layer are collected. In the second phase, the score f ( x ) obtained at the output of the network is propagated backwards through the network according to LRP propagation rules [ 7 ]. In the third phase, the current model is pruned by eliminating the irrelev ant (w .r .t. the “relev ance” quantity R obtained via LRP) units and is (optionally) further ﬁne-tuned. LRP is based on a layer-wise conserv ation principle that allows the propagated quantity (e.g. rele v ance for a predicted class) to be preserved between neurons of two adjacent layers. Let R ( l ) i be the rele v ance of neuron i at layer l and R ( l + 1) j be the rele v ance of neuron j at the next layer l + 1. Stricter deﬁnitions of conservation that in volv e only subsets of neurons can further impose that 6 rele v ance is locally redistrib uted in the lo wer layers and we deﬁne R ( l ) i ← j as the share of R ( l + 1) j that is redistributed to neuron i in the lo wer layer . The conserv ation property alw ays satisﬁes X i R ( l ) i ← j = R ( l + 1) j , (1) where the sum runs ov er all neurons i of the (during inference) preceeding layer l . When using rele v ance as a pruning criterion, this property helps to preserv e its quantity layer-by-layer , re gardless of hidden layer size and the number of iterati vely pruned neurons for each layer . At each layer l , we can extract node i ’s global importance as its attrib uted relev ance R ( l ) i . In this paper , we speciﬁcally adopt rele vance quantities computed with the LRP- α 1 β 0 -rule as pruning criterion. The LRP- αβ -rule was de veloped with feedforward-DNNs with ReLU activ ations in mind and assumes positiv e (pre-softmax) logit activ ations f logit ( x ) > 0 for decomposition. The rule has been shown to work well in practice in such a setting [ 35 ]. This particular variant of LRP is tightly rooted in DTD [ 11 ], and other than the criteria based on network deriv ativ es we compare against [ 25 , 22 ], always produces continuous e xplanations , ev en if backpropagation is performed through the discontinuous (and commonly used) ReLU nonlinearity [ 12 ]. When used as a criterion for pruning, its assessment of netw ork unit importance will change less abruptly with (small) changes in the choice of reference samples, compared to gradient-based criteria. The propagation rule performs tw o separate relev ance propagation steps per layer: one e xclu- si vely considering activ atory parts of the forward propagated quantities (i.e. all a ( l ) i w i j > 0 ) and another only processing the inhibitory parts ( a ( l ) i w i j < 0 ) which are subsequently mer ged in a sum with components weighted by α and β (s.t. α + β = 1) respectively . By selecting α = 1, the propagation rule simpliﬁes to R ( l ) i = X j  a ( l ) i w i j  + P i 0  a i 0 ( l ) w i 0 j  + R ( l + 1) j , (2) where R ( l ) i denotes rele v ance attributed to the i th neuron at layer l , as an aggreg ation of do wnward- propagated relev ance messages R ( l , l + 1) i ← j . The terms ( · ) + indicate the positiv e part of the forward propagated pre-activ ation from layer l , to layer ( l + 1). The i 0 is a running index ov er all input acti v ations a . Note that a choice of α = 1 only decomposes w .r .t. the parts of the inference signal supporting the model decision for the class of interest. Equation (2) is locally conservative , i.e. no quantity of rele v ance gets lost or injected during the distribution of R j where each term of the sum corresponds to a rele v ance message R j ← k . For this reason, LRP has the following technical advantages ov er other pruning techniques such as gradient-based or acti vation-based methods: (1) Localized relev ance conserv ation implicitly ensures layer-wise regularized global redistribution of importances from each netw ork unit. (2) By summing rele v ance within each (con volutional) ﬁlter channel, the LRP-based criterion is directly applicable as a measure of total rele vance per node / ﬁlter , without requiring a post-hoc layer -wise renormalization, e.g., via l p norm. (3) The use of relev ance scores is not restricted to a global application of pruning but can be easily applied to locally and (neuron- or ﬁlter-)group-wise constrained pruning without regularization. Di ﬀ erent strate gies for selecting (sub-)parts of the model might still be considered, e.g., applying di ﬀ erent weightings / priorities for pruning di ﬀ erent parts of the model: Should the aim of pruning be the reduction of FLOPs required during inference, one would prefer to focus on 7 primarily pruning units of the con v olutional layers. In case the aim is a reduction of the memory requirement, pruning should focus on the fully-connected layers instead. In the context of Algorithm 1, Step 1 of the LRP-based assessment of neuron and ﬁlter im- portance is performed as a single LRP backw ard pass through the model, with an aggregation of rele v ance per ﬁlter channel as described abo ve, for con volutional layers, and does not require addi- tional normalization or regularization. W e would lik e to point out that instead of backpropagating the model output f c ( x ) for the true class c of any giv en sample x (as it is commonly done when LRP is used for explaining a prediction [ 7 , 8 ]), we initialize the algorithm with R ( L ) c = 1 at the output layer L . W e thus gain robustness against the model’ s (in)conﬁdence in its predictions on the previously unseen reference samples x and ensure an equal weighting of the inﬂuence of all reference samples in the identiﬁcation of rele v ant neural pathways. 4. Experiments W e start by an attempt to intuiti vely illuminate the properties of di ﬀ erent pruning criteria, namely , weight magnitude, T aylor, gradient and LRP, via a series of toy datasets. W e then show the e ﬀ ecti veness of the LRP criterion for pruning on widely-used image recognition benchmark datasets — i.e. the Scene 15 [ 36 ], Event 8 [ 37 ], Cats & Dogs [ 38 ], Oxford Flower 102 [ 39 ], CIF AR-10 1 , and ILSVRC 2012 [ 40 ] datasets — and four pre-trained feed-forward deep neural network architectures, AlexNet and VGG-16 with only a single sequence of layers, and ResNet-18 and ResNet-50 [ 41 ], which both contain multiple parallel branches of layers and skip connections. The ﬁrst scenario focuses speciﬁcally on pruning of pre-trained CNNs with subsequent ﬁne- tuning, as it is common in pruning research [ 22 ]. W e compare our method with se veral state-of-the- art criteria to demonstrate the e ﬀ ecti veness of LRP as a pruning criterion in CNNs. In the second scenario, we tested whether the proposed pruning criterion also w orks well if only a very limited number of samples is av ailable for pruning the model. This is rele vant in case of devices with limited computational po wer , energy and storage such as mobile de vices or embedded applications. 4.1. Pruning T oy Models First, we systematically compare the properties and e ﬀ ecti v eness of the di ﬀ erent pruning criteria on several toy datasets in order to foster an intuition about the properties of all approaches, in a controllable and computationally inexpensi ve setting. T o this end we ev aluate all four criteria on di ﬀ erent toy data distributions qualitati vely and quantitati vely. W e generated three k -class toy datasets (“moon” ( k = 2), “circle” ( k = 2) and “multi” ( k = 4)), using respectiv e generator functions 2 , 3 . Each generated 2D dataset consists of 1000 training samples per class. W e constructed and trained the models as a sequence of three consecuti ve ReLU-activ ated dense layers with 1000 hidden neurons each. After the ﬁrst linear layer , we hav e added a DropOut layer with a dropout probability of 50%. The model recei ves inputs from R 2 and has — depending on the to y problem set — k ∈ { 2 , 4 } output neurons: 1 https://www.cs.toronto.edu/ ~ kriz/cifar.html 2 https://scikit- learn.org/stable/datasets 3 https://github.com/seulkiyeom/LRP_Pruning_toy_example 8 Dense(1000) -> ReLU -> DropOut(0.5) -> Dense(1000) -> -> ReLU -> Dense(1000) -> ReLU -> Dense(k) W e then sample a number of ne w datapoints (unseen during training) for the computation of the pruning criteria. During pruning, we remov ed a ﬁxed number of 1000 of the 3000 hidden neur ons that hav e the least rele v ance for prediction according to each criterion. This is equi valent to removing 1000 learned (yet insigniﬁcant, according to the criterion) ﬁlters from the model. After pruning, we observed the changes in the decision boundaries and re-ev aluated for classiﬁcation accuracy using the original training samples and re-sampled datapoints across criteria. This experiment is performed with n ∈ [1 , 2 , 5 , 10 , 20 , 50 , 100 , 200] reference samples for testing and the computation of pruning criteria. Each setting is repeated 50 times, using the same set of random seeds (depending on the repetition index) for each n across all pruning criteria to uphold comparability . Figure 2 sho ws the data distributions of the generated to y datasets, an e xemplary set of n = 5 samples generated for criteria computation, as well as the qualitativ e impact to the models’ decision boundary when removing a ﬁx ed set of 1000 neurons as selected via the compared criteria. Figure 3 in vestigates ho w the pruning criteria preserv e the models’ problem solving capabilities as a function of the number of samples selected for computing the criteria. Figure 4 then quantitativ ely summarizes the results for speciﬁc numbers of unseen samples ( n ∈ [1 , 5 , 20 , 100]) for computing the criteria. Here we report the model accuracy on the training set in order to relate the preservation of the decision function as learned fr om data between unpruned (2nd column) to pruned models and pruning criteria (remaining columns). Figure 2: Qualitati ve comparison of the impact of the pruning criteria on the decision function on three toy datasets. 1st column : scatter plot of the training data and decision boundary of the trained model, 2nd column : data samples randomly selected for computing the pruning criteria, 3r d to 6th columns : changed decision boundaries after the application of pruning w .r .t. di ﬀ erent criteria. 9 performance after pruning in % circle dataset moon dataset multi dataset on training set on noisy test set Figure 3: Pruning performance (accurac y) comparison of criteria depending on the number of reference samples per class used for criterion computation. 1st r ow: Model e v aluation on the training data. 2nd r ow: Model ev aluation on an unseen test dataset with added Gaussian noise ( N (0 , 0 . 3)), which ha ve not been used for the computation of pruning criteria. Columns: Results over di ﬀ erent datasets. Solid lines sho w the av erage post-pruning performance of the models pruned w .r .t. to the ev aluated criteria weight (black), T aylor (blue), grad(ient) (green) and LRP (red) over 50 repetitions of the experiment. The dashed black line indicates the model’ s evaluation performance without pruning. Shaded areas around the lines sho w the standard de viation o ver the repetition of experiments. Further results for noise le vels N (0 , 0 . 1) and N (0 , 0 . 01) are av ailable on github 3 . The results in Figure 4 show that, among all criteria based on reference sample for the computa- tion of relev ance, the LRP-based measure consistently outperforms all other criteria in all reference set sizes and datasets. Only in the case of n = 1 reference sample per class, the weight criterion preserves the model the best. Note that using the weight magnitude as a measure of netw ork unit importance is a static approach, independent from the choice of reference samples. Giv en n = 5 points of reference per class, the LRP-based criterion already outperforms also the weight magnitude as a criterion for pruning unimportant neural network structures, while successfully preserving the functional cor e of the predictor . Figure 2 demonstrates ho w the toy models’ decision boundaries change under inﬂuence of pruning with all four criteria. W e can observe that the weight criterion and LRP preserve the models’ learned decision boundary well. Both the T aylor and gradient measures degrade the model signiﬁcantly . Compared to weight- and LRP-based criteria, models pruned by gradient-based criteria misclassify a large part of samples. The ﬁrst ro w of Figure 3 sho ws that all (data dependent) measures beneﬁt from increasing the number of reference points. LRP is able to ﬁnd and preserve the functionally important network components with only v ery little data, while at the same time being considerably less sensiti ve to 10 Figure 4: Comparison of training accuracy after one-shot pruning one third of all ﬁlters w .r .t one of the four metrics on toy datasets, with n ∈ [1 , 5 , 20 , 100] reference samples used for criteria computation for Weight, Gradient, Taylor and L RP. The experiment is repeated 50 times. Note that the W eight criterion is not inﬂuenced by the number of reference samples n . Compare to Supplementary T able 1. the choice of reference points than other metrics, visible in the measures’ standard deviations. Both the gradient and T aylor -based measures do not reach the performance of LRP-based pruning, ev en with 200 reference samples for each class. The performance of pruning with the weight magnitude based measure is constant, as it does only depend on the learned weights itself. The bottom row of Figure 3 sho ws the test performance of the pruned models as a function of the number of samples used for criteria computation. Here, we tested on 500 samples per class, drawn from the datasets’ respecti ve distrib utions, and perturbed with additional gaussian noise ( N (0 , 0 . 3)) added after data generation. Due to the large amounts of noise added to the data, we see the prediction performance of the pruned and unpruned models to decrease in all settings. Here we can observe that two out of three times the LRP-pruned models outperforming all other criteria. Only once, on the “moon” dataset, pruning based on the weight criterion yields a higher performance than the LRP-pruned model. Most remarkably though, only the models pruned with the LRP-based criterion exhibit prediction performance and behavior — measured in mean and standard deviation of accuracies measured ov er all 50 random seeds per n reference samples on the deliberatly heavily noisy data — highly similar to the original and unpruned model, from only n = 5 reference samples per class on, on all datasets. This yields another strong indicator that LRP is, among the compared criteria, most capable at preserving the relev ant core of the learned network function, and to dismiss unimportant parts of the model during pruning. The strong results of LRP, and the partial similarity between the results on the training datasets between LRP and weight raises the question where and ho w both metrics (and T aylor and gradient) de viate, as it can be e xpected that both metrics at least select highly overlapping sets of network units for pruning and preserv ation. W e therefore in vestig ate in all three toy settings — across the di ﬀ erent number of reference samples and random seeds — the (dis)similarities and (in)consistencies in neuron selection and ranking by measuring the set similarities ( S 1 ∩ S 2 ) / min ( | S 1 | , | S 2 | ) of the k neurons selected for pruning (rank ed ﬁrst ) and preserv ation (ranked last ) between and within criteria. Since the weight criterion is not inﬂuenced by the choice of reference samples for computation, 11 it is expected that the resulting neuron order is perfectly consistent with itself in all settings (cf. T able 2). What is unexpected ho wev er , giv en the results in Figure 3 and Figure 4 indicating similar model behavior after pruning to be expected between LRP- and weight-based criteria, at least on the training data, is the minimal set ov erlap between LRP and weight, giv en the higher set similarities between LRP and the gradient and T aylor criteria, as sho wn in T able 1. Overall, the set overlap between the neurons ranked in the extremes of the orderings show that LRP-deri ved pruning strategies hav e very little in common with the ones originating from the other criteria. This observ ation can also be made on more comple x networks at hand of Figure 7, as sho wn and discussed later in this Section. T able 1: Similarity analysis of neuron selection between LRP and the other criteria, computed over 50 di ﬀ erent random seeds. Higher v alues indicate higher similarity in neuron selection of the ﬁrst / last k neurons for pruning compared to LRP. Note that below table reports results only for n = 10 reference samples for criteria computation ( W eight, T aylor , G radient and L RP) and k = 250 and k = 1000. Similar observ ations have been made for n ∈ [1 , 2 , 5 , 20 , 50 , 100 , 200] and k ∈ [125 , 500] and can be found on github 3 . Dataset ﬁrst-250 last-250 ﬁrst-1000 last-1000 W T G L W T G L W T G L W T G L moon 0.002 0.006 0.006 1.000 0.083 0.361 0.369 1.000 0.381 0.639 0.626 1.000 0.409 0.648 0.530 1.000 circle 0.033 0.096 0.096 1.000 0.086 0.389 0.405 1.000 0.424 0.670 0.627 1.000 0.409 0.623 0.580 1.000 mult 0.098 0.220 0.215 1.000 0.232 0.312 0.299 1.000 0.246 0.217 0.243 1.000 0.367 0.528 0.545 1.000 T able 2 reports the self-similarity in neuron selection in the extremes of the ranking across random seeds (and thus sets of reference samples), for all criteria and toy settings. While LRP yields a high consistency in neuron selection for both the pruning ( ﬁrst- k ) and the preservation ( last- k ) of neural network units, both gradient and moreso T aylor exhibit lower self-similarities. The lower consistency of both latter criteria in the model components rank ed last (i.e. preserved in the model the longest during pruning) yields an explanation for the lar ge variation in results observ ed earlier: although gradient and T aylor are highly consistent in the r emoval of neurons rated as irrele v ant, their v olatility in the preserv ation of neurons which constitute the functional cor e of the netw ork after pruning yields dissimilarities in the resulting predictor function. The high consistenc y reported for LRP in terms of neuron sets selected for pruning and preservation, giv en the relati vely low Spearman correlation coe ﬃ cient points out only minor local perturbations of the pruning order due to the selection of reference samples. W e ﬁnd a direct correspondence between the here reported (in)consistency of pruning beha vior for the three data-dependent criteria, and the in [ 12 ] observed “explanation continuity” observed for LRP (and dis continuity for gradient and T aylor) in neural networks containing the commonly used ReLU activ ation function, which provides an explanation for the high pruning consistency obtained with LRP, and the extreme volatility for gradient and T aylor . A supplementary analysis of the neuron selection consistency of LRP ov er di ﬀ erent counts of reference samples n , demonstrating the requirement of only very few reference samples per class in order to obtain stable pruning results, can be found in Supplementary Results 1. T aken together , the results of T ables 1 to 2 and Supplementary T ables 1 and 2 elucidate that LRP constitutes — compared to the other methods — an ortho gonal pruning criterion which is v ery consistent in its selection of (un)important neural network units, while remaining adapti ve to the selection of reference samples for criterion computation. Especially the similarity in post-pruning model performance to the static weight criterion indicates that both metrics are able to ﬁnd v alid, yet completely di ﬀ erent pruning solutions. Howe ver , since LRP can still beneﬁt from the inﬂuence 12 T able 2: A consistency comparison of neuron selection and ranking for network pruning with criteria ( W eight, T aylor , G radient and L RP), a veraged o ver all 1225 unique random seed combinations. Higher v alues indicate higher consistency in selecting the same sets of neurons and generating neuron rankings for di ﬀ erent sets of reference samples. W e report results for n = 10 reference samples and k = 250. Observ ations for n ∈ [1 , 2 , 5 , 20 , 50 , 100 , 200] and k ∈ [125 , 500 , 1000] are av ailable on github 3 . Dataset ﬁrst-250 last-250 Spearman Correlation W T G L W T G L W T G L moon 1.000 0.920 0.918 0.946 1.000 0.508 0.685 0.926 1.000 0.072 0.146 0.152 circle 1.000 0.861 0.861 0.840 1.000 0.483 0.635 0.936 1.000 0.074 0.098 0.137 mult 1.000 0.827 0.829 0.786 1.000 0.463 0.755 0.941 1.000 0.080 0.131 0.155 of reference samples, we will sho w in Section 4.2.2 that our proposed criterion is able to outperform not only weight, but all other criteria in Scenario 2, where pruning is is used instead of ﬁne-tuning as a means of domain adaptation. This will be discussed in the follo wing sections. 4.2. Pruning Deep Image Classiﬁer s for Lar ge-scale Benchmark Data W e no w ev aluate the performance of all pruning criteria on the CNNs, V GG-16, Ale xNet as well as ResNet-18 and ResNet-50, — popular models in compression research [ 42 ] — all of which are pre-trained on ILSVRC 2012 (ImageNet). V GG-16 consists of 13 con v olutional layers with 4224 ﬁlters and 3 fully-connected layers and AlexNet contains 5 con volutional layers with 1552 ﬁlters and 3 fully-connected layers. In dense layers, there exist 4,096 + 4,096 + k neurons (i.e. ﬁlters), respecti vely, where k is the number of output classes. In terms of complexity of the model, the pre-trained VGG-16 and AlexNet on ImageNet originally consist of 138.36 / 60.97 million of parameters and 154.7 / 7.27 Giga Multiply-Accumulate Operations per Second (GMA CS) (as a measure of FLOPs), respecti vely . ResNet-18 and ResNet-50 consist of 20 / 53 con volutional layers with 4,800 / 26,560 ﬁlters. In terms of complexity of the model, the pre-trained ResNet-18 and ResNet-50 on ImageNet originally consist of 11.18 / 23.51 million of parameters and 1.82 / 4.12 GMA CS (as a measure of FLOPs), respectiv ely . Furthermore, since the LRP scores are not implementation-in v ariant and depend on the LRP rules used for the batch normalization (BN) layers, we con vert a trained ResNet into a canonized version, which yields the same predictions up to numerical errors. The canonization fuses a sequence of a con volution and a BN layer into a con v olution layer with updated weights 4 and resets the BN layer to be the identity function. This remo ves the BN layer e ﬀ ecti vely by re writing a sequence of two a ﬃ ne mappings into one updated a ﬃ ne mapping [ 43 ]. The second change replaced calls to torch.nn.functional methods and the summation in the residual connection by classes deriv ed from torch.nn.Module which then were wrapped by calls to torch.autograd.function to enable custom backward computations suitable for LRP rule computations. Experiments are performed within the PyT or ch and tor c hvision frame works under Intel(R) Xeon(R) CPU E5-2660 2.20GHz and NVIDIA T esla P100 with 12GB for GPU processing. W e e v aluated the criteria on six public datasets (Scene 15 [ 36 ], Ev ent 8, Cats and Dogs [ 38 ], Oxford Flo wer 102 [ 39 ], CIF AR-10, and ILSVRC 2012 [ 40 ]). For more detail on the datasets and the 4 See bnafterconv overwrite intoconv(conv,bn) in the ﬁle lrp general6.py in https://github.com/ AlexBinder/LRP_Pytorch_Resnets_Densenet 13 preprocessing, see Supplementary Methods 1. Our complete experimental setup covering these datasets is publicly available at https://github.com/seulkiyeom/LRP pruning . R esNet-50 T est A ccu racy T est A ccu racy VGG- 16 Scen e 15 Cifar 10 Cats and Dogs Even t 8 Oxfor d Fl ower 1 02 Figure 5: Comparison of test accuracy in di ﬀ erent criteria as pruning rate increases on VGG-16 (top) and ResNet-50 (bottom) with ﬁv e datasets. Pruning with ﬁne-tuning. Prematurely terminated lines in above ro w of panels indicate that during pruning, the respectiv e criterion remov ed ﬁlters vital to the network structure by disconnecting the model input from the output. T able 3: A performance comparison between criteria ( W eight, T aylor , G radient with ` 2 -norm each and L RP) and the U npruned model for V GG-16 (top) and ResNet-50 (bottom) on ﬁ ve di ﬀ erent image benchmark datasets. Criteria are ev aluated at ﬁxed pruning rates per model and dataset, identiﬁed as h dataset i @ h percent pruned filters i % . W e report test accuracy (in %), (training) loss ( × 10 − 2 ), number of remaining parameters ( × 10 7 ) and FLOPs (in GMAC) per forward pass. For all measures e xcept accuracy , lower outcomes are better . VGG-16 Scene 15 @ 55% Event 8 @ 55% Cats & Dogs @ 60% U W T G L U W T G L U W T G L Loss 2.09 2.27 1.76 1.90 1.62 0.85 1.35 1.01 1.18 0.83 0.19 0.50 0.51 0.57 0.44 Accuracy 88.59 82.07 83.00 82.72 83.99 95.95 90.19 91.79 90.55 93.29 99.36 97.90 97.54 97.19 98.24 Params 119.61 56.17 53.10 53.01 49.67 119.58 56.78 48.48 50.25 47.35 119.55 47.47 51.19 57.27 43.75 FLOPs 15.50 8.03 4.66 4.81 6.94 15.50 8.10 5.21 5.05 7.57 15.50 7.02 3.86 3.68 6.49 Oxford Flo wer 102 @ 70% CIF AR-10 @ 30% U W T G L U W T G L Loss 3.69 3.83 3.27 3.54 2.96 1.57 1.83 1.76 1.80 1.71 Accuracy 82.26 71.84 72.11 70.53 74.59 91.04 93.36 93.29 93.05 93.42 Params 119.96 39.34 41.37 42.68 37.54 119.59 74.55 97.30 97.33 89.20 FLOPs 15.50 5.48 2.38 2.45 4.50 15.50 11.70 8.14 8.24 9.93 ResNet-50 Scene 15 @ 55% Event 8 @ 55% Cats & Dogs @ 60% U W T G L U W T G L U W T G L Loss 0.81 1.32 1.08 1.32 0.50 0.33 1.07 0.63 0.85 0.28 0.01 0.05 0.06 0.21 0.02 Accuracy 88.28 80.17 80.26 78.71 85.38 96.17 88.27 87.55 86.38 94.22 98.42 97.02 96.33 93.13 98.03 Params 23.54 14.65 12.12 11.84 13.73 23.52 13.53 11.85 11.93 14.05 23.51 12.11 10.40 10.52 12.48 FLOPs 4.12 3.22 2.45 2.42 3.01 4.12 3.16 2.48 2.47 3.10 4.12 3.04 2.40 2.27 2.89 Oxford Flo wer 102 @ 70% CIF AR-10 @ 30% U W T G L U W T G L Loss 0.82 3.04 2.18 2.69 0.83 0.003 0.002 0.004 0.009 0.003 Accuracy 77.82 51.88 58.62 53.96 76.83 93.55 93.37 93.15 92.76 93.23 Params 23.72 9.24 8.82 8.48 9.32 23.52 19.29 18.10 17.96 18.11 FLOPs 4.12 2.55 1.78 1.81 2.38 1.30 1.14 1.06 1.05 1.16 In order to prepare the models for e valuation, we ﬁrst ﬁne-tuned the models for 200 epochs with 14 constant learning rate 0.001 and batch size of 20. W e used the Stochastic Gradient Descent (SGD) optimizer with momentum of 0.9. In addition, we also apply dropout to the fully-connected layers with probability of 0.5. Fine-tuning and pruning are performed on the training set, while results are e v aluated on each test dataset. Throughout the experiments, we iterati vely prune 5% of all the ﬁlters in the network by eliminating units including their input and output connections. In Scenario 1, we subsequently ﬁne-tune and re-ev aluate the model to account for dependency across parameters and regain performance, as it is common. Figure 6: Performance comparison of the proposed method (i.e. LRP) and other criteria on V GG-16 and ResNet-50 with ﬁv e datasets. Each point in the scatter plot corresponds to the performance at a speciﬁc pruning rate of two criteria, where the vertical axis sho ws the performance of our LRP criterion and the horizontal axis the performance of a single other criterion (compare to Figure 5 that displays the same data for more than two criteria). The black dashed line sho ws the set of points where models pruned by one of the compared criteria would exhibit identical performance to LRP. For accuracy , higher values are better . For loss, lower v alues are better . 4.2.1. Scenario 1: Pruning with F ine-tuning On the ﬁrst scenario, we retrain the model after each iteration of pruning in order to regain lost performance. W e then ev aluate the performance of the di ﬀ erent pruning criteria after each pruning-retraining-step. That is, we quantify the importance of each ﬁlter by the magnitude of the respecti ve criterion and iterati vely prune 5% of all ﬁlters (w .r .t. the original number of ﬁlters in the model) rated least important in each pruning step. Then, we compute and record the training loss, test accuracy , number of remaining parameters and total estimated FLOPs. W e assume that the least important ﬁlters should ha ve only little inﬂuence on the prediction and thus incur the lo west performance drop if they are remo ved from the network. Figure 5 (and Supplementary Figure 2) depict test accuracies with increasing pruning rate in VGG-16 and ResNet-50 (and AlexNet and ResNet-18, respecti v ely) after ﬁne-tuning for each dataset and each criterion. It is observ ed that LRP achiev es higher test accuracies compared to other criteria in a large majority of cases (see Figure 6 and Supplementary Figure 1). These results demonstrate that the performance of LRP-based pruning is stable and independent of the chosen dataset. Apart from performance, regularization by layer is a critical constraint which obstructs the expansion of some of the criteria tow ard several pruning strategies such as local pruning, global pruning, etc. Except for the LRP criterion, all criteria perform substantially w orse without l p regularization 15 LRP W eight Gradient T aylor Remaining filters (%) 100 0 Index of Convolutional Layer 100 0 A. Cats and Dogs B. Oxford Flower 102 Remaining filters (%) Figure 7: An observation of per-layer pruning performed w .r .t the di ﬀ erent ev aluated criteria on VGG-16 and two datasets. Each colored line corresponds to a speciﬁc (global) ratio of ﬁlters pruned from the network ( black (top) : 0% , red : 15% , green: 30% , blue: 45% , violet: 75% and black (bottom) 90% ). The dots on each line identify the ratio of pruning applied to speciﬁc con volutional layers, giv en a global ratio of pruning, depending on the pruning criterion. compared to those with l p regularization and result in une xpected interruptions during the pruning process due to the biased redistrib ution of importance in the network (cf. top ro ws of Figure 5 and Supplementary Figure 2). T able 3 shows the predicti ve performance of the di ﬀ erent criteria in terms of training loss, test accuracy , number of remaining parameters and FLOPs, for the VGG-16 and ResNet-50 models. Similar results for AlexNet and ResNet-18 can be found in Supplementary T able 2. Except for CIF AR-10, the highest compression rate (i.e. lo west number of parameters) could be achie ved by the proposed LRP-based criterion (row “P arams”) for VGG-16, b ut not for ResNet-50. Howe ver , in terms of FLOPs, the proposed criterion only outperformed the weight criterion, b ut not the T aylor and Gradient criteria (ro w“FLOPs”). This is due to the fact that a reduction in number of FLOPs depends on the location where pruning is applied within the network: Figure 7 shows that the LRP and weight criteria focus the pruning on upper layers closer to the model output, whereas the T aylor and Gradient criteria focus more on the lo wer layers. Throughout the pruning process usually a gradual decrease in performance can be observed. Ho we v er , with the Event 8, Oxford Flo wer 102 and CIF AR-10 datasets, pruning leads to an initial performance increase, until a pruning rate of approx. 30% is reached. This behavior has been reported before in the literature and might stem from improv ements of the model structure through elimination of ﬁlters related to classes in the source dataset (i.e., ILSVRC 2012) that are not present in the tar get dataset an ymore [ 44 ]. Supplementary T able 3 and Supplementary Figure 2 similarly sho w that LRP achie ves the highest test accurac y in AlexNet and ResNet-18 for nearly all pruning ratios with almost e very dataset. Figure 7 shows the number of the remaining con volutional ﬁlters for each iteration. W e observe that, on the one hand, as pruning rate increases, the con volutional ﬁlters in earlier layers that are associated with v ery generic features, such as edge and blob detectors, tend to generally be preserv ed as opposed to those in latter layers which are associated with abstract, task-speciﬁc features. On the other hand, the LRP- and weight-criterion ﬁrst keep the ﬁlters in early layers in the be ginning, but later aggressiv ely prune ﬁlters near the input which now have lost functionality as input to 16 later layers, compared to the gradient-based criteria such as gradient and T aylor-based approaches. Although gradient-based criteria also adopt the greedy layer-by-layer approach, we can see that gradient-based criteria pruned the less important ﬁlters almost uniformly across all the layers due to re-normalization of the criterion in each iteration. Ho wev er , this result contrasts with pre vious gradient-based works [ 22 , 25 ] that hav e shown that units deemed unimportant in earlier layers, contribute signiﬁcantly compared to units deemed important in latter layers. In contrast to this, LRP can e ﬃ ciently preserve units in the early layers — as long as they serve a purpose — despite of iterati ve global pruning. 4.2.2. Scenario 2: Pruning without F ine-tuning In this section, we ev aluate whether pruning works well if only a (very) limited number of samples is av ailable for quantifying the pruning criteria. T o the best of our knowledge, there are no pre vious studies that show the performance of pruning approaches when acting w .r .t. very small amounts of data. W ith large amounts of data a vailable (and e ven though we can expect reasonable performance after pruning), an iterati ve pruning and ﬁne-tuning procedure of the network can amount to a very time consuming and computationally heavy process. From a practical point of vie w , this issue becomes a signiﬁcant problem, e.g. with limited computational resources (mobile de vices or in general; consumer-le vel hardware) and reference data (e.g., pri v ate photo collections), where capable and e ﬀ ectiv e one-shot pruning approaches are desired and only little leew ay (or none at all) for ﬁne-tuning strategies after pruning is a vailable. T o in vestigate whether pruning is possible also in these scenarios, we performed e xperiments with a relati vely small number of data on the 1) Cats & Dogs and 2) subsets from the ILSVRC 2012 classes. On the Cats & Dogs dataset, we only used 10 samples each from the “cat” and “dog” classes to prune the (on ImageNet) pre-trained AlexNet, VGG-16, ResNet-18 and ResNet-50 networks with the goal of domain / dataset adaption. The binary classiﬁcation (i.e. “cat” vs. “dog”) is a subtask within the ImageNet taxonomy and corresponding output neurons can be identiﬁed by its W ordNet 5 associations. This experiment implements the task of domain adaptation. In a second experiment on the ILSVRC 2012 dataset, we randomly chose k = 3 classes for the task of model specialization, selected only n = 10 images per class from the training set and used them to compare the di ﬀ erent pruning criteria. For each criterion, we used the same selection of classes and samples. In both experimental settings, we do not ﬁne-tune the models after each pruning iteration, in contrast to Scenario 1 in Section 4.2.1. The obtained post-pruning model performance is av eraged ov er 20 random selections of classes (ImageNet) and samples (Cats & Dogs) to account for randomness. Please note that before pruning, we ﬁrst restructured the models’ fully connected output layers to only preserve the task-relev ant k network outputs by eliminating the 1000 − k redundant output neurons. Furthermore, as our tar get datasets are relativ ely small and only ha ve an extremely reduced set of target classes, the pruned models could still be very heavy w .r .t. memory requirements if the pruning process would be limited to the con volutional layers, as in Section 4.2.1. More speciﬁcally , while con volutional layers dominantly constitute the source of computation cost (FLOPs), fully connected layers are prov en to be more redundant [ 29 ]. In this respect, we applied pruning procedures in both fully connected layers and con volutional layers in combination for V GG-16. 5 http://www.image- net.org/archive/wordnet.is_a.txt 17 Figure 8: T est accuracy after pruning of n % of con volutional (ro ws) and m % of fully connected (columns) ﬁlters on VGG-16 without ﬁne-tuning for a random subset of the classes from ILSVRC 2012 ( k = 3) based on di ﬀ erent criteria (av eraged over 20 repetitions). Each color represents a range of 5% in test accuracy . The brighter the color the better the performance after a giv en degree of pruning . For pruning, we iterate a sequence of ﬁrst pruning ﬁlters from the con volutional layers, follo wed by a step of pruning neurons from the model’ s fully connected layers. Note that both ev aluated ResNet architectures mainly consist of con volutional- and pooling layers, and conclude in a single dense layer , of which the set of input neurons are only a ﬀ ected via their inputs by pruning the belo w con volutional stack. W e therefore restrict the iterati ve pruning ﬁlters from the sequence of dense layers of the feed-forward architecture of the V GG-16. The model performance after the application of each criterion for classifying a small number of classes ( k = 3) from the ILSVRC 2012 dataset is indicated in Figure 8 for VGG 16 and Figure 9 for ResNets (please note again that ResNets do not have fully-connected layers). During pruning at fully- connected layers, no signiﬁcant di ﬀ erence across di ﬀ erent pruning ratios can be observed. W ithout further ﬁne-tuning, pruning weights / ﬁlters at the fully connected layers can retain performance e ﬃ ciently . Howe ver , there is a certain di ﬀ erence between LRP and other criteria with increasing pruning ratio of con volutional layers for VGG-16 / ResNet-18 / ResNet-50, respectiv ely: (LRP vs. T aylor with l 2 -norm; up to of 9.6 / 61.8 / 51.8%, LRP vs. gradient with l 2 -norm; up to 28.0 / 63.6 / 54.5 %, LRP vs. weight with l 2 -norm; up to 27.1 / 48.3 / 30.2 %). Moreov er , pruning con v olutional layers needs to be carefully managed compared to pruning fully connected layers. W e can observe that 18 Figure 9: T est accuracy after pruning of n % of con volutional ﬁlters on ResNet18 and ResNet50 without ﬁne-tuning for a random subset of the classes from ILSVRC 2012 ( k = 3) based on the criteria W eight, T aylor , G radient with ` 2 -norm and LRP (av eraged over 20 repetitions). Compar e to Figur e 8 . LRP is applicable for pruning any layer type (i.e. fully connected, con volutional, pooling, etc.) e ﬃ ciently . Additionally , as mentioned in Section 3.1, our method can be applied to general network architectures because it can automatically measure the importance of weights or ﬁlters in a global (network-wise) conte xt without further normalization. Figure 10 sho ws the test accuracy as a function of the pruning ratio, in context a domain adaption task from ImageNet tow ards the Cats & Dogs dataset for all models. As the pruning ratio increases, we can see that ev en without ﬁne-tuning, using LRP as pruning criterion can keep the test accuracy not only stable, but close to 100%, giv en the extreme scarcity of data in this experiment. In contrast, the performance decreases signiﬁcantly when using the other criteria requiring an application of the l 2 -norm. Initially , the performance is ev en slightly increasing when pruning with LRP. During iterati ve pruning, unexpected changes in accuracy with LRP (for 2 out of 20 repetitions of the experiment) have been sho wn around 50 - 55% pruning ratio, but accuracy is regained quickly again. Howe ver , only the VGG-16 model seems to be a ﬀ ected, and none other for this task. For both ResNet models, this phenomenon occurs for the other criteria instead. A series of in-depth in vestigations of this momentary decrease in performance did not lead to an y insights and will be subject of future work 6 . By pruning o ver 99% of con volutional ﬁlters in the netw orks using our proposed method, we can hav e 1) greatly reduced computational cost, 2) faster forward and backward processing (e.g. for the purpose of further training, inference or the computation of attrib ution maps), and 3) a lighter model e ven in the small sample case, all while adapting o ﬀ -the-shelf pre-trained ImageNet models to wards a dog-vs.-cat classiﬁcation task. 5. Discussion Our experiments demonstrate that the nov el LRP criterion consistently performed well com- pared to other criteria across v arious datasets, model architectures and e xperimental settings, and oftentimes outperformed the competing criteria. This is especially pronounced in our Scenario 2 (cf. Section 4.2.2), where only little resources are a v ailable for criterion computation, and no ﬁne-tuning after pruning is allowed. Here, LRP considerably outperformed the other metrics on toy data (cf. Section 4.1) and image processing benchmark data (cf. Section 4.2.2). The strongly 6 W e consequently hav e to assume that this phenomenon marks the do wnloaded pre-trained V GG-16 model as an outlier in this respect. A future line of research will dedicate inquiries about the circumstances leading to intermediate loss and later recov ery of model performance during pruning. 19 Figure 10: Performance comparison of pruning without ﬁne-tuning for AlexNet, VGG-16, ResNet-18 and ResNet-50 based on only fe w (10) samples per class from the Cats & Dogs dataset, as a means for domain adaption. Additional results on further target domains can be found in the Supplement with Supplementary Figure 3. similar results between criteria observed in Scenario 1 (cf. Section 4.2.2) are also not surprising, as an additional ﬁle-tuning step after pruning may allo w the pruned neural network model to recov er its original performance, as long as the model has the capacity to do so [22]. From the results of T able 3 and Supplementary T able 3 we can observe that with a ﬁx ed pruning target of n % ﬁlters remo v ed, LRP might not alw ays result in the cheapest sub-network after pruning in terms of parameter count and FLOPs per inference, howe ver it consistently is able to identify the network components for remo val and preserv ation leading to the best performing model after pruning. Latter results resonate also strongly in our experiments of Scenario 2 on both image and to y data, where, without the additional ﬁne-tuning step, the LRP-pruned models vastly outperform their competitors. The results obtained in multiple to y settings v erify that only the LRP-based pruning criterion is able to preserve the original structure of the prediction function (cf. Figures 2 and 3). Unlike the weight criterion, which is a static quantity once the netw ork is not in training anymore, the criteria T aylor , gradient and LRP require reference samples for computation, which in turn may a ﬀ ect the estimation of neuron importance. From the latter three criteria, ho wever , only LRP provides a continuous measur e of network structure importance (cf. Sec 7.2 in [ 12 ]) which does not su ﬀ er from abrupt changes in the estimated importance measures with only marginal steps between reference samples. This quality of continuity is reﬂected in the stability and quality of LRP results reported in Section 4.1, compared to the high v olatility in neuron selection for pruning and model performance after pruning observ able for the gradient and T aylor criteria. From this observ ation it can also be deduced that LRP requires relati v ely fe w data points to con ver ge to a pruning solution that possesses a similar prediction beha vior as the original model. Hence, we conclude that LRP is a robust pruning criterion that is broadly applicable in practice. Especially in a scenario where no ﬁnetuning is applied after pruning (see Sec. 4.2.2), the LRP criterion allo ws for pruning of a large part of the model without signiﬁcant accuracy drops. In terms of computational cost, LRP is comparable to the T aylor and Gradient criteria because these criteria require both a forward and a backward pass for all reference samples. The weight criterion is substantially cheaper to compute since it does not require to ev aluate any reference samples; ho we v er , its performance falls short in most of our experiments. Additionally , our exper - iments demonstrate that LRP requires less reference samples than the other criteria (cf. Figure 3 and Figure 4), thus the required computational cost is lower in practical scenarios, and better 20 performance can be expected if only low numbers of reference samples are a vailable (cf. Figure 10). Unlike all other criteria, LRP does not require explicit regularization via ` p -normalization, as it is naturally normalized via its enforced r elevance conservation principle during relev ance backpropagation, which leads to the preserv ation of important network substructures and bottlenecks in a global model context. In line with the ﬁndings by [ 22 ], our results in Figure 5 and Supplementary Figure 2 show that additional normalization after criterion computation for weight, gradient and T aylor is not only vital to obtain good performance, but also to av oid disconnected model segments — something which is pre vented out-of-the-box with LRP. Ho we ver , our proposed criterion still provides sev eral open questions that deserve a deeper in vestigation in future work. First of all, LRP is not implementation in variant, i.e., the structure and composition of the analyzed network might a ﬀ ect the computation of the LRP-criterion and “network canonization” — a functionally equi v alent restructuring of the model — might be required for optimal results, as discussed early in Section 4 and [ 43 ]. Furthermore, while our LRP-criterion does not require additional hyperparameters, e.g., for normalization, the pruning result might still depend on the chosen LRP v ariant. In this paper , we chose the α 1 β 0 -rule in all layers, because this particular parameterization identiﬁes the network’ s neural pathways positi vely contrib uting to the selected output neurons for which reference samples are pro vided, is robust to the detrimental e ﬀ ects of shattered gradients a ﬀ ecting especially very deep CNNs [ 11 ] (i.e., other than gradient-based methods, it does not su ﬀ er from potential discontinuities in the backpropagated quantities), and has a mathematical well-moti vated foundation in DTD [ 11 , 12 ]. Ho we ver , other work from literature provide [ 14 ] or suggest [ 9 , 8 ] alternati ve parameterizations to optimize the method for e xplanatory purposes. It is an interesting direction for future work to examine whether these ﬁndings also apply to LRP as a pruning criterion. 6. Conclusion Modern CNNs typically have a high capacity with millions of parameters as this allo ws to obtain good optimization results in the training process. After training, howe ver , high inference costs remain, despite the fact that the number of e ﬀ ecti ve parameters in the deep model is actually signiﬁcantly lo wer (see e.g. [ 45 ]). T o alleviate this, pruning aims at compressing and accelerating the gi ven models without sacriﬁcing much predictiv e performance. In this paper , we hav e proposed a nov el criterion for the iterativ e pruning of CNNs based on the explanation method LRP, linking for the ﬁrst time two so far disconnected lines of research. LRP has a clearly deﬁned meaning, namely the contrib ution of an indi vidual network unit, i.e. weight or ﬁlter , to the network output. Removing units according to lo w LRP scores thus means discarding all aspects in the model that do not contrib ute relev ance to its decision making. Hence, as a criterion, the computed rele v ance scores can easily and cheaply gi ve e ﬃ cient compression rates without further postprocessing, such as per -layer normalization. Besides, technically LRP is scalable to general network structures and its computational cost is similar to the one of a gradient backward pass. In our experiments, the LRP criterion has sho wn fav orable compression performance on a v ariety of datasets both with and without retraining after pruning. Especially when pruning without retraining, our results for small datasets suggest that the LRP criterion outperforms the state of the art and therefore, its application is especially recommended in transfer learning settings where only a small target dataset is a vailable. In addition to pruning, the same method can be used to visually interpret the model and explain 21 indi vidual decisions as intuiti ve relev ance heatmaps. Therefore, in future work, we propose to use these heatmaps to elucidate and explain which image features are most strongly a ﬀ ected by pruning to additionally av oid that the pruning process leads to undesired Clev er Hans phenomena [8]. Acknowledgements This work was supported by the German Ministry for Education and Research (BMBF) through BIFOLD (refs. 01IS18025A and 01IS18037A), MAL T III (ref. 01IS17058), Patho234 (ref. 031L0207D) and T raMeExCo (ref. 01IS18056A), as well as the Grants 01GQ1115 and 01GQ0850; and by Deutsche Forschungsgesellschaft (DFG) under Grant Math + , EXC 2046 / 1, Project ID 390685689; by the Institute of Information & Communications T echnology Planning & Ev aluation (IITP) grant funded by the K orea Gov ernment (No. 2019-0-00079, Artiﬁcial Intelligence Graduate School Program, K orea Univ ersity); and by STE-SUTD Cyber Security Corporate Laboratory; the AcRF T ier2 grant MOE2016-T2-2-154; the TL project Intent Inference; and the SUTD internal grant Fundamentals and Theory of AI Systems. The authors would lik e to e xpress their thanks to Christopher J Anders for insightful discussions. References [1] J. Gu, Z. W ang, J. Kuen, L. Ma, A. Shahroudy , B. Shuai, T . Liu, X. W ang, G. W ang, J. Cai, T . Chen, Recent adv ances in con volutional neural networks, Pattern Recognition 77 (2018) 354–377. [2] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, N. de Freitas, Predicting parameters in deep learning, in: Adv ances in Neural Information Processing Systems (NIPS), 2013, pp. 2148– 2156. [3] V . Sze, Y . Chen, T . Y ang, J. S. Emer , E ﬃ cient processing of deep neural networks: A tutorial and surve y , Proceedings of the IEEE 105 (2017) 2295–2329. [4] Y . LeCun, J. S. Denk er , S. A. Solla, Optimal brain damage, in: Adv ances in Neural Information Processing Systems (NIPS), 1989, pp. 598–605. [5] Y . T u, Y . Lin, Deep neural network compression technique towards e ﬃ cient digital signal modulation recognition in edge de vice, IEEE Access 7 (2019) 58113–58119. [6] Y . Cheng, D. W ang, P . Zhou, T . Zhang, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Signal Processing Magazine 35 (2018) 126–136. [7] S. Bach, A. Binder , G. Montav on, F . Klauschen, K.-R. M ¨ uller , W . Samek, On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relev ance propagation, PLoS ONE 10 (2015) e0130140. [8] S. Lapuschkin, S. W ¨ aldchen, A. Binder , G. Montav on, W . Samek, K.-R. M ¨ uller , Unmasking Cle ver Hans predictors and assessing what machines really learn, Nature Communications 10 (2019) 1096. 22 [9] M. H ¨ agele, P . Seegerer , S. Lapuschkin, M. Bockmayr , W . Samek, F . Klauschen, K.-R. M ¨ uller , A. Binder , Resolving challenges in deep learning-based analyses of histopathological images using explanation methods, Scientiﬁc Reports 10 (2020) 6423. [10] P . Seegerer , A. Binder , R. Saitenmacher , M. Bockmayr , M. Alber , P . Jurmeister , F . Klauschen, K.-R. M ¨ uller , Interpretable deep neural network to predict estrogen receptor status from haematoxylin-eosin images, in: Artiﬁcial Intelligence and Machine Learning for Digital Pathology: State-of-the-Art and Future Challenges, Springer International Publishing, Cham, 2020, pp. 16–37. [11] G. Montav on, S. Lapuschkin, A. Binder , W . Samek, K.-R. M ¨ uller , Explaining nonlinear classiﬁcation decisions with deep taylor decomposition, Pattern Recognition 65 (2017) 211– 222. [12] G. Montavon, W . Samek, K.-R. M ¨ uller , Methods for interpreting and understanding deep neural networks, Digital Signal Processing 73 (2018) 1–15. [13] W . Samek, G. Montav on, S. Lapuschkin, C. J. Anders, K.-R. M ¨ uller , T ow ard interpretable ma- chine learning: T ransparent deep neural networks and beyond, arXi v preprint (2020). [14] M. Alber , S. Lapuschkin, P . Seegerer , M. H ¨ agele, K. T . Sch ¨ utt, G. Montav on, W . Samek, K.-R. M ¨ uller , S. D ¨ ahne, P .-J. Kindermans, iNNvestig ate neural networks!, Journal of Machine Learning Research 20 (2019) 93:1–93:8. [15] S. W iedemann, K.-R. M ¨ uller , W . Samek, Compact and computationally e ﬃ cient representation of deep neural netw orks, IEEE T ransactions on Neural Networks and Learning Systems 31 (2020) 772–785. [16] F . T ung, G. Mori, Deep neural network compression by in-parallel pruning-quantization, IEEE T ransactions on Pattern Analysis and Machine Intelligence 42 (2020) 568–579. [17] K. Guo, X. Xie, X. Xu, X. Xing, Compressing by learning in a low-rank and sparse decompo- sition form, IEEE Access 7 (2019) 150823–150832. [18] T . Xu, P . Y ang, X. Zhang, C. Liu, LightweightNet: T oward f ast and lightweight con volutional neural networks via architecture distillation, Pattern Recognition 88 (2019) 272–284. [19] X. Zhang, X. Zhou, M. Lin, J. Sun, Shu ﬄ enet: An extremely e ﬃ cient con volutional neural network for mobile devices, in: IEEE Conference on Computer V ision and P attern Recognition (CVPR), 2018, pp. 6848–6856. [20] P . Molchanov , A. Mallya, S. T yree, I. Frosio, J. Kautz, Importance estimation for neural network pruning, in: IEEE Conference on Computer V ision and Pattern Recognition (CVPR), 2019, pp. 11264–11272. [21] B. Hassibi, D. G. Stork, Second order deri v ati ves for network pruning: Optimal brain surgeon, in: Adv ances in Neural Information Processing Systems (NIPS), 1992, pp. 164–171. 23 [22] P . Molchanov , S. T yree, T . Karras, T . Aila, J. Kautz, Pruning con volutional neural networks for resource e ﬃ cient transfer learning, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017. [23] C. Y u, J. W ang, Y . Chen, X. Qin, T ransfer channel pruning for compressing deep domain adaptation models, International Journal of Machine Learning and Cybernetics 10 (2019) 3129–3144. [24] C. Liu, H. W u, Channel pruning based on mean gradient for accelerating con volutional neural networks, Signal Processing 156 (2019) 84–91. [25] X. Sun, X. Ren, S. Ma, H. W ang, meprop: Sparsiﬁed back propag ation for accelerated deep learning with reduced overﬁtting, in: International Conference on Machine Learning (ICML), 2017, pp. 3299–3308. [26] S. Han, J. Pool, J. T ran, W . J. Dally , Learning both weights and connections for e ﬃ cient neural network, in: Adv ances in Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143. [27] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horo witz, W . J. Dally , EIE: e ﬃ cient inference engine on compressed deep neural network, in: International Symposium on Computer Architecture (ISCA), 2016, pp. 243–254. [28] W . W en, C. W u, Y . W ang, Y . Chen, H. Li, Learning structured sparsity in deep neural networks, in: Adv ances in Neural Information Processing Systems (NIPS), 2016, pp. 2074–2082. [29] H. Li, A. Kadav , I. Durdanovic, H. Samet, H. P . Graf, Pruning ﬁlters for e ﬃ cient con vnets, in: International Conference on Learning Representations, (ICLR), 2017. [30] R. Y u, A. Li, C. Chen, J. Lai, V . I. Morariu, X. Han, M. Gao, C. Lin, L. S. Da vis, NISP: pruning networks using neuron importance score propagation, in: IEEE Conference on Computer V ision and Pattern Recognition (CVPR), 2018, pp. 9194–9203. [31] J.-H. Luo, H. Zhang, H.-Y . Zhou, C.-W . Xie, J. W u, W . Lin, ThiNet: Pruning CNN ﬁlters for a thinner net, IEEE Transactions on P attern Analysis and Machine Intelligence 41 (2019) 2525–2538. [32] J. Gan, W . W ang, K. Lu, Compressing the CNN architecture for in-air handwritten Chinese character recognition, Pattern Recognition Letters 129 (2020) 190 – 197. [33] X. Dai, H. Y in, N. K. Jha, Nest: A neural network synthesis tool based on a gro w-and-prune paradigm, IEEE Transactions on Computers 68 (2019) 1487–1497. [34] W . Samek, G. Montav on, A. V edaldi, L. K. Hansen, K.-R. M ¨ uller (Eds.), Explainable AI: Interpreting, Explaining and V isualizing Deep Learning, v olume 11700 of Lectur e Notes in Computer Science , Springer , 2019. [35] W . Samek, A. Binder, G. Montav on, S. Lapuschkin, K.-R. M ¨ uller , Ev aluating the visualization of what a deep neural network has learned, IEEE T ransactions on Neural Networks and Learning Systems 28 (2017) 2660–2673. 24 [36] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene cate gories, in: IEEE Conference on Computer V ision and Pattern Recognition (CVPR), 2006, pp. 2169–2178. [37] L. Li, F . Li, What, where and who? Classifying ev ents by scene and object recognition, in: IEEE International Conference on Computer V ision (ICCV), 2007, pp. 1–8. [38] J. Elson, J. R. Douceur , J. Ho well, J. Saul, Asirra: a CAPTCHA that e xploits interest-aligned manual image categorization, in: Proceedings of the 2007 ACM Conference on Computer and Communications Security (CCS), 2007, pp. 366–374. [39] M. Nilsback, A. Zisserman, Automated ﬂo wer classiﬁcation over a lar ge number of classes, in: Sixth Indian Conference on Computer V ision, Graphics & Image Processing (ICVGIP), 2008, pp. 722–729. [40] O. Russakovsk y , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. S. Bernstein, A. C. Ber g, F . Li, Imagenet large scale visual recognition challenge, International Journal of Computer V ision 115 (2015) 211–252. [41] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer V ision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016, 2016, pp. 770–778. [42] H. W ang, Q. Zhang, Y . W ang, H. Hu, Structured probabilistic pruning for con v olutional neural network acceleration, in: British Machine V ision Conference (BMVC), 2018, p. 149. [43] M. Guillemot, C. Heusele, R. K orichi, S. Schnebert, L. Chen, Breaking batch normalization for better explainability of deep neural networks through layer -wise rele v ance propagation, CoRR abs / 2002.11018 (2020). [44] J. Liu, Y . W ang, Y . Qiao, Sparse deep transfer learning for con v olutional neural network, in: AAAI Conference on Artiﬁcial Intelligence, 2017, pp. 2245–2251. [45] N. Murata, S. Y oshizawa, S. Amari, Network information criterion-determining the number of hidden units for an artiﬁcial neural network model, IEEE T ransactions on Neural Networks 5 (1994) 865–872. [46] S. Bianco, R. Cadene, L. Celona, P . Napoletano, Benchmark analysis of representativ e deep neural network architectures, IEEE Access 6 (2018) 64270–64277. [47] S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, A. V edaldi, Fine-Grained V isual Classiﬁcation of Aircraft, T echnical Report, 2013. [48] C. W ah, S. Branson, P . W elinder , P . Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, T echnical Report CNS-TR-2011-001, California Institute of T echnology , 2011. [49] J. Krause, M. Stark, J. Deng, L. Fei-Fei, 3d object representations for ﬁne-grained categoriza- tion, in: 4th International IEEE W orkshop on 3D Representation and Recognition (3dRR-13), Sydney , Australia, 2013. 25 Pruning by Explaining: A No v el Criterion for Deep Neural Network Pruning — S upplement ar y M a terials — Supplementary Methods 1: Data Prepr ocessing During ﬁne-tuning, images are resized to 256 × 256 and randomly cropped to 224 × 224 pixels, and then horizontally ﬂipped with a random chance of 50% for data augmentation. For testing, images are resized to 224 × 224 pixels. Scene 15: The Scene 15 dataset contains about 4,485 images and consists of 15 natural scene categories obtained from COREL collection, Google image search and personal photographs [ 36 ]. W e ﬁne-tuned four di ﬀ erent models on 20% of the images from each class and achieved initial T op-1 accuracy of 88.59% for VGG-16, 85.48% for Ale xNet, 83.96% for ResNet-18, and 88.28% for ResNet-50, respecti vely . Event 8: Event-8 consists of 8 sports ev ent categories by integrating scene and object recognition. W e use 40% of the dataset’ s images for ﬁne-tuning and the remaining 60% for testing. W e adopted the common data augmentation method as in [26]. Cats and Dogs: This is the Asirra dataset provided by Microsoft Research (from Kaggle). The gi ven dataset for the competition (KMLC-Challenge-1) [ 38 ]. T raining dataset contains 4,000 colored images of dogs and 4,005 colored images of cats, while containing 2,023 test images. W e reached initial accuracies of 99.36% for VGG-16, 96.84% for AlexNet, 97.97% for ResNet-18, and 98.42% for ResNet-50 based on transfer learning approach. Oxf ord Flowers 102: The Oxford Flo wers 102 dataset contains 102 species of ﬂower categories found in the UK, which is a collection with ov er 2,000 training and 6,100 test images [ 39 ]. W e ﬁne-tuned models with pre-trained networks on ImageNet for transfer learning. CIF AR-10: This dataset contains 50,000 training images and 10,000 test images spanning 10 categories of objects. The resolution of each image is 32 × 32 pixels and therefore we resize the images as 224 × 224 pixels. ILSVRC 2012: In order to show the e ﬀ ecti veness of the pruning criteria in the small sample scenario, we pruned and tested all models on randomly selected k = 3 from 1000 classes and data from the ImageNet corpus [40]. Supplementary Results 1: Additional results on toy data Supplementary T able 1: Comparison of training accuracy after one-shot pruning one third of all ﬁlters w .r .t one of the four metrics on toy datasets, with n ∈ [1 , 5 , 20 , 100] reference samples used for criteria computation for W eight, Gradient, Taylor and LRP . Compare to Figure 4. Dataset Unpruned Pruned n = 1 n = 5 n = 20 n = 100 W T G L T G L T G L T G L moon 99.90 99.60 79.80 83.07 85.01 84.70 86.07 99.86 86.99 85.87 99.85 94.77 93.53 99.85 circle 100.00 97.10 68.35 69.21 70.23 87.18 82.23 99.89 91.87 85.36 100.00 97.04 90.88 100.00 multi 94.95 91.00 34.28 34.28 62.98 77.34 67.96 91.85 83.21 77.39 91.59 84.76 82.68 91.25 i Here, we discuss with Supplementary T able 2 the consistency of LRP-based neuron selection across reference sample sizes. One can assume that the larger the choice of n the less volatile is the choice of (un)important neurons by the criterion, as the inﬂuence of indi vidual reference samples is marginalized out. W e therefore compare the ﬁrst and last ranked sets of neurons selected for a low (yet due to our observations su ﬃ cient) number n = 10 of reference samples to all other reference sample set sizes m ov er all unique random seed combinations. For all comparisons of n × m (except for m = 1) we observe a remarkable consistenc y in the selection of (un)important network substructures. W ith an increasing m , we can see the consistency in neuron set selection gradually increase and then plateau for the “moon” and “circle” datasets, which means that the selected set of neurons remains consistent for lar ger sets of reference samples from that point on. F or the “mult” toy dataset, we observe a gradual yet minimal decrease in the set similarity scores for m ≥ 10, which means that the results de viate from the selected neurons for n = 10, i.e. v ariability over the neuron sets selected for n = 10 are the source of the volatility between n -reference and m -reference selected neuron sets. In all cases, peak consistency is achiev ed at n ∈ { 5 , 10 } reference samples, identifying lo w numbers of n ∈ { 5 , 10 } as su ﬃ cient for consistently pruning our toy models. Supplementary T able 2: A consistency comparison of neuron selection of LRP between reference sample sets sized n , av eraged over all 1225 unique random seed combinations. Higher values indicate higher consistency . W e report results for n = 10 reference samples in comparison to m ∈ [1 , 2 , 5 , 10 , 20 , 50 , 100 , 200] reference samples per class and k = 250. Observations for all other combinations of n × m , k ∈ [125 , 500 , 1000] and all other criteria are av ailable on github 3 . Dataset n ﬁrst-250 last-250 m = 1 2 5 10 20 50 100 200 1 2 5 10 20 50 100 200 moon 10 0.687 0.865 0.942 0.947 0.951 0.952 0.950 0.950 0.676 0.810 0.916 0.928 0.938 0.946 0.947 0.948 circle 10 0.689 0.795 0.831 0.843 0.845 0.846 0.846 0.842 0.698 0.874 0.919 0.937 0.946 0.951 0.953 0.954 mult 10 0.142 0.625 0.773 0.791 0.779 0.765 0.763 0.762 0.160 0.697 0.890 0.942 0.940 0.936 0.934 0.933 Supplementary Results 2: Additional results f or image processing neural netw orks Here, we pro vide results for Ale xNet and ResNet-18 — in addition to the VGG16 and ResNet-50 results shown in the main paper — in Supplementary Figure 1 (cf. Figure 6), Supplementary Figure 2 (cf. Figure 5), and Supplementary T able 3 (cf. T able 3). These results demonstrate that the fa vorable pruning performance of our LRP criterion is not limited to any speciﬁc network architecture. W e remark that the results for CIF AR-10 show a lar ger rob ustness to higher pruning rates. This is due to the fact that CIF AR-10 has the lowest resolution as dataset and little structure in its images as a consequence. The images contain components with predominantly very lo w frequencies. The ﬁlters, which are co vering higher frequencies, are e xpected to be mostly idle for CIF AR-10. This makes the pruning task less challenging. Therefore no method can clearly distinguish itself by a di ﬀ erent pruning strategy which addresses those ﬁlters which are covering the higher frequencies in images. Furthermore, the ASSIRA dataset of cats and dogs may raise the concern that it is the MNIST analogue for pets, representing a rather simple problem, and for that reason the results might not be transferable to problems with larger v ariance. T o v alidate our observation, we ha ve chosen the more lightweight [ 46 ] ResNet-50 model, and ev aluated its pruning performance on three further datasets. Each of the three datasets is composed as a binary discrimination problem obtained by fusing two datasets chosen from the following selection: FGVC aircraft [ 47 ], CUB birds [ 48 ] and Stanford cars [ 49 ]. W e have chosen these three datasets, as the y are kno wn from the literature, hav e an intrinsic ii Supplementary Figure 1: Performance comparison of the proposed method (i.e. LRP) and other criteria on AlexNet and ResNet-18 with ﬁv e datasets. Each point in the scatter plot corresponds to the performance at a speciﬁc pruning rate of two criteria, where the vertical axis sho ws the performance of LRP and the horizontal axis the performance of one other criterion (compare to Supplementary Figure 2 that displays the same data for more than two criteria). The black dashed line shows the set of points where models pruned by one of the compared criteria would exhibit identical performance to LRP. For accurac y , higher values are better . For loss, lo wer values are better . Compare to Figure 6. R esNet-18 Scen e 15 Cifar 10 Cats and Dogs Even t 8 Oxfor d Fl ower 1 02 T est A ccuracy T est A ccuracy Ale xNet Supplementary Figure 2: Comparison of test accurac y in di ﬀ erent criteria as pruning rate increases on Ale xNet (top) and ResNet-18 (bottom) with ﬁv e datasets. Pruning with ﬁne-tuning. Prematurely terminated lines indicate that during pruning, the respectiv e criterion remov ed vital ﬁlters and thus disconnected the model input from the output. Compare to Figure 5. v ariability as visible by their numbers of classes and a medium sample size. Most importantly , we kno w for these datasets that the object categories deﬁning their contents are matched by some classes in the ImageNet dataset which is used to initialize the weights of the ResNet-50 network. Supplementary Figure 3 shows the results for the three composed discrimination problems, without ﬁne-tuning after pruning. W e can observe that each pruning method is able to remov e a certain amount of network ﬁlters without notable loss of discrimination performance. W eight-based iii Supplementary T able 3: A performance comparison between criteria ( W eight, T aylor , G radient with ` 2 -norm and L RP) and the U npruned model for AlexNet (top) and ResNet-18 (bottom) on ﬁv e di ﬀ erent image benchmark datasets. Criteria are ev aluated at ﬁxed pruning rates per model and dataset, identiﬁed as h dataset i @ h percent pruned filters i % . W e report test accuracy (in %), (training) loss ( × 10 − 2 ), number of remaining parameters ( × 10 7 ) and FLOPs (in Mega Multiply-Accumulate Operations per Second (MMA CS) for AlexNet and GMA CS for ResNet-18) per forward pass. For all measures e xcept accuracy , lower outcomes are better . Compare to T able 3. AlexNet Scene 15 @ 55% Event 8 @ 55% Cats & Dogs @ 60% U W T G L U W T G L U W T G L Loss 2.49 2.78 2.31 2.46 2.02 1.16 1.67 1.21 1.37 1.00 0.50 0.77 0.78 0.87 0.70 Accuracy 85.48 78.43 79.40 78.63 80.76 94.89 88.10 88.62 88.42 90.41 96.84 95.86 95.23 94.89 95.81 Params 54.60 35.19 33.79 33.29 33.93 54.57 34.25 33.00 33.29 32.79 54.54 32.73 33.50 33.97 32.66 FLOPs 711.51 304.88 229.95 225.19 277.98 711.48 301.9 241.18 238.36 291.95 711.46 264.04 199.88 190.84 240.19 Oxford Flo wer 102 @ 70% CIF AR-10 @ 30% U W T G L U W T G L Loss 5.15 4.83 3.39 3.77 3.01 1.95 2.44 2.46 2.46 2.33 Accuracy 78.74 63.93 64.10 64.11 65.69 87.83 90.03 89.48 89.58 89.87 Params 54.95 28.13 29.19 28.72 28.91 54.58 48.31 84.23 46.31 48.22 FLOPs 711.87 192.69 132.34 141.82 161.35 711.49 477.16 371.48 395.43 402.93 ResNet-18 Scene 15 @ 50% Event 8 @ 55% Cats & Dogs @ 45% U W T G L U W T G L U W T G L Loss 1.32 1.98 1.28 1.03 0.85 0.61 1.28 0.99 0.72 0.55 0.03 0.04 0.05 0.04 0.02 Accuracy 83.97 69.95 78.14 78.24 81.61 95.63 80.20 83.81 86.76 90.27 97.97 97.17 96.34 94.13 97.91 Params 11.18 4.63 4.91 4.96 4.52 11.18 3.99 4.17 4.26 3.89 11.18 4.88 5.18 5.15 5.04 FLOPs 1.82 1.30 1.16 1.10 1.27 1.82 1.22 1.11 1.07 1.20 1.82 1.36 1.22 1.21 1.36 Oxford Flo wer 102 @ 70% CIF AR-10 @ 30% U W T G L U W T G L Loss 1.36 4.64 2.96 1.65 1.59 0.000 0.002 0.012 0.016 0.004 Accuracy 71.23 34.58 49.56 62.41 65.60 94.67 94.21 92.71 92.55 94.03 Params 11.23 2.19 3.07 3.07 2.45 11.17 6.88 7.85 7.56 6.62 FLOPs 1.82 0.95 0.72 0.73 0.93 0.56 0.46 0.42 0.41 0.47 Supplementary Figure 3: Comparison of pruning performance without a subsequent ﬁnetuning step for a ResNet- 50 network when pruned by criteria using weights, gradient, T aylor-expansion and LRP . Each dataset is a binary classiﬁcation problem created by combining two datasets taken from FGVC Aircraft [ 47 ], CUB-200-2011 birds [ 48 ] and Stanford Cars [ 49 ], which are covered by similar classes in the ImageNet initialization. Results are the average of 20 repetitions with randomly dra wn samples. Each run relies on 20 samples for pruning, 10 from each of the two datasets, and 2048 samples for test accuracy ev aluation. For a giv en repetition, all methods use the same set of samples for pruning and they use a set of samples for e valuation, which is again identical for all pruning methods, b ut disjoint from the pruning set. Compare to Figure 10. pruning performs second best, while LRP-based pruning allo ws consistently to prune the lar gest fraction of ﬁlters before starting to lose prediction accuracy . i v When comparing cats versus dogs in Figure 10 ag ainst the three composed datasets in Supple- mentary Figure 3, we observ e that there is less redundant capacity which can be pruned a way for the composed datasets. This sanity check is in line with the higher variance of these composed datasets as compared to cats versus dogs. As a side remark, this observation suggests the thought to measure empirically dataset complexities with respect to a neural netw ork by the area under the accurac y graph with respect to the amount of pruned ﬁlters. v

Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment