Neural Plasticity Networks
Neural plasticity is an important functionality of human brain, in which number of neurons and synapses can shrink or expand in response to stimuli throughout the span of life. We model this dynamic learning process as an $L_0$-norm regularized binar…
Authors: Yang Li, Shihao Ji
Neural Plasticity Networks Y ang Li Department of Computer Science Geor gia State University Atlanta, USA yli93@student.gsu.edu Shihao Ji Department of Computer Science Geor gia State University Atlanta, USA sji@gsu.edu Abstract —Neural plasticity is an important functionality of human brain, in which number of neurons and synapses can shrink or expand in response to stimuli throughout the span of life. W e model this dynamic lear ning process as an L 0 - norm r egularized binary optimization problem, in which each unit of a neural network (e.g., weight, neuron or channel, etc.) is attached with a stochastic binary gate, whose parameters determine the level of activity of a unit in the network. At the beginning, only a small portion of binary gates (theref ore the corresponding neurons) are activ ated, while the remaining neurons are in a hibernation mode. As the learning pr oceeds, some neurons might be activated or deactivated if doing so can be justified by the cost-benefit tradeoff measured by the L 0 -norm regularized objectiv e. As the training gets mature, the probability of transition between activation and deactivation will diminish until a final hardening stage. W e demonstrate that all of these learning dynamics can be modulated by a single parameter k seamlessly . Our neural plasticity network (NPN) can prune or expand a network depending on the initial capacity of network pro vided by the user; it also unifies dropout (when k “ 0 ), traditional training of DNNs (when k “ 8 ) and interpolates between these two. T o the best of our knowledge, this is the first learning framework that unifies network sparsification and network expansion in an end-to-end training pipeline. Extensive experiments on synthetic dataset and multiple image classification benchmarks demonstrate the superior performance of NPN. W e show that both network sparsification and network expansion can yield compact models of similar architectures, while retaining competitive accuracies of the original networks. 1 I . I N T R O D U C T I O N Deep Neural Networks (DNNs) have achiev ed great success in a broad range of applications in image recognition [1], natural language processing [2], and game playing [3]. Along with this success is a paradigm shift from feature engineering to architecture design. Latest DNN architectures, such as ResNet [4], DenseNet [5] and W ide-ResNet [6], incorporate hundreds of millions of parameters to achieve state-of-the-art predictiv e performance. Ho wev er , the expanding number of parameters not only increases the risk of ov erfitting, but also leads to high computational costs. A practical solution to this problem is network sparsification, by which weights, neurons or channels can be pruned or sparsified significantly with mi- nor accurac y losses [7], [8], and sometimes sparsified netw orks can e ven achiev e higher accuracies due to the re gularization ef- fects of the network sparsification algorithms [9], [10]. Driv en by the widespread applications of DNNs in resource-limited 1 Our source code is available at https://github .com/leo- yangli/npns. embedded systems, there has been an increasing interest in sparsifying networks recently [7]–[14]. A less explored alternativ e is network expansion, by which weights, neurons or channels can be gradually added to grow a small network to improv e its predictive accuracy . This paradigm is in the opposite to network sparsification, but might be more desirable because (1) we don’t need to set an upper-bound on the network capacity (e.g., number of weights, neurons or channels, etc.) to start with, and the network can shrink or expand as needed for a giv en task; (2) it’ s computationally more efficient to train a small network and expand it to a larger one as redundant neurons are less likely to emerge during the whole training process; and (3) network expansion is more biologically plausible than network sparsification according to our current understanding to human brain dev elopment [15]. In this paper , we propose Neural Plasticity Networks (NPNs) that unify network sparsification and network expan- sion in an end-to-end training pipeline, in which number of neurons and synapses can shrink or expand as needed to solve a given learning task. W e model this dynamic learning process as an L 0 -norm regularized binary optimization problem, in which each unit of a neural network (e.g., weight, neuron or channel, etc.) is attached with a stochastic binary gate, whose parameters determine the level of activity of a unit in the whole network. The activ ation or deactiv ation of a unit is completely data-driv en as long as doing so can be justified by the cost-benefit tradeoff measured by the L 0 -norm regularized objectiv e. As a result, given a network architecture either large or small, our NPN can automatically perform sparsification or expansion without human intervention until a suitable network capacity is reached. NPN is built on top of our previous L 0 -ARM algorithm [16]. Howe ver , the original L 0 -ARM algorithm only explores net- work sparsification, in which it demonstrates state-of-the-art performance at pruning networks, while here we extend this framew ork to network expansion. On the algorithmic side, we further in vestigate the Augment-Reinforce-Merge (ARM) [17], a recently proposed unbiased gradient estimator for binary latent v ariable models. W e sho w that due to the flexibility of ARM, many smooth or non-smooth parametric functions, such as scaled sigmoid or hard sigmoid, can be used to parameterize the L 0 -norm regularized binary optimization problem and the unbiasness of the ARM estimator is retained, while a closly related hard concrete estimator [13] has to rely on the hard sigmoid function for binary optimization. It is this dif ference that entails NPN the capability of shrinking or expanding network capacity as needed for a giv en task. W e also introduce a learning stage scheduler for NPN and demonstrate that many training stages of network sparsification and expansion, such as pre-training, sparsification/expansion and fine-tuning, can be modulated by a single parameter k seamlessly; along the way , we also gi ve a new interpretation of dropout [18]. Extensi ve experiments on synthetic dataset and multiple public datasets demonstrate the superior per- formance of NPNs for network sparsification and network expansion with fully connected layers, con volutional layers and skip connections. Our experiments sho w that both network sparsification and network expansion can con verge to similar network capacities with similar accuracies ev en though they are initialized with networks of different sizes. T o the best of our knowledge, this is the first learning framew ork that unifies network sparsification and network expansion in an end-to-end training pipeline modulated by a single parameter . The remainder of the paper is organized as follows. In Sec. II we describe the L 0 -norm regularized empirical risk minimization for NPN and its solver L 0 -ARM [16] for net- work sparsification. A new learning stage scheduler for NPN is introduced in Sec. III. W e then extend NPN to network expansion in Sec. IV, followed by the related work in Sec. V. Experimental results are presented in Sec. VI. Conclusions and future work are discussed in Sec. VII. I I . N E U R A L P L A S T I C I T Y N E T W O R K S : F O R M U L A T I O N Our Neural Plasticity Network (NPN) is built on the basic framew ork of L 0 -ARM [16], which was proposed in our previous work for network sparsification. W e extend L 0 -ARM to network expansion, and unify network sparsification and expansion in an end-to-end training pipeline. For the sake of clarity , we first introduce NPN in the context of network sparsification, and later extend it to network expansion. The formulation below largely follows that of L 0 -ARM [16]. Giv en a training set D “ tp x i , y i q , i “ 1 , 2 , ¨ ¨ ¨ , N u , where x i denotes the input and y i denotes the target, a neural network is a function h p x ; θ q parametrized by θ that fits to the training data D with the goal of achieving good generalization to unseen test data. T o optimize θ , typically a regularized empirical risk is minimized, which contains two terms – a data loss over training data and a regularization loss ov er model parameters. Empirically , the regularization term can be weight decay or Lasso, i.e., the L 2 or L 1 norm of model parameters. Intuitiv ely , network sparsification or expansion is a model selection problem, in which a suitable model capacity is selected for a gi ven learning task. In this problem, how to measure model complexity is a core issue. The Akaike Information Criterion (AIC) [19] and the Bayesian Information Criterion (BIC) [20], well-known model selection criteria, measure model complexity by counting number of non-zero parameters. Since the L 2 or L 1 norm only imposes shrinkage on large values of θ , the resulting model parameters θ are often manifested by smaller magnitudes but none of them are exact zero. Therefore, the L 2 or L 1 norm is not suitable for measuring model complexity . A more appealing alternativ e is the L 0 norm of model parameters as it measures explicitly the number of non-zero parameters, which is the exact model complexity measured by AIC and BIC. With the L 0 regular - ization, the empirical risk objectiv e can be written as R p θ q “ 1 N N ÿ i “ 1 L p h p x i ; θ q , y i q ` λ } θ } 0 (1) where L p¨q denotes the data loss over training data D , such as the cross-entropy loss for classification or the mean squared error (MSE) for re gression, and } θ } 0 denotes the L 0 -norm o ver model parameters, i.e., the number of non-zero weights, and λ is a regularization hyperparameter that balances between data loss and model complexity . For network sparsification, minimizing of Eq. 1 will dri ve the redundant or insignificant weights to be exact zero and thus pruned away . For network expansion, adding additional neurons will increase model complexity (the second term) but potentially can reduce data loss (the first term) and therefore the total loss. Thus, we will use Eq. 1 as our guiding principle for sparsifying or expanding a network. T o represent a sparsified network, we attach a binary random variable z to each element of model parameters θ . Therefore, we can reparameterize the model parameters θ as an element- wise product of non-zero parameters ˜ θ and binary random variables z : θ “ ˜ θ d z , (2) where z P t 0 , 1 u | θ | , and d denotes the element-wise product. As a result, Eq. 1 can be rewritten as: R p ˜ θ , z q “ 1 N N ÿ i “ 1 L ´ h ` x i ; ˜ θ d z ˘ , y i ¯ ` λ | ˜ θ | ÿ j “ 1 1 r z j ‰ 0 s , (3) where 1 r c s is an indicator function that is 1 if the condition c is satisfied, and 0 otherwise. Note that both the first term and the second term of Eq. 3 are not differentiable w .r .t. z . Therefore, further approximations need to be considered. Fortunately , we can approximate Eq. 3 through an inequality from stochastic variational optimization [21]. Specifically , giv en any function F p z q and any distribution q p z q , the fol- lowing inequality holds min z F p z q ď E z „ q p z q r F p z qs , (4) i.e., the minimum of a function is upper bounded by the expectation of the function. W ith this result, we can deriv e an upper bound of Eq. 3 as follows. Since z j , @ j P t 1 , ¨ ¨ ¨ , | θ |u is a binary random variable, we assume z j is subject to a Bernoulli distribution with parameter π j P r 0 , 1 s , i.e. z j „ Ber p z ; π j q . Thus, we can upper bound min z R p ˜ θ , z q by the expectation ˆ R p ˜ θ , π q “ E z „ Ber p z ; π q R p ˜ θ , z q (5) “ E z „ Ber p z ; π q « 1 N N ÿ i “ 1 L ´ h p x i ; ˜ θ d z q , y i ¯ ff ` λ | ˜ θ | ÿ j “ 1 π j . As we can see, now the second term is differentiable w .r .t. the new model parameters π , while the first term is still problematic since the expectation over a large number of binary random variables z P t 0 , 1 u | θ | is intractable, so is its gradient. T o minimize Eq. 5, L 0 -ARM utitlizes the Augment- Reinforce-Merge (ARM) [17], an unbiased gradient estimator , to this stochastic binary optimization problem. Specifically , Theorem 1 (ARM) [17]. For a vector of V binary random variables z “ p z 1 , ¨ ¨ ¨ , z V q , the gradient of E p φ q “ E z „ ś V v “ 1 Ber p z v ; g p φ v qq r f p z qs (6) w .r .t. φ “ p φ 1 , ¨ ¨ ¨ , φ V q , the logits of the Bernoulli distribu- tion parameters, can be expressed as ∇ φ E p φ q “ E u „ ś V v “ 1 Uniform p u v ;0 , 1 q ” ` f p 1 r u ą g p´ φ qs q´ f p 1 r u ă g p φ qs q ˘ p u ´ 1 { 2 q ı , (7) where 1 r u ą g p´ φ qs : “ 1 r u 1 ą g p ´ φ 1 qs , ¨ ¨ ¨ , 1 r u V ą g p ´ φ V qs and g p φ q “ σ p φ q “ 1 {p 1 ` exp p´ φ qq is the sigmoid function. Parameterizing π j P r 0 , 1 s as g p φ j q , we can re write Eq. 5 as ˆ R p ˜ θ , φ q “ E z „ Ber p z ; g p φ qq r f p z qs ` λ | ˜ θ | ÿ j “ 1 g p φ j q (8) “ E u „ Uniform p u ;0 , 1 q “ f p 1 r u ă g p φ qs q ‰ ` λ | ˜ θ | ÿ j “ 1 g p φ j q , where f p z q “ 1 N ř N i “ 1 L ´ h p x i ; ˜ θ d z q , y i ¯ . From Theorem 1, we can ev aluate the gradient of Eq. 8 w .r .t. φ by ∇ φ ˆ R p ˜ θ , φ q “ E u „ Uniform p u ;0 , 1 q ” ` f p 1 r u ą g p´ φ qs q´ f p 1 r u ă g p φ qs q ˘ p u ´ 1 { 2 q ı ` λ | ˜ θ | ÿ j “ 1 ∇ φ j g p φ j q , (9) which is an unbiased and low variance estimator as demon- strated in [17]. A. Choice of g p φ q Theorem 1 of ARM defines g p φ q “ σ p φ q , where σ p¨q is the sigmoid function. For the purpose of network sparsification and expansion, we find that this parametric function isn’t very effecti ve due to its fixed rate of transition between values 0 and 1. Thanks to the flexibility of ARM, we have a large freedom to design this parametric function g p φ q . Apparently , it’ s straightforward to generalize Theorem 1 for an y parametric functions (smooth or non-smooth) as long as g : R Ñ r 0 , 1 s -10 -5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 g( ) sigmoid (k=1) sigmoid (k=2) sigmoid (k=7) hard sigmoid (k=1) hard sigmoid (k=7) step function Fig. 1. The plots of g p φ q with different k for sigmoid and hard sigmoid functions. A large k tends to be more effecti ve at sparsifying networks. Best viewed in color . and g p´ φ q “ 1 ´ g p φ q 2 . Example parametric functions that work well in our experiments are the scaled sigmoid function g σ k p φ q “ σ p k φ q “ 1 1 ` exp p´ k φ q , (10) and the centered-scaled hard sigmoid g ¯ σ k p φ q “ min p 1 , max p 0 , k 7 φ ` 0 . 5 qq , (11) where 7 is introduced such that g ¯ σ 1 p φ q « g σ 1 p φ q “ σ p φ q . See Fig. 1 for some example plots of g σ k p φ q and g ¯ σ k p φ q with different k s. Empirically , we find that k “ 7 works well for network sparsification, and k “ 0 . 5 for network expansion. More on this will be discussed when we present results. One important dif ference between the hard concrete esti- mator from Louizos et al. [10] and L 0 -ARM is that the hard concrete estimator has to rely on the hard sigmoid gate to zero out some parameters during training (a.k.a. conditional computation [22]), while L 0 -ARM achiev es conditional com- putation naturally by sampling from the Bernoulli distribution, parameterized by g p φ q , where g p φ q can be any parametric function (smooth or non-smooth) as shown in Fig. 1. The consequence of using the hard sigmoid gate is that once a unit is pruned, the corresponding gradient will be always zero due to the exact 0 gradient at the left tail of the hard sigmoid gate (see Fig. 1) and therefore it can nev er be reactiv ated in the future. T o mitigate this issue of the hard concrete estimator , L 0 -ARM can utilize the scaled sigmoid gate (10), which has non-zero gradient e verywhere p´8 , 8q , and therefore a unit can be activ ated or deactiv ated freely and thus be plastic. I I I . L E A R N I N G S TAG E S C H E D U L E R As far as we kno w , all network sparsification algorithms either operate in a three-stage of pre-training, sparsification, and fine-tuning [7], [8], [11] or only have one sparsification 2 The second condition is not necessary . But for simplicity , we will impose this condition to select parametric function g p φ q that is antithetic. Designing g p φ q without this constraint could be a potential area that is w orthy of further in vestigation. stage from scratch [10]. It has been shown that the three-stage sparsification leads to better predictiv e accuracies than the one- stage alternativ es [23]. T o support this three-stage learning process, previous methods [7], [8], [11] howe ver manage this tedious process manually . Thanks to the flexibility of NPN, we can modulate these learning stages by simply adjusting k of the g k p φ q function at different training stages. Along the way , we also discover a new interpretation of dropout [18]. A. Dr opout as k “ 0 When k “ 0 , it is readily to verify that g σ k p φ q “ g ¯ σ k p φ q “ 0 . 5 , and the objectiv e function (8) is degenerated to ˆ R p ˜ θ , φ q “ E u „ Uniform p u ;0 , 1 q “ f p 1 r u ă 0 . 5 s q ‰ ` λ | ˜ θ |{ 2 , (12) which is in fact the standard dropout with a dropout probability of 0.5 [18]. Note that the value of 0.5 is due to the artifact of the antithetic constraint on the parametric function g p φ q . As we discussed in Sec. II-A, this constraint isn’t necessary and we have freedom of designing g p 0 q “ c with c P r 0 , 1 s , which corresponds to any dropout probability of the standard dropout. From this point of view , dropout is just a special case of NPN when k “ 0 , and this is a new interpretation of dropout. B. Pr e-training as k “ 8 at the be ginning of NPN training At the beginning of NPN training, we initialize all φ ’ s to some positiv e v alues, e.g., φ ą 0 . 1 . If we set k “ 8 , then g σ 8 p φ q “ g ¯ σ 8 p φ q “ 1 and the objectiv e function (8) becomes ˆ R p ˜ θ , φ q “ E u „ Uniform p u ;0 , 1 q “ f p 1 r u ă 1 s q ‰ ` λ | ˜ θ | , (13) which corresponds to the standard training of DNNs with all neurons activ ated. Moreover , the gradient w .r .t. φ is degener - ated to ∇ φ ˆ R p ˜ θ , φ q “ E u „ Uniform p u ;0 , 1 q ” ` f p 1 r u ą 0 s q (14) ´ f p 1 r u ă 1 s q ˘ p u ´ 1 { 2 q ı ` λ | ˜ θ | ÿ j “ 1 ∇ φ j 1 “ 0 , such that φ will not be updated during the training and the architecture is fixed. This corresponds to the pre-training of a network from scratch. C. F ine-tuning as k “ 8 at the end of NPN training At the end of NPN training, the histogram of g p φ q is typically split to two spikes of values around 0 and 1 as demonstrated in L 0 -ARM [16]. If we set k “ 8 , then the values of g p φ q will be exactly 0 or 1. In this case, the gradient w .r .t. φ is zero, the neurons with g p φ q “ 1 are activ ated and the neurons with g p φ q “ 0 are deactiv ated. This corresponds to the case of fine-tuning a fixed architecture without the L 0 regularization. D. Modulating learning stages by k As discussed above, we can no w integrate the three-stage of pre-training, sparsification and fine-tuning into one end- to-end pipeline, modulated by a single parameter k . At the beginning of the training, we set k “ 8 to pre-train a network from scratch. Upon con ver gence, we can set k to some small values (e.g., k “ 7 ) to enable the L 0 regularized optimization for network sparsification. After the conv ergence, we can set k “ 8 again to fine-tune the final learned architecture without the L 0 regularization. T o the best of our knowledge, there is no other network sparsification algorithm that supports this three- stage training in a native end-to-end pipeline. As an analogy to the common learning rate scheduler , we call k as a learning stage scheduler . W e will demonstrate this when we present results. I V . N E T W O R K E X PA N S I O N So far we hav e described NPN in the context of network sparsification. Thanks to the flexibility of L 0 -ARM, it is straightforward to extend it to network expansion. Instead of starting from a large network for pruning, we can expand a small network by adding neurons during training, an analogy of growing a brain from small to large. Specifically , giv en the lev el of activity of a neuron is determined by its φ , a new neuron can be added to the network with a large φ such that it will be activ ated in future training epochs. If this neuron is useful at reducing the L 0 -regularized loss function (8), its φ value will be increased such that it will be acti vated more often in the future; otherwise, it will be gradually deactiv ated and pruned away in the future. At each training iteration, we add a new neuron to a layer if (a) the validation loss is lower than its previous value when a neuron was added last time, and (b) all neurons in the layer are activ ated (i.e., no redundant neurons in the layer). The network expansion will terminate when the L 0 -regularized loss plateaus or some added neurons are deactiv ated due to the L 0 -norm regularization. This network expansion procedure is detailed in Algorithm 1. Algorithm 1 Network Expansion Require: number of iterations T , deep network f , number of layers L , loss old “ `8 , l oss v al “ 0 for t “ 1 to T do T rain f on a mini-batch of training examples loss v al “ V alidate f on validation dataset if loss v al ă l oss old then for l “ 1 to L do if all neurons in layer l are activ ated then Add a new neuron to layer l end if end for loss old “ l oss v al end if end for Thanks to the learning stage scheduler discussed in Sec. III, we can simulate network expansion easily by manipulating k . Specifically , we can initialize a very large network to represent an upper-bound on network capacity . T o start with a small network, we randomly select a small number of neurons and initialize the corresponding φ s to large positi ve values and set all the remaining φ ’ s to large ne gativ e values, such that only a small portion of the neurons will be activ ated while the remaining neurons are in a hibernation mode. Since they will not be activ ated, these hibernating neurons consume no computation resources. T o pretrain the initial small network, we can train NPN with k “ 8 . Upon con vergence, we can randomly activ ate a few hibernating neurons (under the conditions discussed abov e) and switch k to a small value (e.g., k “ 0 . 5 ). Along with the original neurons, we can optimize the expanded network by reducing the L 0 -regularized loss. In such a way , we can readily simulate network expansion. Upon the network expansion terminates, we can set k “ 8 to fine- tune the final network architecture. In the experiments, we will resort to this approach to simulate network expansion. V . R E L A T E D W O R K Our NPN has a built-in support to network sparsification, network expansion, and can automatically determine an ap- propriate network capacity for a given learning task. In this section, we revie w related works in these areas. a) Network Sparsification: Driv en by the widespread applications of DNNs in resource-limited embedded systems, recently there has been an increasing interest in network sparsification [7]–[14], [16]. One of the earliest sparsification methods is to prune the redundant weights based on the magnitudes [24], which is proved to be ef fectiv e in modern CNNs [7]. Although weight sparsification is able to compress networks, it can barely improve computational ef ficiency due to unstructured sparsity . Therefore, magnitude-based group sparsity is proposed [11], [12], which can prune networks while reducing computation cost significantly . These w orks are mainly based on the L 2 or L 1 regularization to penalize the magnitude of weights. A more appealing approach is based on the L 0 regularization [10], [16] as this corresponds to the well- known model selection criteria such as AIC [19] and BIC [20]. Our NPN is built on the basic frame work of L 0 -ARM [16] and extend it for network expansion. In addition, as far as we know almost all the network sparsification algorithms [7], [8], [11], [12] usually proceed in three stages manually: pretrain a full network, prune the redundant weights or filters, and fine- tune the pruned network. In contrast, our NPN can support this three-stage training by simply adjusting the learning stage scheduler k at different stages in an end-to-end fashion. b) Neural Ar chitectur e Searc h: Another closely related area is neural architecture search [25]–[27] that searches for an optimal network architecture for a given learning task. It attempts to determine number of layers, types of layers, layer configurations, dif ferent acti vation functions, etc. Gi ven the extremely large search space, typically reinforcement learning algorithms are utilized for efficient implementations. Our NPN can be categorized as a subset of neural architecture search in the sense that we start with a fixed architecture and aim to determine an optimal capacity (e.g., number of weights, neurons or channels) of a network. c) Dynamic Network Expansion: Compared to network sparsification, network expansion is a relatively less explored area. There are few existing works that can dynamically increase the capacity of network during training. For example, DNC [28] sequentially adds neurons one at a time to the hid- den layers of netw ork until the desired approximation accurac y is achiev ed. [29] proposes to train a denoising autoencoder (D AE) by adding in new neurons and later merging them with other neurons to pre vent redundancy . For con volutional networks, [30] proposes to widen or deepen a pretrained network for better knowledge transfer . Recently , a boosting- style method named AdaNet [31] is used to adaptiv ely grow the structure while learning the weights. Howe ver , all these approaches either only add neurons or add/remove neurons manually . In contrast, our NPN can add or remove (deactiv ate) neurons during training as needed without human intervention, and is an end-to-end unified framew ork for network sparsifi- cation and expansion. V I . E X P E R I M E N TA L R E S U LT S W e ev aluate the performance of NPNs on multiple public datasets with different network architectures for network spar- sification and network e xpansion. Specifically , we illustrate how NPN e volves on a synthetic “moons” dataset [32] with a 2-hidden-layer MLP . W e also demonstrate LeNet5-Caffe 3 on the MNIST dataset [33], and ResNet56 [4] on the CIF AR10 and CIF AR100 datasets [34]. Similar to L 0 -ARM [16] and L 0 -HC [13], to achie ve computational efficienc y , only neuron- lev el (instead of weight-lev el) sparsification/expansion is con- sidered, i.e., all weights of a neuron or filter are either pruned from or added to a network altogether . For the comparison to the state-of-the-art network sparsification algorithms [10], [13], [14], we refer the readers to L 0 -ARM [16] for more details since NPN is an extension of L 0 -ARM for network sparsification and expansion. As discussed in Sec. II, each neuron in an NPN is attached with a Bernoulli random variable parameterized by g p φ q . Therefore, the lev el of activity of a neuron is determined by the value of φ . T o initialize an NPN, in our experiments we activ ate a neuron by setting φ “ 3 { k . Since g σ k p φ q “ σ p 3 q « 0 . 95 , this means that the corresponding neuron has a 95% probability of being activ ated. Similarly , we set φ “ ´ 3 { k to deactiv ate a neuron with a 95% probability . As discussed in Sec. III, all of our experiments are performed in three stages: (1) pre-training, (2) sparsifica- tion/expansion, and (3) fine-tuning, in an end-to-end training pipeline modulated by parameter k . In pre-training and fine- tuning stages, we set k “ 5000 (as a close approximation to k “ 8 ) to train an NPN with a fixed architecture. In sparsification/expansion stage, we set k to a small value to allow NPNs to search for a suitable network capacity freely . 3 https://github .com/BVLC/caffe/tree/master/e xamples/mnist 2 1 0 1 2 2 1 0 1 2 #99, Accuracy: 70.60%, # of parameters: 15 0.05 0.20 0.35 0.50 0.65 0.80 0.95 (a) Expansion: epoch 99 2 1 0 1 2 2 1 0 1 2 #199, Accuracy: 95.80%, # of parameters: 2600 0.05 0.20 0.35 0.50 0.65 0.80 0.95 (b) Expansion: epoch 199 2 1 0 1 2 2 1 0 1 2 #1999, Accuracy: 99.60%, # of parameters: 3300 0.05 0.20 0.35 0.50 0.65 0.80 0.95 (c) Expansion: epoch 1999 2 1 0 1 2 2 1 0 1 2 #499, Accuracy: 99.20%, # of parameters: 8160 0.05 0.20 0.35 0.50 0.65 0.80 0.95 (d) Sparsification: epoch 499 2 1 0 1 2 2 1 0 1 2 #999, Accuracy: 99.20%, # of parameters: 4275 0.05 0.20 0.35 0.50 0.65 0.80 0.95 (e) Sparsification: epoch 999 2 1 0 1 2 2 1 0 1 2 #1999, Accuracy: 99.00%, # of parameters: 3234 0.05 0.20 0.35 0.50 0.65 0.80 0.95 (f) Sparsification: epoch 1999 Fig. 2. The evolution of the decision boundaries of NPNs for network expansion (a,b,c) and network sparsification (d,e,f). The videos can be found at https://npnsdemo.github .io. The final architecture of a network is influenced signifi- cantly by two hyperparameters: (1) the regularization strength λ , and (2) k of the g σ k p . q function, which determine how aggressiv ely to sparsify or expand a network. T ypically , a positiv e λ is used both for sparsification and expansion. Ho w- ev er, in some expansion experiments, we notice that λ “ 0 is beneficial because λ “ 0 essentially encourages more neurons to be activ ated, which is important for network expansion to achie ve competitive accuracies. For the hyperparameter k used in stage 2, in all of our experiments we set k “ 7 for sparsification and k “ 0 . 5 for expansion. The reason that different k s are used is because for expansion we need to encourage the netw ork to gro w and a small k is more amenable to keep neurons activ ated. The rest of hyperparameters of NPNs are determined via cross validation on the validation datasets. Unless specifically noted, we use the Adam optimizer [35] with an initial learning rate of 0.001. Our e xperiments are performed on NVIDIA Titan-Xp GPUs. Our source code is av ailable at https://github.com/leo- yangli/npns. A. Synthetic Dataset T o demonstrate that NPNs can adapt their capacities for a learning task, we visualize the learning process of NPNs on a synthetic “moons” dataset [32] for network sparsification and network expansion. The “moons” dataset contains 1000 data points distributed in two moon-shaped clusters for binary classification. W e randomly pick 500 data points for training and use the rest 500 data points for test. T wo different MLP architectures are used for network sparsification and network expansion, respecti vely . For network sparsification, we train an MLP with 2 hidden layers of 100 and 80 neurons, respectiv ely . The input layer has 2 neurons corresponding to the 2-dim coordinates of each data point, which is transformed to a 100- dim vector by a fixed matrix. The output layer has two neurons for binary classification. The ov erall architecture of the MLP is 2-100 (fixed)-80-2 with the first weight matrix fixed, and the overall number of trainable model parameters is 8,160 (excluding biases for clarity). The binary gates are attached to the outputs of the two hidden layers for sparsification. For the expansion experiment, we start an MLP with a very small architecture of 2-100 (fixed)-3-2. Initially , only three neurons at each hidden layer are activ ated, and therefore the total number of trainable model parameters is 15 (excluding biases for clarity). Apparently , the first MLP is ov erparameterized for this synthetic binary classification task, while the second MLP is too small and doesn’t hav e enough capacity to solve the classification task with a high accuracy . T o visualize the learning process of NPNs, in Fig. 2 we plot the decision boundaries and confidence contours of the NPN-sparsified MLP and NPN-expanded MLP on the test set of “moons”. W e pick three snapshots from each experiment. The e volution of the decision boundaries of the NPN-e xpanded MLP is shown in Fig. 2 (a, b, c). At the end of pre-training (a. epoch 99), the decision boundary is mostly linear and the capacity of the network is obviously not enough. During the stage 2 expansion (b . epoch 199), more neurons are added to the network and the decision boundary becomes more expressi ve as manifested by a piece-wise linear function, and at the same time the accuracy is significantly improved to about 96%. At the end of stage 3 fine-tuning (c. epoch 1999), the accuracy reaches 99.2% with more neurons being added. Similarly , Fig. 2 (d, e, f) demonstrates the ev olution of the decision boundaries of NPN-sparsified MLP on the test set. The model achiev es an accuracy of 99.2% at the end of stage 1 pre-training (d. epoch 499). Then 47.6% of weights are pruned without any accuracy loss during stage 2 sparsification (e. epoch 999). The model finally prunes 60.4% of neurons at the end of stage 3 fine-tuning (f. epoch 1999). In this sparsification experiment, across dif ferent training stages, the shapes of decision boundaries are appropriately the same ev en though a large amount of neurons are pruned. Interestingly , the final architectures achie ved by network expansion and network sparsification are very similar (3300 vs. 3234), so are their accuracies (99.6% vs. 99.00%) e ven though the initial network capacities are quite dif ferent (15 vs. 8160). This experiment demonstrates that giv en an initial network architecture either large or small, NPNs can adapt their capacities to solve a learning task with high accuracies. B. MNIST In the second part of experiments, we run NPNs with LeNet5-Caffe on the MNIST dataset for network sparsification and expansion. LeNet5-Caffe consists of two con volutional layers of 20 and 50 neurons, respectively , interspersed with max pooling layers, follo wed by two fully-connected layers with 800 and 500 neurons. W e start network sparsification from the full LeNet5 architecture (20-50-800-500, in short), while in the expansion experiment we start from a very small network with only 3 neurons in the first two con volutional layers, 48 and 3 neurons in the fully-connected layers (3- 3-48-3, in short). W e pre-train the NPNs for 100 epochs, followed by sparsification/expansion for 250 epochs and fine- tuning for 150 epochs. For both experiments, we use λ “ p 10 , 0 . 5 , 0 . 1 , 10 q{ N where N is the number of training images. T ABLE I T H E N E TW O R K S P A R SI FI C A T I O N A N D E X P A N SI O N W I T H L E N ET 5 O N M N IS T. T H E A R C HI T E CT U R E A N D AC C U R AC Y A T TH E E N D O F EA CH S T AG E AR E SH O WN I N T H E TA B LE . Stage Arch. (# of Parameters) Accuracy (%) Baseline - 20-50-800-500 (4.23e5) 99.40 Sparsification stage 1 20-50-800-500 (4.23e5) 99.36 stage 2 7-9-109-30 (5320) 98.90 stage 3 7-9-109-30 (5320) 98.91 Expansion stage 1 3-3-48-3 (474) 93.85 stage 2 8-8-53-8 (2304) 96.71 stage 3 8-8-53-8 ( 2304 ) 98.31 The results are shown in T able I and Fig. 3. For the spar - sification experiment, the NPN achie ves 99.36% accuracy at the end of pre-training, and yields a sparse architecture with a minor accuracy drop at the end of sparsification. With the fine- tuning at stage 3, the accuracy reaches 98.91% in the end. For the expansion experiment, the NPN achie ves a low accuracy of 93.85% after pre-training due to insufficient capacity of the initial network, then it expands to a larger network and improv es the accuracy to 96.71%. Finally , the accurac y reaches 98.31% after fine-tuning. Fig. 3 demonstrate the learning processes of NPNs on MNIST for network sparsification and network expansion. It is interesting to note that both network sparsification and expansion reach the similar network capacity with similar classification accuracies at the end of the training ev en though they are started from two significantly different network architectures. It’ s worth emphasizing that both sparsification and expansion reach similar accuracies to the baseline model, while ov er 99% weights are pruned. C. CIF AR-10/100 In the final part of experiments, we ev aluate NPNs with ResNet56 on CIF AR-10 and CIF AR-100 for network sparsi- fication and network expansion. Due to the existence of skip connections, we do not sparsify the last con volutional layer of each residual block to keep a valid addition operation. T o train NPNs with this modern CNN architecture, two optimizers are used: (1) SGD with momentum for ResNet56 parameters with an initial learning rate of 0.1, and (2) Adam for the binary gate parameters φ with an initial learning rate of 0.001. The batch size, weight decay and momentum of SGD are set to 128, 5e- 4 and 0.9, respectiv ely . The learning rate is multiplied by 0.1 ev ery 60 epochs for SGD optimizer , while for Adam optimizer we multiplied learning rate by 0.1 at epoch 120 and 180. As before, the networks are training in three stages by lev eraging k . W e pre-train the network for the first 20 epochs. In the expansion experiments, we pre-train a small network which only has 20% of neurons, while in the sparsification experiments, we pre-triain the full architecture. The network is then trained in stage 2 for 180 epochs, and finally is fine-tuned in stage 3 for 20 epochs. For the sparsification experiments, we use λ “ 1 e ´ 5 for all layers, while for the expansion experiments we use λ “ 0 to encourage more neurons to be activ ated. W e compare the performance of NPNs with the state-of- the-art pruning algorithms, including SFP [36], AMC [37], FPGM [38], T AS [39] and HRank [40], with the results shown in T able II. Since the baseline accuracies in all the reference papers are different, we follow the common practice and compare the performances of all competing methods by their accuracy gains ∆ Acc and their pruning rates in terms of FLOPs and network parameters. It can be observed that NPN sparsification and NPN expansion achie ve v ery competitiv e performances to the state-of-the-arts in terms of classification accuracies and prune rates. More interestingly , similar to the results on MNIST , both network sparsification and expansion reach the similar network capacities with similar classification accuracies at the end of the training e ven though the y are started from two significantly different network architectures, 0 100 200 300 400 500 Epoch 0 1 2 3 4 Number of parameters × 1 0 5 Expansion (stage 1) Expansion (stage 2) Expansion (stage 3) Sparsification (stage 1) Sparsification (stage 2) Sparsification (stage 3) (a) LeNet5: # of parameters 0 100 200 300 400 500 Epoch 20 40 60 80 100 Accuracy (%) Expansion (stage 1) Expansion (stage 2) Expansion (stage 3) Sparsification (stage 1) Sparsification (stage 2) Sparsification (stage 3) (b) LetNet5: T est accuracy Fig. 3. The ev olution of network capacity and test accuracy as a function of epoch for NPN network sparsification and expansion with LetNet5 on MNIST . T ABLE II T H E N E TW O R K S P A R SI FI C A T I O N A N D E X P A N SI O N W I T H R E S N E T 5 6 O N CI FA R 10 A ND C IFA R 10 0 . “ ∆ ” : ‘ + ’ D E N OT ES AC C U RA CY G AI N ; ‘ - ’ D E N OT E S AC C U RA C Y L O S S . “ P AR A M S . ( P . R . % ) ” : P RU N E R A T I O I N PA RA M E T ER S . Model Method Acc. (%) ∆ Acc FLOPs (P .R. %) Params. (P .R. %) CIF AR10 SFP [36] 93.6 Ñ 93.4 -0.2 59.4M (53.1) - AMC [37] 92.8 Ñ 91.9 -0.9 62.5M (50.0) - FPGM [38] 93.6 Ñ 93.5 -0.1 59.4M (52.6) - T AS [39] 94.5 Ñ 93.7 -0.8 59.5M (52.7) - HRank [40] 93.3 Ñ 93.5 +0.2 88.7M (29.3) 0.71M (16.8) NPN Sparsification 93.2 Ñ 93.0 -0.2 76.1M (40.1) 0.33M (61.2) NPN Expansion 93.2 Ñ 92.7 -0.5 100.5M (20.9) 0.55M (35.3) CIF AR100 SFP [36] 71.4 Ñ 68.8 -2.6 59.4M (52.6) - FPGM [38] 71.4 Ñ 69.7 -1.7 59.4M (52.6) - T AS [39] 73.2 Ñ 72.3 -0.9 61.2M (51.3) - NPN Sparsification 71.1 Ñ 70.9 -0.2 101.8M (19.8) 0.61M (28.2) NPN Expansion 71.1 Ñ 69.9 -0.5 102.8M (19.1) 0.60M (29.4) demonstrating the plasticity of NPN for network sparsification and expansion. V I I . C O N C L U S I O N W e propose neural plasticity networks (NPNs) for net- work sparsification and expansion by attaching each unit of a network with a stochastic binary gate, whose parameters are jointly optimized with original network parameters. The activ ation or deacti vation of a unit is completely data-driven and determined by an L 0 -regularized objectiv e. Our NPN unifies dropout (when k “ 0 ), traditional training of DNNs (when k “ 8 ) and interpolate between these two. T o the best of our knowledge, it is the first learning framew ork that unifies network sparsification and network expansion in an end-to-end training pipeline that supports pre-training, sparsification/expansion, and fine-tuning seamlessly . Along the way , we also giv e a new interpretation of dropout. Extensi ve experiments on multiple public datasets and multiple network architectures v alidate the ef fectiv eness of NPNs for network sparsification and expansion in terms of model compactness and predictiv e accuracies. As for future extensions, we plan to design better (possi- bly non-antithetic) parametric function g p φ q to improve the compactness of learned networks. W e also plan to extend the frame work to prune or expand network layers to further improv e model compactness and accuracy altogether . V I I I . A C K N OW L E D G M E N T The authors would like to thank the anonymous revie wers for their comments and suggestions, which helped improve the quality of this paper . The authors would also gratefully acknowledge the support of VMware Inc. for its uni versity research fund to this research. R E F E R E N C E S [1] J. Deng, W . Dong, R. Socher , L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database, ” in IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2009. [2] J. Devlin, M.-W . Chang, K. Lee, and K. T outanova, “Bert: Pre-training of deep bidirectional transformers for language understanding, ” arXiv pr eprint arXiv:1810.04805 , 2018. [3] D. Silver , A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. v an den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanc- tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner , I. Sutskev er, T . Lillicrap, M. Leach, K. Kavukcuoglu, T . Graepel, and D. Hassabis, “Mastering the g ame of go with deep neural networks and tree search, ” Natur e , vol. 529, pp. 484–503, 2016. [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016, pp. 770–778. [5] G. Huang, Z. Liu, L. van der Maaten, and K. Q. W einberger , “Densely connected conv olutional networks, ” in IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2017. [6] S. Zagoruyko and N. Komodakis, “W ide residual networks, ” in The British Machine V ision Conference (BMVC) , 2016. [7] S. Han, J. Pool, J. Tran, and W . Dally , “Learning both weights and con- nections for efficient neural network, ” in Advances in neural information pr ocessing systems , 2015, pp. 1135–1143. [8] S. Han, H. Mao, and W . J. Dally , “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, ” in International Conference on Learning Representations (ICLR) , 2016. [9] K. Neklyudov , D. Molchanov , A. Ashukha, and D. V etrov , “Structured bayesian pruning via log-normal multiplicative noise, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2017. [10] C. Louizos, M. W elling, and D. P . Kingma, “Learning sparse neural networks through l 0 regularization, ” in International Confer ence on Learning Representations (ICLR) , 2018. [11] W . W en, C. W u, Y . W ang, Y . Chen, and H. Li, “Learning structured sparsity in deep neural networks, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2016. [12] H. Li, A. Kadav , I. Durdanovic, H. Samet, and H. P . Graf, “Pruning filters for efficient convnets, ” arXiv preprint , 2016. [13] C. Louizos, K. Ullrich, and M. W elling, “Bayesian compression for deep learning, ” in Advances in Neural Information Pr ocessing Systems , 2017, pp. 3288–3298. [14] D. Molchanov , A. Ashukha, and D. V etrov , “V ariational dropout sparsi- fies deep neural networks, ” arXiv preprint , 2017. [15] S. Thuret, “Y ou can grow new brain cells. here’s how . ” https://www . youtube.com/watch?v=B tjKYvEziI, Oct. 2015. [16] Y . Li and S. Ji, “L0-ARM: Network sparsification via stochastic bi- nary optimization, ” in The European Confer ence on Machine Learning (ECML) , 2019. [17] M. Y in and M. Zhou, “ Arm: Augment-REINFORCE-merge gradient for stochastic binary networks, ” in International Conference on Learning Repr esentations (ICLR) , 2019. [18] N. Sriv astava, G. Hinton, A. Krizhevsky , I. Sutske ver, and R. Salakhut- dinov , “Dropout: A simple way to pre vent neural networks from o verfit- ting, ” Journal of Machine Learning Research , vol. 15, pp. 1929–1958, 2014. [19] H. Akaike, Selected P apers of Hirotugu Akaike . Springer , 1998, pp. 199–213. [20] G. Schwarz, “Estimating the dimension of a model, ” The Annals of Statistics , vol. 6, pp. 461–464, 1978. [21] T . Bird, J. Kunze, and D. Barber, “Stochastic variational optimization, ” arXiv preprint arXiv:1809.04855 , 2018. [22] Y . Bengio, N. Leonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation, ” arXiv pr eprint arXiv:1308.3432 , 2013. [23] J. Lee, S. Kim, J. Y oon, H. B. Lee, E. Y ang, and S. J. Hwang, “ Adap- tiv e network sparsification with dependent variational beta-bernoulli dropout, ” arXiv pr eprint arXiv:1805.10896 , 2018. [24] Y . LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage, ” in Advances in neural information pr ocessing systems , 1990, pp. 598–605. [25] B. Zoph and Q. V . Le, “Neural architecture search with reinforcement learning, ” in International Conference on Learning Representations (ICLR) , 2017. [26] B. Zoph, V . V asude van, J. Shlens, and Q. V . Le, “Learning transferable architectures for scalable image recognition, ” in IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2018. [27] E. Real, A. Aggarwal, Y . Huang, and Q. V . Le, “ Aging ev olution for image classifier architecture search, ” in AAAI , 2019. [28] T . Ash, “Dynamic node creation in backpropagation networks, ” Con- nection Science , vol. 1, p. 365–375, 1989. [29] G. Zhou, K. Sohn, and H. Lee, “Online incremental feature learning with denoising autoencoders, ” in International Conference on Artificial Intelligence and Statistics (AIStats) , 2012, p. 1453–1461. [30] Y .-X. W ang, D. Ramanan, and M. Hebert, “Growing a brain: Fine-tuning by increasing model capacity , ” in IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2017. [31] C. Cortes, X. Gonzalvo, V . Kuznetsov , M. Mohri, and S. Y ang, “AdaNet: Adaptiv e structural learning of artificial neural networks, ” in Interna- tional Conference on Machine Learning (ICML) , 2017. [32] T . Miyato, S.-i. Maeda, S. Ishii, and M. Ko yama, “V irtual adversarial training: a regularization method for supervised and semi-supervised learning, ” IEEE transactions on pattern analysis and machine intelli- gence , 2018. [33] Y . Lecun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient-based learning applied to document recognition, ” in Pr oceedings of the IEEE , 1998, pp. 2278–2324. [34] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images, ” Master’s thesis, Department of Computer Science, University of T or onto , 2009. [35] D. P . Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” in International Conference on Learning Representations (ICLR) , 2015. [36] Y . He, G. Kang, X. Dong, Y . Fu, and Y . Y ang, “Soft filter pruning for accelerating deep conv olutional neural networks, ” arXiv preprint arXiv:1808.06866 , 2018. [37] Y . He, J. Lin, Z. Liu, H. W ang, L.-J. Li, and S. Han, “ Amc: Automl for model compression and acceleration on mobile devices, ” in Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , 2018, pp. 784–800. [38] Y . He, P . Liu, Z. W ang, Z. Hu, and Y . Y ang, “Filter pruning via geometric median for deep conv olutional neural netw orks acceleration, ” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2019, pp. 4340–4349. [39] X. Dong and Y . Y ang, “Network pruning via transformable architecture search, ” in Advances in Neural Information Pr ocessing Systems , 2019, pp. 760–771. [40] M. Lin, R. Ji, Y . W ang, Y . Zhang, B. Zhang, Y . Tian, and L. Shao, “Hrank: Filter pruning using high-rank feature map, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2020, pp. 1529–1538.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment