Controlling Model Complexity in Probabilistic Model-Based Dynamic Optimization of Neural Network Structures

Con trolling Mo del Complexit y in Probabilistic Mo del-Based Dynamic Optimization of Neural Net w ork Structures ? Shota Saito 1 , 2 and Shinic hi Shirak aw a 1 1 Y ok ohama National Univ ersity , Kanaga wa, Japan { saito-shota-bt, shirakawa-shinichi-bg } @ { ynu, ynu.ac } .jp 2 SkillUp AI Co., Ltd., T okyo, Japan s saito@skillupai.com Abstract. A metho d of sim ultaneously optimizing both the structure of neural netw orks and the connection weigh ts in a single training lo op can reduce the enormous computational cost of neural arc hitecture search. W e focus on the probabilistic mo del-based dynamic neural net work struc- ture optimization that considers the probability distribution of structure parameters and sim ultaneously optimizes both the distribution parame- ters and connection weigh ts based on gradient metho ds. Since the exist- ing algorithm searches for the structures that only minimize the training loss, this method migh t ﬁnd o verly complicate d structures. In this paper, w e propose the introduction of a penalty term to control the model com- plexit y of obtained structures. W e formulate a penalty term using the n umber of weigh ts or units and derive its analytical natural gradient. The prop osed metho d minimizes the ob jective function injected the p enalt y term based on the stochastic gradien t descen t. W e apply the prop osed metho d in the unit selection of a fully-connected neural netw ork and the connection selection of a conv olutional neural netw ork. The exp erimen- tal results show that the prop osed method can con trol mo del complexit y while maintaining performance. Keyw ords: Neural Net w orks · Structure Optimization · Sto chastic Nat- ural Gradient · Model Complexity · Sto c hastic Relaxation 1 In tro duction Deep neural net works (DNNs) are making remark able progress in a v ariety of tasks, such as image recognition and mac hine translation. While v arious neu- ral netw ork structures hav e b een developed to improv e predictive p erformance, the selection or design of neural netw ork structures remains the user’s task. In general, tuning a neural netw ork structure improv es mo del p erformance. It is, ? Accepted as a conference pap er at the 28th International Conference on Artiﬁcial Neural Netw orks (ICANN 2019). The ﬁnal authenticated publication will be a v ailable in the Springer Lecture Notes in Computer Science (LNCS). 2 S. Shota and S. Shirak aw a ho wev er, tedious, b ecause the user must design an appropriate structure for the target task through trial-and-error. T o automate neural netw ork structure design pro cesses, metho ds called neu- ral architecture search hav e been dev elop ed. A popular approac h is to regard the structure parameters (e.g., the num b ers of lay ers and units, the t yp e of lay- ers, and the connectivity) as the hyperparameters and optimized them through blac k-b o x optimization metho ds, such as the ev olutionary algorithms [15,19] and Bay esian optimization [9]. Another approach trains the neural netw ork that generates the netw ork architecture using p olicy gradien t-based reinforcement learning metho ds [23]. How ever, these approac hes require h uge computational resources; several works conducted the exp erimen ts using more than 100 GPUs [15,23], as the ev aluation of a candidate structure requires mo del training and tak es several hours in the case of DNNs. T o solve the computational b ottleneck, alternative metho ds that simultane- ously optimize both the structure and the connection weigh ts in a singl e training lo op hav e b een prop osed [10,14,17]. These metho ds are promising b ecause they can ﬁnd structures with b etter prediction p erformance using only one GPU. In this paper, we emplo y the dynamic structur e optimization framew ork intro- duced in [17] as the baseline algorithm. This framew ork considers the probability distribution of structure parameters and sim ultaneously optimizes b oth the dis- tribution parameters and w eights based on gradient metho ds. The abov e-mentioned metho ds concentrate on ﬁnding neural netw ork struc- tures demonstrating high prediction p erformance; that is, they searc h for a struc- ture that minimizes v alidation or training error. The structures found based on suc h criteria migh t b ecome resource-h ungry . T o deploy suc h neural net- w orks using limited computing resources, such as mobile devices, a compact y et high-p erforming structure is required. Several studies introduced the mo del complexit y-based ob jectiv e function, such as the total n umber of weigh ts and/or FLOPs. T an et al. [20] introduced latency (delay time with resp ect to (w.r.t.) data transfer) to the ob jective function as the p enalty in the policy gradien t- based neural arc hitecture searc h and searched for a platform-aw are structure. Additionally , m ulti-ob jectiv e optimization metho ds hav e b een applied to obtain the structures ov er a trade-oﬀ curv e of the p erformance and model complexit y [3,4]. How ever, suc h metho ds require greater computational resource, as existing metho ds are based on hyperparameter optimization. F or the purp ose of obtaining compact structures, regularization-based con- nection pruning metho ds hav e b een inv estigated. Han et al. [5] used L2 norm regularization of w eights and iterate the w eight coeﬃcient-based pruning and re- training. This metho d obtained a simpler structure with the same performance as the original structure. Liu et al. [11] prop osed channel-wise pruning for use with conv olutional neural netw orks (CNNs) with the addition of new w eigh ts for each channel; the w eights are penalized through L1 norm regularization. In general, regularization-based pruning methods imp ose a p enalt y on the w eight v alues. It is, therefore, diﬃcult to directly use asp ect of the net work size, such as the n umber of weigh t parameters or units, as the p enalt y . Con trolling Mo del Complexity in Dynamic Optimization of NN Structures 3 In this pap er, we introduce a p enalt y term for controlling mo del complexity in the dynamic structure optimization metho d [17]. In accordance with the liter- ature [17], we assume the binary v ector as the structure parameters that can use to represent the netw ork structure, such as the selection of units or connections b et w een lay ers. F urther, we consider the multiv ariate Bernoulli distribution and form ulate the ob jectiv e function to b e minimized as the expectation of loss func- tion under the distribution. Then the p enalt y term w.r.t. the num b er of weigh ts or units is incorp orated into the loss to control the mo del complexity . T o in- v estigate the eﬀects of the prop osed p enalt y term, we apply this metho d in the unit selection of a fully-connected neural net work. The exp erimen tal result sho ws that the prop osed metho d can control the model complexit y and preferentially remo ve insigniﬁcant units and connections to maintain the performance. 2 The Baseline Algorithm W e will no w brieﬂy explain the dynamic structure optimization framew ork pro- p osed in [17]. Neural netw orks are mo deled as φ ( W , M ) by tw o types of param- eters: the vector of connection weigh ts W and the structure parameter M . The structure parameter M ∈ M determines d h yp erparameters, suc h as the connec- tivit y of each lay er or the existence of units. Let us consider that the structure parameter M is sampled from the probabilistic distribution p ( M | θ ), which is parameterized b y a vector θ ∈ Θ as a distribution parameter. W e denote the loss to b e minimized as L ( W, M ) = R l ( z , W , M ) p ( z )d z , where l ( z , W, M ) and p ( z ) indicate the loss of a datum z and the probability distribution of z , resp ectiv ely . Instead of directly optimizing L ( W, M ), the sto chastic r elaxation of M is considered; that is, the follo wing exp ected loss under p ( M | θ ) is minimized: G ( W, θ ) = Z L ( W , M ) p ( M | θ )d M , (1) where d M is a reference measure on M . T o optimize W and θ , w e use the follo wing v anilla (Euclidian) gradient w.r.t. W and the natural gradient w.r.t. θ : ∇ W G ( W, θ ) = Z ∇ W L ( W , M ) p ( M | θ )d M , (2) ˜ ∇ θ G ( W, θ ) = Z L ( W , M ) ˜ ∇ θ ln p ( M | θ ) p ( M | θ )d M , (3) where ˜ ∇ θ = F ( θ ) − 1 ∇ θ is the so-called natural gradient [2] and F ( θ ) is the Fisher information matrix of p ( M | θ ). Optimizing θ using (3) works the same w ay as information geometric optimization (IGO) [13], which is a uniﬁed frame- w ork for probabilistic mo del-based evolutionary algorithms. Diﬀerent from the IGO, Shirak a wa et al. [17] proposed the sim ultaneously updating of W and θ with the gradient directions using (2) and (3), and pro duced dynamic struc- ture optimization. In practice, the gradien ts (2) and (3) are appro ximated b y Mon te-Carlo metho ds using the mini-batch data samples and the λ structure parameters sampled from p ( M | θ ). 4 S. Shota and S. Shirak aw a 3 In tro ducing Penalt y T erm in Dynamic Structure Optimization In this section, w e introduce a p enalt y term to dynamic structure optimization to control mo del complexity . W e fo cus on the case that the structure parameter can b e treated as a binary vector, as was done in [17]. Represen tation of Structure Parameter W e denote neural netw orks as φ ( W , M ) mo deled by the t wo parameters: W is the weigh t vector, and M = ( m 1 , . . . , m d ) T ∈ M = { 0 , 1 } d is a d -dimensional binary vector that determines neural netw ork structures. W e consider the multiv ariate Bernoulli distribution deﬁned b y p ( M | θ ) = Q d i =1 θ m i i (1 − θ i ) 1 − m i to be the probability distribution for the random v ariable M , where θ = ( θ 1 , . . . , θ d ) T , θ i ∈ [0 , 1] refers to the pa- rameters of the Bernoulli distribution. F or instance, in the connection selection, the parameter m i determines whether or not the i -th connection app ears, and θ i corresp onds to the probability that m i b ecomes one. Incorp orating Penalt y T erm into Ob jectiv e F unction W e denote the original loss function of neural netw ork mo dels by L ( W, M ), whic h dep ends on W and M . T o p enalize the complicated structure, w e in tro duce the p enalt y term R ( M ), whic h depends on M , and obtain the ob jective function represen ted b y L ( W , M ) +  R ( M ), where  is a penalty co eﬃcien t. In this paper, w e particularly fo cus on the case that the p enalty term can b e represented b y the weigh ted sum of m i , i = 1 , . . . d , namely R ( M ) = P d i =1 c i m i where c i indicates the co eﬃcien t represen ting the mo del complexity that corresp onds to the i -th bit. Here, w e assume that the model complexity increases if the bit m i b ecomes one. This is a reasonable assumption b ecause the binary v ector is usually used to determine the existences of connections, lay ers, and units. Therefore, we can consider that the mo del complexity increases as the num b er of ‘1’ bits increases. As b oth the original loss and the p enalt y term are not diﬀerentiable w.r.t. M , we employ stochastic relaxation by taking the exp ectation of the ob jectiv e function. The exp ected ob jective function incorp orated with the p enalty term under the Bernoulli distribution p ( M | θ ) is given by G ( W, θ ) = X M ∈M L ( W , M ) p ( M | θ ) +  d X i =1 c i θ i . (4) When  = 0, the minimization of G ( W, θ ) reco vers the same algorithm with [17]. Gradien ts for W eigh ts and Distribution P arameters T o simultaneously optimize W and θ , w e deriv e the gradien ts of G ( W , θ ) w.r.t. W and θ . The v anilla gradien t w.r.t. W is given by ∇ W G ( W, θ ) = P M ∈M ∇ W L ( W , M ) since the p enalt y term R ( M ) do es not depend on W . Note that the gradient ∇ W L ( W , M ) can b e computed through back-propagation. Con trolling Mo del Complexity in Dynamic Optimization of NN Structures 5 Regarding the distribution parameters θ , we derive the natural gradient [2], deﬁned by the pro duct of the in verse of Fisher information matrix and the v anilla gradient, that is the steep est direction of θ when the KL-divergence is considered as the pseudo distance of θ . Since we are considering the Bernoulli distribution F ( θ ) − 1 can b e obtained analytically b y F ( θ ) − 1 = diag ( θ (1 − θ )), where the pro duct of vectors indicates the element-wise pro duct. W e then ob- tain the analytical natural gradien ts of the log-likelihoo d and the p enalt y term as ˜ ∇ θ ln p ( M | θ ) = M − θ and ˜ ∇ θ P d i =1 c i θ i = cθ (1 − θ ), respectively , where c = ( c 1 , . . . , c d ) T is the vector represen tation of the mo del complexit y co eﬃ- cien ts and ˜ ∇ θ = F ( θ ) − 1 ∇ θ indicates the natural gradient op erator. As a result, w e obtain the following gradient: ˜ ∇ θ G ( W, θ ) = X M ∈M L ( W , M )( M − θ ) + cθ (1 − θ ) . (5) Gradien t Approximation In practice, the analytical gradien ts are approx- imated by Monte-Carlo metho d using λ samples { M 1 , . . . , M λ } drawn from p ( M | θ ). Moreov er, the loss L ( W, M i ) is also approximated using ¯ N mini- batc h samples Z = { z 1 , . . . , z ¯ N } . Referring to [17], w e use the same mini-batc h b et w een diﬀerent M i to obtain an accurate ranking of losses. The approximated loss is given by ¯ L ( W , M i ) = ¯ N − 1 P z ∈Z l ( z , W , M i ), where l ( z , W, M i ) represents the loss of a datum. The gradient for W is estimated by Mon te-Carlo metho d using λ samples: ∇ W G ( W, θ ) ≈ 1 λ λ X i =1 ∇ W ¯ L ( W , M i ) . (6) W e can update W using an y sto c hastic gradient descent (SGD) method with (6). T o up date the distribution parameters θ , we transform the loss v alue ¯ L ( W , M i ) in to the ranking-based utility u i as w as done in [17]: u i = 1 if ¯ L ( W , M i ) is in top d λ/ 4 e , u i = − 1 if it is in b ottom d λ/ 4 e , and u i = 0 otherwise. The ranking-based utility transformation mak es the algorithm in v ariant to the or- der preserving transformation of L . W e note that this utility function trans- forms the original minimization problem in to a maximization problem. With this utilit y transformation, the approximation of (5) is given by ˜ ∇ θ G ( W, θ ) ≈ 1 λ P λ i =1 u i ( M i − θ ) − cθ (1 − θ ). As a result, the up date rule for θ at the t -th iteration is giv en by θ ( t +1) = θ ( t ) + η θ  λ X i =1 u i λ ( M − θ ( t ) ) − cθ ( t ) (1 − θ ( t ) )  , (7) where η θ is the learning rate for θ . If we use the binary vector to select input units and set c = (1 , . . . , 1) T , the algorithm works as feature selection [16]. The metho d in tro duced in this pap er targets mo del complexity con trol and can b e 6 S. Shota and S. Shirak aw a Algorithm 1: The training pro cedure of the prop osed metho d. Input: T raining data D and hyperparameters { λ, η θ ,  0 } Output: Optimized parameters of W and θ b egin Initialize the weigh t and distribution parameters as W (0) and θ (0) t ← 0 while not stopping criterion is satisﬁe d do Get ¯ N mini-batch samples from D Sample M 1 , . . . , M λ from p ( M | θ ( t ) ) Compute the losses ¯ L ( W , M i ) for i = 1 , . . . , λ Up date the distribution parameters to θ ( t +1) b y (7) Restrict the range of θ ( t +1) Up date the w eight parameters to W ( t +1) using (6) by any SGD t ← t + 1 applied in cases where each bit corresp onds to a diﬀerent num b er of weigh ts by in tro ducing the mo del complexity co eﬃcien t c . The optimization pro cedure of the prop osed metho d is sho wn in Algorithm 1. Prediction for T est Data As was prop osed in [17], there are tw o options for predicting new data using optimized θ and W . In the ﬁrst metho d, the binary v ectors are sampled from p ( M | θ ), and the prediction results are av eraged. This sto c hastic prediction metho d will pro duce an accurate prediction, but it is not a desirable to obtain a compact structure. The second wa y is to deterministically select the binary v ector as M ∗ = argmax M p ( M | θ ) suc h that m i = 1 if θ i ≥ 0 . 5; otherwise, m i = 0. In our exp erimen t, w e use the second option, deterministic prediction, and rep ort our results. Implemen tation Remark W e restrict the range of θ within [1 /d, 1 − 1 /d ] to re- tain the possibility of generating any binary v ector. T o b e precise, if the up dated θ through (7) falls outside this range, the v alues of θ are set at the b oundary v alue. In addition, the co eﬃcien t of  is normalized as  =  0 / max( c ). The nat- ural gradient corresp onding to L is b ounded within the range of [ − 1 , 1] d due to the utility transformation. In the ab o v e normalization, the one corresp onding to the p enalt y term is b ounded within [0 ,  0 / 4] d . Therefore, b oth the gradients are at approximately the same scale regardless of their encoding scheme (i.e., the usage of M ). 4 Exp erimen t and Result W e apply the prop osed method to the tw o neural netw ork structure optimiza- tion problems with image classiﬁcation datasets: unit selection of fully-connected Con trolling Mo del Complexity in Dynamic Optimization of NN Structures 7 neural net works and connection selection of DenseNet [7]. All algorithms are im- plemen ted using Chainer framework [21] (v ersion 4.5.0) and CuPy bac kend [12] (v ersion 4.5.0) and run on a single NVIDIA GTX 1070 GPU in the exp erimen t of unit selection and on a single NVIDIA GTX 1080Ti GPU in the experiment of connection selection. In b oth exp eriments, w eights are optimized using the SGD with Nesterov’s momentum. The co eﬃcient of the momentum and the weigh t deca y are set to 0.9 and 10 − 4 , resp ectively . Based on [7,17], the learning rate for w eights is divided by 10 at 1 / 2 and 3 / 4 of the maximum n umber of ep ochs. The w eight parameters are initialized by He’s initialization [6]. W e used the cross- en tropy loss of l ( z , W, M ). The exp erimen tal setting of the proposed method is based on [17]; the sample size is λ = 2, the learning rate is η θ = 1 /d , and the initial distribution parameters is θ (0) = 0 . 5. 4.1 Unit Selection of the F ully-Connected Neural Netw ork Exp erimen tal Setting In this exp erimen t, we use a fully-connected neural net work with three hidden lay ers of 784 units as the base structure, and we select the units in each hidden lay er. The MNIST dataset, which is a 10 class handwritten digits dataset consisting of 60,000 training and 10,000 test data of 28 × 28 gra y-scaled images, is used. W e use the pixel v alues as the inputs and determine the existence of the units in hidden la yers according to the binary v ector M . The i -th unit is active if m i = 1 and inactive if m i = 0. The output of the i -th unit is represented by m i F ( X i ), where F and X i denote the activ ation function and the input for the activ ation of the i -th unit, resp ectiv ely . W e use rectiﬁed linear unit (ReLU) and softmax activ ation as F in hidden lay ers and in an output la yer, resp ectiv ely . When m i = 0, the connections to/from the i -th unit can b e omitted; that is, the active n umber of weigh ts decreases. The dimension of θ , the total n um b er of hidden units, is d = 2352. This task is simple, but w e can chec k how the prop osed p enalty term works. W e set the mini-batc h size to ¯ N = 32 and the num b er of training ep ochs to 500 in the prop osed metho d, while the mini-batch size to ¯ N = 64 and the n umber of training ep ochs is set to 1000 in other metho ds. Under these settings, the num b er of data samples used in one iteration and the n umber of iterations for parameters up date become identical in all metho ds, where the n umber of iterations is ab out 9 . 5 × 10 5 . W e initialize the learning rate for W b y 0.01. In this exp erimen t, we c hange the co eﬃcient  0 as 2 − 6 , 2 − 7 , 2 − 8 , 2 − 9 , 0, − 2 − 3 , and − 2 0 to chec k the eﬀect of the penalty term. 1 Since each bit decides whether or not its corresp onding unit is active, w e simply use the same co eﬃcien ts of mo del complexit y for each unit, c = (1 , . . . , 1) T . T o ev aluate the prop osed metho d’s p erformance, we rep ort the exp erimental result of the ﬁxed neural netw ork structures with the v arious num b ers of units. W e man ually and uniformly remo ve the units in the hidden lay ers and control the n umber of w eights. As the net w ork structures are stochastically sampled in the training phase of the dynamic structure optimization, our metho d is somewhat 1 The negative v alue of  0 encourages the increase of the n umber of activ e units. 8 S. Shota and S. Shirak aw a 25 50 75 100 Usage rate of w eights (%) 1 . 25 1 . 50 1 . 75 2 . 00 2 . 25 T est erro r rate (%) Prop osed Metho d Fixed Structure Drop out (a) Unit selection 25 50 75 100 Usage rate of w eights (%) 5 . 0 7 . 5 10 . 0 12 . 5 15 . 0 T est erro r rate (%) Prop osed metho d Random selection Original DenseNet (b) Connection selection Fig. 1. The relationship b etw een the w eight usage rate and test error rates of (a) unit selection of the fully-connected neural net work and (b) the connection selection of DenseNet. The median v alues and 25% and 75% quan tile v alues of eac h ov er ﬁve indep enden t trials are plotted. similar to that of sto c hastic net work mo dels, suc h as Drop out [18]. W e also rep ort the result using Dropout with a dropout rate of 0.5 for comparison. Note that the aim of a drop out is to preven t ov erﬁtting; thus, all units are kept in the test phase. Result and Discussion Figure 1(a) sho ws the relation b et ween the w eight usage rate and test error rates of the prop osed metho d, the ﬁxed structure, and Drop out. The median v alues and the 25% and 75% quantile v alues of eac h ov er ﬁv e indep enden t trials are plotted. Comparing the prop osed metho d and ﬁxed structure, the prop osed method outp erforms the ﬁxed structure ov er the 25% usage rate of weigh ts. In the ﬁxed structure, the error rate gradually increases as the usage rate of weigh ts de- creases. The p erformance of the prop osed metho d deteriorates when its weigh ts usage rate is approximately 6%. This indicates that the prop osed metho d can con trol the usage rate of weigh ts by c hanging the p enalt y co eﬃcien t of  0 and re- mo ve the units while still main taining its performance. The structures obtained b y the diﬀeren t  0 settings create a trade-oﬀ curv e betw een the model complexit y and p erformance. Comparing the prop osed metho d and the original structure (i.e., the ﬁxed structure with the 100% weigh t usage rate), the prop osed metho d outp erforms the original structure in the usage rate of 25% to 100%. Remark ably , when  0 < 0, although all units are selected after the training pro cedure (i.e., the structure is the same as the original structure), the p erformance impro v es. Additionally , drop out training also impro ves p erformance. Based on these results, sto c hastic training app ears to impro v e prediction p erformance. Drop out, ho wev er, cannot Con trolling Mo del Complexity in Dynamic Optimization of NN Structures 9 T able 1. The num b ers of selected units in each hidden lay er in the unit selection exp erimen t. W eigh t 1st 2nd 3rd usage rate la yer 6.6% (  0 = 2 − 6 ) 92 31 24 26.1% (  0 = 2 − 7 ) 340 141 126 57.8% (  0 = 2 − 8 ) 599 380 379 77.7% (  0 = 2 − 9 ) 704 544 567 93.7% (  0 = 0) 770 717 724 100% (  0 = − 2 − 3 ) 784 784 784 T able 2. The num b ers of selected con- nections in each blo ck in the connection selection exp erimen t. W eigh t 1st 2nd 3rd usage rate block 15.8% (  0 = 2 − 2 ) 36 6 20 24.1% (  0 = 2 − 3 ) 45 8 49 42.3% (  0 = 2 − 4 ) 57 39 67 67.5% (  0 = 2 − 5 ) 59 61 76 80.3% (  0 = 2 − 6 ) 65 64 80 100% (  0 = − 2 0 ) 91 91 91 con trol the w eight usage rate, but the prop osed metho d can reduce the num ber of used w eights without signiﬁcant p erformance deterioration. T able 1 sho ws a summary of median v alues of the num b er of selected units in each la y er. W e observe that the prop osed metho d preferen tially remov es units in the second and third hidden lay ers. Therefore, the prop osed metho d remov es the units selectiv ely rather than at random. The computational time for training by the prop osed metho d is almost the same as that required b y the ﬁxed structure. Even if w e run sev eral diﬀeren t p enalt y co eﬃcient  0 settings to obtain additional trade-oﬀ structures, the total computational time of the structure searc h more or less increases sev eral times o ver. This is reasonable more than the hyperparameter optimization-based struc- ture optimization. 4.2 Connection Selection of DenseNet Exp erimen tal Setting W e use DenseNet [7] as the base net work structure; it is composed of several dense blo c ks and transition la yers. The dense blo c k consists of L block la yers, each of which implements a non-linear transformation with batc h normalization (BN) [8] follo wed by the ReLU activ ation and the 3 × 3 con volution. In the dense blo c k, the l -th lay er receiv es the outputs of all the preceding lay ers as inputs that are concatenated on the c hannel dimension. The size of the output feature-maps in the dense blo c k is the same as that of the input feature-maps. The transition lay er is located b et ween the dense blo c ks and consists of the batch normalization, ReLU activ ation, and the 1 × 1 conv olution la yer, which is follo wed by 2 × 2 av erage p ooling. The detailed structure of DenseNet can b e found in [7]. Unlike [7], how ever, we do not use Drop out. W e optimize the connections in the dense blocks using the CIF AR-10 dataset, whic h contains 50,000 training and 10,000 test data of 32 × 32 color images in the 10 diﬀeren t classes. During the preprocessing and data augmen tation, w e use the standardization, padding, and cropping for eac h channel, and this is follo wed b y randomly horizontal ﬂipping. The setting details are the same as in [17]. 10 S. Shota and S. Shirak aw a W e determine the existence of the connections b et ween the la yers in each dense blo c k according to the binary v ector M . As w as done in [17], we use a simple DenseNet structure with a depth of 40 that contains three dense blo c ks with L block = 12 and t wo transition lay ers. In this setup, the dimension of M and θ b ecomes d = 273. W e v ary the co eﬃcien t  0 as 2 − 2 , 2 − 3 , 2 − 4 , 2 − 5 , 2 − 6 , 0, − 2 − 3 , and − 2 0 to assess the eﬀect of the p enalty term. Additionally , we set the co eﬃcien ts of the mo del complexity c i to matc h the num b er of w eights corresp onding to the i -th connection. F or the proposed method, w e set the mini-batch size to ¯ N = 32 and the n umber of training ep ochs to 300. F or the other metho ds, the mini-batch size is set to ¯ N = 64 and the num b er of training epo chs to 600. With these settings, the num b er of iterations for parameter up dates b ecome iden tical in all metho ds, where the num ber of iterations is ab out 4 . 7 × 10 5 . W e initialize the learning rate for W b y 0.1. W e also rep ort the result when the connections are remov ed randomly . W e rep eatedly sample the binary v ector M such that the w eight usage rate b ecomes the target p ercen tage, and then w e train the ﬁxed netw ork. Result and Discussion Figure 1(b) shows the relations b etw een the weigh t usage rate and test error rate. Comparing the prop osed metho d and random selection, the prop osed metho d outp erforms the random selection in the 15% to 40% weigh t usage rate. In the random selection metho d, imp ortan t connections migh t be lost when the usage rate of weigh ts is less than 40%. In contrast, the prop osed metho d can selectively remov e the num b er of weigh ts without increas- ing the test errors, so it do es not eliminate imp ortant connections. Ho wev er, the diﬀerence b et ween the test error rates of the random selection and the pro- p osed metho d is not signiﬁcan t when the w eigh t usage rate exceeds 60%. This result indicates that a small n umber of connections in DenseNet can b e randomly remo ved without p erformance deterioration, meaning that DenseNet might b e redundan t; the prop osed metho d can mo derate increase of the test error rate within 1% in the 40% w eight usage rate. T able 2 summarizes the median v alues of the num b er of selected connections in eac h block. The prop osed metho d preferen tially remo ve the connections in the ﬁrst and second blo c ks when  0 = 0 to 2 − 5 , but these deletions do not hav e a signiﬁcan t impact on p erformance. When  0 = 2 − 4 to 2 − 2 , the prop osed metho d activ ely remov es the connections in the second blo c k, so the obtained structure can reduce the p erformance deterioration more than random selection. Figure 2 sho ws the structure obtained b y the proposed method when  0 = 2 − 4 on a typical single run. As we can see, the second block in this structure, which is b et w een ‘T rans1’ and ‘T rans2’ cells, becames sparser and wider than the ﬁrst and third blo c ks. Interestingly , the second blo c k became a wide structure through remo ving the connections b et ween its la yers. This result might suggest that wide structures ma y be able to improv e p erformance with limited computing resources. Several works, suc h as [22], report that widening la yers impro ves the predictiv e p erformance; our ﬁndings may also supp ort these wide net works. W e Con trolling Mo del Complexity in Dynamic Optimization of NN Structures 11 1 Input (1) T r a ns 1 2 3 4 5 6 7 8 9 10 12 11 13 15 Trans1 (14) 16 17 18 20 21 22 23 24 25 26 T r a ns 2 28 29 30 31 32 33 34 35 36 37 38 39 Trans2 (15) Softmax (40) 19 is inact ive. Fig. 2. The obtained DenseNet structure in the case of  0 = 2 − 4 on a t ypical single run. The n umbers in the cells represent the depth of each la yer in the original DenseNet structure. Cells placed in the same column locate the same depth, and the depth of this DenseNet structure is 32. w ould like to emphasize that it is not easy to manually design a structure, such as that sho wn in Figure 2, due to the diﬀering connectivities in each blo ck. Finally , w e note that the amount of computational time required by our structure for training is not signiﬁcantly greater than that required by random selection, meaning that our prop osed structure optimization is computationally eﬃcien t. 5 Conclusion In this pap er, we prop ose a method of controlling mo del complexity by adding a p enalty term to the ob jectiv e function inv olved in the dynamic structure op- timization of DNNs. W e incorp orate a p enalt y term dep enden t on structure parameters into the loss function and consider its expectation under the multi- v ariate Bernoulli distribution to b e the ob jective function. W e derive a mo diﬁed up date rule that enables us to control mo del complexit y . In the exp erimen t on unit selection, the prop osed metho d outp erforms the ﬁxed structure in terms of a 25 to 100% weigh t usage rate. In the connection selection experiment, the prop osed metho d also outp erforms random selection in the small num b er of weigh ts and preferen tially removing insigniﬁcant connections during the training. Up on chec king the obtained structure, it is found that the in termediate blo c k b ecame a wide structure. As the increased amoun t of the computational time required by the prop osed metho d is not signiﬁcant, w e can tak e the trade-oﬀ b et ween mo del complexit y and p erformance with an acceptable computational cost. Our method requires training only once, whereas the pruning metho ds, such as that in [11], require the retraining after pruning. In future work, w e will apply the prop osed method to the architecture searc h metho d for more complex neural netw ork structures, such as that prop osed in [1]. Additionally , w e should ev aluate the proposed p enalt y term using diﬀerent datasets. Another possible future work is mo difying the prop osed metho d so that it can use other t yp es of the mo del complexity criteria, suc h as FLOPs of neural net works. 12 S. Shota and S. Shirak aw a Ac kno wledgment This work is partially supp orted by the SECOM Science and T echnology F oun- dation. References 1. Akimoto, Y., Shirak aw a, S., Y oshinari, N., Uchida, K., Saito, S., Nishida, K.: Adap- tiv e Stochastic Natural Gradien t Method for One-Shot Neural Architecture Searc h. In: International Conference on Mac hine Learning (ICML). pp. 171–180 (2019) 2. Amari, S.: Natural Gradien t W orks Eﬃcien tly in Learning. Neural Computation 10 (2), 251–276 (1998). https://doi.org/10.1162/089976698300017746 3. Dong, J., Cheng, A., Juan, D., W ei, W., Sun, M.: PPP-Net: Platform-aw are Pro- gressiv e Search for Pareto-optimal Neural Architectures. In: In ternational Confer- ence on Learning Representations (ICLR) W orkshop (2018) 4. Elsk en, T., Metzen, J.H., Hutter, F.: Eﬃcien t Multi-ob jective Neural Architec- ture Searc h via Lamarckian Ev olution. In: International Conference on Learning Represen tations (ICLR) (2019) 5. Han, S., P o ol, J., T ran, J., Dally , W.J.: Learning b oth W eights and Connections for Eﬃcient Neural Netw orks. In: Neural Information Pro cessing Systems (NIPS) (2015) 6. He, K., Zhang, X., Ren, S., Sun, J.: Delving Deep into Rectiﬁers: Sur- passing Human-Level Performance on ImageNet Classiﬁcation. In: IEEE In- ternational Conference on Computer Vision (ICCV). pp. 1026–1034 (2015). h ttps://doi.org/10.1109/ICCV.2015.123 7. Huang, G., Liu, Z., v an der Maaten, L., W einberger, K.Q.: Densely Connected Con- v olutional Net works. In: IEEE Conference on Computer Vision and P attern Recog- nition (CVPR). pp. 2261–2269 (2017). h ttps://doi.org/10.1109/CVPR.2017.243 8. Ioﬀe, S., Szegedy , C.: Batc h Normalization: Accelerating Deep Netw ork T raining b y Reducing In ternal Cov ariate Shift. In: International Conference on Machine Learning (ICML). pp. 448–456 (2015) 9. Kandasam y , K., Neiswanger, W., Sc hneider, J., Poczos, B., Xing, E.: Neural Ar- c hitecture Searc h with Bay esian Optimisation and Optimal T ransp ort. In: Neural Information Pro cessing Systems (NIPS) (2018) 10. Liu, H., Simony an, K., Y ang, Y.: DAR TS: Diﬀerentiable Architecture Search. In: In ternational Conference on Learning Representations (ICLR) (2019) 11. Liu, Z., Li, J., Shen, Z., Huang, G., Y an, S., Zhang, C.: Learning Eﬃcient Conv o- lutional Netw orks through Netw ork Slimming. In: IEEE International Conference on Computer Vision (ICCV). pp. 2736–2744 (2017) 12. Okuta, R., Unno, Y., Nishino, D., Hido, S., Lo omis, C.: CuPy: A NumPy- Compatible Library for NVIDIA GPU Calculations. In: W orkshop on Machine Learning Systems (LearningSys) in The 31st Annual Conference on Neural Infor- mation Pro cessing Systems (NIPS) (2017) 13. Ollivier, Y., Arnold, L., Auger, A., Hansen, N.: Information-Geometric Optimiza- tion Algorithms: A Unifying Picture via In v ariance Principles. Journal of Mac hine Learning Research 18 (18), 1–65 (2017) 14. Pham, H., Guan, M., Zoph, B., Le, Q.V., Dean, J.: Eﬃcient Neural Arc hitecture Searc h via Parameters Sharing. In: International Conference on Machine Learning (ICML). pp. 4095–4104 (2018) Con trolling Mo del Complexity in Dynamic Optimization of NN Structures 13 15. Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., T an, J., Le, Q., Kurakin, A.: Large-Scale Evolution of Image Classiﬁers. In: International Conference on Mac hine Learning (ICML). pp. 2902–2911 (2017) 16. Saito, S., Shirak a wa, S., Akimoto, Y.: Embedded F eature Selection Us- ing Probabilistic Mo del-Based Optimization. In: Genetic and Evolution- ary Computation Conference (GECCO) Companion. pp. 1922–1925 (2018). h ttps://doi.org/10.1145/3205651.3208227 17. Shirak a wa, S., Iwata, Y., Akimoto, Y.: Dynamic Optimization of Neural Netw ork Structures Using Probabilistic Mo deling. In: The 32nd AAAI Conference on Arti- ﬁcial Intelligence (AAAI-18). pp. 4074–4082 (2018) 18. Sriv asta v a, N., Hinton, G., Krizhevsky , A., Sutskev er, I., Salakhutdino v, R.: Drop out: A Simple W ay to Preven t Neural Netw orks from Overﬁtting. Journal of Machine Learning Research 15 , 1929–1958 (2014) 19. Sugan uma, M., Shirak aw a, S., Nagao, T.: A Genetic Programming Ap- proac h to Designing Conv olutional Neural Net work Architectures. In: Genetic and Evolutionary Computation Conference (GECCO). pp. 497–504 (2017). h ttps://doi.org/10.1145/3071178.3071229 20. T an, M., Chen, B., Pang, R., V asudev an, V., Sandler, M., How ard, A., Le, Q.V.: MnasNet: Platform-Aw are Neural Architecture Search for Mobile. In: The IEEE Conference on Computer Vision and P attern Recognition (CVPR) (June 2019) 21. T okui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a Next-Generation Op en Source F ramew ork for Deep Learning. In: W orkshop on Mac hine Learning Systems (Learn- ingSys) in Neural Information Pro cessing Systems (NIPS). pp. 1–6 (2015) 22. Zagoruyk o, S., Komo dakis, N.: Wide Residual Netw orks. In: British Mac hine Vision Conference (BMVC). pp. 87.1–87.12 (2016). https://doi.org/10.5244/C.30.87 23. Zoph, B., Le, Q.V.: Neural Architecture Search with Reinforcement Learning. In: In ternational Conference on Learning Representations (ICLR) (2017)

Controlling Model Complexity in Probabilistic Model-Based Dynamic Optimization of Neural Network Structures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment