EENA: Efficient Evolution of Neural Architecture

EENA : Efﬁcient Evolution of Neural Ar chitectur e Hui Zhu 1,2 , Zhulin An 1 ∗ , Chuanguang Y ang 1 , Kaiqiang Xu 1 , Erhu Zhao 1 , Y ongjun Xu 1 1 Institute of Computing T echnology , Chinese Academy of Sciences 2 Univ ersity of Chinese Academy of Sciences { zhuhui,anzhulin ,yangchuangua ng,xukaiqiang ,zhaoerhu,xyj } @ict.ac.cn Abstract Latest algorithms for automatic neural ar chitectur e sear ch perform remarkable but ar e basically dir ectionless in sear c h sp a ce a nd computation al expensive in the train- ing of every intermediate arc hitectur e. In this paper , we pr opose a metho d for efﬁcient ar chitectur e searc h called EENA (Efﬁcient Evo lution of Neural Ar chitectur e). Due to the elabo rately de sig ned mutation and cr ossover op e r - ations, the evolution pr ocess ca n be guided by the informa- tion have already been learned. Ther efore, less comp uta- tional effort will be r equ ir ed while the sear ching and train- ing time can be r educed signiﬁca ntly . On CIF AR-10 classiﬁ- cation, EENA using minimal comp utationa l r esources (0.6 5 GPU − days) can design high ly effective neural ar chitecture which achieves 2.56% test err or with 8. 4 7M parameters. Furthermore , th e b est arc h itectur e discover ed is also trans- ferable for CIF AR-100 . 1. Intr oduction Con volution al Neural Network has a prominent perfor- mance in compu ter vision, o bject d etection and other ﬁelds by extracting f eatures throug h neural arch itec tu res which imitate the m echanism o f human brain. Human -designed neural arch itectures such as ResNet [14], DenseNet [17], PyramidNet [1 3] an d so on which contain several effecti ve blocks are successi vely proposed to inc r ease the accuracy of image classiﬁcation . In or d er to design n eural architec- tures adaptable for various datasets, mor e researchers ha ve a growing interest in studyin g the algorith mic solutions b ased on h uman experien ce to achieve auto matic neura l architec- ture search [38, 23, 24, 2 7, 3 , 4 , 3 3]. Many architectur e sear ch algorith ms perfor m rem a rk- able but demand fo r lots of com putation a l effort. For ex- ample, obtaining a state-of-the- art architectur e for CIF AR- 10 r e quired 7 days with 45 0 GPUs of ev olu tio nary algo- rithm [ 28] or used 8 0 0 GPUs fo r 28 days of reinfo r cement ∗ Correspondi ng author of this work learning [ 38]. The latest algorithms ba sed on reinforce- ment learnin g (RL) [2 7], sequential model-b ased optim iza- tion (SMBO) [22] and Bayesian optimization [19] over a discrete space are prop osed to speed up the search p r ocess but the basically directionless search leads to a large number of arch itectures ev aluatio ns required . Although several al- gorithms based on grad ient descent over a continu ous space, such as D AR TS [24] a n d NA O [26] ad dress this prob lem to some extent, the trainin g of e very interm ediate architecture is still com p utational expensive. In this work, we prop ose a metho d f or efﬁcient ar chi- tecture search called EENA (Efﬁ cient Evolution of Neural Architecture) g uided by the experience gain e d in th e prior learning to sp e ed up the search process and th us co nsume less computation a l effort. The co ncept, gu idance o f ex- perience g ained, is in spired by Net2Net [5], which gen- erate large n etworks by transforming small networks via function -preserv in g. There are several p recedents [ 1, 35, 2] based on th is f or neu ral ar c hitecture search, but the basic operation s are limited to the expe rience in parameters and are relativ ely simple, so the algo rithms may degenerate into a rand om search. W e absorb more basic b lo cks of classical networks, discard sev e r al ineffecti ve blo cks and e ven extend the g uidance of expe r ience gained to the p rior arc h itectures by crossover in our method. Due to the loss con tinue to decrease and the ev o lution beco mes directiona l, r obust and globally optimal m odels can be discovered rapidly in th e search space. Our experiments (Sect. 4) of neural architecture search on CIF AR-10 show that our method using minimal com- putational resources ( 0.65 GPU − days 1 ) can d esign hig h ly effecti ve neural architectur e tha t achieves 2 .56% test error with 8.47M pa r ameters. W e further tr ansfer th e best ar chi- tecture discovered on CIF AR-10 to CIF AR-10 0 d a ta sets and the r esults perform rem arkable as well. Our contributions are summarized as follows: (1) W e are the ﬁrst to pr opose the crossover o peration guided by experien ce g ained to effecti vely reuse the pr ior 1 All of our e xperiments were perfor med using a NVIDIA Titan Xp GPU. learned a r chitectures an d para meters. (2) W e stud y a large num b er of basic mutatio n oper ations absorbed from ty pical architectu r es and select the ones that have signiﬁcan t effects. (3) W e achieve remarkab le architectur e search efﬁciency (2.56% error on CIF AR-10 in 0.65 GPU-d ays) wh ich we attribute to the u se of EEN A. (4) W e sho w that the neu ral architectu res searched by EEN A on CIF AR-10 are transferable for CIF AR-100 datasets. Part of the code impleme n tation and several mo d- els we searched on CIF AR-1 0 of E EN A is a vailable at https://gith ub.com/ICCV - 5 - E ENA/EENA . 2. Related W ork In this section, we revie w human- d esigned neu r al archi- tectures and a u tomatic ne ural architecture search wh ich a r e most r elated to this work. Human-Designed Neural Architectures. Since convo- lutional neu ral networks were ﬁrst used to deal with the problem s in the ﬁeld of compu ter vision, hu man-d esigned neural architectur es constantly improve the classiﬁcation accuracy on speciﬁc datasets. As the crucial factor affecting the p erform ance of n eural network, m any excellent neural architecture s have been d esigned. Chollet et al . [6] pro- pose to replace Inception with depthwise separable con volu- tions to reduce the numb er of parameters. Grouped con vo- lutions g iven by Krizh evsky et al . [20] is used to d istribute the model over two GPUs a n d Xie et al . [36] further pro- pose that inc reasing cardina lity is more effectiv e than go- ing deeper or wider based on this. He et al . [14] solve the degradation problem of d e e p neural n e tworks by residual blocks. Hu ang et al . [17] propo se d e nse bloc ks to solve the vanishing- g radient problem and substantially reduce the number of p arameters. Hu e t al . [16] pr o pose the Sque e ze- and-Ex citation bloc k that adaptively recalib rates chan nel- wise feature respo nses by explicitly mod elling interdepen- dencies between cha nnels. In add itio n to the improvements on conv olutional layer, many other metho d s have b een p ro- posed to further optimize the network. Lin et al . [21] utilize global av er age poolin g to r eplace the tradition al fully con- nected laye rs to enf orce co rrespond ences between fe a ture maps and categories and av o id overﬁtting. Some regular- ization method s, such as dropo ut [30], dropb lock [11] and shake-shake [10] and so on, also improve the gener alization ability of neural networks. Hum a n-design e d n eural archi- tectures re ly he avily on expe rt experien ce, and it will be extremely difﬁcult to design the most suitab le neural archi- tecture whe n faced with a new task for image. In this work, we fu lly d raw on the excellent experien c e of p redecessors and desig n an efﬁcient neu ral arch itecture search method. A utomatic Neural Architecture Search. Many different search strategies have been proposed to exp lo re the space of n eural architectures, in cluding r andom search, evolution- ary algorithm (EA), reinfo rcement learn in g ( RL), Bay esian optimization (BO) and gr adient-b a sed methods. Ear ly ap- proach e s [32, 31] use e volutionary alg orithms to optimize both the neural ar chitecture and its weights. Howe ver, faced with the scale o f contemp orary neu ral architectu res with millions of weigh ts, recent ap proach es b a sed on e vo- lutionary algo rithm hav e b een impo rved in some ways, such as using gradient-ba sed methods for optimizing weights [29, 28, 23] or modif ying th e architec tu re by network mor- phisms [34, 5, 2]. The gener ation of a neural architec- ture can be regarded as the agent’ s action, thus, many ap- proach e s ba sed on reinfo rcement learnin g [38, 39] h ave been prop osed to d eal with neural arch itecture search. Kan- dasamy et al . [19] derive kernel functions for architecture search spaces in ord er to use classic GP-based BO me th od, but com p ared with others, BO method shows n o obvious advantages. In con trast to th e gradient- free method s above, as gradien t-based app roaches, L iu et al . [24] pro pose a co n- tinuous relaxatio n of the search space and Luo et al . [26] use an encod er network map ping neu ral network architec- tures into a con tin uous space . W e c an notice that the a l- gorithms of automatic ne u ral architecture search pay more and more attention to the e fﬁciency of algorith ms ( such as search time ) besides focusing o n the ef f ect of neural arc h i- tectures discovered. In this work , we pr opose a different method based on ev o lutionary algorithm an d network mo r- phisms for efﬁ cient architecture search and ac h iev e rem a rk- able r esults on the classiﬁcation datasets. 3. Pr o pos ed Methods In this section , we illustrate our basic mutation and crossover operations with an example of se veral connected layers whic h come from a simple conv olutio nal n eural net- work and describ e the metho d of selectio n and d iscar d of individuals from the po pulation in the ev olutio n process. 3.1. Se arch S pace and Mutation Opera tions As we mention ed in the Sect. 2, the birth of a better net- work architecture is usually achieved based on the local im- provements. A g ood design of auto m atic neura l arch itec- ture search should be based on a large n umber of excel- lent human-design ed architctures. I n additio n, some o f the existing methods [5, 35] based on function -preserv ing are brieﬂy reviewed in this section and our meth o d is built on them. Speciﬁcally , we absorb more blocks of classical net- works such as den se block, ad d some effective changes such as noises for new parameters and discard several ineffective operation s su ch a s kerne l widenin g in our me th od. Our method explor es the search space by mutation and crossover o peration s an d every mutatio n operation re f ers to … … … n d d h/2 h/2 … … … n a b c d e f g h … … … … n/2 n n/2 d c a b n … … … n h g f e 0 0 1 1 … … … n n n 0 0 0 0 … … … … n n n n 0 0 0 0 (a) (b) (c) (d) (e) (f) n Figure 1. V isualization of the teacher network (a) and sev eral mutation operations (b ∼ f). The rectangles and circles represent the con volu- tional layers and f eature maps or ﬁlt ers, respecti vely . The same color means identical and white means 0 v alue. The parts in t he red dashed box are equi valent. a rand om change for an individual. x is the in put to the network, the guidan c e of experience gained in parame ter s is to choose a new set of parameters θ ′ for a stud e nt network g ( x ; θ ′ ) which transform from the teacher network h ( x ; θ ) such that: ∀ x : h ( x ; θ ) = g ( x ; θ ′ ) 2 . (1) Assume that the i -th c onv olution al lay er to be chan ged is represented by a ( k 1 , k 2 , c, f ) shaped matrix W ( i ) . The in- put for the conv o lution operation in lay er i is r e presented as X ( i ) and th e pro cessing of Batch Norm and ReLU is ex- pressed as ϕ . I n this work, we consider the f ollowing m uta- tion operations. Widen a Layer . Fig. 1( b ) is a n example of this operatio n . W ( i ) is extend by r eplicating the parameters alo n g the last axis at rando m and the pa rameters in W ( i +1) need to be di- vided alo ng the third axis co rrespon ding to th e co u nts of the same ﬁlters in the i -th layer . U is th e new pa rameter m atrix and f ′ is the numbe r of ﬁlters in the layer i +1. Speciﬁ- cally , A noise δ is ra n domly adde d to every n ew p arameter in W ( i +1) to break sy mmetry . g ( j ) =  j j ≤ f random sampl e f rom { 1 , 2 , · · · , f } j > f , (2) U ( i ) k 1 ,k 2 ,c,j = W ( i ) k 1 ,k 2 ,c,g ( j ) , (3) U ( i +1) k 1 ,k 2 ,j,f ′ = W ( i +1) k 1 ,k 2 ,g ( j ) ,f ′ card ( x | g ( x ) = g ( j )) · (1 + δ ) , δ ∈ [0 , 0 . 05] . (4) Branch a Layer . Fig. 1(c) is an example of this op eration. This oper ation adds no fur ther parame ters an d will always be c ombined with oth er ope rations. U and V are the new 2 The ’ = ’ here doesn’ t mean complete ly equi val ent, noise may be adde d to make the student m ore robust. parameter matrices wh ich can be expressed as: U ( i ) k 1 ,k 2 ,c,j = W ( i ) k 1 ,k 2 ,c,m m ∈  0 , ⌊ f 2 ⌋  , V ( i ) k 1 ,k 2 ,c,l = W ( i ) k 1 ,k 2 ,c,m m ∈  ⌊ f 2 ⌋ , f  . (5) The con volutional layer will be reformu lated a s: C oncat  ϕ  X ( i ) · U ( i ) k 1 ,k 2 ,c,j  , ϕ  X ( i ) · V ( i ) k 1 ,k 2 ,c,l  . (6) Insert a Single Layer . Fig. 1(d) is an example o f th is op - eration. The new layer weight matrix U ( i +1) with a k 1 × k 2 kernel is in itialized to an identity matrix. Re LU ( x ) = max { x, 0 } satisﬁes th e restriction for the a ctiv ation func- tion σ : ∀ x : σ ( x ) = σ ( I σ ( x )) , (7) so th is operation is po ssible and the new matrix can b e ex- pressed as: U ( i +1) j,l,a,b =  1 j = k 1 +1 2 ∧ l = k 2 +1 2 ∧ a = b 0 otherwise . (8) Insert a Layer with Shortc ut Connect ion. Fig. 1(e) is an exam p le of this oper a tio n. All the param e ters of the new layer weig ht matr ix U ( i +1) are initialized to 0. The con vo- lutional lay e r will be refor mulated as: Add  ϕ  X ( i +1)  , ϕ  X ( i +1) · U ( i +1)  . (9) Insert a Layer With Dense Co nnection. Fig . 1(f) is an example of th is oper ation. All the parame te r s of the new layer weig ht matr ix U ( i +1) are initialized to 0. The con vo- lutional lay e r will be refor mulated as: C oncat  ϕ  X ( i +1) · U ( i +1)  , ϕ  X ( i +1)  . (10) 2 3, 4 4 4, 5 3 3 3 3, 6 2 3, 4 4 4, 5 3 3, 7 2 3, 4 4 4, 5 2 3, 4 4 4, 5 3 3, 7 3 3 2 3, 4 4 4, 5 3 3 ! 3 3, 6 " 3 3, 7 ! Oﬀspring: 1 3 4 4 1 3 5 5 1 4 5 1 1 3 3 4 4 5 5 7 7 Oﬀspring: Parent2: Ancestor: c d b e f c c c c d d d d f f b b e Pa re nt 1 Pa re nt 2 Anc es to r Oﬀ sp ri n g 1 2 3 6 Parent1: Figure 2. V isualization of the process that the parents produc e an of fspring by crossov er operation. b ∼ f correspond to t he mutation operations in Fig. 1 and 1 ∼ 7 represent the serial number of lay- ers. The same colored rectangles represent the identical l ayers and white means 0 valu e. The parts in the red dashed box are equiv a- lent. In addition, m any o th er impo rtant meth ods, such as sep- arable conv olution, gro uped conv olutio n an d bottlene c k etc. can be absor bed into the mu tation operations. For these, we run several simple tests and no tice that th e search sp ace is expanded but the accuracy of classiﬁcation is not improved. Therefo re, we ﬁnally abando n ed these o perations in our ex- periment. 3.2. Crossover Oper ation Crossover ref ers to the c o mbination of the prominen t parents to pro duce of fsp r ings which may perform e ven more excellent. The parents refer to the ar c hitectures with high ﬁtness (accuracy) that have been already discovered and ev er y offspring ca n be consider ed as a new explor ation of the search space. Obviously , althoug h our mutation op- erations redu ce the co m putation a l effort of th e repeated re- training, the exploration of the search space is still ran dom without taking ad vantage of the experience already gained in prior ar chitectures. It is crucial and d ifﬁcult to ﬁnd a crossover op eration that can effecti vely reuse the par ame- ters alread y tr ained and even produc e the n ext gen eration guided by experien ce of th e prio r excellent architectures. NEA T [32], as a existing metho d in the ﬁeld of ev olution - ary algorith m , identify which genes lin e up with which by assigning the in novation n umber to each node gene. How- ev e r, this metho d is limited to the ﬁne-g rained crossover for nodes and con nections, and will destroy the par a meters that have alread y been traine d . W e notice tha t the architectures with high ﬁtn ess all d e- riv e from the sam e ancestor of some poin t in th e past (At worst, th e ancestor is the initial ar chitecture) . Whenever a new architectur e appears (thro ugh mutation ope r ations), we record the type and the location of the mutatio n o per- ation. Based on these, we can track the h istorical origin s and ﬁnd the commo n ancestor of th e two individuals with high ﬁtness. Then the offsprings inher it the same architec- ture (ancestor) and rand omly inherit th e d ifferent parts of architecture s o f th e p arents. Fig. 2 is a visual example of the crossover opera tion in our exper iments. Based on the records abo ut the previous mutation o peration s for each in dividual (for Parent1 , mu - tation c, d, b, e occur red at layer 2, 4, 3, 3, r espectively and for Parent2 , mutation c, d, f occurr ed at layer 2 , 4, 3, respectively), the common ancestor of the paren ts (Ances- tor with mutation c, d o ccurred at layer 2, 4 ) can be easily found . The mutation ope rations of the two pa rents different from e ach oth e r are selected and a d ded to the ancestor ar- chitecture acc o rding to a certain p r obability by the m utation operation s (mu tation b , f o c c urred at layer 3, 3 are inherited by Offspring a nd mutation e is random ly discarded) . 3.3. The Selection and Disca rd of I ndividuals in Evolutionary Algorithm The Selectio n of Individuals. Our evolutionary algo- rithm uses tournamen t selection [ 1 2] to select an individual for mutation: a f raction k o f individuals is selected from the populatio n r a n domly and th e individual with h ighest ﬁtness is ﬁnal selected from th is set. For cr o ssover , the two indi- viduals with the hig hest ﬁtness but d ifferent architectur e s will be selected . The Discard of Individuals. I n order to constrain the size of the populatio n, the discar d of individuals will b e accom- panied by the generation of each ne w ind ividual when the populatio n size reaches N . W e regulate aging and non- aging evolutions [28] via a variable λ to affect the con- vergence rate a n d overﬁt: Discarding the worst m odel with probab ility λ and the oldest mod el with 1 − λ within each round . 4. Experiments In this section, we r eport the perform ances o f E E N A in neural architecture search on CIF AR-10 and the feasibility of tran sferring the best ar chitecture discovered on CIF AR- 10 to CIF AR-1 0 0. In addition, we statistically analyze the effect of mutation and cro ssover o p erations. In our experi- ments, we start the evolution fr o m in itializing a simple con- volutional n eural network to sh ow the efﬁciency of EE N A and we use the meth ods of selection and d iscar d mentio ned Figure 3. The initial model designed in our exp eri ments. 0. 9 3 0. 9 2 0. 9 1 Figure 4. The phylogenetic tree visualized one search process on CIF AR-10. In the circle phylogene tic tree, the color of the outermost circle represents ﬁtness, and the color of the penultimate circle represents ancestor . In the rectangular phylogene tic t r ee, the color on the right side represents ﬁt ness. From the inside to the outside in the left ﬁgure and from left to right in the right ﬁgure, along the direction of time axis, the connection s represent the relati onship from ancestors to offsprings. in Sect. 3.3 to select individuals from the pop ulation and th e mutation Sect. 3.1 an d crossover in Sect. 3.2 op erations to improve the ne u ral arch itec tures. Initial Model. The initial mo d el (the n u mber of p aram- eters is 0.67M) is sketched in Fig. 3. It star ts with o ne conv olutio nal layer , followed by three evolutionary block s and two MaxPo o ling layers for down-sampling which are connected alternately . Then an other conv o lutional layer is added, followed b y a GlobalA verag ePooling layer and a Softmax layer for transforma tio n fro m feature map to clas- siﬁcation. Each Max Pooling layer has a stride of two an d is followed by a Dro pBlock [11] laye r with ke e p prob = 0 . 8 ( bl ock siz e = 7 for the ﬁrst one and bl ock siz e = 5 for the second one). Sp eciﬁcally , the ﬁrst con volutional layer contains 6 4 ﬁlters and the last co n volutional laye r contain s 256 ﬁlters. An ev o lutionary block is initialized with a co n- volutional layer with 128 ﬁlters. E very conv olu tional layer mentioned actually means a Con v -BatchNorm- ReLU block with a kernel size o f 3 × 3 . The weig h ts are initialized as He normal distribution [14] and the L2 regularization of 0 . 0001 is ap plied to the weights. Dataset. W e r andom ly sample 1 0,000 images by stratiﬁed sampling from the o riginal trainin g set to fo rm a validation set for ev alu a te the ﬁtn ess of the individuals while u sing the remaining 40 ,000 imag e s for trainin g the individuals dur- ing the evolution. W e norma lize the imag es using channel means and stand ard deviations for p reproce ssing and apply a standar d data a u gmentatio n sch e me (zero- p adding with 4 pixels on each side to obtain a 4 0 × 40 p ixels im age, then random ly cr opping it to size 32 × 32 and ran domly ﬂipping the im age h orizon tally). Search on CIF AR-1 0. The in itial population consists o f 12 individuals, each fo rmed by a single mutation op eration from th e common in itial mo del. Du r ing the p rocess o f evo- lution, I n dividual selection is deter m ined by the ﬁtness (a c- curacy) of the neural architecture ev aluated on the valida- tion set. In our experiments, the size of k in selection o f individuals is ﬁxed to 3 an d the variable λ in d iscar d of in - dividuals is ﬁxed to 0.5. W e don’t discard any individual at the beginning to make the po pulation grow to the size of 20. T hen we use selection and discard together, that is to say , the ind ividual af ter mutation or crossover o perations will be put b ack into the popu lation after train ing and at the same time the discard of indi v iduals will be ex ecuted. The mutation and crossover op erations to improve th e neural ar- chitectures are applied in the e volutionary block and any mutation operation is selected by the same probab ility . Th e crossover operatio n is executed every 5 rou nds, for which we select th e two individuals as paren ts with the high est ﬁt- T able 1. Comparison against state-of-the-art recognition results on CIF AR-10. Results mark ed with † are N OT trained with Cutout [8]. The ﬁrst block represents t he performance of human-designed architectures. The second block represents results of variou s auto- matically designed architectures. Our method use minimal com- putational resources to achie ve a low test error . Method Params Search T ime T est Error (Mil.) (GPU-days) (%) DenseNet-BC [17] † 25.6 − 3.46 PyramidNet-Bottleneck [13] † 26.0 − 3.31 ResNeXt + Shake-Sha ke [10] † 26.2 − 2.86 AmoebaNet-A [28] 3.2 3150 3.34 Large-scale Evolution [29] † 5.4 2600 5.4 N AS-v3 [38] 37.4 1800 3.65 N ASNet-A [39] 3.3 1800 2.65 Hierarchical Evolution [23] † 15.7 300 3.75 PN AS [22] † 3.2 225 3.41 Path-Level-EAS [3] 14.3 200 2.30 N A ONet [26] 128 200 2.11 EAS [2] † 23.4 10 4.23 D AR TS [24] 3.4 4 2.83 Neuro-Cell-based Evolution [35] 7.2 1 3.58 ENAS [27] 4.6 0.45 2.89 N AC [18] 10 0.25 3.33 GDAS (FRC) [9] 2.5 0.17 2.82 Ours 8.47 0.65 2.56 Ours (adjust the number of ch annels) 54.14 − 2.21 ness but d ifferent architectur es among the po pulation . All the neural architectu res are tr ained with a batch size of 1 28 using SGDR [25] with initial le a r ning r a te l max = 0 . 05 , T 0 = 1 and T mult = 2 . The initial model is traine d fo r 63 epochs. Th en, 1 5 ep ochs are trained after each muta- tion opera tion, on e round of 7 ep ochs and anoth er r ound of 15 epo chs are trained after ea ch c rossover operation. One search pr ocess on CIF AR-10 is visualized in ﬁg. 4 . In the circle phyloge netic tree o f EENA, th e color of the ou termost circle represents ﬁtness, and the same co lor of the penulti- mate cir cle represents the same ancestor . In the rectang ular phylog enetic tree, the color on the right side repr esents ﬁt- ness. Fro m the inside to the o utside in the left ﬁgure and from left to right in the rig ht ﬁgur e, alo ng the direction of time ax is, the connectio ns represent the relationsh ip from ancestors to offsprings. W e can notice that the ﬁtness of the populatio n incr eases steadily an d rapidly via m utation and crossover operatio n s. In add itio n, the pop ulation is quick ly taken over by a highly p rominen t homolog ous g roup . After the search budget is exhausted or the highe st ﬁtness of the populatio n no longer in crease over 25 ro unds, th e individ- ual with highest ﬁtness will be extracted as th e b e st neu ral architecture fo r p ost-training . Post-T raining of the Best Neura l Architecture Obtained. W e cond u ct p ost-proce ssing and p ost-training towards the best neu r al architecture design e d by the EENA. The mo del is trained on the full training dataset un til con vergence u s- ing Cutout [ 8] and Mixu p [37] wh ose conﬁgu rations a r e T able 2. The effect of mutation for a random selected indi vidual. The ﬁrst column represen t s the percentage that the ﬁt ness becomes better after mutation. The second column represents the percen t- age that the ﬁtness becomes worse. The last column represents that the ﬁtness is unchang ed. The Fitness after Mutation for a Random Individual Better W orse No Change 118/31 0 (38.06 %) 182/31 0 (58.71 %) 10/3 1 0 ( 3.23%) T able 3. The comparison of the ef fect of mutation and crossove r . The second column represents the percentage of generating the T op1 ﬁtness of the population from the individuals of T op2 ﬁtness after mutation or crosso ver . The last column represents the per - centage of generating the T op5 ﬁtness of the population from the indi viduals of T op2 ﬁtness after mutation or crossov er. Operations T o p1 T o p5 Mutation (T o p2) 28/12 0 (23 .33%) 79/120 (65. 83%) Crossover (T op 2) 18/55 (3 2.73%) 42/55 (76.3 6%) the same as th e original paper (a c utout size of 16 × 16 and α = 1 for m ix up). Speciﬁcally , in order to reﬂect the fairness of the result for comp arison, we don’t u se the lat- est method Au to Augmen t [7] which h as signiﬁcant effects but hasn ’t been widely used yet. T he neural architectures are trained with a batch size o f 128 using SGD with learn- ing rate l = 0 . 1 for 50 ep ochs to acceler a te th e c onv er- gence proce ss. Then we used SGDR with initial lear n ing rate l max = 0 . 1 , T 0 = 1 and T mult = 2 for 5 11 or 1 0 23 epochs 3 . Finally , the er ror on the test dataset is rep orted. The com parison against state-o f -the-art recogn ition results on CIF AR-1 0 is presented in T able 1 . On CIF AR-10, Our method u sing m inimal computation al resources (0.6 5 GPU- days) can design high ly effective neural cell that achieves 2.56% test err o r with small number of parameters (8.4 7M). Fine-T uning of the Best Neura l Network. In order to prove the good scalability of the network design ed by EEN A, we fu rther ﬁne- tu ne the best n e u ral ne twork by ad - justing the n umber of the ch annels. W e change the nu m ber of th e ﬁlters in the three evolutionary blocks of the conv olu- tional n eural network to 1, 2, 4 times and use the same post- processing a n d p ost-training method which is mentioned above. The result afte r ﬁne-tun ing on CIF AR-10 is also pre- sented in T able 1. W e can notice th a t a better perfo rmance (2.21% test error) can be ach iev e d by simply widening the discovered con volutional neu ral network. The Effect of Mutation and Crossover Operations. W e condu c t se veral experiments and do a statistical analy sis 3 W e did not cond uct exte nsive hyperpara m eter tuning due to limited computat ion resources. T able 4. Comparison against state-of-the-art recognition results on CIF AR-100. The ﬁrst block represents the performance of human- designed architectures. The second block represents the results of se veral automatically designed architectures.The last block repre- sents the performance of transferring the best architecture discov- ered on CIF AR-10 to CIF AR -100. Method Params Search Time T est Error (Mil.) (GPU-days) (%) DenseNet-BC [17] 25.6 − 17.18 ResNeXt + Shake-Shake [10] 26.2 − 15.20 AmoebaNet-B [28] 34.9 3150 15. 80 Large-scale Ev olution [ 29] 40.4 2600 23. 70 N ASNet-A [39] 50.9 1800 16. 03 PN A S [2 2] 3 .2 22 5 17.63 N A ONet [26] 128 200 1 4.75 Neuro-Cell-based Evolution [35] 5.3 1 21.74 GD AS(FRC) [9] 2.5 0.17 18.13 Ours (transferred f rom CIF AR-10) 8.49 − 17.71 about the chan ges of the ﬁtness for an individual after mu- tation and crossover operation s. W e collect the resu lts of 310 rando m individuals for m utation an d coun t the num ber of the o ffsprings with better ﬁtness o r worse ﬁtness sepa- rately . Then we rou ghly estimate the effect of this o pera- tion accordin g to the statistical results. The effect of mu- tation for a rand om selected individual is presented in T a- ble 2. W e also collect the results o f 1 20 individuals with T op2 ﬁtness for m utation and 55 pairs of ind ividuals with T op2 ﬁtness for crossover . Then we count th e numb er of the offsprings with T op1 and T op5 ﬁtness in the popu la- tion separately . The com p arison of the effect of mutation and crossover is presented in T able 3. W e can notice that the mu tation and crossover are b oth effective but crossover operation is more likely to p roduce individuals with high ﬁtness (T op1 or T op5 ﬁtness). Comparison to Search Without Crosso ver . Unlike ran- dom search by mutation oper ations, c rossover as a heuristic search makes the explo r ation direction a l. In o rder to fur- ther verify the effect of the crossover operation , w e con - duct an o ther experimen t re moving the cr ossover operation from the searc h process an d all the o th er co n ﬁguration s re- main unchanged . W e run the experiment 5 times for 0.65 hours, then report a mean c la ssiﬁcation error of 3.44 % an d a best classiﬁcation err or of 2.9 6%. The result is worse th an that of the or iginal experiment with crossover . Thus, asso- ciating this result with the effect of mutation and cr o ssover operation s which is mention ed above, we con ﬁrm that the crossover o peration is indeed effectiv e. T ransfer the Best Architecture Searched on CIF AR-10 to CIF AR- 1 00. W e f urther try to transfer the be st a rchi- tecture of h ighest ﬁtness sear c hed on CIF AR-10 to CIF AR- 100 and the results perform remark able as well. For CIF AR- Ad d Ad d Ad d Ad d Ad d Co n ca t e na t e Co n ca t e na t e Co n ca t e na t e Co n ca t e na t e Co n v 64 Co n v 64 Co n v 64 Co n v 64 Co n v 64 Co n v 12 8 Co n v 64 Co n v 64 Co n v 64 Co n v 25 6 Co n v 12 8 Co n v 12 8 Co n v 12 8 Co n v 25 6 Figure 5. The best architecture discov ered by EENA. 64, 128 and 256 is the number of ﬁlters. 100, several h yper-parameters are mo diﬁed: bl ock siz e = 3 for the ﬁrst D r opBlock layer, bl ock siz e = 2 for the sec- ond and the cu tout size is 8 × 8 . T he c omparison against state-of-the- art recog nition results on CIF AR-100 is pre- sented in T able 4. 5. Conclusions and Ongoing W ork W e design an efﬁcient method of neur al a rchitecture search based on ev olu tio n with the guidance of experience gained in th e prior lear ning. This meth od takes repeatab le CNN b locks ( cells) as the basic u nits for ev olu tion, an d achieves a state-o f -the-art accuracy on CIF AR-10 and oth - ers with few p arameters and little search time. W e notice that the initial mod el an d th e basic operatio ns are extremely impactful to sear c h speed a nd ﬁnal accur acy . Therefo re, we are trying to add sev eral effecti ve blocks such as SE block [16] as mutation operations combined with o ther methods that m ight per f orm effecti ve such as macro -search [15] in to our exper iments. Ap pendix Here we p lot the best arch itecture o f CNN cells discov- ered by EENA in Fig. 5. Refer ences [1] H. Cai, T . Chen, W . Zhang, Y . Y u, and J. W ang. Reinforce- ment learning for architecture search by netw ork transforma- tion. CoRR , abs/170 7.04873, 2017. [2] H. Cai, T . Chen, W . Zhang, Y . Y u, and J. W ang. Efﬁcient ar- chitecture search by netwo rk transformation. In Procee dings of the Thirty-Second AAA I Confer ence on Arti ﬁcial Intelli- gence , (AAAI-18), the 30th innovative Applications of Artiﬁ- cial Intelligence (IAAI-18), and the 8th A AAI Symposium on Educational Advances in Arti ﬁcial Intell i gence (EAAI-18), New Orleans, L ouisiana, USA, F ebruary 2-7, 2018 , pages 2787–2 794, 2018. [3] H. Cai, J. Y ang, W . Zhang, S. Han, and Y . Y u. Path-le vel net- work transformation for efﬁcient architecture search. In Pr o- ceedings of the 35th Internationa l Confer ence on Mac hine Learning, ICML 2018, Stoc kholmsm ¨ assan, Stockholm, Swe- den, J uly 10-15, 2018 , pages 677–686, 2018. [4] H. Cai, L. Z hu, and S. Han. Proxylessnas: Direct neu- ral architecture search on t arget task and hardware. CoRR , abs/1812.00 332, 2018. [5] T . Chen, I. J. Goodfellow , and J. Shlens. Net2net: Acceler- ating learning via kno wl edge transfer . In 4th International Confer ence on Learning Represen t ations, ICL R 2016, San J uan, Puerto Rico, May 2-4, 2016, Confer ence T rack Pr o- ceedings , 2016. [6] F . Chollet. Xception: Deep learning with depthwise sepa- rable conv olutions. In 2017 IEEE Confer ence on Computer V ision and P attern Recognition, CVPR 2017, Honolulu , HI, USA, J uly 21-26, 2017 , pages 1800–1807, 2017. [7] E. D. Cubuk, B. Zoph, D. Man ´ e, V . V asudev an, and Q. V . L e. Autoaugment: Learning augmentation policies from data. CoRR , abs/1805.09501, 2018. [8] T . Devries and G. W . T aylor . Impro ved regularization of con volutional neural networks with cutout. CoRR , abs/1708.04 552, 2017. [9] X. Dong and Y . Y ang. Searching for a robust neural architec- ture in four gpu hours. In Pr oceedings of the IEEE Confer- ence on Computer V ision and P attern Recogn i tion (CVPR) , pages 1761–1770, 2019. [10] X. Gastaldi. Shake-sha ke regularization. CoRR , abs/1705.07 485, 2017. [11] G. Ghiasi, T . Lin, and Q. V . Le. Dropblock: A regulariza- tion method for con volutional networks. In Advances in Neural Information P r ocessing Systems 31: Annual Con- fer ence on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr ´ eal, Canada. , pages 10750–10760, 2018. [12] D. E. Goldber g and K. Deb. A comparati ve analysis of se- lection schemes used in genetic algorithms. In Pr oceed- ings of the F irst W orkshop on F oundations of Genetic Al- gorithms. B loomington Campus, Indiana, USA, J uly 15-18 1990. , pages 69–93 , 1990. [13] D. Han, J. Kim, and J. Kim. Deep p yramidal residual net- works. In 2017 IEEE Confer ence on Computer V isi on and P attern Recogn ition, CVPR 201 7, Honolulu , HI, USA, July 21-26, 2017 , pages 6307–6315, 2017. [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 201 6 IEEE Confere nce on Com- puter V ision and P attern Reco gnition, CVPR 2016, L as V e- gas, NV , USA, June 27-30, 2016 , pages 770–778 , 2016. [15] H. Hu, J. Langford, R. Caruana, E. Horvitz, and D. D. and. Macro neural architecture search rev i sited. 2018 . [16] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- works. In 2018 IEEE Confer ence on Computer V ision and P attern Reco gnition, CV PR 2018, Salt Lake City , UT , USA, J une 18-22, 2018 , pages 7132–7141, 2018. [17] G. Huang, Z. Liu, L. v an der Maaten, and K. Q. W einberger . Densely connected con volutional networks. In 2017 IEEE Confer ence on Computer V ision and P attern Recogn ition, CVPR 2017, H onolulu, HI, USA, J uly 21-26, 2017 , pages 2261–2 269, 2017. [18] P . Kamath, A. Singh, and D. Dutta. Neural architecture con- struction using en velopenets. CoRR , abs/1803.06744, 2018. [19] K. Kandasamy , W . Neiswanger , J. Schneider , B. P ´ ocz os, and E. P . Xing. Neural architecture search with bayesian optimi- sation and optimal transport. In Advances in Neural Infor - mation P rocessing Systems 31: Annual Confer ence on Neu- ral Information Pr ocessing Systems 2018, NeurIPS 2018, 3- 8 December 2018, Montr ´ eal, Canada. , pages 2020–2029, 2018. [20] A. Krizhev sky , I. Sutskev er, and G. E. Hinton. Imagenet clas- siﬁcation with deep con volutional neural networks. In Ad- vances in Neura l Information Pr ocessing Systems 25: 26th Annual Confer ence on Neural Information Pr ocessing Sys- tems 2012. P r oceedings of a meeting held December 3-6, 2012, Lake T ahoe , N evada, United States. , pages 1106 – 1114, 2012. [21] M. L i n, Q. Chen, and S. Y an. Network in netwo rk. In 2nd In- ternational Confer ence on Learning Repr esentations, ICL R 2014, Banff, AB, Canada, April 14-16, 2014, Confer ence T rack Pr oceedings , 2014. [22] C. Liu, B. Z oph, M. Neumann, J. Shlens, W . Hua, L. L i , L. F ei-Fei, A. L. Y uille, J. Huang, and K. Murphy . Pro- gressi ve neural architecture search. In Computer V i si on - ECCV 2018 - 15th Euro pean Confer ence, Munic h, Germany , September 8-14, 2018, Pr oceedings, P art I , pages 19–35, 2018. [23] H. Liu, K. Simonya n, O. Vin yals, C. Fernando, and K. Kavukc uoglu. Hierarchical representations for efﬁ- cient architecture search. In 6th International Confer ence on Learning Repre sentations, ICLR 2018, V ancouver , BC, Canada, A pril 30 - May 3, 2018, Confer ence T rack Pr oceed- ings , 2018. [24] H. Liu, K. Simonyan, and Y . Y ang. D AR T S: differentiable architecture search. CoRR , abs/1806.09055, 2018. [25] I. Loshchilo v and F . Hutter . SGDR: stochastic gradient de- scent with warm restarts. In 5th International Confer ence on Learning Repr esentations, ICL R 2017, T oulon, F rance, April 24-26, 2017, Confer ence Tr ack P r oceedings , 2017. [26] R. Luo, F . T ian, T . Qin, E. Chen, and T . Liu. Neural architec- ture optimization. In Advan ces in Neur al Information P r o- cessing Systems 31: A nnual Confer ence on Neural Informa- tion Pro cessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr ´ eal, Canada. , pages 7827–7838, 2018. [27] H. P ham, M. Y . Guan, B. Zoph, Q. V . L e, and J. Dean. E f- ﬁcient neural architecture search via parameter sharing. In Pr oceedings of the 35th International Confer ence on Ma- chine Learning, ICML 2018, Stockholmsm ¨ assan, Stoc kholm, Sweden, J uly 10-15, 2018 , pages 4092–4101, 2018. [28] E. Real, A. Aggarwal, Y . Huang, and Q. V . Le. Reg ular- ized evolution for image classiﬁer architecture search. CoRR , abs/1802.01 548, 2018. [29] E. Real, S. Moore, A. S el l e, S. Saxena, Y . L. Suematsu, J. T an, Q. V . Le, and A. Kurak i n. Large-scale e volution of image classiﬁers. I n Pr oceedings of the 34th International Confer ence on Machine L earning , ICML 201 7, Sydne y , NSW , Austra lia, 6-11 August 2017 , pages 2902–2911 , 2017. [30] N. Sriv astav a, G. E. Hinton, A. Krizhevsk y , I. Sutske ver , and R. S alakhutdino v . Dr opout: a si mple way to prev ent neu- ral networks fr om overﬁtting. Journ al of Machin e L earning Resear ch , 15(1):1929 –1958, 2014. [31] K. O. Stanley , D. B. D’Ambrosio, and J. Gauci. A hypercube -based encoding for ev olving large-sca le neural networks. Artiﬁcial Li f e , 15(2):185–212, 2009. [32] K. O. Stanley and R. Miikkulainen. Efﬁcient reinforcement learning through ev olving neural network topologies. In GECCO 2002: Proce edings of the Genetic and Evolutionary Computation Confer ence, New Y ork, USA, 9-13 J uly 2002 , pages 569–577, 2002. [33] M. T an and Q. V . Le. E fﬁcientne t: Rethinking mode l scal- ing for con volution al neural networks. In Pr oceedings of the 36th International Conferen ce on Machine Learning, ICML 2019, 9-15 Ju ne 2019, Long Beach , California, USA , pages 6105–6 114, 2019. [34] T . W ei, C. W ang, Y . Rui, and C. W . Chen. Network mor- phism. In Pr oceedings of the 33nd International Confer ence on Mac hine Learning , ICML 2016, New Y ork City , NY , USA, J une 19-24, 2016 , pages 564–572, 2016. [35] M. W istuba. Deep learning architecture search by neuro- cell-based e volution wit h function-preserving mutations. In Mach ine L earning and Knowledg e Di scovery in Databases - Eur opean C onferen ce, ECML PKDD 2018, Dublin, Ir eland, September 10-14, 2018, Procee dings, P art II , pages 243– 258, 2018. [36] S. Xie, R. B. Girshick, P . Doll ´ ar, Z . T u, and K. He. Ag- gregated r esidual transformations for deep neural networks. In 2017 IEEE Confer ence on Computer V ision and P attern Recogn ition, CVPR 2017, Honolulu, HI, USA, J uly 21-26, 2017 , pages 5987– 5995, 2017. [37] H. Zhang, M. Ciss ´ e, Y . N. Dauphin, and D. Lopez-Paz. mixup: B eyond empirical risk minimization. In 6th Interna- tional Conferen ce on Learning Repr esentations, ICL R 2018, V ancouver , BC, C anada, April 30 - May 3, 2018, Conferen ce T rack Pr oceedings , 2018. [38] B. Zoph and Q. V . Le. Neural architecture search with re- inforcement learning. In 5th International Confer ence on Learning Repr esentations, ICL R 2017, T oulon, F rance, April 24-26, 2017 , Confere nce T rack Proceed ings , 2017. [39] B. Zoph, V . V asudev an, J. Shlens, and Q. V . L e. Learn- ing transferable architectures for scalable image recognition. In 2018 IEEE Confer ence on Computer V ision and P attern Recogn ition, CVPR 2018, Salt Lake City , UT , USA, J une 18- 22, 2018 , pages 8697–8710, 2018.

EENA: Efficient Evolution of Neural Architecture

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment