AutoGrow: Automatic Layer Growing in Deep Convolutional Networks
Depth is a key component of Deep Neural Networks (DNNs), however, designing depth is heuristic and requires many human efforts. We propose AutoGrow to automate depth discovery in DNNs: starting from a shallow seed architecture, AutoGrow grows new lay…
Authors: Wei Wen, Feng Yan, Yiran Chen
A utoGro w : A utomatic Layer Gr owing in Deep Convolutional Networks W ei W en wei.wen@alumni.duke .edu Duke University Feng Y an fyan@unr .edu University of Nevada, Reno Yiran Chen yiran.chen@duke.edu Duke University Hai Li hai.li@duke.edu Duke University ABSTRA CT Depth is a key component of Deep Neural Networks (DNNs), how- ever , designing depth is heuristic and requires many human eorts. W e propose A utoGrow to automate depth discovery in DNNs: start- ing from a shallow seed architecture, A utoGrow grows new layers if the growth improv es the accuracy; otherwise, stops growing and thus discovers the depth. W e propose robust growing and stop- ping policies to generalize to dierent network architectures and datasets. Our experiments show that by applying the same policy to dierent network architectures, A utoGrow can always discover near-optimal depth on various datasets of MNIST , FashionMNIST , SVHN, CIF AR10, CIF AR100 and ImageNet. For example , in terms of accuracy-computation trade-o, A utoGrow discovers a better depth combination in ResNets than human experts. Our AutoGr ow is e- cient. It discovers depth within similar time of training a single DNN. Our code is available at https://github.com/wenw ei202/autogrow. KEY W ORDS automated machine learning; neural architecture search; depth; growing; neural networks; automation 1 IN TRODUCTION Layer depth is one of the decisive factors of the success of De ep Neural Networks (DNNs). For example , image classication accu- racy keeps improving as the depth of netw ork models grows [ 12 , 15 , 17 , 32 , 35 ]. Although shallow networks cannot ensure high accuracy , DNNs composed of too many layers may suer from over-tting and convergence diculty in training. Ho w to obtain the optimal depth for a DNN still remains mysterious. For instance , ResNet -152 [ 12 ] uses 3 , 8 , 36 and 3 residual blocks under output sizes of 56 × 56 , 28 × 28 , 14 × 14 and 7 × 7 , respectively , which don’t show an obvious quantitative relation. In practice, people usually reply on some heuristic trials and tests to obtain the depth of a network: they rst design a DNN with a sp ecic depth and then train and evaluate the network on a given dataset; nally , the y change the depth and repeat the procedure until the accuracy me ets the requirement. Besides the high computational cost induced by the iteration process, such trial & test iterations must be repeated whenever dataset changes. In this paper , we propose A utoGrow that can automate depth discovery given a lay er architecture. W e will show that A utoGrow generalizes to dierent datasets and layer architectures. There ar e some previous works which add or morph layers to in- crease the depth in DNNs. V ggNet [ 32 ] and DropIn [ 33 ] added new layers into shallower DNNs; Network Morphism [ 5 , 36 , 37 ] mor- phed each layer to multiple lay ers to increase the depth meanwhile preserving the function of the shallower net. T able 1 summarizes dierences in this work. Their goal was to ov ercome diculty of training de eper DNNs or accelerate it. Our goal is to automati- cally nd an optimal depth. Moreover , pre vious works applie d layer growth by once or a few times at pre-dened locations to gr ow a pre-dened number of layers; in contrast, ours automatically learns the number of new layers and growth locations without limiting growing times. W e will summarize mor e related w orks in Section 4. Figure 1 illustrates an example of A utoGrow . It starts from the shallowest backbone network and gradually grows sub-modules (A sub-mo dule can b e one or more layers, e.g. , a residual blo ck); the growth stops once a stopping p olicy is satised. W e studied multiple initializers of new layers and multiple growing policies, and surprisingly nd that: (1) a random initializer w orks equally or better than complicated Network Morphism; (2) it is more eective to grow before a shallow net converges. W e hyp othesize that this is because a converged shallow net is an inadequate initialization Growing sub - nets S ub - modules S toppe d sub - nets Initializer Epochs Seed Discover ed Figure 1: A simple example of A utoGrow . T able 1: Comparison with previous works ab out layer growth. Previous works Ours Goal Ease training Depth automation Times Once or a few Unlimited Locations Human dened Learned Layer # Human dened Learned for training deeper net, while random initialization can help to escape from a bad starting p oint. Motivated by this, we intentionally avoid full convergence during the growing by using (1) random initialization of new layers, (2) a constant large learning rate, and (3) a short growing interval. Our contributions are: • W e propose A utoGrow to automate DNN layer growing and depth discovery . A utoGrow is ver y robust. With the same hyper-parameters, it adapts network depth to various datasets including MNIST , FashionMNIST , SVHN, CIF AR10, CIF AR100 and ImageNet. Moreover , A utoGrow can also dis- cover shallow er DNNs when the dataset is a subset. • A utoGrow demonstrates high eciency and scales up to Ima- geNet, because the layer gr owing is as fast as training a single DNN. On ImageNet, it discov ers a new ResNets with better trade-o between accuracy and computation complexity . • W e challenge the idea of Network Morphism, as random initialization works equally or better when growing layers. • W e nd that it is benecial to rapidly grow layers before a shallower net converge , contradicting previous intuition. 2 AU TOGRO W – A DEPTH GRO WING ALGORI THM Figure 1 gives an overview of the propose d AutoGr ow . In this paper , we use network , sub-networks , sub-modules and layers to describe the architecture hierarchy . A network is composed of a cascade of sub-networks . A sub-netw ork is composed of sub-modules , which typical share the same output size. A sub-module ( e.g. a resid- ual block) is an elementary gr owing blo ck comp osed of one or a few layers . In this se ction, we rigorously formulate a generic version of AutoGr ow which will be materialize d in subsections. A deep convolutional network д (X 0 ) is a cascade of sub-networks by com- posing functions as д (X 0 ) = l ( f M − 1 ( f M − 2 ( · · · f 1 ( f 0 ( X 0 )) · · · ))) , where X 0 is an input image, M is the number of sub-networks, l (·) is a loss function, and X i + 1 = f i ( X i ) is a sub-network that operates on an input image or a feature tensor X i ∈ R c i × h i × w i . Here, c i is the numb er of channels, and h i and w i are spatial dimen- sions. f i ( X i ) is a simplied notation of f i ( X i ; W i ) , where W i is a set of sub-modules ’ parameters within the i -th sub-network . Thus W = { W i : i = 0 . . . M − 1 } denotes the whole set of parameters in the DNN. T o facilitate growing, the following properties are supported within a sub-network: (1) the rst sub-module usually reduces the size of input featur e maps, e.g. , using pooling or con- volution with a stride; and (2) all sub-mo dules in a sub-network maintain the same output size. As such, our framework can sup- port popular networks, including V ggNet -like plain networks [ 32 ], GoogLeNet [ 35 ], ResNets [ 12 ] and DenseNets [ 15 ]. In this paper , we select ResNets and V ggNet -like nets as representatives of DNNs with and without shortcuts, respectively . With above notations, Algorithm 1 rigorously describes the Au- toGrow algorithm. In brief, A utoGrow starts with the shallow est net where every sub-network has only one sub-module for spatial di- mension reduction. A utoGrow loops o ver all growing sub-networks in order . For each sub-network, A utoGrow stacks a new sub-module. When the new sub-module does not improv e the accuracy , the growth in corresponding sub-network will be permanently stopped. Algorithm 1 AutoGrow Algorithm. Input: A see d shallow network д (X 0 ) composed of M sub-networks F = { f i ( · ; W i ) : i = 0 . . . M − 1 } , where each sub-network has only one sub-module (a dimension reduction sub-module); an epoch inter val K to che ck growing and stopping policies; the number of ne-tuning epochs N after growing. Initialization: A Circular Linked List of sub-networks under growing: subNetList = f 0 ( · ; W 0 ) → · · · → f M − 1 ( · ; W M − 1 ) ← − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − ; The current growing sub-netw ork: growingSub = subNetList.head() = f 0 ( · ; W 0 ) ; The last grown sub-network: grownSub = None ; Process: # if there exist growing sub-network(s) while subNetList.size()>0 do train( д (X 0 ) , K ) # train the whole network д (X 0 ) for K epochs if meetStoppingPolicy() then # remove a sub-network from the growing list subNetList.delete(grownSub) ; end if if meetGrowingPolicy() and subNetList.size()>0 then # the current growing sub-netw ork growingSub == f i ( · ; W i ) W i = W i ∪ W # stack a sub-module on top of f i ( · ; W i ) initializer( W ) ; # initialize the new sub-module W grownSub = growingSub ; growingSub = subNetList.next(growingSub) ; end if end while Fine-tune the discovered network д (X 0 ) for N epo chs; Output: A trained neural network д (X 0 ) with learned depth. The details of our method will b e materialized in the following subsections. 2.1 Seed Shallow Networks and Sub-mo dules In this paper , in all datasets except ImageNet, we explore growing depth for four types of DNNs: • Basic3ResNet : the same ResNet used for CIF AR10 in [ 12 ], which has 3 residual sub-networks with output spatial sizes of 32 × 32 , 16 × 16 and 8 × 8 , respectively; • Basic4ResNet : a variant of ResNet used for ImageNet in [ 12 ] built by basic residual blocks (each of which contains two convolutions and one shortcut). There are 4 sub-networks with output spatial sizes of 32 × 32 , 16 × 16 , 8 × 8 and 4 × 4 , respectively; • Plain3Net : a V ggNet -like plain net by removing shortcuts in Basic3ResNet ; • Plain4Net : a V ggNet -like plain net by removing shortcuts in Basic4ResNet . In AutoGr ow , the architectures of seed shallow networks and sub-modules are pre-dened. In plain DNNs, a sub-module is a stack of conv olution, Batch Normalization and ReLU; in residual DNNs, a sub-module is a residual block. In A utoGrow , a sub-network is a stack of all sub-modules with the same output spatial size. Unlike [ 12 ] which manually designed the depth, A utoGrow starts from a seed architecture in which each sub-network has only one sub-module and automatically learns the number of sub-modules. On ImageNet, we apply the same backbones in [ 12 ] as the seed architectures. A seed architecture has only one sub-module un- der each output spatial size. For a ResNet using basic residual blocks or bottleneck residual blocks [ 12 ], we respectively name it as Basic4ResNet or Bottleneck4ResNet . Plain4Net is also ob- tained by removing shortcuts in Basic4ResNet . 2.2 Sub-module Initializers Here we explain ho w to initialize a new sub-module W mentioned in Algorithm 1 ( initializer (W ) ). Network Morphism changes DNN architecture meanwhile preserving the loss function via spe- cial initialization of new layers, that is, д (X 0 ; W ) = д (X 0 ; W ∪ W ) ∀ X 0 . (1) A residual sub-module shows a nice property: when stacking a residual block and initializing the last Batch Normalization layer as zeros, the function of the shallo wer net is preserved but the DNN is morphed to a deeper net. Thus, Network Morphism can be easily implemented by this zero initialization ( ZeroInit ). In this work, all layers in W are initialize d using default ran- domization, except for a spe cial treatment of the last Batch Nor- malization layer in a residual sub-mo dule. Besides ZeroInit , we propose a new AdamInit for Network Morphism. In AdamInit , we freeze all parameters except the last Batch Normalization lay er in W , and then use Adam optimizer [ 16 ] to optimize the last Bath Normalization for maximum 10 ep ochs till the training accuracy of the deeper net is as good as the shallower one. After AdamInit , all parameters are jointly optimized. W e view AdamInit as a Net- work Morphism because the training loss is similar after AdamInit . W e empirically nd that AdamInit can usually nd a solution in less than 3 ep ochs. W e also study random initialization of the last Batch Normalization layer using uniform ( UniInit ) or Gaussian ( GauInit ) noises with a standard deviation 1 . 0 . W e will show that GauInit obtains the best result, challenging the idea of Network Morphism [5, 36, 37]. 2.3 Growing and Stopping Policies In Algorithm 1, a growing policy refers to meetGrowingPolicy() , which returns true when the netw ork should grow a sub-module . T wo growing policies are studied here: (1) Convergent Growth: meetGrowingPolicy() returns true when the improv ement of validation accuracy is less than τ in the last K epochs. That is, in Convergent Growth, A utoGrow only grows when current netw ork has converged. This is a similar gr owing criterion adopted in previous works [ 3 , 4 , 9 ]. (2) Periodic Growth: meetGrowingPolicy() always returns true, that is, the network always grows every K epochs. There- fore, K is also the growing perio d. In the best practice of A utoGrow , K is small ( e.g. K = 3 ) such that it gro ws before current network converges. Our experiments will show that Periodic Growth outperforms Con- vergent Growth. W e hypothesize that a fully converged shallower net is an inadequate initialization to train a deeper net. W e will per- form experiments to test this hypothesis and visualize optimization trajectory to illustrate it. A stopping p olicy denotes meetStoppingPolicy() in Algorithm 1. When Convergent Growth is adopted, meetStoppingPolicy() returns true if a recent growth does not improve validation accu- racy more than τ within K epochs. W e use a similar stopping policy for Periodic Growth; howe ver , as it can grow rapidly with a small period K ( e.g. K = 3 ) before it converges, we use a larger win- dow size J for stop. Specically , when Periodic Growth is adopted, meetStoppingPolicy() returns true when the validation accuracy improves less than τ in the last J epochs, where J ≫ K . Hyper-parameters τ , J and K control the operation of A utoGrow and can be easily setup and generalize well. τ denotes the signif- icance of accuracy impro vement for classication. W e simply set τ = 0 . 05% in all experiments. J represents how many epochs to wait for an accuracy impro vement before stopping the growth of a sub-network. It is more meaningful to consider stopping when the new net is trained to some extent. As such, we set J to the number of epochs T under the largest learning rate when training a baseline. K means how frequently A utoGrow checks the polices. In Convergent Gro wth, we simply set K = T , which is long enough to ensure convergence. In Periodic Gro wth, K is set to a small fraction of T to enable fast growth before conv ergence; more importantly , K = 3 is very robust to all networks and datasets. Therefore, all those hyper-parameters are very robust and strongly correlated to design considerations. 3 EXPERIMEN TS In this paper , we use Basic3ResNet-2-3-2 , for instance, to denote a model architecture which contains 2 , 3 and 2 sub-modules in the rst, second and third sub-networks, respectively . Sometimes we simplify it as 2-3-2 for convenience. A utoGrow always starts from the shallowest depth of 1-1-1 and uses the maximum validation accuracy as the metric to guide growing and stopping. All DNN baselines are trained by SGD with momentum 0 . 9 using staircase learning rate. The initial learning rate is 0 . 1 in ResNets and 0 . 01 in plain networks. On ImageNet, baselines are trained using batch size 256 for 90 epo chs, within which learning rate is decayed by 0 . 1 × at epoch 30 and 60 . In all other smaller datasets, baselines are traine d T able 2: Network Morphism tested on CIF AR10. net backbone shallower deeper initializer accu % ∆ ∗ Basic3ResNet 3-3-3 5-5-5 ZeroInit 92.71 -0.77 AdamInit 92.82 -0.66 Basic3ResNet 5-5-5 9-9-9 ZeroInit 93.64 -0.27 AdamInit 93.53 -0.38 Basic4ResNet 1-1-1-1 2-2-2-2 ZeroInit 94.96 -0.37 AdamInit 95.17 -0.16 ∗ ∆ = (accuracy of Network Morphism) − (accuracy of training from scratch) T able 3: Ablation study of c-A utoGrow . dataset learning rate initializer found net † accu % ∆ ∗ CIF AR10 staircase ZeroInit 2-3-6 91.77 -1.06 staircase AdamInit 3-4-3 92.21 -0.59 constant ZeroInit 2-2-4 92.23 0.16 constant AdamInit 3-4-4 92.60 -0.41 constant UniInit 3-4-4 92.93 -0.08 constant GauInit 2-4-3 93.12 0.55 † Basic3ResNet dataset learning rate initializer found net † accu % ∆ ∗ CIF AR100 staircase ZeroInit 4-3-4 70.04 -0.65 staircase AdamInit 3-3-3 69.85 -0.65 constant ZeroInit 3-2-4 70.22 0.35 constant AdamInit 3-3-3 70.00 -0.50 constant UniInit 4-4-3 70.39 0.36 constant GauInit 3-4-3 70.66 0.91 ∗ ∆ = (accuracy of c-A utoGrow ) − (accuracy of training from scratch) Morphing ResNet -32 Learning rate decay (a) (b) Figure 2: An optimization traje ctor y comparison b etween (a) Network Morphism and ( b) training from scratch. using batch size 128 for 200 ep ochs and learning rate is de cayed by 0 . 1 × at ep och 100 and 150 . Our early experiments followed prior wisdom by growing lay- ers with Network Morphism [ 3 – 5 , 9 , 36 , 37 ], i.e. , AutoGr ow with ZeroInit (or AdamInit ) and Convergent Growth policy; however , it stoppe d early with very shallow DNNs, failing to nd optimal depth. W e hypothesize that a converged shallow net with Network Morphism gives a bad initialization to train a deeper neural net- work. Section 3.1 experimentally test that the hyp othesis is valid. T o tackle this issue, we intentionally avoid convergence during grow- ing by three simple solutions, which are evaluated in Section 3.2. Finally , Section 3.3 and Section 3.4 include extensive experiments to show the eectiveness of our nal A utoGrow . 3.1 Suboptimum of Network Morphism and Convergent Growth In this section, we study Network Morphism itself and its integra- tion into our A utoGrow under Convergent Growth. When studying Network Morphism, we take the following steps: 1) train a shal- lower ResNet to converge, 2) stack residual blocks on top of each sub-network to morph to a deeper net, 3) use ZeroInit or AdamInit to initialize new layers, and 4) train the deeper net in a standard way . W e compare the accuracy dierence (“ ∆ ”) between Network Mor- phism and training the deeper net from scratch. T able 2 summaries our results. Network Morphism has a low er accuracy (negative “ ∆ ”) in all the cases, which validates our hypothesis that a converged shallow netw ork with Network Morphism gives a bad initialization to train a deeper net. W e visualize the optimization traje ctories to illustrate the hy- pothesis. W e hypothesize that a converged shallower net may not be an adequate initialization. Figure 2 visualizes and compares the op- timization trajectories of Netw ork Morphism and the training from scratch. In this gure, the shallower net is Basic3ResNet-3-3-3 ( ResNet -20) and the deeper one is Basic3ResNet-5-5-5 ( ResNet -32) in T able 2. The initializer is ZeroInit . The visualization method is extended from [ 20 ]. Points on the trajectory are evenly sampled every a few epochs. T o maximize the variance of trajector y , we use PCA to project from a high dimensional space to a 2D space and use the rst two Principle Components (PC) to form the axes in Figure 2. The contours of training loss function and the trajectory are visualized around the nal minimum of the deeper net. When projecting a shallower net to a deeper net space, zer os are padded for the parameters not existing in the de eper net. W e must note that the loss increase along the trajectory does not truly represent the situation in high dimensional space, as the trajector y is just a projection. It is possible that the loss remains decreasing in the high dimension while it appears in an opposite way in the 2D space. The sharp detour at “Morphing” in Figure 2(a) may indicate that the shallower net plausibly converges to a point that the deep er net struggles to escape. In contrast, Figure 2(b) shows that the tra- jectory of the direct optimization in the deep er space smo othly converges to a better minimum. T o further validate our hypothesis, we integrate Netw ork Mor- phism as the initializer in AutoGr ow with Convergent Growth p olicy . W e refer to this version of A utoGrow as c- AutoGro w with “ c- ” denot- ing “Convergent. ” More sp ecic, we take ZeroInit or AdamInit as sub-module initializer and “Convergent Growth” policy in Al- gorithm 1. T o recap, in this setting, AutoGrow trains a shallower net till it converges, then gro ws a sub-module which is initialized by Network Morphism, and r epeats the same process till ther e is no further accuracy improv ement. In every interval of K training epochs ( train( д (X 0 ) , K ) in Algorithm 1), “staircase” learning rate is used. The learning rate is reset to 0 . 1 at the rst epo ch, and de- cayed by 0 . 1 × at epoch K 2 and 3 K 4 . The results are shown in T able 3 by “ staircase ” rows, which illustrate that c- A utoGrow can grow a DNN multiple times and nally nd a depth. However , there are two problems: 1) the nal accuracy is lower than training the found net from scratch, as indicated by “ ∆ ” , validating our hypothesis; 2) the depth learning stops too early with a relatively shallo wer net, while a de eper net beyond the found depth can achieve a higher accuracy as we will show in T able 8. These problems provide a circumstantial evidence of the hypothesis that a converged shallow 300 400 600 900 1100 1300 1400 grow stop (a) (b) (c) (d) Figure 3: Optimization trajector y of AutoGr ow , tested by Basic3ResNet on CIF AR10. (a) c-A utoGrow with staircase learning rate and ZeroInit during growing; (b) c-A utoGrow with constant learning rate and GauInit during growing; (c) p-A utoGrow with K = 50 ; and (d) p-A utoGrow with K = 3 . For better illustration, the dots on the trajector y are plotted every 4 , 20 , 5 and 3 epochs in (a-d), respectively . T able 4: p-A utoGrow with dierent growing inter val K . CIF AR10 K found net † accu % 50 6-5-3 92.95 20 7-7-7 93.26 10 19-19-19 93.46 5 23-22-22 93.98 3 42-42-42 94.27 1 77-76-76 94.30 † Basic3ResNet CIF AR100 K found net † accu % 50 8-5-7 72.07 20 8-11-10 72.93 10 18-18-18 73.64 5 23-23-23 73.70 3 54-53-53 74.72 1 68-68-68 74.51 † Basic3ResNet T able 5: p-A utoGrow under initializers with K = 3 . CIF AR10 initializer found net † accu ZeroInit 31-30-30 93.57 AdamInit 37-37-36 93.79 UniInit 28-28-28 93.82 GauInit 42-42-42 94.27 † Basic3ResNet CIF AR100 initializer found net † accu ZeroInit 26-25-25 73.45 AdamInit 27-27-27 73.92 UniInit 41-41-41 74.31 GauInit 54-53-53 74.72 † Basic3ResNet net with Network Morphism giv es a bad initialization. Thus, A u- toGrow cannot receive signals to continue growing after a limited number of growths. Figure 3(a) visualizes the trajectory of c- A utoGrow corresponding to row “ 2-3-6 ” in T able 3. Along the trajector y , there are many trials to detour and escape an initialization from a shallower net. 3.2 Ablation Study for A utoGrow Design Based on the ndings in Section 3.1, we pr opose three simple but eective solutions to further enhance A utoGrow and refer it as p- A utoGrow , with “ p- ” denoting “Periodic”: • Use a large constant learning rate for growing, i.e. , 0 . 1 for residual networks and 0 . 01 for plain networks. Stochastic gradient descent with a large learning rate intrinsically in- troduces noises, which help to avoid a full convergence into a bad initialization from a shallower net. Note that staircase learning rate is still use d for ne-tuning after discovering the nal DNN; 0 50 100 150 200 250 300 40 50 60 70 80 90 100 0 60 120 180 240 300 360 420 480 540 600 The number of layers Accuracy % train accu max val accu val accu layers Epoch Growing Fine-tuning Figure 4: p-A utoGrow on CIF AR10 ( K = 3 ). The se ed net is Basic3ResNet-1-1-1 . • Use random initialization ( UniInit or GauInit ) as noises to escape from an inadequate initialization; • Grow rapidly b efore a shallower net converges by taking Periodic Growth with a small K . p- A utoGrow is our nal AutoGro w . In the rest part of this sec- tion, we perform ablation study to prove that the three solutions are eective. W e start from c-A utoGrow , and incrementally add above solutions one by one and ev entually obtain p- AutoGrow . In T able 3, rst, we replace the staircase learning rate with a con- stant large learning rate, the accuracy of A utoGrow improves and therefore “ ∆ ” improves; second, we further replace Netw ork Mor- phism ( ZeroInit or AdamInit ) with a random initializer ( UniInit or GauInit ) and result in a bigger gain. Overall, combining a con- stant learning rate with GauInit performs the best. Thus, constant learning rate and GauInit are adopte d in the remaining experi- ments, unless we explicitly specify them. Figure 3( b) visualizes the trajectory corresponding to row “ 2-4-3 ” in T able 3, which is much smoother compared to Figure 3(a), implying the advantages of constant large learning rate and GauInit . Note that, in this paper , we ar e more inter ested in automating depth discovery to nd a nal DNN (“found net”) with a high accuracy (“accu”). Ideally , the “found net” has a minimum depth, a larger depth than which cannot further improv e “accu” . W e will show in Figure 6 that A utoGrow discovers a depth approximately satisfying this property . The “ ∆ ” is a metric to indicate how well shallower nets initialize deeper nets; a negative “ ∆ ” indicates that weight initialization fr om a shallo wer net hurts training of a de eper T able 6: p-A utoGrow with dierent seed architecture. dataset seed net † found net † accuracy % CIF AR10 1-1-1 42-42-42 94.27 5-5-5 46-46-46 94.16 CIF AR10 1-1-1-1 22-22-22-22 95.49 5-5-5-5 23-22-22-22 95.62 † Basic3ResNet or Basic4ResNet . net; while a positive “ ∆ ” indicates AutoGro w helps training a deep er net, which is a byproduct of this work. Finally , we apply the last solution – Periodic G rowth, and obtain our nal p-A utoGrow . Our ablation study results for p- AutoGro w are summarized in T able 4. T able 4 analyzes the impact of the growing period K . In general, K is a hyper-parameter to trade o spee d and accuracy: a smaller K takes a longer learning time but discovers a deeper net, vice versa. Our results validate the preference of a faster growth ( i.e . a smaller K ). On CIF AR10/CIF AR100, the accu- racy reaches plateau/peak at K = 3 ; further reducing K produces a deeper net while the accuracy gain is marginal/impossible. In the following, we simply select K = 3 for robustness test. More impor- tantly , our quantitative results in T able 4 show that p-A utoGrow nds much deeper nets, overcoming the very-early stop issue in c- A utoGrow in T able 3. That is, Periodic Gro wth proposed in this work is much more eective than Convergent Gr owth utilized in previous work. Figure 3(c)( d) visualize the traje ctories of p- AutoGrow with K = 50 and 3 . The 2D projection gives limited information to reveal the advantages of p- A utoGrow comparing to c- AutoGro w in Figure 3(b), although the trajector y of our nal p-A utoGrow in Figure 3(d) is plausibly more similar to the one of training from scratch in Fig- ure 2(b). For sanity check, we perform the ablation study of initializers for p- A utoGrow . The results are in T able 5, which further validates our wisdom on selecting GauInit . The motivation of Network Morphism in previous work was to start a deeper net from a loss function that has be en well optimized by a shallower net, so as not to restart the deep er net training from scratch [ 3 – 5 , 9 , 36 , 37 ]. In all our experiments, we nd that simple random initialization can achieve the same goal. Figur e 4 plots the convergence curves and learning process for “ 42-42-42 ” in Table 4. Even with GauInit , the loss and accuracy rapidly recover and no restart is obser ved. The convergence pattern in the “Growing” stage is similar to the “Fine-tuning” stage under the same learning rate (the initial learning rate 0 . 1 ). Similar results on ImageNet will be shown in Figure 8. Our results challenge the ne cessity of Network Morphism when growing neural networks. At last, we perform the ablation study on the initial depth of the seed network. T able 6 demonstrates that a shallowest DNN works as well as a de eper see d. This implies that AutoGro w can appropriately stop regardless of the depth of the seed network. As the focus of this work is on depth automation, we prefer starting with the shallowest seed to avoid a manual search of a seed depth. T able 7: AutoGro w improves accuracy of plain nets. dataset net layer # method accu % CIF AR10 Plain4Net-6-6-6-6 26 baseline 93.90 Plain4Net-6-6-6-6 26 A utoGrow K = 30 95.17 Basic4ResNet-3-3-3-3 26 baseline 95.33 CIF AR10 Plain3Net-11-11-10 34 baseline 90.45 Plain3Net-11-11-10 34 A utoGrow K = 50 93.13 Basic3ResNet-6-6-5 36 baseline 93.60 Plain3Net-11-11-10 (c) Baseline 88.48 % (d) AutoGrow 94.20% (a) Baseline 90.45% (b) AutoGrow 93.13% Plain4Net-17-17-17-17 Plain3Net-23-22-22 (a) Baseline 84.33% (b) AutoGrow 90.82% Figure 5: Loss surfaces around minima found by baselines and A utoGrow . Dataset is CIF AR10. 92.5 93 93.5 94 94.5 0 2 4 6 8 94 94.5 95 95.5 96 0 50 100 150 Millions 85 87 89 91 93 95 97 0 20 40 60 82 84 86 88 90 92 94 0 0.5 1 1.5 Millions Millions Millions Accuracy Accuracy Basic3ResNet Basic4ResNet Plain4Net Plain3Net Baselines training from scratch AutoGrow K =3 AutoGrow K =50 Figure 6: AutoGro w vs manual search obtaine d by training many baselines from scratch. x − ax is is the numb er of pa- rameters. Dataset is CIF AR10. 3.3 Adaptability of A utoGrow T o verify the adaptability of A utoGrow , we use an identical congu- ration ( p- A utoGrow with K = 3 ) and test over 5 datasets and 4 se ed architectures. T able 8 includes the results of all 20 combinations. Figure 6 compares A utoGrow with manual search which is obtained by training many DNNs with dierent depths from scratch. The results lead to the following conclusions and contributions: (1) In T able 8, AutoGro w discovers layer depth across all scenar- ios without any tuning, achieving the main goal of this work. Manual design needs m · n · k trials, where m and n are respec- tively the numbers of datasets and sub-mo dule categories, T able 8: The adaptability of AutoGr ow to datasets net dataset found net accu % ∆ ∗ Basic3ResNet CIF AR10 42-42-42 94.27 -0.03 CIF AR100 54-53-53 74.72 -0.95 SVHN 34-34-34 97.22 0.04 FashionMNIST 30-29-29 94.57 -0.06 MNIST 33-33-33 99.64 -0.03 Basic4ResNet CIF AR10 22-22-22-22 95.49 -0.10 CIF AR100 17-51-16-16 79.47 1.22 SVHN 20-20-19-19 97.32 -0.08 FashionMNIST 27-27-27-26 94.62 -0.17 MNIST 11-10-10-10 99.66 0.01 ∗ ∆ = (accuracy of A utoGrow ) − (accuracy of training from scratch) net dataset found net accu % ∆ ∗ Plain3Net CIF AR10 23-22-22 90.82 6.49 CIF AR100 28-28-27 66.34 31.53 SVHN 36-35-35 96.79 77.20 FashionMNIST 17-17-17 94.49 0.56 MNIST 20-20-20 99.66 0.12 Plain4Net CIF AR10 17-17-17-17 94.20 5.72 CIF AR100 16-15-15-15 73.91 29.34 SVHN 12-12-12-11 97.08 0.32 FashionMNIST 13-13-13-13 94.47 0.72 MNIST 13-12-12-12 99.57 0.03 T able 9: The adaptability of AutoGr ow to dataset sizes Basic3ResNet on CIF AR10 dataset size found net accu % 100% 42-42-42 94.27 75% 32-31-31 93.54 50% 17-17-17 91.34 25% 21-12-7 88.18 Basic4ResNet on CIF AR100 dataset size found net accu % 100% 17-51-16-16 79.47 75% 17-17-16-16 77.26 50% 12-12-12-11 72.91 25% 6-6-6-6 62.53 Plain3Net on MNIST dataset size found net accu % 100% 20-20-20 99.66 75% 12-12-12 99.54 50% 12-11-11 99.46 25% 10-9-9 99.33 Plain4Net on SVHN dataset size found net accu % 100% 12-12-12-11 97.08 75% 9-9-9-9 96.71 50% 8-8-8-8 96.37 25% 5-5-5-5 95.68 and k is the number of trials per dataset per sub-mo dule category; (2) For ResNets , a discovered depth (“ ” in Figure 6) falls at the location where accuracy saturates. This means AutoGrow discovers a near-optimal depth: a shallo wer depth will lose accuracy while a deeper one gains little. The nal accuracy of A utoGrow is as goo d as training the discover ed net from scratch as indicated by “ ∆ ” in T able 8, indicating that initial- ization from shallower nets does not hurt training of deep er nets. As a byproduct, in plain networks, there are large posi- tive “ ∆ ”s in T able 8. It implies that baselines fail to train very deep plain networks even using Batch Normalization, but A utoGrow enables the training of these networks; T able 7 shows the accuracy improvement of plain netw orks by tun- ing K , approaching the accuracy of ResNets with the same depth. Figure 5 visualizes loss surfaces around minima by A utoGrow and baseline. Intuitively , A utoGrow nds wider or deeper minima with less chaotic landscapes. (3) For r obustness and generalization study purpose , w e stick to K = 3 in our experiments, howev er , we can tune K to trade o accuracy and model size. As shown in Figure 6, AutoGrow discovers smaller DNNs when increasing K from 3 (“ ”) to 50 (“ # ”). Interestingly , the accuracy of plain networks even increases at K = 50 . This implies the possibility of discovering a better accuracy-depth trade-o by tuning K , although we stick to K = 3 for generalizability study and it generalizes well. (4) In T able 8, AutoGro w discovers dierent depths under dier- ent sub-modules. The nal accuracy is limited by the sub- module design, not by our A utoGrow . Given a sub-module architecture, our A utoGrow can always nd a near-optimal depth. Finally , our supp osition is that: when the size of dataset is smaller , the optimal depth should be smaller . Under this supposition, we test the eectiveness of A utoGrow by sampling a subset of dataset and verify if AutoGrow can discover a shallow er depth. T able 9 sum- marizes the results. In each set of experiments, dataset is randomly down-sampled to 100% , 75% , 50% and 25% . For a fair comparison, K is divided by the percentage of dataset such that the number of mini-batches between growths remains the same. As expected, our experiments show that AutoGr ow adapts to shallower networks when the datasets are smaller . 3.4 Scaling to ImageNet and Eciency In ImageNet, K = 3 should generalize well, but we explore A uto- Grow with K = 2 and K = 5 to obtain an accuracy-depth trade-o line for comparison with human experts. The larger K = 5 enables A utoGrow to obtain a smaller DNN to trade-o accuracy and model T able 10: Scaling up to ImageNet net K found net T op-1 T op-5 † ∆ T op-1 Basic4ResNet 2 12-12-11-11 76.28 92.79 0.43 5 9-3-6-4 74.75 91.97 0.72 Bottleneck4ResNet 2 6-6-6-17 77.99 93.91 0 . 83 5 6-7-3-9 77.33 93.65 0.83 Plain4Net 2 6-6-6-6 71.22 90.08 0.70 5 5-5-5-4 70.54 89.76 0.93 † ∆ = (T op-1 of AutoGro w ) − (T op-1 of training from scratch) T able 11: The eciency of AutoGr ow net GP Us growing ne-tuning Basic4ResNet-12-12-11-11 4 GTX 1080 Ti 56 . 7 hours 157 . 9 hours Basic4ResNet-9-3-6-4 4 GTX 1080 47 . 9 hours 65 . 8 hours Bottleneck4ResNet-6-6-6-17 4 TI T AN V 45 . 3 hours 114 . 0 hours Bottleneck4ResNet-6-7-3-9 4 TI T AN V 61 . 6 hours 78 . 6 hours Plain4Net-6-6-6-6 4 GTX 1080 Ti 11 . 7 hours 29 . 7 hours Plain4Net-5-5-5-4 4 GTX 1080 Ti 25 . 6 hours 25 . 3 hours 69 70 71 72 73 74 75 76 77 78 79 0 5 10 15 T op-1 Accuracy % GFlops Manual (basic) AutoGrow (basic) Manual (bottleneck) AutoGrow (bottleneck) Figure 7: A utoGrow vs. manual design [12] on ImageNet. Marker area is proportional to model size determine d by depth. “basic”(“bottleneck”) refers to ResNets with basic (bot- tleneck) residual blo cks. size (computation) and the smaller K = 2 achieves higher accuracy . The results are shown in T able 10, which proves that A utoGrow au- tomatically nds a good depth without any tuning. As a byproduct, the accuracy is ev en higher than training the found net from scratch, indicating that the Perio dic Growth in A utoGrow helps training deeper nets. The comparison of A utoGrow and manual depth de- sign [ 12 ] is in Figur e 7, which sho ws that A utoGrow achieves better trade-o between accuracy and computation (measured by oating point operations). T able 11 summarizes the breakdown of wall-clock time in Au- toGrow . The growing/sear ching time is as ecient as (often mor e ecient than) ne-tuning the single discovered DNN. The scalabil- ity of A utoGrow comes from its intrinsic features that (1) it grows quickly with a short period K and stops immediately if no impro ve- ment is sensed; and (2) the network is small at the beginning of growing. Figur e 8 plots the gro wing and converging curves for tw o 5 10 15 20 25 30 35 40 45 50 55 10 20 30 40 50 60 70 80 0 30 60 90 120 150 180 The number of layers T op-1 Accuracy % max val accu val accu layers 5 10 15 20 25 30 10 20 30 40 50 60 70 80 0 30 60 90 120 The number of layers Accuracy % max val accu val accu layers 6-6-6-6 9-3-6-4 1-1-1-1 1-1-1-1 Growing Fine-tuning Growing Fine-tuning (a) (b) epoch # epoch # Figure 8: The convergence cur ves and growing pro- cess on ImageNet for (a) Basic4ResNet-9-3-6-4 and (b) Plain4Net-6-6-6-6 in Table 11. DNNs in T able 11. Even with random initialization in new layers, the accuracy converges stably . 4 RELA TED W ORK Neural Architecture Search (NAS) [ 42 ] and neural evolution [ 1 , 22 , 26 , 30 , 34 ] can search network architectures from a gigantic search space. In NAS, the depth of DNNs in the search space is xed, while AutoGrow learns the depth. Some NAS metho ds [ 2 , 6 , 23 ] can nd DNNs with dierent depths, how ever , the maximum depth is pre-dened and shallower depths ar e obtained by padding zero operations or selecting shallower branches, while our A utoGrow learns the depth in an open domain to nd a minimum depth, beyond which no accuracy improvement can be obtained. Moreover , NAS is very computation and memor y intensive. T o accelerate NAS, one-shot mo dels [ 2 , 28 , 31 ], DARTS [ 23 ] and NAS with T ransferable Cell [ 21 , 43 ] were proposed. The search time reduces dramatically but is still long from practical p erspective. It is still very challenging to deploy these metho ds to larger datasets such as ImageNet. In contrast, our A utoGrow can scale up to ImageNet thanks to its short depth learning time, which is as ecient as training a single DNN. In addition to architecture search which requires to train lots of DNNs from scratch, there are also many studies on learning neural structures within a single training. Structure pruning and growing were proposed for dierent goals, such as ecient inference [ 7 , 8 , 11 , 13 , 14 , 18 , 19 , 24 , 25 , 27 , 38 – 40 ], lifelong learning [ 41 ] and model adaptation [ 10 , 29 ]. How ever , those works xed the netw ork depth and limite d structure learning within the existing layers. Optimization over a DNN with xed depth is easier as the skeleton architecture is known. A utoGrow performs in a scenario where the DNN depth is unknown hence we must seek for the optimal depth. 5 CONCLUSION In this pap er , we propose a simple but ee ctive layer growing algorithm ( A utoGrow ) to automate the depth design of deep neu- ral networks. W e empirically show that AutoGrow can adapt to dierent datasets for dierent lay er architectures without tuning hyper-parameters. A utoGrow can signicantly reduce human eort on sear ching lay er depth. W e surprisingly nd that a rapid growing (under a large constant learning rate with random initialization of new layers) outperforms more intuitively-correct growing method, such as Network Morphism growing after a shallower net con- verged. W e believe our initiative results can inspire future resear ch on structure growth of neural netw orks and related theory . Acknowledgments. This work was supported in part by NSF CCF- 1725456, NSF 1937435, NSF 1822085, NSF CCF-1756013, NSF IIS- 1838024 and DOE DE-SC0018064. Any opinions, ndings, conclu- sions or recommendations expressed in this material are those of the authors and do not necessarily reect the views of NSF, DOE or its contractors. W e also thank Dr . Y andan W ang (yandanw@unr .edu) at University of Nevada Reno for valuable feedback on this work. REFERENCES [1] Peter J Angeline, Gr egory M Saunders, and Jordan B Pollack. 1994. An evolution- ary algorithm that constructs recurrent neural networks. IEEE transactions on Neural Networks 5, 1 (1994), 54–65. [2] Gabriel Bender , Pieter-Jan Kindermans, Barret Zoph, Vijay V asudevan, and Quoc Le. 2018. Understanding and simplifying one-shot architecture search. In Inter- national Conference on Machine Learning . 549–558. [3] Han Cai, Tianyao Chen, W einan Zhang, Y ong Yu, and Jun W ang. 2018. Ecient architecture search by network transformation. In Thirty-Second AAAI Conference on Articial Intelligence . [4] Han Cai, Jiacheng Yang, W einan Zhang, Song Han, and Y ong Yu. 2018. Path- level network transformation for ecient architecture search. arXiv preprint arXiv:1806.02639 (2018). [5] Tianqi Chen, Ian Goodfellow , and Jonathon Shlens. 2015. Net2net: Accelerating learning via knowledge transfer . arXiv preprint arXiv:1511.05641 (2015). [6] Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov , Mehryar Mohri, and Scott Y ang. 2017. Adanet: Adaptive structural learning of articial neural networks. In Proceedings of the 34th International Conference on Machine Learning- V olume 70 . JMLR. org, 874–883. [7] Xiaoliang Dai, Hong xu Yin, and Niraj K Jha. 2017. NeST: A neural network synthesis tool based on a grow-and-prune paradigm. arXiv:1711.02017 (2017). [8] Xiaocong Du, Zheng Li, and Yu Cao. 2019. CGaP: Continuous Growth and Pruning for Ecient Deep Learning. arXiv preprint arXiv:1905.11533 (2019). [9] Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter . 2017. Simple and Ecient Architecture Search for CNNs. In W orkshop on Meta-Learning (MetaLearn 2017) at NIPS . [10] Jiashi Feng and Tr evor Darrell. 2015. Learning the structure of deep convolutional networks. In Proceedings of the IEEE international conference on computer vision . 2749–2757. [11] Ariel Gordon, Elad Eban, Or Nachum, Bo Chen, Hao Wu, Tien-Ju Y ang, and Edward Choi. 2018. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Procee dings of the IEEE Conference on Computer Vision and Pattern Recognition . 1586–1595. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Procee dings of the IEEE conference on computer vision and pattern recognition . 770–778. [13] Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision . 1389–1397. [14] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q W einberger. 2018. Condensenet: An ecient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2752–2761. [15] Gao Huang, Zhuang Liu, Laurens V an Der Maaten, and Kilian Q W einberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4700–4708. [16] Diederik P Kingma and Jimmy Ba. 2014. Adam: A metho d for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014). [17] Alex Krizhevsky , Ilya Sutskever , and Georey E Hinton. 2012. Imagenet classica- tion with deep convolutional neural networks. In Advances in neural information processing systems . 1097–1105. [18] V adim Leb edev and Victor Lempitsky . 2016. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2554–2564. [19] Hao Li, Asim Kadav , Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning lters for ecient convnets. arXiv preprint arXiv:1608.08710 (2016). [20] Hao Li, Zheng Xu, Gavin Taylor , Christoph Studer , and T om Goldstein. 2018. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems . 6391–6401. [21] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, W ei Hua, Li-Jia Li, Li Fei-Fei, Alan Y uille, Jonathan Huang, and Kevin Murphy . 2018. Progressive neural architecture search. In Pr oceedings of the European Conference on Computer Vision (ECCV) . 19–34. [22] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and K oray Kavukcuoglu. 2017. Hierarchical representations for ecient architecture search. arXiv preprint arXiv:1711.00436 (2017). [23] Hanxiao Liu, Karen Simonyan, and Yiming Y ang. 2018. Darts: Dierentiable architecture search. arXiv preprint arXiv:1806.09055 (2018). [24] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Chang- shui Zhang. 2017. Learning ecient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision . 2736–2744. [25] Jian-Hao Luo, Jianxin Wu, and W eiyao Lin. 2017. Thinet: A lter level prun- ing method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision . 5058–5066. [26] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duy , et al . 2019. Evolving deep neural networks. In A rticial Intelligence in the Age of Neural Networks and Brain Computing . Elsevier , 293–312. [27] Jongsoo Park, Sheng Li, W ei W en, Ping Tak Peter T ang, Hai Li, Yiran Chen, and Pradeep Dubey . 2017. Faster cnns with direct sparse convolutions and guide d pruning. In International Conference on Learning Representations (ICLR) . [28] Hieu Pham, Melo dy Y Guan, Barret Zoph, Quoc V Le , and Je Dean. 2018. Ecient neural architecture sear ch via parameter sharing. arXiv preprint (2018). [29] George Philipp and Jaime G Carbonell. 2017. Nonparametric neural networks. arXiv preprint arXiv:1712.05440 (2017). [30] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Y utaka Leon Sue- matsu, Jie T an, Quoc V Le, and Alexey Kurakin. 2017. Large-scale evolution of image classiers. In Proceedings of the 34th International Conference on Machine Learning- Volume 70 . JMLR. org, 2902–2911. [31] Shreyas Saxena and Jakob V erbe ek. 2016. Convolutional neural fabrics. In Ad- vances in Neural Information Processing Systems . 4053–4061. [32] Karen Simonyan and Andrew Zisserman. 2014. V ery deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [33] Leslie N Smith, Emily M Hand, and Timothy Doster . 2016. Gradual dropin of layers to train very deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4763–4771. [34] Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127. [35] Christian Szegedy , W ei Liu, Y angqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov , Dumitru Erhan, Vincent V anhoucke, and Andrew Rabinovich. 2015. Going deep er with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1–9. [36] T ao W ei, Changhu W ang, and Chang W en Chen. 2017. Modularize d morphing of neural networks. arXiv preprint arXiv:1701.03281 (2017). [37] T ao W ei, Changhu Wang, Y ong Rui, and Chang W en Chen. 2016. Network morphism. In International Conference on Machine Learning . 564–572. [38] W ei W en, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, W enhan W ang, Fang Liu, Bin Hu, Yiran Chen, and Hai Li. 2018. Learning intrinsic sparse struc- tures within long short-term memory . In International Conference on Learning Representations (ICLR) . [39] W ei W en, Chunpeng Wu, Y andan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems . 2074–2082. [40] Huanrui Y ang, W ei W en, and Hai Li. 2020. DeepHoyer: Learning Sparser Neural Network with Dierentiable Scale-Invariant Sparsity Measures. In International Conference on Learning Representations (ICLR) . [41] Jaehong Y oon, Eunho Y ang, Jeongtae Lee, and Sung Ju Hwang. 2017. Lifelong learning with dynamically expandable networks. arXiv preprint (2017). [42] Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016). [43] Barret Zoph, Vijay V asudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Pr oceedings of the IEEE conference on computer vision and pattern recognition . 8697–8710.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment