A scalable constructive algorithm for the optimization of neural network architectures

A scalable algorithm for the optimization of neural network architectures Massimiliano Lupo Pasini a, ∗ , Junqi Y in b , Y ing W ai Li c , Markus Eisenbach b a Oak Ridge National Laboratory , Computational Sciences and Engineering Division, 1 Bethel V alley Road, Oak Ridg e, TN, USA, 37831 b Oak Ridge National Laboratory , National Center for Computational Sciences, 1 Bethel V alley Road, Oak Ridg e, TN, USA, 37831 c Los Alamos National Laboratory , Computer , Computational, and Statistical Sciences Division, Los Alamos, NM, 87545, USA Abstract W e propose a new scalable method to optimize the architecture of an artiﬁcial neural network. The proposed algorithm, called Greedy Search for Neural Network Architecture, aims to determine a neural network with minimal number of layers that is at least as performant as neural networks of the same structure identiﬁed by other hyperparameter search algorithms in terms of accuracy and computational cost. Numerical results performed on benchmark datasets show that, for these datasets, our method outperforms state-of-the-art hyperparameter optimization algorithms in terms of attainable predicti ve performance by the selected neural network architecture, and time-to-solution for the hyperparameter optimization to complete. K eywor ds: deep learning, hyperparameter optimization, neural netw ork architecture, random search, greedy constructive algorithms, adaptiv e algorithms This manuscript has been authored in part by UT -Battelle, LLC, under con- tract DE-A C05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publica- tion, acknowledges that the US go vernment retains a nonexclusiv e, paid-up, ir- rev ocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accor- dance with the DOE Public Access Plan ( http://energy.gov/downloads/ doe- public- access- plan ). Introduction Deep neural networks (NN) are nonlinear models used to approximate unknown functions based on observational data [1, 2, 3, 4]. Their broad applicability is deri ved from their complex structure, which allows these techniques to reconstruct complex relations between quantities selected as inputs and out- puts of the model [5]. From a mathematical perspectiv e, a NN is a directed acyclic graph where the nodes (also called neurons) are organized in layers. The type of connectivity between dif- ferent layers is essential for the NN to model complex dynam- ics between inputs and outputs. The structure or architecture of the graph is mainly summarized by the number of layers in the graph, the number of nodes at each layer and the connectivity between nodes of adjacent layers. The performance of a NN is very sensitive to the choice of the architecture for multiple reasons. Firstly , the architec- ture strongly impacts the prediction computed by a NN. Indeed, ∗ Corresponding author Email addr ess: lupopasinim@ornl.gov (Massimiliano Lupo Pasini) NN’ s with di ﬀ erent structures may produce di ﬀ erent outputs for the same input. On the one hand, structures that are too simple may not be articulate enough to reproduce complex re- lations. This may result in underﬁtting the data with high bias and low variance in the predictions. On the other hand, architec- tures that are too complex may cause numerical artifacts such as overﬁtting, leading to predictions with low bias and high variance. Secondly , the topology of a NN a ﬀ ects the compu- tational complexity of the model, because an increase in layers and nodes leads to an increase in ﬂoating point operations to train the model and to make predictions. Therefore, identifying an appropriate architecture is an important step that can heavily impact the computational complexity to train a deep learning (DL) model and the ﬁnal attainable predictive po wer of the DL model itself. Ho wev er , the parameter space of NN architec- tures is too large for an exhaustiv e search. In fact, the number of ar chitectur es grows e xponentially with the number of layers, the number of neur ons per layer and the connections between layers . Sev eral approaches have been proposed in the literature for hyperparameter optimization (HPO) [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] with the goal to identify a NN architecture that out- performs the others in terms of accuracy . Sequential Model- Based Optimization (SMBO) algorithms [7] are a category of HPO algorithm. Examples of SMBO algorithms are Bayesian Optimization (BO) [15, 17] and its less e xpensiv e variant T ree- Parzen estimator (TPE), which rely on information av ailable from previously trained models to guide the choice of models to build and train in following steps. The use of past information generally beneﬁts the reduction of the number of neural net- works to train in the next iterations, and provides an assessment of uncertainty by incorporating the e ﬀ ect of data scarcity . The Pr eprint submitted to Elsevier e ﬃ cacy of the results obtained with BO is highly sensitiv e to the choice of the prior distribution on the hyperparameter space as well as the acquisition function to select new points to e valuate in the hyperparameter space. Another class of HPO methods is represented by genetic algorithms [18, 19, 20, 21, 22, 23] and evolutionary algorithms (EA) [24, 25], which e volve the topology of a NN by alternativ ely adding or dropping nodes and connections based on results attained by previous NN mod- els. Incremental, adaptive approaches [26] and pruning algo- rithms [27, 28] or random dropout [29] can also be computa- tionally con venient because they tend to minimize the number of NN models built and trained. All the SMBO, EA and in- cremental approaches described abov e adopt theoretical expe- dients [30, 31] to reduce the uncertainty of the hyperparameter estimate, but this comes at the price of not being scalable. Sev eral scalable algorithms for hyperparameter search hav e been proposed in the literature. Grid Search (GS), or parameter sweep, searches exhausti vely through a speciﬁed subset of hy- perparameters. The subset of hyperparameters and the bounds in the search space are speciﬁed manually . Moreov er, the search for continuous h yperparameters requires a manually prescribed discretization policy . Although this technique is straightfor- wardly parallelizable, it becomes more and more prohibitive in terms of computational time and resources when the number of hyperparameters increases. Random Search (RS) [32] di ﬀ ers from GS mainly in that it explores hyperparameters stochasti- cally instead of exhausti vely . RS is likely to outperform GS in terms of time-to-solution [32, 33], especially when only a small number of hyperparameters a ﬀ ects the ﬁnal predicti ve po wer of DL model. The independence of the hyperparameter settings used by GS and RS make these approaches appealing in terms of parallelization and obtainable scalability . Howe ver , both GS and RS require expensi ve computations to perform the hyper- parameter search. W e pr esent a scalable method to determine, within a given computational budget, the NN with minimal number of layers that performs at least as well, in terms of accuracy and time- to-solution, as NN models of the same structure identiﬁed by other hyperparameter searc h algorithms. The computational budget is an important aspect of the NN training for tw o impor - tant reasons: the av ailable computational power and the period of time when the computational power is available. The former imposes obvious intrinsic limitations, the latter becomes im- portant when critical decisions hav e to be made in a timely and accurate manner . W e refer to our method as Gr eedy Searc h for NN Ar chitectur e (GSNN A) . Although our algorithm incre- ments the number of hidden layers adaptiv ely , it di ﬀ ers from other incremental, adaptive algorithms proposed in the litera- ture [34, 35, 36, 37] in that our algorithm performs a stratiﬁed (sliced) RS restricted to one hidden layer at each iteration. This stratiﬁed RS is the most important di ﬀ erence between GSNNA and previous methods. The selection of the NN models is driv en by the validation score, which is used as a metric to quantify the predicti ve performance of the DL models. Starting with the ﬁrst layer, a random search is performed in parallel on v arious instantiations of the DL model, to determine the optimal num- ber of neurons on each layer and the hyperparameters of the associated DL model. Random search would identify the hyper- parameters for each of the instantiations, and the performance of the DL model would determine the best number of neurons and retain the hyperparameters associated with best performing model. The same sliced RS procedure is applied to the ne xt layers. The recycling of information from previously e valuated models guarantees a ﬁne level of exploitation , and the stratiﬁed RS performed at each iteration still guarantees a thorough (al- beit not exhausti ve) exploration of the objective function land- scape in the hyperparameter space to prevent stagnations at lo- cal minima. By performing a stratiﬁed RS at each iteration, our new appr oach r etains a high level of parallelization, because the NN models can be trained concurr ently at each step. In this w ork we focus on two widely used NN architectures: multi-layer perceptrons (MLP) and conv olutional NN models (CNN). The performance of the HPO algorithms is ev aluated using ﬁv e standard datasets, each of them is associated with its speciﬁcally tailored DL model. The validation of the method will be done by comparing the e ﬃ ciency of the DL model on the determined NN architecture with the e ﬃ ciency of the same type of NN identiﬁed by other algorithms. The paper is or ganized in ﬁv e sections. Section 1 introduces the DL background. Section 2 explains our novel optimiza- tion algorithm for the architecture of NN models. Section 3 de- scribes the computational en vironment where the numerical ex- periments are performed, the benchmark datasets, the speciﬁcs of the implementations for the each HPO algorithm considered, and the parameter setting for each HPO algorithm. Section 4 presents numerical experiments where we compare the perfor- mance of our HPO algorithm with Bayesian Optimization and T ree-Parzen Estimator . Section 5 summarizes the results pre- sented and describes future directions to possibly pursue. 1. Deep learning background Giv en an unknown function f that relates inputs x and out- puts y as follows y = f ( x ) , (1) a deep feedforwar d network , also called feedforwar d neural net- work or multilayer per ceptron (MLP) [5, 10], is a predicti ve statistical model that approximates the function f by compos- ing together many di ﬀ erent functions such that ˆ f ( x ) = f L + 1 ( · · · f ` + 1 ( f ` ( f ` − 1 ( . . . f 0 ( x ))))) , (2) where ˆ f : R p → R b , and f ` : R p ` → R p ` + 1 for ` = 0 , . . . , L + 1. The goal is to identify the proper number L so that the com- position in Equation (2) resembles the unknown function f in (1). The composition in Equation (2) is modeled via a directed acyclic graph describing how the functions are composed to- gether . The number L that quantiﬁes the complexity of the composition is equal to the number of hidden layers in the NN. W e refer to the input layer as the layer with index ` = 0. The indexing for hidden layers of the deep NN models starts with ` = 1. In this section we consider a NN with a total of L hid- den layers. The symbol p ` is used to denote the number of neurons at the ` th hidden layer . Therefore, p 0 coincides with 2 the dimensionality of the input, that is p 0 = p . The very last layer with inde x L + 1 represents the output layer, meaning that p L + 1 = b coincides with the dimensionality of the output. W e refer to w ∈ R N tot as the total number of regression coe ﬃ cients. Follo wing this notation, the function f 0 corresponds to the ﬁrst layer of the NN, f 1 is the second layer (ﬁrst hidden layer) up to f L + 1 that represents the last layer (output layer). In other words, deep feedforward networks are nonlinear regression models and the non-linearity is giv en by the composition in Equation (2) to describe the relation between predictors x and targets y . This approach can be reinterpreted as searching for a mapping that minimizes the discrepancy between values ˆ y predicted by the model and giv en observations y . Giv en a dataset with m data points, the process of predicting the outputs for given inputs via an MLP can thus be formulated as ˆ y = F ( x , w ) , (3) where the operator F : R p 0 × R N tot → R b is F ( x , w ) = ϕ L + 1  X k L w k L + 1 k L ϕ L  X k L − 1 w k L k L − 1 ϕ L − 1  . . . . . . ϕ 1  X i = 1 w k 1 i x i  , (4) where ϕ ` ( ` = 1 , . . . , L + 1) are acti v ation functions used to generate non-linearity in the predictiv e model. Using the matrix notation for the weights connecting adjacent layers as W `,` − 1 ∈ R p ` × p ` − 1 , (5) we can rewrite (4) as F ( x , w ) = ϕ L + 1  W L + 1 , L  ϕ L  . . .  ϕ 1  W 1 , 0 x  . (6) The composition of the activ ation functions ϕ ` with the tensor products using matrices W ` + 1 ,` at the ` th layer corresponds to the f ` used in Equation (2). The notation in (6) highlights that N tot is the total number of regression weights used by the NN. This value must account for all the entries in W `,` − 1 ’ s matrices, that is N tot = L + 1 X ` = 1 p ` p ` − 1 . (7) If the tar get values are continuous quantities, the v ery last layer ϕ L + 1 is usually chosen to be linear, i.e., the identity function. If the tar get values are categorical, then ϕ L + 1 is usually set to be the logit function. If the number of hidden layers is set to L = 0 and ϕ 1 is set to be the identity function, then the sta- tistical model becomes a classical linear regression model. If the number of hidden layers is set to L = 0 and ϕ 1 is set to be the logit function, then the statistical model becomes a logistic regression model. In order to exploit local correlations in the data, con volu- tional kernels can be composed with the acti vation functions ϕ i . Con volution is a powerful mathematical tool that models local interactions between data points. As such, con volution uses the same set of regression coe ﬃ cients to model local interactions across the entire data instead of using se veral sets of regression coe ﬃ cients, one speciﬁc for each neighbourhood as a standard MLP architecture would require. The use of the con volution thus signiﬁcantly reduces the dimensionality of the coe ﬃ cients needed in DL models to reconstruct local features in regularly structured data. W ell kno wn examples of data that respect this geometrical properties are images. NN models that exploit the data locality for the feature extraction are called Convolutional Neural Networks (CNN) [38, 39, 40] and they are character- ized by a sparse connectivity or sparse weights that stems from the sparse interaction between data. In essence CNN are the nonlinear generalization of kernel regression and they inherit from the linear case the advantages of replacing dense matrix multiplication with sparse matrix multiplications. This beneﬁts the computation by reducing the number of FLOPS required to perform matrix multiplications, and reduces the memory re- quirement to store the regression weights. 2. Adaptive selection of the number of hidden layers The goal of our novel HPO algorithm is to determine, within a given computational b udget, the NN with minimal number of layers that performs at least as well on training datasets, in terms of accuracy and time-to-solution, as NN models of the same structure identiﬁed by other hyperparameter search algo- rithms. The HPO is performed over a set of hyperparameters which di ﬀ ers according to the type of NN architecture consid- ered. For MLP models, the HPO is performed over the number of hidden layers, the number of neurons per layer , the type of nonlinear acti vation function at each hidden layer and the batch size used to train the model with a ﬁrst-order optimization al- gorithm. For CNN models, the number of neurons is replaced by the number of channels. In addition, the conv olutional ker- nel, the dropout rate and the pooling are optimized as well. In order for the HPO procedure to be applied, the re gion of the hyperspace explored must be bounded to guarantee that the exploration is restrained within a computational budget for the number of layers and the other NN hyperparameters. The result of the procedure is dataset dependent, in that it aims to identify a customized neural network architecture that well predicts the input-output relation for the dataset at hand. The dataset is split into a training set, a validation set and a test set. The training portion is used to train the instantiated NN models. The performance of the DL models over the validation set is used to associate the model with a score, which is used to compare the performance of the NN instantiated. The test set is used to quantify the predictiv e performance of the ﬁnally selected NN model by computing the test score. W e refer to Section 4.1 for details about the metrics used to measure the performance of a NN. The pseudo-code that describes GSNN A is presented below , in Algorithm 1. The method starts by performing RS over NN models with one hidden layer and it selects the NN that attains the best predicti ve performance over the validation portion of the dataset. The random search identiﬁes the hyperparameters for each of the instantiations and the performance of the deep learning model determines the best number of neurons and the 3 hyperparameters associated with best performing model on the respectiv e datasets to retain. The procedure continues by freez- ing the number of neurons and the hyperparameters in the pre- vious hidden layers ev ery time a ne w hidden layer is added, and the sliced RS is performed only on the hyperparameters of the last hidden layer in the architecture. This iterativ e procedure proceeds until either the v alidation score reaches a prescribed threshold or the maximum number of hidden layers is reached. An illustration that explains ho w GSNN A proceeds is sho wn in Figure 1. The number of neurons needed may v ary from layer to layer in order for a NN architecture to attain a desired accu- racy . It is thus possible that the NN may hav e to alternativ ely expand and contract across the hidden layers to properly model the nonlinear relations between input and output data. GSNN A allows this, as the number of neurons at each selected through a stratiﬁed RS may vary for each hidden layer . Algorithm 1: Greedy Search for Neural Network Ar- chitecture (GSNN A) Input: • L = maximum number of hidden layers • N ma x nod es = maximum number of nodes (neurons) per layer • score threshold = threshold on the ﬁnal performance prescribed • mod el eval it er = number of model ev aluations per iteration Output: best model Set number of hidden layers ` = 1; Set best model as linear regression (for regression problems) or logistic regression (for classiﬁcation problems); Compute scor e ; while scor e < scor e thr eshold & ` ≤ L do Build mod el eval it er NN models with ` hidden layers each; Set number of nodes and activ ation functions for ﬁrst ( ` − 1) hidden layers as in best model; Perform random search for number of nodes in the last hidden layer and for the remaining hyper-parameters; Select best model as the NN with best performance; Retriev e best model and store info about number of nodes and activ ation functions per layer; ` = ` + 1; end retur n best model The stratiﬁed RS is the most important di ﬀ erence between GSNN A and previous methods. The main contrast of GSNN A with respect to previous methods is the greedy approach adopted in increasing the number of hidden layers. As more hidden lay- ers are added to the NN architecture, the predictive power of the model increases, but with this also the computational cost for training. Previous methods treat the number of hidden lay- ers as any other hyperparameter , and the methods sometimes construct expensi ve neural networks at intermediate steps, and these NN are later discarded in fav or of smaller ones. By per- forming a greedy approach on the number of hidden layers, GSNN A av oids this type of extreme situations where very ex- pensiv e NN are trained and discarded through intermediate steps, and this fa vors a lower computational cost per iteration. The validation of the method will be shown by comparing the e ﬃ - ciency of the DL model on the determined NN architecture with the e ﬃ ciency of the same type of DL model identiﬁed by other algorithms. Figure 1: Illustration of the Greedy Search for Neural Network Architecture (GSNN A). The illustration explains ho w the architecture of the NN is enriched at each iteration. The NN models built at iteration (1) have only one hidden layer and the number of neurons inside the hidden layers is chosen via RS. Every NN is trained and the predictive performance over the validation set is measured. The NN with the best validation score is selected (circled in red). If the attained accuracy meets the requirements prescribed by the user, the al- gorithms stops and returns the selected NN. Otherwise, the hyperparameters of the ﬁrst hidden layer are transferred to iteration (2). The NN models built at iteration (2) have the same number of neurons in the ﬁrst hidden layers as the best NN from iteration (1), whereas the number of neurons at the second hidden layer is chosen with another stratiﬁed RS. The NN models are trained and the validation scores from each NN are collected. The NN with the best predictiv e performance is chosen (circled in red). If the performance meets the requirements, the algorithms stops and returns the selected NN. Otherwise, the information about the numbers of neurons in the ﬁrst and second hidden layers are transferred to iteration (3), so that another stratiﬁed RS takes place on the number of neurons inside the third hidden layer . 2.1. Reduction of dimensionality in the hyperparameter sear ch T ransferring information from smaller to bigger NN models across successiv e iterations and restricting the RS to the hyper- parameters associated only with the last hidden layers reduces the dimension of the hyperparameter space to explore. In this section we compare the dimensionality (number of elements in a set) of the hyperparameter space explored by a standard HPO algorithm (e.g. GS, RS, SMBO, EA) with the dimensionality of the hyperparameter space explored by GSNN A. Denote the maximum number of neuron per layer with N ma x nod es and the maximum number of hidden layers with L . The number of hidden layers and the number of neurons per 4 layer are hyperparameters that a ﬀ ect the structure of the NN models, whereas all the other hyperparameters a ﬀ ect the train- ing of the DL model. Because GSNNA di ﬀ er from state-of- the-art HPO algorithms by the way the number of hidden layers are optimized, this is the only factor that determines a change in dimensionality of the hyperparameter space. The stratiﬁed RS in GSNN A allows us to av oid the curse of dimensionality , because the number of NN architectures to span at each iter- ations decreases from N L ma x nod es (as it is for a standard HPO algorithm) to N ma x nod es . The reduced dimensionality of the hy- perparameter space leads also to a reduction of the uncertainty ov er the estimated attainable predictive performance. This is shown in the numerical experiments in Section 4, where the ac- curacy attained by the NN models selected from multiple runs of GSNN A has narro wer conﬁdence interv als than the ones ob- tained with BO and TPE, indicating that the estimates obtained with GSNN A are more reliable. 2.2. Computational complexity of GSNN A Let us refer to C as the number of independent model ev al- uations performed in one iteration of an HPO algorithm, and L the number of HPO iterations performed. The computational complexity of one iteration of GSNNA is O ( C ) (and hence O ( C L ) for the whole algorithm), because the algorithm compares the predictiv e performance of C models and selects the best one to proceed to the next iteration. T o put this value in perspec- tiv e, we remind the reader that the computational complexity of one iteration of BO is cubic both in the number of independent model ev aluations and in the number of iterations performed, that is O (( C L ) 3 ), and the computational complexity of one iter- ation of TPE is cubic only in the number of independent model ev aluations, that is O ( C 3 ). In terms of computational complex- ity , GSNNA thus provides a signiﬁcant improv ement with re- spect to BO and TPE, because the computational complexity per iteration is constant with respect to the iteration count, and the computational comple xity of one iteration of HPO is re- duced from cubic to linear . This beneﬁt makes GSNN A ap- pealing for scaling purposes with large values of independent model e v aluations C . W e also remind the reader that the inde- pendent model ev aluations in each iteration can be performed concurrently , as we did in the numerical experiments described in Section 4 of this work. 3. Algorithm implementation In this section we describe the computational en vironment where the numerical experiments were performed, the speciﬁcs of the implementations for each of the HPO algorithms consid- ered, the benchmark datasets used and the parameter setting for each HPO algorithm. 3.1. Hardwar e description The numerical experiments were performed using Summit [41], a supercomputer at the Oak Ridge Leadership Comput- ing Facility (OLCF) at Oak Ridge National Laboratory . Sum- mit has a hybrid architecture; each node contains two IBM Name of dataset Nb . attributes Nb . data points Eggbox 2 4,000 Graduate admission 7 400 Computer hardware 9 209 Phishing websites 29 11,055 CIF AR-10 - 60,000 T able 1: Description of the datasets. PO WER9 CPUs and six NVIDIA V olta V100 GPUs all con- nected together with NVIDIA ’ s high-speed NVLink. Each node has over half a terabyte of coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800 GB of non-volatile RAM that can be used as a burst bu ﬀ er or as extended memory . T o pro vide a high rate of I / O throughput, the nodes are connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InﬁniBand interconnect. 3.2. Dataset description The datasets used are standard benchmark datasets in ma- chine learning, open source and accessible to e veryone, and guarantee reproducibility of the results presented. The datasets used for the numerical experiments of this section are summa- rized in T able 1. The dataset Eggbox is artiﬁcially constructed by ev aluating the function f ( x , y ) = [2 + cos( x / 2) ∗ cos( y / 2)] 5 across 4,000 points in the domain square [0 , 2 π ] 2 and it is used as a regression problem. The Graduate admission dataset [42] is a regression problem that relates the chances of a stu- dent’ s admission to GRE score, TOEFL score, uni versity rat- ing, and other performance metrics. The Computer hardware dataset [43, 44] is a regression problem that describes the rela- tiv e CPU performance data in terms of its cycle time, memory size, and other hardware properties. The Phishing website dataset [44] is a classiﬁcation problem that describes the prop- erties of di ﬀ erent websites and classiﬁes them as authentic or fake. The CIFAR-10 dataset [45] requires solving a classiﬁ- cation problem to classify object images into ten cate gories. The numerical e xperiments presented in this section are split between the use of MLP and CNN models. The choice of one type of architecture ov er the other is dictated by the struc- ture of the dataset used to train the NN models. MLP models are used on the Eggbox , Graduate admission , Computer hardware , and Phishing website datasets, whereas CNN models are used for the CIFAR-10 dataset. 3.3. T raining, validation, and test data The datasets are split in three components: the training set, the validation set, and the test set. The training set is used to train e very instantiated DL model, the validation set is used to select the best performing model at each iteration and the test set is used at the end to measure the predicti ve po wer of the NN selected by each HPO algorithm. For the datasets Eggbox , Graduate admission , Computer hardware , and Phishing website , the test set is 10% of the original dataset, the remain- ing portion is partitioned into training and validation in the per - centage of 90% and 10% respecti vely . For classiﬁcation prob- 5 lems, a stratiﬁed splitting is performed to ensure that the pro- portion between classes is preserved across training, validation, and test sets. The partitioning between training / validation set and test set for the CIFAR-10 dataset is performed as suggested by the online sources where the datasets can be downloaded 6. The optimizer used to train the model is the Adam method [46] with an initial learning rate of 0.001. W e highlight that the number of epochs to train a neural network is di ﬀ erent from the number of iterations performed by the hyperparameter op- timization algorithm. In fact, the number of epochs is related to the computation needed to perform every single model ev alua- tion. For all HPO methods (GSNN A, TPE, BO), the maximum number of epochs used to train the neural networks is set to be equal to n , i.e. the number of samples in the training set for each dataset. The actual number of epochs does not necessarily have to be equal to the number n of points in the dataset. If the training is achiev ed before n, an early stopping is in place to ﬁnish the training. If the number of epochs reaches n, that means that the neural network still beneﬁts from the training. Of course, if n is too large, which happens for very large datasets, this may impose unwanted b urden on the e xecution time. But that can be mitigated by trying to ﬁnd a balance between optimal training and training time. The cost to train a neural network depends on both the size of the dataset, and on the size of the neural network itself. The larger the neural network and the datasets, the longer it takes to train the neural network. The longer time to train a larger neural network on a larger dataset would translate into an increased computational time to perform ev ery single model ev aluation, and this would impact the total time to solution for all the HPO algorithms used. W e also want to point out that the size of a neural network should correlate with the complexity of the relation between inputs and outputs. Having a larger dataset does not necessar- ily imply needing a larger neural network. For example, one may hav e inﬁnitely many points aligned on a straight line. The dataset is large, but the complexity required for the predictiv e model to capture the trend is still very lo w . 3.4. Setting of the hyperparameter space The hypercube that delimits the hyperparameter sear ch is deﬁned so as to r estrict the hyperparameter sear ch within an a ﬀ or dable computational budget . Due to the computational budget constraint, we limit the maximum number of layers L to 5. The number of neurons (or channels) per layer spans from 1 to the highest integer smaller than √ n , where n is the num- ber of sample points. The choice of √ n as the upper bound of the number of neurons per layer is a common practice adopted in DL to a void overﬁtting. The set of acti vation functions is made of the sigmoid function (denoted as sigmoid in the T a- bles), the hyperbolic tangent ( tanh ), the rectiﬁed linear unit function ( relu ) and the exponential linear unit function ( elu ). The kernel size for CNN architecture spans between 2 and 5. The discrete range for the batch size spans from 10 to the clos- est integer to n 10 . Also choosing n 10 as maximum size of data batches is a reasonable recommendation adopted by DL practi- tioners to cap the computational cost of each training iteration. The range of search for each hyperparameter is ﬁxed in every HPO algorithm used for the study . T ables 2 and 3 contain a de- scription of the hyperparameters optimized for MLP and CNN architectures with the ranges spanned for each hyperparameter during the optimization. Hyperparameter Search range Number of hidden layers { 1,2,3,4,5 } Number of neurons per layer [1, √ n ] nonlinear activ ation function { relu , sigmoid , tanh , elu } batch size [10, n 10 ] T able 2: Hyperparameters optimized for MLP architectures. The value n refers to the size of the dataset. Hyperparameter Search range Number of hidden layers { 1,2,3,4,5 } Number of channels per layer [1, √ n ] Dropout rate [0,1] Pooling { 1,2 } nonlinear activ ation function { relu , sigmoid , tanh , elu } batch size [10, N 10 ] T able 3: Hyperparameters optimized for CNN architectures. The v alue n refers to the size of the dataset. 3.5. Setting of the hyperparameter sear ch algorithms The code to perform GSNN A is implemented in python 3.5 , and the NN models are built using Keras.io [47] which calls Tensorflow 2.0 backend. The training of the NN mod- els is performed using the GPUs on Summit by calling cudadnn 9.0 for tensor algebra operations. W e compare the GSNN A described in this paper with the TPE and BO. The version of GSNNA that we implemented performs concurrent model e valuations for the RS at each step with a distributed memory parallelization paradigm that uses mpi4py [48]. The version of TPE and BO used are provided by the Ray Tune li- brary [49] through the routines named HyperOptSearch and BayesOptSearch respectively . The version of Ray T une used is 0.3.1. As to BayesOptSearch , the utility function is set to utility kwargs="kind": ’ucb’ , "kappa": 2.5 , "xi": 0.0 . For both HyperOptSearch and BayesOptSearch , the model selection and ev aluations are scheduled using the asyn- chronous version of HyperBand [50] called AsyncHyperBandScheduler . The time attribute for the sched- uler is the training iteration and the reward attribute is the vali- dation score of the NN. The validation score is also used as the stopping criterion of the HPO algorithm. Additional parame- ters for RayT une’ s TPE and BO not mentioned here hav e been left to default value. Our proposed method, GSNNA, is at its ﬁrst implementation, whereas the RayT une library used to per- form HPO with TPE and BO has underwent multiple stages of implementation optimization. Therefore, our comparison be- tween GSNNA, TPE, and BO does not adv antage GSNN A o ver the other HPO algorithms in terms of implementation. 6 4. Numerical results In this section we present numerical experiments for the ﬁv e benchmark datasets described abov e, and we focus on the best suited type of neural network structure for each one of the se- lected datasets. Our numerical experiments compare the per- formance of GSNNA against BO and TPE in terms of ﬁnal attainable accuracy of the selected NN architecture and time- to-solution to complete the hyperparameter search. Numerical tests described in this section focus on weak scal- ing, meaning that the performance of HPO algorithms is mon- itored for increased numbers of concurrent model ev aluations, with each concurrent model ev aluation mapped to a separate MPI process and a separate GPU to train, and the predicti ve performance of the model is assessed. Strong scaling tests are not included in the discussion for the following reasons. For applications such as the ones considered in this paper, strong scaling requires ﬁxing the number of concurrent model ev alu- ations and progressi vely increase the computational resources made available for each model ev aluation. In our methodol- ogy , there is a one-to-one mapping between concurrent model ev aluations and GPUs. When the total number of GPUs is less than the concurrent models, the strong scaling boils down to the scaling of the job scheduler , which is outside the scope of this work. When the total number of GPUs is more than the con- current models, this would translate to using multiple GPUs to perform a single model ev aluation instead of using one GPUs as currently done in the work. In the deep learning community , this approach is known as model parallelization. Model paral- lelization would accelerate the model ev aluations and it would equally apply to all the three methods TPE, BO, and GSNN A. Howe ver , model parallelization would not accelerate the ex ecu- tion of the hyperparameter optimization algorithms themselves. Therefore, the comparison of TPE, BO, and GSNN A would not di ﬀ er from the ones presented in this paper in relativ e terms. Moreov er, the small size of the neural networks and the small size of the benchmark datasets used in this work does not justify model parallelization; strong scaling would only bring marginal beneﬁts on the acceleration of model ev aluations. 4.1. Comparison for pr edictive performance and computational time The ﬁrst set of numerical e xperiments compares the predic- tiv e power of the GSNN A with TPE and BO. The metric used to quantify the predicti ve performance of a NN for regression problems is the R 2 score deﬁned as R 2 = 1 − P m i = 1 ( y i − ˆ y i ) 2 P m i = 1 ( y i − y i ) 2 (8) where y i are the observations for m data points in the test set, ˆ y i are the predictions obtained with the DL model ov er the test set and ¯ y i is the sample mean of the data points over the test set. The metric used to quantify the predicti ve performance of a NN used for classiﬁcation problems is the F 1 score deﬁned as F 1 = 2 PPV · T PR PPV + T PR , (9) where P PV = true positiv es true positiv es + false negati ves is the precision or positive pr edicted value and T PR = true positiv es positiv es is the sensitivity , r ecall , hit rate , or true positive r ate . For the datasets that require the use of MLP architectures, the number of concurrent model ev aluations per iteration is set to 10, 25, 50, 75, and 100 for all the three HPO algorithms. For the CIFAR-10 dataset that requires the use of CNN architec- tures, the number of concurrent model ev aluations per iteration is set to 150, 300, 450, and 600 to cope with a lar ger number of hyperparameters to tune. The maximum number of iterations is set to 5 for all the three HPO algorithms and the stopping cri- terion imposes a threshold on the R 2 score and F 1 score equal to 0.99. T o guarantee a fair comparison between the di ﬀ erent HPO algorithms, the implementations of the three HPO algorithms make use of the same number of concurrent model ev aluations, and each implementation of the HPO algorithms maps e very concurrent model ev aluation to a separate GPU. Howe ver , the complexity of (and thus the cost to train) each model per itera- tion varies according to the speciﬁc architectures that the HPO algorithms select at each iteration. Since di ﬀ erent HPO algo- rithms select di ﬀ erent architectures to construct and ev aluate, this can lead to di ﬀ erent computational times. Because Summit has six GPUs per compute node, the total number of Summit nodes used in a numerical experiment is equal to the least in- teger greater than or equal to the concurrent model e valuations divided by 6. Figures 2, 3, 4, and 5 correspond to the test cases with MLP models. In these ﬁgures, the ﬁgures on top sho w the scores obtained on the test set of the selected MLP model, and the ﬁgures at the bottom sho w the time-to-solution in wall clock seconds. The performance is reported for each hyperparameter search algorithm, averaging ov er 10 runs with 95% conﬁdence intervals both for the mean value of the predicti ve performance and for the mean value of the time-to-solution. The experiments with the Eggbox dataset exhibit better re- sults for GSNN A with respect to TPE and BO in terms of pre- dictiv e power achieved by the selected NN. Moreov er , we no- tice that the conﬁdence band for GSNN A narrows as the num- ber of concurrent e v aluations increases. This happens because the inference on the attainable predictive performance becomes more accurate with a higher number of random samples for the stratiﬁed RS. A di ﬀ erent trend is shown for the conﬁdence band of TPE and BO. In this case, the conﬁdence band does not be- come narrower by increasing the number of concurrent model ev aluations. This highlights the beneﬁt of using a stratiﬁed RS in GSNNA: the uncertainty of the random optimization is bounded by reducing the dimensionality of the search space. In terms of scalability , we notice that GSNN A has a ﬂat weak scaling curve, whereas BO and TPE signiﬁcantly increase the computational time-to-solution with an increased number of concurrent model ev aluations. Although BO and TPE ﬁn- ish in less time than GSNNA for 10 and 50 model ev aluations, the ﬁnal attained accuracy is signiﬁcantly lower than the one obtained with GSNNA. This indicates that GSNN A better ex- plores the hyperparameter space. 7 Figure 2: Eggbox dataset. Comparison between Greedy search, HyperOpt- Search and BayesOptSearch for test cases with MLP architectures. The graph at the top shows the performance obtained by the model selected by the hyper- parameter search on the test set. The graph at the bottom shows a comparison of the computational times. Similar results in terms of ﬁnal attainable accuracy and scal- ability hav e been obtained for Graduate admission , Computer hardware and Phishing websites datasets. Although di ﬀ erent values for the tuning parameter of BayesOptSearch have been tested on the datasets considered in this paper, we noticed that the per- formance of BayesOptSearch on these datasets did not sig- niﬁcantly change. W e also noticed that for the graduate admission dataset and the phishing dataset, some HPO algo- rithms reduce the total time of the search for an increased num- ber of concurrent model ev aluations, and this goes against an intuitiv e reasoning. T o better understand this phenomenon, we note that the number of concurrent models impacts the compu- tational time in tw o w ays: a higher number of concurrent model ev aluations makes it likely to identify a network that attains a desired accuracy faster , but it also needs more time to coordi- nate the model ev aluations between each other . Whether one of these two factors prev ails over the other can results in either a reduction or an increase in the total computational time. The results for the CIFAR-10 dataset using CNN in Fig- ure 6 show that GSNNA outperforms both TPE and BO algo- Figure 3: Graduate admission dataset. Comparison between Greedy search, HyperOptSearch and BayesOptSearch for test cases with MLP architectures. The graph at the top shows the performance obtained by the model selected by the hyperparameter search on the test set. The graph at the bottom shows a comparison of the computational times. rithms in terms of best attainable predictiv e performance and computational time. The F1-score is more appropriate than the accuracy (percentage of data points correctly classiﬁed) to measure the predictive performance of neural networks for clas- siﬁcation purposes in case of class imbalance [51]. Howe ver , the accuracy is still the mostly used metric to report the predic- tiv e performance of a model on some benchmark dataset such as CIFAR-10 . In order to facilitate the comparison with other results published in the literature, we also report the accuracy for CIFAR-10 . Comparing the architecture selected by GSNN A with state- of-the-art architectures customized for CIFAR-10 [52], we see that the predictiv e performance of our architecture has a test error of about 9%, whereas customized architectures currently provide error belo w 0.1%. In view of this gap between the per- formance we obtained on CIFAR-10 with respect to other re- sults published in the literature, we emphasize that the goal of our research is to build an automatic selection of hyperparam- eters that is as agnostic as possible about the speciﬁcs of the dataset at hand. This makes the hyperparameter search more challenging, and the attainable accuracy is generally lo wer than 8 Figure 4: Computer hardware dataset. Comparison between Greedy search, HyperOptSearch and BayesOptSearch for test cases with MLP architectures. The graph at the top shows the performance obtained by the model selected by the hyperparameter search on the test set. The graph at the bottom shows a comparison of the computational times. the one obtained with customized approaches. Recent results obtained by other researchers [53] show a test error around 12% when a Bayesian approach is used to optimize the architecture of a neural netw ork for the CIFAR-10 dataset, and this is in line with the results we present here. 4.2. Sensitivity of GSNNA with r espect to the number of con- curr ent model evaluations In Figure 7 we show the performance obtained with GSNNA on the Eggbox dataset and the Computer hardware dataset as a function of the number of hidden layers for di ﬀ erent num- bers of concurrent model ev aluations (10, 50, and 100). For both experiments it is clear that the use of a small number of concurrent model e valuations leads to signiﬁcant ﬂuctuations in the score, as the stratiﬁed RS does not explore enough ar- chitectures for a ﬁxed number of hidden layers. A progressive increase in the concurrent model ev aluations leads to a better in- ference. This happens because an exhausti ve exploration of the stratiﬁed hyperparameter space reduces the uncertainty in the attainable best performance of the model. Moreover , a su ﬃ - cient e xploration of the stratiﬁed hyperparameter space enables Figure 5: Phishing dataset. Comparison between Greedy search, HyperOpt- Search and BayesOptSearch for test cases with MLP architectures. The graph at the top shows the performance obtained by the model selected by the hyper- parameter search on the test set. The graph at the bottom shows a comparison of the computational times. us to highlight the dependence between the maximum attain- able performance of the NN and the total number of hidden layers. Indeed, the examples displayed in Figure 7 conﬁrm that nonlinear input-output relations can beneﬁt from a higher num- ber of hidden layers. In Figure 8 we present a similar analysis using CNN for the CIFAR-10 dataset. In this case, the number of concur - rent model ev aluations considered is 150, 300, and 600. The scalability tests for the CIFAR-10 dataset use a higher number of concurrent model e valuations with respect to the pre vious datasets because there are more architectural hyperparameters to tune in CNN than in MLP models, as described also by a comparison between T ables 2 and 3. Di ﬀ erent from the previ- ous numerical examples, increasing the number of concurrent model ev aluations does not beneﬁt the identiﬁcation of a better performing architecture for the CIFAR-10 dataset, b ut a pro- gressiv e increase of the number of hidden layers still leads to a progressiv e gain in attainable accuracy . 9 Figure 6: Comparison between GSNNA, HyperOptSearch and BayesOpt- Search for test cases with CNN architectures. The comparison is performed for the CIF AR10 dataset. The graph on top sho ws the performance obtained by the model selected by the hyperparameter search on the test set in terms of F1 score. The graph in the center shows the performance obtained by the model selected by the hyperparameter search on the test set in terms of accuracy . The graph at the bottom shows the computational time. Figure 7: Greedy Search for Neural Network Architecture (GSNNA). Coef- ﬁcient of determination expressed in terms of the number of hidden layers for Eggbox , Computer hardware datasets using 10, 50, and 100 concurrent model ev aluations. Results are shown for a single run. Figure 8: Greedy Search for Neural Network Architecture (GSNN A). Coe ﬃ - cient of determination expressed in terms of the number of hidden layers for CIFAR10 dataset using 150, 300, and 600 concurrent model ev aluations. Re- sults are shown for a single run. 10 5. Concluding remarks and future dev elopments GSNN A aims to determine in a scalable fashion, and within a given computational b udget, the NN with minimal number of layers that performs at least as well as NN models of the same structure identiﬁed by other hyperparameter search algo- rithms. The algorithm adopts a greedy technique on the number of hidden layers, which can beneﬁt the reduction of computa- tional time and cost to perform the hyperparameter search. This makes the algorithm not only appealing, but sometimes strongly compelling when computational and memory resources are lim- ited, or when DL driven decisions have to be performed in a timely manner . The recycling of hidden layer conﬁgurations disregards an exponential number of architectures in the hy- perparameter space. Howe ver , ha ving a smaller search space makes the optimization a much more tractable problem with a signiﬁcant reduction in computational complexity . Moreov er , our numerical results show that this does not compromise the ﬁnal attainable accurac y of the model selected by the optimiza- tion procedure. CIF AR-10 is the largest tested dataset, with 60000 images at 32x32 resolution. ImageNet or the Open Images Dataset hav e more than a million images and are commonly ev aluated at 256x256. At the same e ﬃ ciency , this could take 1000x more time, and CIF AR-10 already takes about 8 hours. This is a lim- itation to the applicability of the method. Howe ver , the pro- posed research aims at improving scalability of hyperparame- ters search algorithms with a constrained computational bud- get . Therefore, while the method is illustrated on modest-size datasets and neural networks, it has promise for implementa- tions on larger datasets and correspondingly larger neural net- works under the same computational b udget constraints. For future developments we aim to extend the study to dif- ferent types of architectures other than multilayer perceptrons and CNN, such as residual neural networks (ResNet), recur- rent neural networks (RNN) and long short-term memory neu- ral netw orks (LSTM). W e will also use GSNNA for speciﬁc problems by selecting customized attributes other than the score for the HPO, and we will conduct an uncertainty quantiﬁcation analysis to estimate the sensitivity of the inference on the hyper- parameters with respect to the dimension of the hyperparameter space and the number of concurrent model ev aluations. Acknowledgements Massimiliano Lupo Pasini thanks Dr . Vladimir Protopopescu for his valuable feedback in the preparation of this manuscript and three anonymous re viewers for their very useful comments and suggestions. This work is supported in part by the O ﬃ ce of Science of the US Department of Energy (DOE) and by the LDRD Pro- gram of Oak Ridge National Laboratory . This work used re- sources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE O ﬃ ce of Science User Facility sup- ported under Contract DE-A C05-00OR22725. Y . W . Li w as partly supported by the LDRD Program of Los Alamos Na- tional Laboratory (LANL) under project number 20190005DR. LANL is operated by Triad National Security , LLC, for the Na- tional Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CN A000001). This document number is LA-UR-21-20936. References References [1] M. L. Minsky . Some univ ersal elements for ﬁnite automata. In C. E Shan- non & J. McCarthy (Eds.), Automata studies, Princeton: Princeton Uni- versity Pr ess , pages 117–128, 1956. [2] J. von Neumann. The general and logical theory of automata. In L. A. Je ﬀ r ess (Ed.), Cerebral mechanisms in behavior , New Y ork, W iley , pages 1–41, 1951. [3] F . Rosenblatt. The perceptron: a probabilistic model for information stor- age and organization in the brain. Psychological Revie w , 65(6), 1958. [4] F . Rosenblatt. The perceptron: a theory of statistical separability in cog- nitiv e systems. Bu ﬀ alo: Cornell Aeronautical Laboratory , Inc. Report Number VG-1196-G-1 , 1958. [5] S. Haykin. Neural Networks And Learning Machines, Thir d Edition . Pearson Education Ltd, 2009. [6] B. Baker , O. Gupta, N. Nahik, and R. Raskar. Designing neural network architectures using performance prediction. 2018 International Confer- ence on Learning Repr esentations, W orkshop Tr ack , 2018. [7] J. Bergstra, R. Bardenet, Y . Bengio, and B. K ´ egl. Algorithms for hyper- parameter optimization. Proceeding NIPS’11 Pr oceedings of the 24th In- ternational Confer ence on Neural Information Processing Systems , pages 2546–2554, 2011. [8] H. Cai, T . Chen, W . Zhang, Y . Y u, and J. W ang. E ﬃ cient architecture search by network transformation. Pr oceedings of the AAAI Conference on Artiﬁcial Intelligence , 32(1), 2018. [9] S. F ahlman and C. Lebiere. The cascade-correlation learning architecture. Advances in neural information processing system , pages 524–532, 1990. [10] I. Goodfellow , Y . Bengio, and A. Courville. Deep Learning . The MIT Press, Cambridge, Massachusetts and London, England, 2016. [11] W . Grathwohl, E. Creager, S. K. S. Ghasemipour , and R. Zemel. Gradient- based optimization of neural network architecture. ICLR 2018 W orkshop T rack , 2018. [12] T . K. Gupta and K. Raza. Optimizing deep neural network architecture: a tabu search based approach. Neural Pr ocessing Letters , 51:2855–2870, 2020. [13] C. Liu, B. Zoph, J. Shlen, W . Hua, L. Li, L. Fei-Fei, A. Y uille, J. Huang, and K. Murphy . Progressiv e neural architecture search. In Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , September 2018. [14] R. Luo, F . Tian, T . Qin, E. Chen, and T .-Y . Liu. Neural architecture opti- mization. In S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors, Advances in Neural Information Pr o- cessing Systems , volume 31, pages 7816–7827. Curran Associates, Inc., 2018. [15] J. Snoek, H. Larochelle, and R. P . Adams. Practical bayesian optimiza- tion of machine learning algorithms. Pr oceeding NIPS’12 Pr oceedings of the 25th International Conference on Neural Information Processing Systems , 2:2951–2959, 2012. [16] B. Zoph and Q. V . Le. Neural architecture search with reinforcement learning. , 2016. [17] J. Snoek, O. Rippel, K. Swersky , R. Kiros, N. Satish, N. Sundaram, M. M. A. P atwary , Prabhat, and R. P . Adams. Scalable bayesian optimiza- tion using deep neural netw orks. In Francis Bach and Da vid Blei, editors, Pr oceedings of the 32nd International Conference on Machine Learning , volume 37 of Pr oceedings of Machine Learning Research , pages 2171– 2180, Lille, France, 07–09 Jul 2015. PMLR. [18] M. Ettaouil, M. Lazaar, and Y . Ghanou. Architecture optimization model for the multilayer perceptron and clustering. Journal of Theor etical and Applied Information T echnology , 10(1):64–72, 2013. [19] T . K. Gupta and K. Raza. Optimization of ann architecture: A revie w on nature-inspired techniques. Machine learning in Bio-signal and Diagnos- tic Imaging , pages 159–182, 2018. [20] J. Holland. Genetic algorithms,for the science. Scientiﬁc American Edi- tion , 179:44–50, 1992. 11 [21] H. Kitano. Designing neural networks using genetic algorithms with graph generation system. Complex Systems J ournal , 4:461–476, 1990. [22] P . Koehn. Combining Genetic Algorithms and Neural Networks: The Encoding Pr oblem, Master of Science Thesis . Univ ersity of Knoxville, T ennessee, USA, 1991. [23] J. T . Tsai, J. H. Chou, and T . K. Liu. T uning the structure and parameters of a neural network by using hybrid taguchi-genetic algorithm. IEEE T rans. Neural Networks , 17(1):69–80, 2006. [24] S. R. Y oung, D. C. Rose, T . P . Karnowski, S.-H. Lim, and R. M. Pat- ton. Optimizing deep learning hyper-parameters through an ev olution- ary algorithm. In Pr oceedings of the W orkshop on Machine Learning in High-P erformance Computing En vir onments , MLHPC ’15, pages 1–5, New Y ork, NY , USA, November 2015. Association for Computing Ma- chinery . [25] Oak Ridge National Laboratory . Multi-node ev olu- tionary neural networks for deep learning (MENNDL). https://www.ornl.gov/division/csmd/projects/ multi- node- evolutionary- neural- networks- deep- learning- menndl . [26] N. K. Treadgold and T . D. Gedeon. Exploring constructive cascade net- works. IEEE Tr ansactions on Neural Networks , 10(6):1335–1350, 1999. [27] S. W . Stepniewski and A. J. Keane. Pruning backpropagation networks using modern stochastic optimization techniques. Neural Computing and Applications , 5(2):76–98, 1997. [28] Compressing and re gularizing deep neural net- works. https://www.oreilly.com/ideas/ compressing- and- regularizing- deep- neural- networks . [29] N. Srivasta va, G. Hinton, A. Krizhevsky , I. Sutske ver, and R. Salakhutdi- nov . Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Mac hine Learning Research , 15:1929–1958, 2014. [30] T . Domhan, J. T . Springenberg, and F . Hutter . Speeding up automatic hyperparameter optimization of deep neural netw orks by extrapolation of learning curves. IJCAI’15 Proceedings of the 24th International Confer- ence on Artiﬁcial Intelligence , pages 3460–3468, 2016. [31] T . Hinz, N. Navarro-Guerrero, S. Magg, and S. W ermter . Speeding up the Hyperparameter Optimization of Deep Conv olutional Neural Networks. International Journal of Computational Intelligence and Applications , 17(02):1850008, June 2018. [32] J. Bergstra and Y . Bengio. Random search for hyper -parameter optimiza- tion. Journal of Mac hine Learning Research , 13:281–305, 2012. [33] P . Liashchynskyi and P . Liashchynskyi. Grid search, random search, ge- netic algorithm: a big comparison for NAS. , 2019. [34] C. Cortes, X. Gonzalvo, V . Kuznetsov , M. Mohri, and S. Y ang. Adanet: adaptiv e structural learning of artiﬁcial neural networks. Pr oceedings of the 34th International Confer ence on Machine Learning, PMLR , 70:874– 883, 2017. [35] T . Y . Kwok and D. Y . Y eung. Constructive algorithms for structure learn- ing in feedforward neural netw orks for regression problems. IEEE T rans- actions on Neural Networks , 8(3):630–645, 1997. [36] D. Liu, T . S. Chang, and Y . Zhang. A constructi ve algorithm for feedforward neural networks with incremental training. IEEE T ransac- tions on Cir cuits and Systems—I: Fundamental Theory and Applications , 49(12):1876–1879, 2002. [37] J. .H. Friedman. Multiv ariate adaptativ e regression splines. Annals of Statistics , 9(1):1–67, 1991. [38] K. Fukushima. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition una ﬀ ected by shift in position. Biological Cybernetics , 36:193–202, 1980. [39] A. Krizhe vsky , S. Sutskever , and G. E. Hinton. Imagenet classiﬁcation with deep con volutional neural networks. Advances in neural information pr ocessing systems , 25(2), 2012. [40] Y . Le Cunn, L. Bottou, Y . Bengio, and P . Ha ﬀ ner . Gradient-based learning applied to document recognition. Pr oceeding of the IEEE , 86(11), 1998. [41] Summit - Oak Ridge National Laboratory’ s 200 petaﬂop su- percomputer . https://www.olcf.ornl.gov/olcf- resources/ compute- systems/summit/ . [42] Kaggle: Y our home for data science. https://www.kaggle.com . [43] D. W . Aha, D. F . Kibler, and M. K. Albert. Instance-based prediction of real-valued attrib utes. Computational Intelligence , 5:51–57, 1989. [44] UCI Machine Learning Repository . https://archive.ics.uci.edu/ ml/index.php . [45] The cifar-10 dataset. https://www.cs.toronto.edu/ ~ kriz/cifar. html . [46] D. P . Kingma and J. L. Ba. Adam: a method for stochastic optimization. Confer ence P aper at International Confer ence on Learning Representa- tions , 2015. [47] Keras: The Python Deep Learning library . https://keras.io . [48] MPI for Python. https://mpi4py.readthedocs.io/en/stable/ . [49] Ray Tune: Hyperparameter Optimization Framework. https://ray. readthedocs.io/en/ray- 0.3.1/tune.html . [50] L. Li, K. Jamieson, G. DeSalv o, A. Rostamizadeh, and A. T alwalkar . Hy- perband: bandit-based conﬁguration ev aluation for hyperparameter opti- mization. ICLR Confer ence proceedings , 2017. [51] A. Tharwat. Classiﬁcation assessment methods. Applied Computing and Informatics , 2020. [52] Image classiﬁcation on cifar-10. https://paperswithcode.com/ sota/image- classification- on- cifar- 10 . [53] J. W u, X. Y . Chen, H. Zhan, L. D. Xiong, H. Lei, and S. H. Deng. Hyper- parameter optimization for machine learning models based on bayesian optimization. Journal of Electr onic Science and T ec hnology , 17(1):26– 40, 2019. 12

A scalable constructive algorithm for the optimization of neural network architectures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment