Funneled Bayesian Optimization for Design, Tuning and Control of Autonomous Systems

Bayesian optimization has become a fundamental global optimization algorithm in many problems where sample efficiency is of paramount importance. Recently, there has been proposed a large number of new applications in fields such as robotics, machine…

Authors: Ruben Martinez-Cantin

Funneled Bayesian Optimization for Design, Tuning and Control of   Autonomous Systems
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 1 Funneled Bayesian Optimization for Design, T uning and Control of Autonomous Systems Ruben Martinez-Cantin Abstract —In this paper , we tackle several pr oblems that appear in robotics and autonomous systems: algorithm tuning, automatic control and intelligent design. All those problems shar e in common that they can be mapped to global optimization problems where e valuations ar e expensive. Bayesian optimization has become a fundamental global optimization algorithm in many problems wher e sample efficiency is of paramount importance. Bayesian optimization uses a probabilistic surrogate model to learn the response function and reduce the number of samples requir ed. Gaussian processes have become a standard surro- gate model for their flexibility to represent a distribution over functions. In a black-box settings, the common assumption is that the underlying function can be modeled with a stationary Gaussian process. In this paper , we present a novel k ernel function specially designed f or Bayesian optimization, that allows nonstationary behavior of the surrogate model in an adaptive local region. This kernel is able to r econstruct nonstationarity even with the irregular sampling distribution that arises from Bayesian optimization. Furthermore, tn our experiments, we found that this new kernel results in an improv ed local search (exploitation), without penalizing the global search (exploration) in many applications. W e provide extensi ve results in well-known optimization benchmarks, machine learning hyper parameter tun- ing, reinf orcement lear ning and control pr oblems, and U A V wing optimization. The results show that the new method is able to outperform the state of the art in Bayesian optimization both in stationary and nonstationary pr oblems. Index T erms —Bayesian optimization, Gaussian processes, Global optimization, Reinfor cement lear ning I . I N T RO D U C T I O N M ANY problems in autonomous systems and robotics require to find the extremum of an unkno wn real valued function usign as fe w ev aluations as possible. In many cases, those functions represent actual e xpensiv e processes like building a prototype, physical trials like learning a controller by e xperimentation, or time consuming computations and simulations like tuning some deep learning architecture. The optimization process must consider the actual b udget and lim- itations of gathering ne w ev aluations. Then, sample efficienc y becomes the key element. Furthermore, those functions might be highly multimodal, requiring a global solution. Bayesian optimization, also found in the literature with the names of Bayesian Sampling [1], Efficient Global Opti- mization (EGO) [2], Sequential Kriging Optimization (SKO) [3], Sequential Model-Based Optimization (SMBO) [4] or Bayesian guided pattern search [5], is a classic optimization method [6], [7] which has become quite popular recently for being a sample efficient method of global optimization [2]. It has been applied with great success to autonomous algorithm R. Martinez-Cantin is at Centro Univ ersitario de la Defensa, Zaragoza, Spain and SigOpt, Inc. email: rmcantin@unizar .es tuning [8], specially for machine learning applications [9], [10], robot planning [11], control [12], task optimization [13], reinforcement learning [14], [15], structural design [16], sensor networks [17], [18], simulation design [19], circuit design [5], ecology [20], biochemistry [21], dynamical modeling of biological systems [22], etc. Recent works have found connec- tions between Bayesian optimization and the way biological systems adapt and search in nature, such as human active search [23] or animal adaptation to injuries [24]. Bayesian optimization uses a Bayesian surrogate model, that is, a distrib ution over tar get functions P ( f ) to incorporate the information available during the optimization procedure, like any prior information and the information from each observ a- tion. This model can be updated recursively as outcomes are av ailable from the ev aluated trials y t = f ( x t ) P ( f | X 1: t , y 1: t ) = P ( x t , y t | f ) P ( f | X 1: t − 1 , y 1: t − 1 ) P ( x t , y t ) , (1) ∀ t = 2 . . . T where X 1: t is a matrix with all the inputs X 1: t = [ x 1 , . . . , x t ] and y 1: t is a vector with all the outcomes y 1: t = [ y 1 , . . . , y t ] T . By using this method, the information is al ways updated. Therefore, the surrogate model provides a memory [25] that improv es the sample efficienc y of the method by considering the whole history of trials and e valuations during the decision procedure of where to sample. Bayesian optimization computes the optimal decision/action u of selecting the next trial u = x t +1 by maximizing (minimizing) a e xpected utility (loss): u B O = arg min u Z δ t ( f , u ) dP ( f | X 1: t , y 1: t ) (2) where δ t ( f , u ) is the optimality criterion or r e gr et function that driv es the optimization to wards the optimum x ∗ . For example, we can use the optimality gap δ t ( f , u ) = f ( u ) − f ( x ∗ ) to get the optimal outcome, the Euclidean distance err or δ t ( f , u ) = k u − x ∗ k 2 to get the optimal input, or the relative entr opy δ t ( f , u ) = H ( x ∗ | X 1: t ) − H ( x ∗ | X 1: t , u ) to maximize the information about the optimum. In summary , Bayesian optimization is the combination of two main components: a surrogate model which captures all prior and observed information and a decision process which performs the optimal action, i.e.: where to sample next, based on the previous model. These two components also hide extra computational cost compared to other op- timization methods. On one hand, we need to update the surrogate model continuously . On the other hand, we need to optimize the criterion function. Howe ver , this additional cost can be compensated by the reduced number of target function ev aluations thanks to the sample ef ficiency of the JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 2 method. Therefore, Bayesian optimization is specially suit- able for expensi ve black-box functions, trial-and-error tuning and experimental design. The methodology behind Bayesian optimization also appears in other areas. In the way points are selected, the optimization problem can be considered an activ e learning problem on the estimation of the optimum [26], [27]. Other authors have drawn connections between Bayesian optimization and some reinforcement learning setups such as multi-armed bandits [17], partially observ able Markov decision processes (POMDPs) [28] or active reinforcement learning [14]. Surrogate models such as Gaussian processes are also used in other optimization methods, like evolutionary algorithms [29], [30], [31]. Some genetic algorithms e ven use the decision (acquisition) functions used in Bayesian optimization as preselection criteria [32]. For Bayesian optimization, the quality of the surrogate model is of paramount importance as it also af fects to the op- timality of the decision process. Earliest versions of Bayesian optimization used W iener processes [6] or W iener fields [33] as surrogate models. Similar methods used radial basis functions [34] or branch and bound with polynomials [35]. It was the seminal paper of Jones et al. [2] that introduced the use of Gaussian processes, also called Kriging, as a Bayesian surrogate function. Jones also wrote an excellent revie w on this kind of surrogate models [36]. Recently , other Bayesian models hav e become popular like Student’ s t processes [37], [38], treed Gaussian processes [5], [39], random forests [4], tree-structured Parzen estimators [40] or deep neural networks [41]. In the case of discrete inputs, the Beta-Bernouilli bandit setting provides an equiv alent frame work [42]. Howe ver , the Gaussian process (GP) is still the most popular model due to its accuracy , rob ustness and flexibility , because Bayesian optimization is mainly used in black or grey-box scenarios. The range of applicability of a Gaussian process is defined by its kernel function, which sets the family of functions that is able to represent through the reproducing kernel Hilbert space (RKHS) [43]. In fact, regret bounds for Bayesian optimization using Gaussian processes are always defined in terms of specific kernel functions [27], [17], [44]. From a practical point of view , the standard procedure is to select a generic kernel function, such as the Gaussian (square exponential) or Mat ´ ern kernels, and estimate the k ernel hyperparameters from data. One property of these kernels is that they are stationary . Although it might be a reasonable assumption in a black box setup, we show in Section III that this reduces the ef ficiency of Bayesian optimization in most situations. It also limits the potential range of applications. On the other hand, nonstationay methods usually require extra knowledge of the function (e.g.: the global trend or the space partition). Howe ver , gathering this knowledge from data usually requires global sampling, which is contrary to the Bayesian optimization methodology . The main contribution of the paper is a new set of adapti ve kernels for Gaussian processes that are specifically designed to model functions from nonstationary processes but focused on the region near the optimum. Thus, the new model maintains the global exploration/local exploitation trade off. This idea results in an improved efficienc y and applicability of any Bayesian optimization based on Gaussian processes. W e call this new method Spartan Bayesian Optimization (SBO). The algorithm has been extensi vely e valuated in many scenarios and applications. Besides some standard optimization bench- marks, the method has been ev aluated in automatic algorithm tuning for machine learning applications, optimal policy learn- ing in reinforcement learning scenarios and autonomous wing design of an airplane using realistic CFD simulations. In our results, we have found that SBO reaches its best performance in problems that are clearly nonstationary , where the local and global shape of the function are different. Ho wev er , our tests hav e also shown that SBO can improve the results of Bayesian optimization in all those scenarios. From an optimization point of vie w , these ne w kernels result in an improved local search while maintaining global exploration capabilities, similar to other locally-biased global optimization algorithms [45]. I I . B A Y E S I A N O P T I M I Z A T I O N W I T H G AU S S I A N P RO C E S S E S W e start describing the ingredients for a Bayesian optimiza- tion algorithm using Gaussian processes as surrogate model. Consider the problem of finding the minimum of an unkno wn real valued function f : X → R , where X is a compact space, X ⊂ R d , d ≥ 1 . In order to find the minimum, the algorithm has a maximum budget of N ev aluations of the tar get function f . The purpose of the algorithm is to select the best query points at each iteration such as the optimization gap or regret is minimum for the av ailable budget. A. Gaussian pr ocesses Giv en a dataset at step t of points X = X 1: t and its respec- tiv e outcomes y = y 1: t , the prediction of the Gaussian process at a new query point x q , with a kernel or cov ariance function k θ ( · , · ) with hyperparameters θ is a normal distribution such as y q ∼ N ( µ, σ 2 | x q ) where: µ ( x q ) = k ( x q , X ) K − 1 y σ 2 ( x q ) = k ( x q , x q ) − k ( x q , X ) K − 1 k ( X , x q ) (3) being k ( x q , X ) the corresponding cross-correlation vector of the query point x q with respect to the dataset X k ( x q , X ) = [ k ( x q , x 1 ) , . . . , k ( x q , x n )] T (4) and K = K ( X , X ) is the Gram or kernel matrix: K =    k ( x 1 , x 1 ) . . . k ( x 1 , x n ) . . . . . . . . . k ( x n , x 1 ) . . . k ( x n , x n )    + σ 2 n I (5) where σ 2 n is a noise or nugget term to represent stochastic functions [3] or surrogate missmodeling [46]. B. Acquisition function The regret functions that we introduced to select the next point at each iteration with equation (2) assume kno wledge of the optimum x ∗ . Thus, they cannot be used in practice. Instead, the Bayesian optimization literature uses acquisition functions , like the expected impr ovement (EI) criterion [7] as a proxy of JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 3 the optimality gap criterion. EI is defined as the expectation of the improvement function I ( x ) = max(0 , ρ − f ( x )) , where ρ is the incumbent of the optimal. In many applications ρ is considered to be the best outcome until the current iteration ρ = y best . Other incumbent values could be considered, specially in the presence of noise, like the maximum predicted value. T aking the expectation over the predicti ve distrib ution, we can compute the expected improv ement as: E I ( x ) = E p ( y | x , θ ) [ I ( x )] = ( ρ − µ ) Φ( z ) + σ φ ( z ) (6) where φ and Φ are the corresponding Gaussian probabil- ity density function (PDF) and cumulativ e density function (CDF), being z = ( ρ − µ ) /σ . In this case, ( µ, σ 2 ) are the prediction parameters computed with Equation (3). At each iteration n , we select the next query at the point that maximizes the corresponding e xpected improvement: x n = arg max x E I ( x ) (7) C. Kernel hyperparameters The advantage of Gaussian process is that the posterior can be computed in closed form due to the linear -Gaussian properties of all the components in volv ed. This is true for known kernel hyperparameters. Uncertainty in the hyperpa- rameters requires a nonlinear transformation which makes the computation of the predicti ve distrib ution or the acquisition function intractable. The most common approach is to use a point estimate of the kernel hyperparameters based on the maximum of the mar ginal likelihood p ( y | X , θ ) = N (0 , K ) [43]. Sometimes a prior on the hyperparameters p ( θ ) is defined, resulting in a maximum a posteriori point estimate. Instead, we propose a fully-Bayesian treatment, where we compute the integrated predictiv e distribution and the inte- grated acquisition function: b y q = Z N ( µ, σ | x q ) p ( θ | y , X ) d θ c E I ( x ) = Z E I ( x ) p ( θ | y , X ) d θ (8) with respect to the posterior distribution on the hyperpa- rameters p ( θ | y , X ) ∝ p ( y | X , θ ) p ( θ ) . Those integrals are directly intractable, thus, we use Markov chain Monte Carlo (MCMC) to generate a set of samples Θ = { θ i } N i =1 with θ i ∼ p ( θ | y , X ) . W e use the slice sampling algorithm which has already been used in Bayesian optimization [9], although bootstraping could equally be used [47]. Compared to point estimates of θ [2], MCMC has a higher computational cost, but MCMC has shown to be more robust [9], [38] in Bayesian optimization. Note that, because we use a sampling distribu- tion of θ the predictiv e distribution at any point x is a sum of Gaussians, that is: b y q = N X i =1 N ( µ i , σ i | x q ) c E I ( x ) = N X i =1 ( ρ − µ i ) Φ( z i ) + σ i φ ( z i ) (9) where each ( µ i , σ i , z i ) is computed using the i -th sample of the kernel hyperparameters θ i using Equation (3). D. Initialization Finally , in order to avoid bias and guarantee global opti- mality , we rely on an initial design of p points based on Latin Hyper cube Sampling (LHS) follo wing the recommendation in the literature [2], [44], [48]. Algorithm 1 summarizes the basic steps in Bayesian optimization. Algorithm 1 Bayesian optimization (BO) Input: T otal budget T , initialization budget p . 1: X ← x 1: p y ← y 1: p  Initial design with LHS 2: for t = p . . . T do  A vailable b udget of N queries 3: Θ ← SampleHyperparameters ( X , y ) 4: x t = arg max x c E I ( x | X , y , Θ ) from Eq. (9) 5: y t ← f ( x t ) X ← add ( x t ) y ← add ( y t ) I I I . N O N S TA T I O N A RY G AU S S I A N P RO C E S S E S Many applications of Gaussian process regression, including Bayesian optimization, are based on the assumption that the process is stationary . This is a reasonable assumption for black-box optimization as it does not assume any extra information on the evolution of the function in the space. Definition 1. Let f ( x ) ∼ G P (0 , k ( x , x 0 )) be a zero mean Gaussian process. W e say that f ( x ) is a stationary process if the kernel function k ( x , x 0 ) is stationary , that is, it can be written as k ( x , x 0 ) = k ( τ ) where τ = x − x 0 . This is equiv alent to say that the process is in variant to translation, that is, the statistical properties of the set of points { x 1 , . . . , x n } are the same as for the points { x 1 + h, . . . , x n + h } for any real v ector h . In practice, a process is only used for limited distances. For example, in our definition of Bayesian optimization, we already limited our search for the minimum to a compact space X . Thus, if we define b as the diameter or the longest distance enclosed in X , then, for practical reasons, the process is stationary if it is in v ariant for | h | ≤ b . In other circumstances, we might find that our process is translation in variant in a smaller re gion X c ⊂ X , with diameter b c . In this case, we say that the process is locally stationary for | h | ≤ b c . In the geostatistics community , this is also called quasi-stationarity [49]. Note that a locally stationary process is also stationary in any smaller subset X s ⊆ X c . Intuitively , ev en for nonstationary process, the smaller the region defined by distance b c , the more homogeneous would be process relativ ely to larger regions, being closer to locally stationary [50]. Finally , an y locally stationary process is nonstationary for any set Y ⊆ X which contains elements outside the locally stationary region ∃ x f ∈ Y , x f / ∈ X c . Bayesian optimization is a locally homogeneous process. For most acquisition functions, it has a dual behavior of global exploration and local exploitation. T ypically , both sampling and uncertainty estimation end up being distributed unev enly , with many samples and small uncertainty near the local optima and sparse samples and large uncertainty everywhere else. Furthermore, man y direct applications of Bayesian op- timization are inherently nonstationary . T ake for example the reinforcement learning problems of Section V -D: JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 4 Example 1. Let us consider a biped robot (agent) trying to find the walking pattern (policy) that maximizes the walking speed (rew ard). In this setup, there are some policies that reach undesirable states or result in a failure condition, like the robot falling or losing the upright posture. Then, the system returns a null rew ard or arbitrary penalty . In cases where finding a stable policy is difficult, the re ward function may end up being almost flat, e xcept for a small region of successful policies where the rew ard is actually informati ve in order to maximize the speed. The re ward function is directly nonstationary for this dual behaviour between failure and success. Modeling this kind of functions with Gaussian processes require kernels with different length scales for the flat/non-flat regions or specially designed kernels to capture that behavior . There has been sev eral attempts to model nonstationary functions with Gaussian processes. For e xample, the use of specific nonstationary kernels [43], space partitioning [51], Bayesian treed GP models [52] or projecting (warping) the input space to a stationary latent space [53]. The idea of treed GPs was used in Bayesian optimization combined with an auxiliary local optimizer [5]. A version of the warping idea was applied to Bayesian optimization [54]. Later , Assael et al. [39] b uilt a treed GPs where the warping model was used in the leav es. These methods try to model nonstationarity in a global way . F or Bayesian optimization the best fit for a global model might end up being inaccurate near the minimum, where we require more accurac y . Before explaining our approach to nonstationarity , we are going to revie w the most common stationary kernels and the effect of the hyperparameters on the optimization process. More details can be found in Appendix B. A. Stationary kernels Most popular stationary kernels include the squared expo- nential (SE) kernel and the Mat ´ ern kernel family: k S E ( x , x 0 ) = exp  − 1 2 r 2  (10) k M ,ν ( x , x 0 ) = 2 1 − ν Γ( ν )  √ 2 ν r  ν K ν  √ 2 ν r  (11) where r 2 = ( x − x 0 ) T Λ( x − x 0 ) with some positiv e semidefinite Λ . More frequently , Λ = diag ( θ − 1 l ) where θ l represents the length-scale hyperparameters that capture the smoothness or v ariability of the function. If θ l is a scalar, the function is isotropic. If θ l is a vector we can estimate a different smoothness per dimension, which also represents the rele v ance of the data in that dimension. Thus, the later case is called automatic rele vance determination (ARD) [43]. In the Mat ´ ern kernel, K ν is a modified Bessel function [43]. The Mat ´ ern kernel is usually computed for values of ν that are half-integers ν = p + 1 / 2 where p is a non-negati ve integer , because the function becomes simpler . For example: k M ,ν =1 / 2 ( x , x 0 ) = exp ( − r ) (12) k M ,ν =3 / 2 ( x , x 0 ) = exp  − √ 3 r   1 + √ 3 r  (13) k M ,ν =5 / 2 ( x , x 0 ) = exp  − √ 5 r   1 + √ 5 r + 5 3 r 2  (14) The value of ν is related to the smoothness of the functions because it represents the η th-dif ferentiability of the underlying process. That is, the process g ( x ) defined by the Mat ´ ern kernel k ( · , · ) is η -times mean square differentiable if and only if ν > η [43]. Furthermore, for ν → ∞ , we obtain the SE kernel from equation (10). In Bayesian optimization, the Mat ´ ern kernel with ν = 5 / 2 is frequently used for its performance in many benchmarks, because it can properly represents smooth and irregular functions without imposing excessi ve restrictiv eness like the infinite differentiability of the SE kernel. 1) The effect of length-scales: For optimization, the anal- ysis of the kernel length-scale plays an important role for optimization. Thus, having multiple length-scales in a single pr ocess opens new possibilities besides modeling nonstation- arity . Small values of θ l will be more suitable to capture signals with high frequency components; while large values of θ l result in a model for low frequency signals or flat functions. Therefore, for the same distance between points, a kernel with smaller length-scale will result in higher predictiv e variance, therefore the exploration will be more aggressiv e. This idea was previously explored in W ang et al. [55] by forcing smaller scale parameters to improve the exploration. More formally , in order to achie ve no-regret conv ergence to the minimum, the target function must be an element of the reproducing kernel Hilbert space (RKHS) characterized by the kernel k ( · , · ) [44], [17]. For a set of kernels like the SE or Mat ´ ern, it can be shown that: Proposition 1. Given two kernels k l and k s with large and small length scale hyperparameters respectively , any function f in the RKHS characterized by a kernel k l is also an element of the RKHS characterized by k s [55]. Thus, using k s instead of k l is safer in terms of guaranteeing no-regret con vergence. Howe ver , if the small kernel is used ev erywhere, it might result in unnecessary sampling of smooth areas. Instead, we should combine both behaviors. The idea of combining kernels with dif ferent length-scales was pre viously hinted as a potential advantage for Bayesian optimization [26], but this is the first work to exploit its advantages in practice. B. Nonstationary combined kernels Our approach to nonstationarity is based on pre vious works where the input space is partitioned in different regions [51], [56]. The resulting GP is the linear combination of local GPs: ξ ( x ) = P j λ j ( x ) ξ j ( x ) . This approach was in- spired by the concept of quasi-stationarity from geostatis- tics [49], [50]. This is equiv alent to having a single GP with a combined kernel function of the form: k ( x , x 0 | θ ) = P j λ j ( x ) λ j ( x 0 ) k j ( x , x 0 | θ j ) . The final kernel is a valid kernel function as the sum and vertical scaling of kernels are also valid kernel functions [43]. Each local kernel k j has its own specific hyperparameters θ j and statistical properties associ- ated with the region, making the final kernel nonstationary ev en if the local kernels are stationary . A related approach of additiv e GPs was used by Kandasamy et al. to optimize high dimensional functions under the assumption that the actual function is a combination of lower dimensional functions [48] JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 5 0.0 0.2 0.4 0.6 0.8 1.0 0 1 Global kernel Local kernel Normalized global weight Moving normalized local weight Spartan kernel Fig. 1. Representation of the Spartan kernel in SBO. In this case, the kernel is just a combination of a single local and global kernels. T ypically , the local and global kernels have a small and lar ge length-scale respecti vely . The influence of each kernel is represented by the normalized weight at the bottom of the plot. Note ho w the k ernel with small length-scale produce larger uncertainties which is an advantage for fast exploitation, but it can perform poorly for global exploration as it tends to sample equally ev erywhere. On the other hand, the kernel with large length-scale provides a better global estimate, which improves focalized global exploration. I V . S PA RT A N B A Y E S I A N O P T I M I Z A T I O N W e present our new method, SBO, which combines a set of moving local and global kernels. This allo ws dif ferent control of the kernel length-scales in a local and global manner . Therefore, it is intrinsically able to deal with nonstationarity , but it also allows different local and global behaviors for optimization as discussed in Section III-A1. For example, it can be used to improv e local exploration towards the optimum without increasing global exploration. The intuition behind SBO is the same of many acquisition functions in Bayesian optimization: the aim of the surr ogate model is not to appr oximate the targ et function precisely in every point, but to pr ovide information about the location of the minimum . Many optimization problems are difficult due to the fact that the region near the minimum is dif ferent from the rest of the function, e.g.: it has higher variability than the rest of the space, like the function in Fig. 2. Also, it allows a better estimate of the hyperparameters of the local kernels without resulting in overconfidence of the hyperparameters globally . In those cases, SBO greatly improves the performance of the state of the art in Bayesian optimization. W e propose the combination of a kernel with global influ- ence with sev eral kernels with moving local influence. The influence is determined by a weighting function (see Figure 1). The influence of the local kernels is centered in a single point with multiple diameters, creating a funnel structure. W e hav e called this kernel, the Spartan kernel: k S ( x , x 0 | θ S ) = λ g ( x ) λ g ( x 0 ) k g ( x , x 0 | θ g ) + M X l =1 λ l ( x | θ p ) λ l ( x 0 | θ p ) k l ( x , x 0 | θ l ) (15) where the weighting function for the local kernel λ l ( x | θ p ) includes the parameters to move the center of the local kernels along the input space. In order to achieve smooth interpolation between regions, each region have an associated weighting function ω j ( x ) , having the maximum in the corresponding region j and decreasing its value with distance to region j [56]. Then, we can set λ j ( x ) = q ω j ( x ) / P p ω p ( x ) . The unnormalized weights ω are defined as: ω g = N  ψ , I σ 2 g  ω l = N  θ p , I σ 2 l  ∀ l = 1 . . . M (16) where ψ and θ p can be seen as the center of the influence region of each kernel while σ g and σ l are related to the diameter of the area of influence. Note that all the local kernels share the same position (mean value) but different size (variance), generating a funnel-like structure. The Spartan kernel with a single local kernel can be seen in Fig. 1. a) Global kernel parameters: Unless we hav e prior knowledge of the function, the parameters of the global kernel are mostly irrelev ant. In most applications, we can use a uniform distribution, which can be easily approximated by a large σ 2 g . For example, assuming a normalized input space X = [0 , 1] d , we can set ψ = [0 . 5] d and σ 2 g = 10 . b) Local kernel parameter s: For the local kernels, we estimate the center of the funnel structure θ p based on the data gathered. W e propose to consider θ p as part of the hyperparameters for the Spartan kernel, which also includes the parameters of the local and global kernels, that is, θ S = [ θ g , θ l 1 , . . . , θ l M , θ p ] (17) The area of influence of each local kernel could also be adapted including the terms { σ 2 l } M l =1 as part of the kernel hyperparameters θ S . In practice, we found that the algorithm was learning the tri vial cases of very small v alues of σ 2 l in many e xperiments. As discussed in Section III, for a small enough region, the behavior is stationary . Howe ver , a very small region ends up with not enough data points to properly learn the length-scale hyperparameters. Haas [50] used a automatic sizing strategy by combining multiple heuristic and a minimum number of samples inside. Because we use a funnel structure, we found simpler to fix the value of σ 2 l at different sizes. As can be seen in Section V, this simple heuristic provide excellent results. Alternatively , we can define σ 2 l in terms of a fixed numbers of samples inside. This method has the adv antage that, while doing exploitation, as the number of local samples increases, the funnel gets narrower , allo wing better local refinement. As commented in Section II, when new data is av ailable, all the parameters are updated using MCMC. Therefore, the position of the local kernel θ p is mov ed each iteration to represent the posterior . Due to the sampling behavior in Bayesian optimization, we found that it intrinsically moves more likely to wards the more densely sampled areas in many problems, which corresponds to the location of the function minima. Furthermore, as we have N MCMC samples, there are N different positions for the funnel of local kernels. On the other hand, for ARD local and global kernels with one length- scale per dimension d , the size of θ S is ( M + 2) d . Thus, the cost of the MCMC is O ( N ( M + 2) d ) . Nevertheless, this cost is still cheaper than alternative methods (see Section V -F). Spartan Bayesian Optimization is simple to implement from standard Bayesian optimization: it only requires to modify the JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 6 10 5 0 5 10 15 20 25 5 0 5 10 15 20 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Fig. 2. Left: Exponential 2D function from Gramacy [52]. The path belo w the surface represents the location of the local kernel as being sampled by MCMC. Clearly , it conver ges to the nonstationary section of the function. For visualization, the path is broken in colors by the order in the path (blue → black → green → red). kernel function and hyperparameter estimation. Howe ver , as we will see in Section V, the results show a large gain in terms of con ver gence and sample ef ficiency for many problems. It is important to note that, although we have implemented SBO relying on Gaussian processes and e xpected impro ve- ment , the Spartan kernel also works with other popular kernel- based models such as Student-t processes [57], [58], treed Gaussian processes [52] and other criteria such as upper con- fidence bound [17], r elative entr opy [59], [26], etc. A similar approach of using a moving local kernel was used in geostatics previously for terrain modeling [50]. In that case, the local kernel was always centered in the query point, creating a sliding windo w approach to prediction. Therefore, e ven though the model was nonstationary , there was no different behaviour in dif ferent regions of the space. V . E V A L U A T I O N A N D R E S U LT S W e ha ve selected a v ariety of benchmarks from dif ferent applications to test our method. The purpose is twofold. First, we highlight the potential applicability of Bayesian optimization in many ways in autonomous systems, from design, control, software tuning, etc. Second, we show that in all those setups, our method outperforms the state of the art in Bayesian optimization. 1) W e hav e taken sev eral well-known functions for testing global optimization algorithms. The set of functions includes both stationary and non-stationary problems. 2) For the algorithm tuning and perception problems, we hav e selected a set of machine learning tuning bench- marks. They include large image classification and clus- tering problems. 3) For the control/reinforcement learning problems, we hav e selected some classic problems and a highly com- plex benchmark to control a hov ering aerobatic heli- copter . 4) Finally , we address the automatic design and embodi- ment problem with the optimal design of a wing profile for a U A V . This is a highly complex scenario, due to the chaotic nature of fluid dynamics. Thus, this problem is ubiquitous in global optimization and evolutionary computation literature. A. Implementation details For ev aluation purposes and to highlight the robustness of SBO, we hav e simplified the funnel structure to a single local and global kernel as in Fig. 1. W e also took the simpler approach to fix the variance of ω l . W e found that a single value of σ 2 l = 0 . 05 was robust enough in all the experiments once the input space was normalized to the unit hypercube. For the experiments reported here we used a Gaussian process with unit mean function like in [2]. Although SBO allows for any combination of local and global kernels, for the purpose of e valuation, we used the Mat ´ ern kernel from equation (14) with automatic relev ance determination for both –local and global– kernels. Furthermore, the kernel hyperpa- rameters were initialized with the same prior for the both kernels. Therefore, we let the data determine which kernel has smaller length-scale. W e found that the typical result is the behavior from Fig. 1. Ho wev er , in some problems, the method may learn a model where the local kernel has a lar ger length- scale (i.e.: smoother and smaller v ariance) than the global kernel, which may also improve the con vergence in plateau- like functions. Besides, if the target function is stationary , the system might end up learning a similar length-scale for both kernels, thus being equi valent to a single kernel. W e can say that standard BO is a special case of SBO where the local and global kernels are the same. Giv en that for a single Mat ´ ern kernel with ARD, the number of k ernel hyperparameters is the dimensionality of the problem, d , the number of hyperparameters for the Spartan kernel in this setup is 3 d . As we will see in the experiments, this is the only drawback of SBO compared to standard BO, as it requires running MCMC in a larger dimensional space, which results in higher computational cost. Howe ver , because SBO is more ef ficient, the extra computational cost can be easily compensated by a reduced number of samples. For the MCMC part, we used the slice sampling algorithm with a small number of samples (10), while trying to decorrelate ev ery resample with larger burn-in periods (100 samples) as in Snoek et al. [9]. W e implemented Spartan Bayesian Optimization (SBO) using the BayesOpt library [38] as codebase. F or comparison, we also implemented the input warping from Snoek et al. [54], which we called W ARP . T o our knowledge, this is the only Bayesian optimization algorithm that has dealt with nonstationarity using Gaussian processes in a fully correlated way . Ho wever , as presented in Section V -F, W ARP is much more expensi ve than SBO in terms of computational cost. F or the warping function we used the cumulati ve density function of a Beta distribution or Beta CDF . The ( α, β ) parameters of the Beta CDF were also computed using MCMC. W e also did some preliminary tests with random forests (RF) [4] and treed GPs [5], [39], which for clarity we hav e not included in the plots. F or RF , we v erified pre vious results in the literature which sho w their poor performance for continuous variables compared to GPs [38], [60]. For treed GPs, we found JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 7 that they required a larger nugget or noise term σ 2 n to av oid numerical instability . This is primarily due to the reduced number of samples per leaf and reduced global correlation. Increasing the noise term reduced considerably the accuracy of the results. Furthermore, single nonstationary GPs like the input warping method or our Spartan kernel can be used as tree leav es [39]. Thus, our method can be used as an enhancement of treed GPs. All plots are based on 20 experiments using common random numbers between methods. As commented in Section II, the number of function ev aluations in each plot includes the initial design generated using latin hyper cube sampling . B. Optimization benchmarks W e e valuated the algorithms on a set of well-kno wn test functions for global optimization both smooth or with sharp drops (see Appendix C). Figs. 3 and 4 sho w the results for the optimization benchmarks. Clearly , functions with sharp drops or flat regions benefits from nonstationary methods, while in the case of smooth surfaces, the advantage is not as important. Bayesian optimization is well know to behave extremely well for smooth functions with wide valle ys like the Branin and Hartmann function. In this case, e ven plain BO preforms well in general. Therefore, there is barely room from im- prov ement. Howe ver , ev en in this situation, SBO is equal or better than stationary BO. For the Branin function, there is no penalty for the use of SBO. Ho wev er , the W ARP method may get slower conv ergence due to the extra complexity . For the Hartmann function, nonstationary methods (SBO and W ARP) achiev e slightly better results and they are more robust as the variance on the plot is also reduced. W e also use this case to show the potential of funneled kernels by adding an intermediate local kernel with σ l = 0 . 1 . For nonstationary functions with sharp drops or flat re gions, our method (SBO) provides excellent results. As can bee seen in Fig. 3, in the case of the Gramacy function, it reaches the optimum in less than 25 iterations (35 samples) for all tests. Because the function is clearly nonstationary , the W ARP method outperforms standard BO, b ut its conv ergence is much slower than SBO. The Michalewicz function is kno wn to be a hard benchmarks in global optimization. The function has a parameter to define the dimensionality d and the steepness m . It has many local minima ( d ! ) and steep drops. W e used d = 10 and m = 10 , resulting in 3628800 minima with very steep edges. For this problem, SBO clearly outperforms the rest of the methods by a large margin. C. Machine learning hyperpar ameter tuning Our next set of experiments was based on a set of problems for automatic tunning of machine learning algorithms. The results of the optimization can be seen in Fig. 5. The first problem consists on tuning the 4 parameters of a logistic r e gression classifier to recognize handwritten numbers from the MNIST dataset [40]. As can be seen in Fig. 5, it is an easy problem for Bayesian optimization. Ev en the standard BO method were able to reach the minimum in less than 50 function ev aluations. In this case, the warped method was the fastest one, with almost 20 ev aluations. The proposed method had similar performance in terms of con vergence, b ut with one order of magnitude lower ex ecution time (see Section V -F). The second problem is based on the setup defined in Snoek et al [9] for learning topics of W ikipedia articles using online Latent Dirichlet Allocation (LD A). It requires to tune 3 parameters. Both the standard BO and the W ARP method got stuck while our method was able escape from the local minimum and outperform other methods by a lar ge mar gin. Again, SBO required much lower computational cost than W ARP . Finally , we ev aluated the HP-NNET problem, based on a deep neural network written by Ber gstra et al. [40] to classify a modified MNIST dataset. In this case, the handwritten numbers were arbitrarily rotated and with random background images as distractors [40]. The new database is called MRBI, for MNIST Rotated and with Background Images. In this case, due to the high dimensionality and heterogeneity of the input space (7 continuous + 7 categorical parameters) we tested two approaches. First, we applied a single fully-correlated model for all the variables. The categorical variables were mapped to integer values that were computed by rounding the query values. In this case, similarly to the Hartmann function, our method (SBO) is more precise and rob ust, having lower average case and smaller v ariance. W e also tested a hierar chical model (see Appendix E for the details) For these benchmarks a single ev aluation can take hours or days of wall-time as a result of the complexity of the training process and the size of the datasets. In order to simplify the comparison and run more tests, we used the surrogate benchmarks provided by Eggensperger et al. [60]. They built surrogate functions of the actual beha vior of the tuning process based on actual trials of the algorithms with real datasets. Each surrogate function can be ev aluated in seconds, compared to the actual training time of each machine learning algorithm. The authors pro vide different surrogates [60]. W e selected the Gradient Boosting as it provides the lowest prediction error (RMSE) with respect to the actual data from each problem. W e explicitly rejected the Gaussian Process surrogate to a void the adv antage of perfect modeling. D. Reinforcement learning and contr ol W e ev aluated SBO with se veral reinforcement learning problems. Reinforcement learning deals with the problem of how artificial agents perform optimal behaviors. An agent is defined by a set of variables s t that capture the current state and configuration of the agent in the world. The agent can then perform an action a t that modifies the agent state s t +1 = T ( s t , a t ) . At each time step, the agent collects the rew ard signal associated with its current state and action r t ( s t , a t ) . The actions are selected according to a policy function that represents the agent beha vior a t +1 = π ( s t ) . The states, actions and transitions are modeled using probabilities to deal with uncertainty . Thus, the objective of reinforcement learning is to find the optimal policy π ∗ that maximizes the JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 8 10 15 20 25 30 35 40 func. evaluations 2 0 2 4 6 8 10 minimum value Branin function BO SBO WARP 10 20 30 40 50 60 func. evaluations 0.1 0.0 0.1 0.2 0.3 0.4 0.5 optimization gap Exponential 2d function BO SBO WARP 0 50 100 150 200 func. evaluations 8 7 6 5 4 3 2 minimum value Michalewicz 10D function (m=10) BO SBO WARP Fig. 3. Optimization benchmarks. From left to right: Results on the Branin-Hoo, Gramacy and Michale wicz functions. For smooth functions (Branin), the advantage of nonstationary methods (SBO and W ARP) is minimal, although it is significant in the Hartmann function. For the nonstationary functions (Gramacy and Michalewicz) clearly there is an advantage of using nonstationary methods with SBO taking the lead in ev ery test. 10 20 30 40 50 60 70 func. e v aluations 10 − 1 10 0 error gap Har tmann 6D function BO SBO SBO-2 W ARP Fig. 4. Hartmann 6D function results. Being a smooth function, the advantage of nonstationary methods is smaller . Still they provide better results in general. Those results are improved with a Spartan kernel with a funnel structure (local σ l = 0 . 05 , local σ l = 0 . 1 and global) named as SBO-2. expected accumulated reward π ∗ = arg max π E s 0: T , a 1: T " T X t =1 r t # (18) The problems we hav e selected are all episodic problems with a finite time horizon. Ho wev er , the methodology can be extended to infinite horizon by adding a discount factor γ t . The expectation is ev aluated using Monte Carlo, by sampling sev eral episodes for a given policy . Reinforcement learning algorithms usually rely on variants of the Bellman equation to optimize the policy step by step considering each instantaneous reward r t separately . Some algorithms also rely on partial or total knowledge of the transition model s t +1 = T ( s t , a t ) in adv ance. Direct policy sear ch methods [61] tackle the optimization problem directly , considering equation (18) as an stochastic optimization prob- lem. The use of Bayesian optimization for reinforcement learning was previously called active policy sear ch [14] for its connection with acti ve learning and how samples are carefully selected based on current information. The main advantage of using Bayesian optimization to compute the optimal policy is that it can be done with very little information. In fact, as soon as we are able to simulate scenarios and return the total reward P T t =1 r t , we do not need to access the dynamics, the instantaneous rew ard or the current state of the system. There is no need for space or action discretization, building complex features or tile coding [62]. F or many problems, a simple, low dimensional, controller is able to achiev e state-of-the-art performance if properly optimized. A frequent issue for applying general purpose optimization algorithms for policy search is the occurrence of failur e states or scenarios which produces large discontinuities or flat regions due to penalties. This is opposed to the behavior of the rew ard near the optimal policy where small variations on a suboptimal policy can considerably change the performance achiev ed. Therefore, the resulting reward function presents a nonstationary beha vior with respect to the policy . W e hav e compared our method in three well-kno wn bench- marks with different lev el of complexity . The first problem is learning the controller of a three limb robot walk er presented in W estervelt et al. [63]. The controller modulates the w alking pattern of a simple biped robot. The desired behavior is a fast upright walking pattern, the reward is based on the walking speed with a penalty for not maintaining the upright position. The dynamic controller has 8 continuous parameters. The walker problem was already used as a Bayesian optimization benchmark [59]. The second problem is the classic mountain car problem [62]. W e hav e used a simple perceptron policy with 7 pa- rameters to compute the action based on position, speed and acceleration (see Appendix D). The third problem is the hovering helicopter from the RL- competition 1 . This is one of the most challenging scenarios of the competition, being presented in all the editions. This problem is based on a simulator of the XCell T empest aero- batic helicopter . The simulator model was learned based on actual data from the helicopter using apprenticeship learning [64]. The model was then used to learn a policy for the real robot. The simulator included se veral difficult wind conditions. The state space is 12D (position, orientation, translational velocity and rotational velocity) and the action is 4D (forward- backward cyclic pitch, lateral cyclic pitch, main collective 1 http://www .rl-competition.org/ JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 9 10 20 30 40 50 60 70 func. evaluations 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 loss function Logistic Regression SBO BO WARP 10 20 30 40 50 60 70 func. evaluations 1240 1250 1260 1270 1280 1290 1300 loss function Online LDA BO SBO WARP 0 20 40 60 80 100 func. evaluations 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 minimum value HPNNET MRBI BO SBO WARP Fig. 5. Machine learning algorithm tuning. From left to right: logistic regression (4D continuous), online LD A (3D continuous), and deep neural network (HP-NNET with the MRBI dataset, 7D continuous, 7D categorical). In all cases SBO performance was the best, except for the logistic regression problem where the W ARP method slightly outperformed SBO. 10 15 20 25 30 35 40 func. evaluations 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 reward 3-limb walker control BO SBO WARP 10 15 20 25 30 35 40 func. evaluations 600 400 200 0 200 400 600 800 1000 reward Mountain Car Control BO SBO WARP 40 50 60 70 80 90 100 110 policy evaluations 80000 70000 60000 50000 40000 30000 20000 10000 0 reward function Hovering Helicopter BO SBO WARP Fig. 6. T otal reward for the three limb walker (left), the mountain car and the hov ering helicopter problem. For the first problem, SBO is able to achieve higher reward, while other methods get stuck in a local maxima. For the mountain car , SBO is able to achiev e maximum performance in all trials after just 27 policy trials (17 iterations + 10 initial samples). For the helicopter problem, BO and W ARP have slow conv ergence, because many policies results in an early crash. Providing almost no information. Howe ver , SBO is able to exploit good policies and quickly improve the performance. pitch and tail collectiv e pitch). The reward is a quadratic function that penalizes both the state error (inaccurac y) and the action (energy). Each episode is run during 10 seconds (6000 control steps). If the simulator enters a terminal state (crash), a lar ge negati ve reward is gi ven, corresponding to getting the most neg ative reward achiev able for the remaining time. W e also used a weak baseline controller that was included with the helicopter model. This weak controller is a simple linear policy with 12 parameters (weights). In theory , this controller is enough to av oid crashing but is not very robust. W e show ho w this policy can be easily improved within few iterations. In this case, initial exploration of the parameter space is specially important because the number of policies not crashing in few control steps is very small. For most policies, the rew ard is the most ne gativ e re ward achie vable. Thus, in this case, we have used Sobol sequences for the initial samples of Bayesian optimization. These samples are deterministic, therefore we guarantee that the same number of non-crashing policies are sampled for ev ery trial and every algorithm. W e also increased the number of samples to 40. Fig. 6 shows the performance for the three limb walker , the mountain car and the helicopter problem. In all cases, the results obtained by SBO were more efficient in terms on number of trials and accuracy , with respect to standard BO and W ARP . Furthermore, we found that the results of SBO were comparable to those obtained by popular reinforcement learning solvers lik e SARSA [62], but with much less informa- tion and prior knowledge about the problem. For the helicopter problem, other solutions found in the literature require a larger number of scenarios/trials to achiev e similar performance [65], [66]. E. Automatic wing design using a CDF softwar e Computational fluid dynamics (CDF) software is a powerful tool for the design of mechanical and structural elements subject to interaction with fluids, such as aerodynamics, hy- drodynamics or heating systems. Compared to physical design and testing in wind tunnels or artificial channels, the cost of simulation is almost negligible. Because simulated redesign is simple, CDF methods hav e been used for autonomous design of mechanical elements following principles from ex- perimental design. This simulation-based autonomous design is a special case of the design and analysis of computer experi- ments (D ACE) [67]. Nev ertheless, the computational cost of an av erage CDF simulation can still take days or weeks of CPU time. Therefore, the use of a sample efficient methodology like BO is mandatory . W e believ e this methodology has an enormous potential in the field for autonomous design. The experiment that we selected is the autonomous design of an optimal wing profile for a UA V [16]. W e simulated a JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 10 Fig. 7. V orticity plots of two different wing shapes. The left wing barely affect the trajectory of the wind, resulting in not enough lift. Meanwhile the right wing is able to provide enough lift with minimum drag. 10 20 30 40 50 60 70 80 90 100 design trials 50 100 150 200 250 300 drag force Wing Optimization BO SBO WARP Fig. 8. Results for the wing design optimization. Each plot is based on 10 runs. wind tunnel using the xFlow TM CDF software. The objectiv e of the experiment was to find the shape of the wing that minimizes the drag while maintaining enough lift. W e assumed a 2D simulation of the fluid along the profile of the wing. This is a reasonable and common assumption for wings with large aspect ratio (large span vs. chord), and it considerably reduces the computation time of each simulation from days to hours. For the parametrization of the profile, we used Bezier curves; ho we ver , note that Bayesian optimization is agnostic of the geometric parametrization and any other parametrization could also be used. The Bezier curve of the wing was based on 7 control points, which resulted in 14 parameters. Howe ver , adding some physical and manufacturing restrictions resulted in 5 de grees of freedom. Directly minimizing the drag presents the problem that the best solutions tends to generate flat wings that do not provide enough lift for the plane. Fluid dynamics also hav e chaotic properties: small changes in the conditions may produce a large variability in the outcome. For example, the flow near the trailing edge can transition from laminar to turbulent regime due to a small change in the wing shape. Thus, the resulting forces are completely different, increasing the drag and reducing the lift. Fig. 7 shows comparison of a wing with no lift and the optimal design. Although there has been some recent work on Bayesian optimization with constraints [68], [69] we decided to use a simpler approach of adding a penalty with two purposes. Firts, the input space remains unconstrained, impro ving the performance of the optimization of the acquisition function. Second, safety is increased because points in the constrain boundary get also partly penalized as a result of GP smoothing. Under this conditions, Fig. 8 shows how both BO and W ARP fail to find the optimum wing shape. Howe ver , SBO finds a better wing shape. Furthermore, it does it in fe w iterations. F . Computational cost T able III shows the av erage CPU time of the dif ferent ex- periments for the total number of function ev aluations. Due to the extensi ve ev aluation, we had to rely on dif ferent machines for running the experiments, although all the algorithms for a single experiment were compared on the same machine. Thus, CPU time of dif ferent experiments might not be directly comparable. The main difference between the three methods in terms of the algorithm is within the kernel function k ( · , · ) , which in- cludes the ev aluation of the weights in SBO and the ev aluation of the warping function in W ARP . The rest of the algorithm is equi valent, that is, we reused the same code. After some profiling, we found that the time dif ferences be- tween the algorithms were mainly dri ven by the dimensionality of the hyperparameter space because MCMC was the main bottleneck. Note that, for all algorithms and experiments, we used slice sampling as recommended by the authors of W ARP [54]. On one hand, the likelihood of the parameters for the Beta CDF was very narro w and the slice sampling algorithm spent many iterations before founding a valid sample. Some methods could be applied to alleviate the issues such as hybrid Monte Carlo, or sequential Monte Carlo samplers; but that remains an open problem beyond the scope of the paper . On the other hand, the e valuation of the Beta CDF w as much more in volv ed and computationally expensi ve than the ev aluation of the Mat ´ ern kernel or the Gaussian weights for the Spartan kernel. That extra cost became an important factor as the kernel function was being called billions of times for each Bayesian optimization run. It is important to note that, although Bayesian optimization is intended for expensiv e functions and the cost per iteration is negligible in most applications (for e xample: CDF simulations, deep learning algorithm tuning, etc.), the dif ference between methods could mean hours of CPU time for a single iteration, changing the range of potential applications. V I . C O N C L U S I O N S In this paper, we have presented a new algorithm called Spartan Bayesian Optimization (SBO) which combines local and global kernels in a single adaptiv e kernel to deal with the exploration/exploitation trade-off and the inherent nonsta- tionarity in the search process during Bayesian optimization. W e have shown that this ne w kernel increases the con ver - gence speed and reduces the number of samples in many applications. For nonstationary problems, the method provides excellent results compared to standard Bayesian optimization and the state of the art method to deal with nonstationarity . Furthermore, SBO also performs well in stationary problems by improving local refinement while retaining global explo- ration capabilities. W e ev aluated the algorithm extensi vely in standard optimization benchmarks, automatic wing design and machine learning applications, such as hyperparameter tuning problems and classic reinforcement learning scenarios. The JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 11 T ABLE I A V E R AG E T OTAL C PU T IM E IN S EC O N D S . Time (s) Gramacy Branin Hartmann Michalewicz W alker Mountain Car Log Reg Online LD A HPNNET #dims 2 2 6 10 8 7 4 3 14 #ev als 60 40 70 210 40 40 50 70 110 BO 120 171 460 8 360 47 38 28 112 763 SBO 2 481 3 732 10 415 225 313 440 797 730 2 131 28 442 W ARP 13 929 28 474 188 942 4 445 854 20 271 18 972 9 149 21 299 710 974 results hav e shown that SBO outperforms the state of the art in Bayesian optimization for all the experiments and tests. It requires less samples or achiev es smaller optimization gap. In addition to that, we hav e shown how SBO was much more efficient in terms of CPU usage than other nonstationary methods for Bayesian optimization. The results in reinforcement learning also highlight the po- tential of the active policy sear ch paradigm for reinforcement learning. Our method is specially suitable for that paradigm. This fact opens a new opportunity for Bayesian optimization as an efficient and competiti ve reinforcement learning solver , without relying on the dynamics of the system, instantaneous rew ards or discretization of the different spaces. A C K N O W L E D G M E N T The authors would like to thank Javier Garc ´ ıa-Barcos for his help on the CFD simulator and Eduardo Montijano for his valuable comments. This project has been funding in part by project DPI2015-65962-R (MINECO/FEDER, UE). R E F E R E N C E S [1] B. E. Stuckman and E. E. Easom, “ A comparison of bayesian/sampling global optimization techniques, ” IEEE Tr ansactions on Systems, Man, and Cybernetics , vol. 22, no. 5, pp. 1024–1032, 1992. [2] D. R. Jones, M. Schonlau, and W . J. W elch, “Efficient global optimiza- tion of expensi ve black-box functions, ” J ournal of Global Optimization , vol. 13, no. 4, pp. 455–492, 1998. [3] D. Huang, T . T . Allen, W . I. Notz, and N. Zheng, “Global optimization of stochastic black-box systems via sequential kriging meta-models, ” Journal of Global Optimization , vol. 34, no. 3, pp. 441– 466, 2006. [4] F . Hutter , H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration, ” in LION-5 , 2011, pp. 507–523. [5] M. A. T addy , H. K. Lee, G. A. Gray , and J. D. Griffin, “Bayesian guided pattern search for robust local optimization, ” T echnometrics , vol. 51, no. 4, pp. 389–401, 2009. [6] H. J. K ushner, “ A ne w method of locating the maximum of an arbitrary multipeak curve in the presence of noise, ” Journal of Basic Engineering , vol. 86, pp. 97–106, 1964. [7] J. Mockus, Bayesian Approac h to Global Optimization , ser . Mathematics and Its Applications. Kluwer Academic Publishers, 1989, v ol. 37. [8] N. Mahendran, Z. W ang, F . Hamze, and N. de Freitas, “ Adaptiv e MCMC with Bayesian optimization, ” Journal of Machine Learning Research - Pr oceedings T rack for Artificial Intelligence and Statistics (AIST ATS) , vol. 22, pp. 751–760, 2012. [9] J. Snoek, H. Larochelle, and R. Adams, “Practical Bayesian optimization of machine learning algorithms, ” in NIPS , 2012, pp. 2960–2968. [10] M. Feurer , A. Klein, K. Eggensperger , J. Springenberg, M. Blum, and F . Hutter, “Efficient and robust automated machine learning, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 2944– 2952. [11] R. Martinez-Cantin, N. de Freitas, E. Brochu, J. Castellanos, and A. Doucet, “ A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. ” Autonomous Robots , vol. 27, no. 3, pp. 93–103, 2009. [12] R. Calandra, A. Seyfarth, J. Peters, and M. Deisenroth, “Bayesian op- timization for learning gaits under uncertainty , ” Annals of Mathematics and Artificial Intelligence (AMAI) , v ol. 1 1, pp. 1–19 1–19, 2015. [13] O. Kroemer , R. Detry , J. Piater , and J. Peters, “Combining active learning and reacti ve control for robot grasping, ” Robotics and Autonomous Systems , vol. 58, no. 9, pp. 1105–1116, 2010. [14] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos, “ Active policy learning for robot planning and exploration under uncer - tainty , ” in Robotics: Science and Systems , 2007. [15] C. Daniel, O. Kroemer , M. V iering, J. Metz, and J. Peters, “ Active re ward learning with a novel acquisition function, ” Autonomous Robots , vol. 39, no. 3, pp. 389–405, 2015. [16] A. I. Forrester , N. W . Bressloff, and A. J. Keane, “Optimization using surrogate models and partially con ver ged computational fluid dynamics simulations, ” Pr oceedings of the Royal Society of London A: Mathe- matical, Physical and Engineering Sciences , vol. 462, no. 2071, pp. 2177–2204, 2006. [17] N. Srinivas, A. Krause, S. Kakade, and M. Seeger , “Gaussian process optimization in the bandit setting: No regret and e xperimental design, ” in Pr oc. International Conference on Machine Learning (ICML) , 2010. [18] R. Garnett, M. A. Osborne, and S. J. Roberts, “Bayesian optimization for sensor set selection, ” in Information Pr ocessing in Sensor Networks (IPSN 2010) , 2010. [19] E. Brochu, T . Brochu, and N. de Freitas, “ A Bayesian interactiv e optimization approach to procedural animation design, ” in ACM SIG- GRAPH/Eur ographics Symposium on Computer Animation , 2010. [20] F . W ario, B. W ild, M. J. Couvillon, R. Rojas, and T . Landgraf, “ Auto- matic methods for long-term tracking and the detection and decoding of communication dances in honeybees, ” F rontier s in Ecology and Evolution , vol. 3, p. 103, 2015. [21] W . M. Czarnecki, S. Podlewska, and A. J. Bojarski, “Robust op- timization of svm hyperparameters in the classification of bioactive compounds, ” Journal of cheminformatics , v ol. 7, no. 1, pp. 1–15, 2015. [22] D. Ulmasov , C. Baroukh, B. Chachuat, M. P . Deisenroth, and R. Mis- ener , “Bayesian optimization with dimension scheduling: Application to biological systems, ” in 26th Eur opean Symposium on Computer Aided Pr ocess Engineering, Computer-Aided Chemical Engineering , 2016. [23] A. Borji and L. Itti, “Bayesian optimization explains human active search, ” in Advances in Neural Information Processing Systems 26 , 2013, pp. 55–63. [24] A. Cully , J. Clune, D. T arapore, and J.-B. Mouret, “Robots that can adapt like animals, ” Nature , v ol. 521, p. 503507, 2015. [25] A. W . Moore and J. Schneider, “Memory-based stochastic optimization, ” in Advances in Neural Information Processing Systems , D. S. T ouretzky , M. C. Mozer , and M. E. Hasselmo, Eds., vol. 8. The MIT Press, 1996, pp. 1066–1072. [26] P . Hennig and C. J. Schuler , “Entropy search for information-ef ficient global optimization, ” Journal of Machine Learning Research , vol. 13, pp. 1809–1837, 2012. [27] N. de Freitas, A. Smola, and M. Zoghi, “Exponential regret bounds for Gaussian process bandits with deterministic observations, ” in Interna- tional Conference on Machine Learning (ICML) , 2012. [28] M. T oussaint, “The Bayesian search game, ” in Theory and Principled Methods for Designing Metaheuristics , A. M. Y ossi Borenstein, Ed. Springer , 2014. [29] D. Buche, N. N. Schraudolph, and P . K oumoutsakos, “ Accelerating ev olutionary algorithms with gaussian process fitness function models, ” IEEE Tr ansactions on Systems, Man, and Cybernetics, P art C (Applica- tions and Reviews) , vol. 35, no. 2, pp. 183–194, 2005. [30] R. Cheng, Y . Jin, K. Narukawa, and B. Sendhoff, “ A multiobjective ev olutionary algorithm using gaussian process-based in verse modeling, ” IEEE T ransactions on Evolutionary Computation , vol. 19, no. 6, pp. 838–856, 2015. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 12 [31] Y . S. Ong, P . B. Nair, and A. J. Keane, “Evolutionary optimization of computationally expensi ve problems via surrogate modeling, ” AIAA journal , vol. 41, no. 4, pp. 687–696, 2003. [32] Z. Zhou, Y . S. Ong, P . B. Nair, A. J. Keane, and K. Y . Lum, “Combining global and local surrogate models to accelerate ev olutionary optimiza- tion, ” IEEE Tr ansactions on Systems, Man, and Cybernetics, P art C (Applications and Reviews) , vol. 37, no. 1, pp. 66–76, 2007. [33] J. Mockus, V . T iesis, and A. Zilinskas, “The application of Bayesian methods for seeking the extremum, ” in T owards Global Optimisation 2 , L. Dixon and G. Szego, Eds. Elsevier , 1978, pp. 117–129. [34] H.-M. Gutmann, “ A radial basis function method for global optimiza- tion, ” Journal of Global Optimization , vol. 19, no. 3, pp. 201–227, 2001. [35] H. D. Sherali and V . Ganesan, “ A pseudo-global optimization approach with application to the design of containerships, ” Journal of Global Optimization , vol. 26, no. 4, pp. 335–360, 2003. [36] D. R. Jones, “ A taxonomy of global optimization methods based on response surfaces, ” Journal of Global Optimization , vol. 21, pp. 345– 383, 2001. [37] A. Shah, A. G. W ilson, and Z. Ghahramani, “Student-t processes as alternativ es to Gaussian processes, ” in AIST ATS, JMLR Pr oceedings. JMLR.or g , 2014. [38] R. Martinez-Cantin, “BayesOpt: A Bayesian optimization library for nonlinear optimization, experimental design and bandits, ” Journal of Machine Learning Researc h , vol. 15, pp. 3735–3739, 2014. [39] J. M. Assael, Z. W ang, and N. de Freitas, “Heteroscedastic treed bayesian optimisation, ” arXiv , T ech. Rep., 2014. [40] J. Bergstra, R. Bardenet, Y . Bengio, and B. Kgl, “ Algorithms for h yper- parameter optimization. ” in NIPS , 2011, pp. 2546–2554. [41] J. Snoek, O. Rippel, K. Swersky , R. Kiros, N. Satish, N. Sundaram, M. M. A. Patwary , Prabhat, and R. P . Adams, “Scalable bayesian optimization using deep neural networks, ” in International Conference on Machine Learning , 2015. [42] D. Russo and B. V an Roy , “Learning to optimize via posterior sampling, ” Mathematics of Operations Researc h , vol. 39, no. 4, pp. 1221–1243, 2014. [43] C. E. Rasmussen and C. K. W illiams, Gaussian Pr ocesses for Machine Learning . Cambridge, Massachusetts: The MIT Press, 2006. [44] A. D. Bull, “Con vergence rates of efficient global optimization algo- rithms, ” Journal of Machine Learning Researc h , vol. 12, pp. 2879–2904, 2011. [45] J. M. Gablonsky and C. T . Kelle y , “ A locally-biased form of the DIRECT algorithm, ” Journal of Global Optimization , vol. 21, no. 1, pp. 27–37, 2001. [46] R. B. Gramacy and H. K. Lee, “Cases for the nugget in modeling computer experiments, ” Statistics and Computing , vol. 22, pp. 713–722, 2012. [47] J. P . Kleijnen, W . van Beers, and I. V an Nieuwenhuyse, “Expected improvement in ef ficient global optimization through bootstrapped krig- ing, ” J ournal of Global Optimization , vol. 54, no. 1, pp. 59–73, 2012. [48] K. Kandasamy , J. Schneider , and B. Poczos, “High dimensional bayesian optimisation and bandits via additiv e models, ” in International Confer- ence on Machine Learning (ICML) , 2015. [49] A. G. Journel and C. J. Huijbregts, Mining geostatistics . Academic Press London, 1978. [50] T . C. Haas, “Kriging and automated variogram modeling within a moving window , ” Atmospheric Envir onment. P art A. General T opics , vol. 24, no. 7, pp. 1759 – 1769, 1990. [51] A. Krause and C. Guestrin, “Nonmyopic active learning of Gaussian processes: an exploration-e xploitation approach, ” in International Con- fer ence on Machine Learning (ICML) , Corvallis, Oregon, June 2007. [52] R. B. Gramacy , “Bayesian treed gaussian process models, ” Ph.D. dis- sertation, University of California, Santa Clara, 2005. [53] P . D. Sampson and P . Guttorp, “Nonparametric estimation of nonsta- tionary spatial co variance structure, ” Journal of the American Statistical Association , vol. 87, no. 417, pp. 108–119, 1992. [54] J. Snoek, K. Swersky , R. S. Zemel, and R. P . Adams, “Input warping for Bayesian optimization of non-stationary functions, ” in International Confer ence on Machine Learning , 2014. [55] Z. W ang, M. Zoghi, F . Hutter, D. Matheson, and N. de Freitas, “Bayesian optimization in high dimensions via random embeddings, ” in International Joint Conferences on Artificial Intelligence (IJCAI) - Extended version: http://arxiv .or g/abs/1301.1942 , 2013. [56] D. J. Nott and W . T . M. Dunsmuir, “Estimation of nonstationary spatial cov ariance structure, ” Biometrika , v ol. 89, no. 4, pp. 819–829, 2002. [57] B. J. Williams, T . J. Santner , and W . I. Notz, “Sequential design of computer experiments to minimize integrated response functions, ” Statistica Sinica , vol. 10, no. 4, pp. 1133–1152, 2000. [58] A. O’Hagan, “Some Bayesian numerical analysis, ” Bayesian Statistics , vol. 4, pp. 345–363, 1992. [59] J. M. Hern ´ andez-Lobato, M. W . Hoffman, and Z. Ghahramani, “Pre- dictiv e entropy search for efficient global optimization of black-box functions, ” in Advances in Neural Information Pr ocessing Systems 27 , 2014, pp. 918–926. [60] K. Eggensperger , F . Hutter , H. Hoos, and K. Leyton-Bro wn, “Ef ficient benchmarking of hyperparameter optimizers via surrogates, ” in Pr oceed- ings of the T wenty-Ninth AAAI Conference on Artificial Intelligence , Jan. 2015. [61] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning, ” Machine learning , vol. 8, no. 3-4, pp. 229–256, 1992. [62] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction . The MIT Press, 1998. [63] E. R. W estervelt, J. W . Grizzle, C. Chev allereau, J. H. Choi, and B. Morris, F eedback contr ol of dynamic bipedal r obot locomotion . CRC press, 2007, vol. 28. [64] P . Abbeel, A. Coates, M. Quigley , and A. Ng, “ An application of reinforcement learning to aerobatic helicopter flight, ” in NIPS , 2006. [65] A. Asbah, A. M. Barreto, C. Gehring, J. Pineau, and D. Precup, “Rein- forcement learning competition: Helicopter hovering with controllability and kernel-based stochastic factorization, ” in International Conference on Machine Learning (ICML), Reinfor cement Learning Competition W orkshop , 2013. [66] R. K oppejan and S. Whiteson, “Neuroev olutionary reinforcement learn- ing for generalized control of simulated helicopters, ” Evolutionary intelligence , vol. 4, no. 4, pp. 219–241, 2011. [67] J. Sacks, W . J. W elch, T . J. Mitchell, and H. P . W ynn, “Design and analysis of computer experiments, ” Statistical Science , v ol. 4, no. 4, pp. 409–423, 1989. [68] J. Gardner, M. Kusner , Z. Xu, K. W einberger , and J. Cunningham, “Bayesian optimization with inequality constraints, ” in Proceedings of the 31st International Conference on Machine Learning (ICML-14) , 2014, pp. 937–945. [69] M. A. Gelbart, J. Snoek, and R. P . Adams, “Bayesian optimization with unknown constraints, ” in Uncertainty in Artificial Intelligence (UAI) , 2014. [70] J. F . Lazenby , The defence of Gr eece: 490-479 BC . Aris & Phillips, 1993. [71] Z. W ang and N. de Freitas, “Theoretical analysis of bayesian optimi- sation with unknown gaussian process hyper-parameters, ” in BayesOpt workshop (NIPS) , 2014. [72] E. Brochu, V . Cora, and N. de Freitas, “ A tutorial on Bayesian opti- mization of expensi ve cost functions, with application to activ e user modeling and hierarchical reinforcement learning, ” arXiv .org, eprint arXiv:1012.2599, December 2010. [73] K. Swersky , D. Duvenaud, J. Snoek, F . Hutter, and M. A. Osborne, “Raiders of the lost architecture: Kernels for bayesian optimization in conditional parameter spaces, ” in NIPS workshop on Bayesian Optimiza- tion,arXiv preprint arXiv:1409.4011 , 2014. [74] F . Hutter, “ Automated configuration of algorithms for solving hard com- putational problems, ” Ph.D. dissertation, Univ ersity of British Columbia, V ancouver , Canada, 2009. Ruben Martinez-Cantin is a senior lecturer at the Centro Uni versitario de la Defensa, attached to the Univ ersity of Zaragora. He is also a research engineer at SigOpt, Inc. Before that, he was a postdoctoral researcher at the Insitute of Systems and Robotics at the Instituto Superior Tcnico in Lisbon. Pre viously , he was a visiting scholar at the Laboratory of Computational Intelligence at UBC. He received a PhD and MSc in Computer Science and Electrical Engineering from the University of Zaragoza in 2008 and 2003, respectively . His re- search interests include Bayesian inference and optimization, machine learn- ing, robotics and computer vision. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 13 A P P E N D I X A O R I G I N O F S PA RT A N B A Y E S I A N O P T I M I Z A T I O N The algorithm is called Spartan as it follows the same intuition and analogous strategy as the Greek forces at the end of the Battle of Thermopylae . The most likely theory claims that, the last day of the battle, a small group of forces led by spartan King Leonidas stood in the narrow pass of the Thermopylae to block the Persian cav alry , while the rest of the forces retreated to cov er more terrain and av oid being surrounded by the Persian mo ving through a mountain path [70]. This dual strategy of allocate global resources sparsely while maintaining a local dense vanguard at a strate gic location is emphasized within Spartan Bayesian Optimization . A P P E N D I X B E FF E C T O F K E R N E L L E N G T H - S C A L E S F O R B AY E S I A N O P T I M I Z A T I O N Like many global optimization and bandit setups, Bayesian optimization requires to control the bounds of the function space to dri ve exploration ef ficiently . In probabilistic terms, the upper and lo wer bounds defined by the Gaussian process in Bayesian optimization play the same role as the Lipschitz constant in classical optimization [27]. W ang and de Freitas [71] pointed out that, in the case of unknown kernel hyperpa- rameters, estimation procedures for those hyperparameters like MAP or MCMC, become overconfident as new data points become a vailable. Thus, length-scale hyperparameters might become extremely wide, resulting in a poor exploration and slow conv ergence tow ards the optimum. This behavior is in part due to the unev en distribution of queries from Bayesian optimization. They propose adaptiv e bounds on the kernel hyperparameters to guarantee that the length-scale remains narrow enough to provide enough exploration. Howe ver , this approach might result in excessi ve global exploration when a limited b udget is considered. A P P E N D I X C D E FI N I T I O N S O F B E N C H M A R K F U N C T I O N S W e e valuated the algorithms on a set of well-kno wn test functions for global optimization both smooth or with sharp drops. The functions are summarized in T able II. T ABLE II O P TI M I Z A T I O N B E N CH M A R K F U N C TI O N S Name Function and Domain Branin-Hoo f ( x ) =  x 2 − 5 . 1 4 π 2 x 2 1 + 5 π x 1 − 6  2 + 10  1 − 1 8 π  cos( x 1 ) + 10 x 1 ∈ [ − 5 , 10] , x 2 ∈ [0 , 15] Hartmann f ( x ) = − P 4 i =1 α i exp  − P 6 j =1 A ij ( x j − P ij ) 2  x ∈ [0 , 1] 6 see Section C-A for α , A and P Gramacy [52] f ( x ) = x 1 exp  − x 2 1 − x 2 2  x 1 , x 2 ∈ [ − 2 , 18] 2 Michalewicz f ( x ) = − P d i =1 sin( x i ) sin 2 m  ix 2 i π  d = 10 , m = 10 , x ∈ [0 , π ] d goal car wall a tanh Σ x v x 1 ˙ v x tanh w 7 w 3 w 2 w 4 w 1 w 5 w 6 Fig. 9. Left: Mountain car scenario. The car is underactuated and cannot climb to the goal directly . Instead it requires to move backwards to get inertia. The line in the left is an inelastic wall. Right: Policy use to control the mountain car . The inputs are the horizontal position x t , velocity v t = x t − x t − 1 and acceleration a t = v t − v t − 1 of the car . The output is the car throttle bounded to [ − 1 , 1] . A. P arameters of the Hartmann 6D function The parameters of the Hartmann 6D function are: α = (1 . 0 , 1 . 2 , 3 . 0 , 3 . 2) T , A =     10 3 17 3 . 5 1 . 7 8 0 . 05 10 17 0 . 1 8 14 3 3 . 5 1 . 7 10 17 8 17 8 0 . 05 10 0 . 1 14     , P = 10 − 4     1312 1696 5569 124 8283 5886 2329 4135 8307 3736 1004 9991 2348 1451 3522 2883 3047 6650 4047 8828 8732 5743 1091 381     A P P E N D I X D M O U N T A I N C A R The second problem is the classic mountain car problem [62]. The state of the system is the car horizontal position. The action is the horizontal acceleration a ∈ [ − 1 , 1] . Contrary to the many solutions that discretize both the state and action space, we can directly deal with continuous states and actions. The policy is a simple perceptron model inspired by Brochu et al. [72] as can be seen in Figure 9. The potentially unbounded policy parameters w = { w i } 7 i =1 are computed as w = tan  ( π −  π ) w 01 − π 2  where w 01 are the policy parameters bounded in the [0 , 1] 7 space. The term  π ∼ 0 was used to av oid w i → ∞ . A P P E N D I X E D I S C U S S I O N O N M I X E D I N P U T S PAC E S I N B A Y E S I A N O P T I M I Z A T I O N Although Bayesian optimization started as a method to solve classic nonlinear optimization problems with box-bounded restrictions, its sample ef ficiency and the flexibility of the sur - rogate models ha ve attracted the interest of other communities and expanded the potential applications of the method. In many current Bayesian optimization applications, like hyperparameter optimization, it is necessary to simultaneously optimize dif ferent kinds of input variables, for e xample: con- tinuous, discrete, categorical, etc. While Gaussian processes are suitable for modeling those spaces by choosing a suit- able kernel [73], [74], Bayesian optimization can become quite in volv ed as the acquisition function (criterion) must JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 14 Algorithm 2 Hierarchical Bayesian Optimization 1: T otal budget N = N c · N d  Discrete iterations times continuous iterations 2: Split the discrete and continuous components of input space x = [ x c , x d ] T 3: Initial samples for x c 4: for n = 1 . . . N c do  Outer loop 5: Update model (e.g.: Gaussian process) for x c 6: Find continuous component of next query point x c n (e.g.: maximize EI) 7: Initial samples for x d 8: for k = 1 . . . N d do  Inner loop 9: Update model (e.g.: Random forest) for x d | x c n 10: Find discrete component of next query point x d k | x c n 11: Combine queries and ev aluate function: y k ← f ([ x c n , x d k ]) 12: Return optimal discrete query for current continuous query: x d ∗ | x c n 13: Return optimal continuous query and corresponding discrete match: x ∗ = [ x c ∗ , x d ∗ | x c ∗ ] T T ABLE III A V E R AG E T OTAL C PU T IM E IN S EC O N D S . Time (s) HPNNET HPNNET (h) #dims 14 7+7 #ev als 110 200 (20) BO 763 20 SBO 28 442 146 W ARP 710 974 2 853 be optimized in the same mixed input space. A vailable im- plementations of Bayesian optimization like Spearmint [9] use grid sampling and rounding tricks to combine different input spaces. Howe ver , this reduces the quality of the final result compared to proper nonlinear optimization methods [38]. Other authors have proposed some heuristics specially designed for criterion maximization in Bayesian optimization [27], but its applicability to mixed input spaces still remains an open question. W e propose a hierar chical Bayesian optimization model , where the input space X is partitioned between homogeneous variables, for example: continuous variables x c and discrete variables x d . That is: X = X c ∪ X d . = { x = [ x c , x d ] T : x c ∈ X c ∨ x d ∈ X d } (19) Therefore, the ev aluation of an element higher in the hierar- chy implies the full optimization of the elements lower in the hierarchy . In principle, that would require many more function ev aluations but, as the input space has been partitioned, the dimensionality of each separate problem is much lower . In practice, for the same number of function ev aluations, the computational cost of the optimization algorithm is consid- erably reduced. W e can also include conditional variables in the outer loop to select which computations to perform in the inner loop. An adv antage of this approach is that we can combine dif fer- ent surrogate models for dif ferent le vels of the hierarchy . For example, using Random Forests [4] or tree-structured Parzen estimators [40] could be more suitable as a surrog ate model for certain discrete/categorical variables than Gaussian processes. W e could also use specific kernels like the Hamming kernel as we used in this paper . 0 20 40 60 80 100 func. evaluations 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 minimum value HPNNET MRBI BO SBO WARP 0 50 100 150 200 function evaluations 0.45 0.50 0.55 0.60 0.65 0.70 loss function HPNNET (hierarchical model) BO SBO WARP Fig. 10. Deep neural network problem using the fully correlated model (top) and the hierarchical model (bottom). Although the number of function ev aluations is higher for the hierarchical model, the computational cost of the Bayesian optimization process was considerably reduced. As in the other examples, SBO performance was the best. In contrast, we loose the correlation among variables in the inner loop, which may be counterproductiv e in certain situations. A similar alternative in the case where the target function is actually a combination of lower spaces, could be to use additi ve models, such as additiv e GPs [48]. For the HPNNET problem, we applied the hierarchical model to split the continuous and categorical variables in a two layer optimization process. In this case, the nonstationary JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 15 algorithms (SBO or W ARP) were only applied on the con- tinuous variables (outer loop). For the categorical variables in inner loop we used a Hamming kernel [74]: k H ( x , x 0 | θ ) = exp  − θ 2 g ( s ( x ) , s ( x 0 )) 2  (20) where s ( · ) is a function that maps continuous vectors to discrete vectors by scaling and rounding. The function g ( · , · ) is defined as the Hamming distance g ( x , x 0 ) = |{ i : x i 6 = x 0 i }| so as not to impose an artificial ordering between the values of categorical parameters. The results of the hierarchical model can be seen in Fig. 10. Note that the plots are with respect to target function ev aluations. Howe ver , the results of the hierarchical model are based on only 20 iterations of the outer loop, as each iteration requires 10 function e valuations in the inner loop. At early stages, SBO was trying to find a good location for the local kernel and the performance was slightly worse. Howe ver , after some data was gathered, the local kernel jumped to a good spot and the con ver gence was faster . Fig. 10 sho ws how the method requires more function ev aluations to achieve similar results than the fully correlated approach. T able III shows the average CPU time of the different e xperiments for the total number of function ev aluations. HPNNET (h) is the HPNNET problem using the hierarchical model. Although it is tested with 200 function ev aluations, only 20 iterations of the optimization loop are being computed. Thus, it is faster than HPNNET with a single model, which might open new applications. A similar approach has been recently proposed to deal with high dimensional problems where ev aluations are not expensi ve [22].

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment