Practical Efficient Global Optimization is No-regret

Efficient global optimization (EGO) is one of the most widely used noise-free Bayesian optimization algorithms.It comprises the Gaussian process (GP) surrogate model and expected improvement (EI) acquisition function. In practice, when EGO is applied…

Authors: Jingyi Wang, Haowei Wang, Nai-Yuan Chiang

Practical Efficient Global Optimization is No-regret
Practical Efficien t Global Optimization is No-regret Jingyi W ang La wrence Liv ermore National Laboratory , CA, USA Hao wei W ang National Universit y of Singap ore, Singap ore Nai-Y uan Chiang La wrence Liv ermore National Lab oratory , CA, USA Juliane Mueller National Lab oratory of the Rockies, CO, USA T uc ker Hartland La wrence Liv ermore National Lab oratory , CA, USA Cosmin G. P etra La wrence Liv ermore National Lab oratory , CA, USA Abstract Efficien t global optimization (EGO) is one of the most widely used noise-free Ba yesian optimization algorithms. It comprises the Gaussian pro cess (GP) surrogate mo del and exp ected impro vemen t (EI) ac- quisition function. In practice, when EGO is applied, a scalar matrix of a small p ositiv e v alue (also called a nugget or jitter) is usually added to the cov ariance matrix of the deterministic GP to improv e numerical stabilit y . W e refer to this EGO with a p ositiv e nugget as the practical EGO. Despite its wide adoption and empirical success, to date, cumulativ e regret b ounds for practical EGO hav e yet to b e established. In this pap er, w e present for the first time the cumulativ e regret upp er bound of practical EGO. In particular, w e show that practical EGO has sublinear cum ulative regret b ounds and thus is a no-regret algorithm for commonly used kernels including the squared exp o- nen tial (SE) and Mat´ ern k ernels ( ν > 1 2 ). Moreo ver, w e analyze the effect of the n ugget on the regret bound and discuss the theoretical implication on its choice. Numerical exp eriments are conducted to supp ort and v alidate our findings. 1 In tro duction Efficien t global optimization (EGO) is a deriv ative- free optimization metho d that uses Gaussian pro cess (GP) surrogate mo dels to approximate and guide the optimization of blac k-b o x functions [ 22 , 26 , 49 ]. Giv en no observ ation noise, it is equiv alent to Bay esian opti- mization (BO) with the expected improv ement (EI) acquisition function. EGO has seen enormous success in many applications including machine learning [ 53 ], rob otics [ 8 ], aerodynamic optimization design [ 21 ], etc, and has b een extended to constrained BO [ 13 ], com- binatorial problems [ 54 ], and multiple surrogates [ 47 ]. In the classic form, EGO aims to solv e the optimiza- tion problem minimize x ∈ C f ( x ) , (1) where x ∈ R d is the decision v ariable, C ⊂ R d repre- sen ts the b ound constrain ts on x , and f : R d → R is the black-box ob jectiv e function. EGO iterativ ely selects the candidate sample for the next observ ation using the EI acquisition func- tion [ 23 , 40 ]. EI computes the conditional exp ectation of an improv emen t function lev eraging b oth the p os- terior mean and v ariance of the GP in search of the next sample. With a closed form, EGO is simple to implemen t and only requires the cum ulative distribu- tion function (CDF) and probability densit y function (PDF) of the standard normal distribution. In practice, adding a small p ositive v alue, called a nugget or a jitter [ 12 , 17 ], to the diagonal of the GP co v ariance matrix in EGO is often b eneficial or even necessary [ 3 ] (see Section 2.1 for details). W e denote the n ugget as ϵ > 0 in this pap er. One of the main motiv ations for using ϵ is to improv e the numerical stabilit y of computations inv olving the inv erse of the co v ariance matrix, e.g. , calculating p osterior mean and v ariance. The cov ariance matrix is kno wn to cause nu- merical issues due to ill-conditioning, which can occur when sample p oin ts are to o close to each other [ 23 ]. Cholesky decomp osition, whic h is inv ariably used to solv e linear systems in volving the co v ariance matrix, is known to break in finite-precision arithmetic for ill-conditioned matrices since very small (b oth p osi- tiv e and negative) pivots occur due to round-off errors [ 18 , 24 , 27 , 30 , 42 , 51 ]. This numerical instability has long b een recognized b y the numerical optimization comm unity and thoroughly inv estigated by prop osing mo dified Cholesky decomp ositions employing p ositive semi-definite p erturbations [ 14 , 15 , 39 , 52 ], or p ositive 1 diagonal regularization terms [ 16 ], to circumv ent the factorization breakdown. Indeed, the use of ϵ is widely adopted in some of the most popular BO/EGO softw are pack ages. In the Surrogate Modeling T o olb o x (SMT) [ 38 ], the EGO implemen tation with the GP (KRG) mo del uses a fixed ϵ = 2 . 220 × 10 − 14 with the option of larger v al- ues. In Pyro [ 4 ], the GP regression mo del (GPModel) emplo ys ϵ = 10 − 6 for stabilization of the Cholesky de- comp osition. In scikit-learn [ 33 ], the GP mo del (GPR) for EGO uses a default ϵ = 10 − 10 . In Botorch [ 3 ], whic h uses GPyT orch [ 12 ], examples on GP regression mo dels (SingleT askGP) with noise-free observ ations are presen ted with ϵ = 10 − 6 . GPyOpt [ 43 ] uses a trial- and-error approac h, where ϵ = 10 − 10 is added when the Cholesky decomp osition of the co v ariance matrix fails. Bay esOpt [ 29 ], whic h is used in MA TLAB, simi- larly raises errors when the Cholesky decomp osition fails. Finally , we men tion that in GPyT orch [ 12 , 48 ], a p ositiv e nugget is instrumen tal in the well-posedness and efficiency of the prop osed preconditioned batc hed conjugate gradient algorithm. Apart from impro ving numerical stabilit y , in recen t y ears, the adoption of ϵ for deterministic GP fitting and EGO has b een recommended for improv ed statis- tical prop erties. In [ 2 ], the authors studied the effect of the nugget and sho wed that b y choosing appropri- ate v alues for ϵ and the length-scale hyper-parameter of the squared exp onential (SE) kernel, the approx- imation errors of the GP can b e arbitrarily small. In [ 17 ], the authors prop osed adding ϵ in determinis- tic GP mo dels for noise-free observ ations. They noted that its inclusion provided impro ved statistical prop- erties of the GP mo del for many common scenarios. Similarly , [ 34 ] recommended using ϵ for noise-free ob- serv ations. The authors claimed that the maxim um lik eliho o d estimate of the correlation parameter is more reliable and the condition num b er of the co v ari- ance matrix is mo derate. In [ 5 ], the authors dev elop ed an adaptiv e strategy for choosing ϵ to impro ve the ac- curacy and efficiency in fitting the hyper-parameters of a GP mo del. Given its wide adoption in b oth liter- ature and practical co de implementation, w e refer to EGO with a nugget ϵ > 0 as the “pr actic al EGO” in this pap er. While the simple r e gr et bound of EGO has b een studied in [ 7 ] in the frequentist setting, where f lies in a repro ducing kernel Hilb ert space (RKHS), ex- isting w orks on the cumulative r e gr et b eha vior of ei- ther EGO or practical EGO hav e clear limitations. Giv en t samples x 1 , . . . , x t , simple regret measures the error betw een the smallest observ ed function v alue and the optimal function v alue, i.e. , f + t − f ∗ , where f + t = min i =1 ,...,t f ( x i ) and f ∗ = min x ∈ C f ( x ). On the other hand, cumulativ e regret, denoted as R T for T samples, measures the o verall p erformance of the algorithm throughout the optimization pro cess (see (9) for defi- nition), and is the preferred metric in the multi-armed bandit paradigm [ 1 , 25 , 36 ] and man y real-w orld appli- cations [ 6 ]. It is desirable for an algorithm to hav e a sublinear R T asymptotically , i.e. , lim T →∞ R T /T = 0. This property is called the no-regret prop ert y . F ol- lo wing the seminal work of [ 41 ], the cumulativ e regret b ounds for some BO algorithms such as the upp er confidence b ound (UCB) and Thompson sampling (TS) ha ve been studied extensively , including in the noise-free case [9, 28, 45]. Ho wev er, the cumulativ e regret upp er b ounds for ei- ther EGO or practical EGO ha v e not b een established, despite EI b eing one of the most p opular acquisition functions [ 11 ]. F rom a technical p ersp ectiv e, this is partially due to the inclusion of an incumbent in EI and its non-conv ex nonlinear nature [ 19 , 37 ]. Ex- isting works often mak e noticeable mo difications to the EI acquisition function by in tro ducing new hyper- parameters in order to achiev e sublinear cum ulative regret b ounds. In [ 50 ], the authors studied a mo dified EI function with additional h yp er-parameters and the b est p osterior mean incumbent. They further used the low er and upp er b ounds of the hyper-parameters to pro ve a sublinear cum ulative regret b ound. Simi- larly , [ 19 ] introduced an ev aluation cost and mo dified EI. In [ 44 ], the authors also mo dified the EI by in- cluding additional control parameter and sho wed an upp er b ound for the sum of simple regret, not the cum ulative regret. In [ 32 ], the authors added an ad- ditional stopping criterion to b ound the instantaneous regret. How ever, it is unclear whether the stopping criterion guarantees an optimal solution up on exit. The lac k of cumulativ e regret analysis despite the empirical success of practical EGO leav es tw o imp or- tan t op en questions: Is pr actic al EGO a no-r e gr et algorithm? How do es the nugget value affe ct r egr et b ehavior? In this pap er, we provide an affirmativ e an- sw er to the first question and guidance to the second question by developing no vel theoretical techniques. Our theoretical results can b e used to explain and v alidate the empirical success of practical EGO. Our con tributions in this pap er are tw o-fold. • First, we establish for the first time a cumula- tiv e regret upp er b ound O ( log 1 / 2 ( T ) T 1 / 2 √ γ T ) for practical EGO, one of the most widely used noise-free BO algorithms, where γ T is the max- im um information gain (Definition A.3). Thus, w e prov e that practical EGO is a no-regret al- gorithm for different kernels, as long as the γ T of a kernel is sublinear. Sp ecifically , the cumula- tiv e regret b ounds are O ( T 1 / 2 log ( d +2) / 2 ( T )) and 2 O ( T ν + d 2 ν + d log 2 ν +0 . 5 d 2 ν + d ( T ))) for SE and Mat´ ern ker- nels, resp ectiv ely (see (3) for definitions). • Second, w e study the effect of the nugget ϵ on practical EGO and its cumulativ e regret b ounds, and thereb y providing insigh t into the c hoice of ϵ . This pap er is organized as follows. In Section 2, we in tro duce GP , EI, practical EGO, and other necessary bac kground information. In Section 3, we present the regret b ound analysis starting with preliminary results in Section 3.1. In Section 3.2, the no vel instantaneous regret b ound is established. The cum ulative regret b ound is pro vided in Section 3.3. Limitation on extending the regret b ound analysis to EGO is also presen ted at the end of Section 3. W e discuss the effect of ϵ in Section 4. Numerical exp eriments are used to v alidate our findings in Section 5. Conclusions are made in Section 6. 2 Bac kground In this section, we first pro vide the basics of GPs and the EI acquisition function. Then, other backgrounds relev ant to cum ulative regret analysis are introduced. 2.1 Gaussian Pro cess Consider a zero mean GP with the kernel ( i.e. , co v ariance function) k ( x , x ′ ) : R d × R d → R .. Giv en the t th sample x t ∈ C , the observed func- tion v alue is f ( x t ). The t × t co v ariance ma- trix is denoted as K t = [ k ( x 1 , x 1 ) , . . . , k ( x 1 , x t ); . . . ; k ( x t , x 1 ) , . . . , k ( x t , x t )]. The noise-free obser- v ations is f 1: t = [ f ( x 1 ) , . . . , f ( x t )] T . Without the n ugget ϵ , the p osterior mean, denoted as µ 0 t , and stan- dard deviation, denoted as σ 0 t , of the deterministic GP used in EGO is µ 0 t ( x ) = k t ( x ) K − 1 t f 1: t ( σ 0 t ( x )) 2 = k ( x , x ) − k t ( x ) T K − 1 t k t ( x ) , where k t ( x ) = [ k ( x 1 , x ) , . . . , k ( x t , x )] T . Accounting for ϵ > 0 and its corresp onding scalar matrix ϵ I , the p osterior mean µ t and v ariance σ 2 t used in practical EGO are µ t ( x ) = k t ( x ) ( K t + ϵ I ) − 1 f 1: t σ 2 t ( x ) = k ( x , x ) − k t ( x ) T ( K t + ϵ I ) − 1 k t ( x ) , (2) W e emphasize that the observ ations f 1: t are noise-free in (2). SE and Mat´ ern k ernels are among the most p opular k ernels for BO and GP . Their definitions are as follo ws. k S E ( x , x ′ ) = exp  − r 2 2 l 2  , k M at ´ ern ( x , x ′ ) = 1 Γ( ν )2 ν − 1 √ 2 ν r l ! ν B ν √ 2 ν r l ! . (3) where l > 0 is the length hyper-parameter, r = ∥ x − x ′ ∥ 2 , ν > 0 is the smo othness parameter of the Mat ´ ern kernel, and B ν is the mo dified Bessel function of the second kind. 2.2 Exp ected Impro vemen t The improv ement function of f giv en t samples is defined as I t ( x ) = max { f + t − f ( x ) , 0 } , (4) where f + t denotes the b est current ob jective v alue. The sample p oint that generates f + t is denoted as x + t . The EI acquisition function is defined as the exp ectation of (4) conditioned on t samples, with the expression: E I t ( x ) = ( f + t − µ t ( x ))Φ( z t ( x )) + σ t ( x ) ϕ ( z t ( x )) , (5) where z t ( x ) = f + t − µ t ( x ) σ t ( x ) . The functions ϕ and Φ are the PDF and CDF of the standard normal distribution, resp ectiv ely . F or ease of reference, we refer to f + t − µ t ( x ) and σ t ( x ) as the exploitation and exploration part of E I t ( x ), resp ectiv ely (see also App endix A for more discussions). A commonly used function in the analysis of EI is the function τ : R → R , defined as τ ( z ) = z Φ( z ) + ϕ ( z ) . (6) Th us, the τ form of EI can b e written as E I t ( x ) = σ t ( x ) τ ( z t ( x )). The next sample is c hosen b y maxi- mizing the acquisition function ov er C , i.e. , x t = argmax x ∈ C E I t − 1 ( x ) , (7) breaking ties arbitrarily [ 11 ]. The practical EGO algorithm is given in Algorithm 1. 3 Algorithm 1 Practical EGO 1: Cho ose k ( · , · ) and T 0 initial samples x i , i = 0 , . . . , T 0 . Observ e f i .T rain the initial GP . 2: for t = T 0 + 1 , T 0 + 2 , . . . do 3: Cho ose x t using (7). 4: Observ e f ( x t ). 5: Up date the surrogate mo del using x 1: t and f 1: t . 6: if Ev aluation budget is exhausted then 7: Exit 8: end if 9: end for 2.3 Additional Bac kground Denote the optimal function v alue as f ( x ∗ ), where x ∗ is a global minimum on C , i.e. , x ∗ ∈ argmin x ∈ C f ( x ). The instantaneous regret r t is defined as r t = f ( x t ) − f ( x ∗ ) ≥ 0 . (8) The cumulativ e regret R T after T samples is defined as R T = T X t =1 r t = T X t =1 [ f ( x t ) − f ( x ∗ )] . (9) In order to deriv e the cumulativ e regret upper b ound, we use the well-established maximum informa- tion gain results [ 9 , 41 ]. Information gain measures the informativ eness of a set of sample p oints in C ab out f . The maxim um information gain γ t is de- fined in Definition A.3 in App endix A. It is often used to b ound the summation of p osterior standard deviation σ t − 1 ( x t ) and is dep enden t on the choice of the kernel [ 41 ]. W e note that the upp er b ound on the sum of σ t − 1 ( x t ) is dep endent on b oth γ t and ϵ . Moreo ver, the b ound on γ t itself is also dep enden t on ϵ , the kernel, C , and d . The latest b ounds on γ t in literature for common kernels such as the SE k ernel and Mat´ ern kernel can b e found in [ 20 , 46 ] and Lemma A.5. 3 Regret Bound In this section, we presen t our regret b ounds of prac- tical EGO. W e start with preliminary results required for the analysis in Section 3.1. Then, the new instan ta- neous regret b ound is giv en in Section 3.2. Finally , the cum ulative regret b ound is established in Section 3.3. 3.1 Assumptions and Preliminary Re- sults Throughout this pap er, we consider the frequentist setting. That is, f lies in the RKHS of k ( · , · ), a com- mon assumption in literature [ 41 ], whose definition is in Section A in the app endix. The formal assumption is given b elow. Assumption 3.1. The function f lies in the RKHS, denoted as H k ( C ), associated with the bounded k ernel k ( x , x ′ ) with the norm ∥·∥ H k . The kernel satisfies k ( x , x ′ ) ≤ 1, ∀ x , x ′ ∈ C , and k ( x , x ) = 1. The RKHS norms of the kernels are bounded ab o ve by constan t B > 0, i.e. , ∥ f ∥ H k ≤ B . The set C is compact. T o help analyze r t , we presen t Lemmas B.1 to B.6 on the prop erties of GP and EI. W e briefly summa- rize some of them here, and leav e the theory state- men ts and proofs in App endix B. The monotonic- it y of function τ (6) and its deriv ative are given in Lemma B.1 [ 23 ]. Lemma B.2 states the rela- tionship b et ween Φ( z ) and τ ( z ) when z < 0. In Lemma B.3, we establish simple but useful b ounds of for E I t − 1 ( x ) σ t − 1 ( x ) . Since E I t − 1 ( x ) = σ t − 1 ( x ) τ ( z t − 1 ( x )), the ab ov e three lemmas can b e used to bound E I t − 1 ( x ). In Lemma B.4, E I t − 1 ( x ) is shown to be monotonically increasing with resp ect to b oth its ex- ploitation f + t − 1 − µ t − 1 ( x ) and exploration σ t − 1 ( x ). Lemma B.2, B.3 and B.4 are used to quantify the exploration and exploitation trade-off prop erties in Section 3.2. A low er b ound on f + t − 1 − µ t − 1 ( x t ) when E I t − 1 ( x ) is b ounded below is giv en in Lemma B.5 [ 32 ]. The global low er b ound for σ t − 1 ( x ) with nugget is giv en in Lemma B.6. These t wo lemmas are used to establish the low er b ound for E I t − 1 ( x t ) and subsequently a lo wer b ound for f + t − 1 − µ t − 1 ( x t ), whic h app ears in an in termediate upp er b ound for r t . Next, we establish the b ound on | I t − 1 ( x ) − E I t − 1 ( x ) | , an imp ortant step leading to the bound on r t . Under the noise-free frequentist setting, the upp er b ound on | f ( x ) − µ t − 1 ( x ) | has b een established in literature(see e.g. , [ 41 ]). W e note that while prac- tical EGO uses ϵ in its p osterior calculations, the confidence interv al of | f ( x ) − µ t − 1 ( x ) | ≤ B σ t − 1 ( x ) con tinue to hold [ 9 ]. The effect of ϵ on the upp er b ound is reflected in the increased σ t − 1 ( x ). Sp ecif- ically , | f ( x ) − µ t − 1 ( x ) | ≤ B σ t − 1 ( x ) holds at giv en x ∈ C and t ∈ N , as stated in Lemma B.7. Then, us- ing this bound, w e can establish the b ounds on I t − 1 ( x ) and E I t − 1 ( x ) in Lemma B.9 through Lemma B.8. 3.2 Instan taneous Regret Bound In this section, we derive the instantaneous regret upp er b ounds of r t in terms of the p osterior stan- dard deviations σ t − 1 ( x t ) and additional exploitation terms, where the former’s sum can b e b ounded with maxim um information gain γ t . 4 Lemma 3.2. The pr actic al EGO gener ates the in- stantane ous r e gr et b ound r t ≤ c B 1 max { f + t − 1 − f ( x t ) , 0 } + ( c B ϵ ( ϵ, t ) + B + c B ( B + ϕ (0))) σ t − 1 ( x t ) , (10) wher e c B ϵ ( ϵ, t ) = log 1 / 2  t + ϵ 2 π ϵτ 2 ( − B )  , c B = τ ( B ) τ ( − B ) and c B 1 = max n τ ( B ) τ ( − B ) − 1 , 0 o . Pro of Sk etc h for Lemma 3.2. W e consider tw o differen t cases: f + t − 1 − f ( x t ) ≤ 0 and f + t − 1 − f ( x t ) > 0. F or the first case, where f + t − 1 − f ( x t ) ≤ 0, w e use the b ound on | f ( x ) − µ t − 1 ( x ) | (Lemma B.7) and Lemma B.9 to obtain an upp er b ound of r t : µ t − 1 ( x t ) − f + t − 1 + c B E I t − 1 ( x t ) + B σ t − 1 ( x t ), where c B = τ ( B ) τ ( − B ) . T o deriv e an upp er bound for µ t − 1 ( x t ) − f + t − 1 , w e establish a positive lo wer bound O  1 √ t  of E I t − 1 ( x t ) at ∀ t ∈ N . W e consider E I t − 1 ( x ∗ ) and sho w that E I t − 1 ( x ∗ ) ≥ σ t − 1 ( x ∗ ) τ ( − B ) using the global low er bound on σ t − 1 ( x ) mentioned in Sec- tion 3.1 (Lemma B.6). Then, the upp er b ound for µ t − 1 ( x t ) − f + t − 1 can be deriv ed by the properties of EI (Lemma B.5). The upper bound on r t b e- comes ( c B ϵ ( ϵ, t ) + B + c B ( ϕ (0) + B )) σ t − 1 ( x t ), where c B ϵ ( ϵ, t ) = log 1 / 2  t + ϵ 2 π τ 2 ( − B ) ϵ  . F or the second case where f + t − 1 − f ( x t ) ≥ 0, from Lemma B.9, we hav e the upper b ound for r t : f ( x t ) − f + t − 1 + c B E I t − 1 ( x t ). W e can further b ound E I t − 1 ( x t ) via the b ounds on | f ( x ) − µ t − 1 ( x ) | (Lemma B.7) and the prop erties of EI (Lemma B.3). Thus, the upp er b ound for r t b ecomes ( c B − 1)( f + t − 1 − f ( x t )) + c B ( B + ϕ (0)) σ t − 1 ( x t ). Combining the bounds in b oth cases leads to (10) in Lemma 3.2. R emark 3.3 (Use of ϵ in Lemma 3.2) . In addition to impro ved n umerical stability and statistical prop er- ties mentioned in Section 1, ϵ pla ys an imp ortan t role in the analysis of r t . Specifically , the upper b ound on µ t − 1 ( x t ) − f + t − 1 requires a p ositiv e low er b ound on E I t − 1 ( x t ). The use of ϵ pro vides a p os- itiv e global low er b ound for the posterior standard deviation σ t − 1 ( x ), whic h leads to the p ositive low er b ound on E I t − 1 ( x t ). Without ϵ , at previous sample p oin ts x i , i = 1 , . . . , t − 1, w e hav e σ t − 1 ( x i ) = 0 and E I t − 1 ( x i ) = 0. Since it is p ossible that x i = x ∗ for some i , we can no longer guarantee a p ositiv e low er b ound for E I t − 1 ( x ∗ ), a critical step tow ards the low er b ound on E I t − 1 ( x t ). Thus, we can no longer obtain a desirable upp er bound for µ t − 1 ( x t ) − f + t − 1 . Intuitiv ely , ϵ > 0 means that there is still uncertain ty recognized b y the GP mo del, alb eit decreasing, at previous sam- ple p oints, making it more lik ely that the next sample is c hosen close to existing samples, when some of them are already close to x ∗ . R emark 3.4 (Exploitation term in instantaneous re- gret b ound) . The instan taneous regret upp er b ound in Lemma 3.2 contains the exploitation term max { f + t − 1 − f ( x t ) , 0 } , which is a no velt y for instantaneous regret b ound, as far as w e kno w. F or instance, the instanta- neous regret b ound for UCB only has the exploration terms inv olving σ t − 1 ( x t ). W e elab orate our tech- niques to b ound the sum of max { f + t − 1 − f ( x t ) , 0 } in Section 3.3. 3.3 Cum ulativ e Regret Bound In this section, we present the cum ulative regret bound of practical EGO. The following lemma establishes the b ound based on Lemma 3.2. Lemma 3.5. The cumulative r e gr et b ound of pr actic al EGO satisfies R T ≤ 2 c B 1 B + ( c B ϵ ( ϵ, T ) + B + c B ( B + ϕ (0))) q C γ ( ϵ ) T γ T , (11) wher e c B ϵ ( ϵ, t ) = log 1 2 ( t + ϵ 2 π ϵτ 2 ( − B ) ) , C γ ( ϵ ) = 2 log(1+1 /ϵ ) , c B = τ ( B ) τ ( − B ) , and c B 1 = max n τ ( B ) τ ( − B ) − 1 , 0 o . Pro of Sketc h for Lemma 3.5. F rom Lemma 3.2 and the definition of R T , we need to b ound the sum of the exploitation term P T t =1 max { f + t − 1 − f ( x t ) , 0 } and P T t =1 c B ϵ ( ϵ, t ) σ t − 1 ( x t ), given that the sum of the remaining terms are obvious. T o b ound the first sum, w e construct the subsequence { x t i } of { x t } for all x t that satisfies f + t − 1 − f ( x t ) > 0. Using f + t i ≤ f ( x t i ) and the fact that t i − 1 ≤ t i − 1, we can write f + t i − 1 − f ( x t i ) + f + t i +1 − 1 − f ( x t i +1 ) ≤ f ( x t i − 1 ) − f ( x t i +1 ) ≤ 2 B . Using this tec hnique and summing up all t i lead to the b ound on P T t =1 max { f + t − 1 − f ( x t ) , 0 } . F or the latter term, using the maxim um in- formation gain γ t from Lemma A.4 s uffices as P T t =1 c B ϵ ( ϵ, t ) σ t − 1 ( x t ) ≤ c B ϵ ( ϵ, T ) P T t =1 σ t − 1 ( x t ). The rate of the cum ulative regret upper b ound of practical EGO is stated in the following theorem. Theorem 3.6. The pr actic al EGO algorithm le ads to the cumulative r e gr et upp er b ound R T = O (log 1 / 2 ( T ) T 1 / 2 √ γ T ) . F or SE kernel, R T = O ( T 1 / 2 log ( d +2) / 2 ( T )) . F or Mat´ ern kernels ( ν > 1 2 ), R T = O ( T ν + d 2 ν + d log 2 ν +0 . 5 d 2 ν + d ( T )) . F rom Theorem 3.6, if γ T of the chosen kernel is sublinear, practical EGO is no-regret. 5 R emark 3.7 . Our analysis framew ork do es not directly apply to the cumulativ e regret of EGO (without ϵ ). As mentioned in Remark 3.3, EGO has σ t − 1 ( x i ) = 0 , i = 1 , . . . , t − 1, and th us E I t − 1 ( x i ) = 0 at previous sample p oin ts. Therefore, we cannot obtain a low er b ound of E I t − 1 ( x t ) via E I t − 1 ( x ∗ ). Consequently , the curren t upp er b ound on µ t − 1 ( x t ) − f + t − 1 cannot b e obtained for EGO. Determining whether EGO is a no-regret algorithm is a topic for future research and can b e considered a limitation of this pap er. 4 Effect of the Nugget In this section, we discuss the impact of the nugget ϵ on practical EGO. The effect of ϵ on the prop erties of GP such as the likelihoo d hav e b een studied previ- ously [ 2 ]. Our fo cus is th us on how ϵ can change EI and the cumulativ e regret b ounds of practical EGO. First, w e briefly demonstrate the effect ϵ can hav e on the maximum of the EI function, i.e. , E I t − 1 ( x t ), via a 2-dimensional example. W e use the Branin function (see example 4 in Section 5) and the GP mo del from scikit-learn [ 33 ]. W e train the GP with ϵ = 10 − 2 , ϵ = 10 − 6 , ϵ = 10 − 10 , and without nuggets using the same 50 samples (25 initial samples and 25 iterations of practical EGO where w e set ϵ = 10 − 6 ), and plot the contours of EI in Figure 1. W e note that the maxim um level of the EI v alue colorbar corresp onds to the maximum of EI 50 in each contour plot. R emark 4.1 . W e remark that while giv en 50 samples, GP with no nugget do es not hav e n umerical issues, w e encounter Cholesky decomp osition failure when in verting the co v ariance matrix using 25 random ini- tial samples and around 75 samples generated from practical EGO runs. It is clear from Figure 1 that ϵ impacts E I t − 1 ( x t ), of whic h a p ositive lo wer bound is needed for the b ound on r t (see pro of of Lemma 3.2). In this example, the impact is relatively small when ϵ is small, e.g. , 10 − 10 . Consisten t with our analysis in Section 3, ϵ ensures a low er b ound on σ t − 1 ( x ) and thus a low er b ound of E I t − 1 ( x t ). This is reflected in the increased v alue of E I t − 1 ( x t ) as ϵ increases. W e emphasize that the effect of ϵ is dep endent on multiple factors suc h as the sample set, the function, the kernel, etc, and therefore, Figure 1 is one illustrativ e example. F or instance, if one uses all 50 samples generated b y random sampling in the example ab o ve, the effect of ϵ w ould b e muc h less prominent since the samples are more ev enly spaced out, as shown in Figure 5 in app endix. Next, we discuss the effect of the nugget on the regret b ound from the theoretical p ersp ectiv e. F rom Lemma 11, ϵ app ears in the cumulativ e regret upp er b ound 2 c B 1 B + ( c B ϵ ( ϵ, T ) + B + c B ( B + ϕ (0))) q C γ ( ϵ ) T γ T ( ϵ ) (12) through c B ϵ ( ϵ, T ), C γ ( ϵ ), as well as γ T ( ϵ ). T o make matters complicated, c B ϵ ( ϵ, T ) decreases, while C γ ( ϵ ) increases, as ϵ increases. F urther, the dep endence of γ T on ϵ is kernel sp ecific. Th us, the impact of ϵ on (12) is complex and dep enden t on the k ernel. Here, we provide an answer to this challenging question for SE and Mat´ ern kernels via the bounds on γ T in [ 20 ], as stated in Lemma E.1. F rom Lemma E.1 for SE and Mat´ ern k ernels, the upp er b ound on γ T ( ϵ ) decreases as ϵ increases. F or simplicity , we consider constant length scale l in b oth kernels. F urther, since the nugget is often small in nature, we fo cus on the case where T /ϵ is large. The effect of the n ugget ϵ on (12) for SE kernel is presented next. Theorem 4.2. Under the c onditions of L emma E.2 and T /ϵ ≫ 1 , for SE kernel at given T , (1) if ( d + 1) log (1 + 1 /ϵ ) > log(1 + T /ϵ ) , (13) and log (1 + T /ϵ ) ≫ max { C 2 dl , C 3 dl , C 2 R , C 3 R , C 3 R , d } , then (12) de cr e ases as ϵ incr e ases. The c onstants ar e define d in L emma E.1 and L emma E.2. (2) if ( d + 2) log (1 + 1 /ϵ ) < (1 + ϵ/T ) / (1 + ϵ ) log (1 + T /ϵ ) , (14) then (12) incr e ases as ϵ incr e ases. Next, we consider the Mat´ ern kernels ( ν > 1 2 ). Theorem 4.3. Under the c onditions of L emma E.3 wher e T /ϵ ≫ 1 , for Mat´ ern kernel ( ν > 1 2 ), (1) if log(1 + 1 /ϵ ) d/ (2 ν + d ) > (1 + 1 /C 2 dν l ) , and C 1 ν C 1 dν l C 3 ν log(1 + 1 /ϵ ) log (1 + 2 T /ϵ ) > C, (15) then (12) de cr e ases with incr e asing ϵ . The c onstants ar e define d in L emma E.1 and L emma E.3. (2) if log(1 + 1 /ϵ )  d 2 ν + d + C 1 ν C 3 ν +  4 ν + d 2 ν + d + C 1 ν  1 log( T /ϵ )  < 1 / (1 + ϵ ) , (16) then (12) incr e ases with incr e asing ϵ . 6 Figure 1: Illustrativ e example of EI contour of the Branin function with 50 samples. F rom left to righ t: con tour plots for ϵ = 10 − 2 , ϵ = 10 − 6 , ϵ = 10 − 10 , and no n ugget. The maximum EI 50 v alue from left to righ t: 2 . 53 × 10 − 2 , 1 . 90 × 10 − 4 , 4 . 72 × 10 − 5 , and 4 . 72 × 10 − 5 . R emark 4.4 . The constants in Theorem 4.2 and 4.3 are dep endent on d , C , and the fixed hyper-parameter l . Readers are referred to Lemma E.1, E.2, and E.3 for their definitions. When T /ϵ is sufficiently large, the conditions inv olving the constants are satisfied. R emark 4.5 . W e note that case 2 in both theorems migh t not b e satisfied when T is small ( ϵ also small to main tain a large T /ϵ ), as w ell as when d is large. F ur- ther, for any given T , for SE k ernel, it is p ossible that neither (13) nor (14) is satisfied. In such cases, the constan ts pla y an imp ortan t role in how ϵ affect (12) . Similar conclusions can b e drawn for Mat ´ ern kernels. Theorem 4.2 and 4.3 show that to obtain a tigh ter cum ulative regret bound, ϵ should sta y within a rea- sonable range. When ϵ is small, conditions (1) for b oth k ernels are more likely to b e satisfied and (12) increases as ϵ decreases. When ϵ is large, conditions (2) are more lik ely to b e satisfied and (12) increases as ϵ increases. In tuitively , if ϵ is to o large, the p osterior v ariance is inflated to o m uch and EI could emphasize to o muc h on exploration. On the other hand, if ϵ is to o small, practical EGO b eha ves closer to EGO, which migh t not b e no-regret. In addition, an ϵ to o small risks not resolving numerical stabilit y issues. W e emphasize that our analysis is based on state-of-the-art cumula- tiv e regret b ound (12) , and not the cumulativ e regret itself. T o b etter illustrate our results, we c ho ose an exam- ple set of constan ts and plot the cumulativ e regret b ound with ϵ . Let C 1 dl = C 2 dl = C 3 dl = 1, d = 2, B = 1 for the the SE kernel. F or the Mat´ ern kernel, let ν = 2 . 5, d = 3, C ν = 1, C dν l 1 = 1, C dν l 2 = 1, and C = 1. The rest of the constan ts can be deduced from these chosen ones. W e plot (12) with ϵ for SE kernel in Figure 2 and the one for Mat´ ern kernel in Figure 4 in the app endix. W e mark when the conditions for the tw o cases are met in Theorem 4.2 and 4.3. The plots clearly demonstrate our theoretical con- clusions on the effect of ϵ . W e note that the inequali- Figure 2: Cumulativ e regret upp er b ound with nugget ϵ at differen t T and selected constants for SE kernel. The case “other” means neither the conditions for case 1 nor those for case 2 are satisfied. ties (13) , (14) , (15) , and (16) are muc h relaxed to allo w for simpler forms. It is clear that when T is small, the second case conditions for b oth kernels migh t not b e satisfied. This do es not mean that the cumulativ e regret b ound is not increasing with ϵ . How ever, when T is small the effect of ϵ b ecomes more complicated to summarize and highly dep endent on constan ts. 5 Numerical Exp eriments In this section, we perform n umerical exp eriments of practical EGO with v arying nugget v alues on widely used test problems to demonstrate the empirical v a- lidit y of our theories. W e consider tw o groups of functions. First, we test functions sampled from GPs with known SE and Mat´ ern kernels ( ν = 2 . 5) using fixed hyperparameters. These problems are designed to minimize the effect of missp ecification of kernels or hyperparameter optimization of GP . Second, w e consider commonly used synthetic functions. F or the first group of functions, w e use 20 and 40 initial design p oin ts for 2D and 4D problems, re- sp ectiv ely , follow ed by 200 additional observ ations acquired iteratively via practical EGO. W e use three ϵ 7 T able 1: The a verage accumulativ e regret for different iteration ov er 20 macro-replications. d kernel ϵ t=1 t=50 t=100 t=200 2 SE 10 − 10 0.596 0.091 0.058 0.040 2 SE 10 − 6 0.596 0.141 0.101 0.076 2 SE 10 − 4 0.596 0.153 0.120 0.097 2 Ma t ´ ern 10 − 10 0.393 0.061 0.031 0.016 2 Ma t ´ ern 10 − 6 0.393 0.055 0.028 0.015 2 Ma t ´ ern 10 − 4 0.393 0.054 0.028 0.016 4 SE 10 − 10 1.422 0.775 0.626 0.440 4 SE 10 − 6 1.422 0.788 0.640 0.440 4 SE 10 − 4 1.422 0.799 0.693 0.528 4 Ma t ´ ern 10 − 10 0.857 0.458 0.296 0.183 4 Ma t ´ ern 10 − 6 0.857 0.493 0.291 0.184 4 Ma t ´ ern 10 − 4 0.857 0.474 0.292 0.184 v alues for eac h problem, 10 − 10 , 10 − 6 , and 10 − 4 . The GP h yp erparameters were kept fixed to the original v alues used for sampling. The results are sho wn in T able 1. F or eac h problem, the av erage cum ulative regret, i.e. , R T /T , declines as optimization progresses. F or Mat ´ ern kernels and SE kernel in 4D, the observed regret b eha vior with resp ect to ϵ align w ell with our theoretical regret b ound predictions, namely that re- gret b ounds do not change monotonically with ϵ and there might exist a range where ϵ should b e chosen. The SE kernels in 2D how ev er, sho w a smaller regret for smaller ϵ . W e note that this is not contradictory to our conclusion, as it is based on the upp er b ound of the regret. F or syn thetic problems, we choose fiv e examples and three n ugget v alues ϵ = 10 − 2 , ϵ = 10 − 4 , and ϵ = 10 − 6 for each problem. F rom example 1 to 5, the ob jectiv e functions are the Rosenbrock function, the six-hump camel function, the Hartmann6 function, the Branin function, and the Michalewicz function [ 31 , 35 ]. The mathematical expressions of our test problems are giv en in Section F in the app endix. The results for the five examples are plotted in Figure 3. The synthetic examples are implemented using GP mo dels in scikit-learn. Each example is run 100 times to account for sto c hasticity . F or the t wo-dimensional examples 1, 2, 4, 5, five initial Latin hypercub e sam- ples are used. F or the six-dimensional problem, exam- ple 3, we choose 50 initial Latin hypercub e samples due to the increase in dimension. F or example 1, 2, 3, SE k ernel is used, while Mat´ ern k ernel with ν = 2 . 5 is used for example 4 and 5, for a demonstration of b oth k ernels mentioned in our theories. The median of the av erage cumulativ e regret among the 100 runs is rep orted against the n umber of iterations. The 25th and 75th p ercentile results are shown in Section F in the app endix. The computational budget is 200 opti- mization iterations for examples 1,2, and 4 and 100 for examples 3, due to the increased dimension and computational cost. F or example 5, we also terminate at 100 optimization iterations as R T /T is sufficiently small. F or all of our examples, practical EGO displays no- regret conv ergence b eha vior and is capable of finding the optimal solution rather efficiently , supp orting our theoretical results. The different v alues of ϵ again align w ell with the regret b ound theory , with ϵ = 10 − 4 app earing to generate low est regret in 3 out of the 5 examples. 6 Conclusions In this pap er, we establish the no vel instantaneous regret bound and the first cumulativ e regret upp er b ound for practical EGO, which is the default imple- men tation of EGO in many a v ailable softw are pack- ages. W e show that it is a no-regret algorithm for k ernels including SE and Mat ´ ern k ernels. Our analysis th us provid es cumulativ e regret theories on one of the most widely used BO algorithms. F urther, we provide theoretical guidelines on the c hoice of the nugget in that ϵ to o large can lead to a worse cumulativ e regret upp er b ound. In practice, we an ticipate the choice of ϵ to b e influenced b y other factors b ey ond our theo- retical results such as the computational budget and the kernel. 8 Figure 3: Median a verage cum ulative regret b ound for practical EGO with ϵ v alues 10 − 2 , 10 − 4 , and 10 − 6 for fiv e examples. F rom top to b ottom: Rosenbrock, Six-h ump camel, Hartmann6, Branin, Mic halewicz functions. 9 A Bac kground W e first define an equiv alent form of EI (5) . W e distinguish b etw een its exploration and exploitation parts and define the tr ade-off form E I ( a, b ) : R × R → R as E I ( a, b ) = a Φ  a b  + bϕ  a b  , (17) where b ∈ (0 , 1]. One can view a and b as tw o indep enden t v ariables. F or a giv en x , if a t = f + t − µ t ( x ) and b t = σ t ( x ) ∈ [0 , 1], then E I ( a t , b t ) = E I t ( x ). Hence, w e refer to f + t − µ t ( x ) and σ t ( x ) the exploitation and exploration parts of E I t , resp ectiv ely . The definition of RKHS is giv en b elow. Definition A.1. Consider a p ositiv e definite kernel k : X × X → R with resp ect to a finite Borel measure supp orted on X . A Hilb ert space H k of functions on X with an inner pro duct ⟨· , ·⟩ H k is called a RKHS with k ernel k if k ( · , x ) ∈ H k for all x ∈ X , and ⟨ f , k ( · , x ) ⟩ H k = f ( x ) for all x ∈ X , f ∈ H k . The induced RKHS norm ∥ f ∥ H k = p ⟨ f , f ⟩ H k measures the smo othness of f with resp ect to k . Union b ound is given in the next lemma. Lemma A.2. F or a c ountable set of events A 1 , A 2 , . . . , we have P ( ∞ [ i =1 A i ) ≤ ∞ X i =1 P ( A i ) . T o use the maximum information gain, we consider a Gaussian observ ation noise η ∼ N (0 , ϵ ). If suc h an i.i.d. noise exists, at sample p oint x t , we hav e the observ ation y t = f ( x t ) + η t . The maxim um information gain can now be defined b elo w. Definition A.3. Consider a set of sample p oin ts A ⊂ C . Given x A and its function v alues f A = [ f ( x )] x ∈ A , the mutual information b et ween f A and the observ ation y A is I ( y A ; f A ) = H ( y A ) − H ( y A | f A ), where H is the entrop y . The maximum information gain γ T after T samples is γ T = max A ⊂ C, | A | = T I ( y A ; f A ). Readers are referred to [ 10 , 41 ] for a detailed discussion of the maximum information gain. Here, we emphasize that practical EGO assumes no observ ation noise. How ev er, w e can contin ue to use maximum information gain as an analytic to ol to b ound P T t =1 σ t − 1 ( x t ). Indeed, σ t − 1 ( x ) of practical EGO is the same as that of GP with Gaussian noise N (0 , ϵ ). The following lemmas are well-established results from [41] on the information gain and v ariances. Lemma A.4. The sum of p osterior standar d deviation at sample p oints σ t − 1 ( x ) satisfies T X t =1 σ t − 1 ( x t ) ≤ q C γ ( ϵ ) T γ T ( ϵ ) , (18) wher e C γ ( ϵ ) = 2 /log (1 + ϵ − 1 ) . Here, w e emphasize that the maximum information gain is dependent on the nugget ϵ . The state-of-the-art rates of γ t for tw o commonly used kernels are giv en b elow. Lemma A.5 ([ 20 , 46 ]) . F or a GP with t samples, the SE kernel has γ t = O ( log d +1 ( t )) , and the Mat´ ern kernel with smo othness p ar ameter ν > 0 has γ t = O ( t d 2 ν + d log 2 ν 2 ν + d ( t )) . Before concluding the section, we state the straigh tforward b ound of f on C as a lemma for easy reference. Lemma A.6. The function f is b ounde d by B , i.e., | f ( x ) | ≤ B for al l x ∈ C . Pr o of. F rom our assumption on the kernel k ( x , x ) = 1, we can write | f ( x ) | ≤ ∥ f ∥ H k k ( x , x ) ≤ B . (19) 10 B Preliminary Results First, we state a prop ert y of τ ( · ) in (6) b elow. Lemma B.1. The function τ ( z ) is monotonic al ly incr e asing in z and τ ( z ) > 0 for ∀ z ∈ R . The derivative of τ ( z ) is Φ( z ) . Pr o of. F rom the definition of τ ( z ), we can write τ ( z ) = z Φ( z ) + ϕ ( z ) > Z z −∞ uϕ ( u ) du + ϕ ( z ) = − ϕ ( u ) | z −∞ + ϕ ( z ) = 0 . (20) Giv en the definition of ϕ ( u ), dϕ ( u ) du = 1 √ 2 π e − u 2 2 ( − u ) = − ϕ ( u ) u. (21) Th us, the deriv ative of τ is dτ ( z ) dz = Φ( z ) + z ϕ ( z ) − ϕ ( z ) z = Φ( z ) > 0 . (22) Another lemma on τ and Φ is given below. Lemma B.2. Given z > 0 , Φ( − z ) > τ ( − z ) . Pr o of. Define q ( z ) = Φ( − z ) − τ ( − z ). Using integration by parts, we hav e Φ( z ) = Z z −∞ ϕ ( u ) du > Z z −∞ ϕ ( u )  1 − 3 u 4  du = − ϕ ( z ) z + ϕ ( z ) z 3 . (23) Replacing z with − z in (23), ϕ ( − z )  1 z − 1 z 3  < Φ( − z ) . (24) Multiplying b oth sides in (24) by 1 + z , (1 + z )Φ( − z ) > ϕ ( − z ) z 2 − 1 z 3 (1 + z ) = ϕ ( − z )  1 + z 2 − z − 1 z 3  . (25) Th us, if z > 1+ √ 5 2 , then the right-hand-side of (25) > ϕ ( − z ) and q ( z ) := Φ( − z ) − τ ( − z ) = (1 + z )Φ( − z ) − ϕ ( − z ) > 0 . (26) Therefore, in the follo wing, we fo cus on z ∈ (0 , 1+ √ 5 2 ]. W e analyze q ( z ) using its deriv atives. T aking the deriv ative of q , b y Lemma B.1, dq ( z ) dz = − ϕ ( − z ) + Φ( − z ) := q ′ ( z ) . (27) F urther, the deriv ative of q ′ ( z ) is d 2 q ( z ) dz 2 = dq ′ ( z ) dz = − ϕ ( − z ) + ϕ ( − z ) z = ϕ ( z )( z − 1) . (28) F or z > 1, d 2 q ( z ) dz 2 > 0. F or 0 < z < 1, d 2 q ( z ) dz 2 < 0. Thus, q ′ ( z ) is monotonically decreasing for 0 < z < 1 and then monotonically increasing for z > 1 . W e first consider q ′ ( z ) for 0 < z < 1. W e kno w that by simple algebra, q ′ (0) = Φ(0) − ϕ (0) > 0 and q ′ (1) = Φ( − 1) − ϕ ( − 1) < 0. Th us, there exists a 0 < ¯ z < 1 so that q ′ ( ¯ z ) = 0. Next, for z > 1, from Lemma B.1, we can write q ′ ( z ) = Φ( − z ) − ϕ ( − z ) < z Φ( − z ) − ϕ ( − z ) = − τ ( − z ) < 0 . (29) Therefore, 0 < ¯ z < 1 < 1+ √ 5 2 is a unique stationary p oin t such that q ′ ( ¯ z ) = 0. Thus, q ′ ( z ) > 0 for 0 < z < ¯ z < 1+ √ 5 2 and q ′ ( z ) < 0 for z > ¯ z . This means that for 0 < z < ¯ z , q ( z ) is monotonically increasing. F or ¯ z < z < 1+ √ 5 2 , q ( z ) is monotonically decreasing. Therefore, q ( z ) > min { q (0) , q ( 1+ √ 5 2 ) } for z ∈ (0 , 1+ √ 5 2 ). Since q (0) > 0 and q ( 1+ √ 5 2 ) > 0, q ( z ) > 0 for z ∈ (0 , 1+ √ 5 2 ). Combined with (26) , the pro of is complete. 11 The next lemma contains basic inequalities for E I t . Lemma B.3. E I t − 1 ( x ) satisfies E I t − 1 ( x ) ≥ 0 and E I t − 1 ( x ) ≥ f + t − 1 − µ t − 1 ( x ) . Mor e over, z t − 1 ( x ) ≤ E I t − 1 ( x ) σ t − 1 ( x ) = τ ( z t − 1 ( x )) < ( ϕ ( z t − 1 ( x )) , z t − 1 ( x ) < 0 z t − 1 ( x ) + ϕ ( z t − 1 ( x )) , z t − 1 ( x ) ≥ 0 . (30) Pr o of. F rom the definition of I t − 1 and E I t − 1 , the first statement follo ws immediately . By (5), E I t − 1 ( x ) σ t − 1 ( x ) = z t − 1 ( x )Φ( z t − 1 ( x )) + ϕ ( z t − 1 ( x )) . (31) If z t − 1 ( x ) < 0, or equiv alently f + t − 1 − µ t − 1 ( x ) < 0, (31) leads to E I t − 1 ( x ) σ t − 1 ( x ) < ϕ ( z t − 1 ( x )). If z t − 1 ( x ) ≥ 0, Φ( · ) < 1 giv es us E I t − 1 ( x ) σ t − 1 ( x ) < z t − 1 ( x ) + ϕ ( z t − 1 ( x )). The left inequality in (30) is an immediate result of E I t − 1 ( x ) ≥ f + t − 1 − µ t − 1 ( x ). The monotonicity of the exploration and exploitation of (17) is giv en next, previously also shown in [23]. Lemma B.4. E I ( a, b ) is monotonic al ly incr e asing in b oth a and b for b ∈ (0 , 1] . Pr o of. W e prov e the lemma by taking the deriv ative of E I ( a, b ) with resp ect to b oth v ariables. First, ∂ E I ( a, b ) ∂ a = Φ  a b  + aϕ  a b  1 b + b ∂ ϕ  a b  ∂ a . (32) F rom (21), (32) is ∂ E I ( a, b ) ∂ a = Φ  a b  + ϕ  a b  a b − ϕ  a b  a b = Φ  a b  > 0 . (33) Similarly , ∂ E I ( a, b ) ∂ b = − aϕ  a b  a b 2 + ϕ  a b  − bϕ  a b  a b ( − a b 2 ) = ϕ  a b  > 0 . (34) The next lemma puts a low er b ound on f + t − 1 − µ t − 1 ( x ) < 0 if E I t − 1 ( x ) is b ounded b elow b y a p ositiv e sequence denoted as κ t . It is also previously shown in [32]. Lemma B.5. If E I t − 1 ( x ) ≥ κ t for some κ t ∈ (0 , 1 √ 2 π ) and f + t − 1 − µ t − 1 ( x ) < 0 , then we have f + t − 1 − µ t − 1 ( x ) ≥ − s 2 log  1 √ 2 π κ t  σ t − 1 ( x ) . (35) Pr o of. By definition of E I t − 1 ( x ), κ t ≤ ( f + t − 1 − µ t − 1 ( x ))Φ( z t − 1 ( x )) + σ t − 1 ( x ) ϕ ( z t − 1 ( x )) < σ t − 1 ( x ) ϕ ( z t − 1 ( x )) = σ t − 1 ( x ) 1 √ 2 π e − 1 2 z 2 t − 1 ( x ) ≤ 1 √ 2 π e − 1 2 z 2 t − 1 ( x ) . (36) Rearranging and taking the logarithm of (36), w e hav e 2 log  1 √ 2 π κ t  > z 2 t − 1 ( x ) = f + t − 1 − µ t − 1 ( x ) σ t − 1 ( x ) ! 2 . (37) Giv en that f + t − 1 − µ t − 1 ( x ) < 0, w e recov er (35). A global low er b ound of the p osterior v ariance is given in the next lemma. 12 Lemma B.6. The p osterior standar d deviation (2) has the lower b ound σ t ( x ) ≥ r ϵ t + ϵ . (38) Pr o of. W e inv oke the fact that the minimum posterior standard deviation at x is obtained if the previous t samples are all x . In this case, all entries of K t are 1. It is easy to v erify that ( K t + ϵ I ) − 1 = − 1 tϵ + ϵ 2 P + 1 ϵ I , (39) where P is a t × t matrix with all entries being 1. Th us, by (2), we ha ve σ 2 t ( x ) ≥ 1 − p T  − 1 t + ϵ P + 1 ϵ I  p = ϵ t + ϵ , (40) where p is the t -dimensional v ector with all 1 entries. Next, we present a well-established b ound on f and the prediction µ t − 1 ( x ). Lemma B.7. F or any given x ∈ C and t ≥ 1 , | f ( x ) − µ t − 1 ( x ) | ≤ B σ t − 1 ( x ) . (41) The pro of of Lemma B.7 can b e found in Theorem 2 of [ 9 ]. Next, we extend the b ounds to | I t − 1 ( x ) − E I t − 1 ( x ) | . An upp er b ound of I t − 1 ( x ) − E I t − 1 ( x ) is given in the next lemma. Lemma B.8. F or any given x ∈ C , t ∈ N , and w > 0 , I t − 1 ( x ) − E I t − 1 ( x ) < ( σ t ( x ) w , f + t − 1 − f ( x ) ≤ 0 − f ( x ) + µ t − 1 ( x ) , f + t − 1 − f ( x ) > 0 . (42) Pr o of. If f + t − 1 − f ( x ) ≤ 0, w e hav e by definition (4) and Lemma B.3, I t − 1 ( x ) − E I t − 1 ( x ) = − E I t − 1 ( x ) < 0 < σ t − 1 ( x ) w . (43) F or f + t − 1 − f ( x ) > 0, w e can write via Lemma B.3, I t − 1 ( x ) − E I t − 1 ( x ) = f + t − 1 − f ( x ) − E I t − 1 ( x ) ≤ f + t − 1 − f ( x ) − f + t − 1 + µ t − 1 ( x ) < − f ( x ) + µ t − 1 ( x ) . (44) An upp er b ound on I t − 1 ( x ) and E I t − 1 ( x ) is given in the next lemma. Lemma B.9. The impr ovement function and E I function satisfy I t − 1 ( x ) ≤ τ ( B ) τ ( − B ) E I t − 1 ( x ) , ∀ x ∈ C , ∀ t ∈ N . (45) Pr o of. W e consider tw o cases. First, if f + t − 1 − f ( x ) ≤ 0, then I t − 1 ( x ) = 0. Since E I t − 1 ( x ) ≥ 0, (45) stands. Second, if f + t − 1 − f ( x ) > 0, then f + t − 1 − µ t − 1 ( x ) = f + t − 1 − f ( x ) + f ( x ) − µ t − 1 ( x ) > f ( x ) − µ t − 1 ( x ) . (46) F rom the one-side inequality in Lem ma B.7, (46) implies f + t − 1 − µ t − 1 ( x ) > − B σ t − 1 ( x ) . (47) 13 Then, from Lemma B.1 the monotonicit y of τ ( · ), we ha ve τ ( z t − 1 ( x )) > τ ( − B ) , (48) where z t − 1 ( x ) = f + t − 1 − µ t − 1 ( x ) σ t − 1 ( x ) . Since E I t − 1 ( x ) = σ t − 1 ( x ) τ ( z t − 1 ( x )), we can write E I t − 1 ( x ) = σ t − 1 ( x ) τ ( z t − 1 ( x )) > τ ( − B ) σ t − 1 ( x ) . (49) Next, we let w = B in Lemma B.8 and obtain I t − 1 ( x ) − E I t − 1 ( x ) ≤ − f ( x ) + µ t − 1 ( x ) ≤ B σ t − 1 ( x ) . (50) Applying (50) to (49) by eliminating σ t − 1 ( x ) and using union b ound, we ha ve E I t − 1 ( x ) > τ ( − B ) B + τ ( − B ) I t − 1 ( x ) = τ ( − B ) τ ( B ) I t − 1 ( x ) . (51) A similar result to Lemma B.9 is previously shown in [7]. C Instan taneous Regret Bound Pro of Pro of of Lemma 3.2 is given next. Pr o of. W e consider t wo cases based on the v alue of f + t − 1 − f ( x t ). First, f + t − 1 ≤ f ( x t ). F rom Lemma B.7 and B.9, r t = f ( x t ) − f ( x ∗ ) = f ( x t ) − f + t − 1 + f + t − 1 − f ( x ∗ ) ≤ f ( x t ) − f + t − 1 + I t − 1 ( x ∗ ) ≤ f ( x t ) − µ t − 1 ( x t ) + µ t − 1 ( x t ) − f + t − 1 + τ ( B ) τ ( − B ) E I t − 1 ( x ∗ ) ≤ µ t − 1 ( x t ) − f + t − 1 + c B E I t − 1 ( x t ) + B σ t − 1 ( x t ) , (52) where c B = τ ( B ) τ ( − B ) . Next, we aim to pro vide a lo wer b ound for E I t − 1 ( x t ) b y comparing E I t − 1 ( x t ) to E I t − 1 ( x ∗ ). F or E I t − 1 ( x ∗ ), w e can write via Lemma B.7 that E I t − 1 ( x ∗ ) = σ t − 1 ( x ∗ ) τ ( z t − 1 ( x ∗ )) = σ t − 1 ( x ∗ ) τ f + t − 1 − f ( x ∗ ) + f ( x ∗ ) − µ t − 1 ( x ∗ ) σ t − 1 ( x ∗ ) ! ≥ σ t − 1 ( x ∗ ) τ f + t − 1 − f ( x ∗ ) σ t − 1 ( x ∗ ) − B ! ≥ σ t − 1 ( x ∗ ) τ ( − B ) , (53) where the second inequality uses f ( x ∗ ) ≤ f ( x ). By definition E I t − 1 ( x t ) ≥ E I t − 1 ( x ∗ ). By Lemma B.6, (53) implies E I t − 1 ( x t ) ≥ τ ( − B ) r ϵ t + ϵ . (54) It is easy to see that τ ( − B ) q ϵ t + ϵ < 1 √ 2 π . By Lemma B.5, (54) implies f + t − 1 − µ t − 1 ( x t ) ≥ − log 1 2  t + ϵ 2 π τ 2 ( − B ) ϵ  σ t − 1 ( x t ) . (55) Equiv alently , µ t − 1 ( x t ) − f + t − 1 ≤ c B ϵ ( ϵ, t ) σ t − 1 ( x t ) . (56) 14 where c B ϵ ( ϵ, t ) = log 1 2 ( t + ϵ 2 π τ 2 ( − B ) ϵ ). Using Lemma B.7, Φ( · ) < 1, and f + t − 1 − f ( x t ) ≤ 0, w e hav e E I t − 1 ( x t ) =( f + t − 1 − µ t − 1 ( x t ))Φ( z t − 1 ( x t )) + σ t − 1 ( x t ) ϕ ( z t − 1 ( x t )) ≤ ( f + t − 1 − f ( x t ) + f ( x t ) − µ t − 1 ( x t ))Φ( z t − 1 ( x t )) + ϕ (0) σ t − 1 ( x t ) ≤ B σ t − 1 ( x t ) + ϕ (0) σ t − 1 ( x t ) . (57) Applying(57) and (56) to (52), we ha ve r t ≤ ( c B ϵ ( ϵ, t ) + B + c B ( B + ϕ (0))) σ t − 1 ( x t ) . (58) Second, if f + t − 1 − f ( x t ) ≥ 0, w e hav e r t = f ( x t ) − f ( x ∗ ) = f ( x t ) − f + t − 1 + f + t − 1 − f ( x ∗ ) ≤ f ( x t ) − f + t − 1 + c B E I t − 1 ( x ∗ ) ≤ f ( x t ) − f + t − 1 + c B E I t − 1 ( x t ) . (59) The first inequality in (59) is by Lemma B.9. F urther, we can write E I t − 1 ( x t ) =( f + t − 1 − µ t − 1 ( x t ))Φ( z t − 1 ( x t )) + σ t − 1 ( x t ) ϕ ( z t − 1 ( x t )) ≤ ( f + t − 1 − f ( x t ) + f ( x t ) − µ t − 1 ( x t ))Φ( z t − 1 ( x t )) + ϕ (0) σ t − 1 ( x t ) ≤ f + t − 1 − f ( x t ) + B σ t − 1 ( x t ) + ϕ (0) σ t − 1 ( x t ) . (60) Applying (60) to (59), we ha ve r t ≤ ( c B − 1)( f + t − 1 − f ( x t )) + c B ( B + ϕ (0)) σ t − 1 ( x t ) . (61) Com bine (58) and (61) and we ha ve r t ≤ max { c B − 1 , 0 } max { f + t − 1 − f ( x t ) , 0 } + ( c B ϵ ( ϵ, t ) + B + c B ( B + ϕ (0))) σ t − 1 ( x t ) . (62) D Cum ulativ e Regret Bound Pro of The pro of of Lemma 3.5 is given next. Pr o of. F rom Lemma 3.2, we consider the term P T t =1 max { f + t − 1 − f ( x t ) , 0 } . Let P T ⊆ { 1 , . . . , T } b e the ordered index set such that f + t − 1 − f ( x t ) > 0. Then, using t i − 1 ≥ t i − 1 , we hav e T X t =1 max { f + t − 1 − f ( x t ) , 0 } = | P T | X i =1 f + t i − 1 − f ( x t i ) ≤ | P T | X i =1 f ( x t i − 1 ) − f ( x t i ) , (63) where t i ∈ P T and t i < t i +1 . Since f + t ≤ f ( x t ) for ∀ t ∈ N , (63) leads to T X t =1 max { f + t − 1 − f ( x t ) , 0 } ≤ | P T | X i =1 f ( x t i − 1 ) − f ( x t i ) ≤ ( f t 0 − f t | P T | ) ≤ 2 B , (64) where we used the b oundedness of f in Lemma A.6. Using (64) in (10), we ha ve R T = T X t =1 r t ≤ T X t =1 c B 1 max { f + t − 1 − f ( x t ) , 0 } + T X t =1 ( c B ϵ ( T ) + B + c B ( B + ϕ (0))) σ t − 1 ( x t ) ≤ 2 c B 1 B + ( c B ϵ ( ϵ, T ) + B + c B ( B + ϕ (0))) T X t =1 σ t − 1 ( x t ) . (65) 15 Pro of of Theorem 3.6 is given next. Pr o of. F rom Lemma A.4, w e kno w P T t =1 σ t − 1 ( x t ) = O ( √ T γ T ). F rom Lemma 3.5, the cumulativ e regret b ound is O ( R T ) = O (log 1 / 2 ( T ) p T γ T ) . (66) Using γ T = O ( log d +1 ( T )) for SE kernel, we ha ve R T = O ( T 1 2 log d +2 2 ( T )). Using γ T = O ( T d 2 ν + d log 2 ν 2 ν + d ( T )) for Mat´ ern kernel from [20, 46], we hav e R T = O ( T ν + d 2 ν + d log 2 ν +0 . 5 d 2 ν + d ( T )). E Pro of of Nugget Effect The maximum information gain b ounds are given below from [ 20 ]. W e use the following lemma on the upp er b ound of γ T . Lemma E.1 (Maximum information gain upp er b ound, Theorem 7 in [ 20 ]) . Assume the domain satisfies C = { x ∈ R d , ∥ x ∥ 2 ≤ 1 } . At given d and T , for SE kernel, if θ ≤ e 2 c d and T ≥ ( e − 1) ϵ , γ T ( ϵ ) ≤ C 1 d θ d log d +1 (1 + T /ϵ ) + log (1 + T /ϵ ) + C 2 d exp  − 2 θ + 1 θ 2  , (67) F urther, for θ > e 2 c d , γ T ( ϵ ) ≤ C 3 d log d  θ e c d  log d +1  1 + T ϵ  + C 4 d log  1 + T ϵ  + C 5 d , (68) wher e θ = 2 l 2 and c d = max  1 , exp  1 e ( d 2 − 1)  . The c onstants C i d , i = 1 , . . . , 5 only dep end on d . F or Mat´ ern kernels with smo othness ν > 1 2 , we have γ T ( ϵ ) ≤ C ( T , ν , ϵ ) ¯ γ T + C, (69) wher e C ν > 0 only dep ends on ν , C is a c onstant, and C ( T , ν, λ ) = max  1 , log 2  1 + Γ( ν ) C ν log  T 2 ϵ  + 1 ν log 2  T 2 ν Γ( ν ) ϵ  + 1  , (70) F urthermor e, ¯ γ T ( ϵ ) = C 1 d,ν log  1 + 2 T ϵ  + C 2 d,ν  T ϵl 2 ν  d 2 ν + d log 2 ν 2 ν + d  1 + 2 T ϵ  . (71) The c onstants C i d,ν dep end only on d and ν . Giv en this premise, we rewrite Lemma E.1 into the follo wing tw o lemmas on the b ounds of γ T . Lemma E.2. Assume the domain satisfies C = { x ∈ R d , ∥ x ∥ 2 ≤ 1 } . F or a SE kernel with fixe d l , ther e exist c onstants C 1 dl , C 2 dl , and C 3 dl such that γ T ( ϵ ) ≤ s T ( ϵ ) = C 1 dl h log d +1 (1 + T /ϵ ) + C 2 dl log(1 + T /ϵ ) + C 3 dl i . (72) wher e the c onstants dep end only on given d and l . Lemma E.3. Assume the domain satisfies the c ondition in The or em 7 in [1]. F or a Mat´ ern kernel with given l , ν > 1 / 2 and d , supp ose T /ϵ is lar ge enough such that c 0 T ( ϵ ) := C 1 ν [log(1 + C 2 ν log  T 2 /ϵ  ) + C 3 ν log( T 2 /ϵ ) + C 4 ν ] ≥ 1 , (73) 16 wher e C 1 ν = 1 log(2) , C 2 ν = Γ( ν ) C ν , C 3 ν = 1 ν , C 4 ν = 1 ν log ( 1 ν Γ( ν ) ) + log (2) , C ν > 0 only dep end on ν and Γ is the Gamma function. Notic e that C i ν > 0 , i = 1 , . . . , 3 . F urther, ther e exist C 1 dν l and C 2 dν l dep endent only on d , ν and l so that ¯ γ T ( ϵ ) = C 1 dν l [log(1 + 2 T /ϵ ) + C 2 dν l ( T η ) d 2 ν + d log 2 ν 2 ν + d (1 + 2 T /ϵ )] , wher e C 1 dν l = C (1) d,ν and C 2 dν l = ( C (2) d,ν ( l ) − 2 ν d/ (2 ν + d ) ) / ( C (1) d,ν ) (se e [1] for c onstants C ( i ) d,ν , i = 1 , 2 ). Then, ther e exists c onstants C > 0 such that its γ T satisfies γ T ( ϵ ) ≤ s T ( ϵ ) = c 0 T ( ϵ ) ¯ γ T ( ϵ ) + C . Note that C 4 ν is not necessarily positive. The assumption that c 0 T ( ϵ ) ≥ 1 can b e achiev ed when T /ϵ ≫ 1.W e shall further assume that T /ϵ is sufficiently large suc h that c 0 T ( ϵ ) ¯ γ T ( ϵ ) ≥ C . Next, we simplify the upp er b ound of R T ( ϵ ). Lemma E.4. L et C 1 R = 2 c B 1 B , C 2 R = log  1 2 π τ 2 ( − B )  , C 3 R = B + c B ( B + ϕ (0)) , and C 4 R = C 2 R + ( C 3 R ) 2 (se e L emma 3.5 for c onstants). Supp ose the upp er b ound on γ T ( ϵ ) is γ T ( ϵ ) ≤ s T ( ϵ ) . Define c T ( ϵ ) =  log(1 + T /ϵ ) + 2 C 3 R (log(1 + T /ϵ ) + C 2 R ) 1 / 2 + C 4 R  1 log(1 + 1 /ϵ ) s T ( ϵ ) , Then, the cumulative r e gr et upp er b ound fr om L emma 3.5 is R T ≤ C 1 R + u T ( ϵ ) , wher e u T ( ϵ ) = p 2 c T ( ϵ ) T (74) Pr o of. By Lemma 3.5, R T ≤ 2 c B 1 B + ( c B ϵ ( ϵ, T ) + B + c B ( B + ϕ (0))) q C γ ( ϵ ) T γ T ( ϵ ) ≤ C 1 R + ( q C 2 R + log(1 + T /ϵ ) + C 3 R ) q C γ ( ϵ ) T s T ( ϵ ) = C 1 R + p c T ( ϵ ) √ 2 T . (75) First, we present the pro of for SE kernel. The pro of of Theorem 4.2 is given b elow. Pr o of. F rom Lemma E.4, at a given T , the upp er b ound on R T ( ϵ ) c hanges the same wa y as c T ( ϵ ). In particular, by Lemma E.2 we only need to consider c T ( ϵ ) /C 1 dl . Hence, in the following, w e analyze how c T ( ϵ ) c hanges with ϵ . F or simplicity , let η = 1 ϵ . W e define the following shorthands for terms in c T ( η ): c 1 T ( η ) = log(1 + T η ) + 2 C 3 R (log(1 + T η ) + C 2 R ) 1 / 2 + C 4 R , (76) and c 2 T ( η ) = log d +1 (1 + T η ) + C 2 dl log(1 + T η ) + C 3 dl . (77) Th us, by Lemma E.2, c T ( η ) /C 1 dl = c 1 T ( η ) 1 log(1 + η ) c 2 T ( η ) . (78) Define c 3 T ( η ) := c 1 T ( η ) c 2 T ( η ). W e expand it in to the sum of nine terms: log d +2 (1 + T η ) + 2 C 3 R (log(1 + T η ) + C 2 R ) 1 / 2 log d +1 (1 + T η ) + C 4 R log d +1 (1 + T η )+ C 2 dl log 2 (1 + T η ) + 2 C 3 R C 2 dl (log(1 + T η ) + C 2 R ) 1 / 2 log(1 + T η ) + C 4 R C 2 dl log(1 + T η )+ C 3 dl log(1 + T η ) + 2 C 3 R C 3 dl (log(1 + T η ) + C 2 R ) 1 / 2 + C 4 R C 3 dl . (79) By (78), differen tiating c T ( η ) /C 1 dl giv es dc T ( η ) dη 1 C 1 dl = 1 log 2 (1 + η )  log(1 + η ) dc 3 T ( η ) dη − 1 1 + η c 3 T ( η )  . (80) 17 Th us, the sign of dc T ( η ) dη is the sign of d c defined as d c ( η ) := log(1 + η ) dc 3 T ( η ) dη − 1 1 + η c 3 T ( η ) . (81) F or simplicity , we suppress the dep endence on the given T . Next, we expand d c ( η ) in (81) as the sum of nine terms, each corresp onding to one term in (79) . The first term is log d +2 (1 + T η ) and its contribution to d c ( η ) is d 1 c ( η ) := log d +1 (1 + T η )  ( d + 2) 1 1 /T + η log(1 + η ) − 1 1 + η log(1 + T η )  . (82) If (13) is satisfied, then d 1 c ( η ) > 1 1 + η log d +1 (1 + T η )(( d + 2) log (1 + η ) − log(1 + T η )) > 1 1 + η 1 d + 1 log d +2 (1 + T η ) . (83) No w we rep eat this deriv ation for the second term. Its contribution to (81) is d 2 c ( η ) :=2 C 3 R (log(1 + T η ) + C 2 R ) 1 / 2 log d (1 + T η )  ( d + 1) log (1 + η ) 1 1 /T + η + 1 2(1 /T + η ) log(1 + η ) 1 + C 2 R / log(1 + T η ) − 1 1 + η log(1 + T η )  . (84) If (13) is satisfied, d 2 c ( η ) > 2 C 3 R (log(1 + T η ) + C 2 R ) 1 / 2 log d (1 + T η ) 1 (1 + η )(2( d + 1)) log(1 + T η ) 1 + C 2 R / log(1 + T η ) . (85) The third term adds to (81) as d 3 c ( η ) := C 4 R log d (1 + T η )  ( d + 1) log (1 + η ) 1 1 /T + η − log(1 + T η ) 1 + η  . (86) If (13) holds, then d 3 c ( η ) > 0. The fourth deriv ative is d 4 c ( η ) := C 2 dl log(1 + T η )  2 log(1 + η ) 1 1 /T + η − log(1 + T η ) 1 + η  . (87) With (13), we can write d 4 c ( η ) > − C 2 dl log 2 (1 + T η ) 1 1 + η d − 1 d + 1 . (88) F or the fifth term, its role in (81) is d 5 c ( η ) :=2 C 3 R C 2 dl (log(1 + T η ) + C 2 R ) 1 / 2  log(1 + η ) 1 1 /T + η + 1 2(1 /T + η ) log(1 + η ) 1 + C 2 R / log(1 + T η ) − log(1 + T η ) 1 + η  . (89) Giv en (13), d 5 c ( η ) > 2 C 3 R C 2 dl 1 + η (log(1 + T η ) + C 2 R ) 1 / 2  1 2( d + 1) log(1 + T η ) 1 + C 2 R / log(1 + T η ) − d d + 1 log(1 + T η )  . (90) The sixth and seven th terms can b e combined can yield for (81) d 67 c ( η ) := ( C 4 R C 2 dl + C 3 dl )  log(1 + η ) 1 1 /T + η − log(1 + T η ) (1 + η  . (91) 18 By (13), d 67 c ( η ) > − ( C 4 R C 2 dl + C 3 dl ) 1 1 + η d d + 1 log(1 + T η ) . (92) The eighth term leads to d 8 c ( η ) := 2 C 3 R C 3 dl (log(1 + T η ) + C 2 R ) 1 / 2  1 2(1 /T + η ) log(1 + η ) log(1 + T η ) + C 2 R − 1 1 + η  . (93) Using (13), d 8 c ( η ) > 2 C 3 R C 3 dl (log(1 + T η ) + C 2 R ) 1 / 2 1 η + 1 [ 1 2( d + 1) 1 1 + C 2 R / log(1 + T η ) − 1] . (94) The last term leads to in (1) d 9 c ( η ) := − C 4 R C 3 dl 1 1 + η . (95) Since d c ( η ) = P 9 i =1 d i c ( η ), we consider the order of the low er b ound of d i c ( η ) to find the sign of d c ( η ). Combine the low er b ounds ab ov e and lift common multiplier 1 / (1 + η ). Since log (1 + T η ) ≫ 1, we can write the order of each term as follows. The order of the p ositiv e low er b ounds are log d +2 (1 + T η ) (83) and 1 d C 3 R log d +1 . 5 (1 + T η ) (85) . F or negative lo wer b ounds, w e ha ve − C 2 dl d − 1 d +1 log 2 (1 + T η ) (87) , − C 3 R C 2 dl log 1 . 5 (1 + T η ) (89) , − ( C 4 R C 2 dl + C 3 dl ) log (1 + T η ) (92) , − C 3 R C 3 dl log 0 . 5 (1 + T η ) (94), and − C 4 R C 3 dl (95). It is easy to verify that, for d ≥ 2, 1 d C 3 R log d +1 . 5 (1 + T η ) ≫ C 3 R C 2 dl log 1 . 5 (1 + T η ) . F urther, using the p ositiv e b ound from (83) satisfies log d +2 (1 + T η ) ≫ C 2 dl d − 1 d + 1 log 2 (1 + T η ) , log d +2 (1 + T η ) ≫ ( C 4 R C 2 dl + C 3 dl ) log(1 + T η ) , log d +2 (1 + T η ) ≫ C 3 R C 3 dl log 0 . 5 (1 + T η ) , log d +2 (1 + T η ) ≫ C 4 R C 3 dl . (96) Th us, by (80) and (81) , we arrive at d c ( η ) > 0 and dc T ( η ) η > 0. When d = 1, the conclusion remains the same as d 4 c ( η ) > 0, where d 3 c ( η ) and d 4 c ( η ) coincide. Next, we consider if (14) is true. F rom d 1 c ( η ) (82) and (14) , we ha ve d 1 c ( η ) < 0. Indeed, it is easy to v erify that all d i c ( η ) < 0 , i = 2 , ..., 9. Hence, d c < 0 and dc T ( η ) η < 0. The pro of of Theorem 4.3 is given b elo w. Pr o of. Recall that from Lemma E.3 that s T ( ϵ ) = c 0 T ( ϵ ) ¯ γ T ( ϵ ) + C . By Lemma E.4, the cumulativ e regret b ound c hanges the same w ay as c T ( ϵ ) / ( C 1 dν l C 1 ν ). First, define c 1 T ( η ) := log (1 + T η ) + 2 C 3 R (log(1 + T η ) + C 2 R ) 1 / 2 + C 4 R , c 2 T ( η ) := c 0 T ( η ) /C 1 ν = [log(1 + C 2 ν log  T 2 η  ) + C 3 ν log( T 2 η ) + C 4 ν ] , c 3 T ( η ) := ¯ γ T ( η ) /C 1 dν l = [log(1 + 2 T η ) + C 2 dν l ( T η ) d 2 ν + d log 2 ν 2 ν + d (1 + 2 T η )] . (97) Let c 123 T ( η ) = c 1 T ( η ) c 2 T ( η ) c 3 T ( η ). By Lemma E.3, we ha ve c T ( η ) = C 1 ν C 1 dν l c 123 T ( η ) 1 log(1 + η ) + c 1 T ( η ) 1 log(1 + η ) C. (98) 19 Consider the first part of (98). Its deriv ative is C 1 ν C 1 dν l 1 log 2 (1 + η )  log(1 + η ) dc 123 T ( η ) dη − 1 1 + η c 123 T ( η )  . (99) The sign of (99) is the same as the sign of d c ( η ) := dc 123 T ( η ) dη log(1 + η ) − 1 1 + η c 123 T ( η ) . (100) Consider case (1) first. Since c i T ( η ) > 0 and increases with η , i = 1 , 2 , 3, we hav e dc 123 T ( η ) dη = dc 1 T ( η ) dη c 2 T ( η ) c 3 T ( η ) + dc 2 T ( η ) dη c 1 T ( η ) c 3 T ( η ) + dc 3 T ( η ) dη c 1 T ( η ) c 2 T ( η ) > dc 3 T ( η ) dη c 1 T ( η ) c 2 T ( η ) . (101) The first tw o terms in the first equalit y ab o ve will b e used later. Using (100) in (101) , we hav e the low er b ound d c ( η ) > c 1 T ( η ) c 2 T ( η )  dc 3 T ( η ) dη log(1 + η ) − 1 1 + η c 3 T ( η )  . (102) The deriv ative is dc 3 T ( η ) dη = 1 η + 1 / 2 T + C 2 dν l d 2 ν + d ( T η ) d 2 ν + d 1 η log 2 ν 2 ν + d (1 + 2 T η )+ C 2 dν l ( T η ) d 2 ν + d 2 ν 2 ν + d log − d 2 ν + d (1 + 2 T η ) 1 η + 1 / 2 T . (103) Note that the first and third terms in (103) > 0. By relaxing them and using (103) in (102), we ha ve d c ( η ) > c 1 T ( η ) c 2 T ( η ) C 2 dν l ( T η ) d 2 ν + d log 2 ν 2 ν + d (1 + 2 T η ) 1 1 + η  d 2 ν + d log(1 + η ) − 1 C 2 dν l  log(1 + 2 T η ) T η  d 2 ν + d + 1 !# . (104) Since log(1+2 T η ) T η < 1 for T η ≫ 1, from (15), we hav e d c ( η ) > c 1 T ( η ) c 2 T ( η ) C 2 dν l ( T η ) d 2 ν + d log 2 ν 2 ν + d (1 + 2 T η ) 1 1 + η  d log(1 + η ) 2 ν + d − (1 /C 2 dν l + 1)  > 0 . (105) Next, we consider the second part in (98). Its deriv ative is C log 2 (1 + η )  log(1 + η ) dc 1 T ( η ) dη − 1 1 + η c 1 T ( η )  . (106) Giv en (15), w e know the second term in (101) leads to C 1 ν C 1 dν l dc 2 T ( η ) dη c 1 T ( η ) c 3 T ( η ) log(1 + η ) >C 1 ν C 1 dν l C 3 ν 1 η c 1 T ( η ) log(1 + 2 T η ) log (1 + η ) > c 1 T ( η ) C 1 + η . (107) Hence, by (99), (105), (98) and (107), w e hav e dc T ( η ) dη > 0. Next, we consider case 2. W e use a more compact pro of pro cedure given that c 123 T ( η ) has 18 terms. Define c 13 T ( η ) := c 1 T ( η ) c 3 T ( η ) = P 3 p =1 P 2 q =1 a 1 p ( η ) a 3 q ( η ), where a 1 1 ( η ) = log (1 + T η ) , a 1 2 ( η ) = 2 C 3 R (log(1 + T η ) + C 2 R ) 1 / 2 , a 1 3 ( η ) = C 4 R , a 3 1 ( η ) = log (1 + 2 T η ) , a 3 2 ( η ) = C 2 dν l ( T η ) d 2 ν + d log 2 ν 2 ν + d (1 + 2 T η ) . (108) 20 Figure 4: Cumulativ e regret upp er bound with n ugget ϵ at different T and selected constan ts for Mat´ ern k ernel ( ν = 1 2 ). The case “other” means neither the conditions for case 1 nor those for case 2 are satisfied. Th us, d c ( η ) can b e written as d c ( η ) = 3 X p =1 2 X q =1 c 2 T ( η ) a 1 p ( η ) a 3 q ( η ) " da 1 p ( η ) dη /a 1 p ( η ) + da 3 q ( η ) dη /a 3 q ( η ) + dc 2 T ( η ) dη /c 2 T ( η ) ! log(1 + η ) − 1 1 + η # . (109) W e note that c 2 T ( η ) is not decomp osed for a even more compact pro of b ecause it is p ossible that C 4 ν < 0. Since a 1 p ( η ) , a 3 q ( η ) , c 2 T ( η ) > 0, to sho w d c ( η ) < 0, it is sufficient to find the largest v alue of d pq η := " da 1 p ( η ) dη /a 1 p ( η ) + da 3 q ( η ) dη /a 3 q ( η ) + dc 2 T ( η ) dη /c 2 T ( η ) # log(1 + η ) − 1 1 + η , (110) for all p and q , a total of six terms, and show that it is < 0. If this largest term generates d pq η < 0, then the other fiv e p, q com binations must also hav e d pq η < 0 and, hence, the sum (109) will b e < 0. Using elementary algebra one can compare the six terms and v erify that the largest d pq η ( η ) o ccurs at p = 1 and q = 2 giv en T η ≫ 1. The third term in (110) is: dc 2 T ( η ) dη /c 2 T ( η ) < 1 η 1 1 /C 2 ν +log( T 2 η ) + C 3 ν c 2 T ( η ) < 1 η 1 c 2 T ( η )  1 log( T 2 η ) + C 3 ν  . (111) F urther, da 1 1 ( η ) dη /a 1 1 ( η ) = 1 1 /T + η 1 log(1 + T η ) , da 3 2 ( η ) dη /a 3 2 ( η ) < 1 η  d 2 ν + d + 2 ν 2 ν + d 1 log(1 + 2 T η )  . (112) Th us, for an y p, q , w e can write d pq η < 1 η  1 log( T η ) + d 2 ν + d + 2 ν 2 ν + d 1 log(2 T η ) + 1 c 2 T ( η ) 1 log( T 2 η ) + 1 c 2 T ( η ) C 3 ν  log(1 + η ) − 1 1 + η < 1 η  d 2 ν + d + 1 log( T η )  4 ν + d 2 ν + d + 1 c 2 T ( η )  + C 3 ν c 2 T ( η )  log(1 + η ) − 1 1 + η , (113) where we use log ( T 2 η ) ≥ log ( T η ). By (16) , (113) < 0. Thus, the other five (smaller) terms in (110) are also < 0. Thus, d c < 0 by (109) . F urther, under (16) , it is easy to v erify that the second part in (98) < 0 and th us dc T ( η ) dη < 0. By Lemma E.4, the cumulativ e regret b ound changes the same as c T with respect to ϵ = 1 /η . 21 Figure 5: Illustrative example of EI con tour of the Branin function with 50 random samples. F rom left to righ t: contour plots for ϵ = 10 − 2 , ϵ = 10 − 6 , ϵ = 10 − 10 , and no n ugget. The maximum EI 50 v alue from left to right: 0 . 40515, 0 . 40085, 0 . 40085, and 0 . 40085. The maxim um level of the colorbar corresp onds to the maxim um of EI 50 of each plot. F Numerical Example and Additional Plots F.1 Plot for the Nugget Effect As mentioned in Section 4, we c ho ose the 50 samples entirely through random sampling for the Branin function and plot the EI contour using different n ugget v alues in Figure 5. The effect of ϵ on the v alues of EI is muc h less obvious here, as exp ected. F.2 Numerical Example Setup The mathematical expression for example 1, the tw o-dimensional Rosenbrock function, is given b elow. f ( x ) = d − 1 X i =1  100( x i +1 − x 2 i ) 2 + ( x i − 1) 2  x i ∈ [ − 2 . 048 , 2 . 0480] 2 . (114) The optimal ob jectiv e function v alue is 0. The mathematical expression for example 2, the six-hump camel function giv en b elow. f ( x ) =  4 − 2 . 1 x 2 1 + x 4 1 3  x 2 1 + x 1 x 2 + ( − 4 + 4 x 2 2 ) x 2 2 − 3 ≤ x 1 ≤ 3 , − 2 ≤ x 2 ≤ 2 . (115) The optimal ob jectiv e function v alue is − 1 . 0316. The mathematical expression for example 3, the Hartmann6 function is giv en b elow. f ( x ) = − 4 X i =1 α i exp   − 6 X j =1 A ij ( x j − P ij ) 2   x i ∈ [0 , 1] , i = 1 , . . . , 6 α = [1 . 0 , 1 . 2 , 3 . 0 , 3 . 2] ⊤ A =     10 3 . 0 17 3 . 5 1 . 7 8 . 0 0 . 05 10 17 0 . 1 8 . 0 14 3 . 0 3 . 5 1 . 7 10 17 8 . 0 17 8 . 0 0 . 05 10 0 . 1 14     P =     0 . 131 0 . 170 0 . 557 0 . 012 0 . 828 0 . 587 0 . 233 0 . 414 0 . 831 0 . 374 0 . 100 0 . 999 0 . 235 0 . 145 0 . 352 0 . 288 0 . 305 0 . 665 0 . 405 0 . 883 0 . 873 0 . 574 0 . 109 0 . 038     . (116) 22 The optimal ob jectiv e function v alue is − 3 . 32. The mathematical expression for example 4, the Branin function, is giv en b elow. f ( x ) =  x 2 − 5 . 1 4 π 2 x 2 1 + 5 π x 1 − 6  2 + 10  1 − 1 8 π  cos( x 1 ) + 10 x 1 ∈ [ − 5 , 10] , x 2 ∈ [0 , 15] . (117) The optimal ob jectiv e function v alue is 0. The mathematical expression for example 5, the Michalewicz function giv en b elow. f ( x ) = − 2 X i =1 sin( x i ) sin 20  ix 2 i π  x ∈ [0 , π ] 2 . (118) The optimal ob jectiv e function v alue is − 1 . 8013. The 25th and 75th p ercentile of the 100 rep eated runs are shown in the following figure. 23 (a) Example 1: Rosenbrock function (b) Example 2: Six-hump camel function (c) Example 3: Hartmann6 (d) Example 4: Branin function (e) Example 5: Michalewicz function Figure 6: 25th and 75th p ercen tile av erage cumulativ e regret for practical EGO with nugget v alues 10 − 2 , 10 − 4 , and 10 − 6 for five examples. 24 References [1] S. Agra wal and N. Go yal. Analysis of thompson sampling for the multi-armed bandit problem. In Confer enc e on le arning the ory , pages 39–1. JMLR W orkshop and Conference Pro ceedings, 2012. [2] I. Andrianakis and P . G. Challenor. The effect of the nugget on gaussian pro cess emulators of computer mo dels. Computational Statistics & Data Analysis , 56(12):4215–4228, 2012. [3] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Baksh y . BoT orch: A F ramework for Efficient Monte-Carlo Bay esian Optimization. In A dvanc es in Neur al Information Pr o c essing Systems 33 , 2020. [4] E. Bingham, J. P . Chen, M. Janko wiak, F. Ob ermeyer, N. Pradhan, T. Karaletsos, R. Singh, P . Szerlip, P . Horsfall, and N. D. Go odman. Pyro: Deep Univ ersal Probabilistic Programming. Journal of Machine L e arning R ese ar ch , 2018. [5] R. Bostanabad, T. Kearney , S. T ao, D. W. Apley , and W. Chen. Leveraging the nugget parameter for efficien t gaussian pro cess mo deling. International journal for numeric al metho ds in engine ering , 114(5): 501–516, 2018. [6] D. Bouneffouf, I. Rish, and C. Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE c ongr ess on evolutionary c omputation (CEC) , pages 1–8. IEEE, 2020. [7] A. D. Bull. Con vergence rates of efficient global optimization algorithms. Journal of Machine L e arning R ese ar ch , 12(10), 2011. [8] R. Calandra, A. Seyfarth, J. Peters, and M. P . Deisenroth. Ba y esian optimization for learning gaits under uncertaint y . Annals of Mathematics and Artificial Intel ligenc e , 76:5–23, F ebruary 2016. [9] S. R. Chowdh ury and A. Gopalan. On kernelized multi-armed bandits. In International Confer enc e on Machine L e arning , pages 844–853. PMLR, 2017. [10] T. M. Cov er. Elements of information the ory . John Wiley & Sons, 1999. [11] P . I. F razier. Bay esian optimization. In R e c ent advanc es in optimization and mo deling of c ontemp or ary pr oblems , pages 255–278, Octob er 2018. [12] J. Gardner, G. Pleiss, K. Q. W ein b erger, D. Bindel, and A. G. Wilson. Gpytorc h: Blackbox matrix-matrix gaussian pro cess inference with gpu acceleration. A dvanc es in neur al information pr o c essing systems , 31, 2018. [13] J. R. Gardner, M. J. Kusner, Z. Xu, K. Q. W ein b erger, and J. P . Cunningham. Bay esian optimization with inequalit y constrain ts. In Pr o c e e dings of the 31st International Confer enc e on International Confer enc e on Machine L e arning - V olume 32 , ICML’14, pages I I–937–II–945. JMLR.org, 2014. [14] P . E. Gill and W. Murray . Newton-t yp e metho ds for unconstrained and linearly constrained optimization. Mathematic al Pr o gr amming , 7:311–350, 1974. [15] P . E. Gill, W. Murray , and M. H. W righ t. Pr actic al optimization . SIAM, 2019. [16] J. Gondzio. Matrix-free in terior p oint metho d. Computational Optimization and Applic ations , 51: 457–480, 2012. [17] R. B. Gramacy and H. KH. Lee. Cases for the n ugget in mo deling computer exp eriments. Statistics and Computing , 22:713–722, 2012. [18] Nicholas J Higham. A c cur acy and stability of numeric al algorithms . SIAM, 2002. [19] S. Hu, H. W ang, Z. Dai, B. K. Hsiang Low, and S. H. Ng. Adjusted exp ected improv ement for cumulativ e regret minimization in noisy bay esian optimization. Journal of Machine L e arning R ese ar ch , 26(46):1–33, 2025. 25 [20] Shogo Iw azaki. Improv ed regret b ounds for gaussian pro cess upp er confidence b ound in bay esian optimization. arXiv pr eprint arXiv:2506.01393 , 2025. [21] S. Jeong and S. Oba yashi. Efficient global optimization (EGO) for multi-ob jective problem and data mining. In 2005 IEEE c ongr ess on evolutionary c omputation , volume 3, pages 2138–2145. IEEE, 2005. [22] D. R. Jones. A taxonomy of global optimization metho ds based on resp onse surfaces. Journal of glob al optimization , 21:345–383, 2001. [23] D. R. Jones, M. Sc honlau, and W. J. W elc h. Efficient global optimization of expensive blac k-b o x functions. Journal of Glob al optimization , 13:455–492, 1998. [24] A. Kie lbasi ´ nski. A note on rounding-error analysis of c holesky factorization. Line ar Algebr a and its applic ations , 88:487–494, 1987. [25] T. L. Lai and H. Robbins. Asymptotically efficien t adaptive allo cation rules. A dvanc es in applie d mathematics , 6(1):4–22, 1985. [26] D. J. Lizotte. Pr actic al Bayesian optimization . PhD thesis, Universit y of Alb erta, Edmonton, Alberta, Canada, 2008. [27] C. J. Lourenco and E. Moreno-Centeno. Exactly solving sparse rational linear systems via roundoff- error-free Cholesky factorizations. SIAM Journal on Matrix A nalysis and Applic ations , 43(1):439–463, 2022. [28] Y. Lyu, Y. Y uan, and I. W. Tsang. Efficien t batch black-box optimization with deterministic regret b ounds. arXiv pr eprint arXiv:1905.10041 , 2019. [29] R. Martinez-Can tin. Ba yesopt: A Ba yesian optimization library for nonlinear optimization, exp erimen tal design and bandits. T e c hnical rep ort, 2014. [30] J. Meinguet. Refined error analyses of cholesky factorization. SIAM journal on numeric al analysis , 20 (6):1243–1250, 1983. [31] M. Molga and C. Smutnic ki. T est functions for optimization needs. T est functions for optimization ne e ds , 101(48):32, 2005. [32] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. V enk atesh. Regret for exp ected impro vemen t ov er the b est-observ ed v alue and stopping condition. In Pr o c e e dings of the Ninth Asian Confer enc e on Machine L e arning , volume 77, pages 279–294. PMLR, 2017. [33] F. P edregosa, G. V aro quaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. W eiss, V. Dubourg, J. V anderplas, A. Passos, D. Cournap eau, M. Brucher, M. P errot, and E. Duchesna y . Scikit-learn: Mac hine learning in Python. Journal of Machine L e arning R ese ar ch , 12:2825–2830, 2011. [34] A. P ep elyshev. The role of the nugget term in the gaussian proce ss method. In mODa 9–A dvanc es in Mo del-Oriente d Design and A nalysis: Pr o c e e dings of the 9th International Workshop in Mo del-Oriente d Design and Analysis held in Bertinor o, Italy, June 14-18, 2010 , pages 149–156. Springer, 2010. [35] V. Pichen y , T. W agner, and D. Ginsb ourger. A b enc hmark of kriging-based infill criteria for noisy optimization. Structur al and multidisciplinary optimization , 48:607–626, 2013. [36] H. Robbins. Some asp ects of the sequen tial design of exp eriments. Bul letin of the Americ an Mathematic al So ciety , 1952. [37] I. O. Ryzhov. On the conv ergence rates of exp ected impro vemen t metho ds. Op er ations R ese ar ch , 64(6): 1515–1528, 2016. 26 [38] P . Sav es, R. Lafage, N. Bartoli, Y. Diouane, J. Bussemak er, T. Lefebvre, J. T. Hwang, J. Morlier, and J. R. R. A. Martins. SMT 2.0: A surrogate modeling to olb o x with a fo cus on hierarc hical and mixed v ariables gaussian pro cesses. A dvanc es in Engine ering Sofwar e , 188:103571, 2024. doi: h ttps://doi.org/10.1016/j.advengsoft.2023.103571. [39] R. B. Schnabel and E. Esk ow. A new mo dified c holesky factorization. SIAM Journal on Scientific and Statistic al Computing , 11(6):1136–1158, 1990. [40] M. Schonlau, W. J. W elch, and D. R. Jones. Global versus lo cal search in constrained optimization of computer mo dels. L e ctur e notes-mono gr aph series , pages 11–25, 1998. [41] N. Sriniv as, A. Krause, S. M. Kak ade, and M. Seeger. Gaussian pro cess optimization in the bandit setting: No regret and exp erimental design. arXiv pr eprint arXiv:0912.3995 , 2009. [42] J. Sun. Comp onen twise perturbation b ounds for some matrix decomp ositions. BIT Numeric al Mathe- matics , 32(4):702–714, 1992. [43] The GPyOpt authors. Gpy opt: A bay esian optimization framework in p ython. http://github.com/ SheffieldML/GPyOpt , 2016. [44] H. T ran-The, S. Gupta, S. Rana, and S. V enk atesh. Regret b ounds for expected improv ement algorithms in gaussian pro cess bandit optimization. arXiv pr eprint arXiv:2203.07875 , 2022. [45] S. V akili. Open problem: Regret b ounds for noise-free kernel-based bandits. In Confer enc e on L e arning The ory , pages 5624–5629. PMLR, 2022. [46] S. V akili, K. Khezeli, and V. Pic heny . On information gain and regret b ounds in Gaussian process bandits. In International Confer enc e on Artificial Intel ligenc e and Statistics , pages 82–90. PMLR, 2021. [47] F. AC. Viana, R. T. Haftk a, and L. T. W atson. Efficien t global optimization algorithm assisted b y m ultiple surrogate tec hniques. Journal of Glob al Optimization , 56:669–689, 2013. [48] K. W ang, G. Pleiss, J. Gardner, S. Tyree, K. Q. W einberger, and A. G. Wilson. Exact gaussian pro cesses on a million data p oints. A dvanc es in neur al information pr o c essing systems , 32, 2019. [49] X. W ang, Y. Jin, S. Sc hmitt, and M. Olhofer. Recent adv ances in Bay esian optimization. ACM Comput. Surv. , 55(13s), jul 2023. [50] Z. W ang and N. de F reitas. Theoretical analysis of ba yesian optimisation with unkno wn gaussian process h yp er-parameters, 2014. URL . [51] J. H. Wilkinson. A priori error analysis of algebraic pro cesses. In Intern. Congr ess Math , volume 19, pages 629–639, 1968. [52] S. J. W right. Mo dified cholesky factorizations in in terior-p oin t algorithms for linear programming. SIAM Journal on Optimization , 9(4):1159–1191, 1999. [53] J. W u, X. Chen, H. Zhang, L. Xiong, H. Lei, and S. Deng. Hyp erparameter optimization for mac hine learning mo dels based on ba yesian optimization. Journal of Ele ctr onic Scienc e and T e chnolo gy , 17(1): 26–40, 2019. [54] M. Zaefferer, J. Stork, M. F riese, A. Fisc hbac h, B. Naujoks, and T. Bartz-Beielstein. Efficient global optimization for combinatorial problems. In Pr o c e e dings of the 2014 A nnual Confer enc e on Genetic and Evolutionary Computation , GECCO ’14, page 871–878, New Y ork, NY, USA, 2014. Asso ciation for Computing Machinery . 27

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment