Convergence rates of efficient global optimization algorithms

Con v ergence rates of eﬃcien t glo bal optimization algorithms Adam D. Bull Statistical Lab oratory Universit y of Ca mbridge a.bull@statslab.cam.ac .uk Abstract In the eﬃcient globa l optimization problem, we minimize an un- known function f , using as few o bserv ations f ( x ) as p ossible . It can be considered a contin uum-armed- bandit problem, with noiseless data, and simple regret. Exp ected-improv ement algorithms are perhaps the most popular metho ds for so lving the problem; in this pa pe r , w e pro- vide theoretical results on their a symptotic behaviour. Implemen ting t hese algorithms requires a choice of Gaussian-pro cess prior, whic h determines a n a s so ciated space of functions, its repro ducing- kernel Hilb ert space (RKHS). When the prior is ﬁxed, exp ected im- prov ement is known to co nverge on the minimum o f any function in its RKHS. W e provide co nv ergence ra tes for this pro cedure, optimal for functions of lo w smoo thness, and descr ibe a mo diﬁed a lgorithm attaining o ptimal r ates for smoother functions. In pr actice, how ever, prior s ar e typically estimated sequentially from the data. F or standar d estimato r s, we show this pro cedure ma y never ﬁnd the minim um of f . W e then pr op ose alternative estimator s, chosen to minimize the constants in the rate of con vergence, and s how these estimators retain the conv ergence rates of a ﬁxed prior . 1 In tro duction Supp ose we wish to minimize a co ntin u ous fu nction f : X → R , wh ere X is a compact subset of R d . Ob serving f ( x ) is costly (it may require a length y computer sim ulation or physic al exp erimen t), so w e w ish to use as few ob- serv ations as p ossible. W e kn o w little ab out the shap e of f ; in particular w e will b e u nable to make assumptions of con v exit y or unimo d ality . W e Mathematics subje ct classiﬁc ation 2010. 90C26 (Primary); 68Q32, 62C10, 62L05, (Secondary) Keywor ds. conv ergence rates, eﬃcient global optimization, exp ected impro vemen t, conti nuum-armed bandit, Ba yesia n optimization 1 therefore need a glob al optimization algorithm, one w hic h atte mp ts to ﬁn d a global minimum. Man y standard global optimization algorithms exist, includ ing genetic algorithms, m ultistart, and sim ulated annealing ( P ardalos and Romeijn , 2002 ) , b ut these algorithms are designed f or fu nctions th at are c heap to ev aluate. When f is exp ensiv e, w e n eed an e ﬃcient algorithm, one whic h will c ho ose its observ ations to maximize th e information gained. W e ca n consider this a con tinuum-armed-bandit problem ( Sriniv as et al. , 2010 , and references therein), with noiseless data, and loss measured by the simple regret ( Bub ec k et al. , 2009 ). A t time n , we choose a design p oin t x n ∈ X , make an observ ation z n = f ( x n ), and then rep ort a p oin t x ∗ n where w e b eliev e f ( x ∗ n ) will b e low. Ou r goa l is to ﬁnd a strategy for choosing the x n and x ∗ n , in terms of previous observ ations, so as to min imize f ( x ∗ n ). W e wo uld lik e to ﬁn d a strategy whic h can guaran tee conv ergence: for functions f in some smo othness class, f ( x ∗ n ) should tend to min f , pr eferab ly at some fast rate. The simplest metho d would b e to ﬁx a sequence of x n in adv ance, and set x ∗ n = arg min ˆ f n , for some app ro ximation ˆ f n to f . W e will sho w that if ˆ f n con v erges in su premum norm at the optimal rate, then f ( x ∗ n ) also con v erges at its optimal rate. Ho we ver, wh ile this strategy gives a go o d w orst-case b ound , on a v erage it is clearly a p o or m etho d of optimization: the design p oin ts x n are completely in dep end ent of the observ ations z n . W e may ther efore ask if there are more eﬃcien t metho ds, with b et- ter a v erage-case p erformance, that nev ertheless pr o vide go o d guaran tees of con v ergence. Th e diﬃculty in designing su c h a metho d lies in the trade-oﬀ b et wee n explor ation and exploitation . If w e exploit the d ata, obs er v in g in regions w here f is kno wn to b e lo w, we will b e more lik ely to ﬁnd the op- tim um quic kly; ho w eve r, unless we explore ev ery region of X , w e ma y not ﬁnd it at all ( Macready and W olp ert , 1998 ). Initial attempts at this p r oblem include w ork on Lip sc hitz optimization (summarized in Hansen et al. , 1992 ) and th e DIRECT algorithm ( Jones et al. , 1993 ), b ut p erhaps the b est-kno wn strategy is exp ected imp ro v ement . It is sometimes called Ba y esian optimization, and ﬁr s t app eared in Mo ˇ ckus ( 1974 ) as a Ba yesia n decision-theoretic solution to th e problem. Conte m- p orary compu ters w ere not p o werful enough to implement the tec hnique in full, and it was later p opularized by Jones et al. ( 1998 ) , who p r o vided a com- putationally eﬃcie nt implemen tation. More recen tly , it has also b een called a kno wledge-gradien t p olicy by F razier et al. ( 2009 ). Man y extensions and alteratio ns hav e b een suggested by f u rther authors; a go o d summary can b e found in Bro c hu et al. ( 2010 ). Exp ected impro vemen t p erforms we ll in exp eriments ( Osb orn e , 2010 , § 9.5), but little is kno wn ab out its theoretical prop erties. Th e b eha viour of th e algorithm dep ends crucially on the Gaussian pro cess pr ior π c hosen for f . Eac h prior has an asso ciated s p ace of fun ctions H , its repr o ducing- k ernel Hilb ert space. H con tains all fun ctions X → R as smo oth as a 2 p osterior mean of f , and is the natural space in whic h to s tudy qu estions of con v ergence. V azquez and Bect ( 2010 ) sho w that when π is a ﬁxed Gaussian pro cess prior of ﬁ nite smo othn ess, exp ected improv ement con v erges on the minimum of any f ∈ H , and almost surely for f dr a wn from π . Grun ew alder et al. ( 2010 ) b ou n d the conv ergence r ate of a computationally infeasible v ersion of exp ected impr o v emen t: for pr iors π of smo othness ν , they s h o w conv ergence at a r ate O ∗ ( n − ( ν ∧ 0 . 5) /d ) on f dra wn from π . W e b egin by b oundin g the con v ergence rate of the feasible algorithm, and sho w conv ergence at a rate O ∗ ( n − ( ν ∧ 1) /d ) on all f ∈ H . W e go on to show that a mo diﬁcation of exp ected impr o v ement con verges at the near-optimal r ate O ∗ ( n − ν /d ). F or pr actitio ners , how ever, these results are somewhat misleading. In t ypical applications, the prior is n ot held ﬁ x ed , but dep end s on parameters estimated sequen tially fr om the d ata. This pro cess ensu r es the c hoice of observ ations is in v arian t under translation and scaling of f , and is b eliev ed to b e more eﬃcient ( Jones et al. , 1998 , § 2). I t has a profoun d eﬀect on con v ergence, how ever: Lo catelli ( 1997 , § 3.2) shows th at, for a Bro wnian motion pr ior w ith estimated p arameters, exp ected imp ro v ement ma y not con v erge at all. W e extend this r esult to more general settings, showing that for standard priors with estimated parameters, there exist smo oth functions f on which exp ected impr o v emen t do es n ot conv erge. W e then prop ose alternativ e es- timates of the prior parameters, c hosen to minimize the constants in the con v ergence rate. W e sho w that these estimators giv e an automatic c hoice of parameters, wh ile retaining the conv ergence r ates of a ﬁxed prior. T able 1 summarizes the notation used in this pap er. W e sa y f : R d → R is a b ump fu nction if f is inﬁnitely diﬀeren tiable and of compact su pp ort, and f : R d → C is Hermitian if f ( x ) = f ( − x ). W e use the Landau notati on f = O ( g ) to denote lim sup | f /g | < ∞ , and f = o ( g ) to denote f /g → 0. If g = O ( f ), w e sa y f = Ω ( g ), and if b oth f = O ( g ) and f = Ω( g ), we sa y f = Θ( g ). I f fur ther f /g → 1, w e say f ∼ g . Finally , if f and g are r an d om, and P (sup | f /g | ≤ M ) → 1 as M → ∞ , we sa y f = O p ( g ). In Section 2 , we br ieﬂy describ e the exp ected-impro vemen t algorithm, and detail our assum p tions on the priors used . W e state our main results in Section 3 , and discuss implicati ons for further w ork in S ection 4 . Finally , w e give pro ofs in App end ix A . 2 Exp ected Imp ro v emen t Supp ose we wish to min im ize an u nkno wn function f , c ho osing design p oin ts x n and estimated minima x ∗ n as in the introd uction. If we pick a prior distribution π for f , representing our b eliefs ab out the unkn o wn function, w e can describ e this problem in terms of decision theory . Let (Ω , F , P ) b e 3 Section 1 f unknown function X → R to b e minimized X compact subset of R d to minimize ov er d n umb er of dimensions to m in imize o v er x n p oint s in X at which f is observ ed z n observ ations z n = f ( x n ) of f x ∗ n estimated minimum of f , giv en z 1 , . . . , z n Section 2.1 π prior distribu tion for f u strat egy for choosing x n , x ∗ n F n ﬁltration F n = σ ( x i , z i : i ≤ n ) z ∗ n b est observ ation z ∗ n = min i =1 ,...,n z i E I n exp ected impr o v ement giv en F n Section 2.2 µ , σ 2 global mean and v ariance of Gaussian-pro cess p rior π K underlying correlat ion kernel for π K θ correlation ke rn el for π with length-scales θ ν , α smo othness parameters of K ˆ µ n , ˆ f n , s 2 n , ˆ R 2 n quan tities describing p osterior distribution of f give n F n Section 2.3 E I ( π ) exp ected impr o v ement strateg y with ﬁxed pr ior ˆ σ 2 n , ˆ θ n estimates of pr ior parameters σ 2 , θ c n rate of deca y of ˆ σ 2 n θ L , θ U b ound s on ˆ θ n E I ( ˆ π ) exp ected impro ve ment strategy with estimated pr ior Section 3.1 H θ ( S ) repro d ucing-k ernel Hilb ert sp ace of K θ on S H s ( D ) Sob olev Hilb ert space of order s on D Section 3.2 L n loss suﬀered ov er an RK HS ball after n steps Section 3.3 E I ( ˜ π ) exp ected impro ve ment strategy with r obust estimated pr ior Section 3.4 E I ( · , ε ) ε -greedy exp ected im p ro v ement strategies T able 1: Notation 4 a p robabilit y sp ace, equipp ed with a random pro cess f h a ving la w π . A strategy u is a collect ion of random v ariables ( x n ), ( x ∗ n ) taking v alues in X . Set z n : = f ( x n ), and deﬁn e the ﬁltration F n : = σ ( x i , z i : i ≤ n ). Th e strategy u is v alid if x n is conditionally indep endent of f giv en F n − 1 , and lik ewise x ∗ n giv en F n . (Note that w e allo w rand om strategi es, pro vided they do not d ep end on un kno wn information ab ou t f .) When taking probabilities and exp ectations we w ill wr ite P u π and E u π , denoting the dep end ence on b oth the pr ior π and strategy u . The a v erage- case p erformance at s ome future time N is then giv en by th e exp ected loss, E u π [ f ( x ∗ N ) − min f ] , and our goal, giv en π , is to c ho ose the strategy u to minimize this quan tit y . 2.1 Ba y esian Optimization F or N > 1 this problem is v ery computationally in tensive ( Osb orn e , 2010 , § 6.3), but we can solv e a simpliﬁed v ersion of it. First, we restrict the c hoice of x ∗ n to the previous design p oin ts x 1 , . . . , x n . (In p ractice this is reasonable, as choosing an x ∗ n w e hav e not observe d can b e u n reliable.) Secondly , rather than ﬁn d ing an optimal strategy for the pr ob lem, we d eriv e the my opic strategy: the strategy wh ic h is optimal if we alw a ys assume we will stop after the next obser v ation. T his strategy is s ub optimal ( Ginsb ourger et al. , 2008 , § 3.1), but p erforms w ell, and greatly simp liﬁes the calculations in vo lve d. In this setting, giv en F n , if w e are to stop at time n w e should c ho ose x ∗ n : = x i ∗ n , wh er e i ∗ n : = arg min 1 ,...,n z i . (In the case of ties, w e m a y pick an y minimizing i ∗ n .) W e then suﬀer a loss z ∗ n − min f , where z ∗ n : = z i ∗ n . W ere we to observe at x n +1 b efore stopping, the exp ected loss w ould b e E u π [ z ∗ n +1 − m in f | F n ] , so the my opic strategy should c ho ose x n +1 to min imize this quan tit y . Equiv- alen tly , it sh ould maximize the exp ected impro v ement ov er the curren t loss, E I n ( x n +1 ; π ) : = E u π [ z ∗ n − z ∗ n +1 | F n ] = E u π [( z ∗ n − z n +1 ) + | F n ] , (1) where x + = max( x, 0). So f ar, we ha ve merely r ep laced one optimiza tion pr oblem with another. Ho w ev er, for suitable priors, E I n can b e ev aluated chea ply , and thus maxi- mized by standard tec h niques. T he exp ected-impro v emen t algorithm is then giv en by c ho osing x n +1 to maximize ( 1 ). 2.2 Gaussian Pro cess Mo dels W e still need to choose a prior π for f . T ypically , we mo del f as a stationary Gaussian pr o cess: we consider the v alues f ( x ) to b e jointl y Gaussian, w ith 5 mean and co v ariance E π [ f ( x )] = µ, C o v π [ f ( x ) , f ( y )] = σ 2 K θ ( x − y ) . (2) µ ∈ R is the global mean of f ; we place a ﬂat p rior on µ , reﬂecting our uncertain t y o v er the lo cation of f . σ > 0 is the global scale of v ariation of f , and K θ : R d → R its correlation k ernel, go ve rn ing the lo cal pr op erties of f . In the follo wing, w e will consider k ernels K θ ( t 1 , . . . , t d ) : = K ( t 1 /θ 1 , . . . , t d /θ d ) , (3) for an underlyin g k ernel K w ith K (0) = 1. (Note that we can alw a ys satisfy this condition by su itably s caling K and σ .) The θ i > 0 are the length-scales of th e pro cess: t wo v alues f ( x ) and f ( y ) will b e highly correlated if eac h x i − y i is small compared with θ i . F or no w, we will assume the parameters σ and θ are ﬁx ed in adv ance. F or ( 2 ) and ( 3 ) to deﬁne a consistent Gaussian pro cess, K must b e a symmetric p ositiv e-deﬁnite fun ction. W e will also m ake th e follo w ing assumptions. Assumption 1. K is c ontinuous and inte gr able. K thus has F ourier transform b K ( ξ ) : = Z R d e − 2 π i h x,ξ i K ( x ) dx, and by Boc hn er’s theorem, b K is non-negativ e and int egrable. Assumption 2. b K i s isotr opic and r adial ly non-incr e asing. In other wo rd s, b K ( x ) = b k ( k x k ) for a non-increasing fun ction b k : [0 , ∞ ) → [0 , ∞ ); as a consequence, K is isotropic. Assumption 3. As x → ∞ , either: (i) b K ( x ) = Θ( k x k − 2 ν − d ) for some ν > 0 ; or (ii) b K ( x ) = O ( k x k − 2 ν − d ) for al l ν > 0 (we wil l then say that ν = ∞ ). Note the condition ν > 0 is required for b K to b e integ rable. Assumption 4. K is C k , for k the lar gest inte ger less than 2 ν , and at the origin, K has k -th or der T aylor app r oximation P k satisfying | K ( x ) − P k ( x ) | = O  k x k 2 ν ( − log k x k ) 2 α  as x → 0 , for some α ≥ 0 . 6 When α = 0, this is just th e condition that K b e 2 ν -H¨ older at the origin; when α > 0, w e instead require this condition u p to a log factor. The rate ν control s the smo othness of fu nctions from the prior: almost surely , f has con tin uous deriv ative s of an y order k < ν ( Adler and T a ylor , 2007 , § 1.4.2). Popular k ernels in clude the Mat ´ ern class, K ν ( x ) : = 2 1 − ν Γ( ν )  √ 2 ν k x k  ν k ν  √ 2 ν k x k  , ν ∈ (0 , ∞ ) , where k ν is a mod iﬁed Bessel fu n ction of th e second kind, and the Gaussian k ernel, K ∞ ( x ) : = e − 1 2 k x k 2 , obtained in the limit ν → ∞ ( Rasm ussen and Willia ms , 2006 , § 4.2). Bet ween them, these kernels co ve r the fu ll r ange of smo othness 0 < ν ≤ ∞ . Both k ernels satisfy Assumptions 1 – 4 for the ν giv en; α = 0 except for the Mat ´ ern k ernel with ν ∈ N , where α = 1 2 ( Abramo witz and Stegun , 1965 , § 9.6). Ha ving c hosen our prior distribution, w e ma y n o w deriv e its p osterior. W e ﬁn d f ( x ) | z 1 , . . . , z n ∼ N  ˆ f n ( x ; θ ) , σ 2 s 2 n ( x ; θ )  , where ˆ µ n ( θ ) : = 1 T V − 1 z 1 T V − 1 1 , (4) ˆ f n ( x ; θ ) : = ˆ µ n + v T V − 1 ( z − ˆ µ n 1) , (5) and s 2 n ( x ; θ ) : = 1 − v T V − 1 v + (1 − 1 T V − 1 v ) 2 1 T V − 1 1 , (6) for z = ( z i ) n i =1 , V = ( K θ ( x i − x j )) n i,j =1 , and v = ( K θ ( x − x i )) n i =1 ( San tner et al. , 2003 , § 4.1 .3). Equiv alentl y , these expr essions are the b est linear unbiase d predictor of f ( x ) and its v ariance, as giv en in Jones et al. ( 1998 , § 2). W e will also need the reduced sum of squ ares, ˆ R 2 n ( θ ) : = ( z − ˆ µ n 1) T V − 1 ( z − ˆ µ n 1) . (7) 2.3 Exp ected Improv emen t Strategies Under our assumptions on π , we ma y no w deriv e an analytic form f or ( 1 ), as in Jones et al. ( 1998 , § 4.1). W e obtain E I n ( x n +1 ; π ) = ρ  z ∗ n − ˆ f n ( x n +1 ; θ ) , σs n ( x n +1 ; θ )  , (8) where ρ ( y , s ) : = ( y Φ( y /s ) + sϕ ( y /s ) , s > 0 , max( y , 0) , s = 0 , (9) 7 and Φ an d ϕ are the standard normal distribu tion and den sit y functions resp ectiv ely . F or a prior π as ab o ve, exp ected improv ement c ho oses x n +1 to maximize ( 8 ), but this do es not fully deﬁne the strategy . Firstly , w e m ust describ e ho w the strategy breaks ties, when more than one x ∈ X maximizes E I n . In general, this will not aﬀect the b ehaviour of the algo rithm, so w e allo w an y c hoice of x n +1 maximizing ( 8 ). Secondly , w e must sa y how to c ho ose x 1 , as the ab ov e expressions are un - deﬁned wh en n = 0. In fact, Jones et al. ( 1998 , § 4.2) ﬁnd that exp ected im- pro ve ment can b e unreliable give n few d ata p oints, and r ecommend that sev- eral initial design p oint s b e c hosen in a rand om quasi-uniform arrangement. W e will therefore assu me th at until some ﬁxed time k , p oin ts x 1 , . . . , x k are instead chosen by some (p oten tially random) metho d indep end ent of f . W e th us obtain the follo wing strategy . Deﬁnition 1. An E I ( π ) str ate gy cho oses: (i) initial design p oints x 1 , . . . , x k indep endently of f ; and (ii) further design p oints x n +1 ( n ≥ k ) fr om the maximizers of ( 8 ) . So far, we ha v e n ot considered the choic e of parameters σ and θ . While these can b e ﬁxed in adv ance, doing so requires u s to sp ecify c haracteris- tic scales of the un kno wn fun ction f , and causes exp ected impro veme nt to b eha ve diﬀerent ly on a rescaling of the same fu nction. W e w ould prefer an algorithm which could adapt automatically to the scale of f . A natural approac h is to tak e maximum likeli ho o d estimates of the pa- rameters, as r ecommended b y Jones et al. ( 1998 , § 2). Give n θ , the MLE ˆ σ 2 n = ˆ R 2 n ( θ ) /n ; for fu ll generalit y , we will allo w any c hoice ˆ σ 2 n = c n ˆ R 2 n ( θ ), where c n = o (1 / log n ). Estimates of θ , ho w ev er, m ust b e obtained by nu- merical optimization. As θ can v ary widely in scale, this optimization is b est p erformed o v er log θ ; as the lik eliho o d surface is t ypically m ultimo dal, this requires the use of a global optimizer. W e must therefore place (imp licit or explicit) b ounds on the allo wed v alues of log θ . W e h a v e thus describ ed the follo w ing strategy . Deﬁnition 2. L et ˆ π n b e a se quenc e of priors, with p ar ameters ˆ σ n , ˆ θ n satis- fying: (i) ˆ σ 2 n = c n ˆ R 2 n ( ˆ θ n ) for c onstants c n > 0 , c n = o (1 / log n ) ; and (ii) θ L ≤ ˆ θ n ≤ θ U for c onstants θ L , θ U ∈ R d + . An E I ( ˆ π ) str ate gy satisﬁes Deﬁnition 1 , r eplacing π with ˆ π n in ( 8 ) . 8 3 Con ve rgence Rates T o d iscuss conv ergence, w e must ﬁrst c ho ose a s mo othness class for the unknown fu nction f . Eac h k ernel K θ is asso ciated with a space of functions H θ ( X ), its repro d u cing-k ernel Hilb ert space (RKHS) or nativ e space. H θ ( X ) con tains all fu nctions X → R as smo oth as a p osterior mean of f , and is the natural space to study conv ergence of exp ected-improv ement algorithms, allo win g a tractable analysis of their asymptotic b ehavio ur . 3.1 Repro ducing-Kernel Hilb ert Spaces Giv en a symmetric p ositiv e-deﬁnite ke rn el K on R d , s et k x ( t ) = K ( t − x ). F or S ⊆ R d , let E ( S ) b e the sp ace of functions S → R spanned by the k x , for x ∈ S . F urnish E ( S ) with the inner pro duct deﬁned by h k x , k y i : = K ( x − y ) . The completion of E ( S ) under this inn er pr o duct is the repr o ducing-k ernel Hilb ert space H ( S ) of K on S . The members f ∈ H ( S ) are abstract ob jects, but we can ident ify them with f unctions f : S → R th rough the repro du cing prop erty , f ( x ) = h f , k x i , whic h h olds for all f ∈ E ( S ). S ee Aronsza jn ( 1950 ), Berlinet and Th omas- Agnan ( 2004 ), W endland ( 2005 ) and v an d er V aart and v an Z an ten ( 2008 ). W e w ill ﬁnd it conv enient also to us e an alternativ e charac terization of H ( S ). W e b egin b y describing H ( R d ) in terms of F ourier transforms. Let b f denote the F ourier transform of a function f ∈ L 2 . The follo wing r esu lt is stated in P arzen ( 1963 , § 2), and pro ved in W endland ( 2005 , § 10.2); w e giv e a short p ro of in App end ix A . Lemma 1. H ( R d ) is the sp ac e of r e al c ontinuous f ∈ L 2 ( R d ) whose norm k f k 2 H ( R d ) : = Z | b f ( ξ ) | 2 b K ( ξ ) dξ is ﬁnite, taking 0 / 0 = 0 . W e m a y now describ e H ( S ) in terms of H ( R d ). Lemma 2 ( Aronsza jn , 1950 , § 1.5) . H ( S ) is the sp ac e of functions f = g | S for some g ∈ H ( R d ) , with norm k f k H ( S ) : = inf g | S = f k g k H ( R d ) , and ther e is a uniqu e g minimizing this expr ession. 9 These spaces are in fact closely r elated to the Sob olev Hilb ert spaces of functional analysis. Sa y a domain D ⊆ R d is Lipsc hitz if its b oundary is lo cally the graph of a Lip s c hitz function (see T artar , 2007 , § 12, for a precise deﬁnition). F or suc h a d omain D , the Sob olev Hilb ert space H s ( D ) is the space of f u nctions f : D → R , given by the restriction of some g : R d → R , whose norm k f k 2 H s ( D ) : = inf g | D = f Z | b g ( ξ ) | 2 (1 + k ξ k 2 ) s/ 2 dξ is ﬁn ite. Th us, for the k ernel K with F ourier transform b K ( ξ ) = (1 + k ξ k 2 ) s/ 2 , this is just the RK HS H ( D ). More generally , if K satisﬁes our assumptions with ν < ∞ , th ese sp aces are equiv alen t in the sense of normed spaces: they con tain the same fu n ctions, and hav e n orms k · k 1 , k · k 2 satisfying C k f k 1 ≤ k f k 2 ≤ C ′ k f k 1 , for constan ts 0 < C ≤ C ′ . Lemma 3. L et H θ ( S ) denote the RKHS of K θ on S , and D ⊆ R d b e a Lipschitz domain. (i) If ν < ∞ , H θ ( ¯ D ) is e quivalent to the Sob olev Hilb ert sp ac e H ν + d/ 2 ( D ) . (ii) If ν = ∞ , H θ ( ¯ D ) is c ontinuously emb e dde d in H s ( D ) for al l s . Th us if ν < ∞ , and X is, sa y , a pro d uct of interv als Q d i =1 [ a i , b i ], the RKHS H θ ( X ) is equiv alen t to the S ob olev Hilb ert sp ace H ν + d/ 2 ( Q d i =1 ( a i , b i )), iden tifying eac h fun ction in that space with its unique contin u ou s extension to X . 3.2 Fixed P arameters W e are no w r eady to state our m ain results. Let X ⊂ R d b e compact with non-empt y interio r. F or a fun ction f : X → R , let P u f and E u f denote p rob- abilit y and exp ectation when minimizing the ﬁxed fu nction f with strategy u . (Note that while f is ﬁxed, u may b e ran d om, so its p erform ance is still probabilistic in natur e.) W e deﬁne the loss suﬀered o ver the ball B R in H θ ( X ) after n s teps b y a strategy u , L n ( u, H θ ( X ) , R ) : = sup k f k H θ ( X ) ≤ R E u f [ f ( x ∗ n ) − min f ] . W e will sa y that u conv erges on the optim um at rate r n , if L n ( u, H θ ( X ) , R ) = O ( r n ) for all R > 0. Note that w e do not allo w u to v ary with R ; the strategy m ust ac hieve this rate without prior kno wledge of k f k H θ ( X ) . W e b egin b y sh o wing that th e minimax rate of con verge nce is n − ν /d . 10 Theorem 1. If ν < ∞ , then for any θ ∈ R d + , R > 0 , inf u L n ( u, H θ ( X ) , R ) = Θ( n − ν /d ) , and this r ate c an b e achieve d b y a str ate gy u not dep ending on R . The upp er b ound is pro vided b y a naive strategy as in the in tro duction: w e ﬁx a quasi-uniform sequence x n in adv ance, and tak e x ∗ n to minimize a radial basis fu nction in terp olan t of the data. As remarke d previously , ho we ver, this n aiv e strategy is not very satisfying; in pr actice it will b e outp erformed by an y go o d strategy v ary in g with the data. W e may th us ask whether m ore soph isticated s trategies, w ith b etter p ractical p erformance, can still p ro vide go o d wo rst-case b ound s. One such strategy is the E I ( π ) strategy of Deﬁnition 1 . W e can sh ow this strategy conv erges at least at rate n − ( ν ∧ 1) /d , up to log factors. Theorem 2. L et π b e a prior with length-sc ales θ ∈ R d + . F or any R > 0 , L n ( E I ( π ) , H θ ( X ) , R ) = ( O ( n − ν /d (log n ) α ) , ν ≤ 1 , O ( n − 1 /d ) , ν > 1 . F or ν ≤ 1, these rates are near-optimal. F or ν > 1, w e are faced with a more diﬃcult p roblem; we discuss this in more d etail in Section 3.4 . 3.3 Estimated P arameters First, we consider the eﬀect of the p rior parameters on E I ( π ). While the previous result giv es a con verge nce rate for an y ﬁxed choice of parameters, the constant in th at rate w ill d ep end on the parameters c hosen; to choose w ell, we m ust somehow estimate these parameters fr om the d ata. The E I ( ˆ π ) strategy , giv en b y Deﬁnition 2 , uses maximum lik eliho o d estimate s for this purp ose. W e can sh o w, ho wev er, that th is ma y cause the strategy to n ev er con v erge. Theorem 3. Supp ose ν < ∞ . Given θ ∈ R d + , R > 0 , ε > 0 , ther e exists f ∈ H θ ( X ) satisfying k f k H θ ( X ) ≤ R , and for some ﬁxe d δ > 0 , P E I (ˆ π ) f  inf n f ( x ∗ n ) − min f ≥ δ  > 1 − ε. The count erexamples constructed in the p ro of of the theorem ma y b e diﬃcult to minimize, b ut they are n ot badly-b eh av ed ( Figure 1 ). A go o d optimization strategy should b e able to minimize suc h fu nctions, and w e m ust ask why exp ected impro v ement fails. W e can understand the issue b y considering the constan t in Theorem 2 . Deﬁne τ ( x ) : = x Φ( x ) + ϕ ( x ) . 11 x f ( x ) Figure 1: A counterexa mp le fr om Theorem 3 F rom the pro of of Th eorem 2 , the d ominan t term in the conv ergence rate has constan t C ( R + σ ) τ ( R/σ ) τ ( − R/σ ) , (10) for C > 0 not d ep ending on R or σ . I n App end ix A , we will pr ov e the follo w ing result. Corollary 1. ˆ R n ( θ ) is non-de cr e asing in n , and b ounde d ab ove by k f k H θ ( X ) . Hence for ﬁxed θ , the estimate ˆ σ 2 n = ˆ R 2 n ( θ ) /n ≤ R 2 /n , and th us R/ ˆ σ n ≥ n 1 / 2 . Inserting this c hoice into ( 10 ) giv es a constant growing exp onentia lly in n , destroying our con ve rgence r ate. T o resolv e the issu e, w e will ins tead try to p ic k σ to minimize ( 10 ). The term R + σ is increasing in σ , and the term τ ( R/σ ) /τ ( − R/σ ) is d ecreasing in σ ; w e ma y b alance the terms b y taking σ = R . The constan t is then prop ortional to R , wh ic h we ma y minimize by taking R = k f k H θ ( X ) . In practice, we will not know k f k H θ ( X ) in adv ance, so w e must estimate it from the data; from Corollary 1 , a conv enient estimate is ˆ R n ( θ ). Supp ose, then, that w e make some b oun ded estimate ˆ θ n of θ , and set ˆ σ 2 n = ˆ R 2 n ( ˆ θ n ). As Theorem 3 h olds for any ˆ σ 2 n of faster th an logarithmic deca y , suc h a c hoice is necessary to ensur e con v ergence. (W e may also choose θ to minimize ( 10 ); we might then pic k ˆ θ n minimizing ˆ R n ( θ ) Q d i =1 θ − ν /d i , but our assump tions on ˆ θ n are wea k enough that we need not consider this further.) If w e b eliev e our Gaussian-pro cess mo d el, this estimate ˆ σ n is certainly unusual. W e should , how eve r, tak e care b efore placing to o m uch faith in the mo del. The fu nction in Figure 1 is a r easonable fun ction to optimize, 12 but as a Gaussian pro cess it is highly at ypical: there are in terv als on whic h the f unction is constant , an ev en t whic h in our mo del o ccurs with proba- bilit y zero. I f w e wan t our algorithm to su cceed on more general classes of functions, we will need to choose our parameter estimates appr op r iately . T o obtain go o d r ates, we m ust add a fur th er condition to our strategy . If z 1 = · · · = z n , E I n ( · ; ˆ π n ) is identic ally zero, and all c hoices of x n +1 are equally v alid. T o ensur e w e fu lly explore f , we will therefore require that when our strategy is applied to a constan t function f ( x ) = c , it pro d uces a sequence x n dense in X . (This can b e ac h iev ed, for example, b y c ho osing x n +1 uniformly at rand om f rom X when z 1 = · · · = z n .) W e hav e th us describ ed the follo w ing strategy . Deﬁnition 3. An E I ( ˜ π ) str ate gy satisﬁes Deﬁnition 2 , e xc ept: (i) we inste ad set ˆ σ 2 n = ˆ R 2 n ( ˆ θ n ) ; and (ii) we r e qui r e the choic e of x n +1 maximizing ( 8 ) to b e such that, if f i s c onstant, the design p oints ar e almost sur ely dense in X . W e cannot no w prov e a con vergence result u niform ov er b alls in H θ ( X ), as the rate of con v ergence d ep ends on the r atio R / ˆ R n , whic h is unb ounded. (Indeed, any estimator of k f k H θ ( X ) m ust sometimes p erform p o orly: f can app ear from the data to ha v e arbitrarily small norm, while in fact ha ving a spik e s omewh ere w e h av e not y et observed.) W e can, ho w ev er, pro vide the same con verge nce rates as in Theorem 2 , in a sligh tly wea ker sense. Theorem 4. F or any f ∈ H θ U ( X ) , under P E I (˜ π ) f , f ( x ∗ n ) − min f = ( O p ( n − ν /d (log n ) α ) , ν ≤ 1 , O p ( n − 1 /d ) , ν > 1 . 3.4 Near-Optimal Rates So far, our rates ha v e b een near-optimal only f or ν ≤ 1. T o obtain go o d rates for ν > 1, standard results on the p erforman ce of Gaussian-pro cess in terp olation ( Narco w ic h et al. , 2003 , § 6) then require th e design p oin ts x i to b e quasi-uniform in a region of inte rest. It is u nclear w h ether this o ccurs naturally under exp ected impr o v ement , but there are many wa ys w e can mo dify the algorithm to ens ure it. P erhaps the simplest, and most wel l-known, is an ε -greedy strategy ( Sut- ton and Ba rto , 1998 , § 2.2). In s u c h a strategy , at eac h step with probabilit y 1 − ε we make a decision to maximize some greedy criterion; with probability ε we make a decision completely at random. This random c hoice ensures that the short-term n ature of the greedy criterion do es n ot o v ershad ow our long-term goal. 13 The p arameter ε con trols the tr ad e-oﬀ b et w een global and lo cal searc h: a go o d c h oice of ε will b e s m all enough to not interfere w ith the exp ected- impro veme nt algorithm, but large enough to p rev ent it f rom getting stuck in a lo cal minim um. Sutton and Barto ( 1998 , § 2.2) consider th e v alues ε = 0 . 1 and ε = 0 . 01, bu t in practical w ork ε should of course b e calibrated to a t ypical problem set. W e th erefore deﬁne the follo wing strategies. Deﬁnition 4. L et · denote π , ˆ π or ˜ π . F or 0 < ε < 1 , an E I ( · , ε ) str ate gy: (i) cho oses initial design p oints x 1 , · · · , x k indep endently of f ; (ii) with pr ob ability 1 − ε , c ho oses design p oint x n +1 ( n ≥ k ) as in E I ( · ) ; or (iii) with pr ob ability ε , cho oses x n +1 ( n ≥ k ) uniformly at r andom fr om X . W e can sho w that these strategies ac hiev e n ear-optimal rates of con v er- gence for all ν < ∞ . Theorem 5. L et E I ( · , ε ) b e one of the str ate gi es in Deﬁnition 4 . If ν < ∞ , then for any R > 0 , L n ( E I ( · , ε ) , H θ U ( X ) , R ) = O (( n/ log n ) − ν /d (log n ) α ) , while if ν = ∞ , the statement holds for al l ν < ∞ . Note that un lik e a t ypical ε -gree dy algorithm, w e do not rely on r andom c hoice to obtain global conv ergence: as ab ov e, the E I ( π ) and E I ( ˜ π ) strate- gies are already globally con verge nt. Instead, we use r andom c hoice simply to impro ve up on the worst-c ase rate. No te also that the result do es not in general hold wh en ε = 1; to obtain goo d rates, w e m ust combine global searc h w ith inference ab out f . 4 Conclusions W e hav e sho wn that exp ected improv ement can con v erge near-optimally , but a n aiv e implement ation may not conv erge at all. W e thus ec h o Diaconis and F reedman ( 1986 ) in stating that, for inﬁ nite-dimensional pr oblems, Ba ye sian metho ds are not alwa ys guaran teed to ﬁnd the right answe r; su c h guaran tees can only b e pro vided by considering the p roblem at hand . W e might ask, ho we ver, if our framework can also b e impro ved. Our upp er b ounds on con v ergence w ere established using n aiv e algorithms, w hic h in practice w ould pr o v e ineﬃcient. If a sophisticated algorithm fails w here a naiv e one s ucceeds, then th e soph isticated algorithm is certainly at f au lt; w e migh t, ho w ev er, prefer metho ds of ev aluation whic h d o not consider naive algorithms so s u ccessful. 14 V azquez and Bect ( 201 0 ) and Grunewalder et al. ( 2010 ) consider a m ore Ba y esian form ulation of the problem, wh ere the unkn own function f is d is- tributed according to the prior π , but this app roac h can prov e r estrictiv e: as w e saw in Section 3.3 , placing to o muc h faith in the prior ma y exclude functions of int erest. F urther, Grunewa lder et al. ﬁnd the same issues are present also within the Ba y esian framework. A more interesting approac h is giv en by th e cont inuum-armed-bandit problem ( Sriniv as et al. , 2010 , and references therein). Here the goal is to minimize the cumulativ e regret, R n : = n X i =1 ( f ( x i ) − min f ) , in general observing the function f u nder noise. Algorithms con trolling the cum ulativ e regret at rate r n also solv e the optimization problem, at rate r n /n ( Bub ec k et al. , 2009 , § 3). Th e naive algorithms ab o ve, ho we ver, h av e p o or cumula tive r egret. W e might, then, consider th e cum ulativ e regret to b e a b etter measure of p erformance, but this ap p roac h to o has limitations. Firstly , th e cumulat ive regret is necessarily increasing, so cannot establish rates of optimization f aster than n − 1 . (Th is is n ot an issu e un der noise, where t ypically r n = Ω( n 1 / 2 ), see Klein b erg and Slivkin s , 2010 .) Secondly , if our goal is optimization, th en m inimizing th e regret, a cost we do not incur, may obscure the problem at hand. Bub ec k et al. ( 20 10 ) stud y this problem with the additional assump tion that f h as ﬁn itely man y minima, and is, sa y , quadratic in a neighbour h o o d of eac h . T h is assumption ma y suﬃce in p ractice, and allo ws the authors to obtain impressive rates of con v ergence. F or optimization, how ever, a fur ther w eakness is th at these rates hold only once the algorithm h as fou n d a basin of attraction; they th us measur e lo cal, rather th an global, p erform an ce. It ma y b e that con ve rgence rates alone are not suﬃcient to captur e the p erformance of a global optimization algorithm, and the time tak en to ﬁ nd a basin of attraction is more r elev an t. In an y case, the choic e of an app ropriate framew ork to measur e p erformance in global optimization merits fur ther study . Finally , we should also ask h o w to c h o ose the smo othness parameter ν (or the equiv alen t parameter in similar alg orithms). Van d er V aart and v an Zan ten ( 2009 ) show that Ba y esian Gaussian-pro cess mo d els can, in some con texts, automatical ly adapt to the smo othn ess of an unknown fun ction f . Their tec hnique requires, ho wev er, that the estimated length-scale s ˆ θ n to tend to 0, p osing b oth p ractical and th eoretica l c hallenges. Th e qu estion of ho w b est to optimize fu nctions of unkn o wn smo othness r emains op en. 15 Ac kno wledgemen ts W e w ould lik e to thank the referees, as w ell as Ric h ard Nic kl and Steﬀen Grunewa lder, f or their v aluable comments and suggestions. A Pro ofs W e now pro ve the results in Section 3 . A.1 Repro ducing-Kernel Hilb ert Spaces Pr o of of Lemma 1 . Let V b e the space of f u nctions describ ed, and W b e the closed real su bspace of Hermitian fu nctions in L 2 ( R d , b K − 1 ). W e w ill sho w f 7→ b f is an isomorphism V → W , so w e ma y equiv alen tly wo rk w ith W . Giv en b f ∈ W , by Cauc hy- Sch warz and Bo c hner’s theorem, Z | b f | ≤  Z b K  1 / 2  Z | b f | 2 / b K  1 / 2 < ∞ , and as k b K k ∞ ≤ k K k 1 , Z | b f | 2 ≤ k b K k ∞ Z | b f | 2 / b K < ∞ , so b f ∈ L 1 ∩ L 2 . b f is th us the F ourier transform of a real con tinuous f ∈ L 2 , satisfying the F ourier inv ersion formula ev erywhere. f 7→ b f is h ence an isomorp h ism V → W . It remains to sho w th at V = H ( R d ). W is complete, so V is. F urther, E ( R d ) ⊂ V , an d by F ourier in ve rsion eac h f ∈ V satisﬁes the repro du cing prop ert y , f ( x ) = Z e 2 π i h x,ξ i b f ( ξ ) dξ = Z b f ( ξ ) b k x ( ξ ) b K ( ξ ) dξ = h f , k x i , so H ( R d ) is a closed su bspace of V . Giv en f ∈ H ( R d ) ⊥ , f ( x ) = h f , k x i = 0 for all x , so f = 0. Thus V = H ( R d ). Pr o of of Lemma 3 . By Lemma 1 , the n orm on H θ ( R d ) is k f k 2 H θ ( R d ) = Z | b f ( ξ ) | 2 b K θ ( ξ ) dξ , and K θ has F ourier transform b K θ ( ξ ) = b K ( ξ 1 /θ 1 , . . . , ξ d /θ d ) Q d i =1 θ i . 16 If ν < ∞ , by assumption b K ( ξ ) = b k ( k ξ k ), for a ﬁnite non-incr easing function b k satisfying b k ( k ξ k ) = Θ( k ξ k − 2 ν − d ) as ξ → ∞ . Hence C (1 + k ξ k 2 ) − ( ν + d/ 2) ≤ b K θ ( ξ ) ≤ C ′ (1 + k ξ k 2 ) − ( ν + d/ 2) , for constan ts C, C ′ > 0, and we obtain that H θ ( R d ) is equiv alen t to th e Sob olev space H ν + d/ 2 ( R d ). F rom Lemma 2 , H θ ( D ) is giv en by the restriction of fu nctions in H θ ( R d ); as D is Lipschitz, th e same is tru e of H ν + d/ 2 . H θ ( D ) is th us equ iv alen t to H ν + d/ 2 ( D ). Finally , functions in H θ ( ¯ D ) are contin u ou s , so uniqu ely iden tiﬁed by their restriction to D , and H θ ( ¯ D ) ≃ H θ ( D ) ≃ H ν + d/ 2 ( D ) . If ν = ∞ , by a similar argumen t H θ ( ¯ D ) is con tin uously emb ed ded in all H s ( D ). F rom Lemma 1 , w e can deriv e results on the b eha viour of k f k H θ ( S ) as θ v aries. F or small θ , w e obtain th e follo wing result. Lemma 4. If f ∈ H θ ( S ) , then f ∈ H θ ′ ( S ) for al l 0 < θ ′ ≤ θ , and k f k 2 H θ ′ ( S ) ≤ d Y i =1 θ i /θ ′ i ! k f k 2 H θ ( S ) . Pr o of. Let C = Q d i =1 ( θ ′ i /θ i ). As b K is isotropic and radially non-in cr easing, b K θ ′ ( ξ ) = C b K θ (( θ ′ 1 /θ 1 ) ξ 1 , . . . , ( θ ′ d /θ d ) ξ d ) ≥ C b K θ ( ξ ) . Giv en f ∈ H θ ( S ), let g ∈ H θ ( R d ) b e its min im um n orm extension, as in Lemma 2 . By Lemma 1 , k f k 2 H θ ′ ( S ) ≤ k g k 2 H θ ′ ( R d ) = Z | b g | 2 b K θ ′ ≤ Z | b g | 2 C b K θ = C − 1 k f k 2 H θ ( S ) . Lik ewise, for large θ , w e obtain the follo wing. Lemma 5. If ν < ∞ , f ∈ H θ ( S ) , then f ∈ H tθ ( S ) for t ≥ 1 , and k f k 2 H tθ ( S ) ≤ C ′′ t 2 ν k f k 2 H θ ( S ) , for a C ′′ > 0 dep ending only on K and θ . Pr o of. As in the p ro of of Lemma 3 , we ha v e constan ts C , C ′ > 0 suc h that C (1 + k ξ k 2 ) − ( ν + d/ 2) ≤ b K θ ( ξ ) ≤ C ′ (1 + k ξ k 2 ) − ( ν + d/ 2) . 17 Th us for t ≥ 1, b K tθ ( ξ ) = t d b K θ ( tξ ) ≥ C t d (1 + t 2 k ξ k 2 ) − ( ν + d/ 2) ≥ C t − 2 ν (1 + k ξ k 2 ) − ( ν + d/ 2) ≥ C C ′− 1 t − 2 ν b K θ ( ξ ) , and we ma y argue as in the pr evious lemma. W e can also d escrib e the p osterior distribu tion of f in terms of H θ ( S ); as a consequence, we may deduce Corollary 1 . Lemma 6. Supp ose f ( x ) = µ + g ( x ) , g ∈ H θ ( S ) . (i) ˆ f n ( x ; θ ) = ˆ µ n + ˆ g n ( x ) solves the optimization pr oblem minimize k ˆ g k 2 H θ ( S ) , subje ct to ˆ µ + ˆ g ( x i ) = z i , 1 ≤ i ≤ n, with minimum value ˆ R 2 n ( θ ) . (ii) The pr e diction err or satisﬁes | f ( x ) − ˆ f n ( x ; θ ) | ≤ s n ( x ; θ ) k g k H θ ( S ) with e quality for some g ∈ H θ ( S ) . Pr o of. (i) Let W = span( k x 1 , . . . , k x n ), and write ˆ g = ˆ g k + ˆ g ⊥ for ˆ g k ∈ W , ˆ g ⊥ ∈ W ⊥ . ˆ g ⊥ ( x i ) = h ˆ g ⊥ , k x i i = 0, so ˆ g ⊥ aﬀects the optimization on ly through k ˆ g k . The minimal ˆ g thus has ˆ g ⊥ = 0, so ˆ g = P n i =1 λ i k x i . T he problem then b ecomes minimize λ T V λ, sub ject to ˆ µ 1 + V λ = z . The solution is giv en by ( 4 ) and ( 5 ), with v alue ( 7 ). (ii) By symmetry , the prediction error do es not dep end on µ , so w e ma y tak e µ = 0. Then f ( x ) − ˆ f n ( x ; θ ) = g ( x ) − ( ˆ µ n + ˆ g n ( x )) = h g , e n,x i , for e n,x = k x − P n i =1 λ i k x i , and λ = V − 1 1 1 T V − 1 1 +  I − V − 1 1 1 T V − 1 1 1 T  V − 1 v . No w, k e n,x k 2 H θ ( S ) = s 2 n ( x ; θ ), as giv en by ( 6 ); this is a consequence of Lo ` ev e’s isometry , but is easily v eriﬁed algebraically . The resu lt then follo w s b y C auc h y-Sch warz. 18 A.2 Fixed P arameters Pr o of of The or em 1 . W e ﬁ rst establish the lo wer b ound . S upp ose we hav e 2 n fu nctions ψ m with d isjoin t sup p orts. W e will argue that, giv en n obser v a- tions, w e cann ot distingu ish b et w een all the ψ m , and th us cannot accurately pic k a minimum x ∗ n . T o b egin with, assume X = [0 , 1] d . Let ψ : R d → [0 , 1] b e a C ∞ function, supp orted inside X and with minimum -1. By Lemma 3 , ψ ∈ H θ ( R d ). Fix k ∈ N , and set n = (2 k ) d / 2. F or v ectors m ∈ { 0 , . . . , 2 k − 1 } d , construct functions ψ m ( x ) = C (2 k ) − ν ψ (2 k x − m ), where C > 0 is to b e d etermined. ψ m is giv en by a translation and scaling of ψ , so by Lemmas 1 , 2 and 5 , for some C ′ > 0, k ψ m k H θ ( X ) ≤ k ψ m k H θ ( R d ) = C (2 k ) − ν k ψ k H 2 kθ ( R d ) ≤ C C ′ k ψ k H θ ( R d ) . Set C = R /C ′ k ψ k H θ ( R d ) , so that k ψ m k H θ ( X ) ≤ R for all m and k . Supp ose f = 0, and let x n and x ∗ n b e chosen by an y v alid s trategy u . Set χ = { x 1 , . . . , x n − 1 , x ∗ n − 1 } , and let A m b e th e ev ent that ψ m ( x ) = 0 for all x ∈ χ . Th er e are n p oin ts in χ , and the 2 n functions ψ m ha v e disjoint supp ort, so P m I ( A m ) ≥ n . Thus X m P u 0 ( A m ) = E u 0 " X m I ( A m ) # ≥ n, and we h a v e some ﬁxed m , d ep ending only on u , for which P u 0 ( A m ) ≥ 1 2 . On the ev en t A m , ψ m ( x ∗ n − 1 ) − min ψ m = C (2 k ) − ν , but on that ev ent , u cannot distingu ish b et wee n 0 and ψ m b efore time n , so C − 1 (2 k ) ν E u ψ m [ f ( x ∗ n − 1 ) − min f ] ≥ P u ψ m ( A m ) = P u 0 ( A m ) ≥ 1 2 . As the min imax loss is non-increasing in n , f or (2( k − 1) ) d / 2 ≤ n < (2 k ) d / 2 w e conclude inf u L n ( u, H θ ( X ) , R ) ≥ inf u L (2 k ) d / 2 − 1 ( u, H θ ( X ) , R ) ≥ inf u sup m E u ψ m h f  x ∗ (2 k ) d / 2 − 1  − min f i ≥ 1 2 C (2 k ) − ν = Ω( n − ν /d ) . F or general X ha ving non-empty in terior, we can ﬁn d a hypercub e S = x 0 + [0 , ε ] d ⊆ X , with ε > 0. W e ma y then pr o ceed as ab ov e, p ic king functions ψ m supp orted insid e S . F or the up p er b ound, consid er a strategy u c ho osing a ﬁxed sequence x n , indep end en t of th e z n . Fit a r adial basis fun ction in terp olan t ˆ f n to the 19 data, and p ick x ∗ n to minimize ˆ f n . T hen if x ∗ minimizes f , f ( x ∗ n ) − f ( x ∗ ) ≤ f ( x ∗ n ) − ˆ f n ( x ∗ n ) + ˆ f n ( x ∗ ) − f ( x ∗ ) ≤ 2 k ˆ f n − f k ∞ , so the loss is b oun ded b y the err or in ˆ f n . F rom results in Narco w ic h et al. ( 2003 , § 6) and W endland ( 2005 , § 11.5), for suitable r adial basis fun ctions the error is uniformly b ounded b y sup k f k H θ ( X ) ≤ R k ˆ f n − f k ∞ = O ( h − ν n ) , where the mesh norm h n : = sup x ∈ X n min i =1 k x − x i k . (F or ν 6∈ N , th is r esult is giv en by Narco w ic h et al. for the radial b asis function K ν , whic h is ν -H¨ older at 0 b y Abramo witz and Stegun , 1965 , § 9.6; for ν ∈ N , the r esult is giv en b y W endland for th in -plate splines.) As X is b ound ed, we ma y c ho ose the x n so that h n = O ( n − 1 /d ), giving L n ( u, H θ ( X ) , R ) = O ( n − ν /d ) . T o pro ve Theorem 2 , we ﬁ rst sh o w that some observ ations z n will b e w ell-predicted by past data. Lemma 7. Set β : = ( α, ν ≤ 1 , 0 , ν > 1 . Given θ ∈ R d + , ther e is a c onstant C ′ > 0 dep ending only on X , K and θ which satisﬁes the fol lowing. F or any k ∈ N , and se quenc es x n ∈ X , θ n ≥ θ , the ine q u ality s n ( x n +1 ; θ n ) ≥ C ′ k − ( ν ∧ 1) /d (log k ) β holds for at most k distinct n . Pr o of. W e ﬁrst sho w that the p osterior v ariance s 2 n is b ounded b y the dis- tance to the nearest design p oint. Let π n denote the p rior with v ariance σ 2 = 1, and length-scales θ n . Then for an y i ≤ n , as ˆ f n ( x ; θ n ) = E π n [ f ( x ) | F n ], s 2 n ( x ; θ n ) = E π n [( f ( x ) − ˆ f n ( x ; θ n )) 2 | F n ] = E π n [( f ( x ) − f ( x i )) 2 − ( f ( x i ) − ˆ f n ( x ; θ n )) 2 | F n ] ≤ E π n [( f ( x ) − f ( x i )) 2 | F n ] = 2(1 − K θ n ( x − x i )) . 20 If ν ≤ 1 2 , then by assumption | K ( x ) − K (0 ) | = O  k x k 2 ν ( − log k x k ) 2 α  as x → 0. If ν > 1 2 , then K is diﬀerentia ble, so as K is sy m metric, ∇ K (0) = 0. If fur ther ν ≤ 1, then | K ( x ) − K (0) | = | K ( x ) − K (0) − x · ∇ K (0) | = O  k x k 2 ν ( − log k x k ) 2 α  . Similarly , if ν > 1, then K is C 2 , so | K ( x ) − K (0) | = | K ( x ) − K (0) − x · ∇ K (0) | = O ( k x k 2 ) . W e ma y th us conclude | 1 − K ( x ) | = | K ( x ) − K (0) | = O  k x k 2( ν ∧ 1) ( − log k x k ) 2 β  , and s 2 n ( x ; θ n ) ≤ C 2 k x − x i k 2( ν ∧ 1) ( − log k x − x i k ) 2 β , for a constant C > 0 d ep ending only on X , K and θ . W e next sh o w that most design p oints x n +1 are close to a previous x i . X is b ound ed, so can b e co vered by k b alls of radius O ( k − 1 /d ). If x n +1 lies in a b all con taining some earlier p oin t x i , i ≤ n , th en w e ma y conclud e s 2 n ( x n +1 ; θ n ) ≤ C ′ 2 k − 2( ν ∧ 1) /d (log k ) 2 β , for a constan t C ′ > 0 dep ending only on X , K and θ . Hence as there are k balls, at most k p oin ts x n +1 can satisfy s n ( x n +1 ; θ n ) ≥ C ′ k − ( ν ∧ 1) /d (log k ) β . Next, w e p ro vide b ounds on the exp ected imp ro v ement wh en f lies in the RKHS. Lemma 8. L et k f k H θ ( X ) ≤ R . F or x ∈ X , n ∈ N , set I = ( f ( x ∗ n ) − f ( x )) + , and s = s n ( x ; θ ) . Then for τ ( x ) : = x Φ( x ) + φ ( x ) , we have max  I − Rs, τ ( − R/σ ) τ ( R/σ ) I  ≤ E I n ( x ; π ) ≤ I + ( R + σ ) s. 21 Pr o of. If s = 0, then by Lemma 6 , ˆ f n ( x ; θ ) = f ( x ), so E I n ( x ; π ) = I , and the result is tr ivial. Supp ose s > 0, and s et t = ( f ( x ∗ n ) − f ( x )) /s , u = ( f ( x ∗ n ) − ˆ f n ( x ; θ )) /s . F rom ( 8 ) and ( 9 ), E I n ( x ; π ) = σ s τ ( u/σ ) , and b y Lemma 6 , | u − t | ≤ R . As τ ′ ( z ) = Φ( z ) ∈ [0 , 1], τ is non-decreasing, and τ ( z ) ≤ 1 + z for z ≥ 0. Hence E I n ( x ; π ) ≤ σ s τ  t + + R σ  ≤ σ s  t + + R σ + 1  = I + ( R + σ ) s. If I = 0, then as E I is the exp ectat ion of a non-negativ e quanti ty , E I ≥ 0, and the lo we r b ounds are trivial. Sup p ose I > 0. T h en as E I ≥ 0, τ ( z ) ≥ 0 for all z , and τ ( z ) = z + τ ( − z ) ≥ z . Th us E I n ( x ; π ) ≥ σ s τ  t − R σ  ≥ σ s  t − R σ  = I − R s. Also, as τ is increasing, E I n ( x ; π ) ≥ σ τ  − R σ  s. Com bining th ese b oun ds, and eliminating s , we obtain E I n ( x ; π ) ≥ σ τ ( − R /σ ) R + σ τ ( − R/σ ) I = τ ( − R/σ ) τ ( R/σ ) I . W e ma y n o w prov e the theorem. W e will use th e ab o v e b ounds to sho w that there must b e times n k when the exp ected imp ro v ement is lo w, and th us f ( x ∗ n k ) is close to min f . Pr o of of The or em 2 . F rom Lemma 7 there exists C > 0, dep ending on X , K and θ , such that for any sequence x n ∈ X and k ∈ N , th e inequalit y s n ( x n +1 ; θ ) > C k − ( ν ∧ 1) /d (log k ) β holds at most k times. F u rthermore, z ∗ n − z ∗ n +1 ≥ 0, and for k f k H θ ( X ) ≤ R , X n z ∗ n − z ∗ n +1 ≤ z ∗ 1 − m in f ≤ 2 k f k ∞ ≤ 2 R , so z ∗ n − z ∗ n +1 > 2 Rk − 1 at most k times. Since z ∗ n − f ( x n +1 ) ≤ z ∗ n − z ∗ n +1 , w e hav e also z ∗ n − f ( x n +1 ) > 2 Rk − 1 at most k times. Th us there is a time n k , k ≤ n k ≤ 3 k , for w hic h s n k ( x n k +1 ; θ ) ≤ C k − ( ν ∧ 1) /d (log k ) β and z ∗ n k − f ( x n k +1 ) ≤ 2 Rk − 1 . Let f ha ve minim um z ∗ at x ∗ . F or k large, x n k +1 will hav e b een c hosen b y exp ected impro v ement (rather than b eing an in itial design p oin t, c hosen 22 at random). Then as z ∗ n is non-increasing in n , for 3 k ≤ n < 3( k + 1) w e ha v e b y Lemma 8 , z ∗ n − z ∗ ≤ z ∗ n k − z ∗ ≤ τ ( R/σ ) τ ( − R/σ ) E I n k ( x ∗ ; π ) ≤ τ ( R/σ ) τ ( − R/σ ) E I n k ( x n k +1 ; π ) ≤ τ ( R/σ ) τ ( − R/σ )  2 Rk − 1 + C ( R + σ ) k − ( ν ∧ 1) /d (log k ) β  . This b ound is uniform in f w ith k f k H θ ( X ) ≤ R , so we obtain L n ( E I ( π ) , H θ ( X ) , R ) = O ( n − ( ν ∧ 1) /d (log n ) β ) . A.3 Estimated Pa rameters T o pr o v e Theorem 3 , w e ﬁ rst establish lo wer b ounds on the p osterior v ari- ance. Lemma 9. Given θ L , θ U ∈ R d + , pick se qu enc es x n ∈ X , θ L ≤ θ n ≤ θ U . Then for op en S ⊂ X , sup x ∈ S s n ( x ; θ n ) = Ω( n − ν /d ) , uniformly in the se quenc e s x n , θ n . Pr o of. S is op en, so conta ins a h yp er cu b e T . F or k ∈ N , let n = 1 2 (2 k ) d , and constr u ct 2 n fun ctions ψ m on T with k ψ m k H θ U ( X ) ≤ 1, as in the pr o of of Theorem 1 . Let C 2 = Q d i =1 ( θ U i /θ L i ); then by Lemma 4 , k ψ m k H θ n ( X ) ≤ C . Giv en n d esign p oints x 1 , . . . , x n , there must b e some ψ m suc h th at ψ m ( x i ) = 0, 1 ≤ i ≤ n . By Lemma 6 , the p osterior mean of ψ m giv en these observ ations is the zero function. Thus for x ∈ T minimizing ψ m , s n ( x ; θ n ) ≥ C − 1 s n ( x ; θ n ) k ψ m k H θ n ( X ) ≥ C − 1 | ψ m ( x ) − 0 | = Ω ( k − ν ) . As s n ( x ; θ ) is non-incr easing in n , f or 1 2 (2( k − 1)) d < n ≤ 1 2 (2 k ) d w e obtain sup x ∈ S s n ( x ; θ n ) ≥ sup x ∈ S s 1 2 (2 k ) d ( x ; θ n ) = Ω( k − ν ) = Ω( n − ν /d ) . Next, w e b ound the exp ected improv ement when pr ior parameters are estimated by maxim um like liho o d . Lemma 10. L e t k f k H θ U ( X ) ≤ R , x n , y n ∈ X . Set I n ( x ) = z ∗ n − f ( x ) , s n ( x ) = s n ( x ; ˆ θ n ) , and t n ( x ) = I n ( x ) /s n ( x ) . Supp ose: 23 (i) for some i < j , z i 6 = z j ; (ii) for some T n → −∞ , t n ( x n +1 ) ≤ T n whenever s n ( x n +1 ) > 0 ; (iii) I n ( y n +1 ) ≥ 0 ; and (iv) for some C > 0 , s n ( y n +1 ) ≥ e − C /c n . Then for ˆ π n as in Deﬁnition 2 , eventual ly E I n ( x n +1 ; ˆ π n ) < E I n ( y n +1 ; ˆ π n ) . If the c onditions hold on a subse que nc e, so do e s the c onclusion. Pr o of. Let ˆ R 2 n ( θ ) b e giv en b y ( 7 ), and set ˆ R 2 n = ˆ R 2 n ( ˆ θ n ). F or n ≥ j , ˆ R 2 n > 0, and by Lemma 4 and Corollary 1 , ˆ R 2 n ≤ k f k 2 H ˆ θ n ( X ) ≤ S 2 = R 2 d Y i =1 ( θ U i /θ L i ) . Th us 0 < ˆ σ 2 n ≤ S 2 c n . T hen if s n ( x ) > 0, for some | u n ( x ) − t n ( x ) | ≤ S , E I n ( x ; ˆ π n ) = ˆ σ n s n ( x ) τ ( u n ( x ) / ˆ σ n ) , as in th e p ro of of Lemma 8 . If s n ( x n +1 ) = 0, th en x n +1 ∈ { x 1 , . . . , x n } , so E I n ( x n +1 ; ˆ π n ) = 0 < E I n ( y n +1 ; ˆ π n ) . When s n ( x n +1 ) > 0, as τ is increasing w e ma y up p er b ou n d E I n ( x n +1 ; ˆ π n ) using u n ( x n +1 ) ≤ T n + S , and lo w er b ound E I n ( y n +1 ; ˆ π n ) using u n ( y n +1 ) ≥ − S . Since s n ( x n +1 ) ≤ 1, and τ ( x ) = Θ( x − 2 e − x 2 / 2 ) as x → −∞ ( Abramo witz and Stegun , 1965 , § 7.1), E I n ( x n +1 ; ˆ π n ) E I n ( y n +1 ; ˆ π n ) ≤ τ (( T n + S ) / ˆ σ n ) e − C /c n τ ( − S/ ˆ σ n ) = O  ( T n + S ) − 2 e C /c n − ( T 2 n +2 S T n ) / 2 ˆ σ 2 n  = O  ( T n + S ) − 2 e − ( T 2 n +2 S T n − 2 C S 2 ) / 2 S 2 c n  = o (1) . If the conditions hold on a s ubsequence, we ma y similarly argue along that subsequence. Finally , we will requ ir e the follo wing tec hnical lemma. Lemma 11. L et x 1 , . . . , x n b e r andom variables taking values in R d . Given op en S ⊆ R d , ther e exist op en U ⊆ S for which P ( S n i =1 { x i ∈ U } ) is arbi- tr arily smal l. 24 Pr o of. Given ε > 0, ﬁx m ≥ n/ε , and pic k disjoin t op en sets U 1 , . . . , U m ⊂ S . Then m X j =1 E [# { x i ∈ U j } ] ≤ E [# { x i ∈ R d } ] = n, so there exists U j with P [ i { x i ∈ U j } ! ≤ E [# { x i ∈ U j } ] ≤ n/m ≤ ε. W e may now pr o v e th e theorem. W e w ill construct a fun ction f on whic h the E I ( ˆ π ) strategy nev er observe s with in a region W . W e ma y then construct a function g , agreeing with f except on W , bu t h a ving diﬀerent minim um . As the strategy cannot distinguish b et w een f and g , it cannot successfully ﬁnd th e minimum of b oth. Pr o of of The or em 3 . Let th e E I ( ˆ π ) s trategy c ho ose initial d esign p oin ts x 1 , . . . , x k , indep enden tly of f . Giv en ε > 0, by Lemma 11 there exists op en U 0 ⊆ X f or whic h P E I ( ˆ π ) ( x 1 , . . . , x k ∈ U 0 ) ≤ ε ; w e may choose U 0 so that V 0 = X \ U 0 has non-empt y in terior. Pick op en U 1 suc h that V 1 = ¯ U 1 ⊂ U 0 , and set f to b e a C ∞ function, 0 on V 0 , 1 on V 1 , and everywhere non- negativ e. By Lemma 1 , f ∈ H θ U ( X ). W e w ork conditional on the ev ent A , ha v in g probabilit y at least 1 − ε , that z ∗ k = 0, and th us z ∗ n = 0 for all n ≥ k . Sup p ose x n ∈ V 1 inﬁnitely often, so the z n are n ot all equal. By Lemma 7 , s n ( x n +1 ; ˆ θ n ) → 0, so on a subsequence with x n +1 ∈ V 1 , we ha v e t n = ( z ∗ n − f ( x n +1 )) /s n ( x n +1 ; ˆ θ n ) = − s n ( x n +1 ; ˆ θ n ) − 1 → −∞ whenev er s n ( x n +1 ; ˆ θ n ) > 0. Ho w eve r, by Lemma 9 , there are p oin ts y n ∈ V 0 with z ∗ n − f ( y n +1 ) = 0, and s n ( y n +1 ; ˆ θ n ) = Ω( n − ν /d ). Hence by Lemma 10 , E I n ( x n +1 ; ˆ π n ) < E I n ( y n +1 ; ˆ π n ) for some n , contradicti ng the deﬁn ition of x n +1 . Hence, on A , there is a r andom v ariable T taking v alues in N , for whic h n > T = ⇒ x n 6∈ V 1 . Hence ther e exists a constant t ∈ N for wh ic h the ev en t B = A ∩ { T ≤ t } has P E I ( ˆ π ) f -probabilit y at least 1 − 2 ε . By Lemma 11 , w e thus ha ve an op en set W ⊂ V 1 for which the ev ent C = B ∩ { x n 6∈ W : n ∈ N } = B ∩ { x n 6∈ W : n ≤ t } has P E I ( ˆ π ) f -probabilit y at least 1 − 3 ε . Construct a smo oth function g by adding to f a C ∞ function w h ic h is 0 outside W , and has minimum − 2. Th en min g = − 1, but on the ev en t C , E I ( ˆ π ) cannot d istinguish b et ween f and g , and g ( x ∗ n ) ≥ 0. Thus for δ = 1, P E I ( ˆ π ) g  inf n g ( x ∗ n ) − min g ≥ δ  ≥ P E I ( ˆ π ) g ( C ) = P E I ( ˆ π ) f ( C ) ≥ 1 − 3 ε. 25 As the b eha viour of E I ( ˆ π ) is inv ariant under rescaling, we ma y s cale g to ha v e norm k g k H θ ( X ) ≤ R , and the ab o ve remains true f or some δ > 0. Pr o of of The or em 4 . As in the pro of of Theorem 2 , w e w ill sh o w there are times n k when the exp ected imp ro v ement is s m all, so f ( x n k ) m ust b e close to the minimum. First, ho we ver, w e must con trol the estimated p arameters ˆ σ 2 n , ˆ θ n . If the z n are all equ al, then b y assumption the x n are dense in X , so f is constant, and the result is trivial. Supp ose the z n are not all equal, and let T b e a rand om v ariable satisfying z T 6 = z i for some i < T . Set U = inf θ L ≤ θ ≤ θ U ˆ R T ( θ ). ˆ R T ( θ ) is a con tin uous p ositiv e function, so U > 0. Let S 2 = R 2 Q d i =1 ( θ U i /θ L i ). By Lemma 4 , k f k H ˆ θ n ( X ) ≤ S , so by Corollary 1 , for n ≥ T , U ≤ ˆ R T ( ˆ θ n ) ≤ ˆ σ n ≤ k f k H ˆ θ n ( X ) ≤ S. As in the pro of of Theorem 2 , w e hav e a constant C > 0, and some n k , k ≤ n k ≤ 3 k , for whic h z ∗ n k − f ( x n k +1 ) ≤ 2 Rk − 1 and s n k ( x n k +1 ; ˆ θ n k ) ≤ C k − α (log k ) β . T hen for k ≥ T , 3 k ≤ n < 3( k + 1), arguing as in Theorem 2 w e obtain z ∗ n − z ∗ ≤ z ∗ n k − z ∗ ≤ τ ( S/ ˆ σ n k ) τ ( − S/ ˆ σ n k )  2 Rk − 1 + C ( S + ˆ σ n k ) k − ( ν ∧ 1) /d (log k ) β  ≤ τ ( S/U ) τ ( − S/U )  2 Rk − 1 + 2 C S k − ( ν ∧ 1) /d (log k ) β  . W e th us ha ve a random v ariable C ′ satisfying z ∗ n − z ∗ ≤ C ′ n − ( ν ∧ 1) /d (log n ) β for all n , and the r esult follo ws. A.4 Near-Optimal Rates T o pr o v e Theorem 5 , we ﬁrst sh ow that the p oint s chose n at random will b e quasi-uniform in X . Lemma 12. L et x n b e i.i.d. r andom variables, distribute d uniformly over X , and deﬁne their mesh norm, h n : = sup x ∈ X n min i =1 k x − x i k . F or any γ > 0 , ther e exists C > 0 such that P ( h n > C ( n/ log n ) − 1 /d ) = O ( n − γ ) . 26 Pr o of. W e will partition X int o n regions of size O ( n − 1 /d ), and sh ow that with high probabilit y w e will p lace an x i in eac h one. Then ev ery p oin t x will b e clo se to an x i , and the mesh norm will b e small. Supp ose X = [0 , 1] d , ﬁ x k ∈ N , and divide X into n = k d sub-cub es X m = 1 k ( m + [0 , 1 ] d ), for m ∈ { 0 , . . . , k − 1 } d . Let I m b e the ind icator function of the ev en t { x i 6∈ X m : 1 ≤ i ≤ ⌊ γ n log n ⌋} , and deﬁn e µ n = E " X m I m # = n E [ I 0 ] = n (1 − 1 /n ) ⌊ γ n log n ⌋ ∼ ne − γ log n = n − ( γ − 1) . F or n large, µ n ≤ 1, so b y the generalized C hernoﬀ b ound of P anconesi and Sriniv asan ( 1997 , § 3.1), P X m I m ≥ 1 ! ≤ e ( µ − 1 n − 1) µ − µ − 1 n n ! µ n ≤ eµ n ∼ en − ( γ − 1) . On the even t P m I m < 1, I m = 0 for all m . F or any x ∈ X , we th en ha v e x ∈ X m for some m , an d x j ∈ X m for some 1 ≤ j ≤ ⌊ γ n log n ⌋ . T h us ⌊ γ n log n ⌋ min i =1 k x − x i k ≤ k x − x j k ≤ √ dk − 1 . As this b ound is uniform in x , we obtain h ⌊ γ n log n ⌋ ≤ √ dk − 1 . Thus for n = k d , P ( h ⌊ γ n log n ⌋ > √ dk − 1 ) = O ( k − d ( γ − 1) ) , and as h n is non-increasing in n , this b ound h olds also for k d ≤ n < ( k + 1) d . By a c hange of v ariables, we then obtain P ( h n > C ( n/γ log n ) − 1 /d ) = O (( n/γ log n ) − ( γ − 1) ) , and the resu lt f ollo ws by c h o osing γ large. F or general X , as X is b ound ed it can b e partitioned in to n regions of measure Θ( n − 1 /d ), so we ma y argue similarly . W e m a y no w prov e the th eorem. W e w ill show that the p oints x n m ust b e quasi-unif orm in X , so p osterior v ariances must b e small. Th en, as in the pro ofs of Theorems 2 and 4 , we h a v e times wh en the exp ected impr o v ement is small, so f ( x ∗ n ) is close to min f . Pr o of of The or em 5 . First su pp ose ν < ∞ . Let the E I ( · , ε ) c ho ose k initial design p oin ts indep endent of f , and supp ose n ≥ 2 k . L et A n b e the ev en t 27 that ⌊ ε 4 n ⌋ of the p oin ts x k +1 , . . . , x n are c h osen uniformly at random, so b y a Chern oﬀ b ound, P E I ( · ,ε ) ( A c n ) ≤ e − εn/ 16 . Let B n b e th e ev en t that one of the p oin ts x n +1 , . . . , x 2 n is c hosen by ex- p ected improv ement, so P E I ( · ,ε ) ( B c n ) = ε n . Finally , let C n b e th e ev en t that A n and B n o ccur, and further the mesh norm h n ≤ C ( n/ log n ) − 1 /d , for the constant C fr om Lemma 12 . Set r n = ( n/ log n ) − ν /d (log n ) α . Then by Lemma 12 , sin ce C n ⊂ A n , P E I ( · ,ε ) f ( C c n ) ≤ C ′ r n , for a constant C ′ > 0 not dep ending on f . Let E I ( · , ε ) ha ve prior π n at time n , with (ﬁxed or estimate d) parame- ters σ n , θ n . Su pp ose k f k H θ U ( X ) ≤ R , and set S 2 = R 2 Q d i =1 ( θ U i /θ L i ), so by Lemma 4 , k f k H θ n ( X ) ≤ S . If α = 0, th en by Narco w ic h et al. ( 2003 , § 6), sup x ∈ X s n ( x ; θ ) = O ( M ( θ ) h ν n ) uniformly in θ , for M ( θ ) a con tinuous fu n ction of θ . Hence on th e ev ent C n , sup x ∈ X s n ( x ; θ n ) ≤ su p x ∈ X sup θ L ≤ θ ≤ θ U s n ( x ; θ ) ≤ C ′′ r n , for a constant C ′′ > 0 dep ending only on X , K , C , θ L and θ U . If α > 0, the same result h olds b y a similar argum en t. On th e ev en t C n , we hav e some x m c hosen by exp ected impro ve ment, n < m ≤ 2 n . Let f ha ve minim um z ∗ at x ∗ . T hen by Lemma 8 , z ∗ m − 1 − z ∗ ≤ E I m − 1 ( x ∗ ; · ) + C ′′ S r m − 1 ≤ E I m − 1 ( x m ; · ) + C ′′ S r m − 1 ≤ ( f ( x m − 1 ) − f ( x m )) + + C ′′ (2 S + σ m − 1 ) r m − 1 ≤ z ∗ m − 1 − z ∗ m + C ′′ T r n , for a constan t T > 0. (Und er E I ( π , ε ), we ha ve T = 2 S + σ ; otherwise σ m − 1 ≤ S by Corollary 1 , so T = 3 S .) Th us , rearranging, z ∗ 2 n − z ∗ ≤ z ∗ m − z ∗ ≤ C ′′ T r n . On the eve nt C c n , we ha ve z ∗ 2 n − z ∗ ≤ 2 k f k ∞ ≤ 2 R , so E E I ( · ,ε ) f [ z ∗ 2 n +1 − z ∗ ] ≤ E E I ( · ,ε ) f [ z ∗ 2 n − z ∗ ] ≤ 2 R P E I ( · ,ε ) f ( C c n ) + C ′′ T r n ≤ (2 C ′ R + C ′′ T ) r n . As this b ound is u niform in f with k f k H θ U ( X ) ≤ R , the result follo ws. If instead ν = ∞ , the ab o ve argument holds for an y ν < ∞ . 28 References Abramow itz M and S tegun I A, editors. Handb o ok of Mathematic al F unctions . Dov er, New Y ork, 1965. Adler R J and T a ylor J E. R andom Fields and Ge ometry . Springer Monographs in Mathematics. S pringer, N ew Y ork , 2007. Aronsza jn N. Theory of repro ducing kernels. T r ans. A mer. Math. So c. , 68(3):337–404 , 1950. Berlinet A and Thomas-Agnan C. R epr o ducing Kernel Hilb ert Sp ac es in Pr ob ability and Statistics . Klu wer Academic Publishers, Boston, Massac husetts, 2004. Broch u E, Cora M, and de F reitas N. A tutorial on Ba yesian optimizatio n of exp ensive cost functions, with application to activ e user mod eling and hierarc hical reinforcemen t learning. 2010 . arXiv:1012. 2599 Bub ec k S, Munos R, and S toltz G. Pure exploration in multi-armed ban dits problems. In Pr o c. 20th International Confer enc e on Algorithmic L e arning The ory (AL T ’09) , pages 23–37, P orto, Po rtugal, 2009 . Bub ec k S, Munos R, Stoltz G, and Szep esv ari C. X-armed bandits. 201 0. Diaconis P and F reedman D. On the consistency of Ba yes estimates. Ann. Statist. , 14(1): 1–26, 1986. F razier P , Po well W, and Daya nik S. The kn o wledge-gradient p olicy for correlated n ormal b eliefs. INF ORMS J. Comput. , 21(4):599– 613, 2009. Ginsbou rger D, le Ric he R, and Carraro L. A m ulti-p oints criterion for deterministic parallel global optimization b ased on Gaussian pro cesses. 2008. hal-00260579 Grunewa lder S , Audib ert J, Opp er M, and Shaw e-T aylor J. Regret b ounds for Gaussian process ban d it p roblems. In Pr o c. 13th International Confer enc e on Artiﬁcial Intel li- genc e and Statistics (AIST A TS ’ 10) , pages 273–280, Sardinia, Italy , 2010. Hansen P , Jaumard B, and Lu S. Global optimization of univ ariate Lipschitz fun ctions: I. survey and prop erties. Math. Pr o gr am . , 55(1):251–272 , 1992. Jones D R, Perttunen C D, and Stuckman B E. Lipschitzian optimization without th e Lipsc hitz constant. J. Optim. The ory Appl. , 79(1):157–18 1, 1993. Jones D R, S chonla u M, and W elch W J. Eﬃcient global optimization of ex p ensive b lac k- b o x functions. J. Glob al Optim. , 13(4):455–492, 1998. Kleinberg R and S livkins A. Sharp dichotomies for regret minimization in metric spaces. In Pr o c. ACM-SIAM Symp osium on Discr ete Algorithms (SODA ’ 10) , pages 827–846, Austin, T exas, 2010. Locatelli M. Ba yesian algorithms for one- dimensional global optimization. J. Glob al Optim. , 10(1):57–76, 1997. Macready W G and W olp ert D H. Bandit problems and the exploration / exploitation tradeoﬀ. IEEE T r ans. Evol. Comput. , 20(1):2–22 , 1998. Mo ˇ ckus J. On Ba yesia n metho d s for seeking the extrem um. In Pr o c. IFIP T e chnic al Confer enc e , pages 400–404, Nov osibirsk, Russia, 1974. 29 Narco wich F J, W ard J D, and W endland H. R eﬁned error estimates for radial basis function interpolation. Constr. Appr ox. , 19(4):541–5 64, 2003. Osb orne M. Bayesian Gaussian pr o c esses for se quential pr e diction, optimi sation and quadr atur e . DPhil thesis, U niversit y of Ox ford, Oxford, UK, 2010. P anconesi A and Sriniv asan A. Rand omized d istributed edge coloring via an ex tension of the Chernoﬀ- Hoeﬀd ing b ounds. SIAM J. Comput. , 26(2):350–368, 1997. P ardalos P M and R omeijn H E, ed itors. Handb o ok of Glob al Optimization, V olume 2 . Nonconv ex Optimization and its A p plications. Kluw er A cademic Publishers, Dordrec ht, the N et h erlands, 2002. P arzen E. Probability density functionals and repro ducing kernel Hilb ert spaces. In Pr o c. Symp osium on Tim e Series Analysis , pages 155–169, Providence, Rho de Island, 1963. Rasmuss en C E and Williams C K I . Gau ssian Pr o c esses for Machine L e arning . MIT Press, Cambridge, Massac husetts, 2006. Santner T J, Williams B J, and Notz W I. The Design and A nalysis of Computer Exp er- iments . Springer Series in Statistics. Springer, New Y ork, 2003. Sriniv as N, Krause A, Kak ade S M, an d Seeger M. Gaussian process optimization in the bandit setting: no regret and exp erimen tal d esign. In Pr o c. 27th International Confer enc e on Machine L e arning (ICML ’10) , Haifa, I srael, 2010. Sutton R S and Barto A G. R einfor c ement L e arning: an Intr o duction . MIT Press, Cam bridge, Massac husetts, 1998. T artar L. An Intr o duction to Sob olev Sp ac es and Interp olation Sp ac es , volume 3 of L e ctur e Notes of the Unione Matemat ic a Italiana . Springer, N ew Y ork , 2007. v an der V aart A W and van Zanten J H. Repro ducing kernel Hilb ert sp aces of Gaussian priors. In Pushing the Limits of Contemp or ary Statistics: Contributions in Honor of Jayanta K. Ghosh , volume 3 of Institute of Mathematic al Statistics Col le ctions , pages 200–222 . Institute of Mathematical Statistics, Beac hw oo d, Ohio, 2008. v an der V aart A W and v an Zanten J H. Adaptive Bay esian estimation using a Gaussian random ﬁeld with inve rse gamma bandwidth. A nn. Stat ist. , 37(5B ):2655–2675, 2009 . V azquez E and Bect J. Conv ergence prop erties of th e exp ected impro vement algorithm with ﬁ xed mean and co v ariance fun ctions. J. Statist. Plann. Infer enc e , 140(11):3 088– 3095, 2010. W endland H. Sc atter e d Data Appr oxim ation . Cambridge Monographs on A pplied and Computational Mathematics. Cam bridge U n ivers ity Press, Cambridge, UK, 2005. 30

Convergence rates of efficient global optimization algorithms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment