Convergence rates of efficient global optimization algorithms
Efficient global optimization is the problem of minimizing an unknown function f, using as few evaluations f(x) as possible. It can be considered as a continuum-armed bandit problem, with noiseless data and simple regret. Expected improvement is perh…
Authors: Adam D. Bull
Con v ergence rates of efficien t glo bal optimization algorithms Adam D. Bull Statistical Lab oratory Universit y of Ca mbridge a.bull@statslab.cam.ac .uk Abstract In the efficient globa l optimization problem, we minimize an un- known function f , using as few o bserv ations f ( x ) as p ossible . It can be considered a contin uum-armed- bandit problem, with noiseless data, and simple regret. Exp ected-improv ement algorithms are perhaps the most popular metho ds for so lving the problem; in this pa pe r , w e pro- vide theoretical results on their a symptotic behaviour. Implemen ting t hese algorithms requires a choice of Gaussian-pro cess prior, whic h determines a n a s so ciated space of functions, its repro ducing- kernel Hilb ert space (RKHS). When the prior is fixed, exp ected im- prov ement is known to co nverge on the minimum o f any function in its RKHS. W e provide co nv ergence ra tes for this pro cedure, optimal for functions of lo w smoo thness, and descr ibe a mo dified a lgorithm attaining o ptimal r ates for smoother functions. In pr actice, how ever, prior s ar e typically estimated sequentially from the data. F or standar d estimato r s, we show this pro cedure ma y never find the minim um of f . W e then pr op ose alternative estimator s, chosen to minimize the constants in the rate of con vergence, and s how these estimators retain the conv ergence rates of a fixed prior . 1 In tro duction Supp ose we wish to minimize a co ntin u ous fu nction f : X → R , wh ere X is a compact subset of R d . Ob serving f ( x ) is costly (it may require a length y computer sim ulation or physic al exp erimen t), so w e w ish to use as few ob- serv ations as p ossible. W e kn o w little ab out the shap e of f ; in particular w e will b e u nable to make assumptions of con v exit y or unimo d ality . W e Mathematics subje ct classific ation 2010. 90C26 (Primary); 68Q32, 62C10, 62L05, (Secondary) Keywor ds. conv ergence rates, efficient global optimization, exp ected impro vemen t, conti nuum-armed bandit, Ba yesia n optimization 1 therefore need a glob al optimization algorithm, one w hic h atte mp ts to fin d a global minimum. Man y standard global optimization algorithms exist, includ ing genetic algorithms, m ultistart, and sim ulated annealing ( P ardalos and Romeijn , 2002 ) , b ut these algorithms are designed f or fu nctions th at are c heap to ev aluate. When f is exp ensiv e, w e n eed an e fficient algorithm, one whic h will c ho ose its observ ations to maximize th e information gained. W e ca n consider this a con tinuum-armed-bandit problem ( Sriniv as et al. , 2010 , and references therein), with noiseless data, and loss measured by the simple regret ( Bub ec k et al. , 2009 ). A t time n , we choose a design p oin t x n ∈ X , make an observ ation z n = f ( x n ), and then rep ort a p oin t x ∗ n where w e b eliev e f ( x ∗ n ) will b e low. Ou r goa l is to find a strategy for choosing the x n and x ∗ n , in terms of previous observ ations, so as to min imize f ( x ∗ n ). W e wo uld lik e to fin d a strategy whic h can guaran tee conv ergence: for functions f in some smo othness class, f ( x ∗ n ) should tend to min f , pr eferab ly at some fast rate. The simplest metho d would b e to fix a sequence of x n in adv ance, and set x ∗ n = arg min ˆ f n , for some app ro ximation ˆ f n to f . W e will sho w that if ˆ f n con v erges in su premum norm at the optimal rate, then f ( x ∗ n ) also con v erges at its optimal rate. Ho we ver, wh ile this strategy gives a go o d w orst-case b ound , on a v erage it is clearly a p o or m etho d of optimization: the design p oin ts x n are completely in dep end ent of the observ ations z n . W e may ther efore ask if there are more efficien t metho ds, with b et- ter a v erage-case p erformance, that nev ertheless pr o vide go o d guaran tees of con v ergence. Th e difficulty in designing su c h a metho d lies in the trade-off b et wee n explor ation and exploitation . If w e exploit the d ata, obs er v in g in regions w here f is kno wn to b e lo w, we will b e more lik ely to find the op- tim um quic kly; ho w eve r, unless we explore ev ery region of X , w e ma y not find it at all ( Macready and W olp ert , 1998 ). Initial attempts at this p r oblem include w ork on Lip sc hitz optimization (summarized in Hansen et al. , 1992 ) and th e DIRECT algorithm ( Jones et al. , 1993 ), b ut p erhaps the b est-kno wn strategy is exp ected imp ro v ement . It is sometimes called Ba y esian optimization, and fir s t app eared in Mo ˇ ckus ( 1974 ) as a Ba yesia n decision-theoretic solution to th e problem. Conte m- p orary compu ters w ere not p o werful enough to implement the tec hnique in full, and it was later p opularized by Jones et al. ( 1998 ) , who p r o vided a com- putationally efficie nt implemen tation. More recen tly , it has also b een called a kno wledge-gradien t p olicy by F razier et al. ( 2009 ). Man y extensions and alteratio ns hav e b een suggested by f u rther authors; a go o d summary can b e found in Bro c hu et al. ( 2010 ). Exp ected impro vemen t p erforms we ll in exp eriments ( Osb orn e , 2010 , § 9.5), but little is kno wn ab out its theoretical prop erties. Th e b eha viour of th e algorithm dep ends crucially on the Gaussian pro cess pr ior π c hosen for f . Eac h prior has an asso ciated s p ace of fun ctions H , its repr o ducing- k ernel Hilb ert space. H con tains all fun ctions X → R as smo oth as a 2 p osterior mean of f , and is the natural space in whic h to s tudy qu estions of con v ergence. V azquez and Bect ( 2010 ) sho w that when π is a fixed Gaussian pro cess prior of fi nite smo othn ess, exp ected improv ement con v erges on the minimum of any f ∈ H , and almost surely for f dr a wn from π . Grun ew alder et al. ( 2010 ) b ou n d the conv ergence r ate of a computationally infeasible v ersion of exp ected impr o v emen t: for pr iors π of smo othness ν , they s h o w conv ergence at a r ate O ∗ ( n − ( ν ∧ 0 . 5) /d ) on f dra wn from π . W e b egin by b oundin g the con v ergence rate of the feasible algorithm, and sho w conv ergence at a rate O ∗ ( n − ( ν ∧ 1) /d ) on all f ∈ H . W e go on to show that a mo dification of exp ected impr o v ement con verges at the near-optimal r ate O ∗ ( n − ν /d ). F or pr actitio ners , how ever, these results are somewhat misleading. In t ypical applications, the prior is n ot held fi x ed , but dep end s on parameters estimated sequen tially fr om the d ata. This pro cess ensu r es the c hoice of observ ations is in v arian t under translation and scaling of f , and is b eliev ed to b e more efficient ( Jones et al. , 1998 , § 2). I t has a profoun d effect on con v ergence, how ever: Lo catelli ( 1997 , § 3.2) shows th at, for a Bro wnian motion pr ior w ith estimated p arameters, exp ected imp ro v ement ma y not con v erge at all. W e extend this r esult to more general settings, showing that for standard priors with estimated parameters, there exist smo oth functions f on which exp ected impr o v emen t do es n ot conv erge. W e then prop ose alternativ e es- timates of the prior parameters, c hosen to minimize the constants in the con v ergence rate. W e sho w that these estimators giv e an automatic c hoice of parameters, wh ile retaining the conv ergence r ates of a fixed prior. T able 1 summarizes the notation used in this pap er. W e sa y f : R d → R is a b ump fu nction if f is infinitely differen tiable and of compact su pp ort, and f : R d → C is Hermitian if f ( x ) = f ( − x ). W e use the Landau notati on f = O ( g ) to denote lim sup | f /g | < ∞ , and f = o ( g ) to denote f /g → 0. If g = O ( f ), w e sa y f = Ω ( g ), and if b oth f = O ( g ) and f = Ω( g ), we sa y f = Θ( g ). I f fur ther f /g → 1, w e say f ∼ g . Finally , if f and g are r an d om, and P (sup | f /g | ≤ M ) → 1 as M → ∞ , we sa y f = O p ( g ). In Section 2 , we br iefly describ e the exp ected-impro vemen t algorithm, and detail our assum p tions on the priors used . W e state our main results in Section 3 , and discuss implicati ons for further w ork in S ection 4 . Finally , w e give pro ofs in App end ix A . 2 Exp ected Imp ro v emen t Supp ose we wish to min im ize an u nkno wn function f , c ho osing design p oin ts x n and estimated minima x ∗ n as in the introd uction. If we pick a prior distribution π for f , representing our b eliefs ab out the unkn o wn function, w e can describ e this problem in terms of decision theory . Let (Ω , F , P ) b e 3 Section 1 f unknown function X → R to b e minimized X compact subset of R d to minimize ov er d n umb er of dimensions to m in imize o v er x n p oint s in X at which f is observ ed z n observ ations z n = f ( x n ) of f x ∗ n estimated minimum of f , giv en z 1 , . . . , z n Section 2.1 π prior distribu tion for f u strat egy for choosing x n , x ∗ n F n filtration F n = σ ( x i , z i : i ≤ n ) z ∗ n b est observ ation z ∗ n = min i =1 ,...,n z i E I n exp ected impr o v ement giv en F n Section 2.2 µ , σ 2 global mean and v ariance of Gaussian-pro cess p rior π K underlying correlat ion kernel for π K θ correlation ke rn el for π with length-scales θ ν , α smo othness parameters of K ˆ µ n , ˆ f n , s 2 n , ˆ R 2 n quan tities describing p osterior distribution of f give n F n Section 2.3 E I ( π ) exp ected impr o v ement strateg y with fixed pr ior ˆ σ 2 n , ˆ θ n estimates of pr ior parameters σ 2 , θ c n rate of deca y of ˆ σ 2 n θ L , θ U b ound s on ˆ θ n E I ( ˆ π ) exp ected impro ve ment strategy with estimated pr ior Section 3.1 H θ ( S ) repro d ucing-k ernel Hilb ert sp ace of K θ on S H s ( D ) Sob olev Hilb ert space of order s on D Section 3.2 L n loss suffered ov er an RK HS ball after n steps Section 3.3 E I ( ˜ π ) exp ected impro ve ment strategy with r obust estimated pr ior Section 3.4 E I ( · , ε ) ε -greedy exp ected im p ro v ement strategies T able 1: Notation 4 a p robabilit y sp ace, equipp ed with a random pro cess f h a ving la w π . A strategy u is a collect ion of random v ariables ( x n ), ( x ∗ n ) taking v alues in X . Set z n : = f ( x n ), and defin e the filtration F n : = σ ( x i , z i : i ≤ n ). Th e strategy u is v alid if x n is conditionally indep endent of f giv en F n − 1 , and lik ewise x ∗ n giv en F n . (Note that w e allo w rand om strategi es, pro vided they do not d ep end on un kno wn information ab ou t f .) When taking probabilities and exp ectations we w ill wr ite P u π and E u π , denoting the dep end ence on b oth the pr ior π and strategy u . The a v erage- case p erformance at s ome future time N is then giv en by th e exp ected loss, E u π [ f ( x ∗ N ) − min f ] , and our goal, giv en π , is to c ho ose the strategy u to minimize this quan tit y . 2.1 Ba y esian Optimization F or N > 1 this problem is v ery computationally in tensive ( Osb orn e , 2010 , § 6.3), but we can solv e a simplified v ersion of it. First, we restrict the c hoice of x ∗ n to the previous design p oin ts x 1 , . . . , x n . (In p ractice this is reasonable, as choosing an x ∗ n w e hav e not observe d can b e u n reliable.) Secondly , rather than fin d ing an optimal strategy for the pr ob lem, we d eriv e the my opic strategy: the strategy wh ic h is optimal if we alw a ys assume we will stop after the next obser v ation. T his strategy is s ub optimal ( Ginsb ourger et al. , 2008 , § 3.1), but p erforms w ell, and greatly simp lifies the calculations in vo lve d. In this setting, giv en F n , if w e are to stop at time n w e should c ho ose x ∗ n : = x i ∗ n , wh er e i ∗ n : = arg min 1 ,...,n z i . (In the case of ties, w e m a y pick an y minimizing i ∗ n .) W e then suffer a loss z ∗ n − min f , where z ∗ n : = z i ∗ n . W ere we to observe at x n +1 b efore stopping, the exp ected loss w ould b e E u π [ z ∗ n +1 − m in f | F n ] , so the my opic strategy should c ho ose x n +1 to min imize this quan tit y . Equiv- alen tly , it sh ould maximize the exp ected impro v ement ov er the curren t loss, E I n ( x n +1 ; π ) : = E u π [ z ∗ n − z ∗ n +1 | F n ] = E u π [( z ∗ n − z n +1 ) + | F n ] , (1) where x + = max( x, 0). So f ar, we ha ve merely r ep laced one optimiza tion pr oblem with another. Ho w ev er, for suitable priors, E I n can b e ev aluated chea ply , and thus maxi- mized by standard tec h niques. T he exp ected-impro v emen t algorithm is then giv en by c ho osing x n +1 to maximize ( 1 ). 2.2 Gaussian Pro cess Mo dels W e still need to choose a prior π for f . T ypically , we mo del f as a stationary Gaussian pr o cess: we consider the v alues f ( x ) to b e jointl y Gaussian, w ith 5 mean and co v ariance E π [ f ( x )] = µ, C o v π [ f ( x ) , f ( y )] = σ 2 K θ ( x − y ) . (2) µ ∈ R is the global mean of f ; we place a flat p rior on µ , reflecting our uncertain t y o v er the lo cation of f . σ > 0 is the global scale of v ariation of f , and K θ : R d → R its correlation k ernel, go ve rn ing the lo cal pr op erties of f . In the follo wing, w e will consider k ernels K θ ( t 1 , . . . , t d ) : = K ( t 1 /θ 1 , . . . , t d /θ d ) , (3) for an underlyin g k ernel K w ith K (0) = 1. (Note that we can alw a ys satisfy this condition by su itably s caling K and σ .) The θ i > 0 are the length-scales of th e pro cess: t wo v alues f ( x ) and f ( y ) will b e highly correlated if eac h x i − y i is small compared with θ i . F or no w, we will assume the parameters σ and θ are fix ed in adv ance. F or ( 2 ) and ( 3 ) to define a consistent Gaussian pro cess, K must b e a symmetric p ositiv e-definite fun ction. W e will also m ake th e follo w ing assumptions. Assumption 1. K is c ontinuous and inte gr able. K thus has F ourier transform b K ( ξ ) : = Z R d e − 2 π i h x,ξ i K ( x ) dx, and by Boc hn er’s theorem, b K is non-negativ e and int egrable. Assumption 2. b K i s isotr opic and r adial ly non-incr e asing. In other wo rd s, b K ( x ) = b k ( k x k ) for a non-increasing fun ction b k : [0 , ∞ ) → [0 , ∞ ); as a consequence, K is isotropic. Assumption 3. As x → ∞ , either: (i) b K ( x ) = Θ( k x k − 2 ν − d ) for some ν > 0 ; or (ii) b K ( x ) = O ( k x k − 2 ν − d ) for al l ν > 0 (we wil l then say that ν = ∞ ). Note the condition ν > 0 is required for b K to b e integ rable. Assumption 4. K is C k , for k the lar gest inte ger less than 2 ν , and at the origin, K has k -th or der T aylor app r oximation P k satisfying | K ( x ) − P k ( x ) | = O k x k 2 ν ( − log k x k ) 2 α as x → 0 , for some α ≥ 0 . 6 When α = 0, this is just th e condition that K b e 2 ν -H¨ older at the origin; when α > 0, w e instead require this condition u p to a log factor. The rate ν control s the smo othness of fu nctions from the prior: almost surely , f has con tin uous deriv ative s of an y order k < ν ( Adler and T a ylor , 2007 , § 1.4.2). Popular k ernels in clude the Mat ´ ern class, K ν ( x ) : = 2 1 − ν Γ( ν ) √ 2 ν k x k ν k ν √ 2 ν k x k , ν ∈ (0 , ∞ ) , where k ν is a mod ified Bessel fu n ction of th e second kind, and the Gaussian k ernel, K ∞ ( x ) : = e − 1 2 k x k 2 , obtained in the limit ν → ∞ ( Rasm ussen and Willia ms , 2006 , § 4.2). Bet ween them, these kernels co ve r the fu ll r ange of smo othness 0 < ν ≤ ∞ . Both k ernels satisfy Assumptions 1 – 4 for the ν giv en; α = 0 except for the Mat ´ ern k ernel with ν ∈ N , where α = 1 2 ( Abramo witz and Stegun , 1965 , § 9.6). Ha ving c hosen our prior distribution, w e ma y n o w deriv e its p osterior. W e fin d f ( x ) | z 1 , . . . , z n ∼ N ˆ f n ( x ; θ ) , σ 2 s 2 n ( x ; θ ) , where ˆ µ n ( θ ) : = 1 T V − 1 z 1 T V − 1 1 , (4) ˆ f n ( x ; θ ) : = ˆ µ n + v T V − 1 ( z − ˆ µ n 1) , (5) and s 2 n ( x ; θ ) : = 1 − v T V − 1 v + (1 − 1 T V − 1 v ) 2 1 T V − 1 1 , (6) for z = ( z i ) n i =1 , V = ( K θ ( x i − x j )) n i,j =1 , and v = ( K θ ( x − x i )) n i =1 ( San tner et al. , 2003 , § 4.1 .3). Equiv alentl y , these expr essions are the b est linear unbiase d predictor of f ( x ) and its v ariance, as giv en in Jones et al. ( 1998 , § 2). W e will also need the reduced sum of squ ares, ˆ R 2 n ( θ ) : = ( z − ˆ µ n 1) T V − 1 ( z − ˆ µ n 1) . (7) 2.3 Exp ected Improv emen t Strategies Under our assumptions on π , we ma y no w deriv e an analytic form f or ( 1 ), as in Jones et al. ( 1998 , § 4.1). W e obtain E I n ( x n +1 ; π ) = ρ z ∗ n − ˆ f n ( x n +1 ; θ ) , σs n ( x n +1 ; θ ) , (8) where ρ ( y , s ) : = ( y Φ( y /s ) + sϕ ( y /s ) , s > 0 , max( y , 0) , s = 0 , (9) 7 and Φ an d ϕ are the standard normal distribu tion and den sit y functions resp ectiv ely . F or a prior π as ab o ve, exp ected improv ement c ho oses x n +1 to maximize ( 8 ), but this do es not fully define the strategy . Firstly , w e m ust describ e ho w the strategy breaks ties, when more than one x ∈ X maximizes E I n . In general, this will not affect the b ehaviour of the algo rithm, so w e allo w an y c hoice of x n +1 maximizing ( 8 ). Secondly , w e must sa y how to c ho ose x 1 , as the ab ov e expressions are un - defined wh en n = 0. In fact, Jones et al. ( 1998 , § 4.2) find that exp ected im- pro ve ment can b e unreliable give n few d ata p oints, and r ecommend that sev- eral initial design p oint s b e c hosen in a rand om quasi-uniform arrangement. W e will therefore assu me th at until some fixed time k , p oin ts x 1 , . . . , x k are instead chosen by some (p oten tially random) metho d indep end ent of f . W e th us obtain the follo wing strategy . Definition 1. An E I ( π ) str ate gy cho oses: (i) initial design p oints x 1 , . . . , x k indep endently of f ; and (ii) further design p oints x n +1 ( n ≥ k ) fr om the maximizers of ( 8 ) . So far, we ha v e n ot considered the choic e of parameters σ and θ . While these can b e fixed in adv ance, doing so requires u s to sp ecify c haracteris- tic scales of the un kno wn fun ction f , and causes exp ected impro veme nt to b eha ve different ly on a rescaling of the same fu nction. W e w ould prefer an algorithm which could adapt automatically to the scale of f . A natural approac h is to tak e maximum likeli ho o d estimates of the pa- rameters, as r ecommended b y Jones et al. ( 1998 , § 2). Give n θ , the MLE ˆ σ 2 n = ˆ R 2 n ( θ ) /n ; for fu ll generalit y , we will allo w any c hoice ˆ σ 2 n = c n ˆ R 2 n ( θ ), where c n = o (1 / log n ). Estimates of θ , ho w ev er, m ust b e obtained by nu- merical optimization. As θ can v ary widely in scale, this optimization is b est p erformed o v er log θ ; as the lik eliho o d surface is t ypically m ultimo dal, this requires the use of a global optimizer. W e must therefore place (imp licit or explicit) b ounds on the allo wed v alues of log θ . W e h a v e thus describ ed the follo w ing strategy . Definition 2. L et ˆ π n b e a se quenc e of priors, with p ar ameters ˆ σ n , ˆ θ n satis- fying: (i) ˆ σ 2 n = c n ˆ R 2 n ( ˆ θ n ) for c onstants c n > 0 , c n = o (1 / log n ) ; and (ii) θ L ≤ ˆ θ n ≤ θ U for c onstants θ L , θ U ∈ R d + . An E I ( ˆ π ) str ate gy satisfies Definition 1 , r eplacing π with ˆ π n in ( 8 ) . 8 3 Con ve rgence Rates T o d iscuss conv ergence, w e must first c ho ose a s mo othness class for the unknown fu nction f . Eac h k ernel K θ is asso ciated with a space of functions H θ ( X ), its repro d u cing-k ernel Hilb ert space (RKHS) or nativ e space. H θ ( X ) con tains all fu nctions X → R as smo oth as a p osterior mean of f , and is the natural space to study conv ergence of exp ected-improv ement algorithms, allo win g a tractable analysis of their asymptotic b ehavio ur . 3.1 Repro ducing-Kernel Hilb ert Spaces Giv en a symmetric p ositiv e-definite ke rn el K on R d , s et k x ( t ) = K ( t − x ). F or S ⊆ R d , let E ( S ) b e the sp ace of functions S → R spanned by the k x , for x ∈ S . F urnish E ( S ) with the inner pro duct defined by h k x , k y i : = K ( x − y ) . The completion of E ( S ) under this inn er pr o duct is the repr o ducing-k ernel Hilb ert space H ( S ) of K on S . The members f ∈ H ( S ) are abstract ob jects, but we can ident ify them with f unctions f : S → R th rough the repro du cing prop erty , f ( x ) = h f , k x i , whic h h olds for all f ∈ E ( S ). S ee Aronsza jn ( 1950 ), Berlinet and Th omas- Agnan ( 2004 ), W endland ( 2005 ) and v an d er V aart and v an Z an ten ( 2008 ). W e w ill find it conv enient also to us e an alternativ e charac terization of H ( S ). W e b egin b y describing H ( R d ) in terms of F ourier transforms. Let b f denote the F ourier transform of a function f ∈ L 2 . The follo wing r esu lt is stated in P arzen ( 1963 , § 2), and pro ved in W endland ( 2005 , § 10.2); w e giv e a short p ro of in App end ix A . Lemma 1. H ( R d ) is the sp ac e of r e al c ontinuous f ∈ L 2 ( R d ) whose norm k f k 2 H ( R d ) : = Z | b f ( ξ ) | 2 b K ( ξ ) dξ is finite, taking 0 / 0 = 0 . W e m a y now describ e H ( S ) in terms of H ( R d ). Lemma 2 ( Aronsza jn , 1950 , § 1.5) . H ( S ) is the sp ac e of functions f = g | S for some g ∈ H ( R d ) , with norm k f k H ( S ) : = inf g | S = f k g k H ( R d ) , and ther e is a uniqu e g minimizing this expr ession. 9 These spaces are in fact closely r elated to the Sob olev Hilb ert spaces of functional analysis. Sa y a domain D ⊆ R d is Lipsc hitz if its b oundary is lo cally the graph of a Lip s c hitz function (see T artar , 2007 , § 12, for a precise definition). F or suc h a d omain D , the Sob olev Hilb ert space H s ( D ) is the space of f u nctions f : D → R , given by the restriction of some g : R d → R , whose norm k f k 2 H s ( D ) : = inf g | D = f Z | b g ( ξ ) | 2 (1 + k ξ k 2 ) s/ 2 dξ is fin ite. Th us, for the k ernel K with F ourier transform b K ( ξ ) = (1 + k ξ k 2 ) s/ 2 , this is just the RK HS H ( D ). More generally , if K satisfies our assumptions with ν < ∞ , th ese sp aces are equiv alen t in the sense of normed spaces: they con tain the same fu n ctions, and hav e n orms k · k 1 , k · k 2 satisfying C k f k 1 ≤ k f k 2 ≤ C ′ k f k 1 , for constan ts 0 < C ≤ C ′ . Lemma 3. L et H θ ( S ) denote the RKHS of K θ on S , and D ⊆ R d b e a Lipschitz domain. (i) If ν < ∞ , H θ ( ¯ D ) is e quivalent to the Sob olev Hilb ert sp ac e H ν + d/ 2 ( D ) . (ii) If ν = ∞ , H θ ( ¯ D ) is c ontinuously emb e dde d in H s ( D ) for al l s . Th us if ν < ∞ , and X is, sa y , a pro d uct of interv als Q d i =1 [ a i , b i ], the RKHS H θ ( X ) is equiv alen t to the S ob olev Hilb ert sp ace H ν + d/ 2 ( Q d i =1 ( a i , b i )), iden tifying eac h fun ction in that space with its unique contin u ou s extension to X . 3.2 Fixed P arameters W e are no w r eady to state our m ain results. Let X ⊂ R d b e compact with non-empt y interio r. F or a fun ction f : X → R , let P u f and E u f denote p rob- abilit y and exp ectation when minimizing the fixed fu nction f with strategy u . (Note that while f is fixed, u may b e ran d om, so its p erform ance is still probabilistic in natur e.) W e define the loss suffered o ver the ball B R in H θ ( X ) after n s teps b y a strategy u , L n ( u, H θ ( X ) , R ) : = sup k f k H θ ( X ) ≤ R E u f [ f ( x ∗ n ) − min f ] . W e will sa y that u conv erges on the optim um at rate r n , if L n ( u, H θ ( X ) , R ) = O ( r n ) for all R > 0. Note that w e do not allo w u to v ary with R ; the strategy m ust ac hieve this rate without prior kno wledge of k f k H θ ( X ) . W e b egin b y sh o wing that th e minimax rate of con verge nce is n − ν /d . 10 Theorem 1. If ν < ∞ , then for any θ ∈ R d + , R > 0 , inf u L n ( u, H θ ( X ) , R ) = Θ( n − ν /d ) , and this r ate c an b e achieve d b y a str ate gy u not dep ending on R . The upp er b ound is pro vided b y a naive strategy as in the in tro duction: w e fix a quasi-uniform sequence x n in adv ance, and tak e x ∗ n to minimize a radial basis fu nction in terp olan t of the data. As remarke d previously , ho we ver, this n aiv e strategy is not very satisfying; in pr actice it will b e outp erformed by an y go o d strategy v ary in g with the data. W e may th us ask whether m ore soph isticated s trategies, w ith b etter p ractical p erformance, can still p ro vide go o d wo rst-case b ound s. One such strategy is the E I ( π ) strategy of Definition 1 . W e can sh ow this strategy conv erges at least at rate n − ( ν ∧ 1) /d , up to log factors. Theorem 2. L et π b e a prior with length-sc ales θ ∈ R d + . F or any R > 0 , L n ( E I ( π ) , H θ ( X ) , R ) = ( O ( n − ν /d (log n ) α ) , ν ≤ 1 , O ( n − 1 /d ) , ν > 1 . F or ν ≤ 1, these rates are near-optimal. F or ν > 1, w e are faced with a more difficult p roblem; we discuss this in more d etail in Section 3.4 . 3.3 Estimated P arameters First, we consider the effect of the p rior parameters on E I ( π ). While the previous result giv es a con verge nce rate for an y fixed choice of parameters, the constant in th at rate w ill d ep end on the parameters c hosen; to choose w ell, we m ust somehow estimate these parameters fr om the d ata. The E I ( ˆ π ) strategy , giv en b y Definition 2 , uses maximum lik eliho o d estimate s for this purp ose. W e can sh o w, ho wev er, that th is ma y cause the strategy to n ev er con v erge. Theorem 3. Supp ose ν < ∞ . Given θ ∈ R d + , R > 0 , ε > 0 , ther e exists f ∈ H θ ( X ) satisfying k f k H θ ( X ) ≤ R , and for some fixe d δ > 0 , P E I (ˆ π ) f inf n f ( x ∗ n ) − min f ≥ δ > 1 − ε. The count erexamples constructed in the p ro of of the theorem ma y b e difficult to minimize, b ut they are n ot badly-b eh av ed ( Figure 1 ). A go o d optimization strategy should b e able to minimize suc h fu nctions, and w e m ust ask why exp ected impro v ement fails. W e can understand the issue b y considering the constan t in Theorem 2 . Define τ ( x ) : = x Φ( x ) + ϕ ( x ) . 11 x f ( x ) Figure 1: A counterexa mp le fr om Theorem 3 F rom the pro of of Th eorem 2 , the d ominan t term in the conv ergence rate has constan t C ( R + σ ) τ ( R/σ ) τ ( − R/σ ) , (10) for C > 0 not d ep ending on R or σ . I n App end ix A , we will pr ov e the follo w ing result. Corollary 1. ˆ R n ( θ ) is non-de cr e asing in n , and b ounde d ab ove by k f k H θ ( X ) . Hence for fixed θ , the estimate ˆ σ 2 n = ˆ R 2 n ( θ ) /n ≤ R 2 /n , and th us R/ ˆ σ n ≥ n 1 / 2 . Inserting this c hoice into ( 10 ) giv es a constant growing exp onentia lly in n , destroying our con ve rgence r ate. T o resolv e the issu e, w e will ins tead try to p ic k σ to minimize ( 10 ). The term R + σ is increasing in σ , and the term τ ( R/σ ) /τ ( − R/σ ) is d ecreasing in σ ; w e ma y b alance the terms b y taking σ = R . The constan t is then prop ortional to R , wh ic h we ma y minimize by taking R = k f k H θ ( X ) . In practice, we will not know k f k H θ ( X ) in adv ance, so w e must estimate it from the data; from Corollary 1 , a conv enient estimate is ˆ R n ( θ ). Supp ose, then, that w e make some b oun ded estimate ˆ θ n of θ , and set ˆ σ 2 n = ˆ R 2 n ( ˆ θ n ). As Theorem 3 h olds for any ˆ σ 2 n of faster th an logarithmic deca y , suc h a c hoice is necessary to ensur e con v ergence. (W e may also choose θ to minimize ( 10 ); we might then pic k ˆ θ n minimizing ˆ R n ( θ ) Q d i =1 θ − ν /d i , but our assump tions on ˆ θ n are wea k enough that we need not consider this further.) If w e b eliev e our Gaussian-pro cess mo d el, this estimate ˆ σ n is certainly unusual. W e should , how eve r, tak e care b efore placing to o m uch faith in the mo del. The fu nction in Figure 1 is a r easonable fun ction to optimize, 12 but as a Gaussian pro cess it is highly at ypical: there are in terv als on whic h the f unction is constant , an ev en t whic h in our mo del o ccurs with proba- bilit y zero. I f w e wan t our algorithm to su cceed on more general classes of functions, we will need to choose our parameter estimates appr op r iately . T o obtain go o d r ates, we m ust add a fur th er condition to our strategy . If z 1 = · · · = z n , E I n ( · ; ˆ π n ) is identic ally zero, and all c hoices of x n +1 are equally v alid. T o ensur e w e fu lly explore f , we will therefore require that when our strategy is applied to a constan t function f ( x ) = c , it pro d uces a sequence x n dense in X . (This can b e ac h iev ed, for example, b y c ho osing x n +1 uniformly at rand om f rom X when z 1 = · · · = z n .) W e hav e th us describ ed the follo w ing strategy . Definition 3. An E I ( ˜ π ) str ate gy satisfies Definition 2 , e xc ept: (i) we inste ad set ˆ σ 2 n = ˆ R 2 n ( ˆ θ n ) ; and (ii) we r e qui r e the choic e of x n +1 maximizing ( 8 ) to b e such that, if f i s c onstant, the design p oints ar e almost sur ely dense in X . W e cannot no w prov e a con vergence result u niform ov er b alls in H θ ( X ), as the rate of con v ergence d ep ends on the r atio R / ˆ R n , whic h is unb ounded. (Indeed, any estimator of k f k H θ ( X ) m ust sometimes p erform p o orly: f can app ear from the data to ha v e arbitrarily small norm, while in fact ha ving a spik e s omewh ere w e h av e not y et observed.) W e can, ho w ev er, pro vide the same con verge nce rates as in Theorem 2 , in a sligh tly wea ker sense. Theorem 4. F or any f ∈ H θ U ( X ) , under P E I (˜ π ) f , f ( x ∗ n ) − min f = ( O p ( n − ν /d (log n ) α ) , ν ≤ 1 , O p ( n − 1 /d ) , ν > 1 . 3.4 Near-Optimal Rates So far, our rates ha v e b een near-optimal only f or ν ≤ 1. T o obtain go o d rates for ν > 1, standard results on the p erforman ce of Gaussian-pro cess in terp olation ( Narco w ic h et al. , 2003 , § 6) then require th e design p oin ts x i to b e quasi-uniform in a region of inte rest. It is u nclear w h ether this o ccurs naturally under exp ected impr o v ement , but there are many wa ys w e can mo dify the algorithm to ens ure it. P erhaps the simplest, and most wel l-known, is an ε -greedy strategy ( Sut- ton and Ba rto , 1998 , § 2.2). In s u c h a strategy , at eac h step with probabilit y 1 − ε we make a decision to maximize some greedy criterion; with probability ε we make a decision completely at random. This random c hoice ensures that the short-term n ature of the greedy criterion do es n ot o v ershad ow our long-term goal. 13 The p arameter ε con trols the tr ad e-off b et w een global and lo cal searc h: a go o d c h oice of ε will b e s m all enough to not interfere w ith the exp ected- impro veme nt algorithm, but large enough to p rev ent it f rom getting stuck in a lo cal minim um. Sutton and Barto ( 1998 , § 2.2) consider th e v alues ε = 0 . 1 and ε = 0 . 01, bu t in practical w ork ε should of course b e calibrated to a t ypical problem set. W e th erefore define the follo wing strategies. Definition 4. L et · denote π , ˆ π or ˜ π . F or 0 < ε < 1 , an E I ( · , ε ) str ate gy: (i) cho oses initial design p oints x 1 , · · · , x k indep endently of f ; (ii) with pr ob ability 1 − ε , c ho oses design p oint x n +1 ( n ≥ k ) as in E I ( · ) ; or (iii) with pr ob ability ε , cho oses x n +1 ( n ≥ k ) uniformly at r andom fr om X . W e can sho w that these strategies ac hiev e n ear-optimal rates of con v er- gence for all ν < ∞ . Theorem 5. L et E I ( · , ε ) b e one of the str ate gi es in Definition 4 . If ν < ∞ , then for any R > 0 , L n ( E I ( · , ε ) , H θ U ( X ) , R ) = O (( n/ log n ) − ν /d (log n ) α ) , while if ν = ∞ , the statement holds for al l ν < ∞ . Note that un lik e a t ypical ε -gree dy algorithm, w e do not rely on r andom c hoice to obtain global conv ergence: as ab ov e, the E I ( π ) and E I ( ˜ π ) strate- gies are already globally con verge nt. Instead, we use r andom c hoice simply to impro ve up on the worst-c ase rate. No te also that the result do es not in general hold wh en ε = 1; to obtain goo d rates, w e m ust combine global searc h w ith inference ab out f . 4 Conclusions W e hav e sho wn that exp ected improv ement can con v erge near-optimally , but a n aiv e implement ation may not conv erge at all. W e thus ec h o Diaconis and F reedman ( 1986 ) in stating that, for infi nite-dimensional pr oblems, Ba ye sian metho ds are not alwa ys guaran teed to find the right answe r; su c h guaran tees can only b e pro vided by considering the p roblem at hand . W e might ask, ho we ver, if our framework can also b e impro ved. Our upp er b ounds on con v ergence w ere established using n aiv e algorithms, w hic h in practice w ould pr o v e inefficient. If a sophisticated algorithm fails w here a naiv e one s ucceeds, then th e soph isticated algorithm is certainly at f au lt; w e migh t, ho w ev er, prefer metho ds of ev aluation whic h d o not consider naive algorithms so s u ccessful. 14 V azquez and Bect ( 201 0 ) and Grunewalder et al. ( 2010 ) consider a m ore Ba y esian form ulation of the problem, wh ere the unkn own function f is d is- tributed according to the prior π , but this app roac h can prov e r estrictiv e: as w e saw in Section 3.3 , placing to o muc h faith in the prior ma y exclude functions of int erest. F urther, Grunewa lder et al. find the same issues are present also within the Ba y esian framework. A more interesting approac h is giv en by th e cont inuum-armed-bandit problem ( Sriniv as et al. , 2010 , and references therein). Here the goal is to minimize the cumulativ e regret, R n : = n X i =1 ( f ( x i ) − min f ) , in general observing the function f u nder noise. Algorithms con trolling the cum ulativ e regret at rate r n also solv e the optimization problem, at rate r n /n ( Bub ec k et al. , 2009 , § 3). Th e naive algorithms ab o ve, ho we ver, h av e p o or cumula tive r egret. W e might, then, consider th e cum ulativ e regret to b e a b etter measure of p erformance, but this ap p roac h to o has limitations. Firstly , th e cumulat ive regret is necessarily increasing, so cannot establish rates of optimization f aster than n − 1 . (Th is is n ot an issu e un der noise, where t ypically r n = Ω( n 1 / 2 ), see Klein b erg and Slivkin s , 2010 .) Secondly , if our goal is optimization, th en m inimizing th e regret, a cost we do not incur, may obscure the problem at hand. Bub ec k et al. ( 20 10 ) stud y this problem with the additional assump tion that f h as fin itely man y minima, and is, sa y , quadratic in a neighbour h o o d of eac h . T h is assumption ma y suffice in p ractice, and allo ws the authors to obtain impressive rates of con v ergence. F or optimization, how ever, a fur ther w eakness is th at these rates hold only once the algorithm h as fou n d a basin of attraction; they th us measur e lo cal, rather th an global, p erform an ce. It ma y b e that con ve rgence rates alone are not sufficient to captur e the p erformance of a global optimization algorithm, and the time tak en to fi nd a basin of attraction is more r elev an t. In an y case, the choic e of an app ropriate framew ork to measur e p erformance in global optimization merits fur ther study . Finally , we should also ask h o w to c h o ose the smo othness parameter ν (or the equiv alen t parameter in similar alg orithms). Van d er V aart and v an Zan ten ( 2009 ) show that Ba y esian Gaussian-pro cess mo d els can, in some con texts, automatical ly adapt to the smo othn ess of an unknown fun ction f . Their tec hnique requires, ho wev er, that the estimated length-scale s ˆ θ n to tend to 0, p osing b oth p ractical and th eoretica l c hallenges. Th e qu estion of ho w b est to optimize fu nctions of unkn o wn smo othness r emains op en. 15 Ac kno wledgemen ts W e w ould lik e to thank the referees, as w ell as Ric h ard Nic kl and Steffen Grunewa lder, f or their v aluable comments and suggestions. A Pro ofs W e now pro ve the results in Section 3 . A.1 Repro ducing-Kernel Hilb ert Spaces Pr o of of Lemma 1 . Let V b e the space of f u nctions describ ed, and W b e the closed real su bspace of Hermitian fu nctions in L 2 ( R d , b K − 1 ). W e w ill sho w f 7→ b f is an isomorphism V → W , so w e ma y equiv alen tly wo rk w ith W . Giv en b f ∈ W , by Cauc hy- Sch warz and Bo c hner’s theorem, Z | b f | ≤ Z b K 1 / 2 Z | b f | 2 / b K 1 / 2 < ∞ , and as k b K k ∞ ≤ k K k 1 , Z | b f | 2 ≤ k b K k ∞ Z | b f | 2 / b K < ∞ , so b f ∈ L 1 ∩ L 2 . b f is th us the F ourier transform of a real con tinuous f ∈ L 2 , satisfying the F ourier inv ersion formula ev erywhere. f 7→ b f is h ence an isomorp h ism V → W . It remains to sho w th at V = H ( R d ). W is complete, so V is. F urther, E ( R d ) ⊂ V , an d by F ourier in ve rsion eac h f ∈ V satisfies the repro du cing prop ert y , f ( x ) = Z e 2 π i h x,ξ i b f ( ξ ) dξ = Z b f ( ξ ) b k x ( ξ ) b K ( ξ ) dξ = h f , k x i , so H ( R d ) is a closed su bspace of V . Giv en f ∈ H ( R d ) ⊥ , f ( x ) = h f , k x i = 0 for all x , so f = 0. Thus V = H ( R d ). Pr o of of Lemma 3 . By Lemma 1 , the n orm on H θ ( R d ) is k f k 2 H θ ( R d ) = Z | b f ( ξ ) | 2 b K θ ( ξ ) dξ , and K θ has F ourier transform b K θ ( ξ ) = b K ( ξ 1 /θ 1 , . . . , ξ d /θ d ) Q d i =1 θ i . 16 If ν < ∞ , by assumption b K ( ξ ) = b k ( k ξ k ), for a finite non-incr easing function b k satisfying b k ( k ξ k ) = Θ( k ξ k − 2 ν − d ) as ξ → ∞ . Hence C (1 + k ξ k 2 ) − ( ν + d/ 2) ≤ b K θ ( ξ ) ≤ C ′ (1 + k ξ k 2 ) − ( ν + d/ 2) , for constan ts C, C ′ > 0, and we obtain that H θ ( R d ) is equiv alen t to th e Sob olev space H ν + d/ 2 ( R d ). F rom Lemma 2 , H θ ( D ) is giv en by the restriction of fu nctions in H θ ( R d ); as D is Lipschitz, th e same is tru e of H ν + d/ 2 . H θ ( D ) is th us equ iv alen t to H ν + d/ 2 ( D ). Finally , functions in H θ ( ¯ D ) are contin u ou s , so uniqu ely iden tified by their restriction to D , and H θ ( ¯ D ) ≃ H θ ( D ) ≃ H ν + d/ 2 ( D ) . If ν = ∞ , by a similar argumen t H θ ( ¯ D ) is con tin uously emb ed ded in all H s ( D ). F rom Lemma 1 , w e can deriv e results on the b eha viour of k f k H θ ( S ) as θ v aries. F or small θ , w e obtain th e follo wing result. Lemma 4. If f ∈ H θ ( S ) , then f ∈ H θ ′ ( S ) for al l 0 < θ ′ ≤ θ , and k f k 2 H θ ′ ( S ) ≤ d Y i =1 θ i /θ ′ i ! k f k 2 H θ ( S ) . Pr o of. Let C = Q d i =1 ( θ ′ i /θ i ). As b K is isotropic and radially non-in cr easing, b K θ ′ ( ξ ) = C b K θ (( θ ′ 1 /θ 1 ) ξ 1 , . . . , ( θ ′ d /θ d ) ξ d ) ≥ C b K θ ( ξ ) . Giv en f ∈ H θ ( S ), let g ∈ H θ ( R d ) b e its min im um n orm extension, as in Lemma 2 . By Lemma 1 , k f k 2 H θ ′ ( S ) ≤ k g k 2 H θ ′ ( R d ) = Z | b g | 2 b K θ ′ ≤ Z | b g | 2 C b K θ = C − 1 k f k 2 H θ ( S ) . Lik ewise, for large θ , w e obtain the follo wing. Lemma 5. If ν < ∞ , f ∈ H θ ( S ) , then f ∈ H tθ ( S ) for t ≥ 1 , and k f k 2 H tθ ( S ) ≤ C ′′ t 2 ν k f k 2 H θ ( S ) , for a C ′′ > 0 dep ending only on K and θ . Pr o of. As in the p ro of of Lemma 3 , we ha v e constan ts C , C ′ > 0 suc h that C (1 + k ξ k 2 ) − ( ν + d/ 2) ≤ b K θ ( ξ ) ≤ C ′ (1 + k ξ k 2 ) − ( ν + d/ 2) . 17 Th us for t ≥ 1, b K tθ ( ξ ) = t d b K θ ( tξ ) ≥ C t d (1 + t 2 k ξ k 2 ) − ( ν + d/ 2) ≥ C t − 2 ν (1 + k ξ k 2 ) − ( ν + d/ 2) ≥ C C ′− 1 t − 2 ν b K θ ( ξ ) , and we ma y argue as in the pr evious lemma. W e can also d escrib e the p osterior distribu tion of f in terms of H θ ( S ); as a consequence, we may deduce Corollary 1 . Lemma 6. Supp ose f ( x ) = µ + g ( x ) , g ∈ H θ ( S ) . (i) ˆ f n ( x ; θ ) = ˆ µ n + ˆ g n ( x ) solves the optimization pr oblem minimize k ˆ g k 2 H θ ( S ) , subje ct to ˆ µ + ˆ g ( x i ) = z i , 1 ≤ i ≤ n, with minimum value ˆ R 2 n ( θ ) . (ii) The pr e diction err or satisfies | f ( x ) − ˆ f n ( x ; θ ) | ≤ s n ( x ; θ ) k g k H θ ( S ) with e quality for some g ∈ H θ ( S ) . Pr o of. (i) Let W = span( k x 1 , . . . , k x n ), and write ˆ g = ˆ g k + ˆ g ⊥ for ˆ g k ∈ W , ˆ g ⊥ ∈ W ⊥ . ˆ g ⊥ ( x i ) = h ˆ g ⊥ , k x i i = 0, so ˆ g ⊥ affects the optimization on ly through k ˆ g k . The minimal ˆ g thus has ˆ g ⊥ = 0, so ˆ g = P n i =1 λ i k x i . T he problem then b ecomes minimize λ T V λ, sub ject to ˆ µ 1 + V λ = z . The solution is giv en by ( 4 ) and ( 5 ), with v alue ( 7 ). (ii) By symmetry , the prediction error do es not dep end on µ , so w e ma y tak e µ = 0. Then f ( x ) − ˆ f n ( x ; θ ) = g ( x ) − ( ˆ µ n + ˆ g n ( x )) = h g , e n,x i , for e n,x = k x − P n i =1 λ i k x i , and λ = V − 1 1 1 T V − 1 1 + I − V − 1 1 1 T V − 1 1 1 T V − 1 v . No w, k e n,x k 2 H θ ( S ) = s 2 n ( x ; θ ), as giv en by ( 6 ); this is a consequence of Lo ` ev e’s isometry , but is easily v erified algebraically . The resu lt then follo w s b y C auc h y-Sch warz. 18 A.2 Fixed P arameters Pr o of of The or em 1 . W e fi rst establish the lo wer b ound . S upp ose we hav e 2 n fu nctions ψ m with d isjoin t sup p orts. W e will argue that, giv en n obser v a- tions, w e cann ot distingu ish b et w een all the ψ m , and th us cannot accurately pic k a minimum x ∗ n . T o b egin with, assume X = [0 , 1] d . Let ψ : R d → [0 , 1] b e a C ∞ function, supp orted inside X and with minimum -1. By Lemma 3 , ψ ∈ H θ ( R d ). Fix k ∈ N , and set n = (2 k ) d / 2. F or v ectors m ∈ { 0 , . . . , 2 k − 1 } d , construct functions ψ m ( x ) = C (2 k ) − ν ψ (2 k x − m ), where C > 0 is to b e d etermined. ψ m is giv en by a translation and scaling of ψ , so by Lemmas 1 , 2 and 5 , for some C ′ > 0, k ψ m k H θ ( X ) ≤ k ψ m k H θ ( R d ) = C (2 k ) − ν k ψ k H 2 kθ ( R d ) ≤ C C ′ k ψ k H θ ( R d ) . Set C = R /C ′ k ψ k H θ ( R d ) , so that k ψ m k H θ ( X ) ≤ R for all m and k . Supp ose f = 0, and let x n and x ∗ n b e chosen by an y v alid s trategy u . Set χ = { x 1 , . . . , x n − 1 , x ∗ n − 1 } , and let A m b e th e ev ent that ψ m ( x ) = 0 for all x ∈ χ . Th er e are n p oin ts in χ , and the 2 n functions ψ m ha v e disjoint supp ort, so P m I ( A m ) ≥ n . Thus X m P u 0 ( A m ) = E u 0 " X m I ( A m ) # ≥ n, and we h a v e some fixed m , d ep ending only on u , for which P u 0 ( A m ) ≥ 1 2 . On the ev en t A m , ψ m ( x ∗ n − 1 ) − min ψ m = C (2 k ) − ν , but on that ev ent , u cannot distingu ish b et wee n 0 and ψ m b efore time n , so C − 1 (2 k ) ν E u ψ m [ f ( x ∗ n − 1 ) − min f ] ≥ P u ψ m ( A m ) = P u 0 ( A m ) ≥ 1 2 . As the min imax loss is non-increasing in n , f or (2( k − 1) ) d / 2 ≤ n < (2 k ) d / 2 w e conclude inf u L n ( u, H θ ( X ) , R ) ≥ inf u L (2 k ) d / 2 − 1 ( u, H θ ( X ) , R ) ≥ inf u sup m E u ψ m h f x ∗ (2 k ) d / 2 − 1 − min f i ≥ 1 2 C (2 k ) − ν = Ω( n − ν /d ) . F or general X ha ving non-empty in terior, we can fin d a hypercub e S = x 0 + [0 , ε ] d ⊆ X , with ε > 0. W e ma y then pr o ceed as ab ov e, p ic king functions ψ m supp orted insid e S . F or the up p er b ound, consid er a strategy u c ho osing a fixed sequence x n , indep end en t of th e z n . Fit a r adial basis fun ction in terp olan t ˆ f n to the 19 data, and p ick x ∗ n to minimize ˆ f n . T hen if x ∗ minimizes f , f ( x ∗ n ) − f ( x ∗ ) ≤ f ( x ∗ n ) − ˆ f n ( x ∗ n ) + ˆ f n ( x ∗ ) − f ( x ∗ ) ≤ 2 k ˆ f n − f k ∞ , so the loss is b oun ded b y the err or in ˆ f n . F rom results in Narco w ic h et al. ( 2003 , § 6) and W endland ( 2005 , § 11.5), for suitable r adial basis fun ctions the error is uniformly b ounded b y sup k f k H θ ( X ) ≤ R k ˆ f n − f k ∞ = O ( h − ν n ) , where the mesh norm h n : = sup x ∈ X n min i =1 k x − x i k . (F or ν 6∈ N , th is r esult is giv en by Narco w ic h et al. for the radial b asis function K ν , whic h is ν -H¨ older at 0 b y Abramo witz and Stegun , 1965 , § 9.6; for ν ∈ N , the r esult is giv en b y W endland for th in -plate splines.) As X is b ound ed, we ma y c ho ose the x n so that h n = O ( n − 1 /d ), giving L n ( u, H θ ( X ) , R ) = O ( n − ν /d ) . T o pro ve Theorem 2 , we fi rst sh o w that some observ ations z n will b e w ell-predicted by past data. Lemma 7. Set β : = ( α, ν ≤ 1 , 0 , ν > 1 . Given θ ∈ R d + , ther e is a c onstant C ′ > 0 dep ending only on X , K and θ which satisfies the fol lowing. F or any k ∈ N , and se quenc es x n ∈ X , θ n ≥ θ , the ine q u ality s n ( x n +1 ; θ n ) ≥ C ′ k − ( ν ∧ 1) /d (log k ) β holds for at most k distinct n . Pr o of. W e first sho w that the p osterior v ariance s 2 n is b ounded b y the dis- tance to the nearest design p oint. Let π n denote the p rior with v ariance σ 2 = 1, and length-scales θ n . Then for an y i ≤ n , as ˆ f n ( x ; θ n ) = E π n [ f ( x ) | F n ], s 2 n ( x ; θ n ) = E π n [( f ( x ) − ˆ f n ( x ; θ n )) 2 | F n ] = E π n [( f ( x ) − f ( x i )) 2 − ( f ( x i ) − ˆ f n ( x ; θ n )) 2 | F n ] ≤ E π n [( f ( x ) − f ( x i )) 2 | F n ] = 2(1 − K θ n ( x − x i )) . 20 If ν ≤ 1 2 , then by assumption | K ( x ) − K (0 ) | = O k x k 2 ν ( − log k x k ) 2 α as x → 0. If ν > 1 2 , then K is differentia ble, so as K is sy m metric, ∇ K (0) = 0. If fur ther ν ≤ 1, then | K ( x ) − K (0) | = | K ( x ) − K (0) − x · ∇ K (0) | = O k x k 2 ν ( − log k x k ) 2 α . Similarly , if ν > 1, then K is C 2 , so | K ( x ) − K (0) | = | K ( x ) − K (0) − x · ∇ K (0) | = O ( k x k 2 ) . W e ma y th us conclude | 1 − K ( x ) | = | K ( x ) − K (0) | = O k x k 2( ν ∧ 1) ( − log k x k ) 2 β , and s 2 n ( x ; θ n ) ≤ C 2 k x − x i k 2( ν ∧ 1) ( − log k x − x i k ) 2 β , for a constant C > 0 d ep ending only on X , K and θ . W e next sh o w that most design p oints x n +1 are close to a previous x i . X is b ound ed, so can b e co vered by k b alls of radius O ( k − 1 /d ). If x n +1 lies in a b all con taining some earlier p oin t x i , i ≤ n , th en w e ma y conclud e s 2 n ( x n +1 ; θ n ) ≤ C ′ 2 k − 2( ν ∧ 1) /d (log k ) 2 β , for a constan t C ′ > 0 dep ending only on X , K and θ . Hence as there are k balls, at most k p oin ts x n +1 can satisfy s n ( x n +1 ; θ n ) ≥ C ′ k − ( ν ∧ 1) /d (log k ) β . Next, w e p ro vide b ounds on the exp ected imp ro v ement wh en f lies in the RKHS. Lemma 8. L et k f k H θ ( X ) ≤ R . F or x ∈ X , n ∈ N , set I = ( f ( x ∗ n ) − f ( x )) + , and s = s n ( x ; θ ) . Then for τ ( x ) : = x Φ( x ) + φ ( x ) , we have max I − Rs, τ ( − R/σ ) τ ( R/σ ) I ≤ E I n ( x ; π ) ≤ I + ( R + σ ) s. 21 Pr o of. If s = 0, then by Lemma 6 , ˆ f n ( x ; θ ) = f ( x ), so E I n ( x ; π ) = I , and the result is tr ivial. Supp ose s > 0, and s et t = ( f ( x ∗ n ) − f ( x )) /s , u = ( f ( x ∗ n ) − ˆ f n ( x ; θ )) /s . F rom ( 8 ) and ( 9 ), E I n ( x ; π ) = σ s τ ( u/σ ) , and b y Lemma 6 , | u − t | ≤ R . As τ ′ ( z ) = Φ( z ) ∈ [0 , 1], τ is non-decreasing, and τ ( z ) ≤ 1 + z for z ≥ 0. Hence E I n ( x ; π ) ≤ σ s τ t + + R σ ≤ σ s t + + R σ + 1 = I + ( R + σ ) s. If I = 0, then as E I is the exp ectat ion of a non-negativ e quanti ty , E I ≥ 0, and the lo we r b ounds are trivial. Sup p ose I > 0. T h en as E I ≥ 0, τ ( z ) ≥ 0 for all z , and τ ( z ) = z + τ ( − z ) ≥ z . Th us E I n ( x ; π ) ≥ σ s τ t − R σ ≥ σ s t − R σ = I − R s. Also, as τ is increasing, E I n ( x ; π ) ≥ σ τ − R σ s. Com bining th ese b oun ds, and eliminating s , we obtain E I n ( x ; π ) ≥ σ τ ( − R /σ ) R + σ τ ( − R/σ ) I = τ ( − R/σ ) τ ( R/σ ) I . W e ma y n o w prov e the theorem. W e will use th e ab o v e b ounds to sho w that there must b e times n k when the exp ected imp ro v ement is lo w, and th us f ( x ∗ n k ) is close to min f . Pr o of of The or em 2 . F rom Lemma 7 there exists C > 0, dep ending on X , K and θ , such that for any sequence x n ∈ X and k ∈ N , th e inequalit y s n ( x n +1 ; θ ) > C k − ( ν ∧ 1) /d (log k ) β holds at most k times. F u rthermore, z ∗ n − z ∗ n +1 ≥ 0, and for k f k H θ ( X ) ≤ R , X n z ∗ n − z ∗ n +1 ≤ z ∗ 1 − m in f ≤ 2 k f k ∞ ≤ 2 R , so z ∗ n − z ∗ n +1 > 2 Rk − 1 at most k times. Since z ∗ n − f ( x n +1 ) ≤ z ∗ n − z ∗ n +1 , w e hav e also z ∗ n − f ( x n +1 ) > 2 Rk − 1 at most k times. Th us there is a time n k , k ≤ n k ≤ 3 k , for w hic h s n k ( x n k +1 ; θ ) ≤ C k − ( ν ∧ 1) /d (log k ) β and z ∗ n k − f ( x n k +1 ) ≤ 2 Rk − 1 . Let f ha ve minim um z ∗ at x ∗ . F or k large, x n k +1 will hav e b een c hosen b y exp ected impro v ement (rather than b eing an in itial design p oin t, c hosen 22 at random). Then as z ∗ n is non-increasing in n , for 3 k ≤ n < 3( k + 1) w e ha v e b y Lemma 8 , z ∗ n − z ∗ ≤ z ∗ n k − z ∗ ≤ τ ( R/σ ) τ ( − R/σ ) E I n k ( x ∗ ; π ) ≤ τ ( R/σ ) τ ( − R/σ ) E I n k ( x n k +1 ; π ) ≤ τ ( R/σ ) τ ( − R/σ ) 2 Rk − 1 + C ( R + σ ) k − ( ν ∧ 1) /d (log k ) β . This b ound is uniform in f w ith k f k H θ ( X ) ≤ R , so we obtain L n ( E I ( π ) , H θ ( X ) , R ) = O ( n − ( ν ∧ 1) /d (log n ) β ) . A.3 Estimated Pa rameters T o pr o v e Theorem 3 , w e fi rst establish lo wer b ounds on the p osterior v ari- ance. Lemma 9. Given θ L , θ U ∈ R d + , pick se qu enc es x n ∈ X , θ L ≤ θ n ≤ θ U . Then for op en S ⊂ X , sup x ∈ S s n ( x ; θ n ) = Ω( n − ν /d ) , uniformly in the se quenc e s x n , θ n . Pr o of. S is op en, so conta ins a h yp er cu b e T . F or k ∈ N , let n = 1 2 (2 k ) d , and constr u ct 2 n fun ctions ψ m on T with k ψ m k H θ U ( X ) ≤ 1, as in the pr o of of Theorem 1 . Let C 2 = Q d i =1 ( θ U i /θ L i ); then by Lemma 4 , k ψ m k H θ n ( X ) ≤ C . Giv en n d esign p oints x 1 , . . . , x n , there must b e some ψ m suc h th at ψ m ( x i ) = 0, 1 ≤ i ≤ n . By Lemma 6 , the p osterior mean of ψ m giv en these observ ations is the zero function. Thus for x ∈ T minimizing ψ m , s n ( x ; θ n ) ≥ C − 1 s n ( x ; θ n ) k ψ m k H θ n ( X ) ≥ C − 1 | ψ m ( x ) − 0 | = Ω ( k − ν ) . As s n ( x ; θ ) is non-incr easing in n , f or 1 2 (2( k − 1)) d < n ≤ 1 2 (2 k ) d w e obtain sup x ∈ S s n ( x ; θ n ) ≥ sup x ∈ S s 1 2 (2 k ) d ( x ; θ n ) = Ω( k − ν ) = Ω( n − ν /d ) . Next, w e b ound the exp ected improv ement when pr ior parameters are estimated by maxim um like liho o d . Lemma 10. L e t k f k H θ U ( X ) ≤ R , x n , y n ∈ X . Set I n ( x ) = z ∗ n − f ( x ) , s n ( x ) = s n ( x ; ˆ θ n ) , and t n ( x ) = I n ( x ) /s n ( x ) . Supp ose: 23 (i) for some i < j , z i 6 = z j ; (ii) for some T n → −∞ , t n ( x n +1 ) ≤ T n whenever s n ( x n +1 ) > 0 ; (iii) I n ( y n +1 ) ≥ 0 ; and (iv) for some C > 0 , s n ( y n +1 ) ≥ e − C /c n . Then for ˆ π n as in Definition 2 , eventual ly E I n ( x n +1 ; ˆ π n ) < E I n ( y n +1 ; ˆ π n ) . If the c onditions hold on a subse que nc e, so do e s the c onclusion. Pr o of. Let ˆ R 2 n ( θ ) b e giv en b y ( 7 ), and set ˆ R 2 n = ˆ R 2 n ( ˆ θ n ). F or n ≥ j , ˆ R 2 n > 0, and by Lemma 4 and Corollary 1 , ˆ R 2 n ≤ k f k 2 H ˆ θ n ( X ) ≤ S 2 = R 2 d Y i =1 ( θ U i /θ L i ) . Th us 0 < ˆ σ 2 n ≤ S 2 c n . T hen if s n ( x ) > 0, for some | u n ( x ) − t n ( x ) | ≤ S , E I n ( x ; ˆ π n ) = ˆ σ n s n ( x ) τ ( u n ( x ) / ˆ σ n ) , as in th e p ro of of Lemma 8 . If s n ( x n +1 ) = 0, th en x n +1 ∈ { x 1 , . . . , x n } , so E I n ( x n +1 ; ˆ π n ) = 0 < E I n ( y n +1 ; ˆ π n ) . When s n ( x n +1 ) > 0, as τ is increasing w e ma y up p er b ou n d E I n ( x n +1 ; ˆ π n ) using u n ( x n +1 ) ≤ T n + S , and lo w er b ound E I n ( y n +1 ; ˆ π n ) using u n ( y n +1 ) ≥ − S . Since s n ( x n +1 ) ≤ 1, and τ ( x ) = Θ( x − 2 e − x 2 / 2 ) as x → −∞ ( Abramo witz and Stegun , 1965 , § 7.1), E I n ( x n +1 ; ˆ π n ) E I n ( y n +1 ; ˆ π n ) ≤ τ (( T n + S ) / ˆ σ n ) e − C /c n τ ( − S/ ˆ σ n ) = O ( T n + S ) − 2 e C /c n − ( T 2 n +2 S T n ) / 2 ˆ σ 2 n = O ( T n + S ) − 2 e − ( T 2 n +2 S T n − 2 C S 2 ) / 2 S 2 c n = o (1) . If the conditions hold on a s ubsequence, we ma y similarly argue along that subsequence. Finally , we will requ ir e the follo wing tec hnical lemma. Lemma 11. L et x 1 , . . . , x n b e r andom variables taking values in R d . Given op en S ⊆ R d , ther e exist op en U ⊆ S for which P ( S n i =1 { x i ∈ U } ) is arbi- tr arily smal l. 24 Pr o of. Given ε > 0, fix m ≥ n/ε , and pic k disjoin t op en sets U 1 , . . . , U m ⊂ S . Then m X j =1 E [# { x i ∈ U j } ] ≤ E [# { x i ∈ R d } ] = n, so there exists U j with P [ i { x i ∈ U j } ! ≤ E [# { x i ∈ U j } ] ≤ n/m ≤ ε. W e may now pr o v e th e theorem. W e w ill construct a fun ction f on whic h the E I ( ˆ π ) strategy nev er observe s with in a region W . W e ma y then construct a function g , agreeing with f except on W , bu t h a ving different minim um . As the strategy cannot distinguish b et w een f and g , it cannot successfully find th e minimum of b oth. Pr o of of The or em 3 . Let th e E I ( ˆ π ) s trategy c ho ose initial d esign p oin ts x 1 , . . . , x k , indep enden tly of f . Giv en ε > 0, by Lemma 11 there exists op en U 0 ⊆ X f or whic h P E I ( ˆ π ) ( x 1 , . . . , x k ∈ U 0 ) ≤ ε ; w e may choose U 0 so that V 0 = X \ U 0 has non-empt y in terior. Pick op en U 1 suc h that V 1 = ¯ U 1 ⊂ U 0 , and set f to b e a C ∞ function, 0 on V 0 , 1 on V 1 , and everywhere non- negativ e. By Lemma 1 , f ∈ H θ U ( X ). W e w ork conditional on the ev ent A , ha v in g probabilit y at least 1 − ε , that z ∗ k = 0, and th us z ∗ n = 0 for all n ≥ k . Sup p ose x n ∈ V 1 infinitely often, so the z n are n ot all equal. By Lemma 7 , s n ( x n +1 ; ˆ θ n ) → 0, so on a subsequence with x n +1 ∈ V 1 , we ha v e t n = ( z ∗ n − f ( x n +1 )) /s n ( x n +1 ; ˆ θ n ) = − s n ( x n +1 ; ˆ θ n ) − 1 → −∞ whenev er s n ( x n +1 ; ˆ θ n ) > 0. Ho w eve r, by Lemma 9 , there are p oin ts y n ∈ V 0 with z ∗ n − f ( y n +1 ) = 0, and s n ( y n +1 ; ˆ θ n ) = Ω( n − ν /d ). Hence by Lemma 10 , E I n ( x n +1 ; ˆ π n ) < E I n ( y n +1 ; ˆ π n ) for some n , contradicti ng the defin ition of x n +1 . Hence, on A , there is a r andom v ariable T taking v alues in N , for whic h n > T = ⇒ x n 6∈ V 1 . Hence ther e exists a constant t ∈ N for wh ic h the ev en t B = A ∩ { T ≤ t } has P E I ( ˆ π ) f -probabilit y at least 1 − 2 ε . By Lemma 11 , w e thus ha ve an op en set W ⊂ V 1 for which the ev ent C = B ∩ { x n 6∈ W : n ∈ N } = B ∩ { x n 6∈ W : n ≤ t } has P E I ( ˆ π ) f -probabilit y at least 1 − 3 ε . Construct a smo oth function g by adding to f a C ∞ function w h ic h is 0 outside W , and has minimum − 2. Th en min g = − 1, but on the ev en t C , E I ( ˆ π ) cannot d istinguish b et ween f and g , and g ( x ∗ n ) ≥ 0. Thus for δ = 1, P E I ( ˆ π ) g inf n g ( x ∗ n ) − min g ≥ δ ≥ P E I ( ˆ π ) g ( C ) = P E I ( ˆ π ) f ( C ) ≥ 1 − 3 ε. 25 As the b eha viour of E I ( ˆ π ) is inv ariant under rescaling, we ma y s cale g to ha v e norm k g k H θ ( X ) ≤ R , and the ab o ve remains true f or some δ > 0. Pr o of of The or em 4 . As in the pro of of Theorem 2 , w e w ill sh o w there are times n k when the exp ected imp ro v ement is s m all, so f ( x n k ) m ust b e close to the minimum. First, ho we ver, w e must con trol the estimated p arameters ˆ σ 2 n , ˆ θ n . If the z n are all equ al, then b y assumption the x n are dense in X , so f is constant, and the result is trivial. Supp ose the z n are not all equal, and let T b e a rand om v ariable satisfying z T 6 = z i for some i < T . Set U = inf θ L ≤ θ ≤ θ U ˆ R T ( θ ). ˆ R T ( θ ) is a con tin uous p ositiv e function, so U > 0. Let S 2 = R 2 Q d i =1 ( θ U i /θ L i ). By Lemma 4 , k f k H ˆ θ n ( X ) ≤ S , so by Corollary 1 , for n ≥ T , U ≤ ˆ R T ( ˆ θ n ) ≤ ˆ σ n ≤ k f k H ˆ θ n ( X ) ≤ S. As in the pro of of Theorem 2 , w e hav e a constant C > 0, and some n k , k ≤ n k ≤ 3 k , for whic h z ∗ n k − f ( x n k +1 ) ≤ 2 Rk − 1 and s n k ( x n k +1 ; ˆ θ n k ) ≤ C k − α (log k ) β . T hen for k ≥ T , 3 k ≤ n < 3( k + 1), arguing as in Theorem 2 w e obtain z ∗ n − z ∗ ≤ z ∗ n k − z ∗ ≤ τ ( S/ ˆ σ n k ) τ ( − S/ ˆ σ n k ) 2 Rk − 1 + C ( S + ˆ σ n k ) k − ( ν ∧ 1) /d (log k ) β ≤ τ ( S/U ) τ ( − S/U ) 2 Rk − 1 + 2 C S k − ( ν ∧ 1) /d (log k ) β . W e th us ha ve a random v ariable C ′ satisfying z ∗ n − z ∗ ≤ C ′ n − ( ν ∧ 1) /d (log n ) β for all n , and the r esult follo ws. A.4 Near-Optimal Rates T o pr o v e Theorem 5 , we first sh ow that the p oint s chose n at random will b e quasi-uniform in X . Lemma 12. L et x n b e i.i.d. r andom variables, distribute d uniformly over X , and define their mesh norm, h n : = sup x ∈ X n min i =1 k x − x i k . F or any γ > 0 , ther e exists C > 0 such that P ( h n > C ( n/ log n ) − 1 /d ) = O ( n − γ ) . 26 Pr o of. W e will partition X int o n regions of size O ( n − 1 /d ), and sh ow that with high probabilit y w e will p lace an x i in eac h one. Then ev ery p oin t x will b e clo se to an x i , and the mesh norm will b e small. Supp ose X = [0 , 1] d , fi x k ∈ N , and divide X into n = k d sub-cub es X m = 1 k ( m + [0 , 1 ] d ), for m ∈ { 0 , . . . , k − 1 } d . Let I m b e the ind icator function of the ev en t { x i 6∈ X m : 1 ≤ i ≤ ⌊ γ n log n ⌋} , and defin e µ n = E " X m I m # = n E [ I 0 ] = n (1 − 1 /n ) ⌊ γ n log n ⌋ ∼ ne − γ log n = n − ( γ − 1) . F or n large, µ n ≤ 1, so b y the generalized C hernoff b ound of P anconesi and Sriniv asan ( 1997 , § 3.1), P X m I m ≥ 1 ! ≤ e ( µ − 1 n − 1) µ − µ − 1 n n ! µ n ≤ eµ n ∼ en − ( γ − 1) . On the even t P m I m < 1, I m = 0 for all m . F or any x ∈ X , we th en ha v e x ∈ X m for some m , an d x j ∈ X m for some 1 ≤ j ≤ ⌊ γ n log n ⌋ . T h us ⌊ γ n log n ⌋ min i =1 k x − x i k ≤ k x − x j k ≤ √ dk − 1 . As this b ound is uniform in x , we obtain h ⌊ γ n log n ⌋ ≤ √ dk − 1 . Thus for n = k d , P ( h ⌊ γ n log n ⌋ > √ dk − 1 ) = O ( k − d ( γ − 1) ) , and as h n is non-increasing in n , this b ound h olds also for k d ≤ n < ( k + 1) d . By a c hange of v ariables, we then obtain P ( h n > C ( n/γ log n ) − 1 /d ) = O (( n/γ log n ) − ( γ − 1) ) , and the resu lt f ollo ws by c h o osing γ large. F or general X , as X is b ound ed it can b e partitioned in to n regions of measure Θ( n − 1 /d ), so we ma y argue similarly . W e m a y no w prov e the th eorem. W e w ill show that the p oints x n m ust b e quasi-unif orm in X , so p osterior v ariances must b e small. Th en, as in the pro ofs of Theorems 2 and 4 , we h a v e times wh en the exp ected impr o v ement is small, so f ( x ∗ n ) is close to min f . Pr o of of The or em 5 . First su pp ose ν < ∞ . Let the E I ( · , ε ) c ho ose k initial design p oin ts indep endent of f , and supp ose n ≥ 2 k . L et A n b e the ev en t 27 that ⌊ ε 4 n ⌋ of the p oin ts x k +1 , . . . , x n are c h osen uniformly at random, so b y a Chern off b ound, P E I ( · ,ε ) ( A c n ) ≤ e − εn/ 16 . Let B n b e th e ev en t that one of the p oin ts x n +1 , . . . , x 2 n is c hosen by ex- p ected improv ement, so P E I ( · ,ε ) ( B c n ) = ε n . Finally , let C n b e th e ev en t that A n and B n o ccur, and further the mesh norm h n ≤ C ( n/ log n ) − 1 /d , for the constant C fr om Lemma 12 . Set r n = ( n/ log n ) − ν /d (log n ) α . Then by Lemma 12 , sin ce C n ⊂ A n , P E I ( · ,ε ) f ( C c n ) ≤ C ′ r n , for a constant C ′ > 0 not dep ending on f . Let E I ( · , ε ) ha ve prior π n at time n , with (fixed or estimate d) parame- ters σ n , θ n . Su pp ose k f k H θ U ( X ) ≤ R , and set S 2 = R 2 Q d i =1 ( θ U i /θ L i ), so by Lemma 4 , k f k H θ n ( X ) ≤ S . If α = 0, th en by Narco w ic h et al. ( 2003 , § 6), sup x ∈ X s n ( x ; θ ) = O ( M ( θ ) h ν n ) uniformly in θ , for M ( θ ) a con tinuous fu n ction of θ . Hence on th e ev ent C n , sup x ∈ X s n ( x ; θ n ) ≤ su p x ∈ X sup θ L ≤ θ ≤ θ U s n ( x ; θ ) ≤ C ′′ r n , for a constant C ′′ > 0 dep ending only on X , K , C , θ L and θ U . If α > 0, the same result h olds b y a similar argum en t. On th e ev en t C n , we hav e some x m c hosen by exp ected impro ve ment, n < m ≤ 2 n . Let f ha ve minim um z ∗ at x ∗ . T hen by Lemma 8 , z ∗ m − 1 − z ∗ ≤ E I m − 1 ( x ∗ ; · ) + C ′′ S r m − 1 ≤ E I m − 1 ( x m ; · ) + C ′′ S r m − 1 ≤ ( f ( x m − 1 ) − f ( x m )) + + C ′′ (2 S + σ m − 1 ) r m − 1 ≤ z ∗ m − 1 − z ∗ m + C ′′ T r n , for a constan t T > 0. (Und er E I ( π , ε ), we ha ve T = 2 S + σ ; otherwise σ m − 1 ≤ S by Corollary 1 , so T = 3 S .) Th us , rearranging, z ∗ 2 n − z ∗ ≤ z ∗ m − z ∗ ≤ C ′′ T r n . On the eve nt C c n , we ha ve z ∗ 2 n − z ∗ ≤ 2 k f k ∞ ≤ 2 R , so E E I ( · ,ε ) f [ z ∗ 2 n +1 − z ∗ ] ≤ E E I ( · ,ε ) f [ z ∗ 2 n − z ∗ ] ≤ 2 R P E I ( · ,ε ) f ( C c n ) + C ′′ T r n ≤ (2 C ′ R + C ′′ T ) r n . As this b ound is u niform in f with k f k H θ U ( X ) ≤ R , the result follo ws. If instead ν = ∞ , the ab o ve argument holds for an y ν < ∞ . 28 References Abramow itz M and S tegun I A, editors. Handb o ok of Mathematic al F unctions . Dov er, New Y ork, 1965. Adler R J and T a ylor J E. R andom Fields and Ge ometry . Springer Monographs in Mathematics. S pringer, N ew Y ork , 2007. Aronsza jn N. Theory of repro ducing kernels. T r ans. A mer. Math. So c. , 68(3):337–404 , 1950. Berlinet A and Thomas-Agnan C. R epr o ducing Kernel Hilb ert Sp ac es in Pr ob ability and Statistics . Klu wer Academic Publishers, Boston, Massac husetts, 2004. Broch u E, Cora M, and de F reitas N. A tutorial on Ba yesian optimizatio n of exp ensive cost functions, with application to activ e user mod eling and hierarc hical reinforcemen t learning. 2010 . arXiv:1012. 2599 Bub ec k S, Munos R, and S toltz G. Pure exploration in multi-armed ban dits problems. In Pr o c. 20th International Confer enc e on Algorithmic L e arning The ory (AL T ’09) , pages 23–37, P orto, Po rtugal, 2009 . Bub ec k S, Munos R, Stoltz G, and Szep esv ari C. X-armed bandits. 201 0. Diaconis P and F reedman D. On the consistency of Ba yes estimates. Ann. Statist. , 14(1): 1–26, 1986. F razier P , Po well W, and Daya nik S. The kn o wledge-gradient p olicy for correlated n ormal b eliefs. INF ORMS J. Comput. , 21(4):599– 613, 2009. Ginsbou rger D, le Ric he R, and Carraro L. A m ulti-p oints criterion for deterministic parallel global optimization b ased on Gaussian pro cesses. 2008. hal-00260579 Grunewa lder S , Audib ert J, Opp er M, and Shaw e-T aylor J. Regret b ounds for Gaussian process ban d it p roblems. In Pr o c. 13th International Confer enc e on Artificial Intel li- genc e and Statistics (AIST A TS ’ 10) , pages 273–280, Sardinia, Italy , 2010. Hansen P , Jaumard B, and Lu S. Global optimization of univ ariate Lipschitz fun ctions: I. survey and prop erties. Math. Pr o gr am . , 55(1):251–272 , 1992. Jones D R, Perttunen C D, and Stuckman B E. Lipschitzian optimization without th e Lipsc hitz constant. J. Optim. The ory Appl. , 79(1):157–18 1, 1993. Jones D R, S chonla u M, and W elch W J. Efficient global optimization of ex p ensive b lac k- b o x functions. J. Glob al Optim. , 13(4):455–492, 1998. Kleinberg R and S livkins A. Sharp dichotomies for regret minimization in metric spaces. In Pr o c. ACM-SIAM Symp osium on Discr ete Algorithms (SODA ’ 10) , pages 827–846, Austin, T exas, 2010. Locatelli M. Ba yesian algorithms for one- dimensional global optimization. J. Glob al Optim. , 10(1):57–76, 1997. Macready W G and W olp ert D H. Bandit problems and the exploration / exploitation tradeoff. IEEE T r ans. Evol. Comput. , 20(1):2–22 , 1998. Mo ˇ ckus J. On Ba yesia n metho d s for seeking the extrem um. In Pr o c. IFIP T e chnic al Confer enc e , pages 400–404, Nov osibirsk, Russia, 1974. 29 Narco wich F J, W ard J D, and W endland H. R efined error estimates for radial basis function interpolation. Constr. Appr ox. , 19(4):541–5 64, 2003. Osb orne M. Bayesian Gaussian pr o c esses for se quential pr e diction, optimi sation and quadr atur e . DPhil thesis, U niversit y of Ox ford, Oxford, UK, 2010. P anconesi A and Sriniv asan A. Rand omized d istributed edge coloring via an ex tension of the Chernoff- Hoeffd ing b ounds. SIAM J. Comput. , 26(2):350–368, 1997. P ardalos P M and R omeijn H E, ed itors. Handb o ok of Glob al Optimization, V olume 2 . Nonconv ex Optimization and its A p plications. Kluw er A cademic Publishers, Dordrec ht, the N et h erlands, 2002. P arzen E. Probability density functionals and repro ducing kernel Hilb ert spaces. In Pr o c. Symp osium on Tim e Series Analysis , pages 155–169, Providence, Rho de Island, 1963. Rasmuss en C E and Williams C K I . Gau ssian Pr o c esses for Machine L e arning . MIT Press, Cambridge, Massac husetts, 2006. Santner T J, Williams B J, and Notz W I. The Design and A nalysis of Computer Exp er- iments . Springer Series in Statistics. Springer, New Y ork, 2003. Sriniv as N, Krause A, Kak ade S M, an d Seeger M. Gaussian process optimization in the bandit setting: no regret and exp erimen tal d esign. In Pr o c. 27th International Confer enc e on Machine L e arning (ICML ’10) , Haifa, I srael, 2010. Sutton R S and Barto A G. R einfor c ement L e arning: an Intr o duction . MIT Press, Cam bridge, Massac husetts, 1998. T artar L. An Intr o duction to Sob olev Sp ac es and Interp olation Sp ac es , volume 3 of L e ctur e Notes of the Unione Matemat ic a Italiana . Springer, N ew Y ork , 2007. v an der V aart A W and van Zanten J H. Repro ducing kernel Hilb ert sp aces of Gaussian priors. In Pushing the Limits of Contemp or ary Statistics: Contributions in Honor of Jayanta K. Ghosh , volume 3 of Institute of Mathematic al Statistics Col le ctions , pages 200–222 . Institute of Mathematical Statistics, Beac hw oo d, Ohio, 2008. v an der V aart A W and v an Zanten J H. Adaptive Bay esian estimation using a Gaussian random field with inve rse gamma bandwidth. A nn. Stat ist. , 37(5B ):2655–2675, 2009 . V azquez E and Bect J. Conv ergence prop erties of th e exp ected impro vement algorithm with fi xed mean and co v ariance fun ctions. J. Statist. Plann. Infer enc e , 140(11):3 088– 3095, 2010. W endland H. Sc atter e d Data Appr oxim ation . Cambridge Monographs on A pplied and Computational Mathematics. Cam bridge U n ivers ity Press, Cambridge, UK, 2005. 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment