X-Armed Bandits

We consider a generalization of stochastic bandits where the set of arms, $\cX$, is allowed to be a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision ma…

Authors: Sebastien Bubeck (INRIA Futurs), Remi Munos (INRIA Lille - Nord Europe), Gilles Stoltz (DMA

X-Armed Bandits
X –Armed Bandits S ´ ebastien Bub ec k Sequel Pro ject, INRIA Lille sebastien.b ubeck@inria.fr R ´ emi Munos Sequel Pro ject, INRIA Lille remi.munos@ inria.fr Gilles Stoltz Ecole Normale Sup ´ erieure ∗ , CNRS & HEC P aris, CNRS, gilles.stol tz@ens.fr Csaba Szepesv´ ari Univ ersit y of Alb erta, Departmen t of Computing Science szepesva@cs .ualberta.ca Octob er 22, 2018 Abstract W e consider a generali zation of stochastic band its where the set of arms, X , is allow ed to be a generic measurable space and th e mean-pay off function is “ lo cally Lipsc h itz” with resp ect t o a dissimilarit y function that is known to the decisio n mak er. Under this condition w e construct an arm selection p olicy , called H OO (hierarc hical optimistic optimization), with imp ro ved regret b ounds compared to previous results for a large class of problems. In particular, our results imply that if X is th e unit hypercub e in a Euclidean space and the mean-p a yoff function has a finite num b er of global maxima around whic h the b ehavior of the function is locally contin uous with a k no wn smo othness degree, t hen the exp ected regret of HO O is b ounded u p t o a logarithmic factor b y √ n , i.e ., th e rate of gro wth of the reg ret is indep endent of the dimension of the space. W e also pro ve th e minimax optimalit y of our algorithm when the dissimi larity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and do es not rely on th e doubling trick. W e also introduce a mo dified strategy , whic h relies on the d ou b ling trick b ut run s in linearithmic time. Both results are improv ements with respect to previous approaches. 1 In tro duction In the classical sto chastic bandit pro blem a g am bler tr ie s to maximize his revenue b y sequen tially playing o ne of a finite n umber of slot mac hines that are a sso ciated with initia lly unknown (and ∗ This researc h w as carri ed out within the INRIA pr o ject CLASSIC hosted b y Ecole normale sup ´ erieure and CNRS. 1 po ten tially differen t) pa yoff distributio ns [26]. Assuming o ld-fashioned s lot machines, the gambler pulls the arms of the machines one b y one in a sequential manner, sim ultaneously learning abo ut the machines’ pa yoff-distributions and gaining actual monetar y rew ard. Thus, in order to maximize his gain, the gam bler m ust c ho ose the nex t arm b y taking in to consideration both the urgency of gaining reward (“exploita tion”) and acquiring new informa tion (“ exploration”). Maximizing the total cumulativ e payoff is equiv alent to minimizing the (total) re gr et , i.e., min- imizing the difference b etw een the total cumulativ e pay off of the gam bler and the one o f another clairvo yan t gambler who chooses the arm with the b est mean-payoff in every round. The quality of the g a m bler’s stra tegy can be c haracter ized as the rate o f growth of his exp ected regr et with time. In particula r , if this ra te of growth is sublinear , the ga m bler in the long run plays as w ell as the clairvo yan t gambler. In this case the ga m bler’s strategy is c a lled Ha nnan consistent. Bandit problems hav e b een studied in the Bay esian fra mework [19 ], a s well as in the freque ntist parametric [25; 2] and non-parametric settings [4], and even in non- stochastic scenarios [5 ; 10]. While in the Bayesian case the question is whether the optimal actions can b e computed efficiently , in the frequentist cas e the question is how to achieve low rate of growth of the reg ret in the lack o f prior information, i.e ., it is a statistical question. In this pap e r we consider the sto chastic, frequentist, non-para metr ic setting. Although the first pap ers studied bandits with a finite num b er of arms, resear c hers hav e s o o n real- ized that bandits with infinitely m any arms a re a lso in teresting, as well as pra c tica lly sig nificant . One particularly imp ortant case is when the arms are identified by a finite num b er of contin uo us-v alued parameters , r esulting in online optimization problems over c o n tinuous finite-dimensional spa ces. Such problems a re ubiquito us to op eratio ns research and cont rol. Examples are “ pricing a new pro duct with uncertain demand in order to maximize r even ue, controlling the transmiss ion p ower of a wireless communication system in a noisy c hannel to maximize the num be r of bits transmit- ted p er unit of p ow er, a nd calibr ating the temp erature or levels of other inputs to a reaction so as to maximize the yield of a chemical pro cess” [12]. Other ex a mples are o ptimizing pa r ameters of schedules, r otational systems, tra ffic netw orks or online pa rameter tuning of nu merica l metho ds. During the last decades numerous author s hav e in vestigated such “ contin uum-a rmed” bandit prob- lems [3 ; 21; 6 ; 22; 1 2]. A sp ecial c ase of interest, which forms a bridge betw een the ca s e of a finite nu mber o f a rms a nd the con tinuum-armed setting, is formed by bandit linear o ptimization, see [1] and the reference s therein. In man y of the abov e-mentioned problems, ho wev er, the natura l domain of some of the optimiza- tion par ameters is a discr ete set, while other pa rameters are still contin uous-v a lued. F or e x ample, in the pricing pro blem different pro duct lines could also be tested while tuning the pr ice, o r in the case of tra nsmission p ow er control different pr oto cols could be tested while optimizing the power. In other problems, such as in online sequential search, the par a meter-vector to be optimized is an infinite sequence over a finite a lpha bet [13; 7 ]. The mo tiv ation for this pap er is to handle all these v arious ca s es in a unified framework. More precisely , w e consider a gener al setting t hat allows us to study bandits with almost no restriction on the set of arms. In particular, we allow the set of ar ms to b e an ar bitrary measurable spa c e . Since we allow non-denum era ble sets, we shall assume that the gambler has s ome kno wledge ab out the behavior of the mean-pay off function (in terms of its local r egularity aro und its maxima, ro ughly sp eaking). This is bec ause when the set of arms is uncount ably infinite and absolutely no a ssumptions are made on the pay off function, it is imposs ible to construct a strategy that sim ultaneously ac hieves sublinear re gret for all bandits pr oblems (see, e.g., [9, Co rollary 4]). When the set of arms is a metric space (p ossibly with the p ow er of the contin uum) previo us w orks hav e assumed either the global smo othness of the pa yoff function [3; 21; 22; 1 2] or lo cal smoothness in the vicinity of the maxima [6]. Here, smo othness means that the pa yoff function is either Lipsc hitz or H¨ older contin uous (lo cally or globally). These s mo o thnes s ass umptions a re indeed reasonable in man y practical problems of int erest. 2 In this pa per , we assume that there exists a dissimilar it y function that constra ins the b ehavior of the mean-payoff function, wher e a dissimilarity function is a measur e of the dis crepancy b etw een t wo arms that is neither sy mmetric, nor reflexive, nor satisfies the triangle inequalit y . (The same notion was introduced simultaneously a nd indep e nden tly of us by [2 3, Section 4.4] under the na me “quasi-dista nce.”) In par ticular, the dissimilarity function is assumed to lo cally set a bo und on the decrease of the mean-pa yoff function at each of its global maxima. W e also assume that the decision maker can constr uct a r ecursive cov ering o f the space of a rms in such a way that the dia meters of the sets in the cov ering s hrink at a known geo metric rate when measured with this dissimilarity . Relation to the literature. Our work gener alizes and improves previous w orks on cont inuum- armed bandits. In pa r ticular, Kleinberg [21] a nd Auer et al. [6] fo cused on one-dimensio na l problems, while we allow general spaces. In this sense, the closest work to the present contribution is that of K lein b erg et al. [2 2], who considered g eneric metric spaces assuming that the mean-pa yoff function is Lipsc hitz with r espe ct to the (known) metric of the space; its full version [23] r elaxed this co ndition and only requires that the mean-pay off f unction is Lipsc hitz at some maxim um with resp ect to some (kno wn) dissimilarity . 1 Kleinberg et a l. [23] prop osed a nov el alg orithm that achiev es essentially the b est po ssible reg ret bo und in a minima x s e ns e with resp ect to the environments studied, as well a s a m uch b etter re g ret bo und if the mean-payoff function has a small “ zo o ming dimension”. Our contribution furthers these works in tw o wa ys: (i) our algorithms, motiv ated by the recent successful tree-bas e d optimiza tion alg orithms [24; 18; 13], are easy to implement; (ii) we show that a v ersio n of o ur main algo rithm is a ble to exploit the lo cal prop erties of the mean-pay off function a t its maxima only , which, as far as w e know, w as not in vestigated in the approa c h of K lein b erg et al. [22, 2 3]. The pr ecise discussion of the improv ements (and drawbac ks) with re s pect to the pa per s by Kleinberg et a l. [22, 2 3] requir es the in tro duction of so mewhat extensive notatio ns and is therefor e deferred to Section 5 . How ever, in a nutshell, the following ca n b e s a id. First, by resorting to a hierarchical appro ach, we a re able to a void the use o f the do ubling tric k, as well as the need for the (cov ering) ora cle, b oth o f which the so- called zo o ming algor ithm o f Kleinberg et al. [2 2] re lie s on. This comes at the co st of slightly mor e restr ictiv e assumptions on the mean- pay off function, a s w ell as a mor e involv ed ana lysis. Moreov er, the oracle is replaced by a n a priori choice of a cov ering tree. In standard metric spaces, such a s the Euclidea n spa ces, suc h trees are trivial to construct, though, in full generality they may b e difficult to o btain when their c onstruction m ust start fro m (say) a distance function only . W e also prop ose a v aria n t o f our alg orithm that has smaller computational complexity of order n ln n compared to the quadr a tic c o mplexit y n 2 of our basic a lg orithm. How ev er, the cheap er algor ithm requires the do ubling trick to achieve an a n ytime guarantee (just like the zo oming algor ithm). Second, w e are also able to weak en our a ssumptions and t o consider only proper ties of the mean- pay off function in the neighbo rho o ds o f its maxima; this leads to re gret b ounds sc aling as e O  √ n  2 when, e.g., the space is the unit h ype rcube and the mean-pay off function has a finite nu mber of global max ima x ∗ around whic h it is lo ca lly equiv alent to a function k x − x ∗ k α with some known degree α > 0. Thus, in this case, we get the desirable pr o per t y tha t the rate of gr owth of the regret is indep endent of the dimensio nality of the input spa c e. (Comparable dimensionality-free rates a re obtained under different a ssumptions in [23].) 1 The pr esent pap er pa p er is a conc urrent and independen t work with respect t o the pap er o f Kleinberg, Sli vkins, and Upfal [23]. An extended abstract [22] of the latter was published in May 2008 at STOC’ 08, while the NIPS’08 v ersion [8] of the presen t pap er was submitted at the b eginning of June 2008. At that time, we w ere not a ware of the existence of the full v ersion [23], whi c h w as released in Septem b er 2008. 2 W e write u n = e O ( v n ) when u n = O ( v n ) up to a logarithmic factor. 3 Finally , in a ddition to the strong theoretical guara n tees, we exp ect o ur algor ithm to work well in pr actice since the alg orithm is very c lo se to the recent, empirically very successful tree-sea rch metho ds from the ga mes a nd pla nning liter ature [16; 17; 2 7; 11; 1 5]. Outline. The outline of the pap er is as follows: 1. In Section 2 we formalize the X –armed bandit problem. 2. In Section 3 we descr ib e the ba sic strategy prop osed, called HOO ( hier ar chic al optimistic optimization ). 3. W e pr esent the ma in results in Sectio n 4. W e start by sp ecifying and explaining our as - sumptions (Section 4 .1) under which v ar ious regret b ounds ar e prov ed. Then we prov e a distribution-dep endent b ound for the basic version of HOO (Section 4.2). A problem with the bas ic a lgorithm is that its computatio na l cost incr eases quadra tically with the num ber of time steps. Assuming the knowledge of the horizon, we thus prop ose a computatio na lly more efficient v a r iant of the basic a lgorithm, called trunc ate d HOO and prove that it enjoys a regret bo und identical to the one of the basic version (Section 4.3) while its computational com- plexity is only log-linear in the n umber of time steps. The fir st set of a ssumptions constrains the mean-payoff function everywhere. A second set of assumptions is ther efore presented that puts co nstraints on the mean-payoff function only in a small vicinity of its glo bal maxima; we then prop ose another alg orithm, called lo c al-HO O , which is proven to enjoy a regret a gain essentially similar to the one of the basic version (Section 4 .4). Finally , we prov e the minimax optimality of HOO in metr ic spaces (Section 4.5). 4. In Section 5 we compar e the results of this pap er with previous works. 2 Problem setup A sto chastic b andit pr oblem B is a pair B = ( X , M ), where X is a measurable space of ar ms and M determines the distribution of re w ards as so ciated with each arm. W e s ay that M is a b andit envir onment on X . F ormally , M is an ma pping X → M 1 ( R ), where M 1 ( R ) is the spa ce o f probability distributio ns ov er the rea ls. The distribution assigned to arm x ∈ X is denoted by M x . W e require that for ea c h arm x ∈ X , the distribution M x admits a first-order moment ; we then denote by f ( x ) its exp ectation (“mean pay off ”), f ( x ) = Z y dM x ( y ) . The mean-pay off function f thus defined is a ssumed to b e measurable. F o r simplicity , we shall also assume that a ll M x hav e bo unded supp orts , included in some fixed b ounded int erv al 3 , say , the unit int erv al [0 , 1]. Then, f als o takes bo unded v alues , in [0 , 1]. A decision maker (the gambler of the in tro duction) that interacts with a sto chastic bandit prob- lem B plays a ga me at discrete time steps according to the following r ules. In the first round the decision maker can select an a rm X 1 ∈ X and re c e iv es a reward Y 1 drawn at random from M X 1 . In round n > 1 the decision maker can select an arm X n ∈ X based on the information av ailable up to time n , i.e., ( X 1 , Y 1 , . . . , X n − 1 , Y n − 1 ), and receives a r ew ard Y n drawn from M X n , independently of ( X 1 , Y 1 , . . . , X n − 1 , Y n − 1 ) given X n . Note that a decision ma k er may r a ndomize his choice, but ca n only use infor mation av a ilable up to the po in t in time when the choice is made. 3 More generally , o ur results wou ld also hold when the tails of the reward distributions ar e uniformly sub-Gaussian. 4 F or ma lly , a str ate gy of the de cision maker in this game (“bandit str ategy”) can b e describ ed by an infinite s equence of measur able mappings, ϕ = ( ϕ 1 , ϕ 2 , . . . ), where ϕ n maps the space of past observ ations, H n =  X × [0 , 1]  n − 1 , to the s pace of probability measures ov er X . By conv ent ion, ϕ 1 do es not take any a r gument . A strategy is ca lled deterministic if for every n , ϕ n is a Dirac dis tr ibution. The goal of the decisio n mak er is to maximize his exp ected cumulativ e reward. Equiv a len tly , the goal can be expressed as minimizing the ex pected cumulativ e regret, whic h is defined as follows. Let f ∗ = sup x ∈X f ( x ) be the b est exp ected payoff in a single round. At round n , the cumulative re gr et of a decision maker playing B is b R n = n f ∗ − n X t =1 Y t , i.e., the difference b etw een the maximum exp ected payoff in n rounds a nd the actual total pay off. In the sequel, we shall re s trict our attent ion to the exp ected cum ulative regre t, which is defined as the exp ectation E [ b R n ] of the cumulativ e regr et b R n . Finally , we define the cumulativ e pseudo-r e gr et as R n = n f ∗ − n X t =1 f ( X t ) , that is, the actual rewards used in the definition of the r egret are replaced b y the mean- pay o ffs of the arms pulled. Since (by the tower rule) E  Y t  = E  E [ Y t | X t ]  = E  f ( X t )  , the exp ected v alues E [ b R n ] of the cum ulative r egret and E [ R n ] of the cum ulative ps e udo-regret are the same. Thus, we fo cus b elow o n the study of the b ehavior of E  R n  . Remark 1 A s it is ar gue d in [9], in many r e al-world pr oblems, the de cision maker is n ot inter este d in his cumulative r e gr et but r ather in its simple r e gr et. The latter c an b e define d as fol lows. After n r oun ds of play in a sto chastic b andit pr oblem B , the de cision maker is aske d t o make a r e c ommenda- tion Z n ∈ X b ase d on the n obtaine d r ewar ds Y 1 , . . . , Y n . The simple re gr et of this r e c ommendation e quals r n = f ∗ − f ( Z n ) . In this p ap er we fo cus on the cumulative r e gr et R n , but al l t he r esults c an b e r e adily ext ende d to t he simple r e gr et by c onsidering the r e c ommendation Z n = X T n , wher e T n is dr awn uniformly at r andom in { 1 , . . . , n } . Inde e d, in this c ase, E  r n  6 E  R n  n , as is shown in [9, Se ction 3]. 3 The Hierarc hical Optimistic Optimization (HOO) strategy The HOO str a tegy (cf. Algorithm 1) incre mentally builds an e s timate of the mean-payoff function f ov er X . The core idea (as in previous works) is to estimate f precisely aro und its maxima, while estimating it lo osely in other parts of the space X . T o implemen t this idea, HOO maintains a binary 5 tree whose no des are asso ciated with meas urable reg ions of the arm-space X such that the regions asso ciated with no des deep er in the tree (further awa y f rom the ro ot) repr e sen t increasingly smaller subsets o f X . The tree is built in an incremen tal ma nner. At each no de of the tr ee, HO O stor es some statistics based on the informa tio n received in pre vious ro unds. In particula r, HOO keeps track of the num ber of times a no de w as trav ersed up to round n a nd the cor resp onding empirica l av erage of the r ewards received so far. B ased on these, HOO ass igns an o ptimistic es tima te (denoted by B ) to the maxim um mean-pay off asso ciated with eac h node. These estimates are then used to select the next no de to “pla y”. This is done by trav ersing the tree, be ginning from the r o o t, a nd alwa ys following the no de with the highes t B –v alue (cf. lines 4 – 14 o f Algor ithm 1). O nce a no de is selected, a p oint in the reg io n asso ciated with it is chosen (line 16) and is sent to the environmen t. Based on the p oint selected and the received reward, the tree is up dated (lines 18 – 33). The tree of co verings whic h HOO needs to r eceive as an input is an infinite binary tr ee whose no des are a s so ciated with subsets of X . The nodes in this tree are indexed by pair s of integers ( h, i ); no de ( h, i ) is located a t depth h > 0 from the ro ot. T he r ange of the second index, i , asso ciated with node s at depth h is restricted by 1 6 i 6 2 h . Thus, the ro ot no de is denoted b y (0 , 1). By conv ent ion, ( h + 1 , 2 i − 1) and ( h + 1 , 2 i ) are used to r e fer to the tw o children of the no de ( h, i ). Let P h,i ⊂ X b e the region asso ciated with no de ( h, i ). B y a ssumption, these regions are meas urable and mu st s a tisfy the constraints P 0 , 1 = X , (1a) P h,i = P h +1 , 2 i − 1 ∪ P h, 2 i , for all h > 0 and 1 6 i 6 2 h . (1b) As a cor ollary , the reg io ns P h,i at any level h > 0 cover the space X , X = 2 h [ i =1 P h,i , explaining the term “tree of cov erings ”. In the algorithm listing the recur sive computation of the B –v alues (lines 28 – 3 3) makes a lo cal copy o f the tree; of cours e, this part of the algor ithm could be implemen ted in v arious other wa ys. Other ar bitrary choices in the algorithm a s shown here are how tie break ing in the node selection part is done (lines 9 – 12), or ho w a p oint in the region as so ciated with the sele c ted no de is chosen (line 1 6). W e no te in passing that implemen ting these differently would not change our theor etical results. T o facilitate the formal study o f the algorithm, we shall need some more notation. In particular, we shall in tro duce time-indexed versions ( T n , ( H n , I n ), X n , Y n , b µ h,i ( n ), e tc.) of the quantities used by the algo rithm. The conv ention used is that the indexation by n is use d to indicate the v a lue taken at the end of the n th round. In particular, T n is used to denote the finite subtree stored by the algor ithm at the end of round n . Thus, the initial tree is T 0 = { (0 , 1) } and it is ex pa nded r ound a fter round as T n = T n − 1 ∪ { ( H n , I n ) } , where ( H n , I n ) is the no de selected in line 15. W e call ( H n , I n ) t he n o de playe d in r ound n . W e use X n to denote the p oint selec ted by HO O in the reg ion ass o cia ted with the no de play ed in r ound n , while Y n denotes the received reward. No de selection works by c o mparing B – v alues and a lw ays choosing the node with the highest B –v alue. The B –v alue, B h,i ( n ), at no de ( h, i ) by the e nd o f ro und n is an estimated upp er b ound on the mean-payoff function at no de ( h, i ). T o define it we first need to intro duce the av erage of the 6 Algorithm 1 The HOO stra tegy P arameters: Two r eal num b ers ν 1 > 0 and ρ ∈ (0 , 1), a sequence ( P h,i ) h > 0 , 1 6 i 6 2 h of subsets of X sa tisfying the conditions (1a) and (1b). Auxiliary function Leaf ( T ): outputs a le af o f T . Initialization: T =  (0 , 1)  and B 1 , 2 = B 2 , 2 = + ∞ . 1: for n = 1 , 2 , . . . do ⊲ Strategy HOO in round n > 1 2: ( h, i ) ← (0 , 1) ⊲ Start at the ro ot 3: P ← { ( h, i ) } ⊲ P sto r es the path trav ersed in the tree 4: while ( h, i ) ∈ T do ⊲ Search the tree T 5: if B h +1 , 2 i − 1 > B h +1 , 2 i then ⊲ Select the “ more pr omising” c hild 6: ( h, i ) ← ( h + 1 , 2 i − 1 ) 7: else if B h +1 , 2 i − 1 < B h +1 , 2 i then 8: ( h, i ) ← ( h + 1 , 2 i ) 9: else ⊲ Tie-break ing rule 10: Z ∼ Ber(0 . 5) ⊲ e.g., choose a c hild a t r a ndom 11: ( h, i ) ← ( h + 1 , 2 i − Z ) 12: end i f 13: P ← P ∪ { ( h, i ) } 14: end while 15: ( H , I ) ← ( h, i ) ⊲ The selected no de 16: Cho ose arm X in P H,I and play it ⊲ Arbitrary s e lection of a n arm 17: Receive co rresp onding reward Y 18: T ← T ∪ { ( H, I ) } ⊲ Extend the tree 19: for all ( h, i ) ∈ P do ⊲ Upda te the statistics T and b µ stored in the path 20: T h,i ← T h,i + 1 ⊲ Increment the co un ter of no de ( h, i ) 21: b µ h,i ←  1 − 1 /T h,i  b µ h,i + Y /T h,i ⊲ Upda te the mea n b µ h,i of no de ( h, i ) 22: end for 23: for all ( h, i ) ∈ T do ⊲ Upda te the statistics U stored in the tr e e 24: U h,i ← b µ h,i + p (2 ln n ) /T h,i + ν 1 ρ h ⊲ Upda te the U –v alue of no de ( h, i ) 25: end for 26: B H +1 , 2 I − 1 ← + ∞ ⊲ B –v alues of the children of the new leaf 27: B H +1 , 2 I ← + ∞ 28: T ′ ← T ⊲ Lo cal copy of the current tree T 29: while T ′ 6 =  (0 , 1)  do ⊲ Backw ard co mputation o f the B – v alues 30: ( h, i ) ← Leaf ( T ′ ) ⊲ T ake any remaining leaf 31: B h,i ← min n U h,i , max  B h +1 , 2 i − 1 , B h +1 , 2 i  o ⊲ Backw ard co mputation 32: T ′ ← T ′ \  ( h, i )  ⊲ Drop up dated lea f ( h, i ) 33: end while 34: end fo r 7 rewards rec e iv ed in rounds when so me descendant of no de ( h, i ) was chosen (b y conven tio n, each no de is a descendant o f itself ): b µ h,i ( n ) = 1 T h,i ( n ) n X t =1 Y t I { ( H t ,I t ) ∈C ( h,i ) } . Here, C ( h, i ) denotes the set of all des c endan ts of a no de ( h, i ) in the infinite tree, C ( h, i ) =  ( h, i )  ∪ C ( h + 1 , 2 i − 1) ∪ C ( h + 1 , 2 i ) , and T h,i ( n ) is the n umber of times a descendant of ( h, i ) is pla yed up to a nd including round n , that is, T h,i ( n ) = n X t =1 I { ( H t ,I t ) ∈C ( h,i ) } . A key quantit y determining B h,i ( n ) is U h,i ( n ), an initial estimate o f the maxim um of the mean-pay off function in the r egion P h,i asso ciated with no de ( h, i ): U h,i ( n ) =        b µ h,i ( n ) + s 2 ln n T h,i ( n ) + ν 1 ρ h , if T h,i ( n ) > 0 ; + ∞ , otherwise . (2) In the expres sion corr espo nding to the case T h,i ( n ) > 0, the first term added to the average of rewards a ccount s for the uncertaint y arising fro m the ra ndo mness o f the r ew ards that the av erage is based on, while the s econd term, ν 1 ρ h , acco un ts for the ma x im um p ossible v a riation of the mean- pay off function ov er the r e g ion P h,i . The actual b ound o n the maxima use d in HOO is defined recursively by B h,i ( n ) = ( min n U h,i ( n ) , max  B h +1 , 2 i − 1 ( n ) , B h +1 , 2 i ( n )  o , if ( h, i ) ∈ T n ; + ∞ , otherwise . The role of B h,i ( n ) is to put a tight, optimistic, high-proba bilit y upper b ound on the best mean- pay off that ca n be achiev ed in the region P h,i . By assumption, P h,i = P h +1 , 2 i − 1 ∪ P h +1 , 2 i . Thus, assuming that B h +1 , 2 i − 1 ( n ) (resp., B h +1 , 2 i ( n )) is a v a lid upp er bound for r egion P h +1 , 2 i − 1 (resp., P h +1 , 2 i ), we see tha t max  B h +1 , 2 i − 1 ( n ) , B h +1 , 2 i ( n )  m ust b e a v alid upp er b ound for r e gion P h,i . Since U h,i ( n ) is another v a lid upp er b ound for regio n P h,i , we ge t a tighter (less ov eroptimistic) upper b ound by taking the minim um of these b ounds. Obviously , for leafs ( h, i ) of the tree T n , one has B h,i ( n ) = U h,i ( n ), while close to the ro ot one may exp ect that B h,i ( n ) < U h,i ( n ); that is, the upper bounds clo se to the ro ot are exp ected to b e less biased than the ones asso cia ted with no des far ther aw ay from the r o o t. Note that at the beginning of round n , the algorithm uses B h,i ( n − 1) to selec t the no de ( H n , I n ) to b e played (since B h,i ( n ) will only be av a ilable at the end of ro und n ). It do es so b y following a path from the ro o t node to an inner node with only one child or a leaf a nd finally considering a child ( H n , I n ) of the latter; at each no de o f the path, the child with highest B –v alue is chosen, till the no de ( H n , I n ) with infinite B –v a lue is reached. Illustrations. Figure 1 illus trates the computation do ne by HOO in r ound n , as well as the corres p ondence b et ween the no des of the tree constructed by the algorithm and their ass o cia ted regions. Figur e 2 s hows trees built by running HOO for a spe cific environmen t. 8 h,i B B h+1,2i−1 B h+1,2i (H ,I ) n n Followed path Pulled point X n Selected node Figure 1: Illustratio n of the no de selectio n pro cedure in round n . The tr ee represe n ts T n . In the illustra tion, B h +1 , 2 i − 1 ( n − 1) > B h +1 , 2 i ( n − 1), therefore, the selected path included the no de ( h + 1 , 2 i − 1) r ather than the no de ( h + 1 , 2 i ). Computational complexity . A t the end of round n , the size of the active tree T n is at most n , making the sto rage requirements o f HOO linear in n . In addition, the sta tistics and B –v alues of all no des in the active tree need to b e up dated, which thus takes time O ( n ). HOO runs in time O ( n ) at ea c h round n , making the alg o rithm’s total running time up to round n quadratic in n . In Section 4.3 we mo dify HOO so that if the time horizon n 0 is known in adv ance, the tota l running time is O ( n 0 ln n 0 ), while the mo dified algor ithm will b e shown to enjoy es sen tially the same regr et bo und as the or iginal version. 4 Main results W e start by descr ibing and commentin g o n the assumptions that w e need to analyze the r e gret of HOO. This is follo wed by stating the fir s t upp er bound, follow ed by some improv ements on the basic algorithm. The sec tio n is finished by the statement of o ur r esults o n the minimax o ptimalit y of HOO. 4.1 Assum ptions The main assumption will concern the “smo othness” of the mean-pay off function. How ever, some- what unco n ven tionally , w e shall us e a notion o f smo othness that is built around dissimilarity func- tions rather than distances, allowing us to deal with function clas ses of highly different smo othness degrees in a unified manner. Befor e s tating o ur smo othness ass umptions, we define the notio n of a dissimilarity function and some asso ciated co nc e pts. Definition 2 (Dissi milarit y) A dissimilarity ℓ over X is a non-ne gative mapping ℓ : X 2 → R satisfying ℓ ( x, x ) = 0 for al l x ∈ X . 9 Figure 2: The trees (b ottom figures) built by HOO after 1,000 (left ) a nd 10,000 (right ) rounds. The mean-pay off function (shown in the top pa rt o f the figure) is x ∈ [0 , 1] 7− → 1 / 2  sin(13 x ) sin(27 x )+ 1  ; the corres ponding payoffs are Bernoulli-dis tr ibuted. The inputs of HO O a re a s follows: the tree of cov erings is for med b y all dyadic interv als , ν 1 = 1 and ρ = 1 / 2. The tie-brea king rule is to choo s e a child at random (as s ho wn in the Algo rithm 1), while the p o in ts in X to b e play ed are chosen as the centers of the dyadic interv als . Note that the tree is extensively refined where the mean-payoff function is nea r-optimal, while it is muc h less develop ed in other regio ns. Given a diss imilarity ℓ , the diameter of a subset A o f X as measur ed by ℓ is defined by diam( A ) = s up x,y ∈ A ℓ ( x, y ) , while the ℓ –op en b al l of X with radius ε > 0 and center x ∈ X is defined by B ( x, ε ) = { y ∈ X : ℓ ( x, y ) < ε } . Note that the dissimila rity ℓ is only used in the theoretica l analysis of HOO; the a lgorithm do es not require ℓ as an explicit input. How ev er, when choo s ing its parameter s ( the tr ee of coverings and the real num bers ν 1 > 0 and ρ < 1) for the (set of ) tw o assumptions b elow to be satisfied, the user of the algo r ithm probably ha s in mind a given dis s imilarity . How ev er, it is also natural to w onder what is the clas s of functions for which the algorithm (given a fixed tree) can achiev e no n-trivial regret bounds; a similar question for regression w as in vestigated e.g., by Y ang [28]. W e shall indica te below how to c o nstruct a subset o f suc h a cla ss, right a fter stating our assumptions connecting the t ree, the dissimilarity , a nd the environment (the mean-pa yoff function). Of these, Assumption A2 will b e interpreted, discussed, a nd equiv alently reformulated below into (4), a form that might b e mo re int uitive. The for m (3) stated below will tur n out to b e the most useful one in the pro ofs. Assumptions Given the parameters of HOO, that is, the rea l nu mbers ν 1 > 0 and ρ ∈ (0 , 1) and the tree of cov erings ( P h,i ), there exis ts a diss imilarity function ℓ such that the following tw o assumptions are satisfied. A1. There ex ists ν 2 > 0 such that for all integers h > 0, (a) diam( P h,i ) 6 ν 1 ρ h for all i = 1 , . . . , 2 h ; 10 (b) for all i = 1 , . . . , 2 h , there exists x ◦ h,i ∈ P h,i such that B h,i def = B  x ◦ h,i , ν 2 ρ h  ⊂ P h,i ; (c) B h,i ∩ B h,j = ∅ for a ll 1 6 i < j 6 2 h . A2. The mean-payoff function f sa tisfies tha t for all x, y ∈ X , f ∗ − f ( y ) 6 f ∗ − f ( x ) + max  f ∗ − f ( x ) , ℓ ( x, y )  . (3) W e show next how a tr ee induces in a natural way first a diss imila rity and then a class o f environmen ts. F or this, we need to a s sume that the tr ee of coverings ( P h,i ) – in addition to (1a ) and (1b)– is such that the subsets P h,i and P h,j are disjoint whenever 1 6 i < j 6 2 h and that none of them is e mpty . Then, each x ∈ X c orresp onds to a unique path in the tr ee, which ca n be represented as an infinite binary sequenc e x 0 x 1 x 2 . . . , wher e x 0 = I  x ∈P 1 , 1+1  , x 1 = I  x ∈P 2 , 1+(2 x 0 +1)  , x 2 = I  x ∈P 3 , 1+(4 x 0 +2 x 1 +1)  , . . . F or p oints x, y ∈ X with re s pective representations x 0 x 1 . . . a nd y 0 y 1 . . . , we le t ℓ ( x, y ) = ( 1 − ρ ) ν 1 ∞ X h =0 I { x h 6 = y h } ρ h . It is not hard to see th at this dissimilarity satisfies A1. Thus, the asso ciated class of environments C is formed by those with mea n-pay off functions sa tis fying A2 with the so -defined dissimilarity . This is a “natural class ” underlying the tree for whic h our tree-based alg orithm can ac hieve non-triv ia l regret. (Ho wev er, we do not know if this is the lar gest such cla ss.) In general, Ass umption A1 ensures that the reg ions in the tre e of coverings ( P h,i ) shrink exactly at a g eometric rate. The following exa mple s hows how to satisfy A1 when the domain X is a D –dimensional hyper-r ectangle and the diss imila rity is some po sitive power of the Euclidean (or supremum) nor m. Example 1 Ass ume that X is a D -dimension hyp er- re ctangle and c onsider the dissimilarity ℓ ( x, y ) = b k x − y k a 2 , wher e a > 0 and b > 0 ar e r e al nu mb ers and k · k 2 is the Euclide an norm. Define the tr e e of c overings ( P h,i ) in the fol lowing inductive way: let P 0 , 1 = X . Given a no de P h,i , let P h +1 , 2 i − 1 and P h +1 , 2 i b e obtaine d fr om t he hyp er- re ctangle P h,i by splitting it in t he midd le along its longest side (ties c an b e br oken arbitr arily). We now ar gue that Assum ption A1 is satisfie d. With no loss of gener ality we take X = [0 , 1] D . Then, for al l inte gers u > 0 and 0 6 k 6 D − 1 , diam( P uD + k , 1 ) = b 1 2 u r D − 3 4 k ! a 6 b √ D 2 u ! a . It is n ow e asy t o se e that Assu mption A1 is satisfie d for the indic ate d dissimilarity, e.g., with the choic e of t he p ar ameters ρ = 2 − a/D and ν 1 = b  2 √ D  a for H OO, and the value ν 2 = b/ 2 a . 11 Example 2 In the same setting, with the same tr e e of c overings ( P h,i ) ove r X = [0 , 1] D , but now with the dissimilari ty ℓ ( x, y ) = b k x − y k a ∞ , we get that for al l int e gers u > 0 and 0 6 k 6 D − 1 , diam( P uD + k , 1 ) = b  1 2 u  a . This time, Assumption A1 is satisfie d, e.g., with the choic e of t he p ar ameters ρ = 2 − a/D and ν 1 = b 2 a for H OO, and the value ν 2 = b/ 2 a . The second assumption, A2, c o ncerns the environmen t; when Assumption A2 is sa tisfied, we say that f is we akly Lipschitz with r esp e c t to (w.r.t.) ℓ . The choice of this terminolo gy follows from the fact that if f is 1–Lipschitz w.r.t. ℓ , i.e., for all x, y ∈ X , one has | f ( x ) − f ( y ) | 6 ℓ ( x, y ), then it is also weakly Lipschitz w.r .t. ℓ . On the o ther hand, w eak Lipschitzness is a milder requir emen t. It implies lo cal (one- sided) 1– Lipschitzness at any g lobal maximum, since at any ar m x ∗ such that f ( x ∗ ) = f ∗ , the criterio n (3) rewrites to f ( x ∗ ) − f ( y ) 6 ℓ ( x ∗ , y ). In the vicinity of o ther a rms x , the constr ain t is milder as the arm x gets worse (as f ∗ − f ( x ) incr eases) sinc e the condition (3) rewrites to ∀ y ∈ X , f ( x ) − f ( y ) 6 max  f ∗ − f ( x ) , ℓ ( x, y )  . (4) Here is another interpretation of these tw o facts; it will be useful when considering local assump- tions in Sectio n 4.4 (a weaker set o f ass umptions). First, co ncerning the b ehavior around g lobal maxima, Assumption A2 implies that for any set A ⊂ X with sup x ∈A f ( x ) = f ∗ , f ∗ − inf x ∈A f ( x ) 6 diam( A ) . (5) Second, it c a n be seen that Assumption A2 is eq uiv alent 4 to the following pr o per t y: for a ll x ∈ X and ε > 0, B  x, f ∗ − f ( x ) + ε  ⊂ X 2  f ∗ − f ( x )  + ε (6) where X ε =  x ∈ X : f ( x ) > f ∗ − ε  denotes the set of ε –optimal arms . This second prop erty essentially states that there is no sudden and lar g e drop in the mean-payoff function aro und the global maxima (note that this pr oper t y can be satisfied even for discontin uo us functions). Figure 3 presents an illustration of the tw o prop erties discus s ed a bove. Before stating our main results, we provide a straightforward, though useful consequence of Assumptions A1 and A2, which should b e seen as a n in tuitive justification for the third term in (2). F or a ll no des ( h, i ), let f ∗ h,i = sup x ∈P h,i f ( x ) and ∆ h,i = f ∗ − f ∗ h,i . ∆ h,i is called the sub optimality factor of no de ( h, i ). Depending whether it is p ositive o r not, a no de ( h, i ) is c a lled s ub optimal (∆ h,i > 0) or optimal (∆ h,i = 0). 4 That Assumption A2 impli es (6) i s im mediate; f or the con v erse, it s uffices to consider, for eac h y ∈ X , the seque nce ε n =  ℓ ( x, y ) −  f ∗ − f ( x )   + + 1 /n , where ( · ) + denotes the nonnegative part. 12 f(x) f* x* x ε f ε Figure 3: Illustra tion o f the pro per t y of w eak Lipsch itzness (on the real line and for the distance ℓ ( x, y ) = | x − y | ). Around the optim um x ∗ the v alues f ( y ) s hould b e ab ov e f ∗ − ℓ ( x ∗ , y ). Around any ε –optimal p oint x the v alues f ( y ) should b e lar ger than f ∗ − 2 ε for ℓ ( x, y ) 6 ε a nd larger than f ( x ) − ℓ ( x, y ) elsewher e. Lemma 3 Under Assum ptions A 1 and A2, if the sub optimality factor ∆ h,i of a re gion P h,i is b oun de d by cν 1 ρ h for some c > 0 , then al l arms in P h,i ar e max { 2 c, c + 1 } ν 1 ρ h –optimal, that is, P h,i ⊂ X max { 2 c,c +1 } ν 1 ρ h . Pro of F or a ll δ > 0, we denote by x ∗ h,i ( δ ) a n element of P h,i such that f  x ∗ h,i ( δ )  > f ∗ h,i − δ = f ∗ − ∆ h,i − δ . By the weak L ips c hitz prop erty (Assumption A2), it then follows that for all y ∈ P h,i , f ∗ − f ( y ) 6 f ∗ − f  x ∗ h,i ( δ )  + max n f ∗ − f  x ∗ h,i ( δ )  , ℓ  x ∗ h,i ( δ ) , y  o 6 ∆ h,i + δ + max  ∆ h,i + δ, diam P h,i  . Letting δ → 0 and substituting the b ounds o n the sub optimalit y and on the dia meter o f P h,i (As- sumption A1) co ncludes the pro of. 4.2 Up p er b ound for the r egret of HOO Auer et al. [6, Assumption 2] obser v ed that the reg ret o f a co n tinuum -ar med bandit algor ithm should depe nd on ho w fast the v olumes of the sets of ε –optimal arms shrink as ε → 0 . Here, we capture this by defining a new notion, the near- o ptimalit y dimensio n of the mea n-pay off function. The connection betw een these concepts, a s well as with the zo oming dimensio n defined by Kleinber g et al. [22], will be further discus sed in Section 5. W e star t by recalling the definition o f packing num b ers. Definition 4 (P ac king num b e r) The ε –p acking numb er N ( X , ℓ, ε ) of X w.r.t. the dissimilarity ℓ is the size of the lar gest p acking of X with disjoint ℓ –op en b al ls of r adius ε . That is, N ( X , ℓ, ε ) is the lar gest int e ger k such that ther e exists k disjoi nt ℓ –op en b al ls with r adius ε c ontaine d in X . W e now define the c –near -optimality dimension, which c hara cterizes the size of the sets X cε as a function o f ε . It can b e seen as s ome g rowth rate in ε of the metric e ntropy (measur e d in terms o f ℓ and with pa c king num b ers r ather than covering n umbers) of the s et o f cε –optimal ar ms . 13 Definition 5 (Near-optimal i t y di mension) F or c > 0 the c –near-optimality dimension of f w.r.t. ℓ e qu als max ( 0 , lim sup ε → 0 ln N  X cε , ℓ, ε  ln  ε − 1  ) . The following example shows that using a diss imilarity (rather than a metric, for instance) may sometimes allow for a significant reductio n of the near-optimality dimension. Example 3 L et X = [0 , 1] D and let f : [0 , 1 ] D → [0 , 1] b e define d by f ( x ) = 1 − k x k a for some a > 1 and some norm k · k on R D . Consider t he dissimila rity ℓ define d by ℓ ( x, y ) = k x − y k a . We shal l s e e in Example 4 that f is we akly Lipschitz w.r.t. ℓ (in a sense however slightly we aker than the one given by (5) and (6) but sufficiently st ro ng t o ensur e a r esult similar to the one of the main r esu lt, The or em 6 b elow). Her e we claim t hat the c –ne ar-optimality dimension (for any c > 0 ) of f w.r.t. ℓ is 0 . On the other hand , the c –ne ar-optimality dimension (for any c > 0 ) of f w.r.t. the dissimilari ty ℓ ′ define d, for 0 < b < a , by ℓ ′ ( x, y ) = k x − y k b is (1 /b − 1 /a ) D > 0 . In p articular, when a > 1 and b = 1 , the c –ne ar-optimality dimension is (1 − 1 /a ) D . Proof (sketc h) Fix c > 0. The set X cε is the k · k –ball w ith center 0 and radius ( cε ) 1 /a , t hat is, the ℓ –ball with center 0 and radius cε . Its ε –packing num b er w.r.t. ℓ is b ounded by a constan t dep ending only on D , c and a ; hence, the va lue 0 for the near-optimality d imension w.r.t. th e dissimilarit y ℓ . In case of ℓ ′ , w e are interested in the packing num b er of the k · k –ball with cen ter 0 and radius ( cε ) 1 /a w.r.t. ℓ ′ –balls. The latter is of the order of  ( cε ) 1 /a ε 1 /b  D = c D /a  ε − 1  (1 /b − 1 /a ) D ; hence, the v alue (1 /b − 1 /a ) D for the near-optimality d imension in the case of the dissimilarit y ℓ ′ . Note that in all these cases the c –near-optimalit y dimension of f is indep endent o f th e v alue of c . W e ca n now state our first main result. The pro of is presented in Section A.1. Theorem 6 (Regret b ound for HOO ) Consider HOO tu ne d wi th p ar ameters su ch that Assump- tions A 1 and A2 hold for some dissimilarity ℓ . L et d b e the 4 ν 1 /ν 2 –ne ar-optimality dimension of the me an- p ayoff function f w.r.t. ℓ . Then, for al l d ′ > d , ther e exists a c onstant γ su ch t hat for al l n > 1 , E  R n  6 γ n ( d ′ +1) / ( d ′ +2)  ln n  1 / ( d ′ +2) . Note tha t if d is infinite, then the b ound is v acuous. The co nstan t γ in the theorem dep ends on d ′ and on all other pa rameters of HO O and o f the assumptions, as well as on the ba ndit environmen t M . (The v alue of γ is determined in the ana lysis; it is in pa r ticular prop ortiona l to ν − d ′ 2 .) The next section will exhibit a refined upp er b ound with a more explicit v alue of γ in terms of a ll these parameters . Remark 7 The tuning of the p ar ameters of HOO is critic al for the assumptions to b e satisfie d, thus to achieve a go o d re gr et; given some envir onment , one should sele ct the p ar ameters of H O O such that the ne ar-optimality dimension of the me an-p ayoff fu n ction is m inimize d. Sinc e the me an-p ayoff function is unknown to the u ser, this migh t b e difficult to achi eve. Thus, ide al ly, t hese p ar ameters should b e sele cte d adaptiv ely b ase d on the observatio n of some pr eliminary sample. F or now, the investigation of this p ossibili ty is left for futur e work. 14 4.3 Impro ving the running t ime when the t ime horizon is kno w n A deficiency of the bas ic HOO algorithm is tha t its computational complexity sca les qua dratically with the n umber of time steps. In this sec tion w e prop ose a simple modifica tion to HOO that achiev es essentially the same r egret as HO O and whose computational complexit y scales only log - linearly with the num b er of time steps. The needed amount of memory is still linear. W e work out the ca se when the time hor izon, n 0 , is k nown in adv ance. The case o f unknown horizo n ca n b e dealt with b y resor ting to the so-ca lle d doubling trick, see, e.g ., [10, Sectio n 2.3 ], which consists o f per io dically res ta rting the algor ithm for reg imes of lengths that do uble at each such fr e s h star t, so that the r th instance of the a lgorithm r uns for 2 r rounds. W e consider tw o mo difications to the a lgorithm descr ib ed in Section 3. First, the quantities U h,i ( n ) of (2) ar e r edefined by replac ing the facto r ln n by ln n 0 , that is, now U h,i ( n ) = b µ h,i ( n ) + s 2 ln n 0 T h,i ( n ) + ν 1 ρ h . (This results in a p olicy whic h explores the arms with a s light ly increased frequency .) The definition of the B –v alues in terms of the U h,i ( n ) is unc hanged. A pleasant consequence of the above mo di- fication is that the B –v alue of a given no de changes only when this no de is pa rt of a path selected by the a lgorithm. Thus a t each round n , only the no des a long the chosen pa th need to b e updated according to the o bta ined reward. How ev er, and this is the r eason for the second modificatio n, in the basic a lgorithm, a path at round n may b e of length linear in n (b ecause the tre e co uld hav e a depth linear in n ). This is wh y we also truncate the trees T n at a dept h D n 0 of the or der o f ln n 0 . More precisely , the algorithm no w selects the no de ( H n , I n ) to pull at round n b y following a path in the tree T n − 1 , starting fro m the ro ot and choosing at each no de the child with the highest B –v alue (with the new definition ab ov e using ln n 0 ), a nd sto pping either when it encounters a no de which has not b een expanded b efore or a no de at depth equal to D n 0 =  (ln n 0 ) / 2 − ln(1 /ν 1 ) ln(1 /ρ )  . (It is a ssumed that n 0 > 1 /ν 2 1 so tha t D n 0 > 1 .) Note that since no child of a no de ( D n 0 , i ) lo cated at depth D n 0 will ever b e explor ed, its B –v a lue at round n 6 n 0 simply equals U D n 0 ,i ( n ). W e call this mo dified version of HOO the trunc ate d HOO alg orithm. The computational co m- plexity of updating all B –v alues at each round n is of the or der o f D n 0 and th us o f the order of ln n 0 . The total computatio na l co mplexit y up to round n 0 is therefor e of the order of n 0 ln n 0 , as claimed in the introductio n o f this section. As the next theorem indicates this new pr o cedur e enjoys almost the same cumulative regret bo und as the bas ic HO O algorithm. Theorem 8 (Upp er b ound on the regret of truncated HOO) Fix a horizon n 0 such that D n 0 > 1 . Then, the r e gr et b ound of The or em 6 st il l holds tr u e at r oun d n 0 for trunc ate d HOO up to an additiona l additiv e 4 √ n 0 factor. 4.4 Local assumptions In this section w e further relax the weak Lipsc hitz assumption and require it only to hold lo cally around the maxima. Doing so, we will b e able to deal with an ev en larger class of functions and in fact w e w ill show that the algo rithm studied in this section achiev es a O ( √ n ) bound on the regret r egret when it is used for functions that are smo o th around their maxima (e.g., equiv ale nt to k x − x ∗ k α for some known s mo o thnes s degr ee α > 0). 15 F or the sa k e of simplicit y and to derive exact co nstant s we also state in a mor e explicit w ay the assumption on the nea r-optimality dimension. W e then pro po s e a simple and efficient adaptation of the HOO alg o rithm s uited for this context. 4.4.1 Mo dified set of assumptions Assumptions Given the pa r ameters o f (the adaption of ) HOO , that is, the real num bers ν 1 > 0 and ρ ∈ (0 , 1 ) a nd the tree of cov erings ( P h,i ), there exists a dissimilarity function ℓ such that Assumption A1 (for some ν 2 > 0) as well as the following t wo as sumptions hold. A2’ . Ther e exists ε 0 > 0 such that for all optimal s ubs ets A ⊂ X (i.e., sup x ∈A f ( x ) = f ∗ ) with diameter diam( A ) 6 ε 0 , f ∗ − inf x ∈A f ( x ) 6 diam( A ) . F urther, ther e e x ists L > 0 such tha t for all x ∈ X ε 0 and ε ∈ [0 , ε 0 ], B  x, f ∗ − f ( x ) + ε  ⊂ X L  2( f ∗ − f ( x ))+ ε  . A3 . There exis t C > 0 a nd d > 0 such that for all ε 6 ε 0 , N  X cε , ℓ, ε  6 C ε − d , where c = 4 Lν 1 /ν 2 . When f sa tisfies Assumption A2’, we s ay that f is ε 0 –lo cally L – we akly Lipschitz w.r.t. ℓ . Note that th is assumption w as o btained by w eakening the characteriza tio ns (5) and (6) of weak Lips c hitz- ness. Assumption A3 is not a r eal ass umption but mere ly a r e formu lation of the definition of near optimality (with the small a dded ingredient that the limit can b e ac hieved, see the second step o f the pro of of Theo rem 6 in Section A.1). Example 4 We c onsider again the domain X and function f studie d in Example 3 and pr ove (as announc e d b efor ehand) that f is ε 0 –lo c al ly 2 a − 1 –we akly Lipschitz w.r.t. the dissimilari ty ℓ define d by ℓ ( x, y ) = k x − y k a ; which, in fact, holds for al l ε 0 . Proof Note that x ∗ = (0 , . . . , 0) is such that f ∗ = 1 = f ( x ∗ ). Therefore, for all x ∈ X , f ∗ − f ( x ) = k x k a = ℓ ( x ∗ , x ) , whic h yields the fi rst p art of A ssump tion A2’ . T o prov e t hat the second part is true for L = 2 a − 1 and with n o constra int on the considered ε , we first note that since a > 1, it holds b y con vexit y that ( u + v ) a 6 2 a − 1 ( u a + v a ) for all u, v > 0. Now, fo r all ε > 0 and y ∈ B  x, k x k a + ε  , i.e., y such that ℓ ( x, y ) = k x − y k a 6 k x k a + ε , f ∗ − f ( y ) = k y k a 6  k x k + k x − y k  a 6 2 a − 1  k x k a + k x − y k a  6 2 a − 1  2 k x k a + ε  , whic h concludes the p roof of the second part of A 2’. 16 4.4.2 Mo dified HOO algorithm W e now describ e the prop osed mo difications to the basic HOO alg orithm. W e fir st consider, as a building blo ck, the a lgorithm called z –HOO , which takes an integer z as an additional parameter to those of HOO. Algo rithm z –HOO works as follows: it nev er plays any no de with depth smaller or equa l to z − 1 and starts directly the s election of a new no de at depth z . T o do so , it first picks the node at depth z with the b est B –v a lue , ch o oses a path and then pro cee ds as the bas ic HOO algor ithm. No te in par ticular that the initialization of this a lgorithm co nsists (in the first 2 z rounds) in playing once ea c h of the 2 z no des lo cated at depth z in the tree (since by definition a node that has not b een play ed y et has a B –v alue equal to + ∞ ). W e note in passing that when z = 0, algorithm z – HOO coincides with the basic HOO a lgorithm. Algorithm lo c al-HO O employs the doubling trick in co njunction with consecutive instances of z – HO O. It works as follows. The in tegers r > 1 will index different regimes . The r th regime s tarts at round 2 r − 1 a nd ends when the next regime star ts; it thus lasts for 2 r rounds. At the b eginning of regime r , a fresh copy of z r –HOO, where z r = ⌈ log 2 r ⌉ , is initialized a nd is then used throughout the regime. Note that each fresh start needs to pull each o f the 2 z r no des lo cated at depth z r at least once (the nu mber of thes e no des is ≈ r ). How ever, since round r las ts for 2 r time s teps (which is exp onentially larger than the num b er of no des to ex plo re), the time sp en t on the initia lization of z r –HOO in any regime r is g reatly o utn umbered by the time sp ent in the r e st o f the regime. In the r est o f this sectio n, we prop ose fir st a n upp er b ound on the reg ret of z – HOO (with ex act and explicit constan ts). This result will play a k ey ro le in proving a b ound on the per formance of lo cal-HOO. 4.4.3 Adaptation o f the regret b ound In the following we write h 0 for the smalles t integer such that 2 ν 1 ρ h 0 < ε 0 and consider the algorithm z –HOO, where z > h 0 . In particular, when z = 0 is chosen, the obtained bo und is the same as the one of Theor em 6, except that the constants are given in a nalytic forms. Theorem 9 (Regret b ound for z –HOO) Consider z –HOO tu ne d wi th p ar ameters ν 1 and ρ such that Assumptions A1, A2’ and A3 ho ld for some dissimilarity ℓ and the values ν 2 , L, ε 0 , C, d . If, in addition, z > h 0 and n > 2 is lar ge enough so that z 6 1 d + 2 ln(4 Lν 1 n ) − ln( γ ln n ) ln(1 /ρ ) , wher e γ = 4 C Lν 1 ν − d 2 (1 /ρ ) d +1 − 1  16 ν 2 1 ρ 2 + 9  , then the fol lowing b ound holds for the exp e cte d r e gr et of z –HO O: E  R n  6  1 + 1 ρ d +2   4 Lν 1 n  ( d +1) / ( d +2) ( γ ln n ) 1 / ( d +2) +  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  . The proo f, whic h is a modification o f the pro of to Theorem 6, can b e found in Sectio n A.3 o f the Appendix. The main complication arises b ecause the weakened assumptions do not allo w one t o reason ab out the smo othness at an ar bitr ary scale; this is essentially due to the threshold ε 0 used in the for m ulation of the a ssumptions. This is why in the prop osed v ar ia n t of HOO we discard no des lo cated too close to t he ro ot (at depth smaller than h 0 − 1). Note that in the b ound the second t erm 17 arises fro m playing in regions corr e s ponding to the descendants of “p o or” no des lo cated at level z . In particular, this term disapp ears w hen z = 0, in which case we get a b ound on the reg ret of HOO provided that 2 ν 1 < ε 0 holds. Example 5 We c onsider again the setting of Examples 2, 3, and 4. The domain is X = [0 , 1] D and t he me an-p ayoff function f is define d by f ( x ) = 1 − k x k 2 ∞ . We assume that HOO is run with p ar ameters ρ = (1 / 4) 1 /D and ν 1 = 4 . We alr e ady pr ove d t hat A ssumptions A 1, A2’ and A3 ar e satisfie d with t he dissimilarity ℓ ( x, y ) = k x − y k 2 ∞ , the c onstant s ν 2 = 1 / 4 , L = 2 , d = 0 , and 5 C = 128 D/ 2 , as wel l as any ε 0 > 0 (that is, with h 0 = 0 ). Thus, r esorting to The or em 9 (applie d with z = 0 ), we obtain γ = 32 × 1 28 D/ 2 4 1 /D − 1  4 2 /D + 9  and get E  R n  6  1 + 4 2 /D  p 32 γ n ln n = q exp  O ( D )  n ln n . Under the pr escrib e d assumptions, the r ate of c onver genc e is of or der √ n no m atter the ambient dimension D . Although t he ra te is indep endent of D , t he latt er imp acts the p erformanc e thr ough the mu ltiplic ative factor in fr ont of the ra te, which is exp onential in D . This is, however, not an artifact of our analysis, sinc e it is natu ra l t hat explor ation in a D –dimensional sp ac e c omes at a c ost exp onent ial in D . (The explor ation p erforme d by HOO natur al ly c ombines an initial glob al se ar ch, which is b ound t o b e exp onential in D , and a lo c al optimization, whose r e gr et is of t he or der of √ n .) The following theorem is an almos t straightforw ard consequence of Theor em 9 (the detailed proof can be fo und in Section A.4 of the App e ndix ). Note that lo cal-HOO do es not require the kno wledge of the par ameter ε 0 in A2’. Theorem 1 0 (Regret b ound for lo cal-HOO ) Consider lo c al-HO O and assume t hat its p ar am- eters ar e tune d s uch t hat Assumptions A1, A2’ and A3 hold for s ome dissimila rity ℓ . Then the exp e cte d r e gr et of lo c al-HOO is b ounde d (in a distribution-dep en dent sense) as fol lows, E  R n  = e O  n ( d +1) / ( d +2)  . 4.5 Mi nimax optimality in metric spaces In this sectio n we provide t wo theor ems s ho wing the minimax optimality of HOO in metric spaces. The notion of packing dimension is key . Definition 11 (P ac king dim ension) The ℓ –p acking dimension of a set X (w.r.t. a dissimilarity ℓ ) is define d as lim sup ε → 0 ln N ( X , ℓ, ε ) ln( ε − 1 ) . F or instance, it is easy to see that whenever ℓ is a norm, compact subsets of R D with non-empty int erior s hav e a packing dimension of D . W e note in passing that the pa c king dimension pr ovides a bo und on the nea r-optimality dimension that only dep ends on X a nd ℓ but not on the underlying mean-pay off function. Let F X ,ℓ be the class of all bandit en vironments on X with a weak Lipschitz mean-pay off function (i.e., satisfying Assumption A2). F o r the sake of clarity , we no w denote, for a bandit strategy ϕ and 5 T o comput e C , one can first note t hat 4 Lν 1 /ν 2 = 128; the quest ion at hand for Assumption A3 to be satisfied is therefore to upper bound t he n um b er o f balls of r adius √ ε (w.r .t. the supremum norm k · k ∞ ) that can be pac ke d in a ball of radius √ 128 ε , giving rise to the bound C 6 √ 128 D . 18 a bandit environmen t M on X , the exp ectation of the cumulativ e r e gret of ϕ ov er M at time n by E M  R n ( ϕ )  . The following theorem provides a uniform upp er b ound o n the r egret of HOO over this class of environmen ts. It is a corollary of Theorem 9; most of the efforts in the proof consist of showing that the distribution-dependent constant γ in the sta tement of Theorem 9 can b e upper bounded by a quantit y (the γ in the sta tement b elow) that only dep ends on X , ν 1 , ρ, ℓ , ν 2 , D ′ , but not on the underlying mean-payoff functions. The pro of is provided in Sec tio n A.5 of the App endix. Theorem 1 2 (Uniform upp er bound on the regret of HOO) Assume that X has a finite ℓ – p acking dimension D and that the p ar ameters of HOO ar e such that A1 is satisfie d. Then, for al l D ′ > D ther e exists a c ons t ant γ such that for al l n > 1 , sup M ∈F X ,ℓ E M  R n ( HOO )  6 γ n ( D ′ +1) / ( D ′ +2)  ln n  1 / ( D ′ +2) . The nex t res ult shows tha t in the case of metric spaces this upp er b ound is optimal up to a m ultiplicative logarithmic factor. Similar lower b ounds app eared in [21] (for D = 1) and in [22]. W e prop ose here a weak er statement that s uits our needs. Note that if X is a large enoug h compact subset of R D with non-empty in terior and the dissimila rity ℓ is some no rm of R D , then the assumption of the following theo rem is satisfied. Theorem 1 3 (Uniform lo w er b ound) Consider a set X e quipp e d with a dissimilarity ℓ that is a metric. Assum e that ther e exists some c onstant c ∈ (0 , 1] such that for al l ε 6 1 , the p acking numb ers satisfy N ( X , ℓ, ε ) > c ε − D > 2 . Then, ther e exist two c onstants N ( c, D ) and γ ( c, D ) dep ending only on c and D such that for al l b andit str ate gies ϕ and al l n > N ( c, D ) , sup M ∈F X ,ℓ E M  R n ( ϕ )  > γ ( c, D ) n ( D +1) / ( D +2) . The reader interested in the explicit expres sions of N ( c, D ) and γ ( c, D ) is r eferred to the last lines of the pr oo f of the theorem in the Appendix. 5 Discussion In this sectio n w e would lik e to shed some light on the results of the prev io us sections. In particular we ge ne r alize the situation of Example 5, discuss th e reg ret that we can o btain, and compare it with what could b e obtaine d by previous works. 5.1 Exam ples of regret b ounds for functions lo c ally smo oth at their max- ima W e equip X = [0 , 1] D with a no rm k · k . W e assume that the mean-pay off function f ha s a finite nu mber of globa l maxima and that it is locally equiv alent to the function k x − x ∗ k α –with deg r ee α ∈ [0 , ∞ )– around each such g lobal maximum x ∗ of f ; that is, f ( x ∗ ) − f ( x ) = Θ  k x − x ∗ k α  as x → x ∗ . This means that there exist c 1 , c 2 , δ > 0 such that for all x satisfying k x − x ∗ k 6 δ , c 2 k x − x ∗ k α 6 f ( x ∗ ) − f ( x ) 6 c 1 k x − x ∗ k α . In par ticular, one can chec k that Assumption A2’ is sa tisfied for the dissimilarity defined b y ℓ c,β ( x, y ) = c k x − y k β , where β 6 α (and c > c 1 when β = α ). W e further assume that HOO is run with pa- rameters ν 1 and ρ and a tree o f dyadic partitions such that Assumption A1 is satisfied a s well (s e e 19 Examples 1 and 2 for explicit v alues of these par ameters in the case of the Euclidean o r the supr e- m um norms over the unit cub e). The following statements can then b e formulated on the ex pected regret of HOO. • Known sm o othness: If we know the true s mo othnes s of f around its max ima, then we s et β = α and c > c 1 . This choice ℓ c 1 ,α of a diss imilarity is such that f is lo cally weak-Lipsch itz with respect to it and the near- optimalit y dimension is d = 0 (cf. Exa mple 3). Theorem 10 th us implies that the e x pected reg ret o f lo cal-HOO is e O ( √ n ), i.e., the r ate of t he b ound is indep endent of t he dimension D . • Smo othnes s underestimated: Here, we as s ume that the true smoo thness of f ar ound its maxima is unk nown and that it is underestimated by choos ing β < α (and so me c ). Then f is still lo c a lly weak-Lipschitz with respect to the dissimilarity ℓ c,β and the nea r-optimality dimension is d = D (1 / β − 1 /α ), as sho wn in E xample 3; the regret of HOO is e O  n ( d +1) / ( d +2)  . • Smo othnes s o v erestimated: Now, if the true smo othness is overestimated by choo sing β > α o r α = β a nd c < c 1 , then the assumption o f w eak Lipschitzness is violated and we are unable to provide any guar antee on the b ehavior of HO O. The latter, when us e d with an ov erestimated smoo thness parameter, may la c k exploratio n a nd exploit to o heavily fr o m the beginning. As a conse q uence, it may get stuc k in so me lo cal optim um of f , missing the global one(s) for a very long time (pos sibly indefinitely). Such a b ehavior is illustrated in the example pr ovided in [1 3] and showing the p ossible problematic b ehavior of the closely related algorithm UCT of [24]. UCT is a n example of an algor ithm ov erestimating the smo othness of the f unction; this is b ecaus e the B –v alues of UCT are defined similar ly to the o nes of the HOO algorithm but without the third term in the definition (2) o f the U –v alues. This correspo nds to an assumed infinite degree of smo o thness (that is, to a lo cally constant mea n-pay off function). 5.2 Relation to previous works Several works [3; 21; 1 2; 6; 22] have co nsidered contin uum-armed bandits in E uclidean or, mor e generally , no rmed or metric spaces and provided upper and low er b ounds on the regret for g iv en classes of environments. • Cop e [12] derived a e O ( √ n ) bo und on the reg ret for compact and co n vex subsets o f R d and mean-pay off functions with a unique minimum and second-or der smo othness. • Kleinberg [21] consider ed mea n-pay off functions f on the rea l line that are H¨ older co n tinuous with degree 0 < α 6 1. The derived reg r et b ound is Θ  n ( α +1) / ( α +2)  . • Auer et al. [6] extended the a nalysis to classes of functions that ar e equiv ale nt to k x − x ∗ k α around their maxima x ∗ , where the allow ed smo othness degree is also larger: α ∈ [0 , ∞ ). They derived the regr et b ound Θ  n 1+ α − αβ 1+2 α − αβ  , where the par a meter β is such that the Lebe sgue measure of ε –optimal arm is O ( ε β ). • Another setting is the one of [22] and [23], who considered a space ( X , ℓ ) equipp ed with some dissimilarity ℓ and as s umed that f is Lipschitz w.r.t. ℓ a t so me maximum x ∗ (when the la tter exists and a rela x ed co ndition o therwise), that is, ∀ x ∈ X , f ( x ∗ ) − f ( x ) 6 ℓ ( x, x ∗ ) . (7) The o btained regr et b o und is e O  n ( d +1) / ( d +2)  , wher e d is the z o oming dimensio n . The latter is defined similarly to our near-optimality dimension with the exceptions that in the definition 20 of zo oming dimension (i) cov ering num b ers instead of packing num ber s are used a nd (ii) sets of the form X ε \ X ε/ 2 are considered instead of the set X cε . When ( X , ℓ ) is a metric space, cov ering a nd packing num ber s are within a constant factor to each other, a nd therefore, o ne may prov e that the zo oming and near- optimalit y dimensions ar e also equal. F or an illus tr ation, consider ag a in the example of Section 5.1. The r e s ult of Auer et al. [6] shows that for D = 1, the regre t is Θ( √ n ) (since her e β = 1 /α , with the nota tion ab ov e). Our result extends the √ n rate of the regret b o und to any dimension D . On the other hand the analysis of Kleinber g et al. [23] do es not apply because in this exa mple f ( x ∗ ) − f ( x ) is controlled o nly when x is close in some sense to x ∗ (i.e., when k x − x ∗ k 6 δ ), while (7) requires such a control ov er the whole set X . How ever, note that the lo cal weak-Lipschitz assumption A2’ requires an extra condition in the vicinit y of x ∗ compared to (7) as it is based on the notion o f weak Lipschitzness. Thus, A2 ’ a nd (7) are in gener al incompa r able (b oth capture a different phenomenon a t the ma xima). W e no w compar e our re s ults to those of [2 2] and [2 3] under Assumption A2 (which do es not cov er the example of Section 5.1 unless δ is lar g e). Under t his assumption, our algorithms enjoy ess en tially the same theore tical g uarantees as the zo oming alg orithm of [22; 23]. F urther, the fo llowing hold. • Our algor ithms do no t require the ora cle needed b y the zo oming a lgorithm. • Our truncated HOO algo r ithm achiev es a computational complexity of or der O ( n log n ), wherea s the complexity o f a naiv e implemen tation of the zoo ming algorithm is likely to be muc h larg er. 6 • Both truncated HOO and the z o o ming algo r ithms use the do ubling tr ick. The ba sic HO O algorithm, howev er, av oids the doubling trick, while mee ting the computational complexity of the zo oming algorithm. The fact that the doubling trick c a n be av oided is go o d news since an algorithm that uses the doubling trick m ust s tart from t abula r asa time to time, which results in predictable, yet inevitable, sharp p erfor mance dr ops –a quite unpleasant pr o per t y . In particular , for this reas on algo rithms that rely on the doubling trick are often neglected by pra ctitioners. In addition, the fact tha t we av oid the oracle needed b y the zo oming a lgorithm is attractive as this oracle mig h t b e difficult to implemen t for g eneral (non-metric) dissimilar ities. Ac kno wledgemen ts W e thank one o f the anon ymous referee for his v alua ble c o mmen ts, which helped us to provide a fair and detailed co mparison of our work to prior contributions. This work was suppo rted in part by F rench National Resea rch Agency (ANR, pro ject E XPLO- RA, ANR-08-COSI-0 0 4), the Alb erta Inge nuit y Centre o f Machine Lear ning, Alb erta Innov ates T echnology F utures (formerly iCore and AIF), NSERC and the P ASCAL2 Net work of Excellence under EC g r ant no. 216 886. 6 The zo oming algorithm requires a co ve ring oracle that i s able to r eturn a point whic h is not co ve red by the set of active strategies, if there exists one. Thus a straigh tforward implementat ion of this cov ering oracle migh t b e computationa lly exp ensive i n (ge neral) con tinuous spac es and w ould require a ‘global’ searc h o ver the whole space . 21 A Pro ofs A.1 Pro of of Theorem 6 (main upp er b ound on the regret of H OO) W e b egin with three lemmas. The pr oo fs of Lemmas 1 5 and 16 rely on co ncen tration-o f-measure techn iques, while the one of Lemma 14 follo ws from a simple case study . Let us fix s o me path (0 , 1), (1 , i ∗ 1 ), (2 , i ∗ 2 ), . . . of optimal no des, star ting fro m the ro ot. That is, denoting i ∗ 0 = 1, we mean that for all j > 1, the sub optimality of ( j, i ∗ j ) equals ∆ j,i ∗ j = 0 and ( j, i ∗ j ) is a child of ( j − 1 , i ∗ j − 1 ). Lemma 14 L et ( h, i ) b e a sub optimal no de. L et 0 6 k 6 h − 1 b e t he lar gest depth such that ( k , i ∗ k ) is on the p ath fr om the r o ot (0 , 1) to ( h, i ) . Then for al l inte gers u > 0 , we have E  T h,i ( n )  6 u + n X t = u +1 P n  U s,i ∗ s ( t ) 6 f ∗ for some s ∈ { k + 1 , . . . , t − 1 }  or  T h,i ( t ) > u and U h,i ( t ) > f ∗  o . Pro of Consider a given round t ∈ { 1 , . . . , n } . If ( H t , I t ) ∈ C ( h, i ), then this is b ecause the child ( k + 1 , i ′ ) of ( k, i ∗ k ) on the path to ( h, i ) had a better B –v alue than its brother ( k + 1 , i ∗ k +1 ). Since b y definition, B –v a lues ca n only increa s e on a c hosen pa th, this e ntails that B k +1 ,i ∗ k +1 6 B k +1 ,i ′ ( t ) 6 B h,i ( t ). This is t urns implies, again b y definit ion of the B – v alues, that B k +1 ,i ∗ k +1 ( t ) 6 U h,i ( t ). Thus,  ( H t , I t ) ∈ C ( h, i )  ⊂  U h,i ( t ) > B k +1 ,i ∗ k +1 ( t )  ⊂  U h,i ( t ) > f ∗  ∪  B k +1 ,i ∗ k +1 ( t ) 6 f ∗  . But, once ag a in by definition of B – v alues,  B k +1 ,i ∗ k +1 ( t ) 6 f ∗  ⊂  U k +1 ,i ∗ k +1 ( t ) 6 f ∗  ∪  B k +2 ,i ∗ k +2 ( t ) 6 f ∗  , and the argument can b e iter a ted. Since up to round t no more than t no des have been played (including the sub optimal no de ( h, i )), we know that ( t, i ∗ t ) has not be en play ed so fa r a nd thus has a B –v alue equal to + ∞ . (Some of the pr evious optimal no des could als o have had an infinite U –v a lue, if not play ed s o far.) W e thus have prov ed the inclusion  ( H t , I t ) ∈ C ( h, i )  ⊂  U h,i ( t ) > f ∗  ∪   U k +1 ,i ∗ k +1 ( t ) 6 f ∗  ∪ . . . ∪  U t − 1 ,i ∗ t − 1 ( t ) 6 f ∗   . (8) Now, for any integer u > 0 it ho lds tha t T h,i ( n ) = n X t =1 I { ( H t ,I t ) ∈C ( h,i ) , T h,i ( t ) 6 u } + n X t =1 I { ( H t ,I t ) ∈C ( h,i ) , T h,i ( t ) >u } 6 u + n X t = u +1 I { ( H t ,I t ) ∈C ( h,i ) , T h,i ( t ) >u } , where w e used for the inequa lity the fact that the quant ities T h,i ( t ) are constant from t to t + 1, except when ( H t , I t ) ∈ C ( h, i ), in which case, they incr ease b y 1; ther efore, on the one hand, a t most u of the T h,i ( t ) can b e smaller than u and on the o ther hand, T h,i ( t ) > u can only happ en if t > u . Using (8) a nd then taking exp ectations y ields the result. Lemma 15 L et Assu mptions A1 and A2 ho ld. Then, for al l optimal no des ( h, i ) and for al l inte gers n > 1 , P  U h,i ( n ) 6 f ∗  6 n − 3 . 22 Pro of On the even t tha t ( h, i ) was not play ed during the first n rounds, one has, b y con ven tion, U h,i ( n ) = + ∞ . In the sequel, we ther efore restrict our attention to the even t  T h,i ( n ) > 1  . Lemma 3 with c = 0 ensures that f ∗ − f ( x ) 6 ν 1 ρ h for all arms x ∈ P h,i . Hence, n X t =1  f ( X t ) + ν 1 ρ h − f ∗  I { ( H t ,I t ) ∈C ( h,i ) } > 0 and therefore, P  U h,i ( n ) 6 f ∗ and T h,i ( n ) > 1  = P ( b µ h,i ( n ) + s 2 ln n T h,i ( n ) + ν 1 ρ h 6 f ∗ and T h,i ( n ) > 1 ) = P  T h,i ( n ) b µ h,i ( n ) + T h,i ( n )  ν 1 ρ h − f ∗  6 − q 2 T h,i ( n ) ln n and T h,i ( n ) > 1  = P ( n X t =1  Y t − f ( X t )  I { ( H t ,I t ) ∈C ( h,i ) } + n X t =1  f ( X t ) + ν 1 ρ h − f ∗  I { ( H t ,I t ) ∈C ( h,i ) } 6 − q 2 T h,i ( n ) ln n and T h,i ( n ) > 1 ) 6 P ( n X t =1  f ( X t ) − Y t  I { ( H t ,I t ) ∈C ( h,i ) } > q 2 T h,i ( n ) ln n and T h,i ( n ) > 1 ) . W e tak e care of the las t term with a union bound and the Hoeffding- Azuma inequality for martingale differences. T o do this in a r igorous manner, we need to define a seq uence o f (random) stopping times when arms in C ( h, i ) w ere pulled: T j = min  t : T h,i ( t ) = j  , j = 1 , 2 , . . . . Note that 1 6 T 1 < T 2 < . . . , hence it holds that T j > j . W e denote by e X j = X T j the j th arm pulled in the reg ion cor resp onding to C ( h, i ). Its a sso ciated corres ponding reward equals e Y j = Y T j and P ( n X t =1  f ( X t ) − Y t  I { ( H t ,I t ) ∈C ( h,i ) } > q 2 T h,i ( n ) ln n and T h,i ( n ) > 1 ) = P    T h,i ( n ) X j =1  f  e X j  − e Y j  > q 2 T h,i ( n ) ln n and T h,i ( n ) > 1    6 n X t =1 P    t X j =1  f  e X j  − e Y j  > √ 2 t ln n    , where we used a union b ound to get the la st inequality . W e cla im that Z t = t X j =1  f  e X j  − e Y j  is a martingale w.r.t. the filtration G t = σ  e X 1 , Z 1 , . . . , e X t , Z t , e X t +1  . This follows, via optional skipping (see [1 4, Chapter VI I, adaptation of Theor em 2 .3]), from the facts tha t n X t =1  f ( X t ) − Y t  I { ( H t ,I t ) ∈C ( h,i ) } 23 is a martinga le w.r.t. the filtration F t = σ ( X 1 , Y 1 , . . . , X t , Y t , X t +1 ) and that the even ts { T j = k } are F k − 1 –measura ble for all k > j . Applying the Ho effding-Azuma inequality for mar tingale differences (see [2 0]), using the b o und- edness of the r anges of the induced mar ting ale difference sequence, we then get, for each t > 1, P    t X j =1  f  e X j  − e Y j  > √ 2 t ln n    6 exp    − 2  √ 2 t ln n  2 t    = n − 4 , which concludes the pro of. Lemma 16 F or al l inte gers t 6 n , for al l sub optimal no des ( h, i ) su ch that ∆ h,i > ν 1 ρ h , and for al l inte gers u > 1 such that u > 8 ln n (∆ h,i − ν 1 ρ h ) 2 , one has P  U h,i ( t ) > f ∗ and T h,i ( t ) > u  6 t n − 4 . Pro of The u mentioned in the statement o f the lemma are such that ∆ h,i − ν 1 ρ h 2 > r 2 ln n u , th us r 2 ln t u + ν 1 ρ h 6 ∆ h,i + ν 1 ρ h 2 . Therefore, P  U h,i ( t ) > f ∗ and T h,i ( t ) > u  = P ( b µ h,i ( t ) + s 2 ln t T h,i ( t ) + ν 1 ρ h > f ∗ h,i + ∆ h,i and T h,i ( t ) > u ) 6 P  b µ h,i ( t ) > f ∗ h,i + ∆ h,i − ν 1 ρ h 2 and T h,i ( t ) > u  6 P  T h,i ( t )  b µ h,i ( t ) − f ∗ h,i  > ∆ h,i − ν 1 ρ h 2 T h,i ( t ) a nd T h,i ( t ) > u  = P ( t X s =1  Y s − f ∗ h,i  I { ( H s ,I s ) ∈C ( h,i ) } > ∆ h,i − ν 1 ρ h 2 T h,i ( t ) and T h,i ( t ) > u ) 6 P ( t X s =1  Y s − f ( X s )  I { ( H s ,I s ) ∈C ( h,i ) } > ∆ h,i − ν 1 ρ h 2 T h,i ( t ) a nd T h,i ( t ) > u ) . Now it follows fro m the same arguments as in the pr o o f o f Lemma 15 (o ptional sk ipping, the Ho effding-Azuma inequality , and a union b ound) tha t P ( t X s =1  Y s − f ( X s )  I { ( H s ,I s ) ∈C ( h,i ) } > ∆ h,i − ν 1 ρ h 2 T h,i ( t ) and T h,i ( t ) > u ) 6 t X s ′ = u +1 exp − 2 s ′  (∆ h,i − ν 1 ρ h ) 2 s ′  2 ! 6 t X s ′ = u +1 exp  − 1 2 s ′ (∆ h,i − ν 1 ρ h ) 2  6 t exp  − 1 2 u  ∆ h,i − ν 1 ρ h  2  6 t n − 4 , 24 where we used the stated b ound on u to obtain the la st inequa lit y . Combining the results of Lemmas 14, 1 5, and 16 leads to the following key res ult bo unding the exp ected num b er o f visits to desce nda n ts of a “p o or” no de. Lemma 17 Under Assumptions A1 and A2, for al l sub optimal no des ( h, i ) with ∆ h,i > ν 1 ρ h , we have, for al l n > 1 , E [ T h,i ( n )] 6 8 ln n (∆ h,i − ν 1 ρ h ) 2 + 4 . Pro of W e take u as the upp er integer par t of (8 ln n ) / (∆ h,i − ν 1 ρ h ) 2 and use union b ounds to get from Lemma 14 the b ound E  T h,i ( n )  6 8 ln n (∆ h,i − ν 1 ρ h ) 2 + 1 + n X t = u +1 P  T h,i ( t ) > u a nd U h,i ( t ) > f ∗  + t − 1 X s =1 P  U s,i ∗ s ( t ) 6 f ∗  ! . Lemmas 15 and 16 further bo und the quantit y of interest a s E  T h,i ( n )  6 8 ln n (∆ h,i − ν 1 ρ h ) 2 + 1 + n X t = u +1 t n − 4 + t − 1 X s =1 t − 3 ! and we now use the crude upp er b ounds 1 + n X t = u +1 t n − 4 + t − 1 X s =1 t − 3 ! 6 1 + n X t =1  n − 3 + t − 2  6 2 + π 2 / 6 6 4 to get the pr opo sed statement. Pro of (of Theorem 6) Fir s t, let us fix d ′ > d . The statement will be prov en in four steps. First step. F or all h = 0 , 1 , 2 , . . . , denote by I h the set of those no des at depth h that a re 2 ν 1 ρ h –optimal, i.e., the no des ( h, i ) such that f ∗ h,i > f ∗ − 2 ν 1 ρ h . (Of co urse, I 0 = { (0 , 1) } .) Then, let I be the unio n of these sets when h v aries. F ur ther, let J be the s et of no des tha t are no t in I but whose pare n t is in I . Finally , for h = 1 , 2 , . . . w e denote by J h the no des in J that a re lo cated at depth h in the tree (i.e., whos e pa rent is in I h − 1 ). Lemma 17 b ounds in particula r the expec ted n umber of times each no de ( h, i ) ∈ J h is vis ited. Since for these no des ∆ h,i > 2 ν 1 ρ h , w e g et E  T h,i ( n )  6 8 ln n ν 2 1 ρ 2 h + 4 . Second step. W e b ound the cardinality |I h | of I h . W e start with the case h > 1. By definition, when ( h, i ) ∈ I h , one ha s ∆ h,i 6 2 ν 1 ρ h , so that by Lemma 3 the inclusio n P h,i ⊂ X 4 ν 1 ρ h ho lds. Since by Assumption A1, the sets P h,i contain disjoint balls of ra dius ν 2 ρ h , w e have that |I h | 6 N  ∪ ( h,i ) ∈I h P h,i , ℓ, ν 2 ρ h  6 N  X 4 ν 1 ρ h , ℓ, ν 2 ρ h  = N  X (4 ν 1 /ν 2 ) ν 2 ρ h , ℓ, ν 2 ρ h  . W e pr ov e b elow that there ex ists a constant C such that for a ll ε 6 ν 2 , N  X (4 ν 1 /ν 2 ) ε , ℓ, ε  6 C ε − d ′ . (9) 25 Thu s we o btain the b ound |I h | 6 C  ν 2 ρ h  − d ′ for all h > 1. W e note that the o bta ined bo und |I h | 6 C  ν 2 ρ h  − d ′ is still v alid for h = 0, since |I 0 | = 1. It o nly re mains to prov e (9). Since d ′ > d , where d is the near-o ptimalit y of f , w e hav e, by definition, that lim sup ε → 0 ln N  X (4 ν 1 /ν 2 ) ε , ℓ, ε  ln  ε − 1  6 d , and th us, ther e exists ε d ′ > 0 such that for all ε 6 ε d ′ , ln N  X (4 ν 1 /ν 2 ) ε , ℓ, ε  ln  ε − 1  6 d ′ , which in turn implies that for a ll ε 6 ε d ′ , N  X (4 ν 1 /ν 2 ) ε , ℓ, ε  6 ε − d ′ . The result is prov ed with C = 1 if ε d ′ > ν 2 . Now, consider the case ε d ′ < ν 2 . Given the definition of packing num be rs, it is str aightf orward that for all ε ∈  ε d ′ , ν 2  , N  X (4 ν 1 /ν 2 ) ε , ℓ, ε  6 u d ′ def = N  X , ℓ , ε d ′  ; therefore, for a ll ε ∈  ε d ′ , ν 2  , N  X (4 ν 1 /ν 2 ) ε , ℓ, ε  6 u d ′ ν d ′ 2 ε d ′ = C ε − d ′ for the choice C = max  1 , u d ′ ν d ′ 2  . Because w e tak e the maximum with 1, the stated inequalit y also holds for ε 6 ε − d ′ , which concludes the pro of of (9). Third step. Let H > 1 b e a n integer to b e chosen la ter. W e partition the no des o f the infinite tree T into three subsets, T = T 1 ∪ T 2 ∪ T 3 , as fo llo ws. Let the set T 1 contain the descendants of the nodes in I H (b y co n ven tion, a no de is considered its own descendant, hence the no des of I H are included in T 1 ); let T 2 = ∪ 0 6 h 0, consider a no de ( h, i ) ∈ T 2 . It belo ngs to I h and is therefo r e 2 ν 1 ρ h –optimal. By Lemma 3, the corres ponding domain is included in X 4 ν 1 ρ h . By the r esult of the sec o nd step of this 26 pro of and using that each no de is play ed at most once, o ne g ets E  R n, 2  6 H − 1 X h =0 4 ν 1 ρ h |I h | 6 4 C ν 1 ν − d ′ 2 H − 1 X h =0 ρ h (1 − d ′ ) . W e finish by b ounding the contribution from T 3 . W e first remark that since the parent o f any element ( h, i ) ∈ J h is in I h − 1 , by Lemma 3 ag ain, we hav e tha t P h,i ⊂ X 4 ν 1 ρ h − 1 . W e now use the first step of this pro of to get E  R n, 3  6 H X h =1 4 ν 1 ρ h − 1 X i : ( h,i ) ∈J h E  T h,i ( n )  6 H X h =1 4 ν 1 ρ h − 1 |J h |  8 ln n ν 2 1 ρ 2 h + 4  . Now, it follows from the fact that the parent of J h is in I h − 1 that |J h | 6 2 |I h − 1 | when h > 1. Substituting this and the b ound on |I h − 1 | obtained in the second step o f this pro of, we get E  R n, 3  6 H X h =1 4 ν 1 ρ h − 1  2 C  ν 2 ρ h − 1  − d ′   8 ln n ν 2 1 ρ 2 h + 4  6 8 C ν 1 ν − d ′ 2 H X h =1 ρ h (1 − d ′ )+ d ′ − 1  8 ln n ν 2 1 ρ 2 h + 4  . F ourth step. Putting the obta ined b o unds to gether, we get E  R n  6 4 ν 1 ρ H n + 4 C ν 1 ν − d ′ 2 H − 1 X h =0 ρ h (1 − d ′ ) + 8 C ν 1 ν − d ′ 2 H X h =1 ρ h (1 − d ′ )+ d ′ − 1  8 ln n ν 2 1 ρ 2 h + 4  = O nρ H + (ln n ) H X h =1 ρ − h (1+ d ′ ) ! = O  nρ H + ρ − H (1+ d ′ ) ln n  (10) (recall that ρ < 1). Note tha t all cons ta n ts hidden in the O symbol only dep end on ν 1 , ν 2 , ρ and d ′ . Now, by choosing H such that ρ − H ( d ′ +2) is o f the o rder of n/ ln n , that is, ρ H is of the or der of ( n/ ln n ) − 1 / ( d ′ +2) , w e g et the desired result, namely , E  R n  = O  n ( d ′ +1) / ( d ′ +2) (ln n ) 1 / ( d ′ +2)  . A.2 Pro of of Theorem 8 (regret b ound for truncated HOO) The pro of follows from an adaptation of the pr o o f of Theorem 6 and of its a sso ciated le mma s; for the sake of cla rity and precisio n, we ex plicitly state the ada ptations of the latter. Adaptations o f the l emmas. Remember that D n 0 denotes the maxim um depth of the tree, given ho rizon n 0 . The adaptation of Lemma 14 is done a s follows. Let ( h, i ) b e a sub optimal no de with h 6 D n 0 and let 0 6 k 6 h − 1 b e the larg est depth such that ( k , i ∗ k ) is on the path fro m the ro ot (0 , 1) to ( h, i ). Then, for all integers u > 0, one ha s E  T h,i ( n 0 )  6 u + n 0 X t = u +1 P n  U s,i ∗ s ( t ) 6 f ∗ for some s with k + 1 6 s 6 min { D n 0 , n 0 }  27 or  T h,i ( t ) > u a nd U h,i ( t ) > f ∗  o . As for Lemma 15, its s traightforw ard adaptation s tates that under Assumptions A1 a nd A2, for all optimal no des ( h, i ) with h 6 D n 0 and for a ll integers 1 6 t 6 n 0 , P  U h,i ( t ) 6 f ∗  6 t ( n 0 ) − 4 6 ( n 0 ) 3 . Similarly , the same changes yield from Lemma 16 the following res ult for truncated HOO. F or all integers t 6 n 0 , for all suboptimal nodes ( h, i ) such that h 6 D n 0 and ∆ h,i > ν 1 ρ h , and fo r a ll int eger s u > 1 such that u > 8 ln n 0 (∆ h,i − ν 1 ρ h ) 2 , one has P  U h,i ( t ) > f ∗ and T h,i ( t ) > u  6 t ( n 0 ) − 4 . Combining these three r esults (using the sa me metho do logy as in the pro of of Lemma 17) shows that under Assumptions A1 and A2, for a ll sub optimal no des ( h, i ) such that h 6 D n 0 and ∆ h,i > ν 1 ρ h , one has E [ T h,i ( n 0 )] 6 8 ln n 0 (∆ h,i − ν 1 ρ h ) 2 + 1 + n 0 X t = u +1   t ( n 0 ) 4 + min { D n 0 ,n 0 } X s =1 ( n 0 ) − 3   6 8 ln n 0 (∆ h,i − ν 1 ρ h ) 2 + 3 . (W e thus even improve slight ly the b ound o f Lemma 17.) Adaptation of the pro of of Theorem 6. The main c hange here comes fr o m the fact that trees are cut at the depth D n 0 . As a consequence, the se ts I h , I , J , and J h are defined only b y referring to nodes of depth smaller than D n 0 . All steps of the pro of can then b e repea ted, except the thir d step; there, while the b ounds on the reg ret resulting from no des of T 1 and T 3 go through without a n y changes (as these sets were cons tructed b y co nsidering all descendants of some base no des), the b ound on the regr e t R n, 2 asso ciated with the nodes T 2 calls f or a modified pro o f since at this stage we used the prop erty that ea c h no de is play ed at most once. But this is not true anymore for no des ( h, i ) lo cated at depth D n 0 , which can b e played se veral times. There fo re the pr oo f is mo dified as follows. Consider a no de at depth h = D n 0 . Then, by definition of D n 0 , h > D n 0 = (ln n 0 ) / 2 − ln(1 /ν 1 ) ln(1 /ρ ) , that is , ν 1 ρ h 6 1 √ n 0 . Since the conside r ed no des are 2 ν 1 ρ D n 0 –optimal, the corresp onding do mains ar e 4 ν 1 ρ D n 0 –optimal by Lemma 3 , thus also 4 / √ n 0 –optimal. The instant aneous r egret incurr e d when playing any of these no des is therefo r e b ounded by 4 / √ n 0 ; a nd the asso ciated cumulativ e r egret (ov er n 0 rounds) can b e b ounded by 4 √ n 0 . In conclusion, with the no tations of Theorem 6, we get the new b ound E  R n, 2  6 H − 1 X h =0 4 ν 1 ρ h |I h | + 4 √ n 0 6 4 √ n 0 + 4 C ν 1 ν − d ′ 2 H − 1 X h =0 ρ h (1 − d ′ ) . The rest of the pr o o f g oe s through and only this additional additiv e factor of 4 √ n 0 is suffered in the final regr et b ound. (The additional factor can be included in the O notation.) 28 A.3 Pro of of Theorem 9 (regret b ound for z –HOO) W e s tart with the fo llowing equiv alent of Lemma 3 in this new lo ca l context. Remember that h 0 is the smallest integer such that 2 ν 1 ρ h 0 < ε 0 . Lemma 18 Under Assumptions A1 and A2’, for al l h > h 0 , if the sub optimality factor ∆ h,i of a r e gion P h,i is b ounde d by cν 1 ρ h for some c ∈ [0 , 2] , then al l arms in P h,i ar e L max { 2 c, c + 1 } ν 1 ρ h – optimal, that is, P h,i ⊂ X L max { 2 c, c +1 } ν 1 ρ h . When c = 0 , i.e., the no de ( h, i ) is optimal, the b ound impr oves to P h,i ⊂ X ν 1 ρ h . Pro of W e first deal with the genera l c ase of c ∈ [0 , 2]. By the hypo thesis on the sub optimality of P h,i , for a ll δ > 0, there exists an element x ∈ X cν 1 ρ h + δ ∩ P h,i . If δ is small enough, e.g., δ ∈  0 , ε 0 − 2 ν 1 ρ h 0  , then this element s atisfies x ∈ X ε 0 . Let y ∈ P h,i . By Assumption A1, ℓ ( x, y ) 6 diam( P h,i ) 6 ν 1 ρ h , which entails, by denoting ε = max  0 , ν 1 ρ h − ( f ∗ − f ( x ))  , ℓ ( x, y ) 6 ν 1 ρ h 6 f ∗ − f ( x ) + ε , that is , y ∈ B  x, f ∗ − f ( x ) + ε  . Since x ∈ X ε 0 and ε 6 ν 1 ρ h 6 ν 1 ρ h 0 < ε 0 , the second part o f Assumption A2’ then yields y ∈ B  x, f ∗ − f ( x ) + ε  ⊂ X L  2( f ∗ − f ( x ))+ ε  . It follows fro m the definition of ε that f ∗ − f ( x ) + ε = max  f ∗ − f ( x ) , ν 1 ρ h  , and this implies y ∈ B  x, f ∗ − f ( x ) + ε  ⊂ X L  f ∗ − f ( x )+m ax { f ∗ − f ( x ) , ν 1 ρ h }  . But x ∈ X cν 1 ρ h + δ , i.e., f ∗ − f ( x ) 6 cν 1 ρ h + δ , we thus have prov ed y ∈ X L  max { 2 c, c +1 } ν 1 ρ h +2 δ  . In conclusion, P h,i ⊂ X L max { 2 c, c +1 } ν 1 ρ h +2 Lδ for all sufficien tly small δ > 0. Letting δ → 0 co ncludes the pro of. In the ca s e of c = 0, we resor t to the first par t of Assumption A2 ’, which ca n be applied since diam( P h,i ) 6 ν 1 ρ h 6 ε 0 as already noted ab ov e, and ca n exa ctly b e res ta ted as indicating that for all y ∈ P h,i , f ∗ − f ( y ) 6 diam( P h,i ) 6 ν 1 ρ h ; that is, P h,i ⊂ X ν 1 ρ h . W e now provide a n ada ptation of Lemma 1 7 (actually ba sed o n adapta tions of Lemmas 14 and 15), providing the same b ound under lo cal conditions that r e lax the a ssumptions o f Lemma 17 to some extent. Lemma 19 Consider a depth z > h 0 . Under Assumptions A1 and A2’, the algorithm z –HOO satisfies that for al l n > 1 and al l sub optimal n o des ( h, i ) with ∆ h,i > ν 1 ρ h and h > z , E  T h,i ( n )  6 8 ln n (∆ h,i − ν 1 ρ h ) 2 + 4 . 29 Pro of W e consider some path ( z , i ∗ z ) , ( z + 1 , i ∗ z +1 ) , . . . of optimal no des, starting a t depth z . W e distinguish tw o cases, dep ending o n whether ther e exists z 6 k ′ 6 h − 1 such that ( h, i ) ∈ C ( k ′ , i ∗ k ′ ) or not. In the firs t case, we denote k ′ the lar g est suc h k . The arg umen t of Lemma 1 4 can b e used without any change and shows that for all integers u > 0, E  T h,i ( n )  6 u + n X t = u +1 P n  U s,i ∗ s ( t ) 6 f ∗ for some s ∈ { k + 1 , . . . , t − 1 }  or  T h,i ( t ) > u a nd U h,i ( t ) > f ∗  o . In the second case, we de no te b y ( z , i h ) the ancestor of ( h, i ) lo ca ted at depth z . By de finitio n of z –HOO, ( H t , I t ) ∈ C ( h, i ) at so me r ound t > 1 o nly if B z ,i ∗ z ( t ) 6 B z ,i h ( t ) and since B – v alues can only increa se o n a c hosen path, ( H t , I t ) ∈ C ( h, i ) can o nly happ en if B z ,i ∗ z ( t ) 6 B h,i ( t ). Rep eating again the arg umen t of Lemma 14, we get that for all integers u > 0, E  T h,i ( n )  6 u + n X t = u +1 P n  U s,i ∗ s ( t ) 6 f ∗ for some s ∈ { z , . . . , t − 1 }  or  T h,i ( t ) > u a nd U h,i ( t ) > f ∗  o . Now, no tice tha t Lemma 16 is v alid without any assumption. On the other hand, with the mo dified assumptions, Lemma 15 is still true but only for optimal no des ( h, i ) with h > h 0 . Indeed, the only p o in t in its pro of where the assumptions were used was in the fourth line, when a pplying Lemma 3; here , Lemma 1 8 with c = 0 pr ovides the needed gua rantee. The pro of is concluded with the same computations a s in the pro of of L emma 1 7. Pro of (of Theorem 9) W e follow the four steps in the pro of of Theorem 6 with so me slight adjustment s. In particular , for h > z , w e use the sets of no des I h and J h defined therein. First step. Lemma 19 b ounds the exp ected num be r of times each no de ( h, i ) ∈ J h is vis ited. Since for these no des ∆ h,i > 2 ν 1 ρ h , w e g et E  T h,i ( n )  6 8 ln n ν 2 1 ρ 2 h + 4 . Second step. W e b ound here the cardinality |I h | . By Lemma 18 with c = 2, when ( h, i ) ∈ I h and h > z , one has P h,i ⊂ X 4 Lν 1 ρ h . Now, by Assumption A1 and b y using the same arg ument as in the second step of the pro of of Theorem 6, |I h | 6 N  X (4 Lν 1 /ν 2 ) ν 2 ρ h , ℓ , ν 2 ρ h  . Assumption A3 c an b e applied since ν 2 ρ h 6 2 ν 1 ρ h 6 2 ν 1 ρ h 0 6 ε 0 and yields the inequalit y |I h | 6 C  ν 2 ρ h  − d . Third step. W e cons ider some integer H > z to b e defined by the analysis in the four th step. W e define a partition of the no des lo cated at a depth e q ual to or lar ger than z ; more prec is ely , • T 1 contains the no des o f I H and their descenda n ts, • T 2 = [ z 6 h 6 H − 1 I h , • T 3 contains the no des [ z +1 6 h 6 H J h and their descenda n ts, 30 • T 4 is formed by the no des ( z , i ) lo ca ted at depth z not b elonging to I z , i.e., such that ∆ z ,i > 2 ν 1 ρ z , and their de s cendant s. As in the pro of of Theore m 6 we denote by R n,i the r egret resulting from the selectio n of no des in T i , for i ∈ { 1 , 2 , 3 , 4 } . Lemma 18 with c = 2 yields the b ound E  R n, 1  6 4 Lν 1 ρ H n , where w e crudely bo unded by n the num ber of times that no des in T 1 were pla yed. Using that by definition each no de of T 2 can b e play ed o nly once, we get E  R n, 2  6 H − 1 X h = z  4 Lν 1 ρ h  |I h | 6 4 C Lν 1 ν − d 2 H − 1 X h = z ρ h (1 − d ) . As for R n, 3 , w e also use here that nodes in T 3 belo ng to some J h , with z + 1 6 h 6 H ; in particular, they ar e the child of s ome element of I h − 1 and a s such, firstly , they are 4 Lν 1 ρ h − 1 –optimal (b y Lemma 18) and s e condly , their num be r is b ounded by |J h | 6 2 |I h − 1 | 6 2 C  ν 2 ρ h − 1  − d . Thus, E  R n, 3  6 H X h = z +1  4 Lν 1 ρ h − 1  X i :( h,i ) ∈J h E  T h,i ( n )  6 8 C Lν 1 ν − d 2 H X h = z +1 ρ ( h − 1)(1 − d )  8 ln n ν 2 1 ρ 2 h + 4  , where we used the bound of Le mma 19. Finally , for T 4 , w e use that it co n tains at most 2 z − 1 nodes, each of them being asso ciated with a re gret co n trolled by Lemma 1 9; therefore, E  R n, 4  6  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  . F ourth step. Putting things to gether, we hav e pr ov ed that E  R n  6 4 Lν 1 ρ H n + E  R n, 2  + E  R n, 3  +  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  , where (using that ρ < 1 in the second inequality) E  R n, 2  + E  R n, 3  6 4 C Lν 1 ν − d 2 H − 1 X h = z ρ h (1 − d ) + 8 C L ν 1 ν − d 2 H X h = z +1 ρ ( h − 1)(1 − d )  8 ln n ν 2 1 ρ 2 h + 4  = 4 C Lν 1 ν − d 2 H − 1 X h = z ρ h (1 − d ) + 8 C L ν 1 ν − d 2 H − 1 X h = z ρ h (1 − d )  8 ln n ν 2 1 ρ 2 ρ 2 h + 4  6 4 C Lν 1 ν − d 2 H − 1 X h = z ρ h (1 − d ) 1 ρ 2 h + 8 C L ν 1 ν − d 2 H − 1 X h = z ρ h (1 − d )  8 ln n ν 2 1 ρ 2 ρ 2 h + 4 ρ 2 h  = C Lν 1 ν − d 2 H − 1 X h = z ρ − h (1+ d ) !  36 + 64 ν 2 1 ρ 2 ln n  . Denoting γ = 4 C Lν 1 ν − d 2 (1 /ρ ) d +1 − 1  16 ν 2 1 ρ 2 + 9  , it follows that for n > 2 E  R n, 2  + E  R n, 3  6 γ ρ − H ( d +1) ln n . 31 It r e mains to define the para meter H > z . In pa rticular, we pr o po se to choo se it such that the terms 4 Lν 1 ρ H n and ρ − H ( d +1) ln n are bala nced. T o this end, let H be the smallest int eger k such that 4 Lν 1 ρ k n 6 γ ρ − k ( d +1) ln n ; in particular, ρ H 6  γ ln n 4 Lν 1 n  1 / ( d +2) and 4 Lν 1 ρ H − 1 n > γ ρ − ( H − 1)( d +1) ln n , implying γ ρ − H ( d +1) ln n 6 4 Lν 1 ρ H n ρ − ( d +2) . Note from the ineq ualit y that this H is s uch that H > 1 d + 2 ln(4 Lν 1 n ) − ln( γ ln n ) ln(1 /ρ ) and th us this H satisfies H > z in view of the assumption of the theor em indica ting that n is large enough. The final bo und on the r egret is then E  R n  6 4 Lν 1 ρ H n + γ ρ − H ( d +1) ln n +  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  6  1 + 1 ρ d +2  4 Lν 1 ρ H n +  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  6  1 + 1 ρ d +2  4 Lν 1 n  γ ln n 4 Lν 1 n  1 / ( d +2) +  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  =  1 + 1 ρ d +2   4 Lν 1 n  ( d +1) / ( d +2) ( γ ln n ) 1 / ( d +2) +  2 z − 1   8 ln n ν 2 1 ρ 2 z + 4  . This concludes the pro of. A.4 Pro of of Theorem 10 (regret b ound for lo cal-HOO) Pro of W e use the notation of the pro of of Theorem 9 . Let r 0 be a p ositive integer such tha t for r > r 0 , one has z r def = ⌈ log 2 r ⌉ > h 0 and z r 6 1 d + 2 ln(4 Lν 1 2 r ) − ln( γ ln 2 r ) ln(1 /ρ ) ; we can therefor e apply the res ult of Theo rem 9 in r egimes indexed by r > r 0 . F or prev ious regimes, we simply upp er b ound the regret by the num b er o f ro unds, that is, 2 r 0 − 2 6 2 r 0 . F or ro und n , we denote by r n the index of the regime where n lies in (reg ime r n = ⌊ lo g 2 ( n + 1) ⌋ ). Since reg ime r n terminates at r ound 2 r n +1 − 2, we hav e E  R n  6 E  R 2 r n +1 − 2  6 2 r 0 + r n X r = r 0  1 + 1 ρ d +2   4 Lν 1 2 r  ( d +1) / ( d +2) ( γ ln 2 r ) 1 / ( d +2) +  2 z r − 1   8 ln 2 r ν 2 1 ρ 2 z r + 4  ! 6 2 r 0 + C 1 (ln n ) r n X r = r 0   2 ( d +1) / ( d +2)  r +  2 /ρ 2  z r  32 6 2 r 0 + C 2 (ln n )   2 ( d +1) / ( d +2)  r n + r n  2 /ρ 2  z r n  = (ln n ) O  n ( d +1) / ( d +2)  , where C 1 , C 2 > 0 denote some co ns tan ts dep ending only on the par ameters but not on n . Note that for the la st eq ualit y we used that the first term in the s um of the tw o terms that dep end on n dominates the seco nd term. A.5 Pro of of Theorem 12 (uniform upp er b ound on the r egret of H OO against t he class of all weak Lipsc hitz en vironmen ts) Equations (5) and (6), which follow from Assumption A2 , show tha t Assumption A2’ is satisfie d for L = 2 and a ll ε 0 > 0. W e take, for instance, ε 0 = 3 ν 1 . Moreov er, since X ha s a packing dimension of D , all environments hav e a nea r-optimality dimension less than D . In pa rticular, for all D ′ > D (as shown in the second step of the pro of of The o rem 6 in Section A.1), there exists a constant C (dep ending only on ℓ , X , ε 0 = 3 ν 1 , ν 2 , and D ′ ) such that Assumption A3 is s atisfied. W e ca n ther efore take h 0 = 0 and apply Theorem 9 with z = 0 and M ∈ F X ,ℓ ; the fact that a ll the quantities in volv ed in the b ound dep end o nly on X , ℓ , ν 2 , D ′ , and the pa rameters of HOO, but not on a particular e n vironment in F , concludes the pro of. A.6 Pro of of Theorem 13 (minimax lo w er b ound in metric spaces) Let K > 2 a n integer to b e defined later. W e provide first an o v erview of the pro of. Here, we exhibit a set A of environmen ts f or the { 1 , . . . , K + 1 } –armed ba ndit problem a nd a subset F ′ ⊂ F X ,ℓ which satisfy the following prop erties. (i) The set A contains “difficult” e nvironments for the { 1 , . . . , K + 1 } –armed ba ndit problem. (ii) F o r any strategy ϕ ( X ) suited to the X –armed bandit pr oblem, one can cons truct a str ategy ψ ( K +1) for the { 1 , . . . , K + 1 } –armed bandit problem such that ∀ M ∈ F ′ , ∃ ν ∈ A , E M  R n ( ϕ ( X ) )  = E ν  R n ( ψ ( K +1) )  . W e now provide the details. Pro of W e only dea l with the case of deterministic stra tegies. The extension to randomized strateg ies can b e done using F ubini’s theor em (by integrating als o w.r.t. the auxiliar y rando mizations used). First step. Le t η ∈ (0 , 1 / 2 ) be a rea l n umber a nd K > 2 be an integer, b oth to b e defined during the cours e of the a na lysis. The s et A only co n tains K elements, denoted by ν 1 , . . . , ν K and given by pro duct distributions. F or 1 6 j 6 K , the distribution ν j is obtained as the pro duct of the ν j i when i ∈ { 1 , . . . , K + 1 } a nd wher e ν j i = ( Ber(1 / 2) , if i 6 = j ; Ber(1 / 2 + η ) , if i = j. One can extract the following result fro m the pro of of the lower bo und of [10, Section 6 .9]. Lemma 20 F or al l str ate gies ψ ( K +1) for t he { 1 , . . . , K + 1 } –arme d b andit ( wher e K > 2 ), one has max j =1 ,...,K E ν j  R n ( ψ ( K +1) )  > nη  1 − 1 K − η p 4 ln(4 / 3) r n K  . 33 Second step. W e now nee d to c o nstruct F ′ such that item (ii) is satisfied. W e as sume tha t K is such that X contains K disjoin t ba lls with radius η . (W e shall quantify later in this pro of a suitable v alue of K .) Denoting by x 1 , . . . , x K the corresp onding centers, these disjoin t balls are then B ( x 1 , η ) , . . . , B ( x K , η ). With ea c h of these balls we now asso ciate a bandit environment over X , in the following wa y . F or a ll x ∗ ∈ X , we introduce a mapping g x ∗ ,η on X defined by g x ∗ ,η ( x ) = max  0 , η − ℓ ( x, x ∗ )  for all x ∈ X . This mapping is used to define an environment M x ∗ ,η ov er X , as follows. F or all x ∈ X , M x ∗ ,η ( x ) = Ber  1 2 + g x ∗ ,η ( x )  . Let f x ∗ ,η be the cor resp onding mean-pay off function; its v alues equal f x ∗ ,η ( x ) = 1 2 + max  0 , η − ℓ ( x, x ∗ )  for all x ∈ X . Note that the mean pa yoff is maximized at x = x ∗ (with v alue 1 / 2 + η ) and is minimal for all p oints lying outside B ( x ∗ , η ), with v a lue 1 / 2. In addition, that ℓ is a metric entails that these mean-pay off functions ar e 1–Lipschitz and th us are also weakly Lipschitz. (This is the only p oint in the pro of where we use that ℓ is a metric.) In co nclusion, we consider F ′ =  M x 1 ,η , . . . , M x K ,η  ⊂ F X ,ℓ . Third step. W e descr ibe how to ass oc iate with ea c h (deterministic) strategy ϕ ( X ) on X a (random) strategy ψ ( K +1) on the finite set of ar ms { 1 , . . . , K + 1 } . Each of these stra tegies is indee d given by a seq uence o f mappings, ϕ ( X ) 1 , ϕ ( X ) 2 , . . . and ψ ( K +1) 1 , ψ ( K +1) 2 , . . . where for t > 1, the mapping s ϕ ( X ) t and ψ ( K +1) t should only depend on the past up to the b eginning of r ound t . Since the s trategy ϕ ( X ) is de ter ministic, the mapping ϕ ( X ) t takes only in to account the past rewards Y 1 , . . . , Y t − 1 and is ther efore a mapping [0 , 1] t − 1 → X . (In par ticular, ϕ ( X ) 1 equals a constant.) W e use the no tations I ′ t and Y ′ t for, r esp e ctiv ely , th e arms pulled a nd the rew ards obtained b y the strategy ψ ( K +1) at each round t . The ar ms I ′ t are drawn a t r andom according to the distr ibutio ns ψ ( K +1) t  I ′ 1 , . . . , I ′ t − 1 , Y ′ 1 , . . . , Y ′ t − 1  , which we no w define. (Actually , they will depe nd o n the obtained pay offs Y ′ 1 , . . . , Y ′ t − 1 only .) T o do that, we need yet another ma pping T that links elements in X to probability distributions ov er { 1 , . . . , K + 1 } . Denoting by δ k the Dirac pr o babilit y on k ∈ { 1 , . . . , K + 1 } , the mapping T is defined as T ( x ) =        δ K +1 , if x 6∈ [ j =1 ,...,K B ( x j , η );  1 − ℓ ( x, x j ) η  δ j + ℓ ( x, x j ) η δ K +1 , if x ∈ B ( x j , η ) for s ome j ∈ { 1 , . . . , K } , for all x ∈ X . Note tha t this definition is legitimate b ecause the ba lls B ( x j , η ) ar e disjoint when j v ar ies b etw een 1 and K . 34 Finally , ψ ( K +1) is defined as follows. F o r all t > 1, ψ ( K +1) t  I ′ 1 , . . . , I ′ t − 1 , Y ′ 1 , . . . , Y ′ t − 1  = ψ ( K +1) t  Y ′ 1 , . . . , Y ′ t − 1  = T  ϕ ( X ) t  Y ′ 1 , . . . , Y ′ t − 1   . Before we pro ceed, we s tudy the distribution of the reward Y ′ obtained under ν i (for i ∈ { 1 , . . . , K } ) by the choice of a ra ndom arm I ′ drawn according to T ( x ), for some x ∈ X . Since Y ′ can only take the v alues 0 or 1, its distribution is a Ber noulli distribution whose parameter µ i ( x ) we compute now. The computation is based o n the fact that under ν i , the Bernoulli distribution corres p onding to a r m j has 1 / 2 as a n expec tation, except if j = i , in which case it is 1 / 2 + η . Th us, for all x ∈ X , µ i ( x ) =      1 / 2 , if x 6∈ B ( x i , η );  1 − ℓ ( x, x i ) η   1 2 + η  + ℓ ( x, x i ) η 1 2 = 1 2 + η − ℓ ( x, x i ) , if x ∈ B ( x i , η ) . That is, µ i = f x i ,η on X . F ourth ste p . W e now prov e that the distributions of the regre ts of ϕ ( X ) under M x j ,η and of ψ ( K +1) under ν j are equa l for all j = 1 , . . . , K . On the one hand, the exp ectations of rewards asso ciated with the b est ar ms equal 1 / 2 + η under the tw o environmen ts. On the other hand, o ne can prov e by induction that the sequences Y 1 , Y 2 , . . . and Y ′ 1 , Y ′ 2 , . . . hav e the same distribution. (In the arg umen t b elow, conditioning by empty sequences mea ns no co nditioning. This will b e the case only for t = 1 .) F or a ll t > 1 , we denote X ′ t = ϕ ( X ) t  Y ′ 1 , . . . , Y ′ t − 1  . Under ν j and given Y ′ 1 , . . . , Y ′ t − 1 , the distribution of Y ′ t is obtained by definition as the tw o-step random draw of I ′ t ∼ T ( X ′ t ) and then, conditionally o n this fir st draw, Y ′ t ∼ ν j I ′ t . By the ab ov e results, the distribution of Y ′ t is th us a Bernoulli distribution w ith parameter µ j ( X ′ t ). A t the same time, under M x j ,η and given Y 1 , . . . , Y t − 1 , the choice of X t = ϕ ( X ) t  Y 1 , . . . , Y t − 1  yields a reward Y t distributed according to M x j ,η ( X t ), that is, by definition a nd with the no tations ab ov e, a Berno ulli distr ibution with para meter f x j ,η ( X t ) = µ j ( X t ). The ar gumen t is c oncluded by induction and by using the fact that rewards are drawn indep en- dent ly in each round. Fifth step. W e s ummarize what w e prov ed so far. F or η ∈ (0 , 1 / 2), pr ovided that there exist K > 2 disjoint balls B ( x j , η ) in X , we could co nstruct, for a ll stra tegies ϕ ( X ) for the X –armed bandit problem, a stra tegy ψ ( K +1) for the { 1 , . . . , K + 1 } –armed bandit problem such that, for all j = 1 , . . . , K a nd a ll n > 1, E M x j ,η  R n ( ϕ ( X ) )  = E ν j  R n ( ψ ( K +1) )  . But by the assumption on the packing dimension, there exists c > 0 such that for all η < 1 / 2, the choice of K η = ⌈ c η − D ⌉ > 2 g uarantees the existence of such K η disjoint balls. Substituting this v alue, and using the res ults o f the first and fourth steps of the pro o f, we g et max j =1 ,...,K η E M x j ,η  R n ( ϕ ( X ) )  = max j =1 ,...,K η E ν j  R n ( ψ ( K +1) )  > nη  1 − 1 K η − η p 4 ln(4 / 3) r n K η  . The pro of is concluded by noting that • the left-hand side is smaller than the maximal regr et w.r.t. a ll weak Lipschitz environments; 35 • the right-hand side can be low er b ounded and then optimized ov er η < 1 / 2 in the fo llowing wa y . By definition of K η and the fact that it is lar ger than 2, one has nη  1 − 1 K η − η p 4 ln(4 / 3) r n K η  > nη  1 − 1 2 − η p 4 ln(4 / 3) r n cη − D  = nη  1 2 − C η 1+ D/ 2 √ n  where C = q  4 ln(4 / 3)   c . W e can optimize the final lower b ound ov er η ∈ [0 , 1 / 2]. T o that end, we choos e, for instance , η such tha t C η 1+ D/ 2 √ n = 1 / 4, that is, η =  1 4 C √ n  1 / (1+ D/ 2) =  1 4 C  1 / (1+ D/ 2) n − 1 / ( D +2) . This gives the lower b ound 1 4  1 4 C  1 / (1+ D/ 2) n 1 − 1 / ( D +2) = 1 4  1 4 C  1 / (1+ D/ 2) | {z } = γ ( c,D ) n ( D +1) / ( D +2) . T o ensure that this choice of η is v a lid we need to show that η 6 1 / 2. Since the latter r equirement is equiv ale nt to n > 2  1 4 C  1 / (1+ D/ 2) ! D +2 , it suffices to cho ose the r ight -hand side to b e N ( c, D ); we then get that η 6 1 / 2 indeed holds for all n > N ( c, D ), thus concluding the pr o o f o f the theorem. References [1] J. Ab ernethy , E. Haza n, and A. Ra khlin. Comp eting in the dark: an efficient alg o rithm for bandit linea r optimizatio n. In Pr o c e e dings of the 21st International Confer enc e on L e arning The ory . O mnipress, 2008 . [2] R. Agr aw al. Sample mean ba sed index po licies with o(log n) regr et for the multi-armed bandit problem. A dvanc es in Appli e d Mathematics , 27 :1 054–1 078, 1995. [3] R. Agraw al. The contin uum-armed bandit problem. S IAM J. Contr ol and Optimization , 33: 1926– 1951, 1995. [4] P . Auer, N. Cesa-Bia nc hi, a nd P . Fischer. Finite-time a nalysis of the multiarmed bandit pr ob- lem. Machine L e arning J ournal , 47(2-3 ):235–25 6, 20 02. [5] P . Auer, N. Cesa -Bianchi, Y. F reund, and R. Schapire. The non-sto chastic m ulti-armed bandit problem. SIAM Journal on Computing , 32(1):48– 77, 200 2. [6] P . Auer , R. Ortner, and C. Szepe s v´ ar i. Improved rates for the sto chastic contin uum-armed bandit problem. 20th Confer enc e on L e arning The ory , pag es 454–4 6 8, 2007. 36 [7] S. Bubeck and R. Munos. Op en lo op o ptimistic planning. In Pr o c e e dings of the 23r d Interna- tional Confer enc e on L e arning The ory . Omnipres s, 2 010. [8] S. Bub eck, R. Mu nos, G. Stoltz, and Cs. Szep esv ar i. O nline optimization in x- armed bandits. In D. Koller , D. Sch uurmans, Y. Bengio, and L. Bo ttou, editor s, A dvanc es in Neur al Information Pr o c ess ing Systems 21 , pages 201–2 08, 200 9 . [9] S. Bub ec k, R. Munos, and G. Stoltz. Pure explor ation in m ulti-armed bandits pr o blems. The- or etic al Computer Scienc e , 2010 . In pres s. [10] N. Cesa-Bianchi and G. Lugosi. Pr e diction, L e arning, and Games . Ca m bridge Univ ersity Press, 2006. [11] G.M.J. Chaslot, M.H.M. Winands, H. Her ik, J. Uiterwijk, and B. Bouzy . Pr ogressive strategie s for Monte-Carlo tree search. New Mathematics and Natu r al Computation , 4(3 ):3 43–357 , 2008. [12] E. Cop e. Regret a nd convergence b ounds for immediate-r ew ard r einforcement learning with contin uous action spa ces. IEEE T r ansactions on Automatic Contr ol , 54(6):1 2 43–12 53, 2 009. [13] P .-A. Coq uelin a nd R. Munos . Ba ndit alg orithms fo r tree sear c h. In Pr o c e e dings of the 23r d Confer enc e on Unc ertainty in Artificial Intel ligenc e , pag es 6 7–74, 2 007. [14] J. L. Doo b. Sto chastic Pr o c esses . John Wiley & Sons, 195 3. [15] H. Finnsson and Y. Bjornsson. Simulation-based approach to general game pla ying. In Pr o- c e e dings of the Twenty-Thir d AAAI Confer enc e on Artificial In tel ligenc e , pag es 259– 264, 2008. [16] S. Gelly and D. Silver. Combining o nline and offline knowledge in UCT. In Pr o c e e dings of the 24th international c onfer enc e on Machine le arning , pages 273– 280. A CM New Y or k, NY, USA, 2007. [17] S. Gelly and D. Silver. Ac hieving mas ter level play in 9 × 9 computer go. In Pr o c e e dings of AAAI , pages 153 7–1540 , 2 008. [18] S. Gelly , Y. W ang , R. Munos, and O. T eytaud. Mo dification of UCT with patterns in Monte- Carlo go. T echnical Rep ort RR-6 0 62, INRIA, 2006. [19] J. C. Gittins. Multi-arme d Bandit Allo c ation Indic es . Wiley-Interscience series in systems and optimization. Wiley , Chichester, NY, 1 989. [20] W. Hoeffding. P robability inequalities for sums of b ounded ra ndom v ar iables. Journal of the Americ an Statistic al Asso ciation , 58:13– 30, 1963 . [21] R. Kleinberg. Nearly tight b ounds for the contin uum-armed ba ndit problem. In 18th A dvanc es in Neu r al Information Pr o c essing Systems , 2 004. [22] R. K lein b erg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces . In Pr o c e e dings of t he 40th A CM Symp osium on Th e ory of Computing , 20 08. [23] R. Klein ber g, A. Slivkins, and E. Upf al. Multi-armed bandits in metric spa c es, September 2008. URL http ://arxi v.org/abs/0809.4882 . [24] L. Ko cs is and Cs. Szep esv ar i. Bandit base d Mont e-car lo planning. In Pr o c e e dings of the 15th Eur op e an Confer enc e on Machine Le arning , pages 282– 2 93, 2006. [25] T. L. Lai and H. Robbins. Asymptotically efficient a da ptiv e allo cation rules. A dvanc es in Applie d Mathematics , 6 :4–22, 1985 . 37 [26] H. Robbins. So me a spec ts of the sequential design of e xper imen ts. Bul letin of the Americ an Mathematics So ciety , 5 8:527– 535, 1952. [27] M.P .D. Schadd, M.H.M. Winands, H.J. v an den Herik, and H. Aldewereld. Addressing NP - complete puzzles with Monte-Carlo metho ds. In Pr o c e e dings of the AISB 2008 Symp osium on L o gic and the S imu lation of Inter action and R e asoning , volume 9, pages 55—61. The Soc iet y for the study of Ar tificial Intelligence and Simulation of Behaviour, 2 008. [28] Y. Y ang. How p ow erful can any regr ession lear ning pro cedure b e? In Pr o c e e dings of the 11th International Confer enc e on Artificial Intel ligenc e and Statistics , volume 2, pages 63 6–643, 2007. 38

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment