Contextual Bandits with Similarity Information
In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets…
Authors: Aleks, rs Slivkins
Contextu al Bandits with Similarity Inf ormation ∗ Aleksandrs Sli vkins † First version: February 2009 This re visio n: May 2014 Abstract In a m ulti-armed bandit (M AB) problem, an online algo rithm makes a sequence o f choices. In each round it chooses from a time- in variant set of alternatives and receives the pay off associated with this alternative. While the case of small strategy sets is by now well-unde rstood, a lot of recent work has focu sed on MAB pro blems with expon entially or infinitely large strategy sets, where on e needs to assume extra structu re in order to m ake the pro blem tractable. I n particu lar , recent literature con sidered informa tion on si milarity be tween arms. W e consider similarity informatio n in the setting of contextual ban dits , a natural extension of the basic MAB problem where before eac h ro und an algorithm is gi ven the conte xt – a hint about the payo ffs in this round. Con textual bandits are d irectly mo ti vated by placing ad vertisements on webpa ges, one of the cr ucial pro blems in sponsor ed search. A particularly simple way to represent similarity inf ormation in the contextual bandit setting is via a similarity distance between the context-arm pairs whic h bou nds from above the difference between the respective expec ted payoffs. Prior work on co ntextual ban dits with similarity uses “u niform” partitions o f th e similarity sp ace, so that ea ch con text-arm pair is approximated b y the clo sest pair in the p artition. Algorithms based on “uniform” partitions disregard the stru cture of the pay offs a nd the context arr iv als, which is po- tentially wasteful. W e present algorithm s that are based on adaptive partitions, and take advantage of ”benign ” pay offs and context arri vals with out sacrificing the worst-case per forman ce. The central idea is to maintain a fin er partitio n in h igh-pay off regions of the similar ity spa ce and in p opular regions o f the context space. Our resu lts app ly to se veral o ther setting s, e.g . MAB with constrain ed tempor al change (Slivkins and Upfal, 2008) and sleeping bandits (Kleinberg et al., 2008a). A CM Categories and subject descriptor s: F .2.2 [Analysis of Algorithms and Problem Complexi ty]: Nonnumeri cal A lgorit hms and Problems; F . 1.2 [Computati on by Abstract Devic es]: Modes of Computa- tion— Onlin e computation General T erms: Algorith ms,Theory . Ke ywords: online learnin g, multi- armed bandits, contextu al bandits, reg ret minimization, metric spaces. ∗ This is the full version of a conference paper in COLT 2011 . A preliminary version of this manuscript has been posted to arxiv.org in February 2011. An earlier version on arxiv.org , which does not include the results in Section 6 , dates back to July 2009. The pre sent rev ision addresses various presentation issues pointed out by journal referees. † Microsoft Research New Y ork, Ne w Y ork, NY 10011, USA. Email: slivkins at microsoft .com . 1 1 Introd uction In a multi-armed bandit problem (henceforth , “multi-armed bandit” will be abbre viat ed as MAB), an algo- rithm is presented with a sequenc e of trials. In each round , the algorith m chooses one alternati ve from a set of alternati ve s ( arms ) based on the past history , and receiv es the payof f associate d w ith this alternati v e. The goal is to maximize the total payof f of the chosen arms. The MAB setting has been introduced in 1952 in Robbins (1952) and studied intensi vel y since then in Operations Research, Econo mics and C omputer Science. This setti ng is a clean mod el for the e xplorat ion-ex ploitation trade-of f, a cruci al issue in seq uential decisi on-making under uncertain ty . One standard way to ev aluate the performanc e of a bandit algorith m is re gr et , defined as the dif ference between the expected payof f of an optimal arm and that of the algorith m. By now the MAB problem w ith a small finite set of arms is quite well unders tood, e.g. see Lai and Robbins (1985), A uer et al. (2002b ,a). Ho wev er , if the a rms set is expo nentially or infinitely large , the problem becomes intractable u nless we ma ke furthe r assumptions about the problem instance. Essentially , a bandit algorith m needs to find a needl e in a haysta ck; for each algorithm there are inputs on which it performs as badly as random guessing. Bandit pro blems with lar ge sets of arms hav e been an acti v e area of in vestig ation in the past de cade (see Section 2 for a discuss ion of related literatu re). A common theme in these works is to assume a certain struct ur e on payof f functions. Assumption s of this type are natural in many applicatio ns, and often lead to efficien t learnin g algorit hms (Kleinber g , 2005). In particular , a line of work started in Agrawal (1995) assumes that some infor mation on similarity between arms is av ailable. In this paper we consider similarity information in the setting of conte xtual bandits (W oodroo fe , 1979, Auer, 2002, W ang et al., 2005, Pande y et al., 2007, L angfo rd and Zhang, 2007), a natural ext ension of the basic MAB pr oblem where before each ro und an algorit hm is gi ven the con text – a hint ab out the payof fs in this round. Contextual bandits are directly motiv ated by the problem of placing adv ertisements on webpages, one of the crucial p roblems in sponso red search. One ca n cas t it as a bandit p roblem so that arms corr espond to the p ossible ad s, and payof fs co rrespond to the user click s. Then the con text consists of informatio n about the page, and perhaps the user this page is served to. Furthermor e, we assume that similarity information is av ailab le on both the context and the arms. Follo wing the work in Agrawal (1995), Kleinber g (2004), Auer et al. (2007), Kleinber g et al. (2008b ) on the (non-c ontext ual) bandits, a particu larly simple way to repres ent similarity info rmation in the context ual bandit setti ng is via a similar ity distanc e bet ween the conte xt-arm pairs, which gi ves an upper bound on the dif ferenc e between the corresp onding payof fs. Our m odel: contex tual bandits with similari ty informa tion. The conte xtual bandits frame work is de- fined as follows. Let X be the conte xt set and Y be the arms set , and let P ⊂ X × Y be the set of feasible conte xt-arms pairs. In each round t , the follo wing ev ents happen in successio n: 1. a context x t ∈ X is rev ealed to the algorithm, 2. the algorithm chooses an arm y t ∈ Y such that ( x t , y t ) ∈ P , 3. payof f (rew ard) π t ∈ [0 , 1] is re v ealed. The sequen ce of conte xt arriv als ( x t ) t ∈ N is fixed before the first round, and does not depend on the sub- sequen t choice s of the algorithm. W ith stoc hastic payof fs , for each pair ( x, y ) ∈ P there is a distrib ution Π( x, y ) w ith e xpectat ion µ ( x, y ) , so that π t is an inde pendent sample fr om Π( x t , y t ) . W ith adver sarial pay- of fs , this distrib ution can change from round to round. For simplicity , we present the subseq uent definitions for the stochas tic setting only , whereas the adversar ial setting is fleshed out later in the paper (Section 8). In general, the goal of a bandit algorit hm is to maximize the total payof f P T t =1 π t , where T is the time horizon . In the contextu al MAB setting, w e benchmark the algorithm’ s performanc e in terms of the 2 conte xt-spec ific “best arm”. Specifically , the goal is to m inimize the conte xtua l r e gr et : R ( T ) , P T t =1 µ ( x t , y t ) − µ ∗ ( x t ) , where µ ∗ ( x ) , s u p y ∈ Y : ( x,y ) ∈P µ ( x, y ) . The conte xt-spec ific best arm is a more demanding benchmark than the best arm used in the “standard ” (conte xt-free ) definition of regret. The similarity informat ion is gi ve n to an algo rithm as a metric spac e ( P , D ) whi ch we call the similarity space , such that the follo wing Lipschitz condition 1 holds: | µ ( x, y ) − µ ( x ′ , y ′ ) | ≤ D (( x, y ) , ( x ′ , y ′ )) . (1) W ithou t loss of generality , D ≤ 1 . The absence of similarity information is modeled as D = 1 . An instructi v e special case is the pr odu ct similarity space ( P , D ) = ( X × Y , D ) , where ( X , D X ) is a metric space on conte xts ( conte xt space ), and ( Y , D Y ) is a metric space on arms ( arms space ), and D (( x, y ) , ( x ′ , y ′ )) = m in (1 , D X ( x, x ′ ) + D Y ( y , y ′ )) . (2) Prior work: unif orm partitions. Hazan and Megid do (2007) consider conte xtual MAB with similarity informat ion on conte xts. The y sugg est an algorithm that chooses a “unifor m” partition S X of the context space and approximates x t by the closest poin t in S X , call it x ′ t . Specifically , the algorithm create s an instan ce A ( x ) of some bandit algorith m A for each point x ∈ S X , and in v okes A ( x ′ t ) in each round t . The granularit y of the partition is adjusted to the time horizon , the conte xt space, and the black-bo x regr et guaran tee for A . Furt hermore, K leinbe rg (2004) provide s a bandit algorith m A for the advers arial MAB proble m on a metric space that has a similar flav or: pick a “uniform” partition S Y of the arms space, and run a k -arm bandit algorith m such as E X P 3 A uer et al. (2002b) on the points in S Y . Again, the granul arity of th e partition is adj usted to the time horizo n, the arms spac e, and the blac k-box regret gu arantee for E X P 3 . Applying these two ideas to our setting (with the product similarity space) giv es a simple algorithm which we call the unifo rm algorithm . Its conte xtual regret, ev en for adver sarial payof fs, is R ( T ) ≤ O ( T 1 − 1 / (2+ d X + d Y ) )(log T ) , (3) where d X is the cov ering dimension of the conte xt space and d Y is that of the arms space . Our contributions. Using “unifor m” partitions disrega rds the potentially benign structure of expec ted payof fs and context arriv als. The central topic in this paper is adaptiv e partitions of the similarity space which are adjust ed to frequently occurrin g conte xts and high-payin g arms, so that the algorithms can take adv antage o f the problem instan ces in which the e xpected p ayof fs or the co ntext arriv als are “ben ign” (“lo w- dimensio nal”), in a sense that w e make p recise later . W e present two main results , one for stochastic payof fs and one for adversa rial payoff s. For stocha stic payof fs, we pro vide an algori thm called conte xtual zooming which “z ooms in” on the re gions of the con text space that corresp ond to frequently occurrin g contex ts, and the regions of the arms space that correspon d to high-payi ng arms. Un like the algorithms in prior work, this algorithm consid ers the conte xt space and the arms space join tly – it maintains a partit ion of the similarity space, rathe r than one partitio n for contex ts and ano ther for arms. W e de velop pro va ble guarante es that captu re the “benig n-ness” of the conte xt arriv als and the expecte d payof fs. In the worst case, we match the guarantee (3) for the uniform algorithm. W e ob- tain nearly matching lower bound s using the KL-div erg ence techniqu e from (Auer et al., 2002b, K leinbe rg, 1 In other wo rds, µ is a Li pschitz-continuo us function on ( X, P ) , wit h Lipschitz con stant K Lip = 1 . Assuming K Lip = 1 is without loss of generality (as long as K Lip is kno wn to the algorithm), since we can re-define D ← K Lip D . 3 2004). T he lo wer bound is very general as it holds for e very giv en (product ) similarity space and for ev ery fixed v alue of the upper bound . Our sto chastic contex tual MA B set ting, and specificall y the conte xtual zooming algorit hm, can be fr uit- fully applied beyond the ad placement scenario described abov e and beyond MAB with similarity infor - mation per se. First, writin g x t = t one can incorpora te “temporal const raints” (across time, for each arm), and combine them with “spatial constraint s” (across arms, for each time). The analysis of conte x- tual zooming yields concrete, m eaning ful bounds this scenari o. In particu lar , we recov er one of the m ain results in S li vkins and Upfal (2008). Second, our setting subsume s the stochast ic sleeping bandit s prob- lem Kleinber g et al. (2008 a ), where in each round some arms are “asleep”, i.e. not a vai lable in this round . Here contexts c orrespond to subsets of arms that a re “a wak e”. Contex tual zooming recov ers and generalize s the correspond ing result in Kleinber g et al. (2008a). Third, follo wing the publica tion of a preliminary ver - sion of this paper , conte xtual zooming has been applied to bandit learning-to- rank in Slivki ns et al. (2013). For the adv ersarial setti ng, we pro vide an algorith m which mainta ins an ad apti ve partit ion of the contex t space and thus takes adv antage of “benign” contex t arri va ls. W e de velop prov abl e guarant ees that capture this “beni gn-ness”. In the worst case, the contex tual regre t is bounded in terms of the cove ring dimension of the context space, matching (3). O ur algorith m is in fact a meta-algorith m : giv en an adver sarial bandit algori thm Bandi t , we presen t a context ual bandit algorithm w hich calls B andi t as a subroutine . Our setup is fle xible: depend ing on what ad ditional constra ints are kno wn about t he adver sarial payof fs, one can plug in a ba ndit algorithm fro m the pr ior work on the corres ponding version o f adv ersarial MAB, s o that t he reg ret bound for B andit plugs into the ov erall regre t bound. Discussion. Adapti ve partitions (of the arms space) for cont ext-fre e MAB with similari ty information ha ve been introduc ed in (Kleinber g et al., 2008b, Bubeck et al., 2011a). This pap er further exp lores the potent ial of the zooming techniq ue in (Kleinber g et al., 2008b). S pecificall y , contextu al zooming extends this techniqu e to adapti ve partitions of the entire similarity space, which necessitates a technicall y dif ferent algori thm and a more delicate analysis. W e obtain a clean algorit hm for contex tual MAB with improv ed (and nearly optimal) bound s. Moreov er , this algorithm applies to sev eral o ther , seemingly unrelated problems a nd unifies some results from prior work. One alternati ve approach is to maintain a partiti on of the contex t spa ce, and run a separate instance o f t he zooming algori thm from Kleinbe rg et al. (2008b) on each set in this partition. F leshing out this idea leads to the meta-algor ithm that we present for adversa rial payof fs (with Ba ndit being the zooming algorithm). This meta-algori thm is parameterized (and constrain ed) by a specific a priori regre t bound for B andit . Unfortun ately , any a priori regr et bound for zooming algorithm woul d be a pessimistic one, which negates its main streng th – the ability to adapt to “benign” expecte d payoff s. Map of the paper . Section 2 is related work, and S ection 3 is Preliminaries. C onte xtual zooming is presen ted in Section 4. Lower bounds are in Section 5. Some applic ations of conte xtual zooming are discus sed in S ection 6. The advers arial setting is treated in Section 8. 4 2 Related work A proper discuss ion of the literature on bandit proble ms is beyond the scope o f this paper . T his paper fo llows the line of work on re gret-minimizin g bandits; a reader is encouraged to refer to (Cesa-Bianchi and L ugosi, 2006, Bubeck and Cesa-Bianch i , 2012) for background . A dif ferent (Bayesian ) perspe ctiv e on bandit prob- lems can be found in (Gittins et al., 2011). Most rele v ant to this p aper is the w ork on ba ndits with lar ge sets of arms, specifically bandits with s imi- larity information (Agraw al, 1995, Kleinber g , 2004, Auer et al., 2007, Pande y et al., 2007, K ocsis and Szepesv ari , 2006, Munos and Coquelin , 2007, Kleinber g et al., 2008b, B ubeck et al., 2011a, Kleinber g and Slivkin s , 2010, Maillard and Munos, 2010). Another commonly assumed structure is linear or con v ex payoff s, e.g. (A werbuch and Kleinb erg, 2008, Flaxman et al., 2005, Dani et al., 2007, Abernethy et al., 2008, Hazan and Kale, 2009, B ubeck et al., 2012). L inear/c on ve x payof fs is a much stronger ass umption than similarity , essentia lly becaus e it allo ws to make strong inferenc es about far -aw ay arms. Other assumpti ons ha ve been consi dered, e.g. (W ang et al., 2008, Bubeck and Munos, 2010). The distin ction between stochastic and advers arial pay- of fs is orthogo nal to the structura l assumption (such as Lipschitz -continui ty or linear ity). Paper s on MAB with linear/co n ve x payof fs typically allo w adversar ial payof fs, whereas papers on MAB w ith similarity in- formatio n focus on stochastic p ayof fs, with notable exceptio ns of Kleinber g (200 4 ) and Maillard and Munos (2010). 2 The notion of structure d adv ersaria l pay offs in this paper is less restric tiv e than the one in Maillard and M unos (2010) (which in turn speci alizes the notio n from linea r/con v ex payof fs), in the sense that t he Lipschit z con- dition is assumed on the expe cted payof fs rather than on realized payof fs. This is a non-tri vial distin ction, essent ially because our notion generalizes stochastic payof fs whereas the other one does not. Contextual MAB. In (Auer, 200 2 ) an d (Chu et al., 2 011) 2 payof fs are linear in context, which is a feature vec tor . (W oodroofe, 1979, W ang et al., 2005) and (Rigollet and Zeevi, 2010) 2 study conte xtual MA B w ith stocha stic payoff s, under the name ba ndits with cov ariates : the conte xt is a rando m va riable correlat ed with the p ayof fs; they consider the case of two arms, and make some add itional assumpti ons. L azaric and Munos (2009) 2 consid er an online labeling problem with stoch astic inputs and advers arially chosen labels ; inputs and hypot heses (mappings from inputs to labels) can be thought of as “contex ts” and “arms” respecti vel y . Bandits with ex perts advice (e.g. Auer (2002)) is the specia l case of contex tual MAB where the context consis ts of expert s’ advice; the advice of a each expe rt is modeled as a distrib utio ns over arms. All these papers are not directly applic able to the present setting. Experiment al work on contex tual MAB includes (Pandey et al., 2007) and (Li et al., 2010, 2011). 2 Lu et al. (2010) 2 consid er the settin g in this paper for a product simila rity space an d, essenti ally , recove r the uniform algorithm and a lower bound that matches (3). T he same guarantee (3) can also be obtained as follo ws. The “uniform partition” describ ed abov e can be used to define “experts ” for a bandit-with-e xpert- advice algorithm such as E X P 4 (Auer et al., 2002b): for each set of the partition there is an exper t whose advise is simply an arbi trary arm in this s et. Then the r egret bo und for E X P 4 yields (3). Instead of E X P 4 one could use an algorithm in McMahan and Streeter (2009) 2 which improv es ov er E X P 4 if the exp erts are not “too distin ct”; howe ve r , it is not clear if it translate s into concrete improv ements ove r (3). If the context x t is time-in v ariant , our setting reduces to the Lipschitz MAB problem as defined in (Kleinber g et al., 2008b), which in turn reduces to continuum-a rmed bandits (Agrawal, 1995, Kleinber g , 2004, Auer et al., 2007) if the metric space is a real line, and to MA B with stochast ic payof fs (Auer et al., 2002a) if the simi- larity informatio n is absent. 2 This paper is concurrent and independent work w . r .t. the preliminary publication of this paper on . 5 3 Pr eliminaries W e will use the notation from the Introduction . In particu lar , x t will denote the t -th conte xt arrival , i.e. the conte xt that arri ves in round t , and y t will denote the arm chose n by the algori thm in that round. W e will use x (1 ..T ) to denote the sequenc e of the first T context arriv als ( x 1 , . . . , x T ) . The badness of a point ( x, y ) ∈ P is defined as ∆( x, y ) , µ ∗ ( x ) − µ ( x, y ) . The conte xt-specific best arm is y ∗ ( x ) ∈ argmax y ∈ Y : ( x,y ) ∈P µ ( x, y ) , (4) where t ies are brok en in an arbitrary b ut fixed way . T o en sure that the max in (4) is attaine d by some y ∈ Y , we will assume that the similarit y space ( P , D ) is compact. Metric spaces. Cov ering dimensio n and relate d notion s are crucial throug hout this paper . Let P be a set of points in a metric space, and fix r > 0 . An r -cove ring of P is a collection of subsets of P , each of diameter strictly less than r , that cov er P . The minimal number of subsets in an r -co vering is called the r -co vering number of P and denoted N r ( P ) . 3 The coverin g dimensio n of P (with multiplier c ) is the smallest d such that N r ( P ) ≤ c r − d for each r > 0 . In particular , if S is a subset of Euclidea n space then its cov ering dimension is at most the linear dimension of S , but can be (much) smalle r . Cov ering is closely rela ted to pac king. A subset S ⊂ P is an r -pac king of P if the distance bet ween any two points in S is at least r . The maximal number of points in an r -packing is called the r -pack ing number and denote d N pack r ( P ) . It is well-kno wn that r -packing numbers are essentially the same as r -cov ering numbers , namely N 2 r ( P ) ≤ N pack r ( P ) ≤ N r ( P ) . The doubli ng constan t c D B L ( P ) of P is th e smalle st k such that any ball can be cov ered by k balls of half the radius. The doubl ing cons tant (an d d oubling dimension log c D B L ) was introduced in Hei nonen (2001) and has been a standard notion in theoretical computer scienc e literature since Gupta et al. (2003). It was used to characterize tractab le problem ins tances for a v ariety o f pro blems (e.g. see T alwar, 2004, Klein ber g et al., 2009, C ole and Gottlieb, 2006). It is known that c D B L ( P ) ≥ c 2 d if d is the cove ring dimension of P with multiplie r c , and that c D B L ( P ) ≤ 2 d if P is a bounded subset of d -dimension al E uclidea n space. A usefu l observ atio n is that if distance between an y two points in S is > r , then any ball of radiu s r contains at most c D B L points of S . A ball w ith center x and radius r is denoted B ( x, r ) . Formally , we will treat a ball as a (center , radius ) pair rath er than a set of p oints. A fun ction f : P → R if a Lipschi tz function on a metric space ( P , D ) , with Lipschitz constant K Lip , if th e Lipsc hitz condition ho lds: | f ( x ) − f ( x ′ ) | ≤ K Lip D ( x, x ′ ) for ea ch x, x ′ ∈ P . Accessing the simil arity space. W e assume full and compu tationally unrestricted access to the simila r- ity information. While the iss ues of ef ficient representati on thereof are important i n pra ctice, we b eliev e tha t a proper trea tment of the se issues wo uld be specific to t he parti cular application and t he parti cular similarity metric used, and would obscure the present paper . One clean formal way to address this issue is to assume ora cle access : an algori thm accesses the similarity space via a fe w specific types of queries, and in v ok es an “oracl e” that answers such queries. Time horizon. W e assume that the time horizon is fixed and kno wn in adv ance. This assumption is without loss of general ity in our setting. T his is due to the well-kno wn doubling trick which con ve rts a bandit algorithm w ith a fixed time horizon into one that runs indefinitely and achie v es essentia lly the same reg ret bound. Suppose for any fixed time horizon T there is an algorith m AL G T whose regr et is at most R ( T ) . The new algorithm proceed s in phases i = 1 , 2 , 3 , . . . of duratio n 2 i round s each, so that in each phase i a fresh instance of AL G 2 i is run. This algor ithm has regre t O (log T ) R ( T ) for each round T , and O ( R ( T )) in the typ ical case when R ( T ) ≥ T γ for some const ant γ > 0 . 3 The covering number can be defined via radius- r balls rather than diameter- r sets. This alternativ e definition lacks the appealing “robu stness” property: N r ( P ′ ) ≤ N r ( P ) for any P ′ ⊂ P , b ut (other than that) is equiv alent for this paper . 6 4 The contextual zooming algorithm In this section w e consider the conte xtual MAB problem with stoch astic payof fs. W e present an algorithm for this problem, called conte xtual zooming , w hich take s adv antage of both the “benign” context arriv als and the “benign” expect ed payof fs. The algori thm adapti ve ly maintains a partition of the similarit y space, “zoomin g in” on both the “popul ar” regio ns on the context space and the high-payof f regions of the arms space. Conte xtual zooming exte nds the (conte xt-free) zooming technique in (Kleinber g et al., 2008b), w hich necess itates a somewhat more complicated algori thm. In particul ar , selectio n and activ atio n rules are defined dif ferently , there is a new notion of “domains ” and the distinct ion between “pre-inde x” and “index ”. The analys is is more del icate, both the high-p robabilit y arg ument in Claim 4.3 and th e subseque nt arg ument that bound s the number of samples from subopt imal arms. Also, the ke y step of setting up the regret bounds is ver y differe nt, especially for the improv ed regret bounds in Section 4.4. 4.1 Pr ovable guarantees Let us define the notion s that express the performanc e of contex tual zooming. These notions rely on the packin g number N r ( · ) in the similarity space ( P , D ) , and the more refined vers ions thereof that take into accoun t “benign” expecte d payoff s and “benign” contex t arriv als. Our guaran tees hav e the follo wing form, for some intege r numbers { N r } r ∈ (0 , 1) : R ( T ) ≤ C 0 inf r 0 ∈ (0 , 1) r 0 T + P r =2 − i : i ∈ N , r 0 ≤ r ≤ 1 1 r N r log T . (5) Here and thereafter , C 0 = O (1) unless specified otherwise. In the pessimistic vers ion, N r = N r ( P ) is the r -packing number of P . 4 The main contrib utio n is refined bounds in which N r is smaller . For eve ry guara ntee of the form (5), cal l it N r -type guarante e, prior work (e.g., Kleinb erg (2004), Kleinber g et al. (2008 b ), Bubeck et al. (2011a)) sugge sts a more tracta ble dimension-typ e guarantee . This guaran tee is in terms of the covering -type dimension induc ed by N r , defined as follo ws: 5 d c , inf { d > 0 : N r ≤ c r − d ∀ r ∈ (0 , 1) } . (6) Using (5) with r 0 = T − 1 / ( d c +2) , we obtain R ( T ) ≤ O ( C 0 ) ( c T 1 − 1 / (2+ d c ) log T ) ( ∀ c > 0) . (7) For the pessimistic versi on ( N r = N r ( P ) ), the correspond ing cover ing-type dimension d c is the cover - ing dimension of the similarity space. The resulting guarant ee (7) subsumes the bound (3) from prior work (becau se the cov ering dimension of a product similarity space is d X + d Y ), and extends this bound from produ ct similarity spaces (2) to arbitrary similarity spaces. T o account for “benign” expect ed payof fs, instead of r -pack ing number of the entire set P we consider the r -packing number of a subset of P which only includes points with near -optimal expect ed payof fs: P µ,r , { ( x, y ) ∈ P : µ ∗ ( x ) − µ ( x, y ) ≤ 12 r } . (8) W e define the r -zooming number as N r ( P µ,r ) , the r -packin g number of P µ,r . The correspon ding coveri ng- type dimensi on (6) is called the contex tual zooming dimension . 4 Then (5) can be simplified to R ( T ) ≤ inf r ∈ (0 , 1) O r T + 1 r N r ( P ) log T since N r ( P ) is non-increasing in r . 5 One standard definition of the cov ering dimension is (6 ) for N r = N r ( P ) and c = 1 . Following Kleinberg et al. (2008b), we include an explicit depen dence on c in (6) to obtain a more efficient re gret bound (which holds for any c ). 7 The r -zoomin g number can be seen as an optimistic vers ion of N r ( P ) : while equal to N r ( P ) in the worst case, it can be m uch smaller if the set of near-o ptimal context- arm pairs is “small” in terms of the packing number . Like wise, the context ual zooming dimension is an optimistic version of the cov ering dimension . Theor em 4.1. Consider the conte xtual MAB pr ob lem with stochas tic pa yof fs. Ther e is an algorit hm ( namely , Algorith m 1 describ ed below) whose conte xtual r e gr et R ( T ) satisfi es (5 ) with N r equal to N r ( P µ,r ) , the r - zooming n umber . C onseq uently , R ( T ) satisfies the dimension-typ e guar antee (7), wher e d c is th e conte xtual zooming dimensio n. In T heorem 4.1, the same algorit hm enjoys the bound (7) for each c > 0 . This is a useful trade-of f since di ffer ent valu es of c may resul t in drast ically diffe rent valu es of t he dimens ion d c . On the contrary , the “unifo rm algorithm” from prior work essentia lly needs to tak e the c as in put. Further refinement s to take into account “benign” contex t arriv als are defer red to Section 4.4. 4.2 Description of the algorithm The algor ithm is parameteri zed by the time horizon T . In each round t , it maintains a finite collectio n A t of balls in ( P , D ) (called active balls ) which collecti vely cov er the similarity space. Adding acti ve balls is called activating ; balls stay acti ve once they are activ ated . Init ially there is only one activ e ball which has radius 1 and therefore contain s the enti re similarity space. At a high lev el, each round t proceeds as follo ws. Conte xt x t arri ves. Then the algorith m selects an acti ve ball B and an arm y t such that ( x t , y t ) ∈ B , acco rding to the “selecti on rule”. Arm y t is pl ayed. T hen one ball may be acti v ated, accordi ng to the “acti v ation rule”. In or der to state the two rules, we ne ed to put forw ard se veral definitions. Fix an acti ve b all B and rou nd t . Let r ( B ) be the radius of B . The confi dence radius of B at time t is conf t ( B ) , 4 s log T 1 + n t ( B ) , (9) where n t ( B ) is the number of times B has been selected by the algorithm before round t . The domain of ball B in round t is a sub set of B that excl udes all balls B ′ ∈ A t of strict ly smaller radius: dom t ( B ) , B \ S B ′ ∈A t : r ( B ′ ) r ( B ′ ) , it follo ws that ( x t , y t ) 6∈ B ′ . Through out the analysi s we will use the follo wing notation . For a ball B with center ( x, y ) ∈ P , define the exp ected payof f of B as µ ( B ) , µ ( x, y ) . L et B sel t be the acti ve ball selecte d by the algorithm in round t . Recall that the badness of ( x, y ) ∈ P is defined as ∆( x, y ) , µ ∗ ( x ) − µ ( x, y ) . Claim 4.3. If ball B is activ e in r ound t , then with pr oba bility at least 1 − T − 2 we have that | ν t ( B ) − µ ( B ) | ≤ r ( B ) + conf t ( B ) . (13) Pr oof. Fix ball V with center ( x, y ) . Let S be the set of rounds s ≤ t when ball B was select ed by the algori thm, and let n = | S | be the number of such roun ds. Then ν t ( B ) = 1 n P s ∈ S π s ( x s , y s ) . Define Z k = P ( π s ( x s , y s ) − µ ( x s , y s )) , where the sum is tak en ov er the k smallest elements s ∈ S . Then { Z k ∧ n } k ∈ N is a martingale with bounded incremen ts. (Note that n here is a random varia ble.) S o by the Azuma-Hoef fding inequality w ith probab ility at least 1 − T − 3 it holds that 1 k | Z k ∧ n | ≤ conf t ( B ) , for each k ≤ T . T aking the Union Bound, it follo ws that 1 n | Z n | ≤ conf t ( B ) . Note that | µ ( x s , y s ) − µ ( B ) | ≤ r ( B ) for each s ∈ S , so | ν t ( B ) − µ ( B ) | ≤ r ( B ) + 1 n | Z n | , which completes the proof. Note that (13) implies I pre ( B ) ≥ µ ( B ) , so that I pre ( B ) is i ndeed a UCB on µ ( B ) . Call a run of the algorithm clean if (13) holds for each round. From no w on we will focus on a clean run, and ar gue deterministical ly using (13). The heart of the anal ysis is the follo wing lemma. Lemma 4.4. C onside r a clean run of the algorithm. T hen ∆( x t , y t ) ≤ 14 r ( B sel t ) in eac h r ound t . Pr oof. Fix round t . B y the cov ering in v ariant , ( x t , y ∗ ( x t )) ∈ B for some acti ve ball B . Recall from (12) that I t ( B ) = r ( B ) + I pre ( B ′ ) + D ( B , B ′ ) fo r some acti ve ball B ′ . Therefore I t ( B sel t ) ≥ I t ( B ) = I pre ( B ′ ) + r ( B ) + D ( B , B ′ ) (select ion rule, defn of inde x (12 )) ≥ µ ( B ′ ) + r ( B ) + D ( B , B ′ ) (“clean run”) ≥ µ ( B ) + r ( B ) ≥ µ ( x t , y ∗ ( x t )) = µ ∗ ( x t ) . (Lipschit z prope rty (1), twice) (14) 10 On the other hand, lettin g B par be the paren t of B sel t and noting that by the acti v ation rule max( D ( B sel t , B par ) , conf t ( B par )) ≤ r ( B par ) , (15) we can upper -boun d I t ( B sel t ) a s follo ws: I pre ( B par ) = ν t ( B par ) + r ( B par ) + conf t ( B par ) (defn of preind ex (11 )) ≤ µ ( B par ) + 2 r ( B par ) + 2 conf t ( B par ) (“clean run”) ≤ µ ( B par ) + 4 r ( B par ) (“pare nthood” (15)) ≤ µ ( B sel t ) + 5 r ( B par ) (Lipschit z property (1)) (16) I t ( B sel t ) ≤ r ( B sel t ) + I pre ( B par ) + D ( B sel t , B par ) (defn of inde x (12 )) ≤ r ( B sel t ) + I pre ( B par ) + r ( B par ) (“pare nthood” (15)) ≤ r ( B sel t ) + µ ( B sel t ) + 6 r ( B par ) (by (16)) ≤ µ ( B sel t ) + 13 r ( B sel t ) ( r ( B par ) = 2 r ( B sel t ) ) ≤ µ ( x t , y t ) + 14 r ( B sel t ) (Lipschit z property (1)). (17) Putting the piece s together , µ ∗ ( x t ) ≤ I t ( B sel t ) ≤ µ ( x t , y t ) + 14 r ( B sel t ) . Cor ollary 4.5. In a clean run, if ball B is activate d in r o und t then ∆( x t , y t ) ≤ 10 r ( B ) . Pr oof. By the activ atio n rul e, B sel t is the parent of B . Thus by Lemma 4.4 we immediately ha ve ∆ ( x t , y t ) ≤ 14 r ( B sel t ) = 28 r ( B ) . T o obtain the consta nt of 10 that is claimed here, we prov e a more ef ficient special case of L emma 4.4: if B sel t is a parent ball then ∆( x t , y t ) ≤ 5 r ( B sel t ) . (18) T o prov e (18 ), we simply replace (17) in the proo f of Lemma 4.4 by similar inequality in terms of I pre ( B sel t ) ra ther than I pre ( B par ) : I t ( B sel t ) ≤ r ( B sel t ) + I pre ( B sel t ) (defn of inde x (12)) = ν t ( B sel t ) + 2 r ( B sel t ) + conf t ( B sel t ) (defns of pre-inde x (11)) ≤ µ ( B sel t ) + 3 r ( B sel t ) + 2 c onf t ( B sel t ) (“clean run”) ≤ µ ( x t , y t ) + 5 r ( B sel t ) For the l ast inequality , w e use the fa ct that con f t ( B sel t ) ≤ r ( B sel t ) whe nev er B sel t is a parent ball. No w w e are ready for the final regret computation . For a giv en r = 2 − i , i ∈ N , let F r be the collection of all balls of radius r that hav e been acti va ted throughou t the ex ecution of the algorith m. Note that in each round , if a paren t ball is sel ected then some other b all is acti v ated. Thus, we can pa rtition the rounds among acti ve balls as follows: for each ball B ∈ F r , let S B be the set of rounds which consists of the round when B was acti va ted and all rounds t when B was selected and was not a parent ball. 6 It is easy to see that | S B | ≤ O ( r − 2 log T ) . Moreo ver , by Lemma 4.4 and Corollary 4.5 we ha ve ∆( x t , y t ) ≤ 15 r in each round t ∈ S B . If ball B ∈ F r is activ ated in round t , then Corollary 4.5 asserts that its center ( x t , y t ) lies in the set P µ,r , as defined in (8). By the separation in v ariant , the centers of balls in F r are within distanc e at least r from one anoth er . It follo ws that |F r | ≤ N r , where N r is the r -zooming number . 6 A gi ven ball B can be selected e ve n after it becomes a p arent ball, b ut in such round som e other ball B is acti v ated, so this round is included in S B ′ . 11 Fixing some r 0 ∈ (0 , 1) , note that in each round s t when a ball of radius < r 0 was select ed, reg ret is ∆( x t , y t ) ≤ O ( r 0 ) , so the total re gret from all such rounds is at most O ( r 0 T ) . Therefore , contex tual regret can be written as follo ws: R ( T ) = P T t =1 ∆( x t , y t ) = O ( r 0 T ) + P r =2 − i : r 0 ≤ r ≤ 1 P B ∈F r P t ∈ S B ∆( x t , y t ) ≤ O ( r 0 T ) + P r =2 − i : r 0 ≤ r ≤ 1 P B ∈F r | S B | O ( r ) ≤ O r 0 T + P r =2 − i : r 0 ≤ r ≤ 1 1 r N r log( T ) . The N r -type regre t guarantee in Theorem 4.1 follo ws by taking in f on all r 0 ∈ (0 , 1) . 4.4 Impr oved r egret bounds Let us provide regret bounds that take into account “benign” contex t arri v als. The main difficu lty here is to de velop the c orrespon ding definitions ; the analysis then carrie s over witho ut much modification. The added v alue is two-fold: first, we establish the intuition that benign conte xt arri va ls matter , and then the specific reg ret bound is used in Section 6.2 to match the result in Slivk ins and Upfal (2008). A crucial step in the proof of Theorem 4.1 is to bound the number of activ e radius- r balls by N r ( P µ,r ) , which i s acco mplished by obs erving that t heir centers f orm an r -packi ng S of P µ,r . W e mak e this step more ef ficient, as follo ws. A n acti ve radius- r ball is called full if con f t ( B ) ≤ r for some round t . N ote that each acti ve ball is either full or a child of some other ball that is full. The number of childre n of a giv en ball is bound ed by the doubling constant of the similarity space. Thus, it suffices to consider the number of acti ve radius - r balls that are full, w hich is at most N r ( P µ,r ) , and potenti ally much smaller . Consider acti ve radius- r activ e balls that are full. Their center s form an r -packing S of P µ,r with an additi onal property : each point p ∈ S is assigned at least 1 /r 2 conte xt arri v als x t so that ( x t , y ) ∈ B ( p, r ) for some arm y , and each context arri val is assigned to at most one point in S . 7 A set S ⊂ P with this proper ty is called r -consistent (with contex t arri v als). The adjusted r -packi ng number of a set P ′ ⊂ P , denote d N adj r ( P ′ ) , is the maximal size of an r -consis tent r -packing of P ′ . It can be much smaller than the r -packing number of P ′ if most conte xt arri va ls fall into a small regi on of the similarity space. W e make one further optimiza tion, tailored to the applic ation in Section 6.2. Informally , w e tak e adv an- tage of context arri val s x t such that expect ed payof f µ ( x t , y ) is either optimal or very suboptimal . A point ( x, y ) ∈ P is called an r -w inner if for each ( x ′ , y ′ ) ∈ B (( x, y ) , 2 r ) it holds that µ ( x ′ , y ′ ) = µ ∗ ( x ′ ) . L et W µ,r be the set of all r -winners . It is easy to see that if B is a radius- r ball centered at an r -winner , and B or its child is selected in a giv en round, then this round does not contrib ute to contextu al regret. Therefore, it suf fices to consider ( r -consistent) r -packings of P µ,r \ W µ,r . Our final guarant ee is in terms of N adj ( P µ,r \ W µ,r ) , which we term the adjuste d r -zooming number . Theor em 4.6. Consider the conte xtua l MA B pr oblem w ith stochas tic payof fs. The conte xtual re gr et R ( T ) of the co ntex tual zooming algorit hm satisfies (5), wher e N r is the ad justed r -zoomin g number and C 0 is the doubl ing constan t of the similarity space times some absolute constant . Consequent ly , R ( T ) satisfies the dimensio n-type guaran tee (7), wher e d c is the corr es ponding covering -type dimension. 5 Lower bound s W e match the upper bound in Theorem 4.1 up to O (log T ) factors . Our lower bound is very general: it applie s to an arbitrary product similarity space, and moreove r for a giv en similarity space it matches, up to 7 Namely , each point p ∈ S is assigned all contexts x t such that the corresponding ball is chosen in round t . 12 O (log T ) f actors, any fixe d val ue of the upper bound (as exp lained belo w). W e cons truct a distrib ution I ov er prob lem instances on a g iv en metric spac e, so tha t the lo wer bound is for a pr oblem inst ance dra wn fr om t his di strib ution. A si ngle pro blem insta nce wou ld not suf fice t o esta blish a lo wer bound because a triv ial algorith m that picks arm y ∗ ( x ) for each conte xt x will ach iev e regre t 0 . The distrib ution I satisfies the follo wing two properties: the upper bound in T heorem 4.1 is uniformly bound ed from abo ve by some number R , and any algorithm must incur regret at least Ω ( R / log T ) in ex- pectat ion over I . Moreov er , we constrict such I for e v ery poss ible val ue of th e upper b ound in Theorem 4.1 on a gi ven metric space, i.e. not just for problem instances that are “hard” for this metric space. T o formulate our result , let R UB µ ( T ) denote the upper boun d in Theorem 4.1, i.e. is the right-han d side of (5) where N r = N r ( P µ,r ) is the r -zooming number . Let R UB ( T ) denote the pessimist ic vers ion of this bound , namely right-hand side of (5) where N r = N r ( P ) is the packing number of P . Theor em 5.1. Consider the conte xtual MAB pr oblem with stoc hastic payof fs, Let ( P , D ) be a pr odu ct similarit y space . F ix an arbitr ary time horizon T and a positive number R ≤ R UB ( T ) . Then ther e exist s a distrib uti on I over pr obl em instanc es on ( P , D ) with the following two pr oper ties: (a) R UB µ ( T ) ≤ O ( R ) for eac h pr oblem instance in supp ort ( I ) . (b) for any conte xtual bandit algorithm it holds that E I [ R ( T )] ≥ Ω( R/ log T ) , T o prove this theorem, we b uild on the lo wer-b ounding techniq ue from Auer et al. (2002b), and its ext ension to (context- free) bandi ts in metric spaces in Kleinber g (2004). In particu lar , we use the basic needle -in-the-ha ystack exa mple from A uer et al. (2002b), where the “haysta ck” consist s of se veral arms with ex pected payof f 1 2 , and the “need le” is an arm whose expected payof f is slightly higher . The lower -bounding construction. Our constructio n is paramete rized by two numbers: r ∈ (0 , 1 2 ] and N ≤ N r ( P ) , where N r ( P ) is th e r -packing n umber o f P . Giv en these parameters, w e construct a collectio n I = I N ,r of Θ( N ) problem instances as follo ws. Let N X ,r be the r -packin g number of X in the conte xt space , and let N Y ,r be the r -packin g number of Y in the arms spa ce. Note that N r ( P ) = N X ,r × N Y ,r . For simplicity , let us assume that N = n X n Y , wher e 1 ≤ n X ≤ N X ,r and 2 ≤ n Y ≤ N Y ,r . An r -net is the set S of points in a metric space such that any two points in S are at distanc e > r from each other , and each point in the metric space is within distance ≤ r from some point in S . R ecall that any r -net on the contex t space has siz e at least N X ,r . Let S X be an arbi trary set of n X points f rom one s uch r -net. Similarly , let S Y be an arbitrary set of n Y points from some r -net on the arms space. The sequence x (1 ..T ) of conte xt arriv als is any fixe d permutation over th e points in S X , repeated indefinit ely . All problem instances in I ha ve 0-1 payof fs. For each x ∈ S X we construct a needle-in -the-haystack exa mple on t he set S Y . Namely , we pic k one point y ∗ ( x ) ∈ S Y to be the “needle”, and de fine µ ( x, y ∗ ( x )) = 1 2 + r 4 , and µ ( x, y ) = 1 2 + r 8 for each y ∈ S Y \ { y ∗ ( x ) } . W e smoothe n the expe cted payoff s so that far fro m S X × S Y exp ected payof fs are 1 2 and the Lipschitz condit ion (1) holds: µ ( x, y ) , max ( x 0 , y 0 ) ∈ S X × S Y max 1 2 , µ ( x 0 , y 0 ) − D X ( x, x 0 ) − D Y ( y , y 0 ) . ( 19) Note that we obtain a distinc t problem instance for each functio n y ∗ ( · ) : S X → S Y . This complete s our constr uction. Analysis. The useful propert ies of the abov e construct ion are summarized in the follo wing lemma: Lemma 5.2. Fix r ∈ (0 , 1 2 ] a nd N ≤ N r ( P ) . Let I = I N ,r and T 0 = N r − 2 . Then: (i) for each pr oble m instance in I it holds that R UB µ ( T 0 ) ≤ O ( N/r )(log T 0 ) . 13 (ii) any conte xtual bandit algorit hm has r e gr et E I [ R ( T 0 )] ≥ Ω ( N/r ) for a pr oblem instan ce chose n unifor mly at ra ndom fr om I . For the lo wer bound in L emma 5.2, the i dea is that in T rounds each co ntext in S X contri bute s Ω( | S Y | /r ) to conte xtual regret , resulting in total conte xtual regret Ω( N /r ) . Before we proceed to prove Lemma 5 .2, le t u s use i t to deriv e Theorem 5.1. Fix an arbitrary time horizon T and a positi ve number R ≤ R UB ( T ) . Recall that since N r ( P ) is non-incre asing in r , for some consta nt C > 0 it hold s that R UB ( T ) = C × inf r ∈ (0 , 1) r T + 1 r N r ( P ) log T . (20) Claim 5.3. Let r = R 2 C T (1+log T ) . Then r ≤ 1 2 and T r 2 ≤ N r ( P ) . Pr oof. Denote k ( r ) = N r ( P ) and consider function f ( r ) , k ( r ) /r 2 . This function is non-incr easing in r ; f (1) = 1 and f ( r ) → ∞ for r → 0 . Therefore there exists r 0 ∈ (0 , 1) such that f ( r 0 ) ≤ T ≤ f ( r 0 / 2) . Re-writing this, we obtain k ( r 0 ) ≤ T r 2 0 ≤ 4 k ( r 0 / 2) . It follo ws that R ≤ R UB ( T ) ≤ C ( T r 0 + 1 r 0 k ( r 0 ) log T ) ≤ C T r 0 (1 + log T ) . Thus r ≤ r 0 / 2 and finally T r 2 ≤ T r 2 0 / 4 ≤ k ( r 0 / 2) ≤ k ( r ) = N r ( P ) . So, Lemma 5.2 with r , R 2 C T (1+log T ) and N , T r 2 . implies Theorem 5.1. 5.1 Pr oof of Lemma 5.2 Claim 5.4. Collection I consists of valid instanc es of conte xtual MAB pr oble m with similarity space ( P , D ) . Pr oof. W e need to pro ve that each problem instance in P satisfies the L ipschit z conditio n (1). Assume the Lipschitz condition (1 ) is violated for some points ( x, y ) , ( x ′ , y ′ ) ∈ X × Y . For brevi ty , let p = ( x, y ) , p ′ = ( x ′ , y ′ ) , and let us write µ ( p ) , µ ( x, y ) . T hen | µ ( p ) − µ ( p ′ ) | > D ( p, p ′ ) . By (19), µ ( · ) ∈ [ 1 2 , 1 2 + r 4 ] , so D ( p, p ′ ) < r 4 . W ithou t los s of generality , µ ( p ) > µ ( p ′ ) . In particu lar , µ ( p ) > 1 2 . Therefore there exists p 0 = ( x 0 , y 0 ) ∈ S X × S Y such that D ( p, p 0 ) < r 4 . Then D ( p ′ , p 0 ) < r 2 by trian gle inequality . No w , for an y ot her p ′ 0 ∈ S X × S Y it holds that D ( p 0 , p ′ 0 ) > r , a nd thus by tria ngle inequ ality D ( p, p ′ 0 ) > 3 r 4 and D ( p ′ , p ′ 0 ) > r 2 . It follo ws that (19) can be simplified as follo ws: ( µ ( p ) = max( 1 2 , µ ( p 0 ) − D ( p, p 0 )) , µ ( p ′ ) = max( 1 2 , µ ( p 0 ) − D ( p ′ , p 0 )) . Therefore | µ ( p ) − µ ( p ′ ) | = µ ( p ) − µ ( p ′ ) = ( µ ( p 0 ) − D ( p, p 0 )) − m ax( 1 2 , µ ( p 0 ) − D ( p ′ , p 0 )) ≤ ( µ ( p 0 ) − D ( p, p 0 )) − ( µ ( p 0 ) − D ( p ′ , p 0 ))) = D ( p ′ , p 0 ) − D ( p, p 0 ) ≤ D ( p, p ′ ) . So we ha ve obtaine d a contrad iction. 14 Claim 5.5. F or each ins tance in P and T 0 = N r − 2 it holds that R UB µ ( T 0 ) ≤ O ( N/r )(log T 0 ) . Pr oof. Recall that R UB µ ( T 0 ) is the right-han d side of (5) with N r = N r ( P µ,r ) , where P µ,r is defined by (8). Fix r ′ > 0 . It is easy to see that P µ, r ′ ⊂ ∪ p ∈ S X × S Y B ( p, r 4 ) . It follo ws that N r ′ ( P µ,r ′ ) ≤ N whene ver r ′ ≥ r 4 . Therefore, taking r 0 = r 4 in (5), we obtain R UB µ ( T 0 ) ≤ O ( r T 0 + N r log T 0 ) = O ( N /r )(log T 0 ) . Claim 5.6. F ix a conte xtual bandit algorithm A . T his algorith m has r e gr et E I [ R ( T 0 )] ≥ Ω( N /r ) for a pr ob lem instance chos en uniformly at random fr o m I , wher e T 0 = N r − 2 . Pr oof. Let R ( x, T ) be the cont ribu tion of each conte xt x ∈ S X to conte xtual regret : R ( x, T ) = X t : x t = x µ ∗ ( x ) − µ ( x, y t ) , where y t is the arm chose n by the algorithm in round t . Our goal is to sho w that R ( x, T 0 ) ≥ Ω( r n Y ) . W e will consid er each conte xt x ∈ S X separa tely: the rounds when x arriv es form an instance I x of a conte xt-free bandit problem that lasts for T 0 /n X = n Y r − 2 round s, where expected payof fs are giv en by µ ( x, · ) as defined in (19). Let I x be the family of all such inst ances I x . A un iform distrib ution over I can b e reformulate d as follo ws: for each x ∈ S X , pi ck the “need le” y ∗ ( x ) indepe ndently and uniformly at random from S Y . This induces a uniform distrib ution ov er instanc es in I x , for each contex t x ∈ S X . Informal ly , kno wing full or partial information about y ∗ ( x ) for some x re veal s no informat ion whatsoe ver about y ∗ ( x ′ ) fo r any x ′ 6 = x . Formally , the contex tual bandit algo rithm A induces a bandit algorithm A x for I x , for each conte xt x ∈ S X : the A x simulate s the problem instance for A for all contexts x ′ 6 = x (startin g from the “needles ” y ∗ ( x ′ ) chosen independen tly and uniformly at random from S Y ). Then A x has expect ed regret R x ( T ) which satisfies E [ R ( T ) ] = E [ R ( x, T ) ] , where the expect ations on both sides are over the randomness in the respect iv e algorithm and the random choice of the problem instance (resp., from I x and from I ). Thus, it remains to handle each I x separa tely: i.e., to prov e that the expe cted regret of an y bandit algori thm on an instance dr awn u niformly at random from I x is at least Ω( r n Y ) . W e us e the KL -di ver gence techni que that originat ed in A uer et al. (2002b). If the set of arms were exac tly S Y , then the desired lower bound would follo w from Auer et al. (2002b) directl y . T o handle the proble m instances in I x , w e use an ext ension of the tec hnique from Auer et al. (20 02b), which is impl icit in Kleinber g (2004) and encapsulated as a sta nd-alone theorem in Klein ber g et al. (2013). W e restate this theorem a s Theore m A.2 in Appen dix A. It is easy to check that the family I x of problem instance s satisfies the precon ditions in Theorem A.2. Fix x ∈ S X . For a giv en choice of the “needle” y ∗ = y ∗ ( x ) ∈ S Y , let µ ( x, y | y ∗ ) . be the expect ed payof f of each arm y , and let ν y ∗ ( · ) = µ ( x, · | y ∗ ) be the correspon ding payof f function for the bandit instance I x . Then { ν y ∗ } , y ∗ ∈ S Y is an “ ( ǫ, k ) -ensemble” for ǫ = r 8 and k = | S Y | . 6 A pplications of contextual zooming W e describe sev eral applica tions of conte xtual zooming: to MAB with slow adversa rial c hange (Section 6.1), to MAB with stochasti cally e vo lving payof fs (Section 6.2 ), and to the “sleeping bandits” proble m (Sec- tion 6.3). In particul ar , we recov er s ome of the mai n results in Sliv kins and U pfa l (2008 ) and Kleinber g et al. (2008 a ). Also, in Section 6.4 w e discuss a recent application of context ual zoomin g to bandi t learning -to- rank, which has been publi shed in S li vkins et al. (2013). 15 6.1 MAB with slow adversarial c hange Consider the (contex t-free) adve rsarial MAB problem in which expe cted payof fs of each arm change ov er time gradual ly . Specifically , we assume that exp ected payof f of each arm y chang es by at most σ y in each round, for some a-priori kno wn volati lities σ y . The algorithm’ s goal here is continuo usly adapt to the chang ing en viro nment, rather than con v er ge to the best fixed m apping from context s to arms. W e call this setting the drifting MAB prob lem . Formally , our benchmark is a fi ctitiou s algorithm which in each round selects an arm that maximizes exp ected payof f for the current contex t. The differ ence in exp ected payof f between this benchmar k and a gi ven algorithm is called dynamic r e gr et of this algorithm. It is easy to see that the worst-c ase dynamic reg ret of any algorithm cannot be sublinea r in time. 8 W e are primarily interest ed in algorithm’ s long-te rm perfor mance, as quantified by avera ge dynamic regret ˆ R ( T ) , R ( T ) /T . Our goal is to bound the limit lim T → ∞ ˆ R ( T ) in terms of the parameters: the number of arms and the volati lities σ y . (In general , such upper bound is non-tr ivia l as long as it is smaller than 1, since all payof fs are at most 1.) W e restate this settin g as a context ual MAB problem with stochas tic payof fs in which the t -th conte xt arri v al is simply x t = t . Then µ ( t, y ) is th e exp ected payo ff of arm y at ti me t , and dy namic reg ret coinc ides with conte xtual regret specializ ed to the case x t = t . Each arm y satisfies a “temporal constraint ”: | µ ( t, y ) − µ ( t ′ , y ) | ≤ σ y | t − t ′ | (21) for some const ant σ y . T o s et up the correspon ding similarity space ( P , D ) , let P = [ T ] × Y , and D (( t, y ) , ( t ′ , y ′ )) = m in (1 , σ y | t − t ′ | + 1 { y 6 = y ′ } ) . (22) Our solution for the drifting MAB problem is the contextu al zooming algorithm parameterized by the similarit y space ( P , D ) . T o obtain guarante es for the long-ter m performanc e, we run contex tual zooming with a suitably chosen time horizon T 0 , and restart it e very T 0 round s; we call this vers ion conte xtual zooming w ith period T 0 . Periodical ly restarting the algorithm is a simple way to pre vent the change ov er time from becoming too lar ge; it suffices to obta in strong prov able guarant ees. The general prov able guara ntees are prov ided by Theorem 4.1 and Theorem 4.6. Below we work out some specific, tractab le corollaries. Cor ollary 6.1. Consid er the drifting MAB pr oble m with k arms and volatil ities σ y ≡ σ . Conte xtual zooming with perio d T 0 has aver ag e dynamic re gr et ˆ R ( T ) = O ( k σ log T 0 ) 1 / 3 , whene ver T ≥ T 0 ≥ ( k σ 2 ) 1 / 3 log k σ . Pr oof. It suffices to upper -bound regret in a single period. Indeed, if R ( T 0 ) ≤ R for any probl em instance, then R ( T ) ≤ R ⌈ T /T 0 ⌉ for any T > T 0 . It follo ws that ˆ R ( T ) ≤ 2 ˆ R ( T 0 ) . Therefore, from here on we can focus on analy zing contex tual zooming itse lf, rather than contex tual zooming with a period. The main step is to deri ve the reg ret bound (5) w ith a spec ific upper bound on N r . W e w ill sho w that dynamic regr et R ( · ) sa tisfies (5) with N r ≤ k ⌈ T σ r ⌉ . (23) Plugging N r ≤ k (1 + T σ r ) in to (5) and taking r 0 = ( k σ log T ) 1 / 3 we obtain 9 R ( T ) ≤ O ( T )( k σ log T ) 1 / 3 + O ( k 2 σ ) 1 / 3 (log T ) ∀ T ≥ 1 . Therefore , for any T ≥ ( k σ 2 ) 1 / 3 log k σ we ha ve ˆ R ( T ) = O ( k σ log T ) 1 / 3 . 8 For example, consider prob lem instances with two arms such that the payo ff of each arm in each round is either 1 2 or 1 2 + σ (and can change from round to round). Ov er this family of problem instances, dynamic regret in T rounds is at least 1 2 σ T . 9 This choice of r 0 minimizes the inf expressio n in (5) up to constant factors by equating the two summ ands. 16 It remains to prov e (23). W e use a pessimistic ver sion of T heorem 4.1: (5) with N r = N r ( P ) , the r -packing number of P . Fix r ∈ (0 , 1] . For any r -packing S of P and each arm y , each time interv al I of durati on ∆ r , r /σ pro vides at mos t one po int f or S : there e xists at most one time t ∈ I such th at ( t, y ) ∈ S . Since there are at most ⌈ T / ∆ r ⌉ such interv als I , it follo ws that N r ( P ) ≤ k ⌈ T / ∆ r ⌉ ≤ k (1 + T σ r ) . The restriction σ y ≡ σ is non-essenti al: it is not hard to obtain the same bound with σ = 1 k P y σ y . Modifyin g the constru ction in Section 5 (details omitted from this version ) one can show that Corollar y 6.1 is optimal up to O (log T ) f actors. Drifting MAB w ith spatial constraints. The temporal version ( x t = t ) of our context ual MA B setting with stochastic payof fs subsumes the drifting MAB problem and furthermore allows to combine the temporal constr aints (21) describe d above ( for each a rm, acros s time) with “sp atial constr aints” (for ea ch time, acros s arms). T o the best of our knowledg e, such MA B models are quite rare in the literature. 10 A clean example is D (( t, y ) , ( t ′ , y ′ )) = m in(1 , σ | t − t ′ | + D Y ( y , y ′ )) , (24 ) where ( Y , D Y ) is the arms space. For this example, we can obtain an analog of Corollary 6.1, where the reg ret bound depends on the cov ering dimension of the arms space ( Y , D Y ) . Cor ollary 6.2. C onside r the drifting MAB pr oblem with spatial constr aints (24), wher e σ is the volatili ty . Let d be the coveri ng dimension of the arms space, with multiplier k . Conte xtua l zooming with period T 0 has aver ag e dynamic re gr et ˆ R ( T ) = O ( k σ log T 0 ) 1 d +3 , whene ver T ≥ T 0 ≥ k 1 d +3 σ − d +2 d +3 log k σ . Remark. W e obtain Corollary 6.1 as a special case by setting d = 0 . Pr oof. It suf fices to bound ˆ R ( T 0 ) for (non-period ic) conte xtual zooming. First we bound the r -cov ering number of the similarity space ( P , D ) : N r ( P ) = N X r ( X ) × N Y r ( Y ) ≤ ⌈ T σ r ⌉ k r − d , where N X r ( · ) is th e r -cov ering numb er in th e conte xt space, an d N Y r ( · ) is that in the arms space. W e work ed out the former for Corollary 6.1. Plugging this into (5) and taking r 0 = ( k σ log T ) 1 / (3+ d ) , we obtain R ( T ) ≤ O ( T )( k σ log T ) 1 d +3 + O k 2 d +3 σ d +1 d +3 log T ∀ T ≥ 1 . The desire d bound on ˆ R ( T 0 ) fo llo ws easily . 6.2 Bandits with stochastically evolving payoffs W e consi der a special case of driftin g MAB problem in which expecte d payof fs of each arm ev olv e over time according to a stochastic process with a uniform stationary distrib ution. W e obtain improv ed regret bound s for context ual zooming, taking advan tage of the full power of ou r analysi s in Section 4. In particul ar , we address a versio n in w hich the stochasti c proce ss is a random walk with step ± σ . This ver sion has been pre vio usly studied in Sli vkin s and Upfal (2008) un der the name “Dyn amic MAB”. For the main case ( σ i ≡ σ ), our regre t bound for Dynamic MAB matches that in Slivk ins and U pfal (200 8 ). T o improv e the flow o f the paper , the proofs are deferred to Appendix 7. 10 The only other MAB model with this fl av or t hat we are aw are of, found i n Haz an and Kale (2009), combines linear payof fs and bounded “total v ariation” (aggregate temporal change) of the cost functions. 17 Unifo rm m arg inals. First we addres s t he genera l v ersion that w e call driftin g M AB with uniform mar ginals . Formally , we assume that expect ed payof fs µ ( · , y ) of each arm y ev olv e over time according to some stochas- tic proces s Γ y that satisfies (21). W e assume that the processes Γ y , y ∈ Y are m utually indepen dent, and moreo ver that the margina l distrib utions µ ( t, y ) are uni form on [0 , 1] , for each time t and each arm y . 11 W e are interes ted in E Γ [ ˆ R ( T )] , a verage dynamic regr et in expec tation over t he processes Γ y . W e obtain a stron ger vers ion of (23) via Theorem 4.6. T o use this theo rem, we need to bound the adjust ed r -zooming numb er , call it N r . W e show that E Γ [ N r ] = O ( k r ) ⌈ T σ r ⌉ and r < σ 1 / 3 ⇒ N r = 0 . (25) Then we obtai n a diffe rent bound on dynamic regre t, which is stronger than Corollary 6.1 for k < σ − 1 / 2 . Cor ollary 6.3. Consi der drifting MAB with uniform marg inals, w ith k arms and volatilitie s σ y ≡ σ . Con- te xtual zooming with period T 0 satisfi es E Γ [ ˆ R ( T )] = O ( k σ 2 / 3 log T 0 ) , whenev er T ≥ T 0 ≥ σ − 2 / 3 log 1 σ . The crux of the proof is to show (25). Interestingl y , it in v olv es using all three optimizations in T heo- rem 4.6: N r ( P µ,r ) , N r ( P µ,r \ W µ,r ) a nd N adj r ( · ) , whereas an y two of them do not see m to suf fice. The rest is a straig htforwar d computation similar to the one in Corollary 6.1. Dynamic MA B. Let us consider the Dynamic MAB problem from S li vkins and Upfal (2008). Here for each arm y the stoch astic process Γ y is a ra ndom walk with step ± σ y . T o ensure that th e random walk stays within the interv al [0 , 1] , we assume reflecting boundaries . Formally , w e assume that 1 /σ y ∈ N , and once a bound ary is reached, the next step is determinis tically in the oppo site direction. 12 Accordin g to a well-k nown fact abou t random walks, 13 Pr h | µ ( t, y ) − µ ( t ′ , y ) | ≤ O ( σ y | t − t ′ | 1 / 2 log T 0 ) i ≥ 1 − T − 3 0 if | t − t ′ | ≤ T 0 . (26) W e use contextu al zooming with period T 0 , bu t we parameterize it by a diff erent similarity space ( P , D T 0 ) that we define accordi ng to (26). Namely , w e set D T 0 (( t, y ) , ( t ′ , y ′ )) = m in(1 , σ y | t − t ′ | 1 / 2 log T 0 + 1 { y 6 = y ′ } ) . (27) The follo wing corollary is prov ed using the same techni que as C orollar y 6.3: Cor ollary 6.4. Conside r the Dynamic MAB pr oblem with k ar ms and vola tilities σ y ≡ σ . Let ALG T 0 denote the con textu al zooming algor otih m w ith pe riod T 0 which i s par ameterized by the similarity space ( P , D T 0 ) . Then ALG T 0 satisfi es E Γ [ ˆ R ( T )] = O ( k σ log 2 T 0 ) , whenev er T ≥ T 0 ≥ 1 σ log 1 σ . 6.3 Sleeping bandits The sleep ing bandits proble m Kleinber g et al. (2008a) is an exten sion of MAB where in each round some arms can be “asleep”, i.e. not av ailable in this round. One of the main results in Kleinber g et al. (2008a) is on sleepin g bandits w ith stoc hastic payof fs. W e recov er this result using contex tual zooming. W e model sleeping bandits as conte xtual MAB problem where each conte xt arriv al x t corres ponds to the set of arms that are “awak e” in this round. More precis ely , for ev ery subset S ⊂ Y of arms there is a 11 E.g. this assumption is satisfied by an y Markov Chain on [0 , 1] with stat ionary initial distribution . 12 Slivkins an d U pfal (2008) has a slightly more gene ral setup which does not require 1 /σ y ∈ N . 13 For exa mple, this follows as a simple application of Azuma-Hoef fding inequality . 18 distin ct contex t x S , an d P = { ( x S , y ) : y ∈ S ⊂ Y } . is the set of feasible c ontext -arm pairs. The si milarity distan ce is simply D (( x, y ) , ( x ′ , y ′ )) = 1 { y 6 = y ′ } . Note that the Lipschitz conditio n (1) is satisfied . For this setting, conte xtual zooming essentiall y red uces to the “highest awak e inde x” algorithm in K leinber g et al. (2008 a ). In fact, w e can re-deri ve the result Kleinber g et al. (2008a) on sleeping MAB with stochas tic pay- of fs as an easy corollary of Theorem 4.1. Cor ollary 6.5. Consider the sleepi ng MAB pr oblem with stochast ic payof fs. Orde r the arms so that their e xpected payof fs are µ 1 ≤ µ 2 ≤ . . . ≤ µ n , wher e n is the number of arms. Let ∆ i = µ i +1 − µ i . Then R ( T ) ≤ inf r > 0 r T + X i : ∆ i >r O (log T ) ∆ i . Pr oof. The r -zooming number N r ( P µ,r ) is equ al to the number of distinct arms in P µ,r , i.e. the number of arms i ∈ Y such that ∆( x, i ) ≤ 12 r for some conte xt x . Note that for a giv en arm i , the quantity ∆( x, i ) is minimized when the set of awak e arms is S = { i, i + 1 } . Therefore, N r ( P µ,r ) is equal to the number of arms i ∈ Y such that ∆ i ≤ 12 r . It follo ws that N r >r 0 ( P µ,r ) = P n i =1 1 { ∆ i ≤ 12 r } . P r >r 0 1 r N r >r 0 ( P µ,r ) = P r >r 0 P n i =1 1 r 1 { ∆ i ≤ 12 r } = P n i =1 P r >r 0 1 r 1 { ∆ i ≤ 12 r } = P i : ∆ i >r 0 O ( 1 ∆ i ) . R ( T ) ≤ inf r 0 > 0 r 0 T + O (log T ) P r >r 0 1 r N r ( P µ,r ) ≤ inf r 0 > 0 r 0 T + O (log T ) P i : ∆ i >r 0 O ( 1 ∆ i ) , as requi red. (In the abov e equations , P r >r 0 denote s the sum ove r all r = 2 − j > r 0 such that j ∈ N .) Moreo ver , the con textua l MAB problem e xtends the slee ping bandits setting by in corporati ng similarity informat ion on arms. The context ual zooming algorithm (and its analys is) applies, and is geared to explo it this additi onal similarity information. 6.4 Bandit learning-to-rank Follo wing a preliminar y publication of this paper on arx iv.or g , contex tual zooming has been applied in Slivk ins et al. (2013) to bandit learning-t o-rank. Interesti ngly , the “conte xts” studied in Slivki ns et al. (2013) are very dif feren t from what we considered so far . The basic setting, moti vat ed by web search, was introduced in Radlins ki et al. (2008). In each round a ne w user arri ves. The algorithm selects a rank ed list of k documents and presents it to the user who clicks on at most one document, namely on the fi rst document that (s)he finds relev ant. A user is specified by a binary vec tor over doc uments. The goal is to minimize aband onment : the number of rounds with no clicks. Sli vkins et al. (2013) study an extension in w hich metric similarity informatio n is av ailable . They con- sider a v ersion with stoc hastic payof fs : in eac h round, the user v ector is an ind ependent sample from a fixe d distrib utio n, and assume a Lipschitz-s tyle con dition that connects expected clicks with the metric space. They run a separat e bandit algorith m (e.g., contextu al zooming) for each of the k “slots” in the ranking. W ithou t loss of generality , in each round the documents are selected sequentiall y , in the top-do wn order . Since a document in slot i is click ed in a giv en round only if all higher ranke d documents are not rele van t, the y treat the set of documents in the higher slots as a conte xt for the i -th algorit hm. The L ipschi tz-style condit ion on expected clicks suf fices to guarantee the correspond ing Lipschitz -style condit ion on contex ts. 19 7 Bandits with stochastically ev olving pay offs : missing pr oofs W e prove Corollary 6.3 and Corollary 6.4 which address the perfor mance of contex tual zooming for the stocha stically e vol ving payof fs. In each corollary we bound from abo ve the av erage dynamic regret ˆ R ( T ) of conte xtual zooming with period T 0 , for any T ≥ T 0 . Since ˆ R ( T ) ≤ 2 ˆ R ( T 0 ) , it suffice s to bound ˆ R ( T 0 ) , which is the same as ˆ R ( T 0 ) for (non-perio dic) contex tual zooming. T herefo re, we can focus on analyzing the non-pe riodic algorithm. W e start with two simple auxil iary claims. Claim 7.1. Consid er the contex tual M AB pr ob lem with a pr oduct similarity space . L et ∆( x, y ) , µ ∗ ( x ) − µ ( x, y ) b e the “badness” of point ( x, y ) in the si milarity space . Then | ∆( x, y ) − ∆ ( x ′ , y ) | ≤ 2 D X ( x, x ′ ) ∀ x, x ′ ∈ X , y ∈ Y . (28) Pr oof. First we sho w that the benchmark payof f µ ( · ) satisfies a Lipschitz condit ion: | µ ∗ ( x ) − µ ∗ ( x ′ ) | ≤ D X ( x, x ′ ) ∀ x, x ′ ∈ X. (29) Indeed , it holds that µ ∗ ( x ) = µ ( x, y ) and µ ∗ ( x ′ ) = µ ( x, y ′ ) fo r some arms y , y ′ ∈ Y . T hen µ ∗ ( x ) = µ ( x, y ) ≥ µ ( x, y ′ ) ≥ µ ( x ′ , y ′ ) − D X ( x, x ′ ) = µ ∗ ( x ′ ) − D X ( x, x ′ ) , and like wise for the other directio n. Now , | ∆( x, y ) − ∆ ( x ′ , y ) | ≤ | µ ∗ ( x ) − µ ∗ ( x ′ ) | + | µ ( x, y ) − µ ( x ′ , y ) | ≤ 2 D X ( x, x ′ ) . Claim 7.2. Let Z 1 , . . . , Z k be inde pendent rando m variables distrib uted uniformly at rando m on [0 , 1] . Let Z ∗ = max i Z i . F ix r > 0 and let S = { i : Z ∗ > Z i ≥ Z ∗ − r } . Then E [ | S | ] = k r . This is a text book result; we provide a proof for the sake of comple teness. Pr oof. Conditional on Z ∗ , it holds that E [ | S | ] = E P i 1 { Z i ∈ S } = k Pr[ Z i ∈ S ] = k P r[ Z i ∈ S | Z i < Z ∗ ] × Pr[ Z i < Z ∗ ] = k r Z ∗ k − 1 k = ( k − 1) r / Z ∗ . Inte grating over Z ∗ , and lettin g F ( z ) , Pr[ Z ∗ ≤ z ] = z k , we obtain that E [ 1 Z ∗ ] = R 1 0 1 z F ′ ( z ) dz = k k − 1 E [ | S | ] = ( k − 1) r E [ 1 Z ∗ ] = k r . Pro of of C or ollary 6.3: It suf fices to bound ˆ R ( T 0 ) fo r (non-period ic) contex tual zooming. Let D X ( t, t ′ ) , σ | t − t ′ | be the contex t distance impl icit in the t emporal cons traint (21). For ea ch r > 0 , pick a number T r such that D X ( t, t ′ ) ≤ r ⇐ ⇒ | t − t ′ | ≤ T r . Clearly , T r , r σ . The crux is to bound the adjusted r -zooming number , call it N r , namely to sho w (25). For the sake of con v enienc e, let us restate it here (and let us use the notation T r ): E Γ [ N r ] = O ( k r ) ⌈ T T r ⌉ and T r < 1 /r 2 ⇒ N r = 0 . (30) 20 Recall that N r = N adj ( P µ,r \ W µ,r ) , where W µ,r is the set of all r -winners (see S ection 4.4 for the definitio n). Fix r ∈ (0 , 1] and le t S be some r -packing of P µ,r \ W µ,r . Partition the time into ⌈ T T r ⌉ interv als of duration T r . Fix o ne such interv al I . Let S I , { ( t, y ) ∈ S : t ∈ I } , th e set of points in S that co rrespond to times in I . Recall the notati on ∆ ( x, y ) , µ ∗ ( x ) − µ ( x, y ) and let Y I , { y ∈ Y : ∆( t I , y ) ≤ 14 r } , where t I , min( I ) . (31) All quanti ties in (31) refer to a fi xed ti me t I , which will allo w us to use the uniform marg inals property . Note that Y I contai ns at least one arm, namely the best arm y ∗ ( t I ) . W e claim that | S I | ≤ 2 | Y I \ { y ∗ ( t I ) }| . (32) Fix arm y . First, D X ( t, t ′ ) ≤ r for any t, t ′ ∈ I , so there exists at most one t ∈ I such that ( t, y ) ∈ S . Second, suppo se such t exis ts. Since S ⊂ P µ,r , it follo ws that ∆( t, y ) ≤ 12 r . B y Claim 7.1 it hold s that ∆( t I , y ) ≤ ∆ ( t, y ) + 2 D X ( t, t ′ ) ≤ 14 r . So y ∈ Y I . It follo ws that | S I | ≤ | Y I | . T o obtain (32), we sho w that S I = 0 whene ver | Y I | = 1 . Indee d, suppose Y I = { y } is a singleto n set, and | S I | > 0 . Then S I = { ( t, y ) } for some t ∈ I . W e will sho w that ( t, y ) is an r -winner , contradicti ng the definitio n of S . For an y arm y ′ 6 = y and any time t ′ such that D X ( t, t ′ ) ≤ 2 r it holds that µ ( t I , y ) = µ ∗ ( t I ) > µ ( t I , y ′ ) + 14 r µ ( t ′ , y ) ≥ µ ( t I , y ) − D X ( t ′ , t I ) ≥ µ ( t I , y ) − 3 r > µ ( t I , y ′ ) + 11 r ≥ µ ( t ′ , y ′ ) − D X ( t ′ , t I ) + 11 r ≥ µ ( t ′ , y ′ ) + 8 r . and so µ ( t ′ , y ) = µ ∗ ( t ′ ) . Thus, ( t, y ) is an r -winner as claimed. This completes the proof of (32). No w using (32) and Claim 7.2 we obtain that E Γ [ | S I | ] ≤ 2 E Γ [ | Y I \ { y ∗ ( t I ) }| ] ≤ O ( k r ) E Γ [ | S | ] ≤ ⌈ T T r ⌉ E [ | S I | ] ≤ O ( k r ) ⌈ T T r ⌉ . T aking the m ax ove r all possible S , we obtain E Γ [ P µ,r \ W µ,r ] ≤ O ( k r ) ⌈ T T r ⌉ . T o complete the proof of (30), we note that S cannot be r -cons istent unless | I | ≥ 1 /r 2 . No w that w e ha ve (30), the rest is a simple computati on. W e use Theorem 4.6, namely we take (5) with r 0 → 0 , plug in (30), and recall that T r ≥ 1 /r 2 ⇐ ⇒ r ≥ σ 1 / 3 . R ( T ) ≤ P r =2 i ≥ σ 1 / 3 1 r N r O (log T ) E Γ [ R ( T )] ≤ P r =2 i ≥ σ 1 / 3 O ( k log T )( T σ r + 1) ≤ O ( k log T )( T σ 2 / 3 + log 1 σ ) . It follo ws that E Γ [ ˆ R ( T )] ≤ O ( k σ 2 / 3 log T ) fo r any T ≥ σ − 2 / 3 log 1 σ . Pro of of C or ollary 6.4: It suf fices to bound ˆ R ( T 0 ) fo r (non-period ic) contex tual zooming. 21 Recall that expected payof fs satisfy the temporal constra int (26). Consider the high-prob ability e vent that | µ ( t, y ) − µ ( t ′ , y ) | ≤ σ | t − t ′ | 1 / 2 log T 0 ∀ t, t ′ ∈ [1 , T 0 ] , y ∈ Y . (33) Since exp ected reg ret due to the failure of (33) is negligi ble, from here on w e will assume that (33) holds determin istically . Let D X ( t, t ′ ) , σ | t − t ′ | 1 / 2 log T 0 be the distan ce on contexts implicit in (33). For each r > 0 , define T r , ( r σ log T 0 ) 2 . Then (30) follo ws exactly as in the proof of Corollary 6.3. W e use Theorem 4.6 similarly: we take (5) with r 0 → 0 , plug in (30), and note that T r ≥ 1 /r 2 ⇐ ⇒ r ≥ ( σ log T 0 ) 1 / 2 . W e obtain E Γ [ R ( T 0 )] ≤ X r =2 i ≥ ( σ log T 0 ) 1 / 2 O ( k log T 0 )( T 0 T r + 1) ≤ O ( k log 2 T 0 )( T 0 σ + log 1 σ ) . It follo ws that E Γ [ ˆ R ( T )] ≤ O ( k σ log 2 T 0 ) a s long as T 0 ≥ 1 σ log 1 σ . 8 Contextual bandits with adversarial pay offs In this section we consider the adve rsarial setting . W e pro vide an algorith m which maintains an adap- ti ve partition of the context spac e and thus takes adv antage of “ben ign” conte xt arriv als . It is in fact a m eta-alg orithm : giv en a bandit algorithm Band it , we presen t a contextu al band it algorithm, called Conte xtual Bandit , w hich calls Bandit as a subrou tine. 8.1 Our setting Recall that in each round t , the conte xt x t ∈ X is re veale d, then the algori thm picks an arm y t ∈ Y and observ es the payof f π t ∈ [0 , 1] . Here X is the conte xt set, and Y is the arms set. In this sect ion, all conte xt-arms pairs are feasible: P = X × Y . Adver sarial payof fs are defined as follo ws. For each round t , there is a payoff function ˆ π t : X × Y → [0 , 1] such that π t = ˆ π t ( x t , y t ) . The payof f functio n ˆ π t is sampled indepe ndently from a time-specific distrib utio n Π t ov er p ayof f function s. Distrib utions Π t are fix ed b y the a dversa ry in adv ance , before the first round , and not rev ealed to the algorithm. Denote µ t ( x, y ) , E [Π t ( x, y )] . Follo wing Hazan and Megiddo (2007), w e general ize the notion of regret for conte xt-free adversar ial MAB to conte xtual MA B. The conte xt-spec ific best arm is y ∗ ( x ) ∈ argmax y ∈ Y P T t =1 µ t ( x, y ) , (34) where the ties are brok en in an arbitrary b ut fi xed way . W e define adver sarial conte xtual r e gr et as R ( T ) , P T t =1 µ t ( x t , y t ) − µ ∗ t ( x t ) , where µ ∗ t ( x ) , µ t ( x, y ∗ ( x )) . (35) Similarity information is gi ven to an algorithm as a pair of metric spaces : a metric space ( X , D X ) on conte xts (the conte xt space ) and a metric space ( Y , D Y ) on arms (the arms space ), which form the product similarit y space ( X × Y , D X + D Y ) . W e assume that for each round t function s µ t and µ ∗ t are Lipschitz on ( X × Y , D X + D Y ) a nd ( X , D X ) , respect iv ely , both with Lipschitz constan t 1 (see Footnote 1). W e assume that the co ntext space is compact, in order to ensure that th e max in (34) i s atta ined by some y ∈ Y . W itho ut loss of general ity , diame ter ( X, D X ) ≤ 1 . 22 Formally , a problem instance consists of metric spaces ( X, D X ) and ( Y , D Y ) , the sequenc e of context arri v als (d enoted x (1 ..T ) ), and a seq uence of distrib utio ns (Π t ) t ≤ T . Note that f or a fixed distrib utio n Π t = Π , this setting reduc es to the stochastic setting, as defined in Introduc tion. For the fi xed contex t case ( x t = x for all t ) this settin g reduces to the (contex t-free) MA B probl em with a randomized obli vious adve rsary . 8.2 Our r esults Our algorit hm is parameterize d by a regret guaran tee for Ban dit for the fixed context case, namely an upper boun d on the con v er gence time. 14 For a more concrete theorem statement we will assume that the con v er gence time of Ba ndit is at most T 0 ( r ) , c Y r − (2+ d Y ) log( 1 r ) fo r some constants c Y and d Y that are kno wn to the algorithm. In particular , an algorith m in K leinber g (2004) achie ves this guarantee if d Y is the c -co vering dimension of the arms space and c Y = O ( c 2+ d Y ) . This is a flexible formulation that can lev erage prior work on adversar ial ba ndits. For instance, if Y ⊂ R d and for each fixed context x ∈ X distrib utions Π t randomiz e over linear function s ˆ π t ( x, · ) : Y → R , then one c ould tak e Ban dit from the line of work on adv ersarial band its with linear p ayof fs. In par ticular , there exi st algorit hms with d Y = 0 and c Y = poly ( d ) (Dani et al. , 2007, Aberneth y et al., 2008, Bubeck et al., 2012). Like wise, for con vex payof fs there exist algorithms with d Y = 2 and c Y = O ( d ) (Flaxman et al., 2005). For a bounded number of arms, algorith m E X P 3 (Auer et al., 2002b) achie ves d Y = 0 and c Y = O ( p | Y | ) . From here on, the conte xt space ( X , D X ) will be only metric space considered; balls and other notions will refer to the conte xt space only . T o quantify the “goo dness” of conte xt arri val s, our guaran tees are in terms of the coveri ng dimension of x (1 ..T ) rather than that of the entire contex t space. (This is the improv ement ov er the guarantee (3) for the uniform algorith m.) In fact, use a more refined notion which allo ws to disre gard a limited number of “outli ers” in x (1 ..T ) . Definition 8.1. Giv en a m etric space and a multi-set S , the ( r , k ) -coveri ng number of S is the r -cov ering number of the set { x ∈ S : | B ( x, r ) ∩ S | ≥ k } . 15 Giv en a constant c and a function k : (0 , 1) → N , the relax ed cove ring dimension of S with slack k ( · ) is the smallest d > 0 such that the ( r, k ( r )) -co verin g number of S is at most c r − d for all r > 0 . Our result is stated as follo ws: Theor em 8.2. Consider the conte xtual MA B pr oblem with adver sarial payof fs, and let Ban dit be a bandit algori thm. A ssume that the pr ob lem instance belongs to some class of pr oblem instances such that for the fixed-c onte xt case , con v er gence time of Ban dit is at most T 0 ( r ) , c Y r − (2+ d Y ) log( 1 r ) for some constants c Y and d Y that ar e known to the algorithm. Then C onte xtual Bandit a chie ves adver sarial conte xtual r e gr et R ( · ) suc h that for any time T and any const ant c X > 0 it ho lds that R ( T ) ≤ O ( c 2 D B L ( c X c Y ) 1 / (2+ d X + d Y ) ) T 1 − 1 / (2+ d X + d Y ) (log T ) , (36) wher e d X is the re laxed cover ing dimension of x (1 ..T ) with multiplier c X and slac k T 0 ( · ) , and c D B L is the doubl ing constant of x (1 ..T ) . Remarks. For a versio n of (36) that is stated in terms of the “raw” ( r , k r ) -co vering numbers of x (1 ..T ) , see (38) in the analy sis (page 26). 14 The r -con ver gence t ime T 0 ( r ) is the smallest T 0 such that regret is R ( T ) ≤ rT for each T ≥ T 0 . 15 By abuse of notation, here | B ( x, r ) ∩ S | denotes the number of points x ∈ S , with multipliciti es, that lie in B ( x, r ) . 23 8.3 Our algorithm The conte xtual bandit algorith m Cont extua lBan dit is paramet erized by a (contex t-free) bandit algo- rithm Band it , which it uses as a subrou tine, and a functi on T 0 ( · ) : (0 , 1) → N . The algorithm maintains a finite collecti on A of balls, called active balls . Initially there is one acti ve ball of radius 1 . B all B stays acti ve once it is activated . Then a fresh instance ALG B of Band it is created , whose s et of “arms” is Y . ALG B can b e paramete rized by the time hori zon T 0 ( r ) , where r is t he radiu s of B . The algorith m procee ds as follo ws. In each round t the algorit hm selects one acti ve ball B ∈ A such that x t ∈ B , calls ALG B to select an arm y ∈ Y to be played, and repor ts the payof f π t back to A LG B . A gi ven ball can be select ed at most T 0 ( r ) times, after which it is called full . B is called r elev ant in round t if it contai ns x t and is not full. The algorithm selects a relev ant ball (break ing ties arbitra rily) if such ball exi sts. Otherwise, a new ball B ′ is activ ated and selecte d. S pecificall y , let B be the smallest-rad ius activ e ball containing x t . Then B ′ = B ( x t , r 2 ) , where r is the radius of B . B is then called the par ent of B ′ . See Algorith m 2 for the pseudocod e. Algorithm 2 Algorithm Contex tualB andit . 1: Input: 2: Conte xt space ( X , D X ) o f diameter ≤ 1 , set Y of arms. 3: Bandit algori thm Ban dit and a function T 0 ( · ) : (0 , 1) → N . 4: Data structur es: 5: A colle ction A of “acti v e balls” in ( X , D X ) . 6: ∀ B ∈ A : counte r n B , instance ALG B of Band it on arms Y . 7: Initializa tion: 8: B ← B ( x, 1) ; A ← { B } ; n B ← 0 ; initiate ALG B . // cen ter x ∈ X is arbitrary 9: A ∗ ← A // acti ve balls that are not full 10: Main loo p: for each round t 11: Input con text x t . 12: relevant ← { B ∈ A ∗ : x t ∈ B } . 13: if re levant 6 = ∅ then 14: B ← any B ∈ relevan t . 15: else / / acti va te a new ba ll: 16: r ← min B ∈A : x t ∈ B r B . 17: B ← B ( x t , r / 2) . // ne w ball to be adde d 18: A ← A ∪ { B } ; A ∗ ← A ∗ ∪ { B } ; n B ← 0 ; initiate ALG B . 19: y ← next arm selected by ALG B . 20: Play arm y , observ e payof f π , report π to A LG B . 21: n B ← n B + 1 . 22: if n B = T 0 ( radius ( B )) then A ∗ ← A ∗ \ { B } . // ball B is full 8.4 Analysis: proof of Theor em 8.2 First let us argue that algorithm Con textu alBan dit is well-defined. Specifically , w e need to sho w that after the activ ation rule is called, there ex ists an acti ve non-full ball containin g x t . Suppose not. T hen the ball B ′ = B ( x t , r 2 ) acti va ted by the acti va tion rule must be full. In particula r , B ′ must ha ve been acti ve before the acti v ation rule was called , which contradic ts the minimality in the choice of r . Claim pro ved. W e continu e by listi ng sev eral basic claims about the algorith m. Claim 8.3. The algorithm satisfies the following basic pr opertie s: 24 (a) (Corr ectness) In each r ound t , e xactly one active ball is selected. (b) E ach active ball of radiu s r is selected at most T 0 ( r ) times . (c) (Separa tion) F or any two active balls B ( x, r ) and B ( x ′ , r ) we have D X ( x, x ′ ) > r . (d) E ach active ball has at most c 2 D B L chi ldr en, wher e c D B L is the doubl ing constant of x (1 ..T ) . Pr oof. Part (a) is i mmediate from the algorith m’ s specification . For (b) , simply note that by the algor ithms’ specifica tion a ball is selected only w hen it is not full. T o prov e (c), suppo se that D X ( x, x ′ ) ≤ r and suppose B ( x ′ , r ) is acti vat ed in some round t while B ( x, r ) is activ e. Then B ( x ′ , r ) was acti vate d as a child of some ball B ∗ of radius 2 r . On the other hand, x ′ = x t ∈ B ( x, r ) , so B ( x, r ) must hav e been full in round t (else no ball would hav e been acti vat ed), and conseq uently the radius of B ∗ is at most r . C ontrad iction. For (d), consider the children of a giv en acti ve ball B ( x, r ) . Note that by the activ ation rule the centers of these childre n are points in x (1 ..T ) ∩ B ( x, r ) , and by the separatio n property any two of these points lie within distan ce > r 2 from one an other . By the doubling prope rty , there can be at most c 2 D B L such point s. Let us fix the time horizon T , and let R ( T ) denote the conte xtual regret of Con textu alBan dit . Partiti on R ( T ) into the contri but ions of activ e balls as follo ws. Let B be the set of all balls that are acti ve after round T . For ea ch B ∈ B , let S B be the set of all round s t when B has been selecte d. Then R ( T ) = P B ∈B R B ( T ) , where R B ( T ) , P t ∈ S B µ ∗ t ( x t ) − µ t ( x t , y t ) . Claim 8.4. F or each bal l B = B ( x, r ) ∈ B , we have R B ≤ 3 r T 0 ( r ) . Pr oof. By the Lipschitz conditions on µ t and µ ∗ t , for each round t ∈ S B it is the case that µ ∗ t ( x t ) ≤ r + µ ∗ t ( x ) = r + µ t ( x, y ∗ ( x )) ≤ 2 r n + µ t ( x t , y ∗ ( x )) . The t -round reg ret of Ba ndit is at m ost R 0 ( t ) , t T − 1 0 ( t ) . Therefore , letting n = | S B | be the number of times algorit hm AL G B has bee n in v oked, we hav e that R 0 ( n ) + P t ∈ S B µ t ( x t , y t ) ≥ P t ∈ S B µ t ( x t , y ∗ ( x )) ≥ P t ∈ S B µ ∗ t ( x t ) − 2 r n. Therefore R B ( T ) ≤ R 0 ( n ) + 2 r n . Recall that by Claim 8.3(b) we hav e n ≤ T 0 ( r ) . T hus, by definition of con v er gence time R 0 ( n ) ≤ R 0 ( T 0 ( r )) ≤ r T 0 ( r ) , and therefore R B ( T ) ≤ 3 r T 0 ( r ) . Let F r be the collection of all full balls of radius r . Let us bound |F r | in terms the ( r, k ) -co vering number of x (1 ..T ) in the conte xt space, which we denote N ( r, k ) . Claim 8.5. Ther e ar e at most N ( r , T 0 ( r )) fu ll balls of rad ius r . Pr oof. Fix r and let k = T 0 ( r ) . Let us say that a point x ∈ x (1 ..T ) is heavy if B ( x, r ) contains at least k points of x (1 ..T ) , counting multiplici ties. C learly , B ( x, r ) is full only if its center is hea vy . By definition of the ( r , k ) -co vering number , there exis ts a family S of N ( r , k ) sets of diameter ≤ r that cov er all heav y points in x (1 ..T ) . For each full ball B = B ( x, r ) , let S B be some set in S that contains x . B y Claim 8.3(c), the sets S B , B ∈ F r are all distinct . Thus, |F r | ≤ |S | ≤ N ( r , k ) . Let B r be the set of all balls of radius r that are activ e after round T . By the algorithm’ s speci fication, each ball in F r has been selected T 0 ( r ) times , so |F r | ≤ T /T 0 ( r ) . Then using C laim 8.3(b) and Claim 8.5, we ha ve |B r / 2 | ≤ c 2 D B L |F r | ≤ c 2 D B L min( T /T 0 ( r ) , N ( r , T 0 ( r ))) P B ∈B r / 2 R B ≤ O ( r ) T 0 ( r ) |B r / 2 | ≤ O ( c 2 D B L ) min( r T , r T 0 ( r ) N ( r , T 0 ( r ))) . (37) 25 T riv ially , for any full ball of radiu s r we hav e T 0 ( r ) ≤ T . Thus, summing (37) ov er all such r , we obtain R ( T ) ≤ O ( c 2 D B L ) P r =2 − i : i ∈ N and T 0 ( r ) ≤ T min( r T , r T 0 ( r ) N ( r , T 0 ( r ))) . (38) Note that (38) makes no assumpt ions o n N ( r, T 0 ( r )) . Now , plugging in T 0 ( r ) = c Y r − (2+ d Y ) and N ( r , T 0 ( r )) ≤ c X r − d X into (38) and optimizing it for r it is easy to deri ve the desired bound (36). 9 Conclusions W e conside r a general setting for conte xtual bandit proble ms where the algori thm is giv en information on similarit y between the context-a rm pairs. The similarity informatio n is modeled as a metric space with respec t to which expecte d payof fs are L ipschi tz-continu ous. Our key contrib ution is an algorithm which maintain s a p artition of the m etric space an d ada ptiv ely refines this p artition o ver time. D ue to th is “a dapti ve partiti on” technique, one can take adv antage of “benign ” problem instances without sacrificin g the worst- case performance; here “benign-ne ss” refers to both expe cted payof fs and contex t arri va ls. W e essentially resolv e the setting where expecte d payof f from ev ery giv en conte xt-arm pair either does not change ov er time, or ch anges slowly . In particula r , we obtain near ly matching lower bound s (for time-in varian t exp ected payof fs and for an important special case of slo w change). W e also cons ider the setting of adv ersarial payof fs. For thi s setting , we design a dif feren t algorithm that maintain s a p artition of contexts and adapti vely re fines it so a s to take adv ant age of “benign” con text a rriv als (b ut not “benign” expected payof fs), withou t sacrificing the worst-case performance . Our algorithm can work with, essentia lly , any giv en of f-the-sh elf algorith m for standard (non-conte xtual) bandits, the choice of which can then be tailore d to the setting at hand. The main open questio ns concern relax ing the require ments on the quali ty of similarity info rmation that are needed for the prov able guarantee s. First, it would be desirab le to obtain similar results under weake r ver sions of the L ipschi tz conditio n. Prior wo rk (Kleinb erg et al., 2008b, Bube ck et al., 2011a) obt ained se v- eral such results for the non-con textua l v ersion of the problem, mainly because their main results do not requir e the full power of the L ipschi tz condition. Howe ver , the analysis in this paper appears to make a hea vier use of the Lipschitz condition; it is not clear whether a meaningful relaxation would suffice. Sec- ond, in some settings the av ailab le similarity informatio n might not include any numeric upper bounds on the differe nce in expe cted payof fs; e.g. it could be giv en as a tree-bas ed taxonomy on conte xt-arm pairs, without any explici t numbers. Y et, one wants to recov er the same prov able guarantees as if the numeri- cal infor mation were explicit ly giv en. For the non-conte xtual version, this direction has been explored in (Bubeck et al., 2011b, Sliv kins, 2011). 16 Another open question concern s our results for adv ersarial payof fs. H ere it is desirable to extend our “adapt iv e partition s” techniqu e to also t ake ad van tage of “ benign” ex pected payof fs (in addition to “benign” conte xt arri v als). Howe ver , to the best of our kno wledge such resul ts are not e ven known for the non- conte xtual version of the problem. Acknowledgements. The author is gratefu l to Ittai Abraham, Bobby K leinbe rg and Eli Upfal for many con v ersatio ns about multi -armed bandit s, and to Sebastien Bub eck for help with t he manu script. Also, co m- ments from anon ymous COL T re vie wers and JMLR referees hav e been tremendously useful in improv ing the presen tation. 16 (Bubeck et al., 2011b, Slivkins, 2011) ha ve been pub lished after the preliminary publication of this paper on arxiv.org . 26 Refer ences Jacob Aberneth y , Elad Haz an, and Alexander Rak hlin. Compe ting in the Dark: An Efficient Algorithm f or Bandit Linear Optimization. In 21th Conf. on Learning Theory (COL T) , page s 263–274 , 20 08. Rajeev Agrawal. Th e continuum-ar med b andit problem. S IAM J. Contr ol and Optimization , 33(6):1 926– 1951, 1995. Peter Auer . Using con fidence b ounds f or exploitatio n-exploratio n trade-offs. J. of Machine Learnin g R esear ch (JMLR) , 3:397– 422, 2002. Preliminary version in 41st IEEE FOCS , 2000. Peter Au er , Nicol ` o Cesa-Bianch i, an d Paul Fischer . Fin ite-time analy sis of the multiarm ed bandit p roblem. Machine Learning , 47(2- 3):235 –256, 2002a . Prelimin ary v ersion in 15 th ICML , 1998. Peter Auer, N icol ` o Cesa-Bianchi, Y o av Freund, and Robert E. Schapire. The nonstoch astic multiarme d band it pr oblem. SIAM J. Comput. , 32(1) :48–77 , 2002 b. Preliminar y version in 36th IEEE FOCS , 1995. Peter Au er , Ron ald Ortner, and Csaba Szepesv ´ ari. Improved Rates for the Stochastic Continu um-Arm ed Bandit Problem. In 20th Conf. on Learning Theory (COL T) , pag es 454–468, 2007. Baruch A werbuch and Robert Kleinberg. Online linear optimization and adaptive routing . J. of Computer and System Sciences , 74(1):9 7–114 , Februar y 2008. Preliminar y version in 36th A CM STOC , 2004. S ´ ebastien Bubeck and Nicolo Cesa-Bianchi. Regret Analysis o f Stochastic and Nonstochastic Mu lti-armed Bandit Problems. F oun dation s an d T rends in Machine Lea rning , 5(1):1– 122, 2012. S ´ ebastien Bubec k and R ´ emi Muno s. Open Lo op Optim istic Plann ing. In 23rd Conf. on Learning Theory (COLT) , pages 477– 489, 2010. S ´ ebastien Bubeck, R ´ emi Mun os, Gilles Stoltz, an d Csaba Szepesvari. On line Optimization in X-Armed Bandits. J. of Machine Learning Resear ch ( JMLR) , 12:158 7–162 7, 2011a. Preliminar y version in NIPS 2008 . S ´ ebastien Bube ck, Gilles S toltz, and Jia Y uan Y u. Lipschitz bandits without the lipschitz constant. In 22nd Intl. Conf. on Algorithmic Learning Theory (ALT) , pages 144– 158, 2011b. S ´ ebastien Bubeck, Nicol ` o Cesa-Bianchi, and Sham M. Kakade. T o wards minimax p olicies for online linear optimiza- tion with bandit feedba ck. In 25th Conf. on Learning Theory (COL T) , 2 012. Nicol ` o Ces a-Bianch i and G ´ abor Lu gosi. P r ediction, learning, and games . Cam bridge Univ . Press, 2006 . W ei Chu, Lihong Li, Le v Reyzin, an d Robert E. Schapire. Contextual Bandits with Linear Payoff Functions. In 14th Intt. Conf. on Artificial Intelligence and Statistics (AIST ATS) , 201 1. Richard Cole and Lee-Ad Gottlieb. Searc hing dynamic p oint s ets in spaces with bo unded d oubling dimension . In 38th AC M Symp. on Theo ry of Computing (STOC) , p ages 574–5 83, 2 006. V arsha Dani, Thomas P . Hayes, and Sham Kakade. The Price of Bandit Inform ation for Online Optimization . In 20th Advance s i n Neural Information Pr ocessing Systems (NIPS) , 2007. Abraham Flaxm an, Ad am Kalai, and H. Brend an McMahan . Online Conv ex Op timization in the Bandit Setting: Gradient Descen t witho ut a Gradient. In 16 th ACM-SIAM Symp. o n Discr ete Algorithms ( SOD A) , pag es 385 –394 , 2005. John Gittins, Ke vin Glazeb rook, and Richard W e ber . Multi- Armed Ban dit Alloca tion In dices . Joh n Wi ley & So ns, 2011. Anupam Gu pta, Robert Krau thgamer, an d James R. Lee . Bounde d geometries, fractals, and low–distortion embed- dings. In 44th IEEE Symp. on F oun dations of Computer Science (FOCS) , pages 534–54 3, 20 03. 27 Elad Hazan and Satyen Kale. Better algorithms for ben ign bandits. In 20th AC M-SI AM Symp. on Discr ete Algorithms (SODA) , p ages 38–47 , 2 009. Elad Hazan and Nimrod Megiddo. Online Learn ing with Prior Info rmation. In 20th Conf. on Learn ing Theory (COLT) , pages 499– 513, 2007. J. Heinonen . Lectur es on an alysis on metric spaces . Un iv ersitext. Springer-V erlag, N ew Y ork, 2001 . Jon Kleinberg, Aleksandrs Sli vk ins, and T om W exler . Triangulation and Embedding Using Small Sets of Beacons. J. of the ACM , 56(6) , September 2009. Preliminary version in 45th IEEE FOCS , 2004. The journal version includes results from Slivkins ( 2005). Robert Kleinberg. Nearly tight bound s for the continuu m-armed band it pro blem. In 18th Ad vances in Neur al Infor- mation Pr ocessing Systems (NIPS) , 2004. Robert Kleinberg. On line Decision Pr oblems with Lar ge Str ategy S ets . PhD thesis, MIT , 2005. Robert Kleinberg an d A leksandrs Slivkins. Sharp Dichoto mies f or Regret Minimiza tion in Me tric Sp aces. In 21st AC M-SI AM Symp. on Discr ete Algorithms (SOD A) , 2010. Robert Kleinberg, Alexandru Niculescu-Mizil, and Y o geshwer Sharm a. Regret bounds for sleeping exper ts and ban - dits. In 21st Conf. on Learning Theory (COL T) , pages 425 –436 , 20 08a. Robert Kleinberg, Aleksandrs Sli vkins, and Eli Upfal. Multi-Armed Bandits in Metric Spaces. In 40th A CM Symp. o n Theory of Computing (STOC) , pages 681–690, 2008b. Robert Kleinb erg, Aleksand rs Slivkins, a nd Eli Upfal. Bandits and Ex perts in Metric Spaces. T echnical repor t http://arxiv .org/abs/13 12.1277 . Merged and revised version of papers in STOC 2008 and SODA 2010 . , D ec 2013 . Lev ente Kocsis and Csaba Szepesvari. Bandit Based Mo nte-Carlo Plannin g. In 17th Eur opean Conf. on Machine Learning (ECML) , pages 282–2 93, 2 006. Tze Leun g Lai a nd Herbert Robbins. Asympto tically efficient Adaptive Allocation Rules. Ad vances in Applied Mathematics , 6:4–22, 1985. John L angfor d and T ong Zhang. The Epoch-Greedy Algorithm for Contextual Mu lti-armed Band its. In 21st Adva nces in Neural Information Pr ocessing Systems (NIPS) , 2007 . Alessandro Lazaric and R ´ emi Munos. Hybrid Stochastic-Ad versarial On-line Lear ning. In 2 2nd Conf. on Learning Theory (COLT ) , 2 009. Lihong Li, W ei Chu, Jo hn Lang ford, and Ro bert E . Schap ire. A contextual-ban dit app roach to personalized news article recommen dation. In 19th Intl. W orld W ide W eb Conf. (WWW) , 2010. Lihong L i, W ei Chu, Joh n Lan gford , and Xuanhui W an g. Unb iased offline evaluation o f contextual-bandit-based news article recommen dation algorithm s. In 4th ACM Intl. Conf. on W eb Se ar ch and Data Mining (WSDM) , 2011. T yler Lu, D ´ avid P ´ al, and Martin P ´ al. Showing Relevant Ads via Lipschitz C on text Multi-Ar med Bandits. In 14th Intt. Conf. on Artificial Intelligence and Statistics (AIST AT S) , 2010. Odalric-Amb rym Maillar d and R ´ emi Munos. On line Learning in Adversarial Lipschitz En viro nments. In Eur opean Conf. on Machine Learning and P rinciples a nd Practice o f Kno wledge Discovery in Datab ases (ECML PKDD) , pages 305– 320, 2010. H. Bren dan McMahan and Matthew Streeter . T ighter Bounds for Multi- Armed Bandits with Expert Advice . I n 22n d Conf. on Learning Theory (COL T) , 2009. 28 R ´ emi Munos and Pier re-Arnau d Coque lin. Ban dit algor ithms for t ree s earch . In 23r d Conf. on Uncertainty in Artificial Intelligence (U AI) , 2007. Sandeep Pandey , Deepak Agar wal, Deepayan Chakra barti, and V anja Josifovski. Bandits f or T ax onomie s: A Model- based Approa ch. In SIA M C onf. on Data Mining (SDM) , 200 7. Filip Radlinski, Robert Kleinb erg, and Th orsten Joachims. Lear ning div erse rank ings with multi-arm ed bandits. In 25th Intl. Conf. on Machine Learning (ICML) , pages 784–7 91, 2008. Philippe Rigollet and Assaf Z eevi. Nonparametric Bandits with Covariates. In 23 r d Conf. o n Learning Theory (COLT) , pages 54–6 6, 2010. Herbert Robbins. Some Aspects of the Sequential De sign of Experiments. Bull. Amer . Math. Soc. , 58:527–5 35, 1952 . Aleksandrs Sli vkins. Distributed Approaches to T riangulation and Embedding. In 16th ACM-SIAM S ymp. on Discr ete Algorithms (SODA) , p ages 640–6 49, 2005. Full version has been merged in to Kleinberg et al. (2009). Aleksandrs Sli vk ins. Multi-armed b andits on implicit metric spaces. I n 25th Advanc es in N eural Information Pr ocess- ing Systems (NIPS) , 2011. Aleksandrs Slivkins and Eli Upfal. Adapting to a Changing En viro nment: the Brownian Restless Bandits. I n 21st Conf. on Learning Theory (COL T) , pages 343– 354, 2008 . Aleksandrs Slivkins, Filip Radlinski, and Sreen iv as Gollapudi. Learning optim ally diverse rankings over large do cu- ment collections. J. of Machine Learning Resear ch (JMLR) , 14(Feb):3 99–4 36, 2013. Preliminar y version in 27th ICML , 2010 . Kunal T alwar . Byp assing the embed ding: Algor ithms for low-dimensional metrics. In 36th ACM Symp. on Theo ry of Computing (STOC) , pages 281–290, 2004. Chih-Chun W an g, Sanjeev R. Kulkarni, and H. V incent Poor . Bandit problems with side ob servations. IEEE T r ans. on Automatic C on tr ol , 50(3):338 355, 2 005. Y izao W an g, Jean -Yves Audibert, and R ´ e mi Mu nos. Algorithm s for Infinitely Many-Arme d Bandits. I n Adva nces in Neural Information Pr oc essing Systems (NIPS) , pages 1729 –1736 , 2008 . Michael W o odroo fe. A one -armed band it problem with a concomitant variable. J. Amer . Statist. Ass oc. , 74(368), 1979 . Ap pendix A: The KL-diver gence technique, encapsulated T o analyze the lower -bou nding constructi on in Section 5, we use an extensio n of the K L-di ver gence techni que from Auer et al. (2002b ), which is implicit in K leinber g (2004) and encapsul ated as a stand- alone theorem in Kleinber g et al. (2013 ). T o make the paper self-cont ained, we state the theorem from Kleinber g et al. (2013), along with the rele van t definitions. The remainder of this section is copied from Kleinber g et al. (2013), w ith minor modificat ions. Consider a very general MAB setting where the algorithm is gi ven a strategy set X and a collecti on F of feasible payof f functions; we call it the feasible MA B pr ob lem on ( X, F ) . For example , F can consist of all functions µ : X → [0 , 1] that are L ipschit z with respect to a giv en metric space. The lower bou nd relies on the exis tence of a collection of subsets of F with certain properties, as defined belo w . These subsets corres pond to children of a giv en tree node in the ball-tree Definition A.1. Let X be the strategy set and F be the set of all feasible payof f func tions. An ( ǫ, k ) - ensemble is a collecti on of subse ts F 1 , . . . , F k ⊂ F such that there exist mutually disjoint subs ets S 1 , . . . , S k ⊂ X and a number µ 0 ∈ [ 1 3 , 2 3 ] whic h satisfy the follo wing. Let S = ∪ k i =1 S i . Then 29 • on X \ S , any two functio ns in ∪ i F i coinci de, and are bounded from abov e by µ 0 . • for each i and each function µ ∈ F i it holds that µ = µ 0 on S \ S i and sup( µ i , S i ) = µ 0 + ǫ . Assume the payo ff function µ lies in ∪ i F i . The idea is that an algorit hm needs to play arms in S i for at least Ω( ǫ − 2 ) roun ds in order to determine w hether µ ∈ F i , and each such step incurs ǫ regre t if µ 6∈ F i . In our applicatio n, subsets S 1 , . . . , S k corres pond to children u 1 , . . . , u k of a gi ven tree nod e in the b all-tree, and each F i consis ts of payof f functions induced by the ends in the subtree rooted at u i . Theor em A.2 (T heorem 5.6 in Kleinber g et al. (2013 )) . C onside r the feasible M AB pr oblem w ith 0-1 pay- of fs. Let F 1 , . . . , F k be an ( ǫ, k ) -ense mble, wher e k ≥ 2 and ǫ ∈ (0 , 1 12 ) . Then for any t ≤ 1 32 k ǫ − 2 and any bandit algorithm ther e exist at least k / 2 distin ct i ’ s such that the r e gr et of this algorit hm on any payof f functi on fr om F i is at least 1 60 ǫt . In Auer et al. (2002b), the authors analyzed a special case of an ( ǫ, k ) -ensemb le in which there are k arms u 1 , . . . , u k , an d eac h F i consis ts of a single pa yoff f unction th at assign s expec ted payof f 1 2 + ǫ to arm u i , and 1 2 to all other arms. 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment