Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

Bey ond the Hazard Rate: More P erturbatio n Algorith ms for Adv ersarial Multi-armed Bandit s Zifan Li Univ ersit y of Mic higan zifanli@umi ch.edu Am buj T ew ari Univ ersit y of Mic higan tewaria@umi ch.edu Jan uary 9, 2018 Abstract Recent w ork on f ollo w th e p erturb ed leader (FTP L) algorithms for th e adversarial multi-armed bandit p roblem has highligh ted t he rol e of the hazard rate of the distribution generating the p er- turbations. Assuming that the haza rd rate is bou nded, it is possible to pro vide regret analyses fo r a v ariet y of FTPL algorithms for the m ulti-armed bandit problem. This pap er pushes the inq uiry into regret bounds for FTP L algorithms beyond th e b oun ded hazard rate condition. There are good reasons to do so: natural distributions su c h as the u niform and G aussian violate the condition. W e giv e regret bounds for b oth b oun ded supp ort and u nbounded supp ort d istributions without assuming the hazard rate cond ition. W e also disprov e a conje cture th at the Gaussi an distribut ion cannot lead to a lo w-regret algor ithm. In fact, it tu rns out that it leads t o near optimal regret, up to logarithmic factors. A k ey ingredient in our approac h is the introduction of a new notion called th e generalized hazard rate. Keywords: online learning, reg ret, m ulti-armed bandits, follow the perturb ed leader , gradient bas ed al- gorithms 1 In tro duction Starting from the seminal work of Hannan [ 1957 ] and later developmen ts due to Kalai and V emp ala [ 2005 ], p erturbation based algorithms ( called “F ollow t he Perturbed Leader (FTP L)”) ha ve o ccupied a central place in online learning. Another ma jor family of online learning algo rithms, called “ F ollo w the Regularized Leader (FTRL)”, is bas ed on the idea of regulariz a tion. In sp ecial cases , such as the exp onential weights alg orithm for the experts proble m, it has been folk knowledge that r egulariza tion and perturbation ideas are connected. That is, t he exp onential weigh ts algorithm can be understo o d as either using negative en tropy regulariz ation or Gumbel distributed p erturbations (for example, see the discussion in Aber neth y et al. [ 2 014 ]). Recent work hav e b egun to further uncover the connections betw een perturbatio n a nd r egulariza tion. F or example, in online linear o ptimization, one can understand regularizatio n and perturbatio n a s simply t wo diﬀeren t w ays to smo o th a non-smo oth potential function. The former corresp onds to inﬁmal conv olution smoothing and the latter corresp onds t o sto chastic (or int egr al conv olution) smo o thing [ Aber neth y et al. , 2014 ]. Having a generic framework for understanding p erturbations a llows one to study a wide v ariety of online linea r optimization g ames and a num ber of in teresting perturba tions. FTRL and FTPL alg orithms hav e also b een used b eyond “full informa tion” s ettings. “F ull infor- mation” refers to the fact that the learner observes the en tire mov e of the adversary . The multi-armed bandit problem is one of the most fundamental examples of “partial information” settings. Regret analysis o f the multi-armed bandit problem go es back to the work of Robbins [ 1952 ] who formulated the sto chastic version of the problem. The non-s to chastic, or adversarial, version was for mu lated by Auer et al. [ 2002 ], who provided the EXP3 algorithm achieving O ( √ N T log N ) regr et in T rounds with N ar ms . They also show ed a low er b ound of Ω( √ N T ), which was la ter matched by the Poly- INF algor ithm [ Audibe r t and Bubeck , 200 9 , Audib ert et al. , 20 11 ]. The Poly-INF a lgorithm can b e 1 int erpr e ted as a n FTRL algor ithm with neg ative Tsallis ent ro p y r e g ularizatio n [ Audiber t et al. , 2011 , Aber neth y et al. , 2015 ]. F o r a recen t survey of bo th sto chastic and non-sto chastic bandit pro blems, see Bube ck and Cesa-Bianchi [ 20 12 ]. F or the no n-sto chastic multi-armed bandit pro blem, Kujala and Elomaa [ 2005 ] and Poland [ 200 5 ] bo th show ed that using the exp onential (actually double exp onential/Laplace) distribution in an FTPL algorithm co upled with sta ndard unbiased estimation technique yields near-optimal O ( √ N T log N ) re- gret. Unbiased estimation needs acces s to a rm probabilities that are not explicitly av ailable when using an FTPL algo rithm. Neu and Ba rt´ ok [ 2 013 ] introduced the geometric resampling scheme to a pproximate these probabilities while still g ua ranteeing low regret. Recently , Ab ernethy et al. [ 201 5 ] analy zed FTPL for adversarial multi-armed bandits and pr ovided regret b ounds under the condition tha t the haz a rd rate of the p erturba tion distribution is b ounded. This condition allow ed them to consider a v ariety of p erturba tion distributions b eyond the exp onential, such as Gamma, Gumbel, F rechet, Pareto, and W eibull. Unfortunately , the b ounded hazard r ate condition is violated b y tw o of the mo st wide ly known distri- butions: namely the uniform 1 and the Ga us sian distributions. Ther efore, the results o f Aber neth y et al. [ 2015 ] say nothing ab out the regr et incurred in an adversarial multi-armed bandit pro blem when we use these distributions (without forced explora tion) to ge ne r ate p er turbations. Co ntrast this to the full information exp erts setting where using these distributions as p erturbatio ns yields optimal √ T regret and even yields the optimal √ log N dep endence on the dimension in the Gaussian case [ Ab ernethy et al. , 2014 ]. The Gauss ia n distribution has lighter tails than the exp onential. The haza rd ra te of a Gaussian increases linearly on the real line (and is hence unbo unded) whereas the exp onential has a consta n t hazard rate. Doe s having to o lig ht a tail ma ke a p e rturbation inhere ntly bad? The unifor m is even worse from a light tail p oint of view: it has bounded supp ort! In fact, Kujala and E lomaa [ 2005 ] had trouble dealing with the uniform distribution and remar ked, “we failed to analyz e the exp ert setting when the p erturba tion distribution was uniform.” Do es having a b ounded supp ort make a p erturbatio n even w orse? Or is it that the hazard rate condition is just a suﬃcient condition without b eing a nywhere close to necessary for a goo d regret bound to exist. The a nalysis of Aber neth y et al. [ 20 15 ] s uggests that per haps a b ounded hazar d rate is cr itical. They even made the following conjecture. Conjecture 1. If a distribution D has a monotonic al ly incr e asing hazar d r ate h D ( x ) that do es not c onver ge as x → + ∞ (e.g., Gauss ian), then t her e is a se quenc e of gains that c au s es the c orr esp onding FTPL algorithm t o incur at le ast a line ar r e gr et. The main co n tribution of this pap er is to pr ovide a nswers to the ques tions raised ab ov e. First, w e show that b oundednes s of the hazard ra te is certainly no t a req uirement for achieving sublinea r (in T ) regret. Bounded supp or t distr ibutions, like the uniform, vio late the boundedness co ndition on the haza rd rate in the most extreme way . Their haza rd r ate blows up not just a symptotically at inﬁnity , as in the Gaussian case, but as one appro aches the right e dge of the support. Y et, we ca n show (Corollar y 3.3 ) that using the uniform distribution results in a reg ret b ound o f O (( N T ) 2 / 3 ). This b ound is clea r ly not optimal. But optimality is not the p oint here. What is s urprising, es pec ially if one r e gards Conjecture 1 as pla usible, is that a non-trivia l sublinear b ound holds at all. In fact, we show (Corollary 3 .4 ) that using any contin uous distribution with bo unded supp ort and b ounded density re sults in a sublinear regret b ound. Second, moving beyond b ounded supp ort distr ibutio ns to o nes with unbounded supp ort, we settle Conjecture 1 in the negativ e. In Theorem 4.6 w e show that, instea d of suﬀering linear regr et as predicted by Conjectur e 1 , a pe rturbation alg orithm using the Gaus sian distribution enjoys a ne a r optimal regr et bo und of O ( √ N T log N log T ). A key ingre dien t in our a pproach is a new quantit y that we call the gener alize d hazar d r ate o f a distribution. W e show that bo unded generalized ha zard rate is eno ugh to guarantee sublinear regr e t in T (Theor em 4.2 ). Finally , we in vestigate the relationship betw een tail b ehavior of random perturba tions and the regr et they induce. W e sho w that hea vy tails, along with s ome fairly mild ass umptions, guara ntee a b ounded hazard rate (Theorem 4.9 ) a nd hence previous r esults can yield regret b ounds for these perturba tions. How ever, lig ht tails can fail to hav e a b o unded hazard r ate. Nevertheless, we sho w that under reaso nable conditions, light tailed distributions do ha ve a bounded gener alize d hazard rate (Theor em 4.10 ). This 1 The unif orm distribution i s also historicall y signiﬁcant as it was used in the original FTPL algorithm of Hannan [ 1957 ]. 2 result allows us to show that rea s onably behaved light-tailed dis tributions lead to near optimal regret (Corollar y 4.11 ). In particular , the exp onential power (or gener alized nor ma l) family of distributions yields near o ptimal re g ret (Theo rem 4.13 ) 2 F ollo w the P erturb ed Leader Algorithm for Bandits Recall the setting of the adv ersar ial multi-armed bandit problem [ Auer et al. , 2 002 ]. An adv ersar y (or Nature) cho oses ga in vectors g t ∈ [ − 1 , 0] N for 1 ≤ t ≤ T ahea d of the g a me. Such an adversary is ca lled oblivio us . At r ound t = 1 , . . . , T in a rep eated g ame, the learner must choo se a distribution p t ∈ ∆ N ov er the set of N av ailable a rms (o r actions). The learner plays action i t sampled according to p t and accumulates the gains g t,i t ∈ [ − 1 , 0]. The lea rner obser ves only g t,i t and receives no information a b out the v a lues g t,j for j 6 = i t . The lear ner’s go al is to minimize the r e gr et . Reg ret is deﬁned to b e the diﬀerence in the realized gains and the gains o f the b est ﬁxed actio n in hindsight: Regret T := max i ∈ [ N ] T X t =1 ( g t,i − g t,i t ) . (1) T o b e precise, we co ns ider the exp e cte d regret, wher e the exp ectation is taken with resp ect to the learner’s r andomization. Note tha t, under an o blivious adversary , the only random v a riables in the ab ov e expr e ssion a re the actions i t of the lear ner. F o r con venience, deﬁne the cum ula tive gain v ectors G t , t = 1 , 2 , . . . , T by G t := t X s =1 g s . 2.1 The Gradien t-Based Algorit hmic T emplate W e will consider the a lgorithmic template descr ibed in F r amework 1 , whic h is the Gr a dient Ba sed Prediction Algor ithm (GBP A) (see, for example, Aber neth y et al. [ 2 015 ]). Let ∆ N be the ( N − 1)- dimensional pro bability simplex in R N . Denote the standa rd basis vector along the i th dimensio n by e i . A t any r ound t , the action choice i t is made by sampling from the distributio n p t which is obtained by applying the gra dient of a conv ex function ˜ Φ to the estimate ˆ G t − 1 of the cumulativ e ga in vector so far. The choice of ˜ Φ is ﬂexible but it must be a diﬀerentiable conv ex function suc h that its g radient is always in ∆ N . Note that we do not require the rang e of ∇ ˜ Φ b e contained in the interior of the pro bability simplex. If we required the gradient to lie in the interior, we w ould not b e able to dea l with bo unded supp ort distributions s uch a s the uniform distribution. Even though some entries of the probability vector p t might b e 0, the e s timation step is alwa ys well deﬁned since p t,i t > 0. But allowing p t,i to b e zero mea ns that ˆ g t is not exactly a n unbiased estimato r of g t . Instead, it is an un biased estima to r o n the s uppo rt of p t . That is, E [ ˆ g t,i | i 1: t − 1 ] = g t,i for a ny i such that p t,i > 0. Here, i 1: t − 1 is shorthand for i 1 , . . . , i t − 1 . Therefore, irres pec tive of whether p t,i = 0 or not, we alwa ys have E [ p t,i ˆ g t,i | i 1: t − 1 ] = p t,i g t,i . (2) When p t,i = 0, we have ˆ g t,i = 0 but g t,i ≤ 0, which means that ˆ g t ov erestimates g t outside the supp or t of p t . Hence, we als o have E [ ˆ g t | i 1: t − 1 ]  g t , (3) where  mea ns element-wise greater than. W e now present a basic result b ounding the exp ected regr et of GBP A in the multi-armed bandit setting. It is ba sically just a simple modiﬁca tio n o f the arg ument s in Aber neth y et al. [ 2015 ] to deal with the p os sibility that p t,i = 0. W e state and pro ve this result here fo r completeness without making any claim of novelt y . 3 F ramew ork 1: Gra dient -Ba sed P r ediction Alg . (GBP A) T emplate for Multi-Armed B andits. GBP A( ˜ Φ): ˜ Φ is a diﬀerentiable c onv ex function such that ∇ ˜ Φ ∈ ∆ N Nature: Adversary choose s ga in vectors g t ∈ [ − 1 , 0] N for t = 1 , . . . , T Learner initializes ˆ G 0 = 0 for t = 1 to T do Sampling: L e arner choose s i t according to the distribution p t = ∇ ˜ Φ( ˆ G t − 1 ) Cost: Lear ner incurs (a nd o bserves) gain g t,i t ∈ [ − 1 , 0] Estimation: Lea rner creates es timate of g a in vector ˆ g t := g t,i t p t,i t e i t Up date: Cumulative ga in estimate so far ˆ G t = ˆ G t − 1 + ˆ g t end for Lemma 2.1. (Decomp o sition of the Exp ected Regret) Deﬁne the non-smo oth p otent ial Φ( G ) = max i G i . The exp e cte d r e gr et of GBP A ( ˜ Φ) c an b e written as E Regret T = Φ( G T ) − E " T X t =1 h p t , g t i # . (4) F urthermor e, t he exp e cte d r e gr et of GBP A ( ˜ Φ) c an b e b ounde d by the sum of an over estimation, an un- der estimation, and a dive r genc e p enalty: E Regret T ≤ ˜ Φ(0) | {z } over estimation p enalty + E    Φ( ˆ G T ) − ˜ Φ( ˆ G T ) | {z } under estimation p enalty    + E    T X t =1 E [ D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | i 1: t − 1 ] | {z } diver gence p enalty    , (5) wher e t he exp e ctations ar e over the sampling of i t and D ˜ Φ is the Br e gman dive r genc e induc e d by ˜ Φ . Pr o of. First, note that the regret, by deﬁnition, is Regret T = Φ( G T ) − T X t =1 h e i t , g t i . Under an oblivious adversary , only the summation on the right hand side is random. More over E [ h e i t , g t i | i 1: t − 1 ] = h p t , g t i . This proves the claim in ( 4 ). F ro m ( 2 ), w e k now that E [ h p t , ˆ g t i | i 1: t − 1 ] = h p t , g t i even if s ome entries in p t might b e zero. Ther efore, we hav e E Regret T = Φ( G T ) − E " T X t =1 h p t , ˆ g t i # . (6) F ro m ( 3 ), we k now that G T ≤ E [ ˆ G T ]. This implies Φ( G T ) ≤ Φ( E [ ˆ G T ]) ≤ E [Φ( ˆ G T )] , (7) where the ﬁrst inequality is because G  G ′ ⇒ Φ( G ) ≥ Φ( G ′ ), a nd the second ineq uality is due to the conv exity of Φ. Plugging ( 7 ) into ( 6 ) yields E Regret T ≤ E " Φ( ˆ G T ) − T X t =1 h p t , ˆ g t i # . (8) Now, recalling the deﬁnition of Breg ma n divergence D ˜ Φ ( ˆ G t , ˆ G t − 1 ) = ˜ Φ( ˆ G t ) − ˜ Φ( ˆ G t − 1 ) − D ∇ ˜ Φ( ˆ G t − 1 ) , ˆ G t − ˆ G t − 1 E , 4 we can write, − T X t =1 h p t , ˆ g t i = − T X t =1 D ∇ ˜ Φ( ˆ G t − 1 ) , ˆ g t E (9) = − T X t =1 D ∇ ˜ Φ( ˆ G t − 1 ) , ˆ G t − ˆ G t − 1 E = T X t =1  D ˜ Φ ( ˆ G t , ˆ G t − 1 ) + ˜ Φ( ˆ G t − 1 ) − ˜ Φ( ˆ G t )  = ˜ Φ( ˆ G 0 ) − ˜ Φ( ˆ G T ) + T X t =1 D ˜ Φ ( ˆ G t , ˆ G t − 1 ) . (10) The pro o f ends by plug ging ( 10 ) into ( 8 ) and no ting that ˜ Φ( ˆ G 0 ) = ˜ Φ(0) is no t r andom. 2.2 Sto ch astic Smo othing of Poten tial F unction Let D be a c ontin uo us distribution with ﬁnite exp ectation, pro bability densit y function f , and cumulativ e distribution function F . Consider GBP A with po tent ial function o f the form: ˜ Φ( G ; D ) = E Z 1 ,...,Z N i.i.d ∼ D Φ( G + Z ) , (11) which is a sto chastic s m o othing o f the no n-smo oth function Φ( G ) = max i G i . Note that Z = ( Z 1 , . . . , Z N ) ∈ R N . W e will often hide the dep endence on the distribution D if the distributio n is obvious from the context or when the dep e ndence on D is not of imp ortance in the ar gument. Since Φ is co n vex, ˜ Φ is also conv ex. F o r sto chastic smo othing, we have the fo llowing result to con trol the under estimation a nd ov erestimation p enalty . Lemma 2.2. F or any G , we have Φ( G ) + E [ Z 1 ] ≤ ˜ Φ( G ) ≤ Φ( G ) + E M AX ( N ) (12) wher e E M AX ( N ) is any fu n ction such t hat E Z 1 ,...,Z N [max i Z i ] ≤ E M AX ( N ) . In p articular, t his implies t hat the over estimation p enalty ˜ Φ(0) is upp er b ounde d by Φ(0) + E M AX ( N ) = E M AX ( N ) and the un der estimation p enalty Φ( ˆ G T ) − ˜ Φ( ˆ G T ) is upp er b ounde d by − E [ Z 1 ] . Pr o of. W e hav e, Φ( G ) + E [ Z 1 ] = max i G i + E [ Z i ] = max i ( G i + E [ Z i ]) ≤ E [max i ( G i + Z i )] = ˜ Φ( G ) ≤ E [max i G i + max i Z i ] = max i G i + E [max i Z i ] = Φ( G ) + E [max i Z i ] . Noting that E [max i Z i ] ≤ E M AX ( N ) ﬁnishes the pro of. Observe that Φ( G + Z ) as a function of G is diﬀerentiable with proba bilit y 1 (under the r andomness of the Z i ’s) due to the fact that Z i ’s ar e random v ariables with a density . By Propo sition 2 .3 of Bertsek a s [ 1973 ], we ca n swap the o rder of diﬀere n tiation a nd exp ectation: ∇ ˜ Φ( G ; D ) = E Z 1 ,...,Z N i.i.d ∼ D e i ∗ , where i ∗ = ar g max i =1 ,...,N { G i + Z i } . (13) Note that, for any G , the random index i ∗ is unique with probability 1. Hence, ties b etw e e n arms can be resolved a rbitrarily . It is clear fro m ab ov e that ∇ ˜ Φ, b eing an ex pec ta tion of vectors in the pro bability 5 simplex, is in the probability simplex. Thus, it is a v alid p otential to b e used in F r amework 1 . Now we derive an iden tit y to write the gr adient of the smoothed potential function in terms of the expectation of the cumulativ e distribution function, ∇ i ˜ Φ( G ) = ∂ ˜ Φ ∂ G i = E Z 1 ,...,Z N 1 { G i + Z i > G j + Z j , ∀ j 6 = i } = E ˜ G − i [ P Z i [ Z i > ˜ G − i − G i ]] = E ˜ G − i [1 − F ( ˜ G − i − G i )] (14) where ˜ G − i = max j 6 = i G j + Z j . If D ha s unbounded support then this partial de r iv ative is non-zero for all i giv en any G . How ever, it can b e zero if D has bounded suppor t. Simila rly , we have the follo wing useful identit y tha t writes the diag onal of the Hessian of the smoo thed potential function in terms of the exp ectation of the pro ba bilit y density function. ∇ 2 ii ˜ Φ( G ) = ∂ ∂ G i ∇ i ˜ Φ( G ) = ∂ ∂ G i E ˜ G − i [1 − F ( ˜ G − i − G i )] = E ˜ G − i  ∂ ∂ G i (1 − F ( ˜ G − i − G i ))  = E ˜ G − i f ( ˜ G − i − G i ) . (15) 2.3 Connection to F ollow t he Perturb ed Leader The sampling step of F ramework 1 with a sto chastically s mo o thed Φ as the p otential ˜ Φ (Equation 11 ) can be done eﬃciently . Instead o f ev a luating the ex p ecta tion (Equa tion 13 ), we just take a rando m sample. Doing so g ives us an eq uiv alent of F o llow the Perturbe d Leader Algorithm (FTPL ) [ Kala i and V empa la , 2005 ] applied to the bandit setting. On the o ther hand, the estimation step is hard b eca us e generally there is no closed-for m expr ession for ∇ ˜ Φ. T o a ddr ess this issue, Neu and B art´ ok [ 20 1 3 ] pr op osed Geometric Resampling (GR), an iterative resampling pro cess to estimate ∇ ˜ Φ (with bia s). They show ed that the ex tra reg r et after stopping at M iterations of GR in tro duces a n estimation bia s that is a t most N T eM as an additiv e term. That is , all GBP A regret b ounds that we prov e will hold for the cor resp onding FTPL algor ithm that do es M iterations of GR a t every time step, with a n extra a dditiv e N T eM term. This ex tra ter m do es not aﬀect the r e gret ra te as lo ng as M = √ N T , bec ause the lower b ound fo r a ny adversarial multi-armed bandit alg orithm is of the orde r √ N T . 2.4 The Role of the Hazard R ate and Its Limitation In pr evious w ork, Abernethy et a l. [ 20 15 ] pr ov ed that for a contin uo us ra ndom v a riable Z with ﬁnite a nd nonnegative exp e ctation and supp ort on the whole r e a l line R , if the hazar d rate o f the r andom v a riable is b ounded, i.e, sup z f ( z ) 1 − F ( z ) < ∞ , then the e x pec ted regr et of GBP A c a n b e upper b ounded a s E Regret T = O  p N T × E M AX ( N )  . Common families of distributions whos e reg ret ca n be controlled in this w ay include Gumbel, F rechet, W eibull, Pareto, and Ga mma (see Aber neth y et al. [ 2015 ] for details). How ever, ther e are many other families of distributions where the haza rd r ate condition fails. F o r example, if the random v ariable has a b ounded supp ort, then the ha zard ra te would ce r tainly ex plo de at the end o f the supp or t. This is, in some sense, an extreme cas e of vio la tion b ecause the ra ndom v ariable do es not even have a tail. Ther e are also some random v a r iables that do hav e supp ort on R but have un bo unded hazar d rate, e.g. Gaussia n, where the haza r d rate mono tonically incre ases to inﬁnity . How can we p erform a nalyses of the e x pe c ted regret of GB P A using those ra ndom v ariables a s p ertur ba tions? T o address these issues, we need to go beyond the ha zard ra te. 6 3 P erturb ations with Bounded Supp ort In this sectio n, we pro ve that GBP A with any con tinuous distribution that has b ounded supp ort and bo unded density enjoys sublinear exp ected r egret. F rom Lemma 2.1 we see that the exp ected regret can be uppe r b ounded b y the sum of three terms. T he ov erestimation p enalty can be bounded v ery easily via Lemma 2 .2 for a distribution with b ounded supp ort. The underestimation p enalty is non-p ositive as long a s the distribution has non-negative exp ectation. The only term that needs to be cont ro lle d with some eﬀort is the divergence p enalty . W e ﬁrst present a general lemma that allows us to wr ite the divergence p ena lty for a stochastically smo othed p otential ˜ Φ as a sum inv olving cer tain double integrals. Lemma 3.1. When using a sto chastic al ly smo othe d p otential as in ( 11 ) , the diver genc e p enalty c an b e written as E h D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | i 1: t − 1 i = X i ∈ supp ( p t ) p t,i Z    g t,i p t,i    0 E ˆ G − i  Z s 0 f ( ˆ G − i − ˆ G t − 1 ,i + r ) dr  ds (16) wher e p t = ∇ ˜ Φ( ˆ G t − 1 ) , ˆ G − i = max j 6 = i ˆ G t − 1 ,j + Z j and supp ( p t ) = { i : p t,i > 0 } . Pr o of. T o reduce clutter , w e dr op the time subscripts: we use ˆ G to denote the cumulativ e es tima te ˆ G t − 1 , ˆ g to denote the marg inal estimate ˆ g t = ˆ G t − ˆ G t − 1 , p to denote p t , and g to denote the tr ue ga in g t . Note that b y deﬁnition of F ramework 1 , ˆ g is a sparse vector with one non-zer o and non-p os itive co ordinate ˆ g i t = g i t /p i t = − | g i t /p i t | . Morever, conditioned on i 1: t − 1 , i t takes v alue i with probabilit y p i . F o r a ny i ∈ supp( p ), let h i ( r ) = D ˜ Φ ( ˆ G − r e i , ˆ G ) , so that h ′ i ( r ) = −∇ i ˜ Φ  ˆ G − r e i  + ∇ i ˜ Φ  ˆ G  and h ′′ i ( r ) = ∇ 2 ii ˜ Φ  ˆ G − r e i  . Now we write: E [ D ˜ Φ ( ˆ G + ˆ g , ˆ G ) | i 1: t − 1 ] = X i ∈ supp( p ) p i D ˜ Φ ( ˆ G + g i /p i e i , ˆ G ) = X i ∈ supp( p ) p i D ˜ Φ ( ˆ G − | g i /p i | e i , ˆ G ) = X i ∈ supp( p ) p i h i ( | g i /p i | ) = X i ∈ supp( p ) p i Z | g i /p i | 0 Z s 0 h ′′ i ( r ) dr ds = X i ∈ supp( p ) p i Z | g i /p i | 0 Z s 0 ∇ 2 ii ˜ Φ  ˆ G − r e i  dr ds = X i ∈ supp( p ) p i Z | g i /p i | 0 Z s 0 E ˆ G − i f ( ˆ G − i − ˆ G i + r ) dr ds = X i ∈ supp( p t ) p t,i Z | g i /p i | 0 E ˆ G − i  Z s 0 f ( ˆ G − i − ˆ G i + r ) dr  ds. The second equality on the ﬁrst line implicitly used the a ssumption that g i ≤ 0, i.e, the “ gains” a re non-p ositive. The second equality on the se cond line used that h i (0) = 0, a nd the equality on the fo ur th line used E quation ( 15 ). Note that each summand in the divergence p enalty expression ab ov e inv olves an integral of the density function of the distribution D ov er an interv al. The main idea to control the divergence p ena lt y for a bo unded supp ort distr ibutio n is to tr uncate the interv al at the end of the supp or t. F or p o ints that are close to the end of the supp ort, we b ound the integral by the pro duct of the b ound on the density and the interv al le ng th. F or p oints that a r e far from the end of the suppo rt, we b ound the integral thro ugh the hazar d ra te as was done by Aber neth y et al. [ 2015 ]. F or a g eneral cont inuous random v aria ble Z with b ounded densit y , bounded supp ort, we ﬁrs t shift it (which obviously do es not change the distribution of the random action choice i t and hence the exp ected regre t) and scale it so that the supp ort is a subset of [0 , 1] with sup { z : F ( z ) = 0 } = 0 a nd inf { z : F ( z ) = 1 } = 1 where F denotes the CDF o f Z . A b eneﬁt of this normalizatio n is that the 7 exp ectation of the ra ndo m v aria ble b ecomes non-ne g ative so the underestimation p enalty is guar anteed to b e no n- po sitive. After scaling, we assume that the b ound on the de ns it y is L . W e consider the per turbation η Z where η > 0 is a tuning par ameter. W rite F η ( x ) and f η ( x ) to denote the CDF a nd PDF of the scaled r andom v ariable η Z resp ectively . If F is strictly increa sing, we know that F − 1 exists. If not, deﬁne F − 1 ( y ) = inf { z : F ( z ) = y } . Elementary calculation gives the following useful facts: F η ( z ) = F ( z η ) , f η ( z ) = f ( z η ) η , F − 1 η ( y ) = η F − 1 ( y ) . Theorem 3.2. (Div e rge nce P enalty Con trol, Bounded Supp ort) The dive r genc e p enalty in the GBP A re gr et b ound using the sc ale d p ert urb ation η Z , wher e Z is dr awn fr om a b ounde d supp ort distri- bution satisfying the c onditions ab ove, c an b e u pp er b ounde d, for any ǫ > 0 , by N L  1 2 η ǫ + 1 − F − 1 (1 − ǫ )  . Pr o of. F ro m L e mma 3.1 , we have, with ˆ G − i = max j 6 = i ˆ G t − 1 ,j + η Z j , E h D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | i 1: t − 1 i = X i ∈ supp( p t ) p t,i Z    g t,i p t,i    0 E ˆ G − i  Z s 0 f η ( ˆ G − i − ˆ G t − 1 ,i + r ) dr  ds = X i ∈ supp( p t ) p t,i Z    g t,i p t,i    0 E ˆ G − i " Z ˆ G − i − ˆ G t − 1 ,i + s ˆ G − i − ˆ G t − 1 ,i f η ( z ) dz # ds ≤ X i ∈ supp( p t ) p t,i Z    g t,i p t,i    0  E ˆ G − i " Z [ ˆ G − i − ˆ G t − 1 ,i , ˆ G − i − ˆ G t − 1 ,i + s ] \ [ F − 1 η (1 − ǫ ) ,η ] f η ( z ) dz | {z } ( I ) # + Z [ F − 1 η (1 − ǫ ) ,η ] f η ( z ) dz | {z } ( I I )  ds. (17) W e b ound the tw o integrals ab ov e diﬀerently . F or the ﬁrst integral, we add the restr iction f η ( z ) > 0 by intersecting the integral interv al with the supp ort of the function f η ( z ), denoted a s I f η ( z ) so tha t 1 − F η ( z ) is not 0 on the interv al to b e integrated. Th us, we get, ( I ) = Z ([ ˆ G − i − ˆ G t − 1 ,i , ˆ G − i − ˆ G t − 1 ,i + s ] \ [ F − 1 η (1 − ǫ ) ,η ]) ∩ I f η ( z ) f η ( z ) dz = Z ([ ˆ G − i − ˆ G t − 1 ,i , ˆ G − i − ˆ G t − 1 ,i + s ] \ [ F − 1 η (1 − ǫ ) ,η ]) ∩ I f η ( z ) (1 − F η ( z )) · f η ( z ) 1 − F η ( z ) dz ≤ Z ([ ˆ G − i − ˆ G t − 1 ,i , ˆ G − i − ˆ G t − 1 ,i + s ] \ [ F − 1 η (1 − ǫ ) ,η ]) ∩ I f η ( z ) (1 − F η ( z )) · L η ǫ dz ≤ (1 − F η ( ˆ G − i − ˆ G t − 1 ,i )) sL η ǫ . (18) The ﬁr st inequality holds b ecause f η ( z ) ≤ L/ η a nd (1 − F η ( z )) ≥ ǫ on the set of z ’s ov er which we are integrating. The second inequality holds b ecaus e on the se t under considera tion 1 − F η ( z ) ≤ 1 − F η ( ˆ G − i − ˆ G t − 1 ,i ) and the measure o f the s et is at mo st s . F or the second integral, we use the b ound f η ( z ) ≤ L/ η aga in to get, ( I I ) = Z [ F − 1 η (1 − ǫ ) ,η ] f η ( z ) dz ≤ L η · ( η − F − 1 η (1 − ǫ )) . (19) 8 Plugging ( 1 8 ) and ( 19 ) into ( 17 ), we can b ound the divergence p e nalty by , ≤ X i ∈ supp( p t ) p t,i Z    g t,i p t,i    0  E ˆ G − i [1 − F η ( ˆ G − i − ˆ G t − 1 ,i )] sL η ǫ + L ( η − F − 1 η (1 − ǫ )) η  ds = X i ∈ supp( p t ) p t,i Z    g t,i p t,i    0  p t,i sL η ǫ + L (1 − F − 1 (1 − ǫ ))  ds = X i ∈ supp( p t ) p t,i  p t,i L η ǫ g 2 t,i 2 p 2 t,i + L (1 − F − 1 (1 − ǫ )) | g t,i | p t,i  ≤ X i ∈ supp( p t )  L 2 η ǫ + L (1 − F − 1 (1 − ǫ ))  ≤ N L  1 2 η ǫ + 1 − F − 1 (1 − ǫ )  . The seco nd to last inequa lit y holds b e cause | g t,i | ≤ 1 and the last inequality holds b ecaus e the sum over i is a t most over all N arms. The regre t b ound fo r the uniform distribution is now a n easy co r ollary . Corollary 3. 3. (Regret Bound for Uniform ) F or GBP A run with a s t o chastic al ly smo othe d p oten tial using an ap pr opriately sc ale d [0 , 1] uniform p erturb ation wher e η = ( N T ) 2 / 3 , the exp e cte d r e gr et c an b e upp er b ounde d by 3( N T ) 2 / 3 . Pr o of. F or [0 , 1] uniform distribution, we ha ve L = 1, F − 1 (1 − ǫ ) = 1 − ǫ so the divergence penalty is upper b ounded by N T ( 1 2 η ǫ + ǫ ) . If we let ǫ = 1 √ 2 η , we can s ee that the divergence p ena lty is upp er bounded by N T q 2 η . T og ether with the ov eres tima tion p enalty which is trivially b ounded by η a nd a no n-p ositive underestima tio n p enalty , we see that the ﬁnal regr et b ound is N T r 2 η + η . Setting η = ( N T ) 2 / 3 gives the desired res ult. F or a g eneral per turbation with b ounded s uppo rt and b ounded density , the r ate at which 1 − F − 1 (1 − ǫ ) go es to 0 as ǫ → 0 can v a ry but we can a lwa ys g uarantee sublinear exp ected regret. Corollary 3. 4. (Asymptotic Regret Bound for Bounded Supp ort) F or sto chastic al ly smo othe d GBP A using gener al c ontinuous ra ndom vari able ηZ wher e Z has b ounde d density and b ounde d su pp ort c ontaine d in [0 , 1] and η = ( N T ) 2 / 3 , the exp e cte d r e gr et gr ows subline arly, i.e., lim T →∞ E Regret T T = 0 . Pr o of. F or a g eneral distribution, let ǫ = 1 √ η . Since the ov erestimation p enalty is trivially b ounded by η and the undere s timation p enalty is non-p o s itive, the exp ected r egret can b e upp er b ounded by LN T  1 2 √ η + 1 − F − 1 (1 − 1 √ η )  + η . Setting η = ( N T ) 2 / 3 we see that the exp ected reg ret c a n b e upp er b ounded by ( L 2 + 1)( N T ) 2 / 3 + LN T (1 − F − 1 (1 − 1 √ η ) . 9 Since lim T →∞ 1 − F − 1 (1 − 1 √ η ) = lim η →∞ 1 − F − 1 (1 − 1 √ η ) = 1 − F − 1 (1) = 0 , we conclude that lim T →∞ E Regret T T = 0 . 4 P erturb ations with Un b ounded Supp ort Unlik e p erturba tions with bo unded supp ort, p erturbations with unbounded supp ort (on the r ig ht) do hav e non-ze ro rig ht tail pro babilities, ensuring that p t,i > 0 alwa ys. Howev er, the tail b ehavior may b e such that the haz ard r ate is unbo unded. Still, under mild a ssumptions, p er turbations with unbo unded suppo rt (on the rig ht) can also b e shown to have near optimal exp ected r egret in T , using the notion of gener alize d hazar d ra te that we now introduce. 4.1 Generalized Hazard Rate W e already know how to con trol the underestimation a nd ov eres tima tio n p enalties via Lemma 2.2 . So our main focus will b e to control the div ergence penalty . T owards this end, we deﬁne the generalized hazard rate for a contin uous random v ar iable Z with suppo r t unbounded on the rig ht , par ameterized by α ∈ [0 , 1 ), a s h α ( z ) := f ( z ) | z | α (1 − F ( z )) 1 − α , (20) where f ( z ) and F ( z ) denotes the PDF and CDF of Z r esp ectively . Note that by setting α = 0 w e r e c ov er the standar d hazard rate. One o f the main res ults of this paper is the following. Note that it includes the result (Lemma 4 .3) of Aber neth y et al. [ 20 1 5 ] a s a sp ecial case. Theorem 4.1. (Divergence P e n al ty Con trol via Generalized Hazard Rate) L et α ∈ [0 , 1) . Supp ose we have ∀ z ∈ R , h α ( z ) ≤ C . Then, E [ D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | i 1: t − 1 ] ≤ 2 C 1 − α × N . Pr o of. Because of the unbounded s uppo rt of Z , supp ( p t ) = { 1 , . . . , N } . Lemma 3.1 gives us: E [ D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | i 1: t − 1 ] = N X i =1 p t,i Z | g t,i /p t,i | 0 E ˜ G − i Z s 0 f ( ˜ G − i − ˆ G t − 1 ,i + r ) dr ds = N X i =1 p t,i Z | g t,i /p t,i | 0 E ˜ G − i Z ˜ G − i − ˆ G t − 1 ,i + s ˜ G − i − ˆ G t − 1 ,i f ( z ) dz ds ≤ C N X i =1 p t,i Z | g t,i /p t,i | 0 E ˜ G − i Z ˜ G − i − ˆ G t − 1 ,i + s ˜ G − i − ˆ G t − 1 ,i (1 − F ( z )) 1 − α | z | − α dz ds ≤ C N X i =1 p t,i Z | g t,i /p t,i | 0 E ˜ G − i (1 − F ( ˜ G − i − ˆ G t − 1 ,i )) 1 − α Z ˜ G − i − ˆ G t − 1 ,i + s ˜ G − i − ˆ G t − 1 ,i | z | − α dz ds. Since the function | z | − α is symmetr ic in z , mono to nically decreasing as | z | → ∞ , we have Z ˜ G − i − ˆ G t − 1 ,i + s ˜ G − i − ˆ G t − 1 ,i | z | − α dz ≤ Z s/ 2 − s/ 2 | z | − α dz = 2 α 1 − α s 1 − α . Also, note that z 1 − α is a concav e function in z . Hence, by Jensen’s inequa lit y , E ˜ G − i [(1 − F ( ˜ G − i − ˆ G t − 1 ,i )) 1 − α ] ≤ ( E ˜ G − i [1 − F ( ˜ G − i − ˆ G t − 1 ,i ]) 1 − α = p 1 − α t,i . 10 Therefore, E [ D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | i 1: t − 1 ] ≤ 2 α C 1 − α N X i =1 p t,i Z | g t,i /p t,i | 0 p 1 − α t,i s 1 − α ds = 2 α C 1 − α N X i =1 p 2 − α t,i Z | g t,i /p t,i | 0 s 1 − α ds = 2 α C (1 − α )(2 − α ) N X i =1 p 2 − α t,i | g t,i /p t,i | 2 − α = 2 α C (1 − α )(2 − α ) N X i =1 | g t,i | 2 − α ≤ 2 α C (1 − α )(2 − α ) N ≤ 2 C 1 − α N . A r egret b ound now easily follows. Theorem 4.2 . (Regret B o und via Generalized Hazard R ate) Su pp ose we use a sto chastic al ly smo othe d GBP A with p erturb ation η Z , with Z ’s gener alize d hazar d r ate b eing b ounde d: h α ( x ) ≤ C, ∀ x ∈ R for some α ∈ [0 , 1 ) , and E Z 1 ,...,Z N [max i Z i ] − E [ Z 1 ] ≤ Q ( N ) , wher e Q ( N ) is some fu n ction of N . Then, if we set η = ( 2 C N T (1 − α ) Q ( N ) ) 1 / (2 − α ) , the ex p e cte d r e gr et of GBP A is no gr e ater than 2 × ( 2 C 1 − α ) 1 / (2 − α ) × ( N T ) 1 / (2 − α ) × Q ( N ) (1 − α ) / (2 − α ) . In p articular, this implies that the algorithm h as subline ar exp e ct e d r e gr et. Pr o of. The div ergence p enalty can be con trolled through The o rem 4.1 o nce we hav e b ounded generalized hazard rate. It r emains to control the ov erestimation and underestimation pena lt y . By Lemma 2.2 , they are at mos t E Z 1 ,...,Z n [max i Z i ] and − E [ Z 1 ] resp ectively . Supp ose w e scale the perturbatio n Z by η > 0, i.e., we add η Z i to each co ordinate. It is easy to see that E [max i =1 ,...,n η Z i ] = η E [max i =1 ,...,n Z i ] and E [ η Z 1 ] = η E [ Z 1 ]. F or the divergence p ena lt y , obser ve that F η ( t ) = F ( t/η ) and thus f η ( t ) = 1 η f ( t/η ). Hence, the bo und o n the generalized haza r d rate for pe rturbation η Z is η α − 1 C . Plug ging new bounds for the s caled p erturbations into Lemma 2.1 gives us E Regret T ≤ η α − 1 2 C 1 − α × N T + η Q ( N ) . Setting η = ( 2 C N T (1 − α ) Q ( N ) ) 1 / (2 − α ) ﬁnishes the pro of. 4.2 Gaussian Perturbation In this section we pr ov e that GBP A with the s ta ndard Ga ussian p erturbatio n incurs a near optimal exp ected regret in b oth N and T . Let F ( z ) a nd f ( z ) denote the CDF and PDF of standard Gaussia n distribution. Lemma 4.3 ( Baricz [ 2 008 ]) . F or standar d Gauss ian r andom variable, we have z < f ( z ) 1 − F ( z ) < z 2 + √ z 2 + 4 2 . This lemma together with exa mple 2.6 in Thomas [ 19 71 ] sho w that the hazard rate of a standard Gaussian random v a riable increases mono to nically to inﬁnity . Howev er, we can s till b o und the gener a lized hazard rate for strictly p ositive α . 11 Lemma 4.4. (Gene ralized Haz ard Bound for Gaussian) F or any α ∈ (0 , 1) , we have f ( z ) | z | α (1 − F ( z )) 1 − α ≤ 2 α . The pro of of this lemma is deferr ed to the a ppendix. The bo unded gener alized ha zard rate shown in the ab ov e lemma ca n b e used to control the div ergence pena lty . Co mbined w ith other knowledge of the sta ndard Gaussia n random v ar iable we ar e able to give a b ound o n the exp ected regret. Corollary 4.5 . The exp e cte d r e gr et of GBP A with an appr opriately sc ale d standar d Gaussian r andom variable as p erturb ation wher e η =  4 N T α (1 − α ) √ 2 log N  1 / (2 − α ) has an exp e cte d r e gr et at most 2( C 1 C 2 N T ) 1 / (2 − α ) ( p 2 log N ) (1 − α ) / (2 − α ) wher e C 1 = 2 α , C 2 = 2 1 − α , for any α ∈ (0 , 1) . Pr o of. It is known that for standard Gaus sian rando m v a riable, we have E [ Z 1 ] = 0 and E Z 1 ,...,Z n [max i Z i ] ≤ p 2 log N . Plug in to Theorem 4.2 gives the result. It r emains to optimally tune α in the ab ove bound. Theorem 4.6. (Regret Bound for Gaussi an) The exp e cte d r e gr et of GBP A with an appr opriately sc ale d st andar d Gaussian r andom variable as p ertu r b ation wher e η =  4 N T α (1 − α ) √ 2 log N  1 / (2 − α ) and α = 1 log T has an exp e ct e d re gre t at most 96 √ N T × N 1 / log T p log N log T for T > 4 . If we assu me t hat T > N , the exp e ct e d r e gr et c an b e u pp er b ounde d by 278 √ N T × p log N log T . The pro of of this theor em is also deferred to the app endix. 4.3 Suﬃcien t Condition for N ear Opt imal Regret In Section 4.1 we show ed that if the ge ne r alized hazard rate of a distribution is bounded, the exp ected regret o f the GBP A can be co nt ro lle d. In this section, we a re g oing to prov e that under reasona ble assumptions on the distribution o f the p erturbation, the FTPL enjoys near o ptimal exp ected reg ret. Note that most pro ofs in this sectio n are deferr ed to the app endix. Assumptions (a)-(c). Befor e w e pro ceed, let us for ma lly state o ur as sumptions on the distributions we will co ns ider. The distribution needs to (a) b e contin uous a nd hav e b ounded density (b) ha ve ﬁnite exp ectation (c) hav e s uppo rt unbounded in the + ∞ direction. Note that if the expectatio n of the random p erturbation is nega tive, we shift it so that the exp ectation is zer o . Hence the under e s timation p enalty is no n-p ositive. In addition to the assumptions we hav e made ab ov e, we make another assumption on the even tual monotonicity of the haza rd rate. Assumption (d) h 0 ( z ) = f ( z ) 1 − F ( z ) is even tually mo no tone . “Even tually mo notone” mea ns that ∃ z 0 ≥ 0 such that if z > z 0 , f ( z ) 1 − F ( z ) is non-decr easing or no n- increasing. This assumption might appea r hard to chec k, but num ero us theorems are av ailable to es- tablish the monotonicity of hazard rate, whic h is muc h str o nger than what we are a s suming her e . F or example, s e e Theorem 2.4 in Thomas [ 1971 ], Theorem 2 and Theorem 4 in Chechile [ 2003 ], Chechile [ 2009 ]. In fact, most natural dis tributions do satisfy this assumption [ Bag noli a nd Berg strom , 20 05 ]. Before we pro ceed, we mention a standard classiﬁca tion of random v a riables in to tw o cla sses based on their tail prop erty . 12 Deﬁnition 4. 7 (see, for example, F oss et al. [ 2009 ]) . A function f ( z ) ≥ 0 is said to b e he avy-taile d if and only if lim z →∞ sup f ( z ) e λz = ∞ for al l λ > 0 . A distribution with CDF F ( z ) and F ( z ) = 1 − F ( z ) is said to b e he avy-taile d if and only if F ( z ) is he avy-taile d. If the distribution is not he avy-taile d, we say t hat it is light-taile d. It turns out that under ass umptions (a)-(d), if the distribution is also heavy-tailed, then the hazar d rate itself is b ounded. If the distribution is light-tailed, we need an additional assumption o n the even tual monoto nicit y of a function similar to the gener alized ha zard rate to ensure the bo undedness of the g eneralized haz a rd rate. But before w e sta te and pro ve the main results, we introduce some functions and prov e an intermediate lemma that will b e useful to prov e the main r esults. Deﬁne R ( z ) = − lo g F ( z ) so that we have F ( z ) = e − R ( x ) and R ′ ( z ) = f ( z ) F ( z ) = h 0 ( z ). Lemma 4.8. U nder assu m ptions (a)-(d), we have F ( z ) e λz is eventu al ly monotone ∀ λ > 0 . Pr o of. Let g ( z ) = F ( z ) e λz , then g ′ ( z ) = e λz F ( z )( λ − f ( z ) F ( z ) ). Since f ( z ) F ( z ) is even tually monotone b y assumption (d), g ′ ( z ) is even tually p ositive, negative o r zer o. The lemma immediately follows. W e are ﬁnally rea dy to pr esent the main r esults in this section. Theorem 4.9. (H ea vy T ail Implies Bounded Hazard) U nder assumptions (a) - (d), if the distri- bution is also h e avy-taile d, then the hazar d r ate is b ounde d, i.e, sup z f ( z ) F ( z ) < ∞ . Unlik e hea vy-tailed distributions, the hazard rate of ligh t-tailed distr ibutions might b e un b ounded. How ever, it turns out that if we make an additional ass umption on the even tual monotonicity of a function similar to the generalized ha zard rate, w e can still guarantee the b oundedness of the genera lized hazard rate. Assumption (e) ∃ δ ∈ (0 , 1] such that f ( z ) (1 − F ( z )) 1 − δ is even tually mo no tone . Theorem 4.10. (Light T ail Implies Bounded Generali zed H azard) Un der assumptions (a) - (e), if the distribution is also ligh t-t aile d, then for any α ∈ ( δ, 1 ) , t he gener alize d hazar d r ate h α ( z ) is b ounde d, i.e, sup z f ( z ) | z | α ( F ( z )) 1 − α < ∞ . Combining the a bove result with c o ntrol of the divergence p enalty gives us the following co rollar y . Corollary 4.1 1. Un der assumptions ( a)- ( e), if t he distribution is also light-taile d, the ex p e cte d r e gr et of GBP A with appr opriately sc ale d p erturb ations dr awn fr om that distribution is, for al l α ∈ ( δ, 1) and ξ > 0 , O  ( T N ) 1 / (2 − α ) N ξ  . In p articular, if assu mption (e) holds for al l δ ∈ (0 , 1) , then t he exp e cte d r e gr et of GBP A is O  ( T N ) 1 / 2+ ǫ  for al l ǫ > 0 , i.e, it is ne ar optimal in b oth N and T . Next we co ns ider a family o f light-tailed dis tributions that do not have a b ounded hazard rate. Deﬁnition 4. 12. The exp onential p ower ( or gener alize d normal) family of distributions, denote d as D β wher e β > 1 , is deﬁne d via t he c df f β ( z ) = C β e − z β , z ≥ 0 . The next theorem shows that GBP A with per turbations from this family o f distributions enjoys near optimal exp ected r egret in b oth N and T . Theorem 4.13. (Regret Bound fo r Exp onential Po wer F amily) ∀ β > 1 , the ex p e cte d r e gr et of GBP A with appr opriately sclae d p erturb ations dr awn fr om D β is, for al l ǫ > 0 , O  ( T N ) 1 / 2+ ǫ  . 13 5 Conclusion and F uture W ork Previous work o n pr oviding regr e t g uarantees for FTPL alg orithms in the a dversarial m ulti-armed ba ndit setting require d a b ounded ha zard rate condition. W e hav e sho wn how to go beyond the hazard rate condition but a n umber of questions rema in open. F or example, what if we use FTPL with perturbations from discr ete distributions such as B e rnoulli distribution? In the full infor mation setting Devroy e e t al. [ 2013 ] a nd V an Erven et al. [ 2014 ] ha ve co nsidered r andom walk perturba tion and drop out perturbatio n, bo th le a ding to minimax optimal regret. But to the b est of our knowledge those dis tributions have not bee n analy z ed in the adversaria l mult i-ar med bandit pro blem. An unsatisfactory asp ect of even the tigh test b ounds for FTPL alg orithms from existing work, in- cluding our s, is tha t they nev er reac h the minimax optimal O ( √ N T ) b ound. They c o me very close to it: up to loga rithmic factors. It is known that FTRL a lgorithms, using the negative Ts allis entrop y as the r egularizer , can a chiev e the optimal b ound [ Audiber t and Bub eck , 200 9 , Audib e rt et al. , 201 1 , Aber neth y et al. , 2015 ]. Is there a p erturbation tha t can achieve the optimal b ound? W e only c o nsidered mu lti-ar med bandits in this work. There has b een some interest in using FTPL algorithms for combinatorial bandit problems (see, for example, Neu and Ba rt´ ok [ 2 013 ]). In fut ure work, it will be interesting to extend o ur analys is to combinatorial bandit problems. Ac kno wledgments. W e tha nk Jacob Ab ernethy a nd Chanso o Lee for helpful discuss ions. W e ac- knowledge the supp or t of NSF CAREER g rant II S- 14520 99 and a Sloan Resea rch F ellowship. References Jacob Ab ernethy , Chanso o Lee , Abhinav Sinha, and Am buj T ewari. Online linea r optimizatio n via smo othing. In COL T , pag es 8 07–82 3, 201 4. Jacob Abernethy , Chanso o Lee, and Ambuj T ewari. Fighting bandits with a new kind of smoothness. In A dvanc es in Neu r al Information Pr o c essing Systems 28 , pag es 218 8–219 6, 2015. Jean-Yves Audib ert and S´ ebastien Bub eck. Minimax p olicies for a dversarial a nd sto chastic bandits. In COL T , pages 217 –226 , 2009 . Jean-Yves Audibert, S´ ebastien Bub eck, a nd G´ ab or Lugo si. Minimax p o licies for combinatorial prediction games. In COL T , 201 1 . Peter Auer, Nicol` o Cesa -Bianchi, Y oav F r e und, and Rob ert E. Schapire. The non-sto chastic m ulti-ar med bandit pro ble m. SIAM J . Comput. , 32 :48–7 7 , 2002. Mark Bag noli and T ed Berg s trom. Log-c oncav e proba bilit y and its applications. Ec onomic The ory , 26 (2):445–4 69, 2005 . ´ Arp´ ad Baricz. Mills’ ratio : Monoto nicit y patterns and functional ine q ualities. J. Math. Anal. Appl. , 3 40 (2):1362– 1370 , 2008. Dimitri P . Ber tsek as. Sto chastic o ptimiza tion pro blems with nondiﬀerentiable cost functionals. Journal of Optimization The ory and Applic ations , 12 (2 ):218–2 31, 1973 . ISSN 002 2-323 9. S´ ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic m ulti-armed bandit pro ble ms . F oundations and T r ends R  in Machine L e arning , 5(1):1–12 2, 201 2. Richard A. Chechile. Mathematica l tools for hazard function analysis. J. Math. Psychol. , 47:4 78–4 9 4, 2003. Richard A. Che chile. Corr ig endum to: mathematical to ols for hazard function analysis [J. Ma th. P sychol. 47 (200 3) 4 7 8-494 ]. J. Math. Psyc hol. , 5 3:298– 299, 2009. Luc Devroy e, G´ ab or Lugosi, a nd Ge r gely Neu. Pr ediction by rando m-walk pertur bation. In Confer enc e on L e arning Th e ory , pa g es 4 60–47 3, 201 3. 14 Sergey F o ss, Dmitry K orshuno v, and Stan Za chary . An Intr o duct ion to He avy-taile d and Su b exp onential Distributions . Springer, 200 9. J. Hanna n. Approximation to Bayes r isk in r ep eated play . In M. Dr esher, A. W. T ucker, and P. Wolfe, e ditors, Co ntribut ions to t he Th e ory of Games, volume III , pag es 97– 1 39, 19 57. Adam Kalai and Santosh V empala. E ﬃcient algorithms for online decision problems. J ournal of Computer and System Scie nc es , 7 1(3):291– 307, 2005. Jussi Kujala and T apio Elo maa. On following the p erturb ed lea der in the bandit setting. In Algo rithmic L e arning The ory , pag es 371– 385. Springer , 2005 . Gergely Neu and G´ ab or Bart´ ok. An eﬃcient algor ithm for lea rning with semi-bandit feedback. I n Algo rithmic Le arning The ory , pages 2 34–24 8. Springer, 20 13. Jan P oland. FPL analysis for ada ptive bandits. In Oleg B. Lupanov, Oktay M. Kasim-Zade, Alexan- der V. Chaskin, and Kathleen Steinh¨ ofel, e dito rs, Sto chastic Algori thms: F oundations and Applic a- tions: Thir d International Symp osium, SAGA 2005, Mosc ow, Russia, Octob er 20-22, 2005. Pr o c e e d- ings , pages 58 –69. Springer Ber lin Heidelbe rg, 2005. Herb ert Robbins. Some as pec ts of the sequential desig n of exp eriments. Bul l. Amer. Math. So c. , 58(5): 527–5 35, 1952 . Ewart A. C. Thomas. Suﬃcien t c o nditions for monotone ha zard ra te an a pplica tion to latency- pr obability curves. J. Math. Psychol. , 8:303 – 332, 197 1. Tim V an Erven, W o jciech Kotlowski, and Ma nfred K W a rmuth. F ollow the le a der with drop out per tur- bations. In COL T , 20 14. A Pro ofs A.1 Pro of of Lemma 4.4 Pr o of. Since the numerator o f the left hand side is an even function o f z , and the denominato r is a decreasing function, and the inequality is triv ially true when z = 0, it suﬃces to prov e fo r z > 0, which we assume for the res t o f the pro of. F rom Lemma 4.3 we ca n derive that f ( z ) 1 − F ( z ) < z + 1 . Therefore, f ( z ) | z | α (1 − F ( z )) 1 − α ≤ f ( z ) z α ( f ( z ) z +1 ) 1 − α = ( f ( z ) z ) α ( z + 1) 1 − α ≤ f ( z ) α ( z + 1) ≤ z f ( z ) α + 1 = r 1 2 π z e − αz 2 / 2 + 1 . Let g ( z ) = z e − αz 2 / 2 , g ′ ( z ) = (1 − αz 2 ) e − αz 2 / 2 . Therefore g ( z ) is maximized at z ∗ = q 1 α . Therefore, f ( z ) | z | α (1 − F ( z )) 1 − α ≤ r 1 2 π z e − αz 2 / 2 + 1 ≤ r 1 2 π z ∗ + 1 ≤ z ∗ + 1 = r 1 α + 1 ≤ 2 α . 15 A.2 Pro of of Theorem 4.6 Pr o of. F ro m Co rollar y 4.5 we see that the ex p ected r egret ca n be upp er b o unded by 2( C 1 C 2 N T ) 1 / (2 − α ) ( p 2 log N ) (1 − α ) / (2 − α ) where C 1 = 2 α and C 1 = 2 1 − α . Note that 2( C 1 C 2 N T ) 1 / (2 − α ) ( p 2 log N ) (1 − α ) / (2 − α ) ≤ 4( C 1 C 2 ) 1 / (2 − α ) N 1 / (2 − α ) p log N (1 − α ) / (2 − α ) T 1 / (2 − α ) =4 N 1 / (2 − α ) p log N (1 − α ) / (2 − α ) T 1 / 2 × ( C 1 C 2 ) 1 / (2 − α ) T α/ (4 − 2 α ) ≤ 4 N 1 / 2 N α/ (4 − 2 α ) p log N T 1 / 2 × ( 4 α (1 − α ) ) 1 / (2 − α ) T α/ (4 − 2 α ) ≤ 4 N 1 / 2 N α p log N T 1 / 2 × 4 T α α (1 − α ) ≤ 16 √ N T N α p log N × T α α (1 − α ) . If we let α = 1 log T , then T α = T 1 / log T = e < 3. Then, we hav e, for T > 4, T α α (1 − α ) ≤ 3 log T 1 − 1 log T = 3 log 2 T log T − 1 ≤ 6 log T . Putting things together ﬁnishes the pro o f. A.3 Pro of of Theorem 4.9 Pr o of. If the dis tr ibution is heavy-tailed, we hav e lim z →∞ sup F ( z ) e λz = ∞ for all λ > 0 . By Lemma 4.8 , we ca n erase the s upr emum op erator and just wr ite lim z →∞ F ( z ) e λz = ∞ for all λ > 0 . Hence, lim z →∞ F ( z ) e λz = lim x →∞ e − R ( z )+ λz = ∞ fo r all λ > 0 ⇒ lim sup z →∞ R ( z ) z = 0 . Note that R ′ ( z ) = f ( z ) F ( z ) , which is even tually monotone by assumption. Therefore , we can conclude that lim sup z →∞ R ′ ( z ) < ∞ ⇒ sup z f ( z ) F ( z ) < ∞ . A.4 Pro of of Theorem 4.10 Pr o of. If the dis tr ibution is lig h t-tailed, we hav e lim z →∞ F ( z ) e λ ∗ z < ∞ for some λ ∗ > 0 . (21) This immediately implies that lim z →∞ F ( z ) a z b = 0 ∀ a, b > 0 . (22) 16 Consider lim z →∞ f ( z ) F ( z ) = lim z →∞ R ′ ( z ). If lim z →∞ R ′ ( z ) < ∞ we can immediately conc lude that sup z f ( z ) 1 − F ( z ) < ∞ . If lim z →∞ R ′ ( z ) = ∞ ins tea d, note that lim z →∞ Z z − z R ′ ( t ) e − δR ( t ) dt = − 1 δ e − δR ( z ) | z =+ ∞ z = −∞ = 1 δ < ∞ . Moreov er, since lim z →∞ R ′ ( z ) = ∞ , R ′ ( z ) e − δR ( z ) is stric tly po sitive for a ll z > z 0 for some z 0 . F urther- more, R ′ ( z ) e − δR ( z ) = f ( z ) ( F ( z )) 1 − δ is even tually monotone by assumption (e), Therefore, we can c o nclude that lim z →∞ R ′ ( z ) e − δR ( z ) = f ( z ) ( F ( z )) 1 − δ = 0 . ∀ α ∈ ( δ, 1), from Eq ua tion ( 22 ) we hav e lim z → + ∞ z α F ( z ) α − δ = 0, so lim z → + ∞ f ( z ) z α ( F ( z )) 1 − α = lim z → + ∞ f ( z ) F ( z ) 1 − δ × z α F ( z ) α − δ = 0 . and hence sup z f ( z ) z α (1 − F ( z )) 1 − α < ∞ ∀ α ∈ ( δ, 1) . A.5 Pro of of Corollary 4.11 Pr o of. F or a light-tailed dis tribution D , we hav e lim z →∞ F D ( z ) e λ ∗ z < ∞ for s ome λ ∗ > 0 . This implies that F D ( z ) ≤ C e − λ ∗ z for some C > 0 , z > z 0 . Let ra ndo m v a riable Z follows distribution D . Since Z might take nega tive v alues , we deﬁne a new distribution D ′ that only ta kes non-neg a tive v alue by f D ′ ( z ) = ( 1 p D + f D ( z ) if z ≥ 0 0 otherwise where p D + = P ( Z ≥ 0) > 0 by r ight unbounded s upp or t assumption. Clear ly , with this deﬁnition of D ′ we see that E Z 1 ,...,Z N ∼D [max i Z i ] ≤ E Z 1 ,...,Z N ∼D ′ [max i Z i ] and for z > z 0 , we have F D ′ ( z ) = F D ( z ) p D + ≤ C ′ e − λ ∗ z where C ′ = C p D + . Note that E Z 1 ,...,Z N ∼D [max i Z i ] ≤ E Z 1 ,...,Z N ∼D ′ [max i Z i ] = Z ∞ 0 P (max i Z i > x ) dx ≤ u + Z ∞ u P (max i Z i > z ) dz ≤ u + N Z ∞ u P ( Z i > z ) dz ≤ u + N Z ∞ u C ′ e − λ ∗ z dz assuming u > z 0 = u + C ′ N λ ∗ e − λ ∗ u . 17 If we let u = log( N ) λ ∗ , obviously u > z 0 if N is suﬃciently la rge. Thus, we see that E Z 1 ,...,Z N ∼D [max i Z i ] ≤ log( N ) λ ∗ + C ′ = O ( N ξ ) ∀ ξ > 0 . (23) F ro m Theorem 4.10 we see that ∀ α ∈ ( δ , 1), f ( z ) z α (1 − F ( z )) 1 − α ≤ C α ∀ z ∈ R . (24) Plug 23 and 2 4 into Theor e m 4 .2 gives the desired res ult. A.6 Pro of of Corollary 4.13 Pr o of. By Corollary 4.11 w e only need to c heck that a ssumptions (a)-(d) hold for distribution D β , exp onential p ow er family is light-tailed, and assumption (e) a ls o ho lds for any δ ∈ (0 , 1 ). By o bserving the dens ity fun ction f β we can trivially see that assumptions (a)-(c) hold and that the exp onential p ower family is lig ht -tailed. Therefor e , deﬁne g δ,β ( z ) = f β ( z ) ( F β ( z )) 1 − δ = f β ( z ) (1 − F β ( z )) 1 − δ , it suﬃces to show that ∀ δ ∈ [0 , 1) , g δ,β ( z ) is even tually mono tone. Note that g ′ δ,β ( z ) = f ′ β ( z )(1 − F β ( z )) 1 − δ + (1 − δ )(1 − F β ( z )) − δ f 2 β ( z ) (1 − F β ( z )) 2 − 2 δ = C 2 β e − z β (1 − F β ( z )) 2 − δ ×  (1 − δ ) e − z β − β z β − 1 Z ∞ z e − t β dt  . It further suﬃces to show that m δ,β ( z ) = (1 − δ ) e − z β − β z β − 1 Z ∞ z e − t β dt is even tually non-nega tive or non-p ositive ∀ β > 1 , δ ∈ [0 , 1). Note that since β > 1, β z β − 1 Z ∞ z e − t β dt = Z ∞ z β z β − 1 e − t β dt < Z ∞ z β t β − 1 e − t β dt = e − z β . (25) Therefore, m 0 ,β ( z ) > 0 for a ll z ≥ 0, i.e, the ha zard r ate is always inc r easing and as sumption (d) is sa tisﬁed. Now, we are left to s how that m δ,β ( z ) is even tually non- negative or non-p ositive for a ny δ ∈ (0 , 1). Note that β z β − 1 Z ∞ z e − t β dt = β ( z z + 1 ) β − 1 ( z + 1) β − 1 Z ∞ z e − t β dt ≥ β ( z z + 1 ) β − 1 ( z + 1) β − 1 Z z +1 z e − t β dt ≥ ( z z + 1 ) β − 1 Z z +1 z β t β − 1 e − t β dt = ( z z + 1 ) β − 1  e − z β − e − ( z +1) β  . Therefore, lim inf z →∞ β z β − 1 R ∞ z e − t β dt e − z β ≥ lim inf z →∞ ( z z +1 ) β − 1  e − z β − e − ( z +1) β  e − z β = lim z →∞ ( z z + 1 ) β − 1 − lim z →∞ ( z z + 1 ) β − 1 e z β − ( z +1) β = 1 . 18 F ro m Equation ( 25 ) we know that lim sup z →∞ β z β − 1 R ∞ z e − t β dt e − z β ≤ 1 . Hence, we co nclude that lim z →∞ β z β − 1 R ∞ z e − t β dt e − z β = 1 , which implies that m δ,β ( z ) is even tually non-p ositive for any δ ∈ (0 , 1), i.e, assumption (e) holds for a ny δ ∈ (0 , 1). 19

Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment