Fighting Bandits with a New Kind of Smoothness
We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, wh…
Authors: Jacob Abernethy, Chansoo Lee, Ambuj Tewari
Fighting Bandits with a New Kind of Smoothness Jacob Aber nethy Univ ersity of Michigan jabernet@umich.edu Chansoo Lee Univ ersity of Michigan chansool@umich.edu Ambuj T ewari Univ ersity of Michigan tewaria@umich.edu Abstract W e define a nov el family of algorithms for the adv ersarial multi-armed bandit problem, and provide a simple analysis technique based on conv ex smoothing. W e prove two main results. First, we show that regularization via the Tsallis entr opy , which includes EXP3 as a special case, achieves the Θ( √ T N ) minimax regret. Second, we sho w that a wide class of perturbation methods achieve a near-optimal regret as low as O ( √ T N log N ) if the perturbation distribution has a bounded hazard rate. For example, the Gumbel, W eibull, Frechet, Pareto, and Gamma distributions all satisfy this k ey property . 1 Introduction The classic multi-armed bandit (MAB) problem, generally attributed to the early work of Robbins (1952), poses a generic online decision scenario in which an agent must mak e a sequence of choices from a fixed set of options. After each decision is made, the agent receiv es some feedback in the form of a loss (or gain) associated with her choice, b ut no information is provided on the outcomes of alternativ e options. The agent’ s goal is to minimize the total loss ov er time, and the agent is thus faced with the balancing act of both experimenting with the menu of choices while also utilizing the data gathered in the process to improve her decisions. The MAB framework is not only mathe- matically elegant, but useful for a wide range of applications including medical experiments design (Gittins, 1996), automated poker playing strategies (V an den Broeck et al., 2009), and hyperparam- eter tuning (Pacula et al., 2012). Early MAB results relied on stochastic assumptions (e.g., IID) on the loss sequence (Gittins et al., 2011; Lai and Robbins, 1985; Auer et al., 2002). As researchers began to establish non-stochastic, worst-case guarantees for sequential decision problems such as pr ediction with expert advice (Little- stone and W armuth, 1994), a natural question arose as to whether similar guarantees were possible for the bandit setting. The pioneering work of Auer , Cesa-Bianchi, Freund, and Schapire (2003) an- swered this in the affirmati v e by sho wing that their algorithm EXP3 possesses nearly-optimal regret bounds with matching lo wer bounds. Attention later turned to the bandit version of online linear optimization , and se veral associated guarantees were published the follo wing decade (McMahan and Blum, 2004; Flaxman et al., 2005; Dani and Hayes, 2006; Dani et al., 2008; Abernethy et al., 2012). Nearly all proposed methods have relied on a particular algorithmic blueprint; they reduce the ban- dit problem to the full-information setting, while using randomization to make decisions and to estimate the losses. A well-studied family of algorithms for the full-information setting is F ollow the Re gularized Leader (FTRL), which optimizes the objectiv e function of the following form: arg min x ∈K h L, x i + λR ( x ) (1) where K is the decision set, L is (an estimate of) the cumulati ve loss vector , and R is a r e gularizer , a con ve x function with suitable curvature to stabilize the objectiv e. The choice of regularizer R is critical to the algorithm’ s performance. For example, the EXP3 algorithm (Auer, 2003) re gularizes with the entr opy function and achie ves a nearly optimal regret bound when K is the probability sim- 1 plex. For a general con v ex set, ho we v er , other re gularizers such as self-concor dant barrier functions (Abernethy et al., 2012) hav e tighter regret bounds. Another class of algorithms for the full information setting is F ollow the P erturbed Leader (FTPL) (Kalai and V empala, 2005) whose foundations date back to the earliest work in adversarial online learning (Hannan, 1957). Here we choose a distribution D on R N , sample a random vector Z ∼ D , and solve the follo wing linear optimization problem arg min x ∈K h L + Z , x i (2) FTPL is computationally simpler than FTRL due to the linearity of the objecti ve, but it is analytically much more complex due to the randomness. For ev ery different choice of D , an entirely new set of techniques had to be de veloped (De vroye et al., 2013; V an Erven et al., 2014). Rakhlin et al. (2012) and Abernethy et al. (2014) made some progress tow ards unifying the analysis framework. Their techniques, howe v er , are limited to the full-information setting. In this paper , we propose a ne w analysis frame w ork for the multi-armed bandit problem that unifies the regularization and perturbation algorithms. The ke y element is a new kind of smoothness prop- erty , which we call dif fer ential consistency . It allows us to generate a wide class of both optimal and near-optimal algorithms for the adversarial multi-armed bandit problem. W e summarize our main results: 1. W e sho w that regularization via the Tsallis entr opy leads to the state-of-the-art adversarial MAB algorithm, matching the minimax regret rate of Audibert and Bubeck (2009) with a tighter constant. Interestingly , our algorithm fully generalizes EXP3. 2. W e show that a wide array of well-studied noise distributions lead to near-optimal regret bounds (matching those of EXP3). Furthermore, our analysis reveals a strikingly simple and appealing sufficient condition for achie ving O ( √ T ) regret: the hazar d rate of the noise distribution must be bounded by a constant in the tail region. W e conjecture that this requirement is in fact both necessary and suf ficient. 2 Gradient-Based Prediction Algorithms f or the Multi-Armed Bandit Let us no w introduce the adversarial multi-armed bandit problem. On each round t = 1 , . . . , T , a learner must choose a distribution p t ∈ ∆ N ov er the set of N av ailable actions. The adversary (Nature) chooses a loss vector g t ∈ [ − 1 , 0] N . The learner plays action i t sampled according to p t and suffers the loss g t,i t . The learner observes only a single coordinate g t,i t and receives no information as to the values g t,j for j 6 = i t . This limited information feedback is what makes the bandit problem much more challenging than the full-information setting in which the entire g t is observed. The learner’ s goal is to minimize the r e gr et . Regret is defined to be the difference in the realized loss and the loss of the best fixed action in hindsight: Regret T := max i ∈ [ N ] T X t =1 ( g t,i − g t,i t ) . (3) T o be precise, we consider the expected regret, where the expectation is taken with respect to the learner’ s randomization. Loss vs. Gain Note: The maximization in (3) would imply that g is strictly speaking a negati ve gain . Nev ertheless, we use the term loss , as we impose the assumption that g t ∈ [ − 1 , 0] N throughout the paper . 2 2.1 The Gradient-Based Algorithmic T emplate W e study a particular algorithmic template described in Framework 1, which is a slight v ariation of the Gradient Based Prediction Algorithm (GBP A) of Abernethy et al. (2014). Note that the algorithm (i) maintains an unbiased estimate of the cumulative losses ˆ G t , (ii) updates ˆ G t by adding a single- round estimate ˆ g t , and (iii) uses the gradient of a con v ex function ˜ Φ as sampling distribution p t . The choice of ˜ Φ is flexible, it must be a differentiable conv ex function such that its gradient is always a probability distribution. Framew ork 1 may appear restricti ve but it has served as the basis for much of the published w ork on adversarial MAB algorithms (Auer et al., 2003; Kujala and Elomaa, 2005; Neu and Bart ´ ok, 2013) mainly for two reasons. First, the GBP A framew ork encompasses all FTRL and FTPL algorithms, which are the core techniques for sequential prediction algorithms (Abernethy et al., 2014). Second, although there is some flexibility , any unbiased estimation scheme would require some kind of in verse-probability scaling. Information theory tells us that the unbiased est imates of a quantity that is observed with only probabilty p must necessarily inv olve fluctuations that scale as O (1 /p ) . Framework 1: Gradient-Based Prediction Alg. (GBP A) T emplate for Multi-Armed Bandits. GBP A ( ˜ Φ) : ˜ Φ is a differentiable con vex function such that ∇ ˜ Φ ∈ ∆ N and ∇ i ˜ Φ > 0 for all i . Initialize ˆ G 0 = 0 for t = 1 to T do Nature: A loss vector g t ∈ [ − 1 , 0] N is chosen by the Adversary Sampling: Learner chooses i t according to the distribution p ( ˆ G t − 1 ) = ∇ Φ t ( ˆ G t − 1 ) Cost: Learner “gains” loss g t,i t Estimation: Learner “guesses” ˆ g t := g t,i t p i t ( ˆ G t − 1 ) e i t Update: ˆ G t = ˆ G t − 1 + ˆ g t Lemma 2.1. Define Φ( G ) ≡ max i G i so that we can write the expected r egr et of GBP A ( ˜ Φ) as E Regret T = Φ( G T ) − P T t =1 h∇ ˜ Φ( ˆ G t − 1 ) , g t i . Then, the expected r egr et of GBP A ( ˜ Φ) can be written as: E Regret T ≤ ˜ Φ(0) − Φ(0) | {z } over estimation penalty + E i 1 ,...,i t − 1 Φ( ˆ G T ) − ˜ Φ( ˆ G T ) | {z } under estimation penalty + T X t =1 E i t [ D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | ˆ G t − 1 ] | {z } diver gence penalty , (4) wher e the e xpectations ar e over the sampling of i t . Pr oof. Let ˜ Φ be a valid con ve x function for GBP A. Consider GBP A ( ˜ Φ) run on the loss sequence g 1 , . . . , g T . The algorithm produces a sequence of estimated losses ˆ g 1 , . . . , ˆ g T . Now consider GBP A-FI ( ˜ Φ) , which is GBP A ( ˜ Φ) run with the full information on the deterministic loss sequence ˆ g 1 , . . . , ˆ g T (there is no estimation step, and the learner updates ˆ G t directly). The regret of this run can be written as Φ( ˆ G T ) − P T t =1 h∇ ˜ Φ( ˆ G t − 1 ) , ˆ g t i (5) and Φ( G T ) ≤ Φ( ˆ G T ) by the con v exity of Φ . Hence, Equation 5 is an upper bound the regret. The rest of the proof is a fairly well-kno wn result in online learning literature; see, for e xample, (Cesa- Bianchi and Lugosi, 2006, Theorem 11.6) or (Abernethy et al., 2014, Section 2). For completeness, we included the full proof in Appendix A. 2.2 A New Kind of Smoothness What has emerged as a guiding principle throughout machine learning is that the stability of an algorithm leads to performance guarantees—that is, small modifications of the input data should not 3 dramatically alter the output. In the context of GBP A, algorithm’ s output (prediction in each time step) is by definition the dervati ve ∇ ˜ Φ , and its stability corresponds to the Lipschitz-continuity of the gradient. Abernethy et al. (2014) proved that a uniform bound the norm of ∇ 2 ˜ Φ directly giv es a regret guarantee for the full-information setting. In the bandit setting, ho we ver , a uniform bound on ∇ 2 ˜ Φ is insufficient; the regret (Lemma 2.1) in v olves terms of the form D ˜ Φ ( ˆ G t − 1 + ˆ g t , ˆ G t − 1 ) , where the incremental quantity ˆ g t can scale as large as the in ver se of the smallest pr obability of p ( ˆ G t − 1 ) . What is needed is a stronger notion of the smoothness that bounds ∇ 2 ˜ Φ in correspondence with ∇ ˜ Φ , and we propose the following definition: Definition 2.2 (Differential Consistency) . F or constants γ , C > 0 , we say that a conve x function ˜ Φ( · ) is ( γ , C ) -differentially-consistent if for all G ∈ ( −∞ , 0] N , ∇ 2 ii ˜ Φ( G ) ≤ C ( ∇ i ˜ Φ( G )) γ . In other words, the rate in which we decrease p i should approach 0 as p i approaches 0. This guran- tees that the algorithm continues to explore. W e now prove a generic bound that we will use in the following tw o sections to deri ve re gret guarantees. Theorem 2.3. Suppose ˜ Φ is ( γ , C ) -differ entially-consistent for constants C, γ > 0 . Then diver- gence penalty at time t in Lemma 2.1 can be upper bounded as: E i t [ D ˜ Φ ( ˆ G t , ˆ G t − 1 ) | ˆ G t − 1 ] ≤ C 2 N X i =1 ∇ i ˜ Φ( ˆ G t − 1 ) γ − 1 . Pr oof. For the sake of clarity , we drop the subscripts; we use ˆ G to denote the cumulativ e estimate ˆ G t − 1 , ˆ g to denote the marginal estimate ˆ g t = ˆ G t − ˆ G t − 1 , and g to denote the true loss g t . Note that by definition of Algorithm 1, ˆ g is a sparse vector with one non-zero and non-positiv e coordinate ˆ g i t = g t,i / ∇ i ˜ Φ( ˆ G ) . Plus, i t is conditionally independent giv en ˆ G . For a fixed i t , Let h ( r ) := D ˜ Φ ( ˆ G + r ˆ g / k ˆ g k , ˆ G ) = D ˜ Φ ( ˆ G + r e i t , ˆ G ) , so that h 00 ( r ) = ( ˆ g / k ˆ g k ) > ∇ 2 ˜ Φ ˆ G + t ˆ g / k ˆ g k ( ˆ g / k ˆ g k ) = e > i t ∇ 2 ˜ Φ ˆ G − t e i t e i t . Now we write: E i t [ D ˜ Φ ( ˆ G + ˆ g , ˆ G ) | ˆ G ] = P N i =1 P [ i t = i ] R k ˆ g k 0 R s 0 h 00 ( r ) dr ds = P N i =1 ∇ i ˜ Φ( ˆ G ) R k ˆ g k 0 R s 0 e > i ∇ 2 ˜ Φ ˆ G − r e i e i dr ds ≤ P N i =1 ∇ i ˜ Φ( ˆ G ) R k ˆ g k 0 R s 0 C ∇ i ˜ Φ( ˆ G − r e i ) γ dr ds ≤ P N i =1 ∇ i ˜ Φ( ˆ G ) R k ˆ g k 0 R s 0 C ∇ i ˜ Φ( ˆ G ) γ dr ds = C P N i =1 ∇ i ˜ Φ( ˆ G ) 1+ γ R k ˆ g k 0 R s 0 dr ds = C 2 P N i =1 ∇ i ˜ Φ( ˆ G ) γ − 1 g 2 i . The first inequality is by the supposition. The second inequality is due to the con ve xity of ˜ Φ which guarantees that ∇ i is an increasing function in the i -th coordinate; this step critically depends on the loss-only assumption that g is always non-positi v e. 3 A Minimax Bandit Algorithm via Tsallis Smoothing Auer et al. (2003) proved that their EXP3 algorithm achiev es O ( √ T N log N ) regret and that any multi-armed bandit algorithm suffers Ω( √ T N ) re gret. A fe w years later, Audibert and Bubeck (2009) resolved this gap with Implicitly Normalized Forecaster (INF), which later was sho wn to be equiv alent to Mirror Descent (Audibert et al., 2011) on the probability simplex. EXP3 corresponds to INF with potential function ψ ( x ) = exp( η x ) , while using ψ ( x ) = ( − η x ) − q with q > 1 gi ves an optimal algorithm that has regret at most 2 √ 2 T N (Bubeck and Cesa-Bianchi, 2012, Theorem 5.7). 4 What we present in this section is essentially a reformulation of a particular subfamily of INF , which includes INF with the abov e two potential functions. Our reformulation leads to a very simple and intuitiv e analysis based on dif fer ential consistency , and a natural interpolation between the two seemingly unrelated potential functions. Let us first note that EXP3 is an instance of GBP A where the potential function ˜ Φ( · ) is the Fenchel conjugate of the Shannon entr opy . For any p ∈ ∆ N , the (negati ve) Shannon entropy is defined as H ( p ) := P i p i log p i , and its Fenchel conjugate is H ? ( G ) = sup p ∈ ∆ N {h p, G i − η H ( p ) } . In fact, we have a closed-form expression for the supremum: H ? ( G ) = 1 η log ( P i exp( η G i )) . By inspecting the gradient of the above expression, it is easy to see that EXP3 chooses the distribution p t = ∇ H ? ( G ) ev ery round. Now we will replace the Shannon entrop y with the Tsallis entr opy 1 (Tsallis, 1988), defined as: S α ( p ) = 1 1 − α 1 − N X i =1 p α i ! for 0 < α < 1 . Interestingly , the Shannon entropy is an asymptotic special case of the Tsallis entropy , i.e., S α ( · ) → H ( · ) as α → 1 . Theorem 3.1. Let ˜ Φ( G ) = max p ∈ ∆ N {h p, G i − η S α ( p ) } . Then the GBP A ( ˜ Φ) has r e gr et at most E Regret ≤ η N 1 − α − 1 1 − α | {z } over estimation penalty + N α T 2 η α | {z } diver gence penalty . (6) Before proving the theorem, we note that it immediately recov ers the EXP3 upper bound as a special case α → 1 . An easy application of L ’H ˆ opital’ s rule shows that as α → 1 , N 1 − α − 1 1 − α → log N and N α /α → N . Choosing η = p ( N log N ) /T , we see that the right-hand side of (6) tends to 2 √ T N log N . Howe ver the choice α → 1 is clearly not the optimal choice, as we show in the following statement, which directly follo ws from the theorem once we see that N 1 − α − 1 < N 1 − α . Corollary 3.2. F or any α ∈ (0 , 1) , if we choose η = q T (1 − α ) 2 α N α − 1 2 then we have E Regret ≤ q 2 T N α (1 − α ) . In particular , the choice of α = 1 2 gives a r e gret of no mor e than 2 √ 2 T N , reco vering (Bubeck and Cesa-Bianchi, 2012, Theor em 5.7). Pr oof of Theor em 3.1. W e will bound each penalty term in Lemma 2.1. Since S α is non-positive, the underestimation penalty is upper bounded by 0 and the o verestimation penalty is at most ( − min S α ) . The minimum of S α occurs at (1 / N , . . . , 1 / N ) . Hence, (ov erestimation penalty) ≤ − η 1 − α 1 − N X i =1 1 N α ! ≤ η N 1 − α − 1 1 − α . (7) Now it remains to upper bound the di vergence penalty . Straightforward calculus gi ves ∇ 2 S α ( p ) = η α diag ( p α − 2 1 , . . . , p α − 2 N ) . Let I ∆ N ( · ) be the function where I ∆ N ( x ) = 0 for x ∈ ∆ N and I ∆ N ( x ) = ∞ for x / ∈ ∆ N . Define a function ˆ S α ( · ) := S α ( · ) + I ∆ N ( · ) , which is the con ve x conjugate of ˜ Φ . F ollo wing the setup of Penot (1994), ∇ 2 S α ( p ) is a sub-hessian of ˆ S α ( p ) . W e now apply Proposition 3.2 of the same reference. Let ( p G , G ) be a pair such that ∇ ˜ Φ( G ) = p G . Since ∇ 2 S α ( p ) is inv ertible, it follo ws that ( ∇ 2 S α ( p G )) − 1 is a super-hessian of ˜ Φ at G . Hence, for any G , ∇ 2 ˜ Φ( G ) ( η α ) − 1 diag ( p G ) 2 − α 1 , . . . , ( p G ) 2 − α N ( G ) . 1 More precisely , the function we giv e here is the ne gative Tsallis entropy according to its original definition. 5 That is, ˜ Φ is (2 − α, ( η α ) − 1 ) -differentially-consistent, and thus applying Theorem 2.3 gi v es D ˜ Φ ( ˆ G t , ˆ G t − 1 ) ≤ (2 η α ) − 1 N X i =1 p i ( ˆ G t − 1 ) 1 − α . Since the 1 α -norm and the 1 1 − α -norm are dual to each other , we can apply H ¨ older’ s inequality to any probability distribution p 1 , . . . , p N and obtain N X i =1 p 1 − α i = N X i =1 p 1 − α i · 1 ≤ N X i =1 p 1 − α 1 − α i ! 1 − α N X i =1 1 1 α ! α = (1) 1 − α N α = N α . So, the div er gence penalty is at most (2 η α ) − 1 N α , which completes the proof. 4 Near -Optimal Bandit Algorithms via Stochastic Smoothing Let D be a continuous distribution ov er an unbounded support with probability density function f and cumulativ e density function F . Consider the GBP A with potential function of the form: ˜ Φ( G ; D ) = E Z 1 ,...,Z N iid ∼D max i { G i + Z i } (8) which is a stochastic smoothing of (max i G i ) function. Since the max function is con v ex, ˜ Φ is also con ve x. By Bertsekas (1973), we can swap the order of differentiation and e xpectation: ∇ ˜ Φ( G ; D ) = E Z 1 ,...,Z N iid ∼D e i ∗ , where i ∗ = arg max i =1 ,...,N { G i + Z i } . (9) Even if the function is not dif ferentiable e verywhere, the swapping is still possible with any subgra- dient under some mild conditions. Hence, the ties between coordinates (which happen with proba- bility zero anyways) can be resolved in an arbitrary manner . It is clear that ∇ ˜ Φ is in the probability simplex, and note that ∂ ˜ Φ ∂ G i = E Z 1 ,...,Z N 1 { G i + Z i > G j + Z j , ∀ j 6 = i } = E ˜ G j ∗ [ P Z i [ Z i > ˜ G j ∗ − G i ]] = E ˜ G j ∗ [1 − F ( ˜ G j ∗ − G i )] (10) where ˜ G j ∗ = max j 6 = i G j + Z j . The unbounded support condition guarantees that this partial deriv ati v e is non-zero for all i giv en any G . So, ˜ Φ( G ; D ) satisfies the requirements of Algorithm 1. Despite the fact that perturbation-based algorithms provide a natural randomized decision strategy , they have seen little applications mostly because they are hard to analyze. But one should expect general results to be within reach: the EXP3 algorithm, for example, can be viewed through the lens of perturbations, where the noise is distributed according to the Gumbel distribution. Indeed, an early result of K ujala and Elomaa (2005) showed that a near-optimal MAB strategy comes about through the use of exponentially-distributed noise, and the same perturbation strategy has more recently been utilized in the work of Neu and Bart ´ ok (2013) and Koc ´ ak et al. (2014). Howe ver , a more general understanding of perturbation methods has remained elusive. For example, would Gaussian noise be sufficient for a guarantee? What about, say , the W eibull distrib ution? 4.1 Connection to F ollow the P erturbed Leader The sampling step of the bandit GBP A (Framework 1) with a stochastically smoothed function (Equation 8) can be done ef ficiently . Instead of ev aluating the expectation (Equation 9), we simply take a random sample. In fact, this is equi valent to F ollow the Perturbed Leader Algorithm (FTPL) (Kalai and V empala, 2005) applied to the bandit setting. On the other hand, the estimation step is hard because generally there is no closed-form expression for ∇ ˜ Φ . T o address this issue, Neu and Bart ´ ok (2013) proposed Geometric Resampling (GR). GR uses an iterativ e resampling process to estimate ∇ ˜ Φ . They showed that if we stop after M iterations, the extra regret due to the estimation bias is at most N T eM (additiv e term). That is, all our GBP A regret bounds in this section hold for the corresponding FTPL algorithm with an extra additiv e N T eM term.. This term, ho we ver , does not affect the asymptotic regret rate as long as M = N T , because the lower bound for an y algorithm is of the order √ N T . 6 4.2 Hazard Rate analysis In this section, we show that the performance of the GBP A( ˜ Φ( G ; D ) ) can be characterized by the hazar d function of the smoothing distrib ution D . The hazard rate is a standard tool in survi v al analysis to describe failures due to aging; for example, an increasing hazard rate models units that deteriorate with age while a decreasing hazard rate models units that improve with age (a counter intuitiv e but not illogical possibility). T o the best of our kno wledge, the connection between hazard rates and design of adversarial bandit algorithms has not been made before. Definition 4.1 (Hazard rate function) . Hazar d rate function of a distrib ution D is h D ( x ) := f ( x ) 1 − F ( x ) For the rest of the section, we assume that D is unbounded in the direction of + ∞ , so that the hazard function is well-defined ev erywhere. This assumption is for the clarity of presentation and can be easily remov ed (Appendix B). Theorem 4.2. The re gr et of the GBP A with ˜ Φ( G ) = E Z 1 ,...,Z n ∼ D max i { G i + η Z i } is at most: η E Z 1 ,...,Z n ∼ D h max i Z i i | {z } over estimation penalty + N (sup h D ) η T | {z } diver gence penalty Pr oof. W e analyze each penalty term in Lemma 2.1. Due to the con ve xity of Φ , the underestimation penalty is non-positiv e. The ov erestimation penalty is clearly at most E Z 1 ,...,Z n ∼ D [max i Z i ] , and Lemma 4.3 prov es the N (sup h D ) upper bound on the div ergence penalty . It remains to prove the tuning parameter η . Suppose we scale the perturbation Z by η > 0 , i.e., we add η Z i to each coordinate. It is easy to see that E [max i =1 ,...,n η X i ] = η E [max i =1 ,...,n X i ] . For the div er gence penalty , let F η be the CDF of the scaled random variable. Observe that F η ( t ) = F ( t/η ) and thus f η ( t ) = 1 η f ( t/η ) . Hence, the hazard rate scales by 1 /η , which completes the proof. Lemma 4.3. The diver gence penalty of the GBP A with ˜ Φ( G ) = E Z 1 ,...,Z n ∼ D max i { G i + η Z i } is at most N (sup h D ) each r ound. Pr oof. Recall the gradient expression in Equation 10. W e upper bound the i -th diagonal entry of the Hessian, as follows: ∇ 2 ii ˜ Φ( G ) = ∂ ∂ G i E ˜ G j ∗ [1 − F ( ˜ G j ∗ − G i )] = E ˜ G j ∗ ∂ ∂ G i (1 − F ( ˜ G j ∗ − G i )) = E ˜ G j ∗ f ( ˜ G j ∗ − G i ) = E ˜ G j ∗ [ h ( ˜ G j ∗ − G i )(1 − F ( ˜ G j ∗ − G i ))] (11) ≤ (sup h ) E ˜ G j ∗ [1 − F ( ˜ G j ∗ − G i )] = (sup h ) ∇ i ( G ) where ˜ G j ∗ = max j 6 = i { G j + Z j } which is a random variable independent of Z i . W e now apply Theorem 2.3 with γ = 1 and C = (sup h ) to complete the proof. Corollary 4.4. F ollow the P erturbed Leader Algorithm with distributions in T able 1 (r estricted to a certain range of parameter s), combined with Geometric Resampling (Section 4.1) with M = √ N T , has an expected r egr et of or der O ( √ T N log N ) . T able 1 provides the two terms we need to bound. W e deri v e the third column of the table in Appendix C using Extreme V alue Theory (Embrechts et al., 1997). Note that our analysis in the proof of Lemma 4.3 is quite tight; the only place we hav e an inequality is where we upper bound the hazard rate. It is thus reasonable to pose the following conjecture: Conjecture 4.5. If a distribution D has a monotonically incr easing hazar d rate h D ( x ) that does not con ver g e as x → + ∞ (e.g., Gaussian), then ther e is a sequence of losses that will incur at least a linear r e gr et. 7 Distribution sup x h D ( x ) E [max N i =1 Z i ] O ( √ T N log N ) Param. Gumbel( µ = 1 , β = 1 ) 1 as x → 0 log N + γ 0 N/A Frechet ( α > 1 ) at most 2 α N 1 /α Γ(1 − 1 /α ) α = log N W eibull* ( λ = 1 , k ≤ 1) k at x = 0 O ( 1 k !(log N ) 1 k ) k = 1 (Exponential) Pareto* ( x m = 1 , α ) α at x = 0 αN 1 /α / ( α − 1) α = log N Gamma ( α ≥ 1 , β ) β as x → ∞ log N + ( α − 1) log log N − log Γ( α ) + β − 1 γ 0 β = α = 1 (Exponential) T able 1: Distrib utions that give O ( √ T N log N ) r e gret FTPL algorithm. The parameterization fol- lows W ikipedia pages for easy lookup. W e denote the Euler constant ( ≈ 0 . 58 ) by γ 0 . Distributions marked with (*) need to be slightly modified using the conditioning trick explained in Appendix B.2. The maximum of Frechet hazard function has to be computed numerically (Elsayed, 2012, p. 47) but elementary calculations sho w that it is bounded by 2 α (Appendix D). The intuition is that if adversary keeps incurring a high loss for the i -th arm, then with high proba- bility ˜ G j ∗ − G i will be large. So, the expectation in Equation 11 will be dominated by the hazard function ev aluated at lar ge v alues of ˜ G j ∗ − G i . Acknowledgments. J. Abernethy acknowledges the support of NSF under CAREER grant IIS- 1453304. A. T e wari ackno wledges the support of NSF under CAREER grant IIS-1452099. References Jacob Abernethy , Elad Hazan, and Alexander Rakhlin. Interior-point methods for full-information and bandit online learning. IEEE T ransactions on Information Theory , 58(7):4164–4175, 2012. Jacob Abernethy , Chansoo Lee, Abhinav Sinha, and Ambuj T e wari. Online linear optimization via smoothing. In COLT , pages 807–823, 2014. Jean-Yves Audibert and S ´ ebastien Bubeck. Minimax policies for adv ersarial and stochastic bandits. In COLT , pages 217–226, 2009. Jean-Yves Audibert, S ´ ebastien Bubeck, and G ´ abor Lugosi. Minimax policies for combinatorial prediction games. In COLT , 2011. Peter Auer . Using confidence bounds for exploitation-e xploration trade-offs. The Journal of Ma- chine Learning Resear c h , 3:397–422, 2003. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer . Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2-3):235–256, 2002. Peter Auer , Nicol ` o Cesa-Bianchi, Y oav Freund, and Robert E. Schapire. The nonstochastic multi- armed bandit problem. SIAM Journal of Computuataion , 32(1):48–77, 2003. ISSN 0097-5397. Dimitri P . Bertsekas. Stochastic optimization problems with nondif ferentiable cost functionals. J our - nal of Optimization Theory and Applications , 12(2):218–231, 1973. ISSN 0022-3239. S ´ ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. arXiv preprint , 2012. Nicol ` o Cesa-Bianchi and G ´ abor Lugosi. Prediction, Learning, and Games . Cambridge Uni versity Press, 2006. V . Dani and T . P . Hayes. Robbing the bandit: less regret in online geometric optimization against an adaptiv e adversary . In SOD A , pages 937–943, 2006. V arsha Dani, Thomas Hayes, and Sham Kakade. The price of bandit information for online opti- mization. In NIPS , 2008. Luc De vroye, G ´ abor Lugosi, and Ger gely Neu. Prediction by random-walk perturbation. In Confer- ence on Learning Theory , pages 460–473, 2013. E.A. Elsayed. Reliability Engineering . Wile y Series in Systems Engineering and Management. W iley , 2012. ISBN 9781118309544. 8 P . Embrechts, C. Kl ¨ uppelberg, and T . Mikosch. Modelling Extremal Events: F or Insurance and F inance . Applications of mathematics. Springer , 1997. ISBN 9783540609315. Abraham D. Flaxman, Adam T auman Kalai, and H. Brendan McMahan. Online con v ex optimization in the bandit setting: gradient descent without a gradient. In SODA , pages 385–394, 2005. ISBN 0-89871-585-7. John Gittins. Quantitative methods in the planning of pharmaceutical research. Drug Information Journal , 30(2):479–487, 1996. John Gittins, Ke vin Glazebrook, and Richard W eber . Multi-armed bandit allocation indices . John W iley & Sons, 2011. J. Hannan. Approximation to bayes risk in repeated play . In M. Dr esher , A. W . T ucker , and P . W olfe, editors, Contrib utions to the Theory of Games, volume III , pages 97–139, 1957. Adam Kalai and Santosh V empala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences , 71(3):291–307, 2005. T om ´ a ˇ s K oc ´ ak, Gergely Neu, Michal V alko, and Remi Munos. Efficient learning by implicit explo- ration in bandit problems with side observations. In NIPS , pages 613–621. Curran Associates, Inc., 2014. Jussi K ujala and T apio Elomaa. On follo wing the perturbed leader in the bandit setting. In Algorith- mic Learning Theory , pages 371–385. Springer , 2005. T . L. Lai and Herbert Robbins. Asymptotically efficient adaptiv e allocation rules. Advances in Applied Mathematics , 6(1):4–22, 1985. Nick Littlestone and Manfred K. W armuth. The weighted majority algorithm. Information and Computation , 108(2):212–261, 1994. ISSN 0890-5401. H. Brendan McMahan and A vrim Blum. Online geometric optimization in the bandit setting against an adaptiv e adversary . In COLT , pages 109–123, 2004. Gergely Neu and G ´ abor Bart ´ ok. An efficient algorithm for learning with semi-bandit feedback. In Algorithmic Learning Theory , pages 234–248. Springer , 2013. Maciej Pacula, Jason Ansel, Saman Amarasinghe, and Una-May OReilly . Hyperparameter tuning in bandit-based adaptive operator selection. In Applications of Evolutionary Computation , pages 73–82. Springer , 2012. Jean-Paul Penot. Sub-hessians, super-hessians and conjugation. Nonlinear Analysis: Theory , Meth- ods & Applications , 23(6):689–702, 1994. Sasha Rakhlin, Ohad Shamir , and Karthik Sridharan. Relax and randomize: From value to algo- rithms. In Advances in Neural Information Pr ocessing Systems , pages 2141–2149, 2012. Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer . Math. Soc. , 58 (5):527–535, 1952. Constantino Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics , 52(1-2):479–487, 1988. Guy V an den Broeck, Kurt Driessens, and Jan Ramon. Monte-carlo tree search in pok er using expected re ward distrib utions. In Advances in Machine Learning , pages 367–381. Springer , 2009. T im V an Erven, W ojciech K otlo wski, and Manfred K W armuth. F ollow the leader with dropout perturbations. In COLT , 2014. 9 Algorithm 2: Gradient-Based Prediction Algorithm (GBP A) for Full Information Setting Input: ˜ Φ , a differentiable con vex function such that ∇ ˜ Φ ∈ ∆ N and ∇ i ˜ Φ > 0 for all i . Initialize G 0 = 0 for t = 1 to T do Sampling: The learner chooses arm i t with probability p i ( ˆ G t − 1 ) = ∇ i Φ t ( ˆ G t − 1 ) Adversary chooses a loss v ector g t ∈ [ − 1 , 0] N and learner pays g t,i Update G t = G t − 1 + g t A Proof of the GBP A Regret Bound (Lemma 2.1) Lemma A.1. The expected r e gret of Algorithm 2 can be written as: E Regret = ˜ Φ(0) − Φ(0) | {z } over estimation penalty + Φ( G T ) − ˜ Φ( G T ) | {z } under estimation penalty + T X t =1 D ˜ Φ ( G t , G t − 1 ) | {z } diver gence penalty Pr oof. Note that since Φ 0 (0) = 0 , ˜ Φ( G T ) = ˜ Φ(0) − Φ 0 (0) | {z } overestimation penalty + T X t =1 ˜ Φ( G t ) − ˜ Φ( G t − 1 ) = ˜ Φ(0) − Φ 0 (0) | {z } overestimation penalty + T X t =1 h∇ ˜ Φ( G t − 1 ) , ` t ) i + D ˜ Φ ( G t , G t − 1 ) Therefore, E Regret def = E " Φ( G T ) − T X t =1 h ˜ Φ( G t − 1 ) , g t i # = E Φ( G T ) − ˜ Φ( G T ) | {z } underestimation penalty + ˜ Φ( G T ) − T X t =1 h ˜ Φ( G t − 1 ) , g t i = E Φ( G T ) − ˜ Φ( G T ) | {z } underestimation penalty + ˜ Φ(0) − Φ 0 (0) | {z } overestimation penalty + D ˜ Φ ( G t , G t − 1 ) B Relaxing Assumptions on the Distribution B.1 Mirr oring trick f or extending the support Let X have support on x > 0 with density f and CDF F . Let us define Y by mirroring the density of X around zero, i.e., Y has density g ( y ) = 1 2 f ( | y | ) and CDF G ( y ) = 1 2 (1 + sign( y ) F ( | y | )) . Note that | Y | is distributed as X and hence, E [max i Y i ] ≤ E [max i | Y i | ] = E [max i X i ] . The hazard h Y ( y ) for y ≥ 0 is f ( y ) / (1 − F ( y )) and for y < 0 is f ( − y ) / (1+ F ( − y )) ≤ F ( − y ) / (1 − F ( − y )) . Therefore, sup y h Y ( y ) = sup x> 0 h X ( x ) . 10 This prov es the following lemma. Lemma B.1. If a random variable X has support on the non-ne gative r eals with density f ( x ) and we define Y as the mirr or ed version with density g ( y ) = 1 2 f ( | y | ) . Then, we have E [max i Y i ] ≤ E [max i X i ] , sup y h Y ( y ) = sup x> 0 h X ( x ) wher e h X , h Y ar e hazar d r ates of X, Y r espectively . B.2 Conditioning trick f or unbounded hazard rate near zer o Suppose F ( x ) is the CDF of a random variable X whose hazard rate is bounded for x ≥ 1 but blows up near zero. Then define Y as X conditioned on X ≥ 1 . That is, Y has CDF , for y > 0 : G ( y ) = P ( X ≥ 1 + y | X > 1) = F (1 + y ) − F (1) 1 − F (1) and density g ( y ) = f (1 + y ) / (1 − F (1)) , y > 0 . So the hazard rate h Y ( y ) is g ( y ) / (1 − G ( y )) = f (1 + y ) / (1 − F (1 + y )) = h X (1 + y ) . Therefore, sup y > 0 h Y ( y ) = sup x> 1 h X ( x ) which makes the hazard rate of Y now bounded. This we hav e prov ed the lemma belo w . Lemma B.2. If a hazard function of X is bounded for x > 1 and blows up only for small values of x then we can condition on X > 1 to define a new random variable whose hazard rate is now bounded. The same technique can be applied for any arbitrary constant other than 1 , but for the family of random variables we considered, it suf fices to condition on X ≥ 1 . C Detailed derivation of extr eme v alue beha vior C.1 Maximum of iid Gumbel The CDF of the Gumbel distribution is exp( − exp( − x )) and the expected value is γ 0 , the Euler (Euler-Mascheroni) constant. Thus, the CDF of the maximum of n iid Gumbel random variables is (exp( − exp( − x ))) N = exp( − exp( − ( x − log N ))) which is also Gumbel b ut with the mean increased by log N . C.2 Maximum of iid Fr echet The CDF of Frechet is exp( − x − α ) and it has mean Γ(1 − 1 α ) as long as α > 1 (otherwise it is infinite). Hence, the CDF of the maximum of N iid Frechet random v ariables is (exp( − x − α )) N = exp( − N x − α ) = exp − x N 1 α − α ! which is also Frechet but with mean scaled by N 1 /α . C.3 Maximum of iid W eibull Let X i hav e modified W eibull distribution with CDF 1 − exp( − ( x + 1) k + 1) . Thus, P (max i X i > t ) ≤ N P ( X 1 > t ) = N exp( − ( t + 1) k + 1) . For non-negati ve random v ariable X and any u > 0 , we hav e, E [ X ] = Z ∞ 0 P ( X > x ) dx ≤ u + Z ∞ u P ( X > x ) dx. 11 Assume k = 1 /m where m ≥ 1 is a positi v e integer . Therefore, E [max i X i ] ≤ u + Z ∞ u N exp( − ( x + 1) k + 1) dx ≤ u + 3 N Z ∞ u exp( − ( x + 1) k ) dx = u + 3 N Z ∞ u +1 exp( − x 1 /m ) dx = u + 3 N m Γ( m, (1 + u ) 1 /m ) dx where Γ( m, x ) is the incomplete Gamma function that for a positi v e integer m and x > 1 simplifies to Γ( m, x ) = ( m − 1)! e − x m − 1 X k =0 x k k ! ≤ ( m − 1)! e − x m − 1 X k =0 x m k ! = ( m − 1)! e − x x m m − 1 X k =0 1 k ! ≤ ( m − 1)! e − x x m ∞ X k =0 1 k ! ≤ 3( m − 1)! e − x x m . Plugging this back abov e, we get, for any u > 0 , E [max i X i ] ≤ u + 9 N m ! e − (1+ u ) 1 /m (1 + u ) . Now choose u = log m N + 1 to get E [max i X i ] ≤ log m N + 9 N m ! log m N N ≤ 10 m ! log m N . C.4 Maximum of iid Gamma Let Y be the maximum of N iid Gamma ( α, β ) ramdom variables. Then, Y − d N c N follows Gumbel distribution, where c N = β − 1 and d N = β − 1 (log N + ( α − 1) log log N − log Γ( α )) . In the language of e xtreme v alue theory , Gamma distrib ution belongs to the maximum domain of attraction of Gumbel distribution with parameters (Embrechts et al., 1997). As mentioned in Section C.1, Gumbel distribution has mean γ 0 . C.5 Maximum of iid Par eto Let X i hav e modified Pareto distribution with CDF 1 − 1 / (1 + x ) α . Thus, P (max i X i > t ) ≤ N P ( X 1 > t ) = N/ (1 + x ) α . For non-negati ve random v ariable X and any u > 0 , we ha ve, E [ X ] = Z ∞ 0 P ( X > x ) dx ≤ u + Z ∞ u P ( X > x ) dx. Therefore, for α > 1 , E [max i X i ] ≤ u + Z ∞ u N (1 + x ) α dx = u + N ( α − 1)(1 + u ) α − 1 . Setting u = N 1 /α − 1 gi v es the bound E [max i X i ] ≤ α α − 1 N 1 /α . 12 D Hazard Functions of Modified Distributions and the Fr echet Case D.1 Pareto distrib ution Using the conditioning trick, we consider , for α > 1 (otherwise mean is infinite), the modified Pareto distribution with pdf f ( x ) = α ( x +1) α +1 supported on (0 , ∞ ) . Its CDF is 1 − 1 / ( x + 1) α . Its hazard function is h ( x ) = α x +1 which decreases in x and is bounded by α . Expected maximum of N iid Pareto random v ariables is bounded by αN 1 /α / ( α − 1) (see Appendix C.5). This gives a regret bound of √ N T p α 2 N 1 /α / ( α − 1) . D.2 Frechet distribution The CDF of Frechet is exp( − x − α ) , x > 0 where α > 0 is a shape parameter . The hazard function of Frechet distribution is h ( x ) = αx − α − 1 exp( − x − α ) 1 − exp( − x − α ) which is hard to optimize analytically but can be upper bounded, for α > 1 , via elementary calculations gi ven below , by 2 α . The CDF of the maximum of N iid Frechet random v ariables is exp( − ( x/ N 1 /α ) − α ) which is also Frechet (b ut with mean scaled by N 1 /α ) with expected v alue N 1 /α Γ(1 − 1 α ) (as long as α > 1 , otherwise e xpectation is infinite). Thus, the regret bound we get is O √ N T q αN 1 /α Γ(1 − 1 α ) . Setting α = log N makes the regret bound O ( √ T N log N ) . Our choice of α is lar ger than 1 as soon as N > 2 . D.2.1 Elementary calculations for bounding Frechet distrib ution’ s hazard rate For α > 1 , we want to show that sup x> 0 h ( x ) ≤ 2 α where h ( x ) = αx − α − 1 exp( − x − α ) 1 − exp( − x − α ) . First, consider the case x ≥ 1 . In this case, define y = x α and note that y ≥ 1 . Then, we have h ( x ) = α xy exp( − 1 /y ) 1 − exp( − 1 /y ) ≤ α y exp( − 1 /y ) 1 − exp( − 1 /y ) ≤ α y 1 1 − (1 − 1 / (2 y )) = 2 α. The first inequality holds because x ≥ 1 . The second holds because exp( − 1 /y ) < 1 and exp( − 1 /y ) ≤ 1 − 1 / (2 y ) for y ≥ 1 . Next, consider the case x < 1 . Define y = 1 /x and note that y > 1 . Then, we hav e h ( x ) = α x α +1 exp( − x − α ) 1 − exp( − x − α ) ≤ α x α +1 exp( − x − α ) 1 − exp( − 1) = α 1 − e − 1 y α +1 exp( − y α ) ≤ 2 αy α +1 exp( − y α ) . T o show an upper bound of 2 α , it therefore suffices to show that sup y > 1 g ( y ) ≤ 1 where g ( y ) = y α +1 exp( − y α ) . W e will show this no w . Note that g 0 ( y ) = ( α + 1) y α exp( − y α ) − y α +1 αy α − 1 exp( − y α ) = y α exp( − y α ) (( α + 1) − αy α ) , which means that g ( y ) is monotonically increasing on the interv al (1 , y 0 ) and monotonically de- creasing on the interval ( y 0 , + ∞ ) where y 0 = α +1 α 1 /α . W e therefore have, sup y > 1 g ( y ) = g ( y 0 ) = 1 + 1 α (1+1 /α ) exp ( − (1 + 1 /α )) ≤ 2 2 exp( − 2) = 4 /e 2 ≤ 1 , where the first inequality above holds because α > 1 . Note that, for α > 1 , the function α 7→ 1 + 1 α (1+1 /α ) exp ( − (1 + 1 /α )) decreases monotonically . 13 D.3 W eib ull distribution The CDF of W eibull is 1 − exp( − x k ) for x > 0 (and 0 otherwise) where k > 0 is a shape parameter . The density is k x k − 1 exp( − x k ) and hazard rate is k x k − 1 . For k > 1 , hazard rate monotonically increases and is therefore unbounded for large x . When k < 1 , the hazard rate is unbounded for small values of x . Note that W eibull includes e xponential as a special case when k = 1 . Let k = 1 /m for some positiv e integer m ≥ 1 and using the conditioning trick, consider a modified W eibull with CDF 1 − exp( − ( x + 1) k + 1) . Density is k ( x + 1) k − 1 exp( − ( x + 1) k + 1) and hazard is k ( x + 1) k − 1 which is bounded by k . When k < 1 we get tails heavier than the e xponential b ut not as heavy as a Pareto or a Frechet. The expected value of the maximum of N iid (modified) W eibull random variables with parameter k = 1 /m scales as O ( m !(log N ) m ) (see Appendix C.3). Thus, we get the regret bound O ( √ N T p m !(log n ) m ) . Thus, the entire modified W eibull family yields O ( p N polylog ( N ) √ T ) regret bounds. The best bound is obtained when m = 1 , i.e. when the W eibull becomes an e xponential. 14
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment