Minimax Policies for Combinatorial Prediction Games

Minimax Policies for Combinatorial Prediction Games Jean-Yves Audibert Imagine, Uni v . Paris Est, and Sierra, CNRS/ENS/INRIA, Paris, France audibert@imagine.enpc.fr S ´ ebastien Bubeck Centre de Recerca Matem ` atica Barcelona, Spain sbubeck@crm.cat G ´ abor Lugosi ICREA and Pompeu Fabra Uni v ersity Barcelona, Spain lugosi@upf.es Nov ember 26, 2024 Abstract W e address the online linear optimization problem when the actions of the forecaster are represented by binary vectors. Our goal is to understand the magnitude of the minimax regret for the worst possible set of actions. W e study the problem under three dif ferent assumptions for the feedback: full information, and the partial information models of the so-called “semi-bandit”, and “bandit” problems. W e consider both L ∞ -, and L 2 -type of restrictions for the losses assigned by the adversary . W e formulate a general strate gy using Bregman projections on top of a potential-based gradient descent, which generalizes the ones studied in the series of papers Gy ¨ orgy et al. (2007), Dani et al. (2008), Abernethy et al. (2008), Cesa-Bianchi and Lugosi (2009), Helmbold and W armuth (2009), K oolen et al. (2010), Uchiya et al. (2010), Kale et al. (2010) and Audibert and Bubeck (2010). W e provide simple proofs that recover most of the pre vious results. W e propose ne w upper bounds for the semi-bandit game. Moreover we deri v e lower bounds for all three feedback assumptions. With the only exception of the bandit game, the upper and lower bounds are tight, up to a constant f actor . Finally , we answer a question asked by K oolen et al. (2010) by showing that the e xponentially weighted av erage forecaster is suboptimal against L ∞ adversaries. 1 Intr oduction In the sequential decision making problems considered in this paper , at each time instance t = 1 , . . . , n , the forecaster chooses, possibly in a randomized way , an action from a giv en set S where S is a sub- set of the d -dimensional hypercube { 0 , 1 } d . The action chosen by the forecaster at time t is denoted by V t = ( V 1 ,t , . . . , V d,t ) ∈ S . Simultaneously to the forecaster , the adversary chooses a loss vector ` t = ( ` 1 ,t , . . . , ` d,t ) ∈ [0 , + ∞ ) d and the loss incurred by the forecaster is ` T t V t . The goal of the forecaster is to minimize the expected cumulativ e loss E P n t =1 ` T t V t where the expectation is taken with respect to the fore- 1 Parameters: set of actions S ⊂ { 0 , 1 } d ; number of rounds n ∈ N . For each round t = 1 , 2 , . . . , n ; (1) the forecaster chooses V t ∈ S with the help of an external randomization; (2) simultaneously the adversary selects a loss vector ` t ∈ [0 , + ∞ ) d (without rev ealing it); (3) the forecaster incurs the loss ` T t V t . He observes – the loss vector ` t in the full information game, – the coordinates ` i,t 1 V i,t =1 in the semi-bandit game, – the instantaneous loss ` T t V t in the bandit game. Goal: The forecaster tries to minimize his cumulativ e loss P n t =1 ` T t V t . Figure 1: Combinatorial prediction games. caster’ s internal randomization. This problem is an instance of an “online linear optimization” problem 1 , see, e.g., A werb uch and Kleinberg (2004), McMahan and Blum (2004), Kalai and V empala (2005), Gy ¨ orgy et al. (2007), Dani et al. (2008), Abernethy et al. (2008), Cesa-Bianchi and Lugosi (2009), Helmbold and W armuth (2009), K oolen et al. (2010), Uchiya et al. (2010) and Kale et al. (2010) W e consider three v ariants of the problem, distinguished by the type of information that becomes av ailable to the forecaster at each time instance, after taking an action. (1) In the full information game the forecaster observes the entire loss vector ` t ; (2) in the semi-bandit game only those components ` i,t of ` t are observ able for which V i,t = 1 ; (3) in the bandit game only the total loss ` T t V t becomes av ailable to the forecaster . W e refer to these problems as combinatorial pr ediction games . All three prediction games are sketched in Figure 1. For all three games, we deﬁne the re gret 2 of the forecaster as R n = E n X t =1 ` T t V t − min v ∈S E n X t =1 ` T t v . In order to make meaningful statements about the regret, one needs to restrict the possible loss vectors the adversary may assign. W e work with two different natural assumptions that have been considered in the literature: L ∞ assumption: here we assume that k ` t k ∞ ≤ 1 for all t = 1 , . . . , n L 2 assumption: assume that ` T t v ≤ 1 for all t = 1 , . . . , n and v ∈ S . Note that, without loss of generality , we may assume that for all i ∈ { 1 , . . . , d } , there exists v ∈ S with v i = 1 , and then the L 2 assumption implies the L ∞ assumption. The goal of this paper is to study the minimax r e gr et , that is, the performance of the forecaster that minimizes the regret for the worst possible sequence of loss assignments. This, of course, depends on the set S of actions. Our aim is to determine the order of magnitude of the minimax regret for the most difﬁcult set to learn. More precisely , for a giv en game, if we write sup for the supremum ov er all allowed adversaries (that is, either L ∞ or L 2 adversaries) and inf for the inﬁmum over all forecaster strategies for this game, we are interested in the maximal minimax regret R n = max S ⊂{ 0 , 1 } d inf sup R n . 1 In online linear optimization problems, the action set is often not restricted to be a subset of { 0 , 1 } d but can be an arbitrary subset of R d . Howe ver , in the most interesting cases, actions are naturally represented by Boolean vectors and we restrict our attention to this case. 2 For the full information game, one can directly upper bound the stronger notion of regret E P n t =1 ` T t V t − E min v ∈S P n t =1 ` T t v which is always lar ger than R n . Howev er , for partial information games, this requires more work. 2 L ∞ L 2 Full Info Semi-Bandit Bandit Full Info Semi-Bandit Bandit Lower Bound d √ n d √ n d 3 / 2 √ n √ dn √ dn d √ n Upper Bound d √ n d √ n d 5 / 2 √ n √ dn √ dn log d d 3 / 2 √ n T able 1: Bounds on R n prov ed in this paper (up to constant factor). The new results are set in bold. L ∞ L 2 Full Info Semi-Bandit Bandit Full Info Semi-Bandit Bandit E X P 2 d 3 / 2 √ n d 3 / 2 √ n d 5 / 2 √ n √ dn d √ n * d 3 / 2 √ n L I N E X P d √ n d √ n d 2 n 2 / 3 √ dn d √ n * d 2 n 2 / 3 L I N P O LY d √ n d √ n - √ dn √ dn log d - T able 2: Upper bounds on R n for speciﬁc forecasters. The new results are in bold. W e also show that the bound for E X P 2 in the full information game is unimprov able. Note that the bound for (Bandit, L I N E X P ) is very weak. The bounds with * become √ dn log d if we restrict our attention to sets S that are “almost symmetric” in the sense that for some k , S ⊂  v ∈ { 0 , 1 } d : P d i =1 v i ≤ k  and Con v ( S ) ∩  k 2 d ; 1  d 6 = ∅ . Note that in this paper we do not restrict our attention to computationally ef ﬁcient algorithms. The following e xample illustrates the dif ferent games that we introduced abov e. Example 1 Consider the well studied example of path planning in which, at every time instance, the for e- caster chooses a path from one ﬁxed vertex to another in a graph. At each time, a loss is assigned to every edge of the graph and, depending on the model of the feedback, the for ecaster observes either the losses of all edges, the losses of each edge on the chosen path, or only the total loss of the chosen path. The goal is to minimize the total loss for any sequence of loss assignments. This problem can be cast as a combinatorial pr ediction game in dimension d for d the number of edges in the graph. Our contribution is threefold. First, we propose a variant of the algorithm used to track the best linear predictor (Herbster and W armuth, 1998) that is well-suited to our combinatorial prediction games. This leads to an algorithm called C L E B that generalizes various approaches that have been proposed. This new point of view on algorithms that were deﬁned for speciﬁc games (only the full information game, or only the standard multi-armed bandit game) allows us to generalize them easily to all combinatorial prediction games, leading to new algorithms such as L I N P O L Y . This algorithmic contribution leads to our second main result, the improv ement of the known upper bounds for the semi-bandit game. This point of view also leads to a different proof of the minimax √ nd regret bound in the standard d -armed bandit game that is much simpler than the one provided in Audibert and Bubeck (2010). A summary of the bounds proved in this paper can be found in T able 1 and T able 2. In addition we prov e se veral lower bounds. First, we establish lo wer bounds on the minimax regret in all three games and under both types of adversaries, whereas only the cases ( L 2 /L ∞ , Full Information) and ( L 2 , Bandit) were previously treated in the literature. Moreover we also answer a question of Koolen et al. (2010) by showing that the traditional exponentially weighted average forecaster is suboptimal against L ∞ adversaries. In particular , this paper leads to the follo wing (perhaps unexpected) conclusions: • The full information game is as hard as the semi-bandit game. More precisely , in terms of R n , the price that one pays for the limited feedback of the semi-bandit game compared to the full information game is only a constant factor (or a √ log d factor for the L 2 setting). • In the full information and semi-bandit game, the traditional exponentially weighted a verage forecaster is pro v ably suboptimal for L ∞ adversaries while it is optimal for L 2 adversaries in the full information game. 3 • Denote by A 2 (respectiv ely A ∞ ) the set of adv ersaries that satisfy the L 2 assumption (respecti vely the L ∞ assumption). W e clearly hav e A 2 ⊂ A ∞ ⊂ d A 2 . W e prov e that, in the full information game, R n gains an additional f actor of √ d at each inclusion. In the semi-bandit game, we sho w that the same statement remains true up to a logarithmic factor . Notation. The con ve x hull of S is denoted Con v ( S ) . 2 Combinatorial learning with Br egman pr ojections In this section we introduce a general forecaster that we call C L E B (Combinatorial LEarning with Bregman projections). Every forecaster in vestigated in this paper is a special case of C L E B . Let D be a con v ex subset of R d with nonempty interior Int ( D ) and boundary ∂ D . Deﬁnition 1 W e call Leg endr e any function F : D → R such that (i) F is strictly con ve x and admits continuous ﬁrst partial derivatives on Int ( D ) (ii) F or any u ∈ ∂ D , for any v ∈ Int ( D ) , we have lim s → 0 ,s> 0 ( u − v ) T ∇ F  (1 − s ) u + sv  = + ∞ . The Bregman di ver gence D F : D × Int ( D ) associated to a Legendre function F is deﬁned by D F ( u, v ) = F ( u ) − F ( v ) − ( u − v ) T ∇ F ( v ) . W e consider the algorithm C L E B described in Figure 2. The basic idea is to use a potential-based gradient descent (1) followed by a projection (2) with respect to the Bregman di ver gence of the potential onto the con ve x hull of S to ensure that the resulting weight vector w t +1 can be viewed as w t +1 = E V ∼ p t +1 V for some distrib ution p t +1 on S . The combination of Bregman projections with potential-based gradient descent was ﬁrst used in Herbster and W armuth (1998). Online learning with Bregman di ver gences without the projection step has a long history (see Section 11.11 of Cesa-Bianchi and Lugosi (2006)). As discussed below , C L E B may be viewed as a generalization of the forecasters L I N E X P and I N F . The Le gendre conjugate F ∗ of F is deﬁned by F ∗ ( u ) = sup v ∈D  u T v − F ( v )  . The follo wing theorem establishes the ﬁrst step of all upper bounds for the regret of C L E B . Theorem 2 C L E B satisﬁes for any u ∈ Con v ( S ) ∩ D , n X t =1 ˜ ` T t w t − n X t =1 ˜ ` T t u ≤ D F ( u, w 1 ) + n X t =1 D F ∗ ( ∇ F ( w t ) − ˜ ` t , ∇ F ( w t )) . (3) Proof By applying the deﬁnition of the Bregman div ergences (or equiv alently using Lemma 11.1 of Cesa- Bianchi and Lugosi (2006)), we obtain ˜ ` T t w t − ˜ ` T t u = ( u − w t ) T  ∇ F ( w 0 t +1 ) − ∇ F ( w t )  = D F ( u, w t ) + D F ( w t , w 0 t +1 ) − D F ( u, w 0 t +1 ) . By the Pythagorean theorem (Lemma 11.3 of Cesa-Bianchi and Lugosi (2006)), we hav e D F ( u, w 0 t +1 ) ≥ D F ( u, w t +1 ) + D F ( w t +1 , w 0 t +1 ) , hence ˜ ` T t w t − ˜ ` T t u ≤ D F ( u, w t ) + D F ( w t , w 0 t +1 ) − D F ( u, w t +1 ) − D F ( w t +1 , w 0 t +1 ) . Summing ov er t then gi ves 4 Parameters: • a Legendre function F deﬁned on D with Con v ( S ) ∩ Int ( D ) 6 = ∅ • w 1 ∈ Con v ( S ) ∩ Int ( D ) For each round t = 1 , 2 , . . . , n ; (a) Let p t be a distribution on the set S such that w t = E V ∼ p t V . (b) Draw a random action V t according to the distribution p t and observe – the loss vector ` t in the full information game, – the coordinates ` i,t 1 V i,t =1 in the semi-bandit game, – the instantaneous loss ` T t V t in the bandit game. (c) Estimate the loss ` t by ˜ ` t . For instance, one may take – ˜ ` t = ` t in the full information game, – ˜ ` i,t = ` i,t P v ∈S : v i =1 p t ( v ) V i,t in the semi-bandit game, – ˜ ` t = P + t V t V T t ` t , with P t = E v ∼ p t ( v v T ) in the bandit game. (d) Let w 0 t +1 ∈ Int ( D ) satisfying ∇ F ( w 0 t +1 ) = ∇ F ( w t ) − ˜ ` t . (1) (e) Project the weight vector w 0 t +1 deﬁned by (1) on the con ve x hull of S : w t +1 ∈ argmin w ∈ Conv ( S ) ∩ Int ( D ) D F ( w, w 0 t +1 ) . (2) Figure 2: Combinatorial learning with Bregman projections ( C L E B ). n X t =1 ˜ ` T t w t − n X t =1 ˜ ` T t u ≤ D F ( u, w 1 ) − D F ( u, w n +1 ) + n X t =1  D F ( w t , w 0 t +1 ) − D F ( w t +1 , w 0 t +1 )  . (4) By the nonnegati vity of the Bre gman div er gences, we get n X t =1 ˜ ` T t w t − n X t =1 ˜ ` T t u ≤ D F ( u, w 1 ) + n X t =1 D F ( w t , w 0 t +1 ) . From Proposition 11.1 of Cesa-Bianchi and Lugosi (2006), we hav e D F ( w t , w 0 t +1 ) = D F ∗  ∇ F ( w t ) − ˜ ` t , ∇ F ( w t )  , which concludes the proof. As we will see belo w , by the equality E P n t =1 ˜ ` T t V t = E P n t =1 ˜ ` T t w t , and provided that ˜ ` T t V t and ˜ ` T t u are unbiased estimates of E ` T t V t and E ` T t u , Theorem 2 leads to an upper bound on the regret R n of C L E B , which allo ws us to obtain the bounds of T able 2 by using appropriate choices of F . Moreov er , if F admits an Hessian, denoted ∇ 2 F , that is always in v ertible, then one can prove that up to a third-order term  in ˜ ` t  , the regret bound can be written as: n X t =1 ˜ ` T t w t − n X t =1 ˜ ` T t u / D F ( u, w 1 ) + n X t =1 ˜ ` T t  ∇ 2 F ( w t )  − 1 ˜ ` t . (5) In this paper , we restrict our attention to the combinatorial learning setting in which S is a subset of { 0 , 1 } d . Ho wev er , one should note that this speciﬁc form of S plays no role in the deﬁnition of C L E B , 5 Figure 3: The ﬁgure sketches the relationship of the algorithms studied in this paper with arrows representing “is a special case of ”. Dotted arrows indicate that the link is obtained by “expanding” S , that, is seeing S as the set of basis vector in R |S | rather than seeing it as a (structured) subset of { 0 , 1 } d (see Section 3.1). The six algorithms on the bottom use a Legendre function with a diagonal Hessian. On the contrary , the F T R L algorithm (see Section 3.3) may consider Legendre functions more adapted to the geometry of the con v ex hull of S . P O LY I N F is the algorithm considered in Theorem 22. meaning that the algorithm on Figure 2 can be used to handle general online linear optimization problems, where S is any subset of R d . 3 Differ ent instances of C L E B In this section we describe sev eral instances of C L E B and relate them to existing algorithms. Figure 3 sum- marizes the relationship between the various algorithms introduced belo w . 3.1 E X P 2 (Expanded Exponentially weighted average f or ecaster) The simplest approach to combinatorial prediction games is to consider each vertex of S as an independent expert, and then apply a strategy designed for the expert problem. W e call E X P 2 the resulting strategy when one uses the traditional exponentially weighted average forecaster (also called Hedge, Freund and Schapire (1997)), see Figure 4. In the full information g ame, E X P 2 corresponds to Expanded Hedge deﬁned in Koolen et al. (2010), where it w as studied under the L ∞ assumption. It w as also studied in the full information g ame under the L 2 assumption in Dani et al. (2008). In the semi-bandit game, E X P 2 was studied in Gy ¨ orgy et al. (2007) under the L ∞ assumption. Finally in the bandit game, E X P 2 corresponds to the strategy proposed by Dani et al. (2008) and also to the ComBand strategy , studied under the L ∞ assumption in Cesa-Bianchi and Lugosi (2009) and under the L 2 assumption in Cesa-Bianchi and Lugosi (2010). (These last strategies dif fer in how the losses are estimated.) E X P 2 is a C L E B strategy in dimension | S | that uses D = [0 , + ∞ ) | S | and the function F : u 7→ 1 η P | S | i =1 u i log( u i ) , for some η > 0 (this can be proved by using the fact that the Kullback-Leibler projec- tion on the simplex is equiv alent to a L 1 -normalization). The following theorem shows the regret bound that one can obtain for E X P 2 (for instance with Theorem 5 applied to the case where S is replaced by S 0 =  u ∈ { 0 , 1 } |S | : P v ∈S u v = 1  ). 6 E X P 2 : Parameter: Learning rate η . Let w 1 =  1 |S | , . . . , 1 |S |  ∈ R |S | . For each round t = 1 , 2 , . . . , n ; (a) Let p t the distribution on S such that p t ( v ) = w v,t for any v ∈ S . (b) Play V t ∼ p t and observe – the loss vector ` t in the full information game, – the coordinates ` i,t 1 V i,t =1 in the semi-bandit game, – the instantaneous loss ` T t V t in the bandit game. (c) Estimate the loss vector ` t by ˜ ` t . For instance, one may take – ˜ ` t = ` t in the full information game, – ˜ ` i,t = ` i,t P v ∈S : v i =1 p v,t V i,t in the semi-bandit game, – ˜ ` t = P + t V t V T t ` t , with P t = E v ∼ p t ( v v T ) in the bandit game. (d) Update the weights, for all v ∈ S , w v,t +1 = exp( − η ˜ ` T t v ) w v,t P u ∈S exp( − η ˜ ` T t u ) w u,t . Figure 4: E X P 2 forecaster . Theorem 3 F or the E X P 2 forecaster , pr ovided that E ˜ ` t = ` t , we have R n ≤ log( |S | ) η + η 2 n X t =1 X v ∈S E  p t ( v )( ˜ ` T t v ) 2 max  1 , exp( − η ˜ ` T t v )  . 3.2 L I N E X P (Linear Exponentially weighted av erage f or ecaster) W e call L I N E X P the C L E B strategy that uses D = [0 , + ∞ ) d and the function F : u 7→ 1 η P d i =1 u i log( u i ) associated to the Kullback-Leibler div ergence, for some η > 0 . In the full information game, L I N E X P corre- sponds to Component Hedge deﬁned in K oolen et al. (2010), where it was studied under the L ∞ assumption. In the semi-bandit game, L I N E X P was studied in Uchiya et al. (2010), Kale et al. (2010) under the L ∞ as- sumption, and for the particular set S with all vertices of L 1 norm equal to some value k . 3.3 F T R L (Follow the Regularized Leader) If Con v ( S ) ⊂ D and w 1 ∈ argmin w ∈D F ( w ) , steps (d) and (e) are equiv alent to w t +1 ∈ argmin w ∈ Conv ( S ) t X s =1 ˜ ` T s w + F ( w ) ! , showing that in this case C L E B can be interpreted as a regularized follow-the-leader algorithm. This type of algorithm was studied in Abernethy and Rakhlin (2009) in the full information and bandit setting (see also the lecture notes Rakhlin and T ewari (2008)). A survey of F T R L strategies for the full information game can 7 be found in Hazan (2010). In the bandit game, F T R L with F being a self-concordant barrier function and a different estimate than the one proposed in Figure 2 w as studied in Abernethy et al. (2008). 3.4 L I N I N F (Linear Implicitly Normalized For ecaster) Let f : R d → R . The function f has a diagonal Hessian if and only if it can be written as f ( u ) = P d i =1 f i ( u i ) , for some twice differentiable functions f i : R → R , i = 1 , . . . , d . The Hessian is called exchangeable when the functions f 00 1 , . . . , f 00 d are identical. In this case, up to adding an afﬁne function of u (note that this does not alter neither the Bregman di ver gence nor C L E B ), we hav e f ( u ) = P d i =1 g ( u i ) for some twice differentiable function g . In this section, we consider this type of Legendre functions. T o under- line the surprising link 3 with the Implicitly Normalized F orecaster proposed in Audibert and Bubeck (2010), we consider g of the form x 7→ R x · ψ − 1 ( s ) ds , and will refer to the algorithm presented hereafter as L I N I N F . Deﬁnition 4 Let ω ≥ 0 . A function ψ : ( −∞ , a ) → R ∗ + for some a ∈ R ∪ { + ∞} is called an ω -potential if and only if it is con ve x, continuously differ entiable, and satisﬁes lim x →−∞ ψ ( x ) = ω lim x → a ψ ( x ) = + ∞ ψ 0 > 0 Z ω +1 ω | ψ − 1 ( s ) | ds < + ∞ . Theorem 5 Let ω ≥ 0 and let ψ be an ω -potential function. The function F deﬁned on D = [ ω , + ∞ ) d by F ( u ) = P d i =1 R u i ω ψ − 1 ( s ) ds is Le gendr e. The associated C L E B satisﬁes, for any u ∈ Conv ( S ) ∩ D , n X t =1 ˜ ` T t w t − n X t =1 ˜ ` T t u ≤ D F ( u, w 1 ) + 1 2 n X t =1 d X i =1 ˜ ` 2 i,t max  ψ 0  ψ − 1 ( w i,t )  , ψ 0  ψ − 1 ( w i,t ) − ˜ ` i,t   , (6) wher e for any ( u, v ) ∈ D × Int ( D ) , D F ( u, v ) = d X i =1  Z u i v i ψ − 1 ( s ) ds − ( u i − v i ) ψ − 1 ( v i )  . (7) In particular , when the estimates ˜ ` i,t ar e nonne gative , we have n X t =1 ˜ ` T t w t − n X t =1 ˜ ` T t u ≤ D F ( u, w 1 ) + n X t =1 d X i =1 ˜ ` 2 i,t 2( ψ − 1 ) 0 ( w i,t ) . (8) Proof It is easy to check that F is a Legendre function and that (7) holds. W e also hav e ∇ F ∗ ( u ) = ( ∇ F ) − 1 ( u ) =  ψ ( u 1 ) , . . . , ψ ( u d )  , hence D F ∗ ( u, v ) = d X i =1  Z u i v i ψ ( s ) ds − ( u i − v i ) ψ ( v i )  . From the T aylor-Lagrange expansion, we ha ve D F ∗ ( u, v ) ≤ P d i =1 max s ∈ [ u i ,v i ] 1 2 ψ 0 ( s )( u i − v i ) 2 . Since the function ψ is con ve x, we hav e max s ∈ [ u i ,v i ] ψ 0 ( s ) ≤ ψ 0  max( u i , v i )  , which gi ves the desired results. Note that L I N E X P is an instance of L I N I N F with ψ : x 7→ exp( η x ) . On the other hand, Audibert and Bubeck (2010) recommend the choice ψ ( x ) = ( − η x ) − q with η > 0 and q > 1 since it leads to the minimax 3 detailed in Appendix A. 8 optimal rate √ nd for the standard d -armed bandit game  while the best bound for Exp3 is of the order of √ nd log d  . This corresponds to a function F of the form F ( u ) = − q ( q − 1) η P d i =1 u ( q − 1) /q i . W e refer to the corresponding C L E B as L I N P O LY . In Appendix A we sho w that a simple application of Theorem 5 proves that L I N P O LY with q = 2 satisﬁes R n ≤ 2 √ 2 nd . This improves on the bound R n ≤ 8 √ nd obtained in Theorem 11 of Audibert and Bubeck (2010). 4 Full Inf ormation Game This section details the upper bounds of the forecasters E X P 2 , L I N E X P and L I N P O LY under the L 2 and L ∞ assumptions for the full information game. All results are gathered in T able 2 (page 3). The proofs can be found in Appendix B. Up to numerical constants, the results concerning ( E X P 2, L 2 and L ∞ ) and ( L I N E X P , L ∞ ) appeared or can be easily deriv ed from respecti vely Dani et al. (2008) and K oolen et al. (2010). Theorem 6 ( L I N E X P , L ∞ ) Under the L ∞ assumption, for L I N E X P with ˜ ` t = ` t , η = p 2 /n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ d √ 2 n. Theorem 7 ( L I N E X P , L 2 ) Under the L 2 assumption, for L I N E X P with ˜ ` t = ` t , η = p 2 d/n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ √ 2 nd. Theorem 8 ( L I N P O L Y , L ∞ ) Under the L ∞ assumption, for L I N P O L Y with ˜ ` t = ` t , η = q 2 q ( q − 1) n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ d r 2 q n q − 1 . Theorem 9 ( L I N P O L Y , L 2 ) Under the L 2 assumption, for L I N P O LY with ˜ ` t = ` t , η = q 2 d q ( q − 1) n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ s 2 q dn q − 1 . Theorem 10 ( E X P 2 , L ∞ ) Under the L ∞ assumption, for E X P 2 with ˜ ` t = ` t , we have R n ≤ d log 2 η + η nd 2 2 . In particular for η = q 2 log 2 nd , we have R n ≤ p 2 d 3 n log 2 . From Theorem 19, the above upper bound is tight, and consequently there exists S for which the algorithm E X P 2 is not minimax optimal in the full information game under the L ∞ assumption. Theorem 11 ( E X P 2 , L 2 ) Under the L 2 assumption, for E X P 2 with ˜ ` t = ` t , we have R n ≤ d log 2 η + η n 2 . In particular for η = q 2 d log 2 n , we have R n ≤ √ 2 dn log 2 . 9 5 Semi-Bandit Game This section details the upper bounds of the forecasters E X P 2 , L I N E X P and L I N P O LY under the L 2 and L ∞ assumptions for the semi-bandit game. These bounds are gathered in T able 2 (page 3). The proofs can be found in Appendix C. Up to the numerical constant, the result concerning ( E X P 2 , L ∞ ) appeared in Gy ¨ orgy et al. (2007) in the context of the online shortest path problem. Uchiya et al. (2010) and Kale et al. (2010) studied the semi-bandit problem under the L ∞ assumption for action sets of the form S =  v ∈ { 0 , 1 } d : P d i =1 v i = k  for some value k . Their common algorithm corresponds to L I N E X P and the bounds are of order p k nd log ( d/k ) . Our upper bounds for the regret of L I N E X P extend these results to more general sets of arms and to the L 2 assumption. Theorem 12 ( L I N E X P , L ∞ ) Under the L ∞ assumption, for L I N E X P with ˜ ` i,t = ` i,t V i,t w i,t , η = p 2 /n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ d √ 2 n. Since the L 2 assumption implies the L ∞ assumption, we also hav e R n ≤ d √ 2 n under the L 2 assumption. Let us now detail ho w L I N E X P behav es for almost symmetric action sets as deﬁned belo w . Deﬁnition 13 The set S ⊂ { 0 , 1 } d is called almost symmetric if for some k ∈ { 1 , . . . , d } , S ⊂  v ∈ { 0 , 1 } d : P d i =1 v i ≤ k  and Con v ( S ) ∩  k 2 d ; 1  d 6 = ∅ . The inte g er k is called the or der of the symmetry . The set S =  v ∈ { 0 , 1 } d : P d i =1 v i = k  considered in Uchiya et al. (2010) and Kale et al. (2010) is a particular almost symmetric set. Theorem 14 ( L I N E X P , almost symmetric S ) Let S be an almost symmetric set of or der k ∈ { 1 , . . . , d } . Consider L I N E X P with ˜ ` i,t = ` i,t V i,t w i,t and w 1 = argmin w ∈ Conv ( S ) D F  w , ( k d , . . . , k d ) T  . Let L = max  log  d k  , 1  . • Under the L ∞ assumption, taking η = q 2 k L nd , we have R n ≤ √ 2 k nd L . • Under the L 2 assumption, taking η = k q L nd , we have R n ≤ 2 √ nd L . In particular , it means that under the L 2 assumption, there is a gain in the regret bound of a factor p d/ L when the set of actions is an almost symmetric set of order k . Theorem 15 ( L I N P O L Y , L ∞ ) Under the L ∞ assumption, for L I N P O L Y with ˜ ` i,t = ` i,t V i,t w i,t , η = q 2 q ( q − 1) n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ d r 2 q n q − 1 . Theorem 16 ( L I N P O L Y , L 2 ) Under the L 2 assumption, for L I N P O L Y with ˜ ` i,t = ` i,t V i,t w i,t , η = r 2 d 1 q q ( q − 1) n and w 1 = argmin w ∈ Conv ( S ) D F  w , (1 , . . . , 1) T  , we have R n ≤ s 2 q nd q − 1 d 1 − 1 q . In particular , for q = 1 + (log d ) − 1 , we have R n ≤ p 2 nde log( ed ) . 10 Theorem 17 ( E X P 2 , L ∞ ) Under the L ∞ assumption, for the E X P 2 forecaster described in F igur e 4 using ˜ ` i,t = ` i,t V i,t w i,t , we have R n ≤ d log 2 η + η nd 2 2 . In particular for η = q 2 log 2 nd , we have R n ≤ p 2 d 3 n log 2 . The corresponding lower bound is gi v en in Theorem 19. Theorem 18 ( E X P 2 , L 2 ) Under the L 2 assumption, for E X P 2 with ˜ ` i,t = ` i,t V i,t w i,t , we have R n ≤ d log 2 η + η nd 2 . In particular for η = q 2 log 2 n , we have R n ≤ d √ 2 n log 2 . Note that as for L I N E X P , we end up upper bounding P d i =1 ` i,t by d . In the case of almost symmetric set S of order k , this sum can be bounded by 2 d/k , while log( |S | ) is upper bounded by k log ( d + 1) . So as for L I N E X P , this leads to a regret bound of order √ nd log d when the set of actions is an almost symmetric set. 6 Bandit Game The upper bounds for E X P 2 in the bandit case proposed in T able 2 (page 3) are extracted from Dani et al. (2008). The approach proposed by the authors is to use E X P 2 in the space described by a barycentric spanner . More precisely , let m = dim ( Span ( S )) and e 1 , . . . , e m be a barycentric spanner of S ; for instance, take ( e 1 , . . . , e m ) ∈ argmax ( x 1 ,...,x m ) ∈S m | det Span ( S ) ( x 1 , . . . , x m ) | (see A werbuch and Kleinberg, 2004). W e introduce the transformations T 1 : R d → R m such that for x ∈ R d , T 1 ( x ) = ( x T e 1 , . . . , x T e m ) T , and T 2 : S → [ − 1 , 1] m such that for v ∈ S , v = P m i =1 ( T 2 ( v )) i e i . Note that for any v ∈ S , we hav e ` T t v = T 1 ( ` t ) T T 2 ( v ) . Then the loss estimate for v ∈ S is ˜ ` T t v =  Q + t T 2 ( V t ) T 2 ( V t ) T T 1 ( ` t )  T T 2 ( v ) , where Q t = E V ∼ p t T 2 ( V ) T 2 ( V ) T . Moreov er the authors also add a forced exploration which is uniform o ver the barycentric spanner . A concurrent approach is the one proposed in Cesa-Bianchi and Lugosi (2009, 2010). There the authors study E X P 2 directly in the original space, with the estimate described in Figure 4, and with an additional forced exploration which is uniform ov er S . They work out sev eral examples of sets S for which they im- prov e the regret bound by a f actor √ d with respect to Dani et al. (2008). Unfortunately there e xists sets S for which this approach fails to provide a bound polynomial in d . In general one needs to replace the uniform ex- ploration over S by an exploration that is tailored to this set. How to do this in general is still an open question. The upper bounds for L I N E X P in the bandit case proposed in T able 2 (page 3) are derived by using the trick of Dani et al. (2008) (that is, by w orking with a barycentric spanner). The proof of this result is omitted, since it does not yield the optimal dependency in n . Moreover we can not analyze L I N P O LY since (1) is not well deﬁned in this case, because ˜ ` t can be non-positiv e. In general we belie ve that the L I N I N F approach is not sound for the bandit case, and that one needs to work with a Legendre function with non-diagonal Hessian. The only known C L E B with non-diagonal Hessian is the one proposed in Abernethy et al. (2008), where the authors use a self-concordant barrier function. In this case, they are able to propose a loss estimate related to the structure of the Hessian. This approach is powerful, and under the L 2 assumption leads to a regret upper bound of order d √ θ n log n for θ > 0 such that Con v ( S ) admits a θ -self-concordant barrier function (see Abernethy et al., 2008, section 5). When Con v ( S ) admits a O (1) -self-concordant barrier function, the upper bound matches the lower bound O  d √ n  . The open question is to determine for which sets S , this occurs. 11 7 Lower Bounds W e start this Section with a result that shows that E X P 2 is suboptimal against L ∞ adversaries. This answers a question of K oolen et al. (2010). Theorem 19 Let n ≥ d . There exists a subset S ⊂ { 0 , 1 } d such that in the full information game, for the E X P 2 strate gy (for any learning rate η ), we have sup R n ≥ 0 . 02 d 3 / 2 √ n, wher e the supr emum is taken o ver all L ∞ adversaries. Proof For sake of simplicity we assume here that d is a multiple of 4 and that n is even. W e consider the following subset of the h ypercube: S =  v ∈ { 0 , 1 } d : d/ 2 X i =1 v i = d/ 4 and  v i = 1 , ∀ i ∈ { d/ 2 + 1; . . . , d/ 2 + d/ 4 }  or  v i = 1 , ∀ i ∈ { d/ 2 + d/ 4 + 1 , . . . , d }  . That is, choosing a point in S corresponds to choosing a subset of d/ 4 elements in the ﬁrst half of the coor- dinates, and choosing one of the two ﬁrst disjoint interv als of size d/ 4 in the second half of the coordinates. W e will prov e that for any parameter η , there exists an adversary such that Exp (with parameter η ) has a regret of at least nd 16 tanh  η d 8  , and that there exists another adversary such that its regret is at least min  d log 2 12 η , nd 12  . As a consequence, we hav e sup R n ≥ max  nd 16 tanh  η d 8  , min  d log 2 12 η , nd 12   ≥ min  max  nd 16 tanh  η d 8  , d log 2 12 η  , nd 12  ≥ min  A, nd 12  , with A = min η ∈ [0 , + ∞ ) max  nd 16 tanh  η d 8  , d log 2 12 η  ≥ min  min η d ≥ 8 nd 16 tanh  η d 8  , min η d< 8 max  nd 16 tanh  η d 8  , d log 2 12 η   ≥ min  nd 16 tanh(1) , min η d< 8 max  nd 16 η d 8 1 tanh(1) , d log 2 12 η   ≥ min ( nd 16 tanh(1) , s nd 3 log 2 128 × 12 × tanh(1) ) ≥ min  0 . 04 nd, 0 . 02 d 3 / 2 √ n  . Let us ﬁrst prov e the lower bound nd 16 tanh  η d 8  . W e deﬁne the following adversary: ` i,t =    1 if i ∈ { d/ 2 + 1; . . . , d/ 2 + d/ 4 } and t odd , 1 if i ∈ { d/ 2 + d/ 4 + 1 , . . . , d } and t even , 0 otherwise . This adv ersary alw ays put a zero loss on the ﬁrst half of the coordinates, and alternates between a loss of d/ 4 for choosing the ﬁrst interv al (in the second half of the coordinates) and the second interv al. At the be ginning 12 of odd rounds, any verte x v ∈ S has the same cumulati ve loss and thus Exp picks its expert uniformly at random, which yields an expected cumulativ e loss equal to nd/ 16 . On the other hand at ev en rounds the probability distribution to select the vertex v ∈ S is always the same. More precisely the probability of selecting a verte x which contains the interv al { d/ 2 + d/ 4 + 1 , . . . , d } (i.e, the interv al with a d/ 4 loss at this round) is exactly 1 1+exp( − η d/ 4) . This adds an e xpected cumulative loss equal to nd 8 1 1+exp( − η d/ 4) . Finally note that the loss of any ﬁx ed verte x is nd/ 8 . Thus we obtain R n = nd 16 + nd 8 1 1 + exp( − η d/ 4) − nd 8 = nd 16 tanh  η d 8  . W e mov e no w to the dependency in 1 /η . Here we consider the adversary deﬁned by: ` i,t =    1 − ε if i ≤ d/ 4 , 1 if i ∈ { d/ 4 + 1 , . . . , d/ 2 } , 0 otherwise . Note that against this adversary the choice of the interval (in the second half of the components) does not matter . Moreover by symmetry the weight of any coordinate in { d/ 4 + 1 , . . . , d/ 2 } is the same (at any round). Finally remark that this weight is decreasing with t . Thus we ha ve the follo wing identities (in the big sums i represents the number of components selected in the ﬁrst d/ 4 components): R n = E  ε n X t =1 d/ 2 X i = d/ 4+1 V i,t  = ε d 4 n X t =1 E V d/ 2 ,t ≥ nεd 4 P ( V d/ 2 ,n = 1) = nεd 4 P v ∈S : v d/ 2 =1 exp( − η n` T 2 v ) P v ∈S exp( − η n` T 2 v ) = nεd 4 P d/ 4 − 1 i =0  d/ 4 i  d/ 4 − 1 d/ 4 − i − 1  exp( − η ( nd/ 4 − inε )) P d/ 4 i =0  d/ 4 i  d/ 4 d/ 4 − i  exp( − η ( nd/ 4 − inε )) = nεd 4 P d/ 4 − 1 i =0  d/ 4 i  d/ 4 − 1 d/ 4 − i − 1  exp( η inε ) P d/ 4 i =0  d/ 4 i  d/ 4 d/ 4 − i  exp( η inε ) = nεd 4 P d/ 4 − 1 i =0  1 − 4 i d  d/ 4 i  d/ 4 d/ 4 − i  exp( η inε ) P d/ 4 i =0  d/ 4 i  d/ 4 d/ 4 − i  exp( η inε ) where we used  d/ 4 − 1 d/ 4 − i − 1  =  1 − 4 i d  d/ 4 d/ 4 − i  in the last equality . Thus taking ε = min  log 2 η n , 1  yields R n ≥ min  d log 2 4 η , nd 4  P d/ 4 − 1 i =0  1 − 4 i d  d/ 4 i  2 min(2 , exp( η n )) i P d/ 4 i =0  d/ 4 i  2 min(2 , exp( η n )) i ≥ min  d log 2 12 η , nd 12  , where the last inequality follows from Lemma 23 (see Appendix E). This concludes the proof of the lower bound. The next two theorems gi ve lower bounds under the three feedback assumptions and the two types of adversaries. The cases ( L 2 , Full Information) and ( L 2 , Bandit) already appeared in Dani et al. (2008), while the case ( L ∞ , Full Information) was treated in Koolen et al. (2010) (with more precise lower bounds for subsets S of particular interest). Note that the lo wer bounds for the semi-bandit case tri vially follo w from the ones for the full information game. Thus our main contribution here is the lower bound for ( L ∞ , Bandit), which is technically quite different from the other cases. W e also giv e explicit constants in all cases. 13 Theorem 20 Let n ≥ d . Against L ∞ adversaries in the cases of full information and semi-bandit games, we have R n ≥ 0 . 008 d √ n, and in the bandit game R n ≥ 0 . 01 d 3 / 2 √ n. Proof In this proof we consider the following subset of { 0 , 1 } d : S = { v ∈ { 0 , 1 } d : ∀ i ∈ { 1 , . . . , b d/ 2 c} , v 2 i − 1 + v 2 i = 1 } . Under full information, playing in S corresponds to playing b d/ 2 c independent standard full information games with 2 experts. Thus we can apply [Theorem 30, Audibert and Bubeck (2010)] to obtain: R n ≥ b d/ 2 c × 0 . 03 p n log 2 ≥ 0 . 008 d √ n. W e no w mov e to the bandit game, for which the proof is more challenging. For the sake of simplicity , we assume in the following that d is ev en. Moreover , we restrict our attention to deterministic forecasters, the extension to general forecaster can be done by a routine application of Fubini’ s theorem. F irst step: deﬁnitions. W e denote by I i,t ∈ { 1 , 2 } the random variable such that V 2 i,t = 1 if and only if I i,t = 2 . That is, I i,t is the expert chosen at time t in the i th game. W e also deﬁne the empirical distribution of plays q i n = ( q i 1 ,n , q i 2 ,n ) in game i as q i j,n = P n t =1 1 I i,t = j n . Let J i,n be drawn according to q i n . In this proof we consider a set of 2 d/ 2 adversaries. For α = ( α 1 , . . . , α d/ 2 ) ∈ { 1 , 2 } d/ 2 we deﬁne the α -adversary as follows: For any t ∈ { 1 , . . . , n } , the loss of expert α i in game i is drawn from a Bernoulli of parameter 1 / 2 while the loss of the other expert in game i is drawn from a Bernoulli of parameter 1 / 2 + ε . W e note E α when we integrate with respect to the rew ard generation process of the α -adversary . W e note P i,α the law of J i,n when the forecaster plays against the α -adversary . Remark that we have P i,α ( J i,n = j ) = E α 1 n P n t =1 1 I i,t = j , hence, against the α -adversary we ha ve: R n = E α n X t =1 d/ 2 X i =1 ε 1 I i,t 6 = α i = nε d/ 2 X i =1 (1 − P i,α ( J i,t = α i )) , which implies (since the maximum is larger than the mean) sup α ∈{ 1 , 2 } d/ 2 R n ≥ nε d/ 2 X i =1   1 − 1 2 d/ 2 X α ∈{ 1 , 2 } d/ 2 P i,α ( J i,n = α i )   . (9) Second step: information inequality . Let P − i,α be the law of J i,n against the adversary which plays like the α -adversary except that in the i th game, the losses of both coordinates are drawn from a Bernoulli of parameter 1 / 2 + ε (we call it the ( − i, α ) -adversary). No w we use Pinsker’ s inequality which giv es: P i,α ( J i,n = α i ) ≤ P − i,α ( J i,n = α i ) + r 1 2 KL( P − i,α , P i,α ) , and thus, (thanks to the concavity of the square root) 1 2 d/ 2 X α ∈{ 1 , 2 } d/ 2 P i,α ( J i,n = α i ) ≤ 1 2 + v u u t 1 2 d/ 2+1 X α ∈{ 1 , 2 } d/ 2 KL( P − i,α , P i,α ) . (10) 14 Thir d step: computation of KL( P − i,α , P i,α ) with the chain rule for K ullback-Leibler diver gence . Note that since the forecaster is deterministic, the sequence of observed losses (up to time n ) W n ∈ { 0 , 1 , . . . , d } n uniquely determines the empirical distribution of plays q i n , and in particular the law of J i,n conditionally to W n is the same for any adversary . Thus, if we note P n α (respectiv ely P n − i,α ) the law of W n when the forecaster plays against the α -adversary (respectively the ( − i, α ) -adversary), then one can easily prov e that KL( P − i,α , P i,α ) ≤ KL( P n − i,α , P n α ) . Now we use the chain rule for Kullback-Leibler div er gence iterativ ely to introduce the laws P t α of the observed losses W t up to time t . More precisely , we hav e, KL( P n − i,α , P n α ) = KL( P 1 − i,α , P 1 α ) + n X t =2 X w t − 1 ∈{ 0 , 1 ,...,d } t − 1 P t − 1 − i,α ( w t − 1 )KL( P t − i,α ( . | w t − 1 ) , P t α ( . | w t − 1 )) = KL  B ∅ , B 0 ∅  1 I i, 1 = α i + n X t =2 X w t − 1 : I i,t = α i P t − 1 − i,α ( w t − 1 )KL  B w t − 1 , B 0 w t − 1  , where B w t − 1 and B 0 w t − 1 are sums of d/ 2 Bernoulli distributions with parameters in { 1 / 2 , 1 / 2 + ε } and such that the number of Bernoullis with parameter 1 / 2 + ε in B w t − 1 is equal to the number of Bernoullis with parameter 1 / 2 + ε in B 0 w t − 1 plus one. Now using Lemma 24 (see Appendix E) we obtain KL( P n − i,α , P n α ) ≤ 16 ε 2 d E − i,α P n t =1 1 I i,t = α i . Summing and plugging this into (10) we obtain 1 2 d/ 2 P α ∈{ 1 , 2 } d/ 2 P i,α ( J i,n = α i ) ≤ 1 2 + 2 ε p n d . T o conclude the proof one needs to plug in this last equation in (9) along with straightfor - ward computations. Theorem 21 Let n ≥ d . Against L 2 adversaries in the cases of full information and semi-bandit games, we have R n ≥ 0 . 05 √ dn, and in the bandit game R n ≥ 0 . 05 min( n, d √ n ) . Refer ences J. Abernethy and A. Rakhlin. Beating the adaptiv e bandit with high probability . In 22nd annual confer ence on learning theory , 2009. J. Abernethy , E. Hazan, and A. Rakhlin. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In Rocco A. Servedio and T ong Zhang, editors, COLT , pages 263–274. Omnipress, 2008. J.-Y . Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. JMLR , 11: 2635–2686, 2010. B. A werbuch and R.D. Kleinberg. Adaptiv e routing with end-to-end feedback: distrib uted learning and geometric approaches. In ST OC ’04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages 45–53. A CM, 2004. N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. In 22nd annual conference on learning theory , 2009. N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Submitted , 2010. N. Cesa-Bianchi and G. Lugosi. Pr ediction, Learning, and Games . Cambridge Uni versity Press, 2006. ISBN 0521841089. 15 V . Dani, T . Hayes, and S.M. Kakade. The price of bandit information for online optimization. In Advances in Neural Information Pr ocessing Systems , volume 20, pages 345–352, 2008. Y . Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences , 55:119–139, 1997. A. Gy ¨ orgy , T . Linder, G. Lugosi, and G. Ottucs ´ ak. The on-line shortest path problem under partial monitoring. J. Mac h. Learn. Res. , 8:2369–2403, 2007. E. Hazan. A survey: The conv ex optimization approach to re gret minimization. W orking draft, 2010. D. P . Helmbold and M. K. W armuth. Learning permutations with exponential weights. JMLR , 10:1705–1736, 2009. M. Herbster and M. K. W armuth. Tracking the best expert. Mach. Learn. , 32:151–178, August 1998. ISSN 0885-6125. A. Kalai and S. V empala. Ef ﬁcient algorithms for online decision problems. J ournal of Computer and System Sciences , 71:291307, 2005. S. Kale, L. Reyzin, and R. Schapire. Non-stochastic bandit slate problems. Advances in Neural Information Pr ocessing Systems , pages 1054–1062, 2010. W . M. K oolen, M. K. W armuth, and J. Kivinen. Hedging structured concepts. In 23r d annual confer ence on learning theory , 2010. H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary . In In Pr oceedings of the 17th Annual Confer ence on Learning Theory , pages 109–123, 2004. A. Rakhlin and A. T ew ari. Lecture notes on online learning. 2008. T . Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In Pr oc. of the 21st International Confer ence on Algorithmic Learning Theory , 2010. 16 Parameters: set of actions A = { 1 , . . . , d } ; number of rounds n ∈ N . For each round t = 1 , 2 , . . . , n ; (1) the forecaster chooses A t ∈ A , with the help of an external randomization; (2) simultaneously the adversary selects a loss vector ` t = ( ` 1 ,t , . . . , ` d,t ) T ∈ R d (with- out rev ealing it); (3) the forecaster incurs the loss ` A t ,t . He observes – the loss vector ` t in the full information game, – the coordinate ` A t ,t in the bandit game. Goal: The forecaster tries to minimize his cumulativ e loss P n t =1 ` A t ,t . Figure 5: Standard prediction games. A Standard pr ediction games It is well-known that the standard prediction games described in Figure 5 are speciﬁc cases of the combina- torial prediction games described in Figure 1. Indeed, consider S = { a 1 , . . . , a d } , where a i ∈ { 0 , 1 } d is the vector whose only nonzero component is the i -th one. The standard and combinatorial prediction games are then equiv alent by using V t = a A t and noticing that ` T t a i = ` i,t . In particular, the semi-bandit and bandit combinatorial prediction games are then both equi valent to the traditional multi-armed bandit g ame. W e no w show that I N F (deﬁned in Figure 2 of Audibert and Bubeck (2010)) is a special case of L I N I N F . Proof Indeed, suppose that the estimates ˜ ` 1 , . . . , ˜ ` n are nonnegati v e (coordinate-wise), and take w 1 =  1 d , . . . , 1 d  . Then the vector w 0 t +1 satisfying (1) exists, and is deﬁned coordinate-wise by ψ − 1 ( w 0 i,t +1 ) = ψ − 1 ( w i,t ) − ˜ ` i,t . Besides, the optimality condition of (2) implies the existence of c t ∈ R (independent of i ) such that ψ − 1 ( w i,t +1 ) = ψ − 1 ( w 0 i,t +1 ) + c t . It implies ψ − 1 ( w i,t ) = ψ − 1 ( w i, 1 ) − P t − 1 s =1 ( ˜ ` i,s − c s ) for any t ≥ 1 . So there exists C t ∈ R such that w i,t = ψ  P t − 1 s =1 (1 − ˜ ` i,s ) − C t  . Since w t ∈ Con v ( S ) , the constant C t should satisfy P n i =1 w i,t = 1 . W e thus recov er I N F with the estimate 1 − ˜ ` i,s of the reward 1 − ` i,t . So the Bregman projection has here a simple solution depending on a unique constant C t obtained by solving the equality P n i =1 ψ  P t − 1 s =1 (1 − ˜ ` i,s ) − C t  = 1 . Next we show ho w to obtain the minimax √ nd regret bound, with a much simpler proof than the one proposed in Audibert and Bubeck (2010), as well as a better constant. Theorem 22 Let q > 1 . F or the I N F for ecaster  that is for C L E B with w 1 =  1 d , . . . , 1 d  T and S = { a 1 , . . . , a d }  using ψ ( x ) = ( − η x ) − q and ˜ ` i,t = ` i,t V i,t w i,t , we have R n ≤ q d 1 q η ( q − 1) + q η nd 1 − 1 q 2 . In particular for η = √ 2 d 1 q − 1 2 [( q − 1) n ] − 1 2 , we have R n ≤ q q 2 nd q − 1 . In view of this last bound, the optimal q is q = 2 , which leads to R n ≤ 2 √ 2 nd . This improves on the bound R n ≤ 8 √ nd obtained in Theorem 11 of Audibert and Bubeck (2010). The I N F forecaster with the abov e polynomial ψ is referred to as P O L Y I N F in Figure 3. 17 Proof W e apply (8). First we bound the di ver gence term. W e have ψ − 1 ( s ) = − 1 η s − 1 /q , and D F ( u, v ) = 1 η P d i =1  1 q − 1 v 1 − 1 q i − q q − 1 u 1 − 1 q i + u i v − 1 q i  , hence max u ∈ Con v ( S ) D F ( u, w 1 ) = D F   1 , 0 , . . . , 0  T ,  1 d , . . . , 1 d  T  = q η ( q − 1)  d 1 q − 1  . Combining this with ( ψ − 1 ) 0 ( w i,t ) = 1 q η w − 1 − 1 q i,t and (8), we obtain R n ≤ q d 1 q η ( q − 1) + q η 2 E n X t =1 d X i =1 w 1+ 1 q i,t ˜ ` 2 i,t = q d 1 q η ( q − 1) + q η 2 n X t =1 d X i =1 E  w 1 q − 1 i,t V i,t ` 2 i,t  ≤ q d 1 q η ( q − 1) + q η 2 n X t =1 d X i =1 E  w 1 q i,t  ≤ q d 1 q η ( q − 1) + q η 2 nd 1 − 1 q , where in the last step we use that by H ¨ older’ s inequality , P d i =1 ( w 1 q i,t × 1) ≤  P d i =1 w i,t  1 q × d 1 − 1 q . B Pr oofs of Theorems in Section 4 Proof of Theor em 6 W e hav e D F ( u, v ) = 1 η P d i =1  u i log  u i v i  − u i + v i  , hence from the Pythagorean theorem, D F ( u, w 1 ) ≤ D F  u, (1 , . . . , 1) T  ≤ d η . Since we hav e P d i =1 w i,t ≤ d , Theorem 5 implies R n ≤ d η + η 2 E P n t =1 P d i =1 w i,t ` 2 i,t ≤ d η + ndη 2 . Proof of Theor em 7 As in the previous proof, we have D F ( u, w 1 ) ≤ d η , but under the L 2 constraint, we can improv e the bound on P d i =1 w i,t ` 2 i,t by using P d i =1 w i,t ` i,t ≤ 1 (since w t ∈ Con v ( S ) ). This gives R n ≤ d η + nη 2 . Proof of Theor em 8 W e have D F ( u, v ) = 1 η P d i =1  1 q − 1 v 1 − 1 q i − q q − 1 u 1 − 1 q i + u i v − 1 q i  , hence D F ( u, w 1 ) ≤ d η ( q − 1) . Since we hav e w 1+ 1 q i,t ` 2 i,t ≤ 1 , Theorem 5 implies R n ≤ d η ( q − 1) + η q 2 E P n t =1 P d i =1 w 1+ 1 q i,t ` 2 i,t ≤ d η ( q − 1) + ndq η 2 . Proof of Theor em 9 As in the pre vious proof, we hav e D F ( u, w 1 ) ≤ d η ( q − 1) . Under the L 2 constraint, we can improv e the bound on P d i =1 w 1+ 1 q i,t ` 2 i,t by using P d i =1 w i,t ` i,t ≤ 1 (since w t ∈ Con v ( S ) ). This gives R n ≤ d η ( q − 1) + η q 2 E n X t =1 d X i =1 w 1+ 1 q i,t ` 2 i,t ≤ d η ( q − 1) + nq η 2 . 18 Proof of Theor em 10 Using log( |S | ) ≤ d log 2 , 0 ≤ ˜ ` T t v ≤ d and P v ∈S w v ,t = 1 in Theorem 3, we get the result. Proof of Theor em 11 Using log( |S | ) ≤ d log 2 , 0 ≤ ˜ ` T t v ≤ 1 and P v ∈S w v ,t = 1 in Theorem 3, we get the result. C Pr oofs of Theorems in Section 5 Proof of Theor em 12 W e hav e again D F ( u, w 1 ) ≤ d η . Since we hav e P d i =1 ` i,t ≤ d , Theorem 5 implies R n ≤ d η + η 2 E n X t =1 d X i =1 ` 2 i,t V i,t w i,t = d η + η 2 n X t =1 d X i =1 E ( ` 2 i,t ) ≤ d η + ndη 2 . Proof of Theor em 14 The starting point is Theorem 5, which, by using E  ` 2 i,t V i,t w i,t  = E ` 2 i,t ≤ E ` i,t , implies R n ≤ D F  u, w 1 ) + η 2 E n X t =1 d X i =1 ` 2 i,t V i,t w i,t ≤ D F  u, w 1 ) + η 2 E n X t =1 d X i =1 ` i,t . (11) For an y u ∈ [0 , 1] d such that P d i =1 u i ≤ k , we hav e D F  u, w 1 ) ≤ D F  u,  k d , . . . , k d  T  ≤ 1 η  k + d X i =1 u i log  du i k e   ≤ k L η , (12) where the last inequality can be obtained by writing the optimality conditions. More precisely , two cases are considered depending on whether holds P d i =1 u i = k at the optimum: when it is the case, the maximum is achiev ed for u of the form u = (1 , . . . , 1 , 0 , . . . , 0) T ; otherwise, u = (0 , . . . , 0) T achiev es the maximum. The desired results are then obtained by combining (11), (12) and an upper bound on P d i =1 ` i,t : indeed, under the L ∞ assumption, we have P d i =1 ` i,t ≤ d . Under the L 2 assumption, since S is an almost symmetric set of order k , there exists z ∈ Con v ( S ) ∩  k 2 d ; 1  d , and consequently P d i =1 ` i,t ≤ P d i =1  2 d k z i  ` i,t ≤ 2 d k . Proof of Theor em 15 W e hav e again D F ( u, w 1 ) ≤ d η ( q − 1) . Since we hav e w 1 q i,t ` 2 i,t ≤ 1 , Theorem 5 implies R n ≤ d η ( q − 1) + η q 2 n X t =1 d X i =1 E  w 1 q i,t ` 2 i,t  ≤ d η ( q − 1) + ndq η 2 . Proof of Theor em 16 W e ha ve again D F ( u, w 1 ) ≤ d η ( q − 1) . From E  w 1+ 1 q i,t ˜ ` 2 i,t  = E  w 1 q i,t ` 2 i,t  ≤ E  ( w i,t ` i,t ) 1 q  and Theorem 5, we get R n ≤ d η ( q − 1) + η q 2 n X t =1 d X i =1 E  ( w i,t ` i,t ) 1 q  ≤ d η ( q − 1) + nd 1 − 1 q q η 2 . where we use P d i =1 ( w i,t ` i,t ) 1 q ≤  P d i =1 w i,t ` i,t  1 q × d 1 − 1 q in the last step. 19 Proof of Theor em 17 Let q i,t = P v ∈S : v i =1 p v ,t = E V t ∼ p t V i,t for i ∈ { 1 . . . , d } . W e hav e E V t ∼ p t X v ∈S p v ,t ( ˜ ` T t v ) 2 = E V t ∼ p t ,V 0 t ∼ p t ( ˜ ` T t V 0 t ) 2 = E V t ∼ p t ,V 0 t ∼ p t X i,j ` i,t V i,t ` j,t V j,t q i,t q j,t V 0 i,t V 0 j,t ≤ E V t ∼ p t ,V 0 t ∼ p t X i,j ` i,t ` j,t V i,t q i,t V 0 j,t q j,t =  d X i =1 ` i,t  2 ≤ d 2 . Using log( |S | ) ≤ d log 2 , the result then follows from Theorem 3. Proof of Theor em 18 Let q i,t = P v ∈S : v i =1 p v ,t = E V t ∼ p t V i,t for i ∈ { 1 , . . . , d } . W e hav e E V t ∼ p t X v ∈S p v ,t ( ˜ ` T t v ) 2 = E V t ∼ p t ,V 0 t ∼ p t X i,j ` i,t V i,t ` j,t V j,t q i,t q j,t V 0 i,t V 0 j,t ≤ E V t ,V 0 t X i,j ` i,t V i,t q i,t V 0 j,t q j,t ` j,t V j,t = E V t d X i =1 ` i,t V i,t q i,t d X j =1 ` j,t V j,t ≤ E V t d X i =1 ` i,t V i,t q i,t = d X i =1 ` i,t ≤ d. Using log( |S | ) ≤ d log 2 , the result then follows from Theorem 3. D Pr oof of Theorem 21 W e consider the bandit game ﬁrst. W e use the notation and adversaries deﬁned in the proof of Theorem 20. W e modify these adv ersaries as follo ws: at each turn one selects uniformly at random E t ∈ { 1 , . . . , d } . Then, at time t , the losses of all coordinates but E t are set to 0 . This new adversary is clearly in L 2 . For this new set of adversaries, one has to do only two modiﬁcations in the proof of Theorem 20. First (9) is replaced by: sup α ∈{ 1 , 2 } d/ 2 R n ≥ nε d d/ 2 X i =1   1 − 1 2 d/ 2 X α ∈{ 1 , 2 } d/ 2 P i,α ( J i,n = α i )   . Second B w t − 1 is now a Bernoulli with mean µ t ∈  1 2 + ε d , 1 2 + ε 2  and B 0 w t − 1 is a Bernoulli with mean µ t − ε d , and thus we hav e KL  B w t − 1 , B 0 w t − 1  ≤ 4 ε 2 (1 − ε 2 ) d 2 . The proof is then concluded again with straightforward computations. The proof for the full information g ame is exactly the same as the one for bandit information, except that the deﬁnition of W t is slightly different and implies that B w t − 1 is now a Bernoulli with mean 1 d  1 2 + ε  and B 0 w t − 1 is a Bernoulli with mean 1 2 d , which giv es KL  B w t − 1 , B 0 w t − 1  ≤ 4 ε 2 2 d − 1 . 20 E T echnical lemmas W e prov e here two technical lemmas that were used in the proofs abo ve. Lemma 23 F or any k ∈ N ∗ , for any 1 ≤ c ≤ 2 , we have P k i =0 (1 − i/k )  k i  2 c i P k i =0  k i  2 c i ≥ 1 / 3 . Proof Let f ( c ) denote the left-hand side term of the inequality . Introduce the random variable X , which is equal to i ∈ { 0 , . . . , k } with probability  k i  2 c i  P k j =0  k j  2 c j . W e have f 0 ( c ) = 1 c E [ X (1 − X/k )] − 1 c E ( X ) E (1 − X/k ) = − 1 c V ar X ≤ 0 . So the function f is decreasing on [1 , 2] , and, from now on, we consider c = 2 . Numerator and denominator of the left-hand side (l.h.s.) differ only by the 1 − i/k factor . A lo wer bound for the left-hand side can thus be obtained by showing that the terms for i close to k are not essential to the value of the denominator . T o prove this, we may use the Stirling formula: for any n ≥ 1  n e  n √ 2 π n < n ! <  n e  n √ 2 π ne 1 / (12 n ) (13) Indeed, this inequality implies that for any k ≥ 2 and i ∈ [1 , k − 1]  k i  i  k k − i  k − i √ k p 2 π i ( k − i ) e − 1 / 6 <  k i  <  k i  i  k k − i  k − i √ k p 2 π i ( k − i ) e 1 / 12 , hence  k i  2 i  k k − i  2( k − i ) k e − 1 / 3 2 π i ( k − i ) <  k i  2 <  k i  2 i  k k − i  2( k − i ) k e 1 / 6 2 π i Introduce λ = i/k and χ ( λ ) = 2 λ λ 2 λ (1 − λ ) 2(1 − λ ) . W e have [ χ ( λ )] k 2 e − 1 / 3 π k <  k i  2 2 i < [ χ ( λ )] k e 1 / 6 2 π λ . (14) Lemma 23 can be numerically veriﬁed for k ≤ 10 6 . W e now consider k > 10 6 . F or λ ≥ 0 . 666 , since the function χ can be sho wn to be decreasing on [0 . 666 , 1] , the inequality  k i  2 2 i < [ χ (0 . 666)] k e 1 / 6 2 × 0 . 666 × π holds. W e hav e χ (0 . 657) /χ (0 . 666) > 1 . 0002 . Consequently , for k > 10 6 , we hav e [ χ (0 . 666)] k < 0 . 001 × [ χ (0 . 657)] k /k 2 . So for λ ≥ 0 . 666 and k > 10 6 , we hav e  k i  2 2 i < 0 . 001 × [ χ (0 . 657)] k e 1 / 6 2 π × 0 . 666 × k 2 < [ χ (0 . 657)] k 2 e − 1 / 3 1000 π k 2 = min λ ∈ [0 . 656 , 0 . 657] [ χ ( λ )] k 2 e − 1 / 3 1000 π k 2 < 1 1000 k max i ∈{ 1 ,...,k − 1 }∩ [0 , 0 . 666 k )  k i  2 2 i . (15) where the last inequality comes from (14) and the fact that there exists i ∈ { 1 , . . . , k − 1 } such that i/k ∈ [0 . 656 , 0 . 657] . Inequality (15) implies that for any i ∈ { 1 , . . . , k } , we ha v e X 5 6 k ≤ i ≤ k  k i  2 2 i < 1 1000 max i ∈{ 1 ,...,k − 1 }∩ [0 , 0 . 666 k )  k i  2 2 i < 1 1000 X 0 ≤ i< 0 . 666 k  k i  2 2 i . 21 T o conclude, introducing A = P 0 ≤ i< 0 . 666 k  k i  2 2 i , we hav e P k i =0 (1 − i/k )  k i  k k − i  2 i P k i =0  k i  k k − i  2 i > (1 − 0 . 666) A A + 0 . 001 A ≥ 1 3 . Lemma 24 Let ` and n be inte ger s with 1 2 ≤ n 2 ≤ ` ≤ n . Let p, p 0 , q , p 1 , . . . , p n be r eal numbers in (0 , 1) with q ∈ { p, p 0 } , p 1 = · · · = p ` = q and p ` +1 = · · · = p n . Let B (r esp. B 0 ) be the sum of n + 1 independent Bernoulli distributions with par ameters p, p 1 , . . . , p n (r esp. p 0 , p 1 , . . . , p n ). W e have KL( B , B 0 ) ≤ 2( p 0 − p ) 2 (1 − p 0 )( n + 2) q . Proof Let Z , Z 0 , Z 1 , . . . , Z n be independent Bernoulli distributions with parameters p, p 0 , p 1 , . . . , p n . Deﬁne S = P ` i =1 Z i , T = P n i = ` +1 Z i and V = Z + S . By slight abuse of notation, merging in the same notation the distribution and the random v ariable, we ha ve KL( B , B 0 ) = KL  ( Z + S ) + T , ( Z 0 + S ) + T  ≤ KL  ( Z + S, T ) , ( Z 0 + S, T )  = KL  Z + S, Z 0 + S  . Let s k = P ( S = k ) for k = − 1 , 0 , . . . , ` + 1 . Using the equalities s k =  ` k  q k (1 − q ) ` − k = q 1 − q ` − k + 1 k  ` k − 1  q k − 1 (1 − q ) ` − k +1 = q 1 − q ` − k + 1 k s k − 1 , which holds for 1 ≤ k ≤ ` + 1 , we obtain KL( Z + S, Z 0 + S ) = ` +1 X k =0 P ( V = k ) log  P ( Z + S = k ) P ( Z 0 + S = k )  = ` +1 X k =0 P ( V = k ) log  ps k − 1 + (1 − p ) s k p 0 s k − 1 + (1 − p 0 ) s k  = ` +1 X k =0 P ( V = k ) log  p 1 − q q k + (1 − p )( ` − k + 1) p 0 1 − q q k + (1 − p 0 )( ` − k + 1)  = E log  ( p − q ) V + (1 − p ) q ( ` + 1) ( p 0 − q ) V + (1 − p 0 ) q ( ` + 1)  . (16) F irst case: q = p 0 . By Jensen’ s inequality , using that E V = p 0 ( ` + 1) + p − p 0 in this case, we then get KL( Z + S, Z 0 + S ) ≤ log  ( p − p 0 ) E ( V ) + (1 − p ) p 0 ( ` + 1) (1 − p 0 ) p 0 ( ` + 1)  = log  ( p − p 0 ) 2 + (1 − p 0 ) p 0 ( ` + 1) (1 − p 0 ) p 0 ( ` + 1)  = log  1 + ( p − p 0 ) 2 (1 − p 0 ) p 0 ( ` + 1)  ≤ ( p − p 0 ) 2 (1 − p 0 ) p 0 ( ` + 1) . 22 Second case: q = p . In this case, V is a binomial distribution with parameters ` + 1 and p . From (16), we have KL( Z + S, Z 0 + S ) ≤ − E log  ( p 0 − p ) V + (1 − p 0 ) p ( ` + 1) (1 − p ) p ( ` + 1)  ≤ − E log  1 + ( p 0 − p )( V − E V ) (1 − p ) p ( ` + 1)  . (17) T o conclude, we will use the follo wing lemma. Lemma 25 The following inequality holds for any x ≥ x 0 with x 0 ∈ (0 , 1) : − log( x ) ≤ − ( x − 1) + ( x − 1) 2 2 x 0 . Proof Introduce f ( x ) = − ( x − 1) + ( x − 1) 2 2 x 0 + log( x ) . W e ha ve f 0 ( x ) = − 1 + x − 1 x 0 + 1 x , and f 00 ( x ) = 1 x 0 − 1 x 2 . From f 0 ( x 0 ) = 0 , we get that f 0 is negati v e on ( x 0 , 1) and positiv e on (1 , + ∞ ) . This leads to f nonnegativ e on [ x 0 , + ∞ ) . Finally , from Lemma 25 and (17), using x 0 = 1 − p 0 1 − p , we obtain KL( Z + S, Z 0 + S ) ≤  p 0 − p (1 − p ) p ( ` + 1)  2 E [( V − E V ) 2 ] 2 x 0 =  p 0 − p (1 − p ) p ( ` + 1)  2 ( ` + 1) p (1 − p ) 2 2(1 − p 0 ) = ( p 0 − p ) 2 2(1 − p 0 )( ` + 1) p . 23

Minimax Policies for Combinatorial Prediction Games

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment