Sampled Fictitious Play is Hannan Consistent
Fictitious play is a simple and widely studied adaptive heuristic for playing repeated games. It is well known that fictitious play fails to be Hannan consistent. Several variants of fictitious play including regret matching, generalized regret match…
Authors: Zifan Li, Ambuj Tewari
Sampled Fictitious Pla y is Hannan Consis ten t Zifan Li zifanli@umi ch.edu Am buj T e w ari tewaria@umi ch.edu April 12, 2017 Abstract Fictitious pla y is a simple and widely studied adaptive heuristic for pla ying rep eated games. It is w ell known that fictitious play fai ls to b e Hannan consistent. Several v arian ts of fictitious play including re- gret matc hing, generalized regret matc hing and smo oth fi ctitious play , a re known to b e Hannan consisten t. In this note, we consider sampled fi ct i- tious pla y: at eac h round, the pla yer sa mples past times and p lays the b est response to p revious mov es of other p la yer s at the sampled time p oints. W e sho w that sampled fictitious play , using Bernoulli sampling, is Hann an consisten t. Unlike several existing Hannan consistency p roofs that rely on concentrati on of measure results, ours instead uses anti-concen tration re- sults from Littlew o o d-Offord theory . Keywords: adaptive heuristics, learning, rep ea ted games, Hannan co nsistency , fictitious play 1 In tro duction In the s etting of r e p ea ted g ames played in discrete time, the (unconditional) regret of a player, at any time p oint, is the difference b etw e e n the pay o ffs s he would hav e rec e ived had she played the b est, in hindsight, cons ta n t strategy throughout, and the payoffs she did in fact re ceive. Hannan [ 19 5 7 ] showed the existence of pro cedures with a “no-reg ret” pro p er t y: pro c edures for which the av er age regr e t p er time go e s to zero for a la rge num ber of time points. His pro cedure w as a simple mo difica tion of fictitious play: random per turbations a re added to the cum ulative pay o ffs of every s tr ategy so fa r and the player picks the strategy with the la rgest p ertur bed cumulativ e pay off. No reg ret pr o cedures ar e also ca lled “universally consistent” [ F udenberg and Levine , 1998 , Section 4.7 ] or “Hannan consistent” [ Ce sa-Bianchi and Lugosi , 2006 , Section 4.2]. It is well known that smo othing the cumulativ e pa yoffs b efore computing the b est resp onse is crucial to achiev e Hanna n consistency . O ne wa y to achieve smo othness is thro ugh sto chastic smo o thing, o r adding p erturbations. Without per turbations, the pro cedure b eco mes identical to fictitious play , which fails to be Hannan cons is ten t [ Cesa-Bianchi and Lugosi , 20 06 , Exer cise 3.8]. Besides Hannan’s mo dification, other v ar iants of fictitious play a re also known to b e 1 Hannan consistent, including (unco nditional) regr et matching, genera lized (un- conditional) regret matchin g and smoo th fictitious play (for a n ov erview, see Hart a nd Mas-Colell [ 2013 , Section 1 0.9]). In this note, we cons ide r another v ar iant of fictitious play , na mely sa mpled fictitious play . Here, the play er samples pa st time points using so me (random- ized) sampling sc heme and plays the bes t resp ons e to the moves of the other play er s restricted to the set of s ampled time po in ts. Sampled fictitious play has bee n c o nsidered by other authors in differen t contexts. Kaniovski a nd Y oung [ 1995 ] established co n vergence to Nash equilibrium in 2 × 2 g ames. Gilliland and Jung [ 2006 ] provided reg ret b ounds for the game of matching p ennies. Lambert II I et a l. [ 2005 ] cons idered games with identical pay o ffs for all play er s and use sampled fictitious play to solve la rge-sca le optimization problems. T o the b est o f our knowledge, it is not known whether sampled fictitious play is Hannan consis- ten t without ma king a n y a ssumptions on the form of the game and payoffs. The purp ose of this no te is to show that it is indeed Hannan consistent when used with a na tural sampling scheme, na mely Bernoulli sampling. 2 Preliminaries Consider a ga me in strategic form wher e M is the num b er o f play ers, S i is the set of str ategies for play er i , a nd u i : Q M j =1 S i → R is the pay off function for play er i . F or simplicit y a ssume that the payoff functions of all play e rs are [ − 1 , 1] bo unded. W e also assume the num b er of pure stra tegies is the sa me for each play er and that S i = { 1 , . . . , N } . Let S = Q M i =1 S i be the set of M -tuples of play er strategies . F or s = ( s i ) M i =1 ∈ S , we denote the str ategies o f play ers other than i b y s − i = ( s j ) 1 ≤ j ≤ M ,j 6 = i . The ga me is play ed rep eatedly over (discr ete) time t = 1 , 2 , . . . . A lea rning pro cedure for player i is a pro cedur e that maps the history h t − 1 = ( s τ ) t − 1 τ =1 of plays just prior to time t , to a strategy s t,i ∈ S i . The learning pro cedur e is allowed to be randomized, i.e., play er i has acc ess to a stream of ra ndom v ariables ǫ 1 , ǫ 2 , . . . and she is allow ed to use ǫ 1 , . . . , ǫ t − 1 , in a ddition to h t − 1 , to choose s t,i . Player i ’s regr et at time t is defined as R t,i = max k ∈ S i t X τ =1 u i ( k , s τ , − i ) − t X τ =1 u i ( s τ ) . This co mpares the player’s cumulativ e pay off with the pay off she could have r e- ceived had s he selected the be s t constant (ov er time) str ategy k with knowledge of the other play er s’ moves. A learning pro cedure for player i is said to b e Hannan c onsisten t if and only if lim sup t →∞ R t,i t ≤ 0 almost sur e ly . Hannan consistency is also known as the “no- r egret” prop erty and as “universal consistency”. The term “universal” r efers to the fa ct that the regr et per time go es to ze ro irresp ective of what the o ther players do. Fictitious play is a (deterministic) learning pr o cedure wher e play er i plays 2 the bes t resp onse to the plays of the o ther players so far. That is, s t,i ∈ a rg max k ∈{ 1 ,...,N } t − 1 X τ =1 u i ( k , s τ , − i ) . (1) As mentioned earlier, fictitious play is not Hannan consistent. How ever, c onsider the following mo dificatio n of fictitious play , called sample d fictitious play . At time t , play er rando mly selects a subs et S t ⊆ { 1 , . . . , t − 1 } o f pr evious time po int s and plays the b est r esp onse to the other play ers ’ mov es only over S t . That is, s t,i ∈ arg max k ∈{ 1 ,...,N } X τ ∈ S t u i ( k , s τ , − i ) . (2) If multiple strategies achiev e the maximum, then the tie is broken uniformly at random, a nd indep endently with resp ect to all previous randomness. Also, if S t turns o ut to b e empty (a n ev ent that happ ens with proba bilit y exa ctly 2 − ( t − 1) under the Be rnoulli sampling describ ed b elow), we adopt the conv en tion that the argmax ab ov e includes all N strateg ies. In this note, w e consider Bernoul li sampling , i.e., any par ticula r ro und τ ∈ { 1 , . . . , t − 1 } is included in S t independently with probability 1 / 2 . Mo re sp ecifically , if ǫ ( t ) 1 , . . . , ǫ ( t ) t − 1 are i.i.d. sy mmetric Ber noulli (o r Rademacher) ran- dom v ariables taking v alues in {− 1 , +1 } , then S t = { τ ∈ { 1 , . . . , t − 1 } : ǫ ( t ) τ = +1 } (3) and ther efore, X τ ∈ S t u i ( k , s τ , − i ) = t − 1 X τ =1 (1 + ǫ ( t ) τ ) 2 u i ( k , s τ , − i ) . Note that the pro cedure defined by the com bination o f ( 2 ) and ( 3 ) is completely parameter free, i.e., there is no tuning para meter that has to b e ca refully tuned in order to obtain desired conv er gence prop erties. 3 Result and Discussion Our main r esult is the fo llowing. Theorem 3.1. Sample d fictitious play ( 2 ) with Bernoul li sampling ( 3 ) is Han- nan c onsistent. Before we mov e o n to the pr o of, a few remar ks ar e in order . Computational tracta bility It is a simple but imp or tant obs erv ation that the form of the optimization pro blem so lved by fictitious play ( 1 ) is exactly the same as the optimization problem solved by sampled fictitious play ( 2 ). This can b e very useful when the play er has a lar ge strateg y set and do es not w ant to enumerate all strategies to solve the max imization inv olved in b oth fictitious play and its sampled version. F or ex ample, Lambert II I et a l. [ 2005 ] descr ibe their computational exp erience w ith sampled fictitious play in the c ontext of a dynamic tr affic a ssignment problem. 3 Rate of con v ergence Our pro o f gives the r a te of co nv erg ence of (exp e cted) av er age regret a s O ( N 2 p log log t/t ) where the constant hidden in O ( · ) nota- tion is sma ll and explicit. It is known that the optimal r ate is O ( p log N/t ) [ Cesa-Bianchi and Lugosi , 2006 , Section 2.10 ]. The r efore, our ra te of conv e r - gence is almost o ptimal in t but severely s ub optimal in N . This r aises several int eresting ques tions. What is the b est b ound p ossible for Sampled Fictitious Play with Ber noulli sampling? Is there a sampling scheme for which Sampled Fictitious Play pro cedure achiev es the optimal rate of conv er gence? The first question is par tially answered by Theor e m B . 1 in Appendix B which states that the dep endency on N is likely to b e po lynomial instead of loga rithmical, but ther e is still so me ga p be t ween the lower b ound a nd the upp e r b ound we provide. Asymmetric probabi lities Instead of using symmetric Berno ulli pro babil- ities, we can choos e ǫ ( t ) τ such that P ( ǫ ( t ) τ = +1) = α . As α → 1, the lea rning pro cedure b ecomes fictitious play and as α → 0, it selects strategies uniformly at random. Therefore, it is natural to exp ect that the regre t b ound will blow up near the tw o extremes of α = 1 a nd α = 0. W e c a n make this in tuition precise, but o nly for {− 1 , 0 , 1 } -v alued pay o ffs (instead of [ − 1 , 1]- v alued). F o r details, s ee Appendix C in the supplementary material. F ollo w the p erturb ed l eader Note that arg max k ∈{ 1 ,...,N } t − 1 X τ =1 (1 + ǫ ( t ) τ ) 2 u i ( k , s τ , − i ) = arg max k ∈{ 1 ,...,N } t − 1 X τ =1 u i ( k , s τ , − i )+ t − 1 X τ =1 ǫ ( t ) τ u i ( k , s τ , − i ) . Therefore, we can think of sampled fictitious play as adding a random p er- turbation to the expres s ion that fictitious play o ptimizes. Such algorithms a re referred to as “follow the per turb ed leader ” (FPL) in the computer s cience lit- erature (“fictitious play” is kno wn a s “fo llow the leader” ). This family was originally prop o sed by Hannan [ 1 9 57 ] and po pularized b y Kalai a nd V empala [ 2005 ]. Closer to this pap er ar e the FPL a lgorithms of Devr oy e et al. [ 2013 ] and v an Er ven et al. [ 2014 ]. How e ver, none of these pap ers considered sampled fictitious play . Extension to condi tional (or in ternal) regret In this pa pe r we fo cus o n unconditional (or e x ternal) regr et. Other notions of reg ret, esp ecia lly condi- tional (o r in ternal) r egret can a lso b e considered. In ternal reg r et measures the worst regr et, ov er N ( N − 1) choices of k 6 = k ′ , of the fo rm “every time stra tegy k was pick ed, strateg y k ′ should hav e bee n pick ed instead” . There are g eneric conv er sions [ Stoltz and L ugosi , 2005 , Blum and Manso ur , 2 0 07 ] that will c on- vert any learning pro c e dure with small ex ter nal regr et to one with s mall internal regret. These conv er sion, how ever, require access to the probability distribution ov er strateg ies at ea ch time po int. This probabilit y distribution can b e a p- proximated, to arbitrar y accura cy , by making the choice of the strategy in ( 2 ) m ultiple times each time selecting the rando m s ubset S t independently . How- ever, doing so and us ing a generic c o nv ers io n from external to internal regret will lea d to a cumbersome ov er all a lgorithm. It will be nicer to design a simpler sampling based learning pro cedure with small internal regret. 4 4 Pro of of the Main Result W e break the pr o of o f o ur ma in result into several steps. The first and third steps inv olve fair ly standar d ar g ument s in this area. Our main innov ations ar e in step tw o. 4.1 Step 1: F rom Regret to Switching Probabilities In this step, we ass ume that play e r s o ther than play er i (the “ opp onents”) are oblivio us , i.e., they do no t adapt to what player i do es. Ma thema tically , this means that the s equence s t, − i do es not depend on the mov e s s t,i of player i . W e will prov e a unifor m reg r et bound that holds for all deter ministic pay off sequences { s t, − i } T t =1 , by which we can co nclude that the same b ound holds for oblivious but rando m pay o ff s equences as well. Since play er i is fix e d for the r e st of the pro of, we will not carry the index i in our notation further. Let the vector g t ∈ [ − 1 , 1] N be defined as g t,k = u i ( k , s t, − i ) for k ∈ { 1 , . . . , N } . Moreov er , we denote play e r i ’s mov e s t,i as k t . With this no tation, r egret at time T equals R T = max k ∈{ 1 ,...,N } T X t =1 g t,k − T X t =1 g t,k t . In this step, we will lo ok at the exp ected regre t. Be c ause the opp onents are oblivious, this equals E [ R T ] = max k ∈{ 1 ,...,N } T X t =1 g t,k − E " T X t =1 g t,k t # = max k ∈{ 1 ,...,N } T X t =1 g t,k − T X t =1 E [ g t,k t ] . Recall that k t ∈ arg max k ∈{ 1 ,... ,N } t − 1 X τ =1 (1 + ǫ ( t ) τ ) 2 g τ ,k . Since g t ’s ar e fixed vectors, by indep endence we see that the dis tribution of k t is ex actly the same whether or not we share the Rademacher random v ar i- ables across time po ints. Therefore, we do not have to draw a fresh sam- ple ǫ ( t ) 1 , . . . , ǫ ( t ) t − 1 at time t . Instead, w e fix a sing le strea m ǫ 1 , ǫ 2 , . . . o f i.i.d. Rademacher random v ar iables and set ( ǫ ( t ) 1 , . . . , ǫ ( t ) t − 1 ) = ( ǫ 1 , . . . , ǫ t − 1 ) for all t . With this r eduction in num be r of r a ndom v a riables use d, we now have k t ∈ arg max k ∈{ 1 ,...,N } t − 1 X τ =1 (1 + ǫ τ ) g τ ,k . (4) W e define G t = P t τ =1 g τ , the cum ulative pay o ff vector at time t . Define ˜ g t = (1 + ǫ t ) g t and ˜ G t = P t τ =1 ˜ g τ . W e als o define g t,i ⊖ j = g t,i − g t,j , ˜ g t,i ⊖ j = ˜ g t,i − ˜ g t,j . With these definitions, we have ˜ G t,i ⊖ j = ˜ G t,i − ˜ G t,j = t X τ =1 ˜ g τ ,i − t X τ =1 ˜ g τ ,j 5 = t X τ =1 (1 + ǫ τ )( g τ ,i − g τ ,j ) = t X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j . The following res ult upp er b ounds the regr et in terms o f down ward zero- c rossings of the pro ces s ˜ G t,i ⊖ j , i.e., the times t when it switches from b eing non-negative at time t − 1 to non-p ositive at time t . Theorem 4.1. We have the fol lowing upp er b ound on t he exp e cte d r e gr et: E [ R T ] ≤ 2 N 2 max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0 . The pro of of this theorem can b e found in App endix A . W e now fo cus o n bo unding the switching probabilities for a fixed pair i, j . 4.2 Step 2: Bounding Switching Probabilities Using Littlew o o d- Offord Theory Our strateg y is to do a “m ulti-scale” ana lysis and, within each scale, apply Littlew o o d-Offord theo ry to bound the switching proba bilities . The need for a m ulti-scale a rgument aris es from the requirement in Littlewo o d-Offord theore m (see Theorem 4 .2 b elow) fo r a low er b ound on the step sizes of random walks. W e partition the set of T time p oints [ T ] := { 1 , . . . , T } into K + 1 disjoint sets at different scales, denoted as { A k } K k =0 where A k = { t ∈ [ T ] : | g t,i ⊖ j | ≤ 1 √ T } k = 0 { t ∈ [ T ] : T − 1 2 k < | g t,i ⊖ j | ≤ T − 1 2 k +1 } k = 1 , . . . , K − 1 { t ∈ [ T ] : T − 1 2 K < | g t,i ⊖ j | ≤ 2 } k = K . Note that actually A k depe nds o n i, j as well but for the sake o f clar it y we drop this dep endence in the notatio n. The ca r dinality of a finite set A will b e denoted by | A | . The num b er K + 1 of differ e n t scales is determined by K = arg min { k ∈ N : T − 1 2 k ≥ 1 / 2 } . ∀ t, i, g t,i ∈ [ − 1 , 1 ] so | g t,i ⊖ j | ∈ [0 , 2 ]. The scales here are chosen such that K is not very la rge (of or der O log log( T ) ) and still cov er s the entire range of the pay offs. It easily follows that, T X t =1 | g t,i ⊖ j | P ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0 = T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! = K X k =0 X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! . W e now wan t to a r gue that the probabilities inv olved ab ov e a r e small. The cr u- cial obser v ation is that, if a switch o ccurs, then the r andom sum P t τ =1 ǫ τ g τ ,i ⊖ j has to lie in a sufficiently small interv a l. Such “small ball” pro babilities are exactly what the classic Littlewoo d-Offord theorem controls. 6 Theorem 4.2 (Littlewoo d-Offord Theorem of E rd¨ os, Theore m 3 of Erd˝ os [ 1945 ]) . L et x 1 , . . . , x n b e n r e al numb ers such that | x i | ≥ 1 for al l i . F or any given r adius ∆ > 0 , the smal l b al l pr ob ability satisfies sup B P ( ǫ 1 x 1 + · · · + ǫ n x n ∈ B ) ≤ S ( n ) 2 n ( ⌊ ∆ ⌋ + 1) wher e ǫ 1 , . . . , ǫ n ar e i.i.d . R ademach er r andom varia bles, B ra nges over al l close d b al ls (int ervals) of r adius ∆ , and ⌊ x ⌋ r efers to t he inte gr al p art of x , S ( n ) is the lar gest binomial c o efficient b elonging to n . Using elementary c alculations to upp er b ound S ( n ) 2 n gives us the following corolla r y . Corollary 4.2.1. Under the same notation and c onditions as The or em 4.2 , we have sup B P ( ǫ 1 x 1 + · · · + ǫ n x n ∈ B ) ≤ C LO ( ⌊ ∆ ⌋ + 1) 1 √ n wher e C LO = 2 √ 2 e π < 3 . The pro of of this co rollar y ca n b e found in App endix A . The sca le of pay offs for time p erio ds in A 0 is so sma ll that we do no t need any Littlewoo d- Offord theor y to c o ntrol their contribution to the r egret. Simply bo unding the probabilities by 1 gives us the following. Theorem 4.3. The fol lowing u pp er b ound holds for s witching pr ob abilities for time p erio ds within A 0 : X t ∈ A 0 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ p | A 0 | ≤ 20 C LO p | A 0 | . wher e C LO > 1 . The pro of of this theo r em ca n also b e found in App endix A . The real w ork lies in co nt rolling the switc hing probabilities for payoffs at int ermediate scales. The idea in the pro o f of the results is to condition on the ǫ t ’s outside A k . Then the pr obability of int erest is written a s a small ball event in terms of the ǫ t ’s in A k . Applying Littlewoo d-O fford theorem concludes the argument. Theorem 4.4. F or any k ∈ { 1 , . . . , K } , we have X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ 20 C LO p | A k | . Again, the pr o of o f this theor em is defer red to App endix A . W e finally hav e all the ingr edients in place to co n trol the switching pr oba- bilities. 7 Corollary 4 .4.1. The fol lowing u pp er b ound on the switching pr ob abilities holds. T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ 2 0 C LO p T log 2 (4 lo g 2 T ) . Pr o of. Using Theor em 4.3 and Theo rem 4.4 , we have T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! = K X k =0 X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ K X k =0 20 C LO p | A k | . Since P K k =0 p | A k | ≤ √ K + 1 · q P K k =0 | A k | and P K k =0 | A k | = T , we hav e K X k =0 20 C LO p | A k | ≤ 20 C LO p ( K + 1) T . By definition of K , we hav e that T − 1 2 K − 1 < 1 2 , K ≤ log 2 (log 2 ( T )) + 1 which finishes the pr o of. Thu s, ∀ i , j ∈ { 1 , . . . , N , i 6 = j } , we hav e T X t =1 | g t,i ⊖ j | P ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0 ≤ 20 C LO p T log 2 (4 lo g 2 T ) , which, when plug ged into Theo rem 4.1 , immediately y ields the following corol- lary . Corollary 4.4. 2. A gainst an oblivious opp onent, b oth versions — the single str e am version ( 4 ) and the fr esh-r andomization-at-e ach-r ound version ( 2 ) — of sample d fictitious play enjoy the fol lowing b ound on the ex p e cte d r e gr et. E [ R T ] ≤ 40 C LO N 2 p T log 2 (4 lo g 2 T ) . 4.3 Step 3: F rom Oblivious to Adaptive Opp onen ts Now we consider adaptive opp onents. In this setting, we can no longer a ssume that play er i plays agains t a fixed seq uence of pay off v ectors { g t } T t =1 . Note that g t,k is just sho rthand fo r u i ( k , s t, − i ) and opp onents can react to player i ’s mov es k 1 , . . . , k t − 1 in selecting their s trategy tuple s t, − i , p oss ibly making use of their own priv ate rando mness. W e denote a ll randomness use d collectively b y other players ov er all time p erio ds by ω w hich is drawn from so me probability space Ω. Thus, g t is a function g t ( k 1 , . . . , k t − 1 , ω ). F aced with general adaptive opp onents, the single strea m version ( 4 ) can incur ter rible exp ected r egret as stated b elow. 8 Theorem 4.5. The single str e am version of the sample d fictitious play pr o c e- dur e ( 4 ) c an incu r line ar exp e cte d r e gr et against adaptive opp onents. The pro of of this theo r em ca n b e found at the end of App endix A . How ever, for the fres h ra ndomness at each r ound pr o cedure ( 2 ) , we can ap- ply Lemma 4.1 o f Cesa-Bia nch i and Lugosi [ 200 6 ] along with Corollar y 4.4.2 to derive our next result that holds for adaptive opp onents to o. There ar e tw o conditions that we must verify befo re we apply that lemma. First, the le arning pro cedure should use indep endent randomiza tio n at different time points. Sec - ond, the probability distribution of s t,i ov er the N av a ilable s trategies s hould be fully determined by s 1 , − i , . . . , s t − 1 , − i and sho uld no t dep end explicitly on play er i own pr evious mov es s 1 ,i , . . . , s t − 1 ,i . Both of these conditions ar e easily seen to hold for sampled fictitious play a s defined in ( 2 ) and ( 3 ). Also note that Cesa-Bia nchi and Lugosi [ 200 6 ] consider deterministic a daptive opp onents in their Lemma 4.1. The ex tension to o ur case is ea sy: we firs t get a hig h proba- bilit y (w.r.t. play er i ’s rando mness) re g ret b ound for the deterministic ada ptive opp onent g t ( k 1 , . . . , k t − 1 , ω ) fo r a fixed ω . Since the b ound holds fo r every ω and do es not depend on ω , the same high probability b ound is true when ω is drawn from Ω. This leads us to our final result. Theorem 4.6. F or any T, for any δ T > 0 , with pr ob ability at le ast 1 − δ T , t he actual re gr et R T of sample d fictitious play as define d in ( 2 ) and ( 3 ) satisfies, for any adaptive opp onent, R T ≤ 40 C LO N 2 p T log 2 (4 lo g 2 T ) + r T 2 log 1 δ T . Now pic k δ T = 1 T 2 . Consider the even ts E T = {R T ≥ 12 C LO N 2 p T log 2 (4 lo g 2 T )+ √ T log T } with P ( E T ) ≤ δ T . Since P ∞ T =1 δ T < ∞ , we hav e P ∞ T =1 P ( E T ) < ∞ . Therefore, using Bore l- Cantelli lemma, the even t “ infinitely many E T ’s o ccur ” has probability 0. That is, with probability 1, we have lim sup T →∞ R T T log T ≤ C for some co nstant C. In particula r, with proba bilit y 1, lim sup T →∞ R T T = 0, which prov es T he o rem 3.1 . 5 Conclusion W e proved that a natural v ariant of fictitious play is Hannan consis tent. In the v ariant we cons ide r ed, the play e r plays the best res po nse to mov es o f he r opp onents at sampled time p oints in the histor y so far . W e considered one particular sampling scheme, namely B ernoulli sampling. It will b e interesting to consider other sa mpling strategie s including sampling with replacement. It will also b e interesting to co nsider notions of r egret, such as tracking regre t [ Cesa-Bianchi and Lugosi , 2006 , Section 5.2 ], that are more suitable for non- stationary environmen ts by biasing the sampling to give more imp ortance to recent time po int s. Ac kno wledgemen ts W e thank Ja cob Ab ernethy , Gergely Neu and Manfred W armuth for helpful dis- cussions. W e acknowledge the supp ort of NSF via CAREER g r ant I IS-145 2099. 9 References Avrim B lum and Yishay Mansour. F rom external to internal regr et. Journ al of Machine L e arning Rese ar ch , 8(J un):1307–1 324, 2007. Nicolo Cesa-Bia nch i and Gab or Lugos i. Pr e diction, L e arning, and Games . Cam- bridge Universit y Pr e ss, 200 6. Luc Devroy e, G´ ab or Lugosi, a nd Gerg ely Neu. Prediction b y random- walk per turbation. In COL T , pages 4 60–47 3, 2 013. Paul Er d˝ os. On a lemma of Littlewo o d and Offor d. Bul letin of t he Americ an Mathematic al So ciety , 51:8 98–90 2, 19 4 5. Drew F udenberg a nd Da vid K. Levine. The The ory of L e arning in Games . MIT Press, 1998. Dennis Gilliland and Inha Jung. Play a gainst the rando m past for matching binary bits. Journal of S tatistic al The ory and Applic ation , 5(3):282 – 291, 20 06. James Hanna n. Approximation to Bayes risk in rep eated play . Contribut ions to the the ory of games , 3(39):97– 139, 1957. Sergiu Har t and Andre u Mas- Colell. Simple A daptive S tr ate gies: F r om Re gr et Matching to Unc ouple d Dynamics , volume 4 of World Scientific Series in Ec onomic The ory . W or ld Scient ific Publishing, 2013 . Adam Kalai and Sa n tosh V empala. Efficient alg orithms for online decis ion prob- lems. Journ al of Computer and System Scienc es , 71 (3):291–3 07, 200 5. Y uri M. Kaniovski and H. Peyton Y o ung. Learning dynamics in g ames with sto chastic per turbations. Games and Ec onomic Behavior , 11 (2):330–3 63, 1995. Theo dore J. Lamber t I I I, Marina A. Ep elman, and Rob ert L. Smith. A fictitious play approa ch to large-s cale optimization. O p er ation R ese ar ch , 53(3):4 77–48 9, 2005. Gilles Stoltz a nd G´ ab or Lugosi. Internal regr et in on-line p o r tfolio s election. Machine L e arning , 59(1 -2):125– 159, 2005 . Tim v an E r ven, W o jciech Ko tlowski, and Manfred K. W armuth. F o llow the leader with drop out p erturba tions. In Pr o c e e dings of Confer enc e on L e arning The ory (COL T) , pages 94 9–974 , 201 4. Manfred K. W ar m uth. F ollow the leader with drop out per turba- tions - additive versus multiplicativ e noise, June 2015. URL https: //use rs.soe.ucsc.edu/ ~ manfre d/pub s/C93mwt alk.pdf . 10 App endix A Pro ofs W e first present a lemma that helps us in proving Theorem 4.1 . Lemma A.1. L et k t and ˜ g t b e define d as in ( 4 ) and the text fol lowing that e quation. We have, T X t =1 ˜ g t,k t +1 ≥ T X t =1 ˜ g t,k T +1 = max k ∈{ 1 ,... ,N } T X t =1 ˜ g t,k . Pr o of. This is a classical lemma, for example, see Lemma 3.1 in [ Cesa-Bianchi and Lugosi , 2006 ]. W e follow the same idea, i.e , proving through induction but a da pt it to handle gains instead of losses. The s tatement is obvious for T = 1 . Assume now that T − 1 X t =1 ˜ g t,k t +1 ≥ T − 1 X t =1 ˜ g t,k T . Since, by definition, P T − 1 t =1 ˜ g t,k T ≥ P T − 1 t =1 ˜ g t,k T +1 , the inductive assumption im- plies T − 1 X t =1 ˜ g t,k t +1 ≥ T − 1 X t =1 ˜ g t,k T +1 . Add ˜ g T ,k T +1 to b oth sides to o btain the result. Pr o of of The or em 4.1 . W e will pr ov e a result for Bernoulli sampling with gen- eral proba bilities, i.e., when P ( ǫ t = +1) = α where α is not necessa rily 1 / 2. W e will show that E [ R T ] ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0 from which the theorem follows a s a specia l ca se when α = 1 / 2 . Obviously we have E ( ˜ g t,i ) = 2 αg t,i bec ause o f the fact tha t E ( ǫ t ) = 2 α − 1. F urthermore, E [ ˜ g t,k t | ǫ 1 , . . . , ǫ t − 1 ] = 2 αg t,k t bec ause k t is fully determined by past randomnes s ǫ 1 , . . . , ǫ t − 1 and past pay offs g 1 , . . . , g t − 1 that ar e given. This implies that E [ ˜ g t,k t ] = E [ E [ ˜ g t,k t | ǫ 1 , . . . , ǫ t − 1 ]] = 2 α E [ g t,k t ]. W e now have, E [ R T ] = max k ∈{ 1 ,...,N } T X t =1 g t,k − E " T X t =1 g t,k t # = 1 2 α max k ∈{ 1 ,...,N } E " T X t =1 ˜ g t,k # − 1 2 α E " T X t =1 ˜ g t,k t # ≤ 1 2 α E " max k ∈{ 1 ,... ,N } T X t =1 ˜ g t,k − T X t =1 ˜ g t,k t # . Using Le mma A.1 , we can fur ther upp er b ound the la st expr ession as follows, E [ R T ] ≤ 1 2 α E " T X t =1 ˜ g t,k t +1 − T X t =1 ˜ g t,k t # 11 = 1 2 α T X t =1 E (1 + ǫ t )( g t,k t +1 − g t,k t ) ≤ 1 2 α T X t =1 E (1 + ǫ t ) | g t,k t +1 − g t,k t | ≤ 1 α T X t =1 E | g t,k t − g t,k t +1 | = 1 α T X t =1 X 1 ≤ i,j ≤ N E | g t,i − g t,j | 1 ( k t = i,k t +1 = j ) = 1 α X 1 ≤ i,j ≤ N T X t =1 E | g t,i − g t,j | 1 ( k t = i,k t +1 = j ) ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i − g t,j | P ( k t = i , k t +1 = j ) ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i − g t,j | P ˜ G t − 1 ,i ≥ ˜ G t − 1 ,j , ˜ G t,i ≤ ˜ G t,j = N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0 . The next lemma is useful to determine the appropria te constan t in the Littlew o o d-Offord Theorem. Lemma A. 2. Supp ose X 1 , . . . , X t ar e i.i.d. Be rnoul li r andom variables that take value of 1 with pr ob ability α and 0 with pr ob ability 1 − α . If t > max( 2 1 − α , 2 α ) ≥ max( 2 α 1 − α , 2 α ) , then for al l k , P t X i =1 X i = k ! ≤ e 2 π × s 2 α (1 − α ) × t − 1 2 . Pr o of. Note that for 0 ≤ k < t , P ( x = k + 1) P ( x = k ) = t k +1 α k +1 (1 − α ) t − k − 1 t k α k (1 − α ) t − k = α ( t − k ) (1 − α )( k + 1) . Therefore, the maxim um pr obability o f Bernoulli distribution P ( X = k ) is achiev ed when k = ˆ k = ⌊ ( t + 1) α ⌋ where ⌊ x ⌋ denotes the integral part of x . Clearly ˆ k ∈ [ tα − 1 , ( t + 1) α ]. Thus, q ˆ k ( t − ˆ k ) ≥ min p ( tα − 1)( t − tα + 1) , p ( t + 1) α ( t − tα − α ) = t × min r ( α − 1 t )(1 − α + 1 t ) , r (1 + 1 t ) α (1 − α − α t ) ≥ t × min r ( α − α 2 )(1 − α ) , r α (1 − α − 1 − α 2 ) 12 = r α (1 − α ) 2 t. With this pr eliminary inequality , we are rea dy to prove the lemma. P ( t X i =1 X i = k ) ≤ P ( t X i =1 X i = ˆ k ) = t ˆ k × α ˆ k (1 − α ) t − ˆ k = t ! ( ˆ k )!( t − ˆ k )! × α ˆ k (1 − α ) t − ˆ k ≤ t t + 1 2 e 1 − t √ 2 π ˆ k ˆ k + 1 2 e − ˆ k √ 2 π ( t − ˆ k ) t − ˆ k + 1 2 e − ( t − ˆ k ) × α ˆ k (1 − α ) t − ˆ k = e 2 π × 1 q ˆ k ( t − ˆ k ) × t t + 1 2 ˆ k ˆ k ( t − ˆ k ) t − ˆ k × α ˆ k (1 − α ) t − ˆ k ≤ e 2 π × s 2 α (1 − α ) × t t − 1 2 × α ˆ k (1 − α ) t − ˆ k ˆ k ˆ k ( t − ˆ k ) t − ˆ k . Let f ( x ) = α x (1 − α ) t − x x x ( t − x ) t − x , f ′ ( x ) = log( α 1 − α ) − log( x t − x ) × f ( x ). Obviously f ′ ( x ) is 0 when x = αt , p ositive when x < αt , and negative when x > αt . T hus, f ( x ) ≤ α αt (1 − α ) t − αt ( αt ) αt ( t − αt ) t − αt = t − t . Hence, P ( t X i =1 X i = a ) ≤ e 2 π × s 2 α (1 − α ) × t t − 1 2 × f ( ˆ k ) ≤ e 2 π × s 2 α (1 − α ) × t − 1 2 . Pr o of of Cor ol lary 4.2.1 . Note that when α = 1 2 , Lemma A.2 provides a b ound on S ( n ) 2 n . Plug in α = 1 2 to Lemma A.2 and combine with T heo rem 4.2 , we know that if n > 4, C LO = √ 2 e π will suffice. If n ≤ 4, 2 √ 2 e π × n − 1 2 > 1 and L e mma A.2 still ho lds. Pr o of of The or em 4.3 . W e write | A | to deno te the cardinality of a finite set A. X t ∈ A 0 | g t,i ⊖ j | P ( t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ) ≤ X t ∈ A 0 1 √ T × 1 = | A 0 | √ T ≤ p | A 0 | . 13 Pr o of of The or em 4.4 . W e write ǫ w ith a subset of [ T ] as subscr ipt to denote ǫ t ’s at times that ar e within the s ubs e t. F or example, ǫ [ T ] = { ǫ 1 , . . . , ǫ T } . W e also write ǫ − A to denote the set of ǫ t ’s that are within the complement of A with resp ect to [ T ]. Case I: k ∈ { 1 , . . . , K − 1 } X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! = X t ∈ A k | g t,i ⊖ j | E ǫ [ T ] [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) ] = X t ∈ A k | g t,i ⊖ j | E ǫ − A k h E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] i = E ǫ − A k " X t ∈ A k | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] # ≤ sup ǫ − A k X t ∈ A k | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] . Let A k = { t k, 1 , . . . , t k, | A k | } with elements listed in incre a sing or der of time index. Also, define D n = D n ( ǫ − A k ) = − t k,n − 1 X τ =1 ,τ ∈− A k (1 + ǫ τ ) g τ ,i ⊖ j . Then, we have X t ∈ A k | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] = | A k | X n =1 | g t k,n ,i ⊖ j | E ǫ A k [ 1 ( P n − 1 s =1 ǫ t k,s g t k,s ,i ⊖ j ≥− P n − 1 s =1 g t k,s ,i ⊖ j + D n , P n s =1 ǫ t k,s g s,i ⊖ j ≤− P n s =1 g t k,s ,i ⊖ j + D n ) | ǫ − A k ] = | A k | X n =1 | g t k,n ,i ⊖ j | P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j ≥ − n − 1 X s =1 g t k,s ,i ⊖ j + D n , n X s =1 ǫ t k,s g t k,s ,i ⊖ j ≤ − n X s =1 g t k,s ,i ⊖ j + D n ǫ − A k . By definition of the set A k , w e have | g t k,s ,i ⊖ j | ≥ T − 1 2 k , so T 1 2 k | g t k,s ,i ⊖ j | ≥ 1. Let M k = T 1 2 k . Then, we hav e | A k | X n =1 | g t k,n ,i ⊖ j | P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j ≥ − n − 1 X s =1 g t k,s ,i ⊖ j + D n , n X s =1 ǫ t k,s g t k,s ,i ⊖ j ≤ − n X s =1 g t k,s ,i ⊖ j + D n ǫ − A k 14 = | A k | X n =1 | g t k,n ,i ⊖ j | P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ≥ − n − 1 X s =1 g t k,s ,i ⊖ j M k + D n M k , n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ≤ − n X s =1 g t k,s ,i ⊖ j M k + D n M k ǫ − A k ≤ | A k | X n =1 | g t k,n ,i ⊖ j | P n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ∈ B k,n ǫ − A k ! where B k,n = " − n X s =1 g t k,s ,i ⊖ j M k + D n M k − 2 | g t k,n ,i ⊖ j | M k , − n X s =1 g t k,s ,i ⊖ j M k + D n M k # is a one-dimensiona l clos e d ball with radius ∆ = | g t k,n ,i ⊖ j | M k . Note that this ball is fixed g iven ǫ − A k . Since | g t k,s ,i ⊖ j | M k ≥ 1, we ca n apply Cor ollary 4.2.1 to g et P n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ∈ B k,n ǫ − A k ! ≤ C LO (∆ + 1) √ n = C LO ( | g t k,n ,i ⊖ j | M k + 1) √ n . Now we co n tin ue the der iv ation, | A k | X n =1 | g t k,n ,i ⊖ j | P n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ∈ B k,n ǫ − A k ! ≤ | A k | X n =1 | g t k,n ,i ⊖ j | C LO ( | g t k,n ,i ⊖ j | M k + 1) √ n ≤ C LO | A k | X n =1 | g t k,n ,i ⊖ j | 2 M k √ n + | A k | X n =1 2 √ n . Since we hav e | g t k,n ,i ⊖ j | < T − 1 2 k +1 , | g t k,n ,i ⊖ j | 2 T 1 2 k = | g t k,n ,i ⊖ j | 2 M k < 1. Thus we have the bo und, C LO | A k | X n =1 | g t k,n ,i ⊖ j | 2 M k √ n + | A k | X n =1 2 √ n ≤ 3 C LO | A k | X n =1 1 √ n ≤ 6 C LO p | A k | . Case I I: k = K . Similar to the previo us ca se, we have X t ∈ A K | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ sup ǫ − A K X t ∈ A K | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A K ] and wr iting the elements of A K in increa sing order a s { t K, 1 , . . . , t K, | A K | } , w e get X t ∈ A K | g t,i ⊖ j | E ǫ A K [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A K ] 15 ≤ | A K | X n =1 | g t K,n ,i ⊖ j | P n X s =1 ǫ t K,s g t K,s ,i ⊖ j M K ∈ B K,n ! where D n = D n ( ǫ − A K ) = − t K,n − 1 X τ =1 ,τ ∈{− A K } (1 + ǫ τ ) g τ ,i ⊖ j , M K = T 1 2 K ≤ 2, a nd B K,n = " − n X s =1 g t K,s ,i ⊖ j M K + D n M K − 2 | g t K,n ,i ⊖ j | M K , − n X s =1 g t K,s ,i ⊖ j M K + D n M K # is a one-dimens io nal c lo sed ball with ra dius ∆ = | g t K,n ,i ⊖ j | M K . Note tha t this ball is fixed given ǫ − A K and hence, w e can apply Co rollary 4.2.1 to get | A K | X n =1 | g t K,n ,i ⊖ j | P n X s =1 ǫ t K,s g t K,s ,i ⊖ j M K ∈ B K,n ! ≤ | A K | X n =1 | g t K,n ,i ⊖ j | C LO ( | g t K,n ,i ⊖ j | M K + 1) √ n ≤ C LO (4 M K + 2) | A K | X n =1 1 √ n ≤ 20 C LO p | A K | . Combining the t wo cases pr ov es the theorem. Pr o of of The or em 4.5 . Consider a game with t wo str ategies, i.e., N = 2 . W e refer to player i as the “player” and the other play er s collectively as the “en- vironment”. On o dd ro unds, the environmen t plays pay off vector (0 , 0). This ensures that after o dd ro unds, the environmen t will k now exactly which stra tegy the player will cho ose as long as ther e is no tie in the play er ’s s ampled cumula- tive payoffs, b eca use no ma tter whether the Rademacher r andom v ariable is − 1 or +1, the next stra tegy play ed will b e the s ame as the strategy the play er just play ed. O n even rounds t , the environment plays the payoff vector (0 , 1 − 0 . 1 t ) if the player chose the first strateg y in the previous round, and (1 − 0 . 1 t , 0) if the play e r chose the s econd s trategy in the previo us r o und. Under this scenario, we make a critical obser v ation that, as long as the set of sampled time p oints is no t e mpty , which ha ppe ns with pr obability ( 1 2 ) t − 1 on r ound t, there will not be a tie in the cum ulative pay offs o f the t wo strategies. Moreov er, without a tie, the player will not be able to switch s trategy o n even r ounds so will no t accumulate any pay off. Therefor e, the total pay off acquired by the play er by following sa mpled fictitious play proc edure will be a t most P ∞ t =1 ( 1 2 ) t − 1 = 2. How ever, a s evident from the environmen t’s pro cedure, the total pay off for tw o strategies is a t least 0 . 45 T and th us the b est stra tegy has a pay off no less tha n 0 . 225 T because of the pigeonhole principle. Hence, the exp ected reg ret for the play er is a t least 0 . 225 T − 2, which is line a r in T . 16 Supplementary Mater ial for “Sa mpled Fictiti ous Pl a y i s Hannan Consist en t ” App endix B Coun terexample Sho wing P olyno- mial N Dep endence In this sectio n, we present a coun terexample w hich shows that the sampled fictitious play a lg orithm ( 2 ) with Berno ulli sampling ( 3 ) has a lo wer b ound of Ω( N ) o n its exp ected reg ret when T is 2 N . This is consis ten t with a low e r bo und for the exp ected regre t of or der Ω( √ N T ). How ever, we are unable to extend the construction to an ar bitrary T . This co un terexample is from [ W armuth , 2015 ] and we learned it in priv ate communication with Ma nfred W armuth and Gergely Neu. Theorem B.1 ( W armuth [ 2015 ]) . The sample d fictitious play algorithm has exp e cte d r e gr et of Ω( N ) when T is 2 N and N → ∞ . Pr o of. Consider the N × 2 N payoff matrix : 0 − 1 − 1 − 1 − 1 − 1 . . . − 1 0 0 − 1 − 1 − 1 . . . − 1 0 − 1 0 0 − 1 . . . − 1 0 − 1 0 − 1 0 . . . . . . . . . . . . . . . . . . . . . . . . − 1 0 − 1 0 − 1 0 . . . . Each r ow re presents a strateg y and each column repr esents payoffs of the stra te- gies in a particular ro und. In the m th o dd ro und, i.e., in the (2 m − 1)-th r o und, the adversary assigns a pay o ff o f − 1 to all s trategies except str ategy m which gets a pay o ff of 0. In the m th even ro und, i.e., the 2 m -th round, the adversary assigns a pay o ff of − 1 to s trategies 1 throug h m and a pay off of 0 to the others. Note that, in a ll rounds a fter 2 m , strategy m will always b e given a pay off of − 1. Ov erall, we will hav e N str ategies and 2 N r o unds, with the b est constant strategy being the la st str ategy which a ccumulates pay o ff of − N . T o analyze the exp ected regret, we co nsider e ven a nd o dd ro unds separately . Note that as long as round 2 m − 1 is pick ed in the sampled histo ry , which happ ens with probability 1 2 , the algo rithm will not choos e any stra teg y fro m m + 1 thro ugh N at r o und 2 m . This is b ecause they all hav e iden tical pay offs as strategy m prior to ro und 2 m − 1, and strategy m lo oks b etter on round 2 m − 1. So, the alg orithm will pic k a strateg y from 1 through m on round 2 m , all of which acquire a gain of − 1. Therefore, the algorithm will acquir e an expected pay off o f at mos t − 1 2 on even rounds. Next we conside r o dd rounds. On r ound 2 m − 1, we obser ve tha t the leader set (i.e., the a rgmin in ( 2 )) either includes strateg y m or not. If it includes strategy m , it will additiona lly include strateg ies m + 1 through N as w ell since they hav e had identical payoffs in the past. It may p ossibly als o include some strategies in the s et { 1 , . . . , m − 1 } . Since the alg orithm r andomly picks a strateg y from the leader set, and all but o ne of them ha s a payoff of − 1 o n round 2 m − 1, the exp ected gain of the algor ithm is at most − N − m N − m +1 . If the 17 leader set do es not include strategy m , then the exp ected g ain is exactly − 1 since strateg y m is the only one with zero payoff at round 2 m − 1. Therefore, the algorithm will acquire an exp ected pay off of at mos t − N − m N − m +1 on even r ounds. Hence, the expected regret o f Sampled Fictitious Play under this scenar io with N stra tegies and 2 N r ounds is at lea st, for so me c > 0, R T ≥ − N − − N X m =1 ( N − m N − m + 1 ) − N 2 ≥ N 2 − c lo g( N ) = Ω( N ) . App endix C Asymmetric Probabilities In this se c tion we prove that for binar y pay off and ar bitrary α ∈ (0 , 1 ) insead of just 1 / 2, the expec ted r egret is O ( √ T ) where the co nstant hidden in O ( · ) notation blows up in either of the tw o extr eme case: α → 0 and α → 1. No te tha t we a re still considering the single stream version ( 4 ) of the le arning pro cedure. Theorem C. 1. F or α ∈ (0 , 1) and g t ∈ {− 1 , 0 , 1 } N , assu ming t hat T > max( 2 1 − α , 2 α ) , the exp e cte d r e gr et satisfies E [ R T ] ≤ 40 N 2 Q α α √ T wher e Q α = e 2 π × q 2 α (1 − α ) . Pr o of. W e b egin with the inequality obta ined in the pro o f of Theo r em 4.1 : E [ R T ] ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0 . (5) As b e fo re, we fix i and j , and will bound the expr ession T X t =1 | g t,i ⊖ j | P ( t − 1 X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≥ 0 , t X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≤ 0) . The rest of the pro of is similar to the pro of of Theor em 4.4 . Define the classes A k = { t : g t,i ⊖ j = k , t = 1 , . . . , T } for k ∈ {− 2 , − 1 , 1 , 2 } . W e have, T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≥ 0 , t X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≤ 0 ! ≤ 2 T X t =1 P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! =2 X k ∈{− 2 , − 1 , 1 , 2 } X t ∈ A k P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ s g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! . F or any k ∈ {− 2 , − 1 , 1 , 2 } , X t ∈ A k P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! 18 ≤ sup ǫ − A k X t ∈ A k E ǫ A k [ 1 P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j | ǫ − A k ] . Let A k = { t k, 1 , . . . , t k, | A k | } with elements listed in incre a sing or der of time index. Also define, for n ∈ { 1 , . . . , | A k |} , D n = D n ( ǫ − A k ) = − t k,n − 1 X τ =1 ,τ ∈{− A k } ǫ τ g τ ,i ⊖ j − t k,n − 1 X τ =1 ,τ ∈{− A k } g τ ,i ⊖ j . W e then pro ceed as follows. X t ∈ A k E ǫ A k [ 1 P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j | ǫ − A k ] = | A k | X n =1 E ǫ A k [ 1 ( P n − 1 s =1 ǫ t k,s g t k,s ,i ⊖ j ≥− P n − 1 s =1 g t k,s ,i ⊖ j + D n , P n s =1 ǫ t k,s g s,i ⊖ j ≤− P n s =1 g t k,s ,i ⊖ j + D n ) | ǫ − A k ] = | A k | X n =1 P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j ≥ − n − 1 X s =1 g t k,s ,i ⊖ j + D n , n X s =1 ǫ t k,s g t k,s ,i ⊖ j ≤ − n X s =1 g t k,s ,i ⊖ j + D n ǫ − A k ! ≤ | A k | X n =1 P 4 [ u =0 ( n X s =1 ǫ t k,s g t k,s ,i ⊖ j = − n − 1 X s =1 g t k,s ,i ⊖ j + D n − u ) ǫ − A k ! ≤ | A k | X n =1 4 X u =0 P n X s =1 ǫ t k,s g t k,s ,i ⊖ j = − n − 1 X s =1 g t k,s ,i ⊖ j + D n − u ǫ − A k !! ≤ 5 | A k | X n =1 Q α √ n ≤ 10 Q α p | A k | where Q α = e 2 π × q 2 α (1 − α ) from Le mma A.2 . Putthing things together, we hav e E [ R T ] ≤ 20 N 2 Q α α X k ∈{− 2 , − 1 , 1 , 2 } p | A k | ≤ 40 N 2 Q α α √ T . 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment