Sampled Fictitious Play is Hannan Consistent

Sampled Fictitious Pla y is Hannan Consis ten t Zifan Li zifanli@umi ch.edu Am buj T e w ari tewaria@umi ch.edu April 12, 2017 Abstract Fictitious pla y is a simple and widely studied adaptive heuristic for pla ying rep eated games. It is w ell known that ﬁctitious play fai ls to b e Hannan consistent. Several v arian ts of ﬁctitious play including re- gret matc hing, generalized regret matc hing and smo oth ﬁ ctitious play , a re known to b e Hannan consisten t. In this note, we consider sampled ﬁ ct i- tious pla y: at eac h round, the pla yer sa mples past times and p lays the b est response to p revious mov es of other p la yer s at the sampled time p oints. W e sho w that sampled ﬁctitious play , using Bernoulli sampling, is Hann an consisten t. Unlike several existing Hannan consistency p roofs that rely on concentrati on of measure results, ours instead uses anti-concen tration re- sults from Littlew o o d-Oﬀord theory . Keywords: adaptive heuristics, learning, rep ea ted games, Hannan co nsistency , ﬁctitious play 1 In tro duction In the s etting of r e p ea ted g ames played in discrete time, the (unconditional) regret of a player, at any time p oint, is the diﬀerence b etw e e n the pay o ﬀs s he would hav e rec e ived had she played the b est, in hindsight, cons ta n t strategy throughout, and the payoﬀs she did in fact re ceive. Hannan [ 19 5 7 ] showed the existence of pro cedures with a “no-reg ret” pro p er t y: pro c edures for which the av er age regr e t p er time go e s to zero for a la rge num ber of time points. His pro cedure w as a simple mo diﬁca tion of ﬁctitious play: random per turbations a re added to the cum ulative pay o ﬀs of every s tr ategy so fa r and the player picks the strategy with the la rgest p ertur bed cumulativ e pay oﬀ. No reg ret pr o cedures ar e also ca lled “universally consistent” [ F udenberg and Levine , 1998 , Section 4.7 ] or “Hannan consistent” [ Ce sa-Bianchi and Lugosi , 2006 , Section 4.2]. It is well known that smo othing the cumulativ e pa yoﬀs b efore computing the b est resp onse is crucial to achiev e Hanna n consistency . O ne wa y to achieve smo othness is thro ugh sto chastic smo o thing, o r adding p erturbations. Without per turbations, the pro cedure b eco mes identical to ﬁctitious play , which fails to be Hannan cons is ten t [ Cesa-Bianchi and Lugosi , 20 06 , Exer cise 3.8]. Besides Hannan’s mo diﬁcation, other v ar iants of ﬁctitious play a re also known to b e 1 Hannan consistent, including (unco nditional) regr et matching, genera lized (un- conditional) regret matchin g and smoo th ﬁctitious play (for a n ov erview, see Hart a nd Mas-Colell [ 2013 , Section 1 0.9]). In this note, we cons ide r another v ar iant of ﬁctitious play , na mely sa mpled ﬁctitious play . Here, the play er samples pa st time points using so me (random- ized) sampling sc heme and plays the bes t resp ons e to the moves of the other play er s restricted to the set of s ampled time po in ts. Sampled ﬁctitious play has bee n c o nsidered by other authors in diﬀeren t contexts. Kaniovski a nd Y oung [ 1995 ] established co n vergence to Nash equilibrium in 2 × 2 g ames. Gilliland and Jung [ 2006 ] provided reg ret b ounds for the game of matching p ennies. Lambert II I et a l. [ 2005 ] cons idered games with identical pay o ﬀs for all play er s and use sampled ﬁctitious play to solve la rge-sca le optimization problems. T o the b est o f our knowledge, it is not known whether sampled ﬁctitious play is Hannan consis- ten t without ma king a n y a ssumptions on the form of the game and payoﬀs. The purp ose of this no te is to show that it is indeed Hannan consistent when used with a na tural sampling scheme, na mely Bernoulli sampling. 2 Preliminaries Consider a ga me in strategic form wher e M is the num b er o f play ers, S i is the set of str ategies for play er i , a nd u i : Q M j =1 S i → R is the pay oﬀ function for play er i . F or simplicit y a ssume that the payoﬀ functions of all play e rs are [ − 1 , 1] bo unded. W e also assume the num b er of pure stra tegies is the sa me for each play er and that S i = { 1 , . . . , N } . Let S = Q M i =1 S i be the set of M -tuples of play er strategies . F or s = ( s i ) M i =1 ∈ S , we denote the str ategies o f play ers other than i b y s − i = ( s j ) 1 ≤ j ≤ M ,j 6 = i . The ga me is play ed rep eatedly over (discr ete) time t = 1 , 2 , . . . . A lea rning pro cedure for player i is a pro cedur e that maps the history h t − 1 = ( s τ ) t − 1 τ =1 of plays just prior to time t , to a strategy s t,i ∈ S i . The learning pro cedur e is allowed to be randomized, i.e., play er i has acc ess to a stream of ra ndom v ariables ǫ 1 , ǫ 2 , . . . and she is allow ed to use ǫ 1 , . . . , ǫ t − 1 , in a ddition to h t − 1 , to choose s t,i . Player i ’s regr et at time t is deﬁned as R t,i = max k ∈ S i t X τ =1 u i ( k , s τ , − i ) − t X τ =1 u i ( s τ ) . This co mpares the player’s cumulativ e pay oﬀ with the pay oﬀ she could have r e- ceived had s he selected the be s t constant (ov er time) str ategy k with knowledge of the other play er s’ moves. A learning pro cedure for player i is said to b e Hannan c onsisten t if and only if lim sup t →∞ R t,i t ≤ 0 almost sur e ly . Hannan consistency is also known as the “no- r egret” prop erty and as “universal consistency”. The term “universal” r efers to the fa ct that the regr et per time go es to ze ro irresp ective of what the o ther players do. Fictitious play is a (deterministic) learning pr o cedure wher e play er i plays 2 the bes t resp onse to the plays of the o ther players so far. That is, s t,i ∈ a rg max k ∈{ 1 ,...,N } t − 1 X τ =1 u i ( k , s τ , − i ) . (1) As mentioned earlier, ﬁctitious play is not Hannan consistent. How ever, c onsider the following mo diﬁcatio n of ﬁctitious play , called sample d ﬁctitious play . At time t , play er rando mly selects a subs et S t ⊆ { 1 , . . . , t − 1 } o f pr evious time po int s and plays the b est r esp onse to the other play ers ’ mov es only over S t . That is, s t,i ∈ arg max k ∈{ 1 ,...,N } X τ ∈ S t u i ( k , s τ , − i ) . (2) If multiple strategies achiev e the maximum, then the tie is broken uniformly at random, a nd indep endently with resp ect to all previous randomness. Also, if S t turns o ut to b e empty (a n ev ent that happ ens with proba bilit y exa ctly 2 − ( t − 1) under the Be rnoulli sampling describ ed b elow), we adopt the conv en tion that the argmax ab ov e includes all N strateg ies. In this note, w e consider Bernoul li sampling , i.e., any par ticula r ro und τ ∈ { 1 , . . . , t − 1 } is included in S t independently with probability 1 / 2 . Mo re sp eciﬁcally , if ǫ ( t ) 1 , . . . , ǫ ( t ) t − 1 are i.i.d. sy mmetric Ber noulli (o r Rademacher) ran- dom v ariables taking v alues in {− 1 , +1 } , then S t = { τ ∈ { 1 , . . . , t − 1 } : ǫ ( t ) τ = +1 } (3) and ther efore, X τ ∈ S t u i ( k , s τ , − i ) = t − 1 X τ =1 (1 + ǫ ( t ) τ ) 2 u i ( k , s τ , − i ) . Note that the pro cedure deﬁned by the com bination o f ( 2 ) and ( 3 ) is completely parameter free, i.e., there is no tuning para meter that has to b e ca refully tuned in order to obtain desired conv er gence prop erties. 3 Result and Discussion Our main r esult is the fo llowing. Theorem 3.1. Sample d ﬁctitious play ( 2 ) with Bernoul li sampling ( 3 ) is Han- nan c onsistent. Before we mov e o n to the pr o of, a few remar ks ar e in order . Computational tracta bility It is a simple but imp or tant obs erv ation that the form of the optimization pro blem so lved by ﬁctitious play ( 1 ) is exactly the same as the optimization problem solved by sampled ﬁctitious play ( 2 ). This can b e very useful when the play er has a lar ge strateg y set and do es not w ant to enumerate all strategies to solve the max imization inv olved in b oth ﬁctitious play and its sampled version. F or ex ample, Lambert II I et a l. [ 2005 ] descr ibe their computational exp erience w ith sampled ﬁctitious play in the c ontext of a dynamic tr aﬃc a ssignment problem. 3 Rate of con v ergence Our pro o f gives the r a te of co nv erg ence of (exp e cted) av er age regret a s O ( N 2 p log log t/t ) where the constant hidden in O ( · ) nota- tion is sma ll and explicit. It is known that the optimal r ate is O ( p log N/t ) [ Cesa-Bianchi and Lugosi , 2006 , Section 2.10 ]. The r efore, our ra te of conv e r - gence is almost o ptimal in t but severely s ub optimal in N . This r aises several int eresting ques tions. What is the b est b ound p ossible for Sampled Fictitious Play with Ber noulli sampling? Is there a sampling scheme for which Sampled Fictitious Play pro cedure achiev es the optimal rate of conv er gence? The ﬁrst question is par tially answered by Theor e m B . 1 in Appendix B which states that the dep endency on N is likely to b e po lynomial instead of loga rithmical, but ther e is still so me ga p be t ween the lower b ound a nd the upp e r b ound we provide. Asymmetric probabi lities Instead of using symmetric Berno ulli pro babil- ities, we can choos e ǫ ( t ) τ such that P ( ǫ ( t ) τ = +1) = α . As α → 1, the lea rning pro cedure b ecomes ﬁctitious play and as α → 0, it selects strategies uniformly at random. Therefore, it is natural to exp ect that the regre t b ound will blow up near the tw o extremes of α = 1 a nd α = 0. W e c a n make this in tuition precise, but o nly for {− 1 , 0 , 1 } -v alued pay o ﬀs (instead of [ − 1 , 1]- v alued). F o r details, s ee Appendix C in the supplementary material. F ollo w the p erturb ed l eader Note that arg max k ∈{ 1 ,...,N } t − 1 X τ =1 (1 + ǫ ( t ) τ ) 2 u i ( k , s τ , − i ) = arg max k ∈{ 1 ,...,N }  t − 1 X τ =1 u i ( k , s τ , − i )+ t − 1 X τ =1 ǫ ( t ) τ u i ( k , s τ , − i )  . Therefore, we can think of sampled ﬁctitious play as adding a random p er- turbation to the expres s ion that ﬁctitious play o ptimizes. Such algorithms a re referred to as “follow the per turb ed leader ” (FPL) in the computer s cience lit- erature (“ﬁctitious play” is kno wn a s “fo llow the leader” ). This family was originally prop o sed by Hannan [ 1 9 57 ] and po pularized b y Kalai a nd V empala [ 2005 ]. Closer to this pap er ar e the FPL a lgorithms of Devr oy e et al. [ 2013 ] and v an Er ven et al. [ 2014 ]. How e ver, none of these pap ers considered sampled ﬁctitious play . Extension to condi tional (or in ternal) regret In this pa pe r we fo cus o n unconditional (or e x ternal) regr et. Other notions of reg ret, esp ecia lly condi- tional (o r in ternal) r egret can a lso b e considered. In ternal reg r et measures the worst regr et, ov er N ( N − 1) choices of k 6 = k ′ , of the fo rm “every time stra tegy k was pick ed, strateg y k ′ should hav e bee n pick ed instead” . There are g eneric conv er sions [ Stoltz and L ugosi , 2005 , Blum and Manso ur , 2 0 07 ] that will c on- vert any learning pro c e dure with small ex ter nal regr et to one with s mall internal regret. These conv er sion, how ever, require access to the probability distribution ov er strateg ies at ea ch time po int. This probabilit y distribution can b e a p- proximated, to arbitrar y accura cy , by making the choice of the strategy in ( 2 ) m ultiple times each time selecting the rando m s ubset S t independently . How- ever, doing so and us ing a generic c o nv ers io n from external to internal regret will lea d to a cumbersome ov er all a lgorithm. It will be nicer to design a simpler sampling based learning pro cedure with small internal regret. 4 4 Pro of of the Main Result W e break the pr o of o f o ur ma in result into several steps. The ﬁrst and third steps inv olve fair ly standar d ar g ument s in this area. Our main innov ations ar e in step tw o. 4.1 Step 1: F rom Regret to Switching Probabilities In this step, we ass ume that play e r s o ther than play er i (the “ opp onents”) are oblivio us , i.e., they do no t adapt to what player i do es. Ma thema tically , this means that the s equence s t, − i do es not depend on the mov e s s t,i of player i . W e will prov e a unifor m reg r et bound that holds for all deter ministic pay oﬀ sequences { s t, − i } T t =1 , by which we can co nclude that the same b ound holds for oblivious but rando m pay o ﬀ s equences as well. Since play er i is ﬁx e d for the r e st of the pro of, we will not carry the index i in our notation further. Let the vector g t ∈ [ − 1 , 1] N be deﬁned as g t,k = u i ( k , s t, − i ) for k ∈ { 1 , . . . , N } . Moreov er , we denote play e r i ’s mov e s t,i as k t . With this no tation, r egret at time T equals R T = max k ∈{ 1 ,...,N } T X t =1 g t,k − T X t =1 g t,k t . In this step, we will lo ok at the exp ected regre t. Be c ause the opp onents are oblivious, this equals E [ R T ] = max k ∈{ 1 ,...,N } T X t =1 g t,k − E " T X t =1 g t,k t # = max k ∈{ 1 ,...,N } T X t =1 g t,k − T X t =1 E [ g t,k t ] . Recall that k t ∈ arg max k ∈{ 1 ,... ,N } t − 1 X τ =1 (1 + ǫ ( t ) τ ) 2 g τ ,k . Since g t ’s ar e ﬁxed vectors, by indep endence we see that the dis tribution of k t is ex actly the same whether or not we share the Rademacher random v ar i- ables across time po ints. Therefore, we do not have to draw a fresh sam- ple ǫ ( t ) 1 , . . . , ǫ ( t ) t − 1 at time t . Instead, w e ﬁx a sing le strea m ǫ 1 , ǫ 2 , . . . o f i.i.d. Rademacher random v ar iables and set ( ǫ ( t ) 1 , . . . , ǫ ( t ) t − 1 ) = ( ǫ 1 , . . . , ǫ t − 1 ) for all t . With this r eduction in num be r of r a ndom v a riables use d, we now have k t ∈ arg max k ∈{ 1 ,...,N } t − 1 X τ =1 (1 + ǫ τ ) g τ ,k . (4) W e deﬁne G t = P t τ =1 g τ , the cum ulative pay o ﬀ vector at time t . Deﬁne ˜ g t = (1 + ǫ t ) g t and ˜ G t = P t τ =1 ˜ g τ . W e als o deﬁne g t,i ⊖ j = g t,i − g t,j , ˜ g t,i ⊖ j = ˜ g t,i − ˜ g t,j . With these deﬁnitions, we have ˜ G t,i ⊖ j = ˜ G t,i − ˜ G t,j = t X τ =1 ˜ g τ ,i − t X τ =1 ˜ g τ ,j 5 = t X τ =1 (1 + ǫ τ )( g τ ,i − g τ ,j ) = t X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j . The following res ult upp er b ounds the regr et in terms o f down ward zero- c rossings of the pro ces s ˜ G t,i ⊖ j , i.e., the times t when it switches from b eing non-negative at time t − 1 to non-p ositive at time t . Theorem 4.1. We have the fol lowing upp er b ound on t he exp e cte d r e gr et: E [ R T ] ≤ 2 N 2 max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P  ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0  . The pro of of this theorem can b e found in App endix A . W e now fo cus o n bo unding the switching probabilities for a ﬁxed pair i, j . 4.2 Step 2: Bounding Switching Probabilities Using Littlew o o d- Oﬀord Theory Our strateg y is to do a “m ulti-scale” ana lysis and, within each scale, apply Littlew o o d-Oﬀord theo ry to bound the switching proba bilities . The need for a m ulti-scale a rgument aris es from the requirement in Littlewo o d-Oﬀord theore m (see Theorem 4 .2 b elow) fo r a low er b ound on the step sizes of random walks. W e partition the set of T time p oints [ T ] := { 1 , . . . , T } into K + 1 disjoint sets at diﬀerent scales, denoted as { A k } K k =0 where A k =      { t ∈ [ T ] : | g t,i ⊖ j | ≤ 1 √ T } k = 0 { t ∈ [ T ] : T − 1 2 k < | g t,i ⊖ j | ≤ T − 1 2 k +1 } k = 1 , . . . , K − 1 { t ∈ [ T ] : T − 1 2 K < | g t,i ⊖ j | ≤ 2 } k = K . Note that actually A k depe nds o n i, j as well but for the sake o f clar it y we drop this dep endence in the notatio n. The ca r dinality of a ﬁnite set A will b e denoted by | A | . The num b er K + 1 of diﬀer e n t scales is determined by K = arg min { k ∈ N : T − 1 2 k ≥ 1 / 2 } . ∀ t, i, g t,i ∈ [ − 1 , 1 ] so | g t,i ⊖ j | ∈ [0 , 2 ]. The scales here are chosen such that K is not very la rge (of or der O  log log( T )  ) and still cov er s the entire range of the pay oﬀs. It easily follows that, T X t =1 | g t,i ⊖ j | P  ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0  = T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! = K X k =0 X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! . W e now wan t to a r gue that the probabilities inv olved ab ov e a r e small. The cr u- cial obser v ation is that, if a switch o ccurs, then the r andom sum P t τ =1 ǫ τ g τ ,i ⊖ j has to lie in a suﬃciently small interv a l. Such “small ball” pro babilities are exactly what the classic Littlewoo d-Oﬀord theorem controls. 6 Theorem 4.2 (Littlewoo d-Oﬀord Theorem of E rd¨ os, Theore m 3 of Erd˝ os [ 1945 ]) . L et x 1 , . . . , x n b e n r e al numb ers such that | x i | ≥ 1 for al l i . F or any given r adius ∆ > 0 , the smal l b al l pr ob ability satisﬁes sup B P ( ǫ 1 x 1 + · · · + ǫ n x n ∈ B ) ≤ S ( n ) 2 n ( ⌊ ∆ ⌋ + 1) wher e ǫ 1 , . . . , ǫ n ar e i.i.d . R ademach er r andom varia bles, B ra nges over al l close d b al ls (int ervals) of r adius ∆ , and ⌊ x ⌋ r efers to t he inte gr al p art of x , S ( n ) is the lar gest binomial c o eﬃcient b elonging to n . Using elementary c alculations to upp er b ound S ( n ) 2 n gives us the following corolla r y . Corollary 4.2.1. Under the same notation and c onditions as The or em 4.2 , we have sup B P ( ǫ 1 x 1 + · · · + ǫ n x n ∈ B ) ≤ C LO ( ⌊ ∆ ⌋ + 1) 1 √ n wher e C LO = 2 √ 2 e π < 3 . The pro of of this co rollar y ca n b e found in App endix A . The sca le of pay oﬀs for time p erio ds in A 0 is so sma ll that we do no t need any Littlewoo d- Oﬀord theor y to c o ntrol their contribution to the r egret. Simply bo unding the probabilities by 1 gives us the following. Theorem 4.3. The fol lowing u pp er b ound holds for s witching pr ob abilities for time p erio ds within A 0 : X t ∈ A 0 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ p | A 0 | ≤ 20 C LO p | A 0 | . wher e C LO > 1 . The pro of of this theo r em ca n also b e found in App endix A . The real w ork lies in co nt rolling the switc hing probabilities for payoﬀs at int ermediate scales. The idea in the pro o f of the results is to condition on the ǫ t ’s outside A k . Then the pr obability of int erest is written a s a small ball event in terms of the ǫ t ’s in A k . Applying Littlewoo d-O ﬀord theorem concludes the argument. Theorem 4.4. F or any k ∈ { 1 , . . . , K } , we have X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ 20 C LO p | A k | . Again, the pr o of o f this theor em is defer red to App endix A . W e ﬁnally hav e all the ingr edients in place to co n trol the switching pr oba- bilities. 7 Corollary 4 .4.1. The fol lowing u pp er b ound on the switching pr ob abilities holds. T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ 2 0 C LO p T log 2 (4 lo g 2 T ) . Pr o of. Using Theor em 4.3 and Theo rem 4.4 , we have T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! = K X k =0 X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ K X k =0 20 C LO p | A k | . Since P K k =0 p | A k | ≤ √ K + 1 · q P K k =0 | A k | and P K k =0 | A k | = T , we hav e K X k =0 20 C LO p | A k | ≤ 20 C LO p ( K + 1) T . By deﬁnition of K , we hav e that T − 1 2 K − 1 < 1 2 , K ≤ log 2 (log 2 ( T )) + 1 which ﬁnishes the pr o of. Thu s, ∀ i , j ∈ { 1 , . . . , N , i 6 = j } , we hav e T X t =1 | g t,i ⊖ j | P  ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0  ≤ 20 C LO p T log 2 (4 lo g 2 T ) , which, when plug ged into Theo rem 4.1 , immediately y ields the following corol- lary . Corollary 4.4. 2. A gainst an oblivious opp onent, b oth versions — the single str e am version ( 4 ) and the fr esh-r andomization-at-e ach-r ound version ( 2 ) — of sample d ﬁctitious play enjoy the fol lowing b ound on the ex p e cte d r e gr et. E [ R T ] ≤ 40 C LO N 2 p T log 2 (4 lo g 2 T ) . 4.3 Step 3: F rom Oblivious to Adaptive Opp onen ts Now we consider adaptive opp onents. In this setting, we can no longer a ssume that play er i plays agains t a ﬁxed seq uence of pay oﬀ v ectors { g t } T t =1 . Note that g t,k is just sho rthand fo r u i ( k , s t, − i ) and opp onents can react to player i ’s mov es k 1 , . . . , k t − 1 in selecting their s trategy tuple s t, − i , p oss ibly making use of their own priv ate rando mness. W e denote a ll randomness use d collectively b y other players ov er all time p erio ds by ω w hich is drawn from so me probability space Ω. Thus, g t is a function g t ( k 1 , . . . , k t − 1 , ω ). F aced with general adaptive opp onents, the single strea m version ( 4 ) can incur ter rible exp ected r egret as stated b elow. 8 Theorem 4.5. The single str e am version of the sample d ﬁctitious play pr o c e- dur e ( 4 ) c an incu r line ar exp e cte d r e gr et against adaptive opp onents. The pro of of this theo r em ca n b e found at the end of App endix A . How ever, for the fres h ra ndomness at each r ound pr o cedure ( 2 ) , we can ap- ply Lemma 4.1 o f Cesa-Bia nch i and Lugosi [ 200 6 ] along with Corollar y 4.4.2 to derive our next result that holds for adaptive opp onents to o. There ar e tw o conditions that we must verify befo re we apply that lemma. First, the le arning pro cedure should use indep endent randomiza tio n at diﬀerent time points. Sec - ond, the probability distribution of s t,i ov er the N av a ilable s trategies s hould be fully determined by s 1 , − i , . . . , s t − 1 , − i and sho uld no t dep end explicitly on play er i own pr evious mov es s 1 ,i , . . . , s t − 1 ,i . Both of these conditions ar e easily seen to hold for sampled ﬁctitious play a s deﬁned in ( 2 ) and ( 3 ). Also note that Cesa-Bia nchi and Lugosi [ 200 6 ] consider deterministic a daptive opp onents in their Lemma 4.1. The ex tension to o ur case is ea sy: we ﬁrs t get a hig h proba- bilit y (w.r.t. play er i ’s rando mness) re g ret b ound for the deterministic ada ptive opp onent g t ( k 1 , . . . , k t − 1 , ω ) fo r a ﬁxed ω . Since the b ound holds fo r every ω and do es not depend on ω , the same high probability b ound is true when ω is drawn from Ω. This leads us to our ﬁnal result. Theorem 4.6. F or any T, for any δ T > 0 , with pr ob ability at le ast 1 − δ T , t he actual re gr et R T of sample d ﬁctitious play as deﬁne d in ( 2 ) and ( 3 ) satisﬁes, for any adaptive opp onent, R T ≤ 40 C LO N 2 p T log 2 (4 lo g 2 T ) + r T 2 log 1 δ T . Now pic k δ T = 1 T 2 . Consider the even ts E T = {R T ≥ 12 C LO N 2 p T log 2 (4 lo g 2 T )+ √ T log T } with P ( E T ) ≤ δ T . Since P ∞ T =1 δ T < ∞ , we hav e P ∞ T =1 P ( E T ) < ∞ . Therefore, using Bore l- Cantelli lemma, the even t “ inﬁnitely many E T ’s o ccur ” has probability 0. That is, with probability 1, we have lim sup T →∞ R T T log T ≤ C for some co nstant C. In particula r, with proba bilit y 1, lim sup T →∞ R T T = 0, which prov es T he o rem 3.1 . 5 Conclusion W e proved that a natural v ariant of ﬁctitious play is Hannan consis tent. In the v ariant we cons ide r ed, the play e r plays the best res po nse to mov es o f he r opp onents at sampled time p oints in the histor y so far . W e considered one particular sampling scheme, namely B ernoulli sampling. It will b e interesting to consider other sa mpling strategie s including sampling with replacement. It will also b e interesting to co nsider notions of r egret, such as tracking regre t [ Cesa-Bianchi and Lugosi , 2006 , Section 5.2 ], that are more suitable for non- stationary environmen ts by biasing the sampling to give more imp ortance to recent time po int s. Ac kno wledgemen ts W e thank Ja cob Ab ernethy , Gergely Neu and Manfred W armuth for helpful dis- cussions. W e acknowledge the supp ort of NSF via CAREER g r ant I IS-145 2099. 9 References Avrim B lum and Yishay Mansour. F rom external to internal regr et. Journ al of Machine L e arning Rese ar ch , 8(J un):1307–1 324, 2007. Nicolo Cesa-Bia nch i and Gab or Lugos i. Pr e diction, L e arning, and Games . Cam- bridge Universit y Pr e ss, 200 6. Luc Devroy e, G´ ab or Lugosi, a nd Gerg ely Neu. Prediction b y random- walk per turbation. In COL T , pages 4 60–47 3, 2 013. Paul Er d˝ os. On a lemma of Littlewo o d and Oﬀor d. Bul letin of t he Americ an Mathematic al So ciety , 51:8 98–90 2, 19 4 5. Drew F udenberg a nd Da vid K. Levine. The The ory of L e arning in Games . MIT Press, 1998. Dennis Gilliland and Inha Jung. Play a gainst the rando m past for matching binary bits. Journal of S tatistic al The ory and Applic ation , 5(3):282 – 291, 20 06. James Hanna n. Approximation to Bayes risk in rep eated play . Contribut ions to the the ory of games , 3(39):97– 139, 1957. Sergiu Har t and Andre u Mas- Colell. Simple A daptive S tr ate gies: F r om Re gr et Matching to Unc ouple d Dynamics , volume 4 of World Scientiﬁc Series in Ec onomic The ory . W or ld Scient iﬁc Publishing, 2013 . Adam Kalai and Sa n tosh V empala. Eﬃcient alg orithms for online decis ion prob- lems. Journ al of Computer and System Scienc es , 71 (3):291–3 07, 200 5. Y uri M. Kaniovski and H. Peyton Y o ung. Learning dynamics in g ames with sto chastic per turbations. Games and Ec onomic Behavior , 11 (2):330–3 63, 1995. Theo dore J. Lamber t I I I, Marina A. Ep elman, and Rob ert L. Smith. A ﬁctitious play approa ch to large-s cale optimization. O p er ation R ese ar ch , 53(3):4 77–48 9, 2005. Gilles Stoltz a nd G´ ab or Lugosi. Internal regr et in on-line p o r tfolio s election. Machine L e arning , 59(1 -2):125– 159, 2005 . Tim v an E r ven, W o jciech Ko tlowski, and Manfred K. W armuth. F o llow the leader with drop out p erturba tions. In Pr o c e e dings of Confer enc e on L e arning The ory (COL T) , pages 94 9–974 , 201 4. Manfred K. W ar m uth. F ollow the leader with drop out per turba- tions - additive versus multiplicativ e noise, June 2015. URL https: //use rs.soe.ucsc.edu/ ~ manfre d/pub s/C93mwt alk.pdf . 10 App endix A Pro ofs W e ﬁrst present a lemma that helps us in proving Theorem 4.1 . Lemma A.1. L et k t and ˜ g t b e deﬁne d as in ( 4 ) and the text fol lowing that e quation. We have, T X t =1 ˜ g t,k t +1 ≥ T X t =1 ˜ g t,k T +1 = max k ∈{ 1 ,... ,N } T X t =1 ˜ g t,k . Pr o of. This is a classical lemma, for example, see Lemma 3.1 in [ Cesa-Bianchi and Lugosi , 2006 ]. W e follow the same idea, i.e , proving through induction but a da pt it to handle gains instead of losses. The s tatement is obvious for T = 1 . Assume now that T − 1 X t =1 ˜ g t,k t +1 ≥ T − 1 X t =1 ˜ g t,k T . Since, by deﬁnition, P T − 1 t =1 ˜ g t,k T ≥ P T − 1 t =1 ˜ g t,k T +1 , the inductive assumption im- plies T − 1 X t =1 ˜ g t,k t +1 ≥ T − 1 X t =1 ˜ g t,k T +1 . Add ˜ g T ,k T +1 to b oth sides to o btain the result. Pr o of of The or em 4.1 . W e will pr ov e a result for Bernoulli sampling with gen- eral proba bilities, i.e., when P ( ǫ t = +1) = α where α is not necessa rily 1 / 2. W e will show that E [ R T ] ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P  ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0  from which the theorem follows a s a specia l ca se when α = 1 / 2 . Obviously we have E ( ˜ g t,i ) = 2 αg t,i bec ause o f the fact tha t E ( ǫ t ) = 2 α − 1. F urthermore, E [ ˜ g t,k t | ǫ 1 , . . . , ǫ t − 1 ] = 2 αg t,k t bec ause k t is fully determined by past randomnes s ǫ 1 , . . . , ǫ t − 1 and past pay oﬀs g 1 , . . . , g t − 1 that ar e given. This implies that E [ ˜ g t,k t ] = E [ E [ ˜ g t,k t | ǫ 1 , . . . , ǫ t − 1 ]] = 2 α E [ g t,k t ]. W e now have, E [ R T ] = max k ∈{ 1 ,...,N } T X t =1 g t,k − E " T X t =1 g t,k t # = 1 2 α max k ∈{ 1 ,...,N } E " T X t =1 ˜ g t,k # − 1 2 α E " T X t =1 ˜ g t,k t # ≤ 1 2 α E " max k ∈{ 1 ,... ,N } T X t =1 ˜ g t,k − T X t =1 ˜ g t,k t # . Using Le mma A.1 , we can fur ther upp er b ound the la st expr ession as follows, E [ R T ] ≤ 1 2 α E " T X t =1 ˜ g t,k t +1 − T X t =1 ˜ g t,k t # 11 = 1 2 α T X t =1 E  (1 + ǫ t )( g t,k t +1 − g t,k t )  ≤ 1 2 α T X t =1 E  (1 + ǫ t ) | g t,k t +1 − g t,k t |  ≤ 1 α T X t =1 E  | g t,k t − g t,k t +1 |  = 1 α T X t =1 X 1 ≤ i,j ≤ N E  | g t,i − g t,j | 1 ( k t = i,k t +1 = j )  = 1 α X 1 ≤ i,j ≤ N T X t =1 E  | g t,i − g t,j | 1 ( k t = i,k t +1 = j )  ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i − g t,j | P ( k t = i , k t +1 = j ) ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i − g t,j | P  ˜ G t − 1 ,i ≥ ˜ G t − 1 ,j , ˜ G t,i ≤ ˜ G t,j  = N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P  ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0  . The next lemma is useful to determine the appropria te constan t in the Littlew o o d-Oﬀord Theorem. Lemma A. 2. Supp ose X 1 , . . . , X t ar e i.i.d. Be rnoul li r andom variables that take value of 1 with pr ob ability α and 0 with pr ob ability 1 − α . If t > max( 2 1 − α , 2 α ) ≥ max( 2 α 1 − α , 2 α ) , then for al l k , P t X i =1 X i = k ! ≤ e 2 π × s 2 α (1 − α ) × t − 1 2 . Pr o of. Note that for 0 ≤ k < t , P ( x = k + 1) P ( x = k ) =  t k +1  α k +1 (1 − α ) t − k − 1  t k  α k (1 − α ) t − k = α ( t − k ) (1 − α )( k + 1) . Therefore, the maxim um pr obability o f Bernoulli distribution P ( X = k ) is achiev ed when k = ˆ k = ⌊ ( t + 1) α ⌋ where ⌊ x ⌋ denotes the integral part of x . Clearly ˆ k ∈ [ tα − 1 , ( t + 1) α ]. Thus, q ˆ k ( t − ˆ k ) ≥ min  p ( tα − 1)( t − tα + 1) , p ( t + 1) α ( t − tα − α )  = t × min  r ( α − 1 t )(1 − α + 1 t ) , r (1 + 1 t ) α (1 − α − α t )  ≥ t × min  r ( α − α 2 )(1 − α ) , r α (1 − α − 1 − α 2 )  12 = r α (1 − α ) 2 t. With this pr eliminary inequality , we are rea dy to prove the lemma. P ( t X i =1 X i = k ) ≤ P ( t X i =1 X i = ˆ k ) =  t ˆ k  × α ˆ k (1 − α ) t − ˆ k = t ! ( ˆ k )!( t − ˆ k )! × α ˆ k (1 − α ) t − ˆ k ≤ t t + 1 2 e 1 − t  √ 2 π ˆ k ˆ k + 1 2 e − ˆ k  √ 2 π ( t − ˆ k ) t − ˆ k + 1 2 e − ( t − ˆ k )  × α ˆ k (1 − α ) t − ˆ k = e 2 π × 1 q ˆ k ( t − ˆ k ) × t t + 1 2 ˆ k ˆ k ( t − ˆ k ) t − ˆ k × α ˆ k (1 − α ) t − ˆ k ≤ e 2 π × s 2 α (1 − α ) × t t − 1 2 × α ˆ k (1 − α ) t − ˆ k ˆ k ˆ k ( t − ˆ k ) t − ˆ k . Let f ( x ) = α x (1 − α ) t − x x x ( t − x ) t − x , f ′ ( x ) =  log( α 1 − α ) − log( x t − x )  × f ( x ). Obviously f ′ ( x ) is 0 when x = αt , p ositive when x < αt , and negative when x > αt . T hus, f ( x ) ≤ α αt (1 − α ) t − αt ( αt ) αt ( t − αt ) t − αt = t − t . Hence, P ( t X i =1 X i = a ) ≤ e 2 π × s 2 α (1 − α ) × t t − 1 2 × f ( ˆ k ) ≤ e 2 π × s 2 α (1 − α ) × t − 1 2 . Pr o of of Cor ol lary 4.2.1 . Note that when α = 1 2 , Lemma A.2 provides a b ound on S ( n ) 2 n . Plug in α = 1 2 to Lemma A.2 and combine with T heo rem 4.2 , we know that if n > 4, C LO = √ 2 e π will suﬃce. If n ≤ 4, 2 √ 2 e π × n − 1 2 > 1 and L e mma A.2 still ho lds. Pr o of of The or em 4.3 . W e write | A | to deno te the cardinality of a ﬁnite set A. X t ∈ A 0 | g t,i ⊖ j | P ( t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ) ≤ X t ∈ A 0 1 √ T × 1 = | A 0 | √ T ≤ p | A 0 | . 13 Pr o of of The or em 4.4 . W e write ǫ w ith a subset of [ T ] as subscr ipt to denote ǫ t ’s at times that ar e within the s ubs e t. F or example, ǫ [ T ] = { ǫ 1 , . . . , ǫ T } . W e also write ǫ − A to denote the set of ǫ t ’s that are within the complement of A with resp ect to [ T ]. Case I: k ∈ { 1 , . . . , K − 1 } X t ∈ A k | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! = X t ∈ A k | g t,i ⊖ j | E ǫ [ T ] [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) ] = X t ∈ A k | g t,i ⊖ j | E ǫ − A k h E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] i = E ǫ − A k " X t ∈ A k | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] # ≤ sup ǫ − A k X t ∈ A k | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] . Let A k = { t k, 1 , . . . , t k, | A k | } with elements listed in incre a sing or der of time index. Also, deﬁne D n = D n ( ǫ − A k ) = − t k,n − 1 X τ =1 ,τ ∈− A k (1 + ǫ τ ) g τ ,i ⊖ j . Then, we have X t ∈ A k | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A k ] = | A k | X n =1 | g t k,n ,i ⊖ j | E ǫ A k [ 1 ( P n − 1 s =1 ǫ t k,s g t k,s ,i ⊖ j ≥− P n − 1 s =1 g t k,s ,i ⊖ j + D n , P n s =1 ǫ t k,s g s,i ⊖ j ≤− P n s =1 g t k,s ,i ⊖ j + D n ) | ǫ − A k ] = | A k | X n =1 | g t k,n ,i ⊖ j | P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j ≥ − n − 1 X s =1 g t k,s ,i ⊖ j + D n , n X s =1 ǫ t k,s g t k,s ,i ⊖ j ≤ − n X s =1 g t k,s ,i ⊖ j + D n       ǫ − A k   . By deﬁnition of the set A k , w e have | g t k,s ,i ⊖ j | ≥ T − 1 2 k , so T 1 2 k | g t k,s ,i ⊖ j | ≥ 1. Let M k = T 1 2 k . Then, we hav e | A k | X n =1 | g t k,n ,i ⊖ j | P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j ≥ − n − 1 X s =1 g t k,s ,i ⊖ j + D n , n X s =1 ǫ t k,s g t k,s ,i ⊖ j ≤ − n X s =1 g t k,s ,i ⊖ j + D n       ǫ − A k   14 = | A k | X n =1 | g t k,n ,i ⊖ j | P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ≥ − n − 1 X s =1 g t k,s ,i ⊖ j M k + D n M k , n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ≤ − n X s =1 g t k,s ,i ⊖ j M k + D n M k       ǫ − A k   ≤ | A k | X n =1 | g t k,n ,i ⊖ j | P n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ∈ B k,n      ǫ − A k ! where B k,n = " − n X s =1 g t k,s ,i ⊖ j M k + D n M k − 2 | g t k,n ,i ⊖ j | M k , − n X s =1 g t k,s ,i ⊖ j M k + D n M k # is a one-dimensiona l clos e d ball with radius ∆ = | g t k,n ,i ⊖ j | M k . Note that this ball is ﬁxed g iven ǫ − A k . Since | g t k,s ,i ⊖ j | M k ≥ 1, we ca n apply Cor ollary 4.2.1 to g et P n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ∈ B k,n      ǫ − A k ! ≤ C LO (∆ + 1) √ n = C LO ( | g t k,n ,i ⊖ j | M k + 1) √ n . Now we co n tin ue the der iv ation, | A k | X n =1 | g t k,n ,i ⊖ j | P n X s =1 ǫ t k,s g t k,s ,i ⊖ j M k ∈ B k,n      ǫ − A k ! ≤ | A k | X n =1 | g t k,n ,i ⊖ j | C LO ( | g t k,n ,i ⊖ j | M k + 1) √ n ≤ C LO   | A k | X n =1 | g t k,n ,i ⊖ j | 2 M k √ n + | A k | X n =1 2 √ n   . Since we hav e | g t k,n ,i ⊖ j | < T − 1 2 k +1 , | g t k,n ,i ⊖ j | 2 T 1 2 k = | g t k,n ,i ⊖ j | 2 M k < 1. Thus we have the bo und, C LO   | A k | X n =1 | g t k,n ,i ⊖ j | 2 M k √ n + | A k | X n =1 2 √ n   ≤ 3 C LO | A k | X n =1 1 √ n ≤ 6 C LO p | A k | . Case I I: k = K . Similar to the previo us ca se, we have X t ∈ A K | g t,i ⊖ j | P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! ≤ sup ǫ − A K X t ∈ A K | g t,i ⊖ j | E ǫ A k [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A K ] and wr iting the elements of A K in increa sing order a s { t K, 1 , . . . , t K, | A K | } , w e get X t ∈ A K | g t,i ⊖ j | E ǫ A K [ 1 ( P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j ) | ǫ − A K ] 15 ≤ | A K | X n =1 | g t K,n ,i ⊖ j | P n X s =1 ǫ t K,s g t K,s ,i ⊖ j M K ∈ B K,n ! where D n = D n ( ǫ − A K ) = − t K,n − 1 X τ =1 ,τ ∈{− A K } (1 + ǫ τ ) g τ ,i ⊖ j , M K = T 1 2 K ≤ 2, a nd B K,n = " − n X s =1 g t K,s ,i ⊖ j M K + D n M K − 2 | g t K,n ,i ⊖ j | M K , − n X s =1 g t K,s ,i ⊖ j M K + D n M K # is a one-dimens io nal c lo sed ball with ra dius ∆ = | g t K,n ,i ⊖ j | M K . Note tha t this ball is ﬁxed given ǫ − A K and hence, w e can apply Co rollary 4.2.1 to get | A K | X n =1 | g t K,n ,i ⊖ j | P n X s =1 ǫ t K,s g t K,s ,i ⊖ j M K ∈ B K,n ! ≤ | A K | X n =1 | g t K,n ,i ⊖ j | C LO ( | g t K,n ,i ⊖ j | M K + 1) √ n ≤ C LO (4 M K + 2) | A K | X n =1 1 √ n ≤ 20 C LO p | A K | . Combining the t wo cases pr ov es the theorem. Pr o of of The or em 4.5 . Consider a game with t wo str ategies, i.e., N = 2 . W e refer to player i as the “player” and the other play er s collectively as the “en- vironment”. On o dd ro unds, the environmen t plays pay oﬀ vector (0 , 0). This ensures that after o dd ro unds, the environmen t will k now exactly which stra tegy the player will cho ose as long as ther e is no tie in the play er ’s s ampled cumula- tive payoﬀs, b eca use no ma tter whether the Rademacher r andom v ariable is − 1 or +1, the next stra tegy play ed will b e the s ame as the strategy the play er just play ed. O n even rounds t , the environment plays the payoﬀ vector (0 , 1 − 0 . 1 t ) if the player chose the ﬁrst strateg y in the previous round, and (1 − 0 . 1 t , 0) if the play e r chose the s econd s trategy in the previo us r o und. Under this scenario, we make a critical obser v ation that, as long as the set of sampled time p oints is no t e mpty , which ha ppe ns with pr obability ( 1 2 ) t − 1 on r ound t, there will not be a tie in the cum ulative pay oﬀs o f the t wo strategies. Moreov er, without a tie, the player will not be able to switch s trategy o n even r ounds so will no t accumulate any pay oﬀ. Therefor e, the total pay oﬀ acquired by the play er by following sa mpled ﬁctitious play proc edure will be a t most P ∞ t =1 ( 1 2 ) t − 1 = 2. How ever, a s evident from the environmen t’s pro cedure, the total pay oﬀ for tw o strategies is a t least 0 . 45 T and th us the b est stra tegy has a pay oﬀ no less tha n 0 . 225 T because of the pigeonhole principle. Hence, the exp ected reg ret for the play er is a t least 0 . 225 T − 2, which is line a r in T . 16 Supplementary Mater ial for “Sa mpled Fictiti ous Pl a y i s Hannan Consist en t ” App endix B Coun terexample Sho wing P olyno- mial N Dep endence In this sectio n, we present a coun terexample w hich shows that the sampled ﬁctitious play a lg orithm ( 2 ) with Berno ulli sampling ( 3 ) has a lo wer b ound of Ω( N ) o n its exp ected reg ret when T is 2 N . This is consis ten t with a low e r bo und for the exp ected regre t of or der Ω( √ N T ). How ever, we are unable to extend the construction to an ar bitrary T . This co un terexample is from [ W armuth , 2015 ] and we learned it in priv ate communication with Ma nfred W armuth and Gergely Neu. Theorem B.1 ( W armuth [ 2015 ]) . The sample d ﬁctitious play algorithm has exp e cte d r e gr et of Ω( N ) when T is 2 N and N → ∞ . Pr o of. Consider the N × 2 N payoﬀ matrix :          0 − 1 − 1 − 1 − 1 − 1 . . . − 1 0 0 − 1 − 1 − 1 . . . − 1 0 − 1 0 0 − 1 . . . − 1 0 − 1 0 − 1 0 . . . . . . . . . . . . . . . . . . . . . . . . − 1 0 − 1 0 − 1 0 . . .          . Each r ow re presents a strateg y and each column repr esents payoﬀs of the stra te- gies in a particular ro und. In the m th o dd ro und, i.e., in the (2 m − 1)-th r o und, the adversary assigns a pay o ﬀ o f − 1 to all s trategies except str ategy m which gets a pay o ﬀ of 0. In the m th even ro und, i.e., the 2 m -th round, the adversary assigns a pay o ﬀ of − 1 to s trategies 1 throug h m and a pay oﬀ of 0 to the others. Note that, in a ll rounds a fter 2 m , strategy m will always b e given a pay oﬀ of − 1. Ov erall, we will hav e N str ategies and 2 N r o unds, with the b est constant strategy being the la st str ategy which a ccumulates pay o ﬀ of − N . T o analyze the exp ected regret, we co nsider e ven a nd o dd ro unds separately . Note that as long as round 2 m − 1 is pick ed in the sampled histo ry , which happ ens with probability 1 2 , the algo rithm will not choos e any stra teg y fro m m + 1 thro ugh N at r o und 2 m . This is b ecause they all hav e iden tical pay oﬀs as strategy m prior to ro und 2 m − 1, and strategy m lo oks b etter on round 2 m − 1. So, the alg orithm will pic k a strateg y from 1 through m on round 2 m , all of which acquire a gain of − 1. Therefore, the algorithm will acquir e an expected pay oﬀ o f at mos t − 1 2 on even rounds. Next we conside r o dd rounds. On r ound 2 m − 1, we obser ve tha t the leader set (i.e., the a rgmin in ( 2 )) either includes strateg y m or not. If it includes strategy m , it will additiona lly include strateg ies m + 1 through N as w ell since they hav e had identical payoﬀs in the past. It may p ossibly als o include some strategies in the s et { 1 , . . . , m − 1 } . Since the alg orithm r andomly picks a strateg y from the leader set, and all but o ne of them ha s a payoﬀ of − 1 o n round 2 m − 1, the exp ected gain of the algor ithm is at most − N − m N − m +1 . If the 17 leader set do es not include strategy m , then the exp ected g ain is exactly − 1 since strateg y m is the only one with zero payoﬀ at round 2 m − 1. Therefore, the algorithm will acquire an exp ected pay oﬀ of at mos t − N − m N − m +1 on even r ounds. Hence, the expected regret o f Sampled Fictitious Play under this scenar io with N stra tegies and 2 N r ounds is at lea st, for so me c > 0, R T ≥ − N −  − N X m =1 ( N − m N − m + 1 ) − N 2  ≥ N 2 − c lo g( N ) = Ω( N ) . App endix C Asymmetric Probabilities In this se c tion we prove that for binar y pay oﬀ and ar bitrary α ∈ (0 , 1 ) insead of just 1 / 2, the expec ted r egret is O ( √ T ) where the co nstant hidden in O ( · ) notation blows up in either of the tw o extr eme case: α → 0 and α → 1. No te tha t we a re still considering the single stream version ( 4 ) of the le arning pro cedure. Theorem C. 1. F or α ∈ (0 , 1) and g t ∈ {− 1 , 0 , 1 } N , assu ming t hat T > max( 2 1 − α , 2 α ) , the exp e cte d r e gr et satisﬁes E [ R T ] ≤ 40 N 2 Q α α √ T wher e Q α = e 2 π × q 2 α (1 − α ) . Pr o of. W e b egin with the inequality obta ined in the pro o f of Theo r em 4.1 : E [ R T ] ≤ N 2 α max 1 ≤ i,j ≤ N T X t =1 | g t,i ⊖ j | P  ˜ G t − 1 ,i ⊖ j ≥ 0 , ˜ G t,i ⊖ j ≤ 0  . (5) As b e fo re, we ﬁx i and j , and will bound the expr ession T X t =1 | g t,i ⊖ j | P ( t − 1 X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≥ 0 , t X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≤ 0) . The rest of the pro of is similar to the pro of of Theor em 4.4 . Deﬁne the classes A k = { t : g t,i ⊖ j = k , t = 1 , . . . , T } for k ∈ {− 2 , − 1 , 1 , 2 } . W e have, T X t =1 | g t,i ⊖ j | P t − 1 X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≥ 0 , t X τ =1 (1 + ǫ τ ) g τ ,i ⊖ j ≤ 0 ! ≤ 2 T X t =1 P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! =2 X k ∈{− 2 , − 1 , 1 , 2 } X t ∈ A k P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ s g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! . F or any k ∈ {− 2 , − 1 , 1 , 2 } , X t ∈ A k P t − 1 X τ =1 ǫ τ g τ ,i ⊖ j ≥ − t − 1 X τ =1 g τ ,i ⊖ j , t X τ =1 ǫ τ g τ ,i ⊖ j ≤ − t X τ =1 g τ ,i ⊖ j ! 18 ≤ sup ǫ − A k X t ∈ A k E ǫ A k [ 1 P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j | ǫ − A k ] . Let A k = { t k, 1 , . . . , t k, | A k | } with elements listed in incre a sing or der of time index. Also deﬁne, for n ∈ { 1 , . . . , | A k |} , D n = D n ( ǫ − A k ) = − t k,n − 1 X τ =1 ,τ ∈{− A k } ǫ τ g τ ,i ⊖ j − t k,n − 1 X τ =1 ,τ ∈{− A k } g τ ,i ⊖ j . W e then pro ceed as follows. X t ∈ A k E ǫ A k [ 1 P t − 1 τ =1 ǫ τ g τ,i ⊖ j ≥− P t − 1 τ =1 g τ,i ⊖ j , P t τ =1 ǫ τ g τ,i ⊖ j ≤− P t τ =1 g τ,i ⊖ j | ǫ − A k ] = | A k | X n =1 E ǫ A k [ 1 ( P n − 1 s =1 ǫ t k,s g t k,s ,i ⊖ j ≥− P n − 1 s =1 g t k,s ,i ⊖ j + D n , P n s =1 ǫ t k,s g s,i ⊖ j ≤− P n s =1 g t k,s ,i ⊖ j + D n ) | ǫ − A k ] = | A k | X n =1 P n − 1 X s =1 ǫ t k,s g t k,s ,i ⊖ j ≥ − n − 1 X s =1 g t k,s ,i ⊖ j + D n , n X s =1 ǫ t k,s g t k,s ,i ⊖ j ≤ − n X s =1 g t k,s ,i ⊖ j + D n      ǫ − A k ! ≤ | A k | X n =1 P 4 [ u =0 ( n X s =1 ǫ t k,s g t k,s ,i ⊖ j = − n − 1 X s =1 g t k,s ,i ⊖ j + D n − u )      ǫ − A k ! ≤ | A k | X n =1 4 X u =0 P n X s =1 ǫ t k,s g t k,s ,i ⊖ j = − n − 1 X s =1 g t k,s ,i ⊖ j + D n − u      ǫ − A k !! ≤ 5 | A k | X n =1 Q α √ n ≤ 10 Q α p | A k | where Q α = e 2 π × q 2 α (1 − α ) from Le mma A.2 . Putthing things together, we hav e E [ R T ] ≤ 20 N 2 Q α α X k ∈{− 2 , − 1 , 1 , 2 } p | A k | ≤ 40 N 2 Q α α √ T . 19

Sampled Fictitious Play is Hannan Consistent

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment