Maximin Action Identification: A New Bandit Framework for Games

Submitted p. 1 – 22 Maximin Action Identiﬁcation: A New Bandit Framework f or Games A ur ´ elien Garivi er AU R E L I E N . G A R I V I E R @ M A T H . U N I V - T O U L O U S E . F R Institut de Math ´ ematiques de T oulouse; UMR5219 Universit ´ e de T oulo use; CNRS UPS IMT , F-31 062 T oulou se Cedex 9, F rance Emilie Kaufmann E M I L I E . K AU F M A N N @ I N R I A . F R Laboratoir e CRIStaL ; Equipe SequeL CNRS UMR ; Inria Lille - Nor d Eur op e 59650 V ill e neuve d’Ascq, F rance W outer M . Ko olen W M K OO L E N @ C W I . N L Centrum W iskunde & Informatica Amster dam, the Netherlan ds Abstract W e study an orig inal problem of p ure e xp loration in a strategic ban dit model motiv a ted by Mon te Carlo T ree Search. It co nsists in identify ing the best actio n in a g ame, when the playe r ma y samp le random outcomes of sequ entially ch osen pairs of actions. W e propo se two strategies for the ﬁxed-co nﬁdence setting: M aximin-LUCB, based on lower- and up per - conﬁden ce bounds; and Ma ximin-Racing, wh ich operates by successi vely eliminating the s u b-optimal actions. W e discuss the sample complexity of both methods and compare their per formance empirically . W e sketch a lower b ound analysis, an d possible connectio ns to an optimal algorithm. Keywords: m ulti-armed band it problem s, games, best-arm identiﬁcation, racing , LUCB 1. Setting: A Bandit Model f or T w o-Pla y er Zer o-Sum Random Games W e study a statisti cal learning probl em inspired by the design of computer opponent s for playing games. W e are thinking about two-p layer zero sum full informati on games like Checke rs, Chess, Go ( Silver et al. , 201 6 ) . . . , and also games with ra ndomness and hidden informatio n lik e S c rabble or Poker ( Bo w li ng et al. , 2015 ). At each step during game play , the ag ent is presented w i th the current game conﬁgura tion, and is task ed w it h ﬁguring out which of the av ailable mov es to play . In most interes ting games, a n exhausti ve searc h of the ga m e tree is complet ely ou t of the qu estion, ev en with smart p runing. Giv en that w e cannot consi der all states, the question is where and how to spend our comp uta- tional ef fort. A popula r approach is based on Monte C a rlo Tree S e arch (MCT S) ( Gelly et al. , 2012 ; Bro w n e et al. , 2012 ). V ery roughly , the idea of MCT S is to reason strate gically about a trac table (say up to some depth) por tion of the game tree root ed at the current con ﬁ g uration, and to use (rand omized) heuris tics to estimate va lues of states at the edge of the tractable area. One way to obtain such estimates is by ‘rollouts’: playing reasonab le random polic ies for both players against eac h other until the game ends and seein g who wins. MCTS methods are cur rently applied very succes sfully in the constr uction of game play ing agents and we are intere sted in unders tanding and character izing the fundamen tal complex ity of such ap- proach es. The existin g picture is still rather incomplete. For e xample, ther e is no precise characte ri- zation of the numbe r of rollou ts require d to ide ntify a close to optimal act ion. Sometimes, cumulated © A. Gari vier , E. Kaufmann & W .M. K oolen. G A R I V I E R K AU F M A N N K O O L E N max min min µ 1 , 1 µ 1 , 2 µ 2 , 1 µ 2 , 2 Figure 1: Game tree when there are two actions by player ( K = K 1 = K 2 = 2 ) . reg ret minimizing algorith m s (e.g. UC B deriv ati ves) are used, whereas only the si m p le regret is rel ev ant here. As a ﬁrst step in this directi on, we in vestig ate in this paper an ideali zed v ersion of the MCTS proble m for games, for which we de velop a theory that leads to sample comple xity guarantees. More precisely , we study perhaps the simplest model incorporat ing both strate gic reasoning and exp loration. W e consid er a two -player two-roun d zero-sum game, in which player A has K av ailabl e action s. F or each of these actions, inde xed by i , play er B can then choose among K i possib le actions, inde xed by j . For i ∈ { 1 , . . . , K } and j ∈ { 1 , . . . , K i } , when p layer A chooses action i and then player B choos es actio n j , the proba bility that pla yer A win s is µ i,j . W e in v estigate the situation (see Figure 1 for an exampl e) from the perspecti ve of Player A, who wants to identify a maximin action i ∗ ∈ ar gmax i ∈{ 1 ,...,K } min j ∈{ 1 ,...,K i } µ i,j . Assuming that Play er B is strateg ic and picks, whatev er A ’ s ac tion i , the action j minimizing µ i,j , this is the best choice for A. The paramete rs of the game a re unkno wn to pla yer A, b ut he can repeatedl y choose a pair P = ( i, j ) of actio ns for him and player B , and subseq uently observe a sample fro m a Bernoulli distrib ution with mean µ i,j . At this point we imagine the sample could be gener ated e.g. by a sin gle rollo ut estimate in an und erlying lon ger game that we conside r bey ond trac table stra tegic considera tion. Note that, in this learnin g phase, Player A i s not p laying a game: he c hooses action s for himself and for his advers ary , and observ es the random outcome. The aim of this work is to propose a dyn amic samplin g strateg y for Player A in ord er to minimize the total number of samples (i.e. rollouts) needed to identify i ∗ . Letting P =  ( i, j ) ∶ 1 ≤ i ≤ K , 1 ≤ j ≤ K i  , we formula te the proble m as the search of a par ticular arm in a sto chastic bandit model with K = ∑ K i = 1 K i Bernoull i arms of respect ive ex pectations µ P , P ∈ P . In this bandit model, parametrize d by µ = ( µ P ) P ∈P , when the p layer chooses an ar m (a pair o f actions) P t at ro und t , he obs erves a sample X t dra w n und er a Bernoulli distrib ution with mean µ P t . In contr ast t o best arm identiﬁcation in bandit model s ( see, e . g . , Even- D ar et al. ( 2 006 ); Aud ibert et al. ( 2010 )), where the goal is to iden tify the arm(s) w i th highest mean, argmax P µ P , here we want to ide n- tify as quickly as poss ible the maximin action i ∗ deﬁned abov e. For this purpose, w e a dopt a sequen tial learnin g strategy (or algo rithm) ( P t , τ , ˆ ı ) . Denot ing by F t = σ ( X 1 , . . . , X t ) the sig m a -ﬁeld generate d by the obse rvation s made up to time t , this strateg y is made of 2 M A X I M I N A C T I O N I D E N T I FI C A T I O N – a sampling rule P t ∈ P indicating the arm chosen at round t , such that P t is F t − 1 measurab le, – a stoppi ng rule τ after which a recommendati on is to be m a de, which is a stoppin g time with respec t to F t , – a ﬁnal guess ˆ ı for the maximin action i ∗ . For some ﬁxed ǫ ≥ 0 , the goal is to ﬁnd as quick ly as possib le an ǫ -maximin action , with a h igh accura cy . More speciﬁcall y , giv en δ ∈] 0 , 1 [ , the strate gy shoul d be δ -P A C , i.e. satisfy ∀ µ , P µ  min j ∈{ 1 ...K i ∗ } µ i ∗ ,j − min j ∈{ 1 ...K ˆ ı } µ ˆ ı,j ≤ ǫ  ≥ 1 − δ , (1) while ke eping the to tal number of samples τ as small as pos sible. This is kno wn, in the best-arm identiﬁ- cation literature, as the ﬁxed-conﬁ dence setti ng; alternati vely , one may consider the ﬁxed-b udget settin g where the total number of samples τ is ﬁxed in adv ance, and where the goal is to minimize the p robability that ˆ ı is not an ǫ -maximin action . Related work. T ools from the bandit lite rature ha ve been used in MCT S for around a decade (see Munos ( 2014 ) for a surv ey). O ri ginally , MCTS was used to perform pla nning in Mark ov Decision Pro- cess (MDP), which is a sligh tly dif ferent setting with no a dversary: when an action is chosen , the tr ansi- tion towar ds a new state and the re ward observe d are ge nerated by some (unkno wn) random process. A popul ar approach, U CT ( K ocsis and Szepesv ´ ari , 20 06 ) bu ilds on Upper C o nﬁdence Bound s algorit hms, that are useful tools for re gret minimiz ation in bandi t models (e.g., Auer et al. ( 2002 )). In this slightly dif ferent set up (see Bubeck and Cesa-Bianchi ( 2012 ) for a surv ey), the goa l is to maximize the sum of the sample colle cted during the interaction with the ban dit, which amounts in our setting to fa vor ro llouts for which player A won (which is not ne cessary in the learnin g phase) . This situation is fro m a certain perspe ctiv e a little puzzl ing and argu ably confusing, becau se as sho wn by Bubeck et al. ( 2011 ), regret minimizatio n and best arm ide ntiﬁcation are incompat ible obj ectiv es in the sense tha t no algor ithm can simultan eously be optimal for both. More recently , too ls from the best-arm iden tiﬁcation liter ature hav e been used by Szoren yi et al . ( 2014 ) in the conte xt of plannin g in a Mark ov Decisi on Process with a gener ativ e model. The proposed algori thm buil ds on the UGapE algori thm of Gabillon et al. ( 2012 ) to decide for which action ne w tra - jectori es in the MDP sta rting from this acti on should be simul ated. J ust like a best arm ide ntiﬁcation algori thm is a bui lding block for such more complex algorithms to perform planni ng in an MDP , we belie ve that unde rstanding the maximin acti on identi ﬁ c ation problem is a ke y step to wards more gen- eral algorith m s in games, with pro vable sample complexit y guarantee s. For e xample, an algori thm for maximin action iden tiﬁcation may be use ful for planning in a competiti ve Mark ov Decisi on Processes Filar and Vrieze ( 199 6 ) that models stocha stic games. Contrib u ti ons. In this paper , we propo se two algo rithms for the maximin act ion iden tiﬁcation in the ﬁxed- conﬁdence setting, inspired by the two dominant approa ches used in best arm identiﬁcati on algo- rithms. The ﬁrst algorit hm, Maximin-LUCB, is described in S e ction 2 : it relies on the use of Upper and Lower Conﬁden ce Bounds. T h e se cond, Maximin- R ac ing is described in Section 3 : it proceeds by suc - cessi ve eliminations of the su b-optimal arms. W e prove that bot h algorith ms are δ -P A C, an d gi ve upper bound s on their sample complexit y . A lo ng the wa y , we also propos e some perspect ive s of improv ement that are illustrated empirical ly in Sectio n 4 . Finally , we pro pose in Section 5 for the two -actions case a lo wer boun d on the sample comple xity of any δ -P A C algorith m, and sketch a strateg y tha t may be optimal with respec t to this lo wer bound. Most proo fs are defer red to the Appe ndix. 3 G A R I V I E R K AU F M A N N K O O L E N Notation. T o eas e the not ation, in the rest of the paper we assu m e that the actions of the two play ers are re-ordered so that for each i , µ i,j is inc reasing in j , and µ i, 1 is dec reasing in i (so tha t i ∗ = 1 and µ ∗ = µ 1 , 1 ). These assu m p tions are ill ustrated in F i gure 2 . W ith th is notatio n, the actio n ˆ ı is an ǫ -maximin action if µ 1 , 1 − µ ˆ ı, 1 ≤ ǫ . W e also introduce P i = {( i, j ) , j ∈ { 1 , . . . , K i }} as the group of arms related to the choice of action i for player A. × belo w: µ 1 , 1 × left: µ 1 , 2 × left: µ 1 , 3 × belo w: µ 2 , 1 × right: µ 2 , 2 × left: µ 2 , 3 × right: µ 3 , 1 × left: µ 3 , 2 × left: µ 3 , 3 Figure 2: Example ‘normal form’ mean conﬁguration . Arro ws point to smaller valu es. 2. First A pproach: M-LUCB W e ﬁ r st descri be a simple strate gy ba sed on conﬁdence interv als, ca lled Maximin-LUCB (M-LUCB ) . Conﬁdence bound s ha ve been successfu lly used for best-arm iden tiﬁcation i n the ﬁxed-c onﬁdence se tting ( Kalyana krishnan et al. ( 2012 ); Gabillon et al. ( 201 2 ); Jamie son et a l. ( 2014 )). T he algorithm prop osed in this section for maximin action identi ﬁ c ation is inspired by the LUC B algo rithm o f Kalyanak rishnan et al. ( 2012 ), based on Upper and Lo w er Co nﬁdence Bounds. For ev ery pair of act ions P ∈ P , let I P ( t ) = [ L P ( t ) , U P ( t )] be a conﬁdence int erval on µ P b uilt using obse rvation s from arm P g athered up to time t . Such a conﬁdence interv al can be obta ined by using the number of dra ws N P ( t ) ∶= ∑ t s = 1 1 ( P t = P ) and the empiric al mean of the observ ations for this pair ˆ µ P ( t ) ∶= ∑ t s = 1 X t 1 ( P t = P )  N P ( t ) . The M - L UCB stra tegy aims at alignin g the lower conﬁdence bound s of arms that are in the same group P i . Arms to be dra wn are chosen two by two: for any e ven time t , deﬁning for e very i ∈ { 1 , . . . , K } c i ( t ) = ar gmin 1 ≤ j ≤ K i L ( i,j ) ( t ) and ˆ ı ( t ) = ar gmax i min j ˆ µ i,j ( t ) , the algorit hm dra w s at ro und t + 1 and t + 2 the arms H t = ( ˆ ı ( t ) , c ˆ ı ( t ) ( t )) and S t = ar gmax P ∈{( i,c i ( t ))} i ≠ ˆ ı U P ( t ) . 4 M A X I M I N A C T I O N I D E N T I FI C A T I O N } ǫ Figure 3: S t opping rule ( 2 ). The algorithm stop s because the lower bound of the green arm beats up to slack ǫ the uppe r bound for at least one arm (mark ed red) in each other action . In this case action ˆ ı = 2 is recommended . This is indeed a regul ar LUCB samplin g rule on a time-dependen t set of arms each repres enting one action : {( i, c i ( t ))} i ∈{ 1 ,...,K } . In the two-actions case, one m a y alternati vely dra w at each time t the arm P t + 1 = argmax P ∈{ H t ,S t } N P ( t ) only . Concerni ng the stoppin g rule, w h ich depends on the parameter ǫ ≥ 0 ( ǫ can be set to zero if µ 1 , 1 > µ 2 , 1 ), it is deﬁned as the ﬁrst m o m e nt when, according to the conﬁden ce interv als, some action ˆ ı is probab ly appro ximately better than all othe r actio ns’ best respo nses: τ = inf  t ∈ 2 N ∶ min i  max i ′ ≠ i min 1 ≤ j ′ ≤ K i ′ U i ′ ,j ′ ( t ) − min 1 ≤ j ≤ K i L i,j ( t ) < ǫ  . (2) Then arm ˆ ı = ˆ ı ( τ ) , the empir ical maximin act ion at th at time, is recommended to player A . The s topping rule is illustr ated in Figure 3 . W ith the nota tion of the sampling rule, this amounts to stopping when L H t ( t ) > U S t ( t ) − ǫ . 2.1. Analy sis of the Algorithm W e analy ze the alg orithm under the assumpt ions µ 1 , 1 < µ 2 , 1 and ǫ = 0 . W e consider the Hoeffdin g- type conﬁden ce bound s L P ( t ) = ˆ µ P ( t ) −     β ( t, δ ) 2 N P ( t ) and U P ( t ) = ˆ µ P ( t ) +     β ( t, δ ) 2 N P ( t ) , (3) where β ( t, δ ) is some explorat ion rate. A choice of β ( t, δ ) that ensures the δ -P A C property ( 1 ) is giv en belo w . In order to high light the depe ndency of the stopping rule on the risk lev el δ , we denote it by τ δ . Theor em 1 Let H ∗ ( µ ) =  ( i,j )∈P 1 max   µ i, 1 − µ 1 , 1 + µ 2 , 2 2  2 , ( µ i,j − µ i, 1 ) 2  . 5 G A R I V I E R K AU F M A N N K O O L E N On the even t E =  P ∈P  t ∈ 2 N  µ P ∈ [ L P ( t ) , U P ( t )] , the M-LUCB stra te gy r eturns the maximin actio n and uses a total number of sample s uppe r-boun ded by T ( µ , δ ) = inf  t ∈ N ∶ 4 H ∗ ( µ ) β ( t, δ ) < t  . Accordin g to Theo rem 1 , the e xploration rate should be lar ge en ough to control P µ ( E ) , an d as small as possib le so as to minimize T ( µ , δ ) . The self-n ormalized dev iation bound of Capp ´ e et al. ( 20 13 ) gi ves a ﬁrst solution (Coro llary 2 ), wherea s Lemma 7 of K a ufmann et al. ( 2015 ) yiel ds Corollary 3 . In both cases, ex plicit bounds on T ( µ , δ ) are obtained using the technical Lemm a 12 stated in Appendix A . Cor ollary 2 Let α > 0 and C = C α be suc h that eK ∞  t = 1 ( log t )( log ( C t 1 + α )) t 1 + α ≤ C , and δ suc h that 4 ( 1 + α )( C  δ ) 1 /( 1 + α ) > 4 . 85 . W ith pr obabilit y lar ger tha n 1 − δ , the M-LUCB strate gy using the ex ploratio n rate β ( t, δ ) = log  C t 1 + α δ  , (4) r eturns the maximin action within a number of steps upper -bound ed as τ δ ≤ 4 H ∗ ( µ )       log  1 δ  + log ( C ( 4 ( 1 + α ) H ∗ ( µ )) 1 + α ) + 2 ( 1 + α ) log log   4 ( 1 + α ) H ∗ ( µ ) C 1 1 + α δ 1 1 + α         Cor ollary 3 F or b, c such that c > 2 and b > c  2 , let the e xplorati on rate be β ( t, δ ) = log 1 δ + b log log 1 δ + c log log ( et ) and f b,c ( δ ) = K √ e π 2 3 1 8 c / 2 (  log ( 1  δ ) + b log log ( 1  δ ) + 2 √ 2 ) c ( log ( 1  δ )) b , then with pr obability l arg er th an 1 − f b,c ( δ ) δ , M-LUCB r eturns the maximin action and, for s ome po sitive consta nt C c and for δ small eno ugh, τ δ ≤ 4 H ∗ ( µ )  log  1 δ  + log ( 8 C c H ∗ ( µ )) + 2 log log  8 C c H ∗ ( µ ) δ  Elaborati ng on th e same ideas, it is p ossible to obtain results in e xpectation , at the pri ce of a less e xplicit bound , that holds for a sligh tly lar ger explora tion rate. Theor em 4 The M-LUCB alg orithm usin g β ( t, δ ) deﬁned by ( 4 ) , with α > 1 , is δ -P AC and sat isﬁes lim sup δ → 0 E µ [ τ δ ] log ( 1  δ ) ≤ 4 H ∗ ( µ ) . 6 M A X I M I N A C T I O N I D E N T I FI C A T I O N The comple xity term H ∗ ( µ ) is easy to interpr et: the number of draws of an a rm ( i, j ) is u pper bounded by the typi cal number of sampl es needed to eith er discriminate µ i,j from the smalles t arm associated to the same action, µ i, 1 , or to dis criminate µ i, 1 from a ‘virtual ar m’ with mean ( µ 1 , 1 + µ 2 , 1 ) 2 . W e v iew this virtua l arm (th at cor responds to the choice of a paramete r c in A p pendix A ) as an artifact of our proof, and we conjecture that it could be replaced by µ 2 , 1 for arms in P 1 and by µ 1 , 1 for other arms. In the particu lar case of two actions by players, we propos e the follo wing ﬁ n er result, that holds for the v ariant of M-LUCB that samples the least drawn ar m among H t and S t at round t + 1 . Theor em 5 Assume K = K 1 = K 2 = 2 . The M-LUCB alg orithm using β ( t, δ ) deﬁned by ( 4 ) with α > 1 is δ - P AC an d sa tisﬁes lim sup δ → 0 E µ [ τ δ ] log ( 1  δ ) ≤ 8  2 ( µ 1 , 1 − µ 2 , 1 ) 2 + 1 ( µ 1 , 2 − µ 2 , 1 ) 2 + 1 max [( µ 1 , 1 − µ 2 , 1 ) 2 , ( µ 2 , 2 − µ 2 , 1 ) 2 ]  . 2.2. I m pr ov ed In t ervals and Stopping Rule The symmetry and the simple form of the sub-gaussia n conﬁdence in terval s ( 3 ) are con venie nt for the analys is, b ut they can be greatly improve d thanks to bet ter dev iation boun ds for Bernoull i distrib utions. A simple impro vement (s ee Kaufmann and Kalyanakris hnan ( 2013 )) is to use Chernof f con ﬁ d ence inter - v als, based on the binary relativ e entro py fun ction d ( x, y ) = x log ( x  y ) + ( 1 − x ) log (( 1 − x )( 1 − y )) . Moreo ver , the use of a better stopping rule based on generaliz ed lik elihood ratio tests (GLR T) has been propo sed recen tly for best-arm id entiﬁcation, leading to signi ﬁ ca nt improveme nts. W e propose here an adapta tion of the Chernof f stopping rule of G a rivier and Kaufmann ( 2016 ), v alid for the case ǫ = 0 . This stopp ing rule base d on the statisti c: Z P ,Q ( t ) ∶= log max µ ′ P ≥ µ ′ Q p µ ′ P  X P N P ( t )  p µ ′ Q  X Q N Q ( t )  max µ ′ P ≤ µ ′ Q p µ ′ P  X P N P ( t )  p µ ′ Q  X Q N Q ( t )  , where X P s is a vec tor that con tains the ﬁrst s observ ations of arm P and p µ ( Z 1 , . . . , Z s ) is the l ikelihoo d of s i.i.d. observ ations from a Bernoul li dis tribut ion with mean µ . Introdu cing the weighted sum of empirica l means of two arms, ˆ µ P ,Q ( t ) ∶= N P ( t ) N P ( t ) + N Q ( t ) ˆ µ P ( t ) + N Q ( t ) N P ( t ) + N Q ( t ) ˆ µ Q ( t ) , it appea rs that for ˆ µ P ( t ) ≥ ˆ µ Q ( t ) , Z P ,Q ( t ) = N P ( t ) d ( ˆ µ P ( t ) , ˆ µ P ,Q ( t )) + N Q ( t ) d ( ˆ µ Q ( t ) , ˆ µ P ,Q ( t )) , and Z P ,Q ( t ) = − Z Q,P ( t ) . The stoppin g rule is deﬁned as τ = inf  t ∈ N ∶ ∃ i ∈ { 1 , . . . , K } ∶ ∀ i ′ ≠ i, ∃ j ′ ∈ { 1 , . . . , K i ′ } ∶ ∀ j ∈ { 1 , . . . , K i } , Z ( i,j ) , ( i ′ ,j ′ ) ( t ) > β ( t, δ ) = in f  t ∈ N ∶ max i ∈{ 1 ,...,K } min i ′ ≠ i max j ′ ∈{ 1 ,...,K i ′ } min j ∈{ 1 ,...,K i } Z ( i,j ) , ( i ′ ,j ′ ) ( t ) > β ( t, δ ) . (5) Pro p o sition 6 Using the stoppi ng rule ( 5 ) with the explor ation r ate β ( t, δ ) = log  2 K 1 ( K − 1 ) t δ  , whateve r the sampling rule , if τ is a.s. ﬁnite , the rec ommendation is corr ect with p r obability P µ ( ˆ ı = i ∗ ) ≥ 1 − δ. 7 G A R I V I E R K AU F M A N N K O O L E N Sketch of Pr oof. Recall that in our notation the optimal action is i ∗ = 1 . P µ ( ˆ ı ≠ 1 ) ≤ P µ  ∃ t ∈ N , ∃ i ∈ { 1 , . . . , K }  { 1 } , ∃ j ∈ { 1 , . . . , K 1 } , Z ( i, 1 ) , ( 1 ,j ) ( t ) > β ( t, δ )  ≤ K  i = 2 K 1  j = 1 P µ  ∃ t ∈ N , Z ( i, 1 ) , ( 1 ,j ) ( t ) > β ( t, δ )  . Note that for i ≠ 1 , µ ( i, 1 ) < µ ( 1 ,j ) for all j ∈ { 1 , . . . , K 1 } . The result follows from the followin g bound pro ved in Gari vier and Kaufmann ( 2016 ): whene ver µ P < µ Q , for any samp ling st rategy , P µ  ∃ t ∈ N ∶ Z P ,Q ( t ) > log  2 t δ  ≤ δ . (6) 3. A Racing algorithm W e no w propose a Racing-t ype alg orithm for the maximin actio n identiﬁcation problem, inspired by anothe r line of algo rithms for bes t arm identi ﬁ c ation ( Maron and Moore , 1997 ; Ev en-Dar et al. , 2006 ; Kaufmann and Kalyana krishnan , 2013 ). Racing algorit hms are simple and po werful metho ds that pro- gressi vely concent rate on the best actions. W e gi ve in this section an analysis of a Maximin-Rac ing algori thm that relie s on the reﬁned informati on-theoreti c tools introduc ed in the pre vious section. 3.1. A generic Maximin-Racing Algorithm The Maximin Racing alg orithm maint ains a set of acti ve arms R and proc eeds in roun ds, in which all the acti ve arms are samp led. At the en d of round r , all activ e arms ha ve been sampled r times and some arms may be elimin ated according to some elimination rule . W e de note by ˆ µ P ( r ) the a verage of the r observ ations on arm P . The eliminat ion rule relies on an eliminati on func tion f ( x, y ) ( f ( x, y ) is lar ge if x is signiﬁcantl y lar ger th an y ), and on a thr eshold function β ( r, δ ) . The Maximin-Raci ng algorithm p resented belo w perfo rms two kinds of elimina tions: the large st arm in ea ch set R i may be e liminated if it appe ars to be signiﬁca ntly larger than t he smallest arm in R i ( high arm eliminat ion ), and the gro up of ar m s R i contai ning the smalles t ar m m a y be eliminated (all the arms in R i are remo ved from the acti ve set) if it conta ins one arm that appe ars sign iﬁcantly smaller than all the arms of anothe r group R j ( actio n elimina tion ). Maximin Racing algorit h m P ar ameters. Eliminat ion fun ction f , threshold function β Initial ization. For each i ∈ { 1 , . . . , K } , R i = P i , and R ∶= R 1 ∪ ⋅ ⋅ ⋅ ∪ R K . Main Loop. At round r : – a ll arms in R are drawn, empiri cal means ˆ µ P ( r ) , P ∈ R are updated – Hig h arms elimination step : for each acti on i = 1 . . . K , if  R i  ≥ 2 and r f  m a x P ∈R i ˆ µ P ( r ) , min P ∈R i ˆ µ P ( r )  ≥ β ( r , δ ) , (7) then remo ve P m = argmax j ∈R i ˆ µ P ( r ) from the acti ve set : R i = R i { P m } , R = R { P m } . – Act ion elimination step : if ( ˜ ı, ˜  ) = arg m i n P ∈R ˆ µ P ( r ) and if r f  max i ≠ ˜ ı min P ∈R i ˆ µ P ( r ) , ˆ µ ( ˜ ı , ˜  ) ( r )  ≥ β ( r , δ ) , then remo ve ˜ ı from the possibl e maximin actions: R = R  R ˜ ı and R ˜ ı = ∅ . 8 M A X I M I N A C T I O N I D E N T I FI C A T I O N The algorithm stops when all bu t one of the R i are empty , and ou tputs the inde x of the remaining set as the maximin action. If the stopping condition is not met for r = r 0 ∶ = 2 ǫ 2 log  4 K δ  , then the algori thm stop s and return s one of the empiri cal maximin action s. 3.2. T uning t h e Eliminat ion and Thr eshold Functions In the best-ar m identiﬁcatio n lite rature, se veral elimination functions ha ve been studied. T h e ﬁ rs t idea, pres ented in the Succes sive Elimination alg orithm of Even- D ar e t al. ( 200 6 ), is to use the simple dif ference f ( x, y ) = ( x − y ) 2 1 ( x ≥ y ) ; in order to take into accou nt possible dif ference s in the de viations of the arms, the KL -Rac ing algorithm of K au fmann and Kalyanakrish nan ( 201 3 ) uses an eli mination functi on equiv alent to f ( x, y ) = d ∗ ( x, y ) 1 ( x ≥ y ) , where d ∗ ( x, y ) is deﬁned as the common va lue of d ( x, z ) and d ( y , z ) for the unique z satisf ying d ( x, z ) = d ( y , z ) . In this paper , we use the di ver gence functi on f ( x, y ) = I ( x, y ) ∶ =  d  x, x + y 2  + d  y , x + y 2  1 ( x ≥ y ) (8) inspir ed by the devia tion boun ds of Section 2.2 . In particular , using again Inequality ( 6 ) f or the un iform sampling rule yields , whene ver µ P < µ Q , P µ  ∃ r ∈ N ∶ r I ( ˆ µ P ( r ) , ˆ µ Q ( r ) ) ≥ log 2 r δ  ≤ δ. (9) Using this bound, Proposit ion 7 (prov ed in Append ix B.1 ) proposes a choic e of the thr eshold function for which the Maximin- R ac ing algor ithm is δ -P A C. Pro p o sition 7 W ith the eliminat ion func tion I ( x, y ) of Equation ( 8 ) and w i th the thr eshold functio n β ( t, δ ) = log ( 4 C K t  δ ) , the Maximin -Racing alg orithm satisﬁes P µ  µ 1 , 1 − µ ˆ ı, 1 ≤ ǫ  ≥ 1 − δ, with C K ≤ ( K ) 2 . If µ 1 , 1 > µ 1 , 2 and if ∀ i, µ i, 1 < µ i, 2 , then C K = K × max i K i . 3.3. Sample Complexity Analysis W e propose here an asymptotic analysis of the number of draws of each arm ( i, j ) under the Maximin- Racing algorithm, denot ed by τ δ ( i, j ) . These bound s are expressed with the de viation funct ion I , and hold for ǫ > 0 . For ǫ = 0 , one can provi de simil ar bounds unde r the addition al assump tion that all arms are pairwise disti nct. Theor em 8 Assume µ 1 , 1 > µ 2 , 1 . F or e very ǫ > 0 , and for β ( t, δ ) ch osen as in Pr oposition 7 , the Maximin-Raci ng algori thm satis ﬁes lim sup δ → 0 E µ [ τ δ ( 1 , 1 )] log ( 1  δ ) ≤ 1 max  ǫ 2  2 , I ( µ 2 , 1 , µ 1 , 1 )  and, for any ( i, j ) ≠ ( 1 , 1 ) , lim sup δ → 0 E µ [ τ δ ( i, j )] log ( 1  δ ) ≤ 1 max  ǫ 2  2 , I ( µ i, 1 , µ 1 , 1 ) , I ( µ i,j , µ i, 1 )  . 9 G A R I V I E R K AU F M A N N K O O L E N It follows from Pinske r’ s inequalit y that I ( x, y ) > ( x − y ) 2 , and hence Theorem 8 implies in pa rticular that for the M-Racing algori thm (for a suf ﬁciently small ǫ ) lim sup δ → 0 E µ [ τ δ ] log ( 1  δ ) ≤ 1 ( µ 1 , 1 − µ 2 , 1 ) 2 + K 1  j = 2 1 ( µ 1 ,j − µ 1 , 1 ) 2 + K  i = 2 K i  j = 1 1 ( µ 1 , 1 − µ i, 1 ) 2 ∨ ( µ i,j − µ i, 1 ) 2 . The complex ity term on the right-h and side is reminis cent of the quantity H ∗ ( µ ) introduc ed in Theo- rem 1 . The terms correspo nding to arm in P ∖ P 1 are comparab le to the correspon ding terms in H ∗ ( µ ) (the y are actually stric tly smaller since no ‘virtua l arm’ ( µ 1 , 1 + µ 2 , 1 ) 2 ha ve bee n introduc ed in the analys is of M - R a cing). Ho wev er , the terms correspon ding to the arms ( 1 , j ) , j ≥ 2 are strictly lar ger than the correspond ing terms in H ∗ ( µ ) . But this is mitigated by the fact that there is no m u ltiplica- ti ve const ant in front of the comple xity term. Besides, as Theorem 8 in volv es the dev iation function I ( x, y ) = d ( x, ( x + y ) 2 ) + d ( y , ( x + y ) 2 and not a subgauss ian approximatio n, the y can indeed be signiﬁca ntly bette r . 4. Numerical Experiments and Discussion In the previ ous sectio ns, we hav e pro posed two dif ferent algorithms for the maximin action identiﬁ- cation problem. The an alysis that w e ha ve gi ven does not clearly adv ocate the superiority of one or the other . The goal of this sectio n is to pro pose a bri ef numerical compari son in differe nt settings, and to compare with other possibl e strat egies. W e will not ably study empiricall y two interesting v ariants of M-LUCB. T h e ﬁrst impro vement that we propose is the M-KL-LU CB strategy , bas ed on KL -b ased conﬁdence bounds ( Kaufmann and Kalyana krishnan ( 2013 )). The second va riant, M-Chernof f, additionally improve s the stopp ing rule as pres ented in Sec- tion 2.2 . Whereas Prop osition 6 justiﬁes the use of the e xploration rate β ( t, δ ) = log ( 4 K 2 t  δ ) , which is ov er-co nserv ativ e in practice, we use β ( t, δ ) = log (( log ( t ) + 1 ) δ ) in all ou r ex periments, as suggeste d by Corollary 3 (this appears to be alr eady quite a conserv ati ve choice in practic e). In the exp eriments, we set δ = 0 . 1 , ǫ = 0 . T o simpli fy the discu ssion and the comparison, we ﬁrst focu s on the particu lar case in which there are two actions for each player . A s an elemen t of compariso n, one can observe that ﬁnding i ∗ is at most as hard as ﬁnding the worst arm (or the three best) among the four arms ( µ i,j ) 1 ≤ i,j ≤ 2 . T h us, on e cou ld use standard best-arm identiﬁca tion strate gies lik e the (origina l) LUCB algorithm. For the latter , the comple xity is of order 2 ( µ 1 , 1 − µ 2 , 1 ) 2 + 1 ( µ 1 , 2 − µ 2 , 1 ) 2 + 1 ( µ 2 , 2 − µ 2 , 1 ) 2 , which is much worse than the comple xity term obtaine d for M-LUCB in Theorem 5 when µ 2 , 2 and µ 2 , 1 are close to one another . T h is is be cause a best arm identiﬁcat ion algorit hm does not only ﬁnd the maximin action, b ut ad ditionally ﬁgures out whic h of the arms in th e other action is worst. Our algorithm does not need to discriminat e between µ 2 , 1 and µ 2 , 2 , it onl y tries to assess that one of these two arms is smaller than µ 1 , 1 . Howe ver , for speciﬁc inst ances in w h ich the gap between µ 2 , 2 and µ 2 , 1 is very large , the differe nce v anishes. This is illus trated in the nu m e rical experi m e nts of T able 1 , which in vo lve the follo wing three sets of parameters (the entry ( i, j ) in each matrix is the mean µ i,j ): µ 1 =  0 . 4 0 . 5 0 . 3 0 . 35  µ 2 =  0 . 4 0 . 5 0 . 3 0 . 45  µ 3 =  0 . 4 0 . 5 0 . 3 0 . 6  10 M A X I M I N A C T I O N I D E N T I FI C A T I O N τ 1 , 1 τ 1 , 2 τ 2 , 1 τ 2 , 2 τ 1 , 1 τ 1 , 2 τ 2 , 1 τ 2 , 2 τ 1 , 1 τ 1 , 2 τ 2 , 1 τ 2 , 2 M-LUCB 1762 198 1761 462 1761 197 1760 110 1755 197 1755 36 M-KL-LUCB 762 92 733 237 743 92 743 54 735 93 740 16 M-Chernof f 315 59 291 136 325 61 327 41 32 1 61 326 13 M-Racing 324 152 301 298 329 161 318 137 322 159 323 35 KL-LUCB 351 64 3074 2768 627 83 841 187 684 88 774 32 T able 1: Number of draws of the differe nt arms under the models paramete rized by µ 1 , µ 2 , µ 3 (from left to right ), a verag ed ov er N = 1000 0 repet itions W e also perform experi m e nts in a model with 3x3-a ctions with parame ters: µ =        0 . 45 0 . 5 0 . 55 0 . 35 0 . 4 0 . 6 0 . 3 0 . 47 0 . 52        Figure 4 shows that the be st thre e algorith m s in the pre vious exp eriments beha ve as e xpected: the num- ber of draws of the arms are ordered ex actly as sugg ested by the bounds gi ven in the analy sis. These τ M-KLLUCB =        798 212 92 752 248 22 210 44 21        τ M-Ch. =        367 131 67 333 156 18 129 31 17        τ M-Racin g =        472 291 173 337 337 42 161 185 71        Figure 4: Number of draws of each arm under the bandi t model µ , a veraged of N = 10000 repetitio ns exp eriments tend to sh ow that, in practice , the best two algorithms are M-Racing and M-Cherno ff, w it h a slight adva ntage for the latter . Howe ver , we did not provide theoretical sampl e comple xity boun ds for M-Chernof f, and it is to be noted that the use of Hoef fding bou nds in the M-LUCB alg orithm (that has been analyzed ) is a cause of sub-op timality . Among the algor ithms for which we provide theoretical sample comple xity gu arantees, the M-Racing algorithm appears to perform best. 5. Pe rspectives T o ﬁ ni sh, let us s ketch the (still specul ativ e) perspe ctiv e of an important improv ement. For simplicity , we focus on the case where each pla yer chooses among only two po ssible actions , and we change our notati on, usi ng: µ 1 ∶ = µ 1 , 1 , µ 2 ∶ = µ 1 , 2 , µ 3 ∶ = µ 2 , 1 , µ 4 ∶ = µ 2 , 2 . As w e will see be low , the optimal strategy is going to dep end a lot on the positio n of µ 4 relati vely to µ 1 and µ 2 . Gi ven w = ( w 1 , . . . , w 4 ) ∈ Σ K = { w ∈ R 4 + ∶ w 1 + ⋅ ⋅ ⋅ + w 4 = 1 } , we deﬁne for a, b, c in { 1 , . . . , 4 } : µ a,b ( w ) = w a µ a + w b µ b w a + w b and µ a,b,c ( w ) = w a µ a + w b µ b + w c µ c w a + w b + w c . Using a similar ar gument tha n the one of Gari vier and Kaufmann ( 2016 ) in th e context of best-arm identi ﬁ ca tion, one can pro ve the follo wing (no n explicit) lower b ound on the sample complexi ty . 11 G A R I V I E R K AU F M A N N K O O L E N Theor em 9 Any δ -P A C algorithm satisﬁes E µ [ τ δ ] ≥ T ∗ ( µ ) d ( δ, 1 − δ ) , wher e T ∗ ( µ ) − 1 ∶ = sup w ∈ Σ K inf µ ′ ∶ µ ′ 1 ∧ µ ′ 2 < µ ′ 3 ∧ µ ′ 4  K  a = 1 w a d ( µ a , µ ′ a ) = sup w ∈ Σ K min [ F 1 ( µ , w ) , F 2 ( µ , w )] , (10) wher e F a ( µ , w ) =  w a d  µ a , µ a, 3 ( w )  + w 3 d  µ 3 , µ a, 3 ( w )  if µ 4 ≥ µ a 3 ( w ) , w a d  µ a , µ a, 3 , 4 ( w )  + w 3 d  µ 3 , µ a, 3 , 4 ( w )  + w 4 d  µ 4 , µ a, 3 , 4 ( w )  otherwis e . A particular case. W h en µ 4 > µ 2 , for any w ∈ Σ K it holds that µ 4 ≥ µ 1 , 3 ( w ) and µ 4 ≥ µ 2 , 3 ( w ) . Hence the comple xity term can be rewritte n to T ∗ ( µ ) − 1 = su p w ∈ Σ K min a = 1 , 2 w a d  µ a , µ a, 3 ( w )  + w 3 d  µ 3 , µ a, 3 ( w )  . In that case it is possi ble to sho w that the followin g quan tity , w ∗ ( µ ) = ar gmax w ∈ Σ K min a = 1 , 2 w a d  µ a , µ a, 3 ( w )  + w 3 d  µ 3 , µ a, 3 ( w )  is unique and to giv e a more ex plicit e xpression. This quantity is to be interp reted as the vector of propo rtions of dra ws of the arms by a strateg y m a tching the lo wer bound . In this par ticular case, one ﬁnds w ∗ 4 ( µ ) = 0 , sho wing tha t an optimal strategy could dra w arm 4 only an asymptot ically va nishing propo rtion of times as δ and ǫ go to 0 . T owards an Asymptotically Optimal Algo rithm. Assume that the solutio n of the general optimiza - tion prob lem ( 10 ) is well-beha ved (unicity of the solu tion, cont inuity in the parameters,...) and that we can ﬁnd an ef ﬁ c ient algorithm to compute w ∗ ( µ ) = arg m a x w ∈ Σ K min [ F 1 ( µ , w ) , F 2 ( µ , w )] for any g ive n µ . In particu lar , for a ﬁxed w and µ , we need to be able to compute F ( w, µ ) = in f µ ′ ∈ Alt ( µ ) 4  a = 1 w a d ( µ a , µ ′ a ) , where Alt ( µ ) = { µ ′ ∶ i ∗ ( µ ) ≠ i ∗ ( µ ′ )} . Then, if we can design a sampling rule ensuring that for all a , N a ( t ) t tends to w ∗ a ( µ ) , and if we combi ne it with the stopp ing rul e τ δ = in f  t ∈ N ∶ F   N a ( t )  a = 1 ... 4 , ˆ µ ( t ) > log ( C t  δ ) for some posit ive consta nt C , then one could expe ct the fo llowing asymptot ic optimality property : lim sup δ → 0 E µ [ τ δ ] log ( 1  δ ) ≤ T ∗ ( µ ) . But proving that this stopp ing rule does ensu res a δ -P A C alg orithm is not straightforw ard, and the anal- ysis remains to be done. 12 M A X I M I N A C T I O N I D E N T I FI C A T I O N Acknowledgmen ts This work was partially supporte d by the CIMI (Centre Inte rnational de Math ´ ematiques et d’In for- matique) Excellenc e program while Emilie Kaufmann visited T oulouse in Nov ember 2015. The authors ackno wledge the sup port of the French Agence N a tionale de la Recherche (ANR), under grants AN R- 13-BS01-00 05 (proje ct SP AD R O) an d AN R-13 -CO RD-00 20 (project AL ICIA). Refer ences J-Y . Audibert, S. Bubeck, and R. Munos. Best Arm Ident iﬁcation i n Multi-armed Bandits . In Pr oceedings of the 23r d C o nfer ence on Lea rning Theory , 2010. P . Auer , N. Cesa-Bianchi , and P . F i scher . Finit e-time analys is of the m u ltiarmed bandit problem. Machine Learning , 47(2): 235–256, 2002. Michael Bowlin g, Neil Burch, Michael Johanson , and Oska ri T ammelin. Heads -up limit hold’em poke r is solv ed. Scie nce , 34 7(6218):145 –149, Jan uary 201 5. C. Browne , E. Powle y , D . Whitehouse , S. L u cas, P . C o w l ing, P . Rohlfshage n, S. T av ener , D. Perez, S. Samoth rakis, and S. Colto n. A surv ey of m o nte carlo tree search methods. IEEE T ransact ions on Computatio nal Intel ligence and A I in games, , 4(1):1–4 9, 2012. S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochasti c and nonstoc hastic multi-armed bandi t proble m s . F onda tions and T r ends in Machine Learnin g , 5(1):1 –122, 201 2. S. Bubeck, R. Munos, and G. Stoltz. Pure Explora tion in Finitely Armed and Contin uous Armed Bandits. Theor etical Computer Science 412, 1832-185 2 , 412:183 2–1852, 2011. O. Capp ´ e, A. Gariv ier , O-A . Maillard , R. M u nos, and G. S to ltz. Kullback-Leible r upper conﬁdence bound s for optimal sequen tial alloca tion. Annals of Statist ics , 41(3 ):1516–154 1 , 2013 . E. Even-Dar , S. Mann or , and Y . Mansour . Acti on Elimination and S t opping Condi tions for the Multi - Armed Bandit and Reinforc ement Learning Problems. Jou rnal of Mac hine L e arning Resear ch , 7: 1079– 1105, 200 6. J. Filar and K. Vrieze. Competitiv e Mark ov D e cision Pr ocesses . Springer , 1996 . V . Gabillon , M. Gha va m z adeh, and A. L a zaric. Best Arm Identi ﬁ c ation: A Uniﬁed Approac h to F ix ed Budget and Fixed Conﬁde nce. In A d vances in Neural Informat ion Pr ocessi ng Systems , 2012. A. Gari vier and E. Kaufmann. O p timal best arm identiﬁcation with ﬁ x ed conﬁdence . arXiv , 2016. Sylv ain Gelly , L e vente K ocsis, M a rc Schoenauer , Mich ` ele Sebag, David Silv er , Csaba Szepe sv ´ ari, and Oli vier T eytaud. The grand ch allenge of compute r go: Monte carlo tree search an d ex tensions. Com- mun. A CM , 55(3): 106–113, 201 2. K. Jamieson , M. Mallo y , R. No wak, and S. Bubeck . lil’UCB: an Optimal Exploratio n Algor ithm for Multi-Armed Bandits . In Pr oceedi ngs of the 27th Confer ence on Learning Theory , 2014. S. Kalyanakrishn an, A . T e wari, P . Auer , and P . Stone. P A C subset selection in stochastic multi-a rmed bandit s. In Intern ational Confer ence on M a chine Learning (ICML) , 2012. 13 G A R I V I E R K AU F M A N N K O O L E N E. Kau fmann and S. Kalyanak rishnan. Information complex ity in bandi t subset selecti on. In Pr oceeding of the 26th Confer ence O n Lea rning Theory . , 20 13. E. Kau fmann, O. Capp ´ e, and A. Gari vier . On th e Complexi ty of Best Arm Iden tiﬁcation in Multi-Armed Bandit Models. Journ al of Mach ine L ea rning Resea rc h (to appear) , 2015. Lev ente K ocsis and Csab a Szepesv ´ ari. Bandit based monte-carl o planning . In Pr oceedi ngs of the 17th Eur opean Confer ence on Machine Learnin g , E CML ’06, pages 282–293, Berlin, H ei delber g, 2006. Springer -V erl ag. ISBN 3-540-45 375-X, 978-3-54 0-45375-8. O. Maron and A. Moore. The Racing algorithm: Model selectio n for Lazy learners. Artiﬁcial Intellige nce Revie w , 11(1-5):1 13–131, 1997. R. Mun os. F r om bandits to M o nte-Carlo T ree Sear ch: The optimisti c prin ciple applied to optimization and plann ing. , volume 7. Found ations an d Trends in Mac hine L ea rning, 2014. Dav id Silver , Aja Huang, Chris J. Maddison , Arthur Guez, Laurent Sifre, G e orge van den D r iessche, Julian S c hrittwieser , Ioan nis Antono glou, V eda Pann eershelv am, Marc Lanctot, Sander Dieleman, Dominik Gre w e , John N h am, Nal Kalchbrenner , Ilya Sutsk ev er , T imothy Lillicrap, Madelein e Leach, K oray Ka vukcuoglu , Thore Graepe l, and Demis Hassa bis. Mastering the game of go with deep n eural netwo rks and tre e sea rch. Natur e , 529:484–4 89, 2016. B. Szoreny i, G. Kede nbur g, and R. Munos . Optimis tic plan ning in mark ov decision processes usin g a genera tiv e model. In Advances in Neur al Information Pr ocessing Systems , 2014. Ap pendix A. Analysis of the Maxi min- LUCB algo r i t hm W e deﬁne the ev ent E t =  P ∈P ( µ P ∈ [ L P ( t ) , U P ( t )]) , so that the e vent E deﬁned in Theorem 1 re writes E =  t ∈ 2 N E t . Assume that the ev ent E holds. The arm ˆ ı recommended satisﬁes, by deﬁnition of the algorithm, for all i ≠ ˆ ı min j ∈ K ˆ ı L ( ˆ ı,j ) ( τ δ ) > min j ∈ K i U ( i,j ) ( τ δ ) − ǫ. Using that L P ( τ δ ) ≤ µ P ≤ U P ( τ δ ) for all P ∈ P (by deﬁnition of E ) yiel ds for al l i µ ˆ ı, 1 = min j ∈ K ˆ ı µ ˆ ı,j > min j ∈ K i µ i,j − ǫ = µ i, 1 − ǫ , hence max i ≠ ˆ ı µ i, 1 − µ ˆ ı, 1 < ǫ . Thus, either ˆ ı = 1 or ˆ ı satisﬁes µ 1 , 1 − µ ˆ ı, 1 < ǫ . In both case, ˆ ı is ǫ - optimal, which pro ves that M-LUC B is corre ct on E . No w we analyze M-LUCB with ǫ = 0 . Our analysis is based on the follo wing two key lemmas, whose proof is gi ven belo w . Lemma 10 Let c ∈ [ µ 2 , 1 , µ 1 , 1 ] and t ∈ 2 N . On E t , if ( τ δ > t ) , ther e exists P ∈ { H t , S t } such that ( c ∈ [ L P ( t ) , U P ( t )]) . 14 M A X I M I N A C T I O N I D E N T I FI C A T I O N Lemma 11 Let c ∈ [ µ 2 , 1 , µ 1 , 1 ] and t ∈ 2 N . On E t , for eve ry ( i, j ) ∈ { H t , S t } , c ∈ [ L ( i,j ) ( t ) , U ( i,j ) ( t )] ⇒ N ( i,j ) ( t ) ≤ min  2 ( µ i, 1 − c ) 2 , 2 ( µ i,j − µ i, 1 ) 2  β ( t, δ ) Deﬁning, for e very arm P ∈ P the constant c P = 1 max   µ i, 1 − µ 1 , 1 + µ 2 , 1 2  2 , ( µ i,j − µ i, 1 ) 2  , combinin g the two lemmas (fo r th e pa rticular choice c = µ 1 , 1 + µ 2 , 1 2 ) yields the follo w i ng k ey statement: E t ∩ ( τ δ > t ) ⇒ ∃ P ∈ { H t , S t } ∶ N P ( t ) ≤ 2 c P β ( t, δ ) . (11) Note that H ∗ ( µ ) = ∑ P ∈P c P , from its deﬁniti on in Theorem 1 . A.1. Pro of of Th e orem 1 Let T be a deterministic time. On the eve nt E =  t ∈ 2 N E t , using ( 11 ) and the fact that for e very ev en t , ( τ δ > t ) = ( τ δ > t + 1 ) by deﬁnition of the algorithm, one has min ( τ δ , T ) = T  t = 1 1 ( τ δ > t ) = 2  t ∈ 2 N t ≤ T 1 ( τ δ > t ) = 2  t ∈ 2 N t ≤ T 1 (∃ P ∈{ H t ,S t }∶ N P ( t )≤ 2 c p β ( t,δ )) ≤ 2  t ∈ 2 N t ≤ T  P ∈P 1 ( P t + 1 = P )∪( P t + 2 = P ) 1 ( N P ( t )≤ 2 c P β ( T ,δ )) ≤ 4  P ∈P c P β ( T , δ ) = 4 H ∗ ( µ ) β ( T , δ ) . For an y T su ch that 4 H ∗ ( µ ) β ( T , δ ) < T , one has min ( τ δ , T ) < T , which impli es τ δ < T . Therefore τ δ ≤ T ( µ , δ ) for T ( µ , δ ) deﬁned in Theorem 1 . A.2. Pro of of Th e orem 4 Let γ > 0 . Let T be a deter m i nistic time. On the ev ent G T =  t ∈ 2 N ⌊ γ T ⌋≤ t ≤ T E t , one can write min ( τ δ , T ) = 2 γ T + 2  t ∈ 2 N ⌊ γ T ⌋≤ t ≤ T 1 ( τ δ > t ) = 2 γ T + 2  t ∈ 2 N ⌊ γ T ⌋≤ t ≤ T 1 (∃ P ∈{ H t ,S t }∶ N P ( t )≤ 2 c p β ( t,δ )) ≤ 2 γ T + 2  t ∈ 2 N ⌊ γ T ⌋≤ t ≤ T  P ∈P 1 ( P t + 1 = P )∪( P t + 1 = P ) 1 ( N P ( t )≤ 2 c P β ( T ,δ )) ≤ 2 γ T + 4 H ∗ ( µ ) β ( T , δ ) . Introd ucing T γ ( µ , δ ) ∶ = inf { T ∈ N ∶ 4 H ∗ ( µ ) β ( T , δ ) < ( 1 − 2 γ ) T } , for a ll T ≥ T γ ( µ , δ ) , G T ⊆ ( τ δ ≤ T ) . One can bound the ex pectation of τ δ in the follo wing wa y (using nota bly the self-normaliz ed de viation inequa lity of Capp ´ e et al. ( 2013 )): 15 G A R I V I E R K AU F M A N N K O O L E N E µ [ τ δ ] = ∞  T = 1 P µ ( τ δ > T ) ≤ T γ + ∞  T = T γ P µ ( τ δ > T ) ≤ T γ + ∞  T = T γ P µ ( G c T ) ≤ T γ + ∞  T = 1 T  t = γ T  P ∈P        P µ    µ P > ˆ µ P ( t ) +     β ( t, δ ) 2 N P ( t )    + P µ    µ P < ˆ µ P ( t ) −     β ( t, δ ) 2 N P ( t )           ≤ T γ + ∞  T = 1 T  t = γ T 2 K P µ    µ P > ˆ µ P ( t ) +     β ( t, 1 ) 2 N P ( t )    ≤ T γ + ∞  T = 1 T  t = γ T 2 K e log ( t ) β ( t, 1 ) exp ( − β ( t, 1 )) ≤ T γ + ∞  T = 1 2 K eT log ( T ) β ( T , 1 ) exp ( − β ( γ T , 1 )) = T γ + ∞  T = 1 2 K eT log ( T ) log ( C T 1 + α ) C γ 1 + α T 1 + α , where the series is con ver gent for α > 1 . O n e has T γ ( µ , δ ) = inf  T ∈ N ∶ log  C T 1 + α δ  < ( 1 − 2 γ ) T 4 H ∗ ( µ )  . The techn ical Lemma 12 belo w permits to gi ve an upper bound on T γ ( µ , δ ) for small va lues of δ , that implies in particu lar lim sup δ → 0 E µ [ τ δ ] log ( 1  δ ) ≤ 4 H ∗ ( µ ) 1 − 2 γ . Letting γ go to zero yields the result. Lemma 12 If α, c 1 , c 2 > 0 ar e such that a = ( 1 + α ) c 1 /( 1 + α ) 2  c 1 > 4 . 85 , then x = 1 + α c 1  log ( a ) + 2 log ( log ( a ))  is suc h that c 1 x ≥ log ( c 2 x 1 + α ) . Pro of. O n e can check that if a ≥ 4 . 85 , then log 2 ( a ) > log ( a ) + 2 log ( log ( a )) . Thus, y = log ( a ) + 2 log ( log ( a )) is such that y ≥ log ( ay ) . Using y = c 1 x ( 1 + α ) and a = ( 1 + α ) c 1 /( 1 + α ) 2  c 1 , one obta ins the result. ◻ A.3. Pro of of Lemma 10 W e show that on E t ∩ ( τ δ > t ) , the follo wing four stateme nts cannot occur , which yields that the thresh old c is contained in one of the interv als I H t ( t ) or I S t ( t ) : 1. ( L H t ( t ) > c ) ∩ ( L S t ( t ) > c ) 16 M A X I M I N A C T I O N I D E N T I FI C A T I O N 2. ( U H t ( t ) < c ) ∩ ( U S t ( t ) < c ) 3. ( U H t ( t ) < c ) ∩ ( L S t ( t ) > c ) 4. ( L H t ( t ) > c ) ∩ ( U S t ( t ) < c ) 1. implies that there exis ts two actions i and i ′ such that ∀ j ≤ K i , L i,j ( t ) ≥ c and ∀ j ′ ≤ K i ′ , L i ′ ,j ′ ( t ) ≥ c . Because E t holds, one has in par ticular µ i, 1 > c and µ j, 1 > c , which is e xcluded sinc e µ 1 , 1 is the only such arm that is lar ger than c . 2. implies that for all i ∈ { 1 , K } , U ( i,c i ( t )) ( t ) ≤ c . Thus, in particular U ( 1 ,c 1 ( t )) ≤ c and, as E t holds, there ex ists j ≤ K 1 such that µ 1 ,j < c , w h ich is exclud ed. 3. implies that there ex ists i ≠ ˆ ı ( t ) such that m in j ˆ µ i,j ( t ) > ˆ µ H t ( t ) ≥ min j ˆ µ ( ˆ ı ( t ) ,j ) ( t ) , which contra dicts the deﬁniti on of ˆ ı ( t ) . 4. implies that U H t ( t ) > L S t ( t ) , thus the algorithm must hav e stop ped before the t -th round, which is ex cluded si nce τ δ > t . W e prove d that there exists P ∈ { H t , S t } such that c ∈ I P ( t ) . A.4. Pro of of Lemma 11 Assume that E t holds and that c ∈ [ L ( i,j ) ( t ) , U ( i,j ) ( t )] . W e ﬁrst show that ( i, 1 ) is also contained in [ L ( i,j ) ( t ) , U ( i,j ) ( t )] . First, by deﬁnit ion of the algor ithm, if ( i, j ) = H t or S t , on e has ( i, j ) = ( i, c i ( t )) , hence L ( i,j ) ( t ) ≤ L ( i, 1 ) ( t ) ≤ µ i, 1 , using that E t holds. Now , if we as sume that µ i, 1 > U ( i,j ) ( t ) , becau se E t holds, one has µ i, 1 > µ i,j , which is a contr adiction. T h us, µ i, 1 ≤ U ( i,j ) ( t ) . As c and µ i, 1 are both c ontained in [ L ( i,j ) ( t ) , U ( i,j ) ( t )] , whose d iameter is 2  β ( t, δ )( 2 N ( i,j ) ( t )) , one has  c − µ i, 1  < 2     β ( t, δ ) 2 N ( i,j ) ( t ) ⇔ N ( i,j ) ( t ) ≤ 2 β ( t, δ ) ( µ i, 1 − c ) 2 . Moreo ver , one can use agai n that L ( i,j ) ( t ) ≤ L ( i, 1 ) ( t ) to write U ( i,j ) ( t ) − 2     β ( t, δ ) 2 N ( i,j ) ( t ) ≤ L ( i, 1 ) ( t ) µ i,j − 2     β ( t, δ ) 2 N ( i,j ) ( t ) ≤ µ i, 1 , which yields N ( i,j ) ( t ) ≤ 2 β ( t,δ ) ( µ i,j − µ i, 1 ) 2 and conclu des the proof . A.5. Pro of of Th e orem 5 In the particula r case of two actions by player , we analyze the version of LUCB that draws only one arm per round. More precisely , in this partic ular case, lettin g X t = argmin j = 1 , 2 L ( 1 ,j ) ( t ) and Y t = argmin j = 1 , 2 L ( 2 ,j ) ( t ) , one has P t + 1 = arg m a x P ∈{ X t ,Y t } N P ( t ) . 17 G A R I V I E R K AU F M A N N K O O L E N The analys is follo ws the same lines as that of T h eorem 4 . First, we notice that the algorithm outputs the maximin acti on on th e eve nt E = ∩ t ∈ N E t , and thus the ex ploration rate deﬁned in Corollary 2 guara n- tees a δ -P A C alg orithm. Then, the sample comp lexity analys is relies on a sp eciﬁc characteri zation of th e dra w of each of the arms gi ven in L e m ma 13 belo w (which is a counterpar t of Lemma 11 ). This resul t justiﬁes the ne w complexit y term that appear s in Theorem 5 . Lemma 13 On the event E , for all P ∈ P , one has ( P t + 1 = P ) ∩ ( τ δ > t ) ⊆ ( N P ( t ) ≤ 8 c P β ( t, δ )) , with c ( 1 , 1 ) = 1 ( µ 1 , 1 − µ 2 , 1 ) 2 , c ( 1 , 2 ) = 1 ( µ 1 , 2 − µ 2 , 1 ) 2 , c ( 2 , 1 ) = 1 ( µ 1 , 1 − µ 2 , 1 ) 2 , and c ( 2 , 2 ) = 1 min ( 4 ( µ 2 , 2 − µ 2 , 1 ) 2 , ( µ 1 , 1 − µ 2 , 1 ) 2 ) . Pro of of L e m ma 13 . The pro of of this result uses extensi vely the fact th at the conﬁde nce interv als in ( 3 ) are symmetric: U P ( t ) = L P ( t ) + 2     β ( t, δ ) 2 N P ( t ) . Assume that ( P t + 1 = ( 1 , 1 )) . By deﬁnition of the samplin g strateg y , one has L ( 1 , 1 ) ( t ) ≤ L ( 1 , 2 ) ( t ) and N ( 1 , 1 ) ( t ) ≤ N Y t ( t ) . If ( τ δ > t ) , one has L ( 1 , 1 ) ( t ) ≤ U Y t ( t ) U ( 1 , 1 ) ( t ) − 2     β ( t, δ ) 2 N ( 1 , 1 ) ( t ) ≤ L Y t ( t ) + 2     β ( t, δ ) 2 N Y t ( t ) . On E , µ 1 , 1 ≤ U ( 1 , 1 ) ( t ) and L Y t ( t ) = min ( L ( 2 , 1 ) ( t ) , L ( 2 , 2 ) ( t )) ≤ min ( µ 2 , 1 , µ 2 , 2 ) = µ 2 , 1 . Thus µ 1 , 1 − µ 2 , 1 ≤ 2     β ( t, δ ) 2 N Y t ( t ) + 2     β ( t, δ ) 2 N ( 1 , 1 ) ( t ) ≤ 4     β ( t, δ ) 2 N ( 1 , 1 ) ( t ) , using that N ( 1 , 1 ) ( t ) ≤ N Y t ( t ) . This prov es that ( P t + 1 = ( 1 , 1 )) ∩ ( τ δ > t ) ⊆  N ( 1 , 1 ) ( t ) ≤ 8 β ( t, δ ) ( µ 1 , 1 − µ 2 , 1 ) 2  . A ve ry similar reasoning shows th at ( P t + 1 = ( 1 , 2 )) ∩ ( τ δ > t ) ⊆  N ( 1 , 2 ) ( t ) ≤ 8 β ( t, δ ) ( µ 1 , 2 − µ 2 , 1 ) 2  . Assume that ( P t + 1 = ( 2 , 1 )) . If ( τ δ > t ) , one has L X t ( t ) ≤ U ( 2 , 1 ) ( t ) U X t ( t ) − 2     β ( t, δ ) 2 N X t ( t ) ≤ L ( 2 , 1 ) ( t ) + 2     β ( t, δ ) 2 N ( 2 , 1 ) ( t ) . 18 M A X I M I N A C T I O N I D E N T I FI C A T I O N On E , µ 1 , 1 ≤ µ X t ≤ U X t ( t ) and L ( 2 , 1 ) ( t ) ≤ µ 2 , 1 . Thus µ 1 , 1 − µ 2 , 1 ≤ 2     β ( t, δ ) 2 N X t ( t ) + 2     β ( t, δ ) 2 N ( 2 , 1 ) ( t ) ≤ 4     β ( t, δ ) 2 N ( 2 , 1 ) ( t ) , using that N ( 2 , 1 ) ( t ) ≤ N X t ( t ) . This prov es that ( P t + 1 = ( 2 , 1 )) ∩ ( τ δ > t ) ⊆  N ( 2 , 1 ) ( t ) ≤ 8 β ( t, δ ) ( µ 1 , 1 − µ 2 , 1 ) 2  . Assume that ( P t + 1 = ( 2 , 2 )) . First, using the fac t that L ( 2 , 2 ) ( t ) ≤ L ( 2 , 1 ) ( t ) yield s, on E , U ( 2 , 2 ) ( t ) − 2     β ( t, δ ) 2 N ( 2 , 2 ) ( t ) ≤ µ 2 , 1 µ 2 , 2 − µ 2 , 1 ≤ 2     β ( t, δ ) 2 N ( 2 , 2 ) ( t ) , which lea ds to N ( 2 , 2 ) ( t ) ≤ 2 β ( t, δ ) ( µ 2 , 2 − µ 2 , 1 ) 2 . T h en, if ( τ δ > t ) , on E (using also that L ( 2 , 2 ) ( t ) ≤ L ( 2 , 1 ) ( t ) ), L X t ( t ) ≤ U ( 2 , 2 ) ( t ) U X t ( t ) − 2     β ( t, δ ) 2 N X t ( t ) ≤ L ( 2 , 2 ) ( t ) + 2     β ( t, δ ) 2 N ( 2 , 2 ) ( t ) U X t ( t ) − 2     β ( t, δ ) 2 N X t ( t ) ≤ L ( 2 , 1 ) ( t ) + 2     β ( t, δ ) 2 N ( 2 , 2 ) ( t ) µ 1 , 1 − 2     β ( t, δ ) 2 N X t ( t ) ≤ µ 2 , 1 + 2     β ( t, δ ) 2 N ( 2 , 2 ) ( t ) µ 1 , 1 − µ 2 , 1 ≤ 4     β ( t, δ ) 2 N ( 2 , 2 ) ( t ) . Thus, if µ 2 , 2 < µ 1 , 1 , one also has N ( 2 , 2 ) ( t ) ≤ 8 β ( t, δ )( µ 1 , 1 − µ 2 , 1 ) 2 . Combining the two bounds yield ( P t + 1 = ( 2 , 2 )) ∩ ( τ δ > t ) ⊆  N ( 2 , 2 ) ( t ) ≤ 8 β ( t, δ ) max ( 4 ( µ 2 , 2 − µ 2 , 1 ) 2 , ( µ 1 , 1 − µ 2 , 1 ) 2 )  . Ap pendix B. Analysi s o f the Maxi min-Racin g algorithm B.1. Pro of of Lemma 7 . First note tha t for ev ery P ∈ P , introd ucing an i.i.d. sequen ce of succes siv e observ ations from arm P , the sequence of asso ciated empirical means ( ˆ µ P ( r ) ) r ∈ N is deﬁned indepe ndently of the arm bein g acti ve. 19 G A R I V I E R K AU F M A N N K O O L E N W e introduce the eve nt E = E 1 ∩ E 2 with E 1 = K  i = 1  ( i,j )∈P i ∶ µ i,j = µ i, 1  ( i,j ′ )∈P i ∶ µ i,j ′ > µ i, 1 ( ∀ r ∈ N , f ( ˆ µ i,j ( r ) , ˆ µ i,j ′ ( r ) ) ≤ β ( r , δ )) E 2 =  i ∈{ 1 ,...,K }∶ µ i, 1 < µ 1 , 1  ( i,j )∈ A i ∶ µ i,j = µ i, 1  i ′ ∈{ 1 ,...,K }∶ µ i ′ , 1 = µ 1 , 1  ( i ′ ,j ′ )∈ A i ′ ( ∀ r ∈ N , r f ( ˆ µ i,j ( r ) , ˆ µ i ′ ,j ′ ( r ) ) ≤ β ( r , δ )) and the e vent F =  P ∈P  ˆ µ P ( r 0 ) − µ P  ≤ ǫ 2  . From ( 9 ) and a un ion bound, P ( E c ) ≤ δ  2 . From Hoef fding inequality and a union bou nd, using also the deﬁnitio n of r 0 , one has P ( F c ) ≤ δ  2 . Finally , P µ ( E ∩ F ) ≥ 1 − δ . W e now sho w that on E ∩ F , the algorithm ou tputs an ǫ -optimal arm. On the e vent E , the follo wing two sta tements are tr ue for an y round r ≤ r 0 : 1. For all i , if R i ≠ Ø , then there exists ( i, j ) ∈ R i such that µ i,j = µ i, 1 2. If there exists i such that R i ≠ Ø , then there exists i ′ ∶ µ i ′ , 1 = µ 1 , 1 such that R i ′ ≠ Ø . Indeed , if 1. is not true, there is a non empty set R i in which all the arms in the set {( i, j ) ∈ P i ∶ µ i,j = µ i, 1 } ha ve been discard ed. H e nce, in a previ ous round at least one of the se arms must hav e appear ed strictl y large r than one of the arms in the set {( i, j ′ ) ∈ P i ∶ µ i,j ′ > µ i, 1 } (in the sense of our elimina tion rule), whic h is not possible from the deﬁnition of E 1 . No w if 2. is not true, th ere exists i ′ ∶ µ i ′ , 1 = µ 1 , 1 , such tha t R i ′ has been discarded at a pre vious round by some non-empt y set R i , with µ i, 1 < µ 1 , 1 . Hence, there exists ( i ′ , j ′ ) ∈ A i ′ that appears signi ﬁ c antly smaller than all arms in R i (in the sens e of our elimination rule). As R i contai ns by 1. some arm µ i,j with µ i,j = µ i, 1 , there exist s r such that r d ( µ ( i,j ) ( r ) , µ ( i ′ ,j ′ ) ( r ) ) > β ( r , δ ) , which contra dicts the deﬁnit ion of E 2 . From the statements 1. and 2., on E ∩ F if the alg orithm termina tes before r 0 , using that the last set in the race R i must satisfy µ i, 1 = µ 1 , 1 , the action ˆ ı is in partic ular ǫ -optimal. If the algorith m has not stoppe d at r 0 , the arm ˆ ı recommen ded is the empir ical maximin actio n. Letting R i some set still in the race with µ i, 1 = µ 1 , 1 , one has, min P ∈R ˆ ı ˆ µ P ( r 0 ) ≥ min P ∈R i ˆ µ P ( r 0 ) . As F holds and because there exists ( ˆ ı , ˆ  ) ∈ R ˆ ı with µ ˆ ı, ˆ  = µ ˆ ı, 1 , and ( i, j ) ∈ R i with µ i,j = µ 1 , 1 , one has min P ∈R i ˆ µ P ( r 0 ) ≥ min P ∈R i ( µ P − ǫ  2 ) = µ i,j − ǫ  2 = µ 1 , 1 − ǫ  2 . min P ∈R ˆ ı ˆ µ P ( r 0 ) ≤ min P ∈R ˆ ı ( µ P + ǫ  2 ) = µ ˆ ı, ˆ  + ǫ  2 = µ ˆ ı, 1 + ǫ  2 . and thus ˆ ı is ǫ -optimal, since µ ˆ ı, 1 + ǫ 2 ≥ µ 1 , 1 − ǫ 2 ⇔ µ 1 , 1 − µ ˆ ı, 1 ≤ ǫ. ◻ 20 M A X I M I N A C T I O N I D E N T I FI C A T I O N B.2. Pro of of Th e orem 8 Recall µ 1 , 1 > µ 2 , 1 . W e present the proof assuming addition ally that for all i ∈ { 1 , K } , µ i, 1 < µ i, 2 (an assumpti on that can be relax ed, at the cost of m o re complex notation s). Let α > 0 . The fun ction f deﬁned in ( 8 ) is un iformly continuou s on [ 0 , 1 ] 2 , th us there exi sts η α such that ( x, y ) − ( x ′ , y ′ ) ∞ ≤ η α ⇒  f ( x, y ) − f ( x ′ , y ′ ) ≤ α. W e introduce the eve nt G α,r =  P ∈P ( ˆ µ P ( r ) − µ P  ≤ η α ) and let E be the eve nt deﬁned in the proof of Lemma 7 , which rewrite s in a simpler way with ou r assumpti ons on the arms : E = K  i = 2 K 1  j = 1 ( ∀ r ∈ N , r f ( ˆ µ i, 1 ( r ) , ˆ µ 1 ,j ( r ) ) ≤ β ( r , δ )) K  i = 1 K i  j = 2 ( ∀ r ∈ N , f ( ˆ µ i, 1 ( r ) , ˆ µ i,j ( r ) ) ≤ β ( r , δ )) Recall that on this e vent, arm (1,1) is ne ver eliminated befo re the algorithm stops and whene ver an arm ( i, j ) ∈ R , we kno w that the correspond ing minimal arm ( i, 1 ) ∈ R . Let ( i, j ) ≠ ( 1 , 1 ) and recall that τ δ ( i, j ) is the numb er of rounds durin g which arm ( i, j ) is drawn. One has E µ [ τ δ ( i, j )] = E µ [ τ δ ( i, j ) 1 E ] + E µ [ τ δ ( i, j ) 1 E c ] ≤ E µ [ τ δ ( i, j ) 1 E ] + r 0 δ 2 . On the ev ent E , if arm ( i, j ) is still in the race at the end of round r , – it canno t be signiﬁca ntly lar ger than ( i, 1 ) : r f ( ˆ µ i,j ( r ) , ˆ µ i, 1 ( r ) ) ≤ β ( r , δ ) – arm ( i, 1 ) can not be signiﬁcan tly smaller than ( 1 , 1 ) (otherwise all arms in R i , including ( i, j ) , are eliminate d): r f ( ˆ µ i, 1 ( r ) , ˆ µ 1 , 1 ( r ) ) ≤ β ( r , δ ) Finally , one can w ri te E µ [ τ δ ( i, j ) 1 E ] ≤ E µ  1 E r 0  r = 1 1 (( i,j )∈R at round r )  ≤ E µ  r 0  r = 1 1 ( r max [ f ( ˆ µ i,j ( r ) , ˆ µ i, 1 ( r )) ,f ( ˆ µ i, 1 ( r ) , ˆ µ 1 , 1 ( r ))] ≤ β ( r,δ ))  ≤ E µ  r 0  r = 1 1 ( r max [ f ( ˆ µ i,j ( r ) , ˆ µ i, 1 ( r )) ,f ( ˆ µ i, 1 ( r ) , ˆ µ 1 , 1 ( r ))] ≤ β ( r,δ )) 1 G α,r  + r 0  r = 1 P µ ( G c α,r ) ≤ r 0  r = 1 1 ( r ( max [ f ( µ i,j ,µ i, 1 ) ,f ( µ i, 1 ,µ 1 , 1 )] − α ) ≤ log ( 4 C K r  δ ) + ∞  r = 1 P µ ( G c α,r ) ≤ T ( i,j ) ( δ , α ) + ∞  r = 1 2 K exp ( − 2 ( η α ) 2 r ) , using Hoef fding inequality and introducing T ( i,j ) ( δ , α ) ∶ = inf  r ∈ N ∶ r ( max [ f ( µ i,j , µ i, 1 ) , f ( µ i, 1 , µ 1 , 1 )] − α ) > log  4 C K r δ  Some algebra (Lemma 12 ) sho ws that T ( i,j ) ( δ , α ) = 1 max [ f ( µ i,j ,µ i, 1 ) ,f ( µ i, 1 ,µ 1 , 1 )] − α log  4 C K δ  + o δ → 0  log 1 δ  and ﬁnally , fo r all α > 0 , E µ [ τ δ ( i, j )] ≤ 1 max [ f ( µ i,j , µ i, 1 ) , f ( µ i, 1 , µ 1 , 1 )] − α log  4 C K δ  + o  log 1 δ  . 21 G A R I V I E R K AU F M A N N K O O L E N As this holds for all α , and keeping in m i nd the trivia l bou nd E µ [ τ δ ( i, j )] ≤ r 0 = 2 ǫ 2 log  4 K δ  , one obtain s lim sup δ → 0 E µ [ τ δ ( i, j )] log ( 1  δ ) ≤ 1 max [ ǫ 2  2 , I ∗ ( µ i,j , µ i, 1 ) , I ∗ ( µ i, 1 , µ 1 , 1 )] . T o uppe r boun d the number of draws of the arm ( 1 , 1 ) , one can proceed similarly and write that, for all α > 0 , τ δ ( 1 , 1 ) 1 E = sup ( i,j ) ∈P {( 1 , 1 )} τ δ ( i, j ) 1 E ≤ sup ( i,j ) ∈P {( 1 , 1 )} r 0  r = 1 1 ( r max [ f ( ˆ µ i,j ( r ) , ˆ µ i, 1 ( r )) ,f ( ˆ µ i, 1 ( r ) , ˆ µ 1 , 1 ( r ))] ≤ β ( r,δ )) ≤ sup ( i,j ) ∈P {( 1 , 1 )} r 0  r = 1 1 ( r ( f ( µ i,j ,µ i, 1 ) ∧ f ( µ i, 1 ,µ 1 , 1 ) − α ) ≤ β ( r,δ )) + ∞  r = 1 1 G c α,r ≤ sup ( i,j ) ∈P {( 1 , 1 )} T ( i,j ) ( δ , α ) + ∞  r = 1 1 G c α,r . T aking the e xpectation and using the m o re explicit exp ression of the T ( i,j ) yields lim sup δ → 0 E µ [ τ δ ( 1 , 1 )] log ( 1  δ ) ≤ 1 max  ǫ 2  2 , I ∗ ( µ ( 2 , 1 ) , µ 1 , 1 )  . 22

Maximin Action Identification: A New Bandit Framework for Games

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment