The Max $K$-Armed Bandit: PAC Lower Bounds and Efficient Algorithms

We consider the Max $K$-Armed Bandit problem, where a learning agent is faced with several stochastic arms, each a source of i.i.d. rewards of unknown distribution. At each time step the agent chooses an arm, and observes the reward of the obtained s…

Authors: Yahel David, Nahum Shimkin

The Max K -Armed Bandit: P A C Lo w er Bounds and Efficien t Algorithms Y ahel Da vid Nah um Shi m kin Department of Electr ic a l Engineer ing T echnion - Isr ael Institute of T ec hnology Haifa 32000, Israel Abstract W e consider the Max K -Armed Ba ndit pro b- lem, where a learning agent is faced with sev- eral stochastic arms, each a s ource of i.i.d. re- wards of unknown distribution. A t each time step the agent chooses an arm, and observes the reward of the obtained sample. Each sample is consider ed her e a s a sepa rate item with the reward designating its v alue, and the goal is to find an item with the highest p os- sible v alue. Our basic assumption is a known low er b o und on the tail function of the reward distributions. Under the P A C fra mework, we provide a low er bound on the sample com- plexity of any ( ǫ, δ )-cor rect algo rithm, and prop ose an algo rithm that a ttains this b ound up to logar ithmic factors. W e analyze the robustness of the pro po sed a lgorithm and in addition, w e compa r e the p erforma nce of this algorithm to the v ariant in which t he a rms are not distinguishable by the agent and a re chosen randomly at ea ch stage. Int eres tingly , when the maximal rewards of the ar ms hap- pen to b e similar, the latter approa c h may provide b etter per formance. 1 In tro duction In the clas sic sto chastic multi-armed bandit (MAB) problem the learning agent faces a set K of sto chastic arms, and wishes to maximize its cumulativ e reward (in the regr et fo r mu latio n), or find the arm with the highest exp ected reward (the pure explor ation prob- lem). This mo del has b een studied extensively in the statistical and lea rning litera ture, see for exa mple [5 ] for a comprehensive sur vey . W e co nsider a v ariant of the MAB problem called the Max K - Armed Bandit problem (Max- Bandit for short). In this v ariant, the ob jectiv e is to obtain a sample with the highest p ossible r ew ar d (namely , the highest v alue in the supp ort o f the probability distri- bution of any arm). More pr ecisely , considering the P AC setting, the ob jectiv e is to return an ( ǫ, δ )-corr ect sample, namely a sample who se v alue is ǫ -clo se to the ov erall b est with a probability larg er tha n 1 − δ . In addition, w e wis h to minimize the sample complexity , namely the exp ected n umber o f samples observed by the lea rning algor ithm b efore it terminates. F or the classical MAB problem, algorithms that find the b est ar m (in terms o f its exp ected reward) in the P AC sense were presented in [11, 1 , 12], and low er bo unds on the sample complexit y were prese nted in [13] and [1]. The essential difference with resp ect to this work is in the ob jectiv e, which is to find an ( ǫ, δ )- correct sample in our case . The scenario co ns idered in the Max-Ba ndit mo del is relev ant when a single b est item needs to be s elected from among se veral (large) clustered sets o f items, with each set repres en ted a s a single a r m. These sets may r epresent parts that co me from differen t ma nufacturers or pro duced b y different pro cesses, job candidates that a re referred b y different employmen t agencies, finding the b est match to cer- tain genetic characteris tics in differ e n t p opula tions, or choosing the b est channel among different frequency bands in a cog nitive ra dio wireless netw ork . The Ma x-Bandit pr oblem was appar en tly first pro - po sed in [9]. F or reward distribution functions in a sp e- cific family , an a lg orithm with an upp er b ound o n the sample complexity that incr eases as − ln( δ ) ǫ 2 was pro- vided in [15]. F o r the cas e o f discr ete rew ar ds, another algorithm w as presented in [16], without p erformance analysis. Later, a similar mo del in which the ob jec- tive is to maximize the e x pected v alue of the largest sampled reward for a given num ber of samples ( n ) was studied in [6]. In that w or k the a ttained b est reward is compared with the exp ected reward obtained by an oracle that samples the bes t arm n time. An algorithm is s uggested and shown to secure an upp er b ound of order n − b/ (( b +1) α ) on that difference, where α > 0 and b > 0 are determined by the prop erties of the distri- bution functions and b decreases as they are further aw ay from a s pecific functions family . Recently , a sim- ilar mo del in which the goa l is to find the arm for which the v alue of a given quantile ( τ ) is the la r gest was studied in [17]. Their mo del can b e compar ed to ours b y allowing an erro r ǫ of the same size as the given quantile. In this cas e , the b ound on the sample complexity pr ovided in [17] increa ses as − ln( τ ) − ln ( δ ) τ 2 . Our basic ass umption in the pr e s ent pap er is that a known low er b ound ( G ∗ ( ǫ ), fo rmally defined in Section 2) is av ailable on the tail distributions, namely on the probability that the reward o f each given ar m will b e close to its maximum. A sp ecial ca se is a lower bound on the probability densities near the maximum. Under that a ssumption, we provide an alg orithm for which the sample complexity increases at most as − ln( G ∗ ( ǫ ) δ ) G ∗ ( ǫ ) . In the cont ext of [15], G ∗ ( ǫ ) ≃ ǫ and in the context o f [17] G ∗ ( ǫ ) = τ . Therefor e, the prop o sed algorithm pro- vides an improv ement by a factor of ǫ − 1 ov er the result of [1 5], which was obtained for a more s pecific mo del, and an impr ov e men t b y the same factor over the result of [17] which was der ived for a similar, but different ob- jective. T o compar e with the result in [6], we note that by considering the exp ected maxima l v alue as the max- imal po ssible v alue, it follows that G ∗ ( ǫ ) ≃ ǫ α . With a choice o f δ = 1 n 2 in our alg orithm, we o btain that the ex pected deficit of the la rgest sample with resp ect to the maximal reward p ossible is at most of order O ( ln( n ) n α ) (as compared to O ( n − b/ (( b +1) α ) ) with b > 0). F urthermo re, we pr ovide a low er bo und on the sam- ple c omplexity of every ( ǫ, δ ) -co rrect algo rithm, which is shown to coincide, up to a logar ithmic term, with the upp er bo und der ived for the pr op osed a lgorithm. T o the b est of our knowledge, this is the first lo wer bo und for the present problem. In addition, w e ana- lyze the r obustness of the algorithm to o ur c hoice of the tail function b ound G ∗ ( ǫ ), b oth for the case wher e this choice is to o optimistic (i.e., the actual dis tribu- tions do no t ob ey the assumed b ound) and for the ca s e where our choice it overly conserv ative. A basic feature of the Max-Ba ndit pro blem (and the asso ciated alg o rithms) is the goal of quic kly fo cusing on the b est ar m (in term of ma ximal reward), and sampling from that ar m as m uch as poss ible. It is natural to compare the obtained results with an al- ternative a pproach, which ig nores the distinction b e- t ween a rms, and simply draws a sample fr om a ra ndo m arm a t ea ch round. This can b e interpreted as mixing the items asso ciated with each arm be fo re sampling; we according ly refer to this v ariant as the unified-a rm problem. This problem actua lly coincides with the so- called infinitely-ma n y armed ba ndit mo del studied in [3, 18, 1 9, 8, 4], for the specific ca se o f deterministic arms studied in [10]. As may be exp ected, the unified- arm approach provides the b est res ults when the re- ward distribution of all arms are identical. Ho wev er, when many arms are sub optimal, the multi-armed ap- proach provides sup erior pe r formance. The pap er pro cee ds a s follows. In the next section we present our mo de l. In Section 3 we provide a low er bo und on the s ample complexity of ev ery ( ǫ, δ )-cor rect algorithm. In Section 4 we present a n ( ǫ, δ )-cor rect al- gorithm, and w e provide an upper bound on its sample complexity . The algo rithm is simple and its b ound has the same order as the low er b ound up to a logar ith- mic ter m in | K | ǫ (where | K | stands for the num b er o f arms). Then, in Section 5, we pr ovide an ana lysis o f the a lg orithm’s p erfo rmance for the case in which our assumption do es not hold. In Section 6, w e c onsider for comparison the unified-a rm appr oach. In Section 7 we close the pap er by s ome concluding remar ks. Cer - tain pro ofs are differed to the App endix due to space limitations. 2 Mo del Definition W e consider a finite set of ar ms, deno ted by K . A t each stage t = 1 , 2 , . . . the lea rning agent choo s es an arm k ∈ K , and a r eal v alued reward is obtained from that arm. The re wards obtained fro m each arm k ar e independent and identically distributed, with a distribution function (CDF) F k ( µ ), µ ∈ R . W e denote the maximal p ossible reward of each arm by µ ∗ k = inf µ ∈ R { µ | F k ( µ ) = 1 } , as sumed finite, and the maximal reward among all a rms by µ ∗ = max k ∈ K µ ∗ k . The tail function G k ( ǫ ) of ea ch arm is defined as fol- lows. Definition 1. F or every arm k ∈ K , the tail fun ction G k ( ǫ ) is define d by G k ( ǫ ) , 1 − F k ( µ ∗ k − ǫ ) , ǫ ≥ 0 . F or example, when µ is uniform on [ a, b ], then G ( ǫ ) = ǫ b − a . In addition, we note that CDFs are nondecreas- ing functions and therefore the tail functions are non- increasing. It should be observed that G k ( ǫ ) do es not reveal the maximal v alue µ ∗ k , which remains unknown. Throughout the pap er , w e shall use the following as- sumption. Assumption 1. Ther e ex ists a known function G ∗ ( ǫ ) and a known c onstant ǫ 0 > 0 su ch that, for every k ∈ K and 0 ≤ ǫ ≤ ǫ 0 , it holds t hat G k ( ǫ ) ≥ G ∗ ( ǫ ) , (1) W e note that for every k ∈ K , P ( µ k > µ ∗ k − ǫ ) ≥ G ∗ ( ǫ ) where µ k stands for a r andom v ar iable with distribution F k . F urthermo re, noting that the tail functions are non-negative a nd non-increasing , we as- sume the sa me for their low e r bo und G ∗ ( ǫ ). More- ov er, for conv enience we shall assume that G ∗ ( ǫ ) is strictly decreasing in ǫ , and de no te its inv erse function by G − 1 ∗ ( ǫ ). An imp o rtant specia l-case is when o ne assumes that the probability density function (pdf ) of each arm is low er bounded by a certa in constant A > 0 , so tha t G ∗ ( ǫ ) = Aǫ . W e shall often use the more gener al bo und of the for m G ∗ ( ǫ ) = Aǫ β to illustrate o ur re- sults. An alg orithm for the Ma x-Bandit mo del s amples an arm at e a ch time step, based on the observed history so far (i.e., the pre viously se le cted ar ms and observed rewards). W e require the algorithm to terminate af- ter a random num b er T of samples, which is finite with probability 1, a nd return a reward V which is the maximal reward o bserved ov er the entire p erio d. An alg orithm is said to be ( ǫ, δ ) -c orr e ct if P ( V > µ ∗ − ǫ ) > 1 − δ . The ex pected n umber of s amples E [ T ] taken by the algorithm is the sample c omplexity , which we wish to minimize. 3 A Lo wer Bound Before tur ning to our pr op osed algor ithm, we provide a low er b ound on the sample complexity of any ( ǫ, δ )- correct a lgorithm. The bo und is established under As- sumption 1, a nd the additio na l pr ovision that G ∗ ( ǫ ) is concav e. The cas e of non-concave G ∗ ( ǫ ) tur ns o ut to be mo r e co mplicated for ana lysis, and it is cur rently unclear whether our lower b ound holds in that ca se. F or example, when G ∗ ( ǫ ) = Aǫ β for some known co n- stants A > 0 and β > 1, P ( µ k > µ ∗ k − ǫ ) ≥ Aǫ β , (2) the required concavit y holds for β ≤ 1. The b ound in Equa tio n 2 is usually referred as β -regularity a nd is similar to those assumed in [3 ], [1 9], [10] a nd [7]. The following result sp ecifies our low er b ound. Theorem 1. L et k ∗ denote some optimal arm, such that µ ∗ k ∗ = µ ∗ . L et Assu mption 1 holds with a c onc ave function G ∗ ( ǫ ) and let ǫ ≤ ǫ 0 and δ ≤ 3 20 e − 3 . Then, for every ( ǫ, δ ) -c orr e ct algorithm, E [ T ] ≥ X k ∈ K \{ k ∗ } 1 32 G ∗ (Θ k ) ln  3 20 δ  (3) wher e Θ k = min { max ( ǫ, µ ∗ − µ ∗ k ) , ǫ 0 } . W e no te that the sp ecific requir ement on δ is not fun- damental, and can b e release d at the cos t of a smaller constant in the bound. This low er b ound can b e interpreted as summing over the minimal n umber of times that each arm, o ther than the o ptimal a rm k ∗ , needs to be sampled. It is imp or - tant to obser ve that if there a re several optimal ar ms, only one of them is excluded from the summation. In- deed, the bound is lar ge when there a r e several opti- mal (or near-optimal) a rms, as the denominator o f the summand is sma ll for s uc h arms. This follows since the algo rithm needs to obtain mor e samples to verify that a given a rm is ǫ -optimal. The pr o of of Theo rem 1 pro ceeds b y cons ider ing any given set of reward distributions that o b eys the As- sumption, and showing that if an a lgorithm samples some sub optimal a rm less than a certain num b er o f times, it cannot b e ( ǫ, δ )-co rrect for some re la ted set of r eward distributions for which this a rm is optimal. Pr o of of The or em 1. W e beg in b y defining the fol- lowing set of hypothese s { H 0 , H 1 , . . . , H | K | } , wher e F H k l ( µ ) stands for the CDF o f ar m l under hypoth- esis k and 1 Θ stands for the indicator function of the set Θ . Hyp othesis H 0 is the true hypo thesis, namely , F H 0 k ( µ ) = F k ( µ ) ∀ k ∈ K . F or k = 1 , . . . , | K | , w e define H k as follows. F or each ar m l 6 = k , its CDF coincide s with the true one, namely , F H k l ( µ ) = F l ( µ ) , l 6 = k . F or arm k , we construct a CDF F H k k such that its maximal v alue is µ ∗ ,H k k = µ ∗ + ǫ , while it still satisfies Assumption 1. T o define F H k k , we use the nota tion F ∗ ( µ ) = ( 1 − G ∗ ( µ ∗ + ǫ − µ ) µ < µ ∗ + ǫ 1 µ ≥ µ ∗ + ǫ where ǫ is provided to the algo r ithm. W e consider tw o cases. Case 1 : µ ∗ k < µ ∗ + ǫ − ǫ 0 . Let F H k k ( µ ) = γ k, 1 F k ( µ ) 1 ( −∞ ,µ ∗ k ) ( µ ) + γ k, 1 F k ( µ ∗ k ) 1 [ µ ∗ k ,µ ∗ + ǫ − ǫ 0 ) ( µ ) + F ∗ ( µ ) 1 [ µ ∗ + ǫ − ǫ 0 , ∞ ) ( µ ) , where γ k, 1 = 1 − G ∗ ( ǫ 0 ). Case 2: µ ∗ k ≥ µ ∗ + ǫ − ǫ 0 . Define P ǫ k , 1 − G ∗ ( ǫ 0 ) + G ∗ ( µ ∗ + ǫ − µ ∗ k ) ≤ 1, and let µ k = sup µ ≤ µ ∗ k { µ | F k ( µ ) ≤ P ǫ k } denote the v alue for whic h F k reaches probability P ǫ k . Set F H k k ( µ ) = γ k, 2 F k ( µ ) 1 ( −∞ , µ k ) ( µ ) + ( F k ( µ ) + ( γ k, 2 − 1) F k ( µ k )) 1 [ µ k ,µ ∗ k ) ( µ ) + F ∗ ( µ ) 1 [ µ ∗ k , ∞ ) ( µ ) where γ k, 2 = 1 − G ∗ ( µ ∗ + ǫ − µ ∗ k ) F k ( µ k ) . By Lemma 1, which is provided in Sec tion 8 .1 in the Appendix, it follows that as sumption 1 holds under all of the hypotheses { H 0 , H 1 , . . . , H | K | } . If hypothesis H k ( k 6 = 0) w ere true, then µ ∗ k ≥ µ ∗ l + ǫ for all l 6 = k , hence the algor ithm sho uld provide a reward from arm k with proba bilit y lar ger than 1 − δ . W e use E H k and P H k to denote the expe c tation and pr obability , resp ectively , under the algo rithm being cons idered a nd hypothesis H k . F or every k ∈ K let t k = 1 16 γ k ln  3 20 δ  , where γ k = G ∗ ( ǫ 0 ) if µ ∗ k < µ ∗ + ǫ − ǫ 0 and γ k = G ∗ ( µ ∗ + ǫ − µ ∗ k ) if µ ∗ k ≥ µ ∗ + ǫ − ǫ 0 . In addition, we let T k stand for the num b er of samples from a rm k . Suppo se now that our algor ithm is ( ǫ , δ )-co rrect under H 0 , and that E H 0 [ T k ] ≤ t k for so me k ∈ K . W e will show that this alg orithm ca nnot be ( ǫ, δ )-correct under hypothesis H k . Therefore , a n ( ǫ, δ )-co r rect algor ithm m ust hav e E H 0 [ T k ] > t k for all k ∈ K . Define the following even ts , for k ∈ K : • A k = { T k ≤ 4 t k } . It easily follows from 4 t k  1 − P H 0 ( A k )  ≤ E H 0 [ T k ] that if E H 0 [ T k ] ≤ t k , then P H 0 ( A k ) ≥ 3 4 . • Let B k stand for the event under whic h the chosen arm at termination is k , and B C k for its c o mple- men t. Since P H 0 ( B k ′ ) > 1 2 can hold for o ne a rm at most, it follows that P H 0  B C k  > 1 2 for ev ery k 6 = k ′ for so me k ′ . • Let C k to b e the ev ent under whic h a ll the samples obtained from ar m k are on the in terv a l ( −∞ , µ ∗ k ]. Clearly , P H 0 ( C k ) = 1. • F or k ∈ K for which µ ∗ k < µ ∗ + ǫ − ǫ 0 , µ k is still defined as b efore, s o µ k = µ ∗ k (and F k ( µ k ) = 1). Now, for e very k ∈ K , we le t D k denote the even t under which for any num b er of samples t ≤ 4 t k from arm k , the num b er of samples which a re on the interv al ( −∞ , µ k ] is bo unded as follows: D k , ( max 1 ≤ t ≤ 4 t k t X i =1  x k i − tF k ( µ k )  < 15 t k F k ( µ k ) ) where x k i is a R V which equals to 1 if the i -th sample from a rm k is on that interv al a nd 0 oth- erwise. Below w e upp er b ound P H 0 ( D k ) using Kolmogo rov’s inequalit y . Kolmogo rov’s inequa lit y states that the s um S t = P t i =1 z i of zero-mea n iid random v ar iables ( z i ) sat- isfies P (max 1 ≤ t ≤ n | S t | ≥ a ) ≤ V ar[ S n ] a 2 (Theorem 2 2.4, in p. 287 o f [14]). By applying it to the R Vs y k i = x k i − F k ( µ k ), we obtain P H 0  D C k  ≤ V ar ( P 4 t k i =1 y k i ) (15 t k F k ( µ k )) 2 = 4 t k F k ( µ k ) (1 − F k ( µ k )) (15 t k F k ( µ k )) 2 , where D C k is the complementary of D k . So, for the case o f µ ∗ k < µ ∗ + ǫ − ǫ 0 , by the fact that F k ( µ k ) = 1, it follows that P H 0 ( D k ) = 1. F or the case o f µ ∗ k ≥ µ ∗ + ǫ − ǫ 0 , it follows tha t G ∗ ( · ) ≤ 1 by its definition, so, again by definition w e obtain that F k ( µ k ) ≥ G ∗ ( µ ∗ + ǫ − µ ) = γ k and therefore t k F k ( µ k ) ≥ 1 16 ln  3 20 δ  . So it follows tha t since δ ≤ 3 20 e − 3 by assumption P H 0 ( D k ) ≥ 1 − 64 225 ln ( 3 20 δ ) ≥ 9 10 . F or simplicit y , we use the bound P H 0 ( D k ) ≥ 9 10 for every k ∈ K . Define now the int ers ection even t S k = A k ∩ B C k ∩ C k ∩ D k . W e hav e just shown that for every k 6 = k ′ it holds that P H 0 ( A k ) ≥ 3 4 , P H 0 ( B C k ) > 1 2 , P H 0 ( C k ) = 1 and P H 0 ( D k ) ≥ 9 10 , fro m which it follows that P H 0 ( S k ) > 3 20 for k 6 = k ′ . Now, we let h to b e the history of the pro ce s s (the sequence of chosen arms and obtained rewards). F or every k ∈ K , we deno te the n umber of rewards under µ k by N k . F o r a given histor y , a t time t ′ , for every k ∈ K , the proba bilit y of choosing the nex t ar m is the same under H 0 and under H k . Also, b y the h yp otheses definition, the reward proba bilit y is the same, unless the chosen arm is k . Ther efore, by the definition of the hypo theses, dP H k dP H 0 ( h ) =  1 − γ k F k ( µ k )  N k f k ( h ) where γ k is defined b efore, b y the fact that F k ( µ k ) = 1 for µ ∗ k < µ ∗ + ǫ − ǫ 0 , it follows that γ k, 1 = 1 − γ k F k ( µ k ) for µ ∗ k < µ ∗ + ǫ − ǫ 0 and that γ k, 2 = 1 − γ k F k ( µ k ) oth- erwise ( γ k, 1 and γ k, 1 are defined befo re). In addition, f k ( h ) repre s en ts the Contribution of samples from arm k with rewards stric tly larg er than µ k . Now we assume that the in tersec tio n event S k o ccurs. Then, C k o ccurs, so f k ( h ) = 1. Also, { A k ∩ D k } o c c urs, so N k ≤ 16 t k F k ( µ k ). Therefo r e, for α k = γ k F k ( µ k ) ≤ 1, dP H k dP H 0 ( h ) I ( S k ) ≥ (1 − α k ) 1 α k ln ( 3 20 δ ) I ( S k ) . Now, b y the fact that (1 − ǫ ) 1 ǫ ≥ e − 1 , we obta in the following inequalities, P H k  B C k  ≥ P H k ( S k ) = E H 0  dP H k dP H 0 ( h ) I ( S k )  ≥ E H 0 h (1 − α k ) 1 α k ln ( 3 20 δ ) I ( S k ) i ≥ (1 − α k ) 1 α k ln ( 3 20 δ ) P H 0 ( I ( S k )) > 3 20 e − ln 3 20 δ ≥ δ . W e found that if a n algorithm is ( ǫ, δ )-correct under hypothesis H 0 and E 0 [ T k ] ≤ t k for some k 6 = k ′ , then, under hypothesis H k this a lgorithm returns a s ample that is smaller by at least ǫ tha n the maximal p ossible reward with probability of δ or more, hence the alg o- rithm is not ( ǫ, δ )-co rrect. Therefore, any ( ǫ, δ )-co rrect algorithm must satisfy E 0 [ T k ] > t k for all of arms ex- cept p os sibly for one (namely , for the o ne k ′ for which P 0  B C k ′  ≤ 1 2 ). In addition t k ∗ ≥ t k ′ , wher e k ∗ is the optimal ar m (namely , µ ∗ k ∗ = µ ∗ ). Hence, E [ T ] ≥ X k ∈ K \{ k ∗ } 1 G ∗ (min ( ǫ 0 , ǫ + µ ∗ − µ ∗ k ) ln  3 20 δ  . Now, by the fact that G ∗ is concave, it follows that tG ∗ ( y ) + (1 − t ) G ∗ (0) ≤ G ∗ ( ty ) wher e y = µ ∗ + ǫ − µ ∗ k . So, for the case of ǫ ≤ µ ∗ − µ ∗ k , for t = ǫ y , by the fact that G ∗ is no n-negative, it follows that G ∗ ( y ) ≥ 2 G ∗ ( ǫ ) and for the case of ǫ < µ ∗ − µ ∗ k , for t = µ ∗ − µ ∗ k y , it follows that G ∗ ( y ) ≥ 2 G ∗ ( µ ∗ − µ ∗ k ). Then since G ∗ is a non- decreasing function, the lo wer b ound is obtained. 4 Algorithm Here we provide a n ( ǫ, δ )-correct algor ithm. The algo - rithm is based on s ampling the arm which has the high- est upp er confidence bound on its maximal r eward. The a lgorithm starts by s ampling a fixed num b er o f times from each ar m. Then, it r epea tedly calculates an index for each arm which can b e interpreted as an upp er bo und on the maximal reward of this a rm, and samples once from the arm with the larg est index. Algorithm 1 Maxima l Confidence Bound (Max-CB) Algorithm 1: Input: The tail function bo und G ∗ = { G ∗ ( ǫ ′ ) , 0 ≤ ǫ ′ ≤ ǫ 0 } and its inv erse function G − 1 ∗ , cons tant s δ > 0 and ǫ > 0. Define L = 6 ln  | K |  1 + − l n( δ ) G ∗ ( ǫ )  . 2: Initi alization: Counters C ( k ) = N 0 , k ∈ K , where N 0 = ⌊ L − ln δ ǫ 0 ⌋ + 1. 3: Sample N 0 times fro m each arm. 4: Co mpute Y k C ( k ) = V k C ( k ) + U ( C ( k )) and set k ∗ ∈ arg max k ∈ K Y k C ( k ) (with tie broken arbitrary), where V k C ( k ) is the lar gest reward obser ved s o far from ar m k and U ( C ( k )) = G − 1 ∗  L − ln( δ ) C ( k )  . 5: If U ( C ( k ∗ )) < ǫ , stop a nd return the la rgest sam- pled r eward. Else, sample once from a rm k ∗ , set C ( k ∗ ) = C ( k ∗ ) + 1 and r eturn to step 4. The algor ithm terminates when the num b er of sa mples from the ar m with the la rgest index is ab ov e a c e r tain threshold. This idea is similar to that in the UCB1 Algorithm of [2]. Theorem 2. Under A s sumption 1, for any ǫ and δ such that L ≥ 10 , Algorithm 1 is ( ǫ, δ ) -c orr e ct with a sample c omplexity of E [ T ] ≤ X k ∈ K L − ln( δ ) G ∗ (Θ k ) + | K | , (4) wher e L = 6 ln  | K |  1 + − l n( δ ) G ∗ ( ǫ )  as define d in the algorithm, and Θ k = min { max ( ǫ, µ ∗ − µ ∗ k ) , ǫ 0 } . As obs e r ved by c omparing the b ounds in Eq uations (3) a nd (4), the upp er b ound in Theorem 2 has the same dependence o f ǫ and ln ( δ − 1 ), up to a logarithmic term. It should b e noted though that while the low er bo und is currently restricted to concav e tail function bo unds, the algorithm and its b ound a re not r estricted to this case. T o es tablish Theorem 2, we first bo und the probability of the even t under whic h the upp er b ound of the bes t arm is b elow the maximal r ew ar d, using an extreme v alue b ound. T he n, we b ound the largest num b er of samples after which the algo r ithm terminates under the assumption that the upp er b ound of the be st arm is ab ov e the maximal reward. Pr o of of The or em 2 W e denote the time step of the algorithm by t , and the v alue of the co un ter C ( k ) a t time step t by C t ( k ). Recall that T stands for the random final time step. By the condition in step 5 of the a lg orithm, for every ar m k ∈ K , it follows that, C T ( k ) ≤ ⌊ L − ln( δ ) G ∗ ( ǫ ) ⌋ + 1 . (5) Note that by the fact that for x ≥ 6 it follows that d 6 ln( x ) dx ≤ 1, a nd b y the fact tha t for x 0 = exp  1 2 3  it follows tha t x 0 > 6 ln( x 0 ) = 10 it is obtained that L ′ , | K |  − ln( δ ) G ∗ ( ǫ ) + 1  > 6 ln  | K |  − ln( δ ) G ∗ ( ǫ ) + 1  = L, for L ≥ 10 . So , by the fact that T = P k ∈ K C T ( i ), for L ≥ 10 it follows that T ≤ | K |  L − ln ( δ ) G ∗ ( ǫ ) + 1  < | K |  L ′ − ln( δ ) G ∗ ( ǫ ) + 1  ≤ L ′ 2 = e L 3 . (6) Now, we b egin with pro ving the ( ǫ, δ )-correctness prop erty of the algorithm. Recall that for every a rm k ∈ K the rewards are distributed ac c o rding to the CDF F k ( µ ). L e t assume w.l.o.g. that µ ∗ 1 = µ ∗ . Then, for N > 0 a nd by the fact tha t (1 − ǫ ) 1 ǫ ≤ e − 1 for every ǫ ∈ (0 , 1 ], for U ( N ) = G − 1 ∗  L − ln( δ ) N  it follows that P  V 1 N ≤ µ ∗ − U ( N )  = ( F 1 ( µ ∗ − U ( N ))) N ≤  1 −  L − ln( δ ) N  N ≤ δ e − L , (7) where V k N is the largest reward obs erved from arm k ∈ K after this a r m has b e en sampled for N times. Hence, at e very time step t , by the definition of Y 1 C t (1) and Equations (6) and (7) , by a pplying the union b ound, it follows that P  Y 1 C t (1) ≤ µ ∗  = P  V 1 C t (1) ≤ µ ∗ − U ( C t (1))  ≤ exp ( L 3 ) X t =1 P  V 1 N ≤ µ ∗ − U ( N )  ≤ δ e − 2 L 3 . (8) Since by the condition in step 5, it is obtained that when the algor ithm sto ps V k ∗ C t ( k ∗ ) > Y k ∗ C t ( k ∗ ) − ǫ, and by the fact that for every time step Y k ∗ C t ( k ∗ ) ≥ Y 1 C t (1) , it follows by Equation (8) that P  V k ∗ C t ( k ∗ ) ≤ µ ∗ − ǫ  ≤ P  Y 1 C t (1) ≤ µ ∗  ≤ δ e − 2 L 3 . Therefore, it follows that the algorithm re tur ns a reward gr eater than µ ∗ − ǫ with a probability larger than 1 − δ . So, it is ( ǫ, δ )-c o rrect. F or proving the bound on the e xpec ted sa mple co m- plexity of the a lgorithm we define the following sets: M ( ǫ ) = { l ∈ K | µ ∗ − µ ∗ l < ǫ } and N ( ǫ ) = { l ∈ K | µ ∗ − µ ∗ l ≥ ǫ } . As b efore, we a ssume w.l.o.g . that µ ∗ 1 = µ ∗ . F or the case in which E 1 , \ 1 ≤ t 1 . Therefore, in this case the proba bilit y P  Y k C ( k ) < µ ∗ k  is smalle r than the v alue on which the pro of of Theor em 2 relies. So, Algorithm 1 returns an ǫ -optimal v a lue with a larger proba bilit y . The pr oba- bilit y of returning a false v alue is given in the following prop osition. Prop ositio n 2 . When Assu mption 1 holds for G ∗ ( ǫ ) , and also for G ′ ∗ ( ǫ ) = αG ∗ ( ǫ ) for some α > 1 , and L ≥ 10 , Algorithm 1 is ( ǫ, δ ′ ) -c orr e ct wher e δ ′ = δ α e − ( α − 1) L ( ǫ and δ ar e pr ovide d t o the algorithm) with the sample c omplexity pr ovide d in The or em 2 F or proving the ab ov e prop osition w e base on a mi- nor v aria tion of the pro of o f Theor em 2. The pro of is provided in Section 8.3 in the Appendix. 6 Comparison with t he Unified-Arm Mo del In this section, we analyze the improvemen t in the sample co mplexit y obtained b y utilizing the multi arm framework (the ability to choose from which arm to sample a t each time step) co mpared to a mo del in which a ll the ar ms are unified in to a single arm, so that the sample is effectively obtained fro m a random arm. In the unified-arm mo del, when the agent sam- ples from this unified arm, o ne o f the o riginal arms is chosen uniformly at random, a nd a reward is sampled from this a rm. The CDF of the unified a rm is therefor e F ( µ ) = 1 | K | P k ∈ K F k ( µ ), and the corresp onding max- imal reward is µ ∗ = max k µ ∗ k . Assumption 1, implies that 1 − F ( µ ) ≥ G ∗ ( µ ∗ − µ ) | K | . In the r emainder o f this section, w e pr ovide a lower bo und on the sa mple complexity and an ( ǫ, δ )-cor rect algorithm that attains the s a me order of this b ound for the unified-a rm mo del. (Note that the low er b ound in Theorem 1 is meaningless for | K | = 1.) Then, we discuss which appr oach (multi-armed or unified- a rm) is b etter for different mo del pa rameters, and provide examples that illustrate these ca ses. 6.1 Lo wer Bo und The following Theorem provides a low er bo und on the sample co mplex ity for the unified-ar m mo del. Theorem 3. F or every ( ǫ, δ ) -c orr e ct algorithm, under Assumption 1, when G ∗ ( ǫ ) is c onc ave and δ ≤ 3 20 e − 3 , it holds that E [ T ] ≥ | K | 16 G ∗ ( ǫ ) ln  3 20 δ  . (13) Algorithm 2 Unified-Arm Algor ithm 1: Input: The tail function bo und G ∗ = { G ∗ ( ǫ ′ ) , 0 ≤ ǫ ′ ≤ ǫ 0 } and its inv erse function G − 1 ∗ , cons tant s δ > 0 and ǫ > 0. 2: Sample ⌈ − l n( δ ) | K | G ∗ ( ǫ ) ⌉ + 1 times from the unified-arm. 3: Return the best sample. The pro of is provided in Section 8.4 in the Appendix and is based o n a s imila r idea to that of Theo r em 1. 6.2 Algorithm In Algorithm 2, a fixed n umber of instances is sampled, and the algo rithm c ho ose s the b est one among them. In the following Theor em we provide a b ound on the sample co mplex ity achiev ed b y Algo rithm 2. Theorem 4 . Under Assumption 1, A lgorithm 2 is ( ǫ, δ ) -c orr e ct, with a sample c omplexity b ound of E [ T ] ≤ | K | ln( δ − 1 ) G ∗ ( ǫ ) + 2 . The pr o of is pr ovided in Section 8 .5 in the App endix. Note tha t the upp er bound on the sa mple complexity is of the s ame order a s the lower bound in T he o rem 3. 6.3 Comparison and Examples T o find when the multi-armed algorithm is useful, we may compare the upp er b ound on the s ample com- plexity provided in Theorem 2 for Algorithm 1 (m ulti- armed ca se) w ith the lo wer bo und for the unified-arm mo del in Theorem 3. W e co nsider tw o extre me cases. Case 1: Supp ose that arm 1 is best: µ ∗ 1 = µ ∗ , while all the other arms fall s ho rt significa n tly c ompared to the r e quired accur acy ǫ : µ ∗ k ≪ µ ∗ − ǫ , for k 6 = 1. Here 1 ǫ ≫ 1 ( max ( ǫ,µ ∗ − µ ∗ k )) , for k 6 = 1. Hence the upper bo und on sample complexity of Algorithm 1 (multi- armed case) will b e smaller than the low e r b ound for the unified-a rm mo del in The o rem 3. W e now provide an example which illus tr ates cas e 1 numerically . Example 1 (Case 1 ) . L et | K | = 10 4 , µ ∗ 1 = 0 . 9 , µ ∗ k = 0 . 1 ∀ k 6 = 1 , G ∗ ( ǫ ) = Aǫ and A = 0 . 01 . F or ǫ = 10 − 4 and δ = 10 − 3 the sample c omplexity att aine d by A lgorithm 1 is 3 . 52 × 10 8 . The lower b ound for the unifie d-arm mo del is 3 . 13 × 10 9 . The sample c omplex- ity attaine d by Algori thm 2 (for the un ifie d-arm mo del mo del) is 6 . 9 × 10 10 . Case 2: Cons ider next the o ppo site case, where there are many o ptimal arms and few that a re worse: say µ ∗ 1 ≪ µ ∗ − ǫ , while µ ∗ k = µ ∗ for all k 6 = 1. Here 1 ǫ = 1 ( max ( ǫ,µ ∗ − µ ∗ k )) , for k 6 = 1. Hence, since there is a log arithmic-in- | K | ǫ m ultiplicative factor in the up- per b ound on the sample complex it y o f Algo rithm 1 , this b ound will b e larg er than the lower b ound for the unified-arm model in Theorem 3. The following exam- ple illustr ates cas e 2 numerically . Example 2 (Case 2) . L et | K | , G ∗ ( ǫ ) , δ and ǫ re main the same as in Example 1, and let µ ∗ 1 = 0 . 1 and µ ∗ k = 0 . 9 for k 6 = 1 . The s ample c omplexity of Al gorithm 1 is 1 . 56 × 10 12 , which is lar ger than the sample c omplexity of Algorithm 2 which is 6 . 9 × 10 10 . As shown in E xample 2, in some ca s es the b ound o n the sample co mplexit y of the m ulti-armed Algo rithm 1 is la rger than that of the unified-arm Algo rithm 2. W e s hall further co mmen t o n these finding in our co n- cluding remarks. 7 Conclusion W e have considered in this pa p er the Max K -armed Bandit problem in the P AC setting, under the assump- tion o f a known low er bo und on the tail function of reward distr ibutio ns . W e pr ovided a lower b ound on the sample complex ity of any alg orithm, and a UCB- t yp e sampling algorithm who s e sa mple co mplex ity is essentially of the same or der up to loga rithmic terms. W e hav e further analyzed the ro bustness of o ur al- gorithm to the violation of As s umption 1 on the tail functions, and bounded the result deterio ration in p e r- formance which is shown to be gra dual. The p erformanc e of the m ulti-armed Algo rithm 1 w as compared to a simple unified-arm appro ach. The b en- efits o f Algo r ithm 1, which a ims to fo cus sampling o n the b est a rms, are cle a r when there are few optimal arms (in term o f their max imal re ward), but might diminish when many a rms are close to optimal. Com- bining these t wo approaches into a single algor ithm that excels in either case remains a challenge for fu- ture w ork s . References [1] J .- Y. Audiber t a nd S. Bubeck. Best arm iden- tification in multi-armed bandits. In COL T- 23th Confer enc e on L e arning The ory , pages 41– 53, 201 0. [2] P . Auer, N. Cesa-Bia nc hi, and P . Fischer. Finite- time analys is of the multiarmed bandit pr oblem. Machine L e arning , 47:235 –256, 2 0 02. [3] D. A. Ber ry , R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with infinitely many arms. The Annals of Statist ics , 25(5):2103 – 2116, 199 7. [4] T. B onald and A. Pro utiere. Two-target algo- rithms for infinite-ar med bandits with Berno ulli rewards. In A dvanc es in Neur al Information Pr o- c essing S ystems 26 , pages 2184– 2192. Curr a n As- so ciates, Inc., 2013. [5] S. Bubeck and N. Cesa-Bia nch i. Regret a nalysis of sto chastic a nd no nsto c hastic multi-armed bandit problems. Machine L e arning , 5(1):1–1 22, 2 012. [6] A. Carp entier and M. V a lko. Extreme bandits. In A dvanc es in N eur al In formation Pr o c essing Sys- tems 27 , pages 1 089–1 097. Cur ran Asso cia tes , Inc., 201 4. [7] A. Ca rp ent ier and M. V a lko. Simple r egret for infinitely many ar med bandits. In Pr o c e e dings of the 32nd International Confer enc e on Machine L e arning, ICML , pages 11 33–11 41, 201 5. [8] D. Chakra barti, R. K umar, F. Radlinsk i, and E. Upfal. Mortal multi-armed ba ndits. In A d- vanc es in Neur al Information Pr o c essing Systems 21 , page s 273 –280. Curran Asso cia tes, Inc., 200 9. [9] V. A. Cic irello and S. F. Smith. The max k- armed bandit: A new mo del of explora tion a p- plied to search heuristic selection. In Pr o c e e dings of the National Confer enc e on Artificia l Int el li- genc e , volume 2 0, page 1 355, 200 5. [10] Y. David and N. Shimkin. Infinitely many-armed bandits with unkno wn v alue distribution. In Machine L e arning and Know le dge D isc overy in Datab ases , pages 307 –322. Springer , 2 014. [11] E. Even-Dar, S. Ma nnor, and Y. Manso ur. P AC bo unds for multi-armed ba ndit a nd mar kov de- cision pro ces s es. In COL T-15th Confer enc e on L e arning The ory , pages 255– 270. 2002. [12] V. Gabillon, M. Ghav amzadeh, and A. Lazar ic. Best arm identification: A unified a pproach to fixed budget and fixed confidence. In A dvanc es in Neur al Information Pr o c essing Systems 25 , pages 3212– 3220. Curra n Ass o ciates, Inc., 2012 . [13] S. Mannor a nd J. N. Tsitsiklis. T he sample com- plexity of explor ation in the multi-armed bandit problem. Jour n al of Machine L e arning Rese ar ch , 5:623– 648, 20 04. [14] B. Patric k. Probability and meas ure. A Wiley- Interscienc e Pu blic ation, John Wiley & Sons, New Y ork , 199 5. [15] M. J. Streeter and S. F. Smith. An asymptotically optimal a lgorithm for the max k-armed bandit problem. In Pr o c e e dings of t he National Confer- enc e on Artificial Int el ligenc e , v olume 21, pag es 135–1 42, 200 6. [16] M. J . Streeter a nd S. F. Smith. A s imple distribution-free appro ach to the ma x k - armed bandit problem. In Principles and Pr actic e of Constr aint Pr o gr amming-CP 2006 , pag es 560– 574. Springe r, 200 6. [17] B. Sz¨ or´ enyi, R. Bus a -F ekete, P . W eng, and E. H¨ ullermeier. Q ua litative multi-armed ban- dits: A quantile-based appr oach. In Pr o c e e dings of the 32nd International Confer enc e on Machine L e arning, ICML , pages 16 60–16 68, 201 5. [18] O. T eytaud, S. Gelly , and M. Seba g. Anytime many-armed bandits. In CAP , Gr e no ble, F r ance, 2007. [19] Y. W ang, J.-Y. Audibert, and R. Munos. Algo- rithms for infinitely many-armed bandits. In A d- vanc es in Neur al Information Pr o c essing Systems 21 , pages 1729 – 1736. 200 8. 8 App endix 8.1 Lemma 1 Lemma 1. Assumption 1 hold s un der the Hyp otheses { H 0 , H 1 , . . . , H | K | } define d in the pr o of of The or em 1 . Pr o of. F or the hypothesis H 0 the ass umption holds since that is the true one. F or k ∈ K for which µ ∗ k < µ ∗ + ǫ − ǫ 0 , sinc e the CDF is F ∗ ( µ ) on the in terv al [ µ ∗ + ǫ − ǫ 0 , µ ∗ ,H k k ], whic h is of size ǫ 0 and r eaches the max imal v alue, it is eas ily obtained by the definition o f F ∗ ( µ ) that Assumption 1 holds. Now w e will show that Assumption 1 holds in the las t case, k ∈ K for which µ ∗ k < µ ∗ + ǫ − ǫ 0 . W e need to show that 1 − F H k k ( µ ∗ + ǫ − ǫ ′ ) ≥ G ∗ ( ǫ ′ ) (14) for ev ery 0 ≤ ǫ ′ ≤ ǫ 0 . F or 0 ≤ ǫ ′ ≤ µ ∗ + ǫ − µ ∗ k , Equation (14) holds by the definition o f F ∗ . F or µ ∗ + ǫ − µ k < ǫ ′ ≤ ǫ 0 , by the definition of µ k it follows that F k ( µ ) ≤ P ǫ k for µ < µ k , so F H k k ( µ ∗ + ǫ − ǫ ′ ) ≤ 1 − G ∗ ( ǫ 0 ) and E quation (14) follows b y the monotonicity of G ∗ ( · ). Finally , for the case of µ ∗ + ǫ − µ ∗ k < ǫ ′ ≤ µ ∗ + ǫ − µ k we use the concavit y of G ∗ . F or µ k ≤ µ < µ ∗ k it follows that F H k k ( µ ) = F k ( µ ) − G ∗ ( µ ∗ + ǫ − µ ∗ k ). Also, by the fact that Assumption 1 holds for F k we hav e that G ∗ ( µ ∗ k − µ ) ≤ 1 − F k ( µ ), so that F H k k ( µ ) ≤ 1 − ∆ k (15) where ∆ k = G ∗ ( µ ∗ k − µ ) + G ∗ ( µ ∗ + ǫ − µ ∗ k ). Then by the assumed concavit y of G ∗ , and noting that G ∗ (0) = 0, it follows that λG ∗ ( µ ∗ + ǫ − µ ) ≤ G ∗ ( λ ( µ ∗ + ǫ − µ )) for λ ∈ [0 , 1]. So by choosing λ = λ 1 = µ ∗ + ǫ − µ ∗ k µ ∗ + ǫ − µ it follows tha t λ 1 G ∗ ( µ ∗ + ǫ − µ ) ≤ G ∗ ( µ ∗ + ǫ − µ ∗ k ) and by choo sing λ = 1 − λ 1 it follows that (1 − λ 1 ) G ∗ ( µ ∗ + ǫ − µ ) ≤ G ∗ ( µ ∗ k − µ ). Therefo r e, G ∗ ( µ ∗ + ǫ − µ ) ≤ G ∗ ( µ ∗ k − µ ) + G ∗ ( µ ∗ + ǫ − µ ∗ k ) = ∆ k (16) So, by Equa tions (15) and (16) it follows that F H k k ( µ ) ≤ 1 − G ∗ ( µ ∗ + ǫ − µ ) , so that Assumption 1 holds. 8.2 Proof o f Prop ositio n 1 Pr o of. First we deno te the inv ers e function of G ′ ∗ ( ǫ ) (the function fo r which Assumption 1 holds) by G ′− 1 ∗ , and we note that, G ′− 1 ∗ ( y ) = G − 1 ∗  y α  . (17) Now, we b egin with providing the ( ǫ ′ , δ ′ )-correc tnes s of the algo rithm. Let assume w.l.o.g. that µ ∗ 1 = µ ∗ . Then, b y Equation (17) it follows that U ( N ) = G ′− 1 ∗  α ( L − ln( δ )) N  , and therefore simila rly to Equa tion (7) it follows that P  V 1 N ≤ µ ∗ − U ( N )  = ( F 1 ( µ ∗ − U ( N ))) N ≤  1 −  α ( L − ln( δ )) N  N ≤ δ α e − αL , (18) Hence, for every time step t ≤ t T where t T = e αL/ 3 , by a pplying the union b ound, the definition o f Y 1 C t (1) and Equa tions (18), it follows that P  ∪ 1 ≤ t ≤ t T Y 1 C t (1) ≤ µ ∗  = P  ∪ 1 ≤ t ≤ t T V 1 C t (1) ≤ µ ∗ − U ( C t (1))  ≤ t T X t =1 t T X t =1 P  V 1 N ≤ µ ∗ − U ( N )  ≤ δ α e − αL/ 3 . (19) Recall that T stands for the ra ndom final time step. Now, we a ssume that the alg orithm terminated at a time step la r ger than t T , na mely , T > t T (w e cons ider the ca se of T ≤ t T later). W e denote the first time step at which an arm has b een sampled for t T | K | times by t F and w e note that t F ≤ t T . By the fact that for every time step Y k ∗ C t ( k ∗ ) ≥ Y 1 C t (1) holds for the chosen arm k ∗ it follows that V k ∗ C t F ( k ∗ ) + U ( C t F ( k ∗ )) = Y k ∗ C t F ( k ∗ ) ≥ Y 1 C t F (1) . (20) Then, by Equatio n (6) it follows tha t | K | α − 1  L − ln( δ ) G ∗ ( ǫ )  α ≤ 1 | K | e αL 3 = t T | K | . (21) Now, by Equation 2 1, the fact that C t F ( k ∗ ) = t T | K | and the increasing o f G − 1 ∗ (and the decrea sing of U ) it is obtained that U ( C t F ( k ∗ )) ≤ ǫ ′ , where ǫ ′ = G − 1 ∗  ( | K | ( L − ln ( δ ))) 1 − α ( G ∗ ( ǫ )) α  . Therefore, by E quation (19) the ( ǫ ′ , δ ′ )- correctnes s with δ ′ = δ α and ǫ ′ = G − 1 ∗  ( | K | ( L − ln( δ ))) 1 − α ( G ∗ ( ǫ )) α  is obtained for the case o f T > t T . F or the case o f T ≤ t T , by the fact that the a lgorithm terminated and the conditio n in step 5 it follows that U ( C T ( k ∗ )) < ǫ ≤ ǫ ′ . Then, since V k ∗ C T ( k ∗ ) + U ( C T ( k ∗ )) = Y k ∗ C T ( k ∗ ) ≥ Y 1 C T (1) , by E quation (19) the ( ǫ ′ , δ ′ )-correc tnes s is obtained for the ca se of T ≤ t T . Now we contin ue with analyzing the sa mple complex- it y . First we r ecall that T stands for the random final step. In additiona l, by the same consider ations as in the pro of of Theorem 2 it follows b y Equation (6) that T ≤ e L/ 3 . F or the c a se in which E 1 , \ 1 ≤ t t . Define the following even ts: • A = { T ≤ 4 t } . By the same co ns ideration a s in the pro of of Theorem 1 (for the event s { A k } k ∈ K ), it follows that if E 0 [ T ] ≤ t , then P 0 ( A ) ≥ 3 4 . • Let B stand for the even t under which the chosen sample is strictly ab ov e µ ∗ , a nd B C for its com- plement ar y (the chosen sample is smaller o r equal to µ ∗ ). Clearly , P 0  B C  = 1. • W e define the event C to be the even t under which all the samples o btained from the unified a rm are on the in terv al [ − ∞ , µ ∗ ]. Clearly , P 0 ( C ) = 1. • W e let D denote the even t under which for any nu mber of samples t ≤ 4 t ′ from the unified arm, the num b er of samples which are on the interv a l ( −∞ , µ ] is bo unded, namely , D , ( max 1 ≤ t ≤ 4 t ′ t X i =1 x i − tF ( µ ) < 15 t ′ F ( µ ) ) where x i is a R.V which equa ls to 1 if the i - th sample is on the interv al and 0 otherwise. No w, by the same consider ation as in the pro of of Theor em 1, it follows tha t P 0 ( D ) ≥ 9 10 . Define now the in tersection even t S = A ∩ B C ∩ C ∩ D . W e hav e shown that P 0 ( A ) ≥ 3 4 , P 0 ( B ) = 1, P 0 ( C ) = 1 and P 0 ( D ) = 9 10 , fr o m which it is obtained that P 0 ( S ) ≥ 13 20 . Now, let h to b e the history of the pro cess . Then by the same consider ations as in the pro of o f Theorem 1, for α = γ ′ F ( µ ) , it follows that, dP H k dP H 0 ( h ) I ( S k ) ≥ (1 − α ) 1 α ln ( 3 20 δ ) I ( S k ) . Therefore we have the following, P 1  B C  ≥ P 1 ( S ) = E 0  dP 1 dP 0 ( h ) I ( S )  ≥ E 0 h (1 − α ) 1 α ln ( 3 20 δ ) I ( S ) i ≥ (1 − α ) 1 α ln ( 3 20 δ ) P 0 ( I ( S )) > 3 20 e − ln 3 20 δ ≥ δ , where in the last inequality we used the facts that (1 − ǫ ) 1 ǫ ≥ e − 1 . W e found that if a n algorithm is ( ǫ, δ )-correct under hypothesis H 0 and E 0 [ T ] ≤ t , then, under hypothesis H 1 this algorithm returns a sample that is smaller by at least ǫ than the max imal p ossible reward with a probability of δ or mo re, hence the algorithm is not ( ǫ, δ )-cor rect. Therefore, any ( ǫ, δ )-correct algo rithm, m ust satisfy E 0 [ T ] > t . Hence the low e r bound is obtained. 8.5 Pro of of Theorem 4 Pr o of. Since sa mpling from the unified arm consists of choos ing one arm out of the | K | arms (with equa l probability), and then, s ampling from this ar m, it fol- lows that, F ( µ ∗ − ǫ ) ≤  1 − G ∗ ( ǫ ) | K |  . Also , we note that (1 − ǫ ) 1 ǫ ≤ e − 1 for every ǫ ∈ (0 , 1]. Ther e fo re, for N = ⌈ − l n( δ ) | K | G ∗ ( ǫ ) ⌉ + 1, P  V 1 N < µ ∗ − ǫ  = ( F ( µ ∗ − ǫ )) N ≤  1 − G ∗ ( ǫ | K |  N < δ, (27) where V 1 N is the la rgest reward obser ved among the first N samples. Hence, the algor ithm is ( ǫ, δ )-correct. The b ound on the sa mple complexity is immediate from the definition o f the algor ithm.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment