PAC-Bayesian Analysis of Martingales and Multiarmed Bandits
We present two alternative ways to apply PAC-Bayesian analysis to sequences of dependent random variables. The first is based on a new lemma that enables to bound expectations of convex functions of certain dependent random variables by expectations …
Authors: Yevgeny Seldin, Franc{c}ois Laviolette, John Shawe-Taylor
P A C-Ba y esian Analysis of Martingales and Multiarmed Band its Y evgen y Seldin Max Planck Institute T¨ ubingen, Germany seldin@tue bingen.mpg.de F ran¸ cois La violette Univ ers it ´ e Lav al Qu´ eb ec, Canada francois.l aviolette@ift.u laval.ca John Sha w e-T a ylor Univ ers it y College London jst@cs.ucl .ac.uk Jan P eters Max Planck Institute T¨ ubingen, Germany jan.peters @tuebingen.mpg. de P eter Auer Chair for Infor mation T echnology Univ ers it y of Leob en, Austria auer@unile oben.ac.at Abstract W e present t wo alterna tive w ays to a pply P AC-Ba yesian analysis to sequences of dep endent random v ariables . The fir st is based on a new lemma that enables to bound exp ectatio ns of conv ex functions of c e rtain dependent random v ariables by exp ectations of the s a me functions of in dep endent Berno ulli random v aria bles . This lemma pro vides a n alternative to ol to Ho effding-Azuma ine q uality to b ound concentration of ma rtingale v a lues. Our s ec- ond approa ch is based on integration of Ho effding-Azuma inequality with P AC-Bay e sian analysis. W e also introduce a wa y to apply P AC-Bay es ian a nalysis in situation of lim- ited feedback. W e combine the new to ols to derive P AC-Ba yesian gene r alization and regret bo unds for the multiarmed bandit problem. Although our regret bound is not y et as tight as state-of-the-ar t regre t bo unds based o n other well-established tec hniques, our results signif- icantly expand the ra nge of po tent ial applica tions of P AC-Bay esian analy sis and intro duce a new analysis to ol to r einforcement lear ning and man y other fields, where marting a les and limited feedback are encountered. 1 In tro duction P AC-Bay esian analy sis was introduced ov er a decade ag o (Shaw e- T aylor and Williamson, 1 9 97, Shaw e -T aylor et al., 1998, McAllester, 1 998, Seeg e r, 20 02) a nd ha s since made a significant con- tribution to the analy sis a nd developmen t of s up er v ised learning metho ds. The p ow er of P AC- Bay esian appro ach lies in the s uccessful ma rriage of flexibility and in tuitiveness of Bayesian mo dels with the rigo r o f P AC analysis. P AC-Bay esian b ounds provide an explic it and often intuitiv e and easy-to- o ptimize trade-off b etw een mo del complexity and empirical data fit, wher e the co mplex- it y can b e nailed down to the res o lution of indiv idual hypotheses via the prior definitio n. The P AC-Bay esian analysis w as applied to der ive gener alization b ounds and new a lgorithms fo r lin- ear cla ssifiers and maximum margin metho ds (Langfor d and Shaw e-T aylor , 2 002, McAllester, 200 3, Germain et al., 200 9), structured predictio n (McAllester, 20 07), and cluster ing -based classificatio n mo dels (Seldin and Tis h by, 2 010), to name just a few. How ever, the applicatio n of P AC-Bay e sian analysis b eyond the super vised lear ning doma in remained sur prisingly limited. In fact, the only a ddi- tional doma in known to us is density estimation (Seldin a nd Tishb y, 2010, Higgs and Shaw e-T aylor , 2010). Even within s uper vised learning the applications of P AC-Ba yesian analys is were res tricted to i.i.d. data for a long time. The iss ue of tr eating no n-indep enden t samples was partially addr e ssed only r ecently by Ralaivola et a l. (201 0) and Lever e t al. (20 10) (their appro aches are also suitable for density estimation (Higgs and Shawe-T aylor , 201 0). The so lution of Ralaivola et al. (2010) es- sentially bo ils do wn to break ing the sample in to indep endent (or almo st indep endent) subsets (which also reduces the effectiv e sa mple s ize to the num b er of indep endent s ubsets). Such an a pproach is inapplicable to martinga les due to str ong depe ndence of the cumulative sum on all of its comp o- nent s. Lever et al. (201 0) emplo yed Hoeffding’s canonical decomp ositio n of U-statistics into forward martingales and applied P A C-Bayesian analysis dir ectly to these martinga les. Our second appro ach to handling sequences o f depe ndent s a mples by co mb ining P AC-Bay esian ana lysis with Ho effding- Azuma inequality is ba sed on s imilar ideas. Our first a pproach to seq ue nc e s of dep endent sa mples is based on the new lemma that allows to b ound exp ectations of functions of certa in sequentially depe ndent random v aria bles b y exp ectations of the same functions of independent random v aria bles. One of the most pr o minent and imp orta n t fields of application of martinga les is r einforce- men t learning. Some p otential adv a nt ag es of applying P AC-Bay esian a nalysis in reinforcement learning w ere recently p ointed out b y several resea rchers, including Tishb y and Polani (201 0) and F ard and P ineau (201 0). Tishb y and Polani (2010) sugg ested that the mutual informa tio n b etw een states and actions in a p olicy can b e used as a natur a l regular izer in r e inforcement lear ning. They show ed tha t regulariz a tion by mutual information ca n b e incorp ora ted into Bellman equations and therefore can be computed efficiently . Tishb y and Polani conjectur e d that P AC-Ba yesian analy sis can be applied to justify such form of re g ularization a nd provide g eneralizatio n guara nt ees for it. F ard and Pineau (2010) s uggested a P AC-Bay esian a nalysis of batch reinforcement learning. They used the analysis to design an algorithm that is able to le verage the prio r knowledge when it is informative a nd co nfirms the data distribution a nd ignores it when it is irrelev a nt . In the first case Bay esian lear ning alg orithms p erfor m well and in the second case P AC lea rning alg o rithms per form b etter, whereas F ard and Pinea u show ed tha t their alg o rithm p erfor ms o n par with the bes t out of the tw o in all situations. How e ver, the analysis o f F ard and Pine a u do es not addres s the exploratio n- exploitation trade-off, which is the key feature of reinforcement learning. In their batch analysis they a ssume that every action was s a mpled in every sta te s ome minimal num b er of times and the b ound decreases a t the rate of a squar e ro ot o f the minimum over states and actions o f the nu mber of times a n action was sampled in a state. Clearly , such an a nalysis is not applicable in online setting, s inc e we do not wan t to sa mple “bad” actions many times , but then the b ound do es not improv e with time. One of the reasons for the difficult y of applying P AC-Ba yesian analysis to address the explora tion- exploitation trade - off is the limited feedback (the fact that we o nly observe the reward for the action taken, but not for all the res t). In sup erv ised lear ning (and also in density estimation) the empirical error for each hypo thesis within a h yp otheses cla ss can be ev aluated on all the sa mples and therefore the size of the sample av aila ble for ev aluation of all the hypotheses is the same (and usually relatively large). In the situation o f limited feedback the sample from one a ction ca nnot b e used to ev a luate another action (that is the rea son why the bound of F ar d and Pinea u (201 0) dep ends on the minim um of the num b er of times any action was tak en in any state, which is the minimal sample size av ailable for ev aluation of a ll state-action pair s). In online s e tting the sa mple size of “ba d” actions has to increase sublinearly in the num b er of game rounds, which results in slow o r even no co nv e r gence of the b ound. W e reso lve this issue by applying w eighted sa mpling strategy (Sutton and Barto, 19 9 8), which is commonly used in the analys is o f non-sto chastic bandits (Auer et al., 2 002b), but has not bee n applied to the analysis of sto chastic bandits prev io usly . The usage of weigh ted s a mpling introduces tw o new difficulties. One is the depe ndence b etw een the samples : the rewards we observe influence the distribution ov er actions we play and throug h this distribution influence the v a riance of the subseque nt weighted sample v ariables. W e handle this dep endence us ing our new P AC-Bay esian appr o aches to sequences of dep endent v ar iables. At the moment b oth approa ches yield compara ble b ounds, how ever each of the approaches has its own po ten tial a dv antages that can b e exploited in future work. The second problem introduced by w eighted sa mpling is the growing v ar iance of the weigh ted sample v ariables . Mar tingale b ounding techniques used in this w ork do not enable to take full control over the v aria nce, which expla ins the gap betw een our re sults and state-of-the-art bo unds for m utliarmed bandits (Auer e t al., 2 0 02a, Auer and Ortner, 2 010). Tighter bounds can b e achiev ed by combining P AC-Bay esian analysis with Ber ns tein-t yp e inequality for mar tingales (Beyg e lz imer et al., 2010). Such a combination will b e presented in future work. The subseq uen t sections are or ganized as follows: Section 2 surveys the main r esults of the pap er, Section 3 presents our b ound on exp ectation of conv ex functions of sequentially dep endent r andom v aria bles and illustrates its application to der iv ation o f an alternative to Ho effding-Azuma inequa lit y , Section 4 provides a P AC-Bay esian analys is o f the weighted sampling stra tegy based on the b ound from Sectio n 3, Sec tion 5 provides P AC-Bay esian ana lysis of the weighted sampling s trategy ba s ed on ma rtingales, Section 6 derives a regr et b ound fo r the multiarmed bandit pro blem, and Section 7 concludes the results. 2 Main Results One of the foundation stones of our pap er is the following lemma that enables to b ound exp ectations of conv ex functions of certain se quent ially dep endent rando m v ar iables by exp ectatio ns of the s a me functions of indep endent Berno ulli r andom v aria bles. The lemma g eneralizes a pre ceding result of Maurer (20 04) for indepe ndent r andom v ar iables and might ha ve a wide interest on its own r ight far 2 beyond the P AC-Bay es ian a nalysis. The lemma can be used to derive a n alterna tive to Ho effding- Azuma inequality (Ho effding , 1963, Azuma, 1967). This alterna tiv e can b e muc h tighter in cer tain situations (see our deriv a tion a nd discussion of Lemma 7 in the next section). Lemma 1 L et X 1 , .., X N b e dep en dent r andom variables b elonging to the [0 , 1] interval and dis- tribute d by p ( x i | X 1 , .., X i − 1 ) , su ch that E [ X i | X 1 , .., X i − 1 ] = p for al l i . L et Y 1 , .., Y N b e inde- p endent Bernoul li r andom variab les, such t hat E Y i = p for al l i . Then for any c onvex function f : [0 , 1] N → R : E f ( X 1 , .., X N ) ≤ E f ( Y 1 , .., Y N ) . W e present the s ubsequent res ults in the context of the multiarmed bandit problem, which is probably the mo st common problem in machine lear ning, where seq uen tially dep endent v aria bles are encountered. Let A b e a set o f actio ns (a rms) of size |A| = K a nd let a ∈ A denote the actions . Denote by R ( a ) the exp ected r eward o f action a . Let π t be a distribution ov er A that is play ed at round t of the game. Let { A 1 , A 2 , ... } b e the seq ue nc e of a ctions played indep endently at r a ndom according to { π 1 , π 2 , ... } res p ectively . Let { R 1 , R 2 , ... } b e the sequence of o bserved rewards. Denote by T t = {{ A 1 , .., A t } , { R 1 , .., R t }} the set of taken actions and observed rewards up to round t (by definition T t − 1 ⊂ T t ). F or t ≥ 1 and a ∈ { 1 , .., K } define a set of indicator ra ndom v a riables { I a t } t,a : I a t = 1 , if A t = a 0 , otherwise. Define a set of random v ar iables R a t = 1 π t ( a ) I a t R t . In other words: R a t = 1 π t ( a ) R t , if A t = a 0 , otherwise. Define: ˆ R t ( a ) = 1 t P t τ =1 R a τ . F or a dis tr ibution ρ ov er A define R ( ρ ) = E ρ ( a ) R ( a ) and ˆ R t ( ρ ) = E ρ ( a ) ˆ R t ( a ). F or tw o distributions ρ and µ , let K L ( ρ k µ ) denote the K L-divergence b etw een ρ and µ . F o r t wo Bernoulli random v ariables with biase s p and q let k l ( p k q ) = p ln p q + (1 − p ) ln 1 − p 1 − q be an abbreviation for K L ([ p, 1 − p ] k [ q , 1 − q ]). W e pres en t tw o a lter native results, the firs t applies Lemma 1 to ha ndle sequences o f depe ndent random v aria bles and the seco nd is ba sed on combination of P AC-Ba yesian analysis with Ho effding- Azuma inequality . Then we co mpare the results and pre s ent a regr et b ound for the multiarmed bandit problem based on the first so lutio n. 2.1 P A C-Ba y esian Analysis of Sequen tially Dep enden t V ariables Based o n Lemm a 1 Our first P AC-Ba yesian theor em provides a b ound o n the divergence b etw een ˆ R t ( ρ t ) and R ( ρ t ) for any playing strategy ρ t throughout the game. Theorem 2 F or any se quenc e of sampling distributions { π 1 , π 2 , ... } that ar e not zer o for any a ∈ A , wher e π t c an dep end on T t − 1 , and for any se quenc e of “r efer enc e” ( “prior” ) distributions { µ 1 , µ 2 , ... } over A , such that µ t is indep endent of T t ( but c an dep end on t ) , for al l p ossible distributions ρ t given t and for al l t ≥ 1 simultane ously with pr ob ability gr e ater than 1 − δ : k l ( π lmin t ˆ R t ( ρ t ) k π lmin t R ( ρ t )) ≤ K L ( ρ t k µ t ) + 3 ln( t + 1) − ln δ t , (1) wher e π lmin t ≤ min a, 1 ≤ τ ≤ t π τ ( a ) . The num b er π lmin t low er bo unds sampling pro babilities fo r all the ac tio ns up to time t ( l min stands for “left minimum ” or minim um of π τ ( a ) up to [“left to”] time t ). The K L-divergence k l ( p k q ) b ounds the abso lute difference b etw een p and q a s | p − q | ≤ p k l ( p k q ) / 2 (2) (Cov er and Thomas , 19 91). Combined with (1) this r elation yields (with pr obability greater than 1 − δ ): R ( ρ t ) − ˆ R t ( ρ t ) ≤ 1 π lmin t r K L ( ρ t k µ t ) + 3 ln( t + 1) − ln δ 2 t . (3) 3 2.2 Com binatio n of P A C-Ba y esian Analysi s with Ho effding- Azuma Inequality The result presented next is based on a combination o f P AC-Bay es ian analysis with Ho effding-Azuma inequality . W e int ro duce one more definition: ˆ R w t t ( a ) = P t τ =1 w t τ R a τ P t τ =1 w t τ , where w t τ ≥ 0 fo r a ll t and τ and P t τ =1 w t τ > 0 fo r a ll t . ˆ R w t t ( a ) is a weigh ted sum of the samples. F or a sp ecial case , where w t τ = 1 t for all τ , ˆ R w t t ( a ) = ˆ R t ( a ). Theorem 3 F or any se quenc e of sampling distributions { π 1 , π 2 , ... } that ar e not zer o for any a ∈ A , wher e π t c an dep end on T t − 1 , and for any se quenc e of “r efer enc e” ( “prior” ) distributions { µ 1 , µ 2 , ... } over A , su ch t hat µ t is indep endent of T t ( but c an dep end on t ) , for any se quenc e of p ositive p a- r ameters { λ 1 , λ 2 , ... } and for any se quenc e of weighting ve ctors { w 1 , w 2 , ... } , su ch that λ t and w t ar e indep endent of T t ( but c an dep end on t ) , for al l p ossible distributions ρ t given t and for al l t ≥ 1 simultane ou s ly with pr ob ability gr e ater than 1 − δ : ˆ R w t t ( a ) − R ( a ) ≤ K L ( ρ t k µ t ) + 1 2 λ 2 t P t τ =1 w t τ π min τ 2 + 2 ln( t + 1 ) + ln 2 δ λ t P t τ =1 w t τ , (4) wher e π min t ≤ min a π t ( a ) . F or the sp ecial cas e w t τ = 1 t we obtain that with probability gr e ater than 1 − δ : ˆ R t ( a ) − R ( a ) ≤ K L ( ρ t k µ t ) + 1 2 λ 2 t t 2 P t τ =1 1 ( π min τ ) 2 + 2 ln( t + 1 ) + ln 2 δ λ t . (5) By taking λ t = v u u t 2 t 2 2 ln( t + 1) + ln 2 δ / t X τ =1 1 ( π min τ ) 2 ! we obtain: ˆ R t ( a ) − R ( a ) ≤ v u u t 1 t P t τ =1 1 ( π min τ ) 2 2 t K L ( ρ t k µ t ) q ln( t + 1) + ln 2 δ + r ln( t + 1) + ln 2 δ . (6) 2.3 Comparison of Theorem 2 with Theo rem 3 It is interesting to compare Theorems 2 and 3 resulting from the tw o different appro aches. Ine q uality (3) dep e nds on 1 π lmin t = max 1 ≤ τ ≤ t n 1 π min τ o , whereas (6) dep ends on q 1 t P t τ =1 1 ( π min τ ) 2 . If π min τ are approximately equal for all τ , then the tw o terms a re approximately iden tical. How ever, a single small v alue o f π min τ can incr ease the v alue o f 1 π lmin t significantly for all t ≥ τ , while its rela tiv e contribution to the average of 1 ( π min τ ) 2 will decreas e with time. This prop er t y provides an a dv antage to Theo rem 3. O n the other hand, the stro nger k l form (1) of Theor em 2 can p otentially b e a n adv antage for the b ound bas ed o n Lemma 1, but we did no t exploit it in this work. Since for our ch oice of sampling stra tegy 1 π lmin t ≈ q 1 t P t τ =1 1 ( π min τ ) 2 up to small constants, we present a r egret b ound based on Theorem 2 only . A regret b ound based o n Theor e m 3 ca n b e derived in a similar wa y and is identical to the b ound presented b elow up to small cons tant s. 2.4 Regret Bound for Multi armed Bandits W e applied T heo rem 2 to der ive the following regret bo und for the multiarmed bandit pr oblem. Theorem 4 F or t < K 3 let π t ( a ) = 1 K for al l a . L et γ t = K 1 / 4 t 1 / 4 and ε t = K − 1 / 4 t − 1 / 4 and for t ≥ ( K 3 − 1 ) let π t +1 ( a ) = ˜ ρ exp t ( a ) = (1 − K ε t +1 ) ρ exp t ( a ) + ε t +1 , (7) 4 wher e ρ exp t ( a ) = 1 Z ( ρ exp t ) e γ t ˆ R t ( a ) (8) and Z ( ρ exp t ) = X a e γ t ˆ R t ( a ) . Then for t ≥ K 3 the p er-r ound r e gr et R ( a ∗ ) − R ( ˜ ρ exp t ) ( wher e a ∗ is t he b est action ) is b oun de d by : R ( a ∗ ) − R ( ˜ ρ exp t ) ≤ K 3 / 4 ( t + 1) 1 / 4 2 . 5 + r ln( K ) + 3 ln( t + 1) − ln δ 2 K + r 3 ln( t + 1) − ln δ 2 K ! with pr ob ability gr e ater than 1 − δ for al l r ounds t simultane ously. This t ra nslates int o a total r e gr et of ˜ O ( K 3 / 4 t 3 / 4 ) ( wher e ˜ O hides lo garithmic factors ) . Note that ε t bo unds π t ( a ) from below for all a and t ≥ K 3 . F urthermo re, since ε t is a decreasing sequence it actually b ounds π τ ( a ) from b elow for all a and τ ≤ t . Hence, for the prediction strateg y selected in Theorem 4 and for t ≥ K 3 we can s ubstitute π lmin t with ε t in (1) and (3). 3 Pro of of Lemma 1 and an E xample of its Application W e star t with the pr o of o f Lemma 1 a nd then illustrate how it can b e applied to martinga les. Pro of of Lemma 1 : The pro o f follows the lines of the pro of of Lemma 3 in Maurer (2004). An y po int ¯ x = ( x 1 , .., x N ) ∈ [0 , 1 ] N can b e written as a conv e x combination of the extreme p oints ¯ η = ( η 1 , .., η N ) ∈ { 0 , 1 } N in the following wa y: ¯ x = X ¯ η ∈ { 0 , 1 } N Y i : η i =0 (1 − x i ) Y i : η i =1 x i ¯ η . Conv exity o f f therefor e implies f ( ¯ x ) ≤ X ¯ η ∈ { 0 , 1 } N Y i : η i =0 (1 − x i ) Y i : η i =1 x i f ( ¯ η ) , (9) with equality if ¯ x ∈ { 0 , 1 } N . At the next step Maurer (200 4) uses indep endence o f X i -s, where a s we use the fact tha t their conditional exp ectation is constant. T a king exp ectation of bo th sides of (9) we obtain: E X 1 ,..,X N [ f ( ¯ X )] ≤ E X 1 ,..,X N X ¯ η ∈{ 0 , 1 } N Y i : η i =0 (1 − X i ) Y i : η i =1 X i f ( ¯ η ) = X ¯ η ∈ { 0 , 1 } N E X 1 ,..,X N Y i : η i =0 (1 − X i ) Y i : η i =1 X i f ( ¯ η ) = X ¯ η ∈ { 0 , 1 } N E X 1 ,..,X N − 1 E X N Y i : η i =0 (1 − X i ) Y i : η i =1 X i X 1 , .., X N − 1 f ( ¯ η ) = X ¯ η ∈ { 0 , 1 } N E X 1 ,..,X N − 1 " Q i : η i =0 ,i 0: E [ e λS N ] ≤ e ( λ 2 / 8) P N i =1 ( b i − a i ) 2 . It is easy to verify , using the same pro cedur e we applied befo r e, that Lemma 8 implies that with probability gre a ter than 1 − δ : | S N | ≤ 1 8 λ 2 P N i =1 ( b i − a i ) 2 + ln 2 δ λ and tha t the ab ov e expression is minimized b y λ = q 8 ln 2 δ / P N i =1 ( b i − a i ) yielding: | S N | ≤ v u u t 1 2 N X i =1 ( b i − a i ) 2 ! ln 2 δ . (15) In a sp ecial case, where a i = a for all i and b i = b for all i , this further simplifies to: | S N | ≤ ( b − a ) r 1 2 N ln 2 δ . Now we are ready to make the compariso n. If a i -s a nd b i -s a re equa l (or almost equal) fo r all i , ineq ua lit y (14) matches Hoeffding- Azuma inequality up to ln( N + 1) fa c tor (which ca n also b e halved by using a tighter b ound in (12)). If a i -s a nd b i -s are not identical, ineq uality (14) ca n b e po ten tially muc h worse, since a s ingle la rge ( b i − a i ) term will p er manently increa se ( b − a ), but its rela tive contribution to (15) will decr ease with the increase o f N . How ever, when the empirica l av erage is clo se to lower or upp er limit of the domain interv al the k l form of Lemma 7 in equation (13) is muc h tighter than the relax e d L 1 norm form in equatio n (1 4 ) (McAllester, 2 003). Therefore, in situations, where the a nalysis ca n b e ca rried out using the k l form o f the b ound, it might b e preferable. 4 Pro of of Theorem 2 (P AC-Ba yesia n Bound B ased on Lemma 1) Our pro of uses the follo wing lemma, which lays at the bas is of P AC-Ba yesian analysis from its incep- tion and takes its r o ots back in information theory and statistical ph ysics (Donsker and V ar adhan, 1975, Dupuis a nd Ellis, 1997, Gray, 201 1, Bane r jee, 200 6). The lemma a llows to r elate all po sterior distributions ρ to a single prior distribution µ . Lemma 9 F or any me asura ble fun ct ion φ ( h ) on H and any distributions µ ( h ) and ρ ( h ) on H , we have : E ρ ( h ) [ φ ( h )] ≤ K L ( ρ k µ ) + ln E µ ( h ) [ e φ ( h ) ] . (16) Pro of of Theorem 2: Fir st, we show tha t R ( a ) = E T t [ ˆ R t ( a )]. Let p ( r | a ) b e the distribution o f the reward for playing ar m a and let R a be a random v ariable distributed acco rding to p ( r | a ). Then for any t : R ( a ) = E p ( r | a ) [ R a ] = E p ( r | a ) π t ( a ) 1 π t ( a ) R a = E p ( r | a ) E π t ( a ) 1 π t ( a ) I a t R a = E p ( r | a ) ,π t ( a ) 1 π t ( a ) I a t R t = E p ( r | a ) ,π t ( a ) [ R a t ] , ( 17 ) where (17) holds s ince if I a t = 1, then R t is distributed by p ( r | a ), and o therwise R t is irrelev ant. Hence, we obtain that E T t [ ˆ R t ( a )] = E T t [ 1 t P t τ =1 R a τ ] = R ( a ) for all a and t . Note that ˆ R t ( a ) is a sum of t random v ariables b elong ing to the [0 , 1 π lmin t ] interv al. By scaling R ( a ) and ˆ R t ( a ) by a factor of π lmin t we sca le the random v ariables to the [0 , 1] interv al, where Lemmas 1 and 6 can be applied. 7 W e apply P AC-Ba yesian a nalysis to the sca led version of R ( a ) and ˆ R t ( a ) for a fixed t : t · k l ( π lmin t ˆ R t ( ρ t ) k π lmin t R ( ρ t )) = t · k l ( E ρ t ( a ) [ π lmin t ˆ R t ( a )] k E ρ t ( a ) [ π lmin t R ( ρ )]) ≤ E ρ t ( a ) [ t · k l ( π lmin t ˆ R t ( a ) k π lmin t R ( a ))] (18) ≤ K L ( ρ t k µ t ) + ln E µ t ( a ) [ e t · kl ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) ] , (19) where (18) is due to conv exity of k l and (1 9) is by Lemma 9. The se cond ter m in (19) can b e b o unded with high proba bilit y: E µ t ( a ) [ e t · kl ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) ] ≤ 1 δ t E T t E µ t ( a ) [ e t · kl ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) ] (20) = 1 δ t E µ t ( a ) E T t [ e t · kl ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) ] (21) ≤ 1 δ t ( t + 1) , (22) where (20) holds with pro bability gr eater than 1 − δ t by Marko v’s inequality (Lemma 5), the in ter- change of exp ectations in (21) is p ossible since µ t is indep endent of T t , a nd (22) is by Lemma 1 a nd Lemma 6 . Substitution of (22 ) int o (19) yields with pr obability g reater than 1 − δ t : k l ( π lmin t ˆ R t ( ρ t ) k π lmin t R ( ρ t )) ≤ K L ( ρ t k µ t ) + ln t +1 δ t t . Finally , by setting δ t = δ t ( t +1) ≥ δ ( t +1) 2 and a pply ing union b ound we obta in (1) for all t simul- taneously (it is well-known that P ∞ t =1 1 t ( t +1) = P ∞ t =1 1 t − 1 t +1 = 1). The key ingredient that made the pro o f o f Theorem 2 p ossible was Lemma 1, which ena bled us to b ound E T t [ e t · kl ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) ] even though the v ar iables { R a 1 , .., R a t } are dep endent . 5 Pro of of Theorem 3 (P AC-Ba yesia n Analysis B ased on Ho effding-Azuma Inequalit y) In this section we provide an alterna tive P AC-Ba yesian b ound for | ˆ R w t t ( ρ t ) − R ( ρ t ) | b y using Ho effding-Azuma inequality . Pro of of Theorem 3: Let M i t ( a ) = 1 t i X τ =1 w t τ ( R a τ − R ( a )) . Observe that M 1 t ( a ) , .., M t t ( a ) is a martinga le [since E R a i M i t ( a ) = M i − 1 t ( a )] and M t t ( a ) = P t τ =1 w t τ ( ˆ R w t t ( a ) − R ( a )). Note that ( M i t − M i − 1 t ) ∈ [ − 1 π min i , 1 π min i ] and E M t t = 0 . Hence, by Ho effding-Azuma inequality (Lemma 8), for all a : E T t h e λ t ( P t τ =1 w t τ ) ( ˆ R w t t ( a ) − R ( a )) i = E T t h e M t t ( a ) i ≤ e 1 2 λ 2 t P t τ =1 w t τ π min τ 2 . By going back to the pro of of Theorem 2 a nd replacing k l ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) with ˆ R w t t ( a ) − R ( a ) and substituting the bo und on E T t [ e t · kl ( π lmin t ˆ R t ( a ) k π lmin t R ( a )) ] with the b ound on E T t [ e λ t ( P t τ =1 w t τ ) ( ˆ R w t t ( a ) − R ( a )) ] we derived ab ove we obtain that with probability grea ter than 1 − 1 2 δ for all ρ t ˆ R w t t ( ρ t ) − R ( ρ t ) ≤ K L ( ρ t k µ t ) + 1 2 λ 2 t P t τ =1 w t τ π min τ 2 + 2 ln( t + 1) + ln 2 δ λ t P t τ =1 w t τ and, by a symmetric argument applied to − M 1 t ( a ) , .., − M t t ( a ), R ( ρ t ) − ˆ R w t t ( ρ t ) ≤ K L ( ρ t k µ t ) + 1 2 λ 2 t P t τ =1 w t τ π min τ 2 + 2 ln( t + 1 ) + ln 2 δ λ t P t τ =1 w t τ . Hence, b oth hold simultaneously with pro ba bilit y grea ter than 1 − δ and yield (4). 8 6 Pro of of Theorem 4 (The Regret B ound) In this section we derive a reg r et b ound ba s ed on Theor em 2. W e then discuss so me p ossible wa ys to tig h ten the r e gret b ound. The reg ret b ound is derived for the sp ecial kind of p osterio r distribution ˜ ρ exp t defined in (7) in Theorem 4, which is used as sa mpling distribution π t +1 for the next round o f the game, as describ ed in the theorem. F urthermo re, we de fine a sp e c ial k ind of prior distribution µ exp t as: µ exp t ( a ) = 1 Z ( µ exp t ) e γ t R ( a ) . (23) The pr ior µ exp t depe nds on the true exp ected rewards R ( a ), but no t on the sa mple and hence it is a legal prior. Pro of of Theorem 4: Let a ∗ be the action with the hig hest reward. The exp ected r e gret of the prediction strategy ˜ ρ exp t at step t + 1 can b e written as follows: R ( a ∗ ) − R ( ˜ ρ exp t ) = [ R ( a ∗ ) − ˆ R t ( a ∗ )] + [ ˆ R t ( a ∗ ) − ˆ R t ( ρ exp t )] + [ ˆ R t ( ρ exp t ) − R ( ρ exp t )] + [ R ( ρ exp t ) − R ( ˜ ρ exp t )] . (24) W e b ound the terms in (24) o ne - b y-o ne. R ( a ∗ ) and ˆ R t ( a ∗ ) ar e the exp ected and the empirical rewards of a prediction strategy , which is a delta distribution on a ∗ . Hence, by Theorem 2: R ( a ∗ ) − ˆ R t ( a ∗ ) ≤ 1 ε t r − ln µ exp t ( a ∗ ) + 3 ln( t + 1) − ln δ 2 t = 1 ε t s ln Z ( µ exp t ) e γ t R ( a ∗ ) + 3 ln( t + 1) − ln δ 2 t ≤ 1 ε t r ln( K ) + 3 ln( t + 1) − ln δ 2 t , (25) where in (25) we used the fact that R ( a ∗ ) ≥ R ( a ) for all a and hence e γ t R ( a ∗ ) ≥ 1 K P a e γ t R ( a ) = 1 K Z ( µ exp t ). F or the second term in (24) we write: ˆ R t ( a ∗ ) − ˆ R t ( ρ exp t ) = X a ( ˆ R t ( a ∗ ) − ˆ R t ( a )) ρ exp t ( a ) = X a ( ˆ R t ( a ∗ ) − ˆ R t ( a )) e γ t ˆ R t ( a ) Z ( ρ exp t ) = X a ( ˆ R t ( a ∗ ) − ˆ R t ( a )) e − γ t ( ˆ R t ( a ∗ ) − ˆ R t ( a )) P a ′ e − γ t ( ˆ R t ( a ∗ ) − ˆ R t ( a ′ )) ≤ K γ t , (26) where in (2 6) follows from the tec hnical lemma b elow. The pro o f o f the lemma is provided at the end of this section. Lemma 10 L et x 1 = 0 and x 2 , .., x n b e n − 1 arbitr ary numb ers. F or any α > 0 and n ≥ 2: P n i =1 x i e − αx i P n j =1 e − αx j ≤ n α . The third term in (24) is b ounded by the following lemma adapted from Lever et al. (2 0 10). The pro of of this lemma is also provided at the end of this sectio n. Lemma 11 F or µ exp t and ρ exp t define d by (23) and (8) under t he c onditions of The or em 2 the fol lowing holds simult ane ously with the assertion of The or em 2 : ˆ R t ( ρ exp t ) − R ( ρ exp t ) ≤ 1 ε t √ 2 t γ t ε t √ 2 t + p 3 ln( t + 1) − ln δ . (27) 9 Finally , for the last ter m in (2 4) : R ( ρ exp t ) − R ( ˜ ρ exp t ) = X a ( ρ exp t ( a ) − ˜ ρ exp t ( a )) R ( a ) ≤ 1 2 k ρ exp t − ˜ ρ exp t k 1 (28) = 1 2 X a | ρ exp t ( a ) − (1 − K ε t +1 ) ρ exp t ( a ) − ε t +1 | = 1 2 X a | K ε t +1 ρ exp t ( a ) − ε t +1 | ≤ 1 2 K ε t +1 X a ρ exp t ( a ) + 1 2 K ε t +1 = K ε t +1 . In (28) we used the fact that R ( a ) is b o unded b y 1 and ρ exp t and ˜ ρ exp t are probability distributions. Gathering a ll the terms and substituting them back into (24) we obta in: R ( a ∗ ) − R ( ˜ ρ exp t ) ≤ 1 ε t r ln( K ) + 3 ln( t + 1 ) − ln δ 2 t + K γ t + 1 ε t √ 2 t γ t ε t √ 2 t + p 3 ln( t + 1) − ln δ + K ε t +1 . By ch o os ing γ t = K 1 / 4 t 1 / 4 and ε t = K − 1 / 4 t − 1 / 4 we get: R ( a ∗ ) − R ( ˜ ρ exp t ) ≤ K 3 / 4 ( t + 1) 1 / 4 r ln( K ) + 3 ln( t + 1) − ln δ 2 K + 1 + 1 2 + r 3 ln( t + 1) − ln δ 2 K + 1 ! . By integration over t the total regr et is b ounded by ˜ O ( K 3 / 4 t 3 / 4 ), where ˜ O hides lo garithmic factors. 6.1 Pro ofs of T e c hnical Lemmas for Sectio n 6 W e conclude this section with pro o fs of the t wo technical lemmas used in the pro of of the reg r et bo und. Pro of of Lemma 10: Since x 1 = 0 we hav e: P n i =1 x i e − αx i P n j =1 e − αx j = P n i =1 x i e − αx i 1 + P n j =2 e − αx j ≤ n X i =1 x i e − αx i ≤ n α , where the last inequality follows from the fact that xe − αx ≤ 1 α . W e note tha t by numerical simulations it seems that a tighter bo und P n i =1 x i e − αx i P n j =1 e − αx j ≤ ln( K ) α holds, but we w ere unable to pr ov e it analy tically . The pro of o f Lemma 1 1 is adapted with minor mo difications from Lever et al. (2 0 10) a nd is based on the following tw o lemmas, which a re also adapted from Lever et al. (201 0) and are prov ed right after the pro of of Lemma 11. Lemma 12 F or µ exp t and ρ exp t define d by (23) and (8): K L ( ρ exp t k µ exp t ) ≤ γ t [ ˆ R t ( ρ exp t ) − R ( ρ exp t )] + [ R ( µ exp t ) − ˆ R t ( µ exp t )] . (29) Lemma 13 F or µ exp t and ρ exp t define d by (23) and (8) under t he c onditions of The or em 2 the fol lowing holds simult ane ously with the assertion of The or em 2 : K L ( ρ exp t k µ exp t ) ≤ γ t ε t √ 2 t 2 + 2 γ t ε t √ 2 t p 3 ln( t + 1) − ln δ . (30) 10 Pro of of Lemma 11: Substitution of (30) into (3) yields (27). Pro of of Lemma 12: K L ( ρ exp t k µ exp t ) = X a ρ exp t ( a ) ln e γ t ˆ R t ( a ) Z ( µ exp t ) e γ t R ( a ) Z ( ρ exp t ) ! = X a ρ exp t ( a ) γ t ( ˆ R t ( a ) − R ( a )) − ln P a e γ t ˆ R t ( a ) Z ( µ exp t ) ! = γ t [ ˆ R t ( ρ exp t ) − R ( ρ exp t )] − ln X a µ exp t ( a ) e γ t ( ˆ R t ( a ) − R ( a )) ! (31) ≤ γ t [ ˆ R t ( ρ exp t ) − R ( ρ exp t )] + [ R ( µ exp t ) − ˆ R t ( µ exp t )] . (32) In (31) we used the fact that 1 Z ( µ exp t ) = µ exp t ( a ) e − γ t R ( a ) (for any a ) and in (32) we used the concavit y of ln. Pro of of Lemma 13: By Theorem 2 and simultaneously with it we hav e: ˆ R t ( ρ exp t ) − R ( ρ exp t ) ≤ 1 ε t r K L ( ρ exp t k µ exp t ) + 3 ln( t + 1) − ln δ 2 t R ( µ exp t ) − ˆ R t ( µ exp t ) ≤ 1 ε t r 3 ln( t + 1) − ln δ 2 t . By substituting this into (29) we hav e: K L ( ρ exp t k µ exp t ) ≤ γ t ε t √ 2 t p K L ( ρ exp t k µ exp t ) + 3 ln( t + 1) − ln δ + γ t ε t √ 2 t p 3 ln( t + 1) − ln δ . If K L ( ρ exp t k µ exp t ) ≤ γ t ε t √ 2 t we are done. O therwise, by rearr anging the terms we obtain: ( K L ( ρ exp t k µ exp t )) 2 − 2 K L ( ρ exp t k µ exp t ) γ t ε t √ 2 t p 3 ln( t + 1) − ln δ + γ t ε t √ 2 t 2 (3 ln( t + 1) − ln δ ) ≤ γ t ε t √ 2 t 2 K L ( ρ exp t k µ exp t ) + γ t ε t √ 2 t 2 (3 ln( t + 1) − ln δ ) , which together with the fact that K L ( ρ exp t k µ exp t ) ≥ 0 implies the result. 7 Discussion W e presented a lemma tha t allows to b ound exp ectations of conv ex functions of certain sequentially depe ndent v aria bles by exp ectatio ns of the same functions of i.i.d. B e rnoulli v aria bles. W e show ed that this lemma can b e used to der ive an alternative to Ho effding-Azuma inequa lity for conv ergenc e of martingale v alues . W e pr esented t wo differen t approa ches to P AC-Bay es ian a na lysis of ma rtingale-type se quent ially depe ndent random v ariables, which was a n importa nt challenge for P AC-Ba yesian analysis for a long time. Our contribution op ens the p os sibilit y to apply P AC-Ba yesian a nalysis in multiple domains, where sequentially dep endent v ar iables are encountered. F o r ex ample, Theor ems 2 and 3 can b e used to b ound conv erg ence o f unco untable n umber of pa rallel martingale sequence s, where simple union b ound do es not apply . W e a nswered p os itively an impo rtant ope n question whether P AC-Ba yesian analys is can b e applied under limited feedback and used to study the explor ation-exploitatio n trade-off. Although our r egret bo und for the multiarmed ba ndit problem is fa r from sta te- of-the-art yet, we b elieve that this gap can b e closed in future work. Multiarmed bandits are just the fir st tier in a w ho le hierarch y of reinfor cement learning problems with increasing structural complexity , including contin uum-armed bandits, contextual bandits, and reinforcement learning in discrete and cont inuous spaces . In ma n y of these do mains Bay esian ap- proaches and incor po ration o f prior knowledge have already proved b eneficial in pra ctice, but their rigoro us analy sis remains difficult to carr y out. W e b elieve that P AC-Ba yesian appro ach will pr ov e to b e a s us eful for this purp ose as it alrea dy prov ed itself in the domain of sup ervised lear ning. 11 Ac kno wledgemen ts W e thank J ohn L a ngford for helpful discuss ions at the early stag es of this work and Andreas Mau- rer for his comments on this manuscript. W e ar e also gra teful to anonymous reviewers for their insightful comments and useful refere nces. This work was supp orted in par t by the IST P rogr amme of the Europ ean C o mm unity , under the P ASCAL2 Netw ork of Excellence , IST-2 007-2 16886 . This publication only reflects the author s’ views. References Peter Auer and Ronald Ortner. UCB r evisited: Improv ed regret bounds for the sto chastic multi- armed bandit problem. Perio dic a Mathematic a Hungaric a , 61(1 -2):55–6 5, 20 10. Peter Auer, Nico l` o Cesa -Bianchi, and Paul Fischer. Finite-time ana lysis of the m ultiar med ba ndit problem. Machine L e arning , 47, 200 2a. Peter Auer, Nicol` o Ces a-Bianchi, Y oav F reund, and Rob ert E. Schapire. The no nsto chastic mu lti- armed bandit problem. SIAM Journal of Computing , 32(1), 2002 b. Kazuoki Azuma. W eighted sums of certain dep endent ra ndo m v aria bles. Tˆ ohok u Mathematic al Journal , 19(3), 196 7. Arindam Banerjee. On Bay esia n bounds. In Pr o c e e dings of the International Confer enc e on Machine L e arning (ICML) , 20 06. Alina Beyg elzimer, John La ng ford, Lihong Li, Lev Reyzin, and Rob ert E . Schapire. Contextual bandit algor ithms with sup ervised lear ning g ua rantees. http://arxiv.org /abs/10 02.4058, 2010. Nicol` o Cesa-Bia nchi and G´ ab or Lugos i. Pr e diction, L e arning, and Games . Ca mbridge University Press, 2006. Thomas M. Cov er and Joy A. Thomas. Elements of Information The ory . John Wiley & Sons, 1 9 91. Monro e D. Donsker and S.R. Sriniv asa V ara dhan. Asymptotic ev aluation of certa in Markov pr o cess exp ectations for large time. Communic ations on Pur e and Applie d Mathematics , 2 8, 1975. Paul Dupuis and Richard S. Ellis. A W e ak Conver genc e Appr o ach to the The ory of L ar ge Deviations . Wiley-Interscience, 19 97. Mahdi Milani F a rd and Jo elle P ineau. P A C-Bayesian model selectio n for reinfor cement le arning. In A dvanc es in Neur al Information Pr o c essing Systems (N IPS) , 2 010. Pascal Germain, Alexandre Lacass e, F ran¸ cois Laviolette, and Mario Ma rchand. P A C-Bayesian learn- ing of linear classifier s. In Pr o c e e dings of the Int ern ational Confer enc e on Machine L e arning (ICML) , 200 9. Rob ert M. Gray . Entro py and Information The ory . Springer, 2 edition, 2 011. Matthew Higg s and John Shaw e -T aylor. A P AC-Ba yes b ound for tailored density estimatio n. In Pr o c e e dings of the International Confer enc e on Algorithmic L e arning The ory (AL T) , 2010. W. Ho effding. P robability inequalities for sums o f b ounded random v aria bles. Journ al of the Amer- ic an Statistic al Asso ciation , 58(301 ):13–30 , 196 3. John Lang ford and Jo hn Shaw e-T aylor. P AC-Ba yes & mar gins. In A dvanc es in Neur al In formation Pr o c essing S ystems (NIPS) , 2002 . Guy L ever, F r an¸ cois Laviolette, and John Shaw e-T aylor. Distribution-dep endent P AC-Bay es priors. In Pr o c e e dings of the International Confer enc e on Algori thmic L e arning The ory (A L T) , 201 0. Andreas Maurer. A note on the P AC-Bay esian theore m. www.arx iv.org, 20 04. David McAllester. Some P AC-Bay esian theorems. In Pr o c e e dings of t he International Confer enc e on Computational L e arning The ory (COL T) , 19 98. David McAllester. Simplified P AC-Ba yesian margin b o unds. In Pr o c e e dings of the International Confer enc e on Computational L e arning The ory (COL T) , 2003 . 12 David McAllester . Genera liz a tion bounds and consistency for structured lab eling . I n G¨ okhan Ba kir, Thomas Hofmann, Bernha rd Sch¨ olkopf, Alexander Smola, Ben T ask ar, and S.V.N. Vish wanathan, editors, Pr e dicting St ructur e d Data . The MIT Press, 2007. Liv a Ra laivola, Marie Sza franski, and Guillaume Stempfel. Chro matic P A C-Bayes b ounds for non- IID da ta: Applications to r anking and stationa ry β -mixing pr o cesses. Journal of Machine L e arning R ese ar ch , 2010 . Matthias Seeger . P AC-Bay esian gener alization e r ror bo unds for Gaussian pro cess classification. Journal of Machine L e arning R ese ar ch , 2002 . Matthias Se e g er. Bayesian Gaussian Pr o c ess Mo dels: P AC -Bayesian Gener alization Err or Bounds and S p arse Appr oximations . PhD thesis, Universit y o f Edinburgh, 200 3 . Y evgeny Seldin and Na ftali Tishb y . P AC-Bay esian analysis of co -clustering and b eyond. Journ al of Machine L e arning R ese ar ch , 11, 2010 . John Shaw e-T aylor and Rob ert C. Williamson. A P AC ana lysis o f a Bayesian es timator. In Pr o- c e e dings of the Int ernational Confer en c e on Computational Le arning The ory (COL T) , 1997. John Shaw e-T aylor, Peter L. Bartlett, Ro ber t C. Williamson, a nd Martin Anthon y . Structural risk minimization ov er da ta-dep endent hierarchies. IEEE T r ansactions on Information The ory , 44(5 ), 1998. Richard S. Sutton and Andrew G. Ba rto. R einfor c ement L e arning: A n Intr o duct ion . MIT P ress, 1998. Naftali Tishb y and Daniel P ola ni. Infor mation theory of decisio ns and a ctions. In V assilis Cutsur idis, Amir Hussain, John G. T aylor, and Daniel Polani, editors, Per c eption-R e ason-A ction Cycle: Mo d- els, Algori thms and Systems . Spr inger, 2 010. 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment