Worst-Case Regret Bounds for Exploration via Randomized Value Functions

This paper studies a recent proposal to use randomized value functions to drive exploration in reinforcement learning. These randomized value functions are generated by injecting random noise into the training data, making the approach compatible wit…

Authors: Daniel Russo

W orst-Case Regret Bounds for Exploration via Randomized V alue F unctions Daniel R usso Colum bia Univ ersit y djr2174@ gsb.columbi a.edu No v em b er 27, 2024 Abstract This pap er studies a recen t prop osal to use randomized v alue functions to d riv e exp lora tion in reinforcemen t learning. These randomized v alue functions are generated by injecting random noise into the training data, making the approac h compatible with many p opular metho ds for estimating parameterized v alue functions. By pro viding a w orst-case regret b ound for tabu lar finite-horizon Marko v decision pro cesse s, we show th at planning with resp ect to these random- ized v alue functions can induce prov ably efficient exploration. 1 In tro duc tio n Exploratio n is one of the cen tral challenges in reinforcement lea rning (RL). A large theoretical literature treats e x ploration in simple finite s tate and action MDPs, showing that it is p ossible to e fficiently lea r n a near optimal p olicy through interaction a lone [5, 8, 1 0, 11, 13 – 16, 24, 2 5]. Overwhelmingly , this literature fo cuses on o ptimistic algorithms, with mos t algor ithms explicitly maint aining uncertaint y sets that are likely to contain the true MDP . It has b een difficult to ada pt these explor ation algorithms to the more complex pro blems in ves- tigated in the applied RL litera ture. Most applied pa pers s eem to gener ate explor ation through ǫ –greedy o r Boltzmann explo ration. Those simple metho ds ar e compatible with practical v alue function lear ning algo rithms, which use par ametric appr o ximations to v alue functions to genera lize across high dimensional state spaces . Unfortunately , such exploration algorithms can fail catastr oph- ically in simple finite state MDPs [See e.g . 22]. This pap er is inspire d b y the sear ch for principled exploration algor ithms that b oth (1) ar e compatible with practical function learning algorithms and (2) provide ro bust p erformance, at leas t when specia lized to simple benchmarks lik e tabular MDPs. Our fo cus will be on metho ds that gener ate explor ation by planning with resp ect to randomized v alue function estimates. This idea was first pro posed in a conference pap er b y [21] a nd is in vesti- gated mo r e thoro ughly in the journal pap er [22]. It is inspired by work on p osterior s ampling for reinforcement learning (a.k.a Thompson sampling) [19, 26], which co uld b e interpreted as sa mpling a v alue function fro m a p osterior distribution a nd following the optimal p olicy under that v a lue func- tion for some extended p erio d o f time b efore resa mpling. A num ber o f pap ers have subseq uen tly in vestigated approa c hes that gener ate randomized v alue functions in complex r einforcemen t lear ning problems [6 , 9, 12, 20, 23, 27, 28]. Our theory will focus on a sp ecific appro ac h of [21, 22], dubbed r andomize d le ast squar es value iter ation (RLSVI), as sp ecialized to tabular MDPs. The name is a play on the cla s sic least-squares po licy iteration alg orithm (LSP I ) of [17]. RLSVI generates a ran- domized v alue function (essent ially) by judiciously injecting Gaussian noise int o the training data and then applying a pplying LSPI to this nois y dataset. One could naturally a pply the s a me template while using other v a lue learning algorithms in place of LSP I. 1 This is a strikingly simple algorithm, but pr o viding rigor ous theo retical guarantees has proved challenging. One c hallenge is that, despite the app ealing conceptual connections, there are significa n t subtleties to a n y precise link betw een RLSVI and p osterior sa mpling. The issue is tha t po sterior sampling based approa c hes ar e derived from a true Bay esian p erspective in which one ma in tains beliefs ov er the underlying MDP . The appro ac hes of [6 , 9, 12, 22, 23, 27, 28] mo del only the v alue function, so Bay es r ule is not even well defined. 1 The work of [21, 22] uses sto c hastic dominance arguments to relate the v alue function sampling distr ibuti on of RLSVI to a correct p osterior in a Bay es ian mo del where the true MDP is randomly drawn. This gives substantial insig h t, but the resulting analysis is not entirely satisfying as a robustness guar an tee. It b ounds regret o n average ov er MDPs with tr a nsitions kernels drawn from a particular Dirichilet pr ior, but one may w orry that hard reinfor cemen t learning instances a re extremely unlik ely under this particular prior. This pap er develops a very different pro of strategy a nd provides a worst-case regret b ound for RLSVI applied to tabular finite-hor izon MDPs. The crucia l pro o f steps are to show that ea c h r a n- domized v alue function sampled by RLSVI has a significa nt pr obabilit y o f b eing o ptimistic (see Lemma 4) and then to show that from this prop e r t y one c an reduce reg ret analy sis to concentration arguments pioneered by [13] (see Lemmas 6, 7). This approach is inspired by frequentist analysis o f Thompson sa mpling for linear bandits [2 ] and esp ecially the lucid description of [1 ]. How e ver, a pply- ing these ideas in r einforcemen t lea rning app ears to require novel analysis. The only prio r extensio n of these pr oof techniques to tabular reinforce ment learning was carried out b y [3]. Reflecting the dif- ficult y of such ana lyses, that paper do es not pr o vide reg ret bo unds for a pure Thompson sampling algorithm; instead their algo rithm sa mples many times from the p osterior to form an o ptimistic mo del, as in the BOSS a lgorithm [4]. Also, unfortunately there is a significant er r or that pap er’s analysis and the co rrection has not y et bee n p osted online, making a car eful compariso n difficult a t this time. The es tablished regr et bounds are not state of the a rt for tabular finite-horizon MDPs. A final step of the pro of a pplies techniques of [13], introducing a n extra √ S in the bo unds. I hop e some smart reader can improv e this by in telligently adapting the tec hniques of [5, 11]. How ever, the primary goa l of the pap er is not to give the tightest p ossible regret b ound, but to br oaden the set o f exploration appr o ac hes known to s atisfy po lynomial w orst-case regr et b ounds. T o this author, it is bo th fascinating and b eautiful that care fully a dding noise to the training data gener ates sophisticated exploration and proving this formally is worth while. 2 Problem form ulation W e consider the problem of learning to optimize p erformance through rep eated interactions with an unknown finite horizo n MDP M = ( H , S , A , P, R , s 1 ). The agent interacts with the environment across K episo des. Each episo de pro ceeds ov er H p erio ds, where for p erio d h ∈ { 1 , . . . , H } of episo de k the agent is in state s k h ∈ S = { 1 , . . . , S } , takes action a k h ∈ A = { 1 , . . . , A } , obser v es the reward r k h ∈ [0 , 1] and, for h < H , also observes next state s k h +1 ∈ S . Let H k − 1 = { ( s i h , a i h , r i h ) : h = 1 , . . . H, i = 1 , . . . , k − 1 } denote the histo r y of interactions prior to episo de k . The Markov transition kernel P enco des the transition proba bilities, with P h,s k h ,a k h ( s ) = P ( s k h +1 = s | a k h , s k h , . . . , a k 1 , s k 1 , H k − 1 ) . The reward distribution is enco ded in R , with R h,s k h ,a k h ( dr ) = P  r k h = dr | a k h , s k h , . . . , a k 1 , s k 1 , H k − 1  . 1 The precise iss ue is that, even give n a prior o ve r v alue functions, there i s no l ik elihoo d function. Giv en and MDP , there is a well sp ecified likelihoo d of transitioning from state s to another s ′ , but a v alue function do es not sp ecify a probabilistic data-generating mo del. 2 W e usually instead refer to exp ected rewards enco ded in a vector R that satisfies R h,s,a = E [ r k h,s,a | s k h = s, a k h = a ]. W e then refer to an MDP ( H, S , A , P , R , s 1 ), describ ed in ter ms of its exp ected rewards r ather than its reward distribution, as this is sufficient to determine the expected v alue accrued by a n y p olicy . The v ariable s 1 denotes a deterministic initial s tate, and w e as sume s k 1 = s 1 for every episo de k . A t the exp ense of complica ting some fo rm ulas, the entire pap er could als o b e written assuming initial s ta tes are drawn from some distribution ov er S ,which is more standard in the literature. A deterministic Marko v policy π = ( π 1 , . . . , π H ) is a sequence of functions, where each π h : S → A prescrib es an a ction to play in each state. W e let Π denote the space of a ll such p olicies. W e use V π h ∈ R S to denote the v alue function asso ciated with p olicy π in the sub-episo de consisting of per iods { h, . . . , H } . T o simplify many expressio ns, we set V π H +1 = 0 ∈ R S . Then the v alue functions for h ≤ H are the unique solution to the the Bellman equations V π h ( s ) = R h,s,π ( s ) + X s ∈S P s,h,π ( s ) ( s ′ ) V π h +1 ( s ′ ) s ∈ S , h = 1 , . . . , H . The optimal v alue function is V ∗ h ( s ) = ma x π ∈ Π V π h ( s ). An episo dic reinforcement learning a lg orithm Alg is a po s sibly r andomized pro cedure that a sso- ciates ea c h history with a po licy to employ throughout the next episo de. F ormally , a r andomized algo- rithm ca n depend on rando m seeds { ξ k } k ∈ N drawn indep endent ly of the past from so me presp ecified distribution. Suc h an episo dic reinforcement learning algor ithm selects a po licy π k = Alg ( H k − 1 , ξ k ) to be employ ed throug hout episo de k . The cumu lative exp ected regret incurr ed by Alg ov er K episo des o f interaction with the MDP M is Regret( M , K, A lg ) = E Alg " K X k =1 V ∗ 1 ( s k 1 ) − V π k 1 ( s k 1 ) # where the exp ectation is taken ov er the random seeds used by a randomized a lgorithm and the randomness in the o bserv ed rewards and state transitions that influence the alg o rithm’s chosen policy . This expression captures the algor ithm’s cumulativ e exp ected shor tfall in p erformance r e lativ e to an omniscient b enc hmark, which knows and always employs the true optimal p olicy . Of c o urse, r egret as formulated ab ov e dep ends o n the MDP M to which the algo rithm is a pplied. Our goa l is not to minimize regre t under a pa rticular MDP but to provide a g uarant ee that holds uniformly acr o ss a class of MDPs. This ca n be expresse d mo r e for ma lly by considering a class M containing all MDPs with S states , A actions, H p erio ds, a nd rewards distributions b ounded in [0 , 1]. Our goal is to b ound the worst-cas e r egret sup M ∈M Regret( M , K, A lg ) incurr ed by an algorithm throughout K episo des of interaction with an unknown MDP in this clas s. W e a im for a bound on worst-case r egret that scales sublinea r ly in K and has s ome rea sonable po lynomial dependence in the size o f state space, action spa ce, and ho rizon. W e won’t explicitly maximize over M in the ana lysis. Instea d, we fix an arbitra ry MDP M and seek to bo und r egret in a wa y that do es not depe nd on the particular tr a nsition pr obabilities or r e ward distributions under M . It is worth rema rking that, as formu lated, our algorithm k no ws S, A , and H but do es not have knowledge of the num b er o f episo des K . Indeed, we study a so- called anytime algo rithm that has go o d p erformance for all sufficien tly long sequences of in teraction. Notation for em pirical estimates. W e define n k ( h, s, a ) = P k − 1 ℓ =1 1 { ( s ℓ h , a ℓ h ) = ( s, a ) } to b e the num b er of times a ction a has b een sampled in state s , p erio d h . F or every tuple ( h, s, a ) with n k ( h, s, a ) > 0, we define the empirical mean r ew ard a nd empirical transition probabilities up to 3 per iod h b y ˆ R k h,s,a = 1 n k ( h, s, a ) k − 1 X ℓ =1 1 { ( s ℓ h , a ℓ h ) = ( s, a ) } r ℓ h (1) ˆ P k h,s,a ( s ′ ) = 1 n k ( h, s, a ) k − 1 X ℓ =1 1 { ( s ℓ h , a ℓ h , s ℓ h +1 ) = ( s, a , s ′ ) } ∀ s ′ ∈ S . (2) If ( h, s, a ) was never sampled befo re episo de k , we define ˆ R k h,s,a = 0 a nd P k h,s,a = 0 ∈ R S . 3 Randomized L east Squares V alue Iteration This section describes an alg orithm called Randomized Least Squa res V a lue Iter a tion (RLSVI). W e describ e RLSVI as sp ecialized to a simple tabular problem in a way t hat is most c onvenient for the subse quent t he or etic al analysis . A mathematically equiv a len t definition – which defines RSL VI a s esti- mating a v a lue function on rando mized tra ining data – extends more g racefully . This interpretation is given at the end of the sectio n a nd more carefully in [22]. A t the start of episo de k , the agent has observed a histor y of interactions H k − 1 . Ba sed on this, it is natural to co nsider an estimated MDP ˆ M k = ( H , S , A , ˆ P k , ˆ R k , s 1 ) with empiric a l estimates of mean rewards a nd tra nsition probabilities. These are precis e ly defined in Equation (2) and the surro unding text. W e could use backw a r d r ecursion to so lv e for the o ptimal po licy and v a lue functions under the empirical MDP , but applying this p olicy would not generate exploratio n. RLSVI builds o n this idea, but to induce explora tio n it judiciously adds Gauss ian noise b efore solving for an optimal p olicy . W e ca n define RLSVI concisely as follows. In episo de k it samples a r a ndom vector with indep enden t comp onen ts w k ∈ R H S A , where w k ( h, s, a ) ∼ N  0 , σ 2 k ( h, s, a )  . W e define σ k ( h, s, a ) = q β k n k ( h,s,a )+1 , wher e β k is a tuning par ameter and the denominator s hrinks like the standard deviation of the av era ge of n k ( h, s, a ) i.i.d samples. Given w k , we construct a randomized p erturbation of the empiric a l MDP M k = ( H , S , A , ˆ P k , ˆ R k + w k , s 1 ) by adding the Gaussian noise to estimated re w a rds. RLSVI so lves for the optimal p olicy π k under this MDP and applies it thro ughout the episo de. This p olicy is, of co ur se, g reedy with resp ect to the (ra ndomized) v alue functions under M k . The random noise w k in RLSVI should b e lar ge eno ug h to dominate the error in tro duced by p erforming a noisy Bellman up date using ˆ P k and ˆ R k . W e set β k = ˜ O ( H 3 ) in the analysis, where functions of H o ffer a coars e b ound on qua ntities like the v aria nce of an empirica lly estimated Bellman up date. F or β = { β k } k ∈ N , w e denote this algorithm by RL SVI β . RLSVI as regression o n p erturb ed data. T o e x tend b eyond simple tabular problems, it is fruitful to view RLSVI–like in Algor ithm 1–as an algo r ithm that p erforms r ecursive lea st squares estimation on the state-actio n v alue function. Randomization is injected int o these v alue function estimates by p erturbing observed rewards and by reg ularizing to a randomized prior sample. This prior sa mple is essential, as otherwise there would no randomness in the es timated v alue function in initial p erio ds. This pro cedure is the LSPI algorithm o f [17] applied with nois y data and a tabular r epresen tation. The pap er [22 ] includes many exp erimen ts with no n- tabular represe ntations. 4 Algorithm 1: RLSVI for T abular , Finite Hor izon, MDPs input : H , S , A , tuning para meter s { β k } k ∈ N (1) for episo des k = 1 , 2 , . . . do /* De fine s quared tempor al diffe rence error */ (2) L ( Q | Q next , D ) = P ( s,a,r,s ′ ) ∈D ( Q ( s, a ) − r − max a ′ ∈A Q next ( s ′ , a ′ )) 2 ; (3) D h = { ( s ℓ h , a ℓ h , r ℓ h , s ℓ h +1 ) : ℓ < k } h < H ; /* Pa st d ata */ (4) D H = { ( s ℓ H , a ℓ H , r ℓ H , ∅ ) : ℓ < k } ; /* Ra ndomly pe rturb data */ (5) for time p erio ds h = 1 , . . . , H do (6) Sample arr a y ˜ Q h ∼ N (0 , β k I ) ; /* Dr aw p rior sampl e */ (7) ˜ D h ← {} ; (8) for ( s, a, r , s ′ ) ∈ D h do (9) sample w ∼ N (0 , β k ); (10) ˜ D h ← ˜ D h ∪ { ( s, a, r + w , s ′ ) } ; (11) end (12) e nd /* Es timate Q on no isy data */ (13) Define terminal v alue Q k H +1 ( s, a ) ← 0 ∀ s, a ; (14) for time p erio ds h = H , . . . , 1 do (15) ˆ Q h ← arg min Q ∈ R S A L ( Q | Q h +1 , ˜ D h ) + k Q − ˜ Q h k 2 2 ; (16) e nd (17) Apply greedy p olicy with resp ect to ( ˆ Q 1 , . . . ˆ Q H ) throughout episo de; (18) O bserv e data s k 1 , a k 1 , r k 1 , . . . s k H , a k H , r k H ; (19) end T o understand this presentation of RLSVI, it is helpful to understand an equiv a lence b et ween po sterior sampling in a Bay e s ian linear mo del and fitting a regularized least s quares estimate to randomly p e r turbed data. W e refer to [2 2] for a full discussion of this eq uiv alence and re v iew the sca lar case here. Consider Bay es up dating of a scala r par ameter θ ∼ N (0 , β ) based o n noisy observ ations Y = ( y 1 , . . . , y n ) where y i | θ ∼ N (0 , β ). The p osterior distribution ha s the closed form θ | Y ∼ N  1 n +1 P n 1 y i , β n +1  . W e could g e nerate a sample fro m this distribution by fitting a least squares estimate to noise. Sample W = ( w 1 , . . . , w n ) where ea c h w i ∼ N (0 , β ) is drawn independently and sample ˜ θ ∼ N (0 , β ). Then ˆ θ = argmin θ ∈ R n X i =1 ( θ − y i ) 2 + ( θ − ˜ θ ) 2 = 1 n + 1 n X i =1 y i + ˜ θ ! (3) satisfies ˆ θ | Y ∼ N  1 n +1 P n 1 y i , β n +1  . F o r more complex mo dels, where exa ct p osterior sampling is impo s sible, we may still ho pe estimation o n randomly p erturb ed data generates sa mples that re flect uncertaint y in a sensible wa y . As far as RLSVI is c o ncerned, roughly the same calculation shows that in Algor ithm 1 ˆ Q h ( s, a ) is equal to a n empirica l Bellma n up date plus Ga ussian noise: ˆ Q h ( s, a ) | ˆ Q h +1 ∼ N ˆ R h,s,a + X s ′ ∈S ˆ P h,s,a ( s ′ ) max a ′ ∈A ˆ Q h +1 ( s ′ , a ′ ) , β k n k ( h, s, a ) + 1 ! . 5 4 Main result Theorem 1 establishes that RLSVI satisfies a worst-case p olynomial r e g ret b ound for tabular finite- horizon MDPs. It is worth contrasting RLSVI to ǫ – greedy ex ploration and Boltzmann exploration, which are bo th widely used ra ndomization a pproaches to exploration. Those simple metho ds explore b y direc tly injecting randomness to the action chosen a t ea c h timestep. Unfortunately , they can fail catastrophically even on s imple ex amples with a finite state space – requir ing a time to learn that scales e x ponentially in the size o f the s tate space. Instead, RLSVI generates rando miza tion by training v alue functions with randomly p erturbed rewards. Theo rem 1 confirms that this a pproach generates a s o phisticated form of explor ation fundament ally different fro m ǫ – greedy exploration a nd Boltzmann exploration. The notation ˜ O ignor e s p oly-logar ithmic factors in H , S, A and K . Theorem 1. L et M denote t he set of MDPs with horizon H , S states, A actions, and r ewar ds b ounde d in [0,1]. Then for a tuning p ar ameter se quen c e β = { β k } k ∈ N with β k = 1 2 S H 3 log(2 H S Ak ) , sup M ∈M Regret( M , K, R LSVI β ) ≤ ˜ O  H 3 S 3 / 2 √ AK  . This b ound is not state of the art a nd that is no t the ma in g oal of this paper . I conjecture that the extra factor of S can b e remov ed fro m this b ound through a car e ful analysis, making the dependence o n S , A , and K , optimal. This conjecture is suppo rted by num erical exp erimen ts and (informally) by a Bay esian reg ret ana lysis [22]. O ne ex tra √ S app ears to come fro m a step at the very end of the pr oof in Lemma 7, where we b ound a certain L 1 norm as in the a nalysis s t yle o f [13]. F or optimistic algor ithms, some recent work has av o ided dire ctly b ounding that L 1 -norm, yielding a tight er regret guarantee [5, 11]. Another fa c to r of √ S stems from the choice of β k , which is used in the pro of of Lemma 5. This seems simila r to and extra √ d factor that app ears in worst-case regret upper b ounds fo r Thompson sampling in d -dimensional linear ba ndit problems [1]. Remark 1. Some tr anslation is r e quir e d to r elate the dep endenc e on H with other liter atur e. Many r esults ar e given in terms of the numb er of p erio ds T = K H , which masks a factor of H . A lso un like e.g. [5] , this p ap er t re ats time-inhomo genous t r ansition kernels. In some sense agents must le arn ab out H extra state/action p airs. R oughly sp e aking then, our r esult exactly c orr esp onds to what one would get by applyi ng the UCRL2 analysis [13] to a time-inhomo genous finite-horizon pr oblem. 5 Pro of of Theorem 1 The pro of follo ws from several lemmas. Some a r e (p ossibly complex) techn ical adaptations of idea s present in many regr et analy s es. Lemmas 4 and 6 are the main discoveries that prompted this pap er. Throughout we use the fo llowing notation: for a ny MDP ˜ M = ( H, S , A , ˜ P , ˜ R, s 1 ), let V ( ˜ M , π ) ∈ R denote the v alue function cor respo nding to p olicy π from the initial state s 1 . In this no tation, for the true MDP M we hav e V ( M , π ) = V π 1 ( s 1 ). A concen tration inequality . T hr ough a careful application of Ho e ffding’s inequality , one can give a high probability b ound on the err or in applying a Bellman update to the (non-random) optimal v alue function V ∗ h +1 . Through this, and a union b ound, Lemma b ounds 2 b ounds the exp e cted n um b er of times the empirica lly e s timated MDP falls outside the confidence set M k =  ( H, S , A , P ′ , R ′ , s 1 ) : ∀ ( h, s , a ) | ( R ′ h,s,a − R h,s,a ) + h P ′ h,s,a − P s,a,h , V ∗ h +1 i| ≤ q e k ( h, s, a )  6 where we define p e k ( h, s, a ) = H s log (2 H S Ak ) n k ( s, h, a ) + 1 . This set is a only a to ol in the analysis a nd canno t b e used by the agent since V ∗ h +1 is unknown. Lemma 2 (V a lidit y of confidence sets) . P ∞ k =1 P  ˆ M k / ∈ M k  ≤ π 2 6 . F rom v alue function error to o n p olicy Bel lman error. F or some fixed p olicy π , the next simple lemma expres ses the gap betw een the v alue functions under tw o MDPs in terms o f the differences b etw een their Bellman o perator s. Results like this a re critical to ma ny ana ly ses in the RL literature. Notice the asy mmetric ro le of ˜ M and M . The v a lue functions c o rresp ond to one MDP while the sta te tra jectory is sampled in the o ther. W e’ll apply the lemma t wice: once where ˜ M is the true MDP and M is estimated one used b y RLSVI and once where the role is r e v e r sed. Lemma 3. Conside r any p olicy π and two MDPs ˜ M = ( H , S , A , ˜ P , ˜ R, s 1 ) and M = ( H , S , A , P , R, s 1 ) . L et ˜ V π h and V π h denote the r esp e ctive value functions of π under ˜ M and M . Then V π 1 ( s 1 ) − ˜ V π 1 ( s 1 ) = E π , M " H X h =1  R h,s h ,π ( s h ) − ˜ R h,s h ,π ( s h )  + h P h,s h ,π ( s h ) − ˜ P h,s h ,π ( s h ) , ˜ V π h +1 i # , wher e ˜ V π H +1 ≡ 0 ∈ R S and the exp e ctation is over the sample d state tr aje ctory s 1 , . . . s H dr awn fr om fol lowing π in the MDP M . Pr o of. V π 1 ( s 1 ) − ˜ V π 1 ( s 1 ) = R 1 ,s 1 ,π ( s 1 ) + h P 1 ,s 1 ,π ( s 1 ) , V π 2 i − ˜ R 1 ,s 1 ,π ( s 1 ) − h ˜ P 1 ,s 1 ,π ( s 1 ) , ˜ V π 2 i = R 1 ,s 1 ,π ( s 1 ) − R 1 ,s 1 ,π ( s 1 ) + h P 1 ,s 1 ,π ( s 1 ) − ˜ P 1 ,s 1 ,π ( s 1 ) , ˜ V π 2 i + h P 1 ,s 1 ,π ( s 1 ) , V π 2 − ˜ V π 2 i = R 1 ,s 1 ,π ( s 1 ) − ˜ R 1 ,s 1 ,π ( s 1 ) + h P 1 ,s 1 ,π ( s 1 ) − ˜ P 1 ,s 1 ,π ( s 1 ) , ˜ V π 2 i + E π , M h V π 2 ( s 2 ) − ˜ V π 2 ( s 2 ) i . Expanding this recursio n gives the result. Sufficien t optim i sm through randomization. There is alwa y s the r isk that, ba s ed on noisy observ ations , an RL algor ithm incor r ectly for ms a low es timate of the v alue function at some state. This may lead the a lgorithm to purp osefully av oid that state, therefore failing to g ather the data needed to correc t its fault y estimate. T o av o id such scenario s, near ly all prov a bly efficient RL exploration algo r ithms are based build purp osefully optimistic estimates. RLSVI do es not do this, and instead generates a r a ndomized v alue function. T he following lemma is key to our ana ly sis. It shows that, except in the rare event when it has g rossly mis-estimated the underly ing MDP , RLSVI has a t le a st a constant chance of s a mpling an o ptimistic v alue function. Similar re s ults can b e prov ed for Thompson sampling with linear mo dels [1]. Recall M is unknown true MDP with o ptimal π ∗ and M k is RLSVI’s noise p erturb ed MDP under which π k is an optimal policy . Lemma 4. L et π ∗ b e an optimal p olicy for the tru e MDP M . If ˆ M k ∈ M k , then P  V ( M k , π k ) ≥ V ( M , π ∗ ) | H k − 1  ≥ Φ( − 1) . This result is mor e easily established throug h the following lemma, which avoids the need to carefully co ndition on the history H k − 1 at each step. W e conclude with the pro of of Lemma 4 after. 7 Lemma 5. Fix any p olicy π = ( π 1 , . . . , π H ) and ve ctor e ∈ R H S A with e ( h, s, a ) ≥ 0 . Consider the MDP M = ( H , S , A , P , R , s 1 ) and alternative ¯ R and ¯ P ob eying the ine quality − p e ( h, s, a ) ≤ ¯ R h,s,a − R h,s,a + h ¯ P h,s,a − P h,s,a , V h +1 i ≤ p e ( h, s, a ) for every s ∈ S , a ∈ A and h ∈ { 1 , . . . , H } . T ake W ∈ R H S A to b e a r andom ve ct or with indep endent c omp onents wher e w ( h, s, a ) ∼ N (0 , H S e ( h, s, a )) . L et ¯ V π 1 ,W denote t he (r andom) value function of the p olicy π under the MDP ¯ M = ( H , S , A , ¯ P , ¯ R + W ) . Then P  ¯ V π 1 ,W ( s 1 ) ≥ V π 1 ( s 1 )  ≥ Φ( − 1) . Pr o of. T o start, we co nsider an arbitrary deterministic vector w ∈ R H S A (thought of as a p ossible realization of W ) and ev alua te the gap in v alue functions ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 ). W e can r e-write this quantit y by applying Lemma 3. Let s = ( s 1 , . . . , s H ) denote a ra ndom seq uence of states drawn b y simu lating the p olicy π in the MDP ¯ M from the deter ministic initial state s 1 . Set a h = π ( s h ) for h = 1 , . . . , H . Then ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 ) = E " H X h =1 w ( h, s h , π h ( s h )) + ¯ R h,s h ,π h ( s h ) − R h,s h ,π h ( s h ) + h ¯ P h,s h ,π h ( s h ) − P h,s h ,π h ( s h ) , V π h i # ≥ H E " 1 H H X h =1  w ( h, s h , π h ( s h )) − p e ( h, s h , π h ( s h ))  # where the exp ectation is taken over the sequence of sates s = ( s 1 , . . . , s H ). Define d ( h, s ) = 1 H P ( s h = s ) for every h ≤ H a nd s ∈ S . Then the ab o ve equation ca n b e written a s 1 H  ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 )  ≥ X s ∈S ,h ≤ H d ( h, s )  w ( h, s h , π h ( s h )) − p e ( h, s h , π h ( s h ))  ≥   X s ∈S ,h ≤ H d ( h, s ) w ( h, s h , π h ( s h ))   − √ H S s X s ∈S ,h ≤ H d ( h, s ) 2 e ( h, s h , π h ( s h )) := X ( w ) where the second inequality applies Cauch y-Shw artz. Now, since d ( h, s ) W ( h, s, π h ( s, a )) ∼ N (0 , d ( h, s ) 2 H S e ( h, s, π h ( s, a ))) , we hav e X ( W ) ∼ N   − s H S X s ∈S ,h ≤ H d ( h, s ) 2 e ( h, s h , π h ( s h )) , H S X s ∈S ,h ≤ H d ( h, s ) 2 e ( h, s h , π h ( s h ))   . By standardization, P ( X ( W ) ≥ 0 ) = Φ( − 1). Therefor e, P ( ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 ) ≥ 0) ≥ Φ( − 1). Pr o of of L emma 4. Consider so me history H k − 1 with ˆ M k ∈ M k . Recall π k is the p olicy chosen b y RLSVI, which is optimal under the MDP M k = ( H, S , A , ˆ P k , ˆ R k + w k , s 1 ). Since σ k ( h, s, a ) = H S e k ( h, s, a ), applying Lemma 5 co nditioned on H k − 1 shows that with pr obabilit y at least Φ( − 1), V ( M k , π ∗ ) ≥ V ( M , π ∗ ). When this o ccurs, we alwa ys hav e V ( M k , π k ) ≥ V ( M , π ∗ ), since b y defini- tion π k is optimal under M k . 8 Reduction to b ounding online predictio n e rror. The next Lemma shows that the cumulativ e exp e cted regre t of RLSVI is b ounded in terms of the to ta l prediction error in es timating the v alue function of π k . The critical feature o f the result is it only dep ends on the a lgorithm b eing able to estimate the p erformance of the p olicies it a ctually employs and ther e fo r e gathers data a bout. F rom here, the re gret analysis will follow only co ncen tr ation arguments. F o r the purp oses of ana lysis, we let ˜ M k denote an imag ined second sample drawn from the sa me distribution a s the p erturb ed MDP M k under RLSVI. Mo re formally , let ˜ M k = ( H , S , A , ˆ P k , ˆ R k + ˜ w k , s 1 ) wher e ˜ w k ( h, s, a ) | H k − 1 ∼ N (0 , σ 2 k ( h, s, a )) is independent Gaussian no ise. Conditioned on the histo r y , ˜ M k has the same margina l distribution a s M k , but it is statistically indep enden t of the po licy π k selected by RLSVI, Lemma 6. F or an absolute c onstant c = Φ( − 1) − 1 < 6 . 31 , we have Regret( M , K , R LSVI β ) ≤ ( c + 1) E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # + c E " K X k =1 | V ( ˜ M k , π k ) − V ( M , π k ) | # + H K X k =1 P ( ˆ M k / ∈ M k ) | {z } ≤ π 2 / 6 . Online prediction error b ounds. W e complete the pro of with concentration a r gumen ts. Set ǫ k R ( h, s, a ) = ˆ R k h,s,a − R h,s,a ∈ R and ǫ k P ( h, s, a ) = ˆ P k h,s,a − P h,s h ,a h ∈ R S to be the error in es- timating mea n the mea n reward and transition vector corresp onding to ( h, s, a ). The next re- sult follows by b ounding each term in L e mma 6. This is done b y using Lemma 3 to expand the terms V ( M , π k ) − V ( M , π k ) and V ( M , π k ) − V ( ˜ M , π k ). W e fo cus our analysis on bo unding E h P K k =1 | V ( M k , π k ) − V ( M , π k ) | i . The other ter m can b e bounded in an iden tical manner 2 , so w e omit this ana lysis. Lemma 7. L et c = Φ( − 1) − 1 < 6 . 31 . Then for any K ∈ N , E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # ≤ v u u t E K X k =1 H − 1 X h =1   ǫ k P ( h, s k h , a k h )   2 1 v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ + E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # + E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # . The rema ining lemma s complete the pro of. At each stage, RLSVI adds Gaussian no ise with standard deviation no la rger than ˜ O ( H 3 / 2 √ S ). Ignoring extr e mely low proba bility even ts, we exp ect,   V k h +1   ∞ ≤ ˜ O ( H 5 / 2 √ S ) and hence P H − 1 h =1   V k h +1   2 ∞ ≤ ˜ O ( H 6 S ). The pro of of this Lemma makes this precise by applying appropriate maximal inequalities. Lemma 8. v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ = ˜ O  H 3 √ S K  The next few lemmas are essentially a consequence of a nalysis in [13], a nd many subsequent pap ers. W e give pr oof sketch es in the appe ndix. The main idea is to apply known concentration 2 In particular, an analogue of Lemma 7 holds 7 holds w here we r eplace M k with ˜ M k , V k h +1 with the v alue f unct ion ˜ V k h +1 corresponding to p olicy π k in the MDP ˜ M k , and the Gaussian noise w k with the fictitious noise terms ˜ w k . 9 inequalities to b ound   ǫ k P ( h, s, a )   2 1 , | ǫ k R ( h, s k h , a k h ) | or | w k ( h, s k h , a k h ) | in terms of either 1 /n k ( h, s h , a h ) or 1 / p n k ( h, s h , a h ). The pigeonhole pr inciple gives P K k =1 P H − 1 h =1 1 /n k ( h, s h , a h ) = O (lo g ( S AK H ) and P K k =1 P H − 1 h =1 (1 / p n k ( h, s h , a h )) = O ( √ S AK H ) . Lemma 9. E " K X k =1 H − 1 X h =1   ǫ k P ( h, s, a )   2 1 # = ˜ O  S 2 AH  Lemma 10. E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # = ˜ O  √ S AK H  Lemma 11. E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # = ˜ O  H 3 / 2 S √ AK H  A ckno wledgments. Muc h of my understanding o f randomized v a lue functions comes from a collab oration with Ia n Osband, B e n V a n Ro y , and Zheng W en. Mark Sellke a nd Cha o Qin each noticed the same er ror in the pro of of Lemma 6 in the initial draft of this pap er. The lemma has now b een revised. I am extremely g rateful fo r their careful reading of the pape r . References [1] Mar c Ab eille, Alessandro Laz a ric, et a l. L inea r thompso n sampling revisited. Ele ctr onic Journal of Statistics , 11(2):516 5–5197 , 201 7 . [2] Shipra Agrawal a nd Navin Goy al. Thompson sampling for contextual bandits with linear pay o ffs. In International Confer enc e on Machine L e arning , pages 12 7–135, 20 13. [3] Shipra Agraw al a nd Randy Jia. O ptimistic p osterior sampling for reinfor cemen t lea rning: worst- case regret b ounds. In A dvanc es in Neur al Information Pr o c essing Systems , pages 118 4–1194, 2017. [4] Jo hn Asmuth , Lihong Li, Mic hael L Littman, Ali Nour i, and Da vid Wingate. A bayesian sampling a ppr oach to exploration in reinforcement lear ning. In Pr o c e e dings of the Twenty-Fifth Confer enc e on Unc ertainty in Arti ficial Intel ligenc e , pa ges 19–26 . AUAI Press, 2009 . [5] Moha mmad Gheshlag hi Azar , Ia n O s band, and Rémi Munos. Minimax reg ret b ounds for rein- forcement learning . In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning- V olume 70 , pag es 26 3–272. JMLR. org, 2017. [6] Ka m yar Azizzadenesheli, Emma Br unskill, and Animashree Anandkumar . Efficient explora tion through bay esian deep q- netw orks. In 2018 Information The ory and A pplic ations W orkshop (IT A) , pages 1–9. IEEE , 201 8. [7] Stéphane Bo uc heron, Gá b or Lugos i, and Pascal Massart. Conc entr ation ine qualities: A nonasymptotic the ory of indep endenc e . O x ford univ e r sit y pr ess, 2013. [8] Ronen I Brafman and Moshe T ennenholtz. R-max-a ge ner al p olynomial time algo rithm for near-optimal reinforcement learning. Journ al of Machine L e arning R ese ar ch , 3(Oct):213– 2 31, 2002. 10 [9] Y ur i Burda, Harrison Edwards, Amos Storkey , and Oleg Klimov. Ex plo ration by rando m net work distillation. 20 19. [10] Chr istoph Dann and Emma Brunskill. Sample co mplexit y of episo dic fixed-hor izon reinforce- men t learning. In A dvanc es in Neur al Information Pr o c essing Systems , pages 281 8–2826, 201 5 . [11] Chr istoph Dann, T or Lattimore, and Emma Brunskill. Unifying pac and regr et: Uniform pa c bo unds for episo dic reinforcement learning. In A dvanc es in N eu r al Information Pr o c essing Systems , page s 57 1 3–5723 , 2 017. [12] Meir e F or tunato, Mohammad Gheshlaghi Azar, Bila l Piot, Jacob Menic k, Ian Osba nd, Alex Grav e s , Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et a l. Noisy netw or ks for exploration. 2 018. [13] T. J a ksc h, R. O rtner, and P . A uer. Near-optimal regret bo unds for r einforcemen t lear ning. Journal of Machine L e arning R ese ar ch , 1 1:1563–1 600, 20 10. [14] Chi Jin, Zeyua n Allen-Zhu , Sebastien Bub ec k, and Michael I Jordan. Is q-learning prov ably efficient ? In Ad vanc es in Neur al In formation Pr o c essing Systems , pag es 4863–4 873, 20 18. [15] Sha m Machandranath K akade et al. On the sample c omplexity of r einfor c ement le arning . PhD thesis, Univ ersity o f Londo n Lo ndo n, Eng land, 2 003. [16] Michael Kearns and Satinder Singh. Near-optimal reinforcement lea rning in p olynomial time. Machine le arning , 4 9 (2-3):209–2 32, 2002. [17] Michail G Lag oudakis and Ronald Parr. Least-squar es p olicy iteration. Journ al of machine le arning r ese ar ch , 4(Dec):1107– 1149, 200 3. [18] T or Lattimore and Csa ba Szep esvári. Bandit algo rithms. pr eprint , 2018 . [19] Ia n O sband, Daniel R usso, and Benjamin V an Roy . (more) efficient r einforcemen t learning via po sterior sampling. In A dvanc es in N eur al Information Pr o c essing Systems , pages 30 03–301 1 , 2013. [20] Ia n Osband, C ha rles Blundell, Alexa nder Pritzel, and Benjamin V an Roy . Deep exploration via bo otstrapp ed dqn. In A dvanc es in neur al information pr o c essing systems , pa ges 4026 –4034, 2016. [21] Ia n Osband, Benjamin V an Roy , and Zheng W en. Gener a lization and exploratio n via ra n- domized v alue functions. In International Confer enc e on Machine L e arning , pages 2377 – 2386, 2016. [22] Ia n Osband, Benjamin V an Roy , Daniel Russo, and Zheng W en. Deep exploration via random- ized v alue functions. arXiv pr eprint arXiv:1703.0 7608 , 2 017. [23] Ia n Osband, John Aslanides, and Albin Cassir er. Randomized prio r functions for deep r ein- forcement lear ning. In A dvanc es in Neur al Information Pr o c essing S ystems , pages 86 17–8629 , 2018. [24] Alexa nder L Strehl, Lihong L i, Er ic Wiewio ra, John Langford, and Michael L Littman. P ac mo del-free reinfor cemen t learning. In Pr o c e e dings of t he 23r d international c onfer enc e on Ma- chine le arning , pages 88 1 –888. ACM, 2006 . [25] Alexa nder L Strehl, Lihong Li, and Michael L Littman. Reinforcemen t learning in finite mdps: Pac analysis . J ournal of Machine L e arning R ese ar ch , 10(Nov):2413–2 444, 2009 . 11 [26] Ma lcolm Strens. A bay e sian framework for re infor cemen t lea rning. In ICML , volume 2000 , pages 943–9 50, 2 000. [27] Ahmed T ouati, Harsh Satija, Jo sh ua Romoff, Jo elle Pineau, and Pascal Vincent. Rando mized v alue functions v ia multiplicativ e normalizing flows. arXiv pr eprint arXiv:1806 .02315 , 2018. [28] Nikolaos T zio rtziotis, Christos Dimitrakakis, and Michalis V azirgia nnis. Randomised bay e s ian least-squar e s po licy iter ation. arXiv pr eprint arXiv:1904.0 3535 , 2019 . [29] T sach y W eissma n, Erik Ordentlic h, Gadiel Serouss i, Ser gio V e r du, and Mar celo J W ein b erger. Inequalities for the l1 deviation of the empir ic a l distribution. Hew lett-Packar d L abs, T e ch. R ep , 2003. 12 A Omitted Pro ofs A.1 Pro of of Lemma 2 Lemma 2 (V a lidit y of confidence sets) . P ∞ k =1 P  ˆ M k / ∈ M k  ≤ π 2 6 . Pr o of. The following construction is the standar d wa y concentration inequalities a re applied in ba n- dit mo dels and tabular reinforc e ment learning. See the discussion of what Lattimore and Szepes vári [18] calls a “stack of rewards” mo del in Subsection 4.6 . F or every tuple z = ( h, s, a ), generate tw o i.i.d sequences of random v aria bles r z ,n ∼ R h,s,a and s z ,n ∼ P h,s,a ( · ). Her e r ( h,s,a ) ,n denotes the reward and s ( h,s,a ) ,n denotes the state transition generated from the n th time action a is played in state s , p erio d n . Set Y z ,n = r z ,n + V ∗ h +1 ( s z ,n ) n ∈ N . These are i.i.d, with Y z ,n ∈ [0 , H ] since k V ∗ h +1 k ∞ ≤ H − 1 , and satisfies E [ Y z ,n ] = R h,s,a + h P h,s,a , V ∗ h +1 i . By Ho effding’s inequality , for a n y δ n ∈ (0 , 1 ), P      1 n n X i =1 Y ( h,s,a ) ,i − R h,s,a − h P h,s,a , V ∗ h +1 i      ≥ H r log(2 /δ n ) 2 n ! ≤ δ n . F or δ n = 1 H S An 2 , a union b ound ov er H S A v alues of z = ( h, s, a ) and all po ssible n gives P   [ h,s,a,n (      1 n n X i =1 Y ( h,s,a ) ,i − R h,s,a − h P h,s,a , V ∗ h +1 i      ≥ H r log(2 /δ n ) 2 n )   ≤ ∞ X n =1 1 n 2 = π 2 6 . Now, by definition, if n k ( h, s, a ) = n > 0, we hav e ˆ R k h,s,a + h ˆ P k h,s,a , V ∗ h +1 i = 1 n n X i =1 Y ( h,s,a ) ,i . Therefore, the ab ov e shows P ∃ ( k , h, s, a ) : n k ( h, s, a ) > 0 ,    ˆ R k h,s,a − R h,s,a + h ˆ P k h,s,a − P h,s,a , V ∗ h +1 i    ≥ H s log (2 H S An k ( h, s, a )) 2 n k ( s, h, a ) ! is upper b ounded b y π 2 / 6. Note that by definition, when n k ( h, s, a ) > 0 w e hav e q e k ( h, s, a ) ≥ H s log (2 H S An k ( h, s, a )) 2 n k ( s, h, a ) and hence this concentration inequality holds with p e k ( h, s, a ) on the r igh t hand side. When n k ( h, s, a ) = 0, w e hav e the trivial b ound    ˆ R k h,s,a − R h,s,a + h ˆ P k h,s,a − P h,s,a , V ∗ h +1 i    = | R h,s,a + h P h,s,a , V ∗ h +1 i| ≤ H ≤ e k ( h, s, a ) since we have defined the empirical estimates to satisfy ˆ R k h,s,a = 0 and ˆ P k h,s,a ( · ) = 0 in the case that h, s, a has never b een play ed. 13 A.2 Pro of of Lemma 6 Lemma 6. F or an absolute c onstant c = Φ( − 1) − 1 < 6 . 31 , we have Regret( M , K , R LSVI β ) ≤ ( c + 1) E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # + c E " K X k =1 | V ( ˜ M k , π k ) − V ( M , π k ) | # + H K X k =1 P ( ˆ M k / ∈ M k ) | {z } ≤ π 2 / 6 . Pr o of. Recall that H k − 1 = { ( s i h , a i h , r i h ) : h = 1 , . . . H , i = 1 , . . . , k − 1 } . So conditioned on H k − 1 , M k , π k and ˜ M k are rando m only due to the in ternal randomness of the RLSVI a lgorithm. Set E k [ · ] = E [ · | H k − 1 ]. Suppose that ˆ M k ∈ M k . Then P  V ( M k , π k ) ≥ V ( M , π ∗ )     H k − 1  ≥ Φ( − 1) . (4) W e b egin with the r egret deco mposition: E k  V ( M , π ∗ ) − V ( M , π k )  = E k h V ( M , π ∗ ) − V ( M k , π k ) i + E k h V ( M k , π k ) − V ( M , π k ) i . (5) W e fo cus on the fir st term. W e show V ( M , π ∗ ) − E k h V ( M k , π k ) i ≤ c E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  . (6) The inequality is immediate if V ( M , π ∗ ) < E k h V ( M k , π k ) i . W e now show this when a ≡ V ( M , π ∗ ) − E k h V ( M k , π k ) i ≥ 0. Then, E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  ≥ a P k  V ( M k , π k ) − E k h V ( M k , π k ) i ≥ a  =  V ( M , π ∗ ) − E k h V ( M k , π k ) i P k  V ( M k , π k ) ≥ V ( M , π ∗ )  ≥  V ( M , π ∗ ) − E k h V ( M k , π k ) i Φ( − 1) , where the first step a pplies Marko v’s inequality , the s econd s imply plugs in for a , and the third uses Equation 4. Dividing each side b y Φ( − 1) gives Eq uation (6 ). Hence we hav e shown E k  V ( M , π ∗ ) − V ( M , π k )  ≤ c E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  + E k h V ( M k , π k ) − V ( M , π k ) i . (7) W e complete our argument by b ounding E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  . F or each fixed (nonrandom) p olicy π , define µ ( π ) ≡ E k  V ( ˜ M k , π )  = E k h V ( M k , π ) i . Notice that µ ( π k ) = E k  V ( ˜ M k , π k ) | π k  almost surely . This relies on the fact that ˜ M k and π k are independent conditioned on the history H k − 1 . In genera l µ ( π k ) 6 = E k h V ( M k , π k ) | π k i , since π k is the optimal p olicy under M k and so these tw o are statistically dep endent. Now, for every p olicy π µ ( π ) = E k h V ( M K , π ) i ≤ E k  sup π ′ V ( M K , π ′ )  = E k h V ( M K , π k ) i . 14 So, µ ( π k ) ≤ E k h V ( M K , π k ) i almost surely . Using this, w e find E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  ≤ E k   V ( M k , π k ) − µ ( π k )  +  ≤ E k h    V ( M k , π k ) − µ ( π k )    i = E k h    V ( M k , π k ) − E k h V ( ˜ M k , π k ) | π k , M k i    i ≤ E k  E k     V ( M k , π k ) − V ( ˜ M k , π k )        π k , M k  = E k h    V ( M k , π k ) − V ( ˜ M k , π k )    i ≤ E k h    V ( M k , π k ) − V ( M , π k )    i + E k    V ( ˜ M k , π k ) − V ( M , π k )    . Plugging this into (7) shows that, for any history H k − 1 with ˆ M k ∈ M k , E k  V ( M , π ∗ ) − V ( M , π k )  ≤ ( c + 1) E k h    V ( M k , π k ) − V ( M , π k )    i + c E k    V ( ˜ M k , π k ) − V ( M , π k )    . In the unlikely e v e nt ˆ M k ∈ M k , w e have the worst case b ound 0 ≤ V ( M , π ∗ ) − V ( M , π k ) ≤ H. Combing these tw o cases and taking exp ectations gives E  V ( M , π ∗ ) − V ( M , π k )  ≤ H P ( ˆ M k / ∈ M k )+( c +1) E h    V ( M k , π k ) − V ( M , π k )    i + c E    V ( ˜ M k , π k ) − V ( M , π k )    . Summing ov er k co ncludes the pro of. A.3 Pro of of Lemma 7 Lemma 7. L et c = Φ( − 1) − 1 < 6 . 31 . Then for any K ∈ N , E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # ≤ v u u t E K X k =1 H − 1 X h =1   ǫ k P ( h, s k h , a k h )   2 1 v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ + E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # + E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # . Pr o of. W e bo und ea ch term in the b ound in Lemma 6 . By applying Lemma 3 with a choice of M = M and ˜ M = M K , the lar gest term is bounded, for any k ∈ N , as    V ( M k , π k ) − V ( M , π k )    =      E " H X h =1  h ˆ P k h,s k h ,a k h − P h,s k h ,a k h , V k h +1 i  + ˆ R k h,s k h ,a k h + w k ( h, s k h , a k h ) − R h,s k h ,a k h     π k , H k − 1 #      ≤ E " H − 1 X h =1   ǫ k P ( h, s k h , a k h )   1   V k h +1   ∞     π k , H k − 1 # + E " H X h =1  | ǫ k R ( h, s k h , a k h ) | + | w k ( h, s k h , a k h ) |      π k , H k − 1 # 15 T aking exp ectations, summing over k , a nd a pplying Ca uc hy-Sch wartz gives E " K X k =1    V ( M k , π k ) − V ( M , π k )    # ≤ v u u t E K X k =1 H − 1 X h =1   ǫ k P ( h, s k h , a k h )   2 1 v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ + E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # + E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # . A.4 Pro of of Lemma 8 The pro of relies on the following maximal inequality . Lemma 12 (E xample 2.7 F ro m [7]) . If X 1 , . . . , X n ar e i.i.d . r andom variables fol lowing a χ 2 1 distribution, then E  max i ≤ n X i  ≤ 1 + p 2 log( n ) + 2 log( n ) . Let us now r ecall Lemma 8. Lemma 8. v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ = ˜ O  H 3 √ S K  Pr o of. W e have v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ ≤ s H K E  max k ≤ K,h ≤ H k V k h +1 k 2 ∞  Now V k h +1 ( s ′ ) ≤ ( H − h − 1)(1 + max h,s,a w k ( h, s, a )) V k h +1 ( s ′ ) ≥ ( H − h − 1)( min h,s,a w k ( h, s, a )) . T ogether this gives that for all k ≤ K and h ∈ { 1 , . . . , H − 1 } k V k h +1 k ∞ ≤ H  1 + max k ≤ K,h,s,a | w k ( h, s, a ) |  2 ≤ 4 H 2 + 4 H 2  max k ≤ K,h,s,a | w k ( h, s, a ) | 2  . W e have w k ( h, s, a ) = σ k ( h, s, a ) ξ k h,s,a where the ξ k h,s,a ∼ N (0 , 1) are dr awn i.i.d acros s h, s, a . Set X k h,s,a = ( ξ k h,s,a ) 2 , each of w hich follows a ch i-squar ed distr ibution with 1 degree of freedom. Then, E  max k ≤ K,h,s,a | w k ( h, s, a ) | 2  ≤  max k ≤ K,h,s,a, σ 2 k ( h, s, a )  E  max k ≤ K,h,s,a | ξ k h,s,a | 2  =  max k ≤ K,h,s,a, σ 2 k ( h, s, a )  E  max k ≤ K,h,s,a X k h,s,a  ≤  S H 3 log(2 S AH K )  E  max k ≤ K,h,s,a X k h,s,a  ≤  S H 3 log(2 S AH K )   1 + p 2 log( S AH K ) + 2 log( S AH K )  ≤ O  S H 3 log (2 S AH K ) 2  . 16 This gives us s K H E  max k ≤ K,h ≤ H k V k h +1 k 2 ∞  = ˜ O  √ K H · H 2 · S H 3  = ˜ O  H 3 √ S K  . A.5 Pro of sk etch of Lemma 9 This result relies o n a n inequa lit y by W eissman et a l. [2 9], whic h we now restate. Lemma 13. [L1 deviation b ound] If p is a pr ob ability distribution over S = { 1 , . . . S } and ˆ p is the empiric al distribution c onstru cte d fr om n i.i.d dr aws fr om p , then for any ǫ > 0 , P ( k ˆ p − p k 1 ≥ ǫ ) ≤ (2 S − 2) exp  − nǫ 2 2  Lemma 9. E " K X k =1 H − 1 X h =1   ǫ k P ( h, s, a )   2 1 # = ˜ O  S 2 AH  Pr o of sketch. By picking an a ppropriate ǫ in Lemma 1 3 as in [13, Appendix C.1 ], together with a union b ound o ver all H S A poss ible v alues for the tuple ( h, s, a ), there exists a numerical constant c such that P   [ s,a,h,k ≤ K (    ˆ P k h,s,a − P h,s,a    1 ≥ c s S log (1 + H S AK ) n k ( h, s, a ) + 1 )   ≤ 1 K H . (8) Set β k ( h, s, a ) = S ℓ n k ( h,s,a ) where ℓ = c 2 log(1 + H S AK ) denotes a logarithmic factor. Recall the definition ǫ p k ( h, s, a ) ≡ ˆ P k h,s,a − P h,s,a . Let B b e the “ba d even t” that k ǫ p k ( h, s, a ) k 2 1 ≥ β k ( h, s, a ) for some ( h, s, a ) and k ≤ K . Since k ǫ p k ( h, s, a ) k 1 ≤ 2 a lw ays, we hav e E K X k =1 H − 1 X h =1 k ǫ k P ( h, s k h , a k h ) k 2 1 1 ( B ) ≤ 4 (9) On the other hand, assuming B c we have the b ound K X k =1 H − 1 X h =1 k ǫ k P ( h, s k h , a k h ) k 2 1 ≤ K X k =1 H − 1 X h =1 β k ( h, s, a ) = S ℓ K X k =1 H − 1 X h =1 1 n k ( h, s h , a h ) + 1 ≤ X h,s,a n K ( h,s,a ) X n =0 1 n + 1 = O ( H S A log( K )) . A.6 Pro of sk etch of Lemma 10 Lemma 10. E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # = ˜ O  √ S AK H  17 Pr o of sketch. The pro of is similar to Lemma 9. By Ho e ffding’s inequalit y together with a union bo und, we c a n ensure that | ǫ k R ( h, s, a ) | ≤ c q log(1+ H S AK ) n k ( h,s,a )+1 for all k ≤ K and a ll tuples ( h, s, a ) except on some bad even t that, a s in (9), contributes at most a cons ta nt to the bo und. Now the r e s ult follows fro m using the pigeonhole principle to co nclude K X k =1 H X h =1 1 p n k ( h, s, a ) = O  √ H S AK  . This kind of b ound b ound is standard in the RL a nd ba ndit literature. See [19, Appendix A] for one pro of. A.7 Pro of sk etch of Lemma 11 Lemma 11. E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # = ˜ O  H 3 / 2 S √ AK H  Pr o of. Recall σ k ( h, s, a ) = q β n k ( h,s,a )+1 where β k = ˜ O ( S H 3 ). W rite w k ( h, s, a ) = σ k ( h, s, a ) ξ k ( h, s, a ) where ξ k ( h, s, a ∼ N (0 , 1) and the array of ra ndom v aria ble { ξ k ( h, s, a ) : 1 ≤ k ≤ K , 1 ≤ h ≤ H , a ∈ A , s ∈ S } is drawn indep endent ly . By Holder’s inequality , E K X k =1 H X h =1 | w k ( h, s h , a h ) | ≤ E  max k ≤ K,h,s,a | ξ k ( h, s, a |  E K X k =1 H X h =1 σ k ( h, s h , a h ) The (sub) Gaussia n maxima l inequa lity gives E  max k ≤ K,h,s,a | ξ k ( h, s, a |  = O  p log( H S AK )  . T o simplify the next expression, no te that β k ≤ β K . On a n y sample path, by the same argument as in Lemma 10, we hav e K X k =1 H X h =1 σ k ( h, s h , a h ) ≤ β K K X k =1 H X h =1 s 1 n k ( h, s h , a h ) + 1 = O  β K √ H S AK  . 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment