Worst-Case Regret Bounds for Exploration via Randomized Value Functions

W orst-Case Regret Bounds for Exploration via Randomized V alue F unctions Daniel R usso Colum bia Univ ersit y djr2174@ gsb.columbi a.edu No v em b er 27, 2024 Abstract This pap er studies a recen t prop osal to use randomized v alue functions to d riv e exp lora tion in reinforcemen t learning. These randomized v alue functions are generated by injecting random noise into the training data, making the approac h compatible with many p opular metho ds for estimating parameterized v alue functions. By pro viding a w orst-case regret b ound for tabu lar ﬁnite-horizon Marko v decision pro cesse s, we show th at planning with resp ect to these random- ized v alue functions can induce prov ably eﬃcient exploration. 1 In tro duc tio n Exploratio n is one of the cen tral challenges in reinforcement lea rning (RL). A large theoretical literature treats e x ploration in simple ﬁnite s tate and action MDPs, showing that it is p ossible to e ﬃciently lea r n a near optimal p olicy through interaction a lone [5, 8, 1 0, 11, 13 – 16, 24, 2 5]. Overwhelmingly , this literature fo cuses on o ptimistic algorithms, with mos t algor ithms explicitly maint aining uncertaint y sets that are likely to contain the true MDP . It has b een diﬃcult to ada pt these explor ation algorithms to the more complex pro blems in ves- tigated in the applied RL litera ture. Most applied pa pers s eem to gener ate explor ation through ǫ –greedy o r Boltzmann explo ration. Those simple metho ds ar e compatible with practical v alue function lear ning algo rithms, which use par ametric appr o ximations to v alue functions to genera lize across high dimensional state spaces . Unfortunately , such exploration algorithms can fail catastr oph- ically in simple ﬁnite state MDPs [See e.g . 22]. This pap er is inspire d b y the sear ch for principled exploration algor ithms that b oth (1) ar e compatible with practical function learning algorithms and (2) provide ro bust p erformance, at leas t when specia lized to simple benchmarks lik e tabular MDPs. Our fo cus will be on metho ds that gener ate explor ation by planning with resp ect to randomized v alue function estimates. This idea was ﬁrst pro posed in a conference pap er b y [21] a nd is in vesti- gated mo r e thoro ughly in the journal pap er [22]. It is inspired by work on p osterior s ampling for reinforcement learning (a.k.a Thompson sampling) [19, 26], which co uld b e interpreted as sa mpling a v alue function fro m a p osterior distribution a nd following the optimal p olicy under that v a lue func- tion for some extended p erio d o f time b efore resa mpling. A num ber o f pap ers have subseq uen tly in vestigated approa c hes that gener ate randomized v alue functions in complex r einforcemen t lear ning problems [6 , 9, 12, 20, 23, 27, 28]. Our theory will focus on a sp eciﬁc appro ac h of [21, 22], dubbed r andomize d le ast squar es value iter ation (RLSVI), as sp ecialized to tabular MDPs. The name is a play on the cla s sic least-squares po licy iteration alg orithm (LSP I ) of [17]. RLSVI generates a ran- domized v alue function (essent ially) by judiciously injecting Gaussian noise int o the training data and then applying a pplying LSPI to this nois y dataset. One could naturally a pply the s a me template while using other v a lue learning algorithms in place of LSP I. 1 This is a strikingly simple algorithm, but pr o viding rigor ous theo retical guarantees has proved challenging. One c hallenge is that, despite the app ealing conceptual connections, there are signiﬁca n t subtleties to a n y precise link betw een RLSVI and p osterior sa mpling. The issue is tha t po sterior sampling based approa c hes ar e derived from a true Bay esian p erspective in which one ma in tains beliefs ov er the underlying MDP . The appro ac hes of [6 , 9, 12, 22, 23, 27, 28] mo del only the v alue function, so Bay es r ule is not even well deﬁned. 1 The work of [21, 22] uses sto c hastic dominance arguments to relate the v alue function sampling distr ibuti on of RLSVI to a correct p osterior in a Bay es ian mo del where the true MDP is randomly drawn. This gives substantial insig h t, but the resulting analysis is not entirely satisfying as a robustness guar an tee. It b ounds regret o n average ov er MDPs with tr a nsitions kernels drawn from a particular Dirichilet pr ior, but one may w orry that hard reinfor cemen t learning instances a re extremely unlik ely under this particular prior. This pap er develops a very diﬀerent pro of strategy a nd provides a worst-case regret b ound for RLSVI applied to tabular ﬁnite-hor izon MDPs. The crucia l pro o f steps are to show that ea c h r a n- domized v alue function sampled by RLSVI has a signiﬁca nt pr obabilit y o f b eing o ptimistic (see Lemma 4) and then to show that from this prop e r t y one c an reduce reg ret analy sis to concentration arguments pioneered by [13] (see Lemmas 6, 7). This approach is inspired by frequentist analysis o f Thompson sa mpling for linear bandits [2 ] and esp ecially the lucid description of [1 ]. How e ver, a pply- ing these ideas in r einforcemen t lea rning app ears to require novel analysis. The only prio r extensio n of these pr oof techniques to tabular reinforce ment learning was carried out b y [3]. Reﬂecting the dif- ﬁcult y of such ana lyses, that paper do es not pr o vide reg ret bo unds for a pure Thompson sampling algorithm; instead their algo rithm sa mples many times from the p osterior to form an o ptimistic mo del, as in the BOSS a lgorithm [4]. Also, unfortunately there is a signiﬁcant er r or that pap er’s analysis and the co rrection has not y et bee n p osted online, making a car eful compariso n diﬃcult a t this time. The es tablished regr et bounds are not state of the a rt for tabular ﬁnite-horizon MDPs. A ﬁnal step of the pro of a pplies techniques of [13], introducing a n extra √ S in the bo unds. I hop e some smart reader can improv e this by in telligently adapting the tec hniques of [5, 11]. How ever, the primary goa l of the pap er is not to give the tightest p ossible regret b ound, but to br oaden the set o f exploration appr o ac hes known to s atisfy po lynomial w orst-case regr et b ounds. T o this author, it is bo th fascinating and b eautiful that care fully a dding noise to the training data gener ates sophisticated exploration and proving this formally is worth while. 2 Problem form ulation W e consider the problem of learning to optimize p erformance through rep eated interactions with an unknown ﬁnite horizo n MDP M = ( H , S , A , P, R , s 1 ). The agent interacts with the environment across K episo des. Each episo de pro ceeds ov er H p erio ds, where for p erio d h ∈ { 1 , . . . , H } of episo de k the agent is in state s k h ∈ S = { 1 , . . . , S } , takes action a k h ∈ A = { 1 , . . . , A } , obser v es the reward r k h ∈ [0 , 1] and, for h < H , also observes next state s k h +1 ∈ S . Let H k − 1 = { ( s i h , a i h , r i h ) : h = 1 , . . . H, i = 1 , . . . , k − 1 } denote the histo r y of interactions prior to episo de k . The Markov transition kernel P enco des the transition proba bilities, with P h,s k h ,a k h ( s ) = P ( s k h +1 = s | a k h , s k h , . . . , a k 1 , s k 1 , H k − 1 ) . The reward distribution is enco ded in R , with R h,s k h ,a k h ( dr ) = P  r k h = dr | a k h , s k h , . . . , a k 1 , s k 1 , H k − 1  . 1 The precise iss ue is that, even give n a prior o ve r v alue functions, there i s no l ik elihoo d function. Giv en and MDP , there is a well sp eciﬁed likelihoo d of transitioning from state s to another s ′ , but a v alue function do es not sp ecify a probabilistic data-generating mo del. 2 W e usually instead refer to exp ected rewards enco ded in a vector R that satisﬁes R h,s,a = E [ r k h,s,a | s k h = s, a k h = a ]. W e then refer to an MDP ( H, S , A , P , R , s 1 ), describ ed in ter ms of its exp ected rewards r ather than its reward distribution, as this is suﬃcient to determine the expected v alue accrued by a n y p olicy . The v ariable s 1 denotes a deterministic initial s tate, and w e as sume s k 1 = s 1 for every episo de k . A t the exp ense of complica ting some fo rm ulas, the entire pap er could als o b e written assuming initial s ta tes are drawn from some distribution ov er S ,which is more standard in the literature. A deterministic Marko v policy π = ( π 1 , . . . , π H ) is a sequence of functions, where each π h : S → A prescrib es an a ction to play in each state. W e let Π denote the space of a ll such p olicies. W e use V π h ∈ R S to denote the v alue function asso ciated with p olicy π in the sub-episo de consisting of per iods { h, . . . , H } . T o simplify many expressio ns, we set V π H +1 = 0 ∈ R S . Then the v alue functions for h ≤ H are the unique solution to the the Bellman equations V π h ( s ) = R h,s,π ( s ) + X s ∈S P s,h,π ( s ) ( s ′ ) V π h +1 ( s ′ ) s ∈ S , h = 1 , . . . , H . The optimal v alue function is V ∗ h ( s ) = ma x π ∈ Π V π h ( s ). An episo dic reinforcement learning a lg orithm Alg is a po s sibly r andomized pro cedure that a sso- ciates ea c h history with a po licy to employ throughout the next episo de. F ormally , a r andomized algo- rithm ca n depend on rando m seeds { ξ k } k ∈ N drawn indep endent ly of the past from so me presp eciﬁed distribution. Suc h an episo dic reinforcement learning algor ithm selects a po licy π k = Alg ( H k − 1 , ξ k ) to be employ ed throug hout episo de k . The cumu lative exp ected regret incurr ed by Alg ov er K episo des o f interaction with the MDP M is Regret( M , K, A lg ) = E Alg " K X k =1 V ∗ 1 ( s k 1 ) − V π k 1 ( s k 1 ) # where the exp ectation is taken ov er the random seeds used by a randomized a lgorithm and the randomness in the o bserv ed rewards and state transitions that inﬂuence the alg o rithm’s chosen policy . This expression captures the algor ithm’s cumulativ e exp ected shor tfall in p erformance r e lativ e to an omniscient b enc hmark, which knows and always employs the true optimal p olicy . Of c o urse, r egret as formulated ab ov e dep ends o n the MDP M to which the algo rithm is a pplied. Our goa l is not to minimize regre t under a pa rticular MDP but to provide a g uarant ee that holds uniformly acr o ss a class of MDPs. This ca n be expresse d mo r e for ma lly by considering a class M containing all MDPs with S states , A actions, H p erio ds, a nd rewards distributions b ounded in [0 , 1]. Our goal is to b ound the worst-cas e r egret sup M ∈M Regret( M , K, A lg ) incurr ed by an algorithm throughout K episo des of interaction with an unknown MDP in this clas s. W e a im for a bound on worst-case r egret that scales sublinea r ly in K and has s ome rea sonable po lynomial dependence in the size o f state space, action spa ce, and ho rizon. W e won’t explicitly maximize over M in the ana lysis. Instea d, we ﬁx an arbitra ry MDP M and seek to bo und r egret in a wa y that do es not depe nd on the particular tr a nsition pr obabilities or r e ward distributions under M . It is worth rema rking that, as formu lated, our algorithm k no ws S, A , and H but do es not have knowledge of the num b er o f episo des K . Indeed, we study a so- called anytime algo rithm that has go o d p erformance for all suﬃcien tly long sequences of in teraction. Notation for em pirical estimates. W e deﬁne n k ( h, s, a ) = P k − 1 ℓ =1 1 { ( s ℓ h , a ℓ h ) = ( s, a ) } to b e the num b er of times a ction a has b een sampled in state s , p erio d h . F or every tuple ( h, s, a ) with n k ( h, s, a ) > 0, we deﬁne the empirical mean r ew ard a nd empirical transition probabilities up to 3 per iod h b y ˆ R k h,s,a = 1 n k ( h, s, a ) k − 1 X ℓ =1 1 { ( s ℓ h , a ℓ h ) = ( s, a ) } r ℓ h (1) ˆ P k h,s,a ( s ′ ) = 1 n k ( h, s, a ) k − 1 X ℓ =1 1 { ( s ℓ h , a ℓ h , s ℓ h +1 ) = ( s, a , s ′ ) } ∀ s ′ ∈ S . (2) If ( h, s, a ) was never sampled befo re episo de k , we deﬁne ˆ R k h,s,a = 0 a nd P k h,s,a = 0 ∈ R S . 3 Randomized L east Squares V alue Iteration This section describes an alg orithm called Randomized Least Squa res V a lue Iter a tion (RLSVI). W e describ e RLSVI as sp ecialized to a simple tabular problem in a way t hat is most c onvenient for the subse quent t he or etic al analysis . A mathematically equiv a len t deﬁnition – which deﬁnes RSL VI a s esti- mating a v a lue function on rando mized tra ining data – extends more g racefully . This interpretation is given at the end of the sectio n a nd more carefully in [22]. A t the start of episo de k , the agent has observed a histor y of interactions H k − 1 . Ba sed on this, it is natural to co nsider an estimated MDP ˆ M k = ( H , S , A , ˆ P k , ˆ R k , s 1 ) with empiric a l estimates of mean rewards a nd tra nsition probabilities. These are precis e ly deﬁned in Equation (2) and the surro unding text. W e could use backw a r d r ecursion to so lv e for the o ptimal po licy and v a lue functions under the empirical MDP , but applying this p olicy would not generate exploratio n. RLSVI builds o n this idea, but to induce explora tio n it judiciously adds Gauss ian noise b efore solving for an optimal p olicy . W e ca n deﬁne RLSVI concisely as follows. In episo de k it samples a r a ndom vector with indep enden t comp onen ts w k ∈ R H S A , where w k ( h, s, a ) ∼ N  0 , σ 2 k ( h, s, a )  . W e deﬁne σ k ( h, s, a ) = q β k n k ( h,s,a )+1 , wher e β k is a tuning par ameter and the denominator s hrinks like the standard deviation of the av era ge of n k ( h, s, a ) i.i.d samples. Given w k , we construct a randomized p erturbation of the empiric a l MDP M k = ( H , S , A , ˆ P k , ˆ R k + w k , s 1 ) by adding the Gaussian noise to estimated re w a rds. RLSVI so lves for the optimal p olicy π k under this MDP and applies it thro ughout the episo de. This p olicy is, of co ur se, g reedy with resp ect to the (ra ndomized) v alue functions under M k . The random noise w k in RLSVI should b e lar ge eno ug h to dominate the error in tro duced by p erforming a noisy Bellman up date using ˆ P k and ˆ R k . W e set β k = ˜ O ( H 3 ) in the analysis, where functions of H o ﬀer a coars e b ound on qua ntities like the v aria nce of an empirica lly estimated Bellman up date. F or β = { β k } k ∈ N , w e denote this algorithm by RL SVI β . RLSVI as regression o n p erturb ed data. T o e x tend b eyond simple tabular problems, it is fruitful to view RLSVI–like in Algor ithm 1–as an algo r ithm that p erforms r ecursive lea st squares estimation on the state-actio n v alue function. Randomization is injected int o these v alue function estimates by p erturbing observed rewards and by reg ularizing to a randomized prior sample. This prior sa mple is essential, as otherwise there would no randomness in the es timated v alue function in initial p erio ds. This pro cedure is the LSPI algorithm o f [17] applied with nois y data and a tabular r epresen tation. The pap er [22 ] includes many exp erimen ts with no n- tabular represe ntations. 4 Algorithm 1: RLSVI for T abular , Finite Hor izon, MDPs input : H , S , A , tuning para meter s { β k } k ∈ N (1) for episo des k = 1 , 2 , . . . do /* De fine s quared tempor al diffe rence error */ (2) L ( Q | Q next , D ) = P ( s,a,r,s ′ ) ∈D ( Q ( s, a ) − r − max a ′ ∈A Q next ( s ′ , a ′ )) 2 ; (3) D h = { ( s ℓ h , a ℓ h , r ℓ h , s ℓ h +1 ) : ℓ < k } h < H ; /* Pa st d ata */ (4) D H = { ( s ℓ H , a ℓ H , r ℓ H , ∅ ) : ℓ < k } ; /* Ra ndomly pe rturb data */ (5) for time p erio ds h = 1 , . . . , H do (6) Sample arr a y ˜ Q h ∼ N (0 , β k I ) ; /* Dr aw p rior sampl e */ (7) ˜ D h ← {} ; (8) for ( s, a, r , s ′ ) ∈ D h do (9) sample w ∼ N (0 , β k ); (10) ˜ D h ← ˜ D h ∪ { ( s, a, r + w , s ′ ) } ; (11) end (12) e nd /* Es timate Q on no isy data */ (13) Deﬁne terminal v alue Q k H +1 ( s, a ) ← 0 ∀ s, a ; (14) for time p erio ds h = H , . . . , 1 do (15) ˆ Q h ← arg min Q ∈ R S A L ( Q | Q h +1 , ˜ D h ) + k Q − ˜ Q h k 2 2 ; (16) e nd (17) Apply greedy p olicy with resp ect to ( ˆ Q 1 , . . . ˆ Q H ) throughout episo de; (18) O bserv e data s k 1 , a k 1 , r k 1 , . . . s k H , a k H , r k H ; (19) end T o understand this presentation of RLSVI, it is helpful to understand an equiv a lence b et ween po sterior sampling in a Bay e s ian linear mo del and ﬁtting a regularized least s quares estimate to randomly p e r turbed data. W e refer to [2 2] for a full discussion of this eq uiv alence and re v iew the sca lar case here. Consider Bay es up dating of a scala r par ameter θ ∼ N (0 , β ) based o n noisy observ ations Y = ( y 1 , . . . , y n ) where y i | θ ∼ N (0 , β ). The p osterior distribution ha s the closed form θ | Y ∼ N  1 n +1 P n 1 y i , β n +1  . W e could g e nerate a sample fro m this distribution by ﬁtting a least squares estimate to noise. Sample W = ( w 1 , . . . , w n ) where ea c h w i ∼ N (0 , β ) is drawn independently and sample ˜ θ ∼ N (0 , β ). Then ˆ θ = argmin θ ∈ R n X i =1 ( θ − y i ) 2 + ( θ − ˜ θ ) 2 = 1 n + 1 n X i =1 y i + ˜ θ ! (3) satisﬁes ˆ θ | Y ∼ N  1 n +1 P n 1 y i , β n +1  . F o r more complex mo dels, where exa ct p osterior sampling is impo s sible, we may still ho pe estimation o n randomly p erturb ed data generates sa mples that re ﬂect uncertaint y in a sensible wa y . As far as RLSVI is c o ncerned, roughly the same calculation shows that in Algor ithm 1 ˆ Q h ( s, a ) is equal to a n empirica l Bellma n up date plus Ga ussian noise: ˆ Q h ( s, a ) | ˆ Q h +1 ∼ N ˆ R h,s,a + X s ′ ∈S ˆ P h,s,a ( s ′ ) max a ′ ∈A ˆ Q h +1 ( s ′ , a ′ ) , β k n k ( h, s, a ) + 1 ! . 5 4 Main result Theorem 1 establishes that RLSVI satisﬁes a worst-case p olynomial r e g ret b ound for tabular ﬁnite- horizon MDPs. It is worth contrasting RLSVI to ǫ – greedy ex ploration and Boltzmann exploration, which are bo th widely used ra ndomization a pproaches to exploration. Those simple metho ds explore b y direc tly injecting randomness to the action chosen a t ea c h timestep. Unfortunately , they can fail catastrophically even on s imple ex amples with a ﬁnite state space – requir ing a time to learn that scales e x ponentially in the size o f the s tate space. Instead, RLSVI generates rando miza tion by training v alue functions with randomly p erturbed rewards. Theo rem 1 conﬁrms that this a pproach generates a s o phisticated form of explor ation fundament ally diﬀerent fro m ǫ – greedy exploration a nd Boltzmann exploration. The notation ˜ O ignor e s p oly-logar ithmic factors in H , S, A and K . Theorem 1. L et M denote t he set of MDPs with horizon H , S states, A actions, and r ewar ds b ounde d in [0,1]. Then for a tuning p ar ameter se quen c e β = { β k } k ∈ N with β k = 1 2 S H 3 log(2 H S Ak ) , sup M ∈M Regret( M , K, R LSVI β ) ≤ ˜ O  H 3 S 3 / 2 √ AK  . This b ound is not state of the art a nd that is no t the ma in g oal of this paper . I conjecture that the extra factor of S can b e remov ed fro m this b ound through a car e ful analysis, making the dependence o n S , A , and K , optimal. This conjecture is suppo rted by num erical exp erimen ts and (informally) by a Bay esian reg ret ana lysis [22]. O ne ex tra √ S app ears to come fro m a step at the very end of the pr oof in Lemma 7, where we b ound a certain L 1 norm as in the a nalysis s t yle o f [13]. F or optimistic algor ithms, some recent work has av o ided dire ctly b ounding that L 1 -norm, yielding a tight er regret guarantee [5, 11]. Another fa c to r of √ S stems from the choice of β k , which is used in the pro of of Lemma 5. This seems simila r to and extra √ d factor that app ears in worst-case regret upper b ounds fo r Thompson sampling in d -dimensional linear ba ndit problems [1]. Remark 1. Some tr anslation is r e quir e d to r elate the dep endenc e on H with other liter atur e. Many r esults ar e given in terms of the numb er of p erio ds T = K H , which masks a factor of H . A lso un like e.g. [5] , this p ap er t re ats time-inhomo genous t r ansition kernels. In some sense agents must le arn ab out H extra state/action p airs. R oughly sp e aking then, our r esult exactly c orr esp onds to what one would get by applyi ng the UCRL2 analysis [13] to a time-inhomo genous ﬁnite-horizon pr oblem. 5 Pro of of Theorem 1 The pro of follo ws from several lemmas. Some a r e (p ossibly complex) techn ical adaptations of idea s present in many regr et analy s es. Lemmas 4 and 6 are the main discoveries that prompted this pap er. Throughout we use the fo llowing notation: for a ny MDP ˜ M = ( H, S , A , ˜ P , ˜ R, s 1 ), let V ( ˜ M , π ) ∈ R denote the v alue function cor respo nding to p olicy π from the initial state s 1 . In this no tation, for the true MDP M we hav e V ( M , π ) = V π 1 ( s 1 ). A concen tration inequality . T hr ough a careful application of Ho e ﬀding’s inequality , one can give a high probability b ound on the err or in applying a Bellman update to the (non-random) optimal v alue function V ∗ h +1 . Through this, and a union b ound, Lemma b ounds 2 b ounds the exp e cted n um b er of times the empirica lly e s timated MDP falls outside the conﬁdence set M k =  ( H, S , A , P ′ , R ′ , s 1 ) : ∀ ( h, s , a ) | ( R ′ h,s,a − R h,s,a ) + h P ′ h,s,a − P s,a,h , V ∗ h +1 i| ≤ q e k ( h, s, a )  6 where we deﬁne p e k ( h, s, a ) = H s log (2 H S Ak ) n k ( s, h, a ) + 1 . This set is a only a to ol in the analysis a nd canno t b e used by the agent since V ∗ h +1 is unknown. Lemma 2 (V a lidit y of conﬁdence sets) . P ∞ k =1 P  ˆ M k / ∈ M k  ≤ π 2 6 . F rom v alue function error to o n p olicy Bel lman error. F or some ﬁxed p olicy π , the next simple lemma expres ses the gap betw een the v alue functions under tw o MDPs in terms o f the diﬀerences b etw een their Bellman o perator s. Results like this a re critical to ma ny ana ly ses in the RL literature. Notice the asy mmetric ro le of ˜ M and M . The v a lue functions c o rresp ond to one MDP while the sta te tra jectory is sampled in the o ther. W e’ll apply the lemma t wice: once where ˜ M is the true MDP and M is estimated one used b y RLSVI and once where the role is r e v e r sed. Lemma 3. Conside r any p olicy π and two MDPs ˜ M = ( H , S , A , ˜ P , ˜ R, s 1 ) and M = ( H , S , A , P , R, s 1 ) . L et ˜ V π h and V π h denote the r esp e ctive value functions of π under ˜ M and M . Then V π 1 ( s 1 ) − ˜ V π 1 ( s 1 ) = E π , M " H X h =1  R h,s h ,π ( s h ) − ˜ R h,s h ,π ( s h )  + h P h,s h ,π ( s h ) − ˜ P h,s h ,π ( s h ) , ˜ V π h +1 i # , wher e ˜ V π H +1 ≡ 0 ∈ R S and the exp e ctation is over the sample d state tr aje ctory s 1 , . . . s H dr awn fr om fol lowing π in the MDP M . Pr o of. V π 1 ( s 1 ) − ˜ V π 1 ( s 1 ) = R 1 ,s 1 ,π ( s 1 ) + h P 1 ,s 1 ,π ( s 1 ) , V π 2 i − ˜ R 1 ,s 1 ,π ( s 1 ) − h ˜ P 1 ,s 1 ,π ( s 1 ) , ˜ V π 2 i = R 1 ,s 1 ,π ( s 1 ) − R 1 ,s 1 ,π ( s 1 ) + h P 1 ,s 1 ,π ( s 1 ) − ˜ P 1 ,s 1 ,π ( s 1 ) , ˜ V π 2 i + h P 1 ,s 1 ,π ( s 1 ) , V π 2 − ˜ V π 2 i = R 1 ,s 1 ,π ( s 1 ) − ˜ R 1 ,s 1 ,π ( s 1 ) + h P 1 ,s 1 ,π ( s 1 ) − ˜ P 1 ,s 1 ,π ( s 1 ) , ˜ V π 2 i + E π , M h V π 2 ( s 2 ) − ˜ V π 2 ( s 2 ) i . Expanding this recursio n gives the result. Suﬃcien t optim i sm through randomization. There is alwa y s the r isk that, ba s ed on noisy observ ations , an RL algor ithm incor r ectly for ms a low es timate of the v alue function at some state. This may lead the a lgorithm to purp osefully av oid that state, therefore failing to g ather the data needed to correc t its fault y estimate. T o av o id such scenario s, near ly all prov a bly eﬃcient RL exploration algo r ithms are based build purp osefully optimistic estimates. RLSVI do es not do this, and instead generates a r a ndomized v alue function. T he following lemma is key to our ana ly sis. It shows that, except in the rare event when it has g rossly mis-estimated the underly ing MDP , RLSVI has a t le a st a constant chance of s a mpling an o ptimistic v alue function. Similar re s ults can b e prov ed for Thompson sampling with linear mo dels [1]. Recall M is unknown true MDP with o ptimal π ∗ and M k is RLSVI’s noise p erturb ed MDP under which π k is an optimal policy . Lemma 4. L et π ∗ b e an optimal p olicy for the tru e MDP M . If ˆ M k ∈ M k , then P  V ( M k , π k ) ≥ V ( M , π ∗ ) | H k − 1  ≥ Φ( − 1) . This result is mor e easily established throug h the following lemma, which avoids the need to carefully co ndition on the history H k − 1 at each step. W e conclude with the pro of of Lemma 4 after. 7 Lemma 5. Fix any p olicy π = ( π 1 , . . . , π H ) and ve ctor e ∈ R H S A with e ( h, s, a ) ≥ 0 . Consider the MDP M = ( H , S , A , P , R , s 1 ) and alternative ¯ R and ¯ P ob eying the ine quality − p e ( h, s, a ) ≤ ¯ R h,s,a − R h,s,a + h ¯ P h,s,a − P h,s,a , V h +1 i ≤ p e ( h, s, a ) for every s ∈ S , a ∈ A and h ∈ { 1 , . . . , H } . T ake W ∈ R H S A to b e a r andom ve ct or with indep endent c omp onents wher e w ( h, s, a ) ∼ N (0 , H S e ( h, s, a )) . L et ¯ V π 1 ,W denote t he (r andom) value function of the p olicy π under the MDP ¯ M = ( H , S , A , ¯ P , ¯ R + W ) . Then P  ¯ V π 1 ,W ( s 1 ) ≥ V π 1 ( s 1 )  ≥ Φ( − 1) . Pr o of. T o start, we co nsider an arbitrary deterministic vector w ∈ R H S A (thought of as a p ossible realization of W ) and ev alua te the gap in v alue functions ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 ). W e can r e-write this quantit y by applying Lemma 3. Let s = ( s 1 , . . . , s H ) denote a ra ndom seq uence of states drawn b y simu lating the p olicy π in the MDP ¯ M from the deter ministic initial state s 1 . Set a h = π ( s h ) for h = 1 , . . . , H . Then ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 ) = E " H X h =1 w ( h, s h , π h ( s h )) + ¯ R h,s h ,π h ( s h ) − R h,s h ,π h ( s h ) + h ¯ P h,s h ,π h ( s h ) − P h,s h ,π h ( s h ) , V π h i # ≥ H E " 1 H H X h =1  w ( h, s h , π h ( s h )) − p e ( h, s h , π h ( s h ))  # where the exp ectation is taken over the sequence of sates s = ( s 1 , . . . , s H ). Deﬁne d ( h, s ) = 1 H P ( s h = s ) for every h ≤ H a nd s ∈ S . Then the ab o ve equation ca n b e written a s 1 H  ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 )  ≥ X s ∈S ,h ≤ H d ( h, s )  w ( h, s h , π h ( s h )) − p e ( h, s h , π h ( s h ))  ≥   X s ∈S ,h ≤ H d ( h, s ) w ( h, s h , π h ( s h ))   − √ H S s X s ∈S ,h ≤ H d ( h, s ) 2 e ( h, s h , π h ( s h )) := X ( w ) where the second inequality applies Cauch y-Shw artz. Now, since d ( h, s ) W ( h, s, π h ( s, a )) ∼ N (0 , d ( h, s ) 2 H S e ( h, s, π h ( s, a ))) , we hav e X ( W ) ∼ N   − s H S X s ∈S ,h ≤ H d ( h, s ) 2 e ( h, s h , π h ( s h )) , H S X s ∈S ,h ≤ H d ( h, s ) 2 e ( h, s h , π h ( s h ))   . By standardization, P ( X ( W ) ≥ 0 ) = Φ( − 1). Therefor e, P ( ¯ V π 1 ,w ( s 1 ) − V π 1 ( s 1 ) ≥ 0) ≥ Φ( − 1). Pr o of of L emma 4. Consider so me history H k − 1 with ˆ M k ∈ M k . Recall π k is the p olicy chosen b y RLSVI, which is optimal under the MDP M k = ( H, S , A , ˆ P k , ˆ R k + w k , s 1 ). Since σ k ( h, s, a ) = H S e k ( h, s, a ), applying Lemma 5 co nditioned on H k − 1 shows that with pr obabilit y at least Φ( − 1), V ( M k , π ∗ ) ≥ V ( M , π ∗ ). When this o ccurs, we alwa ys hav e V ( M k , π k ) ≥ V ( M , π ∗ ), since b y deﬁni- tion π k is optimal under M k . 8 Reduction to b ounding online predictio n e rror. The next Lemma shows that the cumulativ e exp e cted regre t of RLSVI is b ounded in terms of the to ta l prediction error in es timating the v alue function of π k . The critical feature o f the result is it only dep ends on the a lgorithm b eing able to estimate the p erformance of the p olicies it a ctually employs and ther e fo r e gathers data a bout. F rom here, the re gret analysis will follow only co ncen tr ation arguments. F o r the purp oses of ana lysis, we let ˜ M k denote an imag ined second sample drawn from the sa me distribution a s the p erturb ed MDP M k under RLSVI. Mo re formally , let ˜ M k = ( H , S , A , ˆ P k , ˆ R k + ˜ w k , s 1 ) wher e ˜ w k ( h, s, a ) | H k − 1 ∼ N (0 , σ 2 k ( h, s, a )) is independent Gaussian no ise. Conditioned on the histo r y , ˜ M k has the same margina l distribution a s M k , but it is statistically indep enden t of the po licy π k selected by RLSVI, Lemma 6. F or an absolute c onstant c = Φ( − 1) − 1 < 6 . 31 , we have Regret( M , K , R LSVI β ) ≤ ( c + 1) E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # + c E " K X k =1 | V ( ˜ M k , π k ) − V ( M , π k ) | # + H K X k =1 P ( ˆ M k / ∈ M k ) | {z } ≤ π 2 / 6 . Online prediction error b ounds. W e complete the pro of with concentration a r gumen ts. Set ǫ k R ( h, s, a ) = ˆ R k h,s,a − R h,s,a ∈ R and ǫ k P ( h, s, a ) = ˆ P k h,s,a − P h,s h ,a h ∈ R S to be the error in es- timating mea n the mea n reward and transition vector corresp onding to ( h, s, a ). The next re- sult follows by b ounding each term in L e mma 6. This is done b y using Lemma 3 to expand the terms V ( M , π k ) − V ( M , π k ) and V ( M , π k ) − V ( ˜ M , π k ). W e fo cus our analysis on bo unding E h P K k =1 | V ( M k , π k ) − V ( M , π k ) | i . The other ter m can b e bounded in an iden tical manner 2 , so w e omit this ana lysis. Lemma 7. L et c = Φ( − 1) − 1 < 6 . 31 . Then for any K ∈ N , E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # ≤ v u u t E K X k =1 H − 1 X h =1   ǫ k P ( h, s k h , a k h )   2 1 v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ + E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # + E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # . The rema ining lemma s complete the pro of. At each stage, RLSVI adds Gaussian no ise with standard deviation no la rger than ˜ O ( H 3 / 2 √ S ). Ignoring extr e mely low proba bility even ts, we exp ect,   V k h +1   ∞ ≤ ˜ O ( H 5 / 2 √ S ) and hence P H − 1 h =1   V k h +1   2 ∞ ≤ ˜ O ( H 6 S ). The pro of of this Lemma makes this precise by applying appropriate maximal inequalities. Lemma 8. v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ = ˜ O  H 3 √ S K  The next few lemmas are essentially a consequence of a nalysis in [13], a nd many subsequent pap ers. W e give pr oof sketch es in the appe ndix. The main idea is to apply known concentration 2 In particular, an analogue of Lemma 7 holds 7 holds w here we r eplace M k with ˜ M k , V k h +1 with the v alue f unct ion ˜ V k h +1 corresponding to p olicy π k in the MDP ˜ M k , and the Gaussian noise w k with the ﬁctitious noise terms ˜ w k . 9 inequalities to b ound   ǫ k P ( h, s, a )   2 1 , | ǫ k R ( h, s k h , a k h ) | or | w k ( h, s k h , a k h ) | in terms of either 1 /n k ( h, s h , a h ) or 1 / p n k ( h, s h , a h ). The pigeonhole pr inciple gives P K k =1 P H − 1 h =1 1 /n k ( h, s h , a h ) = O (lo g ( S AK H ) and P K k =1 P H − 1 h =1 (1 / p n k ( h, s h , a h )) = O ( √ S AK H ) . Lemma 9. E " K X k =1 H − 1 X h =1   ǫ k P ( h, s, a )   2 1 # = ˜ O  S 2 AH  Lemma 10. E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # = ˜ O  √ S AK H  Lemma 11. E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # = ˜ O  H 3 / 2 S √ AK H  A ckno wledgments. Muc h of my understanding o f randomized v a lue functions comes from a collab oration with Ia n Osband, B e n V a n Ro y , and Zheng W en. Mark Sellke a nd Cha o Qin each noticed the same er ror in the pro of of Lemma 6 in the initial draft of this pap er. The lemma has now b een revised. I am extremely g rateful fo r their careful reading of the pape r . References [1] Mar c Ab eille, Alessandro Laz a ric, et a l. L inea r thompso n sampling revisited. Ele ctr onic Journal of Statistics , 11(2):516 5–5197 , 201 7 . [2] Shipra Agrawal a nd Navin Goy al. Thompson sampling for contextual bandits with linear pay o ﬀs. In International Confer enc e on Machine L e arning , pages 12 7–135, 20 13. [3] Shipra Agraw al a nd Randy Jia. O ptimistic p osterior sampling for reinfor cemen t lea rning: worst- case regret b ounds. In A dvanc es in Neur al Information Pr o c essing Systems , pages 118 4–1194, 2017. [4] Jo hn Asmuth , Lihong Li, Mic hael L Littman, Ali Nour i, and Da vid Wingate. A bayesian sampling a ppr oach to exploration in reinforcement lear ning. In Pr o c e e dings of the Twenty-Fifth Confer enc e on Unc ertainty in Arti ﬁcial Intel ligenc e , pa ges 19–26 . AUAI Press, 2009 . [5] Moha mmad Gheshlag hi Azar , Ia n O s band, and Rémi Munos. Minimax reg ret b ounds for rein- forcement learning . In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning- V olume 70 , pag es 26 3–272. JMLR. org, 2017. [6] Ka m yar Azizzadenesheli, Emma Br unskill, and Animashree Anandkumar . Eﬃcient explora tion through bay esian deep q- netw orks. In 2018 Information The ory and A pplic ations W orkshop (IT A) , pages 1–9. IEEE , 201 8. [7] Stéphane Bo uc heron, Gá b or Lugos i, and Pascal Massart. Conc entr ation ine qualities: A nonasymptotic the ory of indep endenc e . O x ford univ e r sit y pr ess, 2013. [8] Ronen I Brafman and Moshe T ennenholtz. R-max-a ge ner al p olynomial time algo rithm for near-optimal reinforcement learning. Journ al of Machine L e arning R ese ar ch , 3(Oct):213– 2 31, 2002. 10 [9] Y ur i Burda, Harrison Edwards, Amos Storkey , and Oleg Klimov. Ex plo ration by rando m net work distillation. 20 19. [10] Chr istoph Dann and Emma Brunskill. Sample co mplexit y of episo dic ﬁxed-hor izon reinforce- men t learning. In A dvanc es in Neur al Information Pr o c essing Systems , pages 281 8–2826, 201 5 . [11] Chr istoph Dann, T or Lattimore, and Emma Brunskill. Unifying pac and regr et: Uniform pa c bo unds for episo dic reinforcement learning. In A dvanc es in N eu r al Information Pr o c essing Systems , page s 57 1 3–5723 , 2 017. [12] Meir e F or tunato, Mohammad Gheshlaghi Azar, Bila l Piot, Jacob Menic k, Ian Osba nd, Alex Grav e s , Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et a l. Noisy netw or ks for exploration. 2 018. [13] T. J a ksc h, R. O rtner, and P . A uer. Near-optimal regret bo unds for r einforcemen t lear ning. Journal of Machine L e arning R ese ar ch , 1 1:1563–1 600, 20 10. [14] Chi Jin, Zeyua n Allen-Zhu , Sebastien Bub ec k, and Michael I Jordan. Is q-learning prov ably eﬃcient ? In Ad vanc es in Neur al In formation Pr o c essing Systems , pag es 4863–4 873, 20 18. [15] Sha m Machandranath K akade et al. On the sample c omplexity of r einfor c ement le arning . PhD thesis, Univ ersity o f Londo n Lo ndo n, Eng land, 2 003. [16] Michael Kearns and Satinder Singh. Near-optimal reinforcement lea rning in p olynomial time. Machine le arning , 4 9 (2-3):209–2 32, 2002. [17] Michail G Lag oudakis and Ronald Parr. Least-squar es p olicy iteration. Journ al of machine le arning r ese ar ch , 4(Dec):1107– 1149, 200 3. [18] T or Lattimore and Csa ba Szep esvári. Bandit algo rithms. pr eprint , 2018 . [19] Ia n O sband, Daniel R usso, and Benjamin V an Roy . (more) eﬃcient r einforcemen t learning via po sterior sampling. In A dvanc es in N eur al Information Pr o c essing Systems , pages 30 03–301 1 , 2013. [20] Ia n Osband, C ha rles Blundell, Alexa nder Pritzel, and Benjamin V an Roy . Deep exploration via bo otstrapp ed dqn. In A dvanc es in neur al information pr o c essing systems , pa ges 4026 –4034, 2016. [21] Ia n Osband, Benjamin V an Roy , and Zheng W en. Gener a lization and exploratio n via ra n- domized v alue functions. In International Confer enc e on Machine L e arning , pages 2377 – 2386, 2016. [22] Ia n Osband, Benjamin V an Roy , Daniel Russo, and Zheng W en. Deep exploration via random- ized v alue functions. arXiv pr eprint arXiv:1703.0 7608 , 2 017. [23] Ia n Osband, John Aslanides, and Albin Cassir er. Randomized prio r functions for deep r ein- forcement lear ning. In A dvanc es in Neur al Information Pr o c essing S ystems , pages 86 17–8629 , 2018. [24] Alexa nder L Strehl, Lihong L i, Er ic Wiewio ra, John Langford, and Michael L Littman. P ac mo del-free reinfor cemen t learning. In Pr o c e e dings of t he 23r d international c onfer enc e on Ma- chine le arning , pages 88 1 –888. ACM, 2006 . [25] Alexa nder L Strehl, Lihong Li, and Michael L Littman. Reinforcemen t learning in ﬁnite mdps: Pac analysis . J ournal of Machine L e arning R ese ar ch , 10(Nov):2413–2 444, 2009 . 11 [26] Ma lcolm Strens. A bay e sian framework for re infor cemen t lea rning. In ICML , volume 2000 , pages 943–9 50, 2 000. [27] Ahmed T ouati, Harsh Satija, Jo sh ua Romoﬀ, Jo elle Pineau, and Pascal Vincent. Rando mized v alue functions v ia multiplicativ e normalizing ﬂows. arXiv pr eprint arXiv:1806 .02315 , 2018. [28] Nikolaos T zio rtziotis, Christos Dimitrakakis, and Michalis V azirgia nnis. Randomised bay e s ian least-squar e s po licy iter ation. arXiv pr eprint arXiv:1904.0 3535 , 2019 . [29] T sach y W eissma n, Erik Ordentlic h, Gadiel Serouss i, Ser gio V e r du, and Mar celo J W ein b erger. Inequalities for the l1 deviation of the empir ic a l distribution. Hew lett-Packar d L abs, T e ch. R ep , 2003. 12 A Omitted Pro ofs A.1 Pro of of Lemma 2 Lemma 2 (V a lidit y of conﬁdence sets) . P ∞ k =1 P  ˆ M k / ∈ M k  ≤ π 2 6 . Pr o of. The following construction is the standar d wa y concentration inequalities a re applied in ba n- dit mo dels and tabular reinforc e ment learning. See the discussion of what Lattimore and Szepes vári [18] calls a “stack of rewards” mo del in Subsection 4.6 . F or every tuple z = ( h, s, a ), generate tw o i.i.d sequences of random v aria bles r z ,n ∼ R h,s,a and s z ,n ∼ P h,s,a ( · ). Her e r ( h,s,a ) ,n denotes the reward and s ( h,s,a ) ,n denotes the state transition generated from the n th time action a is played in state s , p erio d n . Set Y z ,n = r z ,n + V ∗ h +1 ( s z ,n ) n ∈ N . These are i.i.d, with Y z ,n ∈ [0 , H ] since k V ∗ h +1 k ∞ ≤ H − 1 , and satisﬁes E [ Y z ,n ] = R h,s,a + h P h,s,a , V ∗ h +1 i . By Ho eﬀding’s inequality , for a n y δ n ∈ (0 , 1 ), P      1 n n X i =1 Y ( h,s,a ) ,i − R h,s,a − h P h,s,a , V ∗ h +1 i      ≥ H r log(2 /δ n ) 2 n ! ≤ δ n . F or δ n = 1 H S An 2 , a union b ound ov er H S A v alues of z = ( h, s, a ) and all po ssible n gives P   [ h,s,a,n (      1 n n X i =1 Y ( h,s,a ) ,i − R h,s,a − h P h,s,a , V ∗ h +1 i      ≥ H r log(2 /δ n ) 2 n )   ≤ ∞ X n =1 1 n 2 = π 2 6 . Now, by deﬁnition, if n k ( h, s, a ) = n > 0, we hav e ˆ R k h,s,a + h ˆ P k h,s,a , V ∗ h +1 i = 1 n n X i =1 Y ( h,s,a ) ,i . Therefore, the ab ov e shows P ∃ ( k , h, s, a ) : n k ( h, s, a ) > 0 ,    ˆ R k h,s,a − R h,s,a + h ˆ P k h,s,a − P h,s,a , V ∗ h +1 i    ≥ H s log (2 H S An k ( h, s, a )) 2 n k ( s, h, a ) ! is upper b ounded b y π 2 / 6. Note that by deﬁnition, when n k ( h, s, a ) > 0 w e hav e q e k ( h, s, a ) ≥ H s log (2 H S An k ( h, s, a )) 2 n k ( s, h, a ) and hence this concentration inequality holds with p e k ( h, s, a ) on the r igh t hand side. When n k ( h, s, a ) = 0, w e hav e the trivial b ound    ˆ R k h,s,a − R h,s,a + h ˆ P k h,s,a − P h,s,a , V ∗ h +1 i    = | R h,s,a + h P h,s,a , V ∗ h +1 i| ≤ H ≤ e k ( h, s, a ) since we have deﬁned the empirical estimates to satisfy ˆ R k h,s,a = 0 and ˆ P k h,s,a ( · ) = 0 in the case that h, s, a has never b een play ed. 13 A.2 Pro of of Lemma 6 Lemma 6. F or an absolute c onstant c = Φ( − 1) − 1 < 6 . 31 , we have Regret( M , K , R LSVI β ) ≤ ( c + 1) E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # + c E " K X k =1 | V ( ˜ M k , π k ) − V ( M , π k ) | # + H K X k =1 P ( ˆ M k / ∈ M k ) | {z } ≤ π 2 / 6 . Pr o of. Recall that H k − 1 = { ( s i h , a i h , r i h ) : h = 1 , . . . H , i = 1 , . . . , k − 1 } . So conditioned on H k − 1 , M k , π k and ˜ M k are rando m only due to the in ternal randomness of the RLSVI a lgorithm. Set E k [ · ] = E [ · | H k − 1 ]. Suppose that ˆ M k ∈ M k . Then P  V ( M k , π k ) ≥ V ( M , π ∗ )     H k − 1  ≥ Φ( − 1) . (4) W e b egin with the r egret deco mposition: E k  V ( M , π ∗ ) − V ( M , π k )  = E k h V ( M , π ∗ ) − V ( M k , π k ) i + E k h V ( M k , π k ) − V ( M , π k ) i . (5) W e fo cus on the ﬁr st term. W e show V ( M , π ∗ ) − E k h V ( M k , π k ) i ≤ c E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  . (6) The inequality is immediate if V ( M , π ∗ ) < E k h V ( M k , π k ) i . W e now show this when a ≡ V ( M , π ∗ ) − E k h V ( M k , π k ) i ≥ 0. Then, E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  ≥ a P k  V ( M k , π k ) − E k h V ( M k , π k ) i ≥ a  =  V ( M , π ∗ ) − E k h V ( M k , π k ) i P k  V ( M k , π k ) ≥ V ( M , π ∗ )  ≥  V ( M , π ∗ ) − E k h V ( M k , π k ) i Φ( − 1) , where the ﬁrst step a pplies Marko v’s inequality , the s econd s imply plugs in for a , and the third uses Equation 4. Dividing each side b y Φ( − 1) gives Eq uation (6 ). Hence we hav e shown E k  V ( M , π ∗ ) − V ( M , π k )  ≤ c E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  + E k h V ( M k , π k ) − V ( M , π k ) i . (7) W e complete our argument by b ounding E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  . F or each ﬁxed (nonrandom) p olicy π , deﬁne µ ( π ) ≡ E k  V ( ˜ M k , π )  = E k h V ( M k , π ) i . Notice that µ ( π k ) = E k  V ( ˜ M k , π k ) | π k  almost surely . This relies on the fact that ˜ M k and π k are independent conditioned on the history H k − 1 . In genera l µ ( π k ) 6 = E k h V ( M k , π k ) | π k i , since π k is the optimal p olicy under M k and so these tw o are statistically dep endent. Now, for every p olicy π µ ( π ) = E k h V ( M K , π ) i ≤ E k  sup π ′ V ( M K , π ′ )  = E k h V ( M K , π k ) i . 14 So, µ ( π k ) ≤ E k h V ( M K , π k ) i almost surely . Using this, w e ﬁnd E k   V ( M k , π k ) − E k h V ( M k , π k ) i +  ≤ E k   V ( M k , π k ) − µ ( π k )  +  ≤ E k h    V ( M k , π k ) − µ ( π k )    i = E k h    V ( M k , π k ) − E k h V ( ˜ M k , π k ) | π k , M k i    i ≤ E k  E k     V ( M k , π k ) − V ( ˜ M k , π k )        π k , M k  = E k h    V ( M k , π k ) − V ( ˜ M k , π k )    i ≤ E k h    V ( M k , π k ) − V ( M , π k )    i + E k    V ( ˜ M k , π k ) − V ( M , π k )    . Plugging this into (7) shows that, for any history H k − 1 with ˆ M k ∈ M k , E k  V ( M , π ∗ ) − V ( M , π k )  ≤ ( c + 1) E k h    V ( M k , π k ) − V ( M , π k )    i + c E k    V ( ˜ M k , π k ) − V ( M , π k )    . In the unlikely e v e nt ˆ M k ∈ M k , w e have the worst case b ound 0 ≤ V ( M , π ∗ ) − V ( M , π k ) ≤ H. Combing these tw o cases and taking exp ectations gives E  V ( M , π ∗ ) − V ( M , π k )  ≤ H P ( ˆ M k / ∈ M k )+( c +1) E h    V ( M k , π k ) − V ( M , π k )    i + c E    V ( ˜ M k , π k ) − V ( M , π k )    . Summing ov er k co ncludes the pro of. A.3 Pro of of Lemma 7 Lemma 7. L et c = Φ( − 1) − 1 < 6 . 31 . Then for any K ∈ N , E " K X k =1 | V ( M k , π k ) − V ( M , π k ) | # ≤ v u u t E K X k =1 H − 1 X h =1   ǫ k P ( h, s k h , a k h )   2 1 v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ + E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # + E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # . Pr o of. W e bo und ea ch term in the b ound in Lemma 6 . By applying Lemma 3 with a choice of M = M and ˜ M = M K , the lar gest term is bounded, for any k ∈ N , as    V ( M k , π k ) − V ( M , π k )    =      E " H X h =1  h ˆ P k h,s k h ,a k h − P h,s k h ,a k h , V k h +1 i  + ˆ R k h,s k h ,a k h + w k ( h, s k h , a k h ) − R h,s k h ,a k h     π k , H k − 1 #      ≤ E " H − 1 X h =1   ǫ k P ( h, s k h , a k h )   1   V k h +1   ∞     π k , H k − 1 # + E " H X h =1  | ǫ k R ( h, s k h , a k h ) | + | w k ( h, s k h , a k h ) |      π k , H k − 1 # 15 T aking exp ectations, summing over k , a nd a pplying Ca uc hy-Sch wartz gives E " K X k =1    V ( M k , π k ) − V ( M , π k )    # ≤ v u u t E K X k =1 H − 1 X h =1   ǫ k P ( h, s k h , a k h )   2 1 v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ + E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # + E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # . A.4 Pro of of Lemma 8 The pro of relies on the following maximal inequality . Lemma 12 (E xample 2.7 F ro m [7]) . If X 1 , . . . , X n ar e i.i.d . r andom variables fol lowing a χ 2 1 distribution, then E  max i ≤ n X i  ≤ 1 + p 2 log( n ) + 2 log( n ) . Let us now r ecall Lemma 8. Lemma 8. v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ = ˜ O  H 3 √ S K  Pr o of. W e have v u u t E K X k =1 H − 1 X h =1   V k h +1   2 ∞ ≤ s H K E  max k ≤ K,h ≤ H k V k h +1 k 2 ∞  Now V k h +1 ( s ′ ) ≤ ( H − h − 1)(1 + max h,s,a w k ( h, s, a )) V k h +1 ( s ′ ) ≥ ( H − h − 1)( min h,s,a w k ( h, s, a )) . T ogether this gives that for all k ≤ K and h ∈ { 1 , . . . , H − 1 } k V k h +1 k ∞ ≤ H  1 + max k ≤ K,h,s,a | w k ( h, s, a ) |  2 ≤ 4 H 2 + 4 H 2  max k ≤ K,h,s,a | w k ( h, s, a ) | 2  . W e have w k ( h, s, a ) = σ k ( h, s, a ) ξ k h,s,a where the ξ k h,s,a ∼ N (0 , 1) are dr awn i.i.d acros s h, s, a . Set X k h,s,a = ( ξ k h,s,a ) 2 , each of w hich follows a ch i-squar ed distr ibution with 1 degree of freedom. Then, E  max k ≤ K,h,s,a | w k ( h, s, a ) | 2  ≤  max k ≤ K,h,s,a, σ 2 k ( h, s, a )  E  max k ≤ K,h,s,a | ξ k h,s,a | 2  =  max k ≤ K,h,s,a, σ 2 k ( h, s, a )  E  max k ≤ K,h,s,a X k h,s,a  ≤  S H 3 log(2 S AH K )  E  max k ≤ K,h,s,a X k h,s,a  ≤  S H 3 log(2 S AH K )   1 + p 2 log( S AH K ) + 2 log( S AH K )  ≤ O  S H 3 log (2 S AH K ) 2  . 16 This gives us s K H E  max k ≤ K,h ≤ H k V k h +1 k 2 ∞  = ˜ O  √ K H · H 2 · S H 3  = ˜ O  H 3 √ S K  . A.5 Pro of sk etch of Lemma 9 This result relies o n a n inequa lit y by W eissman et a l. [2 9], whic h we now restate. Lemma 13. [L1 deviation b ound] If p is a pr ob ability distribution over S = { 1 , . . . S } and ˆ p is the empiric al distribution c onstru cte d fr om n i.i.d dr aws fr om p , then for any ǫ > 0 , P ( k ˆ p − p k 1 ≥ ǫ ) ≤ (2 S − 2) exp  − nǫ 2 2  Lemma 9. E " K X k =1 H − 1 X h =1   ǫ k P ( h, s, a )   2 1 # = ˜ O  S 2 AH  Pr o of sketch. By picking an a ppropriate ǫ in Lemma 1 3 as in [13, Appendix C.1 ], together with a union b ound o ver all H S A poss ible v alues for the tuple ( h, s, a ), there exists a numerical constant c such that P   [ s,a,h,k ≤ K (    ˆ P k h,s,a − P h,s,a    1 ≥ c s S log (1 + H S AK ) n k ( h, s, a ) + 1 )   ≤ 1 K H . (8) Set β k ( h, s, a ) = S ℓ n k ( h,s,a ) where ℓ = c 2 log(1 + H S AK ) denotes a logarithmic factor. Recall the deﬁnition ǫ p k ( h, s, a ) ≡ ˆ P k h,s,a − P h,s,a . Let B b e the “ba d even t” that k ǫ p k ( h, s, a ) k 2 1 ≥ β k ( h, s, a ) for some ( h, s, a ) and k ≤ K . Since k ǫ p k ( h, s, a ) k 1 ≤ 2 a lw ays, we hav e E K X k =1 H − 1 X h =1 k ǫ k P ( h, s k h , a k h ) k 2 1 1 ( B ) ≤ 4 (9) On the other hand, assuming B c we have the b ound K X k =1 H − 1 X h =1 k ǫ k P ( h, s k h , a k h ) k 2 1 ≤ K X k =1 H − 1 X h =1 β k ( h, s, a ) = S ℓ K X k =1 H − 1 X h =1 1 n k ( h, s h , a h ) + 1 ≤ X h,s,a n K ( h,s,a ) X n =0 1 n + 1 = O ( H S A log( K )) . A.6 Pro of sk etch of Lemma 10 Lemma 10. E " K X k =1 H X h =1 | ǫ k R ( h, s k h , a k h ) | # = ˜ O  √ S AK H  17 Pr o of sketch. The pro of is similar to Lemma 9. By Ho e ﬀding’s inequalit y together with a union bo und, we c a n ensure that | ǫ k R ( h, s, a ) | ≤ c q log(1+ H S AK ) n k ( h,s,a )+1 for all k ≤ K and a ll tuples ( h, s, a ) except on some bad even t that, a s in (9), contributes at most a cons ta nt to the bo und. Now the r e s ult follows fro m using the pigeonhole principle to co nclude K X k =1 H X h =1 1 p n k ( h, s, a ) = O  √ H S AK  . This kind of b ound b ound is standard in the RL a nd ba ndit literature. See [19, Appendix A] for one pro of. A.7 Pro of sk etch of Lemma 11 Lemma 11. E " K X k =1 H X h =1 | w k ( h, s k h , a k h ) | # = ˜ O  H 3 / 2 S √ AK H  Pr o of. Recall σ k ( h, s, a ) = q β n k ( h,s,a )+1 where β k = ˜ O ( S H 3 ). W rite w k ( h, s, a ) = σ k ( h, s, a ) ξ k ( h, s, a ) where ξ k ( h, s, a ∼ N (0 , 1) and the array of ra ndom v aria ble { ξ k ( h, s, a ) : 1 ≤ k ≤ K , 1 ≤ h ≤ H , a ∈ A , s ∈ S } is drawn indep endent ly . By Holder’s inequality , E K X k =1 H X h =1 | w k ( h, s h , a h ) | ≤ E  max k ≤ K,h,s,a | ξ k ( h, s, a |  E K X k =1 H X h =1 σ k ( h, s h , a h ) The (sub) Gaussia n maxima l inequa lity gives E  max k ≤ K,h,s,a | ξ k ( h, s, a |  = O  p log( H S AK )  . T o simplify the next expression, no te that β k ≤ β K . On a n y sample path, by the same argument as in Lemma 10, we hav e K X k =1 H X h =1 σ k ( h, s h , a h ) ≤ β K K X k =1 H X h =1 s 1 n k ( h, s h , a h ) + 1 = O  β K √ H S AK  . 18

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment