Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Why is P osterior Sampling Better than Optimism f or Reinfor cement Learning? Ian Osband 1 2 Benjamin V an Roy 1 Abstract Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driv en by optimism, such as UCRL2. W e pro- vide insight into the extent of this performance boost and the phenomenon that driv es it. W e lev erage this insight to establish an ˜ O ( H √ S AT ) Bayesian regret bound for PSRL in ﬁnite-horizon episodic Mark ov decision processes. This im- prov es upon the best previous Bayesian re gret bound of ˜ O ( H S √ AT ) for any reinforcement learning algorithm. Our theoretical results are supported by extensi ve empirical e valuation. 1. Introduction W e consider the reinforcement learning problem in which an agent interacts with a Markov decision process with the aim of maximizing expected cumulativ e re ward ( Burnetas & Katehakis , 1997 ; Sutton & Barto , 1998 ). Ke y to per- formance is how the agent balances between exploration to acquire information of long-term beneﬁt and exploita- tion to maximize expected near-term re wards. In princi- ple, dynamic programming can be applied to compute the Bayes-optimal solution to this problem ( Bellman & Kal- aba , 1959 ). Howe ver , this is computationally intractable for anything beyond the simplest of toy problems and di- rect approximations can fail spectacularly poorly ( Munos , 2014 ). As such, researchers have proposed and analyzed a number of heuristic reinforcement learning algorithms. The literature on efﬁcient reinforcement learning offers sta- tistical ef ﬁciency guarantees for computationally tractable algorithms. These prov ably efﬁcient algorithms ( Kearns & Singh , 2002 ; Brafman & T ennenholtz , 2002 ) predom- inantly address the exploration-exploitation trade-off via optimism in the face of uncertainty (OFU): at any state, the agent assigns to each action an optimistically biased esti- 1 Stanford Univ ersity , California, USA 2 Deepmind, London, UK. Correspondence to: Ian Osband . Pr oceedings of the 34 th International Conference on Machine Learning , Sydney , Australia, PMLR 70, 2017. Copyright 2017 by the author(s). mate of future v alue and selects the action with the greatest estimate. If a selected action is not near-optimal, the es- timate must be ov erly optimistic, in which case the agent learns from the experience. Efﬁcienc y relati ve to less so- phisticated exploration arises as the agent avoids actions that can neither yield high value nor informati ve data. An alternative approach, based on Thompson sampling ( Thompson , 1933 ), in volv es sampling a statistically plau- sibly set of action values and selecting the maximizing action. These values can be generated, for example, by sampling from the posterior distribution over MDPs and computing the state-action value function of the sampled MDP . This approach, originally proposed in Strens ( 2000 ), is called posterior sampling for reinforcement learning (PSRL). Computational results from Osband et al. ( 2013 ) demonstrate that PSRL dramatically outperforms existing algorithms based on OFU. The primary aim of this paper is to provide insight into the e xtent of this performance boost and the phenomenon that driv es it. W e show that, in Bayesian expectation and up to constant factors, PSRL matches the statistical efﬁcienc y of any stan- dard algorithm for OFU-RL. W e highlight two key short- comings of existing state of the art algorithms for OFU ( Jaksch et al. , 2010 ) and demonstrate that PSRL does not suffer from these inef ﬁciencies. W e lev erage this insight to produce an ˜ O ( H √ S AT ) bound for the Bayesian regret of PSRL in ﬁnite-horizon episodic Marko v decision pro- cesses where H is the horizon, S is the number of states, A is the number of actions and T is the time elapsed. This improv es upon the best previous bound of ˜ O ( H S √ AT ) for any RL algorithm. W e discuss why we belie ve PSRL sat- isﬁes a tighter ˜ O ( √ H S AT ) , though we have not prov ed that. W e complement our theory with computational exper - iments that highlight the issues we raise; empirical results match our theoretical predictions. More importantly , we highlights a tension in OFU RL be- tween statistical efﬁcienc y and computational tractability . W e ar gue that any OFU algorithm that matches PSRL in statistical efﬁcienc y would likely be computationally in- tractable. W e provide proof of this claim in a restricted setting. Our key insight, and the potential beneﬁts of ex- ploration guided by posterior sampling, are not restricted to the simple tabular MDPs we analyze. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? 2. Problem f ormulation W e consider the problem of learning to optimize a ran- dom ﬁnite-horizon MDP M ∗ =( S , A ,R ∗ ,P ∗ ,H ,ρ ) over re- peated episodes of interaction, where S = { 1 , .., S } is the state space, A = { 1 , .., A } is the action space, H is the horizon, and ρ is the initial state distribution. In each time period h = 1 , .., H within an episode, the agent ob- serves state s h ∈ S , selects action a h ∈ A , recei ves a rew ard r h ∼ R ∗ ( s h , a h ) , and transitions to a new state s h +1 ∼ P ∗ ( s h , a h ) . W e note that this formulation, where the unknown MDP M ∗ is treated as itself a random vari- able, is often called Bayesian reinforcement learning. A policy µ is a mapping from state s ∈ S and period h = 1 , .., H to action a ∈ A . F or each MDP M and policy µ we deﬁne the state-action v alue function for each period h : Q M µ,h ( s, a ) := E M ,µ   H X j = h r M ( s j , a j )    s h = s, a h = a   , (1) where r M ( s, a ) = E [ r | r ∼ R M ( s, a )] . The subscript µ in- dicates that actions over periods h + 1 , . . . , H are selected according to the policy µ . Let V M µ,h ( s ) := Q M µ,h ( s, µ ( s, h )) . W e say a policy µ M is optimal for the MDP M if µ M ∈ arg max µ V M µ,h ( s ) for all s ∈ S and h = 1 , . . . , H . Let H t denote the history of observations made prior to time t . T o highlight this time ev olution within episodes, with some abuse of notation, we let s kh = s t for t = ( k − 1) H + h , so that s kh is the state in period h of episode k . W e deﬁne H kh analogously . An RL algorithm is a deterministic sequence { π k | k = 1 , 2 , . . . } of functions, each mapping H k 1 to a probability distribution π k ( H k 1 ) ov er policies, from which the agent samples a policy µ k for the k th episode. W e deﬁne the regret incurred by an RL algorithm π up to time T to be Regret( T , π , M ∗ ) := d T /H e X k =1 ∆ k , (2) where ∆ k denotes regret over the k th episode, deﬁned with respect to true MDP M ∗ by ∆ k := X S ρ ( s )( V M ∗ µ ∗ , 1 ( s ) − V M ∗ µ k , 1 ( s )) (3) with µ ∗ = µ M ∗ . W e note that the regret in ( 2 ) is random, since it depends on the unknown MDP M ∗ , the learning algorithm π and through the history H t on the sampled transitions and rew ards. W e deﬁne Ba yesRegret( T , π , φ ) := E [Regret( T , π , M ∗ ) | M ∗ ∼ φ ] , (4) as the Bayesian expected regret for M ∗ distributed accord- ing to the prior φ . W e will assess and compare algorithm performance in terms of the regret and BayesRe gret. 2.1. Relating performance guarantees For the most part, the literature on efﬁcient RL is sharply divided between the frequentist and Bayesian perspective ( Vlassis et al. , 2012 ). By volume, most papers focus on minimax regret bounds that hold with high probability for any M ∗ ∈ M some class of MDPs ( Jaksch et al. , 2010 ). Bounds on the BayesRegret are generally weaker analyt- ical statements than minimax bounds on regret. A regret bound for any M ∗ ∈ M implies an identical bound on the BayesReget for any φ with support on M . A partial con- verse is av ailable for M ∗ drawn with non-zero probability under φ , but does not hold in general ( Osband et al. , 2013 ). Another common notion of performance guarantee is giv en by so-called “sample-complexity” or P A C analyses that bound the number of  -sub-optimal decisions taken by an algorithm ( Kakade , 2003 ; Dann & Brunskill , 2015 ). In general, optimal bounds on regret ˜ O ( √ T ) imply optimal bounds on sample complexity ˜ O (  − 2 ) , whereas optimal bounds on the sample complexity give only an ˜ O ( T 2 / 3 ) bound on regret ( Osband , 2016 ). Our formulation focuses on the simple setting on ﬁnite horizon MDPs, but there are sev eral other problems of in- terest in the literature. Common formulations include the discounted setting 1 and problems with inﬁnite horizon un- der some connectedness assumption ( Bartlett & T ew ari , 2009 ). This paper may contain insights that carry over to these settings, but we lea ve that analysis to future work. Our analysis focuses upon Bayesian expected regret in ﬁ- nite horizon MDPs. W e ﬁnd this criterion amenable to (rel- ativ ely) simple analysis and use it obtain actionable insight to the design of practical algorithms. W e absolutely do not “close the book” on the exploration/exploitation problem - there remain man y important open questions. Nonetheless, our work may help to dev elop understanding within some of the outstanding issues of statistical and computational efﬁcienc y in RL. In particular, we shed some light on how and why posterior sampling performs so much better than existing algorithms for OFU-RL. Crucially , we belie ve that many of these insights extend beyond the stylized problem of ﬁnite tabular MDPs and can help to guide the design of practical algorithms for generalization and exploration via randomized value functions ( Osband , 2016 ). 3. Posterior sampling as stochastic optimism There is a well-known connection between posterior sam- pling and optimistic algorithms ( Russo & V an Roy , 2014 ). In this section we highlight the similarity of these ap- proaches. W e argue that posterior sampling can be thought of as a stochastically optimistic algorithm. 1 Discount γ = 1 − 1 /H giv es an effectiv e horizon O ( H ) . Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Before each episode, a typical OFU algorithm constructs a conﬁdence set to represent the range of MDPs that are statistically plausible giv en prior knowledge and observa- tions. Then, a policy is selected by maximizing value si- multaneously over policies and MDPs in this set. The agent then follows this polic y ov er the episode. It is interesting to contrast this approach against PSRL where instead of max- imizing over a conﬁdence set, PSRL samples a single sta- tistically plausible MDP and selects a policy to maximize value for that MDP . Algorithm 1 OFU RL Input: conﬁdence set constructor Φ 1: for episode k = 1 , 2 , .. do 2: Construct conﬁdence set M k = Φ( H k 1 ) 3: Compute µ k ∈ arg max µ,M ∈M k V M µ, 1 4: for timestep h = 1 , .., H do 5: take action a kh = µ k ( s kh , h ) 6: update H kh +1 = H kh ∪ ( s kh , a kh , r kh , s kh +1 ) 7: end for 8: end for Algorithm 2 PSRL Input: prior distribution φ 1: for episode k = 1 , 2 , .. do 2: Sample MDP M k ∼ φ ( · | H k 1 ) 3: Compute µ k ∈ arg max µ V M k µ, 1 4: for timestep h = 1 , .., H do 5: take action a kh = µ k ( s kh , h ) 6: update H kh +1 = H kh ∪ ( s kh , a kh , r kh , s kh +1 ) 7: end for 8: end for 3.1. The blueprint for OFU r egret bounds The general strategy for the analysis of optimistic algo- rithms follows a simple recipe ( Strehl & Littman , 2005 ; Szita & Szepesvári , 2010 ; Munos , 2014 ): 1. Design conﬁdence sets (via concentration inequality) such that M ∗ ∈ M k for all k with probability ≥ 1 − δ . 2. Decompose the regret in each episode ∆ k = V M ∗ µ ∗ , 1 − V M ∗ µ k , 1 = V M ∗ µ ∗ , 1 − V M k µ k , 1 | {z } ∆ opt k + V M k µ k , 1 − V M ∗ µ k , 1 | {z } ∆ conc k where M k is the imagined optimistic MDP . 3. By step (1.) ∆ opt k ≤ 0 for all k with probability ≥ 1 − δ . 4. Use concentration results with a pigeonhole argument ov er all possible trajectories {H 11 , H 21 , .. } to bound, with probability at least 1 − δ , Regret( T ,π ,M ∗ ) ≤ d T /H e X k =1 ∆ conc k | M ∗ ∈ M k ≤ f ( S,A,H ,T ,δ ) . 3.2. Anything OFU can do, PSRL can expect to do too In this section, we highlight the connection between poste- rior sampling and any optimistic algorithm in the spirit of Section 3.1 . Central to our analysis will be the follo wing notion of stochastic optimism ( Osband et al. , 2014 ). Deﬁnition 1 (Stochastic optimism) . Let X and Y be real-valued random variables with ﬁnite expectation. W e will say that X is stochastically optimistic for Y if for any con vex and incr easing u : R → R : E [ u ( X )] ≥ E [ u ( Y )] . (5) W e will write X < so Y for this r elation. This notion of optimism is dual to second order stochastic dominance ( Hadar & Russell , 1969 ), X < so Y if and only if − Y < ssd − X . W e say that PSRL is a stochastically opti- mistic algorithm since the random imagined value function V M k µ k , 1 is stochastically optimistic for the true optimal value function V M ∗ µ ∗ , 1 conditioned upon any possible history H k 1 ( Russo & V an Roy , 2014 ). This observation leads us to a general relationship between PSRL and the BayesRegret of any optimistic algorithm. Theorem 1 (PSRL matches OFU-RL in BayesRe gret) . Let π opt be any optimistic algorithm for r einfor cement learning in the style of Algorithm 1 . If π opt satisﬁes r e- gr et bounds such that, for any M ∗ any T > 0 and any δ > 0 the r e gret is bounded with pr obability at least 1 − δ Regret( T , π opt , M ∗ ) ≤ f ( S, A, H , T , δ ) . (6) Then, if φ is the distribution of the true MDP M ∗ and the pr oof of ( 6 ) follows Section 3.1 , then for all T > 0 Ba yesRegret( T , π PSRL , φ ) ≤ 2 f ( S, A, H , T , δ = T − 1 ) + 2 . (7) Sketch pr oof. This result is established in Osband et al. ( 2013 ) for the special case of π opt = π UCRL2 . W e include this small sketch as a refresher and a guide for high level intuition. First, note that conditioned upon any data H k 1 , the true MDP M ∗ and the sampled M k are identically distributed. This means that E [∆ opt |H k 1 ] ≤ 0 for all k . Therefore, to establish a bound upon the Bayesian regret of PSRL, we just need to bound P d T /H e k =1 E [∆ conc k | H k ] . W e can use that M ∗ | H k 1 = D M k | H k 1 again in step (1.) from Section 3.1 to say that both M ∗ , M k lie within M k for all k with probability at least 1 − 2 δ via a union bound. This means we can bound the concentration error in PSRL, Ba yesRegret( T ,π PSRL ,φ ) ≤ d T /H e X k =1 E [∆ conc k | M ∗ ,M k ∈ M k ] + 2 δ T The ﬁnal step follows from decomposing ∆ conc k by adding and subtracting the imagined optimistic value ˜ V k generated by π opt . Through an application of the triangle Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? inequality , ∆ conc k ≤ | V M k µ k , 1 − ˜ V k | + | ˜ V k − V ∗ µ k , 1 | we can mirror step (4.) to bound the regret from concentration, P d T /H e k =1 E [∆ conc k | M ∗ , M k ∈ M k ] ≤ 2 f ( S, A, H , T , δ ) . This result (and proof strategy) was established in multi- armed bandits by Russo & V an Roy ( 2014 ). W e complete the proof of Theorem 1 with the choice δ = T − 1 and that the regret is uniformly bounded by T . Theorem 1 suggest that, according to Bayesian expected re- gret, PSRL performs within a factor of 2 of any optimistic algorithm whose analysis follo ws Section 3.1 . This in- cludes the algorithms UCRL2 ( Jaksch et al. , 2010 ), UCFH ( Dann & Brunskill , 2015 ), MORMAX ( Szita & Szepesvári , 2010 ) and many more. Importantly , and unlike existing OFU approaches, the al- gorithm performance is separated from the analysis of the conﬁdence sets M k . This means that PSRL even attains the big O scaling of as-yet-undiscovered approaches to OFU, all at a computational cost no greater than solving a single known MDP - e ven if the matched OFU algorithm π opt is computationally intractable. 4. Some shortcomings of existing OFU-RL In this section, we discuss how and why existing OFU al- gorithms forgo the le vel of statistical ef ﬁciency enjoyed by PSRL. At a high level, this lack of statistical efﬁciency emerges from sub-optimal construction of the conﬁdence sets M k . W e present several insights that may prov e cru- cial to the design of improved algorithms for OFU. More worryingly , we raise the question that perhaps the optimal statistical conﬁdence sets M k would likely be computa- tionally intractable. W e argue that PSRL offers a compu- tationally tractable approximation to this unknown “ideal” optimistic algorithm. Before we launch into a more mathematical argument it is useful to take intuition from a simple estimation prob- lem, without any decision making. Consider an MDP with A = 1 , H = 2 , S = 2 N + 1 as described in Figure 1 . Every episode the agent transitions from s = 0 uniformly to s ∈ { 1 , .., 2 N } and recei ves a deterministic reward from { 0 , 1 } depending upon this state. The simplicity of these exam- ples means ev en a naive monte-carlo estimate of the value should concentrate 1 / 2 ± ˜ O (1 / √ n ) after n episodes of in- teraction. Nonetheless, the conﬁdence sets suggested by state of the art OFU-RL algorithm UCRL ( Jaksch et al. , 2010 ) become incredibly mis-calibrated as S gro ws. T o see how this problem occurs, consider any algorithm for for model-based OFU-RL that builds up conﬁdence sets for each state and action independently , such as UCRL. Even if the estimates are tight in each state and action, the result- ing optimistic MDP , simultaneously optimistic across each state and action, may be far too optimistic. Geometrically Figure 1. MDPs to illustrate the scaling with S . Figure 2. MDPs to illustrate the scaling with H . Figure 3. Union bounds giv e loose rectangular conﬁdence sets. these independent bounds form a rectangular conﬁdence set. The corners of this rectangle will be √ S misspeciﬁed to the underlying distribution, an ellipse, when combined across S independent estimates (Figure 3 ). Sev eral algorithms for OFU-RL do exist which address this loose dependence upon S ( Strehl et al. , 2006 ; Szita & Szepesvári , 2010 ). Howe ver , these algorithms depend upon a partitioning of data for future value, which leads to a poor dependence upon the horizon H or equiv alently the effecti ve horizon 1 1 − γ in discounted problems. W e can use a similar toy example from Figure 2 to understand why combining independently optimistic estimates through time will contribute to a loose bound in H . Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? The natural question to ask is, “Why don’t we simply apply these observ ations to design an optimistic algorithm which is simultaneously ef ﬁcient in S and H ?”. The ﬁrst imped- iment is that designing such an algorithm requires some new intricate concentration inequalities and analysis. Do- ing this rigorously may be challenging, but we believ e it will be possible through a more careful application of ex- isting tools to the insights we raise abov e. The bigger chal- lenge is that, ev en if one were able to formally specify such an algorithm, the resulting algorithm may in general not be computationally tractable. A similar observation to this problem of optimistic opti- mization has been shown in the setting of linear bandits ( Dani et al. , 2008 ; Russo & V an Ro y , 2014 ). In these w orks they show that the problem of efﬁcient optimization over ellipsoidal conﬁdence sets can be NP-hard. This means that computationally tractable implementations of OFU ha ve to rely upon inefﬁcient rectangular conﬁdence sets that give up a factor of √ D where D is the dimension of the underly- ing problem. By contrast, Thompson sampling approaches remain computationally tractable (since they require solv- ing only a single problem instance) and so do not suffer from the loose conﬁdence set construction. It remains an open question whether such an algorithm can be designed for ﬁnite MDPs. Ho wever , these previous results in the simpler bandit setting H = 1 show that these problems with OFU-RL cannot be ov ercome in general. 4.1. Computational illustration In this section we present a simple series of computational results to demonstrate this looseness in both S and H . W e sample K = 1000 episodes of data from the MDP and then examine the optimistic/sampled Q-values for UCRL2 and PSRL. W e implement a version of UCRL2 optimized for ﬁnite horizon MDPs and implement PSRL with a uni- form Dirichlet prior over the initial dynamics P (0 , 1) = ( p 1 , .., p 2 N ) and a N (0 , 1) prior ov er rewards updating as if rew ards had N (0 , 1) noise. For both algorithms, if we say that R or P are known then we mean that we use the true R or P inside UCRL2 or PSRL. In each experiment, the es- timates guided by OFU become extremely mis-calibrated, while PSRL remains stable. The results of Figure 5 are particularly revealing. They demonstrates the potential pitfalls of OFU-RL ev en when the underlying transition dynamics entir ely known . Se v- eral OFU algorithms hav e been proposed to remedy the loose UCRL-style L1 concentration from transitions ( Fil- ippi et al. , 2010 ; Araya et al. , 2012 ; Dann & Brunskill , 2015 ) b ut none of these address the inef ﬁciency from hyper-rectangular conﬁdence sets. As expected, these loose conﬁdence sets lead to extremely poor performance in terms of the regret. W e push full results to Appendix C along with comparison to sev eral other OFU approaches. Figure 4. R kno wn, P unknown, v ary N in the MDP Figure 1 . Figure 5. P kno wn, R unknown, v ary N in the MDP Figure 1 . Figure 6. R , P unknown, v ary H in the MDP Figure 2 5. Better optimism by sampling Until no w , all analyses of PSRL ha ve come via comparison to some existing algorithm for OFU-RL. Previous work, in the spirit of Theorem 1 , leveraged the existing analy- sis for UCRL2 to establish an ˜ O ( H S √ AT ) bound upon the Bayesian regret ( Osband et al. , 2013 ). In this section, we present a new result that bounds the expected regret of PSRL ˜ O ( H √ S AT ) . W e also include a conjecture that improv ed analysis could result in a Bayesian regret bound ˜ O ( √ H S AT ) for PSRL, and that this result would be unim- prov able ( Osband & V an Roy , 2016 ). 5.1. From S to √ S In this section we present a new analysis that improves the bound on the Bayesian regret from S to √ S . The proof of this result is somewhat technical, b ut the essential argument comes from the simple observ ation of the loose rectangular conﬁdence sets from Section 4 . The key to this analysis is a technical lemma on Gaussian-Dirichlet concentration ( Osband & V an Roy , 2017 ). Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Theorem 2. Let M ∗ be the true MDP distributed accord- ing to prior φ with any independent Dirichlet prior over transitions. Then the re gr et for PSRL is bounded Ba yesRegret( T , π PSRL , φ ) = ˜ O  H √ S AT  . (8) Our proof of Theorem 2 mirrors the standard OFU- RL analysis from Section 3.1 . T o condense our no- tation we write x kh :=( s kh ,a kh ) and V k k,h := V M k µ k ,h . Let the posterior mean of rew ards ˆ r k ( x ):= E [ r ∗ ( x ) |H k 1 ] , transitions ˆ P k ( x ):= E [ P ∗ ( x ) |H k 1 ] with respecti ve devi- ations from sampling noise w R ( x ):= r k ( x ) − ˆ r k ( x ) and w P h ( x ):=( P k ( x ) − ˆ P k ( x )) T V k kh +1 . W e note that, conditional upon the data H k 1 the true rew ard and transitions are independent of the rewards and transitions sampled by PSRL, so that E [ r ∗ ( x ) |H k 1 ] = ˆ r k ( x ) , E [ P ∗ ( x ) |H k 1 ] = ˆ P k ( x ) for any x . Howe ver , E [ w R ( x ) |H k 1 ] and E [ w P h ( x ) |H k 1 ] are generally non-zero, since the agent chooses its policy to optimize its reward un- der M k . W e can re write the regret from concentration via the Bellman operator (section 5.2 of Osband et al. ( 2013 )), E h V k k 1 − V ∗ k 1 |H k 1 i = E h ( r k − r ∗ )( x k 1 )+ P k ( x k 1 ) T V k k 2 − P ∗ ( x k 1 ) T V ∗ k 2 | H k 1 i = E  ( r k − r ∗ )( x k 1 )+  P k ( x k 1 ) − ˆ P k ( x k 1 )  T V k k 2 + E h V k k 2 − V ∗ k 2  ( s 0 ) | s 0 ∼ P ∗ ( x k 1 ) i |H k 1  = ... = E  P H h =1 { r k ( x k 1 ) − ˆ r ∗ ( x k 1 ) } + P H h =1   P k ( x kh ) − ˆ P k ( x kh )  T V k kh  | H k 1  ≤ E h P H h =1 | w R ( x kh ) | + P H h =1 | w P h ( x kh ) | | H k 1 i . (9) W e can bound the contribution from unknown rewards w R k ( x kh ) with a standard argument from earlier work ( Buldygin & K ozachenko , 1980 ; Jaksch et al. , 2010 ). Lemma 1 (Sub-Gaussian tail bounds) . Let x 1 , .., x n be independent samples fr om sub-Gaussian random variables. Then, for any δ > 0 P 1 n   n X i =1 x i   ≥ r 2 log(2 /δ ) n ! ≤ δ. (10) The key piece of our new analysis will be to show that the contribution from the transition estimate P H h =1 | w P ( x kh ) | concentrates at a rate independent of S . At the root of our argument is the notion of stochastic optimism ( Osband , 2016 ), which introduces a partial ordering over random variables. W e make particular use of Lemma 2 , that re- lates the concentration of a Dirichlet posterior with that of a matched Gaussian distribution ( Osband & V an Roy , 2017 ). Lemma 2 (Gaussian-Dirichlet dominance) . F or all ﬁxed V ∈ [0 , 1] N , α ∈ [0 , ∞ ) N with α T 1 ≥ 2 , if X ∼ N ( α > V /α > 1 , 1 /α > 1 ) and Y = P T V for P ∼ Diric hlet( α ) then X < so Y . W e can use Lemma 2 to establish a similar concentration bound on the error from sampling w P h ( x ) . Lemma 3 (T ransition concentration) . F or any independent prior over re war ds with r ∈ [0 , 1] , additive sub-Gaussian noise and an independent Dirichlet prior over transitions at state-action pair x kh , then w P h ( x kh ) ≤ 2 H s 2 log(2 /δ ) max( n k ( x kh ) − 2 , 1) (11) with pr obability at least 1 − δ . Sketch pr oof. Our proof relies heavily upon some techni- cal results from the note from Osband & V an Roy ( 2017 ). W e cannot apply Lemma 2 directly to w P , since the fu- ture value V k kh +1 is itself be a random variable whose v alue depends on the sampled transition P k ( x kh ) . Howe ver , al- though V k kh +1 can vary with P k , the structure of the MDP means that resultant w P ( x kh ) is still no more optimistic than the most optimistic possible ﬁxed V ∈ [0 , H ] S . W e begin this proof only for the simply family of MDPs with S = 2 , which we call M 2 . W e write p := P k ( x kh )(1) for the ﬁrst component of the unkno wn transition at x kh and similarly ˆ p := ˆ P k ( x kh )(1) . W e can then bound the tran- sition concentration, | w P h ( x kh ) | = | ( P k ( x kh ) − ˆ P k ( x kh )) T V k kh +1 | ≤ | ( p − ˆ p ) || ( V k kh +1 (1) − V k kh +1 (2)) | ≤ | p − ˆ p | sup R k ,P k | ( V k kh +1 (1) − V k kh +1 (2)) | ≤ | ( p − ˆ p ) | H (12) Lemma 2 no w implies that for any α ∈ R + with α T 1 ≥ 2 , the random v ariables p ∼ Dirichlet( α ) and X ∼ N (0 , σ 2 = 1 /α T 1 ) are ordered, X < so p − ˆ p = ⇒ | X | H < so | p − ˆ p | H < so | w P h ( x kh ) | . (13) W e conclude the proof for M ∈ M 2 through an application of Lemma 1 . T o extend this ar gument to multiple states S > 2 we consider the mar ginal distribution of P k ov er any subset of states, which is Beta distributed similar to ( 12 ). W e push the details to Appendix A . T o complete the proof of Theorem 2 we combine Lemma 1 with Lemma 3 . W e rescale δ ← δ / 2 S AT so that these con- ﬁdence sets hold at each R ( s, a ) , P ( s, a ) via union bound with probability at least 1 − 1 T , E h P H h =1  | w R ( x kh ) | + | w P h ( x kh ) |  | H k 1 i ≤ P H h =1 2 ( H + 1) q 2 log (4 S AT ) max( n k ( x kh ) − 2 , 1) . (14) Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? W e can no w use ( 14 ) together with a pigeonhole principle ov er the number of visits to each state and action: Ba yesRegret( T , π PSRL , φ ) ≤ P d T /H e k =1 P H h =1 2( H + 1) q 2 log (4 S AT ) n k ( x kh ) + 2 S A + 1 ≤ 10 H p S AT log(4 S AT ) . This completes the proof of Theorem 2 . Prior work has designed similar OFU approaches that im- prov e the learning scaling with S . MORMAX ( Szita & Szepesvári , 2010 ) and delayed Q-learning ( Strehl et al. , 2006 ), in particular, come with sample complexity bounds that are linear in S , and match lower bounds. But ev en in terms of sample complexity , these algorithms are not neces- sarily an impro vement over UCRL2 or its v ariants ( Dann & Brunskill , 2015 ). For clarity , we compare these algorithms in terms of T π (  ) := min  T | 1 T Ba yesRegret( T , π , φ ) ≤   . DelayQ MORMAX UCRL2 PSRL Theorem 2 ˜ O  H 9 S A  4  ˜ O  H 7 S A  2  ˜ O  H 2 S 2 A  2  ˜ O  H 2 S A  2  T able 1. Learning times compared in terms of T π (  ) . Theorem 1 implies T PSRL (  ) = ˜ O ( H 2 S A  2 ) . MORMAX and delayed Q-learning reduces the S -dependence of UCRL2, but this comes at the expense of worse dependence on H , and the resulting algorithms are not practical. 5.2. From H to √ H Recent analyses ( Lattimore & Hutter , 2012 ; Dann & Brun- skill , 2015 ) suggest that simultaneously reducing the de- pendence of H to √ H may be possible. They note that “local value variance” satisﬁes a Bellman equation. Intu- itiv ely this captures that if we transition to a bad state V ' 0 , then we cannot transition anywhere much worse dur- ing this episode. This relation means that P H h =1 w P h ( x kh ) should behav e more as if they were independent and grow O ( √ H ) , unlike our analysis which crudely upper bounds them each in turn O ( H ) . W e present a sketch tow ards an analysis of Conjecture 1 in Appendix B . Conjecture 1. F or any prior over r ewar ds with r ∈ [0 , 1] , additive sub-Gaussian noise and any independent Dirichlet prior over transitions, we conjectur e that E  Regret( T , π PSRL , M ∗ )  = ˜ O  √ H S AT  , (15) and that this matches the lower bounds for any algorithm up to logarithmic factors. The results of ( Bartlett & T ewari , 2009 ) adapted to ﬁnite horizon MDPs would suggest a lo wer bound Ω( H √ S AT ) on the minimax regret for any algorithm. Ho wever , the as- sociated proof is incorrect ( Osband & V an Roy , 2016 ). The strongest lo wer bound with a correct proof is Ω( √ H S AT ) ( Jaksch et al. , 2010 ). It remains an open question whether such a lower bound applies to Bayesian regret over the class of priors we analyze in Theorem 2 . One particularly interesting aspect of Conjecture 1 is that we can construct another algorithm that satisﬁes the proof of Theorem 2 but would not satisfy the argument for Con- jecture 1 of Appendix B . W e call this algorithm Gaussian PSRL, since it operates in a manner similar to PSRL but ac- tually uses the Gaussian sampling we use for the analysis of PSRL in its algorithm. Algorithm 3 Gaussian PSRL Input: Posterior MAP estimates r k , ˆ P k , visit counts n k Output: Random Q k,h ( s,a ) < so Q ∗ h ( s,a ) for all ( s,a,h ) 1: Initialize Q k,H +1 ( s,a ) ← 0 for all ( s,a ) 2: for timestep h = H,H − 1 ,.., 1 do 3: V k,h +1 ( s ) ← max α Q k,h +1 ( s,α ) 4: Sample w k ( s,a,h ) ∼ N  0 , ( H +1) 2 max( n k ( s,a ) − 2 , 1)  5: Q k,h ( s,a ) ← r k ( s,a )+ ˆ P k ( s,a ) T V + w k ( s,a,h ) ∀ ( s,a ) 6: end for Algorithm 3 presents the method for sampling random Q - values according to Gaussian PSRL, the algorithm then follows these samples greedily for the duration of the episode, similar to PSRL. Interestingly , we ﬁnd that our experimental ev aluation is consistent with ˜ O ( H S √ AT ) , ˜ O ( H √ S AT ) and ˜ O ( √ H S AT ) for UCRL2, Gaussian PSRL and PSRL respectiv ely . 5.3. An empirical in vestigation W e no w discuss a computational study designed to illus- trate ho w learning times scale with S and H , and to em- pirically in vestigate Conjecture 1 . The class of MDPs we consider in volves a long chain of states with S = H = N and with two actions: left and right. Each episode the agent begins in state 1 . The optimal policy is to head right at ev- ery timestep, all other policies hav e zero expected re ward. Inefﬁcient exploration strategies will take Ω(2 N ) episodes to learn the optimal policy ( Osband et al. , 2014 ). Figure 7. MDPs that highlight the need for efﬁcient exploration. W e ev aluate se veral learning algorithms from ten random seeds and N = 2 , .., 100 for up to ten million episodes each. Our goal is to in vestigate their empirical performance and scaling. W e believe this is the ﬁrst ev er large scale empir - ical in vestigation into the scaling properties of algorithms for efﬁcient e xploration. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? W e highlight results for three algorithms with ˜ O ( √ T ) Bayesian regret bounds: UCRL2, Gaussian PSRL and PSRL. W e implement UCRL2 with conﬁdence sets opti- mized for ﬁnite horizon MDPs. For the Bayesian algo- rithms we use a uniform Dirichlet prior for transitions and N (0 , 1) prior for re wards. W e view these priors as simple ways to encode very little prior kno wledge. Full details and a link to source code are av ailable in Appendix D . Figure 8 display the regret curves for these algorithms for N ∈ { 5 , 10 , 30 , 50 } . As suggested by our analysis, PSRL outperforms Gaussian PSRL which outperforms UCRL2. These differences seems to scale with the length of the chain N and that e ven for relati vely small MDPs, PSRL is many orders of magnitude more ef ﬁcient than UCRL2. Figure 8. PSRL outperforms other methods by large margins. W e inv estigate the empirical scaling of these algorithms with respect to N . The results of Theorem 2 and Conjec- ture 1 only bound the Bayesian re gret according to the prior φ . The family of environments we consider in this exam- ple are decidedly not from this uniform distribution; in fact they are chosen to be as difﬁcult as possible. Ne vertheless, the results of Theorem 2 and Conjecture 1 provide remark- ably good description for the behavior we observ e. Deﬁne learning time( π , N ) := min n K | 1 K P K k =1 ∆ k ≤ 0 . 1 o for the algorithm π on the MDP from Figure 7 with size N . For any B π > 0 , the regret bound ˜ O ( √ B π T ) would imply log(learning time)( π , N ) = B π H × log( N ) + o (log ( N )) . In the cases of Figure 7 with H = S = N then the bounds ˜ O ( H S √ AT ) , ˜ O ( H √ S AT ) and ˜ O ( √ H S AT ) would suggest a slope B π of 5 , 4 and 3 respecti vely . Remarkably , these high lev el predictions match our empiri- cal results almost exactly , as we show in Figure 9 . These re- sults provide some support to Conjecture 1 and e ven, since the spirit of these environments is similar example used in existing proofs, the ongoing questions of fundamental lower bounds ( Osband & V an Ro y , 2016 ). Further, we note that ev ery single seed of PSRL and Gaussian PSRL learned the optimal policy for ev ery single N . W e believe that this suggests it may be possible to extend our Bayesian analy- sis to provide minimax regret bounds of the style in UCRL2 for suitable choice of diffuse uninformati ve prior . Figure 9. Empirical scaling matches our conjectured analysis. 6. Conclusion PSRL is orders of magnitude more statistically efﬁcient than UCRL and the same computational cost as solving a known MDP . W e believe that analysts will be able to for- mally specify an OFU approach to RL whose statistical ef- ﬁciency matches PSRL. Howe ver , we ar gue that the result- ing conﬁdence sets which address both the coupling over H and S may result in a computationally intractable optimiza- tion problem. Posterior sampling offers a computationally tractable approach to statistically efﬁcient e xploration. W e should stress that the ﬁnite tabular setting we analyze is not a reasonable model for most problems of interest. Due to the curse of dimensionality , RL in practical settings will require generalization between states and actions. The goal of this paper is not just to improve a mathematical bound in a toy example (although we do also do that). Instead, we hope this simple setting can highlight some shortcom- ings of existing approaches to “efﬁcient RL ” and provide insight into why algorithms based on sampling may offer important advantages. W e believe that these insights may prov e valuable as we move to wards algorithms that solve the problem we really care about: synthesizing efﬁcient ex- ploration with powerful generalization. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Acknowledgements This work was generously supported by DeepMind, a re- search grant from Boeing, a Marketing Research A ward from Adobe, and a Stanford Graduate Fellowship, cour- tesy of P A CCAR. The authors would like to thank Daniel Russo for many hours of discussion and insight leading to this research, Shipra Agrawal and T or Lattimore for point- ing out se veral ﬂaws in some early proof steps, anon ymous revie wers for their helpful comments and many more col- leagues at DeepMind including Remi Munos, Mohammad Azar and more for inspirational con versations. References Araya, Mauricio, Buf fet, Oli vier, and Thomas, V incent. Near-optimal brl using optimistic local transitions. arXiv pr eprint arXiv:1206.4613 , 2012. Asmuth, John, Li, Lihong, Littman, Michael L, Nouri, Ali, and W ingate, David. A Bayesian sampling approach to exploration in reinforcement learning. In Pr oceedings of the T wenty-Fifth Confer ence on Uncertainty in Artiﬁcial Intelligence , pp. 19–26. A U AI Press, 2009. Bartlett, Peter L. and T e wari, Ambuj. REGAL: A regu- larization based algorithm for reinforcement learning in weakly communicating MDPs. In Pr oceedings of the 25th Confer ence on Uncertainty in Artiﬁcial Intellig ence (U AI2009) , pp. 35–42, June 2009. Bellman, Richard and Kalaba, Robert. On adaptiv e control processes. IRE T ransactions on A utomatic Contr ol , 4(2): 1–9, 1959. Brafman, Ronen I. and T ennenholtz, Moshe. R-max - a general polynomial time algorithm for near-optimal re- inforcement learning. Journal of Machine Learning Re- sear ch , 3:213–231, 2002. Buldygin, V alerii V and K ozachenko, Y u V . Sub-gaussian random variables. Ukrainian Mathematical Journal , 32 (6):483–489, 1980. Burnetas, Apostolos N and Katehakis, Michael N. Optimal adaptiv e policies for Markov decision processes. Math- ematics of Operations Resear ch , 22(1):222–255, 1997. Dani, V arsha, Hayes, Thomas P ., and Kakade, Sham M. Stochastic linear optimization under bandit feedback. In COLT , pp. 355–366, 2008. Dann, Christoph and Brunskill, Emma. Sample complex- ity of episodic ﬁxed-horizon reinforcement learning. In Advances in Neur al Information Processing Systems , pp. TB A, 2015. Filippi, Sarah, Cappé, Olivier , and Garivier , Aurélien. Op- timism in reinforcement learning and kullback-leibler di- ver gence. In Communication, Contr ol, and Computing (Allerton), 2010 48th Annual Allerton Confer ence on , pp. 115–122. IEEE, 2010. Fonteneau, Raphaël, Korda, Nathan, and Munos, Rémi. An optimistic posterior sampling strategy for Bayesian reinforcement learning. In NIPS 2013 W orkshop on Bayesian Optimization (BayesOpt2013) , 2013. Gopalan, Aditya and Mannor , Shie. Thompson sampling for learning parameterized Markov decision processes. arXiv pr eprint arXiv:1406.7498 , 2014. Hadar , Josef and Russell, William R. Rules for ordering uncertain prospects. The American Economic Revie w , pp. 25–34, 1969. Jaksch, Thomas, Ortner, Ronald, and Auer, Peter . Near- optimal regret bounds for reinforcement learning. Jour - nal of Machine Learning Resear ch , 11:1563–1600, 2010. Kakade, Sham. On the Sample Complexity of Reinforce- ment Learning . PhD thesis, Univ ersity College London, 2003. Kearns, Michael J. and Singh, Satinder P . Near-optimal reinforcement learning in polynomial time. Machine Learning , 49(2-3):209–232, 2002. K olter, J Zico and Ng, Andrew Y . Near-Bayesian explo- ration in polynomial time. In Pr oceedings of the 26th Annual International Confer ence on Machine Learning , pp. 513–520. A CM, 2009. Lattimore, T or and Hutter , Marcus. P AC bounds for dis- counted MDPs. In Algorithmic learning theory , pp. 320– 334. Springer , 2012. Munos, Rémi. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. 2014. Osband, Ian. Deep Exploration via Randomized V alue Functions . PhD thesis, Stanford, 2016. Osband, Ian and V an Roy , Benjamin. Model-based rein- forcement learning and the eluder dimension. In Ad- vances in Neural Information Pr ocessing Systems , pp. 1466–1474, 2014a. Osband, Ian and V an Roy , Benjamin. Near-optimal rein- forcement learning in factored MDPs. In Advances in Neural Information Pr ocessing Systems , pp. 604–612, 2014b. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Osband, Ian and V an Roy , Benjamin. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732 , 2016. Osband, Ian and V an Roy , Benjamin. Gaussian-dirichlet posterior dominance in sequential learning. arXiv pr eprint arXiv:1702.04126 , 2017. Osband, Ian, Russo, Daniel, and V an Roy , Benjamin. (More) ef ﬁcient reinforcement learning via posterior sampling. In NIPS , pp. 3003–3011. Curran Associates, Inc., 2013. Osband, Ian, V an Roy , Benjamin, and W en, Zheng. Gen- eralization and exploration via randomized v alue func- tions. arXiv preprint , 2014. Russo, Daniel and V an Roy , Benjamin. Learning to opti- mize via posterior sampling. Mathematics of Operations Resear ch , 39(4):1221–1243, 2014. Strehl, Alexander L and Littman, Michael L. A theoretical analysis of model-based interval estimation. In Pr oceed- ings of the 22nd international conference on Machine learning , pp. 856–863. A CM, 2005. Strehl, Alexander L., Li, Lihong, W iewiora, Eric, Lang- ford, John, and Littman, Michael L. P AC model-free reinforcement learning. In ICML , pp. 881–888, 2006. Strens, Malcolm J. A. A Bayesian framew ork for reinforce- ment learning. In ICML , pp. 943–950, 2000. Sutton, Richard and Barto, Andrew . Reinforcement Learn- ing: An Intr oduction . MIT Press, March 1998. Szita, István and Szepesvári, Csaba. Model-based rein- forcement learning with nearly tight exploration com- plexity bounds. In Pr oceedings of the 27th International Confer ence on Machine Learning (ICML-10) , pp. 1031– 1038, 2010. Thompson, W .R. On the likelihood that one unknown prob- ability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285–294, 1933. Vlassis, Nikos, Ghav amzadeh, Mohammad, Mannor , Shie, and Poupart, Pascal. Bayesian reinforcement learning. In Reinfor cement Learning , pp. 359–386. Springer , 2012. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? APPENDICES A. Proof of Lemma 3 This section centers around the proof of Lemma 3 , which we reproduce below for completeness. In the main paper we present a simple sk etch for the special case of S = 2 . W e now extend this argument to general MDPs with S > 2 . The main strategy for this proof is to proceed via an inductiv e argument and consider the contrib ution of each component of P k in turn. W e will see that, for any choice of component, the resultant random variable is dominated by a matched Gaussian random variable just as in ( 12 ). Lemma 3 (T ransition concentration) . F or any independent prior over r ewar ds with r ∈ [0 , 1] , additive sub-Gaussian noise and an independent Dirichlet prior over tr ansitions at state-action pair x kh , then w P h ( x kh ) ≤ 2 H s 2 log(2 /δ ) max( n k ( x kh ) − 2 , 1) (11) with pr obability at least 1 − δ . Our analysis of Lemma 3 will rely heavily upon the technical analysis of Osband & V an Roy ( 2017 ). W e ﬁrst reproduce Lemma 2 from Osband & V an Roy ( 2017 ) in terms of stochastic optimism, rather than second order stochastic dominance. Lemma 4 (Beta vs Dirichlet dominance) . Let X = P > v for the random variable P ∼ Dirichlet( α ) and constants v ∈ R S and α ∈ R S + . W ithout loss of generality , assume v 1 ≤ v 2 ≤ · · · ≤ v S . Let ˜ α = P s i =1 α i ( v i − v 1 ) / ( v d − v 1 ) and ˜ β = P d i =1 α i ( v d − v i ) / ( v d − v 1 ) . Then, there exists a random variable ˜ P ∼ Beta( ˜ α, ˜ β ) such that, for ˜ X = ˜ P v d + (1 − ˜ P ) v 1 , E [ ˜ X | X ] = X and ˜ X < so X . Pr oof. Let γ i = Gamma ( α, 1) be independent and identically distributed and let γ = P d i =1 γ i , so that P ≡ D γ /γ . Let α 0 i = α i ( v i − v 1 ) / ( v d − v 1 ) and α 1 i = α i ( v d − v i ) / ( v d − v 1 ) so that α = α 0 + α 1 . Deﬁne independent random variables γ 0 ∼ Gamma ( α 0 i , 1) and γ 1 ∼ Gamma ( α 1 i , 1) so that γ ≡ D γ 0 + γ 1 . T ake γ 0 and γ 1 to be independent, and couple these v ariables with γ so that γ = γ 0 + γ 1 . Note that ˜ β = P d i =1 α 0 i and ˜ α = P d i =1 α 1 i . Let γ 0 = P d i =1 γ 0 i and γ 1 = P d i =1 γ 1 i , so that 1 − ˜ P ≡ D γ 0 /γ and ˜ P ≡ D γ 1 /γ . Couple these variables so that 1 − ˜ P = γ 0 /γ and ˜ P = γ 1 /γ . W e can now say , E [ ˜ X | X ] = E [(1 − ˜ P ) v 1 + ˜ P v d | X ] = E  v 1 γ 0 γ + v d γ 1 γ    X  = E  E  v 1 γ 0 + v d γ 1 γ    γ , X     X  = E  v 1 E [ γ 0 | γ ] + v d E [ γ 1 | γ ] γ    X  = E " v 1 P d i =1 E [ γ 0 i | γ i ] + v d P d i =1 xp [ γ 1 i | γ i ] γ    X # (a) = E " v 1 P d i =1 γ i α 0 i /α i + v d P d i =1 γ i α 1 i /α i γ    X # = E " v 1 P d i =1 γ i ( v i − v 1 ) + v d P d i =1 γ i ( v d − v i ) γ ( v d − v 1 )    X # = E " P d i =1 γ i v i γ    X # = E " d X i =1 p i v i    X # = X , where (a) follows from elementary properties of Gamma distribution ( Osband & V an Roy , 2017 ). Therefore, ˜ X is a mean-preserving spread of X and so by deﬁnition of stochastic optimism ˜ X < so X . Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Next, consider any ﬁxed P k ( x kh ) and let R k and P k ( x 6 = x kh ) vary in any arbitrary way to maximize the variation from transition w P k ( x kh ) = ( P k ( x kh ) − ˆ P ( x kh )) T V k kh +1 through their ef fects on the future value V k kh +1 ∈ [0 , H ] S . W e can then upper bound the deviation from transitions by the de viation under the worst possible v ∈ [0 , H ] S . w P h ( x kh ) ≤ max R k ,P k ( x 6 = x kh ) ( P k ( x kh ) − ˆ P k ( x kh )) T V k kh +1 ≤ max v ∈ [0 ,H ] S ( P k ( x kh ) − ˆ P k ( x kh )) T v . (16) W e can then apply Lemma 4 to ( 16 ): for any possible value of v ∈ [0 , H ] S there is a matched Beta random variable that is stochastically optimistic for w P h ( x kh ) . This means that we can then apply Lemma 2 to show that there is a matched X ∼  0 , H 2 α T 1  < so w P h ( x kh ) . T o complete the proof of Lemma 3 we apply the Gaussian tail concentration Lemma 1 . B. Conjecture of ˜ O ( √ H S AT ) bounds The key remaining loose piece of our analysis concerns the summation P H h =1 w P h ( x kh ) . Our current proof of Theorem 2 bounds each w P h ( x kh ) independently . Each term is ˜ O ( q H n k ( x kh ) ) and we bound the resulting sum ˜ O ( H q H n k ( x kh ) ) . Howe ver , this approach is v ery loose and pre-supposes that each timestep could be maximally bad during a single episode. T o repeat our geometric intuition, we have assumed a worst-case hyper-rectangle ov er all timesteps H when the actual geometry should be an ellipse. W e therefore suffer an additional term of ˜ O ( √ H ) in exactly the style of Figure 3 . In fact, it is not e ven possible to sequentially get the “worst-case” transitions O ( H ) at each and every timestep during an episode, since once your sample gets one such transition then there will be no more future v alue to deplete. Rather than just being independent per timestep, which would be enough for us to end up with an ˜ O ( √ H ) sa ving, they actually hav e some kind of anti-correlation property through the law of total variance. A very similar observation is used by recent analyses in the sample complexity setting ( Lattimore & Hutter , 2012 ) and also ﬁnite horizon MDPs ( Dann & Brunskill , 2015 ). This seems to suggest that it should be possible to combine the insights of Lemma 3 with, for example, Lemma 4 of ( Dann & Brunskill , 2015 ) to remov e both the √ S and the √ H from our bounds to prove Conjecture 1 . W e note that this informal ar gument would not apply Gaussian PSRL, since it generates w P from some Gaussian posterior which does not satisfy the Bellman operators. Therefore, we should be able to ﬁnd some e vidence for this conjecture if we ﬁnd domains where UCRL, Gaussian PSRL and PSRL all demonstrate their (unique) predicted scalings. W e present some evidence of this ef fect in Section 5.3 and ﬁnd that that our empirical results are consistent with this conjecture. C. Estimation experiments In this section we expand upon the simple examples gi ven by Section 4.1 to a full decision problem with two actions. W e deﬁne an MDP similar to Figures 1 and 2 b ut now with two actions. The ﬁrst action is identical to Figure 1 , but the second action modiﬁes the transition probabilities to fa vor the rewarding states with probability 0 . 6 / N and assigning only 0 . 4 / N to the non-rew arding states. W e now in vestigate the re gr et of sev eral learning algorithms which we adapt to this setting. These algorithms are based upon BEB ( K olter & Ng , 2009 ), BOL T ( Araya et al. , 2012 ),  -greedy with  = 0 . 1 , Gaussian PSRL (see Algorithm 3 ), Optimistic PSRL (which takes K = 10 samples and takes the maximum over sampled Q-values similar to BOSS ( Asmuth et al. , 2009 )), PSRL ( Strens , 2000 ), UCFH ( Dann & Brunskill , 2015 ) and UCRL2 ( Jaksch et al. , 2010 ). W e link to the full code for implementation in Appendix D . W e see that the loose estimates in OFU algorithms from Figures 4 and 5 lead to bad performance in a decision problem. This poor scaling with the number of successor states N occurs when either the rewards or the transition function is unknown. W e note that in stochastic environments the P AC-Bayes algorithm BOL T , which relies upon optimistic fake prior data, can sometimes concentrate too quickly and so incur the maximum linear regret. In general, although BOL T is P A C-Bayes, it concentrates too fast to be P A C-MDP just like BEB ( K olter & Ng , 2009 ). In Figure 12 we see a similar effect as we increase the episode length H . W e note the second order UCFH modiﬁcation improv es upon UCRL2’ s miscalibration with H , as is reﬂected in their bounds ( Dann & Brunskill , 2015 ). W e note that both BEB and BOL T scale poorly with the horizon H . Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Figure 10. Known re wards R and unknown transitions P , similar to Figure 4 . Figure 11. Unknown re wards R and known transitions P , similar to Figure 5 . Figure 12. Unknown re wards R and transitions P , similar to Figure 6 . D. Chain experiments All of the code and experiments used in this paper are av ailable in full on github. As per the revie w request we have removed the link to this code, but instead include an anonymized excerpt of the some of the code in our submission ﬁle. W e hope that researchers will ﬁnd this simple codebase useful for quickly prototyping and experimenting in tabular reinforcement learning simulations. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? In addition to the results already presented we also in vestigate the scaling of similar Bayesian learning algorithms BEB ( K olter & Ng , 2009 ) and BOL T ( Araya et al. , 2012 ). W e see that neither algorithms scale as gracefully as PSRL, although BOL T comes close. Howe ver , as observed in Appendix C , BOL T can perform poorly in highly stochastic en vironments. BOL T also requires S -times more computational cost than PSRL or BEB. W e include these algorithms in Figure 13 . Figure 13. Scaling of more learning algorithms. D.1. Rescaling conﬁdence sets It is well known that prov ably-efﬁcient OFU algorithms can perform poorly in practice. In response to this observation, many practitioners suggest rescaling conﬁdence sets to obtain better empirical performance ( Szita & Szepesvári , 2010 ; Araya et al. , 2012 ; K olter & Ng , 2009 ). In Figure 14 we present the performance of se veral algorithms with conﬁdence sets rescaled ∈ { 0 . 01 , 0 . 03 , 0 . 1 , 0 . 3 , 1 } . W e can see that rescaling for tighter conﬁdence sets can sometimes giv e better empirical performance. Howe ver , it does not change the fundamental scaling of the algorithm. Also, for aggressive scalings some seeds may not con verge at all. Figure 14. Rescaled proposed algorithms for more aggressiv e learning. Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? D.2. Prior sensiti vities W e ran all of our Bayesian algorithms with uninformativ e independent priors for rew ards and transitions. For rew ards, we use r ( s, a ) ∼ N (0 , 1) and updated as if the observed noise were Gaussian with precision τ = 1 σ 2 = 1 . For transitions, we use a uniform Dirichlet prior P ( s, a ) ∼ Dirchlet( α ) . In Figures 15 and 16 we e xamine the performance of Gaussian PSRL and PSRL on a chain of length N = 10 as we v ary τ and α = α 0 1 . Figure 15. Prior sensitivity in Gaussian PSRL. Figure 16. Prior sensitivity in PSRL. W e ﬁnd that both of the algorithms are extremely robust over sev eral orders of magnitude. Only large values of τ (which means that the agent updates it re ward prior too quickly) caused problems for some seeds in this en vironment. Developing a more clear frequentist analysis of these Bayesian algorithms is a direction for important future research. D.3. Optimistic posterior sampling W e compare our implementation of PSRL with a similar optimistic v ariant which samples K ≥ 1 samples from the posterior and forms the optimistic Q -value ov er the en velope of sampled Q -values. This algorithm is sometimes called “optimistic posterior sampling” ( Fonteneau et al. , 2013 ). W e experiment with this algorithm ov er several v alues of K but ﬁnd that the resultant algorithm performs very similarly to PSRL, but at an increased computational cost. W e display this effect over sev eral magnitudes of K in Figures 17 and 18 . Why is P osterior Sampling Better than Optimism for Reinf orcement Learning? Figure 17. PSRL with multiple samples is almost indistinguishable. Figure 18. PSRL with multiple samples is almost indistinguishable. This algorithm “Optimistic PSRL ” is spiritually very similar to BOSS ( Asmuth et al. , 2009 ) and previous work had sug- gested that K > 1 could lead to improved performance. W e believe that an important difference is that PSRL, unlike Thompson sampling, should not resample every timestep but previous implementations had compared to this faulty bench- mark ( Fonteneau et al. , 2013 ).

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment