Thompson Sampling is Asymptotically Optimal in General Environments

We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling …

Authors: Jan Leike, Tor Lattimore, Laurent Orseau

Thompson Sampling is Asymptotically Optimal in General En viron ments Jan Leik e Australian National Univ ersity jan.leike@anu.edu .au T or Lattimore University of Alberta tor .lattim ore@gmail.com Laurent Orseau Google DeepMind lorseau@goo gle.com Marcus Hutter Australian National Univ ersity marcus.hu tter@anu.edu .au Abstract W e discuss a variant of Tho mpson sampling for nonpa rametric reinfor cement learning in a count- able classes of ge neral stochastic environments. These en vironm ents can be non-Mar kov , non- ergodic, and partially observable. W e show th at Thomp son sampling learns the e n vironmen t class in the sense that (1) asymptotically its value co n- verges to the optimal value in mean an d (2) given a recoverability assumption regret is sublinear . Keywords. General reinforcem ent learning , Th ompson sampling, asymptotic optimality , regret, discounting , re- coverability , AIXI. 1 INTR ODUCTION In reinforcem ent learning (RL) an agent interacts with an unknown e n vironmen t with the goal of maximizing re- wards. Recently reinfor cement learn ing has receiv ed a surge of interest, trigger ed by its success in applicatio ns such as simple v ideo games [MKS + 15]. Howe ver , theor y is lagging behind app lication an d most theo retical analy- ses has been done in the bandit framework and for Markov decision pro cesses (MDPs). These re stricted environment classes fall short of the full reinf orcement learnin g pr oblem and theoretical results usually assume ergocity a nd visiting ev ery state infinitely often. Needless to say , these assump- tions are not satisfied for any b ut the simplest applications. Our g oal is to lift these r estrictions; we conside r general r e- infor cement learnin g , a top- down appr oach to RL with the aim to under stand the f undamen tal und erlying pro blems in their gener ality . Our appr oach to general RL is n onpara- metric : we only assume th at th e true en vironment belon gs to a given countab le environment clas s. W e are interested in age nts tha t maximize rewards op ti- mally . Since the agen t does not know the true environment in ad vance, it is n ot obvio us what optimality sh ould mean. W e discuss two different notions of optimality: asymptotic optimality and worst-case r egr et . Asymptotic optimality require s that asymptotically the agent learns to act optimally , i.e., that the discounted value of the agent’ s policy π conver ges to the o ptimal d is- counted v alue, V ∗ µ − V π µ → 0 for a ll en vironments µ from the en vironmen t class. This convergence is impo s- sible for d eterministic policies since th e ag ent has to ex- plore infinitely often an d for lon g stretches o f time, b ut there are policies that converge almost surely in Ces ` aro av erage [LH11]. Bayes-o ptimal agents are generally not asymptotically optim al [ Ors13]. However , asympto tic o p- timality can be achieved thr ough an e xplora tion compo nent on top o f a Bayes- optimal agen t [Lat13, Ch . 5] or throu gh optimism [SH15]. Asymptotic optimality in mean is essentially a weaker vari- ant of p r o bably ap pr o ximately corr ect ( P A C) that co mes without a concrete co n vergence rate: for all ε > 0 and δ > 0 the probab ility that o ur policy is ε -suboptimal con- verges to zero (at an u nknown rate). Eventually this pr ob- ability will be less than δ . Since our en vironm ent class can be very large and non-co mpact, concrete P A C/co n vergence rates are likely impossible. Re gr et is how many expected rewards the agent forfeits b y not following th e best inform ed po licy . Different pr ob- lem classes have dif ferent regret rates, depending on the structure and the d ifficulty of the problem class. Multi- armed ban dits provide a (problem-ind ependent) worst-case regret bo und of Ω( √ K T ) where K is the number o f arms [BB12] . In Markov decision p rocesses (MDPs) the lower b ound is Ω( √ D S AT ) where S is the nu mber of states, A the nu mber o f actions, and D the diameter of the MDP [AJO10]. For a countable class of en vironments giv en b y state representation fu nctions that map h istories to MDP states, a regret of ˜ O ( T 2 / 3 ) is achievable assum ing the resulting MDP is weakly communicatin g [NMR O13]. A problem class is considered learnable if there is an algo- rithm that has a sublinear regret guarantee. This paper continues a narrative that started with definition of th e Bayesian agent AIXI [Hut0 0] and the p roof that it satisfies v arious optimality guara ntees [Hut02]. Recently it was revealed th at th ese optimality notio ns a re trivial or sub- jectiv e [LH1 5]: a Bayesian ag ent does not explor e enoug h to lose the prio r’ s bias, and a particularly bad prior can make the agent co nform to any arbitrarily bad policy as long as this policy yield s some rewards. These negati ve re- sults p ut the Bayesian app roach to (general) RL into ques- tion. In this paper we remedy the situation by showing th at using Bayesian techniq ues an agent can inde ed be optimal in an objective sense. The agent we co nsider is kno wn as Thompson sampling , posterior samp ling , or the Bayesian con tr ol rule [ Tho33]. It samples an en vironment ρ from the posterio r , follows the ρ -optimal policy for one effecti ve horizo n (a lookahea d long enough to encompass most of the discount func tion’ s mass), and th en repe ats. W e sho w that this agent’ s po licy is asymptotically optimal in mea n (and, equ i valently , in prob- ability). Furth ermore, u sing a recoverability assumption on the en vironm ent, and some (minor) assumption s on the dis- count function, we prove that the w orst-case regret is sub- linear . T his is the first tim e con vergence a nd regret bounds of Th ompson sampling hav e been shown under such gen- eral condition s. Thomp son samplin g w as originally proposed by Thomp- son as a bandit algo rithm [Tho 33]. It is easy to im - plement and often achieves quite go od results [CL11]. In multi-arme d bandits it attains optimal regret [A G 11, KKM12]. T hompson s ampling has also been considered for MDPs: as mod el-free m ethod relying on distributions over Q -functions with conver gence gu arantee [DFR98], and a s a mo del-based algorith m withou t theor etical anal- ysis [Str 00]. Bayesian and fr equentist r egret bound s hav e also been estab lished [ORvR13, OR14, GM15]. P AC guar- antees have been established fo r an op timistic variant of Thomp son sampling for MDPs [ALL + 09]. For general RL Thompson sampling was first suggested in [OB10] with resam pling at every time step. The authors prove that the action prob abilities o f Thompson sampling conv erge to the action p robability of the optim al policy al- most sur ely , but r equire a finite environment class and two (arguably q uite strong) techn ical assumptions on the beh av- ior of the p osterior distribution (akin to ergodicity) and the similarity of en vironments in th e class. Our convergence results do n ot req uire these assumption s, b ut we rely on an (unav oidable) recoverability assumption f or ou r re gret bound . Append ix A contains a list of notation and Appendix B contains omitted proof s. 2 PRELIMIN ARIES The set X ∗ := S ∞ n =0 X n is the set of all finite strings over the alphabet X and th e set X ∞ is the set of all infin ite strings over the alph abet X . The empty string is denoted by ǫ , not to be co nfused with th e small positive real num- ber ε . Gi ven a string x ∈ X ∗ , we denote its length by | x | . For a (finite or in finite) string x of length ≥ k , we denote with x 1: k the first k characters of x , and with x 0 and V π ν ( æ 0 = ⇒ a t ∈ arg max a V ∗ µ ( æ ε ] → 0 as t → ∞ fo r all ε > 0 because the random vari- ables X t are nonnegative and boun ded. Ho wev er , this does not imply almost sure conver gence (see Section 3.3). Define the Bayes-expected total variation distance F π m ( æ 0 and let ε t > 0 denote the sequence used to define π T in Algorithm 1. W e assume that t is large enou gh such that ε k ≤ β for all k ≥ t and that δ is small enough such that w ( µ | æ 4 δ for all t , which holds since w ( µ | æ 1 − δ and w ( ν | æ F π T ∞ ( æ 4 δ by assumptio n. It remains to show th at with high probab ility the value V π ∗ ρ µ of the sample ρ ’ s o ptimal p olicy π ∗ ρ is suffi ciently close to the µ -optimal value V ∗ µ . The worst case is that w e d raw the w orst sample fr om M ′ ∩ M ′′ twice in a row . From now o n, let ρ denote th e sample environment we draw at time s tep t 0 , and let t denote s ome time step between t 0 and t 1 := t 0 + H t 0 ( ε t 0 ) (b efore the next resampling ). With probab ility w ( ν ′ | æ 0 . This hap pens infinitely often with prob ability on e and thus we cann ot get almost sure con vergence. ♦ W e expect that strong asym ptotic optimality can be achieved with Thom pson sampling by resampling at every time s tep (with strong assumptions on the discou nt fun c- tion). 4 REGRET 4.1 SETUP In general en vironments classes worst-case regret is linear because the agent can get caugh t in a trap an d be unable to recover [Hut05, Sec. 5.3 .2]. T o achieve sublinear regret we need to ensure that the agent can recov er from mistakes. Formally , we make the follo wing ass umption . Definition 7 (Recoverability ) . An en vironme nt ν satisfies the r ecover ability assumption iff sup π    E π ∗ ν ν [ V ∗ ν ( æ 0 for a ll t , (b) γ t is monoton e decr ea sing in t , and (c) H t ( ε ) ∈ o ( t ) for a ll ε > 0 . This assumption deman ds that the disco unt f unction is somewhat well-be haved: the fun ction h as no oscillations, does not becom e 0 , and the horizon is not growing to o fast. Assumption 10 is satisfied by geometric d iscounting: γ t := γ t > 0 (a) for some fixed con stant γ ∈ (0 , 1) is mon otone decreasing (b), Γ t = γ t / (1 − γ ) , and H t ( ε ) = ⌈ log γ ε ⌉ ∈ o ( t ) (c). The problem with geometric discounting is that it makes the recoverability assumption very strong: since th e hori- zon is not growing, the en vironme nt h as to e nable faster r ecovery as time pro gresses; in this case weakly co mmuni- cating partially observable MDPs are n ot recoverable. A choice with H t ( ε ) → ∞ that satisfies Assumption 10 is γ t := e − √ t / √ t [Lat13, Sec. 2.3. 1]. For this disco unt function Γ t ≈ 2 e − √ t , H t ( ε ) ≈ − √ t log ε + (log ε ) 2 ∈ o ( t ) , and thus H t ( ε ) → ∞ as t → ∞ . 4.2 SUBLINEAR REGRET This subsection is dedicated to the following theorem. Theorem 11 (Sublinear Regret) . If the d iscount fu nction γ satisfies Assumption 10, th e en vir on ment µ ∈ M satis- fies the recover ability a ssumption, and π is asymptotically optimal in mean, i.e., E π µ  V ∗ µ ( æ 0 and assume the discount func tion γ satisfies A ssumption 10. Let ( d t ) t ∈ N be a sequ ence of numb ers with | d t | ≤ 1 fo r all t . If ther e is a time step t 0 with 1 Γ t ∞ X k = t γ k d k < ε ∀ t ≥ t 0 (5) then m X t =1 d t ≤ t 0 + ε ( m − t 0 + 1) + 1 + ε 1 − ε H m ( ε ) Pr o of. This pro of essentially follows the proo f o f [Hut06, Thm. 17]; see Append ix B. Pr o of of Theor em 11. Let ( π m ) m ∈ N denote any sequ ence of policies, such as a sequence of p olicies that attain the supremum in the definition of regret. W e want to s how that E π m µ " m X t =1 r t # − E π µ " m X t =1 r t # ∈ o ( m ) . For d ( m ) k := E π m µ [ r k ] − E π µ [ r k ] (6) we have − 1 ≤ d ( m ) k ≤ 1 since we assumed rewards to be bo unded between 0 an d 1 . Because the en vironmen t µ satisfies the recoverability assumption we have    E π ∗ µ µ [ V ∗ µ ( æ 0 an d ch oose t 0 indepen dent of m and large enoug h such that sup m P ∞ k = t γ k d ( m ) k / Γ t < ε for all t ≥ t 0 . No w we let m ∈ N be given and ap ply L emma 12 to get R m ( π , µ ) m = P m k =1 d ( m ) k m ≤ t 0 + ε ( m − t 0 + 1) + 1+ ε 1 − ε H m ( ε ) m . Since H t ( ε ) ∈ o ( t ) acco rding to Assumption 10c we get lim sup m →∞ R m ( π , µ ) /m ≤ 0 . Example 13 (Converse of Th eorem 11 is False) . Let µ b e a two-armed Bernoulli bandit with means 0 a nd 1 and sup- pose we are using geom etric disco unting with discount fac- tor γ ∈ [0 , 1) . This en vironme nt is recoverable. If ou r policy π p ulls the suboptimal arm exactly on time steps 1 , 2 , 4 , 8 , 1 6 , . . . , regret will b e loga rithmic. Ho wev er , on time steps t = 2 n for n ∈ N the v alue difference V ∗ µ − V π µ is deterministically at least 1 − γ > 0 . ♦ 4.3 IMPLICA TIONS W e get the follo wing immediate consequence. Corollary 14 (Sublinear Regret for th e Optimal Dis- counted Policy) . If the discoun t functio n γ satisfies Assumption 10 a nd the en vir onmen t µ satisfies the r ecov- erability assumption, then R m ( π ∗ µ , µ ) ∈ o ( m ) . Pr o of. From Th eorem 11 since th e policy π ∗ µ is (tr i vially) asymptotically optimal in µ . If the en vironment do es not satisfy the recoverability as- sumption, regret may be linear even o n the optimal policy : the optimal policy maximizes discounted re wards an d this short-sighted ness might incur a tradeoff that leads to linear regret later on if the en vironment doe s not allo w recovery . Corollary 15 (Sublinear Regret for Th ompson Samp ling) . If the discount function γ satisfies Assumption 10 and the en vir o nment µ ∈ M satisfies the r ecoverability assump- tion, then R m ( π T , µ ) ∈ o ( m ) fo r th e Thompson sampling policy π T . Pr o of. From Theorem 4 and Theor em 11. 5 DISCUSSION In this paper we introduced a r einforcem ent learning pol- icy π T based o n Thompson sampling fo r gen eral cou nt- able environment classes (Algorithm 1). W e p roved two asymptotic statements ab out this policy . Theorem 4 states that π T is asym ptotically optim al in mean: the value of π T in the true environment con ver ges to the op timal v alue. Corollary 15 states that the regret of π T is sub linear: the difference of the expected average rewards between π T and the best informed policy converges to 0 . Both statements come withou t a co ncrete conv ergence rate because of the weak assumptions we made on the en vironmen t class. Asymptotic o ptimality has to b e taken with a gra in of salt. It provides no incentive to the agent to a void traps in th e e n- vironm ent. Once the agent gets caught in a trap, all actions are eq ually bad and thus optimal: asymp totic optimality has be en achieved. Even worse, an asymptotically op timal agent has to explor e all the traps because th ey might con- tain hidd en treasure. Overall, there is a dichotomy b etween the asympto tic nature of asymptotic optimality and the use of discounting to prior itize the present ov er the future. Ide- ally , we would want to give finite guarantees instead, b ut without additional assumptions this is likely impossible in this general setting . Our regret b ound could be a step in the right direction , even thoug h itself asymptotic in nature. For Bayesian s asymptotic o ptimality means tha t the p os- terior distribution w ( · | æ 0 for all t and h ence Γ t > 0 for all t . By Assumption 10b have that γ is m onotone decreasing, so we get for all n ∈ N Γ t = ∞ X k = t γ k ≤ t + n − 1 X k = t γ t + ∞ X k = t + n γ k = nγ t + Γ t + n . And with n := H t ( ε ) this yields γ t H t ( ε ) Γ t ≥ 1 − Γ t + H t ( ε ) Γ t ≥ 1 − ε > 0 . (11) In particular, this bound holds for all t and ε > 0 . Next, we define a series o f nonnegativ e weights ( b t ) t ≥ 1 such that m X t = t 0 d k = m X t = t 0 b t Γ t m X k = t γ k d k . This yields the constraints t X k = t 0 b k Γ k γ t = 1 ∀ t ≥ t 0 . The solution to these constraints is b t 0 = Γ t 0 γ t 0 , and b t = Γ t γ t − Γ t γ t − 1 for t > t 0 . (1 2) Thus we get m X t = t 0 b t = Γ t 0 γ t 0 + m X t = t 0 +1  Γ t γ t − Γ t γ t − 1  = Γ m +1 γ m + m X t = t 0  Γ t γ t − Γ t +1 γ t  = Γ m +1 γ m + m − t 0 + 1 ≤ H m ( ε ) 1 − ε + m − t 0 + 1 for all ε > 0 acco rding to (11). Finally , m X t =1 d t ≤ t 0 X t =1 d t + m X t = t 0 b t Γ t m X k = t γ k d k ≤ t 0 + m X t = t 0 b t Γ t ∞ X k = t γ k d k − m X t = t 0 b t Γ t ∞ X k = m +1 γ k d k and using the assumption (5) and d t ≥ − 1 , < t 0 + m X t = t 0 b t ε + m X t = t 0 b t Γ m +1 Γ t ≤ t 0 + εH m ( ε ) 1 − ε + ε ( m − t 0 + 1) + m X t = t 0 b t Γ m +1 Γ t For the latter term we substitute (12) to get m X t = t 0 b t Γ m +1 Γ t = Γ m +1 γ t 0 + m X t = t 0 +1  Γ m +1 γ t − Γ m +1 γ t − 1  = Γ m +1 γ m ≤ H m ( ε ) 1 − ε with (11).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment