Easy Monotonic Policy Iteration

A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or $Q$-function may fail to …

Authors: Joshua Achiam

Easy Monotonic Poli cy Iteration Joshua Achiam JAC H I A M @ B E R K E L E Y . E D U UC Berkeley Abstract A key problem in reinforcem ent learning for co n- trol with gene ral function approx imators (such as deep neural networks and other non linear func- tions) is that, for many algorithm s employed in practice, up dates to the policy or Q -function m ay fail to improve performance —or worse, actually cause the policy perform ance to degrad e. Prior work has addressed th is for po licy iteration by deriving tight po licy improvement b ounds; by optimizing the lower bo und o n p olicy impr ove- ment, a better policy is guaranteed . Howe ver, existing appro aches suffer from bou nds that are hard to optimize in practice because the y include sup norm ter ms which c annot be efficiently esti- mated or differentiated. In this w ork, we derive a better policy improvemen t bo und wh ere the sup norm o f th e po licy di vergence has been replaced with an average diver gence; th is leads to an al- gorithm, Easy M onoton ic Po licy Iteratio n, th at generates seq uences of po licies with g uaranteed non-d ecreasing returns and is easy to implement in a sample-based framework. 1. Intr oduction Follo wing the suc cess o f the Deep Q-Network (DQN) approa ch ( Mnih et al. , 2013 ), there has been a surge o f interest in using r einforcem ent le arning f or contr ol with nonlinear functio n ap prox imators, par ticularly with deep neural networks; in these methods, the policy o r the Q-functio n is rep resented by th e appr oximator . Exam- ples include variants o n DQN, such as Double- DQN ( van Hasselt et al. , 2 015 ) and Deep Recurrent Q-Learn ing ( Hausknech t & Ston e , 2 015 ), as well as meth ods that m ix neural network policies and neu ral network value func- tions, such as the asyn chron ous advantage actor-critic al- gorithm ( Mnih et al. , 20 16 ). Howev er , d espite e mpirical successes, ther e a re few alg orithms that come with theo - retical g uarantees f or continued policy imp rovement when training policies repre sented by arb itrary non linear func- tion approx imators. One ap proach , which serves as the inspiratio n for this work, seeks to maximize a lower bound on policy perfo rmance to guarantee an improvement. This method has its roo ts in conservativ e policy iteration (CPI) ( Kakade & Langford , 2002 ), and was extended separately by Pirotta et al. ( 2013 ) and Schulman et al. ( 201 5a ). Both Piro tta et al. and Schul- man et al. derived similar po licy improvement bo unds con- sisting of two parts: an expected advantage of the new policy with respect to the o ld po licy , an d a pen alty on a diver gence between the new p olicy an d the o ld policy . The divergence penalty , in both cases, is q uite steep: it in volves the m aximum po licy divergence over all states. This makes it particu larly difficult to a pply the bound s in the usual situations wh ere function appro ximation is desir - able: domain s where the model—and h ence th e total set of states—is unknown (in wh ich case it is not po ssible to ev alu ate the bound ) and /or where the state space is large (in which case the bou nd may be un necessarily conservativ e). Pirotta et al. d eveloped algor ithms that primar ily apply to the case wher e the model is k nown or wher e ap proxim a- tion is presen t only in the advantage es timation, and so did not address this issue. Schulman et al. add ressed the issue by proposing to solve an approximate for m of the problem, where the maximu m divergence p enalty is replaced by a trust-region co nstraint on the average d iv ergence; this a l- gorithm is called trust region po licy optimization (TRPO). They fo und TRPO worked quite well in a numb er of do- mains, successfully o ptimizing neural n etwork p olicies to play Atari from raw pixels and to control a simulated ro bot in locomotion tasks. In this work , we d erive a new policy improvement bou nd where the penalty on policy diver gence goes as an average, rather than the maximum , diver gence. This allows us to propo se Easy Monotonic P olicy Iteration (EMPI), an algo- rithm that e xploits the bound to generate sequences of p oli- cies with guaranteed non-d ecreasing retu rns, which is easy to implement in a sample-based framework. It also enables us to give a new theor etical justification for TRPO: we are able to show that each iteration of TRPO h as a worst-case degradation of policy performanc e wh ich depends on a hy- perpara meter o f the algor ithm. Our con tributions at present are entir ely th eoretical, but em pirical results fr om testing EMPI will appear in a future version of this work. Easy Monotonic Policy Iteration 2. Preliminaries A Markov decision pro cess is a tup le, ( S, A, R, P , µ ), where S is the set of states, A is the set of actions, R : S × A × S → R is the reward functio n, P : S × A × S → [0 , 1] is the transition pr obability functio n (where P ( s ′ | s, a ) is the probab ility of transitionin g to state s ′ giv en that the pre- vious state was s and the agent too k action a in s ), and µ : S → [0 , 1] is th e starting state distribution. A policy π : S × A → [0 , 1] is a distribution over actions p er state, with π ( a | s ) th e p robab ility of selecting a in state s . W e consider the p roblem of pick ing a policy π that m aximizes the expected infinite-horizon discounted total re ward, J ( π ) . = E τ ∼ π " ∞ X t =0 γ t R ( s t , a t , s t +1 ) # , where γ ∈ [0 , 1) is the discount factor, τ denotes a tra- jectory ( τ = ( s 0 , a 0 , s 1 , ... ) ), and τ ∼ π is shortha nd fo r indicating that trajector ies are drawn from distributions in- duced by π : s 0 ∼ µ , a t ∼ π ( ·| s t ) , s t +1 ∼ P ( ·| s t , a t ) . W e define the on- policy value fu nction, V π : S → R , action-value fu nction Q π : S × A → R , an d ad vantage function A π : S × A → R in the usual way: V π ( s ) . = E a 0 ,s 1 ,... " ∞ X t =0 γ t R t      s 0 = s # , Q π ( s, a ) . = E s 1 ,a 1 ,... " ∞ X t =0 γ t R t      s 0 = s, a 0 = a # , A π ( s, a ) . = Q π ( s, a ) − V π ( s ) , where R t = R ( s t , a t , s t +1 ) . Th e Q a nd V function s a re connected by Q π ( s, a ) = E s ′ ∼ P ( ·| s,a ) [ R ( s, a, s ′ ) + γ V π ( s ′ ) | s, a ] . Our analy sis will m ake extensi ve use of the discounted fu- ture state distribution, d π , which is defined as d π ( s ) = (1 − γ ) ∞ X t =0 γ t P ( s t = s | π ) . It allows us to express the expected discounted total re ward compactly as J ( π ) = 1 1 − γ E s ∼ d π a ∼ π s ′ ∼ P [ R ( s, a, s ′ )] , (1) where by a ∼ π , we mean a ∼ π ( ·| s ) , an d by s ′ ∼ P , we mean s ′ ∼ P ( ·| s, a ) . W e dr op the explicit notation fo r th e sake of reduc ing clutter , b ut i t should be clear from context that a an d s ′ depend on s . Next, we will examine som e usefu l p roper ties o f d π that become app arent in vecto r form for fin ite state spaces. Let p t π ∈ R | S | denote th e vector with comp onents p t π ( s ) = P ( s t = s | π ) , and let P π ∈ R | S |×| S | denote th e transition matrix with compo nents P π ( s ′ | s ) = R daP ( s ′ | s, a ) π ( a | s ) ; then p t π = P π p t − 1 π = P t π µ and d π = (1 − γ ) ∞ X t =0 ( γ P π ) t µ = (1 − γ )( I − γ P π ) − 1 µ. (2) This form ulation h elps us easily o btain the following lemma. Lemma 1. F or any function f : S → R and any policy π , (1 − γ ) E s ∼ µ [ f ( s )] + E s ∼ d π a ∼ π s ′ ∼ P [ γ f ( s ′ )] − E s ∼ d π [ f ( s )] = 0 . (3) Pr oo f. Multiply b oth sides o f ( 2 ) b y ( I − γ P π ) an d take the inner prod uct with the vecto r f ∈ R | S | . Combining this with ( 1 ), we ob tain the following, for any function f an d any policy π : J ( π ) = E s ∼ µ [ f ( s )] + 1 1 − γ E s ∼ d π a ∼ π s ′ ∼ P [ R ( s, a, s ′ ) + γ f ( s ′ ) − f ( s )] . (4) This ide ntity is nice f or two r easons. First: if we p ick f to be an ap proxim ator of the value function V π , then ( 4 ) re- lates the true discou nted return of the p olicy ( J ( π ) ) to th e estimate of the policy r eturn ( E s ∼ µ [ f ( s )] ) and to th e on- policy av erage TD-error of the appro ximator; this is aes- thetically satisfying. Second: it sh ows that re ward-shaping by γ f ( s ′ ) − f ( s ) has the effect of translating the tota l dis- counted return b y E s ∼ µ [ f ( s )] , a fixed constant in depend ent of p olicy; this illustrates the finding o f Ng. et al. ( 1999 ) that reward shaping b y γ f ( s ′ ) + f ( s ) d oes n ot c hange th e optimal policy . It is also helpful to introdu ce an identity for the vector difference of th e discoun ted futu re state v isitation distri- butions on two d ifferent policies, π ′ and π . Define the matrices G . = ( I − γ P π ) − 1 , ¯ G . = ( I − γ P π ′ ) − 1 , and ∆ = P π ′ − P π . Th en: G − 1 − ¯ G − 1 = ( I − γ P π ) − ( I − γ P π ′ ) = γ ∆; left-multiply ing by G and rig ht-multip lying b y ¯ G , we ob- tain ¯ G − G = γ ¯ G ∆ G. Easy Monotonic Policy Iteration Thus d π ′ − d π = (1 − γ )  ¯ G − G  µ = γ (1 − γ ) ¯ G ∆ Gµ = γ ¯ G ∆ d π . (5) For simplicity in what follows, we will only consider MDPs with finite state and action spaces, although our attention is on MDPs that are too large for tab ular methods. 3. Main Results In this sectio n, we will d erive and presen t the new policy improvement bo und. W e will begin with a lemma: Lemma 2. F or a ny function f : S → R an d a ny policies π ′ and π , defin e L π ,f ( π ′ ) . = E s ∼ d π a ∼ π s ′ ∼ P  π ′ ( a | s ) π ( a | s ) − 1  ( R ( s, a, s ′ ) + γ f ( s ′ ) − f ( s ))  , (6) and ǫ π ′ f . = max s | E a ∼ π ′ ,s ′ ∼ P [ R ( s, a, s ′ ) + γ f ( s ′ ) − f ( s )] | . Then the following bound holds: J ( π ′ ) − J ( π ) ≥ 1 1 − γ  L π ,f ( π ′ ) − 2 ǫ π ′ f D T V ( d π ′ || d π )  , (7) wher e D T V is the total variationa l diver gence. Further- mor e, the bou nd is tight (when π ′ = π , th e LHS and RHS ar e identically zer o). Pr oo f. First, f or no tational conv enience, let δ f ( s, a, s ′ ) . = R ( s, a, s ′ ) + γ f ( s ′ ) − f ( s ) . (The choice o f δ to denote this q uantity is intentio nally sugg esti ve—this b ears a strong resemblance to a TD-erro r .) By ( 4 ), we obtain the identity J ( π ′ ) − J ( π ) = 1 1 − γ      E s ∼ d π ′ a ∼ π ′ s ′ ∼ P [ δ f ( s, a, s ′ )] − E s ∼ d π a ∼ π s ′ ∼ P [ δ f ( s, a, s ′ )] .      Now , we restrict ou r attention to the first term in this equ a- tion. Let ¯ δ π ′ f ∈ R | S | denote the vector of comp onents ¯ δ π ′ f ( s ) = E a ∼ π ′ ,s ′ ∼ P [ δ f ( s, a, s ′ ) | s ] . Observe that E s ∼ d π ′ a ∼ π ′ s ′ ∼ P [ δ f ( s, a, s ′ )] = D d π ′ , ¯ δ π ′ f E = D d π , ¯ δ π ′ f E + D d π ′ − d π , ¯ δ π ′ f E This term is then straightf orwardly bo unded by app lying H ¨ older’ s inequ ality; for any p , q ∈ [1 , ∞ ] such that 1 /p + 1 /q = 1 , we ha ve E s ∼ d π ′ a ∼ π ′ s ′ ∼ P [ δ f ( s, a, s ′ )] ≥ D d π , ¯ δ π ′ f E −    d π ′ − d π    p    ¯ δ π ′ f    q . Particularly , we choose p = 1 an d q = ∞ ; howe ver , we be- liev e that this step is very interesting, and dif ferent choices for dealing with the inner pro duct D d π ′ − d π , ¯ δ π ′ f E may lead to novel and usef ul bounds. W ith    d π ′ − d π    1 = 2 D T V ( d π ′ || d π ) and    ¯ δ π ′ f    ∞ = ǫ π ′ f , the boun d is almost obtained . The last step is to observe that, by the importan ce samplin g identity , D d π , ¯ δ π ′ f E = E s ∼ d π a ∼ π ′ s ′ ∼ P [ δ f ( s, a, s ′ )] = E s ∼ d π a ∼ π s ′ ∼ P  π ′ ( a | s ) π ( a | s )  δ f ( s, a, s ′ )  . After group ing terms, the b ound is obtained. This lemma m akes u se of many idea s that have been ex- plored b efore; for the special case o f f = V π , this strat- egy (af ter b oundin g D T V ( d π ′ || d π ) ) lead s d irectly to some of the policy improvement boun ds pr eviously o btained by Pirotta et al. and Schulm an et al. Th e form giv en h ere is more gener al, h owe ver, becau se it allows fo r freedom in choosing f . (Althou gh we do not repo rt the results here, we note that this freed om allo ws f or the deriv ation of an al- ogous bou nds inv olving Bellman erro rs of Q-f unction ap- proxim ators, which is in teresting and suggesti ve.) Remark. It is reasonable to ask if there is a choice of f which maximizes the lower b ound he re. This turns out to trivially be f = V π ′ . Observe that E s ′ ∼ P [ δ V π ′ ( s, a, s ′ ) | s, a ] = A π ′ ( s, a ) . For all states, E a ∼ π ′ [ A π ′ ( s, a )] = 0 (by the definition of A π ′ ), thus ¯ δ π ′ V π ′ = 0 and ǫ π ′ V π ′ = 0 . Also, L π ,V π ′ ( π ′ ) = − E s ∼ d π ,a ∼ π h A π ′ ( s, a ) i ; from ( 4 ) with f = V π ′ , we can see that this exactly equals J ( π ′ ) − J ( π ) . Th us, for f = V π ′ , we re cover an exact equ ality . While th is is not practically useful to us (because, when we want to o pti- mize a l ower b ound with resp ect to π ′ , it is too expensive to ev alu ate V π ′ for each candidate to be practical), it provides insight: th e pe nalty coefficient on the divergence captures informa tion about the mismatch b etween f and V π ′ . Next, we are inter ested in bo undin g the divergence ter m, k d π ′ − d π k 1 . W e g iv e the f ollowing lemma; to the b est of our knowledge, this is a n ew resu lt. Easy Monotonic Policy Iteration Lemma 3. The div er gence between discounted future state visitation distributions, k d π ′ − d π k 1 , is bo unded by an a v- erag e diver gen ce of the policies π ′ and π : k d π ′ − d π k 1 ≤ 2 γ 1 − γ E s ∼ d π [ D T V ( π ′ || π )[ s ]] , (8) wher e D T V ( π ′ || π )[ s ] = (1 / 2) P a | π ′ ( a | s ) − π ( a | s ) | . Pr oo f. First, using ( 5 ), we obtain k d π ′ − d π k 1 = γ k ¯ G ∆ d π k 1 ≤ γ k ¯ G k 1 k ∆ d π k 1 . k ¯ G k 1 is bound ed by: k ¯ G k 1 = k ( I − γ P π ′ ) − 1 k 1 ≤ ∞ X t =0 γ t k P π ′ k t 1 = (1 − γ ) − 1 T o co nclude the lemma, we bound k ∆ d π k 1 . k ∆ d π k 1 = X s ′      X s ∆( s ′ | s ) d π ( s )      ≤ X s,s ′ | ∆( s ′ | s ) | d π ( s ) = X s,s ′      X a P ( s ′ | s, a ) ( π ′ ( a | s ) − π ( a | s ))      d π ( s ) ≤ X s,a,s ′ P ( s ′ | s, a ) | π ′ ( a | s ) − π ( a | s ) | d π ( s ) = X s,a | π ′ ( a | s ) − π ( a | s ) | d π ( s ) = 2 E s ∼ d π [ D T V ( π ′ || π )[ s ]] . The new policy improvement bound fo llows immed iately . Theorem 1. F or an y function f : S → R and a ny poli- cies π ′ , π , with L π ,f ( π ′ ) as d efined in ( 6 ) and ǫ π ′ f . = max s | E a ∼ π ′ ,s ′ ∼ P [ R ( s, a, s ′ ) + γ f ( s ′ ) − f ( s )] | , the fo l- lowing boun d ho lds: J ( π ′ ) − J ( π ) ≥ 1 1 − γ L π ,f ( π ′ ) − 2 γ ǫ π ′ f 1 − γ E s ∼ d π [ D T V ( π ′ || π )[ s ]] ! . (9) Furthermore , the b ound is tight (when π ′ = π , the LHS a nd RHS ar e identically zer o). Pr oo f. Begin with the bound of lem ma 2 and bound the div ergence D T V ( d π ′ || d π ) by lem ma 3 . A fe w quick ob servations connect this re sult to prior work. Clearly , we could bo und the expectation E s ∼ d π [ D T V ( π ′ || π )[ s ]] by max s D T V ( π ′ || π )[ s ] . Doing this, picking f = V π , and bounding ǫ π ′ V π to get a second factor of ma x s D T V ( π ′ || π )[ s ] , we recover (u p to assump tion-de penden t factors) th e po licy im provement bound s gi ven by Pirotta et al. ( 2013 ) as Cor ollary 3.6, an d by Schulman et al. ( 2015a ) as Theorem 1a. Because the ch oice of f = V π does allow for a n ice sim- plification, we will give the bou nd with this cho ice as a corollary . Corollary 1. F or any policies π ′ , π , with ǫ π ′ . = max s | E a ∼ π ′ [ A π ( s, a )] | , the following boun d holds: J ( π ′ ) − J ( π ) ≥ 1 1 − γ E s ∼ d π a ∼ π ′ " A π ( s, a ) − 2 γ ǫ π ′ 1 − γ D T V ( π ′ || π )[ s ] # . (10) 4. Easy Monotonic Policy Iteration As with the weaker versions of the b ound, this bo und allows us to gener ate a sequen ce of po licies with n on- degrading perf orman ce in any restricted class of policies, i.e. p olicies that ar e represented by neur al network s or other function appro ximators. W e’ll use Π θ to denote an arbitrary restricted class of p olicies, and under stand that this usually means policies smoothly parametrized by some set of parameter s θ . Algo rithm 1 gives the general tem- plate fo r E asy Monoto nic Po licy Iteration (EMPI) , which obtains mono tonic imp rovements using ( 9 ) (and on e add i- tional small step of bou nding to remove π ′ from the pen alty coefficient). T o see that EMPI is ind eed mo noton ic, o b- serve that π i is a feasible point of the optimization defin ing π i +1 , with objective value 0; this is a certificate that the objective at optimum is ≥ 0 . Although we do not specify how to solve the op timization problem , we n ote th at the o bjective is differentiable with respect to the p arameters o f a cand idate po licy π ′ : as a re- sult, a gradient- based meth od can be used. Furtherm ore, when we use neural network policies where the v ector of n parameters may take on any value in R n , the optimizatio n is unco nstrained. Th is is on e sense in which we consider this algorithm ‘easy’ to implement. Usually we would be inter ested in ap plying this meth od to problem s with large state or unk nown state sp aces, where exact calculatio n of the o bjective func tion is not feasible. Because the objecti ve is d efined almost entire ly in terms of expectations on policy π i , however , we ca n u se a sample- based appro ximation of it; this is the oth er sen se in which this algorithm is ‘easy . ’ The only challenge is e stimat- Easy Monotonic Policy Iteration Algorithm 1 Easy Mon otonic Policy Itera tion: mono tonic policy improvements in arbitrary policy classes Input: Initial policy π 0 ∈ Π θ , max num ber of iterations N , stoppin g tolerance α for i = 1 , 2 , ..., N o r until J ( π i ) − J ( π i − 1 ) ≤ α do Choose function f i : S → R π i +1 ← ar g max π ′ ∈ Π θ L π i ,f i ( π ′ ) − C E s ∼ d π i [ D T V ( π ′ || π i )] , where C = 2 γ ǫ 1 − γ , and ǫ = max s,a | E s ′ ∼ P [ R ( s, a, s ′ ) + γ f i ( s ′ ) − f i ( s )] | . end for ing ǫ , wher e we might apply a worst-case bound (po ten- tially making the po licy s tep too conser vati ve) or a reason- able h euristic bo und (po tentially permitting degraded pol- icy performanc e). 4.1. Implementing EMPI A practical form o f the gener al algor ithm is obtained by choosing a par ticular form for the reward-shaping functio ns f i , possibly appro ximating terms in the o bjective, an d re- placing th e expectatio ns in the objective by sample esti- mates. W e’ll c hoose f i = V π i , so th at the ba se optimiza- tion problem becomes max π ′ ∈ Π θ E s ∼ d π a ∼ π ′  A π ( s, a ) − 2 γ ǫ 1 − γ D T V ( π ′ || π )[ s ]  , where ǫ = max s,a | A π ( s, a ) | . Suppose that we u se an estimator o f the advantage, ˆ A π ( s, a ) , instead of the true ad vantage A π ( s, a ) . For ex- ample, if we h av e learn ed a neur al network value fu nction approx imator ˆ V π , we may use ˆ A π ( s, a ) = R ( s, a, s ′ ) + γ ˆ V π ( s ′ ) − ˆ V π ( s ) ; or perhaps we might use the gener- alized ad vantage estimator ( Schulman et al. , 2015b ). Ob- serve that, becau se E a ∼ π [ A π ( s, a )] = 0 , for every state s we have E a ∼ π ′ [ A π ( s, a )] = E a ∼ π ′ h ˆ A π ( s, a ) i − E a ∼ π h ˆ A π ( s, a ) i + X a ( π ′ ( a | s ) − π ( a | s ))  A π ( s, a ) − ˆ A π ( s, a )  . From this, we derive a boun d: E a ∼ π ′ [ A π ( s, a )] ≥ E a ∼ π ′ h ˆ A π ( s, a ) i − E a ∼ π h ˆ A π ( s, a ) i − 2 ma x a    A π ( s, a ) − ˆ A π ( s, a )    D T V ( π ′ || π )[ s ] , which gives us the following corollary to Th eorem 1 . Corollary 2 (Policy Improvement Bound w ith Arbi- trary Ad vantage E stimators) . F o r an y policies π ′ , π , and any adv antage estimator ˆ A π : S × A → R , with c ( s ) . = max a | A π ( s, a ) − ˆ A π ( s, a ) | and ǫ π ′ . = max s | E a ∼ π ′ [ A π ( s, a )] | , the following boun d holds: J ( π ′ ) − J ( π ) ≥ 1 1 − γ E s ∼ d π a ∼ π ′ " ˆ A π ( s, a ) − E ¯ a ∼ π h ˆ A π ( s, ¯ a ) i − 2 c ( s ) + γ ǫ π ′ 1 − γ ! D T V ( π ′ || π )[ s ] # . (11) Furthermore , the b ound is tight (when π ′ = π , the LHS a nd RHS ar e identically zer o). Corollary 2 tells us that we are theoretically justified in us- ing any advantage estimators as lo ng as we increase the penalty on the po licy divergence appr opriately . Also, we can see that if the estimato r is high- quality ( | A π ( s, a ) − ˆ A π ( s, a ) | small) we can take larger policy imp rovement steps, which makes sense. If the advantage estimator is poor, then we pro bably will not make mu ch pr ogress; this is reflected in the bound . The base o ptimization p roblem for E MPI with an ad van- tage estimator, which we obtain by using ( 11 ), drop ping constants, and r emoving π ′ from the penalty co efficient by bound ing, is max π ′ ∈ Π θ E s ∼ d π a ∼ π ′ h ˆ A π ( s, a ) − C ( s ) D T V ( π ′ || π )[ s ] i , where C ( s ) = 2  c ( s ) + γ ǫ 1 − γ  , c ( s ) = max a    A π ( s, a ) − ˆ A π ( s, a )    , and ǫ = max s,a | A π ( s, a ) | . Now , we will pu t the ob jectiv e in a fo rm which can be sam- pled on po licy π . First, we u se the importance sampling identity so that the objective becomes E s ∼ d π a ∼ π  π ′ ( a | s ) π ( a | s ) ˆ A π ( s, a ) − C ( s ) D T V ( π ′ || π )[ s ]  . Next we re w rite the expectation in terms of trajectories: E τ ∼ π " ∞ X t =0 γ t  π ′ ( a t | s t ) π ( a t | s t ) ˆ A π ( s t , a t ) − C ( s t ) D T V ( π ′ || π )[ s t ]  # . (12) After run ning an agen t on policy π to generate a set of sample trajector ies, we can e stimate ( 12 ) by averaging over Easy Monotonic Policy Iteration the sample trajec tories. T he sample estimate of ( 12 ) then serves as the objecti ve fo r the optimization step in EMPI. W e do no t give experimen tal results here, but pla n to in- clude them in a future version of this w ork. 5. Implications f or T rust Region Policy Optimization Schulman e t a l. proposed Trust Region Policy Optimiza - tion, wher e at ev ery iter ation the policy is updated from π to π ′ by solving the following optimiz ation problem: π ′ = a rg max π ′ ∈ Π θ E s ∼ d π a ∼ π ′ [ A π ( s, a )] subject to E s ∼ d π [ D K L ( π || π ′ )[ s ]] ≤ δ, (13) where D K L ( π || π ′ )[ s ] = E a ∼ π h log π ( a | s ) π ′ ( a | s ) i , and the policy div ergence limit δ is a hyp erparam eter of the algorith m. The KL-d iv ergence and the TV -d iv ergence of arb itrary distributions p, q are related by Pinsker’ s inequality , D T V ( p || q ) ≤ p D K L ( p || q ) / 2 ; c ombinin g th is with Jensen’ s inequality , we obtain the fo llowing boun d: E s ∼ d π [ D T V ( π ′ || π )[ s ]] ≤ E s ∼ d π " r 1 2 D K L ( π || π ′ )[ s ] # ≤ r 1 2 E s ∼ d π [ D K L ( π || π ′ )[ s ]] . (14) By this boun d an d Corollary 1 , we have a result f or the worst-case performan ce of TRPO. Corollary 3 (TRPO worst-case perfor mance) . A lower bound on th e p olicy performan ce d iffer ence be tween poli- cies π and π ′ , wher e π ′ is given by ( 13 ) and π ∈ Π θ , is J ( π ′ ) − J ( π ) ≥ − √ 2 δ γ ǫ π ′ (1 − γ ) 2 , (15) wher e ǫ π ′ = max s | E a ∼ π ′ [ A π ( s, a )] | . Pr oo f. π is a feasible point of the op timization prob lem with o bjective v alue 0, so E s ∼ d π ,a ∼ π ′ [ A π ( s, a )] ≥ 0 . The rest follows by Cor ollary 1 an d ( 1 4 ), noting th at TRPO bound s the a verage KL-divergence b y δ . By ( 15 ), we can see that TRPO is th eoretically justified, in the sense that an appro priate selection of the hyp er- parameter δ ca n guarantee a worst-case loss. 6. Discussion In this note, we derived a new policy improvement b ound in which a n average, rather than a maximu m, policy div er- gence is pena lized. W e pr oposed Easy Mono tonic Policy Iteration, an algor ithm that explo its the bou nd to generate sequences of p olicies with gu aranteed non- decreasing re- turns and wh ich is ea sy to implem ent in a sample-based framework. W e showed how to implement EMPI and th e- oretically justified the use of advantage estimators in the optimization in the inner loop. L astly , we showed that our policy improvement b ound gi ves a new theoretical founda- tion to Trust Region Policy Optimization , an algo rithm for approx imate monoto nic policy improvements propo sed by Schulman et al. ( 2015a ) which was shown to perform well empirically on a wide variety of tasks; here, we we re able to bound th e worst-ca se p erforma nce at each iteration of TRPO. In a f uture version of this work , we will give experimental results from implem enting E MPI on a rang e o f reinfo rce- ment learning benchmar ks, includin g high- dimension al d o- mains like Atari. Refer ences Hausknech t, Matthew an d S tone, Peter . Deep Recur rent Q- Learning for P a rtially Observable MDPs. arXiv pr ep rint arXiv:150 7.065 27 , 2015. Kakade, Sham a nd La ngfor d, John. Approx imately Optimal Approxim ate Reinforcem ent Learn ing. Pr ocee dings of the 19th Intern ational Confer e nce on Machin e Learnin g , pp. 267–27 4, 200 2. UR L http://www.c s.cmu.edu/a fs/cs/Web/People/ jcl/papers / a o a r l / F i n a l . p d f . Mnih, V olodym yr, Kavukcuoglu , K oray , Silver, David, Graves , Alex, Antono glou, Ioannis, W ierstra, Daan, and Riedmiller , Martin. Playin g Atari with Deep Reinf orce- ment Learn ing. a rXiv preprint a rXiv: . . . , pp. 1– 9, 201 3. URL http://arxi v.org/abs/1 312.5602 . Mnih, V olo dymyr, Badia, Adri ` a Puig dom ` enech , Mirza, Mehdi, G raves , Alex, Lillicra p, T im othy P ., Harley , T im, Silver, Da vid, an d Kavukcuoglu, K oray . Asynchro nous Methods for Deep Re- inforcem ent Lear ning. pp. 1–28, 20 16. URL http://arxiv .org/abs/16 02.01783 . Ng, Andr ew Y . , Hara da, Daishi, and Russell, Stuar t. Pol- icy inv ar iance under reward tran sformatio ns : The ory and app lication to reward shaping. Sixteenth I nter- nationa l Conference o n Machine Learning , 3:27 8–28 7, 1999. d oi: 1 0.1.1 .48.34 5. Pirotta, Matteo, Restelli, Marcello, an d Caland riello, Daniele. Safe Policy Iteration. P r oceed ings of the 30th Internationa l Confer ence on Machine Learning , 28, 2013. Schulman, Joh n, Moritz, Philipp , Jordan, Michael, and Easy Monotonic Policy Iteration Abbeel, Pieter . T rust Region Po licy O ptimization. 2015a . Schulman, John, Moritz, Philip p, Levine, Sergey , Jorda n, Michael, and Abbeel, Pieter . H igh-Dimen sional Contin- uous Control Using Generalized Advantage Estimation. pp. 1–9, 2015 b. van Hasselt, Hado , Guez, Arthu r , and Silver , David. Deep Reinfo rcement Learnin g with Doub le Q- learning. a rXiv:1509 .0646 1 [cs] , 2015 . URL http://arxiv .org/abs/15 09.06461.pdf .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment