Reinforcement Learning by Value Gradients

25-March-200 8 Reinforcemen t Learning b y V alue-Gradien ts Mic hael F airba nk michael.f airbank ‘a t’ virgin.net c/o Gatsby Computational Neur oscienc e U n it, Alexandr a House, 17 Que en Squar e, LONDON, UK Editor: Leslie P ack Kaelbling Abstract The concept o f the v alue- gradient is intro duced and de velop ed in the co nt ext of re- inforcement lea rning, for deterministic episo dic co nt rol problems that use a function ap- proximator and hav e a co n tin uous s ta te space. It is s hown tha t by learning the v alue- gradients, instead of just the v alues themse lves, explor ation or sto chastic behaviour is no longer needed to ﬁnd lo ca lly o ptimal tra jectories. This is the main motiv atio n for using v alue-gra dients, and it is ar gued that lear ning the v alue-gra dient s is the actua l o b j ective of any v alue-function learning algor ithm for control problems. It is a lso argued that learn- ing v alue-gra dient s is s ig niﬁcantly more eﬃcien t tha n learning just the v a lues, a nd this argument is supp orted in exp eriments by eﬃciency gains of s everal orders of magnitude, in several problem domains. Once v alue- g radients are in tro duced into lea r ning, several ana lyses b ecome p ossible. F or example, a sur prising equiv a lence betw een a v alue-gr adient lea rning algorithm and a po licy-gra dien t learning algo rithm is prov en, and this provides a r obust convergence pro of for control problems using a v a lue function with a g e neral function appr oximator. Also, the issue of whether to include ‘residual gradient’ terms in to the weigh t up date equations is addr essed. Finally , an ana lysis is made of acto r-critic architectures, which ﬁnds strong similarities to back-propagatio n through time, and gives s impliﬁca tions and conv ergence pro ofs to certain actor-critic architectures, but while making th ose actor -critic arc hitectures redundant. Unfortunately , by pr oving equiv alence to p olicy-gr adient learning, ﬁnding new diver- gence examples even in the absence of b o otstrapping, and pr oving the redundancy of residual-g r adients and actor-critic architectures in some circumstances, this pap er do es somewhat discr edit the usefulness of using a v alue- function. Keyw ords: Reinforcement Lear ning, Control Problems , V alue-gr a dient , F unction ap- proximators 1. In tro duction Reinforcemen t learning (RL) algorithms frequently mak e use of a v alue fu nction (Sutton and Barto, 1998). O n problem d omains where the state s p ace is large and cont in uous, the v alue f unc- tion needs to b e r epresen ted by a fu nction app ro xim ator. In this pap er, analysis is restricted to episo dic con trol problems of th is kind, with a kn o wn diﬀeren tiable deterministic mo del. As Su tton and Barto (1998) stated: “The central role of v alue estimation is arguably the most imp ortant thing w e ha ve learned ab out reinforcement learning o v er the last few c  Michael F airbank. F airbank decades”. Ho wev er the use of a v alue f unction introd uces some m a j or diﬃculties when com bin ed with a fun ction app ro xim ator, concerning the lack of con vergence guarant ees to learning. Th is p roblem has led to a ma jor alternativ e R L app roac h whic h works without a v alue fu nction at all, i.e. p olicy-gradien t learning (PGL) algorithms (Williams , 1992; Baxter and Bartlett, 2000; W erb os , 1990), whic h d o h a ve the desir ed conv ergence guar- an tees. In this pap er, a surprising equiv alence b et w een th ese t wo seemingly diﬀerent ap- proac h es is sh o wn , and this pr o vides the basis for con vergence guaran tees to v ariants of v alue fun ction learning algorithms. It is the cen tral thesis of this pap er that for v alue function methods , it is n ot the v alues themselv es that are imp ortant, but in fact the v alue-gradien ts (deﬁned to b e the gradient of the v alue fun ction with resp ect to the state v ector). W e distinguish b et w een metho ds that aim to learn a v alue fu nction b y explicitly up d ating v alue-gradien ts fr om those that don’t b y referring to them as value-gr adient le arning (V GL) and value- le arning (VL), resp ectiv ely . The necessit y of exploration to VL method s is d emonstrated in sectio n 1.3, which b ecomes v ery apparent in our problem domains wh ere all fun ctions are d eterministic. W e call the lev el of exploratio n that searches immediately neigh b ouring tra jectories as lo c al explor ation . This requ iremen t for lo cal exploration is n ot necessary with V GL metho ds , since the v alue- gradien t automatically pr o vides a wareness of any s u p erior neighbour in g tra j ectories. Th is is shown for a sp eciﬁc example in Section 1.3, an d pro v en in the general case in App endix A. It is then argued that V GL metho d s are an idealise d form of VL metho ds, are easier to analyse, and are more eﬃcien t (sections 1.4 and 1.5). The VGL thems elves are stated at the start of Section 2. One of these algorithms (Section 2.1) is p r o ven to b e equiv alent to PGL. This is u sed as the basis for a V GL algorithm in a con tinuous-time formula tion with con v ergence guaran tees (Section 2.2). It also pr o duces a ten tativ e theoretical ju s tiﬁcation for the commonly u sed TD( λ ) weig h t- up d ate, whic h f rom the author’s p oin t of view has alwa ys b een a puzzling issue. The residu al-gradien ts algo rithm for V GL is giv en and analysed in S ection 2.3, and new reasons are giv en for the ineﬀectiv eness of residual-gradient s in d eterministic environmen ts, b oth with V GL and VL. In S ection 3 , actor-critic arc hitectures are deﬁn ed for VGL, an d it is sho wn that the v alue-gradien t analysis p ro vides simpliﬁcations to certain actor-critic arc h itectures. This allo ws new con v ergence pro ofs, b ut at the expen se of making the acto r- critic archite cture redundant. In Section 4 exp eriment al d etails are p ro v id ed that justify th e optimalit y claims, and the eﬃciency claims (by several orders of magnitude). Th e problems w e use include the T o y Problem, deﬁn ed in S ection 1.1.1, which is simp le enough to b e able to an alyse in detail, and c h allenging enough to cause diﬃculties for VL. Examples showing dive rging w eigh ts are giv en for all VL algorithms and some V GL algorithms in Section 4.3. Also a Lunar-Lander neural-net work exp eriment is included, whic h is a larger scale neural net wo rk problem that seems to defeat VL. Finally , Section 5 gives conclusions and a discussion on V GL, highligh ting the contributions of this pap er. V alue-g radien ts h a ve already app eared in v arious forms in the literature. Da yan and Singh (1996) argue the imp ortance of v alue-gradien ts o ve r the v alues themselve s, which is the cen- tral thesis of this pap er. The target v alue-gradien t we deﬁn e is closely related to th e “adjoin t vec tor” that app ears in P on tr y agin’s maximum principle, as discuss ed further in App endix A. 2 Reinf o r cement Learning by V alue-Gradients The equatio ns of Bac k-Propagation T hrough T im e (BPTT, W erb os (1990)) and Diﬀer- en tial Dynamic Programming (Jacobson and Ma yne , 1970) implicitly cont ain references to the target v alue-gradient , b oth with λ = 1 ( λ is the b o otstrapping p arameter, d eﬁ ned in Section 1.1). In contin uous-time problem d omains, Do ya (2000) uses the v alue-gradien t explicitly in the greedy p olicy , and Baird (1994) deﬁn es an “adv anta ge” f unction that implicitly references the v alue-gradien t. Both of these are discussed f u rther in S ection 2.2. A v alue- gradien t app ears in the Hamilton-Jacobi-Be llman Equation wh ic h is an optimalit y condition for contin uous-time v alue-functions; although here only its comp onent parallel to the tra- jectory is used, and this comp onen t is not useful in ob v iating the need for lo cal exploration. Ho wev er, the most similar w ork on v alue-gradient s is in a family of algorithms (W erb os, 1998; White and S ofge, 1992, ch.1 3) referred to as Dual Heuristic P rogramming (DHP). These are full VGL metho ds, but are based on actor-critic arc hitectures sp eciﬁcally with λ = 0, and are more fo cussed to wards unkn o wn sto chastic mo dels. 1.1 Reinforcemen t Learning Problem Notation and Deﬁnit ions State Sp ac e , S , is a subset of ℜ n . Eac h state in the state space is d enoted b y a column v ector ~ x . A tr aje ctory is a list of states { ~ x 0 , ~ x 1 , . . . , ~ x F } throu gh state space s tarting at a giv en p oin t ~ x 0 . The tra jectory is parametrised b y real actions a t for time steps t according to a mo del. The mo del is comprised of t wo known smo oth d eterministic functions f ( ~ x, a ) and r ( ~ x, a ). Th e ﬁr s t mo del function f links one state in the tra jectory to the next, giv en action a t , via the Mark ovia n rule ~ x t +1 = f ( ~ x t , a t ). Th e second mo del function, r , gives an immediate real-v alued r ew ard r t = r ( ~ x t , a t ) on arrivin g at the next state ~ x t +1 . Assume that eac h tra jectory is guaran teed to reac h a terminal state in some ﬁnite time (i.e. the problem is episo d ic). Note that in general, th e num b er of time steps in a tra jectory ma y b e dep endent on the actions take n. F or example, a scenario like this could b e an aircraft with limited fuel tryin g to land . F or a particular tra jectory lab el the ﬁ nal time step t = F , so that ~ x F is the terminal state of that tra jectory . Ass u me eac h action a t is a real num b er that, for some problems, ma y b e constrained to − 1 ≤ a t ≤ 1. F or any tra j ectory starting at state ~ x 0 and follo w ing actions { a 0 , a 1 , . . . , a F − 1 } u n til reac hin g a termin al state u nder the giv en mo del, th e total rewa rd encount ered is giv en by the fun ction: R ( ~ x 0 , a 0 , a 1 , . . . , a F − 1 ) = F − 1 X t =0 r ( ~ x t , a t ) = r ( ~ x 0 , a 0 ) + R ( f ( ~ x 0 , a 0 ) , a 1 , a 2 , . . . , a F − 1 ) (1) Th us R is a fu n ction of the arbitrary starting state ~ x 0 and the actions, and th is allo ws us to obtain the partial deriv ativ e ∂ R ∂ ~ x . P olicy . A p olicy is a f unction π ( ~ x, ~ w ), parametrised b y w eight v ector ~ w , that generates actions as a function of state. T h us for a give n tr a j ectory generated b y a giv en p olicy π , a t = π ( ~ x t , ~ w ). Since the p olicy is a pure function of ~ x and ~ w , the p olicy is memoryless. 3 F airbank If a tra jectory starts at state ~ x 0 and then follo w s a p olicy π ( ~ x, ~ w ) until reac hing a terminal state, th en the total rew ard is giv en by the function: R π ( ~ x 0 , ~ w ) = F − 1 X t =0 r ( ~ x t , π ( ~ x t , ~ w )) = r ( ~ x 0 , π ( ~ x 0 , ~ w )) + R π ( f ( ~ x 0 , π ( ~ x 0 , ~ w )) , ~ w ) Appro ximate V alue F unction. W e deﬁne V ( ~ x, ~ w ) to b e the real-v alued output of a smo oth fu nction appr oximato r with weig h t v ector ~ w and input v ector ~ x . 1 W e refer to V ( ~ x , ~ w ) simply as th e “v alue fun ction” o v er state space ~ x , p arametrised b y weig h ts ~ w . Q V alue f unction. The Q V alue function (W atkins and Da yan, 1992) is deﬁned as Q ( ~ x, a, ~ w ) = r ( ~ x, a ) + V ( f ( ~ x, a ) , ~ w ) (2) T ra jectory Shorthand Notation. F or a giv en tra jectory through states { ~ x 0 , ~ x 1 , . . . , ~ x F } with actions { a 0 , a 1 , . . . , a F − 1 } , and for an y f u nction deﬁn ed on S (e.g. includin g V ( ~ x , ~ w ), G ( ~ x, ~ w ), R ( ~ x, a 0 , a 1 , . . . , a F − 1 ), R π ( ~ x, ~ w ), r ( ~ x, a ), V ′ ( ~ x, ~ w ) and G ′ ( ~ x, ~ w )) w e use a su b script of t on the function to in dicate that the fun ction is b eing ev aluated at ( ~ x t , a t , ~ w ). F or example, r t = r ( ~ x t , a t ), G t = G ( ~ x t , ~ w ), R π t = R π ( ~ x t , ~ w ) an d R t = R ( ~ x t , a t , a t +1 , . . . , a F − 1 ). Note that this shorthand d o es not mean th at these functions are fu nctions of t , as that w ould break the Mark o vian condition. Similarly , for any of these fun ction’s partial deriv ativ es, w e u se br ac ke ts with a sub- scripted t to indicate that the partial deriv ativ e is to b e ev aluated at time step t . F or example,  ∂ G ∂ ~ w  t is shorthand for ∂ G ∂ ~ w   ( ~ x t , ~ w ) , i.e. the fun ction ∂ G ∂ ~ w ev aluated at ( ~ x t , ~ w ). Also, for example,  ∂ f ∂ a  t = ∂ f ∂ a    ( ~ x t ,a t ) ;  ∂ R ∂ ~ x  t =  ∂ R ∂ ~ x   ( ~ x t ,a t ,a t +1 ,...,a F − 1 )  ; and similarly for other partial deriv ativ es includ ing  ∂ r ∂ ~ x  t and  ∂ R π ∂ ~ w  t . Greedy P olicy . Th e greedy p olicy on V generates π ( ~ x, ~ w ) such that π ( ~ x, ~ w ) = arg max a ( Q ( ~ x, a, ~ w )) (3) sub ject to the constraints (if presen t) that − 1 ≤ a t ≤ 1. Th e greedy p olicy is a one-step lo ok-ahead that d ecides whic h action to tak e, based only on the mo d el and V . A gr e e dy tr aje ctory is one that has b een generated by th e greedy p olicy . Since for a greedy p olicy , the actions are dep end en t on the v alue fu nction and state, and V = V ( ~ x, ~ w ), it follo ws that π = π ( ~ x, ~ w ). This means that an y mo diﬁcation to the wei gh t v ector ~ w will immediately c han ge V ( ~ x, ~ w ) and mov e all greedy tra jectories. Hence w e s a y the v alue fu n ction and greedy p olicy are tightly c ouple d . F or the greedy p olicy , when the constraint s − 1 ≤ a t ≤ 1 are p resen t, w e sa y an action a t is satur ate d if | a t | = 1 and  ∂ Q ∂ a  t 6 = 0. I f either of these conditions is not met, or the constrain ts are n ot present, then a t is n ot saturated. W e note tw o usefu l consequences of this: 1. This d iﬀers sligh t ly from some deﬁnitions in the RL literature, which would refer to t his use of the function V as an approximated v alue function for th e greedy p olicy on V . T o side-step this circularit y in deﬁnition, we hav e treated V ( ~ x, ~ w ) simply as a smo oth function on which a greedy p olicy can b e deﬁ n ed. 4 Reinf o r cement Learning by V alue-Gradients Lemma 1 If a t is not satur ate d, then  ∂ Q ∂ a  t = 0 and  ∂ 2 Q ∂ a 2  t ≤ 0 (sinc e it’s a maximum). Lemma 2 If a t is satur ate d, then, whenever they exist,  ∂ π ∂ ~ x  t = 0 and  ∂ π ∂ ~ w  t = 0 . Note that  ∂ π ∂ ~ x  t and  ∂ π ∂ ~ w  t ma y not exist, for example, if there are m ultiple join t maxima in Q ( ~ x, a, ~ w ) with resp ect to a . ǫ -Greedy Polic y . In our later exp erimen ts we imp lemen t VL algorithms that require some exploration. Hence w e make u se of a sligh tly mo diﬁed version of the greedy p olicy which w e refer to as the ǫ -greedy p olicy: 2 π ( ~ x, ~ w ) = arg max a ( Q ( ~ x, a, ~ w )) + RN D ( ǫ ) Here R N D ( ǫ ) is deﬁn ed to b e a random n um b er generator that returns a n ormally dis- tributed rand om v ariable with mean 0 and standard deviation ǫ . The V alue-Gradie nt - deﬁnition. Th e v alue-gradien t function G ( ~ x, ~ w ) is the deriv ativ e of the v alue f unction V ( ~ x, ~ w ) with resp ect to state vect or ~ x . Therefore G ( ~ x, ~ w ) = ∂ V ( ~ x, ~ w ) ∂ ~ x . Since V ( ~ x, ~ w ) is d eﬁned to b e s m o oth, the v alue-gradien t alwa ys exists. T arget s for V. F or a tra jectory found by a greedy p olicy π ( ~ x, ~ w ) on a v alue fun ction V ( ~ x , ~ w ), w e deﬁne the fu nction V ′ ( ~ x, ~ w ) recurs ively as V ′ ( ~ x, ~ w ) = r ( ~ x, π ( ~ x, ~ w )) +  λV ′ ( f ( ~ x, π ( ~ x, ~ w )) , ~ w ) + (1 − λ ) V ( f ( ~ x, π ( ~ x , ~ w )) , ~ w )  (4) with V ′ ( ~ x F , ~ w ) = 0 and where 0 ≤ λ ≤ 1 is a ﬁxed constan t. T o calculat e V ′ for a particular p oint ~ x 0 in state space, it is necessary to r un and cac he a w hole tr a j ectory starting from ~ x 0 under th e greedy p olicy π ( ~ x, ~ w ), and then work backw ards along it app lying the ab o v e recursion; thus V ′ ( ~ x, ~ w ) is deﬁ ned for all p oin ts in state s pace. Using shorth and notation, the ab ov e equation simpliﬁes to V ′ t = r t +  λV ′ t +1 + (1 − λ ) V t +1  λ is a “b o otstrapping” parameter, giving full b o otstrappin g when λ = 0 and none when λ = 1, describ ed b y S utton (1988). When λ = 1, V ′ ( ~ x, ~ w ) b ecomes identica l to R π ( ~ x, ~ w ). F or any λ , V ′ is identica l to th e “ λ -return”, or the “forward view of TD( λ )”, d escrib ed by Sutton and Barto (1998). The u se of V ′ greatly simpliﬁ es the analysis of v alue functions and v alue-gradien ts. W e refer to th e v alues V ′ t as the “targets” for V t . The ob jectiv e of an y VL algorithm is to ac h iev e V t = V ′ t for all t > 0 along al l p ossible gr e e dy tra jectories. By Eq. 4 and for an y λ , this ob jectiv e b ecomes equiv alen t to the deterministic and und iscoun ted case of the Bellman Equation of dynamic pr ogramming (S utton and Barto, 1998): V t = V ′ t ∀ t > 0 ⇐ ⇒ V t = r t + V t +1 ∀ t > 0 (5) Since th e Bellman Equation needs satisfying at all p oin ts in state space, it is s ometimes referred to as a glob al metho d, and th is means VL algorithms alwa ys need to incorp orate some form of exp loration. 2. This diﬀers from t he deﬁnition of the ǫ -greedy p olicy that Sut ton and Barto (1998) u se. Also in the deﬁnition w e use, we hav e assumed the actions are unconstrained. 5 F airbank W e p oin t out that since V ′ is d ep endent on the actions and on V ( ~ x , ~ w ), it is n ot a simple matter to attai n the ob j ectiv e V ≡ V ′ , since changing V inﬁ n itesimally will immediately mo ve the greedy tra jectories (since they are tigh tly coupled), and therefore c hange V ′ ; these targets are mo vin g ones. How ev er, a learning algorithm that con tin ually attempts to mo ve the v alues V t inﬁnitesimally and directly to w ard s the v alues V ′ t is equiv alen t to TD( λ ) (Sutton, 1988), as sho wn in Section 1.2.1 . This fur ther justiﬁes Eq. 4. Matrix-v ector notat ion. Thr oughout this pap er, a con ven tion is u sed that all deﬁned v ector quant ities are columns , and any vecto r b ecomes transp osed (b ecoming a ro w) if it app ears in the numerator of a d iﬀeren tial. Upp er indices indicate the comp onent of a vecto r. F or example, ~ x t is a column; ~ w is a column; G t is a column;  ∂ R π ∂ ~ w  t is a column;  ∂ f ∂ a  t is a r o w;  ∂ f ∂ ~ x  t is a matrix with element ( i, j ) equal to  ∂ f ( ~ x,a ) j ∂ ~ x i  t ;  ∂ G ∂ ~ w  t is a matrix with elemen t ( i, j ) equal to  ∂ G j ∂ ~ w i  t . An example of a pro duct is  ∂ f ∂ a  t G t +1 = P i  ∂ f i ∂ a  t G i t +1 . T arget V alue-Gradien t, ( G ′ t ). W e deﬁne G ′ ( ~ x, ~ w ) = ∂ V ′ ( ~ x , ~ w ) ∂ ~ x . Ex p anding this, using Eq. 4, gives: G ′ t =  ∂ r ∂ ~ x  t +  ∂ π ∂ ~ x  t  ∂ r ∂ a  t  +  ∂ f ∂ ~ x  t +  ∂ π ∂ ~ x  t  ∂ f ∂ a  t   λG ′ t +1 + (1 − λ ) G t +1  (6) with G ′ F = ~ 0. T o obtain this total deriv ativ e w e ha v e u sed the fact that a t = π ( ~ x t , ~ w ), and that therefore changing ~ x t will immediately change all later actions and states. This recur siv e formula tak es a kno wn target v alue-gradien t at the end p oin t of a tra jec- tory ( G ′ F = ~ 0), and works it bac kwards along the tra jectory rotating and incrementi ng it as appropriate, to giv e the desired v alue-gradient at eac h time s tep. T his is th e cen tral equa- tion b ehind all VG L algorithms; the ob j ectiv e for an y V GL algorithm is to attain G t = G ′ t for all t > 0 along a greedy tra jectory . As with the target v alues, it sh ould b e n oted that this ob jectiv e is not straigh tforw ard to ac h iev e since the v alues G ′ t are mo vin g targets and are highly d ep endent on ~ w . The ab o ve ob jectiv e G ≡ G ′ is a lo c al r equiremen t that only needs satisfying along a greedy tra jectory , and is usually suﬃcient to ensure the tra jectory is lo cally optimal. This is in stark contrast to th e Bellman Equation for VL (Eq. 5) w hic h is a glob al r equiremen t. Consequent ly V GL is p oten tially muc h more eﬃcie n t and eﬀectiv e than VL. Th is d iﬀerence is justiﬁed and explained furth er in Section 1.3. All terms of Eq. 6 are obtainable from kno w ledge of the mo del functions and the p olicy . F or obtaining the term ∂ π ∂ ~ x it is usu ally pr eferable to hav e the greedy p olicy written in analytical form , as done in Section 2.2 and the exp eriments of Section 4. Alternativ ely , using a d eriv ation similar to that of Eq. 17, it can b e sho wn that, when it exists,  ∂ π ∂ ~ x  t = ( −  ∂ 2 Q ∂ ~ x∂ a  t  ∂ 2 Q ∂ a 2  − 1 t if a t unsaturated and  ∂ 2 Q ∂ a 2  − 1 t exists 0 if a t saturated If λ > 0 and if ∂ π ∂ ~ x do es not exist at s ome time step, t 0 , of the tra jectory , th en G ′ t is not deﬁned for all t ≤ t 0 . In some common situations, suc h as the con tin u ous-time formulatio ns (Section 2.2), ∂ π ∂ ~ x is alw ays deﬁn ed so this is not a problem. 6 Reinf o r cement Learning by V alue-Gradients x 0 a a 1 (x=0) Position Target x 0 1 x 2 x Figure 1: Illus tration of 2-step T o y Pr oblem In th e sp ecial case w here λ = 0, th e d ep endency of Eq. 6 on ∂ π ∂ ~ x disapp ears and so G ′ t alw ays exists, and the deﬁn ition redu ces to G ′ t =  ∂ r ∂ ~ x  t +  ∂ f ∂ ~ x  t G t +1 . In this case G ′ t is equiv alen t to the target v alue-gradient th at W erb os us es in the algo rithms DHP and GDHP (W erb os, 1998, Eq. 18). When λ = 1, G ′ t b ecomes id en tical to  ∂ R π ∂ ~ x  t .  ∂ R π ∂ ~ x  t app ears exp licitly in the equations of diﬀeren tial dynamic programming (Jacobson and Ma yne , 1970 ), and implicitly in bac k- propagation through time (Eq. 15). 1.1.1 Example - Toy Problem Man y exp eriments in this p ap er mak e use of the n - step T oy Pr oblem with p ar ameter k . This is a problem in whic h an agent can mo ve along a straight line an d must mo ve to w ards the origin eﬃcien tly in a giv en num b er of time steps, illus trated in Fig. 1 , and d eﬁned generically now: State space is one-dimensional and con tin uous . T he actions are u n b ound ed . In this episo dic p r oblem, we deﬁne the mo del fu n ctions diﬀeren tly at eac h time step, and eac h tra jectory is deﬁned to terminate at time step t = n + 1. Strictly sp eaking, to satisfy th e Mark ovia n requirement and ac hiev e these time step dep endencies, w e should add one extra dimension to s tate space to h old t and adjust the mo del functions accordingly . Ho wev er this complication was omitted in the in terests of kee ping the notation simple. Under this simpliﬁcation, the mo del functions are: f ( x t , a t ) = ( x t + a t if 0 ≤ t < n x t if t = n (7a) r ( x t , a t ) = ( − k a t 2 if 0 ≤ t < n − x t 2 if t = n (7b) where k is a real-v alued non-negativ e constant to allo w m ore v arieties of problem t y p es to b e sp eciﬁed. Th e ( n + 1) th time step is present just to deliver a ﬁ nal rew ard based only on state. The mo del fun ctions in th e time step t = n are indep endent of the action a n , and so eac h tra jectory is completel y parametrised b y ju st ( x 0 , a 0 , a 1 , . . . , a n − 1 ). This completes the deﬁn ition of the T o y Problem. Next we describ e the optimal tra jectories and optimal p olicy f or the n -step T o y Pr oblem. Since the total rew ard is R ( x 0 , a 0 , a 1 , . . . , a n − 1 ) = − k ( a 0 2 + a 1 2 + . . . + a n − 1 2 ) − ( x n ) 2 = − k ( a 0 2 + a 1 2 + . . . + a n − 1 2 ) − ( x 0 + a 0 + a 1 + . . . + a n − 1 ) 2 7 F airbank A B D E F C n−2 n−3 n−1 t x Figure 2: T he lines A to F are optimal tra jectories f or the T o y Pr oblem. Th e tra j ectories are sho wn as contin uous lines for illustration here, although time is deﬁned to b e discrete in this problem. Th e arrowheads of the tra jectories sho w the p osition of the terminal states. then the actions that maximise this are all equ al, and are giv en by , a t = − x 0 n + k for all 0 ≤ t < n . Since the optimal actions are all equal and d irected opp ositely to the initial state x 0 , optimal tra jectories form straigh t lines to wards the centre as sh own in Figure 2. Eac h optimal tra jectory terminates at x n = k x 0 n + k . The optimal p olicy , usually denoted π ∗ ( ~ x t ), is π ∗ ( x t ) = − x t n − t + k for all 0 ≤ t < n (8) The optimal v alue function, usually denoted V ∗ ( ~ x t ), can b e found for this p roblem b y ev aluating the total rew ard encount ered on f ollo wing the optimal p olicy un til termination: V ∗ ( x t ) = ( − k ( x t ) 2 n − t + k if t ≤ n 0 if t = n + 1 (9) F or a simple example of a v alue f unction and greedy p olicy , the reader should see Ap - p endix B (equations 33 and 34). 1.2 V alue- Learning Metho ds In tro ducing the targets V ′ simpliﬁes the formulatio n of s ome learning algorithms. The ob jectiv e of all VL algorithms is to learn V t = V ′ t for all t ≥ 1. W e do not need to consider the time step t = 0 since the greedy p olicy is ind ep endent of the v alue function at t = 0. A quic k review of t wo common VL m etho ds follo ws. All learning algorithms in this pap er are “oﬀ-line”, that is an y w eigh t up dates are app lied only after consid ering a whole tra jectory . In all learning alg orithms w e tak e α to b e a small p ositiv e constan t. 8 Reinf o r cement Learning by V alue-Gradients 1.2.1 TD( λ ) Learning The T D( λ ) algorithm (Sutton, 1988) attempts to achiev e V t = V ′ t for all t ≥ 1 b y the follo wing we igh t up date: ∆ ~ w = α X t ≥ 1  ∂ V ∂ ~ w  t ( V ′ t − V t ) (10) The equiv alence of this form ulation to that used b y S u tton (1988) is p ro ven in App end ix C. T h is equiv alence v alidates the V ′ notation. Although not originally deﬁned for control problems b y S utton (1988), it is sho wn in Ap p endix D that the TD( λ ) weigh t up date can b e applied directly to con trol problems with a kn o wn mo d el using the ǫ -greedy p olicy , and is then equiv alen t to Sarsa( λ ) (Rummery and Niranjan, 1994). Unfortunately there are no con vergence gu arantees for this equation w hen u sed with a general function appr o ximator for the v alue function, and divergence examples ab ou n d. As sh o wn b y S utton and Barto (1998) TD learning b ecomes Monte- Carlo learning when the b o otstrapping parameter, λ = 1. 1.2.2 Residual Gradients With the aim of imp ro v in g the conv ergence guaran tees of the previous metho d, the approac h of Baird (1995) is to minimize the error E = 1 2 P t ≥ 1 ( V ′ t − V t ) 2 b y gradien t descen t on the w eigh ts: ∆ ~ w = α X t ≥ 1  ∂ V ∂ ~ w  t −  ∂ V ′ ∂ ~ w  t  ( V ′ t − V t ) The extra terms introd u ced by this metho d are referr ed to through ou t this pap er as the “residual gradien t terms”. W e extend the residual gradient metho d to v alue-gradients in S ection 2.3, and extend it to wo rk with a greedy p olicy th at is tigh tly coupled with the v alue fun ction. 1.3 Motiv ation for V alue-Gradien ts The ab o ve VL algorithms need to use some f orm of exploration. Th e required exploration could b e imp lemen ted by rand omly v arying the start p oin t in state space at eac h iteration, a tec hn ique kno wn as “exploring starts”. Alternativ ely a sto chastic mo del or p olicy could b e used to force exploration w ithin a single tra jectory . Exploration in tro d uces ineﬃciencies whic h are discus sed in Section 1.5. Figure 3 d emonstrates why VL withou t exploratio n can lead to sub optimal tra jectories, whereas V GL w ill not. Und erstanding this is a v er y cen tral p oin t to this p ap er as it is the cen tral m otiv ation for using v alue-gradien ts. If a fuller explanation of either of these cases is required, then App end ix B give s details, and go es fu rther in sho wing that the tra j ectory with learned v alue-gradients will also b e optimal. The conclusion of this example, and App endix B , is that with ou t exploration, VL will p ossibly , and in fact b e lik ely to, con verge to a sub optimal tra jectory . Exploration is necessary since it enables the learning algorithm to b ecome aw are of the presence of any sup erior neigh b ouring tra jectories (which is exactly the information that a v alue-gradien t con tains). Without this aw areness learning can terminate on a su b optimal tr a j ectory , as this coun terexample sh o ws. 9 F airbank −8 −8 −8 −8 −8 −8 −8 −8 A −8 −6 −10 x n−2 n−3 n−1 t Figure 3: I llu stration of the problem of VL withou t exploration. In th is d iagram the ﬂoating negativ e n u mb ers in the b ottom left q u adran t r epresen t v alue fun ction sp ot v alues (for a v alue fu nction whic h is constan t throu gh ou t state space). Hence, in the T o y Problem th e greedy p olicy w ill choose zero actions, and the greedy tra jectory (from A) is the str aigh t line sho w n. The negativ e num b ers ab o v e the x axis giv e the ﬁn al rew ard, r n . Since the in termediate rew ards r t are zero for all t < n , w e ha v e V ′ t = − 8 for all t ≤ n . So V t = V ′ t = − 8 for all t ≤ n , and so VL is complete; yet the tra jectory is not optimal (c.f. Fig. 2). T h is s itu ation cannot happ en with VGL, since if the v alue-gradients w ere learned then there w ould b e a v alue-gradien t p erp en d icular to the tra jectory , and th e greedy p olicy wo uld not ha ve c hosen the tra j ectory shown. Note that this requiremen t for exploration is separate from the issue of exploration that is sometimes also required to learn an unkn o wn mo del. Similar counterexamples can b e constructed for other pr oblem domains. F or example, Figure 3 can b e applied to any p roblem where all the rew ard o ccurs at the end of an episo de. F urthermore, it is pr o ven in App endix A that in a general problem, learning the v alue- gradien ts along a greedy tra j ectory is a suﬃcient condition for the tra j ectory to b e lo cally extremal, and also often ensu res the tra j ectory is lo cally optimal (these terms are deﬁned in the same app endix). This con trasts with VL in that there has b een no need for exploration. L o c al explor ation c omes for fr e e with value-gr adients, sinc e k now le dge of the value-gr adients automatic al ly pr ovides know le dge of the neighb ouring tr aje ctories. 1.4 Relationship of V alue-Learning to V alue- Gradien t Learning Learning the v alue-gradien ts along a tra jectory learns the relativ e v alues of the v alue func- tion along it and all of its immediately n eigh b ouring tra jectories. W e r efer to all these tra jectories collectiv ely as a tub e . Any VL algorithm that exhibits suﬃ cien t exploration to learn the target v alues fully throughout the en tire tub e w ould also achiev e this goal, and therefore also ac hiev e lo cally optimal tra j ectories. This sho ws the consistency of the tw o metho ds, and the equiv alence of their ob jective s. We b elieve VGL te chniques r epr esent an ide alise d form of VL te chniques, and that VL is a sto chastic appr oximation to VGL. Both ha v e s imilar ob jectiv es, i.e. to ac hieve an optimal tra j ectory b y learning the relativ e target v alues throughout a tub e; but whereas 10 Reinf o r cement Learning by V alue-Gradients VL tec hniques rely on the scattergun appr oac h of sto chastic exploration to ac hiev e this, V GL tec hn iques go ab out it more metho dically and d irectly . Figure 4 illustr ates the cont rasting approac hes of the tw o metho ds. A −8 −6 −10 t x A −8 −6 −10 t x Figure 4: Diagrams con trasting VL b y sto c h astic con trol (left ﬁgur e) against deterministic learning by V GL (righ t ﬁ gure). In th e VL case (left ﬁgur e), sto c hastic con trol mak es the tra jectory zigzag, and the target v alue p assed b ackw ards along the tra jectory (wh ic h in this case is app ro xim ately − 7, whereas the d eterministic v alue should b e − 8) will b e passed bac k to the p oin ts in state space encoun tered along th e tra jectory . In th e V GL case (righ t ﬁgure), the v alue-gradient at the end p oint (whic h will b e a small v ector p oin ting in the p ositiv e x direction) is passed bac kwards, w ithout an y sto c h astic distortion, along the central tra jectory and therefore inﬂuen ces the v alue fun ction along all three tra jectories simultaneously . Once the eﬀects of exploration are av eraged out in VL, th e und erlying weigh t up date is often v ery similar to a V GL w eigh t up date; but p ossibly with some extra, and us u ally unw an ted, terms. See Section 4.1 for an example analysis in a sp eciﬁc problem. Hence, we w ould exp ect a large pr op ortion of results obtained for one metho d to app ly to the other. F or example, using a v alue-gradien t analysis, in Section 4.3 we der ive an example that causes V GL to div erge, and then empirically ﬁnd this example causes div ergence in VL too. Also, we w ould exp ect the analyses on residual gradien ts and actor-critic arc h itectures to place limitations on wh at can b e achiev ed with these method s when used with VL (sections 2.3 and 3). If the v alue-gradien t is learned th r oughout the whole of state space, i.e. if ∂ V ∂ ~ x = ∂ V ′ ∂ ~ x for all ~ x , then this forms a diﬀeren tial equation of whic h the Bellman Equation ( V = V ′ for all ~ x ; see Eq. 5) is a particular solution. This is illustr ated in Equation 11. G ( ~ x, ~ w ) = G ′ ( ~ x, ~ w ) ∀ ~ x ⇐ ⇒ V ( ~ x, ~ w ) = V ′ ( ~ x, ~ w ) + c ∀ ~ x (11) Of course the arbitrary constan t, c , is not imp ortan t as it do es not aﬀect the greedy p olicy . He nce w e prop ose that the values themselves ar e not imp ortant at al l; it is only the value-gr adients that ar e imp ortant . This extreme view is consisten t with the fact that it is v alue-gradien ts, not v alues, that app ear in the optimalit y pro of (App end ix A), in the relationship of v alue fun ction learnin g to PGL (Section 2.1) and in th e equation for ∂ π ∂ ~ w (Eq. 17). 11 F airbank The directness of the VG L metho d m eans it is more op en to th eoretical analysis than the VL m etho d. The greedy p olicy is dep enden t on the v alue-gradien t but n ot on the v alues (see Eq. 17 or Eq. 34), so it seems essent ial to consider v alue-gradient s if we are to und erstand ho w an up date to ~ w w ill aﬀect a greedy tra j ectory . If we deﬁne a “whole system” to mean the conjunction of a tigh tly coupled v alue fu n ction with a greedy p olicy , then it is ne c essary to c onsider value-gr adients if we ar e to understand the over al l c onver genc e pr op erties of any “whole system” weight-up date . 1.5 Eﬃciencies of learning V alue-Gradien ts V GL in tro duces some s igniﬁcan t eﬃciency imp ro vemen ts in to v alue fu nction learning. • The remo v al of the need for exploratio n should b e a ma jor eﬃciency gain. As learning algorithms are already iterativ e, the need to explore n eigh b ouring tra j ectories causes a nested la yer of iteration. Also, the issue of exploration seve rely restricts standard algo- rithms, p articularly those that w ork by learning the Q ( ~ x, a, ~ w ) fun ction (e.g. Sars a( λ ), Q( λ )-learning (W atkins , 1989)). F or example, whenever an action a t in a tra jectory is exploratory , i.e. non-greedy , the target v alues b ac ke d up to all previous time steps (i.e. the v alues V ′ k for all k < t ) will b e changed. This eﬀectiv ely sends the wrong learning targets bac k to the early time steps. Th is diﬃcult y is dealt w ith in Sarsa( λ ) b y making ǫ slo wly tend to zero in the ǫ -greedy p olicy . This diﬃcult y is dealt with in Q ( λ )-learning b y forcing λ = 0 or trun cating learning b ey ond the exploratory time step. Both of these solutions hav e p erform ance imp lications (we sho w in Section 4.1 that as ǫ → 0, learning grind s to a halt). • Learning the v alue-gradients along a tra jectory is similar to learning the v alue f unc- tion’s relativ e v alues along an entire group of immediately neighbour ing tra jectories (see Fig. 4). Thus the v alue-gradients encapsulate more relev ant in formation than the v alues, and therefore learning b y v alue-gradien ts s h ould b e faster (provided the function appr o ximator for V can b e made to learn gradient s eﬃciently , w hic h is not a trivial pr oblem). • As describ ed in Section 2.2, some V GL algorithms are d oing tru e gradient ascen t on R π , so are viable to sp eed up thr ough an y of the fast neural-net w ork op timisation algorithms av ailable. These informal arguments are bac k ed up b y the exp erim ents in Section 4. 2. Learning Algorithms for V alue-Gradien ts The ob jectiv e of any V GL algorithm is to ensure G t = G ′ t for all t > 0 along a greedy tra jectory . As pro v en in App end ix A, this will b e suﬃcien t to ensure a lo cally extremal tra jectory (and often a lo cally optimal tra jectory). T his s ection lo oks at some learning algorithms that try to ac hieve this ob jectiv e. In S ection 2.2 w e deriv e a V GL algorithm that is guarante ed to con verge to a lo cally optimal tra jectory . 12 Reinf o r cement Learning by V alue-Gradients Deﬁne the su m-of-squares error fun ction for v alue-gradien ts as E ( ~ x 0 , ~ w ) = 1 2 X t ≥ 1 ( G t − G ′ t ) T Ω t ( G t − G ′ t ) (12) for a giv en greedy tra jectory . Ω t is any arbitrarily chose n p ositiv e semi-deﬁ nite matrix (as in tro duced by W erb os, 1998), in cluded for generalit y , an d often just tak en to b e the identit y matrix for all t . A use for Ω t could b e to allo w u s to comp ensate explicitly for an y rescaling of the dimensions of state-space. If b o otstrapping is used then Ω t should b e c h osen to b e strictly p ositive deﬁnite. One app r oac h for VG L is to p erform gradient d escen t on the ab ov e error f unction (giving the VG L counterpart to residual-gradient s): ∆ ~ w = α ∂ E ∂ ~ w (13) This equation is analysed in Section 2.3. Ho w ever a simpler w eigh t up date is to omit the “residual gradient” terms (giving the V GL coun terpart to TD( λ )): ∆ ~ w = α X t ≥ 1  ∂ G ∂ ~ w  t Ω t ( G ′ t − G t ) (14) In th e n ext section w e p ro v e that this equation, with λ = 1, is equiv alen t to PGL, and sho w it leads to a successful algorithm with con v ergence guarante es. An y VGL algorithm is going to in v olve usin g the matrices  ∂ G ∂ ~ w  t and/or ∂ G ∂ ~ x whic h, for neural-net works, inv olv es second order bac k-pr opagation. This is describ ed by (White and Sofge, 1992, ch. 10) and (Coulom, 2002, App end ix A). In fact, these m atrices are only r equired when m u ltiplied by a column v ector, wh ich can b e implemen ted eﬃcien tly by extending the tec hniqu es of Pea rlm utter (1994) to this situation. 2.1 Relationship to P olicy-Gradien t Learning W e now prov e that th e V GL up date of Eq. 14, with λ = 1 and a carefully c hosen Ω t matrix, is equiv alent to PGL on a greedy p olicy . I t is this equ iv alence that provides con vergence guaran tees for Eq. 14. T o make this demonstration clearest, it is easiest to start by considering a PGL we igh t up date; although it s h ould b e p ointe d out that the disco very of this equ iv alence o ccurred the opp osite wa y around , since f orm s of Eq. 14 date b ac k p r ior to W erb os (1998). PGL, sometimes also kn o wn as “direct” reinforcemen t learning, is deﬁned to b e gradient ascen t on R π ( ~ x 0 , ~ w ) with resp ect to the weig h t vect or ~ w of th e p olicy: ∆ ~ w = α  ∂ R π ∂ ~ w  0 Bac k-propagation through time (BPTT) is merely an eﬃcien t implementa tion of this form ula, designed for architec tures where the p olicy π ( ~ x, ~ w ) is pro vided by a neural-net work (see W erb os, 1990). PGL method s will naturally ﬁnd stationary p oints that are constrained lo cally optimal tra j ectories (see App endix A for optimalit y deﬁnitions). 13 F airbank F or PGL, we therefore hav e:  ∂ R π ∂ ~ w  t = ∂ ( r ( ~ x t , π ( ~ x t , ~ w )) + R π ( f ( ~ x t , π ( ~ x t , ~ w )) , ~ w )) ∂ ~ w =  ∂ π ∂ ~ w  t  ∂ r ∂ a  t +  ∂ f ∂ a  t  ∂ R π ∂ ~ x  t +1  +  ∂ R π ∂ ~ w  t +1 ⇒ ∆ ~ w = α  ∂ R π ∂ ~ w  0 = α X t ≥ 0  ∂ π ∂ ~ w  t  ∂ r ∂ a  t +  ∂ f ∂ a  t  ∂ R π ∂ ~ x  t +1  (15) This equation is iden tical to the we igh t up date p erformed b y BPTT. It is deﬁned for a general p olicy . W e now switc h to sp eciﬁcally consider PGL ap p lied to a greedy p olicy . Initially w e only consider the case wher e  ∂ R π ∂ ~ w  0 exists for a greedy tra jectory , and hence  ∂ π ∂ ~ w  t exists for all t . No w in the summation of Eq. 15, we only n eed to consider the time steps where a t is not saturated, since for a t saturated,  ∂ π ∂ ~ w  t = 0 (b y Lemma 2). The summation inv olv es terms  ∂ π ∂ ~ w  t and  ∂ r ∂ a  t whic h can b e reinte rpreted un der the greedy p olicy: Lemma 3 The gr e e dy p olicy implies, for an unsatur ate d action,  ∂ Q ∂ a  t =  ∂ r ∂ a  t +  ∂ f ∂ a  t  ∂ V ∂ ~ x  t +1 = 0 ⇒  ∂ r ∂ a  t = −  ∂ f ∂ a  t G t +1 (16) Lemma 4 When  ∂ π ∂ ~ w  t exists for an unsatur ate d action a t , the gr e e dy p olicy implies  ∂ Q ∂ a  t ≡ 0 ther efor e, 0 = ∂ ∂ ~ w  ∂ Q ( ~ x t , π ( ~ x t , ~ w ) , ~ w ) ∂ a t  =  ∂ ∂ ~ w +  ∂ π ∂ ~ w  t ∂ ∂ a t   ∂ Q ( ~ x t , a t , ~ w ) ∂ a t  = ∂ ∂ ~ w  ∂ r ∂ a  t +  ∂ f ∂ a  t G ( ~ x t +1 , ~ w )  +  ∂ π ∂ ~ w  t  ∂ 2 Q ∂ a 2  t =  ∂ G ∂ ~ w  t +1  ∂ f ∂ a  T t +  ∂ π ∂ ~ w  t  ∂ 2 Q ∂ a 2  t ⇒  ∂ π ∂ ~ w  t = −  ∂ G ∂ ~ w  t +1  ∂ f ∂ a  T t  ∂ 2 Q ∂ a 2  − 1 t , assuming  ∂ 2 Q ∂ a 2  t 6 = 0 (17) It now b ecomes p ossible to analyse the PGL weig h t up d ate with a greedy p olicy . Su b- stituting the results of the ab ov e lemmas (Eq. 1 6 and Eq. 17), and  ∂ R π ∂ ~ x  t = G ′ t with 14 Reinf o r cement Learning by V alue-Gradients λ = 1 (see Eq. 6), into Eq. 15 giv es: ∆ ~ w = α  ∂ R π ∂ ~ w  0 = α X t ≥ 0 −  ∂ G ∂ ~ w  t +1  ∂ f ∂ a  T t  ∂ 2 Q ∂ a 2  − 1 t  ∂ f ∂ a  t ( − G t +1 + G ′ t +1 ) ! = α X t ≥ 0  ∂ G ∂ ~ w  t +1 Ω t ( G ′ t +1 − G t +1 ) (18) where Ω t = −  ∂ f ∂ a  T t  ∂ 2 Q ∂ a 2  − 1 t  ∂ f ∂ a  t , (19) and is p ositiv e semi-deﬁnite, by the greedy p olicy (Lemma 1). Equation 18 is iden tical to a V GL we igh t u p date equation (Eq. 14), with a carefully c hosen matrix for Ω t , and λ = 1, p r o vided  ∂ π ∂ ~ w  t and  ∂ 2 Q ∂ a 2  − 1 t exist f or all t . If  ∂ π ∂ ~ w  t do es not exist, then ∂ R π ∂ ~ w is n ot deﬁn ed either. Resolutions to these existence conditions are prop osed at the end of this section. This completes the demonstration of th e equiv alence of a V GL algorithm (with the conditions stated ab ov e) to PGL (with greedy p olicy; when ∂ R π ∂ ~ w exists) . Unfortunately we couldn’t ﬁnd a similar analysis for λ < 1, and div ergence examples in this case are giv en in Section 4.3. This result for λ = 1 was quite a su rprise. It justiﬁes the omission of th e “resid ual gradien t terms” wh en forming the w eigh t up date equation (Eq. 14). Omitting these residu al gradien t terms is not, as it ma y ha v e seemed, a puzzling mo d iﬁcation to ∂ E ∂ ~ w ; it is really ∂ R π ∂ ~ w , (with λ = 1, and th e giv en Ω t ). This means using the Ω t terms ensures an optimal v alue of R π is obtained, as sho wn in the exp eriment in Section 4.4. Also, it shows that V GL algorithms (and hence v alue function learning algorithms) are n ot that d iﬀerent fr om PGL algorithms after all. It w as not known that a PGL w eigh t up date, w hen applied to a greedy p olicy on a v alue fun ction, w ould b e d oing the same thing as a v alue fu nction w eigh t up d ate; ev en if b oth had λ = 1. Of course they are usually not th e same, unless this particular choic e of Ω t is c h osen. This also pro vides a ten tativ e justiﬁcation for the TD( λ ) w eigh t u p date equation (Eq. 10). F rom the p oin t of view of the author, this pr eviously had no th eoretica l justiﬁcation. It w as seemingly c h osen b ecause it lo oks a b it lik e gradien t descen t on an error fun ction, and the Bellman Equ ation happ ens to b e a ﬁxed p oint of it. This h as b een a h ugely puzzling issue. Th ere are no conv ergence guarant ees f or it and n u merous diverge nce examples (in Section 4.3, we sho w it can even diverge with λ = 1). Our explanation for it is that it is a sto c h astic approximat ion to Eq. 14, w h ic h itself is an appro ximation to PGL wh en λ = 1. Also it is our understand in g that this is a particularly go o d form of V GL w eight up date to mak e, since it has go o d con vergence guarante es. If an alternativ e is c h osen, e.g. b y replacing Ω t b y the iden tit y matrix, then it might b e p ossible to get muc h more aggressiv e 15 F airbank learning. 3 TD( λ ), b eing a stac hostic appr oximati on to Eq. 14, is ﬁxed to implicitly us e an iden tit y matrix for Ω t . But this creates the u n w an ted problem of non-monotonic progress, in the same wa y that any aggressiv e mo diﬁ cation to gradient ascen t ma y do. It is also p ossible to get d iv ergence in this case (see Section 4.3). It is our opinion that it is b etter to use a more theoretically j ustiﬁable acceleratio n metho d suc h as conjugate-gradien ts or RPR OP . This equiv alence reduces the p ossible adv an tages of the v alue function architec ture, in the case of λ = 1, do wn to b eing solely a sophisticated implementati on of a p olicy f unction. This soph isticated p olicy arc hitecture may just b e easier to train than other p olicies; just as some neural-net work arc hitectures are easier to train than others. It is not the actual learning algorithm that is delive ring any b eneﬁts. W e note th at it also creates a diﬃcult y in distinguish ing b et ween PGL and VG L. With λ = 1 and Ω t as deﬁ n ed in Eq. 19, the equation can no longer b e claimed as a n ew learning algorithm, since it is the same as BPT T with a greedy p olicy . Therefore th e exp erimen tal results will b e exactly the same as for BPTT. How ev er, w e will describ e the ab o ve weigh t up d ate as a V GL up date; it is of the same form as Eq. 14. W e also p oin t out that forms of Eq. 14 came ﬁrst (see W erb os, 1998), b efore the connectio n to PGL was realised, and that it itself is an idealised form of the TD( λ ) weigh t up d ate. This equiv alence pro of is almost a con vergence p ro of for Eq. 18 with λ = 1, since for the ma jorit y of learning iterations there is smo oth gradien t ascent on R π . The problem is that sometimes the terms  ∂ π ∂ ~ w  t are n ot d eﬁned and then learning progress jumps d iscon tinuously . One solution to this problem could b e to c h o ose a function app r o ximator for V suc h that the function Q ( ~ x, a, ~ w ) is ev erywhere s trictly conca ve with resp ect to a , as is done in th e T o y Pr oblem exp erimen ts of Section 4. A more general solution is giv en in the next section. Both of th ese solutions also satisfy th e requirement that  ∂ 2 Q ∂ a 2  − 1 t exists for all t . 2.2 Con tin uous-Time F orm ulation In man y pr oblems time can b e treated as a contin uous v ariable, i.e. t ∈ [0 , F ], as considered b y Do ya (2000) and Baird (1994). With cont in uous-time f orm u lations, some extra diﬃcul- ties can arise for VL as describ ed and solv ed by Baird (1994), but these do not apply to V GL, for reasons describ ed f u rther b elo w . W e describ e a con tin uous-time formulatio n here, since, in some circumstances th e greedy p olicy π ( ~ x , ~ w ) b ecomes a smo oth function of ~ x and ~ w . This remo ves the p roblem of undeﬁn ed  ∂ π ∂ ~ w  t terms that w as describ ed in the p revious section, and leads to a VGL algorithm for con trol pr oblems with a fun ction app ro ximator that is guarante ed to con verge. W e use bars o ver the p r eviously deﬁned functions to denote their con tin uous-time coun- terparts, so that ¯ f ( ~ x , a ) and ¯ r ( ~ x, a ) denote th e con tin u ous-time m o del functions. The tra jectory is generated from a given start p oin t ~ x 0 b y the diﬀerent ial equation (DE), d~ x t dt = ¯ f ( ~ x t , π ( ~ x t , ~ w )) (20) 3. In fact, in t he contin uous-time formulation of Eq. 23, “ ∂ 2 Q ∂ a 2 ” − 1 t = − g ′ ““ ∂ ¯ r L ∂ a ” t + “ ∂ ¯ f ∂ a ” t G t ” , and so setting Ω t to the identity matrix is analogous to giving the deriv ative of a sigmoid function an artiﬁcial b oost (see Eq. 19). This is lik e the tric k prop osed by F ahlman (1988) that is sometimes used to sp eed up sup erv ised learning in artiﬁcial neural netw orks, but at the expen se of robustn ess. 16 Reinf o r cement Learning by V alue-Gradients The total reward for this tr a j ectory is R π ( ~ x 0 , ~ w ) = Z F 0 ¯ r ( ~ x t , a t ) dt The con tin u ous-time Q function is ¯ Q ( ~ x, a, ~ w ) = ¯ r ( ~ x, a ) + ¯ f T ( ~ x, a ) G ( ~ x, ~ w ) + V ( ~ x , ~ w ) (21) This is closely r elated to the “adv an tage f u nction” (Baird, 1994), th at is ¯ A ( ~ x, a, ~ w ) = ¯ Q ( ~ x, a, ~ w ) − V ( ~ x, ~ w ). The ma jor d iﬀerence of the V GL approac h o ve r adv an tage fun ctions is that “adv ant age learning” only learns the comp onent of G that is parallel to th e tra jectory , and so it is similar to all VL algorithms in th at constant exp loration of th e neigh b ouring tra jectories m u st tak e place. Also, as p oin ted out by Do ya (2000 ), the problem of indis- tinguishable Q v alues that adv antage -learning is designed to ad d ress is not relev ant wh en using the follo w ing p olicy: W e use the same p olicy as prop osed by Do ya (200 0). The greedy p olicy do es not need to lo ok ahead in the con tin u ous-time form u lation, and instead r elies only on the v alue-gradien t at th e curr en t time. W e assume the mo del functions are linear in a (whic h is common in Newtonian mo d els, since acceleration is pr op ortional to the f orce), and then we introd uce an extra “action-cost” non-linear term, ¯ r C ( ~ x, a ), int o the m o del’s rew ard fun ction, giving ¯ r ( ~ x , a ) = ¯ r L ( ~ x, a ) + ¯ r C ( ~ x, a ) (22) where ¯ r L ( ~ x, a ) is the original linear rew ard function. Th e action-cost term has the eﬀect of ensuring the actio n c hosen b y the greedy p olicy is b ound to [ − 1 , 1], and also that the actions a t are smooth fun ctions of G t . A suitable c hoice is ¯ r C ( ~ x, a ) = − R a 0 g − 1 ( x ) dx where g − 1 is the inv erse of g ( x ) = tanh( x/c ), and c is a p ositive constan t. This idea is explained more fully by Do ya (2000). Using this c hoice of ¯ r C ( ~ x, a ), and sub stituting Eq. 21 and Eq. 22 in to the greedy p olicy giv es: a t = π ( ~ x t , ~ w ) = g  ∂ ¯ r L ∂ a  t +  ∂ ¯ f ∂ a  t G t  ⇒  ∂ π ∂ ~ w  t = g ′  ∂ ¯ r L ∂ a  t +  ∂ ¯ f ∂ a  t G t   ∂ G ∂ ~ w  t  ∂ ¯ f ∂ a  T t (23) = 1 − a t 2 c  ∂ G ∂ ~ w  t  ∂ ¯ f ∂ a  T t and  ∂ π ∂ ~ x  t = 1 − a t 2 c  ∂ ¯ f ∂ ~ x  t  ∂ G ∂ ~ x  t  ∂ ¯ f ∂ a  T t F or this p olicy , as c → 0, the p olicy tends to “bang-bang” control. F or c > 0, this p olicy fun ction meets the ob jectiv e of p ro ducing b ound actions that are smo oth functions of G t , and therefore, since th e fu nction appro ximator is assumed smo oth, are also smo oth functions of ~ w . Th is solv es the problem of discon tinuitie s describ ed in the previous section. 17 F airbank The solution w orks for an y c > 0, so can get arbitrarily close to the ideal of bang-bang con trol. Using this p olicy , the tra jectory can b e calculat ed via Eq. 20 using a suitable DE solv er. F or small c , the actions can rapidly alternate and so the DE ma y b e “stiﬀ ” and need an appropriate solv er (see Press et al., 1992, c h .16.6). F or the learning equations in con tinuous-time, we u s e ¯ λ as the ‘b o otstrapping’ param- eter. T his is related to the d iscrete time λ by e − ¯ λ ∆ t = λ wher e ∆ t is the discr ete-time time-step. This means ¯ λ = 0 giv es no b o otstrapping, and that b o otstrappin g increases as ¯ λ → ∞ , i.e. this is th e opp osite wa y around to the discrete-time parameter λ . The equations in the r est of th is section were derive d in a similar mann er to the d iscrete- time case, and b y letting the d iscrete time-step ∆ t → 0. The con tin u ous-time target-v alue has sev eral d iﬀeren t equiv alen t form ulations: V ′ t 0 = Z F t 0 e − ¯ λ ( t − t 0 )  ¯ r t + ∂ V t ∂ t  dt + V t 0 = Z F t 0 e − ¯ λ ( t − t 0 )  ¯ r t + ¯ λV t  dt ⇒ ∂ V ′ t ∂ t = − ¯ r t + ¯ λ  V ′ t − V t  with b oun dary condition V ′ F = 0 The target v alue-gradient is giv en by: ∂ G ′ t ∂ t = −  D ¯ r D~ x  t −  D ¯ f D~ x  t  G ′ t  + ¯ λ  G ′ t − G t  (24) with b ound ary condition G ′ F = ~ 0, and where D D~ x ≡ ∂ ∂ ~ x + ∂ π ∂ ~ x ∂ ∂ a . Th is is a DE that n eeds solving with equ iv alent care to which the tra jectory was, and it may also b e stiﬀ. No te that in this equation, ¯ r is the full ¯ r as deﬁned in Eq. 22. Also, in the case of an episo d ic problem wh ere a ﬁnal impulse of reward is give n, an alte rnativ e b ound ary condition to this one ma y b e required—see App endix E.1 for a d iscussion and example. F or this p olicy , mo del and ¯ λ = 0, the PGL we igh t up date is: ∆ ~ w = α  ∂ R π ∂ ~ w  0 = α Z F 0  ∂ G ∂ ~ w  t ¯ Ω t ( G ′ t − G t ) dt (25) where ¯ Ω t = 1 − a t 2 c  ∂ ¯ f ∂ a  T t  ∂ ¯ f ∂ a  t and is p ositiv e semi-deﬁnite. This int egral is the exact equation for gradien t ascent on R π . Therefore, if implemente d precisely , termination will o ccur at a constrained (with resp ect to ~ w ) lo cally optimal tra jec- tory (see App endix A for optimalit y deﬁnitions). How ev er, n umerical metho ds are required to ev aluate this in tegral and the other DEs in this section. F or example, th e ab ov e in tegral is most simp ly appro ximated as: ∆ ~ w = α F X t =0  ∂ G ∂ ~ w  t ¯ Ω t ( G ′ t − G t )∆ t whic h is very similar to the discrete-time case (Eq. 18). The fact that this algorithm is gradien t ascen t means it can b e sp eeded up with any of the fast optimisers a v ailable, e.g. RPROP (Riedmiller and Braun, 1993), which b ecomes v ery usefu l when c is small and therefore ∆ ~ w b ecomes v ery small. 18 Reinf o r cement Learning by V alue-Gradients Eq. 25 w as deriv ed f or ¯ λ = 0 although it can, in theory , b e ap p lied to other ¯ λ . In this case, it is though t to b e necessary to c h o ose a fu ll-rank v ersion of ¯ Ω. Ho w ev er our r esults with b o otstrapp ing with a fu nction appro ximator are n ot goo d (see Section 4.5). 2.3 Residual Gradien ts In this section we d er ive th e form ulae for full gradient descent on the v alue-gradien t error function, according to Eq. 12 and Eq. 1 3. T he particularly pr omisin g thing ab out this approac h is that it has go o d con v ergence gu arantees f or an y λ . This k in d of full gradien t descen t metho d is known as r esidual gr adients by Baird (199 5) or as using the Gal erkinize d e quations by W erb os (1998). T o calculate the total deriv ativ e of E , it is easier to ﬁ rst write E in r ecursiv e form: E ( ~ x t , ~ w ) = 1 2 ( G t − G ′ t ) T Ω t ( G t − G ′ t ) + E ( ~ x t +1 , ~ w ) with E ( ~ x F , ~ w ) = 0 at the terminal time step. Th is giv es a total deriv ativ e:  ∂ E ∂ ~ w  t =  ∂ G ∂ ~ w  t −  ∂ G ′ ∂ ~ w  t  Ω t ( G t − G ′ t ) +  ∂ π ∂ ~ w  t  ∂ f ∂ a  t  ∂ E ∂ ~ x  t +1 +  ∂ E ∂ ~ w  t +1 with  ∂ E ∂ ~ w  F = ~ 0 and where  ∂ E ∂ ~ x  t +1 is found recursively b y  ∂ E ∂ ~ x  t =  ∂ G ∂ ~ x  t −  ∂ G ′ ∂ ~ x  t  Ω t ( G t − G ′ t ) +  ∂ π ∂ ~ x  t  ∂ f ∂ a  t +  ∂ f ∂ ~ x  t   ∂ E ∂ ~ x  t +1 with  ∂ E ∂ ~ x  F = ~ 0. Note th at this go es fur ther than doing residu al-gradien ts for VL in that there is a consideration for ∂ E ∂ ~ x . This is necessary for true gradient -descen t on E w ith r esp ect to the we igh ts, since in this pap er w e sa y the v alue fun ction and greedy p olicy are tigh tly coupled, and therefore up d ating ~ w will imm ediately c han ge the tra j ectory . Th is can b e v eriﬁ ed by ev aluating ∂ E ∂ ~ w n umerically . Although this w eight up d ate p erforms relativ ely well in the exp eriments of S ection 4, our general exp erience of this algorithm is that it often gets stuck in far-from-optimal local minima. This w as quite p uzzling and was n ot explained by p revious criticisms of residual gradien ts (for example, see Baird, 1995), since these only app lied to sto c hastic scenarios. It seems that man y of th e lo cal minima of E are n ot necessarily v alid lo cal maxima of R π . W e sp eculate that choosing to in clude the resid u al gradien t term s is an alogous to c ho osing to m aximise a fun ction f ( x ) by gradient descent on ( f ′ ( x )) 2 . This mak es it diﬃcult to distinguish the maxima f rom the unw ante d min ima, inﬂections and sadd le p oint s, and although the situations are not identic al, an eﬀect like this m a y b e r educing the eﬀectiv eness of the residu al gradien ts w eigh t u p date. This con trasts with the non-r esidual gradients approac h (Eq. 14) where ∆ ~ w = α ∂ R ∂ ~ w whic h is analogous to m aximising f ( x ) d irectly (if λ = 1; see Section 2.1). By the argumen ts of Section 1.4, we would exp ect this explanation to apply to VL with resid u al gradien ts to o. T o illustrate this p roblem by a sp eciﬁc example, consider a v ariant of the 1-step T o y Problem with k = 0, mo diﬁed so th at the ﬁnal rew ard is R ( x 1 ) = − x 1 2 + 4 cos( x 1 ) instead of the u s ual R ( x 1 ) = − x 1 2 . Then the optimal p olicy is π ∗ ( x 0 ) = − x 0 . Let the function 19 F airbank -10 -5 0 5 10 -4 -2 0 2 4 y w y = R π ( w ) y = E ( w ) Figure 5: Gr ap h s sh o win g the spurious minima traps that can exist f or Residual Gradient metho ds compared to direct metho ds. appro ximator b e V ( x 1 , w ) = − x 1 2 + w x 1 , so that th e greedy p olicy with this mo d el giv es π ( x 0 , w ) = w / 2 − x 0 . This give s R π ( x 0 , w ) = − w 2 / 4 + 4 cos( w/ 2) wh ic h has just one maxim u m, at w = 0, corresp onding to the optimal p olicy . The residual v alue-gradien t error is E ( x 0 , w ) = 1 2 ( G 1 − G ′ 1 ) 2 = 1 2 ( w − 4 sin( w / 2)) 2 whic h has many lo cal minima (see Figure 5), only one of whic h corresp onds to the optimal p olicy . The s purious m in ima in E corresp ond to p oin ts of inﬂection in R π 0 , of whic h th ere are in ﬁnitely many . Th erefore gradien t d escen t on E is more lik ely than not to co n v erge to an incorrect solution, whereas gradien t ascen t on R π will con v er ge to the correct solution. 3. Actor-Critic Arc hitectures This section discuss es the u se of actor-c ritic arc hitectures with VGL. It sho ws that in some circumstances the actor-critic arc hitecture can b e shown to b e equiv alen t to a simpler ar- c hitecture. While th is can b e us ed to p ro vide conv ergence guaran tees for the acto r-critic arc h itecture, it also mak es the actor-critic architec ture redund an t in these circumstances. An actor-critic arc hitecture (Barto et al., 1983; W erb os, 1998; Kond a and Tsitsiklis, 2003) uses tw o neural-net works, or more generally tw o function approximato rs, in a con trol problem. Th e ﬁrst neural-net work, parametrised b y wei gh t v ector ~ z , is the “actor” wh ic h pro vides the p olicy , π ( ~ x , ~ z ). In an actor-c ritic arc hitecture the greedy p olicy is not used, since the actor neur al-net work is the p olicy . The second neural-net w ork, p arametrised by a w eigh t v ector ~ w , is the “critic” and provides the v alue fun ction V ( ~ x, ~ w ). F or this section w e extend the deﬁn ition of V ′ and G ′ to apply to tra jectories found b y p olicies other than the greedy p olicy . T o deﬁne V ′ for an arbitrary p olicy π ( ~ x , ~ z ) and v alue function V ( ~ x, ~ w ) we use: V ′ ( ~ x, ~ w , ~ z ) = r ( ~ x, π ( ~ x, ~ z )) + λV ′ ( f ( ~ x , π ( ~ x , ~ z )) , ~ w , ~ z ) + (1 − λ ) V ( f ( ~ x, π ( ~ x , ~ z )) , ~ w ) (26) with V ′ ( ~ x F , ~ w , ~ z ) = 0. This giv es G ′ ( ~ x, ~ w , ~ z ) = ∂ V ′ ( ~ x, ~ w ,~ z ) ∂ ~ x and f or a giv en tra j ectory we can use the sh orthand V ′ t = V ′ ( ~ x t , ~ w , ~ z ) and G ′ t = G ′ ( ~ x t , ~ w , ~ z ). The v alue-gradient version of the actor trainin g equation is as follo ws (W erb os, 1998, Eq. 11): ∆ ~ z = α X t ≥ 0  ∂ π ∂ ~ z  t  ∂ r ∂ a  t +  ∂ f ∂ a  t G t +1  (27) 20 Reinf o r cement Learning by V alue-Gradients where α is a s mall p ositiv e learning-rate and G t +1 is calculate d by the critic. T his equation is almost iden tical to th e PGL equation (BPTT, Eq. 15) except that G t +1 has b een substituted for  ∂ R π ∂ ~ x  t +1 . Eq. 27 is a non-standard actor training equation, ho w ev er in App endix F w e pro v e it is equiv alen t to at least one other actor training equ ation and demonstrate h o w it automatica lly incorp orates exploration. The critic training equation w ould attempt to attain G t = G ′ t for all t by some appro- priate metho d, f or example by Eq. 14, whic h is the only up date w e consider h ere. In th is section, it is us efu l to deﬁne the notion of a hyp othetical ide al function appr ox- imator . Th is is a f unction app ro ximator that can b e assumed to h a ve enough degrees of freedom, and a s tr ong enough learning algorithm, to attain its d esired ob jectiv es exactly , e.g. G t = G ′ t for all t , exactly . W e also refer to an ideal critic and an ideal actor, whic h are based on ideal fu nction appr o ximators. T raining an actor and a critic together give s sev eral p ossibilities of imp lemen tation; either one could b e ﬁxed w hile training the other, or th ey could b oth b e tr ained at the same time. W e only analyse the situations of ke eping one ﬁ xed while training the other. Doing a long iterativ e pr o cess (training on e) within another long iterativ e pr o cess (training the other) is v ery b ad for eﬃciency , wh ic h ma y make the cases analysed seem infeasible and therefore of little relev ance. Ho wev er, the analyses b elo w sho w b oth of th ese situations ha ve equiv alent archite ctures that are eﬃcien t and feasible to imp lemen t, and therefore are relev ant. It is noted that the r esu lts in this section should also app ly to VL actor-critic system, since as discuss ed in Section 1.4, VL is a sto chastic appro ximation to V GL. 3.1 Keeping the Actor Fixed while tra ining the Crit ic In this scenario w e k eep the actor ﬁxed while training an ideal critic fu lly to con v ergence, and then apply one iteration of actor training, and then rep eat until b oth hav e conv erged. It is sho wn that, if the cr itic is ideal, then this scenario is equiv alent to bac k -p ropagation through time (BPTT ) for an y λ . Since th e actor is ﬁxed, the tra j ectory is ﬁxed; therefore an ideal critic w ill b e able to attain the ob j ective of G t = G ′ t for all t along the tra jectory . Th en since G t = G ′ t for all t , w e ha ve G t =  ∂ R π ∂ ~ x  t for all t (pro of is in App endix A, lemma 5 ). Ther efore the actor’s w eigh t up d ate equation (Eq. 27) b ecomes identica l to Eq. 15. Therefore w e could omit the critic an d replace G with ∂ R π ∂ ~ x in the actor training equation (i.e. w e remo ve th e in ner iterativ e pro cess, and remov e the actor-critic arc hitecture), and we are left with BPTT. This shows that this actor-critic is gu aranteed to conv erge, since it is the same as BPT T . The ab o ve argumen t assumed the critic was id eal. This may b e an unr ealistic assumption since a real function approximato r can ha v e only ﬁnite ﬂexibilit y . Ho w ev er, the ob jectiv e of the function approxima tor is to learn G t = G ′ t for all t , and this goal c an b e ac hieve d exactly; simply by removi ng the critic. In eﬀect, by removing the critic, a virtual ideal function appro ximator is obtained. It is assumed that there w ould b e no adv anta ge in u sing a non-ideal critic. Conclusion: BPTT is equiv alen t to the id ealised v ersion of this arc hitecture, an d there- fore the idealised v ersion of this architec ture is guarante ed to con v erge to a constrained 21 F airbank lo cally optimal tra jectory (see App en dix A for optimalit y deﬁnitions). Th e idealised v er- sion of this architect ure is attainable b y remo vin g the critic. 3.2 Keeping the Critic Fixed w hile t raining the Actor In this scenario w e consider k eeping the critic ﬁxed wh ile training an ideal actor fu lly to con vergence , and then apply one iteration of training to the critic and r ep eat. The actor w eigh t up date equation (Eq. 27) tries to maximise Q ( ~ x t , a t , ~ w ) with resp ect to a t at eac h time step t . If this is fully achiev ed, then the greedy p olicy will b e satisﬁed. Therefore the ideal actor can b e r emo ved and replaced b y the greedy p olicy , to get the same algorithm. This again remov es the innermost iterativ e step, and remo v es the actor-critic arc h itecture. Again, it is assumed that th ere would b e no adv antag e in using a non-ideal actor. No w when it comes to the critic-training step (Eq. 14), do we allo w for the fact that c han ging the critic w eight s is going to change the actor in a p redictable wa y? If so, then w e tr eat Ω t as deﬁ ned in Eq. 19. O therwise we are wo rking as if the actor and critic are fully separated, and we are free to choose Ω t . Ha vin g made this c h oice and su bstitution, the actor is redund an t. Conclusion: Keeping the critic ﬁxed w hile tr aining an idealised actor is equiv alent to using just the critic w ith a greedy p olicy . This idealised archite cture can b e eﬃciently attained by remo ving the actor. 4. Exp erimen ts In this section a comparison is made b etw een the p erf ormance of seve ral weigh t u p date strategies on v arious tasks. The weig h t u p date formulae considered are summ arised in T able 1. The ﬁ rst few exp erimen ts are all b ased on the n -step T oy Pr oblem deﬁned in Section 1.1.1 Th e c hoice of this p roblem domain was made b ecause it is s mo oth, d eterministic, conca v e an d p ossible to make the exp erimen ts easily describable and repro d ucible. Within this c hoice, it mak es a lev el playing ﬁ eld for comparison b et ween VL and V GL algorithms. The ﬁnal exp eriment is a neur al-net wo rk based exp eriment on a Lunar-Land er problem sp eciﬁed in Ap p endix E. This is a new b enchmark problem d eﬁned for this pap er; the problem with most current existing b enc hmark problems wa s that th ey are only deﬁned for con tinuing tasks, discrete state-space tasks or tasks not well suited to lo cal exploration. In the T o y Problem exp erimen ts, all weigh t comp onents w ere in itialised with a uniform random distribu tion ov er a range fr om − 10 to +10. The exp eriments were based on 1000 trials. The stopp in g criteria us ed for eac h tr ial w ere as follo ws: A trial was considered a success wh en | w i − w ∗ i | < 10 − 7 for all comp onents i . A trial was considered a failure if an y comp onen t of ~ w b ecame to o large for the compu ter’s acc uracy , or w h en th e n um b er of iterations in any trial exceeded 10 7 . A function RN D ( ǫ ) is deﬁned to retur n a normally distributed random v ariable with mean 0 and standard deviation ǫ . The v ariables c 1 , c 2 , c 3 are r eal constant s sp eciﬁc to eac h exp eriment designed to allo w further v ariation in th e problem sp eciﬁcations. 22 Reinf o r cement Learning by V alue-Gradients W eig h t Up date F orm ula Abbreviation V alue-Lea rning Up date, TD( λ ), (Eq. 10). VL( λ ) V alue-Gradien t Learnin g Up date (Eq. 18), u sing full Ω t matrix from Eq. 19. V GLΩ( λ ) V alue-Gradien t Learning Up date (Eq. 14), bu t using Ω t as the identit y matrix for all t . V GL( λ ) V alue-Gradien t Learn ing Up date, residu al gradien ts (Eq. 13), b ut using Ω t as the identit y matrix for all t . V GLRG( λ ) T able 1: W eigh t u p date formulae considered. 4.1 Exp eriment 1: One-step T o y Problem In this exp erimen t the one-step T o y Problem w ith k = 0 w as considered fr om a ﬁxed s tart p oint of x 0 = 0. A fu nction approxima tor for V with j ust t wo parameters ( w 1 , w 2 ) was used, and d eﬁned separately 4 for the t w o time s teps: V ( x t , w 1 , w 2 ) =  − ( x 1 − c 1 ) 2 + w 1 x 1 + w 2 if t = 1 0 if t = 2 where c 1 is a real constant. Using this mo del and function appr o ximator deﬁn ition, it is p ossible to calcula te th e functions Q ( x, a, ~ w ), G ( x, ~ w ) and the ǫ -greedy p olicy π ( x, ~ w ) (whic h again must b e deﬁned diﬀeren tly for eac h time step). These f u nctions are listed in the left-hand column of T able 2. Using th ese form ulae, and the mo del fun ctions again, the full tra jectory is calculate d in the righ t-hand column of T able 2. Also in this right-hand column , V ′ , G ′ and Ω ha v e b een calculate d for eac h time step usin g Eq . 4, Eq. 6 and Eq. 19 resp ectiv ely . Th ese formulae w ould ha v e to b e ev aluated sequentia lly fr om top to b ottom. The ǫ -greedy p olicy wa s necessary for th e VL exp eriment s. When applying the w eigh t up d ate formulae , expr essions for ∂ V ∂ ~ w and ∂ G ∂ ~ w w ere calc ulated analytically from the fun ctions giv en in the left column of T able 2. F or example, for this function approxima tor we ﬁ nd  ∂ V ∂ w 1  1 = x 1 ,  ∂ V ∂ w 2  1 = 1 and  ∂ G ∂ w 1  1 = 1, etc. Note th at as w 2 do es not aﬀect the tra jectory , this comp onen t was not used as part of the stoppin g criteria. Results for these exp erimen ts using the VL( λ ) and VGL( λ ) algorithms are shown in T able 3. This set of exp erimen ts v eriﬁes th e failure of VL when exploration is remov ed; the slo wing do wn of VL wh en it is to o lo w; and the blowing-up of VL when it is to o high (in this case failure tended to o ccur b ecause the size of w eights exceeded the computer’s range). T h e eﬃciency and success rate of the V GL exp erimen ts is muc h b etter th an for the VL exp eriments, and this is true for b oth v alues of c 1 tested. In fact, the p roblem is tr ivially easy for the V GL( λ ) algorithm, bu t causes the VL( λ ) algorithm considerable p roblems. T o gain some fu r ther insigh t into the diﬀerent b eha viour of these t wo algorithms we can lo ok at the diﬀerentia l equations that the w eigh ts ob ey . F or the VGL( λ ) sys tem of this 4. See S ection 1.1.1 for an explanation on this abu se of notation. 23 F airbank F unction approxima tor and Sequent ial tra jectory ǫ -Greedy Pol icy equations Time step 1: V ( x 1 , w 1 , w 2 ) = − ( x 1 − c 1 ) 2 + w 1 x 1 + w 2 ⇒ G ( x 1 , w 1 , w 2 ) = 2( c 1 − x 1 ) + w 1 Q ( x 0 , a 0 , ~ w ) = − ( x 0 + a 0 − c 1 ) 2 + w 1 ( x 0 + a 0 ) + w 2 π ( x 0 , ~ w ) = (2 c 1 − 2 x 0 + w 1 ) / 2 + RN D ( ǫ ) Time step 2: V ( x 2 , w 1 , w 2 ) = 0 ⇒ G ( x 2 , w 1 , w 2 ) = 0 x 0 ← 0 a 0 ← (2 c 1 − 2 x 0 + w 1 ) / 2 + RN D ( ǫ ) x 1 ← x 0 + a 0 V ′ 1 ← − x 1 2 G ′ 1 ← − 2 x 1 V 1 ← − ( x 1 − c 1 ) 2 + x 1 w 1 + w 2 G 1 ← 2( c 1 − x 1 ) + w 1 Ω 0 ← 1 2 Optimal P olicy: π ∗ ( x 0 ) = − x 0 Optimal W eigh ts: w 1 ∗ = − 2 c 1 T able 2: F unctions and T ra jectory V ariables for E xp eriment 1. α = 0 . 01 α = 0 . 1 α = 1 . 0 Success Iterations Success Iterations Success Iteratio ns c 1 ǫ rate (Mean) (s.d.) rate (Mean) (s.d.) rate (Mean) (s.d.) Results for algorithm VL( λ ) 0 10 66 . 4% 1075.1 293.35 0 . 0% 0 . 0% 0 1 100 . 0% 1715.8 343.31 87 . 6% 163.52 31.948 3 . 8% 134.86 59.643 0 0 . 1 100 . 0% 17244 5 31007 89 . 5% 17160 3033.6 16 . 5% 1527.6 118.39 0 0 0 . 0% 0 . 0% 0 . 0% 10 1 99 . 4% 6048.5 270.86 0 . 0% 0 . 0% Results for algorithm V GL( λ ) 0 0 100 . 0% 1728.2 112.05 100 . 0 % 166.15 11.481 100 . 0 % 1 0 10 0 100 . 0% 1898 .5 51.986 100 . 0% 181.59 1 4.861 100 . 0 % 1 0 T able 3: Results for Exp eriment 1. Note b ecause this is a 1-step problem, λ is irrelev ant. 24 Reinf o r cement Learning by V alue-Gradients exp eriment, b y going through th e equations of the right-hand column of T able 2 and the V GL( λ ) weig h t up date equations, w e can eliminate all v ariables except f or the weigh ts and constan ts to obtain a self-con tained pair of w eigh t up date equations:  ∆ w 1 = − α (2 c 1 + w 1 ) ∆ w 2 = 0 T aking α to b e s uﬃcien tly small, these b ecome a pair of coupled diﬀeren tial equations. Th e solution is a straigh t line across the ~ w plane directly to the optimal solution w ∗ 1 = − 2 c 1 . Doing th e same for the VL( λ ) system, and int egrating o ver the r andom v ariable R N D ( ǫ ) to av erage out the eﬀects of exp loration, giv es a similar p air of coupled weigh t up date equations:    h ∆ w 1 i = − α  c 1 + w 1 2   2 ǫ 2 + c 1 2 + 2 c 1 w 1 + w 1 2 2 + w 2  h ∆ w 2 i = − α  c 1 2 + 2 c 1 w 1 + w 1 2 2 + w 2  There is no kn o wn fu ll analytical solution to this pair of equations. Ho w ever it is clear that th e second equation is contin ually aiming to ac hiev e w 2 = −  c 1 2 + 2 c 1 w 1 + w 1 2 / 2  . In the case that this is ac h iev ed, b oth equations wo uld then simplify to the V GL coupled equations, b ut with a magnitude p rop ortional to ǫ 2 . This shows that if ǫ = 0, the v alue- gradien t part of these equations v anishes. I t is also noted th at in this case exp er im ents sho w learning fails. Hence it is sp eculated that n one of the other terms in the VL( λ ) coup led equations are d oing an ything b eneﬁcial, and that it is un lik ely they will ev er do so even in more complicated systems. V ery inform ally , this example illus trates how V GL applies just the “imp ortant bits” of a VL w eight up date (in this example at least). 4.2 Exp eriment 2: Two -step T oy Problem, with Suﬃcien tly Flexible F unction Appro ximator In this exp eriment the t wo-step T o y Problem with k = 1 is considered from a ﬁ xed start p oint of x 0 = 0. A function appr oximato r is deﬁned diﬀerent ly at eac h time step, by four w eigh ts in total: V ( x t , w 1 , w 2 , w 3 , w 4 ) =    − c 1 x 1 2 + w 1 x 1 + w 2 if t = 1 − c 2 x 2 2 + w 3 x 2 + w 4 if t = 2 0 if t = 3 where c 1 and c 2 are real p ositiv e constants. The consequential f unctions and v ariables f or this exp erimen t are foun d and pr esented in a similar manner as f or Exp eriment 1, in T able 4. F or ease of imp lemen tation of the residual-gradien ts algorithm, the expressions in the righ t-hand column of T able 4 for G t and G ′ t w ere used to implemen t a sum-of-squares error function E ( ~ w ) (Eq. 12), with Ω t = 1. Num erical d iﬀeren tiatio n on this function was then used to imp lemen t the gradient d escen t. F or a larger s cale system, it w ould b e m ore eﬃcien t and accurate to use the recursive equations giv en in Section 2.3. Results for the exp eriments are given in T able 5 . T hese results s h o w all VG L exp erim ents p erformin g s igniﬁ can tly b etter than the corresp onding VL exp eriment s; in most cases b y around tw o orders of magnitude. T he results also show that for all of the V GL( λ ) algorithm 25 F airbank F unction approxima tor and Sequent ial tra jectory ǫ -Greedy Pol icy equations Time step 1: V ( x 1 , w 1 , w 2 ) = − c 1 x 1 2 + w 1 x 1 + w 2 ⇒ G ( x 1 , w 1 , w 2 ) = − 2 c 1 x 1 + w 1 Q ( x 0 , a 0 , ~ w ) = − k a 0 2 − c 1 ( x 0 + a 0 ) 2 + w 1 ( x 0 + a 0 ) + w 2 π ( x 0 , ~ w ) = w 1 − 2 c 1 x 0 2( c 1 + k ) + R N D ( ǫ ) Time step 2: V ( x 2 , w 3 , w 4 ) = − c 2 x 2 2 + w 3 x 2 + w 4 ⇒ G ( x 2 , w 3 , w 4 ) = − 2 c 2 x 2 + w 3 Q ( x 1 , a 1 , ~ w ) = − k a 1 2 − c 2 ( x 1 + a 1 ) 2 + w 3 ( x 1 + a 1 ) + w 4 π ( x 1 , ~ w ) = w 3 − 2 c 2 x 1 2( c 2 + k ) + R N D ( ǫ ) Time step 3: V ( x 3 ) = 0 ⇒ G ( x 3 ) = 0 x 0 ← 0 a 0 ← w 1 − 2 c 1 x 0 2( c 1 + k ) + R N D ( ǫ ) x 1 ← x 0 + a 0 a 1 ← w 3 − 2 c 2 x 1 2( c 2 + k ) + R N D ( ǫ ) x 2 ← x 1 + a 1 V ′ 2 ← − x 2 2 G ′ 2 ← − 2 x 2 V 2 ← − c 2 x 2 2 + w 3 x 2 + w 4 G 2 ← − 2 c 2 x 2 + w 3 Ω 1 ← 1 2( c 2 + k ) V ′ 1 ← − k a 1 2 + λV ′ 2 + (1 − λ ) V 2 G ′ 1 ← 2 c 2 k a 1 + k ( λG ′ 2 +(1 − λ ) G 2 ) c 2 + k V 1 ← − c 1 x 1 2 + w 1 x 1 + w 2 G 1 ← − 2 c 1 x 1 + w 1 Ω 0 ← 1 2( c 1 + k ) Optimal W eigh ts: w 1 ∗ = w 3 ∗ = 0 T able 4: F unctions and T ra jectory V ariables for E xp eriment 2. results, increasing α f rom 0 . 01 to 0 . 1 b rings the num b er of iterations do wn by a factor of approxi mately 10, w h ic h h ints that furth er eﬃciency of the VGL algorithms could b e attained. The optimal v alue function, den oted by V ∗ , for this exp eriment is V ∗ ( x t ) =  − k 1+ k x 1 2 if t = 1 − x 2 2 if t = 2 F or this r eason m ost exp eriment s w ere done with c 1 = 1 2 and c 2 = 1. Ho w ever the only necessit y is to ha v e c 1 > 0 and c 2 > 0, sin ce these are r equired to make the greedy p olicy pro du ce contin uous actions; a p roblematic issue for all v alue fu nction arc hitectures. 4.3 Exp eriment 3: Div e rgence of Algorithms With Two-step T o y Problem W e no w study the T o y Problem to try to ﬁn d a set of p arameters that cause learning to b ecome unstable. S urprisin gly the tw o-step T o y Problem is suﬃciently complex to pro vide examples of dive rgence, b oth with and without b o otstrappin g. By the p rinciples argued in S ection 1.4, we would exp ect these examples that were found for V GL to also cause div ergence with the corr esp onding VL m etho ds. This is conﬁrmed empirically . If we take the previous exp eriment and consider the V GLΩ( λ ) w eight u p date th en th e only t w o we igh ts that c hange are ~ w = ( w 1 , w 3 ) T . The weig h t up date equation f or these t w o w eigh ts can b e found analytically b y substituting all the equations of the right hand side of 26 Reinf o r cement Learning by V alue-Gradients α = 0 . 01 α = 0 . 1 W eig h t Up date Success Iterations Success Iterations Algorithm ( λ ) ǫ c 1 c 2 rate (Mea n) (s.d.) rate (Mean) (s.d.) VL(1) 1 0.5 1 100 . 0% 244122 252234 91 . 3% 736030 743920 VL(1) 0 . 1 0.5 1 100 . 0 % 135588 17360 100 . 0% 21406.6 8641 V GL(1) 0 0.5 1 100 . 0% 1596.8 72.58 100 . 0% 152.5 13.79 V GLΩ(1) 0 0.5 1 100 . 0% 6089.1 340.09 100 . 0% 600.7 47.06 V GLRG(1 ) 0 0.5 1 100 . 0% 794.6 40 .30 100 . 0% 72.2 5.36 VL(0) 1 0.5 1 100 . 0% 244368 252114 91 . 6% 734029 742977 VL(0) 0 . 1 0.5 1 100 . 0 % 138073 17630 99 . 9 % 21918 866 4 V GL(0) 0 0.5 1 100 . 0% 1743.7 103.41 100 . 0% 166.2 12.81 V GLΩ(0) 0 0.5 1 100 . 0% 6516.5 375.99 100 . 0% 643.0 39.12 V GLRG(0 ) 0 0.5 1 100 . 0% 1252.4 92 .81 100 . 0% 118 .2 11.63 VL(1) 0 . 1 4 1 100 . 0 % 228336 60829 100 . 0% 78364 62085 V GL(1) 0 4 1 100 . 0% 5034.7 340.6 100 . 0% 495.1 35.7 VL(1) 0 . 1 0.1 1 100 . 0 % 134443 16614 100 . 0% 20974 8569 V GL(1) 0 0.1 1 100 . 0% 1516.2 89.5 100 . 0% 144.4 13.8 T able 5: Results for v arious algorithms on Exp eriment 2. T able 4 into the V GLΩ ( λ ) we igh t up date equation, and u sing ǫ = 0, giving: ∆ ~ w = αD E D ~ w with E = − 2  ( k + λ (1 + b )( b ( k + 1) + 1) − bk ) ( λ ( k + 1)( b + 1) − k ) 1 + b ( k + 1) ( k + 1)  , b =  ∂ π ∂ x  1 = − c 2 c 2 + k and D =  1 / 2( k + c 1 ) 0 0 1 / 2( k + c 2 )  W e can consider more t yp es of f u nction approximat or by d eﬁning the weigh t v ector ~ w to b e linear system of t w o n ew weigh ts ~ p = ( p 1 , p 2 ) T suc h that ~ w = F ~ p and where F is a 2 × 2 constan t real matrix. If the VGLΩ( λ ) w eigh t up d ate equation is n o w recalculated for these new weigh ts th en the dyn amic system for ~ p is: ∆ ~ p = α ( F T D E D F ) ~ p (28) T aking α to b e suﬃcien tly small, then the we igh t v ector ~ p ev olv es according to a con tinuous-time linear dynamic system, and this system is stable if and only if the ma- trix p r o duct F T D E D F is stable (i.e. if the r eal part of eve ry eigen v alue of this matrix pro du ct is negativ e). The V GL( λ ) system we igh t u p date can also b e d eriv ed and that system is iden tical to Eq. 28 b ut with the leftmost D matrix omitted. Cho osing λ = 0, with c 1 = c 2 = k = 0 . 01 and F = D − 1  10 1 − 1 − 1  leads to dive rgence for b oth the V GL( λ ) and VGLΩ( λ ) systems. Empirically , we found that th ese parameters cause the VL(0) algorithm to d iv erge to o. This is a s p eciﬁc counterexample for the VL 27 F airbank Div ergence Pro ven Algorithm ( λ ) example found to con verge VL(1) Y es No VL(0) Y es No V GL(1) Y es No V GL(0) Y es No V GLΩ(1) No Y es V GLΩ(0) Y es No T able 6: Results for Exp eriment 3: Whic h algorithms can b e made to d iv erge? system w h ic h is “on-p olicy” and equiv alen t to Sarsa. Previous examples of dive rgence for a function approxima tor with b o otstrapping ha ve usually b een for “oﬀ-p olicy” learning (see for example, Baird, 1995; Tsitsiklis and Ro y, 1996b). Tsitsiklis and Ro y (1996a) d escrib e an “on-p olicy” counte rexample for a non-linear function appr o ximator, bu t this is not for the greedy p olicy . Also, p erhaps surprisin gly , it is p ossible to get instabilit y with λ = 1 with the V GL( λ ) system. S ubstituting c 2 = k = 0 . 01, c 1 = 0 . 99 and F = D − 1  − 1 − 1 10 1  mak es the V GL(1) system diverge . This result has b een empirically veriﬁed to carry ov er to the VL(1) system to o, i.e. this is a result wher e S arsa(1) and TD(1) diverge. Th is highligh ts the diﬃcult y of control-problems in comparison to p rediction tasks. A p rediction task is easier, since as Sutton (1988) sho w ed , the λ = 1 system is equiv alen t to gradien t descent on the sum-of-squares error E = Σ t ( V ′ t − V t ) 2 , and so con v ergence is guaran teed for a p r ediction task. Ho w ev er in a cont rol pr oblem, ev en wh en there is no b o otstrapping, c hanging one v alue of V ′ t aﬀects the others by altering the greedy actions. This p roblem is resolv ed by using the VG LΩ(1) weigh t up d ate. The results of this section are summarised in T able 6. 4.4 Exp eriment 4: Two -step T oy Problem, with Insuﬃcien tly Flexible F unction Approximator In this exp erimen t the t w o-step T o y Problem w ith k = 2 wa s considered from a ﬁx ed start p oint of x 0 = 0. A fun ction app ro ximator with ju st one w eigh t comp onent, ( w 1 ), was deﬁned diﬀerently at eac h time step: V ( x t , w 1 ) =    − c 1 x 1 2 + w 1 x 1 if t = 1 − c 2 x 2 2 + ( w 1 − c 3 ) x 2 if t = 2 0 if t = 3 Here c 1 = 2, c 2 = 0 . 1 and c 3 = 10 are constants. These were d esigned to create some conﬂict for the function ap p ro x im ator’s requirements at eac h time step. The optimal actions are a 0 = a 1 = 0, and ther efore th e function appro ximator wo uld only b e optimal if it could ac hieve ∂ V ( x, ~ w ) ∂ x = 0 at x = 0 for b oth time steps. The pr esence of the c 3 term mak es this imp ossible, so a compromise m ust b e made. 28 Reinf o r cement Learning by V alue-Gradients Sequent ial equations x 0 ← 0 G ′ 2 ← − 2 x 2 a 0 ← w 1 − 2 c 1 x 0 2( c 1 + k ) G 2 ← − 2 c 2 x 2 + w 1 x 1 ← x 0 + a 0 G ′ 1 ← 2 c 2 k a 1 + k ( λG ′ 2 +(1 − λ ) G 2 ) c 2 + k a 1 ← w 1 − c 3 − 2 c 2 x 1 2( c 2 + k ) G 1 ← − 2 c 1 x 1 + w 1 x 2 ← x 1 + a 1 Ω 1 ← 1 2( c 2 + k ) Ω 0 ← 1 2( c 1 + k ) T able 7: T ra j ectory V ariables for Exp eriment 4. Only the v alue-gradient algorithms w ere considered in this exp eriment; hence there was no need for exploration or use of the ǫ -greedy p olicy . The equations for the tra jectory are v ery similar as to Exp eriment 2, and so only the k ey results are listed in T able 7. A d iﬀeren t stopping criterion w as used in this exp eriment, sin ce eac h algorithm con verge s to a diﬀeren t ﬁxed p oint. T h e stopping condition used was | ∆ w 1 | < (10 − 7 α ). Once the ﬁxed p oint had b een reac hed w e noted the v alue of R , the total reward for that tra j ectory , in T able 8. Eac h algorithm exp eriment used α = 0 . 01, attained 100% con v ergence, and pro d uced the same v alue of R eac h run. In these trials, α was not v aried, and the iteration result columns are not very meaningful, since it can b e sho wn for eac h algorithm that ∆ w 1 is a lin ear function of w 1 . This indicates th at an y of the algorithms could b e made arbitrarily fast b y ﬁn e tuning α , an d also conﬁrms that th ere is only one ﬁxed p oin t for eac h algorithm. The diﬀerent v alues of R in the results show that the diﬀeren t algorithms ha v e v arying degrees of optimalit y w ith resp ect to R . The ﬁr st algo rithm (V GLΩ(1)) is th e only one that is really optimal with resp ect to R (sub ject to the constrain ts imp osed b y the a wkward c hoice of function approximato r), since it is equiv alen t to gradient ascen t on R π , as sho wn in section 2.1. It is inte resting that the other alg orithms con v erge to diﬀeren t s ub optimal p oin ts, co m- pared to V GLΩ(1). This sho w s that in tr o ducing the Ω t terms and using λ = 1 balances the p r iorities b et w een minimising ( G ′ 1 − G 1 ) 2 and minimising ( G ′ 2 − G 2 ) 2 appropriately so as to ﬁnish with an optimal v alue for R . Diﬀeren t v alues f or the constants c 1 and c 2 w ere c hosen to emp hasise this p oin t. The relativ e rankings of the other algorithms ma y c h ange in other problem-domains, bu t it is exp ected that VGLΩ(1) would alw a ys b e at the top. Hence we use this algorithm as our standard choice of algorithm out of those listed in this pap er; to b e used in conju nction with a r ob u st optimiser, e.g. RPR OP . 4.5 Exp eriment 5: One-Dimensional Lunar-Lande r Problem The ob jectiv e of this exp erimen t was to learn the “Lunar-Lander” p r oblem describ ed in App end ix E. The v alue-function wa s pr o vided by a fu lly connected multi-la yer p erceptron (MLP) (see Bishop (1995) for details). Th e MLP had 3 inpu ts, one hidden lay er of 6 units, and one output in the ﬁ nal la ye r. Add itional shortcut connections connected all in put units directly to the outp u t la y er. Th e activ ation fu nctions were standard sigmoid fun ctions in the input and hidden la y ers, and an iden tit y fu nction (i.e. linear activ ation) in the output unit. 29 F airbank W eig h t Up date Iterations Algorithm ( λ ) R (Mean) (s.d.) V GLΩ(1) − 2 . 65816 2327 247 V GLRG(1 ) − 2 . 6808 3 183 1 9 V GL(1) − 2 . 7990 5 500 5 1 V GLΩ(0) − 2 . 82344 3532 352 V GLRG(0 ) − 3 . 9731 6 256 2 1 V GL(0) − 5 . 7670 1 1077 87 T able 8: Resu lts for Exp er im ent 4, rank ed b y R . The inp ut to the neural-net w ork was ( h/ 100 , v / 10 , u/ 50) T , and the output was multiplie d by 100 to give the v alue function. Eac h we igh t in the neural-net work w as initially rand omised to lie in [ − 1 , 1], w ith uniform p r obabilit y distribution. In this section results are presen ted as pairs as diagrams. The left-hand diagrams sho w a graph of the tota l rew ard for all tra jectories v ersus the num b er of trainin g iterations, and compare p erform an ce to th at of an optimal p olicy . Th e optimal p olicy’s p er f ormance w as calculate d b y the theory describ ed in Ap p endix E .2 . That app endix also sh o ws example optimal tra jectories. The righ t-hand d iagrams sho w a cross-section thr ough state space, with the y-axis sho wing h eight and the x-axis sho wing v elocity . The ﬁn al tra jectories obtained are sho w n as curve s starting at a diamond symb ol and ﬁnishin g at h = 0. All algorithms used in this section w ere the con tin uou s -time counte rparts to th ose stated, and are describ ed in Section 2.2. Also, all w eigh t up dates w ere com bin ed with RPR O P for accele ration. F or implementing RPROP , the w eigh t up date for all tra jectories w as ﬁr s t accum ulated, and then the resulting w eight up date wa s fed in to RPR OP at the en d of eac h iteration. RPR OP was used with the d efault parameters deﬁned b y Riedmiller and Braun (1993). Results for the task of learning one tra jectory from a ﬁxed start p oint are shown in Figure 6 for the VG LΩ algorithm. T he results w ere a v eraged o ver 10 run s. V GLΩ w orked b etter on this task than V GL. VGL sometimes pro duced u n stable or sub optimal solutions. Ho wev er, b oth could manage the task well with c = 1. I t w as not p ossible to get VL to w ork on this task at all. VL failed with a greedy p olicy , as exp ected, as there is no exploration. Ho wev er VL also failed on this task wh en u s ing the ǫ -greedy p olicy . W e susp ect the reason for this w as that random exp loration w as pro ducing more un w anted noise than usefu l inform ation; random exploration mak es the spacecraft ﬂy th e wrong w a y and learn the wrong v alues. W e b elieve this makes a strong case for using v alue-gradien ts an d a d eterministic sys tem instead of VL with random exploration. The kink in the left diagram of Figure 6 was caused b ecause c w as low, so the gradient ∂ R π ∂ ~ w w as tiny f or the ﬁrst few iterations. It seemed that RPR OP would cause the w eigh t c han ge to build u p moment um quic kly and th en temp orarily o vershoot its target b efore bringing it bac k un der con tr ol. It wa s very diﬃcult to learn this task with such a small c v alue without RPROP . Figure 7 sho ws th e p erform ance of V GL and VL in a task of learnin g 50 tra jectories from ﬁxed start p oint s. Th e V GL learning graph is clearly more smo oth, more eﬃcien t and more 30 Reinf o r cement Learning by V alue-Gradients -100 -80 -60 -40 -20 0 1 10 100 1000 1000 0 R Iterations V GLΩ Optimal Poli cy 0 20 40 60 80 100 120 -6 -4 -2 0 2 4 6 H V elocit y Learned T ra jectory Start P oin t ♦ ♦ Figure 6: L earn ing p erformance for a sin gle tra jectory on the Lun ar-Land er problem. Pa- rameters: c = 0 . 01, ∆ t = 0 . 1, ¯ λ = 0. The algorithm us ed w as V GLΩ. Th e tra jectory p ro duced is ve ry close to optimal (c.f. Fig. 9). optimal than the VL graph. V GLΩ and V GL could b oth manage this task, ac hieving close to optimal p erformance eac h time, bu t only V GLΩ could cop e with the smaller c v alues in the range [0 . 01 , 1). -80 -70 -60 -50 -40 -30 -20 1 10 100 1000 10000 R Iterations V GL VL Optimal Poli cy 0 20 40 60 80 100 120 -10 -5 0 5 10 H V elocit y ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ Figure 7: L earn ing p erform an ce for learnin g 50 tra jectories simultaneously on Lun ar- Lander prob lem. P arameters: c = 1, ∆ t = 1, ¯ λ = 0. The algorithms used w ere VL and VGL. Graphs were a v eraged o ver 20 ru ns. The a v eraging will ha ve had a smo othing eﬀect on b oth curves. The right graph s h o ws a set of ﬁ nal tra jectories obtained with one of th e V GL trials, wh ic h are close to optimal. It w as diﬃcult to get VL wo rking w ell on this problem at all, an d the parameters c hosen w ere th ose that app eared to b e most fa v ourable to VL in pr eliminary exp erimen ts. Ha ving suc h a high c v alue mak es the task m uc h easier, but VL still could not get close to optimal, or stable, tra jectories. No sto c h astic elemen ts w ere u s ed in either the p olicy or mo del. The large num b er of ﬁ xed tr a j ectory start p oin ts pro vided the exploration required by VL. These ga v e a reasonable sampling of th e wh ole of s tate space. Preliminary tests with the 31 F airbank ǫ -greedy p olicy did not pr o duce an y successful learning. The p olicy used w as the greedy p olicy describ ed in App endix E. W e could n ot get b o otstrapping to get to wo rk on this task for either the VL or VGL algorithms. With b o otstrapping, the tra jectories tended to con tin u ally oscillate. It w as not clear why this wa s, b ut it is consistent with the lac k of con verge nce results for b o otstrappin g. It is desirable to ha v e c as small as p ossible to get f uel-eﬃcien t tra jectories (see for example Figure 9), but a small c m akes the con tin u ous-time DE more s tiﬀ. How ev er ou r exp erience w as that th e limiting factor in c ho osing a small c wa s not the stiﬀness of the DE, but th e fact that as c → 0, ∂ R π ∂ ~ w → ~ 0 whic h made learning with VGLΩ diﬃcult. Usin g RPR OP largely remedied this since it cop es well with small gradien ts and can resp ond quic kly when the gradient suddenly c hanges. W e did not need to use a stiﬀ DE solv er, and found the E u ler metho d adequate to in tegrate the equations of Section 2.2. 5. Conclusions This section su m marises some of the issues and b eneﬁ ts raised b y the VGL approac h. Also, the cont ributions of this pap er are h ighligh ted. • Several V GL algorithms h a ve b een stated. T hese are algorithms V GL( λ ), VG LΩ( λ ) and VG LR G( λ ) (see T able 1), with their con tin u ous time count erparts, and actor- critic learning; all deﬁn ed f or an y 0 ≤ λ ≤ 1. Results on the T o y Problem and the Lunar-Lander are b etter than the VL results b y several orders of magnitud e, in all cases. • The v alue-gradien t analysis goes a large wa y to resolving th e issue of exploration in v alue fu n ction learning. Lo cal exploration comes for free with V GL. Other than the problem of lo cal v ers us global optimalit y , the problems of exploration are resolv ed, and v alue function learning is put on to an equal fo oting with PGL. F or example, as discussed in S ection 1.5, exploration h ad previously caused d iﬃculties in Q( λ )-learning and Sarsa( λ ). • In App endix A, deﬁnitions of extremal and optimal tr a j ectories and an optimalit y pro of are give n for learning by v alue-gradients. The pro of refers to P on tryag in’s Maximal Principle (PMP), bu t in the case of “bang-bang” con trol, the conclusion of the pr o of goes sligh tly fur th er than is implied s olely b y PMP . • The v alue-gradient analysis pro vides an o verall view that links s everal diﬀeren t areas within reinforcement learnin g. F or example, the connection b et w een PGL and v alue function learning is p ro ven in Section 2.1. This pr o vides an explanation of w hat happ ens when the “residual gradien t terms” are missed oﬀ from the weigh t up date equations (i.e. ∂ E ∂ ~ w → ∂ R π ∂ ~ w ; see Section 2.1), and a ten tativ e j u stiﬁcation for the TD( λ ) w eigh t up date equation (see Section 2.1 ). Also, the obvio us similarit y of form b et ween Eq. 15 and Eq. 27 pro vides connections b et ween PGL and actor-critic arc hitectures, as discuss ed in Section 3. • The use of a fu nction app r o ximator h as b een in trinsic through ou t. T his follo w ed from the d eﬁnition of V in S ection 1.1. This has led to a robust con verge nce pr o of, for a 32 Reinf o r cement Learning by V alue-Gradients general f unction approximato r and v alue function, in S ection 2.1. The u se of a gen- eral function appr o ximator is an adv ancement o v er simpliﬁed function app ro ximators (e.g. linear app ro ximators), or function app ro ximators that requir e a h and-pic k ed partitioning of s tate space. • Most pr evious studies ha v e separated the v alue f u nction up date from the greedy p olicy up d ate, but th is has b een a severe limitation b ecause RL d o es not w ork th is wa y; in practice, it is necessary to alternate one w ith the other, otherwise the RL system w ould never impro v e. Ho wev er no previous conv ergence pro of or div ergence example applies to this “whole system”. In this pap er there has b een a tigh t coup lin g of the v alue function to the greedy p olicy , and this h as made it p ossible to su ccessfully analyse wh at eﬀect a v alue fun ction u p date has to the greedy p olicy . W e ha ve found con vergence p r o ofs and divergence examples for th e whole system. W e do n ot think it is p ossible to do this analysis without v alue-gradien ts, since the expression for ∂ π ∂ ~ w in Eq. 17 d ep ends on ∂ G ∂ ~ w . Hence v alue-gradients are ne c essary to understand v alue- function learning, w hether by VL or V GL. • By considering the “whole system”, a d ivergence example is found for VL (includ ing Sarsa( λ ) an d TD( λ )) and V GL w ith λ = 1, in Section 4.3. Th is may b e a su rprising result, since it is generally thought that the case of λ = 1 is fu lly understo o d and alw ays con v erges, b ut this is not so for the wh ole system on a control problem. • It is prop osed in Section 1.4 that th e v alue-gradients app r oac h is an idealised form of VL. W e also b eliev e that the appr oac h of VL is n ot only h aphazardly indirect, but also introdu ces some extra unw an ted terms int o the weigh t u p date equation, as demonstrated in the analysis at the end of Section 4.1 . • The con tin u ous-time p olicy stated in S ection 2.2 (take n from Do y a , 200 0) is an exact implemen tation of the greedy p olicy π ( ~ x, ~ w ) that is smo oth with resp ect to ~ x and ~ w pro vided th at the f u nction approxi mator u sed is smo oth. This resolv es one of the greatest diﬃculties of v alue-function learnin g, namely that of discon tin u ous c h anges to the actions c h osen by the greedy p olicy . • A new explanation is given ab ou t ho w residual gradien t algorithms can get tr ap p ed in spurious lo cal minima, as describ ed in S ection 2.3. W e think this is th e main reason wh y resid ual gradien ts often fails to wo rk in practice, ev en in the case of V GL and deterministic systems. Unders tand ing this will hop efully sa ve other researc hers losing to o muc h time exploring this p ossibilit y . It is the opinion of the auth or that there are seve ral problematic issues ab out reinforce- men t learning with an appro ximated v alue f unction, w hic h the algorithm V GLΩ(1) resolv es. These problematic issu es are: • Learning p rogress is far from monotonic, when measured by either E (Eq. 12) or R π , or an y other m etric currently kno wn . This problem is r esolv ed by the prop osed algorithm, wh en used in conjun ction with a p olicy suc h as the one in Section 2.2. 33 F airbank • When learning a v alue fu nction with a function appro ximator, the ob jectiv e is to obtain G t = G ′ t for all t along a tra j ectory . In general, du e to the nature of function appro ximation, this will nev er b e attained exactl y . Ho wev er, ev en if this is v ery close to b eing attained, the v alue of R π ma y b e far from optimal. In sh ort, minimising E and maximising R π are not same thing u nless E = 0 can b e attained exactly . As demonstrated in the exp eriment of Section 4.4, and the p ro ofs in S ection 2.1, this problem is resolve d b y the p r op osed algorithm. • The success of learnin g can dep end on h o w state space is scaled. Th e deﬁnition of Ω t (Eq. 19) resolv es this pr oblem. Other algorithms can b ecome unstable without this. • Making successful exp eriment s repr o ducible in VL is v ery diﬃcult. There are no con- v ergence guaran tees, either with or without b o otstrapping, and success often d ep ends up on wel l-c hosen parameter c hoices made by the exp er im enter. F or example, the Lunar-Lander problem in Section 4.5 seems to defeat VL with the giv en c h oices of state sp ace scaling and fu nction approxima tor. With the p rop osed algorithm, con ve r- gence to some ﬁxed p oint is assured; and so one ma jor elemen t of luc k is remov ed. All of th e prop osed algorithms are deﬁned for an y 0 ≤ λ ≤ 1. T he results in Section 3.1 and App end ix A are v alid pro ofs for any λ , b u t the main con vergence resu lt of this pap er, Eq. 18, applies only to λ = 1. Unfortu n ately dive rgence examples exist for λ < 1, as describ ed in Section 4.3. Also by p ro v in g equiv alence to p olicy-learning in the case of λ = 1, and ﬁnd ing a lac k of robu stness and d iv ergence examples for λ < 1, the usefuln ess of the v alue-function is somewhat discredited; b oth for VGL, and its sto chasti c r elativ e, VL. Ac kno wledgmen ts I am very grateful to Peter Da yan, Andy Barto, Paul W erb os and R ´ emi Coulom for their discussions, suggestions an d p ointe rs for research on this topic. App endix A. Optimal T ra jectories In this app end ix w e deﬁne lo cally optimal tra jectories and pr ov e th at if G ′ t = G t for all t along a greedy tra jectory then that tra jectory is lo cally extremal, and in certai n situatio ns, lo cally optimal. Lo cally Optimal T ra jectories. W e d eﬁne a tra jectory parametrised b y v alues { ~ x 0 , a 0 , a 1 , a 2 , . . . } to b e lo cally optimal if R ( ~ x 0 , a 0 , a 1 , a 2 , . . . ) is at a lo cal maximum with resp ect to the pa- rameters { a 0 , a 1 , a 2 , . . . } , sub ject to the constraints (if present) that − 1 ≤ a t ≤ 1. Lo cally Extremal T ra jectories (LET). W e deﬁne a tra jectory parametrised by v alues { ~ x 0 , a 0 , a 1 , a 2 , . . . } to b e lo cally extremal if, for all t ,       ∂ R ∂ a  t = 0 if a t is not saturated  ∂ R ∂ a  t > 0 if a t is saturated and a t = 1  ∂ R ∂ a  t < 0 if a t is saturated and a t = − 1. (29) 34 Reinf o r cement Learning by V alue-Gradients In the case that all the actions are unb ounded, this criterion for a LET simpliﬁes to that of just requiring  ∂ R ∂ a  t = 0 for all t . Ha ving the p ossibility of b ounded actions int ro duces the extra complication of saturated actions. T h e second condition in Eq. 29 incorp orates the idea that if a saturated action a t is fully “on”, then we w ould normally lik e it to b e on eve n more (if that w ere p ossible). In fact, in this deﬁn ition R is lo cally optimal with r esp ect to an y saturated actions. Consequently , if all of the actions are saturated (for example in the case of “bang-bang” con trol), th en this deﬁnition of a LET pro vides a suﬃcient condition for a lo cally optimal tra jectory . Conca ve Mo del F unctions. W e say a mo del has conca ve mo del fun ctions if all lo cally extremal tr a j ectories are guarantee d to b e lo cally optimal. In other w ords , if we deﬁn e ∇ a R to b e a column vect or with i th elemen t equal to ∂ R ∂ a i , and ∇ a ∇ a R to b e the matrix with ( i, j ) th elemen t equal to ∂ 2 R ∂ a i ∂ a j , then th e mo del functions are conca v e if ∇ a R = 0 implies ∇ a ∇ a R is n egativ e deﬁnite. F or example, for the tw o-step T o y Problem with k = 1, since R ( x 0 , a 0 , a 1 ) = − a 0 2 − a 1 2 − ( x 0 + a 0 + a 1 ) 2 , w e ha ve ∇ a ∇ a R =  − 4 − 2 − 2 − 4  , wh ic h is constan t and negativ e deﬁnite; and so th e t wo-ste p T o y Prob lem with k = 1 has conca v e mo d el functions. It can also b e sho wn that the n -step T o y Pr oblem with an y k ≥ 0, and any n ≥ 1, also has conca ve mo del functions. Constrained L o cally O ptimal T ra jectories. The previous t wo optimalit y criteria were indep end en t of an y p olicy . This weak er deﬁnition of optimalit y is sp eciﬁc to a particular p olicy , and is d eﬁned as follo w s: A c onstr aine d (with r e sp e ct to ~ w ) lo c al ly optimal tr aje ctory is a tra jectory parametrised b y an arbitrary s mo oth p olicy function π ( ~ x t , ~ w ), where ~ w is the w eigh t v ector of some fu nction app ro xim ator, suc h that R π ( ~ x 0 , ~ w ) is at a lo cal maxim um with resp ect to ~ w . If w e assume the fun ction R π ( ~ x, ~ w ) is con tinuous and smo oth ev erywhere with r esp ect to ~ w , then this kind of optimalit y is naturally ac hiev ed at any stationary p oin t found by gradien t ascen t on R π with resp ect to ~ w , i.e. by an y PGL algorithm. Lemma 5 If G ′ t ≡ G t (for al l t , and some λ ) along a tr aje ctory f ound by an arbitr ary smo oth p olicy π ( ~ x, ~ z ) , then G ′ t ≡ G t ≡  ∂ R π ∂ ~ x  t for al l t . This lemma is for a general p olicy , and is required by section 3.1. Here w e use the extended deﬁnitions of V ′ and G ′ that apply to an y p olicy , give n in Section 3 and Eq. 26. First we note that  ∂ R π ( ~ x,~ z ) ∂ ~ x  t ≡ G ′ t with λ = 1, since when λ = 1 an y dep endency of G ′ ( ~ x t , ~ w , ~ z ) on ~ w d isapp ears. Also, by Eq. 6 and sin ce G ′ t = G t w e get, G ′ t =  ∂ r ∂ ~ x  t +  ∂ π ∂ ~ x  t  ∂ r ∂ a  t  +  ∂ f ∂ ~ x  t +  ∂ π ∂ ~ x  t  ∂ f ∂ a  t  G ′ t +1 Therefore G ′ t is indep en d en t of λ and th er efore G ′ t ≡ G t ≡  ∂ R π ∂ ~ x  t for all t . Lemma 6 If G ′ t ≡ G t (for al l t , and some λ ) along a gr e e dy tr aje ctory then G ′ t ≡ G t ≡  ∂ R ∂ ~ x  t ≡  ∂ R π ∂ ~ x  t for al l t . This is pr o ved by induction. Note that this lemma d iﬀers from the p revious lemma in that it is sp eciﬁcally for the greedy p olicy , and the conclusion is stronger. By Eq. 6 and 35 F airbank since G ′ t = G t w e get, G t =  ∂ π ∂ ~ x  t  ∂ Q ∂ a  t +  ∂ r ∂ ~ x  t +  ∂ f ∂ ~ x  t G t +1 The left term of this su m m u st b e zero since the greedy p olicy implies either  ∂ π ∂ ~ x  t = ~ 0 (in the case that a t is saturated and  ∂ π ∂ ~ x  t exists, b y L emma 2), or  ∂ Q ∂ a  t = 0 (in the case that a t is not saturated, b y Lemma 1). If  ∂ π ∂ ~ x  t do es not exist th en it m us t b e that λ = 0, since G ′ t exists, and wh en λ = 0 the deﬁnition is G ′ t =  ∂ Q ∂ ~ x  t . Ther efore in all cases, G t =  ∂ r ∂ ~ x  t +  ∂ f ∂ ~ x  t G t +1 Also, diﬀerent iating Eq. 1 with resp ect to ~ x give s  ∂ R ∂ ~ x  t =  ∂ r ∂ ~ x  t +  ∂ f ∂ ~ x  t  ∂ R ∂ ~ x  t +1 (30) So  ∂ R ∂ ~ x  t and G t ha ve the same recurs iv e deﬁ nition. Also th eir v alues at the ﬁn al time step t = F are the same, since  ∂ R ∂ ~ x  F = G F = ~ 0. Therefore, by indu ction and lemma 5, G ′ t ≡ G t ≡  ∂ R ∂ ~ x  t ≡  ∂ R π ∂ ~ x  t for all t . Theorem 7 Any gr e e dy tr aje ctory satisfying G ′ t = G t (for al l t ) must b e lo c al ly extr emal. Pro of: Since the greedy p olicy maximises Q ( ~ x t , a t , ~ w ) w ith resp ect to a t at eac h time- step t , we kno w at eac h t ,           ∂ Q ∂ a  t = 0 if a t is not saturated  ∂ Q ∂ a  t > 0 if a t is saturated and a t = 1  ∂ Q ∂ a  t < 0 if a t is saturated and a t = − 1. (31) These follo w from Lemma 1 and the deﬁn ition of saturated actions. Additionally , by Lemma 6, G t ≡  ∂ R ∂ ~ x  t for all t . Therefore since,  ∂ R ∂ a  t =  ∂ r ∂ a  t +  ∂ f ∂ a  t  ∂ R ∂ ~ x  t +1 =  ∂ r ∂ a  t +  ∂ f ∂ a  t G t +1 =  ∂ Q ∂ a  t w e ha v e  ∂ R ∂ a  t ≡  ∂ Q ∂ a  t for all t . T herefore the consequences of the greedy p olicy (Eq. 31) b ecome equiv alen t to the suﬃcien t conditions for a LET (Eq. 29), which implies the tra jectory is a LET. 36 Reinf o r cement Learning by V alue-Gradients Corollary 8 If, in addition to the c onditions of The or em 7, the mo del func tions ar e c on- c ave, or if al l of the actions ar e satur ate d (b ang-b ang c ontr ol), then the tr aje ctory is lo c al ly optimal. This follo ws from the deﬁ n itions giv en ab o ve of conca ve mo del fu nctions and a LET. Remark: In practice we often do not n eed to worry ab out the n eed for conca v e mo del functions, since an y algorithm th at w orks by gradien t ascen t on R π will tend to head to wards lo cal maxima, not saddle-p oints or minima. Th is applies to all V GL algorithms listed in this pap er, except for residual-gradien ts. Remark: W e p oint out that the pro of of Theorem 7 could almost b e replaced b y use of P ontry agin’s m axim um pr inciple (PMP) (Brons htein and Semendya y ev, 1985), since Eq. 30 implies  ∂ R ∂ ~ x  t is the “costate” (or “adjoin t”) v ector of PMP , and Lemma 6 implies that the greedy p olicy is equiv alen t to the m axim um condition of PMP . P MP on its own is n ot suﬃcien t f or the optimalit y pro of without use of Lemma 6. Use of PMP would obvia te the need f or the b esp ok e deﬁnition of a LET that w e hav e used. W e did not use P MP b ecause it is only d escrib ed to b e a “necessary” condition for optimalit y , and the wa y we ha ve formula ted the pro of allo ws us to deriv e the corollary’s extra conclusion for bang-bang con trol. App endix B. Detailed coun terexample of the failure of v alue-learning without exploration, compared to the imp ossibilit y of failure for v alue-gradien t learning. This section giv es a m ore d etailed example than that of Fig. 3 , to sho w why exploratio n is necessary to VL b ut not to V GL. W e consider the one-step T op Pr ob lem with k = 1. F or this problem, the optimal p olicy (Eq. 8) simp liﬁes to π ∗ ( x 0 ) = − x 0 / 2 (32) Next we deﬁn e a v alue f unction on whic h a greedy p olicy can b e d eﬁned. Let th e v alue function b e linear, for s im p licit y , and b e approxi mated by just t w o p arameters ( w 1 , w 2 ), and deﬁn ed separately for the tw o time steps. V ( x t , w 1 , w 2 ) =  w 1 + w 2 x 1 if t = 1 0 if t = 2 (33) F or the ﬁ nal time step, t = 2, we hav e assum ed the v alue f unction is p erfectly kn own, so that V 2 ≡ R 2 ≡ 0. At time step t = 0, it is not necessary to deﬁne the v alue function s ince the greedy p olicy only lo oks ahead. Diﬀeren tiating this v alue fun ction gives the follo wing v alue-gradien t function: G ( x t , w 1 , w 2 ) =  w 2 if t = 1 0 if t = 2 37 F airbank The greedy p olicy on this v alue function giv es a 0 = π ( x 0 , ~ w ) = arg max a ( r ( x 0 , a, ~ w ) + V ( f ( x 0 , a ) , ~ w )) b y Eqs. 2, 3 = arg max a  − a 2 + w 1 + w 2 ( x 0 + a )  b y Eqs. 33, 7 = w 2 / 2 (34) Ha ving d eﬁned a v alue-function and foun d the greedy p olicy th at acts on it, w e n ext analyse the s ituations in the VL and V GL cases, eac h without exploration. T h e v alue function deﬁn ed ab o ve is used in the follo wing examples. Note that the conclusions of the follo wing examples cannot b e explained b y choic e of function appr oximato r for V . F or example Fig. 3 sh o ws a counterexample for a diﬀerent function approximato r, and similar coun terexamples for VL can easily b e found for any function approximato r of a higher degree. A linear fu nction appro ximator w as c h osen here since it is the simp lest type of ap p ro ximator th at c an b e made to learn an optimal tra jectory in this pr oblem, as is illustrated in the VGL example b elo w. V alue-L e a rning applied to T oy Problem (without exploration) : Here the aim is to s ho w that VL, without exploration, can b e applied to th e one-step T o y Problem (with k = 1) an d conv erge to a s ub-optimal tra jectory . The target for the v alue fun ction at t = 1 is giv en by: V ′ 1 = r ( x 1 , a 1 ) + λV ′ 2 + (1 − λ ) V 2 b y Eq. 4 = − x 1 2 b y Eq. 7, and since V ′ 2 = V 2 = 0 The v alue fu nction at t = 1 is giv en by: V 1 = w 1 + w 2 x 1 A simple counterexample can b e c hosen to sho w that if VL is complete (i.e. if V t = V ′ t for all t > 0), then the tra jectory may n ot b e optimal. If x 0 = 5, w 1 = − 25, w 2 = 0 then the greedy p olicy (Eq. 34) give s a 0 = w 2 / 2 = 0 and th us x 1 = x 0 = 5. Ther efore V 1 = V ′ 1 = − 25, and V 2 = V ′ 2 = 0, and so learning is complete. Ho w ev er th e tra jectory is not optimal, since the optimal p olicy (Eq. 32) requires a 0 = − 5 / 2. V alue-Gradient Learning applied to T o y Problem: Th e ob jectiv e of V GL is to mak e the v alue-gradien ts m atch their target gradien ts. F or the one-step T o y Problem (with k = 1), we get: G ′ 1 =  ∂ r ∂ x  1 +  ∂ π ∂ x  1  ∂ r ∂ a  1  +  ∂ f ∂ x  1 +  ∂ π ∂ x  1  ∂ f ∂ a  1   λG ′ 2 + (1 − λ ) G 2  b y Eq. 6 = ( − 2 x 1 + 0) + (1 + 0)  λG ′ 2 + (1 − λ ) G 2  b y Eq. 7 = − 2 x 1 since G ′ 2 = G 2 = 0 The v alue-gradien t at t = 1 is giv en b y G 1 = w 2 . 38 Reinf o r cement Learning by V alue-Gradients F or these to b e equal, i.e. for G t = G ′ t for all t > 0, we must h a ve w 2 = − 2 x 1 . The greedy p olicy (Eq. 34) then give s a 0 = w 2 / 2 = − x 1 = − ( x 0 + a 0 ) ⇒ a 0 = − x 0 / 2 whic h is the same as the optimal p olicy (Eq. 32). Therefore if the v alue-gradien ts are learned, then the tra jectory will b e optimal. App endix C. Equiv alence of V ′ notation in TD( λ ) The formulation of TD( λ ) as pr esen ted in this pap er (Eq. 10) uses the V ′ notation. Th is can b e pr o ven to b e equiv alent to the formulati on used by Sutton (1988) as follo w s. Expand ing the recurs ion in Eq. 4 give s V ′ t = P k ≥ t λ k − t ( r k + (1 − λ ) V k +1 ), so Eq. 10 b ecomes: ∆ ~ w = α X t ≥ 1  ∂ V ∂ ~ w  t   X k ≥ t λ k − t ( r k + (1 − λ ) V k +1 ) − V t   = α X t ≥ 1  ∂ V ∂ ~ w  t   X k ≥ t λ k − t ( r k + V k +1 ) − X k ≥ t λ k − t λV k +1 − V t   = α X t ≥ 1  ∂ V ∂ ~ w  t   X k ≥ t λ k − t ( r k + V k +1 ) − X k ≥ t λ k − t V k   = α X k ≥ 1 k X t =1 ( r k + V k +1 − V k ) λ k − t  ∂ V ∂ ~ w  t = α X t ≥ 1 ( r t + V t +1 − V t ) t X k =1 λ t − k  ∂ V ∂ ~ w  k This last line is a batc h-up date v ersion of th e weigh t up date equation give n by Sutton (1988). This v alidates the use of the notation V ′ . App endix D. Equiv alence o f Sarsa( λ ) to TD( λ ) for control problems with a kno wn mo del Sarsa( λ ) is an algorithm for con trol pr oblems that learns to approximate the Q ( ~ x, a, ~ w ) function (Ru mmery and Niranj an, 1994). It is designed for p olicies that are dep endent on the Q ( ~ x, a, ~ w ) fu n ction (e.g. the greedy p olicy or ǫ -greedy p olicy). The Sarsa( λ ) algorithm is deﬁned for tra jectories where all actions after the ﬁrst are found by the given p olicy; th e ﬁrst action a 0 can b e arbitrary . T h e fun ction-appro ximator up d ate is then: ∆ ~ w = α X t ≥ 0  ∂ Q ∂ ~ w  t ( Q ′ t − Q t ) (35) where Q ′ t = r t + V ′ t +1 . Sarsa( λ ) is designed to b e able to work in problem domains where the mo d el functions are not kno wn, h o wev er w e can also apply it to the Q fu nction as deﬁned in Eq. 2 that relies up on our kno wn mo del fu n ctions. This means we can r ewrite Eq. 35 in terms of V ( ~ x, ~ w ) 39 F airbank to b ecome exactly the s ame as Eq. 10. F or this reason, we are justiﬁed to wo rk with th e TD( λ ) weig h t up date on con trol problems u s ing th e ǫ -greedy p olicy in the exp eriments of this pap er. App endix E. One-dimen sional Lunar-Lande r Problem In th is cont in uous-time pr oblem, a spacecraft is constrained to mov e in a v ertical lin e and its ob jectiv e is to mak e a fuel-eﬃcien t gen tle landing. Th e sp acecraft is released from v ertically ab ov e a landing pad in a unif orm gra vitatio nal ﬁeld, and has a sin gle thruster that can p ro duce upw ard accelerati ons. The state vec tor ~ x has three comp onen ts: h eigh t ( h ), velocit y ( v ), and fuel remaining ( u ), so that ~ x t ≡ ( h t , v t , u t ) T . V elo cit y and heigh t ha v e u p w ards deﬁned to b e p ositiv e. Th e spacecraft can p erform upw ard accelerations a t with a t ∈ [0 , 1]. The con tin u ous-time mo del functions for th is pr ob lem are: ¯ f (( h, v, u ) T , a ) = ( v , ( a − k g ) , − a ) T ¯ r (( h, v , u ) T , a ) = − ( k f ) a + ¯ r C ( a ) k g ∈ (0 , 1) is a constant giving the accelerat ion due to gra vit y; the sp acecraft can pro du ce greater acce leration than that du e to gra vit y . k f is a constant giving fuel p enalt y . W e used k g = 0 . 2 and k f = 2. T erminal states are wh ere the spacecraft hits the groun d ( h = 0) or run s out of f uel ( u = 0). In addition to th e con tin u ous-time r eward ¯ r d eﬁ ned ab o ve, a ﬁnal impu lse of rew ard equal to − v 2 − 2( k g ) h is giv en as so on as the lander reac hes a terminal state. The terms in this ﬁnal reward represen t kinetic and p oten tial energy resp ectiv ely , whic h means when the sp acecraft runs out of f uel, it’s as if it crashes to the ground by freefall. ¯ r C ( a ) = − R a 0 . 5 g − 1 ( x ) dx is the action-cost term of the reward function (as describ ed in Section 2.2), wh er e g ( x ) = 1 2 (tanh( x/c )+1) and therefore ¯ r C ( a ) = c  x arctanh(1 − 2 x ) − 1 2 ln(1 − x )  a 0 . 5 . This means the con tinuous-time greedy p olicy is exactly a t = g  − k f + ∂ ¯ f ∂ a G t  and this en - sures a t ∈ (0 , 1). The deriv ativ es of th ese mo del fun ctions are: ∂ ¯ f ∂ ~ x =   0 0 0 1 0 0 0 0 0   , ∂ ¯ r ∂ ~ x =   0 0 0   , ∂ ¯ f ∂ a =  0 1 − 1  , ∂ ¯ r ∂ a = − k f − c arctanh(2 a − 1) E.1 Discon tinuit y C orrections for Contin uous-Time F orm ulations ( Clipping) With a con tin u ou s -time mo d el and episo d ic task, care n eeds to b e tak en in calculating gradien ts at an y terminal states or p oin ts where the mo d el fu nctions are not smo oth. Figure 8 illustrates this complicatio n when the spacecraft reac h es a termin al state. This problem means G ′ c han ges discon tin uou s ly at the b oun dary of term in al states. Since the learning algorithms only use G ′ t for t < F , the last gradien t we need is the 40 Reinf o r cement Learning by V alue-Gradients A C D B h=0 Figure 8: T he lines AB and CD are sections of t w o tra jectories that approac h a transition to a region of termin al states (the line h = 0, in this case). If the end A of AB is mo ved do w n then the end B will mo v e do wn. Ho w ever if C m o ves d own then D will mov e left, due to the pr esence of the barr ier. This diﬀerence is w hat we call the pr oblem of Discon tinuit y Corrections. limiting one as t → F . This can b e calculated by consider in g the follo wing one-sided limit: lim ∆ t → 0 +  ∂ R π ∂ ~ x  F − ∆ t = lim h → 0 +  ∂ R π ∂ ~ x  F + h/v since for sm all h , ∆ t ≈ − h/v = lim h → 0 + ∂ ∂ ~ x F + h/v (( k f ) a − ¯ r C ( a )) h v −  v − ( a − k g ) h v  2 ! F + h/v =    ( k f ) a F − ¯ r C ( a F ) v F + 2( a F − k g ) − 2 v F 0    where a F = lim t → F − ( a t ). Similarly it can b e shown that, lim t → F −  ∂ R π ∂ a  t = 0 and therefore the b oundary condi- tion to us e for the target-v alue gradien t is giv en by lim t → F − G ′ t =    ( k f ) a F − ¯ r C ( a F ) v F + 2( a F − k g ) − 2 v F 0    (36) This limiting target v alue-gradient is the one to use instead of th e b oundary cond ition giv en in Eq. 24 or a v alue-gradient based solely on the ﬁnal reward impulse. If this issue is ignored then the ﬁrst comp onent in the ab o v e vec tor w ould b e zero and learning wo uld not ﬁnd optimal tra jectories. A similar correction needs making in the case of the spacecraft runn in g out of fuel. Also, in the cal culation of the tra jectory (Eq. 20) by an app ropriate numerical metho d, w e thin k it is b est to u se clipping in the ﬁnal time step, so that the spacecraft cannot, for example, ﬁn ish with h < 0. The use of clipping en s ures th at the total reward is a sm o oth function of th e w eigh ts and this s hould aid metho d s that w ork by local exploration. 41 F airbank E.2 Lunar-Lander Opt imal T ra jectories It is usefu l to kno w the optimal tra jectories, p urely for comparison of the learned solutions. Here we only consider optimal tra j ectories wh ere there is suﬃcient fuel to land gently . An optimal p olicy is found using P on tryagi n’s maxim um principle: The adjoin t vecto r ~ p ( t ) satisﬁes the diﬀerential equation (DE) ∂ ~ p t ∂ t = −  ∂ ¯ r ∂ ~ x  t −  ∂ ¯ f ∂ ~ x  t ( ~ p t ) and the tra jectory state ev olution equation (i.e. the mo del functions), and where the actio n at eac h ins tan t is found by a t = g  − k f + ∂ ¯ f ∂ a p t  . Th e tra jectory has to b e ev aluated bac kwards fr om the end p oint at a giv en v F and h = 0. Th e b ound ary condition for th is end p oin t is p F = lim t → F − G ′ t (see Eq. 36) w ith a F = g ( − k f − 2 v F ). Solving the adjoint v ector DE, and substituting into the expression for a t giv es ~ p t =   ~ p 0 F − 2 v F + ( F − t ) ~ p 0 F 0   with ~ p 0 F = ( k f ) a F − ¯ r C ( a F ) v F + 2( a F − k g ) ⇒ a t = g  − k f − 2 v F + ( F − t ) ~ p 0 F  (37) Numerical in tegratio n of the mo del functions with Eq. 37 giv es the optimal tra jectory bac kwards from the giv en end p oint. T o ﬁnd which v F pro du ces th e tra jectory that passes through a giv en p oin t ( h 0 , v 0 ) is another problem that requ ir es solving numericall y . Some optimal tra jectories found usin g this metho d are sh own in Figure 9. 0 20 40 60 80 100 120 -6 -4 -2 0 2 4 6 c=1 ♦ ♦ ♦ ♦ 0 20 40 60 80 100 120 -6 -4 -2 0 2 4 6 c=0.01 ♦ ♦ ♦ ♦ Figure 9: L u nar-Lander optimal tra jectories. S ho w s height (y-axis) vs. vel o cit y (x-axis). F uel d imension of state-space is omitted. T ra jectories start at diamond and ﬁnish at h = 0. As c → 0, tra j ectories b ecome more fuel-eﬃcien t, and the transition b et w een the freefall ph ase and braking phase b ecomes sharp er. It is m ost fuel- eﬃcien t to freefall for as long as p ossible (shown by the up p er cu r v ed sections) and then to brak e as quic kly as p ossible (sho wn by the lo wer curved sections). 42 Reinf o r cement Learning by V alue-Gradients App endix F. Deriv ation of actor training equation The actor training equation (Eq. 27) is non-standard. Ho wev er it can b e seen to b e consisten t w ith the more standard equations, while automatically in clud ing exploration, as follo ws. Barto et al. (198 3) use the TD(0) error s ignal δ t = ( r t + V t +1 − V t ) to train the actor, and also sp ecify that some sto c h astic b eha viour is required to force exploration. When th e domain of actions to c ho ose f rom is con tin u ous, the simp lest tec h nique to force exp loration is to add a small amount of random noise n t at time step t to the action c hosen (as done b y Do y a , 2000) giving mo diﬁed actions a ′ t = a t + n t . The sto c h astic real- v alued (SR V) unit algorithm (Gullapalli, 1990) is u sed to train the actor wh ile eﬃciently comp ensating for th e added noise: ∆ ~ z = αn t  ∂ π ∂ ~ z  t ( r t + V t +1 − V t ) Making a ﬁrst order T a ylor ser ies appro ximation to the ab o ve equ ation, by expanding the terms V t +1 and r t ab out the v alues they wo uld hav e had if there was no noise, give s ∆ ~ z = αn t  ∂ π ∂ ~ z  t ( r t + V t +1 + n t  ∂ r ∂ a  t +  ∂ f ∂ a  t G t +1  − V t ) In tegrating with r esp ect to n t to ﬁnd the mean weig h t up date, and assuming n t ∈ [ − ǫ, ǫ ] is a uniform ly distributed random v ariable ov er a small r ange cen tred on zero, giv es h ∆ ~ z i = Z ǫ − ǫ 1 2 ǫ ∆ ~ z dn t = α ǫ 2 3  ∂ π ∂ ~ z  t  ∂ r ∂ a  t +  ∂ f ∂ a  t G t +1  whic h is equ iv alent to Eq. 27 w hen summed ov er t . This justiﬁes the use of E q . 27 and explains how it automatical ly incorp orates exploration. References L. C. Baird. Reinforcemen t learning in contin uous time: Ad v an tage up dating. In Pr o c e e dings of the International Confer enc e on Neur al Networks, O rlando, FL, June , 1994. Leemon C. Baird. Residual algorithms: Reinforcemen t learning with f unction appro xima- tion. In International Confer enc e on Machine L e arning , pages 30–37, 1995. A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptiv e element s that can solv e d iﬃcult learning control pr oblems. IEEE T r ansactions on Systems, Man, and Cy- b ernetics , 13:834–84 6, 1983. J. Baxter and P . L. Bartlett. Direct gradient- based r einforcemen t learnin g (in vited). In Pr o c e e dings of the International Symp osium on Cir cui ts and Systems , p ages I I I–271–2 74, 2000. Christopher M. Bishop. Ne ur al Networks for Pattern R e c o gnition . Oxford Universit y Press, 1995. 43 F airbank I. N. Bronsh tein and K. A. Semendya y ev. Handb o ok of Mathematics , c hap ter 3.2.2, pages 372–3 82. V an Nostrand R einh old Company , 3rd edition, 1985. R ´ emi Coulom. R einfor c ement L e arning Using Neur al Networks, with Applic ations to Motor Contr ol . Ph D thesis, Institut National P olytec hnique de Grenoble, 2002. P eter Da ya n and S atinder P . Sin gh. Imp ro vin g p olicies without measuring merits. In Da vid S. T ouretzky , Mic hael C. Mozer, and Mic hael E. Hasselmo, editors, A dvanc es i n Neur al Information Pr o c essing Systems , vo lume 8, pages 1059–1 065. T he MIT Pr ess, 1996. Kenji Do ya. Reinforcement learning in con tin uous time and s pace. N eur al Computat ion , 12 (1):21 9–245 , 2000. S. E. F ahlman. F aster-learning v ariations on back-propaga tion: An empirical study . In Pr o c e e dings of the 1988 Conne ctionist Summer Scho ol , pages 38–51, San Mateo, CA, 1988. Morgan Kaufmann . V. Gu llapalli. A sto c h astic reinforcement learning algorithm for learning real-v alued func- tions. Neur al Ne tworks , 3:671 –692, 1990. D. H. Jacobson and D. Q. Ma yne. D i ﬀer ential D ynamic Pr o gr amming . Elsevier, New Y ork, NY, 1970. V. R . Kond a and J. N. Tsitsiklis. On actor-critic algorithms. SIAM J ournal on Contr ol and Optimization , 42(4):1143 –1166 , 2003. Barak A. P earlmutter. F ast exact m ultiplication b y the Hessian. Ne ur al Computation , 6 (1):14 7–160 , 1994. William H. Press, Saul A. T eukolsky , William T. V etterling, and Brian P . Flannery . Numer- ic al r e cip es in C (2nd e d.): the art of scientiﬁc c omputing . Cam bridge Univ ersit y Press, New Y ork, NY, US A, 1992. ISBN 0-521-4310 8-5. Martin Riedmiller and Heinric h Braun. A direct adaptiv e metho d for faster backpropagat ion learning: T he RPR OP algorithm. In Pr o c. of the IE E E Intl. Conf. on Neur al Networks , pages 586–591 , San F rancisco, CA, 1993. G. Rummery and M. Niranjan. On-line q-learning us in g connectionist systems. T e ch. R ep. T e chnic al R ep ort CU E D/F-INFENG/TR 166, Cambridge Univ e rsity Engine ering Dep artment , 1994. Ric hard S . S u tton. Learning to p redict by the metho d s of temp oral diﬀeren ces. Machine L e arning , 3:9–44, 1988 . Ric hard S. S utton and Andrew G. Barto. R einfor c ement L e arning: A n Intr o duction . Th e MIT Press, Cambridge, Massac hussetts, USA, 1998. John N. T sitsiklis and Benjamin V an Ro y . An analysis of temp oral-diﬀerence learning with function approximati on. T ec hnical Rep ort LIDS-P-2322, 1996a. 44 Reinf o r cement Learning by V alue-Gradients John N. Tsitsiklis and Benjamin V an Roy . F eature-based metho ds for large scale dynamic programming. Machine L e arning , 22(1-3) :59–94 , 1996b. C. J. C. H. W atkins. L e arning fr om Delaye d R ewar ds . PhD th esis, Cam bridge Unive rsit y , 1989. C. J. C. H. W atkins and P . Da yan. Q-learning. M achine L e arning , 8:279–2 92, 1992. P au l J. W erb os. Bac kpropagation thr ough time: What it do es and how to do it. In Pr o c e e dings of the IE E E , v olume 78, No. 10, pages 1550–15 60, 1990. P au l J. W erb os. S table adaptiv e control using new critic designs. e print arXiv:adap- or g/9810001 , 1998. URL http://xxx .lanl.go v/html/adap- org/9810001 . White and D. Sofge, editors. Handb o ok of Intel ligent Contr ol . V an Nostrand, 1992. R. J. Williams. S imple statistic al gradien t-follo wing algorithms for connectionist reinforce- men t learning. M achine L e arning , 8:229–3 56, 1992. 45

Reinforcement Learning by Value Gradients

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment