Investigating practical linear temporal difference learning

In vestigating Practical Linear T emporal Difference Learning Adam White Depar tment of Computer Science Indiana University Bloomington, IN 47405, USA adamw@indiana.edu Mar tha White Depar tment of Computer Science Indiana University Bloomington, IN 47405, USA mar tha@indiana.edu ABSTRA CT Oﬀ-policy reinforcemen t learning has many applications in- cluding: learning from demonstration, learning multiple goal seeking p olicies in parallel, and represen ting predictive kno wl- edge. Recen tly there has b een an proliferation of new p olicy- ev aluation algorithms that ﬁll a longstanding algorithmic v oid in reinforcement learning: combining robustness to oﬀ- policy sampling, function approximation, linear complexity , and temp oral diﬀerence (TD) updates. This pap er contains t wo main con tributions. First, we derive tw o new h ybrid TD policy-ev aluation algorithms, which ﬁll a gap in this collection of algorithms. Second, we perform an empirical comparison to elicit whic h of these new linear TD metho ds should b e preferred in diﬀerent situations, and make con- crete suggestions ab out practical use. K eywords Reinforcemen t learning; temporal diﬀerence learning; oﬀ- policy learning 1. INTR ODUCTION Un til recently , using temporal diﬀerence (TD) metho ds to appro ximate a v alue function from oﬀ-p olicy samples w as potentially unstable without resorting to quadratic (in the n umber of features) computation and sto rage, ev en in the case of linear appro ximations. Oﬀ-p olicy learning in volv es learning an estimate of total future rew ard that w e would expect to observ e if the agen t follo wed some target policy , while learning from samples generated by a diﬀeren t b eha v- ior p olicy . This oﬀ-policy , p olicy-ev aluation problem, when com bined with a p olicy improv ement step, can can b e used to mo del many diﬀerent learning scenarios, suc h as learning from man y policies in parallel [ 18 ], learning from demon- strations [ 1 ], learning from batc h data [ 8 ], or simply learning about the optimal p olicy while follo wing an exploratory p ol- icy , as in the case of Q-learning [ 25 ]. In this pap er, we fo cus exclusiv ely on the oﬀ-p olicy , policy ev aluation problem, com- monly referred to as v alue function appro ximation or simply the pr e diction pr oblem . Over the past decade there has been an proliferation of new linear-complexity , policy-ev aluation methods designed to b e conv ergent in the oﬀ-policy case. App ears in: Pr o c e e dings of the 15th International Confer enc e on Autonomous A gents and Multiagent Systems (AAMAS 2016), J. Thangar ajah, K. T uyls, C. Jonker, S. Marsel la (ed s.), May 9–13, 2016, Singap or e. Copyright c  2016, International Foundation for Autonomous Agents and Multiagent Systems (www .ifaamas.org). All rights reserved. These nov el algorithmic con tributions ha ve fo cused on dif- feren t w a ys o f ac hieving stable oﬀ-policy prediction learn- ing. The ﬁrst such metho ds w ere the gradien t TD family of algorithms that p erform approximate sto c hastic gradi- en t descen t on the mean squared pro jected Bellman error (MSPBE). The primary drawbac k of these methods is the requiremen t for a second set of learned weigh ts, a second step size parameter, and potentially high v ariance up dates due to imp ortance sampling. Empirically the results ha ve been mixed, w ith some results indicating that TD can b e superior in on-p olicy settings[ 17 ], and others concluding the exact opposite [ 4 ]. Later, provision al TD (PTD) was introduced [ 20 ] to rec- tify the issue that the b ootstrap parameter λ , used in gradi- en t TD metho ds[ 11 ] does not corresp ond w ell with the same parameter used by con v entional TD learning [ 19 ]. Speciﬁ- cally , for λ = 1, gradient TD metho ds do not corresp ond to an y kno wn v ariant of oﬀ-policy Mon te Carlo. The PTD algo- rithm ﬁxes this issue, and in on-p olicy prediction is exactly equiv alent to the con ven tional TD algorithm. PTD do es not use gradient corrections, and is only guaranteed to conv erge in the tabular oﬀ-p olicy prediction setting. Its empirical performance relativ e to TD and gradient TD, ho wev er, is completely unknown. Recen tly Sutton et al. [ 21 ] observed that con v entiona l TD does not correct its up date based on the notion of a follow- on distribution. This distributional mis-match provides an- other w a y to understand the oﬀ-policy div ergence of con v en- tional oﬀ-p olicy TD. They deriv e the Emphatic TD (ETD) algorithm that surprisingly ac hieves conv ergence [ 27 ] with- out the need for a second set of weigh ts, like those used b y gradien t TD metho ds. Like gradien t TD metho ds, how ev er, it seems that this algorithm also suﬀers from high v ariance due to importance sampling. Hallak et al. [ 7 ] in tro duced a v ariant ETD that utilizes a scaling parameter β , whic h is mean t to reduce the magnitude of the follow-on trace. Com- parativ e empirical studies for ETD and ETD( β ), how ever, ha ve b een limited. The most recent contribution to this line of work explores a mirror-prox approach to minimizing the MSPBE [ 12 , 13 , 9 ]. The main beneﬁt of this work wa s that it enabled the ﬁrst ﬁnite sample analysis of an oﬀ-policy TD-based metho d with function appro ximation, and the application of adv anced stochastic gradient optimizations. Liu et al. [ 9 ] introduced t wo mirror-pro x TD algorithms, one based on the GTD2 algorithm [ 17 ] the other based on TDC [ 17 ] 1 and sho w ed 1 The GTD2 and TDC algorithms are gradient TD metho ds that do not use eligibilit y traces; λ = 0. that these metho ds outp erform their base counter-parts on Baird’s coun terexample [ 2 ], but did not extend their new methods with eligibility traces. A less widely kno wn approach to the oﬀ-policy prediction problem is based on algorithms that perform precisely TD updates when the data is sampled on-p olicy , and corrected gradien t-TD st yle updates when the data is generated oﬀ- policy . The idea is to exploit the supposed superior eﬃciency of TD in on-policy learning, while main taining robustness in the oﬀ-p olicy case. These “hybrid ” TD methods were in- troduced for state v alue-function based prediction [ 11 ], and state-action v alue-function based prediction [ 6 ], but hav e not b een extended to utilize eligibility traces, nor compared with the recent developmen ts in linear oﬀ-policy TD learn- ing (man y developed since 2014). Mean while a separate but related thread of algorithmic de- v elopment has sought to improv e the op eration of eligibility traces used in b oth on- and oﬀ-p olicy TD algorithms. This direction is based on another nonequiv alence observ ation: the update p erformed by the forward view v arian t of the con ven tional TD is only equiv alent to its bac kward view up- date at the end of sampling tra jectories. The proposed true- online TD (TO-TD) prediction algorithm [ 24 ], and true- online GTD (TO-GTD) prediction algorithm [ 23 ] remedy this issue, and hav e b een shown to outperform con ven tional TD and gradient TD methods resp ectiv ely on chain do- mains. The TO-TD algorithm requires only a mo dest in- crease in computational complexity ov er TD, how ev er, the TO-GTD algorithm is signiﬁcantly more complex to im- plemen t and requires three eligibility traces compared to GTD. Nevertheless, b oth TO-TD and TO-GTD achiev e lin- ear complexity , and can b e implemented in a completely incremen tal wa y . Although there asymptotic con vergenc e properties of many of these metho ds has b een rigorously chara cterized, but em- pirically there is still muc h we do not understand ab out this no w large collection of metho ds. A frequen t criticism of gradien t TD metho ds, for example, is that they are hard to tune and not w ell-understoo d empirically . It is somewhat disappointing that p erhaps the most famous applic ation of reinforcemen t learning—learning to play A tari games [ 15 ]— uses potentially div ergent oﬀ-policy Q-learning. In addition, w e hav e v ery little understa nding of ho w these metho ds com- pare in terms of learning sp eed, robustness, and parameter sensitivit y . By clarifying some of the empirical prop erties of these algorithms, w e hope to promote more wide-spread adoption of these theoretically sound and computationally eﬃcien t algorithms. This paper has tw o primary con tributions. First, we in- troduce a no vel exten sion of hybrid metho ds to eligibilit y traces resulting in t w o new algorithms, HTD( λ ) and true- online HTD( λ ). Second, we provide an empirical study of TD-based prediction learning with linear function approxi- mation. The conclusions of our experiments are surprisingly clear: 1. GTD( λ ) and TO-GTD( λ ) should be prefe rred if robust- ness to oﬀ-policy sampling is required 2. Bet w een the tw o GTD( λ ) should b e preferred if com- putation time is at a premium 3. Otherwise, TO-ETD( λ, β ) w as clearly the best across our experiments except on Baird’s counterexample. 2. B A CKGR OUND This pap er in vestiga tes the problem of estimating the dis- coun ted sum of future rewards online and with function ap- pro ximation. In the context of reinforcement learning w e tak e online to mean that the agent makes decisions, the en- vironmen t produces outcomes, and the agen t up dates its parameters in a contin ual, real-time interaction stream. W e model the agen t’s in teraction as Mark o v decision process de- ﬁned b y a coun tably inﬁnite set of states S , a ﬁnite set of actions A , and a scalar discount function γ : S → R . The agen t’s observ ation of the curren t situation is summarized b y the feature vector x ( S t ) ∈ R d , where S t ∈ S is the current state and d  |S | . On each time step t , the agent selects an action according to it’s b ehavior p olicy A t ∼ µ ( S t , · ), where µ : S × A → [0 , 1]. The en vironmen t then transitions in to a new state S t +1 ∼ P ( S t , A t , · ), and emits a scalar rew ard R t +1 ∈ R . The agent’s ob jective is to ev aluate a ﬁxed tar get p olicy π : S × A → [0 , 1], or estimate the expected return for p olicy π : v π ( s ) def = E [ G t | S t = s, A t ∼ π ] for return G t def = ∞ X i =0 i Y j =1 γ t + j ! R t + i +1  γ j def = γ ( s j ) . where v π ( s ) is called the state-value function for p olicy π . All the metho ds ev aluated in this study p erform tem- poral diﬀerence up dates, and most utilize eligibility traces. The TD( λ ) algorith m is the protot ypical example of these concepts and is useful for underst anding all the other al- gorithms discussed in the remainder of this pap er. TD( λ ) estimates v π as a linear function of the weigh t vector w ∈ R d , where the estimate is formed as an inner product b e- t ween the w eigh t vecto r and the features of the curren t state: w > x ( s ) ≈ v π ( s ). The algorithm maintains a memory trace of recen tly experienced features, called the eligibilit y trace e ∈ R d , allowing up dates to assign credit to previously vis- ited states. The TD( λ ) algorithm requires linear computa- tion and storage O ( d ), and can b e implemen ted incremen- tally as follo ws: δ t ← R t +1 + γ t +1 w > t x ( S t +1 ) − w > t x ( S t ) e t ← λ t γ t e t − 1 + x ( S t ) ∆ w ← αδ t e t  w t +1 ← w t + ∆ w . In the case when the data is generated by a behavior policy , µ , with π 6 = µ , w e sa y that the data is generated oﬀ-policy . In the oﬀ-p olicy setting we m ust estimate v π with samples generated by selecting actions according to µ . This setting can cause the TD( λ ) algorithm to diverge. The GTD( λ ) algorithm solv es the div ergence issue by minimizing the MSPBE, resulting in a sto chastic gradien t descent algo- rithm that looks similar to TD( λ ), with some imp ortan t dif- ferences. GTD( λ ) uses imp ortance weigh ts, ρ t def = π ( s,a ) µ ( s,a ) ∈ R in the eligib ility trace to rew eight the data and obtain an un- biased estimate of E [ G t ]. Note, in the p olicy iteration case— not studied here—it is still reasonable to assume knowl - edge of π ( s, a ) for al l s ∈ S , a ∈ A ; for example when π is near greedy with respect to the curren t estimate of the state-action v alue function. The GTD( λ ) has a auxiliary set of learned w eigh ts, h ∈ R d , in addition to the primary w eights w , which main tain a quasi-stationary estimate of a part of the MSPBE. Lik e the TD( λ ) algorithm, GTD( λ ) requires only linear computation and storage and can b e implemen ted fully incrementally as follows: δ t ← R t +1 + γ t +1 w > t x ( S t +1 ) − w > t x ( S t ) e t ← ρ t ( λ t γ t e t − 1 + x ( S t ))  weigh ted b y ρ t ∆ w ← αδ t e t − αγ t +1 (1 − λ t +1 )( e > t h t ) x ( S t +1 ) | {z } correction term ∆ h ← α h [ δ t e t − ( x ( S t ) > h t ) x ( S t )]  auxiliary weigh ts The auxiliary weigh ts also make use of a step-size parameter, α h whic h is usually not equal to α . Due to space constraints we do not describ e the other TD-based linear learning algorithms found in the literature and inv estigated in our study . W e pro vide each algorithm’s pseudo co de in the app endix, and in the next section de- scribe t wo new oﬀ-policy , gradien t TD methods, before turn- ing to empirical questions. 3. HTD DERIV A TION Con ven tional temp oral diﬀerence updating can be more data eﬃcien t than gradient temporal diﬀerence up dating, but the correction term used by gradient-TD methods helps prev ent div ergence. Previous empirical studies[ 17 ] demon- strated situations ( sp eciﬁcally on-p olicy) where linear TD(0) can outperform gradien t TD methods, and others [ 6 ] demon- strated that Exp ected Sarsa(0) can outperform multi ple v ari- an ts of the GQ(0) algorithm, ev en un der oﬀ-policy sampling. On the other hand, TD( λ ) can div erge on small, though somewhat con trived counterexamples. The idea of hybrid-TD metho ds is to ac hieve sample eﬃ- ciency closer to TD( λ ) during on-p olicy sampling, while en- suring non-div ergence under oﬀ-policy sampling. T o achiev e this, a hybrid algorithm could do conv entional, uncorrected TD up dates when the data is sampled on-policy , and use gra- dien t corrections when the data is sampled oﬀ-p olicy . This approac h was pioneered by Maei [ 11 ], leading to the deriv a- tion of the Hybrid T emp oral Diﬀerence learning algorithm, or HTD(0). Later, Hackman[ 6 ] pro duced a h ybrid v ersion of the GQ(0) algorithm, estimating state-action v alue func- tions rather than state- v alue functions as w e do here. In this paper, we deriv e the ﬁrst h ybrid temporal diﬀerence method to mak e use of eligibility traces, called HTD( λ ). The k ey idea b ehind the deriv ation of HTD learning meth- ods is to mo dify the gradient of the MSPBE to produce a new learning algorithm. Let E µ represen t the exp ectation according to samples generated under the behavior policy , µ . The MSPBE[ 17 ] can b e written as MSPBE ( w ) = E µ [ δ t e t ] > | {z } − A π w + b π E µ [ x ( S t ) x ( S t ) > | {z } C ] − 1 E µ [ δ t e t ] , where e t = ρ t ( λ t γ t e t − 1 + x ( S t )) and A π def = E µ [ e t ( x ( S t ) − γ t +1 x ( S t +1 )) > ] (1) = X s t ∈S d µ ( s t ) X a t ∈A µ ( s t , a t ) ρ t | {z } π ( s t ,a t ) ( γ t λ E µ [ e t − 1 | s t ] + x ( s t )) X s t +1 ∈S P ( s t , a t , s t +1 )( x ( s t ) − γ t +1 x ( s t +1 )) > b π def = E µ [ R t +1 e t ] = X s t ∈S d µ ( s t ) X a t ∈A π ( s t , a t )( γ t λ E µ [ e t − 1 | s t ] + x ( s t )) X s t +1 ∈S π ( s t , a t ) P ( s t , a t , s t +1 ) r t +1 . Therefore, the relativ e imp ortance given to states in the MSPBE is weigh ted by the stationary distribution of the be- ha vior p olicy , d µ : S → R , (since it is generating samples), but the transitions are reweigh ted to reﬂect the returns that π would pro duce. The gradien t of the MSPBE is: − 1 2 ∇ w MSPBE ( w ) = − A > π C − 1 ( − A π w + b π ) . (2) Assuming A − 1 π is non-singular, we get the TD-ﬁxed p oin t solution: 0 = − 1 2 ∇ w MSPBE ( w ) = ⇒ − A π w + b π = 0 . (3) The v alue of w , for whic h ( 3 ) is zero, is the solution found b y linear TD( λ ) and LSTD( λ ) where π = µ . The gradien t of the MSPBE yields an incremen tal learning rule with the follo wing general form (see [ 3 ]): w t +1 ← w t + α ( M w t + b ) , (4) where M = − A > π C − 1 A π and b = A > π C − 1 b π . The update rule, in the case of TD( λ ), will yield stable conv ergence if A π is p ositive deﬁnite (as shown by Tsitsiklis and v an Roy [ 22 ]). In oﬀ-p olicy learning, w e require A > π C − 1 A π to b e positive deﬁnite to satisfy the conditions of the ordinary diﬀeren tial equation proof of con vergence [ 10 ], whic h holds because C − 1 is p ositiv e deﬁnite and therefore A > π C − 1 A π is positive deﬁnite, because A π is full rank (true b y assump- tion). See Sutton et al. [ 21 ] for a nice discussion on why the A π matrix m ust b e p ositiv e deﬁnite to ensure stable, non- div ergent iterations. The C matrix in Equation ( 3 ), can b e replaced by any p ositiv e deﬁnite matrix and the ﬁxed p oint will be unaﬀected, but the rate of conv ergence will almost surely c hange. Instead of follo wing the usual reci p e for deriving GTD, let us try replacing C − 1 with A −> µ def = E µ [( x ( S t ) − γ t x ( S t +1 )) e µ t > ] , where e µ is the regular on-p olicy trace for the b ehavior pol- icy (i.e., no importance weigh ts) e µ t = γ t λ e µ t − 1 + x ( S t ) . The matrix A −> µ is a p ositiv e deﬁnite matrix (prov ed b y Tsitsiklis and v an Roy [ 22 ]). Plugging A −> µ in to ( 2 ) results in the follo wing exp ected up date: 1 α E [∆ w t ] = A > π A −> µ ( − A π w t + b π ) = ( A > µ − A > µ + A > π ) A −> µ ( − A π w t + b π ) = ( A > µ A −> µ )( − A π w t + b π ) + ( A > π − A > µ ) A −> µ ( − A π w t + b π ) = ( − A π w t + b π ) + ( A > π − A > µ ) A −> µ ( − A π w t + b π ) = ( − A π w t + b π ) + (5) E µ h ( x ( S t ) − γ t +1 x ( S t +1 )) ( e t − e µ t ) > i A −> µ ( − A π w t + b π ) As in the deriv ation of GTD( λ ) [ 11 ], let the vector h t form a quasi-stationary estimate of the ﬁnal term, A −> µ ( − A π w t + b π ) . Getting back to the primary weigh t up date, we can sample the ﬁrst term using the fact that ( − A π w t + b π ) = E µ [ δ t e t ] (see [ 11 ]) and use ( 1 ) to get the ﬁnal stochastic up date ∆ w t ← α  δ t e t + ( x t − γ t +1 x t +1 ) ( e t − e µ t ) > h t  . (6) Notice that when the data is generated on-policy ( π = µ ), e t = e µ t , and thus the correction term disappears and w e are left with precisely linear TD( λ ). When π 6 = µ , the TD update is corrected as in GTD: un surprisingly , th e correction is sligh tly diﬀerent but has the same basic form. T o complete th e deriv ation, w e mu st derive an incremen tal update rule for h t . W e hav e a linear system, b ecause h t = A −> µ ( − A π w t + b π ) = ⇒ A > µ h t = − A π w t + b π . F ollo wing the general exp ected up date in ( 4 ), h t +1 ← h t + α h  ( − A π w t + b π ) − A > µ h t  (7) whic h con verges if A > µ is positive deﬁnite for an y ﬁxed w t and α h is chosen appropriately (see Sutton et al.’s recen t paper[ 21 ] for an extensiv e discussion of conv ergence in ex- pectation). T o sample this up date, recall A > µ h t = E µ [( x ( S t ) − x ( S t +1 )) e µ t > ] h t giving stochastic up date rule for h t : ∆ h t ← α h h δ t e t − ( x t − γ t +1 x t +1 ) e µ t > h t i . As in GTD, α ∈ R and α h ∈ R are step-size parameters, and δ t def = R t +1 + γ t +1 w > t x t +1 − w > t x t . This hybrid-TD algorithm should con verge unde r oﬀ-p olicy sampling using a proof technique similar to the one used for GQ( λ ) (see Maei & Sutton’s proof [ 10 ]), but we lea ve this to future w ork. The HTD( λ ) algorithm is completely sp eciﬁed by the following equations: e t ← ρ t ( λ t γ t e t − 1 + x t ) e µ t ← λ t γ t e µ t − 1 + x t ∆ w t ← α h δ t e t + ( γ t +1 x t +1 − x t )( e µ t − e t ) > h t i ∆ h t ← α h h δ t e t + ( γ t +1 x t +1 − x t ) e µ > t h t i This algorithm can b e made more eﬃcient by exploiting the common terms in ∆ w t and ∆ h t , as shown in the app endix. 4. TR UE ONLINE HTD Recen tly , a new forw ard-backw ard view equiv alence has been prop osed for online TD methods, resulting in true- online TD [ 24 ] and true-online GTD [ 23 ] algorithms. The original forward-bac kward equiv alence was for oﬄine TD( λ ) 2 . T o derive a forwa rd-backw ard equiv alence under online up- dating, a new truncated return w as prop osed, which uses 2 The idea of deﬁning a forward view ob jectiv e and then con - v erting this computati onally impractical forward-view in to an eﬃcien tly implementable alg orithm using traces is exten- siv ely treated in Sutton and Barto’s introductory text [ 16 ]. the online w eight vector that changes into the future, G λ,ρ k,t def = ρ k ( R k +1 + γ k +1 [(1 − λ k +1 ) x > t +1 w k + λ k +1 G λ,ρ k +1 ,t ]) , with G λ,ρ t,t def = ρ t x > t w t − 1 . A forward-view algorithm can be deﬁned that computes w k online assuming access to fu- ture samples, and then an exactly equiv alen t incrementa l bac kward-view algorithm can be derived that do es not re- quire access to future samples. This framework was used to deriv e the TO-TD algorithm for the on-policy setting, and TO-GTD for the more general oﬀ-policy setting. This new true-online equiv alence is not only interesting theoretically , but also translates in to impro v ed prediction and con trol p er- formance [ 24 , 23 ]. In this section, w e deriv e a true-online v ariant of HTD( λ ). When used on-p olicy HTD( λ ) behav es similarly to TO-TD( λ ). Our goal in this section is to com bine the beneﬁts of b oth h ybrid learning and true-online traces in a single algorithm. W e pro ceed with a similar deriv ation to TO-GTD( λ ) [ 23 , Theorem 4], with the main diﬀerence app earing in the up- date of the auxiliary weigh ts. Notice that the primary w eights w , and the auxiliary weigh ts h , of HTD( λ ) hav e a simi- lar structure. Recall from ( 5 ), the mo diﬁed gradient of the MSPBE, or exp ected primary-w eight update can b e written as: 1 α E [∆ w t ] = ( − A π w t + b π ) + E µ h ( x ( S t ) − γ t +1 x ( S t +1 )) ( e t − e µ t ) > i h t Similarly , we can rewrite the exp ected up date of the auxil- iary w eights by plugging A > µ in to ( 7 ): 1 α h E [∆ h t ] = ( − A π w t + b π ) + E µ h ( x ( S t ) − γ t +1 x ( S t +1 )) e µ t > i h t As in the deriv ation of TO-GTD [ 23 , Equation 17,18]), for TO-HTD we will sample the second part of the up date us- ing a bac kward-view and obtain forw ard-view samples for ( − A π w t + b π ). The resulting TO-HTD( λ ) algorithm is c om- pletely speciﬁed by the following equations e t ← ρ t ( λ t γ t e t − 1 + x t ) e µ t ← λ t γ t e µ t − 1 + x t e o t ← ρ t ( λ t γ t e o t − 1 + α t (1 − ρ t γ t λ t x > t e o t − 1 ) x t ) d = δ t e o t + ( e o t − α t ρ t x t )( w t − w t − 1 ) > x t (8) w t +1 ← w t + d + α t ( γ t +1 x t +1 − x t )( e µ t − e t ) > h t h t +1 ← h t + d + α h ( γ t +1 x t +1 − x t ) e µ > t h t In order to pro v e that this is a true-online update, we use the constructiv e theorem due to v an Hasselt et al. [ 23 ]. Theorem 1 (Tr ue-online HTD( λ )). F or any t , the weight ve ctors w t t , h t t as deﬁne d by the forwar d view w t k +1 = w t k + α k ( G λ,ρ k,t − ρ k x > t w t k ) x k + α k ( x t − γ t +1 x t +1 )( e t − e µ t ) > h t k h t k +1 = h t k + α h,k ( G λ,ρ k,t − ρ k x > t w t k ) x k + α h,k ( x t − γ t +1 x t +1 ) e µ t > h t k ar e e qual to w t , h t as deﬁne d by the b ackward view in ( 8 ) . Proof. W e apply [ 23 , Theorem 1]. The substitutions are η t = ρ t α t g w,k = α k ( x k − γ k +1 x k +1 )( e k − e µ k ) > h k g h,k = α h,k ( x k − γ k +1 x k +1 ) e µ k > h k Y t t = w > t − 1 x t Y t k = R k +1 + γ k +1 (1 − λ k +1 ρ k +1 ) w > k x k +1 + γ k +1 λ k +1 G λ,ρ k +1 ,t where g w,k is called x k in v an Hasselt’s Theorem 1 [ 23 ]. The proof then follows through in the same wa y as in v an Hasselt’s Theorem 4 [ 23 ], where we apply Theorem 1 to w and h separately . Our TO-HTD(0) algorithm is equiv alent to HTD(0), but TO-HTD( λ ) is not equiv alent to TO-TD( λ ) under on-p olicy sampling. T o ach ieve the later equiv alence, replace δ t def = R t +1 + γ t +1 w > t x t +1 + w > t − 1 x t and d def = δ t e o t − α t ρ t x t ( w t − w t − 1 ) > x t . W e opted for the ﬁrst equiv alence for t wo rea- sons. In preliminary exp erimen ts, TO-HTD( λ ) describ ed in Equation ( 8 ) already exhibited similar p erformance com- pared to TO-TD( λ ), and so designing for the second equiv a- lence wa s unnecessary . F urther, TO-GTD( λ ) was deriv ed to ensure equiv alence b et ween TO-GTD(0) and GTD(0); this c hoice, therefore, b etter parallels that equiv alence. Giv en our tw o new hybrid metho ds, and the long list of existing linear prediction algorithms w e now fo cus on ho w these algorithms p erform in practice. 5. EXPERIMENT AL STUD Y Our empirical study focused on three main aspects: (1) early learn ing performance with diﬀeren t feature representa- tions, (2) parameter sensitivit y , and, (3) eﬃcacy in on and oﬀ-policy l earning. The ma jority of our experiments w ere conducted on random MDPs (v ariants of those used in pre- vious studies[ 14 , 5 ]). Eac h random MDP contains 30 states, and three actions in each state. F rom eac h state, and for eac h action, the agent can transition to one of four next states, assigned randomly from the entire set without re- placemen t. T ransition probabilities for each MDP instance are randomly sampled from [0 , 1] and the transitions w ere normalized to sum to one. The exp ected reward for eac h transition is also generated randomly in [0 , 1] and the rew ard on each transition was sampled without noise. Two transi- tions are randomly selected to terminate: γ ( s i , s j ) = 0 for i 6 = j . Eac h problem instance is held ﬁxed during learning. W e exp erimen ted with three diﬀerent feature representa- tions. The ﬁrst, a tabular representation where each state is represented with a binary vector with a single one co r- responding the curren t stat e index. This encoding allows perfect representation of the v alue function with no general- ization ov er states. The second representation is computed b y taking the tabular representation and aliasing ﬁv e states to all hav e the same feature v ector, so the agen t cannot diﬀeren tiate these states. These ﬁv e states w ere selected randomly without replacement for eac h MDP instance. The third representation is a dense binary encoding where the feature v ector for each state is the binary enco ding of the state index, and thus the feature vector for a 30 state MDP requires just ﬁve components. Although the binary rep- resen tation appears to exhibit an inapp ropriate amoun t of generalization, we b eliev e it to b e more realistic that a tab- ular representation, b ecause access to MDP state is rare in real-w orld domains (e.g., a robotic with contin uous sensor v alues). The binary represen tation should b e viewed as an appro ximation to the po or, and relatively low-dimensional (compared to the num b er of states in the world) representa- tions common in real applications. All feature enco ding we normalized. Experiments conducted with the binary repre- sen tation use γ = 0 . 99, and the rest use γ = 0 . 9. T o generate policies with purposeful b ehavior, we forced the agent to fa vor a single action in eac h state. The tar- get p olicy is generated by randomly selecting an action and assigning it probability 0.9 (i.e., π ( s, a i ) = 0 . 9) in each state, and then assigning the remaining actions the remain- ing probabilit y ev enly . In the oﬀ-policy exp erimen ts the behavior p olicy is mo diﬁed to b e slightly diﬀerent than the target p olicy , by selecting the same base action, but instead assigning a probability of 0.8 (i.e., µ ( s, a i ) = 0 . 8) . This c hoice ensures that the p olicies are related, but guarantees that ρ t is nev er greater than 1.5 th us a vo iding inappropri- ately large v ariance due to imp ortance sampling 3 . Our experiment compared 12 diﬀerent linear complex- it y v alue function learning algorit hms, including : GTD( λ ), HTD( λ ), true-online GTD( λ ), true-online HTD( λ ), true- online ETD( λ ), true-online ETD( λ, β ), PTD( λ ), GTD2 - mp( λ ), TDC - mp( λ ), linear oﬀ-policy TD(0), TD( λ ), true- online TD( λ ). The later t w o b eing only applicable in on- policy domains, and the tw o mirror-prox metho ds are straight- forw ard extensions (and described in the appendix) of the GTD2-mp and TDC-mp metho ds [ 13 ] to handle traces ( λ > 0). W e drop the λ designation of each method in the ﬁgure labels to reduce clutter. Our results w ere generated by p erforming a large param- eter sweep, av eraged ov er many indep enden t runs, for each random MDP instance, and then av eraging the results ov er the en tire set of MDPs. W e tested 14 diﬀeren t v alues of the step-size parameter α ∈ { 0 . 1 × 2 j | j = − 8 , − 7 , ..., 6 } , sev en v alues of η ∈ { 2 j | j = − 4 , − 2 , − 1 , 0 , 1 , 2 , 4 } ( α h def = αη ), and 20 v alues of λ = { 0 , 0 . 1 , ..., 0 . 9 , 0 . 91 , ..., 1 . 0 } . W e inten- tionally precluded smaller v alues of α h from the parame- ter sw eep because many of the gradient TD metho ds sim- ply become their on-p olicy v ariants as α h approac hes zero, whereas in some oﬀ-policy domains v alues of α h > α are required to av oid divergence [ 26 ]. W e b eliev e this range of η fairly reﬂects ho w the algorithms would b e used in practice if a voiding div ergence was a priority . The β parameter of TO- ETD( λ, β ) was set equal to 0 . 5 γ t . Eac h algorithm instance, deﬁned by one com bination of α, η , and λ wa s ev aluated using the mean absolute v alue error on each time step,  t def = X s ∈S d µ ( s )     x ( s ) > w t − V ∗ ( s ) V ∗ ( s )     , a veraged o v er 30 MDPs, each with 100 runs. Here V ∗ : S → R denotes the true state-v alue function, which can b e easily computed with access to the parameters of the MDP . The graphs in Figures 1 and 2 include (a) learning curves with α, η , and λ selected to minimize the mean absolute v alue error, for eac h of the three diﬀerent feature represen- tations, and (b) parameter sensitivity graphs for α, η , and λ , in whic h the mean absolute v alue error is plotted against the parameter v alue, while the remaining t wo parameters 3 See the recen t study by Mahmoo d & Sutton [ 14 ] for an extensiv e treatment of oﬀ-policy learning domains with large v ariance due to imp ortance sampling. are selected to minimize mean absolute v alue error. These graphs are included across feature represen tations, for on and oﬀ-policy learning. Across all results the parameters are selected to optimize p erformance ov er the last half of the experiment to ensure stable learning throughout the run. T o analyze large v ariance due to imp ortance sampling and oﬀ-policy learning we also in vestigated Baird’s coun terexam- ple [ 2 ], a simple MDP that causes TD learning to div erge. This seven state MDP uses a target policy that is v ery diﬀer- en t from the b ehavi or policy , a feature representation that allo ws p erfect representation of the v alue function, but also causes inappropriate generalization. W e used the v arian t of this problem described by Maei [ 11 ] and White [ 26 , Figure 7.1]. W e present results with the ro ot mean squared error 4 ,  t def = X s ∈S d µ ( s )  x ( s ) > w t − V ∗ ( s )  2 , in Figure 1 . The exp erimen t was conducted in the same w ay was the random MDPs, except we did not av erage ov er MDPs—there is only one—and we used diﬀerent parameter ranges. W e tested 11 diﬀerent v alues of th e step-size pa - rameter α ∈ { 0 . 1 × 2 j | j = − 10 , − 9 , ..., − 1 , 0 } , 12 v alues of η ∈ { 2 j | j = − 16 , − 8 , . . . , − 2 , − 1 , 0 , 1 , 2 , . . . , 32 } ( α h def = α η ), and the same 20 v alues of λ . W e did not ev aluate TD(0) on this domain b ecause the algorithm will diverge and that has been sho wn many times b efore. In addition to p erformance results in Figures 1 and 2 , T able 1 summarizes the runtime comparison for these algo- rithms. Though the algorithms are all linear in storage and computation, they do diﬀer in both implementation and run- time, particularly due to true-online traces. The appendix con tains several plots of runtime verses v alue error illustrat- ing the trade-oﬀ b et ween computation and sample complex- it y for each algorithm. Due to space constraints, we ha v e included the aliased tabular represen tation result s for on- policy learning in the appendix, since they are similar to the tabular represen tation results in on-p olicy learning. 6. DISCUSSION There are t hree broad conclusions suggested b y our re- sults. First, we could not clearly demonstrate the sup- posed sup eriority of TD( λ ) ov er gradient TD methods in the on-p olicy setting. In both tabular and aliased feature settings GTD( λ ) ac hiev ed faster learning and sup erior pa- rameter sensitivity compared to TD, PTD, and HTD. No- tably , the η -sensitivit y of the GTD algorithm was very rea- sonable in b oth domains, how ever, large η w ere required to ac hieve go o d p erformance on Baird’s for b oth GTD( λ ) and TO-GTD( λ ). Our on-policy experimen ts with binary fea- tures did indicate a slight adv antage for TD( λ ), PTD, and HTD, and that PTD and HTD exhibit zero sensitivity to the choice of α h as exp ected. In oﬀ-p olicy learning there is little diﬀerence b et ween GTD( λ ) and PTD and HTD. Our results combined with the prior work of Dann et al. [ 4 ] sug- gest that th e adv antage of conv entional TD( λ ) ov er gradien t TD metho ds, in on-p olicy learning, is limited to sp eciﬁc do- mains. 4 In th is coun terexample the mean absolute v alue error is not appropriate b ecause the optimal v alues for this task are zero. The MSPBE is often used as a p erformance measure, but the MSPBE changes with λ ; for co mpleteness, w e include results with the MSPBE in the appendix. Our second conclusion, is that the new mirror prox meth- ods achi eved po or p erformance in most settings except Baird’s coun terexample. Both GTD2-mp and TDC-mp ac hieved the best performance in Baird’s coun terexample. W e hypothe- size that the t w o-step gradient computation more eﬀectiv ely uses the transition to state 7, and so is ideally suited to the structure of the domain 5 . Ho wev er, the GTD2-MP metho d performed worse than oﬀ-p olicy TD(0) in all oﬀ-p olicy ran- dom MDP domains, while the learning curves of TDC-mp exhibited higher v ariance than other metho ds in all but the on-policy binary case and high parameter sensitivit y across all settings except Baird’s. This does not seem to be a consequence of the extension to eligibilit y traces b ecause in all cases except Baird’s, b oth TDC-mp and GTD2-mp p er- formed best with λ > 0. Lik e GTD and HTD, the mirror pro x metho ds would likely ha ve p erformed b etter with v al- ues of α h > α , how ever, this is undesirable because larger α h is required to ensure go od performance in some oﬀ policy domains, such as Baird’s (e.g., η = 2 8 ). Third and ﬁnally , sev eral metho ds exhibited non-conv ergent behavior on Baird’s counterexample. All methods that ex- hibited reliable error reduce in Baird’s did so with λ near zero, suggesting that eligibility traces are of limited use in these more extreme oﬀ-p olicy domains. In the ca se of PTD, non-con v ergent b ehavior is not surprising since our implemen tation of this algorithm do es not include gradi- en t correction—a p ossible extension suggested by the au- thors [ 20 ]—and thus is only guaranteed to conv erge under oﬀ-policy sampling in the tabular case. F or the emphatic TD metho ds the p erformance on Baird’s remains a concern, especially considering how well TO-ETD( λ, β ) p erformed in all our other exp erimen ts. The addition of the β parameter appears to signiﬁcan tly improv e TO-ETD in the oﬀ-policy domain with binary features, but could not mitigate the large v ariance in ρ produced by the counterexample. It is not clear if this bad b eha vior is inheren t to emphatic TD methods 6 , or could be solv ed by more careful sp eciﬁcation of the state-based in terest function. In our implemen tation, w e follow ed the original author’s recommendation of setting the interest for eac h state to 1.0 [ 21 ], because all our do- mains w ere discoun ted and con tinuing. Additionally , b oth HTD( λ ) and TO-H TD( λ ) d id no t div erge on Baird’s, but performance was less than satisfactory to say the least. Ov erall, the conclusions implied b y our empirical study are surprisingly clear. If guarding a gainst large v ariance due to oﬀ-p olicy sampling is a chief concern, then GTD( λ ) and TO-GTD( λ ) should b e preferred. B etw een the tw o, GTD( λ ) should b e preferred if computation is at a premium. If p oor performance in problems lik e Baird’s is not a concern, then TO-ETD( λ, β ) was clearly the b est across our exp erimen ts, and exhibited nearly the b est run time results. TO-ETD( λ ) on the other hand, exhibited high v ariance in oﬀ-p olicy do- mains, and sharp parameter sensitivity , indicating parame- ter turnng of emphatic methods ma y b e an issue in practice. 7. APPENDIX Additional results and analysis can b e found in the full v ersion of the pap er: 5 Baird’s counterexample uses a sp eciﬁc initialization of the primary w eights: far from one of the true solutions w = 0 . 6 The v ariance of TO-ETD has b een examined b efore in tw o state domains [ 21 ]. ETD is though t to hav e higher v ariance that other TD algorithms due to the emphasis weigh ting. TD(0) TD( λ ) TO-TD PTD GTD TO-ETD TO-ETD( β ) HTD TO-GTD GTD-MP TDC-MP TO-HTD On-policy 120.0 132.7 150.1 172.4 204.6 287.8 286.0 311.8 366.2 467.4 466.2 466.0 Oﬀ-policy 108.3 - - 158.7 175.2 249.65 254.7 267.5 316.2 407.8 395.7 403.3 T able 1: Average run time in microseconds for 500 steps of learning, av eraged ov er 30 MDPS, with 100 runs eac h, with 30-dimensional tabular features. T abular features Aliased T abular features Binary features 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 0 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 TO-ETD TD(0) GTD2-mp GTD2-mp TDC-mp TDC-mp TO-ETD( 𝞫 ) TO-GTD HTD GTD PTD TO-HTD steps 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 TO-ETD TD(0) TO-HTD TDC-mp TDC-mp GTD2-mp GTD2-mp TO-ETD( 𝞫 ) TO-GTD HTD GTD steps PTD 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 TO-ETD TD(0) GTD2-mp GTD2-mp TDC-mp TDC-mp TO-ETD( 𝞫 ) PTD PTD GTD TO-HTD HTD TO-GTD steps 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 2 -6 2 6 2 -4 2 -2 2 0 2 2 2 4 𝞪 = 0.1 X GTD2-mp GTD2-mp GTD HTD TO-HTD 2 4 6 8 10 12 14 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Student Version of MATLAB 2 -6 2 6 2 -4 2 -2 2 0 2 2 2 4 𝞪 = 0.1 X GTD2-mp GTD2-mp HTD TO-HTD GTD 0 2 4 6 8 10 12 14 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Student Version of MATLAB 2 -6 2 6 2 -4 2 -2 2 0 2 2 2 4 𝞪 = 0.1 X GTD2-mp GTD2-mp GTD TO-HTD HTD TO-ETD 1 2 3 4 5 6 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 η ( 𝞪 h = η 𝞪 ) GTD2-mp GTD2-mp GTD HTD TO-HTD 1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 η ( 𝞪 h = η 𝞪 ) GTD2-mp GTD2-mp HTD TO-HTD GTD 1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Student Version of MATLAB 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 η ( 𝞪 h = η 𝞪 ) GTD2-mp GTD2-mp GTD HTD TO-HTD 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB GTD2-mp GTD2-mp GTD HTD TO-HTD 0 0.1 0.2 0.9 .91 .99 1 λ 2 4 6 8 10 12 14 16 18 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB GTD GTD2-mp GTD2-mp HTD TO-HTD 0 0.1 0.2 0.9 .91 .99 1 λ 2 4 6 8 10 12 14 16 18 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Student Version of MATLAB GTD2-mp GTD2-mp GTD HTD TO-HTD 0 0.1 0.2 0.9 .91 .99 1 λ Figure 1: Oﬀ-p olicy p erformance on random MDPs with three diﬀerent represen tations. All plots rep ort mean absolute v alue error av eraged ov er 100 runs and 30 MDPs. The plots are organized in columns left to right corresp onding to tabular, aliased, and binary features. The plots are organized in rows from top to b ottom corresp onding to learning curves, α , η , and λ sensitivit y . The error bars are standard errors ( s/ p ( n )) computed from 100 independent runs. T abular features Binary features Baird’s coun terexample 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 TO-ETD PTD / HTD / TD( λ ) TO-HTD / TO-TD TD(0) GTD2-mp GTD2-mp TDC-mp TDC-mp TO-ETD( 𝞫 ) GTD steps TO-GTD 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 TO-ETD PTD / HTD / TD( λ ) TO-HTD / TO-TD TD(0) GTD2-mp GTD2-mp TDC-mp TDC-mp TO-ETD( 𝞫 ) GTD steps TO-GTD 0 500 1000 1500 2000 0 5 10 15 20 25 30 Student Version of MATLAB 0 500 1000 1500 2000 0 5 10 15 20 25 30 GTD2-mp GTD2-mp TO-ETD( 𝞫 ) TDC-mp TDC-mp PTD PTD HTD steps TO-ETD TO-HTD TO-GTD / GTD 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 2 -6 2 6 2 -4 2 -2 2 0 2 2 2 4 𝞪 = 0.1 X GTD2-mp GTD2-mp GTD 2 4 6 8 10 12 14 0.2 0.4 0.6 0.8 1 1.2 Student Version of MATLAB 2 -6 2 6 2 -4 2 -2 2 0 2 2 2 4 𝞪 = 0.1 X GTD2-mp GTD2-mp GTD 1 2 3 4 5 6 7 8 9 10 11 0 5 10 15 20 25 30 35 40 45 50 Student Version of MATLAB 𝞪 = 0.1 X 2 -10 2 -8 2 -6 2 -4 2 -2 2 0 TO-HTD HTD TO-GTD / GTD 1 2 3 4 5 6 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 GTD2-mp GTD2-mp GTD η ( 𝞪 h = η 𝞪 ) 1 2 3 4 5 6 7 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Student Version of MATLAB GTD2-mp GTD2-mp GTD 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 η ( 𝞪 h = η 𝞪 ) 1 2 3 4 5 6 7 8 9 10 11 12 0 5 10 15 20 Student Version of MATLAB 2 -16 2 -4 2 -1 2 1 2 4 2 16 η ( 𝞪 h = η𝞪 ) GTD2-mp GTD2-mp TO-GTD / GTD 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB GTD2-mp GTD2-mp GTD 0 0.1 0.2 0.9 .91 .99 1 λ 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB GTD2-mp GTD2-mp GTD PTD / HTD / TD( λ ) TO-HTD / TO-TD 0 0.1 0.2 0.9 .91 .99 1 λ 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 Student Version of MATLAB 0 0.1 0.2 0.9 .91 .99 1 λ GTD2-mp GTD2-mp TO-GTD GTD Figure 2 : On-p olicy p erformance on random MDPs with tw o diﬀeren t represen tations and oﬀ-policy p erformance on Baird’s coun terexample. All plots rep ort mean absolute v alue error av eraged ov er 100 runs and 30 random MDPs, and 500 runs for Baird’s. The plots are organized in columns left to right corresp onding to results on random MDPs with tabular and binary features, and results on Baird’s counterexample. The plots are also organized in rows from top to b ottom corresp onding to learning curv es, α , η , and λ sensitivity . REFERENCES [1] B. Argall, S. Chernov a, M. M. V eloso, and B. Bro wning. A survey of rob ot learning from demonstration. R ob otics and Autonomous Systems () , 2009. [2] L. Baird. Residual algorithms: Reinforcemen t learning with function approximation. In International Confer enc e on Machine L ear ning , 1995. [3] D. P . Bertsek as and J. N. Tsitsiklis. Neur o-dynamic pr ogr amming . A thena Scientiﬁc Press, 1996. [4] C. Dann, G. Neumann, and J. Peters. Policy ev aluation with temp oral diﬀerences: a survey and comparison. The Journal of Machine L earni ng R esear ch , 2014. [5] M. Geist and B. Scherrer. Oﬀ-p olicy learning with eligibilit y traces: a survey. The Journal of Machine L earni ng R ese ar ch , 2014. [6] L. Hackman. F aster Gr adient-TD Algorithms . PhD thesis, Univ ersity of Alb erta, 2012. [7] A. Hallak, A. T amar, R. Munos, and S. Mannor. Generalized emphatic temp oral diﬀerence learning: Bias-v ariance analysis. arXiv pr eprint arXiv:1509.05172 , 2015. [8] L.-J. Lin. Self-Improving Reactive Agents Based On Reinforcemen t Learning, Planning and T eac hing. Machine L e arning , 1992. [9] B. Liu, J. Liu, M. Ghav amzadeh, S. Mahadev an, and M. P etrik. Finite-Sample Analysis of Proximal Gradien t TD Algorithms. Confer enc e on Unc ertainty in Artiﬁcial Intel ligence , 2015. [10] H. Maei and R. Sutton. GQ ( λ ): A general gradient algorithm for temporal-diﬀerence prediction learning with eligibilit y traces. In AGI , 2010. [11] H. R. Maei. Gr adient temp or al-diﬀer enc e le arning algorithms . Universit y of Alb erta, 2011. [12] S. Mahadev an and B. Liu. Sparse Q-learning with mirror descen t. In Confer enc e on Unc ertainty in Ar tiﬁcial Intel ligenc e , 2012. [13] S. Mahadev an, B. Liu, P . S. Thomas, W. Dabney , S. Giguere, N. Jacek, I. Gemp, and J. L. 0002. Pro ximal reinforcement learning: A new theory of sequen tial decision making in primal-dual spaces. CoRR abs/1405.6757 , 2014. [14] A. R. Mahmo od and R. Sutton. Oﬀ-p olicy learning based on w eighted imp ortance sampling with linear computational complexity. In Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e , 2015. [15] V. Mnih, K. Kavuk cuoglu, D. Silv er, A. A. Rusu, J. V eness, M. G. Bellemare, A. Grav es, M. Riedmiller, A. K. Fidjeland, G. Ostro vski, et al. Human-level con trol through deep reinforcement learning. Natur e , 518(7540):529–533, 2015. [16] R. Sutton and A. G. Barto. R einfor c ement L e arning: An Intr od uction . MIT press, 1998. [17] R. Sutton, H. Maei, D. Precup, and S. Bhatnagar. F ast gradient-descen t methods for temporal-diﬀerence learning with linear function approximation. International Confer enc e on Machine L e arning , 2009. [18] R. Sutton, J. Mo da yil, M. Delp, T. Degris, P . Pilarski, A. White, and D. Precup. Horde: A scalable real-time arc hitecture for learning knowledge from unsup ervised sensorimotor interaction. In International Confer enc e on Autonomous A gents and Multiagent Systems , 2011. [19] R. S. Sutton. Learning to predict by the metho ds of temporal diﬀerences. Machine le arning , 3(1):9–44, 1988. [20] R. S. Sutton, A. R. Mahmo od, D. Precup, and H. v an Hasselt. A new Q(lam b da) with interim forward view and Mon te Carlo equiv alence. ICML , 2014. [21] R. S. Sutton, A. R. Mahmo od, and M. White. An emphatic approach to the problem of oﬀ-p olicy temporal-diﬀerence learning. The Journal of Machine L earni ng R ese ar ch , 2015. [22] J. Tsitsiklis and B. V an Roy . An analysis of temporal-diﬀerence learning with function appro ximation. IEEE T r ansactions on Automatic Contr ol , 1997. [23] H. v an Hasselt, A. R. Mahmoo d, and R. Sutton. Oﬀ-policy TD ( λ ) with a true online equiv alence. In Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e , 2014. [24] H. v an Seijen and R. Sutton. T rue online TD(lam b da). In International Confer enc e on Machine L e arning , 2014. [25] C. W atkins. Watkins: L e arning fr om delaye d r ewar ds. PhD thesis, Univ ersity of Cambridge, 1989. [26] A. White. Developing a pr e dictive appr o ach to know le dge . PhD thesis, Universit y of Alb erta, 2015. [27] H. Y u. On conv ergence of emphatic temporal-diﬀerence learning. In Annual Confer enc e on L earni ng The ory , 2015. APPENDIX A. ALGORITHMS The original ETD( λ ) algorithm as prop osed by Sutton et al. (2015) is a n not entirely ob vious manip ulation of the true online ETD( λ ) desc rib ed ab ov e and used in our exp erimen ts. The diﬀerence is in the deﬁnition of the eligibilit y trace and the primary weigh t up date. T o achiev e the original ETD( λ ) algorithm mo dify the ab o ve true-online ETD( λ ) algorithm to use e t ← ρ t ( γ t λ e t − 1 + M t x t ) , and ∆ w ← αδ t e t . In all the algorithm s that follo w, we assume w 0 , h 0 are initialized arbitrarily , and eligibilit y traces are initialized to a ve ctor of zeros (e.g., e − 1 = 0 ). TD( λ ) δ t def = R t +1 + γ t +1 x > t +1 w t − x > t w t e t ← λ t γ t e t − 1 + x t ∆ w ← αδ t e t T rue-online TD( λ ) v 0 def = w > t x t +1 δ t def = R t +1 + γ t +1 v 0 − v e t ← γ t λ t e t − 1 + α h 1 − γ t λ t e > t − 1 x t i x t ∆ w ← δ t e t + α [ v − w > t x t ] x t v ← v 0  v initialized to 0 GTD( λ ) δ t def = R t +1 + γ t +1 x > t +1 w t − x > t w t e t ← ρ t ( λ t γ t e t − 1 + x t ) ∆ w ← α h δ t e t − γ t +1 (1 − λ t +1 )( e > t h t ) x t +1 i ∆ h ← α h h δ t e t − ( x > t h t ) x t i T rue-online GTD( λ ) δ t def = R t +1 + γ t +1 x > t +1 w t − x > t w t e t ← ρ t h λ t γ t e t − 1 + α t  1 − ρ t γ t λ t ( x > t e t − 1 )  x t i e µ t ← ρ t ( λ t γ t e µ t − 1 + x t ) e h t ← ρ t − 1 λ t γ t e h t − 1 + α h  1 − ρ t − 1 γ t λ t ( x > t e h t − 1 )  x t d def = δ t e t + ( e t − αρ t x t )( w t − w t − 1 ) > x t ∆ w t ← d − αγ t +1 (1 − λ t +1 )( h > t e µ t ) x t +1 ∆ h t ← ρ t δ t e h t − α h ( x > t h t ) x t PTD( λ ) δ t def = R t +1 + γ t +1 x > t +1 w t − x > t w t e t ← ρ t ( λ t γ t e t − 1 + x t ) ∆ w t ← αδ t e t + ( ρ t − 1) h t h t +1 ← γ t λ t ( ρ t h t + α ¯ δ t e t ) ¯ δ t def = R t +1 + w > t x t +1 − w > t x t T rue-Online ETD( λ ) δ t def = R t +1 + γ t +1 x > t +1 w t − x > t w t F t ← ρ t − 1 γ t F t − 1 + I t  F − 1 = 0 M t def = λ t I t + (1 − λ t ) F t e t ← ρ t γ t λ t e t − 1 + ρ t αM t (1 − ρ t γ t λ t ( x > t e t − 1 )) x t ∆ w ← δ t e t + ( e t − αM t ρ t x t )( w t − w t − 1 ) > x t T rue-Online ETD( β , λ ) δ t def = R t +1 + γ t +1 x > t +1 w t − x > t w t F t ← ρ t − 1 β t F t − 1 + I t  F − 1 = 0 M t def = λ t I t + (1 − λ t ) F t e t ← ρ t γ t λ t e t − 1 + ρ t αM t (1 − ρ t γ t λ t ( x > t e t − 1 )) x t ∆ w ← δ t e t + ( e t − αM t ρ t x t )( w t − w t − 1 ) > x t GTD2( λ )-MP δ t def = R t +1 + γ t +1 w > t x t +1 − w > t x t e t ← ρ t ( λ t γ t e t − 1 + x t ) h t + 1 2 ← h t + α h h δ e t − ( h > t x t ) x t i w t + 1 2 ← w t + α h ( h > t x t ) x t − γ t +1 (1 − λ t +1 )( h > t e t ) x t +1 i δ t + 1 2 def = R t +1 + γ t +1 w > t + 1 2 x t +1 − w > t + 1 2 x t ∆ w ← α h ( h > t + 1 2 x t ) x t − γ t +1 (1 − λ t +1 )( h > t + 1 2 e t ) x t +1 i ∆ h ← α h h δ t + 1 2 e t − ( x > t h t + 1 2 ) x t i TDC( λ )-MP δ t def = R t +1 + γ t +1 w > t x t +1 − w > t x t e t ← ρ t ( λ t γ t e t − 1 + x t ) h t + 1 2 ← h t + α h h δ e t − ( h > t x t ) x t i w t + 1 2 ← w t + α h δ e t − γ t +1 (1 − λ t +1 )( h > t e t ) x t +1 i δ t + 1 2 def = R t +1 + γ t +1 w > t + 1 2 x t +1 − w > t + 1 2 x t ∆ w ← α h δ t + 1 2 e t − γ t +1 (1 − λ t +1 )( e > t h t + 1 2 ) x t +1 i ∆ h ← α h h δ t + 1 2 e t − ( x > t h t + 1 2 ) x t i HTD( λ ) δ t def = R t +1 + γ t +1 w > t x t +1 − w > t x t e t ← ρ t ( λ t γ t e t − 1 + x t ) e µ t ← λ t γ t e µ t − 1 + x t ∆ w t ← α h δ t e t + ( x t − γ t +1 x t +1 )( e t − e µ t ) > h t i ∆ h t ← α h h δ t e t − ( x t − γ t +1 x t +1 )( e µ > t h t ) i T rue-online HTD( λ ) δ t def = R t +1 + γ t +1 w > t x t +1 − w > t x t e t ← ρ t ( λ t γ t e t − 1 + x t ) e µ t ← λ t γ t e µ t − 1 + x t e o t ← ρ t ( λ t γ t e o t − 1 + α (1 − ρ t γ t λ t x > t e o t − 1 ) x t ) d def = δ t e o t + ( e o t − αρ t x t )( w t − w t − 1 ) > x t ∆ w t ← d + α ( x t − γ t +1 x t +1 )( e t − e µ t ) > h t ∆ h t ← d − α h ( x t − γ t +1 x t +1 ) e µ > t h t B. ADDITIONAL EXPERIMENTS This section includes additional results further analyzing the relativ e p erformance of linear TD-based p olicy ev alua- tion algorithms. The runtime results are as follows, for Fig- ures 5 and Figure 6 . The graphs indicate a sample eﬃciency v ersus time trade-oﬀ. F or increasing c , the algorithms are giv en more time p er sample to ﬁnish computing. If compu- tation is not done within the allotted time c , then the agent con tinues to ﬁnish computatio n but has essentially paused in teraction. Several p ossible iterations ma y b e required by the algor ithm, if it is slow, un til it is done computing on that one sample, at which p oin t it is given a new sample. This sim ulates a real-time decision making tasks, such as mobile robot control. New samples cannot b e pro cessed or buﬀered for oﬀ-line computation while the previous sample is b eing process. How ev er, multiple samples can b e processed per iteration, where the iteration duration is denoted by c ∈ R . T ypically computationally frugal learning methods p erform w ell when c is smaller, because more samples c an b e pro- cessed p er iteration. F or example, for c = 0 . 1 on-p olicy TD(0) pro cesses 101 samples, TD( λ ) processes 97 samples, TO-TD pro cesses 85 samples and TO-HTD pro cesses 31 samples. Even though the TD(0) algorithm was allow ed to pro cess more samples p er iteration, it did not achiev e the best p erformance trade-oﬀ, b ecause TO-TD is b oth sample eﬃcien t and computationally eﬃcient. F or larger c more time is av ailable to each algorithm on each iteration. In this case some of the other algorithms hav e better p erformance trade-oﬀs. F or example, in oﬀ-policy learning, HTD with c = 1 . 25 eﬀectively ties PTD for the b est p erformance. 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 TO-ETD PTD / HTD / TD( λ ) TO-HTD / TO-TD TD(0) GTD2-mp GTD2-mp TDC-mp TDC-mp TO-ETD( 𝞫 ) GTD steps TO-GTD 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Student Version of MATLAB 2 -6 2 6 2 -4 2 -2 2 0 2 2 2 4 𝞪 = 0.1 X GTD2-mp GTD2-mp GTD 1 2 3 4 5 6 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 η ( 𝞪 h = η 𝞪 ) GTD2-mp GTD2-mp GTD 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Student Version of MATLAB 0 0.1 0.2 0.9 .91 .92 .98 .99 1 GTD2-mp GTD2-mp GTD PTD / HTD / TD( λ ) TO-HTD / TO-TD λ Figure 3: On-p olicy performance on random MDPs with aliased tabular features. All plots rep ort mean absolute v alue error a veraged ov er 100 runs and 30 MDPs. The top left plot depicts the learning curves with the b est parameters found av eraged o ver all random MDP instances. The remaining graphs depict each algorithms parameter sensitivity in mean absolute v alue error for α , η , and λ . 0 500 1000 1500 2000 0 2 4 6 8 Student Version of MATLAB GTD2-mp TO-ETD & TO-ETD( 𝞫 ) TO-GTD / GTD TDC-mp TO-HTD PTD HTD steps 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 Student Version of MATLAB GTD2-mp TO-ETD TO-GTD / GTD TDC-mp TO-HTD PTD HTD TO-ETD( 𝞫 ) 𝞪 = 0.1 X 2 -10 2 -9 2 -8 2 -7 2 -6 2 -5 2 -4 2 -3 2 -2 2 -1 2 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 Student Version of MATLAB GTD2-mp TO-GTD GTD 2 -16 2 -8 2 -4 2 -2 2 -1 2 -0 2 1 2 2 2 4 2 8 2 16 2 32 η ( 𝞪 h = η𝞪 ) 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 Student Version of MATLAB GTD TO-ETD TO-GTD GTD2-mp PTD TO-ETD( 𝞫 ) 0 0.1 0.2 0.9 .91 .92 .98 .99 1 λ Figure 4: Oﬀ-p olicy performance on a v arian t of Baird’s 7-state counterexample. All plots rep ort the ro ot mean square pro jected Bellman error (RMSPBE). See White’s thesis (2015) for a detailed explanation of ho w to compute the MSPBE from the parameters of an MDP . The top left graph rep orts the RMSPBE is a veraged 200 plotted against time, under the best parameter setting found ov er an extensive sweep. The remaining plots depict the parameter sensitivity , in RMSPBE, of eac h metho d with resp ect to the key algorithm parameters α , η , and λ . 0 50 100 150 200 250 300 350 400 450 500 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 50 100 150 200 250 300 350 400 450 500 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 200 250 300 350 400 450 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c = 0 . 05 c = 0 . 1 c = 0 . 25 0 50 100 150 200 250 300 350 400 450 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c = 0 . 5 c = 0 . 75 c = 1 . 0 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c = 1 . 25 c = 1 . 5 c = 1 . 75 Figure 5: Runtime analysis in on-policy random MDPs, with tabular features. Once the time per iteration is increased to 1.75 milliseconds, we obtain the original learning curve graphs: there are no runtime restrictions on the algorithms at that p oin t since they are all fast enough with so m uch time p er second. The line style and colors corresp ond exactly with the lab els in the main paper. F or a detailed discussion of the ﬁgure see the app endix text. 0 100 200 300 400 500 600 700 800 900 1000 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 100 200 300 400 500 600 700 800 900 1000 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c = 0 . 1 c = 0 . 25 c = 0 . 5 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c = 0 . 75 c = 1 . 0 c = 1 . 25 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c = 1 . 5 c = 1 . 75 c = 2 . 0 Figure 6: Run time analysis in oﬀ-p olicy random MDPs, with tabular features. Once the time p er iteration is increased to 2 milliseconds, we obtain the original learning curve graphs: there are no runtime restrictions on the algorithms at that p oin t since they are all fast enough with the time allotted . The line style and colors corresp ond exactly with the lab els in the main paper. F or a detailed discussion of the ﬁgure see the app endix text.

Investigating practical linear temporal difference learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment