Eligibility Propagation to Speed up Time Hopping for Reinforcement Learning

A mechanism called Eligibility Propagation is proposed to speed up the Time Hopping technique used for faster Reinforcement Learning in simulations. Eligibility Propagation provides for Time Hopping similar abilities to what eligibility traces provid…

Authors: Petar Kormushev, Kohei Nomoto, Fangyan Dong

1 Abstract — A m echan ism called Eligibility Propagation is prop osed to s peed up th e Tim e H opping tech nique used for faster Reinforce men t Learning in simu lations. Eligibility Propagatio n prov ides for Tim e Hopping sim ilar abilities to what eligibility traces p rovide for convention al Reinforcem ent Learnin g. It propa gates value s from one state to all o f its tem poral prede cessors using a state transitions grap h. Experim ents on a sim ulated biped cra w ling robot confirm that Eligibility P rop agation acc elerates the learnin g process mo re than 3 tim es. I. I NTRODUCT ION EI NFORCEMENT learning (RL) algor ithm s [16] a ddress the pro blem of learning to select optimal actions when limited feedback (usually in the for m of a scalar reinforcement fu nction) from the environm ent is available. General RL algorithms like Q-learning [17], SARSA and TD( λ ) [ 15] have been proved to converge to the globally optimal solution (under certain ass umptions) [1][17]. They are very flexible, because they d o not req uire a model o f the environm ent, and have been shown to b e effective in solving a variety o f RL tasks. This flexibility, however, comes at a certain co st: these RL algorithm s req uire extremely long training to cope with large state space prob lem s. Many different approa ches have been propo sed for speeding up th e RL process. One possible techn ique is to use function approximation [8 ], in order to reduce the e ff ect of the “curse of d im ensionality”. Unfortunately, using function approximation creates instability pro blems when used with off-policy learnin g. Significan t speed-up can be ach ieved wh en a demonstration of the goal tas k is available [3], as in Apprenticeship Learning [7]. Although there is a risk o f running dangerous exploration policies in the real w orld [10], there are successfu l implementations of apprenticeship learning for aerobatic helicop ter flight [11]. Another possible technique for speeding up RL is to use some form of hierarchical d ecomposition of the problem [4]. A pro m inent example is the “MAXQ Value Function Manuscrip t submitted March 31, 200 9. This work w as supp orted in part by the Japanese Minist ry of Educ ation, Culture, Sports, Sci ence and Technolog y (MEXT). P. S. Kormushev , F. Dong and K. Hirota a re with the Department of Computationa l I ntelligence and Syste ms Science, Tokyo I nstitu te of Technolog y, Yokohama, 226 -8502, Japan. (phone: +81-45-924-568 6/5682; fax: +81-45-924 -5676; e-mail: {petar, tou, hirota}@h rt.dis.titech.a c.jp). K. Nomoto is w ith t he I ndu strial Design Center, Mitsu bishi Electric Corporation, Tokyo , Jap an. (e-mail: Nomoto.Kohe i@d w .Mit subishiElectric.co.jp ) Decomposition” [2]. Hybrid methods using both apprenticeship learning and hierarchical dec om po sition have been su ccessfully applied to quadruped loco m otion [14][ 18]. Unfortunately , decomposition of the target task is not alw ays possible, and sometim es it may impose ad ditional b urden on the users of the RL algorithm . A state-of-the-art RL algorithm for efficient state space exploration is E3 [6]. It uses active exploration policy to visit states wh ose transition dynam ics are still inaccurately modeled. Beca use of this, running E3 directly in the r eal world mig ht lead to a dangerous exploration behavior. Instead o f executing RL algorithms in the re al w orld , simu lations are comm only u sed. This appro ach h as two main advantages: speed and saf ety. Depending on its complexity , a simu lation can run m any times f aster than a real-world experiment. Also, the time needed to set up and maintain a simu lation experiment is far less compared to a real-world experiment. T he seco nd advantage, safety, is a lso very important, especially if the RL agent is a v ery expe nsiv e equipment (e.g. a fragile rob ot), or a dangerous one (e.g. a chemical plant). Whether the full po tential o f co m puter simu lations has be en utilized for RL, however, is an open question. A new trend in RL su ggests that this m ight not be t he case. For example, two techniques have been proposed recently to better utilize the p otential of co m puter simulations for RL: Time Manipulation [12] and Time Hopping [13]. They share the concept of usin g the sim ulation tim e as a tool for s peed ing up the lear ning process. T he first technique, called T im e Manipulation, suggests that do ing backward time manipulations inside a simulation can significantly speed up the learning process and improve the state space exploration. Applied to failure-avoidance RL pro blems , such as the cart-pole b alancing p roblem, Time Manipulation has been show n to increase the spee d of convergence by 260% [12] . This pap er focuses on the second technique, called T im e Hopping, which ca n b e applied successfully to continuous optimization prob lem s. Unlike the T ime Manipulation technique, wh ich can only perform backward time manipulations , the Time Ho pping technique c an make arbitrary “hops” betw een states and traverse r apidly throughout the entire state space. It has been shown to accelerate the learning p rocess more than 7 times o n some problems [13] . Time Hopping po ssesses mechanism s to trigger time manipulation events, to make pr ediction about possible future rew ards, and to select pro m ising time hopping targets. This paper p roposes an ad ditional mechanism called Eligibility P ropagation to be ad ded to the Tim e Ho pping Eligibili ty Propagation to Speed up T ime Hopping for Reinforcement Learning Petar S. Kormushev, Kohei Nomoto, Fangy an Dong, and Kaoru Hirota R 2 technique, in o rder to p rovide similar ab ilities to what eligibility traces pr ovide for conventional RL. Eligibility traces are easy to implemen t for conventional RL methods with sequential tim e transitions, but in the case of Time Hopping, d ue to its non-sequential nature, a num ber of obstacles have to be overcome. The following Section I I makes a b rief overview of the Time Hopping technique and its components. Sectio n III explains why it is important (and not trivial) to implem ent some form o f eligibility trac es for T ime Ho pping and propo ses the Eligibility P ropagation mechanis m to d o this. Section IV presents the r esults from experimental evaluation of Eligibility Propagation on a benchmark continuous-optim ization pro blem: a biped crawling robot. II. O VERVI EW OF T I ME H OPPING A. B asics of Time Hoppin g Time Hopping is an algorithmic technique wh ich allo w s maintain ing higher learning rate in a simulation environm ent by hop ping to app ropriately selected states [ 13]. For example, let us consider a formal definition of a RL problem, given by the Markov Decision Process (MDP) on Fig. 1. Each state transition has a pr obability associated with it. State 1 represents situations of the environment that are very comm on and learned quickly. The frequency w ith wh ich state 1 is being visited is the highest of all. As the state number increases, the probability o f being in the corresponding state becomes lo w er. State 4 represents the rarest situations and therefore the most unlik ely to be well explored and lea rned. Fig. 1. An example of a MDP with un eve n sta te p robability di stribution. Time Hopping ca n create “shortcuts in ti me” (shown with d ashed lin es) betwee n otherwise dist ant states, i.e. states c onnected by a very low -probability pa th. This a ll ows even the lowe st-probabi lity st ate 4 to be learned easily . When app lied to such a MDP , Time Hopping creates “shortcuts in time” by making hops (direct state transitions) between very distant states inside the MDP . Hopping to low- pro bability states makes them easier to be learned, whi le at the same time it helps to avoid unnecessary repetition of already w ell-explored states [13]. T he process is completely transparent for the underlying RL algorithm. B. Co mponents of Time H opping When applied to a conventional RL algorithm, T im e Hopping consists of 3 components: 1) Hopping trigger – d ecides wh en the hopping starts; 2) Target selectio n – decides where to hop to; 3) Hopping – p erforms th e actual hopp ing. The flowch art on Fig. 2 show s how these 3 components of Time Hopping are connected and how they interact with the RL algorithm . When the Time Hopping trigger is activated, a target state and time hav e to be selected, considering many relevant proper ties of the states, such as prob ability , visit frequency, level o f exploration, connectivity to o ther states (num ber of state transitions), etc. After a target state and time have been selected, hopping can be p erformed. It includes setting the RL agent and the simulation environment to the p roper state, wh ile at the sam e time preserving all the acquired kn owledge by the agent. Fig. 2. Time Hopping techni que ap plied to a conventional RL algorithm. The low er gro up (marked w ith a dashed line) contains the conventional RL algorithm main loop, into w hich the Time Hoppi ng components (the up per group) are integrated. III. E LI GI BI LI TY P ROPAG AT ION A. Th e role of eligibility trac es Eligibility traces ar e one of the basic mechanism s for temporal credit assignm ent in reinforcement learning [16] . An eligibility trace is a temporary reco rd of the occurrence of an event, such a s the visiting of a state o r the taking of an action. When a learning update occ urs, the eligibility trace is used to assign credit or blame for the received reward to the most ap propriate states o r actio ns. For example, in the popular TD( λ ) algorithm, the λ refers to the use of an eligibility trace. Almost any temporal-difference (T D) methods, e.g., Q-learning and SARSA , can be com bined w ith eligibility traces to obta in a more general method that may learn more efficiently. This is why it is important to implem ent som e form of eligibility traces for Time Hopping as well, in order to speed up its convergence. Eligibility traces are usually easy to implement for conventional RL m ethods. In th e case of T ime Hopping, howev er, due to its non-s equential nature, it i s not trivial to do so. Since arbitrary hop s between states ar e allowed, i t is impossible to directly apply linear eligibility traces. I nstead, we prop ose a different mechanism ca lled Eligibility Prop agation to do this. Select action Execute actio n No Yes RL initialization Hopping trigger Target selection Hopping Get rewa rd Change state T T i i m m e e H H o o p p p p i i n n g g R R L L a a l l g g o o r r i i t t h h m m m m a a i i n n l l o o o o p p 1 2 3 1 2 3 4 0.9 0.9 0.9 0.1 0.1 Start “Shortcuts in time” created by Time Hopping 0.9 0.1 0.1 3 B. E ligibility Propaga tion mechan ism Time Hopping is guaranteed to co nverge when an off-policy RL algorithm is used [13] , because the lear ned policy is independent of the po licy follo w ed during learning. This means that the explo ration policy do es not converge to the optimal po licy . In fact, Time Hopping delibera tely t ries to avoid co nv ergence of the policy in ord er to maintain high learning rate and minimize exploration redundancy. T his poses a m ajor requirement for any potential eligibility- trace-mechanism : it has to be able to lear n from independent non-sequential state transitions spr ead spar sely throughout the state space. The p roposed solution is to construct an o riented grap h wh ich represents the state transitions with their associated actions and rewards and use this data structure to prop agate the learning upd ates. Because of the w ay T ime Hopp ing works, the graph might be disco nnected, consisting of many separate connected components . Regardless of the actual o rder in which Time Ho pping visits the states, this oriented graph contains a reco rd o f the correct chronolo gical sequence o f state tra nsiti ons. For example, each state tr ansition can b e considered to b e from state S t to state S t+1 , and the information ab out this state transition is independent from what happened befor e it and wh at w ill happen after it. This allows to eff iciently collect the separate pie ces of information obtained during the randomized hop ping, and to p rocess them uniformly usin g the graph structure. Once this oriented graph is available, it is used to propagate state value u pd ates in the opposite direction of the state transition edges. T his way, the p ropagation logically flow s backwards in time, from state S t to all of its temporal predecessor states S t-1 , S t-2 and so o n. T he pro pagation stops wh en the value upd ates become sufficiently small. T he mechan ism is illustrated on Fig. 3. Fig. 3. Eligibility Propagation mecha nism app lied to the oriented graph of state transiti ons. In summ ary, an explicit d efini tion for the prop osed mechan ism is as follow s: Eligibility Pr opagation is a n algorithmic mechanism for Time Hopping to efficiently co llect, repre sent and pro pagate information abo ut states a nd transitions. I t uses a state transitions graph and a wav e-like pro pagation algorithm to propagate state values from one state to all of its temporal predecessor states. A co ncrete implementation of this mechanism with in the Time Hopping technique is given in the f ollowing section. C. Imp lementation of Eligibility Propag ation The prop osed implementation of Eligibility Pro pagation can b e called “reverse graph propagation”, because values are propagated inside the graph in reverse (opp osite) direction of the state tra nsitions ’ directions. T he process is simi lar to the wav e-like propagation of a BFS (bread th-f irst search) algorithm. In order to give a more specific implem entation description, Q-learning is used as the underlying RL algorithm. The following is th e pseudo-code for the proposed Eligibility Propagation mechanism : 1. C onstruct an ordered set (queue) of state trans itions called Propagati onQueue and initialize it with the current state transition 1 , t t S S + in this way : 1 , . t t PropagationQueue S S + = (1) 2. T ake the f irst state transition 1 , t t S S PropagationQueue + ∈ and remove it from the queue. 3. Let max Q be the current maxim um Q -value of state S t : { } max , max , t S A A Q Q = (2) where the t ransition f rom s tate S t to state S t+1 is done by executing action A , and the reward , t S A R is received. 4. Up date the Q -value for makin g the state transition 1 , t t S S + using the update rule: { } 1 , , , ' ' max . t t t S A S A S A A Q R Q γ + = + (3) 5. Let max ' Q be the new maxim um Q -value of state S t , calculated usin g formu la (2). 6. If max max ' , Q Q ε − > (4) then construct the s et of all imm ediate predecessor state transitions of state S t : { } 1 1 , | , transitions gr aph , t t t t S S S S − − ∈ (5 ) and append it to the end of the . Propagatio nQueue 7. If Prop agationQueu e ≠ then go to step 2. 8. Stop. The decision wheth er further p ropagation is necessary is made in step 6. The p ropagation co ntin ues one more step backw ards in time only if there is a significan t difference between the old maxim um Q - value and the new one, according to formu la (4). This formula is based on the fact that max ' Q mig ht be different than max Q in exactly 3 out of 4 possible cases, w hich are: - The transition 1 , t t S S + w as the one with the highest value for state S t and its new (bigger) valu e needs to b e propagated backwards to its predecessor states. S t+1 Predecessor stat es Next state S t Reward Action S t-1 Eligibility Propaga tion Current state S t-1 S t-1 4 - The transition 1 , t t S S + w as the one with the highest value b ut it is not an y m ore, because its value is reduced. P ropagation o f the new max imum value (wh ich belongs to a differen t transition) is necess ary. - The transition 1 , t t S S + w as not the one wi th the high est valu e but now it became one, so its value needs propagation. The only case wh en pro pagation is not necessary is when the transition 1 , t t S S + w as not the one w ith the highes t value and it is still not the one after the update. In this case, max ' Q is equal to max Q and formula (4) correctly detects it and skips propagation. In the previous 3 cases the propagation is performed, provided that there is a signif icant change of the value, determin ed by the ε parameter. When ε is smaller, the algorithm tends to prop agate further the v alue changes. W hen ε is b igg er, it tends to p ropagate only the biggest changes just a few steps backw ards, skipping any min or updates. The depth of p ropagation also d epends on the discount factor γ . T he bigger γ is, the deeper the pro pagation is, because longer-t erm reward accum ulation is s timulated. Still, due to the exponential attenuation of future rewards, the γ discount factor pr even ts the propagation from going too far and reduces the overall com putational cost. Fig. 4. Eligibility Propagation in tegrate d as a 4 th component in the Time-H oppin g technique. The described Eligibility P ropagation mech anism can be encapsulated as a single component and integrated into the Time Hopping technique as show n o n Fig. 4 . I t is called imm ediately after a state transition takes place, in or der to propagate any potential Q -value changes, and b efore a time hopping step occurs. IV. A PPLI CAT IO N OF E L IG IBI L IT Y P ROPAGA TI ON TO B I PED C RAW LI NG R OBOT In ord er to evaluate the efficiency of the propo sed Eligibility P ropagation mechanism , experiments on a sim ulated biped crawlin g rob ot are co nduct ed. The goal of the learning p rocess is to find a crawlin g motion with the max imum speed. The reward fun ction for this task is defined as the horizontal displacemen t of the robot after every action. A. THEN experimental environment A dedicated experimen tal software sy stem called THEN ( T im e H opping EN vironmen t) was developed for the purpose o f this evalu ation. A general view of the environm ent is shown on Fig. 5. THEN has a b uilt -in phys ics simu lation engin e, im plementation of the Time Hopping technique, usefu l visualization modules (for the sim ulation, th e learning data and th e state tran sitions g raph) and m ost im portantly – a prototype implem entation of the Eligibility Propagation mech anism . T o facilitate the analy sis of the algorithm behavior, T HEN d isplay s d etailed inform ation about the current state, the pr eviou s state transitions, a visual view of the sim ulation, and allows runt ime modif ication of all important parameters of the algorithms and the sim ulation. There is a manual and autom atic control of the Tim e Hopping technique, as w ell as v isualization of the accum ulated data in the form of charts. Fig. 5. General view of THEN ( T ime H opping EN vironment). The b uilt-in physics engine is runnin g a biped crawling robot simulation. B. Description of the crawli ng rob ot The experiments are co nduct ed on a phy sical sim ulation of a biped crawlin g ro bot. T he robot has 2 limbs, each with 2 segm ents, for a total o f 4 degrees of freedom (DOF). Every DOF is independent from the rest and has 3 possible actions at each time step: to move clockwis e, to move anti- clockwis e, or to stand still. Fig. 6 show s a ty pical learned craw ling sequence of the robot as visualized in the simu lation environm ent constru cted for this task. Fig. 6. Crawling robot with 2 limbs, each with 2 segments for a total of 4 DOF. Nine different states of t he crawling robot are shown from a typical learned craw ling sequence. Select act ion Execute action No Yes RL initialization Hopping trigger Target selection Hopping Eligibility propagation Get reward Change state T T i i m m e e - - H H o o p p p p i i n n g g R R L L a a l l g g o o r r i i t t h h m m m m a a i i n n l l o o o o p p 1 2 3 4 5 When all possible actions o f each DOF of the robot are combined, assum ing that they can all move at the same time independently , it pr oduces an action space with size 3 4 - 1 = 80 (we exclude the p ossibility that all DOF are standing s till). Using appropriate discretization for the joint’ s angles (9 for the upper limbs and 13 for the low er limbs ), the state space beco m es divided into (9 x 13 ) 2 = 13689 states. For b etter analys is of the crawling motion, each lim b has been colored d iff erently and only the “skelet on” of the rob ot is displayed. C. Description of the experimental method The conducted experimen ts are divided in 3 groups: experimen ts using conventional Q-learning , experimen ts using only the Time Hopping techniqu e applied to Q-learning (as described in [13]), and experim ents using Time Hopping w ith Eligibility Propagation. The implem entations used for the Time Hopping com ponents are show n in Table I. The experiments from all three groups ar e conducted in exactly the same way , using the same RL parameters (incl. discount factor γ , learning rate α , and the action selection meth od parameters). T he initial state o f the robo t and the sim ulation en vironmen t parameters are also equal. The robot training contin ues u p to a fixed n umber of steps (45000), and the achiev ed crawling s peed is recor ded at fixed checkpoints during the trainin g. This process is repeated 10 tim es and the results are averaged, in order to ensure statistical signif icance. D. Evalua tion of Eligibility Propagation The evaluation of Eligibility P ropagation is do ne using 3 main experiment s. In the first experiment, the learning speed of co nv entional Q-learnin g, T im e Hopping, and Time Hopping w ith Eligibility P ropagation is compared based on the b est solution found (i.e. the fastest achieved crawl ing speed) for the same num ber of training steps. The comparison results ar e show n in Fig. 7. It shows the duration of training needed by each of the 3 algorithm s to achiev e a certain crawling speed. The achieved speed is displayed as percentage from th e globally optimal solution. The r esu lts show that Time Hop ping with Eligibility Propagation is much faster than T im e Hopping alone, which in turn is much f aster than conven tional Q-learnin g. Fig. 7. Speed-of-le arning comparison of c onve nti onal Q-learning, Time Hopping, and Time Ho ppin g w ith Eligibility Pro pagati on. I t is based on the best solution achieved relative to th e d uration of t raining. The ac hieve d crawling speed is measured as a percentage of the globall y optimal so lution, i.e. the fastest possible crawling speed of the robot. Compared to Time Hopping alone, Elig ibility Pro pagation achieves significant speed -u p of the learning process. For exam ple, an 80%-optim al craw l is l earned in only 5000 steps w hen Eligibility P ropagation is used, whi le Time Hop ping alone needs around 20000 steps to learn the same, i.e. in this case Eligi bility Propagation n eeds 4 tim es f ewer training steps to achieve the same result. The speed-up becomes ev en high er as the num ber of training steps increas es. For exam ple, Time Ho pping wi th Eligibility Propagation reaches 90%-optimal s olution wit h 1200 0 steps, wh ile Time Hopping alone needs m ore than 50000 steps to do the sam e. Compared to conv entional Q- learning, El igibility Propagation achieves even higher speed-up. For example, it needs only 4000 steps to achieve a 7 0%-optim al solution, w hile conv entional Q-learning needs 36000 steps to learn t he sam e, i.e. in this case Eligibility Propagation is 9 ti mes faster. Time Hopp ing alone also o utperf orms conventional Q-learnin g b y a factor o f 3 in this case (12000 steps vs. 36 000 steps). In the second experimen t, the real computation al time of convention al Q-learnin g, Time Hopping, and Time Hopping w ith E ligi bility Pro pagation is compared. The actual execution time necessary for each of the 3 algorithm s to reach a certain craw ling speed is m easured. The comparison result s are shown in Fig. 8. Fig. 8. Computational-time comparison of conventional Q-learning, Time Hopping, and Time Ho ppin g w ith Eligibility Pro pagati on. I t is based on the real computational time o f e ach alg orithm re qui red to r each a ce rtain quality of the solution, i.e. certain crawling speed. TA BLE I I MPLE MENTA TION USED FO R EACH T IME H OPPIN G C OMPONE NT Component name Impleme nta tion used 1 Hopping trigger Gamma pruning 2 Targe t selection Lasso target selection 3 Hopping Basic hopp ing 4 Eligibility propagation Reverse graph propagati on 6 The r esu lts show that Time Hop ping with Eligibility Propagation achieves 99% of the maxim um possible speed almost 3 times faster than Time Hopping alone, and more than 4 times fast er than conven tional Q-learn ing. T his signif icant speed-up o f the learning process is achieved despite the additional computat ional overhead of maintai ning the transitions graph. The reason for this is the improved Gamm a-pruni ng based on more precise fut ure reward predictions, as confirm ed by the thi rd experiment. The goal of this third experimen t is to pro vide insights about the state exploration an d Q-value distribution, in order to explain the results from the previous two experiments. Conven tional Q-learn ing, Time Hopping, and Tim e Hopping w ith Eligibility P ropagation are compared based on the max imum Q-v alues achi eved for all ex plored states after 45000 training steps. The Q-valu es are sorted in decreasing order and represent the distribution of Q-values wi thin the explored state space. Fig. 9 show s the com parison results. Fig. 9. State-exploration comparison of conventional Q -learning, Time Hopping, and Time Hoppi ng with El igibi lity Pr opagation. I t shows the sorted sequence of maximum Q-values of a ll explored stat es after 4500 0 steps of trai ning. Time Hopping with Eligibility Propagation h as managed t o find much h igher maximum Q-values for the explor ed states. The conventional Q-learning has explore d more s tat es, but has found low er Q-values fo r them. Firstly , the r esul ts show that T im e Hopping with El igibility Propagation has managed to find signi ficantly higher max imum Q-valu es for the explored states compared to both convention al Q-learning and T im e Hopping. T he reason for this is that Eligibility Propagation man ages to propagate well the state value updates among all explored states, therefore raising th eir maxim um Q-values . Secondly , the results show that both T im e Hopping and Time Hopping with Eligibility Propagation have explored mu ch few er states than conventional Q-learning. The reason for this is the Gamm a-pruning com ponent of Time Hopping. It focuses the exploration of Time Hopping to the most promisin g br anch es and avoids unnecessary exploration. Conven tional Q-learning does not have such a mechanism and therefore it explores more states, but fin ds lower Q-valu es for them. Also, Time Hopping with Eligibility Pr opagation has explored slightly few er states than Tim e Hop ping alone. The reason for this is that whi le both algorithms concentrate the exploration on the most promising par ts o f the state space, only the Eligibility P ropagation man ages to propagate well the Q-values among the explored states. This improves the accuracy of the future reward estimation p erform ed b y the Gamm a-pruni ng componen t o f T im e Hopping, whi ch in its turn detects better unpromising branches of exploration and triggers a tim e hopping step to avoid them . The more purposeful exploration and better prop agation of the acqui red state inf ormation h elp Eligibility Propagation to mak e the best of every single exploration step. T his is a very important advantage of the proposed mech anism , especially if the simu lation involved is computat ionally expensive. In this case, El igibility Propagation can save real com putational tim e by reducing the num ber of normal transition (simul ation) steps in fav or of T im e Hopping steps. V. C ONCLUSI ON The Eligibility P ropagation mechanis m is prop osed to provide for Tim e Hopping sim ilar abilities to w hat eligibility traces provide for conventional R L. During operation, T im e Hopp ing completely changes the normal sequential state transitions into a rather randomized hopping behavior through out the state space. This poses a challeng e how to efficiently collect, rep resent and propagate know ledge ab out actions, rewards, states and transitions. Since usin g sequential eligibility traces is im possible, Eligibility Prop agation u ses the transitions graph to obtain all predecessor states of the updated state. This way , the propagation logically flows backw ards in time, from one state to all of its tem poral pr edecessor states. The proposed mech anism is im plemented as a fourth componen t of the Time Ho pping technique. T his main tains the clear separation between the 4 Time Hopping componen ts and makes it straightforw ard to experiment with alternative componen t implem entations. The biggest ad van tage of Eligibility Propagation is that it can speed up the learning pr ocess of T im e Hopping more than 3 times . T his is due to the im proved Gamm a-pruning ability based o n more precise future r ew ard pred ictions. This, in turn, increases the exploration efficiency by b etter avoiding unprom ising b ranches and selecting more appropriate hopping targets . The conduct ed experim ents on a biped craw ling robot als o show that the speed-up is achieved using significan tly few er training steps. A s a resul t, the speed-up becom es even high er w hen the simulat ion is computationally more expensiv e, due to the m ore purposeful ex ploration. This property m akes Eligibility P ropagation very suitable for speed ing up com plex learning tas ks w hich require costly simul ation. Anoth er advantage o f the proposed implemen tation of Eligibility Propagation is that no parameter tunin g is necessary during the learning, which makes the mechanis m easy to u se. Finally , an im portant drawback of the proposed technique is that it needs ad ditional mem ory to store the transitions graph data. In other w ords, the speed-up is achieved by using more m emory . R EFERENCES [1] P. Day an and T. J. Sejnowski, “TD( λ ) converges w ith p robability 1 ,” Mach. Lea rn. , vol. 14, no. 3, pp. 295 –301, 1 994. [2] T. G. Dietterich, "Hierarchical Reinforcement Le arnin g w ith the MAXQ Value F unc tion Deco mposition", J. Artif. Intel l. Res. , vol. 13, pp. 22 7-303, 20 00. 7 [3] A. Coates, P. A bbeel, and A . Ng, “L earning fo r Control fr om Multiple Demonstrations”, ICML , vol. 25, 20 08. [4] A. Barto and S. Mahadevan, “Recent Advances in Hierarchical Reinforcement Learning”, Discrete Ev ent Dynamic Systems , vol. 13, pp. 34 1-379, 20 03. [5] M. Humphrys, “Action Selection methods using Reinforcement Le arning,“ PhD Thesis, University of Cambridge, June 1997. [6] M. K earns an d S. Singh, “Near-optimal reinforcement learning in poly nomial time”, Ma chine Learn ing , 20 02. [7] A. Ng, “Reinforceme nt L earning and A pprentic eship Learning for Robotic Control”, Lectu re Notes i n Compu ter Sci ence , vol. 42 64, p p. 29-31, 2006. [8] D. Precup, R.S. Sutton, an d S. Dasgupt a, "Off-policy temporal-differ ence learning with functi on ap proximation," I n Proceedi ngs of t he Eighte enth Con ference o n Machine Learni ng (ICML 2001) , ed. M. Kaufmann, pp.4 17-424, 2 001. [9] B. Price and C. Bout ilier, “A ccelerating Reinforcement Learning through Implicit Imitation,” Jou rnal of Artifi cial Intelli gence Research , vol . 19 , pp. 569 -629, 200 3. [10] P . Abbeel and A. Ng, “Exploration and apprenticeship learning in reinforceme nt learning”, ICML , 2005. [11] P . Abbeel, A. C oates, M. Quigley , and A. Ng, “An applicati on of reinforceme nt learning to aerobatic helicopter flight”, NIPS , vol. 1 9, 2007. [12] P . Kor mushev, K. Nomoto, F . Dong, and K. Hirota, “Time manipulati on t echnique for speeding up reinforce ment l earning in simulations”, In ternation al Journ al of Cybe rnetics and Informat ion Technol ogies , vol. 8, no. 1, pp. 12-24 , 2008. [13] P . Ko rmushev, K. Nomoto, F. Dong, and K. Hirota, “Time Hopping techniqu e for faster reinforcement learning in simulations”, IEEE Transactions on Systems, Man an d Cybernetics part B, su bmitted, 2009. [14] J . Kolter, M. Rodgers, and A. Ng, “A Control A rchitecture for Quadruped L ocomotion Ove r Rough Terr ain ”, IEEE Internation al Confere nce on Rob otics and Automatio n , 2008 . [15] R.S. Su tton, “Le arnin g t o p redict by the methods of temporal difference,” Mach. Learn. , vol. 3, pp. 9-44 , 1988. [16] R. S. Sut ton and A. G. Barto, Reinforc ement Learning: An Introduc tion . Camb ridge, MA : MIT Pr ess, 1998 . [17] C. J. C. H. W atki ns and P . Dayan, “Q- learning,” Mach. Learn . , vo l. 8, pp. 27 9-292, 19 92. [18] J . Kolter , P. Abbeel , and A. Ng, “Hier archic al Apprentic eship Le arning, with Applicat ion to Quadru ped Locomotion”, Neu ral Informatio n Processing Systems , vol. 20, 200 7.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment