Eligibility Propagation to Speed up Time Hopping for Reinforcement Learning

Reading time: 5 minute
...

📝 Original Info

  • Title: Eligibility Propagation to Speed up Time Hopping for Reinforcement Learning
  • ArXiv ID: 0904.0546
  • Date: 2009-04-03
  • Authors: Petar Kormushev, Kohei Nomoto, Fangyan Dong, Kaoru Hirota

📝 Abstract

A mechanism called Eligibility Propagation is proposed to speed up the Time Hopping technique used for faster Reinforcement Learning in simulations. Eligibility Propagation provides for Time Hopping similar abilities to what eligibility traces provide for conventional Reinforcement Learning. It propagates values from one state to all of its temporal predecessors using a state transitions graph. Experiments on a simulated biped crawling robot confirm that Eligibility Propagation accelerates the learning process more than 3 times.

💡 Deep Analysis

Deep Dive into Eligibility Propagation to Speed up Time Hopping for Reinforcement Learning.

A mechanism called Eligibility Propagation is proposed to speed up the Time Hopping technique used for faster Reinforcement Learning in simulations. Eligibility Propagation provides for Time Hopping similar abilities to what eligibility traces provide for conventional Reinforcement Learning. It propagates values from one state to all of its temporal predecessors using a state transitions graph. Experiments on a simulated biped crawling robot confirm that Eligibility Propagation accelerates the learning process more than 3 times.

📄 Full Content

Decomposition" [2]. Hybrid methods using both apprenticeship learning and hierarchical decomposition have been successfully applied to quadruped locomotion [14] [18]. Unfortunately, decomposition of the target task is not always possible, and sometimes it may impose additional burden on the users of the RL algorithm.

A state-of-the-art RL algorithm for efficient state space exploration is E3 [6]. It uses active exploration policy to visit states whose transition dynamics are still inaccurately modeled. Because of this, running E3 directly in the real world might lead to a dangerous exploration behavior.

Instead of executing RL algorithms in the real world, simulations are commonly used. This approach has two main advantages: speed and safety. Depending on its complexity, a simulation can run many times faster than a real-world experiment. Also, the time needed to set up and maintain a simulation experiment is far less compared to a real-world experiment. The second advantage, safety, is also very important, especially if the RL agent is a very expensive equipment (e.g. a fragile robot), or a dangerous one (e.g. a chemical plant). Whether the full potential of computer simulations has been utilized for RL, however, is an open question.

A new trend in RL suggests that this might not be the case. For example, two techniques have been proposed recently to better utilize the potential of computer simulations for RL: Time Manipulation [12] and Time Hopping [13]. They share the concept of using the simulation time as a tool for speeding up the learning process. The first technique, called Time Manipulation, suggests that doing backward time manipulations inside a simulation can significantly speed up the learning process and improve the state space exploration. Applied to failure-avoidance RL problems, such as the cart-pole balancing problem, Time Manipulation has been shown to increase the speed of convergence by 260% [12].

This paper focuses on the second technique, called Time Hopping, which can be applied successfully to continuous optimization problems. Unlike the Time Manipulation technique, which can only perform backward time manipulations, the Time Hopping technique can make arbitrary “hops” between states and traverse rapidly throughout the entire state space. It has been shown to accelerate the learning process more than 7 times on some problems [13]. Time Hopping possesses mechanisms to trigger time manipulation events, to make prediction about possible future rewards, and to select promising time hopping targets.

This paper proposes an additional mechanism called Eligibility Propagation to be added to the Time Hopping The following Section II makes a brief overview of the Time Hopping technique and its components. Section III explains why it is important (and not trivial) to implement some form of eligibility traces for Time Hopping and proposes the Eligibility Propagation mechanism to do this. Section IV presents the results from experimental evaluation of Eligibility Propagation on a benchmark continuous-optimization problem: a biped crawling robot.

Time Hopping is an algorithmic technique which allows maintaining higher learning rate in a simulation environment by hopping to appropriately selected states [13]. For example, let us consider a formal definition of a RL problem, given by the Markov Decision Process (MDP) on Fig. 1. Each state transition has a probability associated with it. State 1 represents situations of the environment that are very common and learned quickly. The frequency with which state 1 is being visited is the highest of all. As the state number increases, the probability of being in the corresponding state becomes lower. State 4 represents the rarest situations and therefore the most unlikely to be well explored and learned.

Fig. 1. An example of a MDP with uneven state probability distribution. Time Hopping can create “shortcuts in time” (shown with dashed lines) between otherwise distant states, i.e. states connected by a very low-probability path. This allows even the lowest-probability state 4 to be learned easily.

When applied to such a MDP, Time Hopping creates “shortcuts in time” by making hops (direct state transitions) between very distant states inside the MDP. Hopping to low-probability states makes them easier to be learned, while at the same time it helps to avoid unnecessary repetition of already well-explored states [13]. The process is completely transparent for the underlying RL algorithm.

When applied to a conventional RL algorithm, Time Hopping consists of 3 components: 1) Hopping trigger -decides when the hopping starts; 2) Target selection -decides where to hop to; 3) Hopping -performs the actual hopping. The flowchart on Fig. 2 shows how these 3 components of Time Hopping are connected and how they interact with the RL algorithm.

When the Time Hopping trigger is activated, a target state and time have to be selected, considering many relevant p

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut