A technique for speeding up reinforcement learning algorithms by using time manipulation is proposed. It is applicable to failure-avoidance control problems running in a computer simulation. Turning the time of the simulation backwards on failure events is shown to speed up the learning by 260% and improve the state space exploration by 12% on the cart-pole balancing task, compared to the conventional Q-learning and Actor-Critic algorithms.
Deep Dive into Time manipulation technique for speeding up reinforcement learning in simulations.
A technique for speeding up reinforcement learning algorithms by using time manipulation is proposed. It is applicable to failure-avoidance control problems running in a computer simulation. Turning the time of the simulation backwards on failure events is shown to speed up the learning by 260% and improve the state space exploration by 12% on the cart-pole balancing task, compared to the conventional Q-learning and Actor-Critic algorithms.
Reinforcement Learning (RL) algorithms have been applied successfully for many years [4]. One of their main virtues is that they don't require a model of the device they are supposed to control [7]. Also, general RL algorithms like Q-learning [10] and TD(λ) [8] provably converge to the globally optimal solution (under some assumptions) [1,10]. This convenience, however, comes at a certain cost. The price for this flexibility of RL algorithms is that they require long training [9]. Even for a relatively simple control task (e.g. the inverted pendulum balancing problem, also known as the cart-pole balancing problem [2]) a general RL algorithm requires many trials (hundreds or even thousands of trials) to be able to learn the task.
What is the reason for this slow convergence? Trying to answer this question, let us focus on a subset of the RL problems: learning a control policy to avoid failure. The inverted pendulum balancing task is a good representative example for such a problem. In this case, failure is defined as falling of the pole beyond a certain angle from the upright position or hitting the edges of the cart track. The aim of the RL algorithm is to find a control policy which can prevent the pendulum from falling by moving the cart, without hitting the edges of the cart track. A good name for such problems is “failure-avoidance problems”. They follow the general “trial-and-error” paradigm of unsupervised learning [11].
The learning process is organized in separate trials (or episodes), each starting from the same initial position at the center of the cart track and finishing in a failure state (either when the pendulum has tilted too much, or when the cart has hit an edge). After every failure, the state of the pendulum is reset back to the initial position and the next trial begins.
Usually, the reward function in such problems is defined as -1 in case of a failure and 0 in all other cases. Thus, the RL agent is trying to maximize the cumulative reward effectively avoiding failure states. The failure states include tilting the pendulum more than a certain angle, as well as hitting the left or right edges of the cart track. The target is, for example, to keep the pendulum balanced for at least 100 000 steps in a single trial.
For this particular problem, a general RL algorithm like Q-learning will need more than 1000 trials and more than 200 000 total steps to reach the target. Why does it take so long?
One main reason for this is the poor state space exploration. Practically all RL algorithms follow the same scheme of doing trials. Each time a failure occurs, a new trial begins from the initial state. As a consequence, the state space close to the initial state is very well explored, but the state space further from the initial state is not. In this particular example, the RL algorithm learns to balance the pendulum around the initial position very quickly, but as the pendulum goes further and reaches the ends of the track, it fails immediately. And this is completely understandable, since the RL agent doesn’t have enough “experience” in that part of the state space.
The main problem under investigation here is: how to improve the state space exploration, providing enough “experience” for the RL agent in a broader part of the state space. For example, instead of exploring the same state space near the initial state over and over again, it is desirable for the RL algorithm to focus on the states which lead to failure and try harder to avoid them. This would improve dramatically the state space exploration and speed up the learning process. The present paper proposes a time manipulation technique to achieve this goal.
The key idea is that it is possible to manipulate the time of the simulation in such a way which forces the RL algorithm to explore better the state space in proximity to failures. At the same time, it is possible to avoid re-visiting already well-explored parts of the state space. The time manipulation, which is proposed, consists of turning the time of the simulation backwards when failure events occur, while at the same time preserving the learned policy as it was at the time of the failure. This technique is shown to improve substantially the learning speed and the state space exploration of Q-learning and Actor-Critic algorithms, at the expense of using additional memory. Also, it has the advantage of being completely transparent for the RL algorithm.
Section 2 describes the general RL algorithm for solving failure-avoidance problems. Section 3 is devoted to the proposed time manipulation technique for improving the previously mentioned algorithm. Section 4 describes the experimental evaluation of the proposed technique on a classical benchmark RL problem.
For the purpose of explaining the proposed time manipulation technique any general RL algorithm for solving failure-avoidance problems is usable. Probably the most widely known and used such algorithm is Q-learning [10].
…(Full text truncated)…
This content is AI-processed based on ArXiv data.