Dynamic Policy Programming

Reading time: 6 minute
...

📝 Original Info

  • Title: Dynamic Policy Programming
  • ArXiv ID: 1004.2027
  • Date: 2011-09-09
  • Authors: Researchers from original ArXiv paper

📝 Abstract

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperform other RL methods by a wide margin.

💡 Deep Analysis

Deep Dive into Dynamic Policy Programming.

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperfo

📄 Full Content

Many problems in robotics, operations research and process control can be represented as a control problem that can be solved by finding the optimal policy using dynamic programming (DP). DP is based on the estimating some measures of the value of state-action Q * (x, a) through the Bellman equation. For high-dimensional discrete systems or for continuous systems, computing the value function by DP is intractable. The common approach to make the computation tractable is to approximate the value function using function-approximation and Monte-Carlo sampling (Szepesvári, 2010;Bertsekas and Tsitsiklis, 1996). Examples of such approximate dynamic programming (ADP) methods are approximate policy iteration (API) and approximate value iteration (AVI) (Bertsekas, 2007;Lagoudakis and Parr, 2003;Perkins and Precup, 2002;de Farias and Roy, 2000).

ADP methods have been successfully applied to many real world problems, and theoretical results have been derived in the form of finite iteration and asymptotic performance guarantee of the induced policy (Farahmand et al., 2010;Thiery and Scherrer, 2010;Munos, 2005;Bertsekas and Tsitsiklis, 1996). The asymptotic ℓ ∞ -norm performance-loss bounds of ©??? Mohammad Gheshlaghi Azar Vicenç Gómez and Hilbert J. Kappen. API and AVI are expressed in terms of the supremum, with respect to (w.r.t.) the number of iterations, of the approximation errors:

where γ denotes the discount factor, • is the ℓ ∞ -norm w.r.t. the state-action pair (x, a). Also, π k and ǫ k are the control policy and the approximation error at round k of the ADP algorithms, respectively. In many problems of interest, however, the supremum over the normed-error ǫ k can be large and hard to control due to the large variance of estimation caused by Monte-Carlo sampling. In those cases, a bound which instead depends on the average accumulated error ǭk = 1/(k + 1) k j=0 ǫ j is preferable. This is due to the fact that the errors associated with the variance of estimation can be considered as the instances of some zero-mean random variables. Therefore, one can show, by making use of a law of large numbers argument, that those errors are asymptotically averaged out by accumulating the approximation errors of all iterations. 1 In this paper, we propose a new mathematically-justified approach to estimate the optimal policy, called dynamic policy programming (DPP). We prove finite-iteration and asymptotic performance loss bounds for the policy induced by DPP in the presence of approximation. The asymptotic bound of approximate DPP is expressed in terms of the average accumulated error ǭk as opposed to ǫ k in the case of AVI and API. This result suggests that DPP may perform better than AVI and API in the presence of large variance of estimation since it can average out the estimation errors throughout the learning process. The dependency on the average error ǭk follows naturally from the incremental policy update of DPP which at each round of policy update, unlike AVI and API, accumulates the approximation errors of the previous iterations, rather than just minimizing the approximation error of the current iteration.

This article is organized as follows. In Section 2, we present the notations which are used in this paper. We introduce DPP and we investigate its convergence properties in Section 3. In Section 4, we demonstrate the compatibility of our method with the approximation techniques. We generalize DPP bounds to the case of function approximation and Monte-Carlo simulation. We also introduce a new convergent RL algorithm, called DPP-RL, which relies on an approximate sample-based variant of DPP to estimate the optimal policy. Section 5, presents numerical experiments on several problem domains including the optimal replacement problem (Munos and Szepesvári, 2008) and a stochastic grid world. In Section 6 we briefly review some related work. Finally, we discuss some of the implications of our work in Section 7.

In this section, we introduce some concepts and definitions from the theory of Markov decision processes (MDPs) and reinforcement learning (RL) as well as some standard notations. 2 We begin by the definition of the ℓ 2 -norm (Euclidean norm) and the ℓ ∞ -norm (supremum norm). Assume that Y is a finite set. Given the probability measure µ over Y, for a real-valued function g : Y → R, we shall denote the ℓ 2 -norm and the weighted ℓ 2,µ -norm of g by g 2 2 y∈Y g(y) 2 and g 2 2,µ y∈Y µ(y)g(y) 2 , respectively. Also, the ℓ ∞ -norm of g is defined by g max y∈Y |g(y)|.

A discounted MDP is a quintuple (X, A, P, R, γ), where X and A are, respectively, the state space and the action space. P shall denote the state transition distribution and R denotes the reward kernel. γ ∈ [0, 1) denotes the discount factor. The transition P is a probability kernel over the next state upon taking action a from state x, which we shall denote by P (•|x, a). R is a set of real-valued numbers. A reward r(x, a) ∈ R is associated with

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut