Dynamic Policy Programming
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperform other RL methods by a wide margin.
💡 Research Summary
The paper introduces Dynamic Policy Programming (DPP), a novel policy‑iteration scheme designed to compute the optimal policy for infinite‑horizon Markov decision processes (MDPs). Traditional approximate dynamic programming (ADP) methods—Approximate Value Iteration (AVI) and Approximate Policy Iteration (API)—measure performance loss directly in terms of the supremum (ℓ∞) norm of the per‑iteration approximation error εₖ. Because Monte‑Carlo sampling introduces high‑variance noise, the supremum norm can become large, slowing convergence and degrading the quality of the resulting policy.
DPP tackles this issue by basing its error analysis on the average accumulated error (\bar ε_k = \frac{1}{k+1}\sum_{j=0}^{k} ε_j) rather than the instantaneous error. Under standard stochastic assumptions (i.i.d. samples or martingale differences), the law of large numbers guarantees that the average error shrinks as the number of iterations grows, effectively “averaging out” simulation noise.
The algorithm is derived by adding a relative‑entropy regularization term (g_{\pi\bar\pi}(x)=KL(\pi(\cdot|x)|\bar\pi(\cdot|x))) to the reward and introducing a Lagrange multiplier η. This yields a soft‑max policy update that can be written in closed form (Equations 5‑6). When η→∞ the soft‑max collapses to a hard max, recovering standard value iteration; for finite η the policy updates are smooth and incremental.
Crucially, DPP replaces the conventional double‑loop structure (policy update → value update) with a single‑loop fixed‑point iteration on an action‑preference function Ψₖ(x,a). The core recursion is
\
Comments & Academic Discussion
Loading comments...
Leave a Comment