The Divergence of Reinforcement Learning Algorithms with Value-Iteration and Function Approximation

The Divergence of Reinforcement Learning Algorithms with Value-Iteration   and Function Approximation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper gives specific divergence examples of value-iteration for several major Reinforcement Learning and Adaptive Dynamic Programming algorithms, when using a function approximator for the value function. These divergence examples differ from previous divergence examples in the literature, in that they are applicable for a greedy policy, i.e. in a “value iteration” scenario. Perhaps surprisingly, with a greedy policy, it is also possible to get divergence for the algorithms TD(1) and Sarsa(1). In addition to these divergences, we also achieve divergence for the Adaptive Dynamic Programming algorithms HDP, DHP and GDHP.


💡 Research Summary

The paper “The Divergence of Reinforcement Learning Algorithms with Value‑Iteration and Function Approximation” investigates a fundamental stability issue that arises when reinforcement‑learning (RL) and adaptive‑dynamic‑programming (ADP) algorithms learn a value function with a general smooth function approximator while the agent follows a greedy (value‑iteration) policy. While classic convergence proofs for TD(λ), Sarsa(λ), HDP, DHP, GDHP and related methods assume a fixed policy (policy‑evaluation) or a perfect value function, this work shows that the same algorithms can diverge under a greedy policy even when the approximator is only mildly nonlinear.

The authors construct a minimal deterministic test problem with a one‑dimensional state x, a one‑dimensional action a, and three time steps (t = 0,1,2). The transition dynamics are linear (x_{t+1}=x_t + a_t for t = 0,1, and x_{t+1}=x_t for t = 2) and the reward is –k a_t^2 for the first two steps and –x_2^2 at the final step, with k>0. The optimal policy is trivially a_0 = a_1 = 0.

The critic is defined as a piece‑wise quadratic function with four parameters w = (w1,w2,w3,w4). For t = 1 the value is V(x, w)= –c1 x^2 + w1 x + w3, and for t = 2 it is V(x, w)= –c2 x^2 + w2 x + w4, where c1,c2>0. The corresponding value‑gradient G = ∂V/∂x is linear in x and directly depends on the weight vector.

Using the greedy policy definition a = argmax_a Q(x,a,w) with Q(x,a,w)= r(x,a)+γ V(f(x,a),w) (γ=1), the authors derive closed‑form expressions for the greedy actions a0 and a1 as functions of the current weights. Substituting these actions back into the dynamics yields explicit formulas for the next states x1 and x2, and consequently for the gradients G1 and G2 along the greedy trajectory. Crucially, the Jacobian ∂π/∂x is a constant that depends only on the problem constants (c_t and k), which allows the authors to compute the total derivative Df/Dx = ∂f/∂x + (∂π/∂x) ∂f/∂a and Dr/Dx similarly.

The core of the analysis focuses on the Value‑Gradient Learning (VGL(λ)) algorithm. VGL updates the weight vector by Δw = α Σ_t (∂G/∂w)t Ω_t (G⁰_t – G_t), where G⁰_t is a recursively defined target gradient that incorporates the λ‑return of gradients. By substituting the explicit expressions for G, G⁰, Df/Dx, Dr/Dx, and the constant Ω_t (often taken as the identity), the authors reduce the entire learning process to a discrete‑time dynamical system of the form w{k+1}=F(w_k). They then show analytically that, for a wide range of parameter choices (e.g., sufficiently large learning rate α, certain ratios of c1, c2, and k), the Jacobian of F at the fixed point has eigenvalues with magnitude greater than one, guaranteeing that the iterates diverge rather than converge to the optimal weight vector.

Having identified a set of parameters that cause VGL(λ) to diverge, the authors empirically test the same parameters on TD(λ) and Sarsa(λ). Because the TD and Sarsa weight updates can be expressed in a form similar to VGL when the policy is greedy (the policy gradient appears in the same way), the same instability propagates, and both TD(1) and Sarsa(1) exhibit divergent behavior on the test problem. This is notable because TD(1) is traditionally regarded as true gradient descent on the mean‑squared Bellman error for a fixed policy; the paper demonstrates that the greedy policy coupling destroys this guarantee.

The paper also extends the divergence results to three classic ADP algorithms: Heuristic Dynamic Programming (HDP), Dual Heuristic Dynamic Programming (DHP), and Globalized DHP (GDHP). HDP uses the TD(0) update on the value function, DHP uses the VGL(0) update on the value‑gradient, and GDHP combines both. Since the underlying updates are the same as those already shown to diverge under the greedy policy, the ADP methods inherit the same instability. The authors emphasize that earlier convergence proofs for these ADP methods assumed either a perfect value function (i.e., the critic could represent the true value everywhere) or a fixed policy, assumptions that are violated in the greedy, function‑approximation setting.

In the discussion, the authors reflect on the practical implications. Value‑iteration (or a greedy actor trained to completion between each critic update) can be attractive because it avoids the need for a separate exploration policy and can accelerate learning. However, the analysis shows that when the critic is a nonlinear approximator with limited capacity, the feedback loop between the policy and the critic can become unstable. They suggest that practitioners should either restrict the function class (e.g., linear approximators), employ regularization or projection methods to keep the weight updates within a stable region, or revert to policy‑iteration schemes where the policy is held fixed while the critic converges.

Overall, the paper provides a clear, mathematically rigorous demonstration that many widely used RL and ADP algorithms are not universally stable under greedy policies with function approximation. It fills a gap in the literature by moving the focus from fixed‑policy convergence to the more realistic scenario where the policy is continually updated to be greedy with respect to the current critic, and it offers concrete guidance on how to avoid divergence in practice.


Comments & Academic Discussion

Loading comments...

Leave a Comment