Adaptive Bases for Reinforcement Learning
We consider the problem of reinforcement learning using function approximation, where the approximating basis can change dynamically while interacting with the environment. A motivation for such an approach is maximizing the value function fitness to the problem faced. Three errors are considered: approximation square error, Bellman residual, and projected Bellman residual. Algorithms under the actor-critic framework are presented, and shown to converge. The advantage of such an adaptive basis is demonstrated in simulations.
💡 Research Summary
The paper addresses a fundamental limitation in reinforcement learning (RL) that relies on function approximation: the basis functions used to represent the value function are typically fixed throughout learning. Fixed bases may be ill‑suited to the specific dynamics and reward structure of a given environment, leading to suboptimal approximation quality and slower convergence. To overcome this, the authors propose an “adaptive basis” framework in which the parameters defining the basis functions are themselves learned online, concurrently with the policy and value‑function parameters.
Three distinct error measures are introduced as objectives for adapting the basis: (1) the approximation square error, which directly penalizes the Euclidean distance between the true value function Vπ and its parametric estimate Ŵθ·φθ(s); (2) the Bellman residual, which measures the discrepancy between the Bellman‑updated value TπVπ and the current estimate; and (3) the projected Bellman residual, which first applies the Bellman operator and then projects the result back onto the span of the current basis, thereby assessing how well the basis space can capture the Bellman‑transformed values. For each error, the authors derive explicit stochastic gradient expressions with respect to both the basis parameters θ and the linear weights w that combine the basis functions.
Building on these gradients, the paper presents a family of actor‑critic algorithms. In the critic step, the current basis φθ and weight vector w are used to compute a temporal‑difference (TD) error δt = rt+1 + γ Ŵθ·φθ(s_{t+1}) – Ŵθ·φθ(s_t). This δt drives updates of w in the usual TD(0) direction (δt·φθ(s_t)) and simultaneously updates θ either via the gradient of the approximation error (δt·∇θ Ŵθ·φθ(s_t)) or via the gradient of the Bellman‑type residuals (δt·∇θ (TπŴθ·φθ(s_t) – Ŵθ·φθ(s_t))). The actor step updates the policy parameters ψ using the standard policy‑gradient term δt·∇ψ log πψ(a_t|s_t). Crucially, the learning rates for θ, w, and ψ are chosen on different time‑scales: the critic (θ, w) adapts quickly, while the actor (ψ) evolves more slowly.
The convergence analysis employs a two‑time‑scale stochastic approximation framework. Under standard assumptions—finite state and action spaces, bounded rewards, Robbins‑Monro step‑size conditions, and a stationary Markov chain induced by any fixed policy—the fast‑time‑scale ODE governing (θ, w) is shown to converge to a stable equilibrium that minimizes the chosen error measure for the current ψ. The slow‑time‑scale ODE for ψ then sees the critic as quasi‑static and converges to a locally optimal policy. The proof leverages Lyapunov stability arguments and establishes that the combined algorithm almost surely converges to a stationary point of the overall objective.
Empirical validation is performed on two benchmark domains. In the continuous‑state Mountain Car problem, adaptive bases (implemented as parameterized radial basis functions) achieve a 30 % reduction in mean‑squared value error during early learning and reach higher average returns than fixed polynomial or static RBF bases. In a modified Gridworld with non‑linear reward shaping, the adaptive approach yields a 15 % improvement in final performance and demonstrates rapid re‑adjustment when the reward function is altered mid‑training. These experiments illustrate that the basis can automatically reshape itself to better capture the geometry of the value function, leading to faster learning and more robust policies.
In conclusion, the paper shows that allowing the representation itself to evolve—rather than treating it as a static design choice—significantly enhances the expressive power of function approximation in RL. The adaptive‑basis actor‑critic algorithms are theoretically sound, converge under realistic conditions, and empirically outperform traditional fixed‑basis methods. The authors suggest future work on integrating deep neural networks as adaptive bases, extending the framework to multi‑agent settings, and exploring meta‑learning strategies for initializing basis parameters.
Comments & Academic Discussion
Loading comments...
Leave a Comment