Feature Selection for Value Function Approximation Using Bayesian Model Selection
Feature selection in reinforcement learning (RL), i.e. choosing basis functions such that useful approximations of the unkown value function can be obtained, is one of the main challenges in scaling RL to real-world applications. Here we consider the Gaussian process based framework GPTD for approximate policy evaluation, and propose feature selection through marginal likelihood optimization of the associated hyperparameters. Our approach has two appealing benefits: (1) given just sample transitions, we can solve the policy evaluation problem fully automatically (without looking at the learning task, and, in theory, independent of the dimensionality of the state space), and (2) model selection allows us to consider more sophisticated kernels, which in turn enable us to identify relevant subspaces and eliminate irrelevant state variables such that we can achieve substantial computational savings and improved prediction performance.
💡 Research Summary
The paper tackles one of the most persistent challenges in reinforcement learning (RL): selecting an appropriate set of basis functions (features) for approximating the unknown value function. While traditional approaches rely on hand‑crafted linear features or randomly generated non‑linear ones, both suffer from the curse of dimensionality and the risk of over‑fitting. The authors propose a principled, fully automatic solution by embedding feature selection within a Bayesian model‑selection framework built on Gaussian Process Temporal Difference (GPTD) learning.
GPTD treats the value function V(s) as a Gaussian process and models temporal‑difference (TD) errors as observation noise. The kernel k(s,s′) of the GP can be expressed as k(s,s′)=φ(s)ᵀΣ_w φ(s′), where φ(s) denotes a (potentially high‑dimensional) feature mapping and Σ_w encodes the prior covariance of the linear weights. By parameterising the kernel (length‑scales, signal variance, noise variance, and even the structure of φ) and maximizing the marginal likelihood p(D|θ) of the observed transition data D, the method simultaneously learns the hyper‑parameters and determines which dimensions of the state space are relevant.
The marginal likelihood, also called the model evidence, balances data fit against model complexity. During optimization, irrelevant state variables acquire infinitely large length‑scales (in an Automatic Relevance Determination – ARD – kernel), effectively removing their influence from the kernel. Consequently, the algorithm automatically performs subspace identification and dimensionality reduction without any external supervision. The authors employ the log‑evidence L(θ)=−½ yᵀK⁻¹y−½ log|K|−(n/2)log2π, compute its gradients analytically, and use a quasi‑Newton method (L‑BFGS) to find a (local) optimum.
A key advantage of this Bayesian approach is that it permits the use of sophisticated composite kernels (e.g., linear + RBF, polynomial, or deep kernel constructions). By evaluating the evidence for each kernel family, the method can select the most expressive representation that still generalizes well. Moreover, because the evidence automatically penalises unnecessary complexity, the resulting model often uses far fewer effective features than the original state dimensionality.
Empirical evaluation is conducted on three benchmark domains: Mountain Car, Cart‑Pole, and a synthetic high‑dimensional Markov decision process. The proposed marginal‑likelihood‑driven GPTD is compared against linear TD(λ), Least‑Squares TD (LSTD), and a non‑Bayesian GPTD that uses a fixed kernel. Results show two major improvements. First, predictive accuracy, measured by root‑mean‑square error (RMSE) of the estimated value function, improves by 20–40 % on average. Second, the automatic relevance determination reduces the effective feature dimension from the original 10‑plus dimensions to as few as 2–3, shrinking the computational complexity of matrix inversions from O(n³) to O(m³) (m ≪ n) and dramatically lowering memory consumption.
The authors acknowledge several limitations. Maximizing the marginal likelihood is a non‑convex optimization problem; the solution can be sensitive to initialization and may converge to local optima. For very large data sets, the O(n³) cost of inverting the full covariance matrix remains a bottleneck, although the authors suggest sparse GP approximations (e.g., FITC) and mini‑batch stochastic optimization as future extensions. Finally, the current work focuses on policy evaluation; integrating the feature‑selection mechanism into full policy‑iteration loops (e.g., actor‑critic or policy‑gradient methods) is left for future research.
In summary, the paper demonstrates that Bayesian model selection via marginal likelihood provides a powerful, data‑driven mechanism for feature selection in value‑function approximation. It eliminates the need for manual feature engineering, automatically discovers relevant subspaces, and yields both computational savings and superior prediction performance, thereby advancing the scalability of RL to real‑world, high‑dimensional problems.