Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration

Gaussian Processes for Sample Efficient Reinforcement Learning with   RMAX-like Exploration

We present an implementation of model-based online reinforcement learning (RL) for continuous domains with deterministic transitions that is specifically designed to achieve low sample complexity. To achieve low sample complexity, since the environment is unknown, an agent must intelligently balance exploration and exploitation, and must be able to rapidly generalize from observations. While in the past a number of related sample efficient RL algorithms have been proposed, to allow theoretical analysis, mainly model-learners with weak generalization capabilities were considered. Here, we separate function approximation in the model learner (which does require samples) from the interpolation in the planner (which does not require samples). For model-learning we apply Gaussian processes regression (GP) which is able to automatically adjust itself to the complexity of the problem (via Bayesian hyperparameter selection) and, in practice, often able to learn a highly accurate model from very little data. In addition, a GP provides a natural way to determine the uncertainty of its predictions, which allows us to implement the “optimism in the face of uncertainty” principle used to efficiently control exploration. Our method is evaluated on four common benchmark domains.


💡 Research Summary

This paper introduces a model‑based online reinforcement‑learning (RL) framework that is explicitly designed for high sample efficiency in continuous‑state, continuous‑action domains with deterministic dynamics. The authors observe that many existing sample‑efficient RL algorithms sacrifice generalization power in order to retain theoretical tractability, typically relying on simple tabular or linear function approximators. To overcome this limitation, they decouple two distinct components of the learning pipeline: (1) the model learner, which must ingest data and produce a predictive model of the environment, and (2) the planner, which uses the learned model to compute a policy but does not itself require new samples.

For the model‑learning component they employ Gaussian‑process (GP) regression. A GP provides a full posterior distribution over the transition function given a set of observed state‑action‑next‑state triples. Consequently, it yields both a mean prediction (the best estimate of the next state) and a variance that quantifies epistemic uncertainty. The variance is crucial because it enables the implementation of the “optimism in the face of uncertainty” principle that underlies many efficient exploration strategies. In practice, the GP’s hyper‑parameters (kernel length‑scales, signal variance, observation noise) are automatically tuned by maximizing the marginal likelihood, allowing the model to adapt its complexity to the data without manual intervention. This automatic Bayesian model selection often results in highly accurate dynamics models from only a handful of samples.

Exploration is handled in an RMAX‑like fashion. In classic RMAX, each state that has been visited fewer than a threshold times is assigned a maximal possible reward, encouraging the agent to explore unknown regions. The authors translate this idea to continuous spaces by using the GP’s predictive variance as a proxy for “unknownness.” For any state‑action pair whose variance exceeds a pre‑defined confidence bound, the algorithm inflates the immediate reward (or the value of the resulting state) to a optimistic upper‑confidence value. This inflated reward drives the planner to select actions that reduce uncertainty, thereby balancing exploration and exploitation in a principled, sample‑efficient manner.

Planning proceeds by feeding the GP mean predictions into a standard dynamic‑programming routine such as value iteration or policy iteration. Because the planner operates on the learned model rather than on raw environment interactions, no additional samples are required during this phase. The only source of new data is the online interaction loop: after each real transition, the observed tuple is added to the GP’s training set, the posterior is updated, and the variance‑based exploration bonus is recomputed.

The authors acknowledge the cubic computational complexity O(N³) of naïve GP inference (where N is the number of observed transitions). However, in the experimental regime considered—on the order of a few hundred samples—this cost remains manageable. They also discuss possible extensions such as sparse GP approximations, local kernel windows, or inducing‑point methods to scale the approach to larger data sets and higher‑dimensional problems.

Empirical evaluation is conducted on four widely used continuous control benchmarks (e.g., MountainCarContinuous, CartPole, Acrobot, Pendulum). In each domain the proposed GP‑RMAX algorithm is compared against several state‑of‑the‑art sample‑efficient methods, including PILCO, DDPG, TRPO, and a conventional RMAX variant adapted to continuous spaces. The results show that the GP‑RMAX method reaches comparable or superior performance with significantly fewer environment interactions. Notably, the algorithm exhibits rapid convergence during the early learning phase, reflecting the strength of the GP’s ability to generalize from very few data points and the effectiveness of the variance‑driven optimistic exploration bonus.

In summary, the paper makes three principal contributions: (1) it demonstrates that Gaussian‑process regression can serve as a highly data‑efficient, automatically regularized model learner for deterministic continuous dynamics; (2) it shows how the GP’s predictive variance can be seamlessly integrated into an RMAX‑style optimism‑driven exploration scheme, thereby extending RMAX’s theoretical guarantees to continuous domains; and (3) it validates the combined approach on standard benchmarks, achieving strong empirical sample efficiency. The work opens several avenues for future research, such as incorporating sparse GP techniques for scalability, extending the framework to stochastic or partially observable settings, and exploring more sophisticated planners (e.g., model‑predictive control or Monte‑Carlo tree search) that can exploit the rich uncertainty information supplied by the GP.