A Greedy Approximation of Bayesian Reinforcement Learning with Probably Optimistic Transition Model
Bayesian Reinforcement Learning (RL) is capable of not only incorporating domain knowledge, but also solving the exploration-exploitation dilemma in a natural way. As Bayesian RL is intractable except for special cases, previous work has proposed several approximation methods. However, these methods are usually too sensitive to parameter values, and finding an acceptable parameter setting is practically impossible in many applications. In this paper, we propose a new algorithm that greedily approximates Bayesian RL to achieve robustness in parameter space. We show that for a desired learning behavior, our proposed algorithm has a polynomial sample complexity that is lower than those of existing algorithms. We also demonstrate that the proposed algorithm naturally outperforms other existing algorithms when the prior distributions are not significantly misleading. On the other hand, the proposed algorithm cannot handle greatly misspecified priors as well as the other algorithms can. This is a natural consequence of the fact that the proposed algorithm is greedier than the other algorithms. Accordingly, we discuss a way to select an appropriate algorithm for different tasks based on the algorithms’ greediness. We also introduce a new way of simplifying Bayesian planning, based on which future work would be able to derive new algorithms.
💡 Research Summary
The paper addresses a long‑standing practical obstacle in Bayesian reinforcement learning (BRL): most existing approximations (e.g., Monte‑Carlo sparse sampling, R‑max, BEB, BOLT) achieve theoretical guarantees only when a small set of hyper‑parameters is finely tuned. In real‑world applications such an exhaustive search is infeasible, and performance can degrade dramatically if the parameters are miss‑specified.
To overcome this, the authors propose a new algorithm called Probably Optimistic Transition (POT). The central idea is to replace the usual fixed optimism (e.g., assuming maximum reward or adding a constant bonus) with a probabilistically optimistic transition model that is derived from the current Bayesian posterior. Specifically, the transition probability used in planning is altered by adding a number of artificial observations (\theta). Unlike BOLT, where (\theta) (or its counterpart (\eta)) is a static hyper‑parameter, POT defines (\theta) as a function of both a single user‑controlled parameter (\beta) and the posterior statistics (mean (\alpha) and variance (\sigma)):
\
Comments & Academic Discussion
Loading comments...
Leave a Comment