Variance-Based Rewards for Approximate Bayesian Reinforcement Learning

The explore{exploit dilemma is one of the central challenges in Reinforcement Learning (RL). Bayesian RL solves the dilemma by providing the agent with information in the form of a prior distribution over environments; however, full Bayesian planning is intractable. Planning with the mean MDP is a common myopic approximation of Bayesian planning. We derive a novel reward bonus that is a function of the posterior distribution over environments, which, when added to the reward in planning with the mean MDP, results in an agent which explores efficiently and effectively. Although our method is similar to existing methods when given an uninformative or unstructured prior, unlike existing methods, our method can exploit structured priors. We prove that our method results in a polynomial sample complexity and empirically demonstrate its advantages in a structured exploration task.

💡 Research Summary

The paper tackles the classic exploration‑exploitation dilemma in reinforcement learning from a Bayesian perspective. Bayesian RL (BRL) offers a principled way to balance exploration and exploitation by maintaining a posterior distribution over possible environments, but exact Bayesian planning is computationally intractable for all but the smallest problems. A common practical shortcut is to plan with the mean MDP—i.e., the MDP defined by the posterior means of transition probabilities and rewards. While computationally cheap, this “mean‑MDP” approach is essentially myopic: it ignores uncertainty and therefore fails to drive purposeful exploration.

To remedy this, the authors derive a variance‑based reward bonus (VBRB) that directly incorporates the posterior variance of the model parameters into the reward function used for mean‑MDP planning. For each state‑action pair ((s,a)), they maintain a Bayesian posterior over the transition distribution and the immediate reward. From this posterior they compute the variance (\sigma_{s,a}) (or a suitable scalar proxy) and define a bonus
\