Robust Bayesian reinforcement learning through tight lower bounds
📝 Abstract
In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of a near-optimal memoryless policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting.
💡 Analysis
In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of a near-optimal memoryless policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting.
📄 Content
arXiv:1106.3651v2 [cs.LG] 11 Nov 2011 Robust Bayesian reinforcement learning through tight lower bounds Christos Dimitrakakis1 EPFL, Lausanne, Switzerland christos.dimitrakakis@epfl.ch Abstract. In the Bayesian approach to sequential decision making, ex- act calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently cal- culate a lower bound, which corresponds to the utility of a near-optimal memoryless policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian rein- forcement learning setting. 1 Setting We consider decision making problems where an agent is acting in a (possibly unknown to it) environment. By choosing actions, the agent changes the state of the environment and in addition obtains scalar rewards. The agent acts so as to maximise the expectation of the utility function: Ut ≜PT k=t γkrk, where γ ∈[0, 1] is a discount factor and where the instantaneous rewards rt ∈[0, rmax] are drawn from a Markov decision process (MDP) µ, defined on a state space S and an action space A, both equipped with a suitable metric and σ-algebra, with a set of transition probability measures T s,a µ s ∈S, a ∈A
on S , and a set of reward probability measures Rs,a µ s ∈S, a ∈A
on R, such that: rt | st = s, at = a ∼Rs,a µ , st+1 | st = s, at = a ∼T s,a µ , (1.1) where st ∈S and at ∈A are the state of the MDP, and the action taken by the agent at time t, respectively. The environment is controlled via a policy π ∈P. This defines a conditional probability measure on the set of actions, such that Pπ(at ∈A | st, at−1) = π(A | st, at−1) is the probability of the action taken at time t being in A, where we use P, with appropriate subscripts, to denote probabilities of events and st ≜s1, . . . , st and at−1 ≜a1, . . . , at−1 denotes sequences of states and actions respectively. We use Pk to denote the set of k-order Markov policies. Important special cases are the set of blind policies P0 and the set of memoryless policies P1. A policy in π ∈¯Pk ⊂Pk is stationary, when π(A | st t−k+1, at−1 t−k+1) = π(A | sk, ak−1) for all t. 2 Christos Dimitrakakis The expected utility, conditioned on the policy, states and actions is used to define a value function for the MDP µ and a stationary policy π, at stage t: Qπ µ,t(s, a) ≜Eµ,π(Ut | st = s, at = a), V π µ,t(s) ≜Eµ,π(Ut | st = s), (1.2) where the expectation is taken with respect to the process defined jointly by µ, π on the set of all state-action-reward sequences (S, A, R)∗. The optimal value function is denoted by Q∗ µ,t ≜supπ Qπ µ,t and V ∗ µ,t ≜supπ V π µ,t. We denote the optimal policy1 for µ by π∗ µ. Then Q∗ µ,t = Q π∗ µ µ,t and V ∗ µ,t = V π∗ µ µ,t . There are two ways to handle the case when the true MDP is unknown. The first is to consider a set of MDPs such that the probability of the true MDP lying outside this set is bounded from above [e.g. 20, 21, 4, 19, 28, 27]. The second is to use a Bayesian framework, whereby a full distribution over possible MDPs is maintained, representing our subjective belief, such that MDPs which we consider more likely have higher probability [e.g. 14, 10, 31, 2, 12]. Hybrid approaches are relatively rare [16]. In this paper, we derive a method for efficiently calculating near-optimal, robust, policies in a Bayesian setting. 1.1 Bayes-optimal policies In the Bayesian setting, our uncertainty about the Markov decision process (MDP) is formalised as a probability distribution on the class of allowed MDPs. More precisely, assume a probability measure ξ over a set of possible MDPs M, representing our belief. The expected utility of a policy π with respect to the belief ξ is: Eξ,π Ut = Z M Eµ,π(Ut) dξ(µ). (1.3) Without loss of generality, we may assume that all MDPs in M share the same state and action space. For compactness, and with minor abuse of notation, we define the following value functions with respect to the belief: Qπ ξ,t(s, a) ≜Eξ,π(Ut | st = s, at = a), V π ξ,t(s) ≜Eξ,π(Ut | st = s), (1.4) which represent the expected utility under the belief ξ, at stage t, of policy π, conditioned on the current state and action. Definition 1 (Bayes-optimal policy). A Bayes-optimal policy π∗ ξ with re- spect to a belief ξ is a policy maximising (1.3). Similarly to the known MDP case, we use Q∗ ξ,t, V ∗ ξ,t to denote the value functions of the Bayes-optimal policy. Finding the Bayes-optimal policy is generally intractable [11, 14, 18]. It is im- portant to note that a Bayes-optimal policy is not necessarily the same as the optimal policy for the true MDP. Rather, it is the optimal policy given that the true MDP was draw
This content is AI-processed based on ArXiv data.