On Bellmans principle with inequality constraints

Reading time: 5 minute
...

📝 Original Info

  • Title: On Bellmans principle with inequality constraints
  • ArXiv ID: 1111.3271
  • Date: 2011-11-15
  • Authors: 원문에 명시된 저자 정보가 제공되지 않았습니다. —

📝 Abstract

We consider an example by Haviv (1996) of a constrained Markov decision process that, in some sense, violates Bellman's principle. We resolve this issue by showing how to preserve a form of Bellman's principle that accounts for a change of constraint at states that are reachable from the initial state.

💡 Deep Analysis

Figure 1

📄 Full Content

The most celebrated result in Markov decision process (MDP) theory is Bellman's optimality principle, which can be stated as follows. (We assume that the reader is already generally familiar with MDPs.) Let X t be the state at (discrete) time t and r(X t , a) the reward received if action a is taken at state X t (the stagewise reward). Let V * (x) be optimal cumulative reward starting at state x. Then, Bellman's principle states that for each time t, V * (X t ) = max a {r(X t , a) + E X t ,a [V * (X t+1 )]} where X t+1 is the random next state with distribution depending on X t and a. Moreover, replacing max by argmax on the right-hand side gives the optimal action at X t (i.e., it characterizes the optimal policy). But Bellman's principle is more than just an equation-it embodies an idea that has become almost fundamentally axiomatic in Markov decision theory. This idea is that the optimal policy solves the optimization problem not just at the initial state X 0 = x but also at all states reachable from it.

In this paper, we consider MDPs with explicit constraints. Such constrained MDPs have been studied for at least a couple of decades and continues to draw interest (see, e.g., [1]- [11]). We are interested here in a particular paper by Haviv [6], who raises an issue that has not been addressed in the literature. Basically, Haviv constructs an example of a constrained MDP in which the optimal policy starting at the initial state x is no longer optimal at states other than x, not even at a state y that is reachable from x. He laments that this means that Bellman’s principle is violated. We will explore Haviv’s issue thoroughly. In particular, we will show that there is some preservation of Bellman’s principle, provided we account for the fact that some of the “slackness” in the constraint is spent in going from x to a reachable y. So, if we consider the optimal policy π * starting at state x, the optimality of π * at state y is with respect to a different problem, one where the constraint is modified with the “residual slackness.” In analyzing Haviv’s problem, we will present some known results, some new results, and some related examples along the way to help us understand and resolve the problem. Our analysis highlights the important maxim that when imposing constraints on a decision problem, the constraints should apply only to those things over which we have control.

In [6], Haviv gives an example (reproduced in Fig. 1) in which he shows that, given an optimal policy for an optimization problem starting at some state x, the policy is not optimal with respect to the same problem starting at a reachable state y. The structure of the problem is a multichain MDP with initial state x, which is transient. There are three recurrent subchains that could be reached from x. There is no reward for being in chain 1, while the stagewise reward is $10 at every state in chain 2 and $20 at every state in chain 3. The constraint is that the expected frequency of visits to states in S = S 1 ∪ S 2 ∪ S 3 must not exceed 0.125 (think of states in S as the “bad” states). While in chain i (i = 1, 2, 3), the frequency of visits to S i is as shown in Fig. 1 (e.g., 0.2 for S 1 ). There is only one state in which an action decision must be made: In state y, we can choose either action a or b.

A quick examination of Haviv’s problem shows that there is only one feasible policy: At state y, select action a. Selecting action b at state y would violate the constraint, because the resulting Markov chain would visit states in S with frequency 0.5(0.2 + 0.1) = 0.15. However, if the starting state were y, we would want to pick action b, because this leads to chain 3 where the stagewise reward exceeds that of chain 2, and the frequency of visits to S in chain 3 (S 3 ) is 0.1, which does not exceed the constraint of 0.125. As noted before, this leads to Haviv’s lament-Bellman’s principle is violated, because the optimal policy starting at state x is no longer optimal starting at state y, even though y is reachable from x. As Haviv points out in [6] and we will emphasize again later, the issue is related to the multichain nature of the example: that there are transient states and recurrent subchains that are not reachable from each other.

More specifically, Haviv’s problem illustrates that as far as optimality of a policy is concerned, arriving at state y from x is different from starting at y. From this point of view, the issue raised by Haviv appears to be related to that of time consistency in risk averse multistage stochastic programming, identified in a recent paper by Shapiro [12]. The same issue is also discussed in the economics literature on multistage decision problems arising in dynamic portfolios; see, e.g., [13], [14]. This issue has been recognized for some time in the context of time-varying preferences [15], [16] and game-theoretic formalisms of such changing tastes [17], [18].

In lamenting the violation of Bellman’s pri

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut