Average optimality for risk-sensitive control with general state space
📝 Original Info
- Title: Average optimality for risk-sensitive control with general state space
- ArXiv ID: 0704.0394
- Date: 2016-08-14
- Authors: Researchers mentioned in the ArXiv original paper
📝 Abstract
This paper deals with discrete-time Markov control processes on a general state space. A long-run risk-sensitive average cost criterion is used as a performance measure. The one-step cost function is nonnegative and possibly unbounded. Using the vanishing discount factor approach, the optimality inequality and an optimal stationary strategy for the decision maker are established.💡 Deep Analysis
This research explores the key findings and methodology presented in the paper: Average optimality for risk-sensitive control with general state space.This paper deals with discrete-time Markov control processes on a general state space. A long-run risk-sensitive average cost criterion is used as a performance measure. The one-step cost function is nonnegative and possibly unbounded. Using the vanishing discount factor approach, the optimality inequality and an optimal stationary strategy for the decision maker are established.
📄 Full Content
Next, assuming that a certain family of functions is bounded [Condition (B)] and using Fatou’s lemma (for weakly or setwise convergent measures), we obtain the optimality inequality.
The predecessor of our result is Theorem 4.1 in [16], where the optimality inequality for the risk-sensitive dynamic programming with a countable state space was established. Instead of boundedness assumption (B), Hernández-Hernández and Marcus [16] assume that there exists a stationary policy which induces a finite average cost that is equal some constant in each state. On the other hand, it is well known that an optimal risk-sensitive average cost may depend on the initial state (see Example 1). This behavior happens if the risk factor is too large. Instead of this restriction on the risk coefficient, we use Condition (B), which makes the process reach “good states” sufficiently fast.
There is a rich literature in risk-sensitive control, going back at least to the seminal works of Howard and Matheson [18] and Jacobson [19], which covered the finite horizon case. The average cost criterion on the infinite horizon was studied in [5,8,14,15,16,31] for a denumerable state space and in [10,11,20] for a general state space. It is also worth mentioning that risk-sensitive control finds natural applications in portfolio managment, where the objective is to maximize the growth rate of the expected utility of wealth; see [3,4,30] and the references cited therein.
The paper is organized as follows. Below a Markov control model with the long-run average cost criterion as a performance measure is described, as well as some basic notation is set up. In Section 2 we introduce preliminaries and present the auxiliary discounted minimax problem, which is, in turn, solved in Section 3. The main result is established in Section 4. Section 5 contains a discussion of Condition (B), and in the Appendix a variational formula for the logarithmic moment-generating function is stated.
A discrete-time Markov control process is specified by the following objects:
(i) The state space X is a standard Borel space (i.e., a nonempty Borel subset of some Polish space).
(ii) A is a Borel action space.
(iii) K is a nonempty Borel subset of X ×A. We assume that, for each x ∈ X, the nonempty x-section A(x) = {a ∈ A : (x, a) ∈ K} of K is compact and represents the set of actions available in state x.
(iv) q is a regular conditional distribution from K to X.
(v) The one-step cost function c is a Borel measurable mapping from K to [0, +∞].
Then the history spaces are defined as
The class of stationary policies is identified with the class F of measurable functions f from X to A such that f (x) ∈ A(x). It is well known that F is nonempty [6]. By the Ionescu-Tulcea theorem [24], for each policy π and each initial state x 0 = x, a probability measure P π x and a stochastic process {(x k , a k )} are defined on H ∞ in a canonical way, where x k and a k describe the state and the decision at stage k, respectively. By E π
x we denote the expectation operator with respect to the probability measure P π
x . Let γ > 0 be a given risk factor. For any initial state x ∈ X and policy π ∈ Π, we define the following risk-sensitive average cost criterion:
Our aim is to minimize J(x, π) within the class of all policies and find a policy π * , for which
Throughout the paper the following assumption will be supposed to hold true even without explicit reference:
Remark 1. Throughout the remainder, we assume that the risk factor γ > 0 is arbitrary and fixed. Therefore, here and subsequently, we shall not indicate that some quantities depend on γ [e.g., we write J(x, π) instead of J γ (x, π), dropping the index γ].
Let Pr(X) be the set of all probability measures on X. Fix ν ∈ Pr(X). The relative entropy function R(• ν) is a mapping from Pr(X) into R defined as follows:
It is well known that R(µ ν) is nonnegative for any µ ∈ Pr(X) and R(µ ν) = 0 if and only if µ = ν (consult Lemma 1.4.1 in [12]). We shall consider the following auxiliary minimax problem, associated with our original Markov control process. The set X is the state space, while A and Pr(X) are the action sets for the decision maker and opponent, respectively. The process then operates as follows. In a state x n , n = 0, 1, . . . , the controller chooses an action a n ∈ A(x n ), while the opponent selects µ n (•)[x n , a n ] ∈ Pr(X). As a consequence, the controller pays γc(x n , a n ) -R(µ n q(•|x n , a n )) to his opponent, and the system moves to the next state according to the probability distribution
We shall deal with the following classes of strategies. It will cause no confusion if we continue to use the same letters to denote strategies for the controller. Namely, π stands for a randomized control strategy (policy), whereas f denotes a stationary strategy. We write Π and F to denote the sets of corresponding strategies. For the opponent’s class of strategies, we confine to the stationary one, which is identified with the class P of stochastic kernels p on X given K.
Let (Ω, F) be the measurable space consisting of the sample space Ω = (X × A) ∞ and its product σ-algebra F. Then for an initial state x ∈ X, and strategies π and p, there exists a unique probability measure P πp x and, again, a stochastic process {(x k , a k )} is defined (Ω, F) in a canonical way, where x k denotes the state at time k and a k is the action for the controller. With some abuse of notation, we let h k stand for the history of the process up to the kth state, that is,
The corresponding expectation operator is denoted by E πp x . For fixed x ∈ X, π ∈ Π and p ∈ P , we define the following functional costs:
where β ∈ (0, 1) is the discount factor, and
Note that, since the function R(• •) is lower semicontinuous on Pr(X) × Pr(X) and p and q are stochastic kernels [i.e., measurable functions of (x, a)], it follows that the mapping
is measurable (Lemma 1.4.3(f) in [12]). Observe that V β (x, π, p) and j(x, π, p) might be undetermined, because c can be unbounded. We thus restrict the set of admissible strategies for the opponent in the following way.
Definition 1. Given π = {π k } ∈ Π, we say that p ∈ P is a π-admissible strategy iff
and moreover, there exists a constant C ≥ 0, possibly depending on π and p, such that
for all histories of the process h k , k ≥ 0, induced by p and π. We denote this set by Q(π). [Note that this set is nonempty, since p = q ∈ Q(π) for any π ∈ Π.] Let us introduce the following notation. For any π ∈ Π, p ∈ Q(π) and n ≥ 1, define
and
Now we are ready to present the result that was originally proved in [16] for Markov strategies. However, it still remains valid when arbitrary strategies for the decision maker are considered. Therefore, for the sake of clarity, we state the result with its proof. Proposition 1. Let x ∈ X and p ∈ Q(π). Then:
Proof. (a) Let p ∈ Q(π) be any stochastic kernel. For n = 1, we conclude
, where the first inequality holds since the relative entropy is nonnegative, and the second one is due to Jensen’s inequality. Now assume that the hypothesis is true for some n ≥ 1. Clearly,
Denote by π (1) the “1-shifted” strategy, that is, π
Then, we have
x log e γc(x,a 0 )
log e γc(x,a 0 ) π 0 (da 0 |x)
x 1 e γc(x,a 0 )+ n+1 k=1 γc(x k ,a k ) q(dx 1 |x, a 0 )π 0 (da 0 |x)
x 1 e γc(x,a 0 )+ n+1 k=1 γc(x k ,a k ) q(dx 1 |x, a 0 )π 0 (da 0 |x)
Clearly, the first inequality follows from the induction hypothesis. The third inequality is due to Jensen’s inequality, whilst the second one follows from Lemma A in the Appendix. Since p ∈ Q(π) is arbitrary, we get the desired conclusion.
Part (b) follows directly from part (a).
Remark 2. Note that in the proof of Proposition 1 we did not really have to use the fact that p ∈ Q(π). The only assumption which plays an essential role is condition (2). Namely, it guarantees that j n (x, π, p) is well defined for all n ≥ 1, x ∈ X and π ∈ Π. However, in Definition 1 we restrict the opponent’s class of strategies to the set Q(π) in order to be able to apply the Hardy-Littlewood theorem. In actual fact, later on it will be clear that the set Q(π), where π ∈ Π, is sufficiently large. Namely, the supremum of certain discounted functional costs over the set Q(π) will not change if we add new elements to Q(π); see the proofs of Lemmas 1 and 2.
Let π be as in assumption (G) and let p ∈ Q( π). Then from the Hardy-Littlewood theorem (Theorem H.2 in [13]), we get lim sup
and from Proposition 1(b), lim sup
Combining these two inequalities, we conclude that lim sup
This in turn yields lim sup
where V β (x) is the upper value of functional cost (1), that is,
Consequently, inequality (4) and assumption (G) together lead to the following:
for each x ∈ X and β ∈ (0, 1). In addition, V β (x) ≥ 0. Now defining
and observing that lim sup
one can deduce that there exists a sequence of discount factors {β n } converging to 1 for which
where l is a certain nonnegative constant.
A solution to the auxiliary discounted minimax problem. The main thrust of this section is to solve the auxiliary discounted minimax problem introduced in the previous section. In other words, we look for a discounted functional equation whose solution is the function V β . This is done by an approximation of the above-mentioned minimax models by ones with bounded cost functions. These models in turn are solved by a fixed point argument in Proposition 1. Next, we show in Lemma 1 that the corresponding solutions equal the upper values of some discounted costs on the infinite horizon. Finally, the limit passage in Lemma 2 gives the desired discounted functional equation with the function V β as a solution.
We shall need the following two sets of compactness-semicontinuity assumptions, which will be used alternatively.
(i) The set A(x) is compact. (ii) For each x ∈ X and every Borel set D ⊂ X, the function q(D|x, •) is continuous on A(x).
(iii) The cost function c(x, •) is lower semicontinuous for each x ∈ X.
(i) The set A(x) is compact and the set-valued mapping x → A(x) is upper semicontinuous, that is, {x ∈ X : A(x) ∩ B = ∅} is closed for every closed set B in A.
(ii) The transition law q is weakly continuous on K, that is, the function
is continuous function for each bounded continuous function u.
(iii) The cost function c is lower semicontinuous on K.
By L b (X) and B b (X), we denote the set of all bounded lower semicontinuous and bounded Borel measurable functions on X, respectively. Further, let N stand for the set of positive integers. Choose N ∈ N and define the truncated cost function
The following result was proved under Condition (W) for bounded cost functions by a fixed point argument; see page 72 in [10]. However, a simple and obvious modification of the proof gives the conclusion under Condition (S) as well.
Proposition 2. Under (W) [(S)], for any discount factor β ∈ (0, 1) and a number N ∈ N, there exists a unique function
for each x ∈ X, and
Moreover, there exists a stationary strategy f 0 ∈ F (possibly depending on β and N ) that attains the minimum in (8).
Let β and N be fixed just in the next lemma.
Lemma 1. Assume (W) or (S). Then, it holds
for any initial state x ∈ X.
Proof. Note that (8) can be rewritten in the following equivalent form:
γc N (x, a) + log X e βw N β (y) q(dy|x, a) . (11) Applying Lemma A in the Appendix to (11), we get
Moreover, the measure µ 0 (dy)[x, a] = e βw N β (y) q(dy|x, a)
X e βw N β (y) q(dy|x, a)
achieves the supremum in (12). Put
Note that p 0 ∈ Q(π) for any strategy π ∈ Π. This directly follows from the definition of R(p 0 (•|x, a) q(•|x, a)) and (9). Simple calculations give the upper bound
Let p 0 be defined as in (13). By (12), we then have
By iteration of this inequality n times, it follows
), where π is any strategy for the controller. Now, letting n → ∞ and making use of ( 9), we conclude
Since π is arbitrary, we get
Note that inequality ( 14) is valid because p 0 ∈ Q(π).
On the other hand, by (12), we can write
with f 0 as in Proposition 2 and any p ∈ Q(f 0 ). Proceeding along the same line, we infer
Since p ∈ Q(f 0 ) is arbitrary, we easily deduce
Finally, combining ( 14) with ( 15) completes the proof.
In the remainder of the paper, we shall use the following notation. Let L(X) denote the set of all lower semicontinuous functions on X, whereas B(X) stands for the set of all Borel measurable functions on X. e γc(x,a) X e βw β (y) q(dy|x, a) (16) for all x ∈ X. Furthermore, there exists a Borel measurable selector f β ∈ F of the minima in (16).
(c) For any x ∈ X, w β (x) = V β (x).
Proof. Let x ∈ X and β ∈ (0, 1) be fixed. From (10), it is easily seen that the sequence {w N β (x)} is nondecreasing in N. Therefore, w β (x) = lim N →∞ w N β (x) exists and by (9), it is nonnegative. Clearly, under (S), w β ∈ B(X), whereas, under (W), w β ∈ L(X); see Proposition 10.1 in [26].
In order to prove that w β (x) is finite for each x ∈ X, observe first that, for any π ∈ Π, p ∈ Q(π) and N ∈ N,
Moreover, from Lemma 1, we have 17) By ( 5), V β (x) is finite for each x ∈ X, so is w β (x). This finishes the proof of part (a).
In order to prove part (b), note that by (11) and part (a) the limit lim
exists. Since the first and the second term in (18) are nondecreasing and (W) or (S) holds, then we may interchange the limit with the minimum (see Proposition 10.1 in [26]). Furthermore, making use of the Lebesgue monotone convergence theorem, we conclude (16). The existence of a Borel measurable selector f β ∈ F follows from the compactness-semicontinuity assumptions and Proposition D.5 in [17].
We now turn to proving part (c). Again, taking a logarithm on both sides of ( 16), it follows
γc(x, a) + log X e βw β (y) q(dy|x, a) . (19) Applying Lemma A in the Appendix to (19), we easily obtain
Observe that by (20), for any p ∈ Q(f β ), the following holds:
Iterating this inequality n times, we immediately obtain
Since p ∈ Q(f β ) is arbitrary, we see that
Inequalities ( 17) and ( 22) combined conclude the proof of part (c).
A solution to the risk-sensitive control problem. For any x ∈ X and any discount factor β ∈ (0, 1), define
The following boundedness assumption is supposed to hold true. As mentioned in the Introduction, we put off discussing it until Section 5: Condition (B). For any x ∈ X, sup β∈(0,1) h β (x) < +∞. Remark 3. A similar assumption and its equivalent variants were used to study the expected average cost criterion for Markov decision processes in the risk-neutral setting [17,27,28]. Roughly speaking, Hernández-Lerma and Lasserre [17], Schäl [27], and Sennott [28] assume that the family of the so-called normalized β-discounted cost functions is bounded. This assumption, however, simply holds for ergodic Markov decision processes. More precisely, if the n-step transition probabilities converge to the unique invariant probability measure geometrically fast, and the cost functions are bounded (or more generally satisfy a certain growth hypothesis), then the aforementioned family of functions is pointwise relatively compact [21,22]. It is worth pointing out that this requirement is crucial to obtain the optimality inequality in the risk-neutral case; see [27,28]. In Section 5 we provide an example that illustrates that also in the risk-sensitive case Condition (B) cannot be weakened.
We shall need the following two versions of Fatou’s lemma for converging measures.
Lemma 3. Let {µ n } be a sequence of probability measures converging to µ ∈ Pr(X) and let {h n } be a sequence of measurable nonnegative functions on X. Then,
in the following cases:
(a) {µ n } converges setwise to µ [i.e., X f (y) dµ n (y) → X f (y) dµ(y) ∀f ∈ B b (X)], and h(x) = lim inf n→∞ h n (x);
(b) {µ n } converges weakly to µ, and h(x) = inf{lim inf n→∞ h n (x n ) : x n → x}; moreover, h ∈ L(X).
Proof. Part (a) is due to Royden [25], page 231, whereas part (b) was proved by Serfozo [29]. For the proof of lower semicontinuity of h, the reader is referred to Lemma 3.1 in [22]. Now we are in a position to state the main result of the paper. This theorem concerns a study of the risk-sensitive average cost optimality inequality, which is sufficient to establish the existence of an optimal stationary policy.
Theorem 1. Assume (B) and (W) [or (S)]. Then, for each risk factor γ > 0, there exist a constant l and a nonnegative function
for all x ∈ X. Moreover,
In other words, l/γ is the optimal risk-sensitive average cost and f is a risk-sensitive average cost optimal stationary policy.
Remark 4. (a) There are two papers [16,27] that can be treated as predecessors of our work. They both deal with the optimality inequality but within two different frameworks. The first work [16] establishes the optimality equation for the risk-sensitive dynamic programming on a denumarable state space. In the other one, the result is obtained for Markov control processes on an uncountable state space for the risk factor γ = 0. From this point of view, our result is an extention of Theorem 4.1 in [16] to a general state space and Theorem 3.8 in [27] to the risk-sensitive case. Moreover, the common feature of the discussed results is that their proofs are based on the vanishing discount factor approach. Our proof also relies on this method, and similarly, as in [27] or [21,22], makes use of the Fatou lemmas for setwise and weakly convergent measures.
(b) Finally, it is also worth mentioning that there are papers studying the optimality equation in the risk-sensitive dynamic programming, which is of the following form:
γc(x, a) + log X e h(y) q(dy|x, a) . (24) The constant l γ is (under suitable assumptions) an optimal cost with respect to the risk-sensitive average cost criterion. Let us mention and discuss a few representative papers that deal with equation (24). In [8,15] Markov control models satisfying a simultaneous Doeblin condition, on a finite and countable state space, respectively, are considered. The cost functions are supposed to be bounded and the risk factor must be sufficiently small. Otherwise, as argued in [8], the optimality equation need not have a solution.
In [10] Di Masi and Stettner extend the result to a general state space by retaining bounded cost functions and replacing a simultaneous Doeblin condition with a very strong assumption on transition probabilities. In [11], however, they replace this assumption by one imposed on the risk coefficient. Finally, the class of Markov control models that requires neither any ergodicity conditions nor the smallness of the risk factor was pointed out by Jaśkiewicz in [20].
Fairly recently Borkar and Meyn [5] considered Markov decision processes with unbounded cost functions on a denumarable state space. Their result assumes the following: the state space is irreducible under all Markov policies, the costs are norm-like, and there exists a policy that induces a finite average risk-sensitive cost. Moreover, their proof is based on a multiplicative ergodic theorem that was studied in more detail in [1].
Proof of Theorem 1. Let {β n } be a sequence of discount factors converging to 1 for which (7) holds. Defining
and applying (6), we note that
for any x ∈ X. Assume for a while that inequality ( 23) is satisfied and there exists f ∈ F as in the statement of Theorem 1. We prove that f is an optimal policy. From ( 23), we have
By iteration of this inequality n times, we obtain
Since h is nonnegative, we infer
Hence, ( 25) and ( 26) together imply
for each x ∈ X.
We next focus on showing inequality (23). Let n ≥ 1 and put h n := h βn , f n := f βn . Note that (19) can be rewritten in the following form:
γc(x, a) + log X e βnhn(y) q(dy|x, a) (27) = γc(x, f n (x)) + log X e βnhn(y) q(dy|x, f n (x)).
(i) Assume first (S) and define
Taking the lim inf on both sides of ( 27), we get
γc(x, a) + log X e βnhn(y) q(dy|x, a) .
Making use of Lemma 3(a) and the measurable selection theorem (see Proposition D.5(a) in [17]), one can prove that there exists f ∈ F such that (23) holds.
(ii) Now assume (W). Fix x 0 ∈ X and choose any x n → x 0 , n → ∞. Take a subsequence {n k } of positive integers such that lim inf
Then by (27),
Note that G = {x 0 } ∪ {x n } is compact in X. From the upper semicontinuity of x → A(x), compactness of every A(z) and Berge’s theorem (see [2] or Theorem 7.4.2 in [23]), it follows that z∈G A(z) is compact in A. Therefore, {f n k (x n k )} has a subsequence converging to some a 0 ∈ A. By (W)(i), a 0 ∈ A(x 0 ), that is, (x 0 , a 0 ) ∈ K. Without loss of generality, assume that
By the lower semicontinuity of the cost function c and (28), we have
This and Lemma 3(b) imply that l + lim inf n→∞ h n (x n ) ≥ γc(x 0 , a 0 ) + log X e h(y) q(dy|x 0 , a 0 ), where e h is the generalized lim inf of the sequence e h k = e hn k . Clearly, h ≤ h. By Lemma 3(b), h ∈ L(X). Thus, l + lim inf n→∞ h n (x n ) ≥ γc(x 0 , a 0 ) + log X e h(y) q(dy|x 0 , a 0 ). ( 29) Since x n → x 0 was chosen arbitrarily, we infer from (29) that l + h(x 0 ) ≥ γc(x 0 , a 0 ) + log X e h(y) q(dy|x 0 , a 0 ).
The last inequality shows that, for any x ∈ X, there exists an a x ∈ A(x) such that l + h(x) ≥ γc(x, a x ) + log X e h(y) q(dy|x, a x ) (30) ≥ min a∈A(x) γc(x, a) + X e h(y) (y)q(dy|x, a) .
By our compactness-semicontinuity assumptions and Proposition D.5(b) in [17], there exists some f ∈ F such that (23) holds.
- A discussion. This section is devoted to a discussion of Condition (B). We start with revisiting Example 3.1 in [8]. where ρ ∈ (0, 1). Recall that the following was proved. Let us consider three cases for the risk factor γ:
Then if (I) or (II) hold, the optimal risk-sensitive average cost equals 0 and is independent of the initial state. In case (III) we have J * (0) = 0 and J * (1) = 1 + log(1-ρ) γ > 0. In addition, it is interesting to observe that, for (II) and (III) cases, there does not exist a function h : X → R such that optimality inequality ( 23) is satisfied. Indeed, to see this take x = 1 and consider (III). The optimality inequality is then as follows:
Note that the right-hand side is strictly greater than γ + log(e h (1) (1 -ρ)), which equals to the left-hand side. Similar calculations for case (II) also lead to a contradiction. Hence, although an optimal cost is constant, the optimality inequality need not have a solution. Now we turn to checking Condition (B). Let V β be as in Lemma 2. Clearly, V β = w N β for N ≥ 1 and V β (0) = 0. Then, by ( 8) under (I), we get
for each x ∈ X. Subtracting m β from both sides in (32), we obtain
Iteration of this inequality up to the stopping time τ yields
Since π ∈ Π is an arbitrary policy, we easily get the conclusion.
Note that the fact
has the following interpretation: before the process will reach “good states,” the incurred costs at “early stages” should not be too large. Indeed, let us define a set D as follows. We say that In other cases (33) fails to hold and, in addition, the earlier calculations show that h β (1) = +∞. Summing up, the presented example shows that, without Condition (B) imposed on the family of functions {h β (x)}, β ∈ (0, 1), a solution to the optimality inequality need not exist, and moreover, the optimal risk-sensitive average cost may depend on the initial state. In view of the above discussion, Condition (B) is designed to prevent the accrual of infinite expected costs. Namely, the costs incurred at transient states, that may be occupied only at “early stages,” have an important and definite influence on a long-run performance measure. Therefore, Condition (B) requires the model to be sort of communicating insofar as certain sets of “good states” to be reached sufficiently fast. Then, the optimal risk-sensitive average cost is constant and the optimality inequality takes place. In addition, it is worth mentioning that the ergodicity itself of a Markov process/chain does not help so much as in the risk-neutral case. In other words, for an ergodic Markov chain, it may happen that the optimal risk-sensitive average cost depends on the initial state as in Example 1. Moreover, in this example one can even prove in a straightforward way that under case (I) [either under Condition (B) or for sufficiently small risk factors], the optimality equation ( 24) is satisfied. Therefore, it would be interesting to know whether Condition (B) (together with some compactness-continuity assumptions) is sufficient to obtain a solution to the optimality equation. There is a conjecture that, since in the risk-neutral case a counterpart of Condition (B) is not sufficient [7], neither is it in the risk-sensitive setting. But this question is beyond the scope of the paper and remains open.
The lemma below establishes a variational formula for the logarithmic moment-generating function. The reader is referred to Theorem 4.5.1 and Proposition 1.4.2 in [12] for its proof.
Lemma A. Let X be a Polish space, h a measurable function mapping on X into R, which is either bounded from below or bounded from above, and ν a probability measure on X . Then, the supremum in the variational formula is attained uniquely at µ 0 .
📸 Image Gallery
