Efficient Approximation of Optimal Control for Markov Games

Reading time: 5 minute
...

📝 Original Info

  • Title: Efficient Approximation of Optimal Control for Markov Games
  • ArXiv ID: 1011.0397
  • Date: 2011-07-11
  • Authors: The original author list was not provided in the supplied material; therefore it cannot be reproduced here. —

📝 Abstract

We study the time-bounded reachability problem for continuous-time Markov decision processes (CTMDPs) and games (CTMGs). Existing techniques for this problem use discretisation techniques to break time into discrete intervals, and optimal control is approximated for each interval separately. Current techniques provide an accuracy of O(\epsilon^2) on each interval, which leads to an infeasibly large number of intervals. We propose a sequence of approximations that achieve accuracies of O(\epsilon^3), O(\epsilon^4), and O(\epsilon^5), that allow us to drastically reduce the number of intervals that are considered. For CTMDPs, the performance of the resulting algorithms is comparable to the heuristic approach given by Buckholz and Schulz, while also being theoretically justified. All of our results generalise to CTMGs, where our results yield the first practically implementable algorithms for this problem. We also provide positional strategies for both players that achieve similar error bounds.

💡 Deep Analysis

Figure 1

📄 Full Content

Probabilistic models are being used extensively in the formal analysis of complex systems, including networked, distributed, and most recently, biological systems. Over the past 15 years, probabilistic model checking for discrete-time Markov decision processes (MDPs) and continuous-time Markov chains (CTMCs) has been successfully applied to these rich academic and industrial applications [9,8,11,3]. However, the theory for continuous-time Markov decision processes (CTMDPs), which mix the nondeterminism of MDPs with the continuous-time setting of CTMCs, is less well developed.

This paper studies the time-bounded reachability problem for CTMDPs and their extension to continuous-time Markov games, which is a model with both helpful and hostile non-determinism. This problem is of paramount importance for model checking applications [5]. The non-determinism in the system is resolved by providing a scheduler. The time-bounded reachability problem is to determine or to approximate, for a given set of goal locations G and time bound T , the maximal (or minimal) probability of reaching G before the deadline T that can be achieved by a scheduler.

Early work on this problem focused on restricted classes of schedulers, such schedulers without any access to time in systems with uniform transition rates [1]. Recently however, results have been proved for the more general class of late schedulers [15], which will be studied in this paper. The different classes of schedulers are contrasted by Neuhäußer et. al. [14], and they show that late schedulers are the most powerful class. Several algorithms have been given to approximate the time-bounded reachability probabilities for CTMDPs using this scheduler class [5,7,15,18].

The current state-of-the-art techniques for solving this problem are based on different forms of discretisation. This technique splits the time bound T into small intervals of length ε. Optimal control is approximated for each interval separately, and these approximations are combined to produce the final result. Current techniques can approximate optimal control on an interval of length ε with an accuracy of O(ε 2 ). However, to achieve a precision of π with these techniques, one must choose ε ≈ π/T , which leads to O(T 2 /π) many intervals. Since the desired precision is often high (it is common to require that π ≤ 10 -6 ), this leads to an infeasibly large number of intervals that must be considered by the algorithms.

A recent paper of Buckholz and Schulz [6] has addressed this problem for practical applications, by allowing the interval sizes to vary. In addition to computing an approximation of the maximal time-bounded reachability probability, which provides a lower bound on the optimum, they also compute an upper bound. As long as the upper and lower bounds do not diverge too far, the interval can be extended indefinitely. In practical applications, where the optimal choice of action changes infrequently, this idea allows their algorithm to consider far fewer intervals while still maintaining high precision. However, from a theoretical perspective, their algorithm is not particularly satisfying. Their method for extending interval lengths depends on a heuristic, and in the worst case their algorithm may consider O(T 2 /π) intervals, which is not better than other discretisation based techniques.

In this paper we present a method of obtaining larger interval sizes that satisfies both theoretical and practical concerns. Our approach is to provide more precise approximations for each ε length interval. While current techniques provide an accuracy of O(ε 2 ), we propose a sequence of approximations, called double ε-nets, triple ε-nets, and quadruple ε-nets, with accuracies O(ε 3 ), O(ε 4 ), and O(ε 5 ), respectively. Since these approximations are much more precise on each interval, they allow us to consider far fewer intervals while still maintaining high precision. For example, Table 1 gives the number of intervals considered by our algorithms, in the worst case, for a normed CTMDP with time bound T = 10.

Error π = 10 -7 π = 10 -9 π = 10 -11

Current techniques O(ε 2 ) 1, 000, 000, 000 100, 000, 000, 000 10, 000, 000, 000, 000 Of course, in order to become more precise, we must spend additional computational effort. However, the cost of using double ε-nets instead of using current techniques requires only an extra factor of log |Σ|, where Σ is the set of actions. Thus, in almost all cases, the large reduction in the number of intervals far outweighs the extra cost of using double ε-nets. Our worst case running times for triple and quadruple ε-nets are not so attractive: triple ε-nets require an extra |L| • |Σ 2 | factor over double εnets, where L is the set of locations, and quadruple ε-nets require yet another |L| • |Σ 2 | factor over triple ε-nets. However, these worst case running times only occur when the choice of optimal action changes frequently, and we speculate that the cost of using t

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut