Bounding Procedures for Stochastic Dynamic Programs with Application to the Perimeter Patrol Problem
One often encounters the curse of dimensionality in the application of dynamic programming to determine optimal policies for controlled Markov chains. In this paper, we provide a method to construct sub-optimal policies along with a bound for the deviation of such a policy from the optimum via a linear programming approach. The state-space is partitioned and the optimal cost-to-go or value function is approximated by a constant over each partition. By minimizing a non-negative cost function defined on the partitions, one can construct an approximate value function which also happens to be an upper bound for the optimal value function of the original Markov Decision Process (MDP). As a key result, we show that this approximate value function is {\it independent} of the non-negative cost function (or state dependent weights as it is referred to in the literature) and moreover, this is the least upper bound that one can obtain once the partitions are specified. Furthermore, we show that the restricted system of linear inequalities also embeds a family of MDPs of lower dimension, one of which can be used to construct a lower bound on the optimal value function. The construction of the lower bound requires the solution to a combinatorial problem. We apply the linear programming approach to a perimeter surveillance stochastic optimal control problem and obtain numerical results that corroborate the efficacy of the proposed methodology.
💡 Research Summary
The paper tackles the well‑known “curse of dimensionality” that hampers the use of dynamic programming (DP) for optimal control of Markov decision processes (MDPs) with large state spaces. The authors propose a state‑aggregation technique in which the original finite state set S is partitioned into M disjoint subsets {S₁,…,S_M}. Within each partition the value‑to‑go function V(x) is forced to be a constant v(i), i.e., V(x)=v(i) for all x∈S_i. This restriction reduces the number of decision variables in the linear programming (LP) formulation of the DP from |S| to M, while preserving the original Bellman inequality constraints.
The central theoretical contribution is the proof that, once the partitioning is fixed, the optimal solution of the restricted LP (RLP) is independent of the non‑negative cost vector c used in the LP objective. Consequently, the resulting piecewise‑constant value function v* constitutes the least upper bound on the true optimal value function V* among all possible choices of c. In other words, v* is the tightest possible upper bound that can be obtained solely by fixing the partition structure, regardless of how the cost weights are selected. Moreover, every feasible solution of the RLP dominates V* (i.e., is greater than or equal to V* component‑wise), establishing a dominance relationship that guarantees the quality of the bound.
To complement the upper bound, the authors introduce a disjunctive LP that uses only a subset of the RLP constraints. This LP can be interpreted as the exact LP of a lower‑dimensional MDP embedded within the original problem. Its optimal solution provides a lower bound on V*. Computing this lower bound requires solving a combinatorial selection problem over the possible transition patterns within each partition. While this problem is generally NP‑hard, the authors show that for certain structured applications—specifically the perimeter patrol problem—it can be solved efficiently.
The paper also examines extensions such as variable lifting and the use of iterated Bellman inequalities (Wang & Boyd, 2010). It demonstrates that these extensions do not improve the upper bound beyond what the basic restricted LP already yields, confirming the optimality of the simple aggregation approach in terms of bound tightness.
The methodology is applied to a stochastic perimeter‑patrol scenario involving an unmanned aerial vehicle (UAV) that must monitor a circular boundary while stochastic intrusion events occur. The state includes the UAV’s location, remaining fuel, and the presence of intrusion alerts. The authors partition the state space by geographic zones, define transition probabilities that incorporate environmental noise and intrusion dynamics, and solve both the RLP (upper bound) and the disjunctive LP (lower bound). Numerical results are compared against value‑iteration approximations of the true optimal policy. The upper bound aligns almost exactly with the value‑iteration solution, while the lower bound remains close, confirming the practical effectiveness of the approach. Computationally, the aggregated LPs are dramatically smaller and faster to solve than the full LP, illustrating scalability to larger problems.
In summary, the paper establishes that a simple state‑aggregation scheme yields a cost‑weight‑independent, least‑possible upper bound on the optimal value function, and that a related disjunctive LP can furnish a complementary lower bound. The theoretical results are backed by a realistic surveillance application, making the framework a valuable tool for researchers and practitioners dealing with large‑scale stochastic control problems where performance guarantees are required.
Comments & Academic Discussion
Loading comments...
Leave a Comment