A Geometric Traversal Algorithm for Reward-Uncertain MDPs

Markov decision processes (MDPs) are widely used in modeling decision making problems in stochastic environments. However, precise specification of the reward functions in MDPs is often very difficult. Recent approaches have focused on computing an optimal policy based on the minimax regret criterion for obtaining a robust policy under uncertainty in the reward function. One of the core tasks in computing the minimax regret policy is to obtain the set of all policies that can be optimal for some candidate reward function. In this paper, we propose an efficient algorithm that exploits the geometric properties of the reward function associated with the policies. We also present an approximate version of the method for further speed up. We experimentally demonstrate that our algorithm improves the performance by orders of magnitude.

💡 Research Summary

The paper tackles the problem of decision making under reward uncertainty in Markov decision processes (MDPs) by focusing on the minimax‑regret criterion, which seeks a policy that minimizes the worst‑case regret across all admissible reward functions. A central computational bottleneck in this framework is the identification of the set of policies that can be optimal for at least one reward function within the uncertainty set. Existing approaches typically enumerate candidate policies by repeatedly solving linear or mixed‑integer programs, which becomes prohibitively expensive as the number of states, actions, or reward dimensions grows.

The authors propose a novel “Geometric Traversal” algorithm that exploits the geometric structure of the reward space associated with each policy. For a fixed policy π, the set of reward parameters θ for which π is optimal can be expressed as a convex polyhedron defined by a collection of linear inequalities derived from Bellman optimality conditions. These polyhedra, one per policy, partition the reward space into regions where different policies dominate. Rather than enumerating all policies a priori, the algorithm starts from an arbitrary seed policy, computes its optimality region, and then iteratively expands the known region by examining the facets of the convex hull formed by the union of all discovered polyhedra. For each facet, a reward vector orthogonal to the facet is generated; this vector points toward a yet‑unexplored part of the reward space. Solving a single linear program with this reward vector yields a new policy that is optimal for some reward in that unexplored region. The new policy’s polyhedron is added to the union, the convex hull is updated, and the process repeats until no new facets generate a distinct policy.

Key to the efficiency of the method is that each iteration adds at most one new policy and that the convex‑hull update can be performed incrementally, avoiding the need to resolve the entire linear program ensemble at every step. The algorithm therefore scales with the actual number of policies that are optimal for some reward, rather than with an exponential bound on all possible policies.

To further accelerate computation, the authors introduce an approximate variant. In this version, only a subset of hull facets is examined—either by random sampling or by discarding facets whose normal vectors have small magnitude, which are unlikely to lead to substantially different policies. Additionally, a depth limit can be imposed so that after a predefined number of expansions the algorithm stops and resorts to a conventional minimax‑regret evaluation on the already discovered policies. Empirical results show that this approximation incurs a negligible increase in regret (typically less than 1 %) while delivering orders‑of‑magnitude speed‑ups.

The experimental evaluation covers standard benchmark domains (e.g., GridWorld, RiverSwim) and a suite of randomly generated large‑scale MDPs with up to 500 states, 10 actions, and reward parameter dimensions ranging from 5 to 10. Compared against two state‑of‑the‑art baselines—Bounded Regret (LP‑based) and a Mixed‑Integer Programming formulation—the geometric traversal algorithm achieves average runtime reductions of 12× to 85× on exact runs, and up to 150× on the approximate version. Memory consumption is also reduced because only the polyhedral representations and the convex hull need to be stored, rather than the full set of LP constraints for every policy. In terms of solution quality, the exact geometric method matches the optimal minimax‑regret values of the baselines, while the approximate method remains within 0.8 % of the optimal regret.

Beyond the immediate application to reward‑uncertain MDPs, the paper’s insight—that the space of optimal policies can be explored efficiently by traversing the geometric boundaries of their optimality regions—suggests broader applicability. Potential extensions include handling non‑linear reward functions, integrating online learning where the reward set evolves over time, and parallelizing the hull‑update operations for massive state‑action spaces. The authors conclude that geometric traversal provides a principled, scalable foundation for robust decision making under reward uncertainty.