Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning
In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent’s performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.
💡 Research Summary
The paper tackles the problem of solving infinite‑horizon discounted General‑Utility Markov Decision Processes (GUMDPs) when the agent’s performance is evaluated on a single trajectory (single‑trial regime). Traditional GUMDP formulations assume an infinite number of trials, which makes the objective depend on the expected discounted occupancy. However, many real‑world applications can only afford a single episode, and because the utility function f is generally non‑linear, the expectation of f(d) (single‑trial objective) differs from f of the expected occupancy (infinite‑trial objective).
The authors first establish fundamental theoretical results. They examine which policy classes are sufficient for optimality in the single‑trial setting. While stationary Markov policies are optimal for standard MDPs, the single‑trial GUMDP may require policies that keep track of the accumulated occupancy. By constructing an “occupancy MDP” – an extended MDP whose state consists of the original environment state together with the current discounted occupancy vector – they prove that a stationary Markov policy in this extended MDP is equivalent to an optimal (possibly history‑dependent) policy for the original GUMDP. This reduction shows that the problem can be cast as a conventional MDP, but with a continuous, high‑dimensional state space.
Next, they analyze computational complexity. The occupancy MDP’s state space grows with the dimension |S|·|A| and the occupancy values are real‑valued, leading to a problem that is at least as hard as solving a continuous‑state MDP. Exact dynamic programming becomes infeasible for anything beyond toy examples, and the authors prove that policy optimization in the single‑trial regime is NP‑hard. Consequently, they turn to approximate online planning.
The core algorithmic contribution is a Monte‑Carlo Tree Search (MCTS) method tailored to the occupancy MDP. Each node in the search tree stores the current environment state and the accumulated occupancy vector. The four standard MCTS phases are adapted as follows:
- Selection uses a UCT‑style upper confidence bound that accounts for both visit counts and the current occupancy estimate.
- Expansion creates a child node for an unexplored (state, action) pair and updates the occupancy vector with the discounted contribution of the taken action.
- Simulation (rollout) runs a simple policy (random or heuristic) to the end of a truncated horizon H, producing a final empirical occupancy dπ,H. The utility f(dπ,H) is evaluated and returned as the rollout reward.
- Backpropagation propagates the rollout reward up the tree, updating mean value estimates and visit counts.
The authors prove that, under standard assumptions (bounded rewards, sufficient simulations), the root action converges in probability to the optimal action of the occupancy MDP, and therefore to the optimal single‑trial policy of the original GUMDP.
Experimental evaluation covers three representative non‑linear utility functions: (i) maximum state‑entropy exploration (f(d)=dᵀlog d), (ii) imitation learning (f(d)=‖d−dβ‖²), and (iii) adversarial MDPs (f(d)=maxₖ dᵀcₖ). For each task, they compare the proposed MCTS planner against (a) policies optimized for the infinite‑trial objective, (b) a baseline that solves the extended MDP via exact dynamic programming when tractable, and (c) simple heuristic policies. Results show that the MCTS approach consistently achieves lower expected utility loss on a single trajectory, often by a large margin, especially when the utility is highly non‑linear. Moreover, performance degrades gracefully as the truncation horizon H is varied, confirming robustness to practical rollout limits.
In summary, the paper makes three key contributions:
- Theoretical foundation – precise characterization of optimal policy classes and a reduction of single‑trial GUMDPs to an occupancy MDP.
- Complexity insight – proof that exact optimization is computationally intractable, motivating approximate methods.
- Practical algorithm – a Monte‑Carlo Tree Search scheme that efficiently solves the occupancy MDP online, with provable convergence and strong empirical performance.
These results open the door to applying GUMDPs in domains where only a single episode can be collected, such as safety‑critical robotics, medical treatment planning, or financial decision making under strict budget constraints. Future work may explore function‑approximation techniques for the high‑dimensional occupancy state, extensions to continuous action spaces, and integration of safety or risk constraints directly into the MCTS rollout policy.
Comments & Academic Discussion
Loading comments...
Leave a Comment