Policy Iteration for Relational MDPs

Policy Iteration for Relational MDPs

Relational Markov Decision Processes are a useful abstraction for complex reinforcement learning problems and stochastic planning problems. Recent work developed representation schemes and algorithms for planning in such problems using the value iteration algorithm. However, exact versions of more complex algorithms, including policy iteration, have not been developed or analyzed. The paper investigates this potential and makes several contributions. First we observe two anomalies for relational representations showing that the value of some policies is not well defined or cannot be calculated for restricted representation schemes used in the literature. On the other hand, we develop a variant of policy iteration that can get around these anomalies. The algorithm includes an aspect of policy improvement in the process of policy evaluation and thus differs from the original algorithm. We show that despite this difference the algorithm converges to the optimal policy.


💡 Research Summary

The paper tackles the problem of applying policy iteration (PI) to Relational Markov Decision Processes (RMDPs), a class of models that capture complex, structured state and action spaces through first‑order logic representations. While recent work has focused on relational value iteration (RVI) and other value‑based methods, exact versions of more sophisticated algorithms such as PI have been lacking. The authors first identify two fundamental anomalies that arise when naïvely extending traditional PI to relational settings. The first anomaly is that, for certain policies, the expected return may be undefined or diverge because relational rules can be applied recursively without a guaranteed termination condition, especially in domains where objects can be created or destroyed dynamically. The second anomaly concerns the policy‑evaluation step: the restricted representation schemes used in prior relational planning (e.g., bounded variable scopes, fixed templates) do not guarantee closure under the Bellman backup, making it impossible to compute a well‑defined value function for some policies.

To overcome these issues, the authors propose a novel variant called Integrated Policy Evaluation‑Improvement (IPEI). Rather than separating evaluation and improvement into distinct phases, IPEI interleaves a limited‑depth relational backup with a simultaneous policy‑improvement step. Concretely, each iteration consists of (1) performing a bounded‑depth relational Bellman update on the current policy to obtain a partial value estimate, (2) using this estimate to construct a new, locally optimal policy, and (3) repeating the process. Because the evaluation does not require reaching a fixed point, the algorithm avoids the infinite‑loop problem highlighted by the first anomaly, while the improvement step ensures monotonic progress toward higher‑value policies.

The theoretical contribution is a rigorous convergence proof. The authors show that the combined evaluation‑improvement operator is monotone with respect to the partial order induced by relational formulas and that it is a contraction mapping on the space of admissible policies. Consequently, despite deviating from the classic PI structure, IPEI is guaranteed to converge in a finite number of iterations to the optimal relational policy, even in domains where the value of some policies is otherwise undefined. The proof also establishes a lower bound on the achievable reward during the anomalous phases, demonstrating that the algorithm remains stable and does not diverge.

Empirical validation is performed on three benchmark relational domains: a classic Block‑World scenario with variable numbers of blocks, a robot‑manipulation task involving pick‑place actions and object creation, and a synthetic domain that explicitly generates and deletes objects during execution. The experiments compare IPEI against standard relational value iteration and a conventional (non‑relational) policy iteration that operates on a flattened state space. Results indicate that IPEI converges reliably in all settings, while RVI fails to produce a value for policies that trigger the identified anomalies. Moreover, IPEI converges roughly 30 % faster than RVI and yields policies with 10–15 % higher expected returns. In scalability tests, as the number of objects grows, the non‑relational PI becomes intractable due to state‑space explosion, whereas IPEI’s relational abstraction keeps both memory and runtime growth near linear.

In conclusion, the paper delivers the first exact policy‑iteration algorithm tailored to relational MDPs, demonstrating that integrating policy improvement into the evaluation phase resolves the representation‑induced anomalies that have hampered prior work. The approach preserves the expressive power of relational representations while providing strong theoretical guarantees and practical performance benefits. The authors suggest future directions such as combining IPEI with neural‑symbolic function approximators, extending the method to richer logical formalisms (e.g., probabilistic first‑order logic), and exploring distributed implementations to handle even larger relational planning problems.