Inductive Policy Selection for First-Order MDPs

Inductive Policy Selection for First-Order MDPs

We select policies for large Markov Decision Processes (MDPs) with compact first-order representations. We find policies that generalize well as the number of objects in the domain grows, potentially without bound. Existing dynamic-programming approaches based on flat, propositional, or first-order representations either are impractical here or do not naturally scale as the number of objects grows without bound. We implement and evaluate an alternative approach that induces first-order policies using training data constructed by solving small problem instances using PGraphplan (Blum & Langford, 1999). Our policies are represented as ensembles of decision lists, using a taxonomic concept language. This approach extends the work of Martin and Geffner (2000) to stochastic domains, ensemble learning, and a wider variety of problems. Empirically, we find “good” policies for several stochastic first-order MDPs that are beyond the scope of previous approaches. We also discuss the application of this work to the relational reinforcement-learning problem.


💡 Research Summary

The paper tackles the problem of generating policies for large Markov Decision Processes (MDPs) that are described using compact first‑order (relational) representations. Traditional dynamic‑programming (DP) approaches either flatten the relational structure into a propositional form—causing an exponential blow‑up as the number of objects grows—or operate directly on first‑order representations but remain impractical for domains where the object count can increase without bound. To overcome these limitations, the authors propose an inductive policy‑selection framework that learns policies from examples generated on small problem instances.

The core idea is simple yet powerful: solve a set of modest‑size MDP instances exactly using PGraphplan, collect the resulting optimal state‑action pairs, and then induce a policy that generalizes to larger instances. The policy is represented as an ensemble of decision lists written in a taxonomic concept language. The taxonomic language allows the definition of hierarchical concepts over objects and relations (e.g., “a block that is on the table and has a clear top”), enabling compact expression of complex conditions. Decision lists consist of ordered “if‑condition then‑action” rules; when a condition matches the current state, the associated action is taken. This representation is both interpretable and naturally scalable because the same rule can apply regardless of how many objects instantiate the underlying concepts.

Learning proceeds in two stages. First, a set of small MDPs is solved optimally with PGraphplan, producing a training corpus of state‑action examples. Second, the corpus is fed to a learning algorithm that constructs several decision‑list policies independently. These policies are then combined via a voting ensemble: each decision list votes for an action in a given state, and the majority vote determines the final choice. The ensemble mitigates over‑fitting of any single decision list and improves robustness across diverse situations. Moreover, the learning algorithm automatically generates and selects taxonomic concepts, reducing the need for hand‑crafted domain knowledge.

The authors evaluate their method on several stochastic first‑order domains, including variants of the classic Blocks World, robot navigation tasks with relational obstacles, and relational puzzle problems. In each case, they train on instances with a modest number of objects (e.g., 3–4 blocks) and test on much larger configurations (e.g., 10–12 blocks). The induced policies achieve high success rates, often approaching optimal performance, and significantly outperform existing DP‑based planners and earlier relational policy‑learning approaches that cannot scale to the larger test sizes. The success is attributed to the object‑independent nature of the learned rules: because conditions are expressed in terms of concepts rather than specific object identifiers, the same rule set remains applicable when new objects appear.

Beyond the empirical results, the paper discusses the relevance of this approach to relational reinforcement learning (RRL). In RRL, the environment model is typically unknown, so policies must be learned directly from interaction data. The inductive framework presented here offers a sample‑efficient alternative: by solving a few small instances offline and extracting relational rules, one can bootstrap a policy that generalizes to larger, unseen problems without requiring exhaustive exploration. The authors argue that this bridges the gap between model‑based planning and model‑free reinforcement learning in relational settings.

Key contributions of the work are:

  1. A data‑driven, inductive policy‑learning pipeline that leverages exact solutions of small MDPs to produce scalable policies for larger relational domains.
  2. The introduction of a taxonomic concept language combined with decision‑list policies, yielding interpretable, object‑agnostic rules.
  3. The use of ensemble learning to enhance robustness and reduce over‑fitting of individual rule sets.
  4. Empirical validation on multiple stochastic first‑order MDP benchmarks, demonstrating performance beyond prior state‑of‑the‑art methods.
  5. A discussion of how the method can be extended to relational reinforcement‑learning scenarios, suggesting broader applicability to real‑world systems such as multi‑robot coordination and logistics optimization.

In conclusion, the paper presents a novel paradigm for relational policy synthesis: by grounding policy induction in exact solutions of tractable sub‑problems and representing the resulting knowledge with relational concepts and decision lists, it achieves both scalability and interpretability. Future work is outlined to handle richer stochastic structures, partial observability, and to integrate the approach into physical robotic platforms, thereby testing its practical impact on complex, dynamic environments.