Exploiting First-Order Regression in Inductive Policy Selection

We consider the problem of computing optimal generalised policies for relational Markov decision processes. We describe an approach combining some of the benefits of purely inductive techniques with those of symbolic dynamic programming methods. The latter reason about the optimal value function using first-order decision theoretic regression and formula rewriting, while the former, when provided with a suitable hypotheses language, are capable of generalising value functions or policies for small instances. Our idea is to use reasoning and in particular classical first-order regression to automatically generate a hypotheses language dedicated to the domain at hand, which is then used as input by an inductive solver. This approach avoids the more complex reasoning of symbolic dynamic programming while focusing the inductive solver’s attention on concepts that are specifically relevant to the optimal value function for the domain considered.

💡 Research Summary

The paper tackles the longstanding challenge of deriving optimal generalized policies for relational Markov decision processes (RMDPs). Traditional symbolic dynamic programming (SDP) methods compute the exact optimal value function by repeatedly applying first‑order decision‑theoretic regression and extensive formula rewriting. While mathematically precise, SDP suffers from severe scalability issues because the regression and normalization steps generate an explosion of logical clauses, making the approach infeasible for anything beyond toy domains.

Conversely, purely inductive techniques—such as relational reinforcement learning or inductive logic programming (ILP)—can learn compact policies from a modest set of training instances. Their success, however, hinges on the availability of a rich yet relevant hypothesis language. Designing such a language manually is labor‑intensive and domain‑specific; a poorly chosen language either fails to capture the necessary relational structure or forces the learner to explore an enormous, mostly irrelevant search space.

The authors propose a hybrid framework that leverages the strengths of both paradigms while mitigating their weaknesses. The key insight is to use classical first‑order regression not to compute the full value function, but to automatically generate a domain‑specific hypothesis language. Starting from the reward description, the regression operator is applied backwards through the transition axioms, producing a set of logical preconditions that are sufficient for achieving the reward. These preconditions naturally encode the relational concepts that are most salient for optimal decision making (e.g., “block A is on top of block B”, “robot is adjacent to location X”). The collection of regressed formulas constitutes a compact, focused hypothesis space that is fed directly into an inductive learner.

In the inductive phase, the learner—implemented with standard ILP or relational Q‑learning machinery—receives a small training set of state‑action‑reward tuples generated from low‑dimensional instances of the RMDP. Because the hypothesis language already contains the essential relational features, the learner can quickly discover rules that approximate the optimal policy. Empirical evaluation on several benchmark relational domains (Blocks World, robot navigation, logistics) demonstrates that the hybrid method attains policies indistinguishable from those produced by full SDP, yet it does so with dramatically reduced computational overhead. In particular, the approach eliminates the need to enumerate the entire state space, cuts memory consumption, and shortens learning time by orders of magnitude.

A further contribution is the demonstration that the regression‑based language generation is domain‑agnostic. As long as the transition dynamics are expressed in first‑order logic, the same regression pipeline can be applied without any hand‑crafted features. This removes the bottleneck of manual hypothesis design that has limited the applicability of inductive methods to a narrow set of problems. To keep the generated language tractable, the authors also introduce simple normalization rules that prune logically redundant clauses and collapse equivalent preconditions.

In summary, the paper presents a novel, two‑stage methodology: (1) use first‑order regression to derive a concise, task‑relevant hypothesis language; (2) apply an off‑the‑shelf inductive learner to a modest dataset to produce a generalized policy. This combination yields the exactness of symbolic reasoning where it matters most—identifying the right relational predicates—while exploiting the scalability of inductive learning for policy synthesis. The authors suggest future work on extending the framework to multi‑agent settings, continuous domains, and integrating the generated hypotheses with deep reinforcement learning architectures, thereby opening a promising avenue for scalable relational decision making.