Learning Partially Observable Deterministic Action Models
We present exact algorithms for identifying deterministic-actions effects and preconditions in dynamic partially observable domains. They apply when one does not know the action model(the way actions affect the world) of a domain and must learn it from partial observations over time. Such scenarios are common in real world applications. They are challenging for AI tasks because traditional domain structures that underly tractability (e.g., conditional independence) fail there (e.g., world features become correlated). Our work departs from traditional assumptions about partial observations and action models. In particular, it focuses on problems in which actions are deterministic of simple logical structure and observation models have all features observed with some frequency. We yield tractable algorithms for the modified problem for such domains. Our algorithms take sequences of partial observations over time as input, and output deterministic action models that could have lead to those observations. The algorithms output all or one of those models (depending on our choice), and are exact in that no model is misclassified given the observations. Our algorithms take polynomial time in the number of time steps and state features for some traditional action classes examined in the AI-planning literature, e.g., STRIPS actions. In contrast, traditional approaches for HMMs and Reinforcement Learning are inexact and exponentially intractable for such domains. Our experiments verify the theoretical tractability guarantees, and show that we identify action models exactly. Several applications in planning, autonomous exploration, and adventure-game playing already use these results. They are also promising for probabilistic settings, partially observable reinforcement learning, and diagnosis.
💡 Research Summary
The paper tackles the challenging problem of learning deterministic action models in dynamic domains where the agent receives only partial observations of the state at each time step. Traditional planning and reinforcement‑learning approaches assume either full observability or probabilistic models such as hidden Markov models (HMMs). Under partial observability, the independence assumptions that make many planning algorithms tractable break down, because unobserved variables become correlated through the dynamics. The authors therefore restrict attention to two realistic assumptions: (1) actions are deterministic and have a simple logical structure (e.g., STRIPS‑style preconditions and effects), and (2) every state feature is observed with non‑zero frequency over the course of the interaction. Within this setting they develop exact learning algorithms that take a sequence of partial observations as input and output deterministic action models that could have generated the data.
The core of the method is a two‑phase constraint‑propagation process. Initially every action’s preconditions and effects are unconstrained. As each observation arrives, the algorithm discards any hypothesis that would contradict the observed transition and simultaneously tightens the remaining hypotheses by adding logical constraints derived from the observed before‑and‑after states. Because STRIPS actions can be represented as sets of literals, each update reduces to simple set operations, keeping the per‑step cost linear in the number of state features. The overall runtime is O(T·|F|·C), where T is the length of the observation trace, |F| the number of features, and C a constant that depends on the specific action class (for STRIPS‑like actions C is very small). Consequently the algorithm runs in polynomial time with respect to both time steps and state dimensionality, a stark contrast to the exponential or approximate behavior of HMM‑based learning and conventional reinforcement‑learning methods.
Two output modes are supported. In “enumerate all models” mode the algorithm returns the complete set of deterministic action models that are consistent with the observations, providing a full picture of the remaining uncertainty. In “single model” mode it returns an arbitrary consistent model, which can be useful when a concrete model is needed quickly. Both modes are provably exact: no inconsistent model is ever returned (soundness), and every consistent model appears in the enumeration (completeness).
Theoretical contributions include proofs of soundness, completeness, and polynomial‑time complexity for several traditional action representations examined in AI planning literature, notably STRIPS, lifted operators, and certain classes of conditional effects. The authors also discuss how the approach can be extended to richer deterministic models that include delete‑lists, conditional effects, and limited numeric fluents, as long as the logical structure remains tractable for set‑based constraint propagation.
Empirical evaluation consists of two parts. First, synthetic benchmark domains are generated with varying numbers of fluents and actions. The experiments confirm that runtime scales linearly with both the number of time steps and the number of fluents, matching the theoretical bound, while baseline HMM learners exhibit exponential blow‑up and often fail to converge within reasonable time. Second, the method is applied to a text‑based adventure‑game environment where the agent receives sparse, noisy observations of the game state. Despite the sparsity, the algorithm exactly recovers the ground‑truth action model used by the game engine, whereas standard reinforcement‑learning agents struggle to learn a useful policy without full state feedback. These results demonstrate that the proposed technique is not only theoretically sound but also practically viable for real‑world scenarios such as autonomous exploration, automated game playing, and diagnostic reasoning.
Finally, the paper outlines future directions. The deterministic assumption can be relaxed to allow probabilistic effects, turning the exact constraint propagation into a Bayesian update over distributions. Integrating the learned models with partially observable reinforcement‑learning (POMDP) solvers could enable agents to both infer the dynamics and optimize policies simultaneously. Moreover, the authors suggest that the same framework could be adapted for online learning, where observations arrive continuously and the model is incrementally refined. In sum, the work delivers the first polynomial‑time, exact learning algorithm for deterministic action models under partial observability, opening new avenues for planning‑centric AI systems that must operate with incomplete sensory data.