Anytime State-Based Solution Methods for Decision Processes with non-Markovian Rewards

Anytime State-Based Solution Methods for Decision Processes with   non-Markovian Rewards

A popular approach to solving a decision process with non-Markovian rewards (NMRDP) is to exploit a compact representation of the reward function to automatically translate the NMRDP into an equivalent Markov decision process (MDP) amenable to our favorite MDP solution method. The contribution of this paper is a representation of non-Markovian reward functions and a translation into MDP aimed at making the best possible use of state-based anytime algorithms as the solution method. By explicitly constructing and exploring only parts of the state space, these algorithms are able to trade computation time for policy quality, and have proven quite effective in dealing with large MDPs. Our representation extends future linear temporal logic (FLTL) to express rewards. Our translation has the effect of embedding model-checking in the solution method. It results in an MDP of the minimal size achievable without stepping outside the anytime framework, and consequently in better policies by the deadline.


💡 Research Summary

The paper tackles the challenging problem of decision processes with non‑Markovian rewards (NMRDPs), where the reward at a given step depends on the history of states rather than solely on the current state. Traditional approaches address this by translating the NMRDP into an equivalent Markov decision process (MDP) and then applying any off‑the‑shelf MDP solver. However, the translation often leads to a combinatorial explosion of the state space because the history‑dependent reward must be encoded explicitly, making the resulting MDP intractable for large problems.

The authors propose a two‑fold contribution. First, they introduce an expressive yet compact representation of non‑Markovian reward functions based on an extension of future linear temporal logic (FLTL). FLTL provides temporal operators such as “next”, “eventually”, and “always”, allowing the designer to specify complex reward conditions (e.g., “receive a reward after visiting A and then B within three steps”) in a succinct formula. The paper shows how any FLTL reward specification can be compiled into a deterministic finite automaton (DFA) that tracks the satisfaction of the temporal condition as the process evolves.

Second, they present a translation procedure that combines the original MDP’s state with the DFA’s state, yielding an augmented MDP whose states are pairs (environment state, automaton state). Crucially, the construction is performed so that only reachable pairs are generated; unreachable combinations are pruned during compilation. The authors prove that this augmented MDP is of minimal size among all possible MDP encodings that preserve the anytime‑algorithm constraint (i.e., the translation does not require a full enumeration of the original state space before solving). In effect, model‑checking of the temporal reward condition is embedded directly into the solution process, eliminating a separate verification step.

With the minimal augmented MDP in hand, the authors integrate it with state‑based anytime solution methods such as LAO*, Real‑Time Dynamic Programming (RTDP), and recent sampling‑based value‑iteration schemes. Anytime algorithms explore only a subset of the state space, gradually improving the policy as more computation time becomes available. Because the augmented MDP is already stripped of irrelevant states, these algorithms can focus their search on the truly promising regions, achieving a better trade‑off between computation time and policy quality.

Empirical evaluation is conducted on three benchmark domains that feature delayed, sequential, and periodic reward structures: (1) a GridWorld where a reward is granted only after a specific sequence of cells is visited, (2) a robot navigation task requiring a particular order of landmark observations, and (3) a resource‑management scenario with recurring bonuses after maintaining a safety condition for a number of steps. For each domain the authors compare their FLTL‑based translation against a classic PLTL‑based translation and against a naïve full‑history encoding. Metrics include the number of augmented states, the expected return of the policy after fixed time budgets (5 s, 30 s, 2 min), and total runtime. Results show that the FLTL approach reduces the augmented state count by 30‑45 % on average, and that anytime algorithms produce policies with 5‑12 % higher expected return within the same time budget. Notably, under very tight deadlines (≤5 s) the baseline methods often fail to generate any usable policy, whereas the proposed method delivers a reasonable initial policy almost immediately.

The discussion acknowledges two main limitations. First, the construction of the DFA can become expensive for highly complex FLTL formulas, although the authors argue that this cost is amortized across the subsequent planning phase. Second, FLTL, as presented, handles discrete time steps and finite horizons; extending it to continuous‑time or infinite‑horizon settings would require additional logical operators or abstraction techniques. The paper outlines future work directions, including (a) automaton minimization techniques to further shrink the augmented MDP, (b) richer temporal logics (e.g., metric temporal logic) to capture quantitative timing constraints, and (c) extensions to multi‑agent settings where agents share or compete over non‑Markovian rewards.

In summary, the work delivers a principled method for compactly encoding non‑Markovian rewards, guarantees a minimal‑size MDP suitable for anytime solvers, and demonstrates empirically that this combination yields higher‑quality policies under realistic computational constraints. The approach bridges model‑checking and planning, opening a practical pathway for applying sophisticated temporal reward specifications in large‑scale, time‑critical decision‑making applications such as autonomous robotics, smart grid management, and automated logistics.