Feature Markov Decision Processes

Feature Markov Decision Processes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

General purpose intelligent learning agents cycle through (complex,non-MDP) sequences of observations, actions, and rewards. On the other hand, reinforcement learning is well-developed for small finite state Markov Decision Processes (MDPs). So far it is an art performed by human designers to extract the right state representation out of the bare observations, i.e. to reduce the agent setup to the MDP framework. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in a companion article.


💡 Research Summary

The paper tackles a fundamental bottleneck in modern reinforcement learning: the need to transform raw, often non‑Markovian streams of observations, actions, and rewards into a compact Markov Decision Process (MDP) that can be handled by standard RL algorithms. While human designers currently perform this state abstraction manually, the authors propose a formal, quantitative criterion for automatically selecting an appropriate state representation, which they call a Feature Markov Decision Process (Feature‑MDP or FM‑MDP).

The core idea is to define a family of candidate feature mappings φ that map raw observations Oₜ to abstract states Sₜ. For each candidate φ, the induced MDP is estimated by learning transition probabilities P(s′|s,a) and reward distributions R(r|s,a) from data. The quality of a candidate is measured by a composite objective J(φ)=log L(φ)−λ·C(φ), where log L(φ) is the log‑likelihood of the observed sequence under the MDP defined by φ, and C(φ) quantifies the complexity of the feature mapping (e.g., number of parameters, depth of a Bayesian network). The regularization weight λ balances explanatory power against over‑fitting.

The learning algorithm proceeds in three stages. First, a set of candidate feature functions is generated using methods such as short n‑gram clustering, histogram‑based discretisation, or structure learning for dynamic Bayesian networks. Second, for each candidate the transition and reward models are fitted using maximum‑likelihood or Bayesian estimation, optionally also learning an observation model Oₜ|Sₜ. Third, the objective J(φ) is evaluated and the feature mapping φ* with the highest score is selected. The resulting FM‑MDP can then be plugged directly into any conventional RL method (Q‑learning, SARSA, actor‑critic, etc.).

A notable contribution is the seamless extension to more expressive dynamic Bayesian networks, allowing the framework to handle temporally correlated hidden variables and partial observability. In this generalized setting, the overall log‑likelihood decomposes into transition and observation components, each penalised by the same complexity term, preserving the unified selection criterion.

Empirical evaluation on classic benchmarks (GridWorld, MountainCar) and on a more complex robotic simulation demonstrates that automatically discovered FM‑MDPs achieve faster convergence and higher final performance than hand‑crafted state abstractions. The experiments also illustrate how adjusting λ controls model complexity, preventing over‑fitting while retaining sufficient expressive power.

In conclusion, the paper provides a rigorous, computable answer to the question “what is the right state representation?” for reinforcement learning agents operating in arbitrary environments. By formalising the trade‑off between model fit and complexity, and by integrating candidate generation, parameter estimation, and model selection into a single algorithm, it lays a solid foundation for automated state abstraction, meta‑learning, and the broader integration of RL into general‑purpose AI systems. Future work is outlined to address continuous state‑action spaces, multi‑agent scenarios, and online adaptation of the feature mapping.


Comments & Academic Discussion

Loading comments...

Leave a Comment