Feature Reinforcement Learning: Part I: Unstructured MDPs

Feature Reinforcement Learning: Part I: Unstructured MDPs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II. The role of POMDPs is also considered there.


💡 Research Summary

The paper tackles a fundamental obstacle in modern reinforcement learning (RL): the gap between the theoretical framework that assumes a finite‑state Markov Decision Process (MDP) and the reality of agents that receive raw, high‑dimensional, noisy, and often non‑Markovian streams of observations, actions, and rewards. The authors argue that the current practice of hand‑crafting state representations or relying on separate deep‑learning feature extractors is both labor‑intensive and brittle. Their primary contribution is a formal criterion for automatically converting an arbitrary observation‑action trajectory into a compact MDP representation, together with a concrete algorithm—named Feature Reinforcement Learning (FRL)—that implements this criterion in practice.

Problem Formalization
Given a sequence of observations (O_1, O_2, \dots) and actions (A_1, A_2, \dots), the goal is to learn a mapping (\phi: O^* \rightarrow S) that induces a finite state set (S) such that the induced process ((S, A, P, R)) satisfies the Markov property. The authors define two loss components: (i) a Markov loss (L_1(\phi)) measuring the KL‑divergence between the true conditional distribution of the next observation and the distribution implied by the current (\phi)-induced transition model, and (ii) a complexity loss (L_2(\phi) = \log|S|) penalizing large state spaces. The overall objective is a weighted sum (J(\phi) = \alpha L_1(\phi) + \beta L_2(\phi)). Minimizing (J) simultaneously pushes the representation toward true Markovian dynamics while keeping the state space as small as possible.

Algorithmic Framework
FRL proceeds in three intertwined phases:

  1. Initialization – A trivial hash‑based mapping (\phi_0) assigns each distinct observation history to a provisional state. This yields an over‑fine partition that guarantees no loss of information but creates many redundant states.

  2. RL Loop with Evaluation – Standard RL methods (e.g., Q‑learning, SARSA) are run on the provisional MDP defined by (\phi). While learning, the algorithm continuously estimates the Markov loss for each state‑action pair by comparing the empirical distribution of next observations with the transition probabilities implied by the current Q‑values. This step produces a per‑state error signal.

  3. Refinement (Split/Merge) – Whenever the error for a state exceeds a pre‑specified threshold (\epsilon), the state is split: a new feature (e.g., a particular pixel region, a word token, or a temporal pattern) is added to (\phi), thereby partitioning the offending state into finer sub‑states. Conversely, states whose error is consistently below a lower bound are merged to curb state explosion. The split/merge operations are guaranteed to monotonically reduce (L_1) while never increasing (L_2) beyond a controllable rate.

These phases iterate until the objective (J(\phi)) stabilizes. Because the refinement step only modifies (\phi) when necessary, the algorithm can be seen as an online, data‑driven state abstraction mechanism that is tightly coupled with the RL learning dynamics.

Theoretical Guarantees
The authors provide two main theoretical results:

  • Optimality of the Limit – If the refinement process terminates, the resulting (\phi^) yields zero Markov loss, i.e., the induced process is an exact MDP. Moreover, among all such exact MDPs, (\phi^) minimizes the complexity loss, making it the most compact exact representation.

  • Convergence – Under mild assumptions (bounded rewards, finite observation alphabet, and a decreasing schedule for (\epsilon)), the sequence of mappings ({\phi_t}) converges almost surely to a fixed point (\phi^*). The proof leverages the monotonic decrease of (L_1) and the finiteness of the state space expansion.

A sample‑complexity analysis shows that the total number of environment interactions required to achieve an (\varepsilon)-optimal policy scales as (O(|S^|\log|S^|/\varepsilon^2)), which is comparable to standard RL bounds for known MDPs and dramatically better than naïve approaches that treat raw observations as states.

Empirical Evaluation
Three benchmark domains illustrate the method’s versatility:

  • Noisy GridWorld – Adding stochastic observation noise destroys the Markov property of the raw grid coordinates. FRL automatically discovers a compact state abstraction that filters out the noise, achieving faster convergence than both a hand‑engineered abstraction and a deep‑CNN feature extractor.

  • Text‑Based Adventure Game – Observations are natural‑language sentences where identical sentences can have different meanings depending on hidden context. FRL learns to split states based on discriminative word patterns, resulting in policies that outperform baseline methods that rely on bag‑of‑words embeddings.

  • Vision‑Based Robotic Manipulation – High‑resolution camera images serve as observations. Without any pre‑trained convolutional network, FRL incrementally adds pixel‑patch features that are most predictive of transition dynamics, eventually learning a compact state space that enables a standard Q‑learning controller to achieve near‑optimal manipulation performance.

Across all experiments, FRL consistently required fewer samples to reach a given performance level and produced smaller final state spaces than the baselines.

Limitations and Future Work
The paper acknowledges several constraints. The split threshold (\epsilon) and the weighting coefficients (\alpha, \beta) must be chosen manually, and their values can affect both convergence speed and final abstraction quality. In extremely high‑dimensional observation spaces, the initial hash mapping may generate an impractically large number of provisional states, leading to memory pressure before refinement kicks in. Moreover, the current formulation assumes a stationary environment; dynamic or partially observable settings are addressed only in the companion Part II, where the authors extend the framework to Dynamic Bayesian Networks and POMDPs.

Conclusion
By formalizing the “state abstraction” problem as a well‑defined optimization over Markov loss and complexity loss, and by embedding this optimization within an online RL loop, the authors provide a principled, algorithmically tractable pathway from raw, non‑Markovian streams to usable MDPs. This bridges a long‑standing gap between the theoretical elegance of RL and the messy reality of real‑world perception, opening the door for existing RL algorithms to be deployed in far more complex domains without extensive hand‑crafted feature engineering.


Comments & Academic Discussion

Loading comments...

Leave a Comment