Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs - Extended Version | KOINEU

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we propose an algorithm for polynomial-time reinforcement learning in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an empirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized optimistically. We prove that with suitable initialization (i) FOIM converges to the fixed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with respect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the first algorithm with these properties. This extended version contains the rigorous proofs of the main theorem. A version of this paper appeared in ICML'09.

💡 Research Summary

The paper introduces the Factored Optimistic Initial Model (FOIM), a model‑based reinforcement‑learning algorithm designed for factored Markov decision processes (FMDPs). FOIM follows a conventional empirical‑model approach: after each interaction the algorithm updates an estimate of the transition and reward functions. The distinctive element is the way the model is initialized. All state‑action pairs that have not yet been observed are given an “optimistic” transition distribution that points toward states with the highest possible reward. This optimistic bias forces the agent to explore unknown regions without the need for an explicit exploration schedule such as ε‑greedy or UCB bonuses.

At every time step FOIM computes a policy by performing Approximate Value Iteration (AVI) on the current optimistic model. AVI exploits the factored structure: the global value function is represented as a sum of local value functions defined over small subsets of variables (the factors). Because each factor involves only a limited number of variables, the Bellman backup can be carried out efficiently, and the resulting approximate Q‑values are used to select a greedy action. Consequently the algorithm is purely greedy with respect to its own model; exploration is induced solely by the optimism in the initialization.

The authors prove three central results. First, with a sufficiently large optimism parameter the sequence of greedy policies generated by FOIM converges to the fixed point of AVI on the true MDP. In other words, even before the empirical model becomes accurate, the optimistic initialization guarantees that the value function remains an over‑estimate that stabilizes at the AVI solution. Second, the number of “non‑near‑optimal” steps—steps in which the greedy policy deviates from the ε‑optimal policy defined by the AVI fixed point—is bounded by a polynomial in all relevant quantities: the size of the factored state space, the number of actions, 1/ε, 1/δ (the allowed failure probability), and 1/(1‑γ) (the effective horizon). The bound has the form
O( (|X|·|A|)·(1/ε)²·log(1/δ)·(1/(1‑γ))³ ),
where |X| denotes the number of possible joint assignments to the state variables. Third, the per‑step computational cost is also polynomial. Because AVI updates only the local factors, each update costs O(w·|A|·|X|), where w is the tree‑width of the underlying factor graph; thus the total runtime over the learning horizon remains polynomial.

The proofs rely on standard concentration inequalities (Hoeffding bounds) to control the error of the empirical transition estimates, and on properties of factored MDPs that allow the transition model to be represented compactly when the graph has bounded tree‑width. The optimistic initialization ensures that, until enough samples have been collected for a given (s,a) pair, the estimated Q‑value is artificially high, guaranteeing that the agent will eventually try that pair. Once enough samples are gathered, the empirical model converges to the true dynamics, and the AVI fixed point becomes an accurate approximation of the optimal value function.

Compared with classic PAC‑MDP algorithms such as R‑MAX or MBIE‑PAC, FOIM eliminates the need for separate exploration bonuses or confidence intervals in the policy‑selection step. The only hyper‑parameter is the magnitude of the optimistic prior, which can be set based on known bounds on rewards. The analysis shows that this simple design yields the same (or better) sample‑complexity guarantees while keeping the algorithmic structure extremely simple.

Although the extended version of the paper does not present empirical experiments, the theoretical results suggest that FOIM is well‑suited for large‑scale factored domains such as network routing, robotic manipulation, or smart‑grid control, where the state space grows exponentially but the factor graph remains sparse. In such settings the algorithm can learn near‑optimal policies in polynomial time with respect to the natural description length of the problem.

In summary, the paper makes a significant contribution by demonstrating that “optimistic initialization + greedy policy” is sufficient to achieve polynomial‑time learning in factored MDPs. FOIM is the first algorithm to provide simultaneous guarantees on convergence to the AVI fixed point, a polynomial bound on sub‑optimal actions, and polynomial per‑step computational cost, all without auxiliary exploration mechanisms. This advances both the theoretical understanding of exploration in structured RL and offers a practically attractive approach for real‑world factored decision‑making problems.

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs - Extended Version

💡 Research Summary

Comments & Academic Discussion

Leave a Comment