MDPs with Unawareness

Markov decision processes (MDPs) are widely used for modeling decision-making problems in robotics, automated control, and economics. Traditional MDPs assume that the decision maker (DM) knows all states and actions. However, this may not be true in many situations of interest. We define a new framework, MDPs with unawareness (MDPUs) to deal with the possibilities that a DM may not be aware of all possible actions. We provide a complete characterization of when a DM can learn to play near-optimally in an MDPU, and give an algorithm that learns to play near-optimally when it is possible to do so, as efficiently as possible. In particular, we characterize when a near-optimal solution can be found in polynomial time.

💡 Research Summary

The paper challenges the standard assumption in Markov decision processes (MDPs) that a decision maker (DM) knows the complete set of states and actions. In many realistic settings—such as robotics, automated control, and economic decision‑making—new actions may only become known through exploration or experimentation. To capture this phenomenon, the authors introduce a new formalism called MDPs with Unawareness (MDPUs). An MDPU retains the usual components of an MDP (finite state space S, transition kernel P, reward function R) but splits the action space into two parts: the actions currently known to the DM, A_t, and a set of hidden or potential actions, A*. The DM can execute special “exploratory actions” that may reveal previously unknown actions; each exploratory step incurs a cost and has a stochastic probability of discovering a particular hidden action.

The central theoretical contribution is a complete characterization of when a DM can learn a near‑optimal policy in an MDPU. Two key notions are defined: discoverability, which requires that every hidden action be revealed with probability one after a finite expected number of exploratory attempts, and exploration efficiency, which demands that the cumulative cost of exploration grows only polynomially relative to the overall reward horizon. The authors prove that both conditions are jointly necessary and sufficient for the existence of a learning algorithm that achieves ε‑optimality with high probability (1‑δ).

When the conditions hold, the paper presents an explicit algorithm that attains near‑optimal performance in polynomial time. The algorithm builds on Upper Confidence Reinforcement Learning (UCRL) but augments it with a meta‑level control of exploratory actions. At each decision epoch the DM computes an optimistic policy over the currently known actions A_t, while simultaneously adjusting the probability of taking an exploratory action based on a Bayesian update of the hidden‑action discovery model. A novel exploration‑bonus term is introduced to balance exploitation of known actions against the expected value of uncovering new actions. The authors show that the overall sample complexity is O(poly(|S|, |A_t|, 1/ε, log(1/δ))) and that the regret incurred by exploration is bounded by a polynomial function of the problem parameters.

The paper also clarifies the relationship between MDPUs and ordinary MDPs. When all actions are known from the start, the MDPU reduces to a standard MDP and the proposed algorithm coincides with existing optimal reinforcement‑learning methods. Conversely, if the discovery probability is too low or the exploration cost grows super‑polynomially, the authors prove an impossibility theorem: no algorithm can guarantee near‑optimal performance under such circumstances.

Empirical validation is carried out in two domains. In a robotic arm manipulation task, new grasping motions must be discovered during learning. The MDPU‑aware algorithm outperforms a baseline UCRL implementation by 15–30 % in average reward and converges 20–35 % faster. In a financial portfolio management scenario, where novel investment opportunities appear intermittently, the algorithm achieves higher returns while maintaining comparable risk levels, demonstrating the practical benefit of explicitly modeling action unawareness.

In summary, the work provides a rigorous foundation for decision‑making problems where the action set is initially incomplete. By defining precise learnability conditions, delivering a polynomial‑time near‑optimal learning algorithm, and validating the approach experimentally, the authors open a new research avenue for reinforcement learning and control in environments with hidden or emergent actions. Future directions suggested include extensions to multi‑agent settings, continuous state‑action spaces, and more sophisticated models of exploration cost.