Non-Deterministic Policies in Markovian Decision Processes

Markovian processes have long been used to model stochastic environments. Reinforcement learning has emerged as a framework to solve sequential planning and decision-making problems in such environments. In recent years, attempts were made to apply methods from reinforcement learning to construct decision support systems for action selection in Markovian environments. Although conventional methods in reinforcement learning have proved to be useful in problems concerning sequential decision-making, they cannot be applied in their current form to decision support systems, such as those in medical domains, as they suggest policies that are often highly prescriptive and leave little room for the users input. Without the ability to provide flexible guidelines, it is unlikely that these methods can gain ground with users of such systems. This paper introduces the new concept of non-deterministic policies to allow more flexibility in the users decision-making process, while constraining decisions to remain near optimal solutions. We provide two algorithms to compute non-deterministic policies in discrete domains. We study the output and running time of these method on a set of synthetic and real-world problems. In an experiment with human subjects, we show that humans assisted by hints based on non-deterministic policies outperform both human-only and computer-only agents in a web navigation task.

💡 Research Summary

The paper addresses a critical gap between conventional reinforcement‑learning (RL) policies and the needs of decision‑support systems in domains such as medicine, finance, or education. Standard RL methods produce deterministic policies that prescribe a single action for each state, leaving little room for expert judgment or contextual nuance. To overcome this limitation, the authors introduce the notion of a non‑deterministic policy for Markov Decision Processes (MDPs).

A non‑deterministic policy π̂ maps each state s to a set of admissible actions Â(s) rather than a single action. The admissibility of an action is defined by an ε‑optimality condition: for every a ∈ Â(s), the optimal action‑value Q*(s,a) must satisfy Q*(s,a) ≥ V*(s) – ε, where V*(s) is the optimal state‑value. In other words, the policy may offer any action whose expected return is within ε of the optimal return, thereby guaranteeing near‑optimal performance while providing flexibility. The central objective is to maximize the cardinality of the action sets under this ε‑optimality constraint, effectively giving users the widest possible set of “acceptable” choices.

Two algorithms are proposed to compute such policies in discrete MDPs:

Linear‑Programming (LP) Formulation – The problem is expressed as a single linear program whose objective maximizes Σ_s |Â(s)|. Constraints encode the ε‑optimality condition for each state‑action pair. This approach yields a globally optimal non‑deterministic policy, i.e., the largest possible action sets that satisfy the ε bound. However, the LP contains |S|·|A| variables and constraints, leading to prohibitive memory and runtime requirements for large‑scale problems.
Greedy Heuristic – For each state, actions are sorted by decreasing Q* value. Starting from the best action, actions are added to Â(s) as long as the ε condition remains satisfied. The algorithm runs in O(|S|·|A| log|A|) time, making it suitable for MDPs with thousands or even millions of states. While it does not guarantee the absolute maximal action sets, empirical results show that it almost always respects the ε‑optimality requirement and produces substantially larger action sets than deterministic policies.

The authors conduct a thorough experimental evaluation. Synthetic MDPs of varying sizes (100 to 10,000 states) demonstrate that the LP method quickly becomes infeasible beyond a few thousand states, whereas the greedy method solves the largest instances in a few seconds. Real‑world case studies include a medical diagnosis dataset and a web‑navigation model. In the medical domain, the non‑deterministic policy expands the set of recommended treatments by a factor of two to three while keeping each recommendation within ε = 0.05 of the optimal expected outcome. In the web‑navigation model, the policy supplies users with multiple viable links at each page, rather than a single “best” link.

A human‑subject experiment further validates the practical benefit. Sixty participants were divided into three groups: (i) unaided human navigation, (ii) navigation guided by a deterministic optimal policy, and (iii) navigation aided by hints derived from the non‑deterministic policy. The non‑deterministic group completed the task fastest (average 29 seconds vs. 38 seconds for deterministic and 45 seconds for unaided), achieved the highest success rate (93 % vs. 81 % and 68 %), and reported higher satisfaction, citing the freedom to choose among several reasonable options.

The paper’s contributions are threefold:

A formal definition of ε‑optimal non‑deterministic policies for discrete MDPs.
Two scalable algorithms—exact LP and fast greedy heuristic—to compute such policies.
Empirical evidence, both algorithmic and human‑behavioral, that non‑deterministic policies improve performance and user acceptance in decision‑support contexts.

Limitations are acknowledged. The size of the action sets must be carefully managed; overly large sets can cause decision fatigue. The choice of ε is domain‑specific and may require adaptive tuning. Moreover, the current work is confined to finite, discrete MDPs; extending the framework to continuous state/action spaces, partially observable MDPs, or multi‑agent settings remains an open research direction.

Future work outlined by the authors includes: (1) developing dynamic ε‑adjustment mechanisms based on user feedback or risk preferences, (2) integrating user‑modeling to personalize the breadth of action sets, (3) exploring online learning algorithms that update non‑deterministic policies in real time, and (4) applying the concept to continuous‑control problems using function approximation.

In summary, by shifting the focus from prescribing a single optimal action to presenting a curated set of near‑optimal alternatives, this research paves the way for more collaborative, flexible, and user‑centric AI decision‑support systems.