Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.


💡 Research Summary

The paper introduces Lexpop, a novel framework for solving both standard partially observable Markov decision processes (POMDPs) and their robust extension, hidden‑model POMDPs (HM‑POMDPs). Lexpop consists of three tightly coupled stages. First, a deep reinforcement learning (DRL) agent with a recurrent neural network (RNN) policy is trained on the given model(s). The authors employ Proximal Policy Optimization (PPO) together with a vectorized simulator (VecStorm) that can generate massive parallel trajectories for one or many POMDP instances, thereby exploiting the known model while keeping the learning process model‑free. The RNN provides an internal hidden state that serves as memory, allowing the policy to react to observation histories despite partial observability.

Second, the trained RNN policy is transformed into a finite‑state controller (FSC). Two extraction techniques are proposed. The first uses Alergia, a probabilistic automata learning algorithm, to approximate the stochastic action and memory‑update functions of the RNN by a Mealy‑style FSC. The second, called SIG (Scalable Interpretable Generator), learns a surrogate self‑interpretable network that directly controls the size of the resulting FSC, giving the practitioner explicit control over the number of memory nodes. Both methods treat the policy as a black box, making the extraction agnostic to the underlying RNN architecture.

Third, the extracted FSC is evaluated using model‑checking tools such as Storm. Because an FSC defines explicit transition probabilities over the product of the original POMDP state space and the controller’s memory nodes, the resulting Markov chain can be analyzed exactly, yielding provable performance guarantees—something that raw neural policies cannot provide.

For HM‑POMDPs, the authors extend Lexpop with a robustness loop. An FSC is first extracted, then a deductive verification step identifies the “worst‑case” POMDP in the hidden‑model set (the one that yields the lowest expected reward for that FSC). This worst‑case model is added to a buffer, and the DRL agent is retrained on the entire buffer of models. The process repeats: extract a new FSC, find its worst‑case POMDP, augment the buffer, and retrain, until a timeout or convergence criterion is met. This iterative scheme produces a robust FSC whose guaranteed reward is maximized over the entire family of POMDPs, effectively solving a zero‑sum Stackelberg game between the controller and an adversarial model selector.

Empirical evaluation covers four large‑scale benchmark domains: grid‑world navigation, medical treatment planning, network routing protocols, and robotic manipulation. State spaces range from tens of thousands to several hundred thousand. Lexpop is compared against belief‑space solvers (e.g., POMCP, SARSOP), model‑based FSC synthesis methods (e.g., AlphaZero‑FSC, bounded policy iteration), and prior robust HM‑POMDP approaches. Results show that Lexpop achieves 2–5× lower training time and memory consumption while producing FSCs that attain ≥99 % of the optimal value as measured by exact verification. In the robust setting, the iterative Lexpop improves the worst‑case reward by 15–30 % relative to baselines, and limits performance degradation in the most adverse model to under 5 %.

The paper’s contributions are threefold: (1) demonstrating that DRL can be harnessed to solve large‑scale discrete POMDPs when combined with a formal extraction step; (2) providing two scalable, architecture‑agnostic FSC extraction methods that allow explicit trade‑offs between controller size and performance; (3) introducing a worst‑case‑driven learning loop that yields provably robust FSCs for HM‑POMDP families. The authors also discuss future directions, including extensions to continuous state/observation spaces, discounted and multi‑objective rewards, multi‑agent POMDPs, and further minimization/interpretability of extracted controllers. Lexpop bridges the gap between high‑performance deep learning and formal verification, offering a practical pathway to trustworthy decision‑making in complex, uncertain environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment