On the Computational Complexity of Stochastic Controller Optimization in POMDPs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We show that the problem of finding an optimal stochastic ‘blind’ controller in a Markov decision process is an NP-hard problem. The corresponding decision problem is NP-hard, in PSPACE, and SQRT-SUM-hard, hence placing it in NP would imply breakthroughs in long-standing open problems in computer science. Our result establishes that the more general problem of stochastic controller optimization in POMDPs is also NP-hard. Nonetheless, we outline a special case that is convex and admits efficient global solutions.

💡 Research Summary

The paper investigates the computational difficulty of finding optimal stochastic “blind” controllers in Markov decision processes (MDPs) and, by extension, in partially observable MDPs (POMDPs). A blind controller is defined as a stationary stochastic policy that does not condition on the current state, observation, or time; it simply draws an action from a fixed probability distribution π at every step. By embedding this restriction into the standard occupancy‑measure linear program for discounted infinite‑horizon MDPs, the authors obtain a jointly constrained bilinear program (variables x for state occupancies and π for the action distribution).

The central theoretical contribution is a series of hardness results for the decision version of the problem: given an MDP and a cost threshold r, does there exist a blind stochastic controller whose expected discounted cost J(π) does not exceed r?

NP‑hardness – The authors reduce the classic Independent‑Set problem on cubic graphs to the blind‑controller decision problem. They construct an MDP whose states and actions correspond to the vertices of the graph, set deterministic transitions so that taking action a moves the system to state a, and define the cost matrix as (G + I)/γ where G is the adjacency matrix. The resulting cost function becomes J(π) = constant + πᵀ(G + I)π. By the Motzkin‑Straus theorem, the quadratic term encodes the reciprocal of the size of a maximum independent set. Choosing the threshold r appropriately makes the decision problem equivalent to asking whether the graph contains an independent set of size at least j, establishing NP‑hardness.
PSPACE containment – The bilinear constraints can be rewritten as a system of polynomial equalities and inequalities over real variables. Deciding feasibility of such a system is known to be PSPACE‑complete (Canny, 1988). Hence the blind‑controller decision problem lies in PSPACE.
sqrt‑sum hardness – To connect the problem to the long‑standing open sqrt‑sum problem, the authors reduce an instance of sqrt‑sum (given integers c₁,…,cₙ and integer d, decide whether Σ√cᵢ ≤ d) to a specially crafted MDP with n + 1 states and n actions. Each non‑absorbing state i has a self‑loop for all actions except action i, which transitions to an absorbing state. Costs are set to the input integers, and the discount factor γ is tuned so that the expected cost of any blind controller can be expressed as a function proportional to (Σ√cᵢ)². By selecting the threshold r = d²·(n+ε)/(n+ε)², the decision problem becomes exactly the sqrt‑sum question, proving sqrt‑sum‑hardness. Consequently, placing the blind‑controller problem in NP would resolve several deep open questions in numerical analysis.
A tractable special case – The authors identify a structural restriction under which the problem becomes polynomial‑time solvable. If every action’s transition matrix Pₐ is symmetric and doubly stochastic, and the cost vector is proportional to the negative of the initial state distribution (c = –κ μ, κ > 0), then the objective can be rewritten as f(π) = μᵀ(I – γ M(π))⁻¹μ where M(π) = ΣπₐPₐ. Lemma 1 shows that I – γ M(π) is always symmetric positive definite. Using the Schur complement, the epigraph of f is shown to be a linear matrix inequality (LMI), making f convex in π. Since the feasible region is the probability simplex, the convex minimization of –f attains its optimum at a vertex of the simplex, i.e., at a deterministic blind controller. Evaluating all k deterministic controllers thus yields the global optimum in O(k n³) time.
Implications and open directions – The paper concludes that, in general, optimizing stochastic blind controllers is computationally intractable (NP‑hard, PSPACE, sqrt‑sum‑hard). The hardness relies on costs depending on both state and action; the authors note that even state‑only costs remain NP‑hard via a reduction from the general case. Open problems include the complexity of approximation algorithms for blind controllers, extensions to the undiscounted infinite‑horizon setting, and the broader class of stochastic memoryless controllers (which subsume blind controllers). The identified convex special case suggests that exploiting structural properties of transition dynamics may lead to practical algorithms for restricted domains.

Overall, the work provides a rigorous complexity landscape for a natural subclass of POMDP controllers, clarifying both the limits of exact optimization and a promising avenue for efficient solutions under symmetry and cost‑distribution assumptions.

On the Computational Complexity of Stochastic Controller Optimization in POMDPs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment