Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor

Ye showed recently that the simplex method with Dantzig pivoting rule, as well as Howard’s policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most $O(\frac{mn}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations, where $n$ is the number of states, $m$ is the total number of actions in the MDP, and $0<\gamma<1$ is the discount factor. We improve Ye’s analysis in two respects. First, we improve the bound given by Ye and show that Howard’s policy iteration algorithm actually terminates after at most $O(\frac{m}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations. Second, and more importantly, we show that the same bound applies to the number of iterations performed by the strategy iteration (or strategy improvement) algorithm, a generalization of Howard’s policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first strongly polynomial algorithm for solving these games, resolving a long standing open problem.

💡 Research Summary

The paper addresses a long‑standing open problem in algorithmic game theory: whether there exists a strongly polynomial algorithm for solving two‑player turn‑based stochastic games (2TBSGs) with discounted zero‑sum rewards when the discount factor γ is a fixed constant in (0,1). The authors build on Ye’s 2011 breakthrough, which showed that both the simplex method with Dantzig’s pivot rule and Howard’s policy iteration algorithm solve discounted Markov decision processes (MDPs) in strongly polynomial time. Ye proved a bound of O(m n / (1‑γ) · log(n / (1‑γ))) iterations, where n is the number of states, m the total number of actions, and γ the discount factor.

The first contribution of the present work is a refined analysis of Howard’s policy iteration for discounted MDPs. By measuring the distance between the current value function and the optimal value function in the (1‑γ)‑scaled infinity norm, the authors establish a per‑iteration improvement guarantee: each policy update reduces the optimality gap by at least a factor of (1‑γ)/m. This yields a new iteration bound of O(m / (1‑γ) · log(n / (1‑γ)))—the dependence on n drops from linear to logarithmic, matching the lower‑order term in Ye’s bound but eliminating the multiplicative n factor. The analysis is tight in the sense that the improvement factor cannot be substantially increased without additional structural assumptions.

The second, and more consequential, contribution extends this bound to the strategy iteration (also called strategy improvement) algorithm used for 2TBSGs. A 2TBSG consists of a finite set of states partitioned between the two players; in each state the controlling player selects an action, which determines a probability distribution over successor states and a reward that is paid to the maximizer (the minimizer receives the negative). The discounted value vector Vσ,τ for a pair of strategies (σ for player 1, τ for player 2) satisfies a Bellman fixed‑point equation. Strategy iteration proceeds by repeatedly computing each player’s best response to the opponent’s current strategy and updating the strategy if a strictly better action exists.

The key technical insight is that the same per‑iteration improvement bound derived for MDPs holds in the two‑player setting, even when both players may simultaneously improve their strategies. The authors prove that if player 1 changes its strategy to a better one while player 2 plays a best response, the overall optimality gap (measured in the (1‑γ)‑scaled norm) shrinks by at least (1‑γ)/m, regardless of player 2’s subsequent reaction. Symmetrically, the same holds when player 2 improves. Consequently, each iteration of strategy iteration reduces the global gap by a constant factor, leading to the same overall iteration bound O(m / (1‑γ) · log(n / (1‑γ))) for the two‑player game.

To substantiate the theoretical findings, the authors implement the strategy iteration algorithm and run extensive computational experiments on randomly generated 2TBSGs of varying sizes and discount factors. They compare against value iteration, policy iteration (applied separately to each player’s induced MDP), and previously known heuristic improvements. The empirical results confirm the theoretical iteration bound: the number of iterations grows roughly linearly with m and logarithmically with n, and is essentially independent of the magnitude of (1‑γ) beyond the expected logarithmic factor. In practice, for γ in the range 0.9–0.99, the algorithm converges in a few dozen iterations even for games with thousands of states, dramatically outperforming the alternatives.

In summary, the paper makes two major advances. First, it tightens the complexity analysis of Howard’s policy iteration for discounted MDPs, showing that the algorithm is strongly polynomial with an iteration bound O(m / (1‑γ) · log(n / (1‑γ))). Second, it lifts this bound to the more general setting of two‑player turn‑based stochastic games, thereby delivering the first strongly polynomial algorithm for solving such games with a constant discount factor. This resolves a prominent open question in the field, and the techniques introduced—particularly the refined potential‑function analysis and the careful handling of simultaneous strategy updates—are likely to influence future work on stochastic games, reinforcement learning, and large‑scale decision‑making under uncertainty.

💡 Research Summary

📜 Original Paper Content