Value Function Approximation in Zero-Sum Markov Games

This paper investigates value function approximation in the context of zero-sum Markov games, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case. We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games. We present a generalization of the optimal stopping problem to a two-player simultaneous move Markov game. For this special problem, we provide stronger bounds and can guarantee convergence for LSTD and temporal difference learning with linear value function approximation. We demonstrate the viability of value function approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control problem.

💡 Research Summary

The paper tackles the problem of value‑function approximation (VFA) in zero‑sum Markov games, a natural two‑player extension of the classic Markov decision process (MDP). A zero‑sum Markov game is defined by a state space (S), two action spaces (A_1) and (A_2), a transition kernel (P(s’|s,a_1,a_2)), and a reward function (r(s,a_1,a_2)) that satisfies (r_2 = -r_1). The optimal strategies are given by the minimax principle: player 1 maximizes the worst‑case value while player 2 minimizes the best‑case value.

The authors first generalize the well‑known error bounds for linear value‑function approximation in MDPs to the two‑player setting. By proving that the Bellman operator for a zero‑sum game remains a (\gamma)-contraction under the sup‑norm, they show that the distance between the true parameter vector (w^) and any approximate vector (w) is bounded by (|w-w^|\le \frac{1}{1-\gamma}\epsilon), where (\epsilon) is the Bellman residual in the chosen feature space. This result mirrors the classic MDP bound, confirming that the presence of an adversarial opponent does not fundamentally alter the convergence rate of linear VFA.

A major technical contribution is the treatment of the optimal stopping problem when both players can decide to stop simultaneously. In this “dual‑stop” formulation, the game terminates only when both agents choose the stop action. The authors exploit the fact that the value function only changes at stopping points, allowing them to prove global convergence of both Least‑Squares Temporal Difference (LSTD) and standard Temporal‑Difference (TD) learning with linear function approximation. The proof hinges on the non‑singularity of the empirical Gram matrix in LSTD and the contraction property of the Bellman operator, guaranteeing that the parameter estimates converge to the unique fixed point.

Building on these theoretical foundations, the paper adapts the Least‑Squares Policy Iteration (LSPI) algorithm to zero‑sum games. In each iteration, LSPI evaluates a candidate joint policy by solving a linear system derived from LSTD‑Q, which approximates the Q‑function (Q(s,a_1,a_2)) with features (\phi(s,a_1,a_2)). The policy improvement step then computes a minimax policy directly from the approximated Q‑values:
\