Leader-follower general-sum stochastic games (LF-GSSGs) model sequential decision-making under asymmetric commitment, where a leader commits to a policy and a follower best responds, yielding a strong Stackelberg equilibrium (SSE) with leader-favourable tie-breaking. This paper introduces a dynamic programming (DP) framework that applies Bellman recursion over credible sets-state abstractions formally representing all rational follower best responses under partial leader commitments-to compute SSEs. We first prove that any LF-GSSG admits a lossless reduction to a Markov decision process (MDP) over credible sets. We further establish that synthesising an optimal memoryless deterministic leader policy is NP-hard, motivating the development of ε-optimal DP algorithms with provable guarantees on leader exploitability. Experiments on standard mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate empirical gains in leader value and runtime scalability over state-of-the-art methods.
Leader-follower general-sum stochastic games (LF-GSSGs) model sequential planning under asymmetric commitment, where a leader commits to a policy and a follower best responds. The solution concept of interest is the strong Stackelberg equilibrium (SSE), which assumes that the follower resolves ties in favour of the leader. LF-GSSGs arise in numerous applications requiring robust decision-making under strategic uncertainty, including adversarial patrolling, network security, infrastructure protection, and cyber-physical planning (Yin et al. 2012;An et al. 2012;Xin Jiang et al. 2013;Basilico, Nittis, and Gatti 2016;Basilico, Coniglio, and Gatti 2016;Chung, Hollinger, and Isler 2011).
Dynamic programming (DP) has proven foundational in planning under uncertainty for Markov decision processes (MDPs) (Bellman 1957;Puterman 1994) and zerosum stochastic games (zs-SGs) (Shapley 1953;Horák and Bošanskỳ 2019;Zheng, Jung, and Lin 2022;Horák et al. 2023), where recursive decompositions and Markovian policies enable tractable value computation. However, in LF-GSSGs, the asymmetry of commitment and general-sum structure fundamentally alter the nature of planning: the follower’s response may depend on the entire interaction history, and the leader must anticipate all such best responses. Subgame decomposability fails, standard state-based recursions no longer hold, and Markov policies are not sufficient for optimality (Vorobeychik and Singh 2012;López et al. 2022).
These limitations manifest even in simple deterministic environments. In the centipede game of Figure 1, backward induction predicts that rational agents will terminate the game immediately, despite the existence of more rewarding cooperative trajectories (Rosenthal 1981;Binmore 1987;McKelvey and Palfrey 1992;Megiddo 1986;Reny 1988;Kreps 2020). This outcome illustrates the failure of Bellman-style reasoning in the presence of social dilemmas,1 a limitation that only intensifies under stochastic transitions and asymmetric commitment. The computational landscape reflects these structural difficulties. While SSEs can be computed in polynomial time for normal-form games (Conitzer and Sandholm 2006), the problem becomes NP-hard with succinct action representations (Korzhyk, Conitzer, and Parr 2010), PSPACE-complete in STRIPS-like planning domains (Behnke and Steinmetz 2024), and even NEXPTIME-complete in multi-objective arenas (Bruyère et al. 2024). Existing methods either encode the problem as large mixed-integer programs (Vorobeychik and Singh 2012;Letchford and Conitzer 2010;Vorobeychik et al. 2014) or rely on extensive linear programs (Conitzer and Sandholm 2006), both of which scale poorly with horizon or state space. Simplifying assumptions-such as myopic or omniscient followers (Denardo 1967;Whitt 1980), or stationary Markov policies (López et al. 2022)-limit practical applicability and fail to preserve generality.
This work introduces a new value-based dynamic programming framework for LF-GSSGs, grounded in a structural reduction to what we call a credible Markov decision process (credible MDP). In this reformulation, states correspond to credible sets-finite collections of occupancy states induced by a fixed leader policy and all rational follower responses. Each occupancy state is a distribution over joint histories of environment states and actions, induced by a follower response to a partial leader policy. Transitions between credible sets are deterministic and governed by the leader’s decision rules, while accounting for all follower responses that are admissible under SSE semantics-that is, responses that could be extended into an optimal policy with leader-favourable tiebreaking, under an extension of the leader policy consistent with the current prefix.
Contributions. The reduction is shown to be lossless: optimal leader value is preserved without requiring explicit follower enumeration. To characterise computational hardness, it is further proven that synthesising an optimal memoryless deterministic leader policy in an LF-GSSG is NPhard. Nonetheless, the value function over credible sets exhibits uniform continuity, enabling Bellman-style recursion and point-based approximation with provable guarantees on leader exploitability. The resulting dynamic programming algorithms achieve ε-optimality without full policy enumeration and scale to large-horizon settings. Empirical evaluations on mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate improvements in leader value and runtime over MILP-based and DP-inspired baselines.
This section formalises leader-follower general-sum stochastic games (LF-GSSGs), outlines the interaction model, and recalls key concepts from dynamic programming relevant to our framework. A summary of the notation used throughout the paper is provided in Appendix A. Definition 1. A leader-follower general-sum stochastic game (LF-GSSG) is a tuple M . = (I, S, A L , A F , p,
This content is AI-processed based on open access ArXiv data.