Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret
Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that $Ω(\sqrt{T})$ external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy – a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent’s action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves $O(c \log T)$ instance-dependent regret with a game-dependent parameter $c$. Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form $O(c’ \log T)$ for a different game-dependent parameter $c’$ that could potentially be much smaller than $c$. Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds.
💡 Research Summary
The paper tackles the problem of learning to play zero‑sum games against an arbitrary, possibly adaptive opponent when the learner receives only bandit feedback. Classical external regret in this setting is known to be lower‑bounded by Ω(√T), making it impossible to achieve sub‑√T guarantees in the worst case. To bypass this barrier, the authors introduce a new performance measure called Pure‑Strategy Maximin Regret (PSMR). PSMR measures the cumulative shortfall of the learner’s payoff relative to the pure‑strategy maximin value v* = maxₓ min_y u(x,y), i.e., the best guaranteed payoff a deterministic player can secure if the game were known. When the game possesses a pure‑strategy Nash equilibrium (PSNE), v* coincides with the Nash value, so PSMR reduces to the previously studied Nash‑value regret; otherwise it is a weaker but still meaningful safety metric.
Two feedback models are considered. In the uninformed model the learner observes only the scalar reward r_t after each round, which is the standard adversarial bandit setting. In the informed model the learner also observes the opponent’s action y_t, a richer signal used in prior works on bandit learning in games.
Uninformed setting – Tsallis‑INF
The authors apply the Tsallis‑INF algorithm (Zimmert & Seldin, 2021) to finite normal‑form games under the uninformed model. Tsallis‑INF is an FTRL method that uses the negative Tsallis entropy (parameter α∈(0,1)) as regularizer, importance‑weighted reward estimators, and a time‑varying learning rate η_t = 1/(2√t). While previous logarithmic regret results for Tsallis‑INF required a stationary or self‑play environment, the paper shows that the algorithm still yields instance‑dependent logarithmic PSMR against any adaptive adversary.
If a strict PSNE (x,y*) exists*, define row‑wise gaps Δ_r(x)=u(x*,y*)−u(x,y*) and column‑wise gaps Δ_c(y)=u(x*,y)−u(x*,y*). Let Δ_r^min and Δ_c^min be the smallest positive gaps. The authors prove
PSMR_T = O\Big( (1/Δ_c^min) ∑_{x≠x*} (log T)/Δ_r(x) \Big),
which can be written as O(c log T) where the constant c depends only on the game’s gap structure, not on the opponent.
If no PSNE exists, the Nash value v_Nash exceeds v* by Δ_mix > 0. In this case the bound becomes
PSMR_T = O(m_x Δ_mix),
a T‑independent guarantee. An information‑theoretic lower bound is also proved, showing that the dependence on the game‑specific constant c is unavoidable, establishing optimality of the result.
Informed setting – Maximin‑UCB
When the opponent’s action is observable, the learner can construct confidence bounds for each pure row. The proposed Maximin‑UCB algorithm maintains an upper confidence bound U_i(t) for each row i based on importance‑weighted estimates and the number of times the row has been played. At each round the algorithm selects the row with the highest U_i(t). The analysis yields
PSMR_T = O(c′ log T),
where c′ is another game‑dependent constant that may be substantially smaller than c (e.g., involving products of the row and column gaps). Thus, the informed feedback model can lead to tighter constants while preserving logarithmic dependence on T.
Extension to bilinear games
The paper generalizes both approaches to bilinear games with potentially huge or continuous action sets. For the uninformed model, Tsallis‑FTRL‑SPM extends Tsallis‑INF by employing sampling‑based approximations of the FTRL update, preserving the O(c log T) guarantee without explicit dependence on the number of actions. For the informed model, Maximin‑LinUCB adapts linear‑bandit confidence bounds to the bilinear setting, again achieving O(c′ log T) regret with no dependence on the cardinalities of the action spaces. These extensions demonstrate that the logarithmic, game‑dependent bounds scale to high‑dimensional settings.
Related work and contributions
Prior work achieved logarithmic external or swap regret in self‑play with full‑information feedback, and recent studies obtained logarithmic Nash‑value regret for 2×2 games under bandit feedback. This paper’s contributions are:
- Introduction of PSMR, a safety‑oriented regret notion that aligns with Nash‑value regret when a PSNE exists.
- Demonstration that Tsallis‑INF attains instance‑dependent logarithmic PSMR in the uninformed model, together with a matching lower bound.
- Design of Maximin‑UCB for the informed model, achieving potentially smaller game‑dependent constants.
- Extension of both algorithms to general bilinear games, removing dependence on action‑set size.
- Comprehensive theoretical analysis that unifies bandit learning, game theory, and information‑theoretic optimality.
Practical implications
PSMR is particularly relevant in safety‑critical domains (e.g., security, finance, autonomous systems) where guaranteeing a minimum payoff against any opponent is more important than matching the best fixed action. Logarithmic dependence on the horizon T ensures that performance does not degrade over long deployments. Moreover, the constants c and c′ can be estimated from a preliminary analysis of the game matrix, allowing practitioners to assess worst‑case guarantees before deployment. The informed model’s advantage suggests that whenever opponent actions can be observed (e.g., in network routing or market trading), employing Maximin‑UCB can substantially reduce the regret constant.
Conclusion
The authors provide the first logarithmic‑regret guarantees for learning against arbitrary opponents under bandit feedback when the performance metric is the pure‑strategy maximin value. By leveraging Tsallis‑based FTRL in the uninformed setting and a UCB‑style confidence approach in the informed setting, they achieve optimal, game‑dependent bounds and extend them to large‑scale bilinear games. This work bridges a gap between adversarial bandit theory and game‑theoretic learning, offering both theoretical insight and practical algorithms for safe, efficient learning in competitive environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment