Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study online learning in two-player uninformed Markov games, where the opponent’s actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min {\sqrt{K} + (CK)^{1/3},\sqrt{LK}})$ regret bound, where $C$ quantifies the variance of the opponent’s policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes – $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case – but also smoothly interpolate between these extremes by automatically adapting to the opponent’s non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.


💡 Research Summary

The paper tackles online learning in two‑player uninformed Markov games, where the learner cannot observe the opponent’s actions or policies. Prior work (Tian et al., 2021) showed that achieving sub‑exponential external regret is impossible in this setting, prompting a shift to the weaker notion of Nash‑value regret (Nr). Their V‑learning‑type algorithm attains an $O(K^{2/3})$ regret bound, but this bound does not adapt to problem difficulty: even when the opponent follows a fixed policy—where $O(\sqrt{K})$ external regret is known to be achievable—their result remains $O(K^{2/3})$ on a weaker metric.

The authors address two fundamental questions: (1) Is there a stronger regret notion that interpolates between external regret and Nash‑value regret as the opponent’s non‑stationarity increases? (2) Can an efficient algorithm automatically adapt to this notion, achieving rates that smoothly transition between $O(\sqrt{K})$ and $O(K^{2/3})$?

Empirical Nash‑value regret (Enr).
They introduce Empirical Nash‑value regret (Enr), defined by restricting the min‑player (the opponent) at each state to the set of policies actually used by the opponent over the $K$ episodes. Formally, the empirical state Nash values $V^*_h(s)$ are computed by a max‑over‑learner min‑over‑empirical‑opponent game. Enr measures the cumulative gap between these empirical Nash values at initial states and the learner’s actual cumulative reward. Enr is strictly stronger than the previously used Nash‑value regret (Nr) and, crucially, collapses to standard external regret when the opponent’s policy is fixed. Thus, any bound on Enr simultaneously bounds Nr and, in the fixed‑opponent case, external regret.

Epoch‑based V‑learning analysis.
The starting point is the epoch‑based V‑learning algorithm of Mao et al. (2022), which partitions visits to each (step, state) pair into geometrically growing epochs and runs an adversarial bandit subroutine within each epoch. The authors modify the algorithm slightly and provide a new analysis yielding the bound \


Comments & Academic Discussion

Loading comments...

Leave a Comment