Efficient Uncoupled Learning Dynamics with $\tilde{O}\!\left(T^{-1/4}\right)$ Last-Iterate Convergence in Bilinear Saddle-Point Problems over Convex Sets under Bandit Feedback
In this paper, we study last-iterate convergence of learning algorithms in bilinear saddle-point problems, a preferable notion of convergence that captures the day-to-day behavior of learning dynamics. We focus on the challenging setting where player…
Authors: Arnab Maiti, Claire Jie Zhang, Kevin Jamieson
Efficien t Uncoupled Learning Dynamics with ˜ O T − 1 / 4 Last-Iterate Con v ergence in Bilinear Saddle-P oin t Problems o v er Con v ex Sets under Bandit F eedbac k Arnab Maiti * 1 Claire Jie Zhang * 1 Kevin Jamieson 1 Jamie Heather Morgenstern 1 Ioannis P anageas 2 Lillian J. Ratliff 1 1 Univ ersity of W ashington 2 Univ ersity of California, Irvine Abstract In this pap er, w e study last-iterate con- v ergence of learning algorithms in bilinear saddle-p oin t problems, a preferable notion of con vergence that captures the da y-to-day be- ha vior of learning dynamics. W e focus on the c hallenging setting where pla yers select ac- tions from compact conv ex sets and rece i ve only bandit feedback. Our main contribu- tion is the design of an uncoupled learning algorithm that guarantees last-iterate con- v ergence to the Nash equilibrium with high probabilit y . W e establish a conv ergence rate of ˜ O ( T − 1 / 4 ) up to p olynomial factors in prob- lem parameters. Crucially , our prop osed al- gorithm is computationally efficient, requir- ing only an efficient linear optimization or- acle ov er the pla yers’ compact action sets. The algorithm is obtained by combining tec h- niques from exp erimen tal design and the clas- sic F ollo w-The-Regularized-Leader (FTRL) framew ork, with a carefully chosen regular- izer function tailored to the geometry of the action set of each learner. 1 INTR ODUCTION Online learning in games is a w ell-studied area (Anagnostides et al. (2022a); Chen and P eng (2020); Dask alakis et al. (2021); Syrgk anis et al. (2015)) that in vestigates the conv ergence prop erties of learning al- gorithms in game-theoretic settings. This line of re- searc h has b een instrumental in developing sup erh u- man AI agents for comp etitiv e environmen ts such as Go (Silver e t al., 2017), P oker (Brown and Sand- holm, 2018) and Diplomacy (Meta F undamen tal AI Researc h Diplomacy T eam (F AIR) et al., 2022). It is w ell known that standard algorithms suc h as F ollow- the-Regularized-Leader (FTRL) and Mirror Descent con verge to a Nash equilibrium in the av erage-iterate sense under self-pla y . In other w ords, while individual strategies ma y remain far from equilibrium, their a v- erage con verges to the Nash equilibrium. The seminal w orks of Dask alakis et al. (2011) and Rakhlin and Srid- haran (2013) further strengthened this understanding b y establishing near-optimal conv ergence rates in the a verage-iterate sense. Ho wev er, w orks by Bailey and Piliouras (2018); Mertikopoulos et al. (2018) show ed that man y standard algorithms that succeed in the a verage-iterate sense fail to con verge in the last-iterate sense, which is often more desirable in practice as it reflects the day-to-da y b eha vior of the learners. Motiv ated by this negativ e result, a new line of w ork has fo cused on designing uncoupled learning al- gorithms that achiev e last-iterate conv ergence to a Nash equilibrium in self-play Cai et al. (2022, 2024); Dask alakis and Panageas (2018). In particular, opti- mistic v ariants of classical algorithms ha ve been sho wn to exhibit last-iterate con vergence under the gradient feedbac k setting Dask alakis et al. (2017); Liang and Stok es (2019); W ei et al. (2020). Moreov er, W ei et al. (2020) established last-iterate linear conv ergence for bilinear games with p olytope action sets under gradi- en t feedbac k. Algorithms under bandit feedbac k-where only the pa y- off of the chosen action is observ ed, in contrast to the richer gradien t feedback-form a well-studied area in the multi-armed bandits literature due to their practical relev ance Auer et al. (2002); Bub ec k et al. (2012); Neu (2015); Zimmert and Lattimore (2022), but results on las t-iterate conv ergence remain rela- tiv ely sparse. Under a v ariant of the standard bandit feedbac k mo del, Cai et al. (2023) first show ed conv er- gence to a Nash equilibrium with high probability at a rate of T − 1 / 8 in matrix games, later impro ved to Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k T − 1 / 5 in Cai et al. (2025). While classical game theory has deep ro ots in discrete actions, many mo dern strategic in teractions are inher- en tly contin uous. Play ers often select from a contin- uum of strategies rather than a finite list, as in applica- tions such as algorithmic pricing, resource allo cation, routing and multi-agen t rob otics (Besb es and Zeevi, 2009; Den Bo er, 2015; Krichene et al., 2015). Related ideas also app ear in the alignment of language mo d- els (Munos et al., 2023). These settings are formally captured by compact con vex action sets, for whic h no high-probabilit y last-iterate guarantees under stan- dard bandit feedback are currently known. The only established result under the standard bandit feedbac k mo del is due to Dong et al. (2024), who prop osed an uncoupled learning algorithm whose iterates conv erge to a Nash equilibrium only in exp ectation at a rate of T − 1 / 6 . This gap motiv ates our central question: Given a biline ar function with c omp act and c onvex action sets, do es ther e exist an unc ou- ple d le arning algorithm whose iter ates c on- ver ge to a Nash e quilibrium with high pr ob- ability in the self-play setting, under b andit fe e db ack? 1.1 Problem Setting In this pap er, w e answer the ab o v e question in the af- firmativ e. T o this end, w e formalize the setting of last- iterate con vergence in bilinear saddle-p oin t problems under bandit feedback. Let X ⊂ R n and Y ⊂ R m b e compact, conv ex sets, and let A ∈ R n × m b e an input matrix. W e assume span( X ) = R n and span( Y ) = R m . F or simplicity of presentation, throughout this pap er w e also assume that ⟨ x, Ay ⟩ ∈ [ − 1 , 1] for all x ∈ X and y ∈ Y . In each round k , the ro w pla yer selects x k ∈ X and the column play er selects y k ∈ Y . They then receive standard bandit feedback in the form of ⟨ x k , Ay k ⟩ and −⟨ x k , Ay k ⟩ , resp ectiv ely . A v ariant of this feedback w as studied by Cai et al. (2022, 2023) for probability simplices, where i k ∼ x k and j k ∼ y k are sampled, and the play ers observe only A i k ,j k and − A i k ,j k . Even when X and Y are probability simplices, the t wo feed- bac k t yp es are fundamen tally differen t, and the results are not directly comparable. W e fo cus on unc ouple d le arning algorithms , whic h op- erate en tirely on a pla yer’s o wn action set and mak e no assumptions ab out the opp onen t: they do not observe the opp onen t’s actions, do not know the opp onen t’s action set, and not ev en the dimension of the action set. The goal is to design suc h algorithms for b oth pla yers under standard bandit feedback so that the pair ( x k , y k ) forms an ε k -appro ximate Nash equilib- rium (last-iterate con vergence) with high probability , where ε k dep ends p olynomially on n and m and satis- fies lim k →∞ ε k = 0. Recall that a pair ( ˜ x, ˜ y ) is an ε -appro ximate Nash equi- librium if, for all ( x, y ) ∈ X × Y , ⟨ x, A ˜ y ⟩ − ε ≤ ⟨ ˜ x, A ˜ y ⟩ ≤ ⟨ ˜ x, Ay ⟩ + ε. If only ( E [ x k ] , E [ y k ]) can b e shown to form an ε k - appro ximate Nash equilibrium, then the con vergence is said to hold only in exp ectation, as in Dong et al. (2024). Such con v ergence guarantees are w eaker, since they do not ensure con vergence along a single tra jec- tory and may require multiple runs to learn an equi- librium, which is often undesirable in practice. 1.2 Con tributions In this pap er, we design the first uncoupled learn- ing dynamics whose iterates exhibit last-iterate con- v ergence with high probability under standard ban- dit feedback for bilinear saddle-p oin t problems ov er con vex sets. F ormally , w e construct uncoupled learn- ing dynamics such that the pair ( x k , y k ) forms an ε k -appro ximate Nash equilibrium with probabilit y at least 1 − δ , where ε k = p oly( n, m, log ( k/δ )) k − 1 / 4 . This result also impro ves up on the k − 1 / 6 con vergence rate of Dong et al. (2024), who established last-iterate con vergence only in exp ectation. Moreov er, if the ac- tion sets admit efficient linear optimization oracles, our dynamics can b e implemented in p olynomial time. Our approach builds on the a verage-to-last-iterate framew ork recently introduced b y Cai et al. (2025). The key challenge in adapting this framework to our setting is that, unlik e Cai et al. (2025), who work ed with probability simplices under a v ariant of ban- dit feedback where estimating rew ards is relatively straigh tforward and the negative en tropy regularizer is a natural choice (well aligned with the ( ∥ · ∥ 1 , ∥ · ∥ ∞ ) primal-dual pair), we consider arbitrary compact con- v ex sets under standard bandit feedback. This makes rew ard estimation significantly more c hallenging, since strategies m ust remain approximate equilibria and we cannot freely explore sub optimal strategies. W e ad- dress this difficulty through a carefully designed sam- pling pro cedure that lev erages exp erimen tal design tec hniques from linear bandits. In addition, we construct regularizers tailored to the geometry of the action sets. This is necessary to en- sure that the regularizer is compatible with the norms Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff naturally arising from our estimation guarantees. Fi- nally , all of these steps must b e carried out while pre- serving computational efficiency whenever the action sets admit efficient linear optimization oracles, which is ensured by our algorithm. 1.3 Related W ork Muth ukumar et al. (2020) ruled out last-iterate con- v ergence for certain w ell-known classes of uncoupled learning dynamics that would hav e serv ed as natu- ral candidates under the bandit feedback setting. The first results on last-iterate conv ergence rates for uncou- pled learning dynamics in tw o-play er zero-sum games under a v arian t of standard bandit feedbac k were pre- sen ted by Cai et al. (2023). They show ed that sim- ple uncoupled dynamics based on mirror descent with KL-div ergence, com bined with carefully c hosen sub- sets of action sets and suitable loss estimators, achiev e a last-iterate conv ergence rate of ˜ O ( T − 1 / 8 ) with high probabilit y and ˜ O ( T − 1 / 6 ) in exp ectation. In the same w ork, they also generalized their results to Marko v games. The high-probability rate w as later improv ed to ˜ O ( T − 1 / 5 ) by Cai et al. (2025), while a concurrent w ork b y Fiegel et al. (2025) established a low er b ound of Ω( T − 1 / 3 ) for this setting. Recen tly , Chen et al. (2023, 2024) prop osed smo othed b est-response dynam- ics for tw o-play er zero-sum sto c hastic games. In the bilinear setting under standard bandit feed- bac k, Dong et al. (2024) in tro duced mirror descent based uncoupled dynamics with appropriate gradien t estimators, achieving an O ( T − 1 / 6 ) last-iterate con ver- gence rate, though only in expectation. In a broader class of monotone games, T atarenko and Kamgarp our (2019) established asymptotic last-iterate conv ergence to Nash equilibrium, alb eit without finite-time guar- an tees. The literature on learning in games is extensive. Here, w e primarily fo cused on works concerning last-iterate con vergence under bandit feedback in tw o-pla yer zero- sum games. F or results on av erage-iterate conv er- gence, we refer the reader to Anagnostides et al. (2022a); Chen and Peng (2020); Dask alakis et al. (2011, 2021); Rakhlin and Sridharan (2013); Syrgk a- nis et al. (2015) and the references therein. F or results on last-iterate conv ergence under gradient feedback, see Ab e et al. (2024); Anagnostides et al. (2022b); Cai et al. (2025); Dask alakis and Panageas (2018); Dask alakis et al. (2017); Liang and Stokes (2019); W ei et al. (2020) and the references therein. F or other con- ditions such as strict equilibria and strong monotonic- it y , w e refer the reader to Ba et al. (2025); Giannou et al. (2021); Jordan et al. (2025) and the references therein. 2 ALGORITHM WITH LAST-ITERA TE CONVER GENCE Recen tly , Cai et al. (2025) introduced a framew ork for zero-sum games ov er probability simplices that trans- forms uncoupled dynamics with a verage-iterate con- v ergence guarantees into ones with last-iterate conv er- gence guarantees. The framework runs an av erage- iterate algorithm ov er multiple phases, where a phase t consists of B t rounds. In a phase t , if the av erage- iterate algorithm outputs ˜ x t , the framew ork plays strategies x k close to ¯ x t := 1 t P t s =1 ˜ x s for each round k in that phase. These strategies are then used to estimate A b y t , where b y t denotes the expected strat- egy of the other play er in phase t . This estimate de- fines a phase utility vector that, when fed back into the av erage-iterate algorithm, drives ¯ x t to ward equi- librium. Since the framework plays strategies near ¯ x t , last-iterate conv ergence is achiev ed. W e adapt this framework to our setting in order to ac hieve last-iterate con vergence, with details giv en in Algorithm 1. The main c hallenge lies in sampling in eac h phase so as to estimate A b y t accurately with re- sp ect to a suitable dual norm, and in constructing a regularizer that is strongly con vex with resp ect to a corresp onding primal norm while maintaining low Bregman div ergence. Our approaches to these c hal- lenges are presen ted in Sections 2.1 and 2.2, which form the core tec hnical contributions of this paper. W e state our main result in Section 2.3. 2.1 Sampling metho d and estimator Analogous to the row play er’s algorithm, we can de- scrib e an algorithm for the column play er, where in the s -th round of the t -th phase it selects y t,s . Denote the row play er’s true exp ected utilit y vector as ¯ θ x t := A b y t , where b y t := E [ y t,s ] = 1 2 (1 − λ t ) ¯ y t + λ t E z ∼D Y [ z ] + 1 2 ¯ y t is the column play er’s av eraged strategy in phase t , where D Y is the exploration distribution of the column pla yer. In this section, w e describ e ho w to construct an estima- tor b θ x t of the utilit y vector ¯ θ x t and establish meaningful concen tration guaran tees. An analogous estimator can b e constructed for the column play er. W e no w begin the construction. Recall that in each round s of phase t , the row play er pla ys x t,s ∈ X and receiv es the reward r t,s = ⟨ x t,s , Ay t,s ⟩ . W e can decomp ose this reward as r t,s = ⟨ x t,s , A b y t ⟩ + ⟨ x t,s , A ( y t,s − b y t ) ⟩ . Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k Algorithm 1 Last-iterate algorithm for the ro w pla yer under bandit feedback Input: Probability error term δ ∈ (0 , 1 2 ], step size η > 0, batch size B t ← log(8 t 2 /δ ) · t 3 , mixing parameter λ t ← t − 2 , exploration distribution D X o ver X . Initialization: round counter k ← 1; b θ x 0 ← 0 ; x 1 ← E x ∼D [ x ]. 1 for phase t = 1 , 2 , . . . do 2 Compute running av erage: ¯ x t ← 1 t P t ℓ =1 ˜ x ℓ 3 for s = 1 , 2 , . . . , B t do 4 With probability 1 / 2, set x t,s ← ¯ x t 5 With probability 1 / 2, sample z t,s ∼ D X and set x t,s ← (1 − λ t ) ¯ x t + λ t z t,s 6 Pla y strategy x k = x t,s , observ e reward r t,s , and up date k ← k + 1 7 Construct estimate b θ x t of the mean reward vector ¯ θ x t using ( x t,s , r t,s ) as describ ed in Section 2.1 8 Estimate phase utility: b u x t ← t · b θ x t − ( t − 1) · b θ x t − 1 9 Up date via OFTRL with regularizer ϕ ( x ) from Section 2.2: ˜ x t +1 ← arg max x ∈X (* x, t X ℓ =1 b u ℓ + b u t + − 1 η ϕ ( x ) ) . This yields a linear mo del, where the second term ⟨ x t,s , A ( y t,s − b y t ) ⟩ is zero-mean 4 λ 2 t -subgaussian noise in each phase t (which w e sho w in App endix A). F or simplicit y of exposition, assume that in eac h phase t , half of the B t rounds use x t,s = ¯ x t , where we denote these indices by { s 1 , s 2 , . . . , s B t / 2 } . In the remaining B t / 2 rounds, we set x t,s ← (1 − λ t ) ¯ x t + λ t z t,s , z t,s ∼ D , and denote the corresp onding indices b y { s ′ 1 , s ′ 2 , . . . , s ′ B t / 2 } . W e address all the other p ossible cases in App endix A.1. W e no w construct pairs ( s i , s ′ i ) such that x t,s ′ i = (1 − λ t ) x t,s i + λ t z t,s ′ i . No w consider the transformed rew ard b r t,s ′ i := r t,s ′ i − (1 − λ t ) r t,s i λ t = ⟨ z t,s ′ i , θ t ⟩ + b η t,s ′ i , where b η t,s ′ i is zero-mean 8-subgaussian noise. Th us, from the pairs ( z t,s ′ i , b r t,s ′ i ), we obtain an unbi- ased estimator of ¯ θ x t : b θ x t = B t / 2 X i =1 z t,s ′ i z ⊤ t,s ′ i − 1 B t / 2 X i =1 b r t,s ′ i z t,s ′ i . Finally , if the vectors z t,s ′ i are sampled from an explo- ration distribution D X that is uniform ov er a subset S := { x 1 , . . . , x n } ⊂ X satisfying sup x ∈X x ⊤ V − 1 x ≤ 2 n 2 , V := 1 n n X i =1 x i x ⊤ i ≻ 0 , then we obtain meaningful concentration guarantees, whic h are used in our analysis. W e formalize this in the follo wing lemma, with the pro of pro vided in App endix A.2. Lemma 2.1 (Estimator concentration b ound) . The estimator b θ x t c onstructe d in e ach phase t satisfies the fol lowing: Pr sup x ∈X |⟨ x, b θ x t − ¯ θ x t ⟩| ≤ 48 q n 3 t 3 ≥ 1 − δ / (4 t 2 ) . Remark: In the previous work of Cai et al. (2025), where the action sets are probability simplices, after c ho osing x k the algorithm is allo wed to sample i k ∼ x k and observe the corresp onding rew ard, whic h simpli- fies the estimation process. In our setting, on the other hand, w e only observe the rew ard of the actual strategy ⟨ x t,s , Ay t,s ⟩ . Thus, estimating ¯ θ x t requires the transfor- mation describ ed in this section, and is made p ossible b y the sp ecific sampling scheme used in each phase. 2.2 Choice of regularizer and the corresp onding primal–dual norms Recall that our algorithm up dates the play ers’ strate- gies using the OFTRL framew ork. The efficiency of OFTRL critically dep ends on the choice of the regu- larizer function ϕ ( x ). An ideal regularizer should b e strongly conv ex and hav e a small diameter ov er the action set X . T o achiev e this, w e tailor the regularizer to the geometry of X . W e b egin by defining a pair of primal-dual norms in Section 2.2.1, intrinsically tied to the action set through its symmetrization K := con v( X ∪ ( −X )). W e then construct an ellipsoid E = { x : x ⊤ H x ≤ 1 } , whic h serv es as a tight appro ximation of K up to poly- nomial factors in the dimension. The regularizer is c hosen to b e half the squared norm induced b y this ellipsoid, namely ϕ ( x ) := 1 2 x ⊤ H x . As we show in Sec- tion 2.2.2, this choice yields a regularizer that is 1- strongly conv ex with resp ect to the primal norm and whose Bregman divergence scales p olynomially with the dimension. 2.2.1 Primal-dual norm pair In this section, we formally establish that the norms tailored to the action set X constitute a v alid primal- dual pair. Analogous norms can b e defined for the action set Y . Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff Let ∥ z ∥ ∗ , X := max x ∈X |⟨ x, z ⟩| and ∥ z ∥ X := max ∥ y ∥ ∗ , X ≤ 1 ⟨ y , z ⟩ . Recall that X spans R d . There- fore, one can establish the following prop erties, pro of of which is provided in the App endix B for complete- ness. Lemma 2.2 (Chandrasek aran et al. (2012)) . The fol- lowing pr op erties hold for ∥ · ∥ X and ∥ · ∥ ∗ , X : • ∥ · ∥ X and ∥ · ∥ ∗ , X ar e b oth norms. • ∥ · ∥ ∗ , X is the dual norm of ∥ · ∥ X • { y : ∥ y ∥ X ≤ 1 } = conv( X ∪ −X ) Remark: This choice of primal-dual norms is mainly motiv ated b y the following reasoning. F or any choice of primal norm ∥ · ∥ , one must ensure that ∥ b θ t − ¯ θ t ∥ ∗ remains small in order to obtain meaningful con ver- gence guarantees when p erforming the OFTRL analy- sis. Since Lemma 2.1 establishes conv ergence guaran- tees for max x ∈X |⟨ x, b θ t − θ t ⟩| , defining the dual norm as ∥ z ∥ ∗ := max x ∈X |⟨ x, z ⟩| is a natural choice. 2.2.2 Suitable Regularizer In this section, we formally state the regularizer ϕ ( x ) for the ro w play er. Analogously , a corresp onding regu- larizer ψ ( y ) can b e constructed for the column play er. The ob jective is to design a regularizer, based on an ellipsoid approximating the action set X , that is well- suited for the OFTRL framework by b eing 1-strongly con vex with resp ect to the primal norm and having p olynomially b ounded Bregman divergence. Let K := conv( X ∪ −X ) and α := p d ( d + 1). One can compute an ellipsoid E = { x : x ⊤ H x ≤ 1 } , H ≻ 0 , suc h that E ⊆ K ⊆ α E , where αE := { αx : x ∈ E } (see Theorem 4.6.3 in Gr¨ otsc hel et al. (2012)). No w w e define our regularizer as ϕ ( x ) = 1 2 x ⊤ H x, and the Bregman divergence with resp ect to the regu- larizer ϕ is defined as: D ϕ ( u, v ) = ϕ ( u ) − ϕ ( v ) − ⟨∇ ϕ ( v ) , u − v ⟩ . W e no w establish the properties of the regularizer ϕ ( · ). W e begin b y proving the strong con vexit y of ϕ ( · ) in the follo wing lemma. Lemma 2.3 (Strong-con vexit y) . The r e gularizer ϕ ( x ) is 1 -str ongly c onvex with r esp e ct to the primal norm ∥ · ∥ X . Pr o of. Let the p olar of a set C ⊂ R d b e defined as C ◦ := { z ∈ R d : max x ∈ C ⟨ x, z ⟩ ≤ 1 } . Observ e that A ⊆ B implies B ◦ ⊆ A ◦ and ( λA ) ◦ = (1 /λ ) A ◦ . Hence, w e ha ve (1 /α ) E ◦ ⊆ K ◦ ⊆ E ◦ . W e now prov e that E ◦ = { y : y ⊤ H − 1 y ≤ 1 } . Recall that E = { x : ∥ x ∥ H ≤ 1 } , where ∥ x ∥ H := √ x ⊤ H x . By definition, E ◦ = n y : sup ∥ x ∥ H ≤ 1 ⟨ y , x ⟩ ≤ 1 o . Consider x ∈ E . Observ e that ⟨ y , x ⟩ = ⟨ H − 1 / 2 y , H 1 / 2 x ⟩ and ∥ H 1 / 2 x ∥ 2 2 = x ⊤ H x ≤ 1. By Cauc hy–Sc hw arz, we hav e ⟨ y , x ⟩ ≤ ∥ H − 1 / 2 y ∥ 2 ∥ H 1 / 2 x ∥ 2 ≤ ∥ H − 1 / 2 y ∥ 2 , with equality at x = H − 1 y ∥ H − 1 / 2 y ∥ 2 . Note that this c hoice of x b elongs to E as x ⊤ H x = y ⊤ H − 1 y ∥ H − 1 / 2 y ∥ 2 2 = 1. Hence, w e ha ve sup ∥ x ∥ H ≤ 1 ⟨ y , x ⟩ = ∥ H − 1 / 2 y ∥ 2 = p y ⊤ H − 1 y . Therefore, E ◦ = { y : p y ⊤ H − 1 y ≤ 1 } = { y : ∥ y ∥ H − 1 ≤ 1 } . Using analogous calculations, we can also show that for any z ∈ R d , max y ∈ E ◦ ⟨ y , z ⟩ = √ z ⊤ H z . Since K ◦ ⊆ E ◦ , we deduce the following: ∥ z ∥ X = max y :max x ∈X |⟨ x,y ⟩|≤ 1 ⟨ y , z ⟩ = max y :max x ∈ K ⟨ x,y ⟩≤ 1 ⟨ y , z ⟩ = max y ∈ K ◦ ⟨ y , z ⟩ ≤ max y ∈ E ◦ ⟨ y , z ⟩ = √ z ⊤ H z = ∥ z ∥ H F or an y u, v ∈ R d , D ϕ ( u, v ) = 1 2 ( u − v ) ⊤ H ( u − v ) = 1 2 ∥ u − v ∥ 2 H . Since, ∥ z ∥ H ≥ ∥ z ∥ for all z ∈ R d , we hav e D ϕ ( u, v ) ≥ 1 2 ∥ u − v ∥ 2 , so ϕ is 1-strongly conv ex with re spect to ∥ · ∥ . Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k Next, we show in the following prop osition that the Bregman divergence is b ounded by a p olynomial in the dimension. Prop osition 2.4 (Bregman div ergence) . F or any x, y ∈ X , we have D ϕ ( x, y ) ≤ 2 d ( d + 1) . Pr o of. F or any x, y ∈ X ⊆ αE , w e hav e ∥ x − y ∥ H ≤ ∥ x ∥ H + ∥ y ∥ H ≤ 2 α . Therefore, D ϕ ( x, y ) = 1 2 ∥ x − y ∥ 2 H ≤ 1 2 (2 α ) 2 = 2 d ( d + 1) . 2.3 Main result W e now state below our main result, whic h is obtained b y leveraging the prop erties of our estimator and reg- ularizer, with technical details provided in the next section. Theorem 2.5. Consider a two-player zer o-sum game with action sets X ⊆ R n and Y ⊆ R m , wher e b oth X and Y ar e c onvex and c omp act. L et ( x k , y k ) denote the iter ates gener ate d by the two players running A l- gorithm 1 with step size η = 1 6 in e ach r ound k . Then, with pr ob ability at le ast 1 − δ , for every k ≥ 1 , the it- er ate ( x k , y k ) is an ε k -appr oximate Nash e quilibrium, wher e ε k = p oly( n, m, log ( k/δ )) k − 1 / 4 . 3 TECHNICAL DET AILS F OR OUR MAIN RESUL T The up date step in eac h phase of Algorithm 1 is an in- stance of OFTRL, a standard and widely used frame- w ork for online conv ex optimization. In each phase t , OFTRL outputs ˜ x t for the row play er (and analo- gously ˜ y t for the column play er). These outputs are then used to determine the actual strategies x k and y k c hosen in eac h round of phase t . Let u x t := A ˜ y t denote the phase- t utility v ector of the row pla yer, and u y t := − A ⊤ ˜ x t denote the phase- t utilit y vector of the column play er. Recall that ¯ x T = 1 T P T t =1 ˜ x t and ¯ y T = 1 T P T t =1 ˜ y t . W e hav e the following guaran tee on the duality gap of ( ¯ x T , ¯ y T ): max x ∈X ⟨ x, A ¯ y T ⟩ − min y ∈Y ⟨ ¯ x T , Ay ⟩ = 1 T max x ∈X T X t =1 ⟨ x, A ˜ y t ⟩ + 1 T max y ∈Y T X t =1 ⟨− A ⊤ ˜ x t , y ⟩ = 1 T max x ∈X T X t =1 ⟨ x − ˜ x t , Ay t ⟩ + 1 T max y ∈Y T X t =1 ⟨− A ⊤ ˜ x t , y − ˜ y t ⟩ = 1 T max x ′ ∈X T X t =1 ⟨ u x t , x ′ − ˜ x t ⟩ + max y ′ ∈Y T X t =1 ⟨ u y t , y ′ − ˜ y t ⟩ ! If the dualit y gap ab o v e is upp er bounded by β T , then ( ¯ x T , ¯ y T ) is an O ( β T )-appro ximate Nash equilib- rium. Since the iterates in phase T tak e the form ((1 − λ T ) ¯ x T + λ T z x , (1 − λ T ) ¯ y T + λ T z y ) with z x ∈ X and z y ∈ Y , the iterates in phase T are O ( β T + λ T )- appro ximate Nash equilibria. Hence, our fo cus is on upp er b ounding the term max x ′ ∈X P T t =1 ⟨ u x t , x ′ − ˜ x t ⟩ . T o this end, we make use of the R VU prop ert y , which c haracterizes the p erfor- mance of OFTRL. W e state this prop ert y in the fol- lo wing lemma and include its pro of in App endix C for completeness. Lemma 3.1 (R VU Prop ert y Syrgk anis et al. (2015)) . L et X ⊂ R d b e c omp act c onvex. L et R : X → R b e σ -str ongly c onvex w.r.t. a norm ∥ · ∥ with dual ∥ · ∥ ∗ . Fix a step size η > 0 and initialize ˜ x 0 = ˜ x 1 = arg min x ∈X R ( x ) . Assume u 0 = 0 . F or utilities u t ∈ R d , define the OFTRL de cisions ˜ x t +1 ∈ arg max x ∈X (* x, t X ℓ =1 u ℓ + u t + − 1 η R ( x ) ) . Then for every x ∈ X , T X t =1 ⟨ u t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x 1 ) η + η σ T X t =1 ∥ u t − u t − 1 ∥ 2 ∗ − σ 4 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 . While the R VU prop ert y is sufficien t to bound max x ′ ∈X P T t =1 ⟨ u x t , x ′ − ˜ x t ⟩ when the actual phase-wise utilities are a v ailable, this is not the case in our algo- rithm. Instead, w e use the estimates b u x t in the row pla yer’s OFTRL up dates, since w e operate under the bandit feedback setting. T o address this challenge, we rely on the following key observ ation concerning the actual phase-wise utilities: u x t = A ˜ y t = t X ℓ =1 A ˜ y ℓ − t − 1 X ℓ =1 A ˜ y ℓ = tA ¯ y t − ( t − 1) A ¯ y t − 1 , where ¯ y t = 1 t P t s =1 ˜ y s . Also note that the estimate b u x t has the corresp onding form t · b θ x t − ( t − 1) · b θ x t − 1 . This allo ws us to upp er b ound ⟨ u x t , x − ˜ x t ⟩ as ⟨ b u x t , x − ˜ x t ⟩ − ⟨ b u x t − E [ b u x t ] , x − ˜ x t ⟩ + O 1 t . Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff Moreo ver, one can show that ∥ E [ b u x t ] − E [ b u x t − 1 ] ∥ ∗ , X ≤ ∥ u x t − u x t − 1 ∥ ∗ , X + O 1 t . These observ ations allow us to apply the R VU prop- ert y to the sequence of estimates b u x t , yielding the fol- lo wing lemma. All missing pro ofs, including that of the lemma, are deferred to Appendix D. Lemma 3.2 (R VU with estimation error) . L et ∆ x t := b θ x t − ¯ θ x t . Then for any x ∈ X and any η > 0 , we have the fol lowing for the r ow player: T X t =1 ⟨ u x t , x − ˜ x t ⟩ ≤ D ϕ ( x, ˜ x 1 ) η + 2 ∥ T ∆ x T ∥ ∗ , X + 36 η T X t =1 ∥ t ∆ x t ∥ 2 ∗ , X + 4 η T X t =1 ∥ u t − u t − 1 ∥ 2 ∗ , X − 3 16 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 X + O ( η + log T ) . Analogously , w e can provide a similar guarantee for the column play er, with ∆ y t defined accordingly . W e are now ready to upp er b ound the duality gap of ( ¯ x T , ¯ y T ) using Lemma 3.2. First, we establish the fol- lo wing: max x ′ ∈X T X t =1 ⟨ u x t , x ′ − ˜ x t ⟩ + max y ′ ∈Y T X t =1 ⟨ u y t , y ′ − ˜ y t ⟩ ≤ D ϕ ( x, ˜ x 1 ) η + 2 ∥ T ∆ x T ∥ ∗ , X + 36 η T X t =1 ∥ t ∆ x t ∥ 2 ∗ , X + 4 η T X t =1 ∥ u x t − u x t − 1 ∥ 2 ∗ , X − 3 16 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 X + D ψ ( y , ˜ y 1 ) η + 2 ∥ T ∆ y T ∥ ∗ , Y + 36 η T X t =1 ∥ t ∆ y t ∥ 2 ∗ , Y + 4 η T X t =1 ∥ u y t − u y t − 1 ∥ 2 ∗ , Y − 3 16 η T X t =1 ∥ ˜ y t − ˜ y t − 1 ∥ 2 Y + O ( η + log T ) W e group terms on the right side. W e b egin with: T erm I = 4 η T X t =1 ( ∥ u x t − u x t − 1 ∥ 2 ∗ , X + ∥ u y t − u y t − 1 ∥ 2 ∗ , Y ) − 3 16 η T X t =1 ( ∥ ˜ x t − ˜ x t − 1 ∥ 2 X + ∥ ˜ y t − ˜ y t − 1 ∥ 2 Y ) No w observ e that ∥ u x t − u x t − 1 ∥ ∗ , X = sup x ∈X |⟨ x, A ( ˜ y t − ˜ y t − 1 ) | ≤ sup x ∈X ∥ A ⊤ x ∥ ∗ , Y ∥ ˜ y t − ˜ y t − 1 ∥ Y ≤ ∥ ˜ y t − ˜ y t − 1 ∥ Y , where the first inequality follows from the fact that |⟨ x, y ⟩| ≤ ∥ x ∥∥ y ∥ ∗ for an y primal-dual norm pair and the second inequality follows from the fact that |⟨ x, Ay ⟩| ≤ 1 for all x ∈ X and y ∈ Y . Similarly we can show that ∥ u y t − u y t − 1 ∥ ∗ , Y ≤ ∥ ˜ x t − ˜ x t − 1 ∥ X . As η = 1 / 6, we hav e T erm I ≤ 0. Let T erm I I = 2 ∥ T ∆ x T ∥ ∗ , X + 36 η P T t =1 ∥ t ∆ x t ∥ 2 ∗ , X + 2 ∥ T ∆ y T ∥ ∗ , Y + 36 η P T t =1 ∥ t ∆ y t ∥ 2 ∗ , Y . Due to Lemma 2.1, we get that ∥ ∆ x t ∥ ∗ , X ≤ 48 q n 3 t 3 with probabilit y at leas t 1 − δ 4 t 2 . Analogously , we can sho w that ∥ ∆ y t ∥ ∗ , Y ≤ 48 q m 3 t 3 with probabilit y at least 1 − δ 4 t 2 . Hence due to union b ound, if η = 1 / 6, w e get the following with probability at least 1 − δ : T erm II ≤ O r n 3 T + r m 3 T + T X t =1 t 2 ( n 3 + m 3 ) t 3 ! ≤ O (( n 3 + m 3 ) log T ) Define T erm I II = D ϕ ( x, ˜ x 1 ) η + D ψ ( y , ˜ y 1 ) η + O ( η + log T ). Due to Prop osition 2.4, we hav e D ϕ ( x, ˜ x 1 ) ≤ O ( n 2 ). Similarly , we can show that D ψ ( y , ˜ y 1 ) ≤ O ( m 2 ). Hence, T erm I II ≤ O ( n 2 + m 2 + log T ) if η = 1 / 6. Hence, we hav e the following: max x ′ ∈X T X t =1 ⟨ u x t , x ′ − ˜ x t ⟩ + max y ′ ∈Y T X t =1 ⟨ u y t , y ′ − ˜ y t ⟩ ≤ T erm I + T erm I I + T erm I II ≤ O (( n 3 + m 3 ) log T ) Hence ( ¯ x T , ¯ y T ) is an O ( n 3 + m 3 ) log T T -appro ximate Nash equilibrium. Hence, the pair of strate- gies pla yed in a round in the phase T is O ( n 3 + m 3 ) log T T + λ T -appro ximate Nash equilibrium whic h is also O ( n 3 + m 3 ) log T T -appro ximate Nash equilibrium. Now consider a round k and let it b e part of phase T k . Note that T k ≤ k . As in each phase t we hav e B t = log (8 t 2 /δ ) · t 3 rounds, w e therefore ha ve log (8 k 2 /δ ) · T 4 k ≥ k . Hence, we hav e T k ≥ k log(8 k 2 /δ ) 1 / 4 . Hence, the iterate ( x k , y k ) is an Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k ε k -appro ximate Nash equilibrium, where ε k is upp er b ounded as follo ws: ε k ≤ O (( n 3 + m 3 ) log ( k ) log 1 / 4 (8 k 2 /δ ) k − 1 / 4 ) 4 COMPUT A TIONAL EFFICIENCY OF OUR ALGORITHM In this section, w e sho w that our prop osed algorithm is computationally efficient, provided the action sets admit an efficient linear optimization oracle. W e es- tablish this b y sho wing that its k ey building blocks can b e implemented in p olynomial time. First, recall that our exploration distribution D X is uniform ov er a subset S := { x 1 , . . . , x n } ⊂ X such that sup x ∈X x ⊤ V − 1 x ≤ 2 n 2 , V := 1 n n X i =1 x i x ⊤ i ≻ 0 . Hazan and Karnin (2016) show ed that such a subset can b e computed in p olynomial time, provided X ad- mits an efficien t linear optimization oracle. Hence, our sampling pro cess is efficient. Next, recall that we construct an ellipsoid E such that E ⊆ K ⊆ p n ( n + 1) E , where K = conv( X ∪ −X ). Theorem 4.6.3 of Gr¨ otschel et al. (2012) shows that suc h an ellipsoid can b e computed in p olynomial time, pro vided there is an efficien t linear optimization oracle for K . F or any z ∈ R n , we hav e max x ∈ K ⟨ x, z ⟩ = max max x ∈X ⟨ x, z ⟩ , max x ∈X ⟨− x, z ⟩ . Hence, K admits an efficien t linear optimization oracle whenev er X do es. Finally , our OFTRL up date step is a conv ex opti- mization problem o ver an action set that admits ef- ficien t linear optimization. Suc h an OFTRL up date can b e implemented in p olynomial time, pro vided w e can compute the regularizer and its gradien t ef- ficien tly—which we can in the case of our regularizer (see Chapter 2 of Gr¨ otschel et al. (2012)). These are the three main components of our algorithm. Therefore, the algorithm can b e implemented in p oly- nomial time, pro vided the action sets admit an efficien t linear optimization oracle. 5 CONCLUSION In this pap er, we presen ted the first uncoupled learn- ing dynamics whose iterates exhibit last-iterate con- v ergence with high probability under bandit feedbac k for bilinear saddle-p oint problems o ver con vex sets. W e established a conv ergence rate of ˜ O ( T − 1 / 4 ) and sho wed that our dynamics can b e implemented effi- cien tly , pro vided the action sets admit efficient linear optimization oracles. This w ork raises sev eral interest- ing op en questions. First, what is the tight lo wer bound on the last-iterate con vergence rate for bilinear saddle-point problems un- der bandit feedbac k, and can w e design uncoupled learning dynamics that ac hieve this rate? Next, do es there exist a simpler dynamics that applies optimistic FTRL in eac h round and attains last-iterate conv er- gence under bandit feedbac k, rather than relying on phased updates and the inv olv ed sampling pro cedure used in our algorithm? Finally , can these results b e generalized to con vex-conca ve functions and monotone games, and can last-iterate conv ergence b e ac hieved in these broader settings under bandit fee d back? A CKNO WLEDGEMENTS Ioannis Panageas w as supp orted by National Science F oundation grant CCF-2454115. LJ Ratliff was sup- p orted in part by NSF 1844729, 2312775. KJ and AM w ere supp orted in part by NSF 2141511, 2023239, and a Singap ore AI Visiting Professorship aw ard. JM and CJZ were supp orted in part by NSF ID 2045402 and a Simons Collab oration on the Theory of Algorithmic F airness. References Kenshi Ab e, Mitsuki Sak amoto, Kaito Ariu, and At- sushi Iwasaki. Boosting perturb ed gradient ascent for last-iterate con vergence in games. arXiv pr eprint arXiv:2410.02388 , 2024. Ioannis Anagnostides, Gabriele F arina, Christian Kro er, Chung-W ei Lee, Haip eng Luo, and T uo- mas Sandholm. Uncoupled learning dynamics with o (log t ) swap regret in multipla yer games. A dvanc es in Neur al Information Pr o c essing Systems , 35:3292– 3304, 2022a. Ioannis Anagnostides, Ioannis Panageas, Gabriele F a- rina, and T uomas Sandholm. On last-iterate con- v ergence b ey ond zero-sum games. In International Confer enc e on Machine L e arning , pages 536–581. PMLR, 2022b. P eter Auer, Nicolo Cesa-Bianchi, Y oav F reund, and Rob ert E Schapire. The nonsto c hastic multiarmed bandit problem. SIAM journal on c omputing , 32(1): 48–77, 2002. W enjia Ba, Tian yi Lin, Jia wei Zhang, and Zhengyuan Zhou. Doubly optimal no-regret online learning in strongly monotone games with bandit feedbac k. Op- er ations R ese ar ch , 2025. Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff James P Bailey and Georgios Piliouras. Multiplicative w eights update in zero-sum games. In Pr o c e e dings of the 2018 ACM Confer enc e on Ec onomics and Com- putation , pages 321–338, 2018. Omar Besb es and Assaf Zeevi. Dynamic pricing with- out kno wing the demand function: Risk b ounds and near-optimal algorithms. Op er ations r ese ar ch , 57(6): 1407–1420, 2009. Noam Bro wn and T uomas Sandholm. Sup erh uman ai for heads-up no-limit p ok er: Libratus b eats top professionals. Scienc e , 359(6374):418–424, 2018. S ´ ebastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kak ade. T ow ards minimax p olicies for online linear optimization with bandit feedbac k. In Confer enc e on L e arning The ory , pages 41–1. JMLR W orkshop and Conference Pro ceedings, 2012. Y ang Cai, Argyris Oikonomou, and W eiqiang Zheng. Finite-time last-iterate conv ergence for learning in m ulti-play er games. A dvanc es in Neur al Information Pr o c essing Systems , 35:33904–33919, 2022. Y ang Cai, Haipeng Luo, Chen-Y u W ei, and W eiqiang Zheng. Uncoupled and conv ergent learning in t wo- pla yer zero-sum marko v games with bandit feed- bac k. A dvanc es in Neur al Information Pr o c essing Systems , 36:36364–36406, 2023. Y ang Cai, Gabriele F arina, Julien Grand-Cl´ ement, Christian Kro er, Chung-W ei Lee, Haipeng Luo, and W eiqiang Zheng. F ast last-iterate conv ergence of learning in games requires forgetful algorithms. A d- vanc es in Neur al Information Pr o c essing Systems , 37:23406–23434, 2024. Y ang Cai, Haipeng Luo, Chen-Y u W ei, and W eiqiang Zheng. F rom av erage-iterate to last-iterate con- v ergence in games: A reduction and its applica- tions. arXiv pr eprint arXiv:2506.03464, T o app e ar at NeurIPS , 2025. V enk at Chandrasek aran, Benjamin Rech t, Pablo A P arrilo, and Alan S Willsky . The conv ex geometry of linear in verse problems. F oundations of Compu- tational mathematics , 12(6):805–849, 2012. Xi Chen and Binghui Peng. Hedging in games: F aster con vergence of external and swap regrets. A d- vanc es in Neur al Information Pr o c essing Systems , 33:18990–18999, 2020. Zaiw ei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite- sample analysis of pay off-based indep enden t learn- ing in zero-sum sto c hastic games. A dvanc es in Neur al Information Pr o c essing Systems , 36:75826– 75883, 2023. Zaiw ei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Last- iterate conv ergence of pa yoff-based indep enden t learning in zero-sum sto c hastic games. arXiv pr eprint arXiv:2409.01447 , 2024. Constan tinos Dask alakis and Ioannis Panageas. Last- iterate conv ergence: Zero-sum games and con- strained min-max optimization. arXiv pr eprint arXiv:1807.04252 , 2018. Constan tinos Dask alakis, Alan Dec kelbaum, and An- thon y Kim. Near-optimal no-regret algorithms for zero-sum games. In Pr o c e e dings of the twenty-se c ond annual ACM-SIAM symp osium on Discr ete Algo- rithms , pages 235–254. SIAM, 2011. Constan tinos Dask alakis, Andrew Ily as, V asilis Syrgk anis, and Haoy ang Zeng. T raining gans with optimism. arXiv pr eprint arXiv:1711.00141 , 2017. Constan tinos Dask alakis, Maxwell Fishelson, and Noah Golowic h. Near-optimal no-regret learning in general games. A dvanc es in Neur al Information Pr o- c essing Systems , 34:27604–27616, 2021. Arnoud V Den Bo er. Dynamic pricing and learning: Historical origins, curren t researc h, and new direc- tions. Surveys in op er ations r ese ar ch and manage- ment scienc e , 20(1):1–18, 2015. Jing Dong, Baoxiang W ang, and Y aoliang Y u. Uncoupled and conv ergen t learning in monotone games under bandit feedback. arXiv pr eprint arXiv:2408.08395 , 2024. Cˆ ome Fiegel, Pierre Menard, T adashi Kozuno, Michal V alko, and Vianney Perc het. The harder path: Last iterate conv ergence for uncoupled learning in zero- sum games with bandit feedback. In 42nd Inter- national Confer enc e on Machine L e arning (ICML 2025) , volume 267, 2025. Angeliki Giannou, Emmanouil-V asileios Vlatakis- Gk aragkounis, and Pana y otis Mertik op oulos. On the rate of con v ergence of regularized learning in games: F rom bandits and uncertaint y to optimism and b e- y ond. A dvanc es in Neur al Information Pr o c essing Systems , 34:22655–22666, 2021. Martin Gr¨ otsc hel, L´ aszl´ o Lo v´ asz, and Alexander Sc hri- jv er. Ge ometric algorithms and c ombinatorial opti- mization , volume 2. Springer Science & Business Media, 2012. Elad Hazan and Zohar Karnin. V olumetric spanners: an efficien t exploration basis for learning. The Jour- nal of Machine L e arning R ese ar ch , 17(1):4062–4095, 2016. Mic hael Jordan, Tian yi Lin, and Zhengyuan Zhou. Adaptiv e, doubly optimal no-regret learning in strongly monotone and exp-concav e games with gra- dien t feedback. Op er ations R ese ar ch , 73(3):1675– 1702, 2025. Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k W alid Kric hene, Benjamin Drigh` es, and Alexandre M Ba yen. Online learning of nash equilibria in con- gestion games. SIAM Journal on Contr ol and Opti- mization , 53(2):1056–1081, 2015. T or Lattimore and Csaba Szep esv´ ari. Bandit algo- rithms . Cambridge Universit y Press, 2020. T engyuan Liang and James Stokes. In teraction mat- ters: A note on non-asymptotic lo cal conv ergence of generative adversarial netw orks. In The 22nd In- ternational Confer enc e on Artificial Intel ligenc e and Statistics , pages 907–915. PMLR, 2019. Haip eng Luo. Lecture notes: In tro duc- tion to online optimization/learning. URL h ttps://haip eng- luo.net/courses/CSCI659/ 2022 fall/lectures/lecture3.p df . P anay otis Mertik op oulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regular- ized learning. In Pr o c e e dings of the twenty-ninth an- nual ACM-SIAM symp osium on discr ete algorithms , pages 2703–2717. SIAM, 2018. Meta F undamental AI Researc h Diplomacy T eam (F AIR), An ton Bakhtin, Noam Brown, Emily Di- nan, Gabriele F arina, Colin Flaherty , Daniel F ried, Andrew Goff, Jonathan Gray , Hengyuan Hu, et al. Human-lev el play in the game of diplomacy b y com- bining language mo dels with strategic reasoning. Scienc e , 378(6624):1067–1074, 2022. R ´ emi Munos, Michal V alko, Daniele Calandriello, Mo- hammad Gheshlaghi Azar, Mark Ro wland, Zhao- han Daniel Guo, Y unhao T ang, Matthieu Geist, Thomas Mesnard, Andrea Mic hi, et al. Nash learning from human feedback. arXiv pr eprint arXiv:2312.00886 , 18, 2023. Vidy a Muth ukumar, Soham Phade, and Anant Sa- hai. On the imp ossibilit y of con vergence of mixed strategies with no regret learning. arXiv pr eprint arXiv:2012.02125 , 2020. Gergely Neu. Explore no more: Impro ved high- probabilit y regret bounds for non-stochastic ban- dits. A dvanc es in Neur al Information Pr o c essing Systems , 28, 2015. Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Confer enc e on L e arning The ory , pages 993–1019. PMLR, 2013. Da vid Silver, Julian Schritt wieser, Karen Simony an, Ioannis An tonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Bak er, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without hu- man knowledge. natur e , 550(7676):354–359, 2017. V asilis Syrgk anis, Alekh Agarwal, Haipeng Luo, and Rob ert E Schapire. F ast conv ergence of regularized learning in games. A dvanc es in Neur al Information Pr o c essing Systems , 28, 2015. T atiana T atarenko and Maryam Kamgarp our. Learn- ing nash equilibria in monotone games. In 2019 IEEE 58th Confer enc e on De cision and Contr ol (CDC) , pages 3104–3109. IEEE, 2019. Chen-Y u W ei, Chung-W ei Lee, Mengxiao Zhang, and Haip eng Luo. Linear last-iterate con vergence in con- strained saddle-p oin t optimization. arXiv pr eprint arXiv:2006.09517 , 2020. Julian Zimmert and T or Lattimore. Return of the bias: Almost minimax optimal high probability b ounds for adversarial linear bandits. In Confer enc e on L e arning The ory , pages 3285–3312. PMLR, 2022. Supplemen tary Material In this app endix, we provide the missing details from the main b ody . In App endix A, we presen t the formal guaran tees for our estimators along with their pro ofs. In App endix B, we provide the missing pro ofs for our primal–dual norms. In App endix C, we give the pro of of the R VU prop ert y of OFTRL for completeness. In App endix D, w e establish the R VU prop ert y under estimation error. A Estimation Lemma A.1. Fix x ∈ X and λ ∈ (0 , 1) . L et D b e a distribution over Y . Define the r andom variable ˜ y as fol lows: with pr ob ability 1 / 2 , set ˜ y = ¯ y for some fixe d ¯ y ∈ Y ; with pr ob ability 1 / 2 , sample z ∼ D and set ˜ y = (1 − λ ) ¯ y + λz . If b y = E [ ˜ y ] , then ⟨ x, A ( ˜ y − b y ) ⟩ is zer o-me an and 4 λ 2 -sub gaussian. Pr o of. By linearity of expectation, ⟨ x, A ( ˜ y − b y ) ⟩ has mean zero. Note that ˜ y is alwa ys of the form (1 − λ ) ¯ y + λz ′ where z ′ ∈ Y . Next, observe that b y = (1 − λ ) ¯ y + λ b z ∈ Y , where b z = 1 2 ¯ y + 1 2 E z ∼D [ z ] ∈ Y . Since ⟨ x, Ay ⟩ ∈ [ − 1 , 1] for all x ∈ X and y ∈ Y , it follows that ⟨ x, A ( ˜ y − b y ) ⟩ ∈ [ − 2 λ, 2 λ ] . The result then follows from the fact that a b ounded zero-mean random v ariable taking v alues in an interv al of length 4 λ is 4 λ 2 -subgaussian. Lemma A.2 (Chernoff Bound) . L et X 1 , X 2 , . . . , X n b e i.i.d samples fr om a Bernoul li distribution with me an µ . Then we have the fol lowing for any 0 < δ < 1 : P " 1 n · n X i =1 X i ≥ (1 + δ ) µ # ≤ e − nµδ 2 3 and P " 1 n · n X i =1 X i ≤ (1 − δ ) µ # ≤ e − nµδ 2 2 A.1 Estimates using exploration distribution Let X ⊂ R n b e conv ex, compact, and span( X ) = R n . Consider a subset S := { x 1 , . . . , x n } ⊂ X suc h that sup x ∈X x ⊤ V − 1 x ≤ 2 n 2 , V := 1 n n X i =1 x i x ⊤ i ≻ 0 . Suc h a subset can b e computed in p olynomial time, pro vided X has an efficien t linear optimization oracle (see Hazan and Karnin (2016)). No w collect N samples by rep eating each x i exactly r := N /n times (assume n divides N ). The observ ations follow y t = ⟨ x t , θ ⟩ + η t , t = 1 , . . . , N , where θ ∈ R n is fixed and { η t } t ∈ [ N ] are indep enden t, mean-zero, σ 2 -subgaussian (MGF sense). Define the matrix V = 1 N N X t =1 x t x ⊤ t = 1 n n X i =1 x i x ⊤ i , and the ordinary least squares estimator Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k b θ = V − 1 1 N N X t =1 x t y t ! . As y t = ⟨ x t , θ ⟩ + η t , we get the following: b θ − θ = V − 1 1 N N X t =1 x t η t . Let Z := V 1 / 2 ( b θ − θ ) = 1 N P N t =1 V − 1 / 2 x t η t . F or an y z ∈ R n and any x ∈ R n , |⟨ x, z ⟩| = |⟨ V − 1 / 2 x, V 1 / 2 z ⟩| ≤ ∥ V − 1 / 2 x ∥ 2 ∥ V 1 / 2 z ∥ 2 = √ x ⊤ V − 1 x ∥ V 1 / 2 z ∥ 2 . T aking z = b θ − θ and supremum ov er x ∈ X yields sup x ∈X |⟨ x, b θ − θ ⟩| ≤ sup x ∈X √ x ⊤ V − 1 x ∥ Z ∥ 2 . By the design choice, sup x ∈X x ⊤ V − 1 x ≤ 2 n 2 = ⇒ sup x ∈X |⟨ x, b θ − θ ⟩| ≤ √ 2 n ∥ Z ∥ 2 . Due to the results in Chapter 20 of Lattimore and Szepesv´ ari (2020), we hav e the following with probability at least 1 − δ : ∥ Z ∥ 2 ≤ 2 σ r 2 N n ln 6 + ln 1 δ ≤ 4 σ q n +ln(1 /δ ) N , Hence, we hav e Pr sup x ∈X |⟨ x, b θ − θ ⟩| ≤ 6 σ r n 3 + n 2 ln(1 /δ ) N ! ≥ 1 − δ. (1) A.2 Sampling metho d and estimator Recall that in the row pla yer’s algorithm, during the s -th round of phase t , it selects x t,s . Analogous to the ro w pla yer’s algorithm, we can describ e an algorithm for the column play er, where in the s -th round of the t -th phase it selects y t,s . Denote the ro w pla yer’s true exp ected utilit y vector as ¯ θ t := A b y t , where b y t := E [ y t,s ] is the column play er’s av eraged strategy in phase t . Recall that D X is an uniform distribution ov er a subset { x 1 , x 2 , . . . , x n } . Let us fix a sequence of vectors x t, 1 , x t, 2 , . . . , x t,B t suc h that |{ s : x t,s = ¯ x t }| ≥ B t / 4 and for all i ∈ [ n ], |{ s : x t,s = (1 − λ t ) ¯ x t + λ t x i }| ≥ B t / (4 n ). Conditioned on this sequence, we no w construct an estimator b θ x t of ¯ θ x t with desirable concentration guarantees. In each round s of phase t , the ro w pla yer pla ys x t,s ∈ X and receives the reward r t,s = ⟨ x t,s , Ay t,s ⟩ . W e can decomp ose this reward as r t,s = ⟨ x t,s , A b y t ⟩ + ⟨ x t,s , A ( y t,s − b y t ) ⟩ = ⟨ x t,s , ¯ θ t ⟩ + η t,s , Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff where the second term η t,s := ⟨ x t,s , A ( y t,s − b y t ) ⟩ is a zero-mean 4 λ 2 t -subgaussian noise in eac h phase t due to Lemma A.1. Note that the noises η t,s are independent and we can correctly apply Lemma A.1 as the sequence { x t,s } s ∈ [ B t ] is fixed. Let { s 1 , s 2 , . . . , s B t / 2 } b e the first B t / 4 indices such that x t,s = ¯ x t . Similarly , let { s ′ 1 , s ′ 2 , . . . , s ′ B t / 4 } b e the set of indices consisting of the first B t / (4 n ) indices such that x t,s = (1 − λ t ) ¯ x t + λ t x i for all i ∈ [ n ]. W e construct pairs ( s i , s ′ i ) such that x t,s ′ i = (1 − λ t ) x t,s i + λ t z t,s i . where z t,s i ∈ { x 1 , x 2 , . . . , x n } . Now consider the transformed reward b r t,s ′ i := r t,s ′ i − (1 − λ t ) r t,s i λ t = ⟨ z t,s ′ i , θ t ⟩ + b η t,s ′ i , where b η t,s ′ i is zero-mean 8-subgaussian noise. Th us, from the pairs ( z t,s ′ i , b r t,s ′ i ), we obtain an unbiased estimator of ¯ θ x t : b θ x t = B t / 4 X i =1 z t,s ′ i z ⊤ t,s ′ i − 1 B t / 4 X i =1 b r t,s ′ i z t,s ′ i = n X i =1 x i x ⊤ i ! − 1 4 B t B t / 4 X i =1 b r t,s ′ i z t,s ′ i . Due to Eq. (1), conditioning on the sequence { x t,s } s ∈ [ B t ] w e ha ve the following : Pr sup x ∈X |⟨ x, b θ x t − ¯ θ x t ⟩| ≤ 48 q n 3 t 3 { x t,s } s ∈ [ B t ] ≥ 1 − δ / (8 t 2 ) . (2) No w consider a sequence of v ectors x t, 1 , x t, 2 , . . . , x t,B t generated b y our algorithm and define random v ariables N t, 0 := |{ s : x t,s = ¯ x t }| and for all i ∈ [ n ], N t,i := |{ s : x t,s = (1 − λ t ) ¯ x t + λ t x i }| . Observe that E [ N t, 0 ] = B t / 2 and E [ N t,i ] = B t / (2 n ) for all i ∈ [ n ]. Consider i ∈ [ n ] ∪ { 0 } . Due to Chernoff b ound, for any phase t such that B t ≥ 32 n 2 ln(8 t 2 /δ ) w e ha ve the following Pr ( N t,i < E [ N t,i ] / 2)) ≤ exp( − E [ N t,i ] 8 ) ≤ exp( − B t 16 n ) (as E [ N t,i ] ≥ B t / (2 n )) ≤ exp( − 2 n ln(8 t 2 /δ )) (as B t ≥ 32 n 2 ln(8 t 2 /δ )) ≤ δ/ (16 nt 2 ) , where we get the last step follows from the fact that x ln(1 /y ) ≥ ln( x/y ) for all x ≥ 2 and 0 < y ≤ 1 / 2. No w due to union b ound, for any phase t such that B t ≥ 32 n 2 ln(8 t 2 /δ ) w e ha ve follo wing Pr ( N t, 0 ≥ B t / 4 and N t,i ≥ B t / (4 n ) ∀ i ∈ [ n ]) ≥ 1 − δ / (8 t 2 ) (3) Hence, due to Eq. (2) and Eq. (3), we ha ve the following Pr sup x ∈X |⟨ x, b θ x t − ¯ θ x t ⟩| ≤ 48 q n 3 t 3 ≥ 1 − δ / (4 t 2 ) . (4) Note that if B t < 32 n 2 ln(8 t 2 /δ ), then we set b θ x t = 0 and the ab o v e inequalit y holds trivially . B Pro ofs for primal-dual norms First, we prov e that ∥ z ∥ ∗ , X is a norm. Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k (i) Positiv e definiteness. If z = 0, then ∥ z ∥ ∗ , X = 0. If ∥ z ∥ ∗ , X = 0, then |⟨ x, z ⟩| = 0 for all x ∈ X . As span( X ) = R d , we hav e ⟨ v , z ⟩ = 0 for all v ∈ R d . This implies that ∥ z ∥ 2 = 0, so z = 0. (ii) Absolute homogeneity . F or any scalar α , ∥ αz ∥ ∗ , X = max x ∈X |⟨ x, αz ⟩| = | α | max x ∈X |⟨ x, z ⟩| = | α | ∥ z ∥ ∗ , X . (iii) T riangle inequality . F or an y z , w , ∥ z + w ∥ ∗ , X = max x ∈ X |⟨ x, z + w ⟩| ≤ max x ∈ X |⟨ x, z ⟩| + |⟨ x, w ⟩| ≤ ∥ z ∥ ∗ , X + ∥ w ∥ ∗ , X . Th us ∥ · ∥ ∗ , X is a norm. Next we prov e that ∥ z ∥ X is a norm. (i) Positiv e definiteness. If z = 0, then ∥ z ∥ = 0. If z = 0, then ∥ z ∥ ≥ ⟨ z ∥ z ∥ ∗ , X , z ⟩ = ∥ z ∥ 2 2 ∥ z ∥ ∗ , X > 0. (ii) Absolute homogeneity . F or any scalar α , ∥ αz ∥ X = max ∥ y ∥ ∗ , X ≤ 1 ⟨ y , αz ⟩ = max ∥ y ∥ ∗ , X ≤ 1 |⟨ y , αz ⟩| (as ∥ y ∥ ∗ , X = ∥ − y ∥ ∗ , X ) = | α | max ∥ y ∥ ∗ , X ≤ 1 |⟨ y , z ⟩| = | α | max ∥ y ∥ ∗ , X ≤ 1 |⟨ y , z ⟩| = | α | max ∥ y ∥ ∗ , X ≤ 1 ⟨ y , z ⟩ = | α |∥ z ∥ X (iii) T riangle inequality . F or an y z , w , ∥ z + w ∥ X = max ∥ y ∥ ∗ , X ≤ 1 |⟨ y , z + w ⟩| ≤ max ∥ y ∥ ∗ , X ≤ 1 |⟨ y , z ⟩| + |⟨ y, w ⟩| ≤ ∥ z ∥ X + ∥ w ∥ X Th us ∥ · ∥ X is a norm. No w we show that ∥ z ∥ X and ∥ z ∥ ∗ , X are primal-dual norm pairs. It suffices to sho w max x ∈X |⟨ x, z ⟩| = max ∥ y ∥≤ 1 ⟨ y , z ⟩ for any z . Let K := conv( X ∪ −X ). Define B := { y : ∥ y ∥ X ≤ 1 } . W e b egin by showing B = K . (i) K ⊆ B . Consider y ∈ K . Then due to Carath´ eo dory’s theorem, there exists a subset { x 1 , x 2 , . . . , x ℓ } ⊆ X suc h that y = P ℓ i =1 λ i s i x i where λ i ≥ 0, P ℓ i =1 λ i = 1, s i ∈ {− 1 , +1 } . Now we ha ve the following: ∥ y ∥ X = max ∥ z ∥ ∗ , X ≤ 1 ⟨ z , y ⟩ = max ∥ z ∥ ∗ , X ≤ 1 X i λ i s i ⟨ z , x i ⟩ = max ∥ z ∥ ∗ , X ≤ 1 X i λ i |⟨ z , x i ⟩| ≤ max ∥ z ∥ ∗ , X ≤ 1 X i λ i · 1 = 1 Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff Hence y ∈ B . Since y was chosen arbitrarily , we hav e K ⊆ B . (ii) B ⊆ K . Consider y / ∈ K . As K is compact conv ex, there exists a vector z such that ⟨ y , z ⟩ > max x ∈ K ⟨ x, z ⟩ , due to the hyperplane separation theorem . Set t := max x ∈ K ⟨ x, z ⟩ = max x ∈X |⟨ x, z ⟩| . The last equality follows as K = conv( X ∪ −X ). Note that t > 0, and for u := z /t , ∥ u ∥ ∗ , X = max x ∈X |⟨ x, u ⟩| = 1 t max x ∈X |⟨ x, z ⟩| = 1 . No w w e ha ve the following: ∥ y ∥ X = max ∥ x ∥ ∗ , X ≤ 1 ⟨ x, y ⟩ ≥ ⟨ u, y ⟩ = ⟨ z , y ⟩ t > 1 , Th us y / ∈ B . Therefore B ⊆ K . Hence, for any z , we hav e, max ∥ y ∥ X ≤ 1 ⟨ y , z ⟩ = max y ∈ B ⟨ y , z ⟩ = max y ∈ K ⟨ y , z ⟩ = max x ∈X |⟨ x, z ⟩| . C Optimistic FTRL Algorithm and R VU Prop ert y F or completeness, we adapt Prop osition 7 of Syrgk anis et al. (2015) to general conv ex set. Lemma C.1 (R VU Prop ert y Syrgk anis et al. (2015)) . L et X ⊂ R d b e c omp act c onvex. L et R : X → R b e σ - str ongly c onvex w.r.t. a norm ∥ · ∥ with dual ∥ · ∥ ∗ . Fix a step size η > 0 and initialize ˜ x 0 = ˜ x 1 = arg min x ∈X R ( x ) . Assume u 0 = 0 . F or utilities u t ∈ R d , define the OFTRL de cisions ˜ x t +1 ∈ arg max x ∈X (* x, t X ℓ =1 u ℓ + u t + − 1 η R ( x ) ) . Then for every x ∈ X , T X t =1 ⟨ u t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x 1 ) η + η σ T X t =1 ∥ u t − u t − 1 ∥ 2 ∗ − σ 4 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 . Pr o of. W e start by restating the updates we make with utility sequence { u k } , k ∈ [ t − 1], and the regularizer R . Lemma 1 in lecture notes Luo gives T X t =1 ⟨ u t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x ′ 1 ) η + T X t =1 u t − u t − 1 , ˜ x ′ t +1 − ˜ x t − 1 η T X t =1 D R ( ˜ x t , ˜ x ′ t ) + D R ˜ x ′ t +1 , ˜ x t where ˜ x ′ t is a hypothetical ”v anilla” FTRL play er that do esn’t use the optimistic guess u t − 1 . Similar to how Theorem 1 is sho wn in Luo, w e b ound the middle term with help of Lemma 4 in Lecture 2 notes of the same lecture note series. ˜ x t − x ′ t +1 ≤ η σ t − 1 X k =1 u k + u t − 1 ! − t X k =1 u k ! ∗ ≤ η σ ∥ u t − 1 − u t ∥ ∗ Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k And via Cauch y-Sch w artz step u t − 1 − u t , ˜ x t − ˜ x ′ t +1 ≤ ∥ u t − 1 − u t ∥ ∗ · ˜ x t − x ′ t +1 ≤ ∥ u t − 1 − u t ∥ ∗ · η σ ∥ u t − 1 − u t ∥ ∗ = η σ ∥ u t − u t − 1 ∥ 2 ∗ Summing ov er t and putting everything together T X t =1 u t − 1 − u t , ˜ x t − ˜ x ′ t +1 ≤ η σ T X t =1 ∥ u t − u t − 1 ∥ 2 ∗ No w we b ound the Bregman terms similar to ho w it is done in the lecture notes. W e first drop the non-negative terms at the b oundaries and shift the index to get a low er b ound: T X t =1 D R ( ˜ x t , ˜ x ′ t ) + D R ˜ x ′ t +1 , ˜ x t ≥ T X t =2 ( D R ( ˜ x t , ˜ x ′ t ) + D R ( ˜ x ′ t , ˜ x t − 1 )) ≥ σ 2 T X t =2 ∥ ˜ x t − ˜ x ′ t ∥ 2 + ∥ ˜ x ′ t − ˜ x t − 1 ∥ 2 ( R is σ -strongly con vex) ≥ σ 4 T X t =2 ( ∥ ˜ x t − ˜ x ′ t ∥ + ∥ ˜ x ′ t − ˜ x t − 1 ∥ ) 2 ( a 2 + b 2 ≥ ( a + b ) 2 / 2) ≥ σ 4 T X t =2 ∥ ( ˜ x t − ˜ x ′ t ) + ( ˜ x ′ t − ˜ x t − 1 ) ∥ 2 (T riangle Inequalit y) = σ 4 T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 Th us w e ha ve a upp er bound on the negativ e term of Bregman div ergence: − 1 η T X t =1 D R ( ˜ x t , ˜ x ′ t ) + D R ˜ x ′ t +1 , ˜ x t ≤ − σ 4 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 Putting everything together and using the fact that ˜ x 1 = ˜ x ′ 1 . T X t =1 ⟨ u t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x ′ 1 ) η + η σ T X t =1 ∥ u t − u t − 1 ∥ 2 ∗ − σ 4 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 D R VU with Estimation Error Lemma D.1 (R VU with estimation error) . L et ∆ x t := b θ x t − ¯ θ x t . Then for any x ∈ X and any η > 0 , we have the fol lowing for the r ow player: T X t =1 ⟨ u x t , x − ˜ x t ⟩ ≤ D ϕ ( x, ˜ x 1 ) η + 2 ∥ T ∆ x T ∥ ∗ , X Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff + 36 η T X t =1 ∥ t ∆ x t ∥ 2 ∗ , X + 4 η T X t =1 ∥ u t − u t − 1 ∥ 2 ∗ , X − 3 16 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 X + O ( η + log T ) . Pr o of. F or the simplicity of presen tation, let us ∥ · ∥ to denote the primal norm ∥ · ∥ X and ∥ · ∥ ∗ to denote the dual norm ∥ · ∥ ∗ , X . Recall that u x t = A ˜ y t is the true utility of each phase t . Let ¯ u x t := t ¯ θ x t − ( t − 1) ¯ θ x t − 1 denote the pseudo-utilit y for phase t . Note that ¯ u x t = E [ b u x t ]. Let D t b e a distribution ov er X such that with probabilit y 1 / 2 w e choose ¯ x t and with remaining probability w e uniformly sample an element from { y 1 , y 2 , . . . , y n } and choose it. Recall that b y t = E [ y t,s ]. Let b z t := E z ∼D t [ z ]. Observ e that b y t = (1 − λ t ) ¯ y t + λ t b z t . Recall that ¯ y t = 1 t P t ℓ =1 ˜ y ℓ . No w observ e that ¯ u x t can b e further simplied as follows: ¯ u x t = t ¯ θ x t − ( t − 1) ¯ θ x t − 1 = t · A b y t − ( t − 1) · A b y t − 1 = (1 − λ t ) A ˜ y t + ( λ t − 1 − λ t ) t − 1 X s =1 A ˜ y s + t · λ t A b z t − ( t − 1) · λ t − 1 A b z t − 1 (5) No w, w e ha ve the following due to triangle inequalit y: ∥ ¯ u x t − ¯ u x t − 1 ∥ ∗ ≤ ∥ u x t − u x t − 1 ∥ ∗ + λ t ∥ u x t ∥ ∗ + λ t − 1 ∥ u x t − 1 ∥ ∗ + ( λ t − 1 − λ t ) · t − 1 X s =1 ∥ u x s ∥ ∗ + t · λ t ∥ A b z t ∥ ∗ + ( t − 1) · λ t − 1 ∥ A b z t − 1 ∥ ∗ + ( λ t − 2 − λ t − 1 ) · t − 2 X s =1 ∥ u x s ∥ ∗ + ( t − 1) · λ t − 1 ∥ A b z t − 1 ∥ ∗ + ( t − 2) · λ t − 2 ∥ A b z t − 2 ∥ ∗ ≤ ∥ u x t − u x t − 1 ∥ ∗ + O (1 /t ) , where the last inequality follows from the fact that λ t = 1 t 2 and |⟨ x, Ay ⟩| ≤ 1 for all x ∈ X , y ∈ Y . Next, we hav e the following due to Eq. (5) and the fact that |⟨ x, y ⟩| ≤ ∥ x ∥ · ∥ y ∥ ∗ : ⟨ u x t , x − ˜ x t ⟩ ≤ ⟨ ¯ u x t , x − ˜ x t ⟩ + λ t ∥ u x t ∥ ∗ · ∥ x − ˜ x t ∥ + ( λ t − 1 − λ t ) t − 1 X s =1 ∥ u x t ∥ ∗ · ∥ x − ˜ x t ∥ + t · λ t ∥ A b z t ∥ ∗ · ∥ x − ˜ x t ∥ + ( t − 1) · λ t − 1 ∥ A b z t − 1 ∥ ∗ · ∥ x − ˜ x t ∥ ≤ ⟨ ¯ u x t , x − ˜ x t ⟩ + 2 λ t ∥ u x t ∥ ∗ + 2( λ t − 1 − λ t ) t − 1 X s =1 ∥ u x t ∥ ∗ + 2 t · λ t ∥ A b z t ∥ ∗ + 2( t − 1) · λ t − 1 ∥ A b z t − 1 ∥ ∗ (as ∥ x − ˜ x t ∥ ≤ ∥ x ∥ + ∥ ˜ x t ∥ ≤ 2) ≤ ⟨ ¯ u x t , x − ˜ x t ⟩ + O (1 /t ) , where the last inequality follows from the fact that λ t = 1 t 2 and |⟨ x, Ay ⟩| ≤ 1 for all x ∈ X , y ∈ Y . Recall that ∆ x t : = b θ t − ¯ θ t . Now we define δ t := b u x t − ¯ u x t = b u x t − E [ b u x t ]. Then δ t = t b θ x t − ( t − 1) b θ x t − 1 − t ¯ θ x t + ( t − 1) ¯ θ x t − 1 = t ∆ x t − ( t − 1)∆ x t − 1 Pseudo regret can b e written in terms of regret against the estimated utilities and the error term δ t . T X t =1 ⟨ ¯ u x t , x − ˜ x t ⟩ = T X t =1 ⟨ b u x t , x − ˜ x t ⟩ − T X t =1 ⟨ δ t , x − ˜ x t ⟩ Last-Iterate Con vergence in Bilinear Saddle-P oint Problems under Bandit F eedbac k W e apply Lemma 3.1 with σ := 1 to the sequence the algorithm actually sees: b u x t . The lemma ab o ve giv es a b ound on P T t =1 ⟨ b u x t , x − ˜ x t ⟩ : T X t =1 ⟨ b u x t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x 1 ) η + η σ T X t =1 ∥ b u x t − b u x t − 1 ∥ 2 ∗ − σ 4 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 . Substituting this back, we get our main inequality: T X t =1 ⟨ ¯ u x t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x 1 ) η + η σ T X t =1 ∥ b u x t − b u t − 1 ∥ 2 ∗ | {z } T erm II + T X t =1 ⟨ δ t , ˜ x t − x ⟩ | {z } T erm I − σ 4 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 (1) Bounding T erm I W e rewrite T erm I using summation b y parts. T erm I = T X t =1 ⟨ δ t , ˜ x t − x ⟩ = T X t =1 ⟨ t ∆ x t − ( t − 1)∆ x t − 1 , ˜ x t − x ⟩ = ⟨ T ∆ x T , ˜ x T − x ⟩ + T − 1 X t =1 ⟨ t ∆ x t , ˜ x t − ˜ x t +1 ⟩ ≤ 2 ∥ T ∆ x T ∥ ∗ ∥ ˜ x T − x ∥ + T − 1 X t =1 ∥ t ∆ x t ∥ ∗ ∥ ˜ x t − ˜ x t +1 ∥ Since x, ˜ x t ∈ X , the term ∥ ˜ x T − x ∥ ≤ 2 sup z ∈X ∥ z ∥ ≤ 2. W e get the last inequality due to the fact that sup z ∈X ∥ z ∥ ≤ 1. Com bining the b ound abov e with negative mov ement term from (1), and using a separate σ 16 η p ortion for this b ound, w e get for this sum part: T − 1 X t =1 ∥ t ∆ x t ∥ ∗ ∥ ˜ x t − ˜ x t +1 ∥ − σ 16 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 Using Y oung’s inequality , ab ≤ a 2 2 c + cb 2 2 , with a = ∥ t ∆ x t ∥ ∗ , b = ∥ ˜ x t − ˜ x t +1 ∥ , and c = σ 8 η , ∥ t ∆ x t ∥ ∗ ∥ ˜ x t − ˜ x t +1 ∥ ≤ ∥ t ∆ x t ∥ 2 ∗ 2( σ / 8 η ) + ( σ / 8 η ) ∥ ˜ x t − ˜ x t +1 ∥ 2 2 = 4 η σ ∥ t ∆ x t ∥ 2 ∗ + σ 16 η ∥ ˜ x t − ˜ x t +1 ∥ 2 Summing this from t = 1 to T − 1, the mov ement terms cancel the − σ 16 η P T t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 term, leaving on second-order error terms. Th us the contribution from T erm I is b ounded by: T erm I − σ 16 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 ≤ 2 ∥ T ∆ x T ∥ ∗ + 4 η σ T − 1 X t =1 ∥ t ∆ x t ∥ 2 ∗ Bounding T erm I I W e use the inequality ( a + b ) 2 ≤ 2 a 2 + 2 b 2 and the triangle inequality . ∥ b u x t − b u x t − 1 ∥ 2 ∗ = ∥ ( ¯ u x t − ¯ u x t − 1 ) + ( δ t − δ t − 1 ) ∥ 2 ∗ ≤ 2 ∥ ¯ u x t − ¯ u x t − 1 ∥ 2 ∗ + 2 ∥ δ t − δ t − 1 ∥ 2 ∗ ≤ 4 ∥ u x t − u x t − 1 ∥ 2 ∗ + 4 ∥ δ t ∥ 2 ∗ + 4 ∥ δ t − 1 ∥ 2 ∗ + O (1 /t 2 ) T X t =1 ( ∥ δ t ∥ 2 ∗ + ∥ δ t − 1 ∥ 2 ∗ ) ≤ 2 T X t =1 ∥ δ t ∥ 2 ∗ Maiti*, Zhang*, Jamieson, Morgenstern, Panageas, Ratliff ≤ 4 T X t =1 ∥ t ∆ x t ∥ 2 ∗ + 4 T X t =1 ∥ ( t − 1)∆ x t − 1 ∥ 2 ∗ ≤ 8 T X t =1 ∥ t ∆ x t ∥ 2 ∗ T erm II = η σ T X t =1 ∥ b u x t − b u t − 1 ∥ 2 ∗ ≤ η σ T X t =1 4 ∥ u x t − u x t − 1 ∥ 2 ∗ + 32 T X t =1 ∥ t ∆ x t ∥ 2 ∗ ! + O η σ Com bining the b ounds. T X t =1 ⟨ ¯ u x t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x 1 ) η + T erm I I + T erm I − σ 4 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 = D R ( x, ˜ x 1 ) η + T erm I I + T erm I − σ 16 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 ! − σ 4 η ∥ ˜ x 1 − x 0 ∥ 2 − σ 4 η − σ 16 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 ≤ D R ( x, ˜ x 1 ) η + 2 ∥ T ∆ x T ∥ ∗ sup z ∈X ∥ z ∥ + 36 η σ T X t =1 ∥ t ∆ x t ∥ 2 ∗ + 4 η σ T X t =1 ∥ u x t − u x t − 1 ∥ 2 ∗ − σ 4 η ∥ ˜ x 1 ∥ 2 − 3 σ 16 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 + O η σ Since σ / 4 η > 3 σ / 16 η , we can weak en the b ound on the ∥ ˜ x 1 ∥ 2 term to get a single, compact sum: − σ 4 η ∥ ˜ x 1 ∥ 2 − 3 σ 16 η T X t =2 ∥ ˜ x t − ˜ x t − 1 ∥ 2 ≤ − 3 σ 16 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 This gives the final result: T X t =1 ⟨ u x t , x − ˜ x t ⟩ ≤ D R ( x, ˜ x 1 ) η + 2 ∥ T ∆ x T ∥ ∗ + 36 η σ T X t =1 ∥ t ∆ x t ∥ 2 ∗ + 4 η σ T X t =1 ∥ u x t − u x t − 1 ∥ 2 ∗ − 3 σ 16 η T X t =1 ∥ ˜ x t − ˜ x t − 1 ∥ 2 + O η σ + log T
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment