A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

A Generaliz ed Minimax Q -learning Algorithm for T wo-Pla yer Zero-Sum Stochastic Games Raghuram Bharadwaj Diddig i, Chandra mouli Kamanchi, Shalabh Bhatnag ar 0018-928 6 ©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. A version of this paper is accepted for publication at the IEEE Transac tions on Automatic Control. DOI: 10.1109/T A C.2022.315 9453 Abstract W e conside r th e pr oblem of two-player zero-sum games. This problem is formu lated as a min-max Markov game in the literatu re. T he solu tion o f this g ame, which is the min-max payoff, starting from a giv en state is called the min- max v alue of the state. In this work, we compu te th e so lution of the two-player zero-sum game u tilizing the technique of successi ve relax ation that has b een successfully applied in the literature to compute a faster value iteration alg orithm in the context of Markov Decisio n Processes. W e extend the con cept o f successive relaxation to the setting of two-playe r zero- su m ga m es. W e show th a t, under a spe cial structure o n the game, this tech nique facilitates faster computatio n of the min-max value of th e states. W e then deriv e a gen e ralized minim ax Q-learn ing algor ithm that comp utes the o ptimal policy when the mod el inform ation is n ot known. Finally , we prove the conv ergence of the propo sed generalized m inimax Q-learning a lgorithm utilizin g stochastic app roximation techniqu es, und er an assumption o n th e bo u ndedn ess of iterates. Th rough exper im ents, we de m onstrate the effecti veness of our propo sed alg o rithm. I . I N T R O D U C T I O N In two-player zero-sum games, th ere are two agents that are com peting against each other in a common en vironment. Based on t he actions taken by th e agents, they recei ve a payoff corresponding to the current state and the en vironment transitions to the next state. Th e o b jectiv e of an agent (say agent 1) is to compute a sequence of actions starting from a given state to maximize t he to t al discount ed payoff. On the ot her hand, the objective of the s econd agent (agent 2) is to com pute a s equence of actions t hat m inimizes the to t al di scounted payoff . This problem is formulated as a Markov game and the v alue that is obtained as the m in-max of the The authors are with the Department of C omputer Science and Automation, Indian Institute of Science, Bangalore, 560012, India (e-mails: raghub@iisc.ac.in; chandramouli@iisc.ac.in; shalabh@iisc.ac.in). total expected discounted payof f starting from state i is called t he min-max value of state i . The policies that achieve th i s min-max value are the optimal policies o f the agents. When the model informati on of the en vironment is known, a Bellman op erator for th e two- player zero-sum game [1] is constructed and a ﬁxed po i nt it eration scheme analogous to the value iteration is used to com pute the min-m ax v alue. Howev er , in most tw o-player zer o-sum game settings , the mod el information is assumed unknown to the players and t h e objective is to compute optimal po licies utilizing th e state and payoff samples obtained from the en vironment. In our work, we construct a modiﬁed mi n -max Q-Bellman o perator by using th e technique of successiv e relaxation for th e Markov gam es and prov e that the contraction f actor is at most the contraction factor of th e standard mi n -m ax Bellman operator . This impl ies that, when the model information is k nown, the min-max value can be com p uted faster using our proposed s chem e. W e then proceed to develop a generalized minimax Q-learning algorithm based on t h e modi ﬁed min-max Q-Bellman operator . The minim ax Q-learning algorit hm has been presented in [2]. T wo-player general sum games are tho se where the payoffs of th e agents are unrelated in general. If the payoff of an agent is the negative of the payoff of another agent, the game reduces to a zero-sum game. A Nash Q-learning algorithm for solving general sum games i s propo s ed in [3]. In [4], Friend-or -Foe (FF) Q-learning for general sum games is proposed and is sho wn to ha ve stron g er con ver g ence properties compared to N ash Q-learning. A generalization of Nash Q-learning and FF Q-learning, namely correlated Q-learning, is di s cussed in [5]. In [6], desirable properties for an agent learning in mul ti-agent scenarios are studied and a new learning algorithm namely “W oLF policy hill cli m bing” is p rop osed. Surveys of algorithm s for m ulti-agent learning and multi- agent Reinforcement l earnin g are provided in [7, 8]. W e now discuss some v ariants of minimax Q-learning in the li t erature. In [9], the mini m ax TD-learning algorith m that utilizes t h e concept of temp o ral differe nce learning is proposed. Th e minimax version of the Deep Determi n istic policy gradient algorithm has been recently dev eloped in [10]. Howe ver , no con ver gence proofs or th eoretical guarantees are provided. The concept of s u ccessiv e relaxati on in the context of M arko v Decisi on Processes (MDPs) has been ﬁrst applied in [11]. In our recent work [12], we ha ve proposed successive over -relaxation Q-learning for model-free MDPs (i.e., in the single-agent scenario). In this work, we extend the concept of successive relaxation to the two-player zero-sum g am es and prop o se a prov ably con ver gent generalized min imax Q-learning. The cont ributions of t he paper are as follows: • W e present a modiﬁed min-max Q-Bellman operator for two-player zero-sum Markov games and show that the operator is a m ax-norm contraction. • W e show that under som e assumptions, the contraction factor of the modiﬁed min-max Q-Bellman operator is sm al l er than the s t andard min-max Q-Bellman operator . • W e propose a model-free generalized minim ax Q-l earning algorithm and prove i ts almost sure con vergence using ODE based analysis of stochastic approximati on, und er an assum p- tion on the boundedness of iterates • W e discuss an i n teresting relation betw een st and ard minimax Q-learning and our proposed algorithm. • Finally , th rough experimental ev aluation , we s how that our proposed algorithm has a better performance compared to t h e standard mini m ax Q-learning algorithm. W e note here that the Successiv e Over Relaxation (SOR) t echnique utilized t o derive our algorithm and stochastic approximation arguments employed in the con ver gence analysis are well-known in t he literature. Our contribution comprises of applying these techniques to derive and analyze a generalized minimax Q-Learning algo ri thm that has faster con vergence. I I . B AC K G RO U N D A N D P R E L I M I N A R I E S In t his paper , we consider the setting of two-player zero-sum Markov games [13]. The t wo players in the gam e are referred to as agent 1 and agent 2 . A t wo-player zero-sum Markov game is charac terized by t h e tuple ( S, U, V , p, r, α ) where S is the set of states that both the agents observe, U is the ﬁnit e set of actions of agent 1, V is th e ﬁnite set of actions of agent 2, p denotes t h e transiti o n probability rule, i.e., p ( j | i, u , v ) denotes the probability of transition to state j ∈ S from state i ∈ S when actions u ∈ U and v ∈ V are chosen by the agents 1 and 2, respectiv ely . Let r ( i, u , v ) denote the single-stage payoff obt ain ed b y the agent 1 in s tate i when actions u and v are chosen by agents 1 and 2, respecti vely . N o te that, in t he case of a zero-sum Markov game, th e payoff of the agent 2 is the negati ve of the payoff obtained by the agent 1. Also, 0 ≤ α < 1 denotes the discount factor . The goals of the two agents in th e Markov game are to individually learn the optimal pol i cies π 1 : S − → ∆ | U | and π 2 : S − → ∆ | V | , resp ectively , where ∆ d denotes the probability simpl ex in R d and π 1 ( i ) (resp. π 2 ( i ) ) indi cates the probability distribution ove r actions to be taken by t he agent 1 (resp. agent 2) in state i t h at maximizes (resp. mini m izes) the discounted objective give n by: min π 2 max π 1 ∞ X t =0 E h | U | X u =1 | V | X v =1 α t x u t y v t r ( s t , u, v ) | s 0 = i i , (1) where s t is the state of the game at time t , π 1 ( s t ) = ( x u t ) | U | u =1 ∈ ∆ | U | , π 2 ( s t ) = ( y v t ) | V | v =1 ∈ ∆ | V | , ∀ t ≥ 0 and E [ . ] is the expectation t aken o ver the states obtained ov er time t = 1 , . . . , ∞ . Let J ∗ ( i ) denote the min-max value in state i obt ained by solving (1). It can be s h own ( [14, Chapter 7]) that the m in-max value function, J ∗ , satisﬁes the following ﬁx ed point equ ation in J ∈ R | S | , given by: J ( i ) = val [ Q ( i )] , ∀ i ∈ S, (2) where Q ( i ) is a m atrix of s i ze | U | × | V | , whose ( u , v ) th entry is given by Q ( i, u, v ) = r ( i, u, v ) + α X j ∈ S p ( j | i, u, v ) J ( j ) and the function val [ A ] , for a given m × n matrix A , is deﬁned as follows: val [ A ] = min y max x x T Ay , (3) where x ∈ ∆ m and y ∈ ∆ n , respectively . The system of equations in (2) can be rewritten as: J = T J , (4) where the operator T , for a giv en J ∈ R | S | , is deﬁned as: ( T J )( i ) = val [ Q ( i )] , ∀ i ∈ S. (5) The operator T and the set of equations (2) are analogous to the Bellman o p erator and the Bellman optimalit y condition, respecti vely , for Markov Decision Processes (MDPs) [14]. I I I . T H E P RO P O S E D A L G O R I T H M W e describe a single iteration of the syn chron o us version [15] of our proposed algorithm in Algo rithm 1 b elow . At eac h iteration n , Q-values of all the st ate-action tuple Q ( i, u , v ) are updated as shown in the step 4 of Algorit h m 1. Remark 1. Note that the st ep 3 of Algo rithm 1 r equir es computa tion of val [ Q n ( . )] which is a linear pro gram. Also observe that the gener alized m i nimax Q-learni ng algorithm o n l y r equi res an additiona l computation of val [ Q n ( i n )] compar ed to the sta n dar d mini max Q-learning. Algorithm 1 Generalized mini max Q-Learning Input: w : Choose w ∈ (0 , w ∗ ] (with w ∗ as in (6)), as the relaxation parameter . { Y n } : a sequence of vectors o f si ze | S × U × V | , w h ere Y n ( i, u, v ) i ndicates the next s t ate obtained when actio n s u, v are tak en in stat e i . r ( i, u, v ) : si n gle-stage payoff a vailable when actions u, v are chosen in st ate i . γ n : step-size chosen at time n . Q n : the estim ate of Q † (see (8)) at time n . 1: pr ocedur e G E N E R A L I Z E D M I N I M A X Q - L E A R N I N G : 2: f or ( i, u, v ) ∈ S × U × V do 3: d n +1 ( i, u, v ) = w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) val [ Q n ( i )] 4: Q n +1 ( i, u, v ) =  1 − γ n  Q n ( i, u, v ) + γ n d n +1 ( i, u, v ) 5: r eturn Q n +1 Remark 2. Let Q N be the sol ution obt a ined by Algorit hm 1 up on termi nation aft er N i terations. Then the ap p r oximate min-max value, ˜ J ( i ) for a given state i i s obtained as fol lows: ˜ J ( i ) = val [ Q N ( i )] and the corre sponding appr oximate policies of the agents ar e obtained as: ( ˜ π 1 ( i ) , ˜ π 2 ( i )) ∈ arg val [ Q N ( i )] . I V . C O N V E R G E N C E A N A L Y S I S Let ∆ d := { x ∈ R d : x i ≥ 0 , P d i =1 x i = 1 } denote the probability simpl ex in R d . For matrix A ∈ R m × n , x ∈ ∆ m and y ∈ ∆ n , recall that the value of the matrix A is deﬁned as val [ A ] := min y ∈ ∆ n max x ∈ ∆ m x T Ay . Note that the norm considered in this section i s the max-norm , i.e., norm of the vector x ∈ R d is k x k := ma x 1 ≤ i ≤ d | x ( i ) | . W e ﬁrst derive a few properties of the val [ . ] operator that would be u s ed in the subsequ ent analysis. Lemma 1. Suppos e B = [ b ij ] , C = [ c ij ] ∈ R m × n , then | val [ B ] − val [ C ] | ≤ max i,j | b ij − c ij | = k B − C k Pr oof. val [ B ] − val [ C ] = min y max x x T B y − min y max x x T C y ≤ − min y  max x x T C y − max x x T B y  = max y  max x x T B y − max x x T C y  ≤ max y max x x T ( B − C ) y ≤ max y max x     X 1 ≤ i,j ≤ n x i ( b ij − c ij ) y j     ≤ max i,j | b ij − c ij | max y max x     X 1 ≤ i,j ≤ n x i y j     = max i,j | b ij − c ij | . In the above, x ∈ ∆ m , and y ∈ ∆ n , respectively . Sim ilarly , val [ B ] − val [ C ] = min y max x x T B y − min y max x x T C y ≥ min y  max x x T B y − max x x T C y  ≥ min y  − max x x T ( C − B ) y  = − max y max x x T ( C − B ) y ≥ − max y max x     X 1 ≤ i,j ≤ n x i ( b ij − c ij ) y j     ≥ − max i,j | b ij − c ij | max y max x     X 1 ≤ i,j ≤ n x i y j     = − max i,j | b ij − c ij | . Therefore | val [ B ] − val [ C ] | ≤ max i,j | b ij − c ij | = k B − C k . Note t he repeated application of the fac ts X i,j x i y j = 1 , sup ( a n + b n ) ≤ sup a n + sup b n in the arguments. This com pletes the proof. Corollary 1. Consider B = [ b ij ] ∈ R m × n , then | val [ B ] | ≤ k B k . Pr oof. Using Lemm a 1 with C = 0 , we get: | val [ B ] | ≤ max i,j | b ij | = k B k . Lemma 2. Let E = [ e ij ] m × n , wher e e ij = 1 ∀ i, j. Then for cons tants β , k ∈ R and A ∈ R m × n , val [ β A + k E ] = β val [ A ] + k . Pr oof. By deﬁnition o f the val operator , for x ∈ ∆ m , y ∈ ∆ n , val [ β A + k E ] = min y max x x T ( β A + k E ) y = β min y max x x T Ay + k ( since x T E y = 1) = β val [ A ] + k . Recall that for a given st ochastic gam e ( S, U, V , p, r, α ) th e min-m ax value functi o n J ∗ satisﬁes [16] the system of equations, J ( i ) = val [ Q ( i )] , ∀ i ∈ S, where Q ( i ) is a | U | × | V | matrix with ( u, v ) th entry Q ( i, u, v ) = r ( i, u , v ) + α X j ∈ S p ( j | i, u, v ) J ( j ) , and the system of equatio ns can be reformulated as the ﬁxed point equation, T J = J , with T being a contraction un d er the max-norm with contraction factor α . W e deﬁne a quantity w ∗ as follows: w ∗ , min i,u,v ( 1 1 − α p ( i | i, u, v ) ) . (6) As t h e probabil i ties p ( i | i, u , v ) ≥ 0 , ∀ ( i, u, v ) , it is clear t h at w ∗ ≥ 1 . F or 0 < w ≤ w ∗ , we now deﬁne a modi ﬁed operator T w : R | S | → R | S | as follows [11]: ( T w J )( i ) = w ( T J )( i ) + (1 − w ) J ( i ) , where w represents a prescribed relaxation factor . Note that T w is in general not a con vex combination of T and t he identity operator I since we all ow w ≥ 1 as w ∗ ≥ 1 (see above). Let J ∗ denote the min-max v alue of the Marko v game. Therefore, T J ∗ = J ∗ . Now , ( T w J ∗ )( i ) = w ( T J ∗ )( i ) + (1 − w ) J ∗ ( i ) = w J ∗ ( i ) + (1 − w ) J ∗ ( i ) = J ∗ ( i ) . (7) Therefore, the mi n -max value functi on J ∗ is also a ﬁxed point of T w . Next, we derive a m odiﬁed min-max Q-Bellman operator for the two-player zero-sum game. Let Q † ( i, u, v ) be deﬁned as follows: Q † ( i, u, v ) := w  r ( i, u, v ) + α M X j =1 p ( j | i, u, v ) J ∗ ( j )  + (1 − w ) J ∗ ( i ) . (8) Now l et Q ∗ ( i, u, v ) =  r ( i, u, v ) + α M X j =1 p ( j | i, u, v ) J ∗ ( j )  Let E = [ e ij ] m × n with e ij = 1 , ∀ i, j . Then, val [ Q † ( i )] = val [ w Q ∗ ( i ) + (1 − w ) J ∗ ( i ) E ] = w val [ Q ∗ ( i )] + (1 − w ) J ∗ ( i ) (from Lem ma 2) = w ( T J ∗ )( i ) + ( 1 − w ) J ∗ ( i ) = ( T w J ∗ )( i ) = J ∗ ( i ) (from (7)) . Hence the equatio n (8) can be rewritten as follows: Q † ( i, u, v ) = w  r ( i, u, v ) + α M X j =1 p ( j | i, u, v ) val [ Q † ( j )]  + (1 − w ) val [ Q † ( i )] . (9) Let H w : R | S × U × V | → R | S × U × V | be deﬁned as follows. For Q ∈ R | S × U × V | , ( H w Q )( i, u, v ) := w  r ( i, u, v ) + α M X j =1 p ( j | i, u, v ) val [ Q ( j )]  + (1 − w ) val [ Q ( i )] . H w is the mo diﬁed Q-Bellman operator for the two-player zero-sum Markov game. Lemma 3. F or 0 < w ≤ w ∗ with w ∗ as in (6) , the map H w : R | S × U × V | → R | S × U × V | is a max-norm contraction and Q † is the uni que ﬁxed point of H w . Pr oof. From equation (9), Q † is a ﬁxed point of H w . Therefore, it is enough t o show that H w is a contraction operator (which will als o ensure its uniqueness). For P , Q ∈ R | S × U × V | , we have     ( H w P − H w Q )( i, u, v )     =     w α M X j =1 p ( j | i, u, v )  val [ P ( j )] − val [ Q ( j )]  + (1 − w )  val [ P ( i )] − val [ Q ( i )]      =     w α M X j =1 ,j 6 = i p ( j | i, u, v )  val [ P ( j )] − val [ Q ( j )]  + (1 − w + w αp ( i | i, u, v ))  val [ P ( i )] − val [ Q ( i )]      ≤     w α M X j =1 ,j 6 = i p ( j | i, u, v )  val [ P ( j )] − val [ Q ( j )]      +   (1 − w + w αp ( i | i, u, v ))        val [ P ( i )] − val [ Q ( i )]      (10) ≤ w α M X j =1 ,j 6 = i p ( j | i, u, v )      val [ P ( j )] − val [ Q ( j )]      + (1 − w + w αp ( i | i, u, v ))      val [ P ( i )] − val [ Q ( i )]      (11) ≤ w α M X j =1 ,j 6 = i p ( j | i, u, v ) max u,v     P ( j, u, v ) − Q ( j, u, v )     + (1 − w + w αp ( i | i, u, v )) max u,v     P ( i, u, v ) − Q ( i, u, v )     ≤ ( w α + 1 − w ) k P − Q k . (12) Since the RHS is not a functio n of ( i, u, v ) , we ha ve max i,u,v | ( H w P − H w Q )( i, u, v ) | ≤ ( w α + 1 − w ) k P − Q k , or k ( H w P − H w Q ) k ≤ ( w α + 1 − w ) k P − Q k . Note the use of the assu mption 0 < w ≤ w ∗ (with w ∗ as in (6)) in equation (10) t hat ensures that the term  1 − w + w α p ( i | i, u, v )  ≥ 0 , to arri ve at equati on (11). Al so equation (12) is obtained by an appli cation of Lemma 1 in equation (11). From the assu mptions on w and discount f actor α , it is clear that 0 ≤ ( w α + 1 − w ) < 1 . Therefore H w is a max-norm cont raction with contraction factor ( w α + 1 − w ) and Q † is its uniqu e ﬁxed poin t . Lemma 4. T w is a contraction with contraction factor (1 − w + w α ) . Pr oof. The proof is analogous to the proof of the Lemm a 3. Lemma 5. F or 1 ≤ w ≤ w ∗ , the contraction factor for t he map H w , 1 − w + α w ≤ α. Pr oof. For 1 ≤ w ≤ w ∗ , deﬁne f ( w ) = 1 − w + αw . Let w 1 < w 2 . Then f ( w 1 ) = 1 − w 1 (1 − α ) > 1 − w 2 (1 − α ) = f ( w 2 ) . Hence f is decreasing. In particular , for w ∈ [1 , w ∗ ] , 1 − w + αw = f ( w ) ≤ f (1) = α . Thi s sho ws that, i f w ∗ > 1 and w is chos en such t hat 1 < w ≤ w ∗ , the contraction factor is strictly smaller than α . Remark 3. Depending on the choice of w , the f ollowing observations can be mad e abou t o ur pr oposed generalized minimax Q-learning algori t hm (r efer Algori thm 1). • Case I ( w = 1 ) : The gener alized minimax Q-learni ng r educes t o standar d minimax Q- learning. • Case II ( w < 1 ) : The contraction factor of H w in this case, 1 − w + αw > α , giving rise to minimax Q-learnin g algorithm with slower con ver gence. • Case III ( w > 1 ) : F or thi s choice of w , i t is r equir ed that p ( i | i, u, v ) > 0 , ∀ ( i, u, v ) (r efer equation (6) ). Un der this con dition, as shown in Lemma 5, the contraction fa ctor of H w , (1 − w + αw ) < α , givin g rise to a f aster minimax Q-learni ng algorithm. Lemma 6. Let Q ∗ ( i, u, v ) = r ( i, u, v ) + α X j ∈ S p ( j | i, u, v ) J ∗ ( j ) and Q † be the ﬁxed point of H w . Then for all ( i, u, v ) ∈ S × U × V , Q † ( i, u, v ) − Q ∗ ( i, u, v ) = (1 − w )  J ∗ ( i ) − Q ∗ ( i, u, v )  . Mor eover , val [ Q † ( i )] = val [ Q ∗ ( i )] ∀ i ∈ S . Pr oof. By the hypothesis on Q ∗ , val [ Q ∗ ( i )] = T J ∗ ( i ) = J ∗ ( i ) = T w J ∗ ( i ) = val [ Q † ( i )] ∀ i ∈ S . Since Q † is the ﬁxed point of H w , we have Q † ( i, u, v ) = ( H w Q † )( i, u, v ) = w  r ( i, u, v ) + α M X j =1 p ( j | i, u, v ) J ∗ ( j )  + (1 − w ) J ∗ ( i ) Therefore Q † ( i, u, v ) − Q ∗ ( i, u, v ) = (1 − w )  J ∗ ( i ) − Q ∗ ( i, u, v )  . This completes th e proof. This L emma is an interesting and important result in our paper . It shows that, e ven if th e standard mi nimax Q-value iterates and generalized min imax Q-v alue iterates are not the same for all ( i, u, v ) tu p l es, the min-max v alues at each s tate giv en by bo t h the algorithms are equal. Therefore, this lemma stat es that generalized minimax Q-v alue iteration computes the min-max value function, which is the goal of the two-player zero-sum Markov game. W e now show the con ver gence of g eneralized mini m ax Q-learning (refe r Algorithm 1). For t his purpo se, we ﬁrst state the following result (Proposition 4.5 on p age 157 of [14]) and apply it to show the con ver gence of our proposed algorithm. W e consider γ n ( i ) to be determi n istic as with our algorithm, unlike [14] where th ese are allowed to be random. Theor em 1. Let { r n } be t h e sequence generated by the iteration r n +1 ( i ) = (1 − γ n ( i )) r n ( i ) + γ n ( i )  F r n ( i ) + N n ( i )  , n ≥ 0 . • The step-sizes γ n ( i ) ar e non-ne gat ive and satis fy ∞ X n =0 γ n ( i ) = ∞ , ∞ X n =0 γ 2 n ( i ) < ∞ . • The noise terms N n ( i ) s a tisfy – F or every i and n , we have E [ N n ( i ) |F n ] = 0 , wher e F n = σ  r 0 ( i ) , ..., r n ( i ) , N 0 ( i ) , · · · , N n − 1 ( i ) , 1 ≤ i ≤ d  . – Given any norm k . k on R d , ther e exist constants C and D such that E [ N 2 n ( i ) |F n ] ≤ C + D k r n k 2 , ∀ i, n. • The mapping F : R d → R d is a max-norm contraction. Then, r n con ver ges to r ∗ , the u n ique ﬁxed point o f F , with pr obabil ity 1 . Theor em 2. Given a ﬁnite state-action two-player zer o -sum Markov g a me ( S, U, V , p, r, α ) with bounded pa yoffs i.e. | r ( i, u , v ) | ≤ R < ∞ , ∀ ( i, u, v ) ∈ S × U × V , the generalized minimax Q-learning algor ithm (see Algorithm 1) gi ven by the upda t e rule: Q n +1 ( i, u, v ) = Q n ( i, u, v ) + γ n w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) val [ Q n ( i )] − Q n ( i, u, v ) ! con ver ges with pr obabi lity 1 to Q † ( i, u, v ) as long as X n γ n = ∞ , X n γ 2 n < ∞ , for all ( i, u , v ) ∈ S × U × V . Pr oof. The updat e rule of the algo ri t hm is giv en by Q n +1 ( i, u, v ) =  1 − γ n  Q n ( i, u, v ) + γ n h w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) v al [ Q n ( i )] i . Let F n = σ  { Q 0 , Y j , ∀ j < n }  , n ≥ 0 b e the associated ﬁltration. Now obs erve th at Y n ( i, u, v ) ∼ p ( . | i, u, v ) . Also, give n ( i, u, v ) , assume that the random v ariabl es Y n ( i, u, v ) , n ≥ 0 are inde- pendent. Then th e above equation can be rewritten as: Q n +1 ( i, u, v ) =  1 − γ n  Q n ( i, u, v )+ γ n  ( H w Q n )( i, u, v ) + N n ( i, u, v )  , (13) where ( H w Q n )( i, u, v ) = E h w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) val [ Q n ( i )]    F n i , (14) and N n ( i, u, v ) =  w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) v al [ Q n ( i )]  − E h w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) v al [ Q n ( i )]    F n i . (15) Now n ote, from Lemma 3, that the mapping H w is a max-norm contraction. Also, by the deﬁnitio n of N n , we have that N n is F n +1 − measurable ∀ n . Further , E [ N n    F n ] = 0 , ∀ n. (16) Finally , as Y n is independent of F n , we hav e E  N 2 n ( i, u, v )    F n  = E "  w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w )  val [ Q n ( i )]  − H w Q n ( i, u, v )  2 # ≤ E "  w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w )  val [ Q n ( i )]   2 # ≤ 3  w 2 R 2 + α 2 w 2 k Q n k 2 + (1 − w ) 2 k Q n k 2  =( C + D k Q n k 2 ) , (17) where C = 3 w 2 R 2 and D = 3  α 2 w 2 + (1 − w ) 2  . Here the ﬁrst inequality follows from the fact: E [ Z − E Z ] 2 = E [ Z 2 ] − E [ Z ] 2 ≤ E [ Z 2 ] . The second inequali t y follows from t h e following f acts: | r ( i, u , v ) | ≤ R, k v k = max i | v ( i ) | , ( a + b + c ) 2 ≤ 3 ( a 2 + b 2 + c 2 ) ∀ a, b, c and Corollary 1 . Therefore by Theorem 1, wi t h probability 1, the g eneralized minimax Q-learning i terates Q n con ver ge. By virtue of Lemma 6, our prop osed mi nimax Q-learning algorithm compu tes a policy whose value is the min-m ax v alue of the Markov game. A. Extension to t he asynchr onous setting In the setting considered abo ve, the updates are synchrono u s, i.e., Q-values of all state-action pairs are updated at ev ery iteration. Ho wev er , in the case of onli ne setting s, onl y a si ngle samp l e is obt ai n ed through the interaction with the en vironment. In t he following, we describe the con ver gence of our algorit hm in the asynchronous s ettings. The following assumption on t he structure of probability t ransi tion m atrix p and the control policies [15, Pa ge 1 30] is necessary in the asynchrono us setting: Assumption 1 . The Markov chain induced by a l l the contr ol policies is er godic. Moreo ver , under each policy , every action can be pick ed with a positive pr ob a bility in any state. The latter requirement in Ass u mption 1 is satisﬁed for instance by policies such as ǫ − greedy , see [17]. W e ﬁrst state a resul t from [18, Theorem 3] and appl y i t to show the con ver gence of our proposed algorithm. Let T i be an inﬁnite subs et of N and let { r n } ∈ R m be t h e sequence generated by t h e iteration r n +1 ( i ) =      r n ( i ) , n 6∈ T i , (1 − γ n ( i )) r n ( i ) + γ n ( i )  F r i n ( i ) + N n ( i )  , n ∈ T i . Here, r n ( i ) is a vector of possibly outdated compon ents of r . In particul ar , we l et r i n = ( r τ i 1 ( n ) (1) , . . . , r τ i m ( n ) ( m )) , where each τ i j ( n ) is an int eger satisfying 0 ≤ τ i j ( n ) ≤ n representing the delay in information about component j av ailable while updating component i at time n . If τ i j ( n ) = n, ∀ i, j then this reduces to t h e synchronous setting. Let F n = σ  r 0 ( i ) , · · · , r n ( i ) , γ 0 ( i ) , · · · , γ n ( i ) , τ i j (0) , · · · , τ i j ( n ) , N 0 ( i ) , · · · , N n − 1 ( i ) , 1 ≤ i, j ≤ m  . It is im p ortant to note from the constructi on of {F n } that, the step-size sequences γ n ( i ) are in general allowed to be random. Th us, the component to be updated at tim e n can be decided online based on t he history until t ime n . Assumption 2. F or any i and j , lim n →∞ τ i j ( n ) = ∞ , wit h pr obab i lity 1. Assumption 3. F or ev ery i and n , N n ( i ) is F n +1 − measurable and E [ N n ( i ) |F n ] = 0 . Assumption 4. E [ N 2 n ( i ) |F n ] ≤ C + D max j max τ ≤ n | r τ ( j ) | 2 , ∀ i, n. Assumption 5. The st ep-si zes γ n ( i ) are non -ne gative an d s atisfy ∞ X n =0 γ n ( i ) = ∞ , ∞ X n =0 γ 2 n ( i ) < ∞ , with pr obabil ity 1 . Assumption 6. Ther e e xists a vector r ∗ , a p ositive vector v , a scalar β ∈ [0 , 1) , s u ch that k F ( r ) − r ∗ k v ≤ β k r − r ∗ k v , ∀ r ∈ R m . Theor em 3. Un d er Ass umptions 2-6, r n con ver ges to r ∗ , the u nique ﬁxed point of F , with pr obabi lity 1. Theor em 4. Consid er a ﬁnite sta t e-action two-player zer o-sum Markov game ( S, U, V , p, r, α ) with bo unded p a yoffs i.e. | r ( i, u, v ) | ≤ R < ∞ , ∀ ( i, u , v ) ∈ S × U × V . Let the sample at iter- ation n be ( i n , u n , v n , Y n ( i n , u n , v n )) . Then, under Assumpti on 1, the asynchr o n o us generalized minimax Q-learning al gorithm given by the update rule: Q n +1 ( i, u, v ) =                  Q n ( i, u, v ) , if ( i, u , v ) 6 = ( i n , u n , v n ) Q n ( i, u, v ) + γ n ( i, u, v ) w  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w ) val [ Q n ( i )] − Q n ( i, u, v ) ! , if ( i, u, v ) = ( i n , u n , v n ) con ver ges with pr obabi lity 1 to Q † ( i, u, v ) for all ( i, u , v ) ∈ S × U × V . Pr oof. Assumpt ion 2 is trivially satisﬁed as t here is no d elay in inform ation during the traini n g. Hence τ i j ( n ) = n, ∀ i, j . Ass u mptions 3 and 4 are shown in (16) and (17), respectiv el y . In order for Assumption 5 to be true, all state and action pairs have to b e visited inﬁnitely often, which is ensured throu g h Assumptio n 1. Finally , from Lemma 3, k H w ( Q ) − H w ( Q † ) k ≤ ( w α + 1 − w ) k Q − Q † k , ∀ Q Howe ver , as Q † is the uni que ﬁxed point, we hav e, k H w ( Q ) − Q † k ≤ ( w α + 1 − w ) k Q − Q † k , ∀ Q, thereby proving Assum ption 6. T h erefore, by T h eorem 3, wi th probabil ity 1, th e asynchronous generalized minim ax Q-learning iterates Q n con ver ge to Q † . V . R E L A T I O N B E T W E E N G E N E R A L I Z E D M I N I M A X Q - L E A R N I N G A N D S T A N D A R D M I N I M A X Q - L E A R N I N G In this section , we describe the relation between our proposed Generalized Minimax Q-learning and standard M inimax Q-learning algorith m s. For the given two-player zero-sum Markov gam e ( S, U, V , p, r , α ) , w e construct a new game ( ¯ S , ¯ U , ¯ V , q , ¯ r , ¯ α ) as follows: • ¯ S = S, ¯ U = U, ¯ V = V • ¯ r = w r , ¯ α = (1 − w + α w ) and for a given ( i, u , v ) , let q ( . | i, u , v ) : S → [0 , 1] be deﬁned as q ( k | i, u , v ) =      w αp ( k | i,u,v ) (1 − w + w α ) , k 6 = i, 1 − w + w αp ( i | i,u,v ) (1 − w + w α ) , k = i, where 0 < w ≤ w ∗ . W e note th at q ( . | i, u, v ) is a probability mass fun ct i on on ¯ S . Now con sider the standard minim ax Q -Bellman operator ¯ H for t his game given by , ¯ H : R ¯ S × ¯ U × ¯ V → R ¯ S × ¯ U × ¯ V and ¯ H Q ( i, u , v ) = w r ( i, u , v ) + (1 − w + αw ) X j ∈ ¯ S q ( j | i, u, v ) v al [ Q ( j )] , where Q ( j ) i s | ¯ U | × | ¯ V | dimens i onal m at ri x with ( u, v ) th entry as Q ( j, u, v ) . and val [ Q ( j )] is given by the equation (3). Not e th at ¯ H Q ( i, u , v ) = w r ( i, u, v ) + (1 − w + αw ) X j ∈ ¯ S q ( j | i, u, v ) v al [ Q ( j )] = w r ( i, u, v ) + X j ∈ ¯ S ,j 6 = i w αp ( j | i, u, v ) va l [ Q ( j )] + (1 − w + w αp ( i | i, u, v )) val [ Q ( i )] = w r ( i, u, v ) + α X j ∈ S p ( j | i, u, v ) val [ Q ( j )] ! + (1 − w ) val [ Q ( i )] = H w Q ( i, u, v ) . Hence ¯ H operator of the game ( ¯ S , ¯ U , ¯ V , q , ¯ r , ¯ α ) is same as the H w operator deﬁned for the game ( S, U, V , p, r , α ) . Let us consider an iteration of the minimax Q -learning algo rithm on ( ¯ S , ¯ U , ¯ V , q , ¯ r , ¯ α ) giv en by ¯ Q n +1 ( i, u, v ) =  1 − γ n  ¯ Q n ( i, u, v ) + γ n  w r ( i, u, v ) + (1 − w + w α ) val [ ¯ Q n ( ¯ Y n ( i, u, v ))]  =  1 − γ n  ¯ Q n ( i, u, v ) + γ n  ( ¯ H ¯ Q n )( i, u, v ) + ¯ N n ( i, u, v )  , where γ n , n ≥ 0 , i s the st ep-size sequence, ¯ Y n ( i, u, v ) ∼ q ( . | i, u, v ) , ¯ N n ( i, u, v ) =  w r ( i, u, v ) + (1 − w + w α ) val [ ¯ Q n ( ¯ Y n ( i, u, v ))]  − ¯ H ¯ Q n ( i, u, v ) and compare i t wit h an iteratio n of Generalized minimax Q-learning. Since ¯ H = H w , both algorithms con ver ge to Q † , the ﬁx ed point of H w , and differ only in the per-iterate noise ¯ N n and N n . Lemma 7. Suppos e { Q n } ar e the it erates o f Generalized minimax Q-learning. Then given any ǫ > 0 ther e e xists a natural number N that i s po s sibly sam p le pa th dependent, such that k Q n k ≤ R 1 − α + ǫ , for n > N almost sur ely . Pr oof. Consider t he iterates ¯ Q n of the m inimax Q-learning algorith m wi th respect to the s t ochas- tic game ( ¯ S , ¯ U , ¯ V , q , ¯ r , ¯ α ) wi t h initi al poi nt k ¯ Q 0 k ≤ R 1 − α . Now ass u me that k ¯ Q n k ≤ R 1 − α (induction hypothesis ). Then k ¯ Q n +1 k ≤ (1 − γ n ) k ¯ Q n k + γ n ( w k r k + (1 − w + w α ) k ¯ Q n k ) ≤ (1 − γ n ) R 1 − α + γ n  w R + (1 − w + w α ) R 1 − α  = R 1 − α . Therefore by induction k ¯ Q n k ≤ R 1 − α , ∀ n ≥ 0 . As the sequences { ¯ Q n } and { Q n } con verge to Q † , given ǫ > 0 there exists a natural num ber N such that k Q n − ¯ Q n k ≤ ǫ = ⇒ k Q n k ≤ R 1 − α + ǫ, ∀ n > N . Moreover k Q † k ≤ R 1 − α . T o concl u de we have k Q n k ≤ R 1 − α + ǫ almost surely and N here is possibl y sample path dependent. This completes the proof. Remark 4. W e i nvok e t h e standar d Q-learning algorithm on ( ¯ S , ¯ U , ¯ V , q , ¯ r , ¯ α ) with the initial point ¯ Q 0 chosen such that k ¯ Q 0 k ≤ R 1 − α to pr ove the Lemma 7. It is also possible to obtain the same desir ed conclus ion by dir ectly ut i lizing the con ver gence of the iterates of standa r d Q-learning algor ithm on ( ¯ S , ¯ U , ¯ V , q , ¯ r , ¯ α ) for any arbi trary initial point ¯ Q 0 . V I . M O D E L - F R E E G E N E R A L I S E D M I N I M A X Q - L E A R N I N G Note that an inpu t to the Alg o ri thm 1 is the relaxation parameter w ≤ w ∗ , where w ∗ is deﬁned in (6). As w ∗ depends on the transition prob abi lity function p , it is not possibl e to choose a valid w in the experiments, where we do not have access to probability transition function. In this section, we describe a synchronous version of t h e model-free generalised minimax Q-learning procedure that miti gates the dependency on the model inform ation. W e maintain a count value C n [ i ][ j ][ u ][ v ] , ∀ i, j ∈ S, u ∈ U, v ∈ V (initialised to zero ∀ i, j, u, v ) that represents the numb er of times t he sam ple ( i, u , v , j ) has been encountered until iteration n . W e deﬁne p ′ n ( j | i, u, v ) = C n [ i ][ j ][ u ][ v ] n , ∀ n ≥ 1 , (18) with p ′ 0 ( j | i, u, v ) = 0 , ∀ i, j, u, v . It is easy to see that p ′ n ( j | i, u, v ) − → p ( j | i, u , v ) , ∀ i, j, u, v , (19) as n − → ∞ , almost surely (from the Strong Law of Lar ge Numbers). Now , we propose our mo del-free “generalised m i nimax Q-learning” by modifying t he Step 3 of Algorith m 1 as: d n +1 ( i, u, v ) = w n  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w n ) val [ Q n ( i )] , (20) where the sequence { w n , n ≥ 1 } is up d ated as: w n +1 = (1 − γ n ) w n + γ n  1 1 − α min i,u,v p ′ n ( i | i, u, v )  , (21) with w 0 ∈ [1 , 1 1 − α ] . A. Con ver gence Analysis: W e write the two update equations as follows: Q n +1 ( i, u, v ) = Q n ( i, u, v ) + γ n ( h ( Q n , w n ) + M n +1 ) , ∀ i, u , v , (22) w n +1 = w n + γ n ( g ( w n ) + ǫ n ) . (23) The function h is deﬁned as: h ( Q n , w n )( i, u, v ) = H w n ( Q n )( i, u, v ) − Q n ( i, u, v ) = E h w n  r ( i, u, v ) + α val [ Q n ( j )]  + (1 − w n ) val [ Q n ( i )] − Q n ( i, u, v ) i . The sequence { M n } deﬁned as: M n +1 = w n  r ( i, u, v ) + α val [ Q n ( Y n ( i, u, v ))]  + (1 − w n ) val [ Q n ( i )] − Q n ( i, u, v ) − h ( Q n , w n ) , is a marting ale di f ference noise sequence with respect to the increasing σ − ﬁelds F n := { Q 0 , w 0 , M 0 , . . . , w n , M n } , n ≥ 0 , satisfyi n g E [ k M n +1 k 2 |F n ] ≤ 3 w 2 n  max i,u,v | r ( i, u , v ) |  2 + 6 α 2 (1 − α ) 2 k Q n k 2 ≤ K  1 + k w n k 2 + k Q n k 2  , (24) where K = max n 3  max i,u,v | r ( i, u , v ) |  2 , 6 α 2 (1 − α ) 2 o . The function g is deﬁned as: g ( w n ) = 1 1 − α min i,u,v p ( i | i, u, v ) − w n . Finally , ǫ n = 1 1 − α min i,u,v p ′ n ( i | i, u, v ) − 1 1 − α min i,u,v p ( i | i, u, v ) , where p ′ n is updated as shown in the equation (18). Not e that, from (19), we get ǫ n − → 0 , as n − → ∞ , almost s u rely . (25) Notice from (22)-(23) that the Q n -recursion in (22) depends on the w n -update in (23), w h ile the latter is an independ ent up d ate that does not depend on Q n . Let Q † w ∗ be the (uniqu e) ﬁxed point of H w ∗ . Note that, from (21), w n ∈ h 1 , 1 1 − α i , ∀ n ≥ 0 . Th erefore, { w n , ∀ n ≥ 1 } u pdates are bounded. W e now make an assumpti o n on the boundedness of { Q n } iterates. Assumption 7. k Q n k ≤ B < ∞ , ∀ n ≥ 0 . In practice, the it erates { Q n } will satisfy Ass umption 7 if th ey are projected to a prescribed compact set Ω whene ver th ey exit it, see for instance, [19, Chapter 5] for a g eneral setting of projected stochastic approximation. From Lemma 7, t he so lution k Q † w ∗ k ≤ R 1 − α . T h erefore, we can choose t h e set Ω such th at k Ω k := max x ∈ Ω k x k > R 1 − α . Lemma 8. Function s h ( Q, w ) and g ( w ) ar e Lipschitz. Pr oof. Consider p, q ∈ [ − B , B ] | S |×| U |×| V | and w 1 , w 2 ∈ [1 , 1 1 − α ] . Let R = max i,u,v | r ( i, u , v ) | . Then, k h ( p, w 1 ) − h ( q , w 1 ) k ≤ k H w 1 ( p ) − H w 1 ( q ) k + k p − q k ≤ ( | 1 − w 1 | + w 1 α ) k p − q k + k p − q k ≤  1 + α 1 − α  k p − q k . | ( h ( q , w 1 ) − h ( q , w 2 ))( i, u, v ) | ≤ | w 1 − w 2 | E [ | r ( i, u , v ) + α val [ q ( j )] − val [ q ( i ) | ] ≤ | w 1 − w 2 | ( R + 2 B ) . Hence, k h ( p, w 1 ) − h ( q , w 2 ) k ≤ L ( k p − q k + | w 1 − w 2 | ) , where L = max  1 + α 1 − α , R + 2 B  . Finally , | g ( w 1 ) − g ( w 2 ) | ≤ | w 1 − w 2 | . Therefore the functions h ( Q, w ) and g ( w ) are Lipschitz. W e now consider the iterates (22)-(23) in a combined form as fol lows: x n +1 = x n + γ n ( f ( x n ) + M ′ n +1 + ǫ ′ n ) , where (26) x n = ( Q n , w n ) T , f ( x n ) = ( H w n ( Q n ) − Q n , w ∗ − w n ) T , M ′ n +1 = ( M n +1 , 0) T , ǫ ′ n = (0 , ǫ n ) T . Let Q † w ∗ be the ﬁxed point of the m odiﬁed min-max Q-Bellman operator (see (8)) when w = w ∗ is used. Theor em 5. x n − → x ∗ , wher e x ∗ =  Q † w ∗ , w ∗  T , almost sur ely . Pr oof. The iterates { x n } in (26) track the ODE [15, Section 2.2] ˙ x = f ( x ) = ( H w ( Q ) − Q, w ∗ − w ) T . Note that w n iterates drive t he Q n iterates but the reverse is not t rue, i .e, it is a one way coupli ng of the dynamics. First, consi der the ODE ˙ w = w ∗ − w . Let g ∞ ( w ) = lim r − → ∞ g ( r w ) r . The function g ∞ ( w ) exists and is equal to − w . Moreov er , the origi n is the unique glob ally asymptotically stable equilibrium for t he ODE ˙ w = g ∞ ( w ) = − w , with V ( w ) = w 2 2 serving as an associated L yapunov function. Further , w ∗ is the unique globally asymptoticall y stable equilibrium for the ODE ˙ w = w ∗ − w . Therefore, by [15, Theorem 7, Chapter 3 and Theorem 2 - Corollary 4], we ha ve w n − → w ∗ almost surely . The Q n iterates now track the ODE giv en by ˙ Q = H w ∗ ( Q ) − Q . By virtue of Lemma 3, H ∗ w is a cont ractio n. Hence, by Stochastic ﬁxed point analysis [15 , Section 10.3], we hav e Q n − → Q † w ∗ , almost surely . Remark 5. One way to pro ve the Assu mp tion 7 is to pr oject th e { Q n } iterates onto a pr especiﬁed con vex and compact set C . Under projection, the upd ate equation (eq. (22) ), i.e., Q n +1 ( i, u, v ) = Q n ( i, u, v ) + γ n ( h ( Q n , w n ) + M n +1 ) , ∀ i, u , v , (27) is r eplaced with Q n +1 ( i, u, v ) = Γ C n Q n ( i, u, v ) + γ n ( h ( Q n , w n ) + M n +1 ) o , ∀ i, u, v , (28) wher e Γ C ( P ) i s the p r ojection of P onto a compact and con vex set such as C = [ − B , B ] | S |×| U |×| V | . Con vexity of C would ensur e th a t Γ C ( P ) is a uni que ﬁxed point f o r any P . The iterates { Q n } in (28) trac k the ODE [19, Chapter 5] ˙ Q = ˆ Γ( H w ∗ ( Q ) − Q ) , (29) wher e the operator ˆ Γ( h ) , for a continuous function h is deﬁned as: ˆ Γ( h ( Q, w )) = lim 0 < ∆ − → 0 [Γ C ( Q + ∆ h ( Q, w )) − Q ] ∆ , (30) F r om [15, Theor em 2, Chapter 2], { Q n } i terates con ver ge to a compact, connected, internally chain transiti ve, in varian t set of the ODE (29) . It is easy to see that { Q † w ∗ } is an in variant and internally chain transitive (ICT ) set of the ODE (29) . However , t he pr ojection operation will intr od uce spurio u s ﬁxed points on the bou ndary of the set C t h at will also be invariant and ICT sets of the ODE (29) . In [15, Chapter 5.4], some practical techniques ar e discuss ed to avoid con ver gence to undesir ed equil i brium points (boundary points in this case). V I I . E X P E R I M E N T S A N D R E S U L T S W e refer to the Algorithm 1, with w = w ∗ , as “Generalised optimal minimax Q-learning” and the model-free al g orithm d erived in the previous section as “Generalised m inimax Q-learning” algorithm in the experiments. W e generate a two-player zero-sum Mark ov game and run all the algorithms for 50 independent episodes in each of the three cases - (a). 1 0 states and 5 actio ns for each o f the agents, (b). 20 states and 5 actions for each of the agents and (c). 50 states and 5 actions for each of the agents. The discount factor is set to 0 . 6 . The probabili ty transi t ion matrix generated satisﬁes p ( i | i, u, v ) > 0 ∀ i, u, v as this condition is required for faster p erformance of the generalized optimal minimax Q-l earning and generalized min i max Q-learning. Al l the algorithms are run for 1000 iterations i n each episode with th e same step-size s equ ences. Algorithm 10 states 20 states 50 states Standard minimax Q-learning 0.68 ± 0.07 1.67 ± 0.13 3.99 ± 0.11 Generalized minimax Q-learning 0.49 ± 0.08 1.43 ± 0.18 3.75 ± 0.12 Generalized optimal minimax Q-l earning 0.35 ± 0.08 1.26 ± 0.19 3.59 ± 0.14 T able I: Comp aris on of Error among three algorithms av eraged across 50 epi s odes The comparison criterion considered is the a verage error that i s calculated as follows. At the end of each episo de of the algorithm, t h e norm difference betw een estimate of the m i n-max value function and the actual min-max v alue function is computed. This process is repeated for all the 50 episodes and the average is computed. Thus, A verage Error = 1 50 50 X k =1 k J ∗ − val [ Q k ( . )] k 2 , (31) where J ∗ is the mi n-max value funct i on of the game and Q k ( . ) is the mi n i max Q-value function estimate obtained at the end of k th episode. In T able I, we report the av erage error of three algori t hms. W e can see t hat, generalized optimal mi nimax Q-learning has the least av erage error , followed by the generalized min i max Q-leaning algorithm . This is expected as t h e generalized optim al Q-learning algorithm makes use of the op t imal relaxation parameter w ∗ in its up dates, which is not p ractically feasible. Therefore, we conclude that our proposed generalized mini max Q-learning algorithms perform empi rically better (in terms of number of samples) th an the st andard minimax Q-learning algori t hm. V I I I . C O N C L U S I O N S In this work, we use the technique of successive relaxati o n to propose a m odiﬁed min-max Bellman operator for two-player zero-sum games. W e pro ve t hat the contraction factor of this modiﬁed min -max Bellman operator is less than the di scount factor (contraction of the standard min-max Bellman operator) for the choice of w > 1 . The construction of the m odiﬁed Q-Bellman operator enabled us to d evelop a generalized minim ax Q-learning algorithm . W e show the alm o st sure con ver gence of our proposed algorithm. W e then deri ve a relation between our proposed algorithm and the standard minimax Q-learning algorith m . W e also propose a model-free (from samples) version of our algorithm and prove it s con ver gence under the boundedness of it erates assumption . In the future, we would like to incorporate fun ction approxim ation architecture and apply our propo sed alg o rithm on practical applicatio ns. Moreover , as a future work, we would like to explore t he th eoretical sam p le com plexity of our al g orithm and com pare the sam e wit h minimax Q-learning. I X . A C K N O W L E D G E M E N T S Raghuram Bharadwaj was suppo rt ed by a fellowship grant from t he Centre for Network ed Intelligence (a Cisco CSR initiative) of the Ind ian Ins titute of Science, Bangalore. Shalabh Bhatnagar was support ed by the J.C.Bose Fellowship, a project from D ST under the ICPS Program and th e RBCC PS, IISc. R E F E R E N C E S [1] D. P . Bertsekas, Dynamic pro gramming and optimal contr ol . Athena scientiﬁc Belmont, MA, 2013, vol. 2. [2] M. L. Lit tman, “Marko v games as a framework for multi-agent reinforcement l earning, ” in Machin e learning pro ceedings 1994 . Else vier, 1994, pp. 157–163. [3] J. Hu and M. P . W ellman, “Nash Q-learning for general-sum stochastic games, ” Jou rnal of machine learning res ear ch , vol. 4, no. Nov , pp. 1039–106 9, 2003. [4] M. L. Lit tman, “Friend-or-foe Q-learning in general-sum games, ” in ICML , vol. 1, 2001, pp. 322–328. [5] A. Greenwald, K. Hall, and R. Serrano, “Correlated Q-learning, ” in I C ML , vo l. 3, 2003, pp. 242–249 . [6] M. Bowling and M. V eloso, “Rational and con ver gent l earning in stochastic games, ” in International j oint confer ence on artiﬁcial intelligence , vol. 17, no. 1. La wrence Erlbaum Associates Lt d, 2001, pp. 1021–10 26. [7] L. Busoniu, R. Babuska, and B. De Schutter , “ A comprehensi ve survey of multiagent r einforcement learning, ” IEEE T ransactions on Systems, Man, and Cybernetics, P art C (Applications and Reviews) , vol. 38, no. 2, pp. 156–172, 2008. [8] K. Zhang, Z. Y ang, and T . Bas ¸ ar , “Multi-agent reinforcement learning: A selectiv e overvie w of theories and algorithms, ” arXiv pr eprint arXiv:1911.10635 , 2019. [9] F . A. D ahl and O. M. Halck, “Minimax TD-learning with neural nets in a Markov game, ” in Eur opean Conferen ce on Mach ine Learning . Springer , 2000, pp. 117–128. [10] S. L i, Y . W u, X. Cui, H. Dong, F . Fang, and S. Russell, “Robust multi-agent r einforcement learning via minimax deep deterministic policy gradient, ” in AAAI Confer ence on Artiﬁcial Intelligence (AAAI) , 2019. [11] D. Reetz, “Solution of a Marko vian decision problem by successiv e overrelaxation, ” Zeit schrift f ¨ ur Operations Resear ch , vol. 17, no. 1, pp. 29–32, 1973. [12] C. Kamanchi, R. B. Diddigi, and S. Bhatnagar, “Successiv e Over-Relaxation Q -Learning, ” IEEE Contr ol Systems Letters , vol. 4, no. 1, pp. 55–60, Jan 2020. [13] J. Filar and K . Vrieze, C ompetitive Marko v decision pr ocesses . Springer Science & Business Media, 2012. [14] D. P . Bertsekas and J. N. Tsitsiklis, Neur o-Dynamic Pro gramming . Athena Scientiﬁc Belmont, MA, 1996, vol. 5. [15] V . S . Borkar , Stochas tic A ppr oximation: A D ynamical Systems V iewpoint . Cambridge Uni v . Press, 2008. [16] L. S. Shaple y , “Stochastic games, ” Pr oceedings of t he National Academy of Sciences , vol. 39, no. 10, pp. 109 5–1100, 1953. [17] R. S. Sutton and A. G. Barto, Intro duction to Reinforce ment Learning . MIT press Cambridge, 2018. [18] J. N. Tsitsiklis, “ Asynchro nous stochastic approximation and q-learning, ” Machine learning , vol. 16, no. 3, pp. 185–202 , 1994. [19] H. J. Kushner and D. S. Clark, Stochastic appr oximation methods for constrained and unconstraine d systems . Springer Science & Business Media, 2012, vol. 26.

A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment