Game-theoretical control with continuous action sets

1 Game-theoretical control with continuous action sets Stev en Perkins, P anayotis Mertikopoulos, and Da vid S. Leslie Abstract Moti v ated by the recent app lications of game-theoretical learning techn iques to the design of distrib uted con trol systems, we study a class of control problems that can be formulated as potential games with continuous action sets, and we propose an actor-critic reinforcement learning algorithm that prov ably con ver ges to equilibrium in this class of problems. The method employed is t o analyse the learning process under study t hrough a mean-ﬁeld dynamical system that e v olve s in an inﬁnite-dimension al func tion space (the space of proba bility distrib utions over the players’ continuous controls). T o do so, we extend the theory of ﬁnite-dimensional two-timescale stochastic approximation to an inﬁnite- dimensional, Banach space setting, and we prove that the continuous dynamics of the process conv erg e to equilibrium in the ca se of potential g ames. These resu lts combine to g iv e a pro v ably-con verg ent l earning algo rithm in which play ers do not need to keep track of the controls selected by the other agents. I . I N T R O D U C T I O N There has been much recent activity in using technique s of learning in games to design distrib uted control systems. This research traverses from utility function design [1–3], thro ugh analysis of potential subo ptimalities due to the use of distributed selﬁsh con trollers [4 ] to the design a nd analysis of gam e-theor etical learn ing alg orithms with speciﬁc control- inspired objectives (reachin g a glob al optimum, fast con vergence, etc.) [5, 6]. In this context, considerable interest h as ar isen fr om the app roach of [1, 2] in which the inde penden t controls av ailable to a system are distributed amon g a set of a gents, hencefo rth called “players”. T o co mplete the game - theoretical analog y , the contro ls available to a p layer are c alled “actions”, and each player is assigned a u tility functio n which depends on the actions of all playe rs (as d oes th e global system-level u tility). As such, a play er’ s utility in a particular p lay of th e game cou ld b e set to be the global utility o f the join t ac tion selected by all players. Howe ver , a more learnable choice is the so-called W onderful Lif e Utility (WLU) [1, 2], in which the utility of any particular player is g iv en by how much better the system is doing as a resu lt of that play er’ s actio n (com pared to th e situation where no other player chan ges their actio n but th e focal player uses a baseline action instead ). A fundamen tal result in this S. Perkins’ Ph.D. research was funded by grant number EP/D063485/1 from the United K ingdom Engineering and Physical Scie nces Research Council . P . Mertikopoul os’ research was partially supported by the French National Research Agenc y under grant nos. ANR-GA GA-13-JS01- 0004-01 a nd ANR-NETLEARN-13-INFR-004, and the CNRS grant PEPS-GA THERING-2014. D.S. Leslie ’ s research wa s funde d by grant number EP/I032622/1 from the United Kingdom Engineeri ng and Physical Science Research Council. S. Perkins carrie d out this research while a PhD candidate at the School of Mathematics, Univ ersity of Bristol, United Kingdom. P . Mertik opoulos is with the French National Center for Scienti ﬁc Research (CNRS) and the Uni v . Grenoble Alpes, LIG, F-38000 Grenoble, France. Davi d S. Leslie is with School of Mathematics and Statistics, Lancaster Univ ersity , United Kingdom. 2 domain is that setting the player s’ utilities using WLUs results in a poten tial game [7] (see Section II below). Th ere are alternative m ethods for con verting a system-lev el utility fu nction in to in dividual utilities, such as Shapley value utility [ 8]; howe ver , most of these also b oil down to a p otential gam e (p ossibly in the extended sen se of [3]) where the o ptimal system contro l is a Nash equilibriu m of th e gam e. Thu s, by represen ting a co ntrol pr oblem as a p otential game, the controllers’ main objectiv e amounts to reaching a Nash equilibrium of the resulting game. On the other hand, like much of the economic literatu re on learning in games [9, 10], the vast majority of this corpus of research has focused almost e xclusively on situations where each p layer’ s controls comprise a ﬁnite set. This allows results from the the ory of learnin g in games to be app lied directly , resulting in learn ing algorithms that conver ge to the set of e quilibria – an d hen ce system optima . Howe ver, the assumption of discrete actio n sets is freq uently anomalou s in contro l, engin eering a nd economics: af ter all, pr ices are no t discrete, and neither are the co ntrols in a large num ber of engin eering systems. For instance, in massively parallel grid co mputing networks (such as the Berkeley Open Infrastru cture for Network Com puting – BOINC) [11], the decision gra nularity of “bag -of-tasks” application scheduling gi ves rise to a potential game with continu ous action sets [7]. A similar situation is encountered in the case of energy-efﬁcient p ower co ntrol and po wer allocation in large wireless networks [12, 13]: mobile wireless users can transmit at different p ower levels ( or s plit their power ac ross dif ferent subcarrier s [1 4]), and th eir throughpu t is a continu ous function of the ir ch osen transmit p ower proﬁles (which have to be optimized un ilaterally and with out recourse to user coordination or cooperation ). Fin ally , d ecision-mak ing in the emerging “smart grid” para digm for power generation an d manag ement in electricity grids also rev olves around con tinuou s v a riables (such as the amoun t of p ower to generate, o r wh en to power down d uring th e d ay), leading aga in to game-th eoretical model formu lations with continuou s action sets [15]. In this paper, we focus squar ely on control prob lems (pr esented as potential games) with continuo us action sets and we p ropo se an actor-critic reinfo rcement learning algorithm tha t provably co n verges to equilibriu m. T o ad dress this prob lem in an eco nomic setting, very recent work by Perkin s and Leslie [16] extended the theor y o f learnin g in games to z ero-sum games with continuou s action sets (see also [ 17, 18]); howe ver , fr om a control- theoretical point of view , zero- sum games are of limited practical rele vance because they only capture adversarial interactions between two player s. Owing to th is fun damental difference between zero-sum an d poten tial games, the two-play er analysis o f [16] no longer applies to our case, so a c ompletely different appro ach is required to obtain con vergence in the co ntext of many-playe r p otential games. T o accomplish this, o ur ana lysis relies on two theoretical contr ibutions o f indep endent interest. The ﬁr st is th e extension of stochastic approx imation techniques for B anach spaces (otherwise kno wn as “abstract stochastic approx- imation” [19 – 24]) to the so-called “two-time scales” f ramework origin ally in troduce d in stand ard ( ﬁnite-dimen sional space) stochastic app roxima tion by [25]. This allows us to conside r interdepen dent strate gies and value function s ev olving a s a stochastic process in a Banach space ( the space of sign ed m easures over the p layers’ con tinuou s action sets a nd the space o f contin uous functions from action sp ace to R respectively , both end owed with app ropriate norms). Our seco nd contribution is th e asymptotic an alysis of the mean ﬁeld dynamics o f th is process on the space of probability measures on the action spac e; our analysis re veals that the dyn amics’ rest points in potential g ames 3 are globally attracting, so, com bined with our stochastic appr oximation results, we obtain the conv ergence of our actor-critic reinforcement learning algorithm to equilibrium. In Section I I we introduce th e framew ork an d notation , and introduce our actor–critic learning algorithm. Follo wing that, in Section III we intr oduce two-timescales stoch astic app roximatio n in Banach spaces, and p rove o ur ge neral result. Section IV app lies the stochastic app roxima tion theory to the actor–critic algorithm to show that it can be studied via a mean ﬁeld dynamical system. Sec tion V th en analyses the convergence of the mean ﬁeld d ynamical system in potential games, a result which allo ws us to prove the conv ergence of t he actor–critic proc ess in this context. I I . A C T O R – C R I T I C L E A R N I N G W I T H C O N T I N U O U S AC T I O N S PAC E S Throu ghout this paper, we will f ocus o n control p roblems presented a s potential games with ﬁnitely ma ny players and continu ous action spaces. Such a g ame comprises a ﬁnite set of player s labelled i ∈ { 1 , . . . , N } . F or each i there exists an action set A i ⊂ R which is a com pact in terval; 1 when each player selects an a ction a i ∈ A i , this results in a join t action a = ( a 1 , . . . , a N ) ∈ A = Q N i =1 A i . W e w ill freq uently use the notation ( a i , a − i ) to refer to the joint action a in which Player i uses action a i and all other players use the joint action a − i = ( a 1 , . . . , a i − 1 , a i +1 , . . . , a N ) . Each player i is also associated with a bounde d and continu ous utility fun ction u i : A → R . For th e game to b e a potential game, there must exist a potential fun ction φ : A → R suc h that u i ( a i , a − i ) − u i (˜ a i , a − i ) = φ ( a i , a − i ) − φ (˜ a i , a − i ) for all i ∈ { 1 , . . . , N } , for all a − i and for all a i , ˜ a i . Thus if any p layer chang es the ir action while th e others do not, the change in utility for the player tha t changes their action is equal to the chan ge in value of the potential function of the game. Meth ods f or co nstructing p otential ga mes fr om system utility functio ns [1 – 3] usually e nsure th at the po tential correspo nds t o the system utility , so maximising the potential function corre sponds to maximising the s ystem utility . Game-theo retical an alyses usually focus on mixed strategies wh ere a player selects an action to play rand omly . A mixed strategy for Play er i is deﬁned to be a proba bility distribution over the actio n space A i . This is a simple concept when A i is ﬁnite, but for the c ontinuo us action spaces A i considered in this paper more care is required. Speciﬁcally , let B i be the Borel sigma-algeb ra on A i and let P ( A i , B i ) denote the set of all probability measur es o n A i . Throu ghou t this article we endow P ( A i , B i ) with th e weak to pology , metr ized b y the bo unded Lipsch itz no rm (see Section IV; also [16, 26, 27]). A mixed stra tegy is then an element π i ∈ P ( A i , B i ) ; for B i ∈ B i we ha ve that π i ( B i ) is the probab ility that Player i selec ts an action in the Borel set B i . Note that a mixed strategy under this deﬁnition need not admit a density with respect to Lebesgue measure, and in particular may contain an atom at a particular action a i . Returning to our g ame-theo retical consideratio ns, we exten d th e deﬁnition of utilities to th e space ∆ = Q N i =1 P ( A i , B i ) of mixed strategy proﬁles. In particula r , let π ∈ ∆ be a mixed strategy proﬁle, and deﬁne u i ( π ) = Z A 1 · · · Z A N u i ( a ) π 1 (d a 1 ) · · · π N (d a N ) . 1 W e are only making this assumption for con ve nience ; our analysis carrie s through to higher- dimensional conv e x bodies with minimal hassle. 4 As before we use the notation ( π i , π − i ) to r efer to the mixed strate gy proﬁle π in which Player i uses π i and all other player s use π − i = ( π 1 , . . . , π i − 1 , π i +1 , . . . , π N ) . In fu rther abuse of notation, we write ( a i , π − i ) for the m ixed strategy pro ﬁle ( δ a i , π − i ) , wh ere δ a i is the Dirac measure at a i (meaning that Player i selects action a i with probability 1 ). Hence u i ( a i , π − i ) is the utility to Player i for selecting a i when all other players use strategy π − i . A central concept in game theo ry is the best response corresponden ce of Player i , i.e. the set of mixed strategies th at maximise Play er i ’ s utility given any pa rticular oppone nt mixed strategy π − i . A Nash eq uilibrium is a ﬁxed point of this correspo ndence, in which all players are playing a best response to all other players. In a learning context however , the discontin uities that appear in best respo nse correspo ndences can cause great difﬁculties [28]. W e fo cus instead on a smoothin g of the best respon se. For a ﬁxed η > 0 , the logit best response with no ise level η of Player i to strategy π − i is deﬁned to be the mixed strate gy L i η ( π − i ) ∈ P ( A i , B i ) such that L i η ( π − i )( B i ) = R B i exp  η − 1 u i ( a i , π − i )  d a i R A i exp { η − 1 u i ( b i , π − i ) } d b i (1) for each B i ∈ B i . In [1 8] it is shown th at L i η ( π − i ) ∈ P ( A i , B i ) is absolu tely continuo us (with re spect to Lebesgu e measure), with density given by l i η ( π − i )( a i ) = exp  η − 1 u i ( a i , π − i )  R A i exp { η − 1 u i ( b i , π − i ) } d b i . (2) T o ease no tation in what follows, we let L η ( π ) =  L 1 η ( π − 1 ) , . . . , L N η ( π − N )  . The e xistence of ﬁxed poin ts of L η is sho wn in [18] a nd [16 ]; such a ﬁxed poin t is a joint strategy π such that π i = L i η ( π − i ) for each i , and so is a mixed strategy pro ﬁle such that ev ery player is play ing a smo oth best r esponse to the strategies of the other players. Such proﬁles π are called logit equilibria and the set of all such ﬁxed points will be den oted by LE η . A logit equ ilibrium is thus a n appr oximation o f a loca l maximizer of th e poten tial function of the g ame in th e sense that fo r small η a lo git equilib rium places m ost of the prob ability mass in areas wh ere the joint action resu lts in a high poten tial functio n value; in p articular, log it equilibria approxim ate N ash eq uilibria when the noise lev el is suf ﬁciently small. 2 Smooth best responses also play an impor tant part in d iscrete action games, pa rticularly when learning is considered . In th is domain th ey were introduc ed in stochastic ﬁctitio us play b y [ 30], and late r stud ied b y , among o thers, [31 – 33] to en sure the p layed mixed strategies in a ﬁctitious play pro cess conv erge to lo git equilib rium. This is in con trast to classical ﬁctitious play in which the beliefs of players con verge, but the played strategies a re (almost) alw ays pur e. The techn ique was also required by [ 34 – 36] to allow simple reinfo rcement learners to conv erge to lo git equilib ria: as discussed in [34], play ers w hose strategies are a fun ction of the expected value of their actio ns cannot conver ge to a Nash equilibrium b ecause, at equilib rium, all actions in th e sup port o f th e equilibrium mixed strategies will receive the same expected re ward. Recently [18] developed the dyn amical systems tools necessary to co nsider whether the smooth b est respon se dynamics con verge to logit equ ilibria in the in ﬁnite-dimen sional setting . T his w as e xtended to learning systems in 2 W e note here tha t the notion of a logit equi librium is a special case of the more general conc ept of quantal response equilibrium introduced in [29]. 5 Algorithm 1 Actor-critic R einforcem ent Learning Based on Logit Best R esponses P arameters: step-size s equen ces α n , γ n . Initia lize critics Q i , actors π i ; n ← 0 . Repeat n ← n + 1 ; foreach player i = 1 , . . . , N do select action a i based on actor π i ; #play the game update critic: Q i ← Q i + γ n ( u i ( a 1 , . . . , a N ) − Q i ); #up date pa yoff estimates draw sample b i ∼ L i η ( Q i ); #sample logit best response update actor : π i ← π i + α n ( δ b i − π i ); #upda te mix ed strategies until ter minatio n crit erio n is reached. [16], where it was shown tha t stochastic ﬁctitious play converges to log it eq uilibrium in two-player zero-sum games with compact continuou s action sets. One o f th e main requ irements fo r efﬁcient learnin g in a control setting is that th e fu ll utility function s of th e game need not be known in advance, and player s may not be able to observe the actions of all other players. Usi ng ﬁctitious play (o r , indeed, many of the other stan dard game- theoretical tools) does not satisfy this requirement because they assume full k nowledge and o bservability of payoff function s and opponen t actions. Th is is what m otiv ates the simple reinfor cement learning app roache s discussed p reviously [3 4 – 36], and also the actor-critic reinfo rcement learnin g approa ch of [ 37], which we extend in this article to the continuo us action spa ce settin g. Th e id ea is t o learn b oth a v alue function Q i : A i → R that estimates the func tion u i ( a i , π − i ) for the curr ent value of π − i , while also maintainin g a separate mixed strategy π i ∈ P ( A i , B i ) . Th e critic, Q i , informs the upd ate of the actor, π i . In turn the observed utilities received by the actor, π i , inform the update of the critic Q i . In the continu ous action space setting of this pap er , we impleme nt the acto r-critic alg orithm as the fo llowing iterative process (for a pseudo -code i mplemen tation, see Algorith m 1): 1) At the n -th stage of the pr ocess, each player i = 1 , . . . , N selects an actio n a i n by sampling from the distribution π i n and uses a i n to play the game. 2) Players update their critics using the update equation Q i n +1 = Q i n + γ n ·  u i ( · , a − i n ) − Q i n  (3a) 3) Each player samples b i n ∼ L i η ( Q i n ) and updates their actor using the update equation π i n +1 = π i n + α n ·  δ b i n − π i n  . (3b) The algorithm above is the main focus of our paper, so some remarks are in order: Remark 1 . In (3a), it is assumed that a player can access u i ( · , a − i n ) , so they can calculate how much they would have receiv ed fo r eac h o f th eir actions in response to the joint action that w as s elected by th e o ther players. Even though th is assumption r estricts the applicability of ou r metho d somewhat, it is relatively harmless in m any settings 6 — for instance, in co ngestion gam es such estimates c an be calculated simply b y o bserving the utilization le vel of th e system’ s facilities. Note further tha t to implemen t this alg orithm an individual n eed not actu ally o bserve the action proﬁle a − i n , needing only th e u tility u i ( · , a − i n ) . This means that a playe r need know nothing at all ab out the play ers who don’t directly af fect her u tility functio n, which allo ws a degree of separation and modularisation in large systems, as demon strated in [38]. Remark 2 . The logit response L i η used to sample the b i n used in (3b) is now param eterised by Q i n instead of π − i . This is a tri vial change in which we use Q i ( · ) in place of u i ( · , π − i ) in (1), which represen ts the fact that now players select smooth best responses to their critic Q i instead of directly to the estimated mixed strategy of the other player s. Remark 3 . Also in (3b), the players u pdate towards a sampled b i n instead of toward the fu ll function L i η ( Q i n ) . This is so that th e critic π i n can be represented as a c ollection of weig hted atoms, instead of as a c omplicated and continuou s probab ility measure. Representing π i n as a collection of atoms means that sampling a i n ∼ π i n is particularly easy . On the o ther hand, samp ling b i n ∼ L i η ( Q i n ) cou ld be extremely difﬁcult f or gen eral Q i n . The g radual ev olution of the Q i n howe ver im plies that a seq uential Mon te Carlo samp ler [ 39] could be used to pr oduce samples accordin g to L i η ( Q i n ) . The represen tation o f Q i n is also potentially troub lesome an d we do not address it fu lly here. However one could assum e tha t each u i ( a i n ) can be represented a s a ﬁnite linear co mbination o f basis function s such as a spline, Fourier o r wa velet basis. An other o ption would b e to slo wly increase the size of a Fourier or wa velet basis as n gets large, resulting in v anishing bias terms which can be easily incorporated in the stochastic approximation frame work. Remark 4 . Finally , we note that the updates (3a) and (3b) use different step size pa rameters α n and γ n . This s eparation is what allows the algorithm to be a two-timescales proced ure, and is discussed at the start of Section III. The remainder of this article works to prove the follo wing theorem , while also providin g several auxiliary results of indepen dent interest along the way: Theorem 1. In a continu ous-action -set potential game with boun ded Lipschitz r ew ar ds an d isolated equilibrium compon ents, the actor–critic algorithm (3) conver ges str ongly to a comp onent of the equilibrium set LE η ( a.s. ) . Remark. W e rec all her e that the notion of strong convergence of prob ability measures π n → π ∗ is d eﬁned by ask ing that π n ( A ) → π ∗ ( A ) fo r every measurable A . As such, this notion of co n vergence is ev en strong er than the no tion of “conver gence in pr obability” (vague con vergence) u sed in th e central limit theorem an d other weak-convergence results. I I I . T W O - T I M E S C A L E S S T O C H A S T I C A P P RO X I M AT I O N I N B A N A C H S PAC E S The analysis of systems such as Algo rithm 1 is enabled by the use of two-timescales stochastic appr oximation technique s [25]. By allowing α n /γ n → 0 as n → ∞ , the system can be analysed as if the ‘fast’ update (3a), with higher lear ning p arameter γ n , has fully conver ged to the current value o f th e ‘slow’ system (3b), with lower lea rning parameter α n . No te that it is not the case th at we have an outer and inne r loop, in which (3a) is run to convergence 7 for every upd ate of (3b): both the actor Q n and the critic π n are updated on ev ery iteration. It is simply that the two-timescales technique allows us to analyse the system as if there were an inner loop. That being said, the results of [25] are only cast in the f ramework of ﬁnite-dimension al spa ces. W e ha ve alr eady ob- served that with contin uous action spaces A i , the mixed strategies π i are probability measures in the space P ( A i , B i ) , and the critics Q i are L 2 function s. Placin g a pprop riate n orms on these space s results in Banach spaces, an d in this section w e com bine the two-time scales results of [25] with the Ban ach space stochastic ap prox imation frame work of [16] to dev elop the tool necessary to analyse the recursion (3). T o that en d, consider the general two-timescales s tochastic app roxima tion system x n +1 = x n + α n +1 [ F ( x n , y n ) + U n +1 + c n +1 ] , (4a) y n +1 = y n + γ n +1 [ G ( x n , y n ) + V n +1 + d n +1 ] , (4b) where • x n and y n are sequences in the Banach spaces ( X, k · k X ) and ( Y , k · k Y ) respectively . • { α n } and { γ n } are the learning rate sequences of the process. • F : X × Y → X an d G : X × Y → Y comprise the mean ﬁeld of the p rocess. • { U n } and { V n } ar e stoch astic pr ocesses in X and Y respecti vely . (For a detailed exposition of Banach-valued random variables, see [40].) • c n ∈ X and d n ∈ Y are bias terms that conver ge almost surely to 0 . W e will stud y this system usin g the asymptotic pseudotrajec tory ap proach of [41], wh ich is already cast in the languag e of metric spaces; since Banach space s are metric, the fram ew ork of [41] still applies to our scenar io. This modernises the approach of [22] while also introdu cing the tw o-timescales technique to ‘abstract stochastic approximatio n’. T o proceed, recall that a semiﬂow Φ on a metric space, M , is a continuo us m ap Φ : R + × M → M , ( t, x ) 7→ Φ t ( x ) , such that, Φ 0 ( x ) = x and Φ t + s ( x ) = Φ t  Φ s ( x )  for all t, s ≥ 0 . As in simp le E uclidean sp aces, well-posed differential equation s on Banach spaces induce a semiﬂow [42]. A continuou s fu nction z : R + → M is an asymptotic pseudo- trajectory for Φ if for any T > 0 , lim t →∞ sup 0 ≤ s ≤ T d  z ( t + s ) , Φ s  x ( t )   = 0 . Properties of asymptotic pseudo-tra jectories are discussed in detail in [41]. W e will prove th at interpolations of the st ochastic approximation process (4) result in asymptotic pseudotr ajectories to ﬂows induced by dy namical systems on X a nd Y governed by F and G respectively . T o do so, and to allow us to state necessary assumptions on the p rocesses, we deﬁne timescales on which we will in terpolate the stochastic approx imation pr ocess. In particular, let τ α n = P n j =1 α j (with τ α 0 = 0 ) , and for t ∈ R + let m α ( t ) = sup { k ≥ 0; τ α k ≤ t } . Similarly let τ γ n = P n j =1 γ j (with τ γ 0 = 0 ), and for t ∈ R + let m γ ( t ) = sup { k ≥ 0; τ γ k ≤ t } . W ith th ese timescales we deﬁne interp olations of the stoch astic approx imation processes (4). On the slow ( α ) timescale we deﬁne a continuo us-time interpolation ¯ x α : R + → X of { x n } n ∈ N by letting ¯ x α ( τ α n + s ) = x n + s x n +1 − x n α n +1 (5) 8 for s ∈ [0 , α n +1 ) . On the fast ( γ ) timescale we c onsider z n = ( x n , y n ) ∈ X × Y , and deﬁne the continuou s tim e interpolatio n ¯ z γ : R + → X × Y of { z n } n ∈ N by letting ¯ z γ ( τ γ n + s ) = z n + s z n +1 − z n γ n +1 (6) for s ∈ [0 , γ n +1 ) . Our assumptions, which are simple extensions to those of [25] and [41], can now b e stated as follows: A1) Noise control. a) For all T > 0 , lim n →∞ sup k ∈{ n +1 ,...,m α ( τ α n + T ) }          k − 1 X j = n α j +1 U j +1       X    = 0 , lim n →∞ sup k ∈{ n +1 ,...,m γ ( τ γ n + T ) }          k − 1 X j = n γ j +1 V j +1       Y    = 0 . b) { c n } n ∈ N and { d n } n ∈ N are bound ed s equence s such that k c n k X → 0 and k d n k Y → 0 as n → ∞ . A2) Bounded ness and contin uity . a) There exist compact sets C ⊂ X and D ⊂ Y such tha t x n ∈ C an d y n ∈ D fo r all n ∈ N . b) F a nd G are bound ed and unifor mly continuous on C × D . A3) Learning rates. a) P ∞ n =1 α n = ∞ and P ∞ n =1 γ n = ∞ with α n → 0 and γ n → 0 as n → ∞ . b) α n /γ n → 0 as n → ∞ . A4) Mean ﬁeld behaviour . a) For any ﬁxed ˜ x ∈ C the differential equation d y d t = G ( ˜ x, y ) (7) has uniqu e solution trajecto ries that remain in D for any initial value y 0 ∈ D . Furthermo re the differential equation (7 ) h as a u nique glob ally attracting ﬁxed p oint y ∗ ( ˜ x ) , a nd the function y ∗ : C → D is Lipschitz continuo us. b) The differential equation d x d t = F ( x, y ∗ ( x )) (8) has uniqu e solution trajecto ries that remain in C fo r any initial v alue x 0 ∈ C . Assumption A1 is the standard assumption for noise control in st ochastic appro ximation. It has traditionally caused difﬁculty in abstract stochastic appro ximation, but recent solutions are discussed in the following p aragraph . Assump - tion A2 is si mply a bo unded ness an d c ontinuity assump tion, but can cause dif ﬁculty with some n orms in function spaces. Assumption A3 provides the two-timescales nature of the sch eme, with both learning rate sequences con - verging to 0, but α n becoming much smaller tha n γ n . Finally Assump tion A4 provides b oth the existence of uniq ue 9 solutions o f the relev ant mean ﬁeld differential equatio ns, an d th e useful separation o f timescales in continuou s time which is dire ctly analogou s to Assumption (A1) of [2 5]. Note that we d o not make the stron ger assumption that there exists a u nique g lobally asymptotically stable ﬁxed po int in th e slo w timescale dynam ics (8) [25, Assumptio n A2 ]; this ass umption is not necessary for th e theory presented here, and w ould unnecessarily restrict the applicability of the results. Note th at th e no ise assumption A1(a) has traditionally caused difﬁculty fo r stochastic approximation on Banach spaces: [2 3] consider s th e simple case where the stochastic terms are independen t an d identically d istributed, wh ilst [22] prove a very weak con vergence r esult for a particular process which again uses independ ent noise. Howe ver [16] provide criteria an alogou s to the martingale noise assumptions in R K which guarantee that the n oise cond ition 1(a) holds in useful Banach spaces. In particular, if { U n } is a sequence of martingale dif ferences in Banach space X , then lim n →∞ sup k ∈{ n +1 ,...,m α ( τ α n + T ) }          k − 1 X j = n α j +1 U j +1       X    = 0 with probability 1 if X is: • the space of L p function s for p ≥ 2 , { α n } n ∈ N is deterministic with P n ∈ N α 1+ q/ 2 n < ∞ , { U n } n ∈ N is a ma rtingale difference sequence with r espect to some ﬁltration { F n } n ∈ N , an d sup n ∈ N E [ k U n k q L p ] < ∞ (cf. the r emark following Proposition A.1 of [16]); • the space of L 1 function s on boun ded spaces (see [43 ]); or • the space of ﬁnite signed measures on a comp act interval of R with the boun ded Lipschitz norm (see [16, 26, 27] or Section IV belo w) { α n } n ∈ N is deterministic with P n ∈ N α 2 n < ∞ , U n = δ x n +1 − P n where there exists a ﬁltration {F n } n ∈ N such that U n is measurable with respect to F n , P n is a boun ded absolutely continuous probab ility measure which is measurable with respect to F n and has den sity p n , and x n +1 is sampled from the probab ility distribution P n (Proposition 3.6 of [16]); Clearly , if similar condition s als o hold for Y then Assumption A1(a) holds. Our ﬁrst lemma demonstra tes that we can analyse the system as if the fast system { y n } is fully calibrated to the slow system { x n } . By this we mean that, f or sufﬁciently large n , y n is close to the value it would con verge to if x n were ﬁxed and y n allowed to fully conver ge. Lemma 2. Und er Ass umptions A1– A4, k y n − y ∗ ( x n ) k Y → 0 as n → ∞ . Pr o of: Let Z = X × Y , with k · k Z the in duced product no rm fro m th e to pologies of X and Y . Under this topolog y , Z is a Banach space, and C × D is comp act. The updates (4) can be expressed as z n +1 = z n + γ n +1 h H ( z n ) + W n +1 + κ n +1 i , (9) 10 where H : Z → Z is s uch that H ( z n ) = (0 , G ( z n )) , for 0 ∈ X , and W n =  α n γ n U n , V n  , κ n +1 =  α n +1 γ n +1 h F ( z n ) + d n +1 i , e n +1  . Assumptions A1–A4 imply the assumptio ns of Theorem 3.3 of [16]. Most are d irect translations, but th e noise must be carefully considered . For any n ∈ N , any T > 0 , and any k ∈ { n + 1 , . . . , m γ ( τ γ n + T ) } ,       k − 1 X j = n γ j +1 ( W n +1 + κ n +1 )       Z ≤       k − 1 X j = n γ j +1 W n +1       Z +       k − 1 X j = n γ j +1 κ n +1       Z ≤       k − 1 X j = n γ j +1 W n +1       Z + sup k ′ ∈{ n +1 ,...,k } k κ k ′ k Z ! k − 1 X j = n γ j +1 ≤       k − 1 X j = n γ j +1 W n +1       Z + sup k ′ ∈{ n +1 ,...,m γ ( τ γ n + T ) } k κ k ′ k Z ! m γ ( τ γ n + T ) − 1 X j = n γ j +1 ≤       k − 1 X j = n γ j +1 W n +1       Z +  sup k ′ ≥ n +1 k κ k ′ k Z  T Since κ n → 0 , the second term conv erges to 0 as n → ∞ . Hence, using assumption A1 to contro l the ﬁ rst term, lim n →∞ sup k ∈{ n +1 ,... ,m γ ( τ γ n + T ) }       k − 1 X j = n γ j +1 ( W n +1 + κ n +1 )       Z = 0 . Therefo re ¯ z γ ( · ) : R + → X × Y , d eﬁned in (6), is an asymptotic pseudotrajecto ry of the ﬂow deﬁned by d z d t = H  z ( t )  . (10) Assumption A4(a) im plies that { ( x, y ∗ ( x )) : x ∈ C } is globally attracting for (10). Hence Theorem 6.10 o f [41] giv es that z n → { ( x, y ∗ ( x )) : x ∈ C } . The result follows by the contin uity of y ∗ assumed in A4(a). W e use th is fact to consider the e volution of x n on the slow timescale. Theorem 3. Under Assump tions A1–A4, the interpo lation ¯ x α ( · ) : R + → X , deﬁned in (5) , is an asympto tic pseudo- trajectory to the ﬂow induced by the differ ential equation (8) . Pr o of: Re write (4a) as x n +1 = x n + α n +1 h F  x n , y ∗ ( x n )  + U n +1 + ˜ c n +1 i , (11) 11 where ˜ c n +1 = F ( x n , y n ) − F ( x n , y ∗ ( x n )) + c n +1 . W e will show that this is a well-behav ed stoch astic approximation process. In particula r , we need to sho w that ˜ c n can be absorb ed into U n in such a way that the eq uiv alent As sumption A1 of [16] can be applied to U n + ˜ c n . By Lemma 2 we have that k y n − y ∗ ( x n ) k Y → 0 . Hence we can deﬁne δ n = inf { δ > 0 : ∀ m ≥ n, k y m − y ∗ ( x m ) k Y < δ } with δ n → 0 as n → ∞ . By the uniform c ontinuity o f F , it follows that we can deﬁne a seq uence ε n → 0 such th at for all m ≥ n , k F ( x m , y m ) − F ( x m , y ∗ ( x m )) k X < ε n . From this construction , for any n ≥ 0 and for any k ∈ { n + 1 , . . . , m α ( τ α n + T ) } ,       k − 1 X j = n α j +1 h F  x n , y n  − F  x n , y ∗ ( x n )  i       X ≤       k − 1 X j = n α j +1 ε n       X ≤ T ε n . As in the proof of Lemma 2, similar arguments can be used fo r { c n } n ∈ N under assump tion (A1)(b ). Hence for all T > 0 , lim n →∞ sup k ∈{ n +1 ,... ,m α ( τ α n + T ) }          k − 1 X j = n α j +1 ˜ c j +1       X    = 0 . Once again it is straightfo rward to show that, un der (A1)-( A4), the slow timescale stochastic approx imation (11) satisﬁes the assum ptions of T heorem 3.3 o f [16], a nd there fore ¯ x ( · ) : R + → X is an asym ptotic pseu do-trajecto ry to the ﬂow induced by the differential equation (8). While [4 1] pr ovides se veral results that ca n b e comb ined with Theorem 3, we summarise the result used in this paper with the following corollar y: Corollary 4. Sup pose that Assumption s A1–A4 hold. Then x n conver ges to an internally cha in transitive set o f the ﬂow induced by the mean ﬁeld differ entia l equation (8) . Pr o of: T his is an immed iate consequ ence of Theor em 5 above and Th eorem 5 .7 o f [4 1], wh ere th e d eﬁnition o f internally chain transitive sets can be found. I V . S T O C H A S T I C A P P RO X I M AT I O N O F T H E AC T O R – C R I T I C A L G O R I T H M In this section we demonstrate that the a ctor–critic algorithm (3) can be analysed using th e two-time scales stoch astic approx imation framework of Section III. Our ﬁrst task is to deﬁne the Banach spaces in which the algorith m e v olves. Note th at the set P ( A i , B i ) of pr obability distributions on A i is a sub set of the space M ( A i , B i ) of ﬁnite signed measures on ( A i , B i ) . T o turn th is space in to a Banach space, th e most conv enient norm for our p urpo ses is the 12 bound ed Lipschitz (BL) no rm. 3 T o deﬁn e the BL norm, let G i = { g : A i → R : sup a ∈ A i | g ( a ) | + sup a,b ∈ A i ,a 6 = b | g ( a ) − g ( b ) | | a − b | ≤ 1 } . Then, for µ ∈ M ( A i , B i ) we deﬁne k µ k B L i = sup g ∈ G i     Z A i g (d µ )     . M ( A i , B i ) with n orm k · k B L i is a Banach spa ce [27], an d conv ergence of a seq uence of pro bability measures und er k · k B L i correspo nds to weak conver gence of th e measures [ 26]. Under the BL n orm, P ( A i , B i ) is a compact sub set of M ( A i , B i ) (see Proposition 4.6 of [16]), allowing Ass umption A2 to be easily veriﬁed. W e consider mixed strategy proﬁles as existing in the sub set ∆ of the pro duct spac e Σ = M ( A 1 , B 1 ) × · · · × M ( A N , B N ) . W e u se the max norm to induce the produc t topolog y , so that if µ = ( µ 1 , . . . , µ N ) ∈ Σ we deﬁne k µ k B L = max i =1 ,...,N k µ i k B L i . (12) Suppose a lso that utility func tions u i are b ound ed and Lipschitz contin uous. Sinc e the ir d omain is a bou nded interval of R we can assum e that the estima tes Q i n are in the Ban ach space L 2 ( A i ) of f unctions A i → R with a ﬁn ite L 2 norm, under the L 2 norm. Hence we con sider the vectors Q n = ( Q 1 n , . . . , Q N n ) as elemen ts o f the Banach space Y = × N i =1 L 2 ( A i ) with k Q k Y = max i =1 ,...,N k Q i k L 2 . Theorem 5. Consid er the actor–critic algo rithm (3) . Suppose th at for each i the actio n space A i is a compa ct in terval of R , an d the utility fu nction u i is bou nded and uniformly Lipschitz continu ous. Suppose also that { α n } n ∈ N and { γ n } n ∈ N ar e chosen to satisfy Assumption A 3 as well a s P n ∈ N α 2 n < ∞ and P n ∈ N γ 2 n < ∞ . Then, un der the bound ed Lipschitz norm, { π n } n ∈ N conver ges with pr obab ility 1 to an internally c hain tr ansitive set o f the ﬂow deﬁned by the N -p layer logit best r esponse dyna mics d π d t = L η ( π ) − π . (13) Pr o of: W e take ( X , k · k X ) = (Σ , k · k B L ) , and ( Y , k · k Y ) as above. This allows a direct mapping o f the actor –critic algorithm (3) to the stochastic approx imation framework (4) by taking x n = π n , F ( π , Q ) = L η ( Q ) − π , U n +1 = ( δ b 1 n , . . . , δ b N n ) − L η ( Q ) , c n = 0 3 For a discussion regardi ng the appropriaten ess of this norm for game-theo retica l considerati ons, see [18, 26, 27], and, for stochastic approximat ion, especially [16]. 13 and y n = Q n , G ( π , Q ) = ( G 1 ( π , Q ) , . . . , G N ( π , Q )) , G i ( π , Q ) = u i ( · , π − i ) − Q i , V n +1 = ( V 1 n +1 , . . . , V N n +1 ) , V i n +1 = u i ( · , a − i n ) − u i ( · , π − i n ) , d n = 0 . By Corollary 4 we therefore only need to verify Assumptions A1–A4. A1: U n is of exactly th e form studied by [16] and therefore Pr oposition 3.6 of that paper sufﬁces to prove th e condition on th e tail be haviour o f P j α j +1 U j +1 holds with p robab ility 1. The V n +1 are mar tingale difference sequences, sinc e E ( u i ( · , a − i n ) | F n ) = u i ( · , π − i n ) , and the Q n +1 are L 2 function s. Hence Propo sition A.1 of [16] sufﬁces to prove the condition on the tail behaviour of P j γ j +1 V j +1 holds with probability 1 under the L 2 norm. Since c n and d n are identically zero, we have shown that A1 holds. A2: ∆ is a com pact s ubset of Σ un der the boun ded Lipschitz norm, so taking C = ∆ suf ﬁces. Furthermor e, with bound ed co ntinuou s reward f unctions u i it follows that the Q i n are unif ormly bound ed and equicontin uous and therefo re remain in a compact set D . G is clear ly uniform ly continu ous on the compact set C × D . Th e continuity of L η , and therefor e F , is shown in Lemma C.2 of [16]. A3: The learning rates are chosen to satisfy this assumption. A4: For ﬁxed ˜ π , the differential equations ˙ Q i = u i ( · , ˜ π − i ) − Q i conv erge exponentially q uickly to Q i = u i ( · , ˜ π − i ) . Furtherm ore u i ( · , π − i ) is Lipschitz con tinuou s in π − i , so part (a) is satisﬁed. Equation (8) then becomes ˙ π i = L i η ( u i ( · , π − i )) − π i , for i = 1 , . . . , N . Since we re-wrote L i η to d epend on the utility fu nctions instead of directly on π − i , we ﬁnd that we have recovered the logit b est response dyn amics o f [18] and [1 6], wh ich those autho rs show to have uniq ue solution trajectorie s. V . C O N V E R G E N C E O F T H E L O G I T B E S T R E S P O N S E DY N A M I C S W e have shown in Theo rem 5 that th e acto r–critic algorithm (3) results in joint strategies { π n } n ∈ N that co n verge to an internally chain transitive set of the ﬂow deﬁne d b y the logit best response dynamics (13) under the boun ded Lipschitz norm. It is demon strated in [ 16] that in two-player zer o-sum continuo us actio n games the set LE η of logit 14 equilibria (the ﬁxed po ints of the logit best response L η ) is a global attractor of the ﬂo w . Hence, by Corollary 5.4 of [41] we instantly obtain the result that any internally chain transiti ve s et is contained in LE η . Howe ver tw o-player zero-sum g ames are not p articularly relev ant for contro l systems: multiplay er potential games are much m ore important. The lo git best resp onses in a potential game are id entical to the logit best responses in the id entical in terest g ame in which the p otential f unction is the global u tility function. Henc e evolution of stra tegies under the lo git best response d ynamics in a potential game is iden tical to tha t in the iden tical inte rest g ame in wh ich the potential acts as the global utility . W e therefore carry ou t our c onv ergence analysis for th e logit b est response dynamics (13) in N -pla yer identical interest games with continuous action spaces. See [44] for related issues. For the remainder of this section we work to prove the follo wing theorem: Theorem 6 . In a potential game with continu ous b ounde d r ewar ds, in which the connected co mponents of the set LE η of logit eq uilibria o f th e game a r e isola ted, a ny in ternally cha in transitive set o f the ﬂ ow induced b y the smo oth b est r esponse dynamics (13) is contained in a connected component of LE η . Deﬁne ∆ D =            π ∈ ∆ : ∀ i = 1 , . . . , N , π i is absolutely continuou s with density p i such that D − 1 ≤ p i ( x i ) ≤ D for all x i ∈ A i and p i is Lipschitz continuo us with constant D            . Append ix C of [ 16] shows that if the utility functions u i are bound ed and Lip schitz continuo us then , f or any η > 0 , there exists a D such tha t L η ( π ) ∈ ∆ D for all π ∈ ∆ , and that ∆ D is forward inv ariant under the lo git b est respo nse dynamics. For the remainder of this article, D is taken to be sufﬁciently large for this to be the case. Our method ﬁrst demo nstrates that the set ∆ D is globally attracting for the ﬂow , so any internally chain transitiv e set of the ﬂow is con tained in ∆ D . The n ice prop erties of ∆ D then allow the use of a L yapunov function argument to show t hat any internally chain transiti ve set in ∆ D is a connec ted set of logit equilib ria. Lemma 7. Let Λ ⊂ ∆ be an internally chain-transitive set. Th en Λ ⊂ ∆ D . Pr o of: Consider the trajectory of (13) starting at an arbitrary π (0) ∈ ∆ . W e can write π ( t ) as π ( t ) = e − t π (0) + Z t 0 e s − t L η ( π ( s )) d s. Deﬁning σ ( t ) = R t 0 e s − t L η ( π ( s )) d s 1 − e − t it is immediate both that σ ( t ) ∈ ∆ D and k π ( t ) − σ ( t ) k B L < 2e − t . (14) Thus π ( t ) approach es ∆ D at an exponential rate, uniformly in π (0) . Hen ce ∆ D is unif ormly glob ally attracting. W e would like to inv oke Corollary 5.4 of [4 1], but since ∆ D may no t be inv ariant it is not an a ttractor in the terminolo gy of [41] either . W e therefo re prove directly that Λ ⊂ ∆ D . Suppose not, so there exists a point p ∈ Λ \ ∆ D 15 and by the compactness of internally chain transitiv e sets there exists a δ > 0 such that inf π ∈ ∆ D k p − π k = 2 δ . There exists a T > 0 such that for the tr ajectory p ( t ) with p (0) = p , inf π ∈ ∆ D k p ( T ) − π k < δ , and so k p ( T ) − p k > δ . Hence, as in the proof of Proposition 5.3 of [41], p c annot be part of an internally chain recurre nt set (see [41]). Since internally chain tran siti ve sets are internally ch ain recurren t s ets [4 1, Proposition 5.3 ] we have a co ntradictio n. Hence Λ ⊂ ∆ D . W e are now lef t to ﬁnd the internally chain transitiv e sets of the ﬂo w re stricted to ∆ D . Since all elem ents of ∆ D admit densities, we can d eﬁne a L yapun ov functio n b ased on the den sities o f the mixed strategies. For an absolu tely continuo us mixed strategy π i with density function p i , we deﬁne the entropy ν i ( π i ) = − Z A i p ( x i ) log p ( x i ) d x i . The L yap unov function to be considered is V η ( π ) = − " u ( π ) + η N X i =1 ν i ( π i ) # (15) where u i ( π ) = u ( π ) for all i . For V η to b e a useful L y apunov function , it must be continuous with r espect to the bound ed Lipschitz norm that we use on strategy space. Lemma 8. V η : ∆ D → R is co ntinuou s with r espect to the bound ed Lipschitz norm. Pr o of: Note that u is multilinear and theref ore continuous. Therefor e it sufﬁces to sho w that the entropy ν ( π i ) is continuo us in π i . Consider two den sities p and q cor respond ing to distrib utions P and Q o n a ﬁnite interval A ⊂ R , and assume th at p ( x ) , q ( x ) ∈ [ D − 1 , D ] fo r all x ∈ A , and bo th p and q are Lipschitz con tinuous with constant D . W e calculate that | ν ( P ) − ν ( Q ) | =     Z A p ( x ) log ( p ( x )) − q ( x ) log ( q ( x )) d x     ≤ Z A | p ( x ) − q ( x ) || log( p ( x )) | d x + Z A q ( x ) | log ( p ( x )) − lo g( q ( x )) | d x ≤ log( D ) Z A | p ( x ) − q ( x ) | d x + D Z A | log ( p ( x )) − log( q ( x )) | d x, since both p ( x ) and q ( x ) are unif ormly bound ed above by D . Furthermor e, since log is Lip schitz on [ D − 1 , D ] w ith constant D , | log( p ( x )) − log( q ( x )) | ≤ D | p ( x ) − q ( x ) | . W e th erefore see that | ν ( P ) − ν ( Q ) | ≤ (log D + D 2 ) Z A | p ( x ) − q ( x ) | d x. It remain s to show that this integral is a rbitrarily small for sufﬁciently close P and Q under the bou nded L ipschitz norm. Note that this is not the case for arbitrary P and Q , b ut the L ipschitz continuity of p and q en sure that we can complete the result. In p articular, supp ose that there exists an x ∗ such that p ( x ∗ ) − q ( x ∗ ) > ǫ . T o reduce the 16 notational effort assume that x ∗ ± ǫ/ (4 D ) ∈ A to av oid b ound ary effects (which can b e accomm odated simply but with m ore notation ). For x ∈ [ x ∗ − ǫ/ (4 D ) , x ∗ + ǫ/ (4 D )] we have th at p ( x ) > q ( x ) + ǫ/ 2 . Deﬁne a test function g ( x ) = max(0 , ǫ / (8 D ) − | x − x ∗ | / 2) . W e have that k P − Q k B L ≥     Z A ( p ( x ) − q ( x )) g ( x ) d x     = Z x ∗ + ǫ/ (4 D ) x ∗ − ǫ/ (4 D ) ( p ( x ) − q ( x )) g ( x ) d x ≥ Z x ∗ + ǫ/ (4 D ) x ∗ − ǫ/ (4 D ) ǫ 2 g ( x ) d x = ǫ 3 64 D . So by taking k P − Q k B L small, we can force p ( x ) − q ( x ) to be uniformly small, and hence R A | p ( x ) − q ( x ) | d x to be small, giving the result. Lemma 9. The functio n V η is strictly decreasing for any trajectory in ∆ D whenever π / ∈ LE η . Pr o of: Using the Gateaux deriv ativ e, ˙ V η ( π ) = d V η ( π , ˙ π ) = − " d u ( π , ˙ π ) + η N X i =1 d ν i ( π i , ˙ π i ) # = − N X i =1  d u (( π i , π − i ) , ˙ π i ) + η d ν i ( π i , ˙ π i ))  . It f ollows directly from the d eﬁnition o f the deriv ati ves th at d u (( π i , π − i ) , ˙ π i ) = R A i u ( a i , π − i )) ˙ π i (d a i ) . Re-ar ranging the deﬁnition of l i η ( π − i ) from (2) gi ves u ( a i , π − i ) = η log( l i η ( π − i )( a i )) + η log  Z A i exp { η − 1 u (˜ a i , π − i ) } d ˜ a i  . So, noting that R A i ˙ π i (d a i ) = 0 , Z A i u ( a i , π − i ) ˙ π i (d a i ) = η Z A i log( l i η ( π − i )( a i )) ˙ π i (d a i ) . It is shown in [16, equ ation (D.3)] that d ν i ( π i , ˙ π i ) = − R A i log( p i ( a i )) ˙ π i (d a i ) . Hence ˙ V η ( π ) = − η N X i =1 Z A i  log( l i η ( π − i )( a i )) − log ( p i ( a i ))  ˙ π i (d a i ) = − η N X i =1 Z A i  log( l i η ( π − i )( a i )) − log ( p i ( a i ))  ×  l i η ( π − i )( a i ) − p i ( a i )  d a i = − η N X i =1  K L ( l i η ( π − i ) k p i ) + K L ( p i k l i η ( π − i ))  17 where K L ( · k · ) is the Kullback–Leib ler divergence, which is non-negative and zero only when the tw o arguments are equal. Therefor e V η is strictly de creasing u nless p i = l i η ( π − i ) fo r a ll i , which is exactly the con dition tha t π ∈ LE η . W e thus have a continuous fun ction w hich is decreasing whenever π / ∈ LE η . Howe ver , as demonstrated by [41], this is insufﬁcient to prove that all internally cha in transitiv e sets are co ntained in LE η . W e could use a f urther result, that the set of values V η takes at p oints π ∈ LE η is a measure zero set. This is u sually a chieved by u sing Sar d’ s theor em (see [44] for example), but Smale’ s generalisation of Sard’ s theorem to Banach spaces does not app ly in our case. W e therefor e prove a new result directly , using the provide d co ndition that th e co nnected compo nents of the set of logit equilibria LE η are isolated. Lemma 10. Let V : M → R be a strict Lyapunov fun ction for some ﬂow Φ o n a metric space M . If the con nected equilibrium compo nents of Φ ar e isola ted, and V is constant on each component, every internally chain t ransitive set of Φ is contained in such a component. Pr o of: R ecall ﬁrst that an internally chain transiti ve set Λ is a com pact, connecte d, inv ariant and attractor-free set. Let Λ 0 = argmin { V ( x ) : x ∈ Λ } , and V 0 = min { V ( x ) : x ∈ Λ } . It then follows t hat Λ 0 only consists of equilibria of V : oth erwise, if x ∈ Λ 0 is not an equilibr ium, we would h av e V (Φ( x, t )) < V ( x ) for all t > 0 , con tradicting th e fact that Λ is f orward in v ariant and V ( x ) ≥ V 0 for all x ∈ Λ . Now , assume there exists some x ∈ Λ with V ( x ) > V 0 . Th en, take ǫ > 0 small enough so that the closed set Λ ǫ = { x ∈ Λ : V ( x ) ≤ V 0 + ǫ } co ntains no other equ ilibria of Φ except those in Λ 0 (that this is possible follows from the fact that V is constant on equilibriu m com ponen ts and that these compon ents are isolated). Since V is a strict L yapu nov function for Φ we will also have Φ(Λ ǫ , t ) ⊆ int(Λ ǫ ) for all t > 0 (recall that Λ 0 is contained in the interior of Λ ǫ and Λ ǫ has no other equilibr ia), so Λ ǫ contains an attractor of Φ for all ǫ > 0 [41, Lemma 5.2 ]. This contradicts the fact that Λ is attra ctor-free, so we must ha ve V ( x ) = V 0 for all x ∈ Λ , i.e. Λ = Λ 0 . W e are now in a position to prove Theorem 6 and – ﬁnally – Theorem 1. Pr o of of Theorem 6: V η is necessarily constant on con nected co mpon ents of LE η , so the cond itions of Lemm a 10 are met. Therefo re any internally chain transiti ve (under bounded Lipschitz norm) set of the ﬂo w deﬁned by (1 3) is contained in a conn ected component of the set LE η . This is precisely Theorem 6. Pr o of of Theo r em 1 : Theo rem 5 shows th at { π n } n ∈ N conv erges under the bo unded Lipschitz norm to an intern ally chain transiti ve set of the ﬂow d eﬁned by the logit best response dynamics. Theo rem 6 shows th at any inter nally chain transitiv e set of these dynamics is contained in LE η . It thus follows that π n conv erges to LE η weakly . T o establish our stro ng convergence claim, recall ﬁrst that every probability measure in LE η is non atomic and absolutely contin uous with respe ct to Lebesg ue measure on R . On the o ther han d, if π ∗ is a ( weak) limit p oint of π n , we will ha ve π n ( A ) → π ∗ ( A ) for e very contin uity s et A of π ∗ (i.e. for e very measurable set A such that π ( ∂ A ) = 0 ). Since e very weak limit point o f π n is con tained in LE η and Bor el sets are also continu ity sets for ab solutely c ontinuo us measures, our assertion follows. 18 V I . C O N C L U S I O N S In this p aper, we introdu ced an actor-critic reinfo rcement learning algorithm fo r poten tial g ames with continu ous action sets. By utilizin g two different timescales for the actor and critic up dates (fast an d slow respectively), we sh owed that the algorith m con verges stron gly to the game’ s set of logit equilibria with minimal infor mation requirements – in particular, play ers are not a ssumed to observe their opponen ts’ actio ns or to have full k nowledge of their individual payoff functio ns. From a practical point of view , this provides an attracti ve alg orithmic fra mew ork for distributed co ntrol and op- timization in complex systems with spa rse feed back – such as rate co ntrol and power allocation in large-scale, decentralized wireless network s. In addition, from a th eoretical point of view , our app roach provided a nontrivial extension of se veral ﬁnite-d imensional s tochastic approximation tech niques to inﬁnite-d imensional Banach spaces. In this way , the p roposed framework can be ap plied and extended to different scenar ios of high practical relevance (especially in the context of wireless networks) s uch as the case of noisy/imp erfect payoff observations, asyn chrono us and/or d elayed player upda tes, etc. T hese research direction s lie beyon d the sco pe o f the current work , but we in tend to pursue them in a future paper . R E F E R E N C E S [1] D. H. W olpert and K. Tumer , “Optimal Payof f Functions for Members of Collecti ves , ” Adv . Complex Syst. , vol . 4, pp. 265–279, 2001. [2] G. Arslan, J. R. Marden, and J. S. Shamma, “Autono mous V ehicle-T arget Ass ignment: A Game Theoretic al Formulation, ” J . Dyn. Syst.-T . ASME , vol. 129, pp. 584– 596, 2007. [3] N. Li and J. R. Marden, “Designing Games for Distribut ed Optimization, ” IEEE J. Sel . T op. Signa. , vol. 7, no. 2, pp. 230–242, 2013. [4] T . Roughgarden and E. T ardos, “How bad is selﬁsh routing? ” Jo. AC M , vol . 49, pp. 236–259, 2002. [5] J. R. Marden, H. P . Y oung, G. Arslan, and J. S. Shamma, “Payof f-base d dynamics for multi-player weakly acy clic games, ” SIAM J. Contr ol Optim. , vol. 48, pp. 373–3 96, 2009. [6] J. R. Marden, P . Y oung, and L. Y . Pao , “Achi e ving Pareto optimality throug h distrib uted learni ng, ” in Confe re nce on Decision and Contr ol , 2012, pp. 7419—-7424. [7] D. Monderer and L. S. Shaple y , “Pot ential games, ” Game. Econ. Behav . , vol. 14, pp. 124–143, 1996. [8] E. Anshelev ich, A. Dasgupta, J . Kleinberg, E . T ardos, T . W exl er , and T . Roughgarden, “The Price of Stability for Network Design with Fair Cost Allocati on, ” SIAM J . Comput. , vol . 38, pp. 1602–1623, 2008. [9] D. Fudenberg and D. K. Levine, The Theory of Learning in Games , ser . MIT Press Series on Economic Learnin g and Solution Evolution. Cambridge , MA: MIT Press, 1998. [10] N. Cesa-Bian chi and G. Lugosi , Predic tion, Learning, and Games . Cambridge Uni versi ty Press, 2006. [11] R. Bertin, A. L egra nd, and C. T ouati, “T oward a fully decentrali zed algorithm for multiple bag-of-t asks appli catio n schedulin g on grids, ” in GRID ’08: Pro ceeding s of the 3rd IEEE/A CM Inte rnational Confer ence on Grid Computing , 2008. [12] F . Meshkati , A. J. Goldsmith, H. V . Poor , and S. C. Schw artz, “ A game-th eoretic approach to ener gy-ef ﬁcie nt modula tion in CDMA netw orks with delay QoS constrai nts, ” IEEE J . Sel. A rea Comm. , vol. 25, pp. 1069–1078 , 2007. [13] G. Scutari, D. P . Palomar , and S. Barbarossa, “Competiti ve design of multiuser MIMO systems based on game theory: a uniﬁed vie w , ” IEE E J . Sel. Area Comm. , v ol. 26, pp. 1089–110 3, 2008. [14] P . Mertik opoul os, E. V . Belmega, A. L. Moustakas, and S. Lasaulce, “Distrib uted learning policies for po wer allocati on in multiple access channe ls, ” IEEE J . Sel. Ar ea Comm. , vol . 30, pp. 96–106, 2012. [15] W . Saad, Z. Han, H. V . Poor , and T . Bas ¸ ar , “Game-theoreti c methods for the smart grid: an overvi e w of microgrid systems, demand-side management , and smart grid communications, ” IEEE Signal Pr oc. Ma g. , vol. 29, pp. 86–10 5, 2012. [16] S. Perkins and D. S. Leslie, “Stochast ic ﬁcti tious play with continuou s act ion sets, ” J. Econ. Theory , vol. 152, pp. 179– 213, 2014. [17] M.-W . Cheung, “Pairwi se comparison dynamics for games with continuous strategy space, ” J. Econ. Theory , vo l. 153, pp. 344–375, 2014. 19 [18] R. Lahkar and F . Riedel, “The Continuous Logit Dynamic and Price Dispersion, ” T ech . Rep., 2013. [19] H. W alk, “An in v arian ce princ iple for the Robbins-Monro process in a Hilbert space, ” Z. W ahrsc heinlic hk eit , vol. 39, pp. 135–15 0, 1977. [20] E. Berger , “Asymptotic behavio ur of a class of stochastic approximati on procedures, ” Pr obab . Theory Rel. , vol. 71, pp. 517–552, 1986. [21] H. W alk and L. Z sid ´ o, “Con ve rgenc e of the Robbins-Monro method for lin ear problems in a Banach space, ” J . Mat h. Anal. Appl. , vol. 139, pp. 152–177, 1989. [22] A. Shwartz and N. Berman, “Abstract stochast ic approximations and applicat ions, ” Stoc h. Proc. Appl ’ , vol. 31, pp. 133–149, 1989. [23] V . A. Ko v al, “Rate of con verge nce of stochastic approximation procedures in a banach space, ” Cybern. Syst. Anal. , vol. 34, pp. 386–394, 1998. [24] J. Dippon and H. W alk, “The A ve raged Robbins Monro Method for Linear Problems in a Banach Space, ” J. Theor . Pr obab . , vol. 19, pp. 166–189, 2006. [25] V . S. Borkar , “Sto chastic approximation with two time scales, ” Syst. Contr ol Lett. , vol. 29, pp. 291–294 , 1997. [26] J. Oechssler and F . Riedel , “On the Dynamic Foundation of Evolutiona ry Stabilit y in Continuo us Models, ” J. Econ. Theory , vol. 107, pp. 223–252, 2002. [27] J. Hofbaue r , J. Oechssle r , and F . Riedel, “Brown von Neumann Nash dynamics : The continuous strate gy case, ” Game. Econ. Behav . , vol. 65, pp. 406– 429, 2009. [28] M. Bena ¨ ım, J. Hofbauer , and S. Sorin, “ Stochasti c approximations and dif ferenti al inclusions, ” SIAM J. C ontr ol Optim. , v ol. 44, pp. 328–348, 2006. [29] R. D. McKel ve y and T . R. Palfre y , “Quantal Response Equilibria for Normal Form Games, ” Game. Econ. Behav . , vol. 10, pp. 6–38, 1995. [30] D. Fudenber g and D. M. Kreps, “Learning Mixed Equili bria, ” Game. Econ. Behav . , vol. 5, pp. 320–367, 1993. [31] M. Be na ¨ ım and M. W . Hirsch, “Mixe d equil ibria and dy namical syst ems ari sing from ﬁctitious pla y in p erturbed ga mes, ” Game . Econ. Behav . , vol. 29, pp. 36–7 2, 1999. [32] J. Hofbauer and E. Hopkins, “Learning in Perturbed Asymmetric Games, ” Game. Econ. Behav . , vol. 52, pp. 133–152, 2005. [33] J. Hofbauer and W . H. Sandholm, “Evoluti on in games with randomly disturbed payof fs, ” J. Econ. Theory , vol . 132, pp. 47–69, 2007. [34] D. S. Leslie and E. J. Collins, “Indi vidual Q-learning in normal form games, ” SIAM J. Contr ol Optim. , vol. 44, pp. 495–514, 2005. [35] R. Cominetti, E. Melo, and S. Sorin, “A payoff-ba sed learning procedure and its appl icati on to traf ﬁ c games, ” Game. Econ. Behav . , vol. 70, pp. 71–83, 2010. [36] P . Couchene y , B. Gaujal, and P . Mertiko poulos, “Pena lty-re gulat ed dynamics and robust learnin g procedur es in games, ” Math. Oper . Res. , to appear . [37] D. S. Leslie and E. J. Collins, “Con verge nt Multiple-ti mescales Reinforcement Learning Algorithms in Normal Form Games, ” Ann. A ppl. Pr obab . , vol. 13, pp. 1231–125 1, 2003. [38] A. C. Chapman, D. S. Leslie, A. Rogers, and N. R. Jennings, “Con v ergen t Learning Algorithms for Unkno wn Rew ard Games, ” SIAM J . Contr ol Optim. , vol. 51, pp. 3154–3180, 2013. [39] P . Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo S amplers, ” J. Roy . Stat. Soc. B , vol. 68, pp. 411–436 , 2006. [40] M. Ledoux and M. T al agrand, Pr obabilit y in Banach spaces . Springer-V erla g, 1991. [41] M. Bena ¨ ım, “Dynamics of stochastic approximati on algorithms, ” Seminair e de prob abilite s XXXIII , v ol. 33, pp. 1–68, 1999. [42] D. Luenberg er , Optimization by vector space methods . Prentice Hall, 1969. [43] S. Perkins, “Adva nced Stochastic Approximation Framewor ks and their Applications by, ” Ph.D. dissert ation, Unive rsity of Bristol, 2013. [44] J. Hofbauer and W . H. Sandholm, “On the global con vergenc e of stocha stic ﬁctitious play, ” Econometrica , vol. 70, pp. 2265–2294, 2002.

Game-theoretical control with continuous action sets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment