Multi-agent learning using Fictitious Play and Extended Kalman Filter

Decentralised optimisation tasks are important components of multi-agent systems. These tasks can be interpreted as n-player potential games: therefore game-theoretic learning algorithms can be used to solve decentralised optimisation tasks. Fictitio…

Authors: Michalis Smyrnakis

Multi-agent learning using Fictitious Play and Extended Kalman Filter
Multi-agen t learning using Fictitious Pla y and Extended Kalman Filter Mic halis Sm yrnakis ∗ 1 1 Complex Systems and Statistical Physics Group, Sc ho ol of Ph ysics and Astronomy , Universit y of Manchester, UK August 21, 2018 Abstract Decen tralised optimisation tasks are important components of multi- agen t systems. These tasks can b e interpreted as n-play er p oten tial games: therefore game-theoretic learning algorithms can be used to solv e decen- tralised optimisation tasks. Fictitious pla y is the canonical example of these algorithms. Nev ertheless fictitious pla y implicitly assumes that play- ers ha ve stationary strategies. W e present a no v el v ariant of fictitious pla y where pla yers predict their opp onen ts’ strategies using Extended Kalman filters and use their predictions to update their strategies. W e show that in 2 by 2 games with at least one pure Nash equilibrium and in p oten tial games where play ers hav e tw o av ailable actions, the pro- p osed algorithm con verges to the pure Nash equilibrium. The p erformance of the prop osed algorithm was empirically tested, in tw o strategic form games and an ad-ho c sensor netw ork surveillance problem. The prop osed algorithm p erforms b etter than the classic fictitious pla y algorithm in these games and therefore improv es the p erformance of game-theoretical learning in decentralised optimisation. Keyw ords: Multi-agen t learning, game theory , fictitious pla y , decen- tralised optimisation, learning in games, Extended Kalman filter. 1 In tro duction Recen t adv ance in tec hnology render decentralised optimisation a crucial com- p onen t of man y applications of multi agent systems and decen tralised control. Sensor netw orks (Kho et al., 2009), traffic control (v an Leeuw en et al., 2002) and sc heduling problems (Stranjak et al., 2008) are some of the tasks where decen tralised optimisation can b e used. These tasks share common c haracter- istics suc h as large scale, high computational complexity and comm unication constrain ts that m ak e a cen tralised solution intractable. It is well kno wn that man y decentralised optimisation tasks can b e cast as p oten tial games (W olp ert and T urner, 1999.; Arslan et al., 2006), and the searc h of an optimal solution ∗ michalis.sm yrnakis@manc hester.ac.uk 1 can b e seen as the task of finding Nash equilibria in a game. Th us it is feasi- ble to use iterativ e learning algorithms from game-theoretic literature to solv e decen tralised optimisation problems. A game theoretic learning algorithm with proof of con vergence in certain kinds of games is fictitious pla y (F udenberg and Levine, 1998; Monderer and Shapley, 1996). It is a learning pro cess where play ers c ho ose an action that maximises their exp ected rew ards according to the beliefs they maintain ab out their opp onen ts’ strategies.The play ers up date their beliefs ab out their opp o- nen ts’ strategies after observing their actions. Even though fictitious pla y con- v erges to Nash equilibrium, this con vergence can b e very slo w. This is because it implicitly assumes that other play ers use a fixed strategy in the whole game. Sm yrnakis and Leslie (2010) addressed this problem by representing the ficti- tious play pro cess as a state space mo del and b y using particle filters to predict opp onen ts’ strategies. The dra wback of this approach is the computational cost of the particle filters that render difficult the application of this method in real time applications. The alternative that we propose in this article is to use instead of particle filters, extended Kalman filters (EKF) to predict opp onen ts’ strategies. There- fore the prop osed algorithm has smaller computational cost than the particle filter v ariant of fictitious play algorithm that prop osed by Sm yrnakis and Leslie (2010). W e sho w that the EKF fictitious play algorithm conv erges to a pure Nash equilibrium, in 2 by 2 games with at least one pure Nash equilibrium and in potential games where play ers ha v e t wo av ailable actions. W e also empirically observ e, in a range of games, that the prop osed algorithm needs less iterations than the classic fictitious pla y to conv erge to a solution. Moreo ver in our simu- lations, the prop osed algorithm conv erged to a solution with higher reward than the classic fictitious pla y algorithm. The remainder of this paper is organised as follows. W e start with a brief description of game theory , fictitious play and extended Kalman filters. Section 3 introduces the prop osed algorithm that combines fictitious pla y and extended Kalman filters. The conv ergence results we obtained are presen ted in Section 4. In Section 5 we prop ose some indicativ e v alues for the EKF algorithm pa- rameters. Section 6 presen ts the simulation results of EKF fictitious pla y in a 2 × 2 coordination game, a three play er climbing hill game and an ad-ho c sensor net work surv eillance problem. In the final section we present our conclusions. 2 Bac kground In this section we introduce some definition from game theory that w e will use in the rest of this article and the relation b et w een p oten tial games and decen tralised optimisation. W e also briefly presen t the classic fictitious pla y algorithm and the extended Kalman filter algorithm. 2.1 Game theory definitions W e consider a game Γ with I pla yers, where eac h pla yer i , i = 1 , 2 , . . . , I , choose his action, s i , from a finite discrete set S i . W e then can define the join t action that is play ed in a game as the set pro duct S = × i = I i =1 S i . Eac h Play er i receiv e a rew ard, u i , after choosing an action . The rew ard is a map from the joint action 2 space to the real num b ers, u i : S → R . W e will often write s = ( s i , s − i ), where s i is the action of Play er i and s − i is the joint action of Pla yer i ’s opp onen ts. When play ers select their actions using a probability distribution they use mixed strategies. The mixed strategy of a play er i , σ i , is an elemen t of the set ∆ i , where ∆ i is the set of all the probability distributions ov er the action space S i . The joint mixed strategy , σ , is then an elemen t of ∆ = × i = I i =1 ∆ i . Analogously to the join t actions we will write σ = ( σ i , σ − i ). In the special case where the pla yers choose an action with probabiit y one w e will say that pla yers choose their actions using pure strategies. The exp ected utilit y a pla yer i will gain if he c ho oses a strategy σ i (resp. s i ), when his opp onen ts choose the joint strategy σ − i is u i ( σ i , σ − i ) (resp. u i ( s i , σ − i )). A common decision rule in game theory is best resp onse (BR). The b est resp onse is defined as the action that maximizes play ers’ exp ected utilit y given their opp onen ts’ strategies. Th us for a specific opp onen ts’ strategy σ − i w e ev aluate the b est response as: B R i ( σ − i ) = argmax s i ∈ S u i ( s i , σ − i ) (1) Nash (1950) show ed that every game has at least one equilibrium, whic h is a fixed p oin t of the best resp onse corresp ondence, σ i ∈ B R ( σ − i ). Th us when a join t mixed strategy ˆ σ is a Nash equilibrium then: u i ( ˆ σ i , ˆ σ − i ) ≥ u i ( s i , ˆ σ − i ) for all s i ∈ S i (2) Equation 2 implies that if a strategy ˆ σ is a Nash equilibrium then it is not p ossible for a play er to increase his utility by unilaterally c hanging his strategy . When all the play ers in a game select their actions using pure strategies then the equilibrium actions are referred as pure strategy Nash equilibria. A pure equilibrium is strict if eac h pla yer has a unique b est resp onse to his opp onen ts actions. 2.2 Decen tralised optimisation tasks as p oten tial games A class of games that are of particular in terest in multi agent systems and decen tralised optimisation tasks are p oten tial games, b ecause of their utility structure. In particular in order to b e able to solve an optimisation task de- cen trally the lo cal functions should ha ve similar characteristics with the global function that we w an t to optimise. This suggests that an action which impro ves or reduces the utilit y of an individual should resp ectiv ely increase or reduce the global utilit y . Poten tial games ha ve this property , since the p oten tial func- tion (global function) depict the c hanges in the play ers’ pay offs (lo cal functions) when they unilaterally c hange their actions. More formally w e can write u i ( s i , s − i ) − u i ( ˜ s i , s − i ) = φ ( s i , s − i ) − φ ( ˜ s i , s − i ) where φ is a p oten tial function and the ab o v e equality stands for ev ery play er i , for ev ery action s − i ∈ S − i , and for ev ery pair of actions s i , ˜ s i ∈ S i , where S i and S − i represen t the set of all av ailable actions for Play er i and his opp onen ts resp ectiv ely . Moreov er potential games has at least one pure Nash equilibrium, hence there is at least one joint action s where no play er can increase their rew ard, therefore the p oten tial function, through a unilateral deviation. 3 It is feasible to choose an appropriate form of the agents’ utilit y function in order for the global utilit y to act as a p oten tial of the system. W onderful life utilit y is a utility function that introduced by W olp ert and T urner (1999.) and applied by Arslan et al. (2006) to form ulate distributed optimisation tasks as p oten tial games. Pla y er i ’s utility , when w onderful life utility is used, can b e defined as the difference b et w een the global utilit y u g and the utilit y of the system when a reference action is used as play er’s i action. More formally when pla yer i chooses an action s i w e write u i ( s i ) = u g ( s i , s − i ) − u g ( s i 0 , s − i ) where s i 0 denotes the reference action of pla yer i . Hence the decentralised op- timisation problem can b e cast as a p oten tial game and any algorithm that is pro ved to con verge to a Nash equilibrium of a p oten tial game, which is a lo cal or the global optimum of the optimisation problem, will conv erge to a joint action from whic h no play er can increase the global rew ard through unilateral deviation. 2.3 Fictitious pla y Fictitious play (Brown, 1951), is a widely used learning technique in game the- ory . In fictitious pla y eac h play er c ho oses his action according to the b est resp onse to his b eliefs ab out his opp onen ts’ joint mixed strategy σ − i . Initially each pla yer has some prior b eliefs ab out the strategy that each of his opp onen ts uses to choose an action based on a weigh t function κ t . The play- ers, after eac h iteration, update the w eight function and therefore their beliefs ab out their opp onen ts’ strategies and play again the best response according to their b eliefs. More formally in the b eginning of a game Pla yer i maintains some arbitrary non-negative initial weigh t functions κ j 0 , ∀ j ∈ [1 , I ] \{ i } , that are up dated using the formula: κ j t ( s j ) = κ j t − 1 ( s j ) + I s j t = s j for each j , where I s j t = s j =  1 if s j t = s j 0 otherwise. . The mixed strategy of opp onen t j is estimated from the follo wing formula: σ j t ( s j ) = κ j t ( s j ) P s 0 ∈ S j κ j t ( s 0 ) . (3) Pla yer i based on his b eliefs ab out his opponents’ strategies, chooses the action whic h maximises his exp ected pa yoffs. When play er i uses equation (3) to up date the beliefs ab out his opp onen ts’ strategies he treats the environmen t of the game as stationary and implicitly assumes that the actions of the pla yers are sampled from a fixed probabilit y distribution. Therefore the recen t observ ations ha ve the same weigh t as the initial ones. This approach leads to p oor adaptation when the other pla yers c ho ose to change their strategies. 2.4 Fictitious pla y as a state space mo del W e follow Smyrnakis and Leslie (2010) and we will represent fictitious play pro cess as a state-space model. According to this state space model eac h play er 4 has a propensity Q i t ( s i ) to pla y each of his av ailable actions s i ∈ S i , and then he forms his strategy based on these prop ensities. Finally he chooses his actions based on his strategy and the best resp onse decision rule. Because play ers ha ve no information ab out the ev olution of their opp onen ts’ prop ensities, and under the assumption that the c hanges in propensities are small from one iteration of the game to another, we model propensities using a Gaussian autoregressive prior on all prop ensities. W e set Q 0 ∼ N (0 , I ) and recursiv ely up date the v alue of Q t according to the v alue of Q t − 1 as follows: Q ( s t ) = Q ( s t − 1 ) + η t where η t ∼ N (0 , χ 2 I ). The action of a pla y er then is related to his prop ensit y b y the follo wing sigmoid equation for every s i ∈ S i s i = e ( Q i ( s i ) /τ ) P ˜ s ∈ S i e ( Q t ( ˜ s ) /τ ) . Therefore play ers will assume that at ev ery iteration t their opp onen ts ha ve a differen t strategy σ t . 2.5 Kalman filters and Extended Kalman filters Our ob jectiv e is to estimate pla yer i ’s opponent propensity and thus to estimate the marginal probability p ( Q t , s 1: t ). This ob jectiv e can b e represented as a Hidden Marko v Model (HMM). HMMs are used to predict the v alue of an unobserv ed v ariable x t , the hidden state, using the observ ations of another v ariable z 1: t . There are tw o main assumptions in the HMM representation. The former one is that the probability of b eing at an y state x t at time t dep ends only at the state of time t − 1, x t − 1 . The latter one is that an observ ation at time t depends only on the current state x t . One of the most common metho ds to estimate p ( x 1: t , z 1: t ) is Kalman filters and its v ariations. Kalman filter (Kalman et al., 1960) is based on tw o assumptions, the first is that the state v ariable is Gaussian. The second is that the observ ations are the result of a linear combination of the state v ariable. Hence Kalman filters can b e used in cases which are represented as the following state space mo del: x t = Ax t − 1 + ξ t − 1 hidden lay er y t = B x t + ζ t observ ations where ξ t and ζ t follo w a zero mean normal distribution with co v ariance matrices Ξ = q t I and Z = r t I respectively , and A , B are linear transformation matrices. When the distribution of the state v ariable x t is Gaussian then p ( x t | y 1: t ) is also a Gaussian distribution, since y t is a linear com bination of x t . Therefore it is enough to estimate its mean and v ariance to fully c haracterise p ( x t | y 1: t ). Nev ertheless in the state space model we w ant to implement, the relation b et w een Play er i ’s opp onen t propensity and his actions is not linear. Th us w e should use a more general form of state space mo del suc h as: x t = f ( x t − 1 ) + ξ t y t = h ( x t ) + ζ t (4) 5 where ξ t and ζ t are the hidden and observ ation state noise resp ectiv ely , with zero mean and cov ariance matrices Ξ = q t I and Z = r t I resp ectiv ely . The distribution of p ( x t | y 1: t ) is not a Gaussian distribution because f ( · ) and h ( · ) are non-linear functions. A simple metho d to o v ercome this shortcoming is to use a first order T aylor expansion to appro ximate the distributions of the sate space mo del in (4). In particular we let x t = m t − 1 +  , where m t denotes the mean of x t and  ∼ N (0 , P ). W e can rewrite (4) as: x t = f ( m t − 1 +  ) + w t − 1 = f ( m t − 1 ) + F x ( m t − 1 )  + ξ t − 1 y t = h ( m t +  ) + ζ t = h ( m t ) + H x ( m t )  + ζ t (5) where F x ( m t − 1 ) and H x ( m t ) is the Jacobian matrix of f and h ev aluated at m t − 1 and m t , respectively . If we use the transformations in (5) then p ( x t | y 1: t ) is a Gaussian distribution. Since p ( x t | y 1: t ) is a Gaussian distribution to fully c haracterise it w e need to ev aluate its mean and its v ariance. The EKF pro cess (Jazwinski, 1970; Grewal and Andrews, 2011) estimates this mean and v ariance in tw o steps the prediction and the update step. In the prediction step at any iteration t the distribution of the state v ariable is estimated based on all the observ ations until time t − 1, p ( x t | y 1: t − 1 ). The distribution of p ( x t | y 1: t − 1 ) is Gaussian and w e will denote its mean and v ariance as m − t and P − t resp ectiv ely . During the update step the estimation of the prediction step is corrected in the ligh t of the new observ ation at time t , so w e estimate p ( x t | y 1: t ). This is also a Gaussian distribution and we will denote its mean and v ariance as m t and P t resp ectiv ely . The prediction and the up date steps of the EKF process (Jazwinski, 1970; Grew al and Andrews, 2011) to estimate the mean and the v ariance of p ( x t | y 1: t − 1 ) and p ( x t | y 1: t ) resp ectiv ely are the follo wing: Prediction Step m − t = f ( m t − 1 ) P − t = F ( m t − 1 ) P t − 1 F ( m t − 1 ) + Ξ t − 1 where the j, j 0 elemen t of F ( m t ) is defined as [ F ( m − t )] j,j 0 = ∂ f ( x j , r ) ∂ x j 0 | x = m − t ,q =0 Up date Step v t = z t − h ( m − t ) S t = H ( m − t ) P − t H T ( m − t ) + Z K t = P − t H T ( m − t ) S − 1 t m t = m − t + K t v t P t = P − t − K t S t K T t where z t is the observ ation vector (with 1 in the entry of the observed action and 0 everywhere else) and the j, j 0 elemen t of H ( m t ) is defined as: [ H ( m − t )] j,j 0 = ∂ h ( x j , r ) ∂ x j 0 | x = m − t ,r =0 6 3 Fictitious pla y and EKF F or the rest of this pap er we will only consider inference ov er a single opp onen t mixed strategy in fictitious pla y . Separate estimates will b e formed identically and independently for eac h opp onen t. W e therefore consider only one opponent, and w e drop all dependence on pla yer i , and write s t , σ t and Q t for Pla yer i ’s opp onen t’s action, strategy and prop ensit y resp ectiv ely . Moreov er for any vector x , x [ j ] will denote the j th elemen t of the v ector and for any matrix y , y [ i, j ] will denote the ( i, j ) th elemen t of the matrix. W e can use the following state space mo del to describ e the fictitious play pro cess: Q t = Q t − 1 + ξ t − 1 s t = h ( Q t ) + ζ t where ξ t − 1 ∼ N (0 , Ξ), is the noise of the state pro cess and ζ t is is the error of the observ ation state with zero mean and co v ariance matrix Z , which occurs b ecause w e appro ximate a discrete pro cess lik e b est resp onses, equation (1), using a con tinuous function h ( · ). Hence w e can combine the EKF with fictitious pla y as follows. At time t − 1 Play er i has an estimation of his opp onen t’s prop ensit y using a Gaussian distribution with mean m t − 1 and v ariance P t − 1 , and has observ ed an action s t − 1 . Then at time t he uses EKF prediction step to estimate his opponent’s prop ensit y . The mean and v ariance of p ( Q t | s 1: t − 1 ) of the opp onen t’s propensity appro ximation are: m − t = m t − 1 P − t = P t − 1 + Ξ Pla yer i then ev aluates his opponents strategies using his estimations as: σ t ( s t ) = exp ( m − t [ s t ] /τ ) P ˜ s ∈ S exp ( m − t [ ˜ s ] /τ ) . (6) where m − t [ s t ] is the mean of Play er i ’s estimation ab out the prop ensit y of his opp onen t to play action s t . Play er i then uses the estimation of his opponent strategy , equation (6), and b est resp onses, equation (1), to choose an action. After observing the opponent’s action s t , Pla yer i correct his estimations ab out his opp onen t’s prop ensit y using the up date equations of EKF pro cess. The up date equations are: v t = z t − h ( m − t ) S t = H ( m − t ) P − t H T ( m − t ) + Z K t = P − t H T ( m − t ) S − 1 t m t = m − t + K t v t P t = P − t − K t S t K T t where h = exp ( Q t [ s 0 ] /τ ) P ˜ s ∈ S exp ( Q t [ ˜ s ] /τ ) , and τ is a temp erature parameter. The Jacobian matrix H ( m − t ) is defined as 7 [ H ( m − t )] j,j 0 =      P j 6 = j 0 exp( m − t [ j ]) exp( m − t [ j 0 ]) ( P j exp( m − t [ j ])) 2 if j = j 0 − exp( m − t [ j ]) exp( m − t [ j 0 ]) ( P j exp( m − t [ j ])) 2 if j 6 = j’ . T able 1 summarises the fictitious pla y algorithm when EKF is used to predict opp onen ts strategies. A t time t 1. Play er i main tains some estimations about his opp onen ts propensity up to time t − 1, p ( Q t − 1 | s 1 : t − 1). Th us he has an estimation of the mean m t − 1 and the cov ariance P t − 1 of this distribution. 2. Then Play er i is up dating his estimations ab out his opponents prop ensi- ties p ( Q t | s 1 : t − 1) using equations, m − t = m t − 1 , P − t = P t − 1 + W t − 1 . 3. Based on the weigh ts of step 1 eac h play er up dates his b eliefs about his opp onen ts strategies using σ j t ( s j ) = exp ( m − t ( j ) /τ ) P j 0 exp ( m − t ( j ) /τ ) . 4. Cho ose an action based on the b eliefs of step 3 according to best response. 5. Observe opp onen t’s action s t . 6. Up date the prop ensities estimates using m t = m − t + K t v t and P t = P − t − K t S t K T t . 7. set t=t+1 T able 1: EKF Fictitious Play algorithm 4 Theoretical Results In this section w e present the con vergence results w e obtained for games with at least one pure Nash equilibrium and pla y ers who ha ve 2 av ailable actions, s = (1 , 2). W e will denote as − s the action that a play er do es not c hoose, for example if Pla yer i ’s opp onen t chooses action 1, s = 1 and hence − s = 2. Also w e will denote as m [1] and m [2] the estimated means of opp onen t’s prop ensit y of action 1 and 2 resp ectiv ely . Similarly P [1 , 1] and P [2 , 2] will represen t the v ariance of the prop ensit y’s estimation of action 1 and 2 resp ectiv ely , and P [1 , 2] , P [2 , 1] their cov ariance. The prop osed algorithm has the follo wing tw o prop erties: Prop osition 1. If at iter ation t of the EKF fictitious play algorithm, action s is playe d fr om Player i ’s opp onent, then the estimation of his opp onent pr op ensity to play action s incr e ases, m t − 1 [ s ] < m t [ s ] . A lso the estimation of his opp onent pr op ensity to play action − s de cr e ases, m t − 1 [ − s ] > m t [ − s ] Pr o of. The pro of of Prop osition 1 is on Appendix A. 8 L R U 1,1 0,0 D 0,0 1,1 T able 2: Simple coordination game Prop osition 1 implies that play ers, when they use EKF fictitious play , learn their opponent’s strategy and even tually they will choose the action that will maximise their reward base on their estimation. Nev ertheless there are cases where pla yers may c hange their action sim ultaneously and trapp ed in a cycle instead of conv erging in a pure Nash equilibrium. As an example w e consider the game that is depicted in T able 2. This is a simple coordination game with tw o pure Nash equilibria the joint actions ( U, L ) and ( D , R ). In the case were the tw o play ers start from joint action ( U, R ) or ( D , L ) and they alwa ys c hange their action simultaneously then they will never reac h one of the tw o pure Nash equilibria of the game. Prop osition 2. In a 2 × 2 game wher e the players use EKF fictitious play pr o c ess to cho ose their actions, and the varianc e of the observation state is set to Z = r I + I , with high pr ob ability the two players wil l not change their action simultane ously infinitely often. We define  as a r andom numb er fr om normal distribution with zer o me an and arbitr arily smal l c ovarianc e matrix, I is the identity matrix. Pr o of. The pro of of Prop osition 2 is on Appendix B. W e should men tion here that the reason we set Z = r I + I is in order to break an y symmetries that occ urred because the initialisation of the EKF fictitious pla y algorithm. Based on Prop osition 1 and 2 we can infer the following prop ositions and theorems. Prop osition 3. (a) In a game wher e players have two available actions if s is a Nash e quilibrium, and s is playe d at date t in the pr o c ess of EKF fictitious play, s is playe d at al l subse quent dates. That is, strict Nash e quilibria ar e absorbing for the pr o c ess of EKF fictitious play. (b) Any pur e str ate gy ste ady state of EKF fictitious play must b e a Nash e quilibrium. Pr o of. Consider the case where pla yers b eliefs ˆ σ t , are suc h that their optimal c hoices corresp ond to a strict Nash equilibrium ˆ s . In EKF fictitious play pro- cess play ers’ beliefs are formed iden tically and indep enden tly for eac h opponent based on equation (6). By Prop osition 1 we kno w that play ers’ estimations ab out their opp onen ts’ prop ensities and therefore their strategies, that each pla yer main tains for the other play ers, will increase for the actions that are included in ˆ s and will b e reduced otherwise. Thus the b est response to their b eliefs ˆ σ t +1 will be again ˆ s and since ˆ s is a Nash equilibrium they will not de- viate from it. Conv ersely , if a play er remains at a pure strategy profile, then ev entually the assessmen ts will b ecome concentrated at that profile, because of Prop osition 1, hence if the profile is not a Nash equilibrium, one of the play ers w ould ev en tually w an t to deviate. 9 Prop osition 4. Under EKF fictitious play, if the b eliefs over e ach player’s choic es c onver ge, the str ate gy pr ofile c orr esp onding to the pr o duct of these dis- tributions is a Nash e quilibrium. Pr o of. Supp ose that the b eliefs of the pla yers at time t, σ t , con verges to some profile ˆ σ . If ˆ σ were not a Nash equilibrium, some play er would even tually wan t to deviate and the b eliefs would also deviate since based on Proposition 1 play ers ev entually learn their opp onen ts actions. Theorem 1. The EKF fictitious play pr o c ess c onver ges to the Nash e quilibrium in 2 × 2 games with at le ast one pur e Nash e quilibrium, when the c ovarianc e matrix of the observation sp ac e err or, Z , is define d as in Pr op osition 2, Z = r I + I . Pr o of. W e can distinct tw o p ossible initial states in the game. In the first pla yers’ initial beliefs of the pla yers actions are suc h that their initial joint action s 0 is a Nash equilibrium. F rom Prop osition 3 and equation (6) w e kno w that they will pla y the joint action whic h is a Nash equilibrium for all the iterations of the game. The second case where the initial beliefs of the pla yers are such that their initial join t action s 0 is not a Nash equilibrium is divided in 2 sub categories. The first include 2 × 2 games with only one pure Nash equilibrium. In this case, one of the tw o pla yers has a dominan t action, th us for all the iterations of the game he will c ho ose the dominan t action. This action maximises his exp ected pa yoff regardless the other play er’s strategy and th us he will select this action in every iteration of the game. Therefore b ecause of Prop osition 1 the other pla yer will learn his opp onen t’s strategy and play ers will c ho ose the joint action whic h is the pure Nash equilibrium. The second category includes 2 × 2 games with 2 pure Nash equilibria, lik e the simple co ordination game that is depicted in T able 2. In this case pla yers initial joint action s 0 = ( s 1 , s 2 ) is not a Nash equilibrium. Then the pla yers will learn their opponent’s strategy , Prop osition 1 and Equation (6), and they will c hange their action. W e know from Prop osition 2 that in a finite time with high probabilit y the play ers will not change their actions simultaneously , and hence they will end up in a join t action that will b e one of the tw o pure Nash equilibria of the game. W e can extend the results of Theorem 1 in n × 2 games with a b etter reply path. A game with a b etter reply path can be represented as a graph were its edges are the join actions of the game s and there is a v ertex that connects s with s 0 iff only one play er i can increasing his pay off by c hanging his action (Y oung, 2005). P oten tial games ha ve a b etter reply path. Theorem 2. The EKF fictitious play pr o c ess c onver ges to the Nash e quilib- rium in n × 2 games with a b etter r eply p ath when the c ovarianc e matrix of the observations sp ac e err or, Z , is Z = r + I . Pr o of. Similarly to the 2 × 2 games if the initial b eliefs of the play ers are suc h that their initial joint action s 0 is a Nash equilibrium, from Prop osition 3 and equation (6), we know that they will play the joint action which is a Nash equilibrium for the rest of the game. 10 Moreo ver in the case of the initial b eliefs of the play ers are suc h that their initial joint action s 0 is not a Nash equilibrium based on Prop osition 1 and Prop osition 2 after a finite n umber of iterations b ecause the game has a better reply path the only play er that can improv e his pay off b y c hanging his actions will c ho ose a new action whic h will result in a new joint action s . If this action is not the a Nash equilibrium then again after finite num b er of iterations the pla yer who can improv e his pa yoff will c hange action and a new join t action s 0 will b e play ed. Thus after the search of the v ertices of a finite graph, and th us after a finite n umber of iterations, play ers will c ho ose a join t action whic h is a Nash equilibrium. 5 Sim ulations to define algorithm parameters Ξ and Z . The cov ariance matrix of the state space error Ξ = q I and the measurement error Z = r I are tw o parameters that w e should define in the beginning of the EKF fictitious pla y algorithm and they affect its p erformance. Our aim is to find v alues, or range of v alues, of q and r that can efficiently track opponents’ strategy when it smoothly or abruptly c hange, instead of c ho osing q and r heuristically for eac h opponent when w e use the EKF algorithm. Nevertheless it is p ossible that for some games the results of the EKF algorithm will b e impro ved for other combinations of q and r than the ones that we prop ose in this section. W e examine the impact of EKF fictitious play algorithm parameters in its p erformance in the follo wing tw o tracking scenarios. In the first one a single opp onen t chooses his actions using a mixed strategy which c hanges smo othly and has a sin usoidal form o ver the iterations of the trac king scenario. In particular for t = 1 , 2 , . . . , 100 iterations of the game: σ t (1) = cos 2 πt n +1 2 = 1 − σ t (2), where n = 100. In the second to y example Play er i ’s opp onen t change his strategy abruptly and c ho oses action 1 with probability σ 2 t (1) = 1 during the first 25 and the last 25 iterations of the game and for the rest iterations of the game σ 2 t (1) = 0. The probability of the second action is calculated as: σ 2 t (2) = 1 − σ 2 t (1). W e tested the p erformance of the prop osed algorithm for the following range of parameters 10 − 4 ≤ q ≤ 1 and 10 − 4 ≤ r ≤ 1. W e rep eated both examples 100 times for each of the com binations of q and r . Eac h time w e measured the absolute error of the estimated strategy against the real one. The combined a verage absolute error when b oth examples are considered is depicted on Figure 1. The dark est areas of the con tour plot represent the areas where the a verage absolute error is minimised. The av erage absolute error is minimised for a range of v alues of q and r , that form t wo distinct areas. In the first area, the wide dark area of Figure 1, the range of q and r w ere 0 . 08 ≤ q ≤ 0 . 4 and 0 . 2 ≤ r ≤ 1 resp ectiv ely . In the second area, the narrow dark area of Figure 1, the range of q and r w ere 0 . 001 ≤ q ≤ 0 . 025 and 0 . 08 ≤ r ≤ 0 . 13 resp ectiv ely . The minim um error whic h w e observed in our simulations was in the narrow area and in particular when Ξ = 0 . 01 I and Z = 0 . 1 I , where I is the identical matrix. 11 Figure 1: Com bined absolute error for b oth tracking scenarios. The range of b oth parameters, q and r is b et ween 10 − 4 and 1. U M D U M D 0 0 0 0 50 40 0 0 30 U U M D -300 70 80 -300 60 0 0 0 0 M U M D 100 -300 90 0 0 0 0 0 0 D T able 3: Clim bing hill game with three pla yers. Play er 1 selects rows, Play er 2 selects columns, and Pla yer 3 selects the matrix. The global reward depicted in the matrices, is received by all pla yers. The unique Nash equilibrium is in b old 6 Sim ulation results This section is divided in tw o parts. The first part c on tains results of our simu- lations in tw o strategic form games and the second part con tains the results w e obtained in an ad-hoc sensor netw ork surveillance problem. In all the simula- tions of this section w e set the cov ariance matrix of the hidden and the observ a- tions state to Ξ = 0 . 01 I and Z = (0 . 1 +  ) I respectively , where  ∼ N (0 , 10 − 5 ) and I is the iden tical matrix. 6.1 Sim ulations results in strategic form games In this section we compare the results of our algorithm with those of fictitious pla y in t wo coordination games. These games are depicted in T ables 2 and 3. The game that is depicted in T able 2, as it w as describ ed in Section 4 , is a simple co ordination game with t wo pure Nash equilibria, its diagonal elements. T able 3 presen ts an extreme version of the climbing hill game (Claus and Boutilier, 1998) in whic h three play ers must climb up a utility function in order to reac h the Nash equilibrium where their reward is maximised. W e present the results of 50 replications of a learning episode of 50 iterations for eac h game. As it is depicted in Figures 2 and 3 the proposed algorithm p erforms b etter than fictitious play in both cases. In the simple co ordination game that is shown in T able 2, the EKF fictitious play algorithm con verges to 12 Figure 2: Results of EKF and classic fictitious play in the simple co ordination game of T able 2 one of the pure equilibria after a few iterations. On the other hand fictitious pla y is trapp ed in a limit cycle in all the replications where the initial join t action was not one of the tw o pure Nash equilibria. F or that reason the pla yers’ pa yoff for all the iterations of the game was either 1 utilit y unit or 0 utility units dep ending to the initial joint action. In the climbing hill game, T able 3 the prop osed algorithm con v erges to the Nash equilibrium after 35 iterations when fictitious play algorithm do not con verge ev en after 50 iterations. 6.2 Ad-ho c sensor net w ork surv eillance problem. W e compared the results of our algorithm against those of fictitious play in a co ordination task of a pow er constrained sensor netw ork, where sensors can be either in a sense or sleep mo de (F arinelli et al., 2008; Chapman et al., 2011). When the sensors are in sense mode they can observe the even ts that o ccur in their range. During their sleep mo de the sensors harv est the energy they need in order to be able function when they are in the sense mo de. The sensors then should coordinate and choose their sense/sleep sc hedule in order to maximise the co verage of the even ts. This optimisation task can b e cast as a potential game. In particular we consider the case where I sensors are deplo yed in an area where E even ts occur. If an even t e , e ∈ E , is observed from the sensors then it pro duce some utilit y V e . Each of the sensors i = 1 , . . . , I should c hoose an action s i = j , from one of the j = 1 , . . . , J time interv als which they can be in sense mo de. Each sensor i when it is in sense mo de can observe an ev ent e , if it is in its sense range, with probability p ie = 1 d ie , where d ie is the distance b et ween the sensor i and the even t e . W e assume that the probability each sensor has to observ e an ev ent is indep enden t from the other sensors. If w e denote as i in the sensors that are in sense mo de when the even t e o ccurs and e is in their sensing range, then we can write the probability an ev ent e to be observ ed from 13 Figure 3: Probability of playing the (U,U,D) equilibrium for the EKF fictitious pla y (solid line) and fictitious pla y (dash line) for the three play er climbing hill game the sensors, i in as 1 − Y i ∈ i in (1 − p ie ) The expected utility that is pro duced from the even t e is the pro duct of its utilit y V e and the probabilit y it has to be observ ed by the sensors, i in that are in sense mo de when the even t e occurs and e is in their sensing range. More formally we can express the utilit y that is produced from an even t e as: U e ( s ) = V e (1 − Y i ∈ i in (1 − p ie )) The global utilit y is then the sum of the utilities that all even ts, e ∈ E , pro duce U g lobal ( s ) = X e U e ( s ) . Eac h sensor after each iteration of the game receiv es some utility which is based on the sensors and the even ts that are inside his comm unication and sense range resp ectiv ely . F or a sensor i w e denote ˜ e the ev en ts that are in its sensing range and ˜ s − i the join t action of the sensors that are inside his comm unication range. The utility that sensor i will receiv e if his sense mo de is j will b e U i ( s i = j, ˜ s − i ) = X ˜ e U ˜ e ( s i = j, ˜ s − i ) . W e compared the p erformance of the t wo algorithms in 2 instances of the ab o v e scenario one with 20 and one with 50 sensors that are deploy ed in a unit square. In both instances sensors had to c hoose one time interv al of the da y that they will b e in sense mo de and use the rest time in terv als to harvest energy . W e consider cases where sensors had to choose their sense mo de b et ween 2, 3 and 14 4 a v ailable time in terv als. Sensors are able to communicate with other sensors that are at most 0.6 distance units aw ay , and can only observ e ev ents that are at most 0.3 distance units aw ay . Moreov er in b oth instances w e assumed that 20 ev ents took place in the unite square area. Those even ts were uniformly distributed in space and time, so an ev en t could ev enly appear in an y point of the unit square area and it could occur at any time with the same probability . The duration of each even t was uniformly c hosen b et ween (0-6] hours and each ev ent had a v alue V e ∈ (0 − 1]. Figures 4 and 5 depict the av erage results of 50 replications of the game for the t w o algorithms. F or each instance, b oth algorithms run for 50 iterations. T o b e able to av erage across the 50 replications w e normalise the utility of a replication b y the global utilit y that the sensors will gain if they w ere only in sense mo de during the whole da y . (a) Results when sensors hav e to choose between t wo time in terv als. (b) Results when sensors hav e to choose between three time in terv als. (c) Results when sensors hav e to choose between four time in terv als. Figure 4: Results of the instance where 20 sensors should co ordinate for b oth algorithms. The results of EKF fictitious play are the solid lines and the results of the classic fictitious pla y are the dash lines. The horizon tal axis of the figures depict the iteration of the game and the vertical axis the global utility as a p ercen tage of the global utility of the system in the case that sensors were alw ays in sense mo de. As w e observ e in Figures 4 and 5 EKF fictitious play con v erges to a stable join t action faster than the fictitious play algorithm. In particular on av erage the EKF fictitious pla y algorithm needed 10 “negotiation” steps b et w een the sensors in order to reach a stable joint action, when fictitious ply needed more than 25. Moreo ver the classic fictitious pla y algorithm was alw ays resulted in join t actions with smaller reward than the prop osed algorithm. 15 (a) Results when sensors hav e to choose between t wo time in terv als. (b) Results when sensors hav e to choose between three time in terv als. (c) Results when sensors hav e to choose between four time in terv als. Figure 5: Results of the instance where 50 sensors should co ordinate for b oth algorithms. The results of EKF fictitious play are the solid lines and the results of the classic fictitious pla y are the dash lines. The horizon tal axis of the figures depict the iteration of the game and the vertical axis the global utility as a p ercen tage of the global utility of the system in the case that sensors were alw ays in sense mo de. 7 Conclusion W e hav e introduced a v ariation of fictitious play that uses Extended Kalman filters to predict opp onen ts’ strategies. This v ariation of fictitious pla y addresses the implicit assumption of the classic algorithm that opponents use the same strategy in every iteration of the game. W e sho wed that, for 2 × 2 games with at least one pure Nash equilibrium, EKF fictitious pla y con verges in the pure Nash equilibrium of the game. More o ver the proposed algorithm conv erges in games with a b etter reply path, like p oten tial games, and n pla yers that hav e 2 av ailable actions. EKF fictitious pla y p erformed b etter than the classic algorithm algorithm in the strategic form game s and the ad-ho c sensor net work surv eillance problem w e simulated. Our empirical observ ations indicate that EKF fictitious play con- v erges to a solution that is b etter than the classic algorithm and needs only a few iterations to reach that solution. Hence by sligh tly increasing the computa- tional in tensity of fictitious pla y less comm unication is required b et ween agents to quickly coordinate on a desired solution. 16 8 Ac kno wledgemen ts This work is supp orted by The Engineering and Physical Sciences Rese arc h Council EPSRC (gran t num b er EP/I005765/1). A Pro of of Prop osition 1 W e will base the proof of Prop osition 1 on the prop erties of EKF when they used to estimate opp onen t’s strategy with t wo av ailable actions. If play er i ’s opp onen t has tw o av ailable actions 1 and 2, then w e can assume that at time t − 1 Pla y er i maintains b eliefs about his opp onen t’s propensity , with mean m t − 1 and v ariance P t − 1 . Moreov er based on these estimations he c ho oses his strategy σ t − 1 . At the prediction step of this pro cess he uses the follo wing equations to predict his opp onen t’s propensity and choose an action using b est response. m − t =  m − t − 1 [1] m − t − 1 [2]  P − t =  P − t − 1 [1 , 1] P − t − 1 [1 , 2] P − t − 1 [2 , 1] P − t − 1 [2 , 2]  + q I without loss of generality we can assume that his opp onen t in iteration t c ho oses action 2. Then the update step will b e : v t = z t − h ( m − t ) since Play ers i ’s opp onen t play ed action 2 and h = exp ( Q t [ s 0 ] /τ ) P ˜ s ∈ S exp ( Q t [ ˜ s ] /τ ) w e can write v t and H t ( m − t ) as: v t =  0 1  −  σ t − 1 (1) 1 − σ t − 1 (1)  =  − σ t − 1 (1) σ t − 1 (1)  H t ( m − t ) =  a t − a t − a t a t  where a t is defined a t = σ t − 1 (1) σ t − 1 (2). The estimation of S t = H ( m − t ) P − t H T ( m − t ) + Z will b e: S t = a 2  b − b − b b  + Z where b = P − t [1 , 1]+ P − t [2 , 2] − 2 P − t [1 , 2]. The Kalaman gain, K t = P − t H T ( m − t ) S − 1 t can b e written as K t = 1 2 r b + r 2  P − t [1 , 1] k k P − t [2 , 2]   a t − a t − a t a t   b + r b b b + r  up to a m ultiplicative constan t we can write K 1 ∼  c − c − d d  17 where c = P − t [1 , 1] − P − t [1 , 2] and d = P − t [2 , 2] − P − t [1 , 2]. The up dates then for the mean and v ariance are: m t = m − t + K t v t P t = P − t − K t S t K T t The mean of the Gaussian distribution that is used to estimate opponent’s prop ensities is: m t =  m t [1] m t [2]  = m − t [1] − 2 σ (1) a ( b − k ) 4 a 2 ( b − k )+( r +  ) m − t [2] + 2 σ (1) a ( b − k ) 4 a 2 ( b − k )+( r +  ) ! (7) Based on the ab o v e we observe that m t (1) < m t − 1 (1) and m t (2) > m t − 1 (2) whic h completes the pro of. B Pro of of Prop osition 2 W e consider 2 × 2 games with at least one pure Nash equilibrium. In the case that only one Nash equilibrium exists, a dominan t strategy exists and th us one of the play ers will not deviate from this action. Hence we are in terested in in 2 × 2 games with tw o pure Nash equilibria. Without loss of generality w e consider a game with similar structure to the simple co ordination game that is depicted in T able 2. with t wo equilibria, the joint actions in the diagonal of the pa yoff matrix, ( U, L ) and ( D , R ). W e will present calculations for Play er 1,but the same results hold also for Play er 2. W e define λ as the necessary confidence lev el that Pla yer 1’s estimation of σ t ( L ) should reac h in order to choose action U . Hence we Pla yer 1 will choose D if: σ t (1) > λ ⇔ exp ( m − t [1]) exp ( m − t [1]) + exp ( m − t [2]) > λ ⇔ m − t [1] > ln( λ 1 − λ ) + m − t [2] ⇔ m t − 1 [1] > ln( λ 1 − λ ) + m t − 1 [2] In order to prov e Proposition 2, w e need to show that when a pla yer c hanges his action his opp onen t will change his action at the same iteration with proba- bilit y less than 1. In the case where at time t − 1 the joint action of the pla yers is U, R then Play er 1 believes that his opp onen t will play L , while he observing him playing R . Assume that Pla yer 2’s b eliefs about Pla yer 1’s strategies has reac hed the necessary confiden t level ab out Pla yers 1’s strategy and at iteration t he will c hange his action from R to L . Play er 1 will also change his action at the same time if m t − 1 [2] > ln( 1 − λ λ ) + m t − 1 [1] W e wan t to show that pla y ers will not change actions sim ultaneously with prob- abilit y 1. Hence it is enough to sho w that 18 P r ob ( m t − 1 [1] > ln( λ 1 − λ ) + m t − 1 [2]) > 0 (8) W e can replace m t − 1 [1] and m t − 1 [2] with their equiv alent from (7) and write: m − t [1] − 2 σ (1) a ( b − k ) 4 a 2 ( b − k ) + ( r +  ) > ln( λ 1 − λ ) + m − t [2] + 2 σ (1) a ( b − k ) 4 a 2 ( b − k ) + ( r +  ) ⇔ − 4 σ (1) a ( b − k ) 4 a 2 ( b − k ) + ( r +  ) > ln( λ 1 − λ ) + m − t [2] − m − t [1] ⇔ a ( b − k ) 4 a 2 ( b − k ) + ( r +  ) < ln( λ 1 − λ ) + m − t [2] − m − t [1] − 4 σ (1) Solving this with respect to  we ha ve  > a ( b − k ) σ (1) ln( λ 1 − λ ) + m − t [2] − m − t [1] − a 2 ( b − k ) − r Th us w e can write (8) as: P r ob (  > a ( b − k ) σ (1) ln( λ 1 − λ ) + m − t [2] − m − t [1] − a 2 ( b − k ) − r ) > 0 (9) Since  is a Gaussian white noise (9) is alwa ys true. W e also consider the case where at time t − 1 the join t action of the play ers is D , L then Play er 1 b eliev es that his opp onen t will play R , while he observing him playing L . Assume that Pla yer 2’s b eliefs ab out Pla yer 1’s strategies has reac hed the necessary confiden t lev el and at t he will c hange his action from L to R . Play er 1 w i ll also c hange his action at the same time if m t − 1 [1] > ln( λ 1 − λ ) + m t − 1 [2] W e wan t to sho w that Play ers will not c hange actions simultaneously with prob- abilit y 1. Hence it is enough to sho w that P r ob ( m t − 1 [2] > ln( 1 − λ λ ) + m t − 1 [1]) > 0 (10) W e can rewrite (10) using the results w e obtained for m t − 1 [1] and m t − 1 [2] in (7) again as P r ob (  > a ( b − k ) σ (1) ln( λ 1 − λ ) + m − t [2] − m − t [1] − a 2 ( b − k ) − r ) > 0 (11) Since  is a Gaussian white noise (11) is alwa ys true. If we define ξ t the even t that b oth play ers c hange their action at time t sim ul- taneously , and assume that the tw o pla yers ha v e change their actions simultane- ously at the following iterations t 1 , t 2 , . . . , t t , then the probability that they will also c hange their action simultaneously at time t T +1 , P ( ξ t 1 , ξ t 2 , . . . , ξ t T , ξ t T +1 ) is almost zero for large but finite T . 19 References Arslan, G., Marden, J., Shamma, J.. Autonomous v ehicle-target assignment: A game theoretical form ulation. 2006. Bro wn, G.W.. Iterativ e solutions of games by fictitious play . In: Ko opmans, T.C., editor. Activity Analysis of Pro duction and Allo cation. Wiley; 1951. p. 374–376. Chapman, A.C., Leslie, D.S., Rogers, A., Jennings, N.R.. Conv ergent learn- ing algorithms for p oten tial games with unkno wn noisy rew ards. W orking P ap ers 05/2011; Univ ersity of Sydney Business School, Discipline of Business Analytics; 2011. Claus, C., Boutilier, C.. The dynamics of reinforcemen t learning in coop erativ e m ultiagent systems. In: Pro ceedings of the fifteenth nationalon Artificial in telligence. 1998. . F arinelli, A., Rogers, A., Jennings, N.. Maximising sensor netw ork efficiency through agent-based co ordination of sense/sleep schedules. In: W orkshop on Energy in Wireless Sensor Netw orks in conjuction with DCOSS 2008. 2008. p. 43–56. F udenberg, D., Levine, D.. The theory of Learning in Games. The MIT Press, 1998. Grew al, M., Andrews, A.. Kalman filtering: theory and practice using MA T- LAB. Wiley-IEEE press, 2011. Jazwinski, A.. Sto c hastic pro cesses and filtering theory . v olume 63. Academic press, 1970. Kalman, R., et al. A new approac h to linear filtering and prediction problems. Journal of basic Engineering 1960;82(1):35–45. Kho, J., Rogers, A., Jennings, N.R.. Decentralized control of adaptive sam- pling in wireless sensor net works. A CM T rans Sen Net w 2009;5(3):1–35. v an Leeu wen, P ., Hesselink, H., Rohlinga, J.. Sc heduling aircraft using constrain t satisfaction. Electronic Notes in Theoretical Computer Science 2002;76:252 – 268. Monderer, D., Shapley, L.. P oten tial games. Games and Economic Beha vior 1996;14:124–143. Nash, J.. Equilibrium p oin ts in n-person games. In: Pro ceedings of the National Academ y of Science, USA. volume 36; 1950. p. 48–49. Sm yrnakis, M., Leslie, D.S.. Dynamic Opponent Mo delling in Fictitious Play. The Computer Journal 2010;. Stranjak, A., Dutta, P .S., Eb den, M., Rogers, A., Vytelingum, P .. A multi- agen t sim ulation system for prediction and sc heduling of aero engine o v erhaul. In: AAMAS ’08: Proceedings of the 7th international join t conference on Autonomous agents and multiagen t systems. 2008. p. 81–88. 20 W olp ert, D., T urner, K.. An ov erview of collectiv e in telligence. Handb o ok of Agen t T echnology 1999.;. Y oung, H.P .. Strategic Learning and Its Limits. Oxford Universit y Press, 2005. 21

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment