Polynomial stochastic games via sum of squares optimization

Stochastic games are an important class of problems that generalize Markov decision processes to game theoretic scenarios. We consider finite state two-player zero-sum stochastic games over an infinite time horizon with discounted rewards. The player…

Authors: Parikshit Shah, Pablo A. Parrilo

1 Polynomial stochastic game s via sum of squares optimiz ation Parikshit Shah a nd Pablo A. Parrilo Department of Elec trical Engineering a nd C omputer Scienc e Massachusetts I nstitute of T echnology , C ambridge, MA 02139 Abstract Stochastic games are an important class of problems that g eneralize M arkov decision pro cesses to game theoretic scenarios. W e consider finite state tw o-playe r zero-sum stochastic games over an infin ite time horizon with discounted re wards. The players are assumed to ha ve infin ite strategy spaces and the payo ffs are assumed to be p olyno mials. In this pap er we restrict our attention to a special class o f games for which the sing le-contr oller assumptio n holds. It is shown that minimax equilib ria and optimal strategies f or such games may be obtained via semidefinite pro grammin g. I . I N T R O D U C T I O N Markov decision processes (MDPs) are very widely used system modeling tools where a single agent attempts t o m ake op timal d ecisions at each st age of a multi -stage process s o as to optimize some rew ard or payoff [1]. Game theory is a system modeling paradigm that allows one to model probl ems where se ver al (pos sibly adversarial) decisio n makers m ake i ndividual decisions to opt imize th eir own payoff [2]. In this paper we study stochastic games [3], a frame work that combines the modeli ng power of MDPs and games. Stochastic games may be viewed as competitive MDP s where se veral decision mak ers make decisions at each stage to maximize their own rew ard. Each s tate of a stochasti c game is a sim ple gam e, but the decisions made by the pl ayers affec t not only their current payoff, but also th e transition to the next state. This research was funded in part by AFOSR MURI suba wards 200 3-07688-1 and 102-108067 3. October 26, 2018 DRAFT 2 Notions of sol utions in g ames have been extensively studi ed, and are very well understo od. The mos t popular not ion of a solut ion in gam e theory i s that of a Nash equil ibrium . While th ese equilibria are hard to compute in general, in certain cases the y m ay be computed effic iently . For games in volving two players and finite action s paces, mixed strategy min imax equi libria alwa ys exist (see, e.g., [2]). These mini max saddle poi nts correspond to the well -known notion of a Nash equilib rium. From a com putational standpoi nt such games are cons idered tractable because Nash equilibri a may be com puted effi ciently via linear programm ing. Stochastic games were introduced by Shapley [4] in 1953. In hi s p aper , h e sho wed that the not ion of a minimax equilibrium may be extended to stochastic games with finite state spaces and st rategy set s. He also proposed a value iteration-like algorithm to compute the equilibria. In 1981 Parthasarathy and Raghav an [5], [3] studied single cont roller games. Single controller games are games where the probabilities of transit ions are cont rolled by th e action of only one player . They sh owed that s tochastic g ames satisfying this property coul d be sol ved effi ciently via li near prog ramming (thus proving that such problems wit h rational data could be computed in a finit e n umber of steps). While computati onal techni ques for finit e games are reasonably w ell underst ood, t here has been some recent interest in t he class of infin ite games ; see [6], [7] and the references therein. In this important class, players hav e access to an infinit e nu mber of pu re strategies, and the players are all owed to randomize over th ese choices. In a recent paper [6], Parrilo describes a technique to solve two-player , zero-sum infinite games with polynom ial payo f fs via semidefinite programming. It i s natu ral to wonder whether the t echniques from finit e stochastic gam es can be extended t o infinit e stochastic games (i.e. finite st ate stochast ic games w here pl ayers h a ve access to infinit ely many pure strategies). In particular , s ince finite, sin gle-controller , zero-sum games can be so lved vi a linear programm ing, can simi lar infinit e s tochastic games be s olved via semi definite programming ? The answer is af firmative, and this paper focuses on est ablishing this resul t. The main contribution of this paper is to provide a computati onally ef ficient, finite dimens ional characterization of the so lution of s ingle-controller pol ynomial st ochastic gam es. For thi s, we extend the linear programming formulation that solves the finite acti on si ngle-controller stochastic game (i.e., under assum ption (SC) b elow), to an infinite dimensional optimi zation probl em when the action s are u ncountably infini te. W e furt hermore establi sh the following properties o f t his October 26, 2018 DRAFT 3 infinite dim ensional optim ization problem: 1) Its optimal so lutions correspond to m inimax equilib ria. 2) The problem can be s olved effi ciently by semidefinite programm ing. Section II of this paper provides a formal description of the problem and introduces the b asic notation used in the paper . W e show t hat for two-player zero-sum polynom ial st ochastic games, equilibria e x ist and that the corresponding equilibrium value vector is uni que. (This proof is essentially an adaptation of the original proof by Shapley in [4] for finite stochastic games). In Section II we al so briefly re vi e w some elegant results about polynomial nonnegativity , m oment sequences o f nonnegativ e measures, and their connection to s emidefinite programming. In Sec- tion III, we b riefly revie w the linear programm ing approach t o finite sto chastic games. Section IV states and proves the m ain result of t his paper . In Section V we present an example of a two- player , two-state stochasti c game, and compute t he equil ibria vi a semid efinite programm ing. Finally , in Section VI we state so me natural e x tensions of this problem, conclusions, and directions of future research. I I . P RO B L E M D E S C R I P T I O N A. Stochastic games W e consid er the problem of solving two-player zero-sum st ochastic games via mathematical programming. The game consists of finit ely many states with two adversarial players that make simultaneous decisions. Each pl ayer receives a payoff that depends on the actions o f bo th players and the state (i.e. each state can be though t of as a particular zero-sum game). The transi tions between t he st ates are random (as in a finite state Markov decision process), and the transition probabilities in general depend o n the actio ns of th e players and the current state. The process runs over an in finite horizon. Player 1 attempts to maximize his rew ard over the horizon (via a d iscounted accumulation of the rewar ds at each s tage) while player 2 tri es to m inimize his payoff to player 1 . If ( a 1 1 , a 2 1 , . . . ) and ( a 1 2 , a 2 2 , . . . ) are sequences o f actions chosen by players 1 and 2 resulti ng in a sequence of states ( s 1 , s 2 , . . . ) respective ly , then the reward of player 1 is giv en by: ∞ X k =1 β k r ( s k , a k 1 , a k 2 ) . The g ame is com pletely defined via the specification of the fol lowing data: October 26, 2018 DRAFT 4 22 1 r 1 2 r 2 p 12 p 21 p 11 p Fig. 1. A two state stochastic game. The payof f functions associated to the states are denoted by r 1 and r 2 . The edges are marked by the corresponding state transition probabilities. 1) The (finite) st ate space S = { 1 , . . . , S } . 2) The sets of actions for players 1 and 2 given by A 1 and A 2 . 3) The payoff function, deno ted by r ( s, a 1 , a 2 ) , for a given s et of state s and actions a 1 and a 2 (of pl ayers 1 and 2 ). 4) The probability transiti on m atrix p ( s ′ ; s, a 1 , a 2 ) which provides the conditio nal probabi lity of t ransition from state s t o s ′ giv en players’ actio ns. 5) The discount f actor β , where 0 ≤ β < 1 . T o fix ideas, con sider the following example of a t wo-state sto chastic game (i.e. S = { 1 , 2 } ). The action spaces of the two p layers are A 1 = A 2 = [0 , 1 ] . The payoff function in state 1 is r (1 , a 1 , a 2 ) = r 1 ( a 1 , a 2 ) and the payoff function in state 2 is given by r (2 , a 1 , a 2 ) = r 2 ( a 1 , a 2 ) . Both are assumed to be pol ynomials in a 1 and a 2 . Th e probabi lity transiti on matrix is : P =   p 11 ( a 1 , a 2 ) p 12 ( a 1 , a 2 ) p 21 ( a 1 , a 2 ) p 22 ( a 1 , a 2 )   . Every entry i n thi s matrix is assum ed to be a po lynomial in a 1 and a 2 . This stochast ic game can be depicted graphically as shown in Fig. 1. W e will return to a specific in stance of this example in Section V, w here we explicitly solve for the equilibrium s trategies of the two players. Through most of this paper (except Section II-C) we make the following important assumpti on about the probability t ransition matrix: Assumption SC The probabili ty t ransition to state s ′ conditioned upon the current state being s depends onl y on s , s ′ , and the action a 1 of player 1 for ever y s and s ′ . This probability is independ ent of the action October 26, 2018 DRAFT 5 of player 2 . Thus, p ( s ′ ; s, a 1 , a 2 ) = p ( s ′ ; s, a 1 ) . This is known as the single-controller ass umption . In this paper we wil l mostly (except briefly , in Section III where finite st rategy spaces are considered) be concerned with the case where the action spaces A 1 and A 2 of the two pl ayers are uncountabl y infinite sets. For the sake of simplicit y we will often consider the case where A 1 = A 2 = [0 , 1] ∈ R . The results easily generalize to the ca se where the strate gy sets are finite unions of arbitrary intervals of the real line. For the sake of s implicity , we also assume that the actio n sets are the same for each st ate, though this assumption may be relaxed. W e will denote by a 1 and a 2 , the actual actions chosen by players 1 and 2 from their respecti ve action spaces. The payo f f function is ass umed to be a po lynomial in t he variables a 1 and a 2 with real coef ficients: r ( s, a 1 , a 2 ) = d 1 X i =1 d 2 X j =1 r ij ( s ) a i 1 a j 2 . Finally , we assume that the transition probabilit y p ( s ′ ; s, a 1 ) i s a polynomial in the action a 1 . The decision p rocess runs over an infinite h orizon, thu s it is natural to restrict one’ s attention to station ary strategies for each player , i .e. strategies that depend only on the state of the process and not on t ime. Moreover , s ince the process in volve s two adversarial decision makers, it is also natural to look for randomized strategies (or mixed strategies) rather than pure strategies so as to recover the notion of a minimax equili brium. A mixed s trategy for pl ayer 1 is a finite s et of probabilit y measures µ = [ µ (1) , . . . , µ ( S )] support ed on the action set A 1 . Each probability measure corresponds to a randomized st rategy for player 1 in some particul ar s tate, for example µ ( k ) corresponds to the random ized strategy that pl ayer 1 would use when in state k . Similarly , player 2 ’ s st rategy will b e represented by ν = [ ν (1) , . . . , ν ( S )] . (A word on notati on: Th roughout the paper , indices in parentheses will be used to denote the state. Bold letters will be used indicate vectorization with respect to the s tate, i.e., coll ection o f o bjects correspond ing t o different st ates into a vector with the i th entry corresponding to s tate i . The Greek l etters ξ , µ , ν will be used to denote measures. Subscripts on these Greek let ters will be used t o denote mo ments of th e measures. A bar over a greek letter indicates a (finite) moment s equence (the length of the sequence being clear from t he context). For example ξ j ( i ) denotes the j th moment of t he measure ξ corresponding to state i , and ¯ ξ ( i ) = [ ξ 0 ( i ) , . . . , ξ n ( i )] ). October 26, 2018 DRAFT 6 A strategy µ leads to a probability transit ion m atrix P ( µ ) s uch th at P ij ( µ ) = R A 1 p ( j ; i, a 1 ) dµ ( i ) . Thus, once player 1 fixes a st rategy µ , the probability t ransition matrix is fixed, and can be obtained by integrating each entry i n the matrix with respect to the m easure µ . (Since the entries are polynom ials, upon integration, t hese entries depend affinely on th e m oments µ ( i ) ). G iv en strategies µ and ν , the expected rew ard collected by player 1 in some stage s is giv en b y: r ( s, µ ( s ) , ν ( s )) = Z A 1 Z A 2 r ( s, a 1 , a 2 ) dµ ( s ) dν ( s ) . The re ward coll ected o ver t he infinite horizon (for fixed strate gies µ ( s ) and ν ( s ) ) starting at state s , v β ( s, µ ( s ) , ν ( s )) , is give n by the system of equations : v β ( s, µ ( s ) , ν ( s )) = r ( s, µ ( s ) , ν ( s ) ) + β P s ′ ∈S  R A 1 p ( s ′ ; s, a 1 ) dµ ( s )  v β ( s ′ , µ ( s ′ ) , ν ( s ′ )) ∀ s. V ectorizing v β ( s, µ ( s ) , ν ( s )) , we obtain v β ( µ, ν ) = ( I − β P ( µ ) ) − 1 r ( µ, ν ) , where r ( µ , ν ) = [ r (1 , µ (1) , ν (1)) , . . . , r ( S, µ ( S ) , ν ( S ))] ∈ R S . B. Solut ion Concept W e now bri efly di scuss t he questi on: “What is a reasonable solu tion concept for stochastic games?” Recall that for zero-sum normal form gam es, a Nash equil ibrium is a widely used notion o f equil ibrium in comp etitive scenarios. A N ash equi librium in a two-player g ame is a pair of independent randomized strategies (say µ and ν , one for each player) su ch that, give n player 2 plays the ν , player 1 ’ s best response would be to play µ and vice-versa. It i s an easy exe rcise that comput ation of Nash equilibria is equiv alent to finding saddle points o f the payoff- function. It is also well-kn own that Nash equi libria (or equiv alently saddle point s) correspond to the minimax no tion of an equili brium, i.e. points that satisfy th e following equality: min µ max ν v ( µ , ν ) = max ν min µ v ( µ , ν ) . While there may exist no pu re strategies t hat satisfy this equalit y , it may be achiev ed by allowing randomization over the allo wable s trategies. In his seminal paper [4], Shapley generalized the noti on of Nash equilib ria to stochastic games. He defined t he no tion of a “stationary equilibrium ” to be a pair of randomized st rategies (over October 26, 2018 DRAFT 7 the action sp ace) that depended only on the s tate of the game. (Of course, to be an equil ibrium, these mixed st rategies must also satisfy th e no-deviation pri nciple). F or st ochastic games, once one restrict s attention to stati onary equilib ria, instead of having unique “values” (as i n n ormal form g ames), one has a u nique “v alue vector”. This vector is indexed by the state and t he i th component is interpreted as the equi librium value Player 1 can expect to receive (ove r the i nfinite discounted process) con ditioned on the fact th at the g ame starts in s tate i . Note that d iffe rent states of the game may b e fav orable to diffe rent players. Since t he actions affe ct b oth payoffs and state transitions , p layers must balance their strategies so that th ey recei ve good payoffs in a particular state along wit h f av orable state transitions . The “no unilateral de viation” principle, saddle p oint in equality (interpreted row-wise, i.e., conditio ned upon a p articular st ate) and the equiv alence of the mi nmax and maxmin over randomi zed strategies all e xtend to the s tochastic game case, and when we restri ct attention to games with just one s tate, we recover the classical notions of equilibrium . Definition 1: A pai r of ve ctor of mixed strategies (indexed by the state) µ 0 and ν 0 which satisfy th e saddl e point property: v β ( µ, ν 0 ) ≤ v β ( µ 0 , ν 0 ) ≤ v β ( µ 0 , ν ) (1) for al l (vectors of) mixed strategies µ, ν are called equili brium strate g ies . The corresponding vector v β ( µ 0 , ν 0 ) i s called the val ue vector of the game. One shou ld note that v β ( µ, ν ) is a vector in R S indexed by t he init ial state of th e Markov process. Hence the abov e inequality is a v ector inequalit y and is to be interpreted componentwise. More precisely , i f A is the action space, let ∆( A ) denote the space of probability measures supported on A . Then the function v β is a function of the form: v β : Π S i =1 ∆( A ) × Π S i =1 ∆( A ) → R S , and equili brium strategies correspond to the s addle-points of this functio n. The mixed strategies of the players are indexed by the s tate (i.e. there is one probabil ity measure per st ate per player). These probabilit y m easures (condition ed upon the state) are ind ependent across states, and are also independ ent across the pl ayers. October 26, 2018 DRAFT 8 C. Existence of Equilib ria In his original paper , Shapley [4] showed that stat ionary equilibri a always exist (and that the correspond ing value-v ectors are unique) for two-player , zero-sum, finite state, finite action stochastic g ames. (Shapley consid ered games wh ere at each state th ere was some p robability of termination, where as i n this paper we cons ider games over an i nfinite horizon with discou nted re wards, as already mentioned. These two formulation s are equiv alent in t he sens e that st arting from a discount ed game one can con struct a game wit h terminat ion probabilit ies and vice- versa such that bot h hav e the sam e equilib rium value vectors.) In this subsection we address the existence and uniqueness issue, and prove th at for two-player , zero-sum st ochastic gam es over finite state spaces, infinite strategy spaces, and polynomial payoffs, st ationary equi libria alwa ys exist, and that the value vectors are un ique. Throughout the paper , we ass ume that the transition probabil ities are polyno mial funct ions of the actions o f the players. It is important to note that t he results of this subsection do not depend upon the single-contr oller assumption . As a by-produ ct of this proof, we obtain a sim ple algorithm fo r compu ting equilibria for all such games. This al gorithm is analogous to policy-iteration in dynami c programming, and consis ts o f solving a sequence of sim ple (non-stochastic) games whose value-vectors con verge to the true value vector . Let p ( x, y ) be a polynom ial, and A = [0 , 1] be the strategy space of pl ayers 1 and 2 . Let v al( p ( x, y )) be th e v alue of the zero-sum polynomial g ame wit h the payoff functi on as p ( x, y ) and the strategy sp ace A . It can be shown that a mixed-strategy Nash equili brium always exists for two-player zero-sum poly nomial games [8], and th ey can be computed using semidefinit e programming [6]. Lemma 1: Let p 1 ( x, y ) and p 2 ( x, y ) be given polyn omials. Then | v al( p 1 ( x, y ) ) − v al( p 2 ( x, y ) ) | ≤ max x,y ∈ [0 , 1] | p 1 ( x, y ) − p 2 ( x, y ) | . Pr oof: Let µ 1 , ν 1 be the optimal strategies for the polynomial zero-sum game with payof f p 1 ( x, y ) (so that E µ 1 ,ν 1 [ p 1 ( x, y ) ] = v al( p 1 ( x, y ) ) ) and µ 2 , ν 2 be the optimal st rategies for the game with payoff p 2 ( x.y ) . If v al( p 1 ) = v al( p 2 ) the result is trivial, s o without l oss of generality , assume that v al( p 1 ) > v al( p 2 ) . By the saddle point property , Z p 1 ( x, y ) d µ 1 dν 2 ≥ Z p 1 ( x, y ) d µ 1 dν 1 ≥ Z p 2 ( x, y ) d µ 2 dν 2 ≥ Z p 2 ( x, y ) d µ 1 dν 2 . October 26, 2018 DRAFT 9 Here the first inequality follows by considering ν 2 to be a deviation of pl ayer 2 from his op timal strategy (i.e. ν 1 ) for the gam e with payoff p 1 , the second inequ ality follows by the preceding assumption , and t he third inequali ty follows from a d e viation argument for pl ayer 1 from his optimal strategy . Hence,   R p 1 ( x, y ) d µ 1 dν 1 − R p 2 ( x, y ) d µ 2 dν 2   ≤   R ( p 1 ( x, y ) − p 2 ( x, y ) ) d µ 1 dν 2   ≤ max x,y ∈ [0 , 1] | ( p 1 ( x, y ) − p 2 ( x, y ) ) | R dµ 1 dν 2 . Note t hat t he quantity on t he right is bounded because we are consi dering t he m aximum o f a bounded continu ous function o n a compact set. Let α ∈ R S . Given a polynomial gam e with payoff functions r ( s , a 1 , a 2 ) and t ransition probabilit ies p ( t ; s, a 1 , a 2 ) (sometimes we will hide the state indi ces and writ e the entire matrix as P ( a 1 , a 2 ) ), fix a stat e s and define th e polynomial G s ( α ) = r ( s, a 1 , a 2 ) + β P t ∈S p ( t ; s, a 1 , a 2 ) α t . W e will need to perform iterations us ing this vector α ∈ R S . W e call the iterates of these vectors α k ∈ R S ( k is the i teration index), and denote s th component of thi s vector by α k s . Pick the vector α 0 ∈ R S arbitrarily and d efine the recursion for the s th component at iteration k by: α k s = v al( G s ( α k − 1 )) , k = 1 , 2 , . . . Rephrasing t he above i n terms of operators, define T s to be t he op erator such that T s α = v al( G s ( α )) . Let T α = [ T 1 α, . . . T S α ] T .Then the recursion simply consists of computing t he t erms T k ( α ) . Lemma 2: The quantit y lim k →∞ T k ( α ) = φ exists and is independent of α . Moreov er , φ is the unique fixed point solution to the equatio n: φ = T φ. Pr oof: F or α ∈ R S define the norm k α k = ma x s | α s | . Then, k T γ − T α k = max s | v al( G s ( γ )) − v al( G s ( α )) | ≤ max s max a 1 ,a 2 ∈ [0 , 1] | β P t p ( t ; s, a 1 , a 2 )( γ t − α t ) | (using Lem ma 1) ≤ max s max a 1 ,a 2 ∈ [0 , 1] | β P t p ( t ; s, a 1 , a 2 ) | max t | ( γ t − α t ) | = β k γ − α k . October 26, 2018 DRAFT 10 Since the d iscount factor β < 1 , we have a contraction, and by the contraction mapping p rinciple, the i teration T k α i s conv ergent to t he u nique fixed po int of the equation T φ = φ . Lemma 2 establishes that a fixed point solution to the iteration exists. W e now show t hat the fixed point is in fact the value vector of the game. T o show thi s, we show t hat if we compute the optimal strategies µ ( s ) , ν ( s ) to the game G s ( φ ) , s = 1 , 2 , . . . , S t hen play according to th ese these strategies achieves the value vector φ . Since φ by definition satis fies t he saddle po int inequality (1), an equilibri um soluti on exists. T o show that the value vector is u nique, we show that any value ve ctor satisfies the fix ed point equation T v β = v β . Since there is a unique fixed point b y Lemma 2, the value vector mus t be unique. Theor em 1: Let φ be th e fixed point defined in Lemma 2. Then, a. Let µ ( s ) , ν ( s ) denot e the opti mal measures to th e polynom ial game with payoff G s ( φ ) , s = { 1 , . . . , S } . Then µ = [ µ (1) , . . . , µ ( S )] T , ν = [ ν (1) , . . . , ν ( S )] T are the optim al s trategies for th e sto chastic game. b . If v β ( µ, ν ) is a value vector for t he game then v β satisfies T v β = v β . Hence v β = φ exists and i s uniqu e. Pr oof: Let µ ( s ) and ν ( s ) be the opti mal strategies for the game G s ( φ ) . Then by definition , the expected value of play und er these strategies wil l be φ s = T s φ = . . . = T k s φ . V ectorizin g this equati on, we note that φ = T k φ = E µ,ν [ r ( a 1 , a 2 )+ β P ( a 1 , a 2 ) r ( a 1 , a 2 )+ · · · + β k − 1 P k − 1 ( a 1 , a 2 ) r ( a 1 , a 2 )+ β k P k ( a 1 , a 2 ) φ ] . T aking the li mit as k → ∞ , we ob tain that φ = E µ,ν [ P ∞ k =0 β k P k ( a 1 , a 2 ) r ( a 1 , a 2 )] = v β ( µ, ν ) . Hence playing according to the s tationary strategies µ ( s ) , ν ( s ) , s = 1 , . . . , S achiev es the v alue vector φ . Suppose p layer 1 plays according to the s trategy µ , and s uppose player 2 deviates from the prescribed st ationary strategy ν to station ary strategy ν ′ . Th en, since µ, ν are defined t o b e an equi librium strategies for the game G s ( φ ) , we hav e the (vector) in equality for all ν ′ : φ = E µ,ν [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) φ ] ≤ E µ,ν ′ [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) φ ] ≤ E µ,ν ′ [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) r ( a 1 , a 2 ) + β 2 P 2 ( a 1 , a 2 ) φ ] . . . ≤ E µ,ν ′ [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) r ( a 1 , a 2 ) + · · · + β k P k ( a 1 , a 2 ) r ( a 1 , a 2 ) + β k P k ( a 1 , a 2 ) φ ] . October 26, 2018 DRAFT 11 In the first inequality a φ occurs on the right side. W e sub stitute that inequality i n the φ on the right si de to obtain the second inequ ality and so on. Finally , we obtain the inequality: φ = E µ,ν " ∞ X k =0 β k P k ( a 1 , a 2 ) r ( a 1 , a 2 ) # ≤ E µ,ν ′ " ∞ X k =0 β k P k ( a 1 , a 2 ) r ( a 1 , a 2 ) # , i.e. that φ = v β ( µ, ν ) ≤ v β ( µ, ν ′ ) for all ν ′ . A simil ar ar gument for deviations µ ′ of player 1 shows that v β ( µ ′ , ν ) ≤ v β ( µ, ν ) = φ . Hence µ ( s ) , ν ( s ) constructed as t he strategies for the gam es G s ( φ ) satisfy t he saddle po int inequality (1) compon ent-wise. This establ ishes th e existence of equilibria. For uniqueness, no te that any strategies µ, ν such that v β ( µ, ν ) sati sfies the saddle point inequality (1), by definition we have T v β ( µ, ν ) = v β ( µ, ν ) . Since T has a u nique fixed point, t he vector v β ( µ, ν ) mus t be unique. It is interesting to note that the above proof also provides an algorithm to compute approximat e equilibria. T o compute each iterate T s ( α ) one needs t o solve a polynomial game in normal form (which can be d one by s olving a si ngle sem idefinite program), and by so lving a sequence of such problems, one can compute T k ( α ) which is provably clo se t o the actual value-v ector . Howe ver , the rate of con ver gence of this iteration i s not very at tractiv e. In the rest of t his paper , we focus attention on single-contr oller gam es , for which equilibria can be computed by solvi ng a single semidefinite prog ram. D. SDP Characterization of Nonne gativity and Moments Let A be a closed interval on the real l ine. Th e set of univ ariate polynom ials which are nonnegativ e on A have an exact s emidefinite d escription. The set of (finite) vectors in R n which correspond t o moment sequences of m easures suppo rted on A als o hav e an exact semi definite description. W e briefly revie w these notions here and int roduce some related notatio n [6]. Let R [ x ] denote the set of univ ariate pol ynomials with real coef ficients. Let p ( x ) = P n k =0 p k x k ∈ R [ x ] . W e say that p ( x ) i s non negati ve on A if p ( x ) ≥ 0 for ev ery x ∈ A . W e denote th e set of nonnegative polynomials of degree n which are nonnegative on A by P ( A ) . (T o av oid cumbersome notation, we e xclude the degree information in the notation. M oreover t he degree will usually be clear from the context.) The pol ynomial p ( x ) is said to be a sum of squa r es if there exist polyno mials q 1 ( x ) , . . . , q k ( x ) such that p ( x ) = P k i =1 q i ( x ) 2 . It is well known th at a univ ariate polyno mial is a sum of squares if and only if p ( x ) ∈ P ( R ) . October 26, 2018 DRAFT 12 Let µ denote a measure supported on the set A . Th e i th moment of the measure µ is d enoted by µ i = Z A x i dµ. Let ¯ µ = [ µ 0 , . . . , µ n ] be a vector in R n +1 . W e say that ¯ µ is a moment sequence of length n + 1 if it corresponds t o the first n + 1 mom ents of s ome nonnegative measure µ supported on the set A . T he moment space , denoted by M ( A ) is the subset of R n +1 which corresponds to mom ents of nonnegati ve measures supported on the set A. W e say that a nonnegati ve measure µ is a pr obabi lity measure if its zeroth order moment satisfies µ 0 = 1 . The s et of mom ent sequences of l ength n + 1 corresponding to probability m easures i s denoted by M P ( A ) . Let S n denote t he set of n × n s ymmetric matri ces and define the linear operator H : R 2 n − 1 → S n as: H :        a 1 a 2 . . . a 2 n − 1        7→        a 1 a 2 . . . a n a 2 a 3 . . . a n +1 . . . . . . . . . . . . a n a n +1 . . . a 2 n − 1        . Thus H is simply the linear operator that takes a vector and con structs the associated Hankel matrix which is cons tant along the antidi agonals. W e will also frequently u se t he adjoint of this operator , the linear map H ∗ : S n → R 2 n − 1 : H ∗ :        m 11 m 12 . . . m 1 n m 12 m 22 . . . m 2 n . . . . . . . . . . . . m 1 n m 2 n . . . m nn        7→           m 11 2 m 12 m 22 + 2 m 13 . . . m nn           . This m ap flattens a matrix into a vector by addi ng all the entries along anti diagonals. Lemma 3: Let p ( x ) = P 2 n k =0 p k x k be a poly nomial. Let ¯ p = [ p 0 , . . . , p 2 n ] T be the vector of its coefficients. Then p ( x ) i s nonnegati ve (or SOS) i f and only i f th ere exists S ∈ S n +1 , S  0 such that: ¯ p = H ∗ ( S ) . October 26, 2018 DRAFT 13 Pr oof: For univ ariate polyn omials, nonnegativity is equiv alent t o SOS (see [9]). Let [ x ] n = [1 , x, . . . , x n ] T . W e h a ve for e very S ∈ S n +1 , p ( x ) = ¯ p T [ x ] 2 n = H ∗ ( S ) T [ x ] 2 n = [ x ] T n S [ x ] n . Factoring S  0 , we obt ain a sum of squares decom position. The con verse is immediate. One can give a s imilar semidefinite characterization of polynom ials that are non negati ve on an interval. Sin ce in this paper we are t ypically considering the interv al to be [0 , 1] we gi ve an explicit semidefinite characterization of P ([0 , 1]) . W e define the following m atrices: L 1 =   I n × n 0 1 × n   , L 2 =   0 1 × n I n × n   , where I n × n stands for the n × n identity matrix. Lemma 4: The pol ynomial p ( x ) = P 2 n k =0 p k x k is nonnegative on [0 , 1] if and only if there exist matrices Z ∈ S n +1 and W ∈ S n , Z  0 , W  0 such that      p 0 . . . p 2 n      = H ∗ ( Z + 1 2 ( L 1 W L T 2 + L 2 W L T 1 ) − L 2 W L T 2 ) . Pr oof: T he pro of follows from the characterization of nonn egati ve polyno mials on i ntervals. It i s well kno wn that p ( x ) ≥ 0 ∀ x ∈ [0 , 1] ⇔ p ( x ) = z ( x ) + x (1 − x ) w ( x ) , where z ( x ) and w ( x ) are sums of squares. A simp le application of Lem ma 3 yields t he required condition. In this paper , we wil l also be usi ng a very imp ortant classical result about the semidefinit e representation o f mom ent spaces [10], [11]. W e giv e an explicit characterization of M ([0 , 1]) and M P ([0 , 1]) . Lemma 5: The vector ¯ µ = [ µ 0 , µ 1 , . . . , µ 2 n ] T is a valid set of moments for a nonnegativ e measure s upported on [0 , 1] if and onl y if H ( ¯ µ )  0 1 2 ( L T 1 H ( ¯ µ ) L 2 + L T 2 H ( ¯ µ ) L 1 ) − L T 2 H ( ¯ µ ) L 2  0 . (2) October 26, 2018 DRAFT 14 Moreover , it is a moment sequence corresponding to a probability measure if and only if in addition to (2) i t satisfies µ 0 = 1 . Pr oof: The proof follows b y dualizing L emma 4. Alternatively , a direct proof may be found in [10]. For example, for 2 n = 2 the sequence [ µ 0 , µ 1 , µ 2 ] is a moment sequence corresponding to a measure s upported on [0 , 1] if and onl y if the following inequalities are true:   µ 0 µ 1 µ 1 µ 2    0 µ 1 − µ 2 ≥ 0 . I I I . F I N I T E S T R A T E G Y C A S E For the reader’ s con venience and comparison pu rposes, we briefly revie w here t he case where each player has only finitely many strategies at each state [3]. Again, for sim plicity we assume that the set of pure strategies av ai lable to each pl ayer at each state is ident ical so that A 1 = A 2 = { 1 , . . . , m } . Under th e finit e strategy case, when assum ption S C holds, a minim ax solution may be com puted via linear programming. W e state the linear program in this section . In the next section, drawing motivation from this li near prog ram, we write an infinite dimensio nal optimizatio n problem for the case where each player has a choi ce from infinitely many pure strategies. The finite action game is com pletely defined via the specification of the following data: 1) The state space S = { 1 , . . . , S } . 2) The (finite) sets of action s for players 1 and 2 give n by A 1 = A 2 = { 1 , . . . , m } . 3) The payoff fun ction for a given state s (representable by a matrix indexed by the actions of each players) deno ted b y r ( s, a 1 , a 2 ) . 4) The probability trans ition matrix p ( s ′ ; s, a 1 ) which provides the conditional probabilit y of transition from state s to s ′ giv en player 1 ’ s action a 1 . 5) The discount f actor β . A mixed strategy for player 1 is a function f : S × A 1 → [0 , 1] subject to the norm alization constraint P a 1 f ( s, a 1 ) = 1 for each s ∈ S (so t hat f ( s ) = [ f ( s, 1) , . . . , f ( s, m )] becom es a probability di stribution over the strategy space A 1 ) . Simil arly the mi xed strategy for player 2 in October 26, 2018 DRAFT 15 a particular state s is giv en b y g ( s ) = [ g ( s, 1) , . . . , g ( s, m )] . The collectio n o f mixed strategies (indexed by the states) will be denoted by f = [ f (1) , . . . , f ( S )] (and g = [ g (1) , . . . , g ( S )] respectiv ely ). A s trategy f leads to a probabi lity matrix P ( f ) = P a 1 ∈ A 1 p ( s ′ ; s, a 1 ) f ( s, a 1 ) . Again we cons ider a β -discoun ted process over an infini te horizon. Given strategies f and g , the reward collected by player 1 in som e st age s is given by: r ( s, f ( s ) , g ( s )) = X a 1 ∈ A 1 ,a 2 ∈ A 2 r ( s, a 1 , a 2 ) f ( s, a 1 ) g ( s, a 2 ) . The rew ard collected over the infinite horizon starting at state s , v β ( s, f ( s ) , g ( s )) , is give n by the s ystem of equations: v β ( s, f ( s ) , g ( s )) = r ( s, f ( s ) , g ( s ) ) + β P s ′ ∈S  P a 1 ∈ A 1 p ( s ′ ; s, a 1 ) f ( s, a 1 )  v β ( s ′ , f ( s ′ ) , g ( s ′ )) . Thus, v β ( f , g ) = ( I − β P ( f ) ) − 1 r ( f , g ) , where r ( f , g ) = [ r (1 , f (1) , g (1) ) , . . . , r ( S, f ( S ) , g ( S ))] ∈ R S . The problem is to find equili brium strategies f 0 and g 0 that satis fy the Nash equi librium property: v β ( f , g 0 ) ≤ v β ( f 0 , g 0 ) ≤ v β ( f 0 , g ) (3) for all mixed strategies f , g . Theor em 2 ([3]): Consider the p rimal-dual pair of linear programs: minimize P S s =1 v ( s ) g ( s, a 2 ) , v ( s ) v ( s ) ≥ P a 2 ∈ A 2 r ( s, a 1 , a 2 ) g ( s, a 2 )+ β P S s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) ∀ s ∈ S , a 1 ∈ A 1 P a 2 ∈ A 2 g ( s, a 2 ) = 1 ∀ s ∈ S g ( s, a 2 ) ≥ 0 ∀ s ∈ S , a 2 ∈ A 2 . ( P ) October 26, 2018 DRAFT 16 and maximize P S s =1 z ( s ) x ( s, a 1 ) , z ( s ) P S s =1 P a 1 ∈ A 1 [ δ ( s, s ′ ) − β p ( s ′ , s, a 1 )] x ( s, a 1 ) = 1 ∀ s ′ ∈ S z ( s ) ≤ P a 1 ∈ A 1 x ( s, a 1 ) r ( s, a 1 , a 2 ) ∀ s ∈ S , a 2 ∈ A 2 , x ( s, a 1 ) ≥ 0 , ∀ s ∈ S , a 1 ∈ A 1 . ( D ) Let p ∗ be th e optimal value of ( P ) , and d ∗ be the op timal value of ( D ) . Let x ∗ ( s, a 1 ) be th e optimal values of the x ( s, a 1 ) variables obtained in ( D ) . Let f ∗ ( s, a 1 ) = x ∗ ( s, a 1 ) P a 1 x ∗ ( s, a 1 ) and g ∗ ( s, a 2 ) be the dist ribution obtained b y the o ptimal solutio n of ( P ) . Then the foll owing statements hold: 1) p ∗ = d ∗ . 2) Let v ∗ = [ v ∗ (1) , . . . , v ∗ ( S )] be the op timal s olution of ( P ) . Then v ∗ = v β ( f ∗ , g ∗ ) . 3) v β ( f ∗ , g ∗ ) s atisfies the saddle-point inequali ty (3). Remark Note that st atement 2 claims that the s olution of the LP ( P ) corresponds to the infinite horizon di scounted reward obt ained when players 1 and 2 play according to the distributions f ∗ and g ∗ . Statement 3 claims that these d istributions are in fact optim al for the two players i n the Nash equilibrium s ense. Pr oof: See [3, p p. 9 3]. Remark Note that the pri mal problem ( P ) has a natural int erpretation in terms of security strate gies . Feasible vectors v , and g sati sfy the first set of inequalities i n ( P ) . The inequalities can be interpreted to mean that using strategy g the payof f of player 2 will be at most v . October 26, 2018 DRAFT 17 I V . I N FI N I T E S T R A T E G Y C A S E A. Pr o blem Setup In this section we cons ider th e case where each pl ayer can choose from uncount ably many diffe rent acti ons. In particular , each p layer can choose actions from the set [0 , 1 ] . The number of s tates |S | = S i s still finite. The payoff functi on r ( s, a 1 , a 2 ) is a polynomial i n a 1 and a 2 for each s ∈ S . The si ngle cont roller case (Assumpt ion SC) i s studied. In thi s case, we assu me that the probabil ity of transition p ( s ′ ; s, a 1 ) is a polyn omial in a 1 . Again we consider the two-player zero s um case where player 1 attempt s to maxim ize his rewa rd over the infinite horizon. W e generalize the problem ( P ) to th is case. T he variables f and g representing dist ributions ov er the finite s ets A 1 and A 2 are replaced by m easures µ ( s ) and ν ( s ) . These m easures represent mixed strategies over the uncountable action spaces. (W e remin d the reader that for each player there are S measures, each m easure corresponding to a mixed s trategy in a particular state. For example µ ( s ) corresponds to the mixed strategy p layer 1 would adopt when the game is in s tate s .) B. Pr el iminary Results W e point out that the generalization of ( P ) to thi s case is an opt imization probl em in volving non-negati vity of a sy stem of univ ariate polynomi als with coefficients that depend on the m o- ments of these measures. The interpretation in terms of security strategies for player 2 holds. The fol lowing is the generalization of the linear program ( P ) m entioned above: minimize P S s =1 v ( s ) ν ( s ) , v ( s ) ( a ) v ( s ) ≥ R a 2 ∈ A 2 r ( s, a 1 , a 2 ) dν ( s )+ β P S s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) for all s ∈ S , a 1 ∈ A 1 ( b ) ν ( s ) is a m easure supp orted on A 2 for all s ∈ S Since R r ( s, a 1 , a 2 ) dν ( s ) = q ν ( s, a 1 ) , a univ ariate polynom ial in a 1 for each s ∈ S , for a fixed vector v ( s ) , the constraints (a) are a system of pol ynomial in equalities. Note that the coef ficients of q will d epend on the measure ν on ly via finitely many moments . More con- cretely , l et r ( s, a 1 , a 2 ) = P n s ,m s i,j r ij ( s ) a i 1 a j 2 be the payoff polynomial. Then R r ( s, a 1 , a 2 ) dν ( s ) = October 26, 2018 DRAFT 18 P i,j r ij ( s ) a i 1 ν j ( s ) . Using this observation, this problem m ay be re written as the following prob- lem. minimize P S s =1 v ( s ) ¯ ν ( s ) , v ( s ) ( c ) v ( s ) − P i,j r ij ( s ) a i 1 ν j ( s ) − β P S s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) ∈ P ( A 1 ) for all s ∈ S ( d ) ¯ ν ( s ) ∈ M ( A 2 ) , and ν 0 ( s ) = 1 for all s ∈ S . ( P ′ ) The constraints (c) giv e a system of polynomial inequalities in a 1 , o ne inequality p er state. Fix some state s . Let the degree of the inequality for that s tate by d s . Let [ a 1 ] d s = [1 , a 1 , a 2 1 , . . . a d s 1 ] . The first term in constraint (c) can be rewritten in vector form as: X i,j r ij ( s ) a i 1 ν j ( s ) = ¯ ν ( s ) T R ( s ) T [ a 1 ] d s , where R ( s ) is a m atrix t hat cont ains the coeffi cients of the polynomi al r ( s, a 1 , a 2 ) . Similar to the finite strategy case we define a vector by v ∗ = [ v ∗ (1) , . . . , v ∗ ( S )] T which will turn o ut to be the value vector of the stochasti c gam e (whi ch i s indexed b y the state). T he s econd term in the constraint (c) which depends on the probabilit y transit ion p ( s ′ ; s, a 1 ) is also a polyno mial in a 1 whose coefficients depend on the coef ficients of p ( s ′ ; s, a 1 ) and v . Specifically S X s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) = v T Q ( s ) T [ a 1 ] d s , for so me matrix Q ( s ) which contains the coef ficients of p ( s ′ ; s, a 1 ) . Lemma 6: Let A 1 = A 2 = [0 , 1] . Let E s ∈ R d s × S be the matrix which has a 1 i n t he (1 , s ) October 26, 2018 DRAFT 19 position. Th en the semidefinite program ( S P ) given by: minimize P S s =1 v ( s ) ¯ ν ( s ) , v ( s ) ( e ) H ∗ ( Z s + 1 2 ( L 1 W s L T 2 + L 2 W s L T 1 ) − L 2 W s L T 2 ) = E s v − β Q ( s ) v − R ( s ) ¯ ν ( s ) ∀ s ∈ S ( f ) H ( ¯ ν ( s ))  0 ∀ s ∈ S ( g ) 1 2  L 1 T H ( ¯ ν )( s ) L 2 + L T 2 H ( ¯ ν )( s ) L 1  − L 2 T H ( ¯ ν )( s ) L 2  0 ∀ s ∈ S ( h ) e 1 T ¯ ν ( s ) = 1 ∀ s ∈ S ( i ) Z s , W s  0 ∀ s ∈ S ( S P ) exactly solves the p olynomial opt imization problem ( P ′ ) . Pr oof: The polyn omial in in equality (c) has the coefficient vector E s v − β Q ( s ) v − R ( s ) ¯ ν ( s ) . The proof follo ws as a direct consequence of L emma 4 concerning the s emidefinite representation of poly nomials no nnegati ve over [0 , 1] , and Lemm a 5 concerning t he semi definite representation of m oment sequences of nonn egati ve m easures supported on [0 , 1 ] . October 26, 2018 DRAFT 20 The dual of ( S P ) is given by the following semidefinite program: maximize P S s =1 α ( s ) α ( s ) , ¯ ξ ( s ) ( j ) H ∗ ( A s + 1 2 ( L 1 B s L T 2 + L 2 B s L T 1 ) − L 2 B s L T 2 ) = R T s ¯ ξ ( s ) − α ( s ) e 1 ∀ s ∈ S ( k ) H ( ¯ ξ ( s ))  0 ∀ s ∈ S ( l ) 1 2  L 1 T H ( ¯ ξ ( s )) L 2 + L T 2 H ( ¯ ξ ( s )) L 1  − L 2 T H ( ¯ ξ ( s )) L 2  0 ∀ s ∈ S P s ( E s − β Q ( s )) T ¯ ξ ( s ) = 1 ( m ) A s , B s  0 ∀ s ∈ S . ( S D ) Lemma 7: The dual SDP ( S D ) is equiv alent to t he following polynomi al optim ization prob- lem: maximize P S s =1 α ( s ) α ( s ) , ¯ ξ ( s ) ( n ) P i,j r ij ( s ) ξ i ( s ) a j 2 − α ( s ) ≥ 0 ∀ a 2 ∈ A 2 , s ∈ S ( o ) ¯ ξ ( s ) ∈ M ( A 2 ) ∀ s ∈ S ( p ) P s R A 1 ( δ ( s, s ′ ) − β p ( s ′ , s, a 1 )) dξ ( s ) = 1 ∀ s ′ ∈ S . ( D ′ ) Pr oof: This again fol lows as a consequence of Lemm as 4 and 5. Remark Note t hat in the dual problem, the m oment sequences do not necessarily correspond to probability measures. H ence, to con vert them to probability measures, one needs to normalize the m easure. Upon normalization, one obtains the optimal strategy for player 1 . October 26, 2018 DRAFT 21 Lemma 8: The po lynomial optim ization problems ( P ′ ) and ( D ′ ) are strong duals of each other . Pr oof: W e prove thi s by showing that the semidefinite program ( S P ) satisfies Slater’ s constraint qualification and that i t is bou nded from below . The result th en follows from the strong du ality of the equiv alent semidefinite programs ( S P ) and ( S D ) . First pick µ ( s ) and ν ( s ) to be the uniform di stribution on [0 , 1 ] for each state s ∈ S . One can show [10] that the moment s equence of µ is in t he interior of t he moment space of [0 , 1] . As a consequence, constraints (f) and (g) are strictly posit iv e definite. Using the strategies µ and ν , e valuate the discounted value of this pair of strategies as: v β ( µ, ν ) = [ I − β P ( µ ) ] − 1 r ( µ, ν ) . Choose v > v β . The poly nomial inequ alities giv en by (c) are all strictly posi tiv e and thus constraints (i) are strictly positive definite. The equality constraints are trivially satis fied. T o prove that the p roblem is bo unded below , we note that r ( s, a 1 , a 2 ) is a pol ynomial and that the s trategy spaces for bot h players are bound ed. Hence, inf a 1 ∈ A 1 ,a 2 ∈ A 2 r ( s, a 1 , a 2 ) is finite and provides a tri vial lower bound for v ( s ) . Lemma 9: Let ¯ ν ∗ ( s ) and ¯ ξ ∗ ( s ) be opti mal mo ment sequences for ( P ′ ) and ( D ′ ) respectiv ely . Let ν ∗ ( s ) and ξ ∗ ( s ) b e the corresponding measures supported on A 1 and A 2 respectiv ely . The following comp lementary slackness results hold for the optima of ( P ′ ) and ( D ′ ) : v ∗ ( s ) R A 1 dξ ∗ ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dξ ∗ ( s ) dν ∗ ( s )+ β P s ′ v ∗ ( s ′ ) R A 1 p ( s ′ ; s, a 1 ) dξ ∗ ( s ) ∀ s ∈ S (4) α ∗ ( s ) R A 2 dν ∗ ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dξ ∗ ( s ) dν ∗ ( s ) ∀ s ∈ S . (5) Pr oof: The result fol lows from the strong duality of the equivalent semidefinite representa- tions o f the primal-dual pair ( P ′ ) − ( D ′ ) . T he Lagrangian function for ( P ′ ) is giv en by: L ( ξ , α ) = inf v ,ν { P S s =1 v ( s ) − R A 1 [ v ( s ) − R A 2 r ( s, a 1 , a 2 ) dν ( s ) − β P s ′ v ( s ′ ) p ( s ′ ; s, a 1 )] dξ ( s ) + P s α ( s )(1 − ν 0 ( s )) } . L ( ξ , α ) must satisfy weak dualit y , i.e. d ∗ ≤ p ∗ . At optimalit y p ∗ = P s v ∗ ( s ) for some vector v ∗ . Howe ver , strong duali ty holds , i.e. p ∗ = d ∗ . This forces the first complementary slackness October 26, 2018 DRAFT 22 relation. T he s econd relation is obtain ed sim ilarly by considering the Lagrangian of the dual problem. W e have shown that problem ( P ′ ) can be reduced to the semidefinite prog ram ( S P ) , and is thus compu tationally tractable via con vex optimization algo rithms. W e next sho w that the solution to problem ( P ′ ) i s in fa ct the desired equilibrium so lution. C. Main Theore m Let p ∗ be the optimal value of ( P ′ ) , and d ∗ be the opti mal value of ( D ′ ) . Let ν ∗ ( s ) and ξ ∗ ( s ) be th e opt imal measures recovered in ( P ′ ) and ( D ′ ) . Let µ ∗ ( s ) = ξ ∗ ( s ) R A 1 dξ ∗ ( s ) . so that µ ∗ is a n ormalized version of ξ ∗ (i.e. µ ∗ is a p robability measure). Let v ∗ be the vector obtained as the opt imal sol ution of ( P ′ ) . Theor em 3: The op timal solut ions to t he prim al-dual pair ( P ′ ) , ( D ′ ) s atisfy the following: 1) p ∗ = d ∗ . 2) v ∗ = v β ( µ ∗ , ν ∗ ) . 3) v β ( µ ∗ , ν ∗ ) s atisfies t he saddle-point inequality: v β ( µ, ν ∗ ) ≤ v β ( µ ∗ , ν ∗ ) ≤ v β ( µ ∗ , ν ) (6) for all mixed strategies µ, ν . Pr oof: 1) Follows from the strong d uality of the primal-dual pair ( P ′ ) − ( D ′ ) . 2) Usin g Lemma 9 equation (4) in normalized form (i.e. dividing through out by ξ ∗ 0 ( s ) , whi ch is the zeroth order moment of the measure ξ ( s ) ) we obtain v ∗ ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dµ ∗ ( s ) dν ∗ ( s )+ β P s ′ v ∗ ( s ′ ) R A 1 p ( s ′ ; s, a 1 ) dµ ∗ ( s ) ∀ s ∈ S . Upon simplification and vectorization of v ∗ ( s ) one obtains v ∗ = r ( µ ∗ , ν ∗ ) + β P ( µ ∗ ) v ∗ . Using a Bellman equation argument or by simply iterating this equation (i.e. s ubstituti ng repeatedly for v ∗ ) it is easy to see that v ∗ = v β ( µ ∗ , ν ∗ ) . October 26, 2018 DRAFT 23 3) Consider inequality (c) it at its optimal value. W e h a ve for e very st ate s : v ∗ ( s ) ≥ R a 2 ∈ A 2 r ( s, a 1 , a 2 ) dν ∗ ( s )+ β P S s ′ =1 p ( s ′ ; s, a 1 ) v ∗ ( s ′ ) . Integrating wi th respect to som e arbit rary probabi lity measure µ ( s ) (with su pport on A 1 ), we get: v ∗ ( s ) ≥ R A 2 R A 1 r ( s, a 1 , a 2 ) dµ ( s ) dν ∗ ( s )+ β P S s ′ =1 R A 1 p ( s ′ ; s, a 1 ) v ∗ ( s ′ ) dµ ( s ) . Thus, v ∗ ( s ) ≥ r ( s, µ ( s ) , ν ∗ ( s ))+ β P S s ′ =1 R A 1 p ( s ′ ; s, a 1 ) v ∗ ( s ′ ) dµ ( s ) . Iterating this equation, we obtain v β ( µ ∗ , ν ∗ ) = v ∗ ≥ v β ( µ, ν ∗ ) for ev ery strategy µ . This completes on e si de of the saddle point inequali ty . Using t he norm alized version of equation (5), we get: α ∗ ( s ) ξ ∗ 0 ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dµ ∗ ( s ) dν ∗ ( s ) = r ( s, µ ∗ ( s ) , ν ∗ ( s )) . If we integrate inequality (n) in probl em ( D ′ ) with respect to any arbitrary probabil ity measure ν ( s ) with support on A 2 we obt ain α ∗ ( s ) ξ ∗ 0 ( s ) ≤ r ( s , µ ∗ ( s ) , ν ( s )) . Thus r ( s, µ ∗ ( s ) , ν ∗ ( s )) ≤ r ( s, µ ∗ ( s ) , ν ( s )) for ev ery s . Mul tiplying throug hout by ( I − β P ( µ ∗ )) − 1 , we get v β ( µ ∗ , ν ∗ ) ≤ v β ( µ ∗ , ν ) . This compl etes the other side of the s addle point i nequality . D. Obtain ing the measures Solutions to the semidefinite programs ( S P ) and ( S D ) provide the mom ent sequences corre- sponding to optimal strategies. Additional computation is required to recover t he actual measures. W e briefly describe a classi cal procedure to recover the m easures u sing linear algebra. For more details, t he reader may refer to [11], [12]. October 26, 2018 DRAFT 24 Let ¯ µ ∈ R 2 n be a given moment sequence. W e wis h to find a nonnegative m easure µ supp orted on the real line with these m oments. The resultin g measure will be com posed of finitely many atoms (i.e. a discrete m easure) of the form P w i δ ( x − a i ) where Prob ( x = a i ) = w i ∀ i. Construct the following l inear system:        µ 0 µ 1 . . . µ n − 1 µ 1 µ 2 . . . µ n . . . . . . . . . . . . µ n − 1 µ n . . . µ 2 n − 2               c 0 c 1 . . . c n − 1        = −        µ n µ n +1 . . . µ 2 n − 1        . Note that the Hankel matrix that appears on t he left hand side is a sub-matrix of H ( ¯ µ ) . W e assume wi thout lo ss of generality that the above matrix is strictly pos itive definite. (Suppose the above matrix is not ful l rank, construct a smaller k × k linear syst em of equ ations by elimin ating the last n − k ro ws and col umns of the m atrix so that th e k × k submatri x is full rank, and therefore strictl y positive definit e.) By i n verting this matrix we solve for [ c 0 , . . . , c n − 1 ] T . Let x i be th e root s of the polynom ial equation x n + c n − 1 x n − 1 + · · · + c 1 x + c 0 = 0 . It can be shown th at t he x i are all real and distinct, and that they are the s upport poi nts of the discrete measure. Once the supports are obtained, the weights w i may be obtained by solvi ng the n onsingular V andermonde system given by: n X i =1 w i x j i = µ j (0 ≤ j ≤ n − 1 ) . V . E X A M P L E Consider the two p layer d iscounted sto chastic game w ith β = 0 . 5 , S = { 1 , 2 } wit h payoff function r (1 , a 1 , a 2 ) = ( a 1 − a 2 ) 2 and r (2 , a 1 , a 2 ) = − ( a 1 − a 2 ) 2 . Let the probabil ity transition matrix be give n by: P ( a 1 ) =   a 1 1 − a 1 1 − a 2 1 a 2 1   . October 26, 2018 DRAFT 25 2 a 1 a 2 1 2 1 (a 1 − a 2 ) 2 −(a 1 − a 2 ) 2 1−a 1 1−a 1 Fig. 2. A two state st ochastic game wi th transition probabilities dependent only on the action of Player 1. The payoffs associated to the states are indicated in the corresponding nod es. T he edges are marked by the corresponding state transition probabilities. Figure 2 graphically illust rates thi s st ochastic game, consi sting of two states (the no des) with polynomial transition probabilities dependent on a 1 (as marked on the e dges of the graph). W ithin the n odes, the payof fs associated t o the corresponding st ates are indicated. T o understand this game, consider fi rst th e zero-sum (nonstochastic game) with payof f functio n p ( a 1 , a 2 ) = ( a 1 − a 2 ) 2 over the strategy space [0 , 1] . This game (called the “guessing game”) was s tudied by Parr ilo i n [6]. If Player 2 is able to guess t he action of Player 1 , he can sim ply imitate his action (i.e. set a 2 = a 1 and his payoff to pl ayer 1 would be zero (this i s the minimum possible since ( a 1 − a 2 ) 2 ≥ 0 ). Player 1 would try to confuse player 2 as much as possible and thus random ize betw een the extreme actions a 1 = 0 and a 1 = 1 with a probabi lity o f 1 2 . Player 2 ’ s best response would be to play a 2 = 1 2 with probabi lity 1 . In the game described in Fig. 2, i n State 1 Player 1 p lays the role of confuser and Player 2 plays the role of guesser . In state 2 , th e rol es of the players are rev ersed, Player 1 is the guesser and Player 2 the confuser . Howe ver , the problem is comp licated a b it by the fact that State 1 is advantageous to Player 1 so that at e very stage he has incentive to play a strategy t hat gi ves him a good payoff as well as maximize the chances of transitioning t o State 1 . The polynomial optimization problem t hat computes the minimax strategies and the equilib- October 26, 2018 DRAFT 26 rium values is the following: minimize v (1) + v (2) v (1) ≥ R ( a 1 − a 2 ) 2 dν (1)+ β ( a 1 v (1) + (1 − a 1 ) v (2 )) ∀ a 1 ∈ [0 , 1] v (2) ≥ − R ( a 1 − a 2 ) 2 dν (2)+ β ((1 − a 2 1 ) v (1 ) + a 2 1 v (2) ) ∀ a 1 ∈ [0 , 1] ν (1) , ν (2) probability measures supported on [0 , 1] . This probl em can be reformulated as follows: minimize v (1) + v (2 ) v (1) ≥ a 2 1 − 2 a 1 ν 1 (1) + ν 2 (1)+ β ( a 1 v (1) + (1 − a 1 ) v (2 )) ∀ a 1 ∈ [0 , 1] v (2) ≥ − a 2 1 + 2 a 1 ν 1 (2) − ν 2 (2)+ β ((1 − a 2 1 ) v (1 ) + a 2 1 v (2) ) ∀ a 1 ∈ [0 , 1] [1 , ν 1 (1) , ν 2 (1)] T , [1 , ν 1 (2) , ν 2 (2)] T ∈ M ([0 , 1 ]) . Solving the SDP and its dual we obt ain the following o ptimal cost-to-go and optimal moment sequences: v ∗ = [ . 298 , − . 158] T ¯ µ ∗ (1) = [1 , . 614 , . 61 4] T ¯ µ ∗ (2) = [1 , . 5 , . 25] T ¯ ν ∗ (1) = [1 , . 614 , . 37 7] T ¯ ν ∗ (2) = [1 , . 614 , . 61 4] T . The corresponding measures obtained as e xp lained in sub section IV -D are supported at only October 26, 2018 DRAFT 27 finitely many p oints, and are g iv en by the fol lowing: µ ∗ (1) = . 386 δ ( a 1 ) + . 614 δ ( a 1 − 1 ) µ ∗ (2) = δ ( a 1 − . 5 ) ν ∗ (1) = δ ( a 2 − . 6 14) ν ∗ (2) = . 386 δ ( a 2 ) + . 614 δ ( a 2 − 1 ) . Consider , for example, play in State 1 . If Player 1 were playi ng obliviously wit h respect to the s tate transition s, he would p lay actio ns a 1 = 0 and a 1 = 1 wi th one half probabil ity each. Howe ver , to increase the probabili ty of stayi ng in State 1 he plays action 1 with a higher probability . Player 2 cannot affect the state transition probabiliti es directly , thus h e must play a myopic best response. (A myo pic best response is on e that is a best respons e for the g ame in the current state). Note that in state 1 , once Player 1 ’ s strategy is fixed, th e (only) best respons e for Player 2 is to play th e action a 2 = 0 . 614 with p robability 1 . In state 2 , p layer 1 ’ s best strategy is to play a 1 = 0 . 5 . Player 2 picks an action from hi s myo pic best respon se set (in thi s case, all probability dis tributions that are supp orted on the point s 0 and 1 ). V I . C O N C L U S I O N S A N D F U T U R E W O R K In thi s paper , we have presented a techniq ue for solving two-player , zero-sum finite state stochastic games with infinit e strategies and polynomi al payoffs. W e est ablished the existence of equilibria for such games. As a b y-product we got an algorit hm that conv erged to unique value vector of the game (howe ver this algorithm does not seem t o ha ve very attractive con ver gence rates). W e focused mainly on t he case where th e s ingle-controller assumpti on ho lds. W e sh owed that the p roblem can be reduced to solving a system of un iv ariate polynom ial i nequalities and moment cons traints. W e used techniqu es from the classical theory of mom ents and sum -of- squares to reduce the probl em t o a semidefinite programming problem. By solvi ng a pri mal-dual pair of semidefinite programs, we obtained minimax equil ibria and optimal s trategies for the players. It is known that finite-state, finit e acti on, two-player zero-sum gam es which satisfy the or- derfield property [13], [5] may be solved via l inear programming. The single-controller case, games wit h perfect information, swit ching cont roller sto chastic games, separable rew ard-state October 26, 2018 DRAFT 28 independent t ransition (SER-SIT) games and additive games all satisfy this property . W e int end to extend t hese cases to t he infinite strategy case with polyno mial payoff s. General finite action stochastic games which do no t satisfy t he orderfield property still have an interesting math- ematical s tructure, but ef ficient computati onal procedures are not av ailable. De veloping such procedures present an int eresting direction of futu re research. Acknowledgeme nt: The authors would like to thank Ilan Lobel and Prof. Mu nther Dahleh for bringing to their attentio n t he linear programming sol ution t o singl e control ler finite stochastic games. R E F E R E N C E S [1] D. P . Bert sekas, Dynamic pro gramming and optimal contr ol . Athena Scientific, 2005 , vol. I. [2] D. Fudenberg and J. T irole, Game theory . Cambridge, MA: MIT Press, 1991. [3] J. A. Filar and K. Vrieze, C ompetitive Mark ov decision pr ocesses . New Y ork: Springer , 1997. [4] L. S. Shapley , “Stochastic games, ” Proc. Nat. Acad. Sci. U . S. A. , v ol. 39, pp. 109 5–1100, 1953. [5] T . Parthasarathy and T . E. S. Ragha van, “ An orderfield property f or stochastic games when one player controls transition probabilities, ” J. Optim. T heory Appl. , v ol. 33, no. 3, pp. 375–392, 1981. [6] P . A. Parrilo, “Polynomial games and sum of squares optimization, ” i n Pr oceedings of the 45 th IEEE Confer ence on Decision and Contr ol , 2006. [7] N. Stei n, A. Ozdaglar , and P . A. Parrilo, “Separable and low-ran k continuous games, ” in Pro ceedings of the 45 th IEEE Confer ence on Decision and Contr ol , 2006. [8] M. D resher , S. Karli n, and L. S. Shapley , “Polynomial games, ” in Contributions to the Theory of Games , ser . Annals of Mathematics Studies, no. 24. Princeton, N. J.: Princeton Univ ersity Press, 1950, pp. 161– 180. [9] P . A. Parrilo, “Structured semidefinite programs and semialgebraic geometry methods i n robustness and optimization, ” Ph.D. dissertation, California Institute of T echnology , May 2000. [10] S. Karlin and L. Shapley , Geometry of moment spaces , ser . Memoirs of the American Mathematical S ociety . AMS, 1953, vol. 12. [11] J. A. Shohat and J. D. T amarkin, The Pro blem of Moments , ser . American Mathematical Society Mathematical surveys, vol. II. New Y ork: American Mathematical Society , 1943. [12] L. Devro ye, Nonuniform random variate gen eration . New Y ork: Springer-V erlag, 1986. [13] T . E. S. Raghav an and J. A. Filar , “ Algorithms for stochastic games—a survey , ” Z. Oper . Res. , vol. 35, no. 6, pp. 437–472, 1991. October 26, 2018 DRAFT

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment