Polynomial stochastic games via sum of squares optimization

1 Polynomial stochastic game s via sum of squares optimiz ation Parikshit Shah a nd Pablo A. Parrilo Department of Elec trical Engineering a nd C omputer Scienc e Massachusetts I nstitute of T echnology , C ambridge, MA 02139 Abstract Stochastic games are an important class of problems that g eneralize M arkov decision pro cesses to game theoretic scenarios. W e consider ﬁnite state tw o-playe r zero-sum stochastic games over an inﬁn ite time horizon with discounted re wards. The players are assumed to ha ve inﬁn ite strategy spaces and the payo ffs are assumed to be p olyno mials. In this pap er we restrict our attention to a special class o f games for which the sing le-contr oller assumptio n holds. It is shown that minimax equilib ria and optimal strategies f or such games may be obtained via semideﬁnite pro grammin g. I . I N T R O D U C T I O N Markov decision processes (MDPs) are very widely used system modeling tools where a single agent attempts t o m ake op timal d ecisions at each st age of a multi -stage process s o as to optimize some rew ard or payoff [1]. Game theory is a system modeling paradigm that allows one to model probl ems where se ver al (pos sibly adversarial) decisio n makers m ake i ndividual decisions to opt imize th eir own payoff [2]. In this paper we study stochastic games [3], a frame work that combines the modeli ng power of MDPs and games. Stochastic games may be viewed as competitive MDP s where se veral decision mak ers make decisions at each stage to maximize their own rew ard. Each s tate of a stochasti c game is a sim ple gam e, but the decisions made by the pl ayers affec t not only their current payoff, but also th e transition to the next state. This research was funded in part by AFOSR MURI suba wards 200 3-07688-1 and 102-108067 3. October 26, 2018 DRAFT 2 Notions of sol utions in g ames have been extensively studi ed, and are very well understo od. The mos t popular not ion of a solut ion in gam e theory i s that of a Nash equil ibrium . While th ese equilibria are hard to compute in general, in certain cases the y m ay be computed efﬁc iently . For games in volving two players and ﬁnite action s paces, mixed strategy min imax equi libria alwa ys exist (see, e.g., [2]). These mini max saddle poi nts correspond to the well -known notion of a Nash equilib rium. From a com putational standpoi nt such games are cons idered tractable because Nash equilibri a may be com puted efﬁ ciently via linear programm ing. Stochastic games were introduced by Shapley [4] in 1953. In hi s p aper , h e sho wed that the not ion of a minimax equilibrium may be extended to stochastic games with ﬁnite state spaces and st rategy set s. He also proposed a value iteration-like algorithm to compute the equilibria. In 1981 Parthasarathy and Raghav an [5], [3] studied single cont roller games. Single controller games are games where the probabilities of transit ions are cont rolled by th e action of only one player . They sh owed that s tochastic g ames satisfying this property coul d be sol ved efﬁ ciently via li near prog ramming (thus proving that such problems wit h rational data could be computed in a ﬁnit e n umber of steps). While computati onal techni ques for ﬁnit e games are reasonably w ell underst ood, t here has been some recent interest in t he class of inﬁn ite games ; see [6], [7] and the references therein. In this important class, players hav e access to an inﬁnit e nu mber of pu re strategies, and the players are all owed to randomize over th ese choices. In a recent paper [6], Parrilo describes a technique to solve two-player , zero-sum inﬁnite games with polynom ial payo f fs via semideﬁnite programming. It i s natu ral to wonder whether the t echniques from ﬁnit e stochastic gam es can be extended t o inﬁnit e stochastic games (i.e. ﬁnite st ate stochast ic games w here pl ayers h a ve access to inﬁnit ely many pure strategies). In particular , s ince ﬁnite, sin gle-controller , zero-sum games can be so lved vi a linear programm ing, can simi lar inﬁnit e s tochastic games be s olved via semi deﬁnite programming ? The answer is af ﬁrmative, and this paper focuses on est ablishing this resul t. The main contribution of this paper is to provide a computati onally ef ﬁcient, ﬁnite dimens ional characterization of the so lution of s ingle-controller pol ynomial st ochastic gam es. For thi s, we extend the linear programming formulation that solves the ﬁnite acti on si ngle-controller stochastic game (i.e., under assum ption (SC) b elow), to an inﬁnite dimensional optimi zation probl em when the action s are u ncountably inﬁni te. W e furt hermore establi sh the following properties o f t his October 26, 2018 DRAFT 3 inﬁnite dim ensional optim ization problem: 1) Its optimal so lutions correspond to m inimax equilib ria. 2) The problem can be s olved efﬁ ciently by semideﬁnite programm ing. Section II of this paper provides a formal description of the problem and introduces the b asic notation used in the paper . W e show t hat for two-player zero-sum polynom ial st ochastic games, equilibria e x ist and that the corresponding equilibrium value vector is uni que. (This proof is essentially an adaptation of the original proof by Shapley in [4] for ﬁnite stochastic games). In Section II we al so brieﬂy re vi e w some elegant results about polynomial nonnegativity , m oment sequences o f nonnegativ e measures, and their connection to s emideﬁnite programming. In Sec- tion III, we b rieﬂy revie w the linear programm ing approach t o ﬁnite sto chastic games. Section IV states and proves the m ain result of t his paper . In Section V we present an example of a two- player , two-state stochasti c game, and compute t he equil ibria vi a semid eﬁnite programm ing. Finally , in Section VI we state so me natural e x tensions of this problem, conclusions, and directions of future research. I I . P RO B L E M D E S C R I P T I O N A. Stochastic games W e consid er the problem of solving two-player zero-sum st ochastic games via mathematical programming. The game consists of ﬁnit ely many states with two adversarial players that make simultaneous decisions. Each pl ayer receives a payoff that depends on the actions o f bo th players and the state (i.e. each state can be though t of as a particular zero-sum game). The transi tions between t he st ates are random (as in a ﬁnite state Markov decision process), and the transition probabilities in general depend o n the actio ns of th e players and the current state. The process runs over an in ﬁnite horizon. Player 1 attempts to maximize his rew ard over the horizon (via a d iscounted accumulation of the rewar ds at each s tage) while player 2 tri es to m inimize his payoff to player 1 . If ( a 1 1 , a 2 1 , . . . ) and ( a 1 2 , a 2 2 , . . . ) are sequences o f actions chosen by players 1 and 2 resulti ng in a sequence of states ( s 1 , s 2 , . . . ) respective ly , then the reward of player 1 is giv en by: ∞ X k =1 β k r ( s k , a k 1 , a k 2 ) . The g ame is com pletely deﬁned via the speciﬁcation of the fol lowing data: October 26, 2018 DRAFT 4 22 1 r 1 2 r 2 p 12 p 21 p 11 p Fig. 1. A two state stochastic game. The payof f functions associated to the states are denoted by r 1 and r 2 . The edges are marked by the corresponding state transition probabilities. 1) The (ﬁnite) st ate space S = { 1 , . . . , S } . 2) The sets of actions for players 1 and 2 given by A 1 and A 2 . 3) The payoff function, deno ted by r ( s, a 1 , a 2 ) , for a given s et of state s and actions a 1 and a 2 (of pl ayers 1 and 2 ). 4) The probability transiti on m atrix p ( s ′ ; s, a 1 , a 2 ) which provides the conditio nal probabi lity of t ransition from state s t o s ′ giv en players’ actio ns. 5) The discount f actor β , where 0 ≤ β < 1 . T o ﬁx ideas, con sider the following example of a t wo-state sto chastic game (i.e. S = { 1 , 2 } ). The action spaces of the two p layers are A 1 = A 2 = [0 , 1 ] . The payoff function in state 1 is r (1 , a 1 , a 2 ) = r 1 ( a 1 , a 2 ) and the payoff function in state 2 is given by r (2 , a 1 , a 2 ) = r 2 ( a 1 , a 2 ) . Both are assumed to be pol ynomials in a 1 and a 2 . Th e probabi lity transiti on matrix is : P =   p 11 ( a 1 , a 2 ) p 12 ( a 1 , a 2 ) p 21 ( a 1 , a 2 ) p 22 ( a 1 , a 2 )   . Every entry i n thi s matrix is assum ed to be a po lynomial in a 1 and a 2 . This stochast ic game can be depicted graphically as shown in Fig. 1. W e will return to a speciﬁc in stance of this example in Section V, w here we explicitly solve for the equilibrium s trategies of the two players. Through most of this paper (except Section II-C) we make the following important assumpti on about the probability t ransition matrix: Assumption SC The probabili ty t ransition to state s ′ conditioned upon the current state being s depends onl y on s , s ′ , and the action a 1 of player 1 for ever y s and s ′ . This probability is independ ent of the action October 26, 2018 DRAFT 5 of player 2 . Thus, p ( s ′ ; s, a 1 , a 2 ) = p ( s ′ ; s, a 1 ) . This is known as the single-controller ass umption . In this paper we wil l mostly (except brieﬂy , in Section III where ﬁnite st rategy spaces are considered) be concerned with the case where the action spaces A 1 and A 2 of the two pl ayers are uncountabl y inﬁnite sets. For the sake of simplicit y we will often consider the case where A 1 = A 2 = [0 , 1] ∈ R . The results easily generalize to the ca se where the strate gy sets are ﬁnite unions of arbitrary intervals of the real line. For the sake of s implicity , we also assume that the actio n sets are the same for each st ate, though this assumption may be relaxed. W e will denote by a 1 and a 2 , the actual actions chosen by players 1 and 2 from their respecti ve action spaces. The payo f f function is ass umed to be a po lynomial in t he variables a 1 and a 2 with real coef ﬁcients: r ( s, a 1 , a 2 ) = d 1 X i =1 d 2 X j =1 r ij ( s ) a i 1 a j 2 . Finally , we assume that the transition probabilit y p ( s ′ ; s, a 1 ) i s a polynomial in the action a 1 . The decision p rocess runs over an inﬁnite h orizon, thu s it is natural to restrict one’ s attention to station ary strategies for each player , i .e. strategies that depend only on the state of the process and not on t ime. Moreover , s ince the process in volve s two adversarial decision makers, it is also natural to look for randomized strategies (or mixed strategies) rather than pure strategies so as to recover the notion of a minimax equili brium. A mixed s trategy for pl ayer 1 is a ﬁnite s et of probabilit y measures µ = [ µ (1) , . . . , µ ( S )] support ed on the action set A 1 . Each probability measure corresponds to a randomized st rategy for player 1 in some particul ar s tate, for example µ ( k ) corresponds to the random ized strategy that pl ayer 1 would use when in state k . Similarly , player 2 ’ s st rategy will b e represented by ν = [ ν (1) , . . . , ν ( S )] . (A word on notati on: Th roughout the paper , indices in parentheses will be used to denote the state. Bold letters will be used indicate vectorization with respect to the s tate, i.e., coll ection o f o bjects correspond ing t o different st ates into a vector with the i th entry corresponding to s tate i . The Greek l etters ξ , µ , ν will be used to denote measures. Subscripts on these Greek let ters will be used t o denote mo ments of th e measures. A bar over a greek letter indicates a (ﬁnite) moment s equence (the length of the sequence being clear from t he context). For example ξ j ( i ) denotes the j th moment of t he measure ξ corresponding to state i , and ¯ ξ ( i ) = [ ξ 0 ( i ) , . . . , ξ n ( i )] ). October 26, 2018 DRAFT 6 A strategy µ leads to a probability transit ion m atrix P ( µ ) s uch th at P ij ( µ ) = R A 1 p ( j ; i, a 1 ) dµ ( i ) . Thus, once player 1 ﬁxes a st rategy µ , the probability t ransition matrix is ﬁxed, and can be obtained by integrating each entry i n the matrix with respect to the m easure µ . (Since the entries are polynom ials, upon integration, t hese entries depend afﬁnely on th e m oments µ ( i ) ). G iv en strategies µ and ν , the expected rew ard collected by player 1 in some stage s is giv en b y: r ( s, µ ( s ) , ν ( s )) = Z A 1 Z A 2 r ( s, a 1 , a 2 ) dµ ( s ) dν ( s ) . The re ward coll ected o ver t he inﬁnite horizon (for ﬁxed strate gies µ ( s ) and ν ( s ) ) starting at state s , v β ( s, µ ( s ) , ν ( s )) , is give n by the system of equations : v β ( s, µ ( s ) , ν ( s )) = r ( s, µ ( s ) , ν ( s ) ) + β P s ′ ∈S  R A 1 p ( s ′ ; s, a 1 ) dµ ( s )  v β ( s ′ , µ ( s ′ ) , ν ( s ′ )) ∀ s. V ectorizing v β ( s, µ ( s ) , ν ( s )) , we obtain v β ( µ, ν ) = ( I − β P ( µ ) ) − 1 r ( µ, ν ) , where r ( µ , ν ) = [ r (1 , µ (1) , ν (1)) , . . . , r ( S, µ ( S ) , ν ( S ))] ∈ R S . B. Solut ion Concept W e now bri eﬂy di scuss t he questi on: “What is a reasonable solu tion concept for stochastic games?” Recall that for zero-sum normal form gam es, a Nash equil ibrium is a widely used notion o f equil ibrium in comp etitive scenarios. A N ash equi librium in a two-player g ame is a pair of independent randomized strategies (say µ and ν , one for each player) su ch that, give n player 2 plays the ν , player 1 ’ s best response would be to play µ and vice-versa. It i s an easy exe rcise that comput ation of Nash equilibria is equiv alent to ﬁnding saddle points o f the payoff- function. It is also well-kn own that Nash equi libria (or equiv alently saddle point s) correspond to the minimax no tion of an equili brium, i.e. points that satisfy th e following equality: min µ max ν v ( µ , ν ) = max ν min µ v ( µ , ν ) . While there may exist no pu re strategies t hat satisfy this equalit y , it may be achiev ed by allowing randomization over the allo wable s trategies. In his seminal paper [4], Shapley generalized the noti on of Nash equilib ria to stochastic games. He deﬁned t he no tion of a “stationary equilibrium ” to be a pair of randomized st rategies (over October 26, 2018 DRAFT 7 the action sp ace) that depended only on the s tate of the game. (Of course, to be an equil ibrium, these mixed st rategies must also satisfy th e no-deviation pri nciple). F or st ochastic games, once one restrict s attention to stati onary equilib ria, instead of having unique “values” (as i n n ormal form g ames), one has a u nique “v alue vector”. This vector is indexed by the state and t he i th component is interpreted as the equi librium value Player 1 can expect to receive (ove r the i nﬁnite discounted process) con ditioned on the fact th at the g ame starts in s tate i . Note that d iffe rent states of the game may b e fav orable to diffe rent players. Since t he actions affe ct b oth payoffs and state transitions , p layers must balance their strategies so that th ey recei ve good payoffs in a particular state along wit h f av orable state transitions . The “no unilateral de viation” principle, saddle p oint in equality (interpreted row-wise, i.e., conditio ned upon a p articular st ate) and the equiv alence of the mi nmax and maxmin over randomi zed strategies all e xtend to the s tochastic game case, and when we restri ct attention to games with just one s tate, we recover the classical notions of equilibrium . Deﬁnition 1: A pai r of ve ctor of mixed strategies (indexed by the state) µ 0 and ν 0 which satisfy th e saddl e point property: v β ( µ, ν 0 ) ≤ v β ( µ 0 , ν 0 ) ≤ v β ( µ 0 , ν ) (1) for al l (vectors of) mixed strategies µ, ν are called equili brium strate g ies . The corresponding vector v β ( µ 0 , ν 0 ) i s called the val ue vector of the game. One shou ld note that v β ( µ, ν ) is a vector in R S indexed by t he init ial state of th e Markov process. Hence the abov e inequality is a v ector inequalit y and is to be interpreted componentwise. More precisely , i f A is the action space, let ∆( A ) denote the space of probability measures supported on A . Then the function v β is a function of the form: v β : Π S i =1 ∆( A ) × Π S i =1 ∆( A ) → R S , and equili brium strategies correspond to the s addle-points of this functio n. The mixed strategies of the players are indexed by the s tate (i.e. there is one probabil ity measure per st ate per player). These probabilit y m easures (condition ed upon the state) are ind ependent across states, and are also independ ent across the pl ayers. October 26, 2018 DRAFT 8 C. Existence of Equilib ria In his original paper , Shapley [4] showed that stat ionary equilibri a always exist (and that the correspond ing value-v ectors are unique) for two-player , zero-sum, ﬁnite state, ﬁnite action stochastic g ames. (Shapley consid ered games wh ere at each state th ere was some p robability of termination, where as i n this paper we cons ider games over an i nﬁnite horizon with discou nted re wards, as already mentioned. These two formulation s are equiv alent in t he sens e that st arting from a discount ed game one can con struct a game wit h terminat ion probabilit ies and vice- versa such that bot h hav e the sam e equilib rium value vectors.) In this subsection we address the existence and uniqueness issue, and prove th at for two-player , zero-sum st ochastic gam es over ﬁnite state spaces, inﬁnite strategy spaces, and polynomial payoffs, st ationary equi libria alwa ys exist, and that the value vectors are un ique. Throughout the paper , we ass ume that the transition probabil ities are polyno mial funct ions of the actions o f the players. It is important to note that t he results of this subsection do not depend upon the single-contr oller assumption . As a by-produ ct of this proof, we obtain a sim ple algorithm fo r compu ting equilibria for all such games. This al gorithm is analogous to policy-iteration in dynami c programming, and consis ts o f solving a sequence of sim ple (non-stochastic) games whose value-vectors con verge to the true value vector . Let p ( x, y ) be a polynom ial, and A = [0 , 1] be the strategy space of pl ayers 1 and 2 . Let v al( p ( x, y )) be th e v alue of the zero-sum polynomial g ame wit h the payoff functi on as p ( x, y ) and the strategy sp ace A . It can be shown that a mixed-strategy Nash equili brium always exists for two-player zero-sum poly nomial games [8], and th ey can be computed using semideﬁnit e programming [6]. Lemma 1: Let p 1 ( x, y ) and p 2 ( x, y ) be given polyn omials. Then | v al( p 1 ( x, y ) ) − v al( p 2 ( x, y ) ) | ≤ max x,y ∈ [0 , 1] | p 1 ( x, y ) − p 2 ( x, y ) | . Pr oof: Let µ 1 , ν 1 be the optimal strategies for the polynomial zero-sum game with payof f p 1 ( x, y ) (so that E µ 1 ,ν 1 [ p 1 ( x, y ) ] = v al( p 1 ( x, y ) ) ) and µ 2 , ν 2 be the optimal st rategies for the game with payoff p 2 ( x.y ) . If v al( p 1 ) = v al( p 2 ) the result is trivial, s o without l oss of generality , assume that v al( p 1 ) > v al( p 2 ) . By the saddle point property , Z p 1 ( x, y ) d µ 1 dν 2 ≥ Z p 1 ( x, y ) d µ 1 dν 1 ≥ Z p 2 ( x, y ) d µ 2 dν 2 ≥ Z p 2 ( x, y ) d µ 1 dν 2 . October 26, 2018 DRAFT 9 Here the ﬁrst inequality follows by considering ν 2 to be a deviation of pl ayer 2 from his op timal strategy (i.e. ν 1 ) for the gam e with payoff p 1 , the second inequ ality follows by the preceding assumption , and t he third inequali ty follows from a d e viation argument for pl ayer 1 from his optimal strategy . Hence,   R p 1 ( x, y ) d µ 1 dν 1 − R p 2 ( x, y ) d µ 2 dν 2   ≤   R ( p 1 ( x, y ) − p 2 ( x, y ) ) d µ 1 dν 2   ≤ max x,y ∈ [0 , 1] | ( p 1 ( x, y ) − p 2 ( x, y ) ) | R dµ 1 dν 2 . Note t hat t he quantity on t he right is bounded because we are consi dering t he m aximum o f a bounded continu ous function o n a compact set. Let α ∈ R S . Given a polynomial gam e with payoff functions r ( s , a 1 , a 2 ) and t ransition probabilit ies p ( t ; s, a 1 , a 2 ) (sometimes we will hide the state indi ces and writ e the entire matrix as P ( a 1 , a 2 ) ), ﬁx a stat e s and deﬁne th e polynomial G s ( α ) = r ( s, a 1 , a 2 ) + β P t ∈S p ( t ; s, a 1 , a 2 ) α t . W e will need to perform iterations us ing this vector α ∈ R S . W e call the iterates of these vectors α k ∈ R S ( k is the i teration index), and denote s th component of thi s vector by α k s . Pick the vector α 0 ∈ R S arbitrarily and d eﬁne the recursion for the s th component at iteration k by: α k s = v al( G s ( α k − 1 )) , k = 1 , 2 , . . . Rephrasing t he above i n terms of operators, deﬁne T s to be t he op erator such that T s α = v al( G s ( α )) . Let T α = [ T 1 α, . . . T S α ] T .Then the recursion simply consists of computing t he t erms T k ( α ) . Lemma 2: The quantit y lim k →∞ T k ( α ) = φ exists and is independent of α . Moreov er , φ is the unique ﬁxed point solution to the equatio n: φ = T φ. Pr oof: F or α ∈ R S deﬁne the norm k α k = ma x s | α s | . Then, k T γ − T α k = max s | v al( G s ( γ )) − v al( G s ( α )) | ≤ max s max a 1 ,a 2 ∈ [0 , 1] | β P t p ( t ; s, a 1 , a 2 )( γ t − α t ) | (using Lem ma 1) ≤ max s max a 1 ,a 2 ∈ [0 , 1] | β P t p ( t ; s, a 1 , a 2 ) | max t | ( γ t − α t ) | = β k γ − α k . October 26, 2018 DRAFT 10 Since the d iscount factor β < 1 , we have a contraction, and by the contraction mapping p rinciple, the i teration T k α i s conv ergent to t he u nique ﬁxed po int of the equation T φ = φ . Lemma 2 establishes that a ﬁxed point solution to the iteration exists. W e now show t hat the ﬁxed point is in fact the value vector of the game. T o show thi s, we show t hat if we compute the optimal strategies µ ( s ) , ν ( s ) to the game G s ( φ ) , s = 1 , 2 , . . . , S t hen play according to th ese these strategies achieves the value vector φ . Since φ by deﬁnition satis ﬁes t he saddle po int inequality (1), an equilibri um soluti on exists. T o show that the value vector is u nique, we show that any value ve ctor satisﬁes the ﬁx ed point equation T v β = v β . Since there is a unique ﬁxed point b y Lemma 2, the value vector mus t be unique. Theor em 1: Let φ be th e ﬁxed point deﬁned in Lemma 2. Then, a. Let µ ( s ) , ν ( s ) denot e the opti mal measures to th e polynom ial game with payoff G s ( φ ) , s = { 1 , . . . , S } . Then µ = [ µ (1) , . . . , µ ( S )] T , ν = [ ν (1) , . . . , ν ( S )] T are the optim al s trategies for th e sto chastic game. b . If v β ( µ, ν ) is a value vector for t he game then v β satisﬁes T v β = v β . Hence v β = φ exists and i s uniqu e. Pr oof: Let µ ( s ) and ν ( s ) be the opti mal strategies for the game G s ( φ ) . Then by deﬁnition , the expected value of play und er these strategies wil l be φ s = T s φ = . . . = T k s φ . V ectorizin g this equati on, we note that φ = T k φ = E µ,ν [ r ( a 1 , a 2 )+ β P ( a 1 , a 2 ) r ( a 1 , a 2 )+ · · · + β k − 1 P k − 1 ( a 1 , a 2 ) r ( a 1 , a 2 )+ β k P k ( a 1 , a 2 ) φ ] . T aking the li mit as k → ∞ , we ob tain that φ = E µ,ν [ P ∞ k =0 β k P k ( a 1 , a 2 ) r ( a 1 , a 2 )] = v β ( µ, ν ) . Hence playing according to the s tationary strategies µ ( s ) , ν ( s ) , s = 1 , . . . , S achiev es the v alue vector φ . Suppose p layer 1 plays according to the s trategy µ , and s uppose player 2 deviates from the prescribed st ationary strategy ν to station ary strategy ν ′ . Th en, since µ, ν are deﬁned t o b e an equi librium strategies for the game G s ( φ ) , we hav e the (vector) in equality for all ν ′ : φ = E µ,ν [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) φ ] ≤ E µ,ν ′ [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) φ ] ≤ E µ,ν ′ [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) r ( a 1 , a 2 ) + β 2 P 2 ( a 1 , a 2 ) φ ] . . . ≤ E µ,ν ′ [ r ( a 1 , a 2 ) + β P ( a 1 , a 2 ) r ( a 1 , a 2 ) + · · · + β k P k ( a 1 , a 2 ) r ( a 1 , a 2 ) + β k P k ( a 1 , a 2 ) φ ] . October 26, 2018 DRAFT 11 In the ﬁrst inequality a φ occurs on the right side. W e sub stitute that inequality i n the φ on the right si de to obtain the second inequ ality and so on. Finally , we obtain the inequality: φ = E µ,ν " ∞ X k =0 β k P k ( a 1 , a 2 ) r ( a 1 , a 2 ) # ≤ E µ,ν ′ " ∞ X k =0 β k P k ( a 1 , a 2 ) r ( a 1 , a 2 ) # , i.e. that φ = v β ( µ, ν ) ≤ v β ( µ, ν ′ ) for all ν ′ . A simil ar ar gument for deviations µ ′ of player 1 shows that v β ( µ ′ , ν ) ≤ v β ( µ, ν ) = φ . Hence µ ( s ) , ν ( s ) constructed as t he strategies for the gam es G s ( φ ) satisfy t he saddle po int inequality (1) compon ent-wise. This establ ishes th e existence of equilibria. For uniqueness, no te that any strategies µ, ν such that v β ( µ, ν ) sati sﬁes the saddle point inequality (1), by deﬁnition we have T v β ( µ, ν ) = v β ( µ, ν ) . Since T has a u nique ﬁxed point, t he vector v β ( µ, ν ) mus t be unique. It is interesting to note that the above proof also provides an algorithm to compute approximat e equilibria. T o compute each iterate T s ( α ) one needs t o solve a polynomial game in normal form (which can be d one by s olving a si ngle sem ideﬁnite program), and by so lving a sequence of such problems, one can compute T k ( α ) which is provably clo se t o the actual value-v ector . Howe ver , the rate of con ver gence of this iteration i s not very at tractiv e. In the rest of t his paper , we focus attention on single-contr oller gam es , for which equilibria can be computed by solvi ng a single semideﬁnite prog ram. D. SDP Characterization of Nonne gativity and Moments Let A be a closed interval on the real l ine. Th e set of univ ariate polynom ials which are nonnegativ e on A have an exact s emideﬁnite d escription. The set of (ﬁnite) vectors in R n which correspond t o moment sequences of m easures suppo rted on A als o hav e an exact semi deﬁnite description. W e brieﬂy revie w these notions here and int roduce some related notatio n [6]. Let R [ x ] denote the set of univ ariate pol ynomials with real coef ﬁcients. Let p ( x ) = P n k =0 p k x k ∈ R [ x ] . W e say that p ( x ) i s non negati ve on A if p ( x ) ≥ 0 for ev ery x ∈ A . W e denote th e set of nonnegative polynomials of degree n which are nonnegative on A by P ( A ) . (T o av oid cumbersome notation, we e xclude the degree information in the notation. M oreover t he degree will usually be clear from the context.) The pol ynomial p ( x ) is said to be a sum of squa r es if there exist polyno mials q 1 ( x ) , . . . , q k ( x ) such that p ( x ) = P k i =1 q i ( x ) 2 . It is well known th at a univ ariate polyno mial is a sum of squares if and only if p ( x ) ∈ P ( R ) . October 26, 2018 DRAFT 12 Let µ denote a measure supported on the set A . Th e i th moment of the measure µ is d enoted by µ i = Z A x i dµ. Let ¯ µ = [ µ 0 , . . . , µ n ] be a vector in R n +1 . W e say that ¯ µ is a moment sequence of length n + 1 if it corresponds t o the ﬁrst n + 1 mom ents of s ome nonnegative measure µ supported on the set A . T he moment space , denoted by M ( A ) is the subset of R n +1 which corresponds to mom ents of nonnegati ve measures supported on the set A. W e say that a nonnegati ve measure µ is a pr obabi lity measure if its zeroth order moment satisﬁes µ 0 = 1 . The s et of mom ent sequences of l ength n + 1 corresponding to probability m easures i s denoted by M P ( A ) . Let S n denote t he set of n × n s ymmetric matri ces and deﬁne the linear operator H : R 2 n − 1 → S n as: H :        a 1 a 2 . . . a 2 n − 1        7→        a 1 a 2 . . . a n a 2 a 3 . . . a n +1 . . . . . . . . . . . . a n a n +1 . . . a 2 n − 1        . Thus H is simply the linear operator that takes a vector and con structs the associated Hankel matrix which is cons tant along the antidi agonals. W e will also frequently u se t he adjoint of this operator , the linear map H ∗ : S n → R 2 n − 1 : H ∗ :        m 11 m 12 . . . m 1 n m 12 m 22 . . . m 2 n . . . . . . . . . . . . m 1 n m 2 n . . . m nn        7→           m 11 2 m 12 m 22 + 2 m 13 . . . m nn           . This m ap ﬂattens a matrix into a vector by addi ng all the entries along anti diagonals. Lemma 3: Let p ( x ) = P 2 n k =0 p k x k be a poly nomial. Let ¯ p = [ p 0 , . . . , p 2 n ] T be the vector of its coefﬁcients. Then p ( x ) i s nonnegati ve (or SOS) i f and only i f th ere exists S ∈ S n +1 , S  0 such that: ¯ p = H ∗ ( S ) . October 26, 2018 DRAFT 13 Pr oof: For univ ariate polyn omials, nonnegativity is equiv alent t o SOS (see [9]). Let [ x ] n = [1 , x, . . . , x n ] T . W e h a ve for e very S ∈ S n +1 , p ( x ) = ¯ p T [ x ] 2 n = H ∗ ( S ) T [ x ] 2 n = [ x ] T n S [ x ] n . Factoring S  0 , we obt ain a sum of squares decom position. The con verse is immediate. One can give a s imilar semideﬁnite characterization of polynom ials that are non negati ve on an interval. Sin ce in this paper we are t ypically considering the interv al to be [0 , 1] we gi ve an explicit semideﬁnite characterization of P ([0 , 1]) . W e deﬁne the following m atrices: L 1 =   I n × n 0 1 × n   , L 2 =   0 1 × n I n × n   , where I n × n stands for the n × n identity matrix. Lemma 4: The pol ynomial p ( x ) = P 2 n k =0 p k x k is nonnegative on [0 , 1] if and only if there exist matrices Z ∈ S n +1 and W ∈ S n , Z  0 , W  0 such that      p 0 . . . p 2 n      = H ∗ ( Z + 1 2 ( L 1 W L T 2 + L 2 W L T 1 ) − L 2 W L T 2 ) . Pr oof: T he pro of follows from the characterization of nonn egati ve polyno mials on i ntervals. It i s well kno wn that p ( x ) ≥ 0 ∀ x ∈ [0 , 1] ⇔ p ( x ) = z ( x ) + x (1 − x ) w ( x ) , where z ( x ) and w ( x ) are sums of squares. A simp le application of Lem ma 3 yields t he required condition. In this paper , we wil l also be usi ng a very imp ortant classical result about the semideﬁnit e representation o f mom ent spaces [10], [11]. W e giv e an explicit characterization of M ([0 , 1]) and M P ([0 , 1]) . Lemma 5: The vector ¯ µ = [ µ 0 , µ 1 , . . . , µ 2 n ] T is a valid set of moments for a nonnegativ e measure s upported on [0 , 1] if and onl y if H ( ¯ µ )  0 1 2 ( L T 1 H ( ¯ µ ) L 2 + L T 2 H ( ¯ µ ) L 1 ) − L T 2 H ( ¯ µ ) L 2  0 . (2) October 26, 2018 DRAFT 14 Moreover , it is a moment sequence corresponding to a probability measure if and only if in addition to (2) i t satisﬁes µ 0 = 1 . Pr oof: The proof follows b y dualizing L emma 4. Alternatively , a direct proof may be found in [10]. For example, for 2 n = 2 the sequence [ µ 0 , µ 1 , µ 2 ] is a moment sequence corresponding to a measure s upported on [0 , 1] if and onl y if the following inequalities are true:   µ 0 µ 1 µ 1 µ 2    0 µ 1 − µ 2 ≥ 0 . I I I . F I N I T E S T R A T E G Y C A S E For the reader’ s con venience and comparison pu rposes, we brieﬂy revie w here t he case where each player has only ﬁnitely many strategies at each state [3]. Again, for sim plicity we assume that the set of pure strategies av ai lable to each pl ayer at each state is ident ical so that A 1 = A 2 = { 1 , . . . , m } . Under th e ﬁnit e strategy case, when assum ption S C holds, a minim ax solution may be com puted via linear programming. W e state the linear program in this section . In the next section, drawing motivation from this li near prog ram, we write an inﬁnite dimensio nal optimizatio n problem for the case where each player has a choi ce from inﬁnitely many pure strategies. The ﬁnite action game is com pletely deﬁned via the speciﬁcation of the following data: 1) The state space S = { 1 , . . . , S } . 2) The (ﬁnite) sets of action s for players 1 and 2 give n by A 1 = A 2 = { 1 , . . . , m } . 3) The payoff fun ction for a given state s (representable by a matrix indexed by the actions of each players) deno ted b y r ( s, a 1 , a 2 ) . 4) The probability trans ition matrix p ( s ′ ; s, a 1 ) which provides the conditional probabilit y of transition from state s to s ′ giv en player 1 ’ s action a 1 . 5) The discount f actor β . A mixed strategy for player 1 is a function f : S × A 1 → [0 , 1] subject to the norm alization constraint P a 1 f ( s, a 1 ) = 1 for each s ∈ S (so t hat f ( s ) = [ f ( s, 1) , . . . , f ( s, m )] becom es a probability di stribution over the strategy space A 1 ) . Simil arly the mi xed strategy for player 2 in October 26, 2018 DRAFT 15 a particular state s is giv en b y g ( s ) = [ g ( s, 1) , . . . , g ( s, m )] . The collectio n o f mixed strategies (indexed by the states) will be denoted by f = [ f (1) , . . . , f ( S )] (and g = [ g (1) , . . . , g ( S )] respectiv ely ). A s trategy f leads to a probabi lity matrix P ( f ) = P a 1 ∈ A 1 p ( s ′ ; s, a 1 ) f ( s, a 1 ) . Again we cons ider a β -discoun ted process over an inﬁni te horizon. Given strategies f and g , the reward collected by player 1 in som e st age s is given by: r ( s, f ( s ) , g ( s )) = X a 1 ∈ A 1 ,a 2 ∈ A 2 r ( s, a 1 , a 2 ) f ( s, a 1 ) g ( s, a 2 ) . The rew ard collected over the inﬁnite horizon starting at state s , v β ( s, f ( s ) , g ( s )) , is give n by the s ystem of equations: v β ( s, f ( s ) , g ( s )) = r ( s, f ( s ) , g ( s ) ) + β P s ′ ∈S  P a 1 ∈ A 1 p ( s ′ ; s, a 1 ) f ( s, a 1 )  v β ( s ′ , f ( s ′ ) , g ( s ′ )) . Thus, v β ( f , g ) = ( I − β P ( f ) ) − 1 r ( f , g ) , where r ( f , g ) = [ r (1 , f (1) , g (1) ) , . . . , r ( S, f ( S ) , g ( S ))] ∈ R S . The problem is to ﬁnd equili brium strategies f 0 and g 0 that satis fy the Nash equi librium property: v β ( f , g 0 ) ≤ v β ( f 0 , g 0 ) ≤ v β ( f 0 , g ) (3) for all mixed strategies f , g . Theor em 2 ([3]): Consider the p rimal-dual pair of linear programs: minimize P S s =1 v ( s ) g ( s, a 2 ) , v ( s ) v ( s ) ≥ P a 2 ∈ A 2 r ( s, a 1 , a 2 ) g ( s, a 2 )+ β P S s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) ∀ s ∈ S , a 1 ∈ A 1 P a 2 ∈ A 2 g ( s, a 2 ) = 1 ∀ s ∈ S g ( s, a 2 ) ≥ 0 ∀ s ∈ S , a 2 ∈ A 2 . ( P ) October 26, 2018 DRAFT 16 and maximize P S s =1 z ( s ) x ( s, a 1 ) , z ( s ) P S s =1 P a 1 ∈ A 1 [ δ ( s, s ′ ) − β p ( s ′ , s, a 1 )] x ( s, a 1 ) = 1 ∀ s ′ ∈ S z ( s ) ≤ P a 1 ∈ A 1 x ( s, a 1 ) r ( s, a 1 , a 2 ) ∀ s ∈ S , a 2 ∈ A 2 , x ( s, a 1 ) ≥ 0 , ∀ s ∈ S , a 1 ∈ A 1 . ( D ) Let p ∗ be th e optimal value of ( P ) , and d ∗ be the op timal value of ( D ) . Let x ∗ ( s, a 1 ) be th e optimal values of the x ( s, a 1 ) variables obtained in ( D ) . Let f ∗ ( s, a 1 ) = x ∗ ( s, a 1 ) P a 1 x ∗ ( s, a 1 ) and g ∗ ( s, a 2 ) be the dist ribution obtained b y the o ptimal solutio n of ( P ) . Then the foll owing statements hold: 1) p ∗ = d ∗ . 2) Let v ∗ = [ v ∗ (1) , . . . , v ∗ ( S )] be the op timal s olution of ( P ) . Then v ∗ = v β ( f ∗ , g ∗ ) . 3) v β ( f ∗ , g ∗ ) s atisﬁes the saddle-point inequali ty (3). Remark Note that st atement 2 claims that the s olution of the LP ( P ) corresponds to the inﬁnite horizon di scounted reward obt ained when players 1 and 2 play according to the distributions f ∗ and g ∗ . Statement 3 claims that these d istributions are in fact optim al for the two players i n the Nash equilibrium s ense. Pr oof: See [3, p p. 9 3]. Remark Note that the pri mal problem ( P ) has a natural int erpretation in terms of security strate gies . Feasible vectors v , and g sati sfy the ﬁrst set of inequalities i n ( P ) . The inequalities can be interpreted to mean that using strategy g the payof f of player 2 will be at most v . October 26, 2018 DRAFT 17 I V . I N FI N I T E S T R A T E G Y C A S E A. Pr o blem Setup In this section we cons ider th e case where each pl ayer can choose from uncount ably many diffe rent acti ons. In particular , each p layer can choose actions from the set [0 , 1 ] . The number of s tates |S | = S i s still ﬁnite. The payoff functi on r ( s, a 1 , a 2 ) is a polynomial i n a 1 and a 2 for each s ∈ S . The si ngle cont roller case (Assumpt ion SC) i s studied. In thi s case, we assu me that the probabil ity of transition p ( s ′ ; s, a 1 ) is a polyn omial in a 1 . Again we consider the two-player zero s um case where player 1 attempt s to maxim ize his rewa rd over the inﬁnite horizon. W e generalize the problem ( P ) to th is case. T he variables f and g representing dist ributions ov er the ﬁnite s ets A 1 and A 2 are replaced by m easures µ ( s ) and ν ( s ) . These m easures represent mixed strategies over the uncountable action spaces. (W e remin d the reader that for each player there are S measures, each m easure corresponding to a mixed s trategy in a particular state. For example µ ( s ) corresponds to the mixed strategy p layer 1 would adopt when the game is in s tate s .) B. Pr el iminary Results W e point out that the generalization of ( P ) to thi s case is an opt imization probl em in volving non-negati vity of a sy stem of univ ariate polynomi als with coefﬁcients that depend on the m o- ments of these measures. The interpretation in terms of security strategies for player 2 holds. The fol lowing is the generalization of the linear program ( P ) m entioned above: minimize P S s =1 v ( s ) ν ( s ) , v ( s ) ( a ) v ( s ) ≥ R a 2 ∈ A 2 r ( s, a 1 , a 2 ) dν ( s )+ β P S s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) for all s ∈ S , a 1 ∈ A 1 ( b ) ν ( s ) is a m easure supp orted on A 2 for all s ∈ S Since R r ( s, a 1 , a 2 ) dν ( s ) = q ν ( s, a 1 ) , a univ ariate polynom ial in a 1 for each s ∈ S , for a ﬁxed vector v ( s ) , the constraints (a) are a system of pol ynomial in equalities. Note that the coef ﬁcients of q will d epend on the measure ν on ly via ﬁnitely many moments . More con- cretely , l et r ( s, a 1 , a 2 ) = P n s ,m s i,j r ij ( s ) a i 1 a j 2 be the payoff polynomial. Then R r ( s, a 1 , a 2 ) dν ( s ) = October 26, 2018 DRAFT 18 P i,j r ij ( s ) a i 1 ν j ( s ) . Using this observation, this problem m ay be re written as the following prob- lem. minimize P S s =1 v ( s ) ¯ ν ( s ) , v ( s ) ( c ) v ( s ) − P i,j r ij ( s ) a i 1 ν j ( s ) − β P S s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) ∈ P ( A 1 ) for all s ∈ S ( d ) ¯ ν ( s ) ∈ M ( A 2 ) , and ν 0 ( s ) = 1 for all s ∈ S . ( P ′ ) The constraints (c) giv e a system of polynomial inequalities in a 1 , o ne inequality p er state. Fix some state s . Let the degree of the inequality for that s tate by d s . Let [ a 1 ] d s = [1 , a 1 , a 2 1 , . . . a d s 1 ] . The ﬁrst term in constraint (c) can be rewritten in vector form as: X i,j r ij ( s ) a i 1 ν j ( s ) = ¯ ν ( s ) T R ( s ) T [ a 1 ] d s , where R ( s ) is a m atrix t hat cont ains the coefﬁ cients of the polynomi al r ( s, a 1 , a 2 ) . Similar to the ﬁnite strategy case we deﬁne a vector by v ∗ = [ v ∗ (1) , . . . , v ∗ ( S )] T which will turn o ut to be the value vector of the stochasti c gam e (whi ch i s indexed b y the state). T he s econd term in the constraint (c) which depends on the probabilit y transit ion p ( s ′ ; s, a 1 ) is also a polyno mial in a 1 whose coefﬁcients depend on the coef ﬁcients of p ( s ′ ; s, a 1 ) and v . Speciﬁcally S X s ′ =1 p ( s ′ ; s, a 1 ) v ( s ′ ) = v T Q ( s ) T [ a 1 ] d s , for so me matrix Q ( s ) which contains the coef ﬁcients of p ( s ′ ; s, a 1 ) . Lemma 6: Let A 1 = A 2 = [0 , 1] . Let E s ∈ R d s × S be the matrix which has a 1 i n t he (1 , s ) October 26, 2018 DRAFT 19 position. Th en the semideﬁnite program ( S P ) given by: minimize P S s =1 v ( s ) ¯ ν ( s ) , v ( s ) ( e ) H ∗ ( Z s + 1 2 ( L 1 W s L T 2 + L 2 W s L T 1 ) − L 2 W s L T 2 ) = E s v − β Q ( s ) v − R ( s ) ¯ ν ( s ) ∀ s ∈ S ( f ) H ( ¯ ν ( s ))  0 ∀ s ∈ S ( g ) 1 2  L 1 T H ( ¯ ν )( s ) L 2 + L T 2 H ( ¯ ν )( s ) L 1  − L 2 T H ( ¯ ν )( s ) L 2  0 ∀ s ∈ S ( h ) e 1 T ¯ ν ( s ) = 1 ∀ s ∈ S ( i ) Z s , W s  0 ∀ s ∈ S ( S P ) exactly solves the p olynomial opt imization problem ( P ′ ) . Pr oof: The polyn omial in in equality (c) has the coefﬁcient vector E s v − β Q ( s ) v − R ( s ) ¯ ν ( s ) . The proof follo ws as a direct consequence of L emma 4 concerning the s emideﬁnite representation of poly nomials no nnegati ve over [0 , 1] , and Lemm a 5 concerning t he semi deﬁnite representation of m oment sequences of nonn egati ve m easures supported on [0 , 1 ] . October 26, 2018 DRAFT 20 The dual of ( S P ) is given by the following semideﬁnite program: maximize P S s =1 α ( s ) α ( s ) , ¯ ξ ( s ) ( j ) H ∗ ( A s + 1 2 ( L 1 B s L T 2 + L 2 B s L T 1 ) − L 2 B s L T 2 ) = R T s ¯ ξ ( s ) − α ( s ) e 1 ∀ s ∈ S ( k ) H ( ¯ ξ ( s ))  0 ∀ s ∈ S ( l ) 1 2  L 1 T H ( ¯ ξ ( s )) L 2 + L T 2 H ( ¯ ξ ( s )) L 1  − L 2 T H ( ¯ ξ ( s )) L 2  0 ∀ s ∈ S P s ( E s − β Q ( s )) T ¯ ξ ( s ) = 1 ( m ) A s , B s  0 ∀ s ∈ S . ( S D ) Lemma 7: The dual SDP ( S D ) is equiv alent to t he following polynomi al optim ization prob- lem: maximize P S s =1 α ( s ) α ( s ) , ¯ ξ ( s ) ( n ) P i,j r ij ( s ) ξ i ( s ) a j 2 − α ( s ) ≥ 0 ∀ a 2 ∈ A 2 , s ∈ S ( o ) ¯ ξ ( s ) ∈ M ( A 2 ) ∀ s ∈ S ( p ) P s R A 1 ( δ ( s, s ′ ) − β p ( s ′ , s, a 1 )) dξ ( s ) = 1 ∀ s ′ ∈ S . ( D ′ ) Pr oof: This again fol lows as a consequence of Lemm as 4 and 5. Remark Note t hat in the dual problem, the m oment sequences do not necessarily correspond to probability measures. H ence, to con vert them to probability measures, one needs to normalize the m easure. Upon normalization, one obtains the optimal strategy for player 1 . October 26, 2018 DRAFT 21 Lemma 8: The po lynomial optim ization problems ( P ′ ) and ( D ′ ) are strong duals of each other . Pr oof: W e prove thi s by showing that the semideﬁnite program ( S P ) satisﬁes Slater’ s constraint qualiﬁcation and that i t is bou nded from below . The result th en follows from the strong du ality of the equiv alent semideﬁnite programs ( S P ) and ( S D ) . First pick µ ( s ) and ν ( s ) to be the uniform di stribution on [0 , 1 ] for each state s ∈ S . One can show [10] that the moment s equence of µ is in t he interior of t he moment space of [0 , 1] . As a consequence, constraints (f) and (g) are strictly posit iv e deﬁnite. Using the strategies µ and ν , e valuate the discounted value of this pair of strategies as: v β ( µ, ν ) = [ I − β P ( µ ) ] − 1 r ( µ, ν ) . Choose v > v β . The poly nomial inequ alities giv en by (c) are all strictly posi tiv e and thus constraints (i) are strictly positive deﬁnite. The equality constraints are trivially satis ﬁed. T o prove that the p roblem is bo unded below , we note that r ( s, a 1 , a 2 ) is a pol ynomial and that the s trategy spaces for bot h players are bound ed. Hence, inf a 1 ∈ A 1 ,a 2 ∈ A 2 r ( s, a 1 , a 2 ) is ﬁnite and provides a tri vial lower bound for v ( s ) . Lemma 9: Let ¯ ν ∗ ( s ) and ¯ ξ ∗ ( s ) be opti mal mo ment sequences for ( P ′ ) and ( D ′ ) respectiv ely . Let ν ∗ ( s ) and ξ ∗ ( s ) b e the corresponding measures supported on A 1 and A 2 respectiv ely . The following comp lementary slackness results hold for the optima of ( P ′ ) and ( D ′ ) : v ∗ ( s ) R A 1 dξ ∗ ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dξ ∗ ( s ) dν ∗ ( s )+ β P s ′ v ∗ ( s ′ ) R A 1 p ( s ′ ; s, a 1 ) dξ ∗ ( s ) ∀ s ∈ S (4) α ∗ ( s ) R A 2 dν ∗ ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dξ ∗ ( s ) dν ∗ ( s ) ∀ s ∈ S . (5) Pr oof: The result fol lows from the strong duality of the equivalent semideﬁnite representa- tions o f the primal-dual pair ( P ′ ) − ( D ′ ) . T he Lagrangian function for ( P ′ ) is giv en by: L ( ξ , α ) = inf v ,ν { P S s =1 v ( s ) − R A 1 [ v ( s ) − R A 2 r ( s, a 1 , a 2 ) dν ( s ) − β P s ′ v ( s ′ ) p ( s ′ ; s, a 1 )] dξ ( s ) + P s α ( s )(1 − ν 0 ( s )) } . L ( ξ , α ) must satisfy weak dualit y , i.e. d ∗ ≤ p ∗ . At optimalit y p ∗ = P s v ∗ ( s ) for some vector v ∗ . Howe ver , strong duali ty holds , i.e. p ∗ = d ∗ . This forces the ﬁrst complementary slackness October 26, 2018 DRAFT 22 relation. T he s econd relation is obtain ed sim ilarly by considering the Lagrangian of the dual problem. W e have shown that problem ( P ′ ) can be reduced to the semideﬁnite prog ram ( S P ) , and is thus compu tationally tractable via con vex optimization algo rithms. W e next sho w that the solution to problem ( P ′ ) i s in fa ct the desired equilibrium so lution. C. Main Theore m Let p ∗ be the optimal value of ( P ′ ) , and d ∗ be the opti mal value of ( D ′ ) . Let ν ∗ ( s ) and ξ ∗ ( s ) be th e opt imal measures recovered in ( P ′ ) and ( D ′ ) . Let µ ∗ ( s ) = ξ ∗ ( s ) R A 1 dξ ∗ ( s ) . so that µ ∗ is a n ormalized version of ξ ∗ (i.e. µ ∗ is a p robability measure). Let v ∗ be the vector obtained as the opt imal sol ution of ( P ′ ) . Theor em 3: The op timal solut ions to t he prim al-dual pair ( P ′ ) , ( D ′ ) s atisfy the following: 1) p ∗ = d ∗ . 2) v ∗ = v β ( µ ∗ , ν ∗ ) . 3) v β ( µ ∗ , ν ∗ ) s atisﬁes t he saddle-point inequality: v β ( µ, ν ∗ ) ≤ v β ( µ ∗ , ν ∗ ) ≤ v β ( µ ∗ , ν ) (6) for all mixed strategies µ, ν . Pr oof: 1) Follows from the strong d uality of the primal-dual pair ( P ′ ) − ( D ′ ) . 2) Usin g Lemma 9 equation (4) in normalized form (i.e. dividing through out by ξ ∗ 0 ( s ) , whi ch is the zeroth order moment of the measure ξ ( s ) ) we obtain v ∗ ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dµ ∗ ( s ) dν ∗ ( s )+ β P s ′ v ∗ ( s ′ ) R A 1 p ( s ′ ; s, a 1 ) dµ ∗ ( s ) ∀ s ∈ S . Upon simpliﬁcation and vectorization of v ∗ ( s ) one obtains v ∗ = r ( µ ∗ , ν ∗ ) + β P ( µ ∗ ) v ∗ . Using a Bellman equation argument or by simply iterating this equation (i.e. s ubstituti ng repeatedly for v ∗ ) it is easy to see that v ∗ = v β ( µ ∗ , ν ∗ ) . October 26, 2018 DRAFT 23 3) Consider inequality (c) it at its optimal value. W e h a ve for e very st ate s : v ∗ ( s ) ≥ R a 2 ∈ A 2 r ( s, a 1 , a 2 ) dν ∗ ( s )+ β P S s ′ =1 p ( s ′ ; s, a 1 ) v ∗ ( s ′ ) . Integrating wi th respect to som e arbit rary probabi lity measure µ ( s ) (with su pport on A 1 ), we get: v ∗ ( s ) ≥ R A 2 R A 1 r ( s, a 1 , a 2 ) dµ ( s ) dν ∗ ( s )+ β P S s ′ =1 R A 1 p ( s ′ ; s, a 1 ) v ∗ ( s ′ ) dµ ( s ) . Thus, v ∗ ( s ) ≥ r ( s, µ ( s ) , ν ∗ ( s ))+ β P S s ′ =1 R A 1 p ( s ′ ; s, a 1 ) v ∗ ( s ′ ) dµ ( s ) . Iterating this equation, we obtain v β ( µ ∗ , ν ∗ ) = v ∗ ≥ v β ( µ, ν ∗ ) for ev ery strategy µ . This completes on e si de of the saddle point inequali ty . Using t he norm alized version of equation (5), we get: α ∗ ( s ) ξ ∗ 0 ( s ) = R A 2 R A 1 r ( s, a 1 , a 2 ) dµ ∗ ( s ) dν ∗ ( s ) = r ( s, µ ∗ ( s ) , ν ∗ ( s )) . If we integrate inequality (n) in probl em ( D ′ ) with respect to any arbitrary probabil ity measure ν ( s ) with support on A 2 we obt ain α ∗ ( s ) ξ ∗ 0 ( s ) ≤ r ( s , µ ∗ ( s ) , ν ( s )) . Thus r ( s, µ ∗ ( s ) , ν ∗ ( s )) ≤ r ( s, µ ∗ ( s ) , ν ( s )) for ev ery s . Mul tiplying throug hout by ( I − β P ( µ ∗ )) − 1 , we get v β ( µ ∗ , ν ∗ ) ≤ v β ( µ ∗ , ν ) . This compl etes the other side of the s addle point i nequality . D. Obtain ing the measures Solutions to the semideﬁnite programs ( S P ) and ( S D ) provide the mom ent sequences corre- sponding to optimal strategies. Additional computation is required to recover t he actual measures. W e brieﬂy describe a classi cal procedure to recover the m easures u sing linear algebra. For more details, t he reader may refer to [11], [12]. October 26, 2018 DRAFT 24 Let ¯ µ ∈ R 2 n be a given moment sequence. W e wis h to ﬁnd a nonnegative m easure µ supp orted on the real line with these m oments. The resultin g measure will be com posed of ﬁnitely many atoms (i.e. a discrete m easure) of the form P w i δ ( x − a i ) where Prob ( x = a i ) = w i ∀ i. Construct the following l inear system:        µ 0 µ 1 . . . µ n − 1 µ 1 µ 2 . . . µ n . . . . . . . . . . . . µ n − 1 µ n . . . µ 2 n − 2               c 0 c 1 . . . c n − 1        = −        µ n µ n +1 . . . µ 2 n − 1        . Note that the Hankel matrix that appears on t he left hand side is a sub-matrix of H ( ¯ µ ) . W e assume wi thout lo ss of generality that the above matrix is strictly pos itive deﬁnite. (Suppose the above matrix is not ful l rank, construct a smaller k × k linear syst em of equ ations by elimin ating the last n − k ro ws and col umns of the m atrix so that th e k × k submatri x is full rank, and therefore strictl y positive deﬁnit e.) By i n verting this matrix we solve for [ c 0 , . . . , c n − 1 ] T . Let x i be th e root s of the polynom ial equation x n + c n − 1 x n − 1 + · · · + c 1 x + c 0 = 0 . It can be shown th at t he x i are all real and distinct, and that they are the s upport poi nts of the discrete measure. Once the supports are obtained, the weights w i may be obtained by solvi ng the n onsingular V andermonde system given by: n X i =1 w i x j i = µ j (0 ≤ j ≤ n − 1 ) . V . E X A M P L E Consider the two p layer d iscounted sto chastic game w ith β = 0 . 5 , S = { 1 , 2 } wit h payoff function r (1 , a 1 , a 2 ) = ( a 1 − a 2 ) 2 and r (2 , a 1 , a 2 ) = − ( a 1 − a 2 ) 2 . Let the probabil ity transition matrix be give n by: P ( a 1 ) =   a 1 1 − a 1 1 − a 2 1 a 2 1   . October 26, 2018 DRAFT 25 2 a 1 a 2 1 2 1 (a 1 − a 2 ) 2 −(a 1 − a 2 ) 2 1−a 1 1−a 1 Fig. 2. A two state st ochastic game wi th transition probabilities dependent only on the action of Player 1. The payoffs associated to the states are indicated in the corresponding nod es. T he edges are marked by the corresponding state transition probabilities. Figure 2 graphically illust rates thi s st ochastic game, consi sting of two states (the no des) with polynomial transition probabilities dependent on a 1 (as marked on the e dges of the graph). W ithin the n odes, the payof fs associated t o the corresponding st ates are indicated. T o understand this game, consider ﬁ rst th e zero-sum (nonstochastic game) with payof f functio n p ( a 1 , a 2 ) = ( a 1 − a 2 ) 2 over the strategy space [0 , 1] . This game (called the “guessing game”) was s tudied by Parr ilo i n [6]. If Player 2 is able to guess t he action of Player 1 , he can sim ply imitate his action (i.e. set a 2 = a 1 and his payoff to pl ayer 1 would be zero (this i s the minimum possible since ( a 1 − a 2 ) 2 ≥ 0 ). Player 1 would try to confuse player 2 as much as possible and thus random ize betw een the extreme actions a 1 = 0 and a 1 = 1 with a probabi lity o f 1 2 . Player 2 ’ s best response would be to play a 2 = 1 2 with probabi lity 1 . In the game described in Fig. 2, i n State 1 Player 1 p lays the role of confuser and Player 2 plays the role of guesser . In state 2 , th e rol es of the players are rev ersed, Player 1 is the guesser and Player 2 the confuser . Howe ver , the problem is comp licated a b it by the fact that State 1 is advantageous to Player 1 so that at e very stage he has incentive to play a strategy t hat gi ves him a good payoff as well as maximize the chances of transitioning t o State 1 . The polynomial optimization problem t hat computes the minimax strategies and the equilib- October 26, 2018 DRAFT 26 rium values is the following: minimize v (1) + v (2) v (1) ≥ R ( a 1 − a 2 ) 2 dν (1)+ β ( a 1 v (1) + (1 − a 1 ) v (2 )) ∀ a 1 ∈ [0 , 1] v (2) ≥ − R ( a 1 − a 2 ) 2 dν (2)+ β ((1 − a 2 1 ) v (1 ) + a 2 1 v (2) ) ∀ a 1 ∈ [0 , 1] ν (1) , ν (2) probability measures supported on [0 , 1] . This probl em can be reformulated as follows: minimize v (1) + v (2 ) v (1) ≥ a 2 1 − 2 a 1 ν 1 (1) + ν 2 (1)+ β ( a 1 v (1) + (1 − a 1 ) v (2 )) ∀ a 1 ∈ [0 , 1] v (2) ≥ − a 2 1 + 2 a 1 ν 1 (2) − ν 2 (2)+ β ((1 − a 2 1 ) v (1 ) + a 2 1 v (2) ) ∀ a 1 ∈ [0 , 1] [1 , ν 1 (1) , ν 2 (1)] T , [1 , ν 1 (2) , ν 2 (2)] T ∈ M ([0 , 1 ]) . Solving the SDP and its dual we obt ain the following o ptimal cost-to-go and optimal moment sequences: v ∗ = [ . 298 , − . 158] T ¯ µ ∗ (1) = [1 , . 614 , . 61 4] T ¯ µ ∗ (2) = [1 , . 5 , . 25] T ¯ ν ∗ (1) = [1 , . 614 , . 37 7] T ¯ ν ∗ (2) = [1 , . 614 , . 61 4] T . The corresponding measures obtained as e xp lained in sub section IV -D are supported at only October 26, 2018 DRAFT 27 ﬁnitely many p oints, and are g iv en by the fol lowing: µ ∗ (1) = . 386 δ ( a 1 ) + . 614 δ ( a 1 − 1 ) µ ∗ (2) = δ ( a 1 − . 5 ) ν ∗ (1) = δ ( a 2 − . 6 14) ν ∗ (2) = . 386 δ ( a 2 ) + . 614 δ ( a 2 − 1 ) . Consider , for example, play in State 1 . If Player 1 were playi ng obliviously wit h respect to the s tate transition s, he would p lay actio ns a 1 = 0 and a 1 = 1 wi th one half probabil ity each. Howe ver , to increase the probabili ty of stayi ng in State 1 he plays action 1 with a higher probability . Player 2 cannot affect the state transition probabiliti es directly , thus h e must play a myopic best response. (A myo pic best response is on e that is a best respons e for the g ame in the current state). Note that in state 1 , once Player 1 ’ s strategy is ﬁxed, th e (only) best respons e for Player 2 is to play th e action a 2 = 0 . 614 with p robability 1 . In state 2 , p layer 1 ’ s best strategy is to play a 1 = 0 . 5 . Player 2 picks an action from hi s myo pic best respon se set (in thi s case, all probability dis tributions that are supp orted on the point s 0 and 1 ). V I . C O N C L U S I O N S A N D F U T U R E W O R K In thi s paper , we have presented a techniq ue for solving two-player , zero-sum ﬁnite state stochastic games with inﬁnit e strategies and polynomi al payoffs. W e est ablished the existence of equilibria for such games. As a b y-product we got an algorit hm that conv erged to unique value vector of the game (howe ver this algorithm does not seem t o ha ve very attractive con ver gence rates). W e focused mainly on t he case where th e s ingle-controller assumpti on ho lds. W e sh owed that the p roblem can be reduced to solving a system of un iv ariate polynom ial i nequalities and moment cons traints. W e used techniqu es from the classical theory of mom ents and sum -of- squares to reduce the probl em t o a semideﬁnite programming problem. By solvi ng a pri mal-dual pair of semideﬁnite programs, we obtained minimax equil ibria and optimal s trategies for the players. It is known that ﬁnite-state, ﬁnit e acti on, two-player zero-sum gam es which satisfy the or- derﬁeld property [13], [5] may be solved via l inear programming. The single-controller case, games wit h perfect information, swit ching cont roller sto chastic games, separable rew ard-state October 26, 2018 DRAFT 28 independent t ransition (SER-SIT) games and additive games all satisfy this property . W e int end to extend t hese cases to t he inﬁnite strategy case with polyno mial payoff s. General ﬁnite action stochastic games which do no t satisfy t he orderﬁeld property still have an interesting math- ematical s tructure, but ef ﬁcient computati onal procedures are not av ailable. De veloping such procedures present an int eresting direction of futu re research. Acknowledgeme nt: The authors would like to thank Ilan Lobel and Prof. Mu nther Dahleh for bringing to their attentio n t he linear programming sol ution t o singl e control ler ﬁnite stochastic games. R E F E R E N C E S [1] D. P . Bert sekas, Dynamic pro gramming and optimal contr ol . Athena Scientiﬁc, 2005 , vol. I. [2] D. Fudenberg and J. T irole, Game theory . Cambridge, MA: MIT Press, 1991. [3] J. A. Filar and K. Vrieze, C ompetitive Mark ov decision pr ocesses . New Y ork: Springer , 1997. [4] L. S. Shapley , “Stochastic games, ” Proc. Nat. Acad. Sci. U . S. A. , v ol. 39, pp. 109 5–1100, 1953. [5] T . Parthasarathy and T . E. S. Ragha van, “ An orderﬁeld property f or stochastic games when one player controls transition probabilities, ” J. Optim. T heory Appl. , v ol. 33, no. 3, pp. 375–392, 1981. [6] P . A. Parrilo, “Polynomial games and sum of squares optimization, ” i n Pr oceedings of the 45 th IEEE Confer ence on Decision and Contr ol , 2006. [7] N. Stei n, A. Ozdaglar , and P . A. Parrilo, “Separable and low-ran k continuous games, ” in Pro ceedings of the 45 th IEEE Confer ence on Decision and Contr ol , 2006. [8] M. D resher , S. Karli n, and L. S. Shapley , “Polynomial games, ” in Contributions to the Theory of Games , ser . Annals of Mathematics Studies, no. 24. Princeton, N. J.: Princeton Univ ersity Press, 1950, pp. 161– 180. [9] P . A. Parrilo, “Structured semideﬁnite programs and semialgebraic geometry methods i n robustness and optimization, ” Ph.D. dissertation, California Institute of T echnology , May 2000. [10] S. Karlin and L. Shapley , Geometry of moment spaces , ser . Memoirs of the American Mathematical S ociety . AMS, 1953, vol. 12. [11] J. A. Shohat and J. D. T amarkin, The Pro blem of Moments , ser . American Mathematical Society Mathematical surveys, vol. II. New Y ork: American Mathematical Society , 1943. [12] L. Devro ye, Nonuniform random variate gen eration . New Y ork: Springer-V erlag, 1986. [13] T . E. S. Raghav an and J. A. Filar , “ Algorithms for stochastic games—a survey , ” Z. Oper . Res. , vol. 35, no. 6, pp. 437–472, 1991. October 26, 2018 DRAFT

Polynomial stochastic games via sum of squares optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment