Distributed Constrained Optimization with Semicoordinate Transformations

Recent work has shown how information theory extends conventional full-rationality game theory to allow bounded rational agents. The associated mathematical framework can be used to solve constrained optimization problems. This is done by translating…

Authors: William Macready, David Wolpert

Distributed Constrained Optimization with Semicoordinate Transformations
Distributed Constrained Optimization with Semico ordinate T ransformations William Macready ( wgm@dwavesys.com ) D-W a v e Systems Suite 100, 4401 Still Creek Dr Burnab y , BC, V5C 6G9 Da vid W olp ert ( dhw@email.arc.nasa.gov ) NASA Ames Researc h Cen ter, MailStop 269-2, Moffett Field, CA, 94035 July 28, 2021 Abstract Recen t w ork has sho wn how information theory extends conv en tional full-rationality game theory to allo w b ounded rational agen ts. The associated mathematical framework can be used to solv e constrained optimization problems. This is done by translating the problem into an iterated game, where each agen t con trols a different v ariable of the problem, so that the joint probabilit y distribution across the agents’ mo v es giv es an exp ected v alue of the ob jectiv e function. The dynamics of the agents is designed to minimize a Lagrangian function of that joint distribution. Here we illustrate ho w the up dating of the Lagrange parameters in the Lagrangian is a form of automated annealing, whic h focuses the joint distribution more and more tigh tly about the join t mo ves that optimize the ob jective function. W e then in vestigate the use of “semico ordinate” v ariable transformations. These separate the join t state of the agents from the v ariables of the optimization problem, with the tw o connected by an onto mapping. W e present experiments illustrating the ability of suc h transformations to facilitate optimization. W e focus on the special kind of transformation in which the statistically independent states of the agen ts induces a mixture distribution ov er the optimization v ariables. Computer exp erimen t illustrate this for k -sat constraint satisfaction problems and for unconstrained minimization of N K functions. Subje ct Classific ation: programming: nonlinear, algorithms, theory; probabilit y: applications A r e a of R eview: optimization 1 1 In tro duction 1.1 Distributed optimization and con trol with Probability Collectiv es As first describ ed in (W olpert 2003 a , W olpert 2004 b ), it turns out that one can translate many of the concepts from statistical physics, game theory , distributed optimization and distributed con trol in to one another. This translation is based on the fact that those concepts all inv olve distributed systems in which the random v ariables are, at any single instant, statistically indep enden t. (What is coupled is instead the distributions of those v ariables.) Using this translation, one can transfer theory and techniques b et w een those fields, creating a large common mathematics that connects them. This common mathematics is kno wn as Probability Collectives (PC). Its unifying concern is the set of probabilit y distributions that go v ern an y particular distributed system, and ho w to manipulate those distributions to optimize one or more ob jectiv e functions. See (W olp ert, T umer & Bandari 2003, W olp ert & T umer 2001) for e arlier, less formal w ork on this topic. In this pap er we consider the use of PC to solve constrained optimization and/or control prob- lems. Reflecting the fo cus of PC on distributed systems, its use for such problems is particularly appropriate when the v ariables in the collective are spread across many ph ysically separated agents with limited inter-agen t communication (e.g., in a distributed design or supply c hain application, or distributed control). A general adv an tage of PC for suc h problems is that since they work with probabilities rather than the underlying v ariables, they can b e implemen ted for arbitrary types of the underlying v ariables. This same c haracteristic also means they pro vides m ultiple solutions, eac h of whic h is robust, along with sensitivit y information concerning those solutions. An adv antage particulary relev an t to optimization is that the distributed PC algorithm can often b e implemen ted on a parallel computer. An adv antage particularly relev ant to control problems is that PC algo- rithms can, if desired, b e used without an y mo delling assumptions ab out the (sto c hastic) system b eing controlled. These adv antages are discussed in more detail below. 1.2 The Probabilit y Collectiv es Approac h Broadly sp eaking, the PC approach to optimization/con trol is as follows. First one maps the pro vided problem into a multi-agen t collective. In the simplest version of this pro cess one assigns a separate agent of the collective to determine the v alue of each of the v ariables x i ∈ X i in the problem that w e con trol. So for example if the i ’th v ariable can only tak e on a finite n umber of v alues, those |X i | possible v alues constitute the possible mo v es of the i ’th agent. 1 The v alue of the joint set of n v ariables (agen ts) describing the system is then x = [ x 1 , · · · , x n ] ∈ X with X , X 1 × · · · × X n . 2 Unlik e many optimization metho ds, in PC the v ariables are not manipulated directly . Rather a probabilit y distribution is what is manipulated. T o a v oid com binatorial explosions as the n um b er of dimensions of X grows, w e must restrict attention to a low-dimensional subset of the space of all probability distributions. W e indicate this b y writing our distributions as q ∈ Q ov er X . The manipulation of that q pro ceeds through an iterativ e pro cess. The ultimate goal of this pro cess is to induce a distribution that is highly p eak ed ab out the x optimizing the ob jective function G ( x ), sometimes called the world c ost or world utility function. (In this pap er we only consider problems with a single ov erall ob jective function, and we arbitrarily choose lo w er v alues to b e b etter, ev en when using the term “utilit y”.) In the precise algorithms in vestigated here, at the start of an y iteration a single Lagrangian function of q , L : Q → R , is sp ecified, based on G ( x ) and the asso ciated constraints of the optimization problem. Rather than minimize the ob jectiv e function o ver the space X , the algorithm 2 minimizes that Lagrangian o v er q ∈ Q . This is done b y direct manipulation of the comp onen ts of q by the agen ts. After such a minimization of a Lagrangian, one modifies the Lagrangian slightly . This is done so that the q optimizing the new Lagrangian is more tigh tly concentrated ab out x that solv e our optimization problem than is the current q . One then uses the current q as the starting p oin t for another process of having the agen ts minimize a Lagrangian, this time ha ving them work on that new Lagrangian. A t the end of a sequence of such iterations one ends up with a final q . That q is then used to determine a final answer in X , e.g., b y sampling q , ev aluating its mo de, ev aluating its mean (if that is defined), etc. F or a prop erly c hosen sequence of Lagrangians and algorithm for minimizing the Lagrangians, this last step should, with high probability , provide the desired optimal p oin t in X . F or the class of Lagrangians used in this pap er, the sequence of minimizations of Lagrangians is closely related to simulated annealing. The difference is that in simulated annealing an inefficient Metrop olis sampling pro cess is used to implicitly descend eac h iteration’s Lagrangian. By explicitly manipulating q , PC allows for more efficient descent. In this pap er w e shall consider the case where Q is a product space, q ( x ) = Q i q i ( x i ). The asso ciated formulation of PC is sometimes called “Product Distribution” theory . It corresp onds to noncoop erativ e game theory , with each q i b eing agent i ’s “mixed strategy” (W olp ert 2004 b , F udenberg & Tirole 1991). Our particular focus is the use of such pro duct distributions when X is not the same as the ultimate space of the optimization v ariables, Z . In this formulation — a mo dification of what was presented ab ov e — there is an in termediate mapping from X → Z , and the provided G is actually a function o ver Z , not (directly) o ver X . Such in termediate mappings are called “semico ordinate systems”, and going from one to another is a “semico ordinate transformation”. As elab orated b elo w, suc h transformations allow arbitrary coupling among the v ariables in Z while preserving man y of the computational adv an tages of using pro duct distributions o v er X . 1.3 Adv antages of Probabilit y Collectiv es There are man y adv antages to working with distribution in Q rather than p oin ts in X . Usually the supp ort of q is all of X , i.e., the q minimizing t he Lagrangian lies in the interior of the unit simplices giving Q . Con versely , an y elemen t of X can be view ed as a probabilit y distribution on the edge (a vertex) of those simplices. So working with X is a sp ecial case of working with Q , where one sticks to the vertices of Q . In this, optimizing ov er Q rather than X is analogous to interior p oin t metho ds. Due to the breadth of the supp ort of q , minimizing o v er it can also b e view ed as a wa y to allow information from the v alue of the ob jective function at all x ∈ X to b e exploited sim ultaneously . Another adv an tage, alluded to ab o v e, is that by w orking with distributions Q rather than the space X , the same general PC approac h can b e used for essen tially any X , b e it contin uous, discrete, time-extended, mixtures of these, etc. (F ormally , those differen t spaces just corresp ond to differen t probabilit y measures, as far as PC is concerned.) F or exp ository simplicity though, here w e will work with finite X , and therefore ha ve probability distributions rather than density functions, sums rather than integrals, etc. See in particular (Bieniawski & W olpert 2004 a , W olp ert & Bieniawski 2004 a , W olp ert 2004 a , W olpert, Strauss & Ra jnay aran 2006) for analysis explicitly for the case of infinite X . Y et another adv antage arises from the fact that when X is finite, q ∈ Q is a vector in a Euclidean space. Accordingly the Lagrangian we are minimizing is a real-v alued function of a Euclidean v ector. This means PC allows us to leverage the p o wer of descent schemes for contin uous spaces 3 lik e gradient descent or Newton’s metho d — even if X is a categorical, finite space. So with PC, sc hemes like “gradient descent for categorical v ariables” are p erfectly well-defined. While the Lagrangians can be based on prior knowledge or mo delling assumptions concerning the problem, they need not b e. Nor do es optimization of a Lagrangian require con trol of all v ariables X (i.e., some of the v ariables can b e noisy). This allows PC to b e very broadly applicable. 1.4 Connection with other sciences A more general adv antage of PC is how it relates seemingly disparate disciplines to one another. In particular, it can b e motiv ated b y using information theory to relate b ounded rational game theory to statistical ph ysics (W olp ert 2003 a , W olpert 2004 b ). This allo ws tec hniques from one field to b e imp orted in to the other field. F or example, as illustrated b elo w, the grand canonical ensemble of physics can b e imp orted in to nonco op erativ e game theory to analyze games having sto chastic n um b ers of the play ers of v arious t yp es. T o review, a nonco op erativ e game consists of a sequence of stages. A t the b eginning of each stage ev ery agen t (ak a “play er”) sets a probability distribution (its “mixed strategy”) o ver its mo v es (F udenberg & Tirole 1991, Aumann & Hart 1992, Basar & Olsder 1999, F udenberg & Levine 1998). The joint mov e at the stage is then formed b y agents simultaneously sampling their mixed strategies at that stage. So the mov es those agen ts mak e at an y particular stage of the game are statistically independent and the distribution of the join t-mov es at an y stage is a product distribution — just lik e in PC theory . This do es not mean that the mov es of the agents across all time are statistically indep enden t ho w ev er. A t each stage of the game each agent will set its mixed strategy based on information gleaned from preceding stages, information that in general will reflect the e arlier mo ves of the other agen ts. So the agents are coupled indirectly , across time, via the up dating of the { q i } n i =1 at the end of each stage. Analogously , consider again the iterative PC algorithm outlined ab o ve, and in particular the pro cess of optimizing the Lagrangian within some particular single iteration. T ypically that process pro ceeds by successiv ely modifying q across a sequence of timesteps. In eac h of those timesteps q ( x ) = Q i q i ( x i ) is first sampled, and then it is updated based on all previous samples. So just lik e in a nonco operative game there is no direct coupling of the v alues of the underlying v ariables { x i } at an y particular timestep ( q is a pro duct distribution). Rather just lik e in a nonco operative game, the v ariables are indirectly coupled, across time (i.e., across timesteps of the optimization), via coupling of the distributions q i ( x i ) at different timesteps. In addition, information theory can b e used to show that the b ounded rational equilibrium of any nonco operative game is the q optimizing an associated “maxen t Lagrangian” L ( q ) (W olp ert 2004 b ). (That Lagrangian is minimized b y the distribution that has maximal entr opy while b eing consistent with sp ecified v alues of the a v erage pa y offs of the agen ts.) This Lagrangian turns out to b e exactly the one that arises in the version of PC considered in this pap er. So b ounded rational game theory is an instance of PC. No w in statistical ph ysics often one wishes to find the distribution out of an allow ed set of distributions (e.g., Q ) with minimal distance to a fixed target distribution p ∈ P , the space of all p ossible distributions o v er X . Perhaps the most popular c hoice for a distance measure b et ween distributions is the Kullbac k-Leibler (KL) distance 3 : D ( q k p ) , P x q ( x ) ln  q ( x ) /p ( x )  (Co v er & Thomas 1991). As the KL distance is not symmetric in its argumen ts p and q w e shall refer to D ( q k p ) as the q p KL distance (this is also sometimes called the exclusive KL distance), and D ( p k q ) as the pq distance (also sometimes called the inclusive KL distance). 4 T ypically in ph ysics p is giv en by one of the statistical “ensem bles”. An important exam- ple of such KL minimization arises with the Boltzmann distribution of the canonical ensem ble: p ( x ) ∝ exp[ − H ( x ) /T ], where H is the “Hamiltonian” of the system. The KL distance D ( q || p ) to the Boltzmann distribution is prop ortional to the Gibbs free energy of statistical physics. This free energy is iden tical to the maxent Lagrangian considered in this pap er. Stated differently , if one solv es for the distribution q from one’s set that minimizes q p KL distance to the Boltzmann distribution, one gets the distribution from one’s set ha ving maximal entrop y , sub ject to the con- strain t of ha ving a sp ecified exp ected v alue of H . When the set of distributions one’s considering is Q , the set of pro duct distributions, this q minimizing q p KL distance to p is called a “mean-field appro ximation” to p . So mean-field theory is an instance of PC. This illustrates that b ounded rational games and the mean-field approximation to Boltzmann distributions are essen tially iden tical. T o relate them one equates H with a common pa yoff function G . The equiv alence is completed by then iden tifying each (indep enden t) agent with a different one of the (indep enden t) physical v ariables in the argument of the Hamiltonian. 4 This connection betw een these fields allows us to exploit tec hniques from statistical physics in b ounded rational game theory . F or example, as men tioned ab o ve, rather than the canonical ensem ble, we can apply the grand canonical ensemble to b ounded rational games. This allo ws us to consider games in whic h the num b er of play ers of eac h type is sto c hastic (W olpert 2004 b ). 1.5 The con tribution of this paper The use of a product distribution space Q for optimization is consistent with game theory (and more generally m ulti-agent systems). F urther, this c hoice results in a highly parallel algorithm, and is well-suited to problems that are inherently distributed. Nonetheless, other concerns may dictate differen t Q . In particular, in many optimization tasks we seek multiple solutions which may b e far apart from one another. F or example, in Constrain t Satisfaction Problems (CSPs) (Dec hter 2003), the goal is to identify all feasible solutions which satisfy a set of constraints, or to sho w that none exist. F or small problem instances exhaustiv e enumeration techniques lik e branch-and-bound are t ypically used to iden tify all such feasible solutions. How ever, for larger problems it is desirable to dev elop lo cal-searc h-based approaches which determine multiple distinct solutions in a single run. In cases lik e these, where we desire multiple distinct solutions, the use of PC with a pro duct distribution is a p o or choice. The problem is that if each distribution q i is peaked about ev ery v alue of x i whic h o ccurs in at least one of the multiple solutions, then in general there will b e spurious p eaks in the product q ( x ) = Q q i ( x i ), i.e., q ( x ) may be p eak ed ab out some x that are not solutions. Alternativ ely , if each q i is p eaked only ab out a few of the solutions, this do es not pro vide us with man y solutions. T o address this we migh t descend the Lagrangian many times, b eginning from differen t starting p oin ts (i.e., different initial q ). How ever there is no guarantee that multiple runs will each generate different solutions. PC offers a simple solution to this problem that allows one to still use pro duct distributions: extend the ev en t space underlying the pro duct distribution so that a single game provides multiple distinct solutions to the optimization problem. Intuitiv ely sp eaking, suc h a transformation recasts the problem in terms of a “meta-game” b y cloning the original game into sev eral sim ultaneous games, with an indep enden t set of agents for eac h game. A sup ervisory agen t chooses what game is to b e play ed. W e then form a Lagrangian for the meta-game that is biased to w ards having any agen ts that control the same v ariable in different games ha v e differen t mixed strategies from one another. The joint strategies for eac h of the separate games in the meta-game then giv e a set of multiple solutions to the original game. The sup ervisory agent sets the relative imp ortance of whic h such solution is used. Since in general the resultant distribution across the v ariables b eing 5 optimized (i.e., across Z ) cannot b e written as a single pro duct distribution, it pro vides coupling among those v ariables. F ormally , the abov e pro cess can be represen ted as a semicoordinate transformation. Recall that the space of arguments to the ob jectiv e function G ( z ) is Z , and that the product distribu- tion is defined o ver X . A “semico ordinate system” maps from X to Z (W olp ert & Bienia wski 2004 a , W olp ert 2004 c ). Before introduction of the semicoordinate system X = Z , and pro duct distributions o v er X give pro duct distributions ov er Z . Ho wev er when X 6 = Z and we introduce a semico ordinate system, pro duct distributions o v er X (i.e., the nonco operative game is play ed in X ) need not be product distributions on Z . By appropriate c hoice of the semico ordinate trans- formation, such distributions can be made to corresp ond to any coupled distributions across Z . In general, an y Ba y es net top ology can b e achiev ed with an appropriate semico ordinate trans- formation (W olp ert 2004 c , W olp ert & Bienia wski 2004 a ). Different pro duct distributions ov er X corresp ond to differen t Ba y es nets having the same indep endence relations. Here we consider a X that results in a mixture of M product distributions Z , q ( z ) = M X m =1 q 0 ( m ) q m ( z ) . In tuitiv ely , q 0 is the distribution ov er the mo ves of the sup ervisor agent, with m labelling the game that agent chooses. This mixture of pro duct distributions allo ws for the determination of M solutions at once. A t the same time, an entrop y term in the Lagrangian “pushes” the separate pro ducts q m ( z ) in the mixture apart. This biases the algorithm to locating w ell separated solutions, as desired. In Sec. 2 w e review how one arriv es at the Lagrangian considered in this paper, the maxen t Lagrangian. In Sec. 3 w e review tw o elemen tary techniques in tro duced in (W olp ert & Bieniawski 2004 b , W olp ert 2003 a , W olp ert 2004 a ) for up dating a pro duct distribution q to minimize the as- so ciated Lagrangian. Dep ending on the form of the ob jective, the terms in v olv ed in the up dating of q may b e ev aluated in closed form, or ma y require estimation via Monte Carlo metho ds. In the exp erimen ts reported here all terms are calculated in closed form. How ever, to demonstrate the wider applicability of the up date rules w e review in App endix A a set of Monte-Carlo tec hniques pro viding lo w v ariance estimates of required quantities. The first deriv ation of these estimators is presen ted in this w ork. With this bac kground review complete, semicoordinate transformations are introduced in Sec. 4. As an illustration of the use of semico ordinate transformations particular attention is placed on mixture mo dels, and ho w mixture mo dels ma y b e seen as a pro duct distributions o ver a differen t space. In Sec.5 w e analyze the minimization of the maxen t Lagrangian asso ciated with mixture- inducing semicoordinate transformations. In that section w e also relate our maxent Lagrangian for mixture distributions to the Jensen-Shannon distance ov er X . Exp erimen tal v alidation of these tec hniques is then presen ted for the k -satisfiabilit y CSP problem (section 6.1) and the N K family of discrete optimization problems (section 6.2). These sections consider the situation where the semico ordinate transformation is fixed a priori, but suggestions are made on how to determine a go od semico ordinate transformation dynamically as the algorithm progresses. W e conclude with a synopsis of some other techniques for up dating a pro duct distribution q to minimize the asso ciated Lagrangian. This synopsis serv es as the basis for a discussion of the relationship b et ween PC and other techniques. Lik e all of PC, the techniques presented in this pap er can readily b e applied to problems other than constrained optimization. F or example, PC provi des a natural impro v ement to the Metrop olis sampling algorithm (W olp ert & Lee 2004), whic h the techniques of this paper should b e able to 6 impro v e further. In addition, while for simplicit y w e focus here on optimization o v er coun table domains, PC can b e extended in man y wa ys to contin uous space optimization. The associated tec hnical difficulties can all be addressed(W olp ert et al. 2006). See (An toine, Bienia wski, Kro o & W olpert 2004, W olpert & Bienia wski 2004 a , Bienia wski, W olp ert & Kro o 2004, Bieniawski & W olp ert 2004 b ) for other examples of PC and experiments. It should b e emphasized that PC usually is not a goo d c hoice for how best to optimize problems lying in some narro wly defined class. When one kno w a lot ab out the class of optimization problems under scrutiny , algorithms that are hand-tailored for that class will almost be called for. It is also in suc h situations that one often can call upon formal con vergence bounds. In con trast, PC is in the spirit of Genetic Algorithms, the cross entrop y metho d, simulated annealing, etc. It is a broadly applicable optimization algorithm that p erforms well in man y domains, ev en when there is little prior knowledge ab out the domain. (See (W olpert & Macready 1997) for a general discussion of this issue.) Finally , in statistical inference, one parameterizes the possible solutions to one’s problem to reduce the dimensionalit y of the solution space. Without such parameterization, the curse of dimensionalit y prev ents go o d performance, in general (Duda, Hart & Stork 2000). By c ho osing one’s parameterization though, one assumes (implicitly or otherwise) that that parameterization is flexible enough to capture the salien t asp ects of the sto c hastic pro cess generating one’s data. In essence, one assumes that the parameterization “pro jects out” the noise while keeping the signal. The PC analogue of the problem of what parameterization to use is what precise semicoordinate transformation to use. Just as there is no univ ersally correct choice of ho w to parameterize a statistics problem, there is no universally correct choice of what semico ordinate transformation to use. In both situations, one must rely on prior kno wledge to mak e one’s c hoice, p oten tially com bined with conserv ative online adaptation of that choice. 2 The Lagrangian for Pro duct Distributions W e begin b y considering the case of the iden tity semico ordinate system, and X = Z . As dis- cussed ab ov e, w e consider q p distance to the T -parameterized Boltzmann distribution p ( x ) = exp[ − G ( x ) /T ] / Z ( T ) where Z ( T ) is the normalization constant. A t low T the Boltzmann distribu- tion is concentrated on x having low G v alues, so that the pro duct distribution with minimal q p distance to it would b e exp ected to ha v e the same b eha vior. Accordingly , one would expect that b y taking q p KL distance to this distribution as one’s Lagrangian, and mo difying the Lagrangian from one iteration to the next by lo wering T , one should end up at a q concen trated on x ha ving lo w G v alues. (See (W olp ert & Bieniawski 2004 b , W olp ert 2004 a ) for a more detailed formal justifi- cation of using this Lagrangian based on solving constrained optimization problems with Lagrange parameters.) More precisely , the q p KL distance to the Boltzmann distribution is the maxen t Lagrangian, L ( q ) = E q ( G ) − T S ( q ) (1) up to irrelev ant additive and m ultiplicativ e constants. Equiv alen tly , w e can write it as L ( q ) = β E q ( G ) − S ( q ) (2) where β , 1 /T , up to an irrelev an t o v erall constan t. In these equations the inner pro duct E q ( G ) , P x q ( x ) G ( x ) is the exp ected v alue of G under q , and S ( q ) , − P x q ( x ) ln q ( x ) is the Shannon en trop y of q . 7 F or q ’s whic h are pro duct distributions S ( q ) = P i S ( q i ) where S ( q i ) = − P x i q i ( x i ) ln q i ( x i ). Accordingly , w e can view the maxen t Lagrangian as equiv alen t to a set of Lagrangians, L i ( q ) = P x i E q − i [ G ( x i , x − i )] q i ( x i ) − T S i ( q i ), one suc h Lagrangian for eac h agen t i so that L ( q ) = P n i =1 L i ( q ). 5 The first term in L is minimized b y ha ving p erfectly rational play ers, i.e. by play ers who concen- trate all their probability on the mo v es that are b est for them, giv en the distributions o ver the agen ts. The second term is minimized by p erfectly irrational pla y ers, i.e., b y a p erfectly uniform join t mixed strategy q . So T sp ecifies the balance betw een the rational and irrational b eha vior of the pla y ers. In particular, for T → 0, by minimizing the Lagrangian we recov er the Nash equilibria of the game. Alternatively , from a statistical physics p ersp ectiv e, where T is the temp erature of the system, this maxen t Lagrangian is simply the Gibbs free energy for the Hamiltonian G . 2.1 Incorp orating constraints Since we are interested in problems with constrain ts, w e replace G in Eqs. (1) and (2) with G ( x ) + C X a =1 λ a c a ( x ) (3) where G is the original ob jective function and the c a are the set of C equality constraint functions that are required to b e equal to zero. F or constraint satisfaction problems w e tak e the original ob jective function to b e the constan t function 0. The λ a are Lagrange multipliers that are used to enforce the constrain ts. Collectively , w e refer to the Lagrange multipliers with the C × 1 vector λ λ λ . The constrain ts are all equality constraints, so a saddle p oin t of the Lagrangian ov er the space of p ossible q and λ λ λ is a solution of our problem. Note ho w ever that we do not hav e to find the exact saddle p oin t; in general sampling from a q close to the saddle p oin t will give us the x ’s w e seek. There are certainly other wa ys in whic h constraints can be addressed within the PC framework. An alternative approach might allow constraints to b e weakly violated. W e w ould then iteratively anneal down those w eaknesses, i.e., strengthen the constrain ts, to where they are not violated. In this approach we could replace the maxen t Lagrangian form ulation encapsulated in Eq.’s (2) and (3) with L ( q , β , λ λ λ ) = β [ E q ( G ) − γ G ] + X a λ a [ E q ( c a ) − γ a ] − S ( q ) . (4) In each iteration of the algorithm β , λ λ λ are treated as Lagrange parameters and one solves for their v alues that enforce the equality constraints E q ( G ) = γ G , and the C constrain ts E q ( c a ) = γ a while also minimizing L ( q , β , λ λ λ ). In the usual wa y , since our constraints are all equalities, one can do this by finding saddle p oin ts of L ( q , β , λ λ λ ). The next iteration w ould then start by mo difying our Lagrangian by shrinking the v alues γ G , { γ a } slightly b efore pro ceeding to a new pro cess of finding a saddle p oin t. Another more theoretically justified w ay to incorp orate constraints requires that the support of q is constrained to lie entirely within the feasible region. Any x whic h violates a constraint is assigned 0 probability , i.e. q ( x ) = 0 at all x which violate the constrain ts. F or p edagogical simplicity , we do not consider these alternativ e approac hes, but concentrate on the Lagrangian of Eq. (1) with the G of Eq. (3). In addition to the constraints asso ciated with the optimization problem the v ectors { q i } must b e probabilit y distributions. So there are implicit constrain ts our solution must satisfy: 0 ≤ q i ( x i ) ≤ 1 for all i and x i , and P x i q i ( x i ) = 1 for all i . T o reduce the size of our equations w e do not explicitly write these constraints. 8 3 Minimizing the maxen t Lagrangian F or fixed β , our task is to find a saddle p oin t of L ( q , λ λ λ ). In “first order metho ds” a saddle p oin t is found b y iterating a tw o-step pro cess. In the first step the Lagrange parameters λ λ λ are fixed and one solves for the q that minimizes the asso ciated L . 6 In the second step one then freezes that q and up dates the Lagrange parameters. There are more sophisticated wa ys of finding saddle p oin ts (Gran tham 2005), and more generally one can use mo dified v ersions of the Lagrangian (e.g., an augmen ted Lagrangian (Bertsek as 1996)). F or simplicity w e do not consider suc h more sophisticated approac hes. In this section w e review tw o approaches to finding the { q i } for fixed Lagrange multipliers λ λ λ . W e also describ e our approac h for the second step of the first order metho d, i.e., w e describe ho w w e use gradient ascen t to up date the Lagrange multipliers λ a for fixed q . See (W olp ert 2003 a , W olpert & Bieniawski 2004 b , W olp ert 2004 a ) for further discussion of these approaches as w ell as the many others one can use. 3.1 Brou wer Up dating A t eac h step t the direction in the simplex Q that, to first order, maximizes the drop in L is giv en b y (-1 times) ˜ ∇ ∇ ∇ q L ( q ) , ∇ ∇ ∇ q L ( q ) − η η η ( q ) . (5) In this equation η η η ( q ) is proportional to the unit v ector, with its magnitude set to ensure that a step in the direction ˜ ∇ ∇ ∇ q L ( q ) remains in the unit simplex. F urthermore, the q i ( x i ) comp onen t of the gradien t, one for ev ery agent i and ev ery p ossible mo v e x i b y the agent, is (up to constant terms whic h hav e b een absorb ed into η η η ( q )): [ ∇ q L ( q )] q i ( x i ) = ∂ L ∂ q i ( x i ) = E q − i ( G | x i ) + T ln[ q i ( x i )] (6) where E q − i G | x i ) = X x − i q − i ( x − i ) G ( x i , x − i ) (7) with x − i , [ x 1 , · · · , x i − 1 , x i +1 , · · · , x n ] and q − i ( x − i ) , Q n j =1 | j 6 = i q j ( x j ). η η η ( q ) is the v ector that needs to b e added to ∇ ∇ ∇ q L ( q ) so that eac h q i is properly normalized. 7 The q i ( x i ) component of η η η ( q ), is equal to [ η ( q )] q i ( x i ) = η i ( q ) , 1 |X i | X x 0 i [ ∇ q L ( q )] q i ( x 0 i ) (8) where |X i | is the num b er of p ossible mo ves (allo w ed v alues) x i . Not that for an y agen t i , all of the asso ciated comp onen ts of η η η ( q ), namely q i ( x 1 ) , · · · , q i ( x |X i | ), share the same v alue η i ( q ). This choice ensures that P x i q i ( x i ) = 1 after the gradien t update to the v alues q i ( x i ). The expression in Eq. (7) is the exp ected pa y off to agen t i when it plays mov e x i , under the distribution q − i across the mov es of all other agents. Setting ˜ ∇ q L ( q ) to zero gives the solution q t +1 i ( x i ) ∝ exp  − E q t − i ( G | x i ) /T  (9) Brou w er’s fixed p oin t theorem guarantees the solution of Eq. (9) exists for any G (W olp ert 2004 b , W olp ert 2003 a ). Hence we call up date rules based on this equation Br ouwer up dating . Brou w er up dating can be done in parallel on all the agents. How ever, one problem that can arise if all agents up date in parallel is “thrashing”. In Eq. (9) each agen t i adopts the q i that is 9 b e optimal assuming the other agen ts don’t change their distributions. Ho w ev er, other agen ts do c hange their distributions, and thereby at least partially confound agent i . One wa y to address this problem is to hav e agen t i not use the current v alue E q t − i ( G | x i ) alone to up date q t i ( x i ), but rather use a w eigh ted av erage of all v alues E q t 0 − i ( G | x i ) for t 0 ≤ t , with the weigh ts shrinking the further into the past one go es. This introduces an inertia effect whic h helps to stabilize the up dating. (Indeed, in the contin uum-time limit, this weigh ting b ecomes the replicator dynamics (W olp ert 2004 c ).) A similar idea is to hav e agen t i use the current E q t − i ( G | x i ) alone, but hav e it only mov e part of the wa y the parallel Brou wer up date recommends. Whether one mo ves all the w a y or only part-w ay , what agent i is interested in is what distribution will b e optimal for the next distributions of the other agents . Accordingly , it makes sense to hav e agent i predict, using standard time-series to ols, what those future distributions will b e. This amounts to predicting what the next vector of v alues of E q t − i ( G | x i ) will b e, based on seeing ho w that vector has evolv ed in the recen t past. See (Shamma & Arslan 2004) for related ideas. Another w ay of circum ven ting thrashing is to ha v e the agents update their distributions serially (one after the other) rather than in parallel. See (W olp ert & Bieniawski 2004 a ) for a description of v arious kinds of serial sc hemes, as well as a discussion of partial serial, partial parallel algorithms. 3.2 Nearest-Newton Updating T o ev aluate the gradient one only needs to ev aluate or estimate the terms E q t − i ( G | x i ) for all agen ts (see b elo w and (W olp ert 2003 a , W olp ert 2004 b )). Consequen tly , gradient descen t is t ypically straigh t-forw ard. Though, it is also usually simple to ev aluate the Hessian of the Lagrangian, con v en tional Newton’s descen t is in tractable for large systems b ecause inv erting the Hessian is computationally exp ensiv e. Of course there are schemes such as conjugate gradient or quasi-Newton that do exploit second order information ev en when the Hessian cannot b e in verted. Ho wev er, the sp ecial structure of the Lagrangian also allo ws second order information to b e used for a simple v arian t of Newton descent. The asso ciated up date rule is called Ne ar est-Newton up dating (W olp ert & Bieniawski 2004 b ); w e review it here. T o derive Nearest-Newton w e b egin b y considering the Lagrangian E π ( G ) − T S ( π ), for an unr estricted probability distribution π . 8 This Lagrangian is a con vex function of π with a diagonal Hessian. So giv en a curren t distribution π t w e can mak e an unrestricted Newton step of this Lagrangian to a new distribution π t +1 . That new distribution t ypically is not in Q (i.e. not a pro duct distribution), even if the starting distribution is. Ho w ever we can solve for the q t +1 ∈ Q that is nearest to π t +1 , for example by finding the q t +1 ∈ Q that minimizes q p KL distance D ( p || q ) to that new point. More precisely , the Hessian of E π ( G ) − T S ( π ), ∂ 2 L /∂ π ( x ) ∂ π ( x 0 ), is diagonal, and so is simply in v erted. This giv es the Newton up date for π t : π t +1 ( x ) = π t ( x ) − α t q π t ( x )  G ( x ) − E π t ( G ) T + S ( π t ) + ln π t ( x )  whic h is normalized if π t is normalized and where α t q is a step size. As π t will typically not b elong to Q w e find the pro duct distribution nearest to π t +1 b y minimizing the KL distance D ( π t +1 k q ) with respect to q . The result is that q i ( x i ) = π t +1 i ( x i ), i.e. q i is the marginal of π t +1 giv en b y in tegrating it ov er x − i . 10 Th us, whenever π t itself is a product distribution, the up date rule for q i ( x i ) is q t +1 i ( x i ) = q t i ( x i ) − α t q q t i ( x i )  E q t − i ( G | x i ) − E q t ( G ) T + S ( q i ) + ln q t i ( x i )  . (10) This up date maintains the normalization of q i , but may mak e one or more q t +1 i ( x i ) greater than 1 or less than 0. In suc h cases we set q t +1 i to b e v alid pro duct distribution nearest in Euclidean distance (rather than KL distance) to the suggested Newton up date. 3.3 Up dating Lagrange Multipliers In order to satisfy the imp osed optimization constraints { c a ( x ) } we must also up date the Lagrange m ultipliers. T o minimize communication b et w een agen ts this is done in the simplest p ossible w ay – by gradient descent. T aking the partial deriv ativ es with resp ect to λ a giv es the up date rule λ t +1 a = λ t a + α t λ E q t ∗  c a ( x )  (11) where α t λ is a step size and q t ∗ is the local minimizer of L determined as ab o ve at the old settings, λ λ λ t , of the m ultipliers. 3.4 Other descen t sc hemes It should b e emphasized that PC encompasses many approaches to optimization of the Lagrangian that differ from those used here. F or example, in (Bieniawski & W olp ert 2004 a , W olp ert 2004 c ) there is discussion of alternative t yp es of descent algorithms that are related to blo c k relaxation, as w ell as to the fictitious pla y algorithm of game theory (F udenberg & Tirole 1991, Shamma & Arslan 2004) and multi-agen t reinforcement learning algorithms lik e those in collectiv e intelligence (W olp ert & T umer 2001, W olp ert et al. 2003). As another example, see (W olpert & Bieniawski 2004 b , W olp ert 2004 b ) for discussions of using pq KL distance (i.e., D ( p || q )) rather than q p distance. In terestingly , as discussed b elow, that alternativ e distance m ust b e used even for descent of q p distance, if one wishes to use 2nd order descen t schemes. (W olpert 2004 a ) discusses using non-Boltzmann target distributions p , and many other options for what functional(s) to descend. 3.5 Algorithmic summary Ha ving describ ed t w o p ossible PC algorithms we summarize the steps inv olved in each. This basic framew ork will form the basis for the semico ordinate extensions described in Section 4. Pseudo co de for basic PC optimization app ears in Algorithm 1. Lines 1 – 3 initialize algorithmic parameters. The b est starting temp eratures and m ultiplier v alues v ary from problem to problem. W e typically initialize q to b e the maximum entrop y distribution which is uniform ov er the search space X . An outer lo op decreases the temp erature according to a a schedule determined by the function updateT. Later w e comment on automatic sc hedules generated from the settings of Lagrange m ultipliers. Inside this lo op is another lo op whic h increments Lagrange multipliers according to Eq. (11) ev ery time q is iterated to a local minimum of the Lagrangian. The minimization of L for a fixed temp erature and setting of multipliers is accomplished in the innermost lo op. This minimization is accomplished b y rep eatedly determining all the conditional probabilities E q − i ( G + P a λ a c a | x i ) (line 7), and then using these in either of the tw o up date rules Eqs. (9), (10) (line 8). The ev aluation of the conditional exp ectations of G + P a λ a c a can b e accomplished either analytically or with Mon te Carlo estimates. F or many problems analytical ev aluation is prohibitively costly , and Monte 11 Input : An ob jectiv e function G ( x ), and a set of equality constraint functions { c a ( x ) = 0 } C a =1 Output : A pro duct distribution q ∗ ( x ) = Q i q ∗ i ( x i ) p eak ed ab out the minimizer of G satisfying the constraints. λ λ λ ← initializeMultipliers 1 T ← initializeT 2 q ← initializeQ 3 rep eat 4 rep eat 5 rep eat 6 g s ← evalConditionalExp ectations ( G, q ) 7 q ← up dateQ ( g s ) 8 un til atLo calMinimum( q ) 9 λ λ λ ← up date λ 10 un til constraintsSatisfied 11 T ← up dateT 12 un til done 13 Algorithm 1 : The basic PC algorithmic framework. Carlo metho ds are the only option. In App endix A we consider unbiased lo w v ariance Mon te Carlo metho ds for estimating the required conditional exp ectations, and in Appendix B w e deriv e the minimal v ariance estimator from within a class of useful estimators. 4 Semico ordinate T ransformations 4.1 Motiv ation Consider a m ulti-stage game lik e c hess, with the stages (i.e., the instan ts at which one of the pla yers mak es a mo ve) delineated b y t . In game theoretic terms, the “strategy” of a play er is the mapping from b oard-configuration to resp onse that sp ecifies the rule it adopts b efore play starts (F udenberg & Tirole 1991, Basar & Olsder 1999, Osb orne & Rubenstein 1994, Aumann & Hart 1992, F udenberg & Levine 1998). More generally , in a multi-stage game like chess the strategy of play er i , x i , is the set of t -indexed maps taking what that pla yer has observed in the stages t 0 < t in to its mov e at stage t . F ormally , this set of maps is called play er i ’s normal form strategy . The join t strategy of the tw o pla y ers in chess sets their joint mov e-sequence, though in gen- eral the reverse need not be true. In addition, one can alw ays find a join t strategy to result in an y particular joint mov e-sequence. Now typically at an y stage there is ov erlap in what the pla y- ers ha ve observed ov er the preceding stages. This means that even if the play ers’ strategies are statistically indep enden t (b eing separately set b efore pla y started), their mov e sequences are statis- tically c oupled. In such a situation, by parameterizing the space Z of joint-mo ve-sequences z with join t-strategies x , we shift our fo cus from the coupled distribution P ( z ) to the decoupled pro duct distribution, q ( x ). This is the adv an tage of casting multi-stage games in terms of normal form strategies. More generally , giv en an y t wo spaces X and Z , an y asso ciated onto mapping ζ : Z → X , not necessarily inv ertible, is called a semic o or dinate system . The iden tity mapping id : Z → Z is a trivial example of a semico ordinate system. Another semico ordinate system is the mapping from join t-strategies in a m ulti-stage game to joint mov e-sequences. In other words, c hanging the 12 represen tation space of a multi-stage game from mov e-sequences z to strategies x is a semico ordinate transformation of that game. In tuitiv ely , a semico ordinate transformation is a reparameterization of how a game — a mapping from join t mo ves to asso ciated pay offs — is represented. So we can p erform a semico ordinate transformation even in a single-stage game. Sa y we restrict attention to distributions o v er X that are pro duct distributions. Then changing ζ ( · ) from the iden tit y map to some other function means that the pla y ers’ mo ves are no longer indep endent. After the transformation their mo v e c hoices — the components of z — are statistically coupled, even though we are considering a pro duct distribution. F ormally , this is expressed via the standard rule for transforming probabilities, P Z ( z ∈ Z ) , ζ ( P X ) , Z d x P X ( x ) δ ( z − ζ ( x )) , (12) where P X and P Z are the distributions across X and Z , resp ectively . T o see what this rule means geometrically , recall that P is the space of all distributions (pro duct or otherwise) ov er Z and that Q is the space of all pro duct distributions o ver X . Let ζ ( Q ) be the image of Q in P . Then b y c hanging ζ ( · ), we c hange that image; differen t choices of ζ ( · ) will result in different manifolds ζ ( Q ). As an example, say we hav e tw o pla yers, with t wo p ossible mov es eac h. So z consists of the p ossible joint mo v es, lab elled (0 , 0) , (0 , 1) , (1 , 0) and (1 , 1). T ake X = Z , and c ho ose ζ (0 , 0) = (0 , 0) , ζ (0 , 1) = (1 , 1) , ζ (1 , 0) = (1 , 0), and ζ (1 , 1) = (0 , 1). Sa y that q is giv en b y q 1 ( x 1 = 0) = q 2 ( x 2 = 0) = 2 / 3. Then the distribution ov er join t-mov es z is P Z (0 , 0) = P X (0 , 0) = 4 / 9, P Z (1 , 0) = P Z (1 , 1) = 2 / 9, P Z (0 , 1) = 1 / 9. So P Z ( z ) 6 = P Z ( z 1 ) P Z ( z 2 ); the mo ves of the play ers are statistically coupled, even though their strategies x i are indep enden t. An y P Z , no matter what the coupling among its components, can b e expressed as ζ ( P X ) for some pro duct distribution P X for and asso ciated ζ ( · ) In the worst case, one can simply c ho ose X to ha v e a single component, with ζ ( · ) a bijection b etw een that comp onen t and the v ector z — trivially , an y distribution ov er such an X is a pro duct distribution. Another simple example is where one aggregates one or more agents in to a new single agent, i.e., replaces the pro duct distribution o v er the join t mo v es of those agen ts with an arbitrary distribution ov er their joint mo ves. This is related to the concept coalitions in co op erativ e game theory , as w ell as to Aumann’s correlated equilibrium (F udenberg & Tirole 1991, Aumann 1987, Aumann & Hart 1992). Less trivially , given any mo del class of distributions { P Z } , there is an X and asso ciated ζ ( · ) suc h that { P Z } is identical to ζ ( Q X ). F ormally this is expressed in a result concerning Ba yes nets. F or simplicit y , restrict attention to finite Z . Ord er the comp onen ts of Z from 1 to N . F or each index i ∈ { 1 , 2 , . . . , N } , ha v e the paren t function Pa( i, z ) fix a subset of the comp onents of z with index greater than i , returning the v alue of those comp onen ts for the z in its second argumen t if that subset of comp onen ts is non-empty . So for example, with N > 5, we could hav e Pa(1 , z ) = ( z 2 , z 5 ). Another p ossibilit y is that Pa(1 , z ) is the empt y set, indep endent of z . Let A (P a) b e the set of all probability distributions P Z that ob ey the conditional indep endencies implied by Pa: ∀ P Z ∈ A (P a) , z ∈ Z , P Z ( z ) = N Y i =1 P Z ( z i | P a( i, z )) . (13) By definition, if P a( i, z )) is empty , P Z ( z i | P a( i, z )) is just the i ’th marginal of P Z , P Z ( z i ). As an example of these definitions, the dep endencies { Pa(1 , z ) = ( z 2 , z 3 ) , P a(2 , z ) = z 4 , P a(3 , z ) = () , P a(4 , z ) = () } corresp ond to the family of distributions factoring as P ( z ) = P ( z 1 | z 2 , z 3 ) P ( z 2 | z 4 ) P ( z 3 ) P ( z 4 ) 13 As pro ven in (W olpert & Bienia wski 2004 a ), for any c hoice of Pa there is an asso ciated set of distributions ζ ( Q X ) that equals A (P a) exactly: Prop osition: Define the comp onen ts of X using m ultiple indices: F or all i ∈ { 1 , 2 , . . . , N } and p ossible asso ciated v alues (as one v aries o ver z ∈ Z ) of the v ector Pa( i, z ), there is a separate comp onen t of x , x i ;Pa( i, z ) . This comp onen t can take on any of the v alues that z i can. Define ζ ( · ) recursiv ely , starting at i = N and working to low er i , by the following rule: ∀ i ∈ { 1 , 2 , . . . , N } , [ ζ ( x )] i = x i ;Pa( i, z ) . Then A (Pa) = ζ ( Q X ). In tuitiv ely , eac h comp onen t of x in the prop osition is the conditional distribution P Z ( z i | P a( i, z )) for some particular instance of the vector P a( i, z ). As illustration consider again the example { P a(1 , z ) = ( z 2 , z 3 ) , P a(2 , z ) = z 4 , P a(3 , z ) = () , Pa(4 , z ) = () } . If eac h z i assumes the v alue 0 or 1, then x has 8 comp onen ts x 4 , x 3 , x 2;0 , x 2;1 , x 1;00 , x 1;01 , x 1;10 , and x 1;11 with each comp onen t also either 0 or 1. The pro duct distribution in X is q ( x ) = q 4 ( x 4 ) q 3 ( x 3 ) q 2;0 ( x 2;0 ) q 2;1 ( x 2;1 ) q 1;00 ( x 1;00 ) q 1;01 ( x 1;01 ) q 1;10 ( x 1;10 ) q 1;11 ( x 1;11 ) . Under ζ the distribution q 4 ( x 4 ) is mapp ed to q 4 ( z 4 ), q 2;0 ( x 2;0 ) is mapp ed to q 2 ( z 2 | z 4 = 0), q 1;01 ( x 1;01 ) is mapp ed to q 1 ( z 1 | z 2 = 0 , z 3 = 1), and so on. The proposition means that in principle w e nev er need consider coupled distributions. It suffices to restrict attention to pro duct distributions, so long as w e use an appropriate semicoordinate system. Semico ordinate systems also enable the representation of mixture mo dels o ver Z can b e also be represented using products. How ever, b efore discussing mixture mo dels we show how transformation of semico ordinate systems can in principle b e used to escap e lo cal minima in L ( q ). 4.2 Semico ordinate transformations and local minima T o illustrate another application of semicoordinate transformations, w e confine ourselves to the case where X = Z so that ζ is a bijection on X . W e assume that the domain of the i th of n v ariables has size |X i | . Then |X | = Q n i =1 |X i | is the size of the search space. Each co ordinate v ariable x i partitions the search s pace into |X i | disjoint regions. The partitions are such that the intersection o ver all v ariable co ordinates yields a single x . In particular, the standard semico ordinate system relies on the partition [ ∗ , · · · , ∗ , x i = 0 , ∗ , · · · , ∗ ], · · · , [ ∗ , · · · , ∗ , x i = |X i | − 1 , ∗ , · · · , ∗ ] for eac h co ordinate x i . As a illustrativ e example, consider 3 binary v ariables where X = { 0 , 1 } 3 . Figure 1(a) sho ws the 8 p oints in the searc h space represen ted in the standard co ordinate system. Figure 1(b) shows a shuffling of the 8 configurations under the permutation (0 1 2 3 4 5 6 7) ζ → (1 5 2 6 4 0 7 3). The resulting partitions of configurations are given in T able 1. Suc h transformations can b e used to escap e from lo cal minima of the Lagrangian. T o see this consider a co ordinate transformation ζ from X to the new space Z suc h that z = ζ ( x ), and choose q ( z ) = q ( x ) (i.e. do not c hange the asso ciated probabilities). Then the entrop y con tribution to the Lagrangian remains unchanged, but the exp ected G alters from P x G ( x ) q ( x ) to 9 X x G X ( x ) q ( x ) , X x G ( ζ ( x )) q ( x ) = X z G ( ζ  z )  q ( z ) . This means that the gradien t of the maxen t Lagrangian will t ypically differ b efore and after the ap- plication of ζ . In particular, what w as a local minimum with zero gradient b efore the semico ordinate transformation ma y not b e a lo cal minimum after the transformation and the resultant sh uffling 14 x 1 /.-, ()*+ 1 O O /.-, ()*+ 3 /.-, ()*+ 5 } } } } /.-, ()*+ 7    /.-, ()*+ 0 /.-, ()*+ 2 / / x 2 /.-, ()*+ 4 ~ ~ } } } /.-, ()*+ 6    x 3 (a) z 1 /.-, ()*+ 5 O O /.-, ()*+ 6 /.-, ()*+ 0 ~ ~ ~ ~ /.-, ()*+ 3    /.-, ()*+ 1 /.-, ()*+ 2 / / z 2 /.-, ()*+ 4 ~ ~ } } } /.-, ()*+ 7    z 3 (b) Figure 1: (a) Original linear indexing for 3 binary v ariables x 1 , x 2 , x 3 . (b) Result after applying the transformation to the new v ariables z 1 , z 2 , z 3 . x 1 = 0 (0,0,0), (0,0,1), (0,1,0), (0,1,1) z 1 = 0 (1,0,0), (0,1,0), (0,0,1), (1,1,1) x 1 = 1 (1,0,0), (1,0,1), (1,1,0), (1,1,1) z 1 = 1 (0,0,0), (1,1,0), (1,0,1), (0,1,1) x 2 = 0 (0,0,0), (0,0,1), (1,0,0), (1,0,1) z 2 = 0 (1,0,0), (0,1,0), (1,0,1), (0,1,1) x 2 = 1 (0,1,0), (0,1,1), (1,1,0), (1,1,1) z 2 = 1 (0,0,0), (1,1,0), (0,0,1), (1,1,1) x 3 = 0 (0,0,0), (0,1,0), (1,0,0), (1,1,0) z 3 = 0 (1,0,0), (0,1,0), (1,0,1), (0,1,1) x 3 = 1 (0,0,1), (0,1,1), (1,0,1), (1,1,1) z 3 = 1 (0,0,0), (1,1,0), (0,0,1), (1,1,1) T able 1: Resultan t partitions from the transformation of Figure 1(b). of utility v alues. As difficult problems t ypically ha ve man y lo cal minima in their Lagrangian, such semico ordinate transformations ma y pro v e very useful. A simple example is shown in 2(a) where a Lagrangian surface for 2 binary v ariables is sho wn. The utility v alues are G (0 , 0) = 0 , G (1 , 0) = 25 , G (0 , 1) = 18 , G (1 , 1) = 2. If the temp erature is 7 in units of the ob jective then the global minimum is at q 1 (0) = 0 . 95 , q 2 (0) = 0 . 91 where L = − 0 . 82. A t this temp erature there is a sub optimal lo cal minim um (indicated by the dot in the low er left) lo cated at q ∗ 1 (0) = 0 . 14 , q ∗ 2 (0) = 0 . 08 where L = 0 . 83. There are a num b er of criteria that migh t b e used to determine a semico ordinate transforma- tion to escape from this lo cal minimum q ∗ . Tw o simple c hoices are to select the transformation that minimizes the new v alue of the maxent Lagrangian (i.e., minimize L ( q ∗ )), or to select the transformation which results in the largest gradient of the maxent Lagrangian at q ∗ , (i,e,, maxi- mize k∇ q L ( q ∗ ) k ). F or this simple problem the results of b oth these choices are shown as Figures 2(b) and 2(c) resp ectiv ely . The transformation in each of these cases is determined b y optimizing o v er semico ordinate transformations (p erm utations), while k eeping the probabilities fixed, to either minimize the Lagrangian v alue at q ∗ , or maximize the norm of the gradient at q ∗ . In principle, this semico ordinate search can b e embedded within any optimization to dynam- ically escap e lo cal minima as they are encountered. Imp ortan tly , the search criteria listed ab ov e require no lo ok ahead in order to identify the b est semico ordinate p erm utation. Ho wev er, in prac- tice, since we can not searc h o ver arbitrarily large permutation spaces we must select a few of the v ariables and p erm ute amongst their p ossible joint mov es. 10 By comp osing such p erm utations w e can easily accoun t for the escap e from m ultiple local minima. Heuristics for the selection of “p erm uting” v ariables, and the results of this pro cedure aw ait future work. 15 G(1,1)=2 G(1,0)=25 G(0,1)=18 G(0,0)=0 prob x 1 =0 prob x 2 =0 L = 0.83, grad L = [0.00 −0.00] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) G(1,1)=2 G(1,0)=18 G(0,1)=0 G(0,0)=25 prob x 1 =0 prob x 2 =0 L = −1.66, grad L = [−14.13 −0.10] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) G(1,1)=25 G(1,0)=2 G(0,1)=0 G(0,0)=18 prob x 1 =0 prob x 2 =0 L = 15.49, grad L = [−34.65 −34.68] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) Figure 2: (a) Original Lagrangian function. The sub optimal lo cal minim um is lo cated at q ∗ whic h is indicated with a dot in the low er left corner. (b) The Lagrangian under the co ordinate trans- formation which minimizes L ( q ∗ ). (c) The Lagrangian under the co ordinate transformation whic h maximizes the norm of the gradien t at q ∗ . The direction of the negative gradient is indicated by a blac k arrow. 16 4.3 Semico ordinate T ransformations for Mixture Distributions In this section w e turn to a different use of semico ordinate transformations. W e hav e previously describ ed how the Lagrangian measuring the distance of a product distribution to a Boltzmann distribution ma y b e minimized in a distributed fashion. W e now extend these results to mixtures of product distributions in order to represent multiple pro duct distribution solutions at once. W e can alwa ys do that by means of a semico ordinate transformation of the un derlying v ariables. In this section we demonstrate this explicitly . Let X indicate the set of v ariables in a space of dimension d X , and let Z b e the original (pre- transformation) n -dimensional space ov er which G is defined. W e identify a pro duct distribution o v er X (with d X > n ), and an appropriately chosen mapping ζ ζ ζ : X → Z which induce a mixture distribution ov er Z . T o see this consider an M comp onen t mixture distribution o v er n v ariables, whic h we write as: q ( z ) = P M m =1 q 0 ( m ) q m ( z ) with P M m =1 q 0 ( m ) = 1 and q m ( z ) = Q n i =1 q m i ( z i ). W e express this q ( z ) as (the image of ) a pro duct distribution o v er a space X of dimension d X = 1 + M n . In tuitively , the first dimension of X (indicated as x 0 ∈ [1 , M ]) labels the mixture, and the remaining M n dimensions (indicated as x m i ∈ Z i ) corresp ond to eac h of the original n dimensions for each of the M mixtures. More precisely , write out the X -space pro duct distribution as q X ( x ) = q 0 ( x 0 ) Q M m =1 q m ( x m ) with q m ( x m ) = Q n i =1 q m i ( x m i ) for x = [ x 0 , x 1 , · · · , x M ] and x m = [ x m 1 , · · · , x m n ]. The density in X and Z are related as usual by q ( z ) = P x q X ( x ) δ  z − ζ ( x )  with the delta function acting on vectors b eing understo od comp onent-wise. If we lab el the comp onents of ζ so that z i = ζ i ( x 0 , x 1 , · · · , x M ) , x x 0 i w e find q ( z ) = X x 0 q 0 ( x 0 ) X x 1 , ··· , x M Y m q m ( x m ) Y i δ  z i − ζ i ( x 0 , x 1 , · · · , x M )  = X x 0 q 0 ( x 0 ) X x 1 , ··· , x M Y m q m ( x m ) Y i δ  z i − x x 0 i  = X x 0 q 0 ( x 0 ) X x x 0 q x 0 ( x x 0 ) Y i δ  z i − x x 0 i  = X x 0 q 0 ( x 0 ) q x 0 ( z ) Th us, under ζ the pro duct distribution q X is mapp ed to the mixture of pro ducts q ( z ) = P m q 0 ( m ) q m ( z ) (after relab elling x 0 to m ), as desired. The maxent Lagrangian of the X pro duct distribution q X ( x ) is L X ( q X ) = X m q 0 ( m ) E q m ( G ) − T h S ( q 0 ) + M X m =1 S ( q m ) i . This Lagrangian contains a term pushing us (as we search for the minimizer of that Lagrangian) to maximize the entrop y of the mixture weigh ts, but it pro vides no incentiv e for the distributions q m to differ from each other. As w e ha ve argued, it is desirable to hav e the different mixtures capture differen t solutions, and so w e mo dify the Lagrangian function sligh tly . If we wish the q m differ from one another, we instead consider the maxent Lagrangian defined 17 o v er q ( z ). In this case L Z ( q ) = X z G  z ) q ( z ) − T S ( q ) = X x G  ζ ζ ζ ( x )  q X ( x ) − T S ( q ) = X m q 0 ( m ) E q m ( G ) − T S  X m q 0 ( m ) q m ( z )  . The entrop y term differs crucially in these tw o maxen t Lagrangians. T o see this add and subtract T P m q 0 ( m ) S ( q m ) to the Z Lagrangian to find L Z ( q ) = X m q 0 ( m ) L ( q m ) − T J ( q ) (14) where each L ( q m ) is a maxen t Lagrangian as giv en by Eq. (1), and J ( q ) ≥ 0 is a mo dified version of the Jensen-Shannon (JS) distance, J ( q ) = S  X m q 0 ( m ) q m  − X m q 0 ( m ) S ( q m ) = − X m X z q 0 ( m ) q m ( z ) ln q ( z ) q m ( z ) . Con v en tional Jensen-Shannon distance is defined to compare tw o distributions to eac h other, and giv es those distributions equal weigh t. In con trast, the generalized JS distance J ( q ) concerns m ultiple distributions, and w eights them nonuniformly , according to q 0 ( m ). J ( q ) is maximized when the q m are all different from each other. Th us, its inclusion in to the Lagrangian pushes the mixing components aw ay (in X ) from one another. In this, we can view Eq. (14) as a no v el deriv ation of (a generalized v ersion of ) Jensen-Shannon distance. Unfortunately , the JS distance also couples all of the v ariables (b ecause of the sum inside the logarithm) which prev en ts a highly distributed solution. T o address this, in this pap er we replace J ( q ) in L Z ( q ) with another function which lo w er- b ounds J ( q ) but which requires less communication b et w een agents. It is this mo dified Lagrangian that we will minimize. 4.4 A V ariational Lagrangian F ollowing (Jaakk ola & Jordan 1998), w e in tro duce M v ariational functions w ( z | m ) and lo wer-bound the true JS distance. W e b egin with the iden tit y J ( q ) = − X m X z q 0 ( m ) q m ( z ) ln  1 w ( z | m ) q 0 ( m ) w ( z | m ) q ( z ) q 0 ( m ) q m ( z )  = X m X z q 0 ( m ) q m ( z ) ln w ( z | m )) − X m q 0 ( m ) ln q 0 ( m ) − X m X z q 0 ( m ) q m ( z ) ln w ( z | m ) q ( z ) q 0 ( m ) q m ( z ) . No w replace M of the − ln terms with the low er b ound − ln z ≥ − ν z + ln ν + 1 obtained from the Legendre dual of the logarithm to find J ( q ) ≥ J ( q , w , ν ) , X m X z q 0 ( m ) q m ( z ) ln w ( z | m ) − X m q 0 ( m ) ln q 0 ( m ) − X m ν m X z w ( z | m ) q ( z ) + X m q 0 ( m ) ln ν m + 1 . 18 Optimization o v er w and ν maximizes this low er b ound. T o further aid in distributing the algorithm w e restrict the class of v ariational w ( z | m ) to pro ducts: w ( z | m ) = Q i w i ( z i | m ). F or this c hoice J ( q , w , ν ) , X m q 0 ( m ) ( B m,m − X ˜ m A m, ˜ m ν ˜ m + ln ν m ) + S ( q 0 ) + 1 (15) where A ˜ m,m i , P z i q ˜ m i ( z i ) w i ( z i | m ), A ˜ m,m , Q d i =1 A ˜ m,m i , B m,m i , P z i q m i ( z i ) ln w i ( z i | m ), and B m,m , P d i =1 B m,m i . 11 A t an y temp erature T the v ariational Lagrangian which m ust b e mini- mized with resp ect to q , w and ν (sub ject to appropriate p ositivit y and normalization constrain ts) is then L Z ( q , w, ν ) = X m q 0 ( m ) L ( q m ) − T J ( q , w , ν ) (16) with J ( q , w , ν ) giv en b y Eq. (15). 5 Minimizing the Mixture Distribution Lagrangian Equating the gradients with resp ect to w and ν to zero giv es 1 ν m = 1 q 0 ( m ) X ˜ m q 0 ( ˜ m ) A ˜ m,m . (17) w i ( z i | m ) ∝ q 0 ( m ) q m i ( z i ) ν m " X ˜ m q 0 ( ˜ m ) q ˜ m i ( z i ) A ˜ m,m A ˜ m,m i # − 1 . (18) The dep endence of L Z on q 0 ( m ) is particularly simple: L Z ( q , w, ν ) ≈ P m q 0 ( m ) E ( m ) − T  S ( q 0 ) + 1  up to q 0 -indep enden t terms where E ( m ) , E q m ( G ) − T S [ q m ] + B m,m − X ˜ m A m, ˜ m ν ˜ m + ln ν m ! , Th us, the mixture w eights are Boltzmann distributed with energy function E ( m ): q 0 ( m ) = exp  −E ( m ) /T  P ˜ m exp  −E ( ˜ m ) /T  . (19) The determination of q m i ( z i ) is similar. The relev ant terms in L Z in v olving q m i ( z i ) are L Z ≈ q 0 ( m ) P z i E m ( z i ) q m i ( z i ) − T S ( q m i ) where E m ( z i ) = E q m − i ( G | z i ) − T  ln w i ( z i | m ) − X ˜ m A m, ˜ m A m, ˜ m i ν ˜ m w i ( z i | ˜ m )  . As b efore, the conditional exp ectation E q m − i ( G | z i ) is P z − i G ( z i , z − i ) q m − i ( z − i ). The mixture proba- bilities are thus determined as q m i ( z i ) = exp  −E m ( z i ) /T  P z i exp  −E m ( z i ) /T  . (20) 19 5.1 Agen t Comm unication As desired these results require minimal communication b etw een agen ts. A sup ervisory agent, call this the 0-agent, is assigned to manage the determination of q 0 ( m ), and ( i, m )-agents manage the determination of q m i ( z i ). The M ( i, m )-agen ts for a fixed i communicate their w i ( z i | m ) to determine A m, ˜ m i . These results along with the B m,m i from each ( i, m ) agen t are then forwarded to the 0-agent who forms A m, ˜ m and B m,m broadcasts this bac k to all ( i, m )-agents. With these quantities and the lo cal estimates for E q m − i ( G | z i ), all q m i can b e up dated indep endently . 6 Exp erimen ts In this section w e demonstrate our metho ds on some simple problems. These examples are illustra- tiv e only , and for further examples the reader is directed to (Bieniawski & W olp ert 2004 a , Bienia wski et al. 2004, Bieniawski & W olp ert 2004 b , Lee & W olpert 2004, W olp ert & Lee 2004) for related exp erimen ts. W e test the the mixture semicoordinate probabilit y collectiv e metho d on t w o differen t problems: a k -sat constraint satisfaction problem having multiple feasible solutions, and optimization of an unconstrained optimization of an N K function. 6.1 k-sat The k -sat problem is p erhaps the b est studied CSP (Mezard, P arisi & Zecchina 2002). The goal is to assign N binary v ariables (labelled z i ) true/v alse v alues so that C disjunctiv e clauses are satisfied. The a th clause inv olves k v ariables lab elled b y v a,j ∈ [1 , N ] (for j ∈ [1 , k ]), and k binary v alues asso ciated with each a and lab elled b y σ a,j . The a th clause is satisfied iff W k j =1 [ z v a,j = σ a,j ] so we define the a th constrain t as c a ( z ) = ( 0 if W k j =1 [ z v a,j = σ a,j ] 1 otherwise . As the a th clause is violated only when all z v a,j = σ a,j (with σ defined to b e not σ ), the Lagrangian o v er product distributions can be written as L ( q ) = λ λ λ > c ( q ) − T S ( q ) where c ( q ) is the C -vector of exp ected constrain t violations, and λ λ λ is the C vector of Lagrange m ultipliers. The a ’th compo- nen t of c ( q ) is the expected violation of the a th clause, and is given by c a ( q ) , P z c a ( z ) q ( z ) = Q k j =1 q v a,j ( σ a,j ). Note that no Monte Carlo estimates are required to ev aluate this quantit y . F ur- ther, the only communication required to ev aluate G , and its conditional exp ectations is b etw een agen ts appearing in the same clause. T ypically , this communication netw ork is sparse; for the N = 100, k = 3, C = 430 v ariable problem we consider here each agen t interacts with only 6 other agen ts on av erage. W e first present results for a single pro duct distribution. F or any fixed setting of the Lagrange m ultipliers, the Lagrangian is minimized by iterating Eq. (10). If the minimization is done by the Brou w er metho d, any random subset of v ariables, no tw o of whic h app ear in the same clause, can b e updated simultaneously . This eliminates “thrashing” and ensures that the Lagrangian decreases at each iteration. The minimization is terminated at a lo cal minim um q ∗ whic h is detected when the norm of the gradien t falls below a threshold. If all constrain ts are satisfied at q ∗ w e return the solution z ∗ = arg max z q ∗ ( z ), otherwise the Lagrange m ultipliers are up dated according to Eq. (11). In the presen t context, this up dating rule offers a num b er of b enefits. Firstly , those constraints which 20 50 100 150 200 250 300 350 400 450 500 −0.03 −0.02 −0.01 0 0.01 L iteration 50 100 150 200 250 300 350 400 450 500 0 10 20 30 40 constraints (a) 0 10 20 30 40 50 60 70 8 0 0 0.02 0.04 0.06 0.08 0.1 0.12 G Prob(G) iteration 1 iteration 49 iteration 95 (b) Figure 3: (a) Ev olution of Lagrangian v alue (solid line), exp ected constraint violation (dotted line), and constraint violations of most likely configuration (dashed line). (b) P ( G ) after minimizing the Lagrangian for the first 3 multiplier settings. A t termination P ( G ) = δ ( G ). 49 95 215 238 43 1 100 200 300 400 2 3 4 5 6 x 10 −3 iteration constraints λ Figure 4: Each constrain t’s Lagrange multiplier versus the iterations when they c hange. 21 are violated most strongly hav e their penalty increased the most, and consequen tly , the agen ts in v olv ed in those constraints are most likely to alter their state. Secondly , the Lagrange multipliers con tain a history of the constrain t violations (since w e keep adding to λ λ λ ) so that when the agen ts co ordinate on their next mov e they are unlikely to return a previously violated state. This mimics the approac h used in tab oo searc h where revisiting of configurations is explicitly preven ted, and aids in an efficient exploration of the searc h space. Lastly , rescaling the Lagrangian after each up date of the m ultipliers by 1 1 1 > λ λ λ , P a λ a giv es L ( q ) = ˆ λ λ λ > c ( q ) − ˆ T S ( q ) where ˆ λ λ λ = λ λ λ/ 1 > λ λ λ and ˆ T = T / 1 > λ λ λ . Since P a ˆ λ a = 1 the first term reweigh ts clauses according to their exp ected violation, while the temp erature ˆ T co ols in an automated annealing schedule as the Lagrange m ultipliers increase. Co oling is most rapid when the expected constrain t violation is large and slows as the optimum is approac hed. The parameters α t λ th us gov ern the o v erall rate of co oling. W e used the fixed v alue α t λ = 0 . 5 in the reported exp erimen ts. Figure 3 presen ts results for a 100 v ariable k = 3 problem using a single mixture. The problem is satisfiable formula uf100-01.cnf from SA TLIB ( www.satlib.org ). It w as generated with the ratio of clauses to v ariables b eing near the phase transition, and consequently has few solutions. Fig. 3(a) sho ws the v ariation of the Lagrangian, the exp ected num b er of constraint violations, and the n um b er of constraints violated in the most probable state z mp , arg max z q ( z ) as a function of the num b er of iterations. The starting state is the maxim um entrop y configuration of uniform q i , and the starting temp erature is T = 1 . 5 · 10 − 3 . The iterations at which the Lagrange multipliers are up dated are indicated b y v ertical dashed lines whic h are clearly visible as discontin uities in the Lagrangian v alues. T o show the sto c hastic underpinnings of the algorithm we plot in Fig. 3(b) the probability densit y of the num b er of constraint violations obtained as P ( G ) = P z q ( z ) δ  G − P a c a ( z )  . 12 W e see the do wnw ard mo vemen t of the expected v alue of G as the algorithm progresses. Figure 4 sho ws the evolution of the renormalized Langrange m ultipliers ˆ λ λ λ . At the first iteration the multiplier for all clauses are equal. As the algorithm progresses weigh t is shifted amongst difficult to satisfy clauses. Results on a larger problem with m ultiple mixtures are sho wn in Fig. 5(a). This is the 250 v ariable/1065 clause problem uf250-01.cnf from SA TLIB with the first 50 clauses remov ed so that the problem has m ultiple solutions. The optimization is p erformed by selecting a random subset of v ariables, no tw o of whic h appear in the same clause, at each iteration, and up dating according to Eqs. (17), (18), (19), and (20). After conv ergence to a lo cal minimum the Lagrange multipliers are up dated as ab o ve according to the expected constrain t violation. The initial temp erature is 10 − 1 . W e plot the n umber of constrain ts violated in the most probable state of eac h mixture as a function of the n um b er of updates. as w ell as the expected n um b er of violated constraints. After 8000 steps three distinct solutions are found along with a fourth configuration whic h violates a single constraint. 6.2 Minimization of NK F unctions Next we consider an unconstrained discrete optimization problem. The N K mo del defines a family of tunably difficult optimization problems (Kauffman & Levin 1987). The ob jective o v er N binary v ariables { z i } N i =1 is defined as the av erage of N randomly generated contributions dep ending on z i and K other randomly chosen v ariables z 1 i · · · z K i : G ( z ) = N − 1 P N i =1 E i ( z i ; z 1 i , · · · z K i ). F or eac h of the 2 K +1 lo cal configurations E i is assigned a v alue dra wn uniformly from 0 to 1. K whic h ranges b et ween 0 and N − 1 con trols the num b er of lo cal minima; under Hamming neighborho o ds K = 0 optimization landscapes hav e a single global optim um and K = N − 1 landscap es ha ve on a v erage 2 N / ( N + 1) lo cal minima. F urther properties of N K landscap es ma y b e found in (Durrett & Limic 2003). Fig. 5(b) plots the energy of a 5 mixture mo del for a multi-modal 22 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 20 40 60 80 100 120 140 # unsat iteration 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 J expected mix 1 mix 2 mix 3 mix 4 (a) 500 1000 1500 2000 0.25 0.3 0.35 0.4 0.45 G iteration mix 1 mix 2 mix 3 mix 4 mix 5 500 1000 1500 2000 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 J (b) Figure 5: (a) The solid curves show the n umber of unsatisfied clauses in the most probable config- uration z mp of each of the 4 mixtures vs iterations. The topmost solid black line plots the exp ected n um b er of violations, and the dashed blac k line shows the approximation to the JS distance. (b) The solid curves sho w the evolution of the G v alue of the b est z mp configurations for each of 5 mix- tures versus n um b er of iterations. The dashed black line shows the corresp onding appro ximation to the JS distance. N = 300 K = 2 function. The K − 1 spins other than i up on which E i dep ends were selected at random. A t termination of the PC algorithm (at a small but non-zero temp erature), fiv e distinct configurations are obtained with the neares t pair of solutions having Hamming distance 12. Note that unlike the k -sat problem whic h has multiple configurations all having the same global minimal energy , the JS distance (the dashed curve) of Fig. 5(b) drops to zero as the temp erature decreases. This is because at exactly zero temperature there is no term forcing differen t solutions, and the Lagrangian is minimized b y having all mixtures conv erge to delta functions at the lo w est ob jectiv e v alue configuration. The effect of mixtures enables the algorithm to simultaneously explore multiple lo cal extema before con v erging on the lo west ob jective solution. 7 Relation of PC to other w ork There has b een m uch work from many fields related to PC. The maxen t Lagrangian has b een used in statistical ph ysics for ov er a century under the rubric of “free energy”. Its deriv ation in terms of information theoretic statistical inference was by Ja ynes (Ja ynes 1957). The maxent Lagrangian has also app eared occasionally in game theory as a heuristic, without a statistical inference justification (b e it information-theoretic or otherwise) (F udenberg & Levine 1998, Shamma & Arslan 2004). 13 In none of this earlier work is there an appreciation for its relationship with the related work in other fields. In the context of distributed control/optimization, the distinguishing feature of PC is that it do es not view the v ariable x as the fundamental ob ject, but rather the distribution across it, q . Samples of that distribution are not the direct ob ject of in terest, and in fact are only used if necessary to estimate functionals of q . The fundamental ob jectiv e function is stated in terms of q . As explicated in the next subsection, the asso ciated optimization algorithms are related to some 23 w ork in sev eral fields. Heretofore those fields hav e b een una w are of each other, and of the breadth of their relation to information theory and game theory . Finally , w e note that the maxen t or q p Lagrangian L ( q ) = E q ( G ) − T S ( q ) can be view ed as a barrier-function (in terior point) metho d with ob jective E q ( G ). An entropic barrier function is used to enforce the constraints q i ( x i ) ≥ 0 ∀ i and x i , with the constrain t that all q i sum to 1 b eing implicit. 7.1 V arious sc hemes for updating q W e hav e seen that the q p Lagrangian is minimized by the pro duct distribution q given by q i ( x i ) ∝ exp  − E q − i ( G | x i ) /T  . (21) Direct application of these equations that minimize the Lagrangian form the basis of the Brou wer up date rules. Alternatively , steep est descen t of the maxen t Lagrangian forms the basis of the Nearest Newton algorithm. These up date rules ha v e analogues in con v en tional (non-PC) opti- mization. F or example, Nearest-Newton is based on Newton’s metho d, and Brouw er up dating is similar to blo c k-relaxation. This is one of the adv antages of embedding the original optimization problem inv olving x in a problem in volving distributions across x : it allows us to solve problems o v er non-Euclidean (e.g., countable) spaces using the p o w erful methods already well-understoo d for optimization ov er Euclidean spaces. Ho w ev er there are other PC up date rules that hav e no direct analogue in such well-understoo d metho ds for Euclidean space optimization. Algorithms may be dev elop ed that minimize the pq Lagrangian D ( p T k q ) where p T ( x ) = exp  − G ( x ) /T  / Z ( T ) with Z ( T ) being the normalization of the Boltzmann distribution. The pq Lagrangian is minimized b y the the pro duct of the marginals of the Boltzmann distribution, i.e. q i ( x i ) = R d x − i p T ( x ). Another example of up date rules without Euclidean analogues are the iterative fo cusing up date rules describ ed in (W olp ert 2004 a ). Iterative fo cusing up dates are intrinsically tied into the fact that we’re minimizing (the distribution setting) an exp ectation v alue. A subset of up date rules arising from q p and pq Lagrangians are describ ed in (W olp ert 2004 a ). In all cases, the up dates ma y b e written as m ultiplicative up dating of q . The follo wing is a list of the up date ratios r q ,i ( x i ) ≡ q t +1 i ( x i ) /q t i ( x i ) of some of those rules. In all of the rules in this list, F G is a probabilit y distribution ov er x that nev er increases b et ween t w o x ’s if G do es (e.g., a Boltzmann distribution in G ( x )). In addition, const is a scalar that ensures the new distribution is prop erly normalized and α is a stepsize. 14 Finally , “iterative fo cusing” is a tec hnique that rep eats the following pro cess: It tak es a distribution ˜ q as input. It then pro duces as output a new distribution that is ˜ q “fo cused” more tightly about x ’s with go o d G ( x ). That fo cused distribution b ecomes ˜ q for the next iteration. See (W olpert et al. 2006). Gr adient desc ent of q p distanc e to F G : 1 − α n E q t (ln F G | x i ) + ln( q t i ( x i )) q t i ( x i ) o − const q t i ( x i ) (22) Ne ar est Newton desc ent of q p distanc e to F G : 1 − α  E q t (ln F G | x i ) + ln( q t i ( x i ))  − const (23) Br ouwer up dating for q p distanc e to F G : const × e E q t (ln F G | x i ) q t i ( x i ) (24) 24 Imp ortanc e sampling minimization of pq distanc e to F G : const × E q t  F G q t | x i  (25) Iter ative fo cusing of ˜ q with fo cusing function F G using q p distanc e and gr adient desc ent: 1 − α n E q t (ln F G | x i ) + ln q t i ( x i ) ˜ q i ( x i ) q t ( x i ) o − const q t ( x i ) (26) Iter ative fo cusing of ˜ q with fo cusing function F G using q p distanc e and Ne ar est Newton: 1 − α n E q t (ln F G | x i ) + ln q t i ( x i ) ˜ q i ( x i ) o − const (27) Iter ative fo cusing of ˜ q with fo cusing function F G using q p distanc e and Br ouwer up dating: const × e E q t (ln F G | x i ) × ˜ q ( x i ) q t i ( x i ) (28) Iter ative fo cusing of ˜ q with fo cusing function F G using pq distanc e: const × E ˜ q ( F G | x i ) × ˜ q ( x i ) q t i ( x i ) (29) Note that some of these up date ratios are themselv es prop er probabilit y distributions, e.g., the Nearest Newton up date ratio. This list highlights the abilit y to go b eyond conv entional Euclidean optimization up date rules, and is an adv an tage of embedding the original optimization problem in a problem ov er a space of probabilit y distributions. Another adv antage is the fact that the distribution itself provides muc h useful information (e.g., sensitivit y information). Y et another adv an tage is the natural use of Mon te Carlo tec hniques that arise with the em b edding, and allow the optimization to b e used for adaptiv e con trol. 7.1.1 Relation of PC to other distribution-based optimization algorithms There is some w ork on optimization that precedes PC and that has directly considered the dis- tribution q as the ob ject of interest. Much of this work can b e viewed as sp ecial cases of PC. In particular deterministic annealing (Duda et al. 2000) is “bare-b ones” parallel Brouw er up dat- ing. This inv olves no data-aging (or any other scheme to a void thrashing of the agents), difference utilities, etc.. 15 More tan talizingly , the technique of probability matching (Sab es & Jordan 1996) uses Mon te Carlo sampling to optimize a functional of q . This w ork w as in the con text of a single agent, and did not exploit tec hniques lik e data-ageing. Unfortunately , this w ork was not pursued. Other work has b oth viewed q as the fundamen tal ob ject of in terest and used tec hniques lik e data-aging and difference utilities. In particular, this is the case with the COllective INtelligence (COIN) w ork (W olp ert, T umer & F rank 1999, W olp ert et al. 2003, W olp ert 2003 c ). Ho w ever this w ork was not based on information-theoretic considerations and had no explicit ob jective function for q . It w as the in tro duction of such considerations that resulted in PC. 25 Another interesting bo dy of early work is the cross entry op (CE) metho d (Rubinstein & Kro ese 2004). This w ork is the same as iterative fo cusing using pq distance. The CE method does not consider the formal difficulty with iterative fo cusing identified in (W olp ert et al. 2006), or any of the p otential solutions to that problem discussed there. That difficulty means that even with no sampling error (i.e., if all estimated quan tities are in fact exactly correct), in general there are no guaran tees that the algorithm con v erges to an optimal x . Other early w ork grew out of the Genetic Algorithms comm unity . This w ork w as initiated with MIMIC (De Bonet, Isb ell Jr. & Viola 1997), and has since dev elop ed into the Estimation of Distribution Algorithms (EDA) approac h (Lozano, Larra ˜ naga, Inza & Bengo etxa 2006). A num b er of important issues were raised in this early work, e.g., the importance of information theoretic concepts. F or the most part ho wev er, esp ecially in its early stages (e.g., with MIMIC), this w ork view ed the samples as the fundamental ob ject of interest, rather than view the distribution b eing sampled that w a y . Little concern arises for what the ob jectiv e function of the distribution b eing sample d should b e, and of how samples can b e used to achiev e that optimization. This means that there is little concern for issues lik e the (lack of ) con vexit y of that implicit ob jective function. This prev en ts ED A’s from fully exploiting the p o wer of contin uous space op- timization, e.g., the absence of local minima with conv ex problems (Boyd & V andenberghe 2003). Similarly , it means that with ED A’s there is no concern for cases where the distribution ob jec- tiv e function can b e optimized in closed form, without any need for sampling at all. Nor is there widespread appreciation for how old sample x ’s can b e re-used to help guide the optimization (as for example they are in PC’s adaptive imp ortance sampling (W olpert et al. 2006)). This contrasts with PC, whose distinguishing feature is that it do es not treat the v ariable x as the fundamental ob ject to b e optimized, but rather the distribution across it, q . So for example, in PC, samples of that distribution are only used if necessary to estimate quan tities that cannot b e ev aluated other w ays; the fundamen tal ob jective function is stated in terms of q . Indeed, in the cases considered in this pap er, as well as earlier PC work lik e that rep orted in (Macready & W olp ert 2004), no such samples of q arise. Shortly after the in tro duction of PC, a v arian t of its Mon te Carlo v ersion of parallel Brou w er up dating has b een in tro duced, called the MCE metho d (Rub enstein 2005). In this v ariant the annealing of the Lagrangian do esn’t in v olv e changing the temp erature β , but instead changing the v alue of a constraint sp ecifying E q ( G ). Accordingly , rather than jump directly to the ( β -sp ecified) solution giv en ab o v e, one has to solv e a set of coupled nonlinear equations relating all the q i . (Another distinguishing feature is no data-ageing, difference utilities or the like are used in the MCE metho d.) The MCE metho d has b een justified with the KL argument reviewed ab ov e rather than with “ratc het”-based maxim um en trop y arguments. This has redrawn attention to the role of the argumen t-ordering of the KL distance, and how it relates Brou w er updating and the CE metho d. Another b o dy of work related to PC is (lo op y) propagation algorithms, Bethe approximations, and the like (Mack ay 2003). These tec hniques can b e seen as an alternativ e to semicoordinate transformations for ho w to go b ey ond pro duct distributions. Unlike those approac hes, we are guaran teed to reach a lo cal minim um of free energy . (If w e were to use pq KL distance rather than q p KL distance, w e would get to a g l obal minimum of free energy .) In addition, via utilities lik e AU and WLU (see app endix), w e can exploit v ariance reduction techniques absen t from those other tec hniques. Similarly those other tec hniques do not mak e use of data-aging. Finally , there is also work that has b een done after the introduction of b oth the CE metho d and PC that is closely related to b oth. Suc h metho ds are t ypically sample based, and a goo d summary of these approac hes can b e found in (M ¨ uhlenbein & H¨ ons 2005). The use of junction tree 26 factorizations for q utilized in the ab o v e metho ds can b e extended to PC theory , and results along these lines will be presen ted elsewhere. 8 Conclusion A distributed constrained optimization framew ork based on probability collectiv es has been pre- sen ted. Motiv ation for the framework was drawn from an extension of full-rationality game theory to b ounded rational agents. An algorithm that is capable of obtaining one or more solutions sim ul- taneously was developed and demonstrated on tw o problems. The results show a promising, highly distributed, off-the-shelf approach to constrained optimization. There are many a ven ues for future exploration. Alternatives to the Lagrange m ultiplier metho d used here can b e dev elop ed for constrain t satisfaction problems. By viewing the constraints as separate ob jectives, a Pareto-lik e optimization procedure may be developed whereb y a gradient direction is chosen whic h is constrained so that no constraints are w orsened. This idea is motiv ated b y the highly successful W alkSA T (Selman, Kautz & Cohen 1996) algorithm for k -sat in whic h spins are flipp ed only if no previously satisfied clause b ecomes unsatisfied as a result of the change. Probabilit y collectives also offer promise in devising new methods for escaping local minima. Unlik e traditional optimization metho ds where monotonic transformations of the ob jective lea ve lo- cal minima unchanged, such transformations will alter the local minima structure of the Lagrangian. This observ ation, and alternative Lagrangians (see (Rubinstein 2001) for a related approach using a different minimization criterion) offer new approaches for impro ved optimization. A CKNOWLEDGEMENTS W e would like to thank Charlie Strauss for illuminating conv ersa- tion and Bill Dun bar and George Judge for helpful commen ts on the man uscript. Notes 1 |S | denotes the num b er of elements in the set S . 2 In this pap er v ectors are indicated in b old font and scalars are in regular font. 3 Despite its p opularit y , the KL distance D ( q k p ) b et ween tw o probability distributions q and p is not a prop er metric. It is not even symmetric betw een q and p . How ev er it is non-negative, and equals zero only when q = p . 4 Here and throughout, we fix the con ven tion that it is desirable to minimize ob jective functions, not to maximize them. 5 W e adopt the notation that q − i indicates the distribution q with v ariable i marginalized out, i.e., the pro duct Q j 6 = i q j . Analogously , x − i , [ x 1 , · · · , x i − 1 , x i +1 , · · · , x n ] 6 Prop erly sp eaking one should find the global minimizer q . Here we con tent ourselves with finding lo cal minima. 7 N.b., we do not pro ject onto Q but rather add a vector to get back to it. See (W olp ert & Bienia wski 2004 b ). 8 F or suc h a distribution we relax the requirement of being a pro duct or ha ving any other particular form; we only require that all 0 ≤ π ( x ) ≤ 1 and P x π ( x ) = 1. 9 The Jacobian factor is irrelev an t as ζ is a p erm utation. 10 Alternativ ely , we can parameterize a smaller space of candidate p ermutations and select the b est from amongst this candidate set. 11 Note that if w i ( z i | m ) = 1 / |Z i | is uniform across z i then A ˜ m,m i = 1 / |Z i | and B m,m i = − ln |Z i | . Maximizing ov er ν m w e find that J ( q , w = 1 / |Z | , ν = ν ∗ ) = 0. Thus, maximizing with resp ect to w increases the JS distance from 0. 12 In determining the density 10 4 samples were drawn from q ( z ) with Gaussians centered at each v alue of G ( z , 1 ) and with the width of all Gaussians determined by cross v alidation of the log likelihoo d. The fact that there is non-zero probability of obtaining non-integral num b ers of constraint violations is an artifact of the finite width of the Gaussians. 13 Ho wev er an attempt at a first-principles deriv ation can b e found in (Meginniss 1976). 14 As a practical matter, b oth Nearest Newton and gradien t-based up dating hav e to b e mo dified in a particular step if their step size is large enough so that they would otherwise take one off the unit simplex. This changes the up date ratio for that step. See (W olp ert & Bieniawski 2004 b ). 27 15 Indeed, as conv en tionally cast, deterministic annealing assumes the conditional G can b e ev aluated in closed form, and therefore has no concern for Monte Carlo sampling issues. References An toine, N., Bieniawski, S., Kro o, I. & W olp ert, D. H. (2004), Fleet assignmen t using collectiv e in telligence, in ‘Pro ceedings of 42nd Aerospace Sciences Meeting’. AIAA-2004-0622. Aumann, R. & Hart, S. (1992), Handb o ok of Game The ory with Ec onomic Applic ations , North- Holland Press. Aumann, R. J. (1987), ‘Correlated equilibrium as an expression of Ba y esian rationality’, Ec ono- metric a 55 (1), 1–18. Basar, T. & Olsder, G. (1999), Dynamic Nonc o op er ative Game The ory , Siam, Philadelphia, P A. Second Edition. Bertsek as, D. (1996), Constr aine d Optimization and L agr ange Multiplier Metho ds , A thena Scientific, Belmon t, MA. Bienia wski, S. & W olp ert, D. H. (2004 a ), Adaptiv e, distributed con trol of constrained multi-agen t systems, in ‘Pro ceedings of AAMAS 04’. Bienia wski, S. & W olpert, D. H. (2004 b ), Using pro duct distributions for distributed optimization, in ‘Pro ceedings of ICCS 04’. Bienia wski, S., W olpert, D. H. & Kro o, I. (2004), Discrete, contin uous, and constrained optimiza- tion using collec tiv es, in ‘Pro ceedings of 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, Albany , New Y ork’. Bonab eau, E., Dorigo, M. & Theraulaz, G. (2000), ‘Inspiration for optimization from so cial insect b eha viour’, Natur e 406 (6791), 39–42. Bo yd, S. & V anden b erghe, L. (2003), Convex Optimization , Cambridge Universit y Press. Co v er, T. & Thomas, J. (1991), Elements of Information The ory , Wiley-Interscience, New Y ork. De Bonet, J., Isb ell Jr., C. & Viola, P . (1997), Mimic: Finding optima b y estimating probability densities, in ‘Adv ances in Neural Information Pro cessing Systems - 9’, MIT Press. Dec h ter, R. (2003), Constr aint Pr o c essing , Morgan Kaufmann. Duda, R. O., Hart, P . E. & Stork, D. G. (2000), Pattern Classific ation (2nd e d.) , Wiley and Sons. Durrett, R. & Limic, V. (2003), ‘Rigorous results for the N K mo del’, Ann. Pr ob ab. 31 (4), 1713– 1753. Av ailable at http://www.math.cornell.edu/ ~ durrett/NK/ . F udenberg, D. & Levine, D. K. (1998), The The ory of L e arning in Games , MIT Press, Cambridge, MA. F udenberg, D. & Tirole, J. (1991), Game The ory , MIT Press, Cam bridge, MA. 28 Gran tham, W. J. (2005), Gradient transformation tra jectory follo wing algorithms for determining stationary min-max saddle p oints, in ‘Adv ances in Dynamic Game Theory and Applications’, Birkhauser. Jaakk ola, T. S. & Jordan, M. I. (1998), Improving the mean field approximation via the use of mixture distributions, in M. I. Jordan, ed., ‘Learning in Graphical Mo dels’, Kluw er Academic. Ja ynes, E. T. (1957), ‘Information theory and statistical mec hanics’, Physic al R eview 106 , 620. Kauffman, S. A. & Levin, S. A. (1987), ‘T o wards a general theory of adaptiv e w alks on rugged landscap es’, Journal of The or etic al Biolo gy 128 , 11–45. Lee, C. F. & W olp ert, D. H. (2004), Pro duct distribution theory for con trol of multi-agen t systems, in ‘Pro ceedings of AAMAS 04’. Lozano, J., Larra ˜ naga, P ., Inza, I. & Bengo etxa, E. (2006), T owar ds a New Evolutionary Compu- tation. A dvanc es in Estimation of Distribution Algorithms , Springer V erlag. Mac k a y , D. (2003), Information the ory, infer enc e, and le arning algorithms , Cam bridge Universit y Press. Macready , W. & W olp ert, D. H. (2004), Distributed optimization, in ‘Pro ceedings of ICCS 04’. Meginniss, J. R. (1976), ‘A new class of symmetric utilit y rules for gambles, sub jective marginal probabilit y functions, and a generalized bay es rule’, Pr o c. of the Americ an Statisticic al Asso- ciation, Business and Ec onomics Statistics Se ction pp. 471 – 476. Mezard, M., Parisi, G. & Zecchina, R. (2002), ‘Analytic and algorithmic solution of random satis- fiabilit y problems’, Scienc e . M ¨ uhlen b ein, M. & H¨ ons, R. (2005), ‘The estimation of distributions and the minimum relativ e en trop y’, Evolutionary Computation 13 (1), 1–27. Osb orne, M. & Rub enstein, A. (1994), A Course in Game The ory , MIT Press, Cam bridge, MA. Rub enstein, R. (2005), The sto c hastic minim um cross-entrop y metho d for com binatorial optimiza- tion and rare-even t estimation. unpublished. Rubinstein, R. & Kroese, D. (2004), The Cr oss-Entr opy Metho d , Springer. Rubinstein, R. Y. (2001), Com binatorial optimization via cross-entrop y , in S. Gass & C. Harris, eds, ‘Encyclop edia of Op erations Researc h and Management Sciences’, Kluw er. Sab es, P . N. & Jordan, M. I. (1996), Reinforcemen t learning b y probabilit y matc hing, in G. T esauro, M. Mozer & M. Hasselmo, eds, ‘Adv ances in Neural Information Pro cessing Systems 8’, MIT Press, Cambridge, MA, pp. 1080 – 1086. Selman, B., Kautz, H. & Cohen, B. (1996), Lo cal search strategies for satisfiabilit y testing, in D. S. Johnson & M. A. T rick, eds, ‘Cliques, Coloring, and Satisfiability: Second DIMA CS Implemen- tation Challenge, Octob er 11-13, 1993’, V ol. 26 of DIMACS Series in Discr ete Mathematics and The or etic al Computer Scienc e , AMS. Shamma, J. & Arslan, G. (2004), Dynamic fictitious pla y , dynamic gradien t pla y , and distributed con v ergence to nash equilibria. submitted. 29 W olp ert, D. H. (2003 a ), ‘F actoring a canonical ensemble’. preprint cond-mat/0307630. W olp ert, D. H. (2003 b ), Theory of collectiv e intelligence, in K. T umer & D. H. W olp ert, eds, ‘Collectiv es and the Design of Complex Systems’, Springer, New Y ork. W olp ert, D. H. (2003 c ), Theory of collectiv es, in ‘The Design and Analysis of Collectives’, Springer- V erlag, New Y ork. Av ailable at http://ic.arc.nasa.go v/ ∼ dhw. W olp ert, D. H. (2004 a ), Finding b ounded rational equilibria part 1: Iterative fo cusing, in ‘Pro- ceedings of the In ternational Society of Dynamic Games Conference, 2004’. in press. W olp ert, D. H. (2004 b ), Information theory - the bridge connecting b ounded rational game theory and statistical physics, in D. Braha & Y. Bar-Y am, eds, ‘Complex Engineering Systems’. W olp ert, D. H. (2004 c ), What information theory says ab out b est resp onse, binding con tracts, and collectiv e intelligence, in A. N. et al, ed., ‘Pro ceedings of WEHIA04’, Springer V erlag. W olp ert, D. H. & Bieniawski, S. (2004 a ), Adaptiv e distributed control: b ey ond single-instant categorical v ariables, in A. S. et al, ed., ‘Pro ceedings of MSRAS04’, Springer V erlag. W olp ert, D. H. & Bienia wski, S. (2004 b ), Distributed con trol b y lagrangian steepest descent, in ‘Pro ceedings of CDC 04’. W olp ert, D. H. & Lee, C. F. (2004), Adaptive metropolis hastings sampling using pro duct distri- butions, in ‘Pro ceedings of ICCS 04’. W olp ert, D. H. & Macready , W. G. (1997), ‘No free lunch theorems for optimization’, IEEE T r ans- actions on Evolutionary Computation 1 (1), 67–82. Best Paper Aw ard. W olp ert, D. H., Strauss, C. E. M. & Ra jnay aran, D. (2006), ‘Adv ances in distributed optimization using probability collectives’, A dvanc es in Complex Systems . in press. W olp ert, D. H. & T umer, K. (2001), ‘Optimal pay off functions for mem b ers of collectives’, A dvanc es in Complex Systems 4 (2/3), 265–279. W olp ert, D. H. & T umer, K. (2002), ‘Collectiv e in telligence, data routing and braess’ parado x’, Journal of A rtificial Intel ligenc e R ese ar ch . W olp ert, D. H., T umer, K. & Bandari, E. (2003), ‘Impro ving search by using in telligent co ordinates’, Physic al R eview E . In press. W olp ert, D. H., T umer, K. & Bandari, E. (2004), ‘Impro ving search by using in telligent co ordinates’, Physic al R eview E 69 . W olp ert, D. H., T umer, K. & F rank, J. (1999), Using collective in telligence to route in ternet traffic, in ‘Adv ances in Neural Information Pro cessing Systems - 11’, MIT Press, pp. 952–958. A Statistical estimation to up date q Using either of the up date rules Eqs. (9) or (10) requires knowing E q t − i ( G | x i ), defined in Eq. (7). Ho w ev er, often w e cannot efficiently calculate all the terms E q t − i ( G | x i ). T o use our up date rules in suc h situations we can use Monte Carlo sampling, as describ ed in this section. 30 A.1 Mon te Carlo sampling In the Mon te Carlo approac h, at each timestep every agent i samples its distribution q i to get a p oin t x i . Since w e hav e a pro duct distribution q , these samples provides us with a sample x of the full joint distribution q . By rep eating this pro cess L times w e get a “blo c k” of suc h joint samples. The G v alues in that blo c k can b e used by eac h agent separately to estimate its up dating v alues E q t − i ( G | x i ), for example simply by uniform av eraging of the G v alues in the samples asso ciated with eac h x i . Note that the single set of samples can b e used no matter how man y agents are in the system; we don’t need a differen t Monte Carlo process for eac h agent to estimate their separate E q t − i ( G | x i ). All agents (v ariables) sample mo v es (v ariable settings) indep enden tly , and coupling o ccurs only in the up dates of the q i . As we ha v e seen this up date (even to second order) for agent i dep ends only on the conditional exp ectations E q − i ( G | x i ) where q − i describ es the strategies used b y the other agen ts. Thus, if w e are using Mon te Carlo, then the only information whic h needs to b e comm unicated to eac h agent is the G v alues up on which the estimate will b e based. Using these v alues eac h agent indep enden tly up dates its strategy (its q i ) in a wa y which collectively is guaran teed to low er the Lagrangian. If the expectation is ev aluated analytically , the i th agent needs the q j distributions for eac h of the j agents inv olved in factors in G that also inv olve i . F or ob jective functions which consists of a sum of lo cal interactions each of whic h individually inv olves only a small subset of the v ariables (e.g. the problems considered here), the n umber of agen ts that i needs to communicate with is t ypically muc h smaller than n . A.2 Difference utilities for faster Monte Carlo con v ergence The basic Monte Carlo approach outlined ab o ve can b e slo w to con v erge in high-dimensional prob- lems. F or the problems considered in this pap er this is irrelev an t, since E q t − i ( G | x i ) ma y b e efficien tly calculated in closed form for all agents i and their mov es x i , so w e don’t need to use Mon te Carlo sampling. F or completeness though here w e present details of an impro v emen t of the basic Mon te Carlo approach that conv erges far more quickly . Scrutin y of the tw o update rules (9) and (10) rev eals that we don’t actually require the v alues E q t − i ( G | x i ) for all x i . All that is needed are the differences E q t − i ( G | x i ) − E q t − i ( G | x 0 i ) for pairs of distinct mo v es x i and x 0 i . (F or b oth up date rules the other degrees of freedom in the v alues E q t − i ( G | x i ) are absorb ed in to the factor in the up date rule ensuring normalization of q i .) In other w ords, to up date q i w e can replace the sample a v erages of G ( x ) with the sample av erages of the asso ciated differ enc e utility g i ( x ) = G ( x ) − D i ( x − i ) for an y function D i ( x − i ). The differences in exp ectation v alues E q t − i ( g i | x i ) − E q t − i ( g i | x 0 i ) are inv ariant with resp ect to c hoice of D i . Ho wev er the c hoice of D i can ha v e a ma jor effect on the statistical accuracy of our Monte Carlo estimation. W e exploit the freedom granted b y the introduction of D i ( x − i ) to deriv e quan tities which ma y b e more easily estimated b y Monte Carlo metho ds. In particular, we ma y define D i b y requiring that the maximum lik eliho o d estimator of the gradien t has the low est p ossible v ariance. In app endix B we show that the c hoice D i ( x − i ) =  X x 00 i 1 L x 00 i  − 1 X x 0 i G ( x 0 i , x − i ) L x 0 i . (30) giv es the low est p ossible v ariance. In the ab o v e equation L x i is the num b ers of times that the i th v ariable assumed v alue x i in a blo c k of L Mon te Carlo samples. 31 The asso ciated difference utilit y , g i ( x ) = G ( x ) − D i ( x − i ), is called the A risto cr at utility (AU). An approximation to it w as inv estigated in (W olp ert & T umer 2001, W olp ert, T umer & Bandari 2004, W olp ert & T umer 2002) and references therein. AU itself w as derived in (W olp ert 2003 b ). A U minimizes v ariance of gradien t descent up dating r e gar d less of the form of q . Sometimes not all the terms in the sum in AU can b e stored, b ecause |X i | and/or the blo c k size is to o large. That sum can b e appro ximated by replacing the L x i v alues in the definition of AU with q ( x i ) L . This replacemen t also pro vides a wa y to address cases where one or more L x i = 0. 16 Similarly , for computational reasons it may b e desirable to approximate the weigh ted av erage of G o v er all x 0 i whic h defines AU. The sum ov er x 0 i o ccurring in AU should not b e confused with the sum ov er x 0 i in Eq. (8) that pulls the gradient estimate back in to the unit simplex. The sum here is o ver v alues of G for coun terfactual sample pairs ( x 0 i , x − i ). (The other sum is ov er v alues of our gradien t estimate at all of its argumen ts.) When the functional form of G is kno wn it is often the case that there is cancellation which allows A U b e calculated directly , in one ev aluation, an ev aluation which can b e c heap er than that of G ( x ). When this is not the case their ev aluation incurs a computational cost in general. 17 This cost is offset b y the fact that those ev aluations allo w us to determine the v alue of AU not just for the actual p oint ( x i , x − i ), but in fact for all p oin ts { ( x 0 i , x − i ) | x 0 i ∈ X i } . Nonetheless, there will b e cases where ev aluating AU requires ev aluating all p ossible G ( x 0 i , x − i ), and where the cost of that is prohibitiv e, even if it allows us to update AU for all x i at once. F ortunately there are difference utilities that are cheaper to ev aluate than A U while still ha ving less v ariance than G . In particular, note that the w eigh ting factor L − 1 x 0 i / P x 00 i L − 1 x 00 i in the formula for A U is largest for those x i whic h o ccur infrequen tly , i.e. that ha v e low q i ( x i ). This observ ation leads to the Wonderful Life Utility (WLU), which is an appro ximation to AU that (b eing a difference utilit y) also has zero bias: g W LU i ( x i , x − i ) = G ( x i , x − i ) ) − G ( x clamp i , x − i ) . (31) In this formula, x clamp i = arg min x i L x i or if we wish to b e more conserv ative, arg min x i q i ( x i ), agen t i ’s low est probabilit y mo ve (W olp ert 2003 a , W olp ert 2004 b ). 18 Giv en the deriv ation of AU, one would exp ect that in general WLU will w ork b est if x clamp i is set to the x that is least lik ely . As the run unfolds, whic h x has that prop ert y will change. Ho wev er c hanging x clamp i accordingly will void the assumptions underlying the deriv ation of AU. In practice then, while one usually can up date x clamp i in suc h a dynamic fashion occasionally , doing so to o frequen tly can cause problems. A.3 Discussion of Mon te Carlo sampling Note that the foregoing analysis breaks down if an y of the L x i = 0. More generally it may break do wn if just one or more of the q ( x i ) are particularly small in comparison to the others, even if no L x i is exactly zero. The reason for this is that our approximation of the av erage o v er n x i (see App endix B) with the av erage where no L x i = 0 breaks do wn. Doing the exact calculation with no suc h appro ximation doesn’t fix the situation — once w e ha ve to assign non-infinitesimal probability to L x i = 0, w e’re allo wing a situation in which the gradient step w ould tak e us off the simplex of allo w ed q ∈ Q . W e migh t try to compensate for this b y reducing the stepsize, but in general the foregoing analysis do esn’t hold if stepsize is reduced in some situations but not in an y others. (V ariable stepsize constitutes a change to the up date rule. Suc h a mo dification to the up date rule m ust b e incorp orated in to the analysis — whic h obviates the deriv ation of A U.) 32 One w a y to address this scenario would b e to simply zero out the probability of agent i making an y mo ve x i for whic h q i ( x i ) is particular small. In other w ords, w e can redefine i ’s mov e space to exclude any mov es if their probabilit y ev er gets sufficien tly small. This has the additional adv antage of reducing the amoun t of “noise” that agents j 6 = i will see in the next Mon te Carlo blo c k, since the effect of agen t i on the v alue of G in that block is more tightly constrained. There several wa ys to extend the deriv ation of A U, which only addresses estimation error for a single agen t at a time, and for just that agent’s curren t up date. One suc h extension is to hav e agen t i ’s utilit y set to impro v e the accuracy of the up date estimation for agen ts j 6 = i . F or example, w e could try to bias q i to b e p eaked ab out only a few mov es, thereb y reducing the amoun t of noise those other agents j 6 = i will see in the next Monte Carlo block due to v ariability in i ’s mo ve c hoice. Another extension is to ha ve agen t i ’s utilit y set to impro ve the accuracy of its estimate of its up date for future Mon te Carlo blo cks, even at the exp ense of accuracy for the current blo ck. Strictly sp eaking, the deriv ation of AU only applies to gradien t descent up dating of q . Difference utilities are un biased estimators for the Nearest Newton up date rule, so long as each i estimates E q t ( g i ) as q t i ( x i ) times the estimate of E q t ( g i | x i ), X x i ˆ g i x i ( n i ) q t i ( x i ) rather than as the empirical av erage ov er all samples of g i , 19 X x i ˆ g i x i ( n i ) L x i L . Ho w ev er the calculation for ho w to minimize the v ariance must b e mo dified. Redoing the algebra ab o ve, the analog of AU for the Nearest Newton rule arises if we replace 1 L x i → [ q i ( x i )] 2 L x i  [1 − q i ( x i )] 2 + X x 0 i 6 = x i [ q i ( x 0 i )] 2  (32) throughout the equation defining A U. Similar considerations apply to Brouw er up dating as w ell. Nonetheless, in practice AU and WLU as defined ab o ve w ork well (and in particular far better than taking g i = G ) for the other up dating rules as w ell. F or gradient descent up dating, minimizing exp ected quadratic error of our estimator of E q t − i ( G | x i ) corresp onds to making a quadratic approximation to the Lagrangian surface, and then minimizing the exp ected v alue of the Lagrangian after the gradien t step (W olpert 2003 a ). More generally , and esp ecially for other up date rules, some other kind of error measure m igh t b e preferable. Suc h measures w ould differ from the bias-v ariance decomposition. W e do not consider suc h alternativ es here. Note that the agents are completely “blind” in the Mon te Carlo pro cess outline ab ov e, getting no information from other agents other than the v alues of G ( x ). When we allow some information to be transmitted b et ween the agents w e can impro ve the estimation of E q t − i ( G | x i ) beyond that of the simple Mon te Carlo pro cess outlined abov e. F or example, say that at every timestep the agen t i knows not just its own mov e x i , but in fact the join t mo v e x . Then as time goes on it accum ulates a training set of pairs { ( x , G ( x )) } . These can b e used with conv entional sup ervised learning algorithms (Duda et al. 2000) to form a rough estimate, ˆ G , of the entire function G . Say that in addition i kno ws not its o wn distribution q i ( x t i ), but in fact the entire joint distribution, q ( x t ). Then it can use that join t distribution together with ˆ G to form an estimate of E q t − i ( G | x i ). 33 That estimate is in addition to the one formed b y the blind Monte Carlo pro cess outlined ab o ve. One can then com bine these estimates to form one sup erior to b oth. See (Lee & W olpert 2004). Ev en when we are restricted to a blind Mon te Carlo pr o cess, there are man y heuristics that when incorp orated into the up date rules that can greatly improv e their p erformance on real-world problems. In this pap er w e examine problems for whic h join t distributions q are kno wn to all agents as well as the function form of G ( x ) and the required exp ectations E q t − i ( G | x i ) may b e obtained in closed form. So there is no need for Monte Carlo approximations. Accordingly there is no need for those heuristics, and there is not ev en an y need to using difference utilities. Empirical in vestigations of the effects of using difference utilit y functions and the heuristics ma y b e found in (Bieniawski & W olp ert 2004 a , Bieniawski et al. 2004, Bieniawski & W olp ert 2004 b ). B Utilit y Deriv ation Sa y we are at a timestep t at the end of a Monte Carlo blo c k, and consider the simplest up dating rule. This is gradient descen t up dating, in whic h we wish to up date q i at a timestep t by ha ving eac h agent i take a small step in the direction (cf. Eq. (5)) f i,G , −  ∂ L ( q t ) ∂ q i ( x 1 i ) , · · · , ∂ L ( q t ) ∂ q i ( x |X i | i )  − η i ( q ) 1 where η i ( q ) w as defined in Eq. (8), 1 is the vector of length |X i | all of whose components are 1, and x 1 i , · · · , x |X i | i are the |X i | mov es a v ailable to agent i . In general, there will be some error in i ’s estimate of that step, since it has limited information ab out q t − i . Presuming quadratic loss reflects quality of the update, for agent i the Ba yes-optimal estimate of its update is the p osterior exp ectation Z dq t P ( q t − i | n i ) f i,G where n i is all the prior knowledge and data that i has, and the dep endence of f i,G on q t − i is implicit. 20 P ( q t − i | n i ) is a probability distribution ov er likely v alues of q t − i giv en the information n i a v ailable to agent i . No w agent i can ev aluate ln q i ( x i ) for each of its mo v es x i exactly . How ever to perform its up date it still needs the integrals Z dq t P ( q t − i | n i ) E q t − i ( G | x i ) (recall Eq. (6)). In general these in tegrals can b e very difficult to ev aluate. As an alternative, w e can replace those integrals with simple maximum lik eliho od estimators of them, i.e., w e can use Mon te Carlo sampling. In this case, the prior information, n i , a v ailable to the agent is a list, L , of L joint configurations x along with their accompanying ob jective v alues G ( x ). T o define this precisely , for an y function h ( x ), let ˆ h ( n i ) b e a vector of length |X i | which is indexed b y x i . The x i comp onen t of ˆ h ( n i ) is indicated as ˆ h x i ( n i ). Each of its comp onen ts is giv en b y the information in n i . The x i ’th suc h comp onen t is the empirical a verage of the v alues that h had in the L x i samples from the just-completed Mon te Carlo blo c k when agent i made mo v e x i , i.e. ˆ h x i ( n i ) = 1 L x i X x ∈ L x i h ( x ) 34 where L x i is the set of x in L whose i th comp onen t is equal to x i , and where L x i = | L x i | . Giv en this notation, w e can express the comp onents of the gradient up date step for agen t i under the simple maximum likelihoo d estimator as the v alues ˆ f i,G x i ( n i ) = −{ ˆ G x i ( n i ) + T ln q i ( x i ) } − ˆ η ( n i ) (33) where ˆ η i ( n i ) , − 1 | X i | X x 0 i  ˆ G x 0 i ( n i ) + T ln q i ( x 0 i )  . (34) Unfortunately , often in very large systems the conv ergence of ˆ G ( n i ) with growing L is very slo w, since the distribution sampled by the Monte Carlo pro cess to pro duce n i is very broad. This suggests we use some alternative estimator. Here we fo cus on estimators that are still maxim um lik eliho od, just with a different c hoice of utility . In particular, w e will fo cus on estimators based on difference utilities, briefly in tro duced in the preceding app endix. T o motiv ate such estimators more carefully , first p osit that the differences E q t − i ( G | x i ) − E q t − i ( G | x 0 i ), one for each ( x i , x 0 i ) pair, are unc hanged when one replaces G with some other function g i . So the c hange is equiv alen t to adding a constan t to G , as far as those differences are concerned. This means that if agent i used q t − i to ev aluate its exp ectation v alues exactly , then its asso ciated up date w ould b e unc hanged under this replacemen t. (This is due to cancellation of the equiv alent additiv e constan t with the change that arises in η i ( n i ) under the replacement of G ( x ) with g i ( x )). It is straigh t-forw ard to verify that the set of all g i guaran teed to hav e this c haracter, regardless of the form of q , is the set of differ enc e utilities , g i ( x ) = G ( x ) − D i ( x − i ) for some function D i . G itself is the trivial case D i ( x − i ) = 0 ∀ x i . On the other hand, if we use a difference utility rather than G in our maxim um lik eliho o d estimator then it is the sample v alues of P ( g i ) that generate n i , and w e use the asso ciated x i - indexed v ector ˆ g i x i ( n i ) rather than ˆ G x i ( n i ) to up date eac h q i . F or well-c hosen D i it ma y typically b e the case that ˆ g i ( n i ) has a far smaller standard deviation than do es ˆ G ( n i ). In particular, if the n um b er of co ordinates coupled to i through G do es not grow as the system do es, often such difference utility error bars will not grow muc h with system size, whereas the error bars asso ciated with G will grow greatly . Another adv antage of difference utilities is that v ery often the Mon te Carlo v alues of a difference utilit y are far easier to ev aluate than are those of G , due to cancellation in subtracting D i . T o make this more precise w e can solve for the difference utilit y with minimal error bar. First as a notational matter, extend the definition of f i,G b y replacing G with (arbitrary) h throughout, writing that extended v ersion as f i,h . Then assuming no L x i = 0, we are in terested in the g i minimizing the data-av eraged quadratic error, E q − i ,n g i i (Err) = Z dq − i P ( q − i ) Z dn g i i P ( n g i i | n x i , q − i , g i ) k f i,g i − ˆ f i,g i ( n i ) k 2 , (35) where P ( q − i ) reflects an y prior information w e migh t hav e concerning q − i (e.g., that it is likely that the curren t f i,g i is close to that estimated for the previous blo c k of L steps), and n g i i is the set of v alues of the priv ate utilit y contained in n i . (The asso ciated x i v alues, n x i , are indep enden t of g i and q − i and therefore for our purp oses can be treated as though they are fixed.) No w the comp onen ts of ˆ f i,g i ( n i ) (one for each x i ) are not indep enden t in general, b eing coupled via ˆ η i ( n i ). T o get an in tegrand that in volv es only indep enden t v ariables, we m ust w ork with only one x i comp onen t at a time. T o that end, rewrite the data-av eraged quadratic error as X x i Z dq − i P ( q − i ) Z dn g i i P ( n g i i | n x i , q − i , g i )[ f i,g i x i − ˆ f i,g i x i ( n i )] 2 35 where f i,g i x i is the q i ( x i ) comp onent of f i,g i . Our results will hold for all q − i , s o w e ignore the outer in tegral and fo cus on X x i Z dn g i i P ( n g i i | n x i , q − i , g i )[ f i,g i x i − ˆ f i,g i x i ( n i )] 2 . (36) F or any x i the inner integral can b e decomp osed with the famous bias-v ariance decomp osition into a sum of tw o terms (Duda et al. 2000). 21 The first of the tw o terms in our sum is the (square of the) bias , f i,g i x i − E n g i i ( ˆ f i,g i x i ), where E n g i i  ˆ f i,g i x i ( n g i i )  , Z dn g i i P ( n g i i | n x i , q − i , g i ) ˆ f i,g i x i ( n i ) (37) is the exp ectation (ov er all p ossible sets of Mon te Carlo sample utilit y v alues n g i ) of ˆ f i,g i x i ( n i ). The bias reflects the systematic trend of our sample-based estimate of f i,g i x i to differ from the actual f i,g i x i . When bias is zero, w e know that on a v erage our estimator will return the actual v alue it’s estimating. The second term in our sum is the varianc e , V ar  ˆ f i,g i x i ( n g i i )  , Z dn g i i P ( n g i i | n x i , q − i , g i )  ˆ f i,g i x i ( n i ) − E n g i i ( ˆ f i,g i x i )  2 , (38) In general v ariance reflects how muc h the v alue of our estimate “b ounces around” if one w ere to resample our Monte Carlo blo c k. In our context, it reflects how muc h the priv ate utility of agen t i dep ends on its o wn mov e x i v ersus the mov es of the other agen ts. When i ’s estimator is isolated from the mov es of the other agents ˆ f i,g i x i ( n i ) is mostly indep endent of the mo v es of the other agen ts, and therefore of n i . This means that v ariance is low, and there is a crisp “signal-to-noise” guiding i ’s up dates. In this situation the agent can ac hiev e a preset accuracy level in its up dating with a minimal total num b er of samples in the Monte Carlo blo ck. Plug the general form for a difference utilit y into the formula for ˆ f i,g i x i ( n i ) to see that (due to cancellation with the ˆ η ( n i ) term) its n g i i -a v eraged v alue is indep enden t of D i . Accordingly bias must equal 0 for difference utilities. (In fact, difference utilities are the only utility that is guaran teed to ha v e zero bias for all q − i .) So our exp ected error reduces to the sum ov er all x i of the v ariance for eac h x i . F or eac h one of those v ariances again use Eq. 33 with G replaced by g i throughout to expand ˆ f i,g i x i ( n i ). Since the q i ( x i ) terms in that expansion are all the same constant indep enden t of n i , they don’t contribute to the v ariance. Accordingly we hav e V ar  ˆ f i,g i x i ( n g i i )  = V ar  ˆ g i x i ( n g i ) − 1 |X i | X x 0 i ˆ g i x 0 i ( n g i )  = V ar  1 − 1 |X i |  ˆ g i x i ( n g i ) − 1 |X i | X x 0 i 6 = x i ˆ g i x 0 i ( n g i )  . (39) Since n x i is fixed and w e are doing IID sampling, the tw o expressions inside the v ariance function are statistically indep enden t. In addition, the v ariance of a difference of indep enden t v ariables is 36 the sum of the v ariances. Accordingly , the sum ov er all x i of our v ariances is (cf. Eq. (36)) X x i V ar  ˆ f i,g i x i ( n g i )  = X x i V ar  1 − 1 |X i |  ˆ g i x i ( n g i ) − 1 |X i | X x 0 i 6 = x i ˆ g i x 0 i ( n g i )  = X x i  1 − 1 |X i |  2 V ar  ˆ g i x i ( n g i )  + 1 |X i | 2 X x 0 i 6 = x i V ar  ˆ g i x 0 i ( n g i )   = X x i  1 − 1 |X i |  2 + |X i | − 1 |X i | 2  V ar( ˆ g i x i ( n g i )) = |X i | − 1 |X i | X x i V ar( ˆ g i x i ( n g i )) , (40) where the third equation follo ws from the second b y using the trivial iden tit y P a P b 6 = a F ( b ) = P a F ( a ) P b 6 = a 1 for any function F . Since for each such x i w e are doing L x i -fold I ID sampling of an asso ciated fixed distribution, the v ariance for each separate x i is of the form Z d y P ( y )  1 L x i L x i X j =1 y j − E P  1 L x i L x i X j =1 y j  2 for a fixed distribution P ( y 1 , y 2 , . . . , y L x i ) = Q L x i j =1 P ( y j ). W e can again use the decomp osition of a v ariance of a sum into a sum of v ariances to ev aluate this. With the distribution q t implicit, define the single sample v ariance for the v alue of an y function H ( x ), for mo ve x i , as V ar( H ( x i )) , E ([ H ] 2 | x i ) − [ E ( H | x i )] 2 . (41) This gives V ar( ˆ g i x i ( n g i )) = V ar( g i ( x i )) / L x i (42) Collecting terms, we get X x i V ar( ˆ f i,g i x i ( n g i )) = |X i | − 1 |X i | X x i V ar( g i ( x i )) L x i . (43) No w V ar( A ( τ )) = (1 / 2) P t 1 ,t 2 P ( t 1 ) P ( t 2 )[ A ( t 1 ) − A ( t 2 )] 2 for an y random v ariable τ with dis- tribution P ( t ). Use this to rewrite the sum in Eq. (43) as |X i | − 1 2 |X i | X x i L − 1 x i X x 0 − i , x 00 − i q − i ( x 0 − i ) q − i ( x 00 − i )[ g i ( x i , x 0 − i ) − g i ( x i , x 00 − i )] 2 Bring the sum ov er x i inside the other integrals, expand g i , and drop the o verall multiplicativ e constan t to get V ar( ˆ F i g i ( n g i )) ∝ X x 0 − i , x 00 − i q − i ( x 0 − i ) q − i ( x 00 − i ) X x i  G ( x i , x 0 − i ) − G ( x i , x 00 − i ) − { D i ( x 0 − i ) − D i ( x 00 − i ) }  2 L x i . 37 F or each x 0 − i and x 00 − i , our c hoice of D i minimizes the sum so long as the difference in the curly brac k ets ob eys D i ( x 0 − i ) − D i ( x 00 − i ) = X x i [ G ( x i , x 0 − i ) − G ( x i , x 00 − i )] L x i / X x 0 i 1 L x 0 i . This can b e assured by picking D i ( x − i ) =  X x 0 i 1 L x 0 i  − 1 X x 0 i G ( x 0 i , x − i ) L x 0 i . (44) for all x − i . Note that AU minimizes v ariance of gradien t descen t up dating r e gar d less of the form of q . Indeed, b eing indep enden t of q − i , it minimizes our original q − i in tegral in Eq. (35), regardless of the prior P ( q − i ). F or the same reason it is optimal if the in tegral is replaced by a w orst-case b ound o v er q − i . 38

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment