Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

Pr omoting Coordination thr ough P olicy Regularization in Multi-Agent Deep Reinf or cement Lear ning Julien Roy ∗ Québec AI institute (Mila) Polytechnique Montréal julien.roy@mila.quebec Paul Barde ∗ Québec AI institute (Mila) McGill Univ ersity bardepau@mila.quebec Félix G. Harvey Québec AI institute (Mila) Polytechnique Montréal Derek Nowr ouzezahrai Québec AI institute (Mila) McGill Univ ersity Christopher Pal ‡ Québec AI institute (Mila) Polytechnique Montréal Element AI Abstract In multi-agent reinforcement learning, discovering successful collecti ve beha viors is challenging as it requires exploring a joint action space that grows exponentially with the number of agents. While the tractability of independent agent-wise exploration is appealing, this approach fails on tasks that require elaborate group strategies. W e argue that coordinating the agents’ policies can guide their exploration and we in vestigate techniques to promote such an inducti ve bias. W e propose two policy re gularization methods: T eamRe g, which is based on inter- agent action predictability and CoachReg that relies on synchronized behavior selection. W e ev aluate each approach on four challenging continuous control tasks with sparse rewards that require varying lev els of coordination as well as on the discrete action Google Research Football environment. Our experiments sho w improv ed performance across many cooperativ e multi-agent problems. Finally , we analyze the ef fects of our proposed methods on the policies that our agents learn and show that our methods successfully enforce the qualities that we propose as proxies for coordinated behaviors. 1 Introduction Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an en vironment that contains other learning agents. It represents a challenging branch of Reinforcement Learning (RL) with interesting de velopments in recent years [ 11 ]. A popular frame work for MARL is the use of a Centralized Training and a Decentralized Execution (CTDE) procedure [ 24 , 8 , 14 , 7 , 28 ]. T ypically , one lev erages centralized critics to approximate the v alue function of the aggre gated observ ations-actions pairs and train actors restricted to the observ ation of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly re warding beha viors. Ho we ver , these approaches depend on the agents luckily stumbling on these collective actions in order to grasp their beneﬁt. Thus, it might fail in scenarios where such beha viors are unlikely to occur by chance. W e hypothesize that in such scenarios, coordination-promoting inducti ve biases on the policy search ∗ Equal contribution. ‡ Canada CIF AR AI Chair . 34th Conference on Neural Information Processing Systems (NeurIPS 2020), V ancouv er , Canada. could help discover successful beha viors more efﬁciently and supersede task-speciﬁc re ward shaping and curriculum learning. T o moti v ate this proposition we present a simple Mark ov Game in which agents forced to coordinate their actions learn remarkably f aster . For more realistic tasks in which coordinated strategies cannot be easily engineered and must be learned, we propose to transpose this insight by relying on two coordination proxies to bias the policy search. The ﬁrst av enue, T eamReg, assumes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second, CoachReg, supposes that coordinated agents collecti vely recognize dif ferent situations and synchronously switch to different sub-policies to react to them. 2 . Our contributions are threefold. First, we show that coordination can crucially accelerate multi-agent learning for cooperativ e tasks. Second, we propose two novel approaches that aim at promoting such coordination by augmenting CTDE MARL algorithms through additional multi-agent objectiv es that act as policy regularizers and are optimized jointly with the main return-maximization objective. Third, we design two ne w sparse-re ward cooperati v e tasks in the multi-agent particle en vironment [ 26 ]. W e use them along with two standard multi-agent tasks to present a detailed e v aluation of our approaches’ beneﬁts when the y extend the reference CTDE MARL algorithm MADDPG [ 24 ]. W e validate our methods’ ke y components by performing an ablation study and a detailed analysis of their effect on agents’ behaviors. Finally , we verify that these beneﬁts hold on the more complex, discrete action, Google Research Football en vironment [20]. Our experiments suggest that our T eamRe g objecti ve pro vides a dense learning signal that can help guiding the policy towards coordination in the absence of external re ward, ev entually leading it to the disco very of higher performing team strategies in a number of cooperative tasks. Howe ver we also ﬁnd that T eamRe g does not lead to improv ements in every single case and can ev en be harmful in en vironments with an adversarial component. For CoachReg, we ﬁnd that enforcing synchronous sub-policy selection enables the agents to concurrently learn to react to different agreed upon situations and consistently yields signiﬁcant improv ements on the ov erall performance. 2 Background 2.1 Markov Games W e consider the framework of Marko v Games [ 23 ], a multi-agent e xtension of Marko v Decision Processes (MDPs). A Markov Game of N agents is deﬁned by the tuple hS , T , P , {O i , A i , R i } N i =1 i where S , T , and P are respectiv ely the set of all possible states, the transition function and the initial state distribution. While these are global properties of the en vironment, O i , A i and R i are individually deﬁned for each agent i . The y are respecti vely the observ ation functions, the sets of all possible actions and the re ward functions. At each time-step t , the global state of the en vironment is giv en by s t ∈ S and ev ery agent’ s indi vidual action vector is denoted by a i t ∈ A i . T o select their action, each agent i only has access to its o wn observ ation vector o i t which is extracted by the observation function O i from the global state s t . The initial state s 0 is sampled from the initial state distribution P : S → [0 , 1] and the next state s t +1 is sampled from the probability distribution ov er the possible ne xt states gi ven by the transition function T : S × S × A 1 × ... × A N → [0 , 1] . Finally , at each time-step, each agent receiv es an individual scalar re ward r i t from its rew ard function R i : S × S × A 1 × ... × A N → R . Agents aim at maximizing their expected discounted return E h P T t =0 γ t r i t i ov er the time horizon T , where γ ∈ [0 , 1] is a discount f actor . 2.2 Multi-Agent Deep Deterministic Policy Gradient MADDPG [ 24 ] is an adaptation of the Deep Deterministic Policy Gradient algorithm [ 22 ] to the multi- agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this frame w ork, each agent i possesses its own deterministic policy µ i for action selection and critic Q i for state-action v alue estimation, which are respectiv ely parametrized by θ i and φ i . All parametric models are trained off-policy from pre vious transitions ζ t : = ( o t , a t , r t , o t +1 ) uniformly sampled from a replay buf fer D . Note that o t : = [ o 1 t , ..., o N t ] is the joint observ ation vector and a t : = [ a 1 t , ..., a N t ] is the joint action v ector , obtained by concatenating 2 Source code for the algorithms and en vironments will be made public upon publication of this work. V isualisations of CoachReg are av ailable here: https://sites.google.com/view/marl- coordination/ 2 the individual observ ation v ectors o i t and action vectors a i t of all N agents. Each centralized critic is trained to estimate the expected return for a particular agent i from the Q-learning loss [33]: L i ( φ i ) = E ζ t ∼D  1 2  Q i ( o t , a t ; φ i ) − y i t  2  y i t = r i t + γ Q i ( o t +1 , a t +1 ; ¯ φ i )    a j t +1 = µ j ( o j t +1 ; ¯ θ j ) ∀ j (1) For a gi ven set of weights w , we deﬁne its target counterpart ¯ w , updated from ¯ w ← τ w + (1 − τ ) ¯ w where τ is a hyper-parameter . Each policy is updated to maximize the expected discounted return of the corresponding agent i : J i P G ( θ i ) = E o t ∼D   Q i ( o t , a t )     a i t = µ i ( o i t ; θ i ) , a j t = µ j ( o j t ; ¯ θ j ) ∀ j 6 = i   (2) By taking into account all agents’ observation-action pairs when guiding an agent’ s policy , the value-functions are trained in a centralized, stationary en vironment, despite taking place in a multi- agent setting. This mechanism can allow to learn coordinated strategies that can then be deplo yed in a decentralized way . Ho wev er , this procedure does not encourage the discovery of coordinated strategies since high-return behaviors ha ve to be randomly experienced through unguided e xploration. 3 Motivation 50 25 Return 0 , L = 5 , L = 5 100 50 Return 0 , L = 1 0 , L = 1 0 0 20 40 60 80 100 Episodes 200 100 Return 0 , L = 2 0 , L = 2 0 ( 1 , 1 )  0  1  2     − 1 ... ( 0 , 0 ) ( 0 , 0 ) ( 0 , 0 ) ( 0 , 0 ) ( 0 , 0 ) ( 1 , 1 ) ( 1 , 1 ) ( 1 , 1 ) ( 1 , 1 ) ( 1 , 0 ) ( 0 , 1 ) ( 1 , 0 ) ( 0 , 1 ) ( 1 , 0 ) ( 0 , 1 ) ( 1 , 0 ) ( 0 , 1 ) a = (1,1)  0 Figure 1: (T op) The tabular Q-learning agents learn much more efﬁciently when constrained to the space of coordinated policies (solid lines) than in the original action space (dashed lines). (Bottom) Simple Markov Game consisting of a chain of length L leading to a terminal state (in grey). Agents can be seen as the two wheels of a vehicle so that their actions need to be in agreement for the v ehicle to mo ve. The detailed e xperimental setup is reported in Appendix A. In this section, we aim to answer the follo wing question: can coordination help the discovery of ef fecti ve policies in cooperative tasks? Intuitiv ely , coordination can be deﬁned as an agent’ s behavior being informed by the behavior of another agent, i.e. structure in the agents’ interactions. Namely , a team where agents behave independently of one another would not be coordinated. Consider the simple Markov Game consisting of a chain of length L leading to a termination state as depicted in Figure 1. At each time-step, both agents receive r t = − 1 . The joint action of these two agents in this en vironment is given by a ∈ A = A 1 × A 2 , where A 1 = A 2 = { 0 , 1 } . Agent i tries to go right when selecting a i = 0 and left when selecting a i = 1 . Howe v er , to transition to a different state both agents need to perform the same action at the same time (two lefts or two rights). No w consider a slight variant of this en vironment with a dif ferent joint action structure a 0 ∈ A 0 . The former action structure is augmented with a hard-coded coordination module which maps the joint primitiv e a i to a i 0 like so: a 0 =  a 1 0 = a 1 a 2 0 = a 1 a 2 + (1 − a 1 )(1 − a 2 )  ,  a 1 a 2  ∈ A While the second agent still learns a state-action v alue function Q 2 ( s, a 2 ) with a 2 ∈ A 2 , the coordination module builds a 2 0 from ( a 1 , a 2 ) so that a 2 0 effecti v ely determines whether the second agent acts in agreement or in disagr eement with the ﬁrst agent. In other words, if a 2 = 1 , then a 2 0 = a 1 (agreement) and if a 2 = 0 , then a 2 0 = 1 − a 1 (disagreement). While it is true that this additional structure does not modify the action space nor the independence of the action selection, it reduces the stochasticity of the transition dynamics as seen by agent 2. In the ﬁrst setup, the outcome of an agent’ s action is 3 conditioned on the action of the other agent. In the second setup, if agent 2 decides to disagree, the transition becomes deterministic as the outcome is independent of agent 1. This suggests that by reducing the entrop y of the transition distribution, this mapping reduces the v ariance of the Q-updates and thus makes online tab ular Q-learning agents learn much faster (Figure 1). This example uses a handcrafted mapping in order to demonstrate the ef fecti veness of e xploring in the space of coordinated policies rather than in the unconstrained policy space. No w , the follo wing question remains: ho w can one softly learn the same type of constraint throughout training for any multi-agent cooperati ve tasks? In the follo wing sections, we present tw o algorithms that tackle this problem. 4 Coordination and Policy r egularization 3 4.1 T eam regularization 𝝁 1 𝝁 2 Figure 2: Illustration of T eamRe g with two agents. Each agent’ s policy is equipped with additional heads that are trained to predict other agents’ actions and ev ery agent is regularized to produce actions that its teammates correctly predict. The method is depicted for agent 1 only to av oid cluttering. This ﬁrst approach aims at exploiting the structure present in the joint action space of coordinated policies to attain a certain degree of predictability of one agent’ s behavior with respect to its teammate(s). It is based on the hypothesis that the reciprocal also holds i.e. that promoting agents’ predictability could foster such team structure and lead to more coordinated behaviors. This assumption is cast into the decentralized frame work by training agents to predict their teammates’ actions given only their o wn observation. For continuous control, the loss is the mean squared error (MSE) between the predicted and true actions of the teammates, yielding a teammate-modelling secondary objecti ve. For discrete action spaces, we use the KL-div ergence ( D KL ) between the predicted and real action distrib utions of an agent pair . While estimating teammates’ policies can be used to enrich the learned representations, we extend this objecti ve to also driv e the teammates’ behaviors towards the predictions by lev eraging a dif ferentiable action selection mechanism. W e call team-spirit this objective pair J i,j T S and J j,i T S between agents i and j : J i,j T S -continuous ( θ i , θ j ) = − E o t ∼D h MSE ( µ j ( o j t ; θ j ) , ˆ µ i,j ( o i t ; θ i )) i (3) J i,j T S -discrete ( θ i , θ j ) = − E o t ∼D h D KL  π j ( ·| o j t ; θ j ) || ˆ π i,j ( ·| o i t ; θ i ) i (4) where ˆ µ i,j (or ˆ π i,j in the discrete case) is the policy head of agent i trying to predict the action of agent j . The total objective for a gi v en agent i becomes: J i total ( θ i ) = J i P G ( θ i ) + λ 1 X j J i,j T S ( θ i , θ j ) + λ 2 X j J j,i T S ( θ j , θ i ) (5) where λ 1 and λ 2 are hyper -parameters that respectiv ely weigh ho w well an agent should predict its teammates’ actions, and how predictable an agent should be for its teammates. W e call T eamRe g this dual regularization from team-spirit objecti ves. Figure 2 summarizes these interactions. 4.2 Coach regularization In order to foster coordinated interactions, this method aims at teaching the agents to recognize different situations and synchronously select corresponding sub-beha viors. 3 Pseudocodes of our implementations are provided in Appendix D (see Algorithms 1 and 2). 4 𝝁 1 Coach 𝝁 2 Figure 3: Illustration of CoachReg with two agents. A central model, the coach, takes all agents’ observations as input and outputs the current mode (policy mask). Agents are regularized to predict the same mask from their local observations and optimize the corresponding sub-policy . Sub-policy selection Firstly , to enable explicit sub- behavior selection, we propose the use of policy masks as a means to modulate the agents’ policies. A polic y mask u j is a one-hot vector of size K (a ﬁxed hyper -parameter) with its j th component set to one. In practice, we use these masks to perform dropout [ 30 ] in a structured manner on ˜ h 1 ∈ R M , the pre-acti vations of the ﬁrst hidden layer h 1 of the policy network π . T o do so, we construct the vector u j , which is the concatenation of C copies of u j , in order to reach the dimensionality M = C ∗ K . The element-wise product u j  ˜ h 1 is performed and only the units of ˜ h 1 at indices m modulo K = j are kept for m = 0 , . . . , M − 1 . Each agent i generates e i t , its o wn policy mask from its observ ation o i t , to modulate its policy network. Here, a simple linear layer l i is used to produce a categorical probability distrib ution p i ( e i t | o i t ) from which the one-hot vector is sampled: p i ( e i t = u j | o i t ) = exp  l i ( o i t ; θ i ) j  P K − 1 k =0 exp  l i ( o i t ; θ i ) k  (6) Synchronous sub-policy selection Although the policy masking mechanism enables the agent to swiftly switch between sub-policies it does not encourage the agents to synchronously modulate their behavior . T o promote synchronicity we introduce the coach entity , parametrized by ψ , which learns to produce policy-masks e c t from the joint observ ations, i.e. p c ( e c t | o t ; ψ ) . The coach is used at training time only and dri ves the agents to w ard synchronously selecting the same behavior mask. Speciﬁcally , the coach is trained to output masks that (1) yield high returns when used by the agents and (2) are predic table by the agents. Similarly , each agent is regularized so that (1) its pri vate mask matches the coach’ s mask and (2) it derives ef ﬁcient beha vior when using the coach’ s mask. At ev aluation time, the coach is remo ved and the agents only rely on their o wn polic y masks. The policy gradient objective when agent i is pro vided with the coach’ s mask is giv en by: J i E P G ( θ i , ψ ) = E o t , a t ∼D " Q i ( o t , a t )      a i t = µ ( o i t ,e c t ; θ i ) e c t ∼ p c ( ·| o t ; ψ ) # (7) The difference between the mask distrib ution of agent i and the coach’ s one is measured from the Kullback–Leibler di ver gence: J i E ( θ i , ψ ) = − E o t ∼D  D KL  p c ( ·| o t ; ψ ) | | p i ( ·| o i t ; θ i )  (8) The total objectiv e for agent i is: J i total ( θ i ) = J i P G ( θ i ) + λ 1 J i E ( θ i , ψ ) + λ 2 J i E P G ( θ i , ψ ) (9) with λ 1 and λ 2 the regularization coef ﬁcients. Similarly , the coach is trained with the following dual objectiv e, weighted by the λ 3 coefﬁcient: J c total ( ψ ) = 1 N N X i =1  J i E P G ( θ i , ψ ) + λ 3 J i E ( θ i , ψ )  (10) In order to propagate gradients through the sampled policy mask we reparameterized the categorical distribution using the Gumbel-softmax [ 15 ] with a temperature of 1. W e call this coordinated sub-policy selection re gularization CoachReg and illustrate it in Figure 3. 5 Related W ork Se veral works in MARL consider explicit communication channels between the agents and distinguish between communicative actions (e.g. broadcasting a gi ven message) and physical actions (e.g. moving in a giv en direction) [ 6 , 26 , 21 ]. Consequently , they often focus on the emergence of 5 language, considering tasks where the agents must discover a common communication protocol to succeed. Deri ving a successful communication protocol can already be seen as coordination in the communicativ e action space and can enable, to some e xtent, successful coordination in the physical action space [ 1 ]. Y et, e xplicit communication is not a necessary condition for coordination as agents can rely on physical communication [26, 9]. T eamReg falls in the line of work that e xplores how to shape agents’ beha viors with respect to other agents through auxiliary tasks. Strouse et al. [31] use the mutual information between the agent’ s policy and a goal-independent policy to shape the agent’ s behavior to wards hiding or spelling out its current goal. Howe v er , this approach is only applicable for tasks with an explicit goal representation and is not speciﬁcally intended for coordination. Jaques et al. [16] approximate the direct causal effect between agent’ s actions and use it as an intrinsic reward to encourage social empowerment. This approximation relies on each agent learning a model of other agents’ policies to predict its effect on them. In general, this type of behavior prediction can be referred to as a gent modelling (or opponent modelling) and has been used in pre vious work to enrich representations [ 12 , 13 ], to stabilise the learning dynamics [10] or to classify the opponent’ s play style [29]. W ith CoachReg, agents learn to unitedly recognize different modes in the en vironment and adapt by jointly switching their policy . This echoes with the hierarchical RL litterature and in particular with the single agent options framework [ 3 ] where the agent switches between different sub-policies, the options, depending on the current state. T o encourage cooperation in the multi-agent setting, Ahilan and Dayan [1] proposed that an agent, the "manager", is extended with the possibility of setting other agents’ re wards in order to guide collaboration. CoachRe g stems from a similar idea: reaching a consensus is easier with a central entity that can asymmetrically inﬂuence the group. Y et, Ahilan and Dayan [1] guides the group in terms of "ends" (inﬂuences through the re wards) whereas CoachReg constrains it in terms of "means" (the group must synchronously switch between different strategies). Hence, the interest of CoachReg does not just lie in training sub-policies (which are obtained here through a simple and novel masking procedure) but rather in co-e volving synchronized sub-policies across multiple agents. Mahajan et al. [25] also looks at sub-policies co-ev olution to tackle the problem of joint e xploration, howe v er their selection mechanism occurs only on the ﬁrst timestep and requires duplicating random seeds across agents at test time. On the other hand, with CoachReg the sub-polic y selection is e xplicitly decided by the agents themselves at each timestep without requiring a common sampling procedure since the mode recognition has been learned and grounded on the state throughout training. Finally , Barton et al. [4] propose con v ergent cross mapping (CCM) to measure the degree of ef fecti ve coordination between two agents. Although this represents an interesting avenue for behavior analysis, it fails to pro vide a tool for ef fectiv ely enforcing coordination as CCM must be computed o ver long time series making it an impractical learning signal for single-step temporal difference methods. T o our knowledge, this work is the ﬁrst to extend agent modelling to derive an inducti ve bias to wards team-predictable policies or to introduce a collective, agent induced, modulation of the policies without an explicit communication channel. Importantly , these coordination proxies are enforced throughout training only , which allows to maintain decentralised e xecution at test time. 6 T raining envir onments Our continuous control tasks are built on OpenAI’ s multi-agent particle en vironment [ 26 ]. SPREAD and CHASE were introduced by [ 24 ]. W e use SPREAD as is but with sparse rewards. CHASE is modiﬁed with a prey controlled by repulsion forces so that only the predators are learnable, as we wish to focus on coordination in cooperativ e tasks. Finally we introduce COMPROMISE and BOUNCE where agents are physically tied together . While positiv e return can be achie ved in these tasks by selﬁsh agents, the y all beneﬁt from coordinated strate gies and maximal return can only be achie ved by agents working closely together . Figure 4 presents a visualization and a brief description. In all tasks, agents receive as observation their own global position and velocity as well as the relati ve position of other entities. A more detailed description is provided in Appendix B. Note that work showcasing experiments on these en vironments often use discrete action spaces and dense rewards (e.g. the proximity with the objective) [ 14 , 24 , 17 ]. In our experiments, agents learn with continuous action spaces and from sparse rew ards which is a far more challenging setting. 6 Figure 4: Multi-agent tasks we emplo y . (a) SPREAD: Agents must spread out and co ver a set of landmarks. (b) BOUNCE: T wo agents are link ed together by a spring and must position themselves so that the falling black ball bounces towards a target. (c) COMPR OMISE: T wo linked agents must compete or cooperate to reach their o wn assigned landmark. (d) CHASE: T wo agents chase a (non-learning) prey (turquoise) that mo ves w .r .t repulsion forces from predators and walls. 7 Results and Discussion The proposed methods of fer a way to incorporate ne w inducti ve biases in CTDE multi-agent polic y search algorithms. W e e v aluate them by e xtending MADDPG, one of the most widely used algorithm in the MARL literature. W e compare against v anilla MADDPG as well as two of its variants in the four cooperati v e multi-agent tasks described in Section 6. The ﬁrst v ariant (DDPG) is the single-agent counterpart of MADDPG (decentralized training). The second (MADDPG + sharing) shares the policy and v alue-function models across agents. Additionally to the two proposed algorithms and the three baselines, we present results for two ablated versions of our methods. The ﬁrst ablation (MADDPG + agent modelling) is similar to T eamReg but with λ 2 = 0 , which results in only enforcing agent modelling and not encouraging agent predictability . The second ablation (MADDPG + policy mask) uses the same policy architecture as CoachRe g, but with λ 1 , 2 , 3 = 0 , which means that agents still predict and apply a mask to their own polic y , but synchronicity is not encouraged. T o of fer a fair comparison between all methods, the hyper -parameter search routine is the same for each algorithm and en vironment (see Appendix E.1). For each search-experiment (one per algorithm per en vironment), 50 randomly sampled hyper -parameter conﬁgurations each using 3 random seeds are used to train the models for 15 , 000 episodes. For each algorithm-en vironment pair , we then select the best hyper -parameter conﬁguration for the ﬁnal comparison and retrain them on 10 seeds for twice as long. This thorough ev aluation procedure represents around 3 CPU-year . W e gi ve all details about the training setup and model selection in Appendix C and E.2. The results of the hyper -parameter searches are given in Appendix E.5. Interestingly , Figure 9 shows that our proposed coordination regularizers impro ve rob ustness to hyper -parameters despite having more hyper -parameters to tune. 7.1 Asymptotic Perf ormance Figure 5 reports the average learning curves and T able 1 presents the ﬁnal performance. CoachReg is the best performing algorithm considering performance across all tasks. T eamReg also signiﬁcantly improv es performance on two tasks (SPREAD and BOUNCE) but sho ws unstable behavior on COMPR OMISE, the only task with an adv ersarial component. This result re veals one limitation of this approach and is dicussed in details in Appendix F. Note that all algorithms perform similarly well on CHASE, with a slight adv antage to the one using parameter sharing; yet this superiority is restricted to this task where the optimal strategy is to mo ve symmetrically and squeeze the prey into a corner . Contrary to popular belief, we ﬁnd that MADDPG almost ne ver signiﬁcantly outperforms DDPG in these sparse re ward en vironments, supporting the hypothesis that while CTDE algorithms can in principle identify and reinforce highly rew arding coordinated behavior , they are likely to fail to do so if not incentivized to coordinate. Regarding the ablated versions of our methods, the use of unsynchronized policy masks might result in swift and unpredictable behavioral changes and make it difﬁcult for agents to perform together and coordinate. Experimentally , “MADDPG + policy mask" performs similarly or worse than MADDPG on all but one environment, and never outperforms the full CoachReg approach. Ho wev er , policy masks alone seem sufﬁcient to succeed on SPREAD, which is about selecting a landmark from a set. Finally “MADDPG + agent modelling" does not drastically improv e on MADDPG apart from 7 T able 1: Final performance reported as mean return over agents a veraged across 10 episodes and 10 seeds ( ± SE). env alg DDPG MADDPG MADDPG +sharing MADDPG +agent modelling MADDPG +policy mask MADDPG +T eamReg (ours) MADDPG +CoachReg (ours) SPREAD 133 ± 12 159 ± 6 47 ± 8 183 ± 10 221 ± 11 216 ± 12 210 ± 12 BOUNCE 3 . 6 ± 1 . 4 4 . 0 ± 1 . 6 0 . 0 ± 0 . 0 3 . 8 ± 1 . 5 3 . 7 ± 1 . 1 5 . 8 ± 1 . 3 7 . 4 ± 1 . 2 COMPROMISE 19 . 1 ± 1 . 2 18 . 1 ± 1 . 1 19 . 6 ± 1 . 5 12 . 9 ± 0 . 9 18 . 4 ± 1 . 3 8 . 8 ± 0 . 9 31 . 1 ± 1 . 1 CHASE 727 ± 87 834 ± 80 980 ± 64 946 ± 69 722 ± 82 917 ± 90 949 ± 54 one en vironment, and is alw ays outperformed by the full T eamReg (except on COMPR OMISE, see Appendix F) which supports the importance of enforcing predictability alongside agent modeling. 0 15000 30000 Episodes 0 100 200 Return SPREAD 0 15000 30000 Episodes 0 3 6 9 BOUNCE 0 15000 30000 Episodes 0 10 20 30 COMPROMISE 0 15000 30000 Episodes 0 400 800 1200 CHASE MADDPG + policy mask MADDPG + CoachReg (ours) MADDPG + agent modelling MADDPG + TeamReg (ours) MADDPG + sharing MADDPG DDPG Figure 5: Learning curves (mean return ov er agents) for our two proposed algorithms, two ablations and three baselines on all four en vironments. Solid lines are the mean and env elopes are the Standard Error (SE) across the 10 training seeds. 7.2 Effects of enfor cing predictable behavior Here we v alidate that enforcing predictability makes the agent-modelling task more successful. T o this end, we compare, on the SPREAD environment, the team-spirit losses between T eamReg and its ablated versions. Figure 6 shows that initially , due to the weight initialization, the predicted and actual actions both have relatively small norms yielding small values of team-spirit loss. As training goes on ( ∼ 1000 episodes), the norms of the action-vector increase and the regularization loss becomes more important. As expected, MADDPG leads to the worst team-spirit loss as it is not trained to predict the actions of other agents. When using only the agent-modelling objective ( λ 1 > 0 ), the agents signiﬁcantly decrease the team-spirit loss, b ut it nev er reaches v alues as lo w as when using the full T eamReg objectiv e ( λ 1 > 0 and λ 2 > 0 ). Note that the team-spirit loss increases when performance starts to improv e i.e. when agents start to master the task ( ∼ 8000 episodes). Indeed, once the return maximisation signal becomes stronger , the relativ e importance of the auxiliary objectiv e is reduced. Being predictable with respect to one-another may push agents to e xplore in a more structured and informed manner in the absence of re ward signal, as similarly pursued by intrinsic motiv ation approaches [5]. 7.3 Analysis of synchronous sub-policy selection In this section we conﬁrm that CoachReg yields the desired behavior: agents synchr onously alternating between varied sub-policies. Figure 7 shows the a v erage entropy of the mask distributions for each en vironment compared to the entropy of Categorical Uniform Distributions of size k ( k -CUD). On all the environments, agents use sev eral masks and tend to alternate between masks with more v ariety (close to uniformly switching between 3 masks) on SPREAD (where there are 3 agents and 3 goals) than on the other en vironments (comprised of 2 agents). Moreo ver , the Hamming proximity between the agents’ mask sequences, 1 − D h where D h is the Hamming distance (i.e. the ratio of timesteps for which the two sequences are different) shows that agents are synchronously selecting the same policy mask at test time (without a coach). Finally , we observe that some settings result in the agents coming up with interpretable strategies, lik e the one depicted in Figure 13 in Appendix G.2 where the agents alternate between tw o sub-policies depending on the position of the target 4 . 4 See animations at https://sites.google.com/view/marl- coordination/ 8 0 10000 20000 30000 Episodes 0.0 0.3 0.6 0.9 Team Spirit Loss MADDPG MADDPG + agent modelling MADDPG + TeamReg (ours) Figure 6: Ef fect of enabling and disabling the coefﬁcients λ 1 and λ 2 on the ability of agents to predict their teammates behavior . Solid lines and en velope are average and SE on 10 seeds on SPREAD. SPREAD BOUNCE COMPROMISE CHASE 0.0 0.2 0.4 0.6 0.8 1.0 H / H m a x , k = 4 H m a x , k = 4 H m a x , k = 3 H m a x , k = 2 SPREAD BOUNCE COMPROMISE CHASE 0.0 0.2 0.4 0.6 0.8 1.0 1 D h r a n d k = 2 r a n d k = 3 r a n d k = 4 Figure 7: (Left) A v erage entropy of the policy mask distributions for each task. H max,k is the entropy of a k -CUD. (Right) A verage Hamming Proximity between the policy mask sequence of agent pairs. rand k stands for agents independently sampling their masks from k -CUD. Error bars are the SE on 10 seeds. 7.4 Experiments on discrete action spaces T able 2: A verage Returns for 3v2 football MADDPG 0.004 ± 0.002 MADDPG + sharing 0.005 ± 0.003 MADDPG + T eamReg (ours) 0.006 ± 0.003 MADDPG + CoachReg (ours) 0.088 ± 0.017 Figure 8: Snapshot of the google research football 3vs1-with-keeper . W e e valuate our techniques on the more challenging task of 3vs2 Google Research football en vironment [ 20 ]. In this en vironment, each agent controls an of fensiv e player and tries to score against a defensiv e player and a goalkeeper controlled by the engine’ s rule-based bots. Here agents hav e discrete action spaces of size 21, with actions like moving direction, dribble, sprint, short pass, high pass, etc. W e use as observations 37-dimensional vectors containing players’ and ball’ s coordinates, directions, etc. The algorithms presented in T able 2 were trained using 25 randomly sampled h yperparameter conﬁg- urations. The best conﬁguration was retrained using 10 seeds for 80,000 episodes of 100 steps. T able 2 sho ws the mean return ( ± standard error across seeds) on the last 10,000 episodes. All algorithms but MAD- DPG + CoachReg f ail to reliably learn policies that achiev e positi ve return (i.e. scoring goals). 8 Conclusion In this work we motiv ate the use of coordinated policies to ease the discovery of successful strategies in cooperativ e multi-agent tasks and propose two distinct approaches to promote coordination for CTDE multi-agent RL algorithms. While the beneﬁts of T eamRe g appear task-dependent – we show for example that it can be detrimental on tasks with a competiti v e component – CoachReg signiﬁcantly improv es performance on almost all presented en vironments. Moti vated by the success of this single-step coordination technique, a promising direction is to explore model-based planning approaches to promote coordination ov er long-term multi-agent interactions. Broader Impact In this work, we present and study methods to enforce coordination in MARL algorithms. It goes without saying that multi-agent systems can be employed for positive and neg ativ e applications alike. W e do not propose methods aimed at making new applications possible or improving a particular set of applications. W e instead propose methods that allow to better understand and improve multi-agent RL algorithms in general. Therefore, we do not aim in this section at discussing the impact of Multi- Agent Reinforcement Learning applications themselves b ut focus on the impact of our contrib ution: promoting multi-agent behaviors that are coordinated. W e ﬁrst observe that current Multi-Agent Reinforcement Learning (MARL) algorithms may fail to train agents that lev erage information about the beha vior of their teammates and that ev en when 9 explicitly gi ven their teammates observations, action and current policy during the training phase. W e belie ve that this is an important observ ation worth raising some concern among the community since there is a widespread belief that centralized training (lik e MADDPG) should al ways outperform decentralize training (DDPG). Not only is this belief unsupported by empirical evidence (at least in our experiments) b ut it also prev ents the community from in vestigating and tackling this ﬂa w that is an important limitation for learning safer and more effecti ve multi-agent beha vior . By not accounting for the beha vior of its teammates, an agent could not adapt to a new teammate or e v en a change in the teammates beha vior . This prevents current methods to be applied in the real w orld where there is external perturbations and uncertainties and where an artiﬁcial agent may need to interact with various dif ferent indi viduals. W e propose to focus on coordination and sketch a deﬁnition of coordination: an agent beha vior should be predictable giv en its teammate behavior . While this deﬁnition is restrictiv e, we believ e that it is a good starting point to consider . Indeed, enforcing that criterion should make learning agents more aware of their teammates in order to coordinate with them. Y et, coordination alone does not ensure success, as agents could be coordinated in an unproductive manner . More so, coordination could ha ve detrimental ef fects if it enables an attack er to inﬂuence an agent through taking control of a teammate or using a mock-up teammate. For these reasons, when using multi-agent RL algorithms (or even single-agent RL for that matter) for real world applications, additional safeguards are absolutely required to pre vent the system from misbeha ving, which is highly probable if out-of-distribution states are to be encountered. Acknowledgments and Disclosure of Funding W e thank Olivier Delalleau for his insightful feedback and comments. W e also acknowledge funding in support of this work from the F onds de Recherche Nature et T echnologies (FRQNT) and Mitacs, as well as Compute Canada for supplying computing resources. References [1] Sanjee van Ahilan and Peter Dayan. Feudal multi-agent hierarchies for cooperati ve reinforcement learning. arXiv preprint , 2019. [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv pr eprint arXiv:1607.06450 , 2016. [3] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-F irst AAAI Confer ence on Artiﬁcial Intelligence , 2017. [4] Sean L Barton, Nicholas R W aytowich, Erin Zaroukian, and Derrik E Asher . Measuring collaborativ e emer gent beha vior in multi-agent reinforcement learning. In International Confer ence on Human Systems Engineering and Design: Future T r ends and Applications , pages 422–427. Springer , 2018. [5] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motiv ated reinforcement learning. In Advances in neural information pr ocessing systems , pages 1281– 1288, 2005. [6] Jakob Foerster , Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Pr ocessing Systems , pages 2137–2145, 2016. [7] Jakob F oerster , Francis Song, Edw ard Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, and Michael Bowling. Bayesian action decoder for deep multi-agent reinforcement learning. International Conference on Mac hine Learning , 2019. [8] Jakob N F oerster , Gregory Farquhar , T riantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018. [9] Jayesh K. Gupta, Maxim Egorov , and Mykel J. K ochenderfer . Cooperati ve multi-agent control using deep reinforcement learning. In AAMAS W orkshops , 2017. 10 [10] He He, Jordan Boyd-Graber , Ke vin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In International Confer ence on Machine Learning , pages 1804–1813, 2016. [11] Pablo Hernandez-Leal, Bilal Kartal, and Matthe w E T aylor . Is multiagent deep reinforcement learning the answer or the question? a brief surve y . arXiv pr eprint arXiv:1810.05587 , 2018. [12] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. T aylor . Agent Modeling as Auxiliary T ask for Deep Reinforcement Learning. In AAAI Confer ence on Artiﬁcial Intelligence and Interactive Digital Entertainment , 2019. [13] Zhang-W ei Hong, Shih-Y ang Su, Tzu-Y un Shann, Y i-Hsiang Chang, and Chun-Y i Lee. A deep policy inference q-network for multi-agent systems. arXiv preprint , 2017. [14] Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In International Confer ence on Machine Learning , pages 2961–2970, 2019. [15] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In International Confer ence on Learning Repr esentations (ICLR 2017) . OpenRe vie w . net, 2017. [16] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z Leibo, and Nando De Freitas. Social inﬂuence as intrinsic moti v ation for multi-agent deep reinforcement learning. In International Confer ence on Machine Learning , pages 3040–3049, 2019. [17] Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Pr ocessing Systems , pages 7254–7264, 2018. [18] Leslie Pack Kaelbling, Michael L Littman, and Andre w W Moore. Reinforcement learning: A surve y . J ournal of artiﬁcial intelligence r esear ch , 4:237–285, 1996. [19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [20] Karol Kurach, Anton Raichuk, Piotr Sta ´ nczyk, Michał Zajac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien V incent, Marcin Michalski, Oli vier Bousquet, et al. Google research football: A nov el reinforcement learning en vironment. arXiv pr eprint arXiv:1907.11180 , 2019. [21] Angeliki Lazaridou, Alexander Peysakho vich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv pr eprint arXiv:1612.07182 , 2016. [22] T imothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, T om Erez, Y uval T assa, David Silv er , and Daan W ierstra. Continuous control with deep reinforcement learning. arXiv pr eprint arXiv:1509.02971 , 2015. [23] Michael L Littman. Markov games as a frame work for multi-agent reinforcement learning. In Machine learning pr oceedings 1994 , pages 157–163. Elsevier , 1994. [24] Ryan Lowe, Y i W u, A vi v T amar , Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperati ve-competiti ve en vironments. In Advances in Neural Information Pr ocessing Systems , pages 6379–6390, 2017. [25] Anuj Mahajan, T abish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Mav en: Multi- agent v ariational exploration. In Advances in Neural Information Pr ocessing Systems , pages 7613–7624, 2019. [26] Igor Mordatch and Pieter Abbeel. Emer gence of grounded compositional language in multi- agent populations. In Thirty-Second AAAI Conference on Artiﬁcial Intellig ence , 2018. [27] V inod Nair and Geoffrey E Hinton. Rectiﬁed linear units improv e restricted boltzmann machines. In Proceedings of the 27th international confer ence on machine learning (ICML-10) , pages 807–814, 2010. [28] T abish Rashid, Mikayel Samvelyan, Christian Schroeder W itt, Gregory Farquhar , Jak ob Foerster , and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Confer ence on Machine Learning , pages 4292–4301, 2018. [29] Frederik Schadd, Sander Bakkes, and Pieter Spronck. Opponent modeling in real-time strategy games. In GAMEON , pages 61–70, 2007. 11 [30] Nitish Sriv asta v a, Geoffre y Hinton, Alex Krizhe vsky , Ilya Sutske ver , and Ruslan Salakhutdinov . Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning r esear c h , 15(1):1929–1958, 2014. [31] Daniel Strouse, Max Kleiman-W einer , Josh T enenbaum, Matt Botvinick, and Da vid J Schwab . Learning to share and hide intentions using information regularization. In Advances in Neural Information Pr ocessing Systems , pages 10270–10281, 2018. [32] George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion. Physical r evie w , 36(5):823, 1930. [33] Christopher JCH W atkins and Peter Dayan. Q-learning. Machine learning , 8(3-4):279–292, 1992. 12 A Additional details for experiment presented in Section 3 Motiv ation W e trained each agent i with online Q-learning [ 33 ] on the Q i ( a i , s ) table using Boltzmann exploration [ 18 ]. The Boltzmann temperature is ﬁxed to 1 and we set the learning rate to 0.05 and the discount factor to 0.99. After each learning episode we ev aluate the current greedy policy on 10 episodes and report the mean return. Curves are av eraged ov er 20 seeds and the shaded area represents the standard error . B T asks descriptions SPREAD (Figure 4a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). At every timestep, agents receiv e a team-re ward r t = n − c where n is the number of landmarks occupied by at least one agent and c the number of collisions occurring at that timestep. T o maximize their return, agents must therefore spread out and cover all landmarks. Initial agents’ and landmarks’ positions are random. T ermination is triggered when the maximum number of timesteps is reached. BOUNCE (Figure 4b): In this environment, two agents (small orange circles) are linked together with a spring that pulls them toward each other when stretched abov e its relaxation length. At episode’ s mid-time a ball (smaller black circle) falls from the top of the en vironment. Agents must position correctly so as to have the ball bounce on the spring tow ards the target (bigger beige circle), which turns yello w if the ball’ s bouncing trajectory passes through it. They recei ve a team-re ward of r t = 0 . 1 if the ball reﬂects to wards the side walls, r t = 0 . 2 if the ball reﬂects to wards the top of the en vironment, and r t = 10 if the ball reﬂects tow ards the target. At initialisation, the tar get’ s and ball’ s vertical position is ﬁxed, their horizontal positions are random. Agents’ initial positions are also random. T ermination is triggered when the ball is bounced by the agents or when the maximum number of timesteps is reached. COMPR OMISE (Figure 4c): In this en vironment, two agents (small orange circles) are linked together with a spring that pulls them to ward each other when stretched abov e its relaxation length. They both ha ve a distinct assigned landmark (light gray circle for light orange agent, dark gray circle for dark orange agent), and receive a re ward of r t = 10 when the y reach it. Once a landmark is reached by its corresponding agent, the landmark is randomly relocated in the en vironment. Initial positions of agents and landmark are random. T ermination is triggered when the maximum number of timesteps is reached. CHASE (Figure 4d): In this en vironment, two predators (orange circles) are chasing a prey (turquoise circle). The prey moves with respect to a scripted policy consisting of repulsion forces from the walls and predators. At each timestep, the learning agents (predators) receive a team-rew ard of r t = n where n is the number of predators touching the prey . The prey has a greater max speed and acceleration than the predators. Therefore, to maximize their return, the tw o agents must coordinate in order to squeeze the prey into a corner or a wall and effecti vely trap it there. T ermination is triggered when the maximum number of time steps is reached. C T raining details In all of our experiments, we use the Adam optimizer [ 19 ] to perform parameter updates. All models (actors, critics and coach) are parametrized by feedforward networks containing two hidden layers of 128 units. W e use the Rectiﬁed Linear Unit (ReLU) [ 27 ] as acti v ation function and layer normalization [ 2 ] on the pre-acti v ations unit to stabilize the learning. W e use a buffer -size of 10 6 entries and a batch-size of 1024 . W e collect 100 transitions by interacting with the en vironment for each learning update. F or all tasks in our hyper -parameter searches, we train the agents for 15 , 000 episodes of 100 steps and then re-train the best conﬁguration for each algorithm-environment pair for twice as long ( 30 , 000 episodes) to ensure full con ver gence for the ﬁnal e v aluation. The scale of the exploration noise is k ept constant for the ﬁrst half of the training time and then decreases linearly to 0 until the end of training. W e use a discount factor γ of 0 . 95 and a gradient clipping threshold of 0 . 5 in all experiments. Finally for CoachRe g, we ﬁxed K to 4 meaning that agents could choose between 4 sub-policies. Since policies’ hidden layers are of size 128 the corresponding value for C is 32. All experiments were run on Intel E5-2683 v4 Broadwell (2.1GHz) CPUs in less than 12 hours. 13 D Algorithms Algorithm 1 T eam Randomly initialize N critic networks Q i and actor networks µ i Initialize the target weights Initialize one replay buf fer D for episode from 0 to number of episodes do Initialize random processes N i for action exploration Receiv e initial joint observ ation o 0 for timestep t from 0 to episode length do Select action a i = µ i ( o i t ) + N i t for each agent Execute joint action a t and observe joint re ward r t and new observ ation o t +1 Store transition ( o t , a t , r t , o t +1 ) in D end for Sample a random minibatch of M transitions from D for each agent i do Evaluate L i and J i P G from Equations (1) and (2) for each other agent ( j 6 = i ) do Evaluate J i,j T S from Equations (3, 4) Update actor j with θ j ← θ j + α θ ∇ θ j λ 2 J i,j T S end for Update critic with φ i ← φ i − α φ ∇ φ i L i Update actor i with θ i ← θ i + α θ ∇ θ i  J i P G + λ 1 P N j =1 J i,j T S  end for Update all target weights end for Algorithm 2 Coach Randomly initialize N critic networks Q i , actor networks µ i and one coach network p c Initialize N target networks Q i 0 and µ i 0 Initialize one replay buf fer D for episode from 0 to number of episodes do Initialize random processes N i for action exploration Receiv e initial joint observ ation o 0 for timestep t from 0 to episode length do Select action a i = µ i ( o i t ) + N i t for each agent Execute joint action a t and observe joint re ward r t and new observ ation o t +1 Store transition ( o t , a t , r t , o t +1 ) in D end for Sample a random minibatch of M transitions from D for each agent i do Evaluate L i and J i P G from Equations (1) and (2) Update critic with φ i ← φ i − α φ ∇ φ i L i Update actor with θ i ← θ i + α θ ∇ θ i J i P G end for for each agent i do Evaluate J i E and J i E P G from Equations (8) and (7) Update actor with θ i ← θ i + α θ ∇ θ i  λ 1 J i E + λ 2 J i E P G  end for Update coach with ψ ← ψ + α ψ ∇ ψ 1 N P N i =1  J i E P G + λ 3 J i E  Update all target weights end for 14 E Hyper -parameter search E.1 Hyper -parameter search ranges W e perform searches o ver the following hyper -parameters: the learning rate of the actor α θ , the learning rate of the critic ω φ relativ e to the actor ( α φ = ω φ ∗ α θ ), the target-network soft-update parameter τ and the initial scale of the exploration noise η noise for the Ornstein-Uhlenbeck noise generating process [ 32 ] as used by Lillicrap et al. [22] . When using T eamRe g and CoachReg, we additionally search ov er the re gularization weights λ 1 , λ 2 and λ 3 . The learning rate of the coach is always equal to the actor’ s learning rate (i.e. α θ = α ψ ), motiv ated by their similar architectures and learning signals and in order to reduce the search space. T able 2 shows the ranges from which values for the hyper-parameters are dra wn uniformly during the searches. T able 2: Ranges for hyper -parameter search, the log base is 10 H Y P E R - P A R A M E T E R R A N G E log( α θ ) [ − 8 , − 3] log( ω φ ) [ − 2 , 2] log( τ ) [ − 3 , − 1] log( λ 1 ) [ − 3 , 0] log( λ 2 ) [ − 3 , 0] log( λ 3 ) [ − 1 , 1] η noise [0 . 3 , 1 . 8] E.2 Model selection During training, a polic y is e valuated on a set of 10 dif ferent episodes e very 100 learning steps. At the end of the training, the model at the best ev aluation iteration is saved as the best version of the policy for this training, and is re-e valuated on 100 different episodes to hav e a better assessment of its ﬁnal performance. The performance of a hyper -parameter conﬁguration is deﬁned as the av erage performance (across seeds) of the best policies learned using this set of hyper-parameter v alues. 15 E.3 Selected hyper -parameters T ables 3, 4, 5, and 6 sho ws the best hyper -parameters found by the random searches for each of the en vironments and each of the algorithms. T able 3: Best found hyper -parameters for the SPREAD en vironment H Y P E R - PA R A M E T E R DD P G M A D D P G M A D D P G + S H A R I N G M A D D P G + T E A M R E G M A D D P G + C O AC H R E G α θ 5 . 3 ∗ 10 − 5 2 . 1 ∗ 10 − 5 9 . 0 ∗ 10 − 4 2 . 5 ∗ 10 − 5 1 . 2 ∗ 10 − 5 ω φ 53 79 0 . 71 42 82 τ 0 . 05 0 . 083 0 . 076 0 . 098 0 . 0077 λ 1 - - - 0 . 054 0 . 13 λ 2 - - - 0 . 29 0 . 24 λ 3 - - - - 8 . 4 η noise 1 . 0 0 . 5 0 . 7 1 . 2 1 . 6 T able 4: Best found hyper -parameters for the BOUNCE en vironment H Y P E R - PA R A M E T E R DD P G M A D D P G M A D D P G + S H A R I N G M A D D P G + T E A M R E G M A D D P G + C O AC H R E G α θ 8 . 1 ∗ 10 − 4 3 . 8 ∗ 10 − 5 1 . 2 ∗ 10 − 4 1 . 3 ∗ 10 − 5 6 . 8 ∗ 10 − 5 ω φ 2 . 4 87 0 . 47 85 9 . 4 τ 0 . 089 0 . 016 0 . 06 0 . 055 0 . 02 λ 1 - - - 0 . 06 0 . 0066 λ 2 - - - 0 . 0026 0 . 23 λ 3 - - - - 0 . 34 η noise 1 . 2 0 . 9 1 . 2 1 . 0 1 . 1 T able 5: Best found hyper -parameters for the CHASE en vironment H Y P E R - PA R A M E T E R DD P G M A D D P G M A D D P G + S H A R I N G M A D D P G + T E A M R E G M A D D P G + C O AC H R E G α θ 4 . 5 ∗ 10 − 4 2 . 0 ∗ 10 − 4 9 . 7 ∗ 10 − 4 1 . 3 ∗ 10 − 5 1 . 8 ∗ 10 − 4 ω φ 32 64 0 . 79 85 90 τ 0 . 031 0 . 021 0 . 032 0 . 055 0 . 011 λ 1 - - - 0 . 06 0 . 0069 λ 2 - - - 0 . 0026 0 . 86 λ 3 - - - - 0 . 76 η noise 0 . 6 1 . 0 1 . 5 1 . 0 1 . 1 T able 6: Best found hyper -parameters for the COMPR OMISE en vironment H Y P E R - PA R A M E T E R DD P G M A D D P G M A D D P G + S H A R I N G M A D D P G + T E A M R E G M A D D P G + C O AC H R E G α θ 6 . 1 ∗ 10 − 5 3 . 1 ∗ 10 − 4 6 . 2 ∗ 10 − 4 1 . 5 ∗ 10 − 5 3 . 4 ∗ 10 − 4 ω φ 1 . 7 0 . 94 0 . 58 90 29 τ 0 . 065 0 . 045 0 . 007 0 . 02 0 . 0037 λ 1 - - - 0 . 0013 0 . 65 λ 2 - - - 0 . 56 0 . 5 λ 3 - - - - 1 . 3 η noise 1 . 1 0 . 7 1 . 3 1 . 6 1 . 6 T able 7: Best found hyper -parameters for the 3-vs-1-with-keeper Google Football en vironment H Y P E R - P A R A M E T E R M A D D P G M A D D P G + S H A R I N G M A D D P G + T E A M R E G M A D D P G + C O AC H R E G α θ 1 . 6 ∗ 10 − 6 3 . 4 ∗ 10 − 5 3 . 5 ∗ 10 − 6 9 . 4 ∗ 10 − 5 ω φ 3 . 1 13 0 . 96 2 . 9 τ 0 . 004 0 . 0014 0 . 0066 0 . 018 λ 1 - - 0 . 1 0 . 027 λ 2 - - 0 . 02 0 . 027 λ 3 - - - 2 . 4 16 E.4 Selected hyper -parameters (ablations) T ables 8, 9, 10, and 11 sho ws the best hyper -parameters found by the random searches for each of the en vironments and each of the ablated algorithms. T able 8: Best found hyper -parameters for the SPREAD en vironment H Y P E R - PA R A M E T E R M A D D P G + A G E N T M O D E L L I N G M A D D P G + P O L I C Y M A S K α θ 1 . 3 ∗ 10 − 5 6 . 8 ∗ 10 − 5 ω φ 85 9 . 4 τ 0 . 055 0 . 02 λ 1 0 . 06 0 λ 2 0 0 λ 3 - 0 η noise 1 . 0 1 . 1 T able 9: Best found hyper -parameters for the BOUNCE en vironment H Y P E R - PA R A M E T E R M A D D P G + A G E N T M O D E L L I N G M A D D P G + P O L I C Y M A S K α θ 1 . 3 ∗ 10 − 5 2 . 5 ∗ 10 − 4 ω φ 85 0 . 52 τ 0 . 055 0 . 0077 λ 1 0 . 06 0 λ 2 0 0 λ 3 - 0 η noise 1 . 0 1 . 3 T able 10: Best found hyper -parameters for the CHASE en vironment H Y P E R - PA R A M E T E R M A D D P G + A G E N T M O D E L L I N G M A D D P G + P O L I C Y M A S K α θ 2 . 5 ∗ 10 − 5 6 . 8 ∗ 10 − 5 ω φ 42 9 . 4 τ 0 . 098 0 . 02 λ 1 0 . 054 0 λ 2 0 0 λ 3 - 0 η noise 1 . 2 1 . 1 T able 11: Best found hyper -parameters for the COMPR OMISE en vironment H Y P E R - PA R A M E T E R M A D D P G + A G E N T M O D E L L I N G M A D D P G + P O L I C Y M A S K α θ 1 . 2 ∗ 10 − 4 2 . 5 ∗ 10 − 4 ω φ 0 . 71 0 . 52 τ 0 . 0051 0 . 0077 λ 1 0 . 0075 0 λ 2 0 0 λ 3 - 0 η noise 1 . 8 1 . 3 17 E.5 Hyper -parameter search results The performance distributions across hyper -parameters conﬁgurations for each algorithm on each task are depicted in Figure 9 using box-and-whisker plot. It can be seen that, while most algorithms can perform reasonably well with the correct conﬁguration, T eamReg, CoachReg as well as their ablated versions boost the performance of the third quartile, suggesting an increase in the robustness across hyper-parameter compared to the baselines. 0 50 100 150 200 average return SPREAD 0 2 4 6 8 BOUNCE MADDPG + CoachReg (ours) MADDPG + policy mask MADDPG + TeamReg (ours) MADDPG + agent modelling MADDPG + sharing MADDPG DDPG 0 5 10 15 20 25 average return COMPROMISE MADDPG + CoachReg (ours) MADDPG + policy mask MADDPG + TeamReg (ours) MADDPG + agent modelling MADDPG + sharing MADDPG DDPG 0 200 400 600 800 1000 CHASE Figure 9: Hyper-parameter tuning results for all algorithms. There is one distrib ution per (algorithm, en vir onment) pair , each one formed of 50 data-points (hyper-parameter conﬁguration samples). Each point represents the best model performance a veraged ov er 100 e v aluation episodes and a v eraged o ver the 3 training seeds for one sampled hyper-parameters conﬁguration. The box-plots di vide in quartiles the 49 lo wer-performing conﬁgurations for each distrib ution while the score of the best-performing conﬁguration is highlighted abov e the box-plots by a single dot. 18 F The effects of enfor cing predictability (additional results) 0 5 10 15 20 25 | p e r f | DDPG MADDPG MADDPG + sharing MADDPG + TeamReg (ours) 1 0 3 1 0 2 1 0 1 1 0 0 2 Figure 10: A verage performance differ - ence ( ∆ perf ) between the two agents in COMPROMISE for each 150 runs of the hyper-parameter searches (left). All occurrences of abnormally high per - formance difference are associated with high values of λ 2 (right). The results presented in Figure 5 sho w that MADDPG + T eamReg is outperformed by all other algorithms when considering av erage return across agents. In this section we seek to further in vestig ate this failure mode. Importantly , COMPR OMISE is the only task with a competitiv e component (i.e. the only one in which agents do not share their re wards). The tw o agents being strapped together , a good policy has both agents reach their landmark successi v ely (e.g. by having both agents navigate towards the closest landmark). Howe v er , if one agent ne ver reaches for its landmark, the optimal strate gy for the other one becomes to drag it around and always go for its own, leading to a strong imbalance in the return cumulated by both agents. While such scenario doesn’ t occur for the other algorithms, we found T eamReg to often lead to cases of domination such as depicted in Figure 11. Figure 10 depicts the performance difference between the two agents for e very 150 runs of the hyperparameter search for T eamRe g and the baselines, and shows that (1) T eamReg is the only algorithm that leads to large imbalances in performance between the two agents and (2) that these cases where one agent becomes dominant are all associated with high values of λ 2 , which driv es the agents to beha ve in a predictable fashion to one another . Looking back at Figure 11, while these domination dynamics tend to occur at the beginning of training, the dominated agent e ventually gets e xposed more and more to sparse re ward gathered by being dragged (by chance) onto its o wn landmark, picks up the goal of the task and starts pulling in its o wn direction, which causes the av erage return ov er agents to drop as we see happening midway during training in Figure 5. These results suggest that using a predictability-based team-re gularization in a competitive task can be harmful; quite understandably , you might not want to optimize an objectiv e that aims at making your beha vior predictable to your opponent. 0 5000 10000 15000 20000 25000 30000 0 15 30 45 Average Return DDPG agent 0 agent 1 0 5000 10000 15000 20000 25000 30000 0 15 30 45 MADDPG agent 0 agent 1 0 5000 10000 15000 20000 25000 30000 Episodes 0 15 30 45 Average Return MADDPG + sharing agent 0 agent 1 0 5000 10000 15000 20000 25000 30000 Episodes 0 15 30 45 MADDPG + TeamReg agent 0 agent 1 Figure 11: Learning curves for T eamReg and the three baselines on COMPROMISE. W e see that while both agents remain equally performant as they improv e at the task for the baseline algorithms, T eamReg tends to make one agent much stronger than the other one. This domination is optimal as long as the other agent remains docile, as the dominant agent can gather much more rew ard than if it had to compromise. Howev er , when the dominated agent ﬁnally picks up the task, the dominant agent that has learned a policy that does not compromise see its return dramatically go down and the mean ov er agents ov erall then remains lower than for the baselines. 19 G Analysis of sub-policy selection (additional r esults) G.1 Mask densities W e depict on Figure 12 the mask distribution of each agent for each (seed, en vir onment) experiment when collected on a 100 different episodes. Firstly , in most of the e xperiments, agents use at least 2 different masks. Secondly , for a gi ven e xperiments, agents’ distributions are very similar , suggesting that they are using the same masks in the same situations and that they are therefore synchronized. Finally , agents collapse more to using only one mask on CHASE, where they also display more dissimilarity between one another . This may explain why CHASE is the only task where CoachReg does not improve performance. Indeed, on CHASE, agents do not seem synchronized nor leveraging multiple sub-policies which are the priors to coordination behind CoachReg. In brief, we observe that CoachReg is less ef fectiv e in enforcing those priors to coordination of CHASE, an en vironment where it does not boost nor harm performance. Figure 12: Agent’ s policy mask distrib utions. For each (seed, en vir onment) we collected the masks of each agents on 100 episodes. 20 G.2 Episodes rollouts with synchronous sub-policy selection W e display here and on https://sites.google.com/view/marl- coordination/ some inter- esting sub-policy selection strategy e v olved by CoachReg agents. On Figure 13, the agents identiﬁed two dif ferent scenarios depending on the target-ball location and use the corresponding polic y mask for the whole episode. Whereas on Figure 13, the agents synchronously switch between policy masks during an episode. In both cases, the whole group selects the same mask as the one that would ha ve been suggested by the coach. (a) BOUNCE: The ball is on the left side of the target, agents both select the purple polic y mask t = 0, C = t = 5, C = t = 10, C = t = 15, C = t = 50, C = t = 59, C = t = 60, C = t = 65, C = (b) BOUNCE: The ball is on the right side of the target, agents both select the green polic y mask t = 0, C = t = 5, C = t = 10, C = t = 15, C = t = 50, C = t = 58, C = t = 59, C = t = 65, C = Figure 13: V isualization of two different BOUNCE e v aluation episodes. Note that here, the agents’ colors represent their chosen policy mask. Agents ha v e learned to synchronously identify two distinct situations and act accordingly . The coach’ s masks (not used at e v aluation time) are displayed with the timestep at the bottom of each frame. (a) SPREAD t = 0, C = t = 5, C = t = 20, C = t = 10, C = t = 15, C = t = 25, C = t = 30, C = t = 35, C = (b) COMPR OMISE t = 0, C = t = 3, C = t = 17, C = t = 6, C = t = 7, C = t = 22, C = t = 32, C = t = 33, C = t = 34, C = t = 37, C = t = 40, C = t = 42, C = t = 43, C = t = 48, C = t = 51, C = t = 52, C = t = 53, C = t = 54, C = t = 64, C = t = 65, C = t = 66, C = t = 67, C = t = 68, C = t = 73, C = Figure 14: V isualization of sequences on two different en vironments. An agent’ s color represent its current policy mask. The coach’ s masks (not used at evaluation time) are displayed with the timestep at the bottom of each frame. Agents synchronously switch between the av ailable polic y masks. 21 G.3 Mask diversity and synchronicity (ablation) As in Subsection 7.3 we report the mean entropy of the mask distrib ution and the mean Hamming proximity for the ablated “MADDPG + policy mask” and compare it to the full CoachRe g. W ith “MADDPG + policy mask” agents are not incentivized to use the same masks. Therefore, in order to assess if they synchronously change polic y masks, we computed, for each agent pair, seed and en vironment, the Hamming proximity for ev ery possible masks equiv alence (mask 3 of agent 1 corresponds to mask 0 of agent 2, etc.) and selected the equiv alence that maximised the Hamming proximity between the two sequences. W e can observe that while “MADDPG + policy mask” agents display a more diverse mask usage, their selection is less synchronized than with CoachReg. This is easily understandable as the coach will tend to reduce div ersity in order to have all the agents agree on a common mask, on the other hand this agreement enables the agents to synchronize their mask selection. T o this regard, it should be noted that “MADDPG + polic y mask” agents are more synchronized that agents independently sampling their masks from k -CUD, suggesting that, ev en in the absence of the coach, agents tend to synchronize their mask selection. Figure 15: (Left) Entropy of the polic y mask distributions for each task and method, averaged o ver agents and training seeds. H max,k is the entropy of a k -CUD. (Right) Hamming Proximity between the polic y mask sequence of each agent a veraged across agent pairs and seeds. rand k stands for agents independently sampling their masks from k -CUD. Error bars are SE across seeds. 22 H Scalability with the number of agents H.1 Complexity In this section we discuss the increases in model complexity that our methods entail. In practice, this complexity is negligible compared to the o verall complexity of the CTDE frame work. T o that respect, note that (1) the critics are not af fected by the regularizations, so our approaches only increase complexity for the forward and backward propag ation of the actor , which consists of roughly half of an agent’ s computational load at training time. Moreov er , (2) ef ﬁcient design choices signiﬁcantly impact real-world scalability and performance: we implement T eamReg by adding only additional heads to the pre-existing actor model (ef fectiv ely sharing most parameters for the teammates’ action predictions with the agent’ s action selection model). CoachReg consists only of an additional linear layer per agent and a unique Coach entity for the whole team (which scales better than a critic since it only takes observ ations as inputs). As such, only a small number of additional parameters need to be learned relati vely to the underlying base CTDE algorithm. For a T eamRe g agent, the number of parameters of the actor increases linearly with the number of agents (additional heads) whereas the critic model gro ws quadratically (since the observ ation size themselves usually depend on the number of agents). In the limit of increasing the number of agents, the proportion of added parameters by T eamReg compared to the increase in parameters of the centralised critic v anishes to zero. On the SPREAD task for example, training 3 agents with T eamRe g increases the number of parameters by about 1.25% (with similar computational complexity increase). W ith 100 agents, this increase is only of 0.48%. For CoachReg, the increase in an agent’ s parameter is independent of the number of agent. Finally , any additional heads in T eamRe g or the Coach in CoachRe g are only used during training and can be safely removed at e xecution time, reducing the systems computational complexity to that of the base algorithm. H.2 Robustness T o assess ho w the proposed methods scale to greater number of agents, we increase the number of agents in the SPREAD task from three to six agents. The results presented in Figure 16 sho w that the performance beneﬁts provided by our methods hold when the number of agents is increased. Unsurprisingly , we also note ho w quickly learning becomes more challenging when the number of agents rises. Indeed, with each ne w agent, the coordination problem becomes more and more dif ﬁcult, and that might explain why our methods that promote coordination mai ntain a higher degree of performance. Nonetheless, in the sparse re ward setting, the complexity of the task soon becomes too difﬁcult and none of the algorithms is able to solv e it with six agents. While these results sho w that our methods do not contrib ute to a quicker downfall when the number of agents is increased, they are not howe ver aimed at tackling the problem of massively-multi-agent RL. Other approaches that use attention heads [ 14 ] or restrict one agent perceptual ﬁeld to its n -closest teammates are better suited to these particular challenges and our proposed regularisation schemes could readily be adapted to these settings as well. 0 5000 10000 15000 20000 25000 30000 0 80 160 Return SPREAD - 3 AGENTS 0 5000 10000 15000 20000 25000 30000 100 0 100 200 SPREAD - 4 AGENTS 0 5000 10000 15000 20000 25000 30000 Episodes 120 60 0 60 Return SPREAD - 5 AGENTS DDPG MADDPG MADDPG + sharing MADDPG + TeamReg (ours) MADDPG + CoachReg (ours) 0 5000 10000 15000 20000 25000 30000 Episodes 120 60 0 SPREAD - 6 AGENTS Figure 16: Learning curv es (mean return ov er agents) for all algorithms on the SPREAD en vironment for varying number of agents. Solid lines are the mean and env elopes are the Standard Error (SE) across the 10 training seeds. 23

Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment